Data warehouse generation algorithm explained

Posted on by Lulzim Bilali

There are already a couple of data warehouse generation tools in the market which make an amazing job in simplifying your work in building and maintaining a data warehouse. If you ever wondered how these tools work, than this post is for you. Here we are going to explain how a data warehouse generator works. The algorithm described in this post is based on the algorithm implemented by open source data warehouse generator Quipu but it does not describe a one to one copy of the algorithm implemented on the tool. For example, this algorithm does not deal with reference tables and does not have any option to group relationships in a link.

This algorithm is going to strait a data vault model based on a source model. Data vault is selected as output based on the popularity of the method and simplicity of the generation but other methods like anchor modeling could be considered and can be generated with the same simplicity. To better understand this post the reader must have some basic data vault knowledge.

To better illustrate the algorithm, a simple model is going to be transformed step by step as the algorithm is explained.
The sample model is an order tracking system where a client orders products in a store. The orders can be daily and might contain many products.
For every client there is one shipping address used to send ordered products. To make it easier to search and group, products are tagged with keywords called tags. A product might have many tags and a tag can be used in many products.
Below is a diagram of the source model:

(For better quality click on the picture)

Source model

Source Model

As you can see from the diagram, the primary key columns are marked as business key.

To simplify the process we are going to split the process in 3 steps:

  1. Analyze This step is going to check every table in the source model and based on the primary keys and foreign keys of the tables mark them as Hub, Link or Satellite.
  2. Review In this step the analyst can overrule the decisions of the analyzer by specifying what a table should be. Keep in mind that not every thing is possible to change.
  3. Generate This step is going to generate the data vault model based on the result of the first two steps.

Step 1, Analyze

Mark potential satellite tables

RULE: Every table on which there are no foreign keys referencing to and has only one foreign key with all the referencing columns also primary key columns and no other columns are part of primary key: is a candidate to become a satellite.
In ER terms: The entity has only one dependent 1:1 relationship and it is not referenced from other relationships.
In our sample model, shipping address is the only table which full-fills the rule. We are going to mark that table as a satellite, which in data vault modeling is represented with the yellow color.

Mark tables as satellite

Mark tables as satellite

Mark as peg leg Links

RULE: Every table on which there are no foreign keys referencing to and has only one foreign key with all the referencing columns also primary key columns but primary key is wider than the foreign key: is a candidate to become a peg leg link.
In ER terms: The entity has only one dependent 1:n relationship and it is not referenced from other relationships.
In our sample model, tags is the only table which full-fills the rule. We are going to mark that table as link, which in data vault modeling is represented with the red color.

Mark table as peg-leg-links

Mark table as peg-leg-links

Mark as links

RULE: Every table on which there are no foreign keys referencing to and has more than one foreign key with all the referencing columns also primary key columns: is a candidate to become a link.
In this case does not matter if the primary key is wider than all the foreign keys together or not.
In ER terms: The entity has more than one 1:n relationship and all the relationships are dependent and it is not referenced from other relationships.
In our sample model, table details fulfill the rule. We are marking that table as link.

Mark tables as link

Mark tables as link

Mark as hubs

Every table which does not fit on any of the categories above is going to be a hub.
In our sample model, the tables to be marked as hub are Client, Store, Product and Order. In data vault modeling the hubs are represented with the blue color.

Mark tables as hub

Mark tables as hub

Extra Options

Some analysts might like Order table to be a link and not a hub. In that case, the rule to find the potential links must be changed to:
Every table which has more than one foreign key with all the referencing columns also primary key columns: is a candidate to become a link.
In this case, the model will result into a Link to link relationship.
Liking or disliking link to links is a matter of choice and here in this post we decided not to use them.

Step 2, Review

The analyst might not be happy with all the decisions of the analyzer and can overrule that output but not every change is possible.
Below is a list of possible changes.

In our sample model we are not going to apply and change on the result of the analyzer.

Step 3, Generate

In this step we are going throw every table in source model and create the corresponding data vault tables.

Create hubs
During first iteration we are creating a hub and a satellite for each table marked as hub.

For each source table marked as hub.

  1. Create a hub table in the data vault model.
  2. Create an id column called hub_id as primary key in the just created hub.
  3. For every primary key column on source tables create a column on the just created hub.
  4. Create a satellite table in the data vault model.
  5. Create an id on the just created satellite.
  6. Create a foreign key referencing from id on the satellite to the id of the hub.
  7. Create a column load_date as primary key on the just created satellite.
  8. Create a column load_date_end on the just created satellite.
  9. Create a column voided on just crated satellite. This column is used to keep track if the record is deleted.
  10. For each column in source table not part of primary key
    1. Create a column in the satellite.

By applying the above steps to out sample model we will get the model below.

create hubs and hub satellites

create hubs and hub satellites

Create links from relationships of tables marked as hubs

Now we are going thru every relationship of source tables marked as hub and create a link and a satellite for each of them.

For each foreign key of tables marked as hub

  1. Create a link table in the data vault model.
  2. Create an id column called link_id as primary key on the link table.
  3. Create an id column called <foreign_table_name>_hub_id on the link table.
  4. Create a foreign key from <foreign_table_name>_hub_id to the hub_id of the hub created from the source table.
  5. Create an id column called <primary_table_name>_hub_id on the link table.
  6. Create a foreign key from <primary_table_name>_hub_id to the hub_id of the hub created from the source table where the foreign key is referencing to.
  7. Create a satellite table in the data vault model.
  8. Create an id on the just created satellite.
  9. Create a foreign key referencing from id on the satellite to the link_id of the link.
  10. Create a column load_date as primary key on the just created satellite.
  11. Create a column load_date_end on the just created satellite.
  12. Create a column voided on just crated satellite.

By applying these steps to our sample model we will get two links and their satellites. The model up to this point looks like the one below.

Create link and link satellites from hub relationships

Create link and link satellites from hub relationships

Create links from tables marked as links

Now we are going to create links based on tables marked as links. Here are included peg leg links as well.

For each source table marked as link create a link on the data vault model.

  1. Create a link table in the data vault model.
  2. Create an id column called link_id as primary key on the link table.
  3. For each foreign key of the link
    1. Create a column called <primary_table_name>_hub_id on the link table.
    2. Create a foreign key referencing from just created column to the hub_id of the hub created from the source table where the foreign key is referencing to.
  4. Create a satellite table in the data vault model.
  5. Create an id on the just created satellite.
  6. Create a foreign key referencing from id on the satellite to the link_id of the link.
  7. Create a column load_date as primary key on the just created satellite.
  8. Create a column load_date_end on the just created satellite.
  9. Create a column voided on just crated satellite.

After these steps our data vault model will be almost finished. Below is the diagram as of this step.

Create link and link satellites from tables marked as link

Create link and link satellites from tables marked as link

Create satellites based on tables marked as satellites

The last step is to create the satellites for the tables marked as satellites.

For each source table marked as satellite

  1. Create a satellite table in the data vault model.
  2. Create an id on the just created satellite.
  3. Create a foreign key referencing from id on the satellite to the hub or link created as result of the source table parent of the only foreign key of the table.
  4. Create a column load_date as primary key on the just created satellite.
  5. Create a column load_date_end on the just created satellite.
  6. Create a column voided on just crated satellite.
  7. For each column in source table not part of primary key
    1. Create a column in the satellite.

Below is the final diagram of our sample model.

Create satellites from tables marked as satellites

Create satellites from tables marked as satellites

Conclusion

As you can see with a simple set of instructions it is possible to generate a data vault model. This set of instructions not only helps you understand how data warehouse generator tools work, but can be viewed as a way of thinking when creating a data vault model by “pen and paper”.
It is important to understand that generators will help you a lot by getting a big chunk of work out of your hands; think of the DDL code and ETL code generated for free from them, but they need your help as well. The work of the analyst is still needed to properly define the model. This is especially important during the process of building a business data vault where the business model differentiates from the source model. Even here, with the help of generator tools the analysts can be focused on the business side of the problem and only the parts which can not be generated like business rules and let the tool do the obvious work.