Loading your data vault

Posted on by Lulzim Bilali

In a 3NF model tables have references to each other, which depending on the model, might go very deep. Enforcing referential integrity means that the tables must be loaded in a certen order; in fact, that is exactly why we enforce referential integrity. In an OLTP system data comes incrementally and in a certain order that is designed for specific business processes. For instance a product should get registered in the warehouse before it can be sold. In other words the product must first exists in the system and then it can be sold.

While the order of inserting data, in an OLTP system, makes sure the user does not make undesirable insertions/transactions, it might be an unnecessary barrier in a DWH where the data is usually loaded in bulk by an ETL tool. In doing so the margin of mistake becomes almost nonexistent in comparison with the OLTP systems, in which the user is responsible to insert the data.

Data vault is a method where, given that one does not have link to links, the dependency does not go deeper than 3 levels, with hubs being on top of the hierarchy and then links and hub satellites depended on hubs and the last link satellites depended on links. This method makes it easier to load tables in parallel, hence speeds up the load process while still maintaining the referential integrity.

In the diagram below, part of the diagram on my previous post “Data warehouse generation algorithm explained”, the load order might be:
1. execute step 1 and step 2
2. execute step 3, step 4 and step 5
3. execute step 6

(For better quality click on the picture)

Load data dependently

Load data dependently

This not only does speed up the load process by making it possible to run load scripts in parallel, but also makes it very simple to find out the dependency and the order of execution. And becomes very useful when loading the full data warehouse. But there are cases when we do not want to load the full DWH. This might be for instance when the process of loading the full data warehouse takes long or because we do not have all the data at the moment of loading.
Lets say we want to make near real time (refreshed every 10 minutes) reports based on orders which we display in big monitors around the office as an important KPI for our company but we are not jet interested to see which clients made the orders. Having to load both tables might slow down the load process, making it not practical to refresh reports every 10 minutes. So, what we actually need to do is to load only the data from the Order table every 10 min and then to load data from Client table once in the night for the rest of the reports. In any case, to load Order_Client_l we first need to load Client_h.
While we can leave Order_Client_l without loading and still get the desired report, it is not best practice to do so especially when you do delta loads, and furthermore there will be cases when we have to load the links to get the desired reports.
To skip loading the data from Client table we can load the Client_h only with the data coming from the Order table. So if a new client makes an order then we have in the Order table the ClientNr needed to load the Client_h. The rest of the information about the Client which goes on the satellite will be loaded during the night from the Client table.
The diagram below shows the steps necessary to load all the data from the Order table without having to load the data from the Client table first.
The load order in this case might be:
1. Execute “step 1 ap” and step 2. Where “ap” stands for “artificially propagated”.
2. Execute step 3 and step 5
3. Execute step 6

Load data independently

Load data independently

Loading the Clint_h with the data on Order table, it solves another issue; i.e. when the referential integrity is not enforced in the source system and there exists orders with ClienNr which do not exists on the Client table. This is an undesirable situation but still we can not leave out this data, remember that the business rules and data cleansing have to be done after data is loaded to the data warehouse and never before. This makes it necessary to run the “step 1 ap” even when we load the full data warehouse.
The diagram below shows the full steps, where the order of execution is:
1. Execute step 1 and step 2
2. Execute “step 1 ap”, step 4 and step 5
3. Execute step 3
4. Execute step 6

Load data combined

Load data independently

While “step 1 ap” can be executed even before step 1 they can not be executed at the same time for there will be a collision among the two. The preferred order of execution would be first to execute step 1 and then “step 1 ap”.

Of course in the real world there will be a lot of steps that can be executed in parallel, making it inefficient to run all of them in the same time. Usually with the ETL generated with Quipu we make some tests running different number of steps in parallel to see where we can gain the best performance. The number of steps to be executed on parallel depends on the hardware, data sets and platform you are running on.