A comparison of head and version, Data Vault and anchor modeling
In this exercise we are going to evolve our data warehouse from one model to the next based on some simple requirement changes.
To make things more concrete we are going to use the very simple model below.
Our simple model keeps information about products and their categorization. Every category belongs to a cluster. Because we do not keep more information than the cluster’s name the clusters is modeled as a column on the category table.
Here we are skipping a lot of details on implementation from all the methods and focus more on the big differences
Head and version
Our first requirement is to keep history for the product prices and on which category they belong to, as well as the changes that categories go on regarding clustering.
The most flexible and generic way to keep history is by creating a new row for every mutation.
To implement it we can split the columns that we expect to change in a separate table and add a row identifier on that table. For referencing the columns left on the immutable table we can add an id column as object identifier.
So on our model we are bringing price and cluster on separate tables and add ids on those tables. We are adding ids on head tables as well to internally identify the objects. We can now use head_id to reference their corresponding head tables from version tables.
Our DWH implemented as head and version model will look like:
Head and version works well and will handle many requirement changes such as adding new records and data integration. But when it comes to changes like changing relationship cardinality from 1:M to N:M or changing non history attributes to history attributes it will start to give us extra work. So we will have to change the model, migrate data, change ETL/ELT etc.
So why not prevent those cases from the start?
Let’s bring all columns but business key column to history tables and let’s split relationship columns in its own many to many tables.
We are going to make some other small changes. For instance we are going to remove its own ids on the history tables and use the head_ids, which by the way are renamed as id.
We are using ids together with the load date as primary key on the history tables and we are going to rename the tables as well so we can better indicate which tables are hubs, links or satellites.
Our DWH implemented as Data Vault will look like:
In the model above we created only one satellite per hub but we can have as many satellites as we want per hub/link, giving us the possibility to split the columns in many ways including change frequency.
Data vault works very well and it has the flexibility and agility to accommodate many changes which might come from the business or we might need to implement for technical reasons.
But let’s see how Data Vault handles data integration when the same objects are identified differently on different systems and/or business units.
Let’s assume that in our little example a new model will be loaded where the product is identified by it’s name and the product code does not even exists at all on the new source.
In this case we will have to create a different hub where the business key is the name and link the two product hubs via a same as link which in this case can be easily populated with a simple rule.
This means that on Data Vault modeling the hub does not actually represent a business object but it represents a business key instead.
What if we take the business key columns out of the hub and put them in its own tables?
In the same way we will split every attribute on in its own table.
Now we can combine on the fly whatever combination of columns that is unique and fits our needs and use them as identifier for an object removing the need to create multiple hub/anchor tables for the same business entity.
We are making some other changes on the model like keeping history only for attributes required to. And we are renaming tables to better indicate which ones are anchors, ties and attribute tables.
Our DWH implemented based on anchor modeling will look like:
Keep in mind that with the approach used by anchor modeling there might be some cases to be considered. Take in account the case when:
1. First product with code “apl” arrives from source 1 but the name is null.
2. Than the product with name “Apple” comes from source 2 but the code is null.
3. In a later time the first source brings the information that the product with code “apl” has the name “Apple”.
In this case two entires will be created on “Product anchor” for the same product. Now we need some sort of “same as” tie to keep the information that this two entries represent the same object.
Moving from one method to the other we are adding flexibility to our Data warehouse but in the same time we are adding complicity to it, even though using the right tools this added complexity can be hidden from us.
Each of this methods has it’s own strong and week points and there is no winner. All depends on individual cases and needs.