Data Modeling in a Big Data World

Posted on by Lulzim Bilali

Before starting to talk about Data Modeling in a Big Data World let’s talk about Data Modeling in a Data World. Often different people talk about Data Modeling and yet they talk about different things.

To some people Data Modeling is the set of tables, columns, Primary keys and other objects that have been implemented in a database (Physical Model). Others see it as a set of Entities, attributes, relationships, etc (ER Modeling) for some it is a set of facts (Fact Oriented modeling) and so on.

So let’s set some ground.

Everything which represents how the tables are implemented in the database is a Physical Data Model and in this article we will refer to it as Physical Data Modeling or implementation.

The other models represent which facts we catch within the organization and how these facts are related to one another. We will refer to them as Information modeling.

When we talk about information modeling it does not matter if we are in a Big Data World, Small Data World or <add an adjective> Data World, the models are the same and there is no such thing as big data modeling or small data modeling but only information modeling.

When it comes to the implementation of the information model things start to change. An information model concerning products and categories could be implemented as 2 tables (3NF), 1 table (demoralizing in a dimension) or 4+ tables (Ensemble Modeling), and all are valid implementations on their domains.

Furthermore, implementation of an information model means more than just transforming it into tables. It means that we have to recognize the needs, analyze the data, decide the architecture, choose the tools, check the budget, acknowledge the talent, and more.

But are there some general guidelines on how to implement an information model to a physical model which supports the needs of today and can easily extend to the demands of tomorrow?

1. Model your information

As tempting as it is to start immediately implementing and showing results, do yourself a favor and take the time to create an information model independent of your implementation. Having a clear picture of which facts you possess and where they reside will be priceless in the long run, especially to the future members of the team.

2. Generate primary keys based on business keys.

In a big data world, data does not reside in the same platform where we can easily generate ids and lookup them every time we need. Instead data will be in multiple platforms deployed on internally running servers and/or in the cloud.

Using a simple and universal way to generate primary keys, such as by hashing the business keys with a universal algorithm such as SHA1, will make the data integration across platforms a breeze.

3. Choose a data structure which easily accommodates changes.

It is crucial to have a physical data model which can be extended to support the ever changing business needs without having to halt or redesign the existing parts of it.

For instance physical data models based on methodologies like data vault are easily expendable allowing the developers to just “plug in” new structures without having to change existing ones or their ETL/ELT processes.

4. Turn your ETLs into ELTs

The ability to load data fast is crucial, and moving the transformations on the way out of the data warehouse instead of on the way in will extremely improve the speed of loading. In addition, moving the transformations on the way out means that you have the source data unmodified, which has multiple benefits on it’s own.

Here you have 4 guidelines which will make your data warehouse a bit more future prof.