Advanced Data Modelling

Data modelling is an essential part of Data engineering. I would say this is a must if you want to become a successful data practitioner. Building SQL transformation pipelines with multiple layers is a challenging task. During the process, it is important to keep things organised. In this story, I tried to summarise some techniques for convenient data structuring and describe the modelling techniques I use daily. It often helps me to design and develop a great data platform or a data warehouse which is accurate, easy to navigate and user-friendly.

Naming convention

Using a well-designed naming convention provides a very clear and unambiguous sense and meaning regarding the content of a given database object. It is always good to have naming policies for tables and columns in place. It simply demonstrates how mature your data warehouse is and helps a lot during the development.

Database entity names must be human-readable – at a minimum.

Maintaining the database or a DWH with this in mind improves user experience and simply makes it look more user-friendly.

Our current naming convention must look solid. It should be at least consistent with the standard Kimball techniques where we prefix the dimension table with dim_ and fact tables with fact_, i.e. dim[Dimension Name] /fact[Fact Table Name].

This is a good practice.

Semantic layer names must look consistent and logically coherent providing a solid foundation for robust data modelling.

Data lineage example for marketing attribution model. Image by author.

I’ll try to be concise here as this topic is very well covered by renowned scholars per se (Inmon, Kimball, et al.). I’ll just focus on things that I believe are crucial and impact DWH scalability significantly.

Abbreviations

I would recommend in general avoiding abbreviations that might sound misleading (stg_, int_) and using full English words in database object names where possible.

Most of the modern data warehouse (DWH) solutions support 20–30 character names for identifiers. Postgres even supports more so that might be a good idea. There is no right or wrong solution here and the optimal naming format depends on the nature of the database entity, end users and data model. For example, I often abbreviate common terms to a minimum of three letters for visibility and better discovery mainly, i.e. dev for development, prod for production, etc.

From this point of view, stg instead of staging doesn’t make much sense.

It’s okay though, we can live with that.

I recommend using single names when possible, for example – _transaction not _transactions.
I recommend using lower_snake_case with no quotes around database identifier – again for consistency and convenience.

I understand though that in many companies naming conventions might be inherited from legacy projects and databases. So it’s absolutely fine if we see something conflicting with the typical Kimbal guide, e.g.

cst_ instead of customer_

There has to be a reason why these names were introduced. One such reason – is the platform and microservice architecture in the company. Server-side development, e.g. existing implementations of event streams, connectors, SDK wrappers, etc. – might rely heavily on these names and obviously refactoring is not worth the risk in such a case.

Often we can face the scenario when we have an OLTP database somewhere with such a naming convention that doesn’t look like the best practice either. Contemplating the adjustments to the DWH tool it would be good to keep in mind that the new naming convention won’t work well with the legacy one we have in the OLTP database.

It might be simply very difficult to join the DWH schema back on OLTP.

This also might have further adverse implications on ad-hoc Analytics and required data deep dives and investigations.

I would recommend using the consistent naming conventions for DWH objects enriching them with metadata as necessary. This would give a hint to end users and serve as a single source of truth but will also make the solution maintainable and user-friendly.

Data marts

Denormalized data is great and we can take full advantage of it using modern DWH tools. Maintenance might become a pain point when we have to work with a considerable number of different data marts. We can end up with multiple-star schemas. In this case, we would want to ensure that a single data item (a dimension, a column, etc.) is used consistently and in the same way across all data marts. So it has to be designed as a standard basic dimension to be shared consistently in all data marts. It would be nice to have a matrix of conformed dimensions and facts (if not in place already). This will ensure the DWH solution is business process-oriented, and predictable and query performance tweaks can be done easily if needed.

Primary and surrogate keys

Using the surrogate keys is important too. Typically surrogate keys don’t have any contextual or business meaning but become useful if designed right. Natural keys have a tendency to cause problems when scaling the data model. They also can deteriorate over time becoming meaningless. A good example can be a stock price ticker after an M&A or a company reorganisation (GOOG/GOOGL, etc.).

DWH/database structure

Naming source data source with the src_ prefix makes sense if you have only one project or database. In this case, it would be a clear indication that this particular entity is a declared data source and no transformation is applied in the first place. I previously wrote about it here:

Building a Data Warehouse

Data location likeanalytics_src.* suggests the nature (data origin) of a business function (analytics). However, this might seem a little confusing as analytics is typically an entity that normally contains multitudes of ad-hoc queries, materializations and/or reporting objects, e.g. analytics per se shouldn’t have _src suffix. int_analytics or simple analytics would suffice.

Remember the rule though:

Never use *`select `** – on raw data

Staging data model layer: `stg_`

We can often see a couple of great things like "explicit declaration of fields" and "Fields named according to naming convention". This is a really good practice.

Table name example stg_raw_customer_data clearly indicates the staging development area.

I would consider removing the stg_raw_ prefix from table names. Ideally, I would remove it from the table name and create a dedicated database for staging/dev and one more for automated testing – with _test.* suffix.

Data environments are very important and I tried to cover them here:

Database Data Transformation for Data Engineers

The following transformations should not be expected in this data layer:

no filter (no "where")
no joins
no group by
no calculations

It is a named implementation of a base model with only basic transformations applied where no complex logic is expected.

Intermediate models: int_ layer

What I keep seeing very often is that data analysts and BI developers use ephemeral materialisations in their model’s lineage.

Using ephemerals is recommended by dbt labs. It’s recommended at early stages as it makes simple CTEs reusable.

I wouldn’t recommend using ephemeral constructs as they, don’t produce an output, are hard to unit test for data transformation logic and don’t support model contracts. Writing unit tests for any ephemeral CTE inside a data transformation pipeline is problematic.
int_ abbreviation might seem confusing for data developers as analysts often use it to name internal data sources. I wouldn’t use it as my naming convention. However, it’s a matter of taste and some analysts would differ.

According to common sense, complex transformations are not expected in this data model layer.

The models in this layer should be relatively simple. If the model includes many complex transformations, it may be more suitable for a business logic layer (biz_) or a data mart layer (mrt_)."

Usually, this layer is used for temporary transformations. If the model does not materialize for the sake of query optimisation, it should be ephemeral (100% of cases).

This is ultimately a good thing as it reduces the database clutter and allows reusable logic for simple tasks.

ephemerals are a good choice for basic data preparation, deduplication and JSON extracts early only if there is no need for unit testing (not data testing) of actual data transformation logic.

Dimension layer: dim_

Input model layer type

presumes that fact data also flows in this layer. All common data transformations are allowed here, e.g. joins, aggregations and filtering.

We can create persisted fact tables for improved query performance if needed.

I prefer naming dimension entities with the dim_ prefix. Fact tables are fact by default so no need to prefix them.

I previously wrote about the benefits of the dimensional approach here:

Data Modelling For Data Engineers

Mart and Biz layers: mrt or biz prefix (or suffix)

It’s a data model layer with materialized entities containing some complex data transformation logic. Using incremental strategies is considered as best practice.

Often the mrt_ layer provides an input for biz_ and in this case, materialization or data precalculation is not necessary and a simple view is enough. However, it all would depend on the data volumes we process.

A good example of an incremental table update can be found here:

Advanced SQL techniques for beginners

Data platform and/or data warehouse architecture

Data modelling I would recommend considering templating your dbt / dataform project and starting using custom database names to split the data environment between production, dev and test data. For example, databases in the production environment can be named like so:

In this case data model naming convention can be simplified by moving raw_ and base_ prefixes to database naming:

DATABASE_NAME   SCHEMA_NAME     
-------------------------------
RAW_DEV         SERVER_DB_1     -- mocked data
RAW_DEV         SERVER_DB_2     -- mocked data
RAW_DEV         EVENTS          -- mocked data
RAW_PROD        SERVER_DB_1     -- real production data from pipelines
RAW_PROD        SERVER_DB_2     -- real production data from pipelines
RAW_PROD        EVENTS          -- real production data from pipelines
...                             
BASE_PROD       EVENTS          -- enriched data
BASE_DEV        EVENTS          -- enriched data
...                             
ANALYTICS_PROD       REPORTING  -- materialised queries and aggregates
ANALYTICS_DEV        REPORTING  
ANALYTICS_PROD       AD_HOC     -- ad-hoc queries and views

Using custom database names always helps to simplify this task. To dynamically inject custom database names we just need to create this macro:

-- ./macros/generate_database_name.sql
{% macro generate_database_name(custom_database_name=none, node=none) -%}

    {%- set default_database = target.database -%}
    {%- if custom_database_name is none -%}

        {{ default_database }}

    {%- else -%}

        {{ custom_database_name | trim }}

    {%- endif -%}

{%- endmacro %}

So now whenever we compile our models it will apply a custom database name from the model`s config, i.e.:

dbt run -t dev -> select * from raw_dev.shema.table
dbt run -t prod -> select * from raw_prod.shema.table
dbt run -t test -> select * from raw_test.shema.table

Personally, I try to design and introduce a base_ data model layer with minimal data manipulation on the column level. Sometimes it might be worth doing to get a better query performance:

-- simplified project structure:
.
└── models
    └── some_data_source
        ├── _data_source_model__docs.md
        ├── _data_source__models.yml
        ├── _sources.yml  -- raw data table declarations
        └── base -- base transformations, e.g. JSON to cols
        |   ├── base_transactions.sql
        |   └── base_orders.sql
        └── analytics -- deeply enriched data prod grade data, QA'ed
            ├── _analytics__models.yml
            ├── some_model.sql
            └── some_other_model.sql

-- extended example with various data sources and marts:
└── models
    ├── int -- only if required and 100% necessary for reusable logic
    │   └── finance
    │       ├── _int_finance__models.yml
    │       └── int_payments_pivoted_to_orders.sql
    ├── marts -- deeply enriched, QAed data with complex transformations
    │   ├── finance
    │   │   ├── _finance__models.yml
    │   │   ├── orders.sql
    │   │   └── payments.sql
    │   └── marketing
    │       ├── _marketing__models.yml
    │       └── customers.sql
    └── src (or staging) -- raw data with basic transformations applied
        ├── some_data_source
        │   ├── _data_source_model__docs.md
        │   ├── _data_source__models.yml
        │   ├── _sources.yml
        │   └── base
        │       ├── base_transactions.sql
        │       └── base_orders.sql
        └── another_data_source
            ├── _data_source_model__docs.md
            ├── _data_source__models.yml
            ├── _sources.yml
            └── base
                ├── base_marketing.sql
                └── base_events.sql

A good practice would be to use the following techniques:

Use persisted materialization and clustering for biz_ layer objects if needed
Avoid using Google Sheets as a source
Avoid using dbt seeds. This allows only CSV and shouldn’t be used to populate tables. Try seeding _test database tables using custom materialization instead, e.g. a SQL which generates an output and can be referenced (also visible on the data lineage graph):

--./macros/operation.sql
{%- materialization operation, default  -%}
    {%- set identifier = model['alias'] -%}
  {%- set target_relation = api.Relation.create(
        identifier=identifier, schema=schema, database=database,
        type='table') -%}
  -- ... setup database ...
  -- ... run pre-hooks...
  -- build model
  {% call statement('main') -%}
    {{ sql }}
  {%- endcall %}

  -- ... run post-hooks ...
  -- ... clean up the database...
    -- `COMMIT` happens here
    {{ adapter.commit() }}
  -- Return the relations created in this materialization
  {{ return({'relations': [target_relation]}) }}
{%- endmaterialization -%}

Now if we add it to our model config as an operation it will simply run the SQL and we will be able to reference it using the ref() function:

{{ config(
    materialized='operation',
    alias='table_a',
    schema='schema',
    tags=["tag"]
) }}
create or replace table {{this.database}}.{{this.schema}}.{{this.name}} (
     id      number(38,0)   
    ,comments varchar(100)  
);

# dbt run --select table_a.sql
19:10:08  Concurrency: 1 threads (target='dev')
19:10:08  
19:10:08  1 of 1 START sql operation model schema.table_a ................................ [RUN]
19:10:09  1 of 1 OK created sql operation model schema.table_a ........................... [SUCCESS 1 in 1.10s]
19:10:09  
19:10:09  Finished running 1 operation model in 0 hours 0 minutes and 4.16 seconds (4.16s).
19:10:09  
19:10:09  Completed successfully
19:10:09  
19:10:09  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Now if we want to have another table or a view that can reference this operation we can use the standard ref() function and our table_a would appear as a dependency in data lineage:

I would recommend to use incremental updates with clustering and incremental predicates, e.g.

-- Sample SQL for standard transaction table
-- will use conditional ref in incremental strategy
-- depends_on: {{ ref('transactions_view') }}
{{
    config(
        materialized='incremental',
        unique_key='id',
        on_schema_change='append_new_columns',
        incremental_strategy='merge',
        cluster_by = ['updated_at'],
        incremental_predicates = [
            "DBT_INTERNAL_SOURCE.load_ts > dateadd(day, -30, current_date)"
        ],
        schema='source_db',
        alias='transactions_',
        query_tag = 'transactions__incremental',
        tags=["incremental", "transactions", "unit_testing"]
    )
}}
select
    *
{% if is_incremental() %}
    from {{ref('transactions_view')}}
{% else %}
    from {{ref('transactions__log')}}
{% endif %}
where
    id is not null
qualify
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at::TIMESTAMP DESC ) = 1

Avoid patterns with select * and consider moving attribution_with_paid_ads and paid_ads_wtihout_attribution into separate model files. Unit test.

Testing the model logic

This is a very important part where can run tests for our data model logic. Do you unit test your Python functions? I tend to do it in a similar way.

Consider this example below. I use dbt_utils.equality package to compare the expected dataset and the table we get after we run our model:

version: 2

models:
  - name: my_model_i_want_to_test
    database: |
        {%- if  target.name == "dev" -%} raw_dev
        {%- elif target.name == "prod"  -%} raw_prod
        {%- elif target.name == "test"  -%} raw_test
        {%- else -%} invalid_database
        {%- endif -%}
enabled: true
    description: 
    tests:
      - dbt_utils.equality:
          tags: ['unit_testing']
          compare_model: ref('my_model_i_want_to_test_test_expected')
          compare_columns:
            - col_1
            - col_2
            ...
    config:
      alias: my_model_i_want_to_test
      schema: marketing

Now if we run dbt build - select tag:unit_testing -t test in our command line dbt will build our source table which is transactions to compare it with the expected table built in the test environment. This table is called my_model_i_want_to_test_test_expected and is referenced using our favourite ref() function.

-- my_model_i_want_to_test_test_expected.sql
{{ config(
    materialized='table',
    schema='some_schema',
    query_tag = 'tag',
    tags=["unit_testing"]
) }}
select
'{
  "data": {
    ...
  }
}'...
...
...

dbt build --select tag:unit_testing -t test

[RUN]
19:40:35  2 of 2 PASS dbt_utils_source_equality__db_my_model_i_want_to_test_test_expected_..._col1__ref_..._test_expected_  [PASS in 1.53s]
19:40:35  
19:40:35  Finished running 1 table model, 1 test in 0 hours 0 minutes and 10.97 seconds (10.97s).
19:40:35  
19:40:35  Completed successfully
19:40:35  
19:40:35  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

dbt_utils.equality will compile our test into something like this to check that the actual table generated after the run equals the expected table we defined for the test:

with a as (
    select * from raw_test.schema.my_model_i_want_to_test
),

b as (
    select * from raw_test.schema.my_model_i_want_to_test_test_expected
),

a_minus_b as (

    select filename from a

    except
    select filename from b

),

b_minus_a as (
    select filename from b

    except
    select filename from a

),

unioned as (
    select 'a_minus_b' as which_diff, a_minus_b.* from a_minus_b
    union all
    select 'b_minus_a' as which_diff, b_minus_a.* from b_minus_a

)

select * from unioned

    ) dbt_internal_test

This feature is very useful and this way we can test any model with just one dbt-core package.

DBT goes even further and in dbt-core package starting version 1.8 they implemented independent unit tests that look even better.

Unit tests | dbt Developer Hub

Conclusion

It’s always great to have an opportunity to design our DWH solution from scratch, In this case, we are lucky not to rely on any legacy processes and OLTP databases. Often this is not the case and the world might seem a bit more complicated. It’s good to bear in mind that the new naming convention must be usable and could be applied easily with any existing infrastructure resources we have.

I would recommend using the consistent naming conventions for DWH objects enriching them with metadata as necessary. At least data marts must follow the naming convention with meaningful business names. Other raw_/source_ and live_/prod_ schema object names can be anything with the existing naming convention. As long as they are enriched with metadata and downstream clearly into data mart entities on the data lineage graph. This would give a hint to end users and serve as a single source of truth but will also make the solution maintainable and user-friendly.

If we need to use names longer than 30 characters then it would be probably a good idea to come up with abbreviation rules. That would make development easier in the future.

Building a SQL transformation pipeline with multiple models is a challenging task. It is not only data frames and CTEs but also various data models because they help to manage your code in a way that is easy to maintain and split the logic between different data environments. During the process, it is important to keep things organised. I summarised some techniques for convenient structuring of our data transformation steps that result in a nice DHW design.

Data Warehouse Design Patterns

Model metadata is very important. No matter what data platform architecture we use I would always recommend using the consistent naming conventions for DWH objects enriching them with metadata as necessary. This would give a hint to end users and serve as a single source of truth but will also make the solution maintainable and user-friendly.

The final data model might look very complex and difficult to read. In some companies, this can be considered as an anti-pattern in SQL development. Breaking that SQL into smaller parts makes our code reusable and easier to maintain. Modern data build tools (data form, DBT, etc.) offer a set of useful features to improve this process. We can power our SQL templates by injecting pieces of code using macros, variables and constants. From my experience, this feature combined with infrastructure as code helps to ensure adequate CI/CD processes which saves time during development. It helps to organise and simplify the data environment split between dev, live and automated testing so we can focus on business logic and continuous improvement of our state-of-the-art data platform.

Naming convention

Abbreviations

Data marts

Primary and surrogate keys

DWH/database structure

Staging data model layer: `stg_`

Intermediate models: int_ layer

Intermediate models: int_ layer

Dimension layer: dim_

Mart and Biz layers: mrt or biz prefix (or suffix)

Data platform and/or data warehouse architecture

Testing the model logic

Conclusion

Recommended read:

Related Articles

Methods for Modelling Customer Lifetime Value: The Good Stuff and the Gotchas

Feature Engineering with Microsoft Fabric and Dataflow Gen2

Feature Engineering with Microsoft Fabric and PySpark

Data: Where Engineering and Science Meet

SQL Mastery: Advanced Techniques for Data Professionals

Top 3 non-machine learning skills to rise in Kaggle competitions

The Marketing Reporting Gap

Overcoming (Unintended) Data Jails

Advanced Data Modelling

Naming convention

Abbreviations

Data marts

Primary and surrogate keys

DWH/database structure

Staging data model layer: stg_

Intermediate models: int_ layer

Intermediate models: int_ layer

Dimension layer: dim_

Mart and Biz layers: mrt or biz prefix (or suffix)

Data platform and/or data warehouse architecture

Testing the model logic

Conclusion

Recommended read:

Related Articles

Staging data model layer: `stg_`