An Introduction To Analytics Engineering

Traditionally, data teams were formed by Data Engineers and Data Analysts.

The Data Engineers are responsible for building up the infrastructure to support data operations. These would include the configuration of databases and the implementation of ETL processes that are used to ingest data from external sources into a destination system (perhaps another database). Furthermore, Data Engineers are typically in charge of ensuring data integrity, freshness and security so that Analysts can then query the data. A typical skillset for a Data Engineer includes Python (or Java), SQL, orchestration (using tools such as Apache Airflow) and data modeling.

On the other hand, Data Analysts are supposed to build dashboards and reports using Excel or SQL in order to provide business insights to internal users and departments.

Transitioning From ETL to ELT

In order to process data and gain valuable insights we first need to extract it, right? 🤯

Data Ingestion is performed using ETL (and more recently with ELT) processes. Both ETL and ELT paradigms involve three main steps; Extract, Transform and Load. For now, let’s ignore the sequence of executing these steps and let’s focus on what does each step do independently.

Extract

This step refers to the process of pulling data from a persistent source. This data source could be a database, an API endpoint a file or a message queue.

Transform

In Transform step, the pipeline is expected to perform some changes in the structure and/or format of the data in order to achieve a certain goal. A transformation could be a modification (e.g. mapping "United States" to "US"), an attribute selection, a numerical calculation or a join.

Load

This step refers to the process of moving data (either raw, or transformed) into a destination system. The target is usually a OLTP system, such as a database or an OLAP system, such as a Data Warehouse.

ETL: Extract → Transform → Load

ETL refers to the process where the data extraction step is followed by the transformation step and ends with the load step.

A visual representation of an ETL process - Source: Author — A visual representation of an ETL process – Source: Author

The data transformation step in ETL processes occurs in a staging environment outside of the target system, where the data is transformed just before it gets loaded to the destination.

ETL has been around for a while but its application has slowly started fading out.

Since the transformation happens in an intermediate (staging) server, there’s an overhead for moving the transformed data into the target system
The target system won’t contain the raw data (i.e. the data in the format prior to the transformation). This means that whenever additional transformations are required, we would have to pull the raw data once again.

The emergence of Cloud technologies have shifted the process of ingesting and transforming data. Data Warehouses hosted on the cloud have made it possible to store huge volumes of data at a very low cost. Therefore, is there really need to apply transformations "on the fly" while discarding raw data every time a transformation is performed?

ELT: Extract → Load → Transform

ELT refers to a process where the extraction step is followed by the load step and the final data transformation step happens at the very end.

A visual representation of an ELT process - Source: Author — A visual representation of an ELT process – Source: Author

In contrast to ETL, in ELT no staging environment/server is required since data transformation is performed within the destination system, which is usually a Data Warehouse or Data Lake hosted on the Cloud.

In addition, the raw data exists on the destination system and thus available for further transformations at any time.

Analytics Engineering

As a reminder, in older data team formations, engineers were in charge of maintaining the ETL layer while analysts where responsible for the creation of dashboards and reporting. But the question now is where do Analytics Engineers fit into the picture?

In older data team formations, Data Engineers were responsible for ETL and Data Analysts for reporting - Source: Author — In older data team formations, Data Engineers were responsible for ETL and Data Analysts for reporting – Source: Author

Analytics Engineers are essentially the link between Data Engineers and Analysts. Their responsibility is to take the raw data and apply transformations so that Data Analysts can then collect the transformed data and prepare Dashboards and Reports on the Business Intelligence layer so that internal users can then make data-informed decisions. Now the Data Engineers can focus more on the ingestion level and the wider data infrastructure of the data platform.

In ELT pipelines, Data Engineers are responsible for Extraction and Load of data in a Data Warehouse, Analytics Engineers for the data transformation layer and Analysts for the creation of business dashboards - Source: Author — In ELT pipelines, Data Engineers are responsible for Extraction and Load of data in a Data Warehouse, Analytics Engineers for the data transformation layer and Analysts for the creation of business dashboards – Source: Author

dbt: The ultimate tool for Analytics Engineering

Analytics Engineers are people that can help data teams scale and move faster. But to do so, they also need to take advantage of tools that can help them get the job done. And the ultimate Analytics Engineering tool is data build tool (dbt).

dbt is a tool used to build and manage data models in a scalable and cost effective fashion. Instead of taking the time to figure out all inter-dependencies between models in order to decide in what sequence models must be executed, dbt does all the dirty work for you. Furthermore, it provides functionality to support data quality tests, freshness tests and documentation among others.

In order to better understand what dbt does, it’s important to visualise the wider context and see where it fits within the modern data stack. dbt is actually sitting on the T layer within an ELT pipeline and transformations are performed within the Data Warehouse where the raw data resides.

Using dbt to perform transformations over raw data within the Data Warehouse - Source: Author — Using dbt to perform transformations over raw data within the Data Warehouse – Source: Author

dbt is a CLI (Command Line Interface) tool that enables Analytics Engineering teams deploy and manage data models following software engineering best practices. Some of these practices include support for multiple environments (development and production), version controlling and CI/CD (Continuous Integration and Continuous Development). Data models can be written in SQL (jinja templated) but more recent versions of the tool also support model definitions with Python!

Final Thoughts..

Analytics Engineering is an emerging field in the intersection of Data Engineering and Data Analytics that aims to speed up the development of analytics products, improve data quality and bring more data trust. The main tool that facilitates the lifecycle of data products is dbt that has drastically changed the way data teams work and collaborate together. It is therefore important to familiarise yourself with it since it’s here to stay for the long run.

In upcoming articles we are going to focus more on dbt and how you can use it to build and manage your data models effectively. So make sure to subscribe in order to be notified when the articles are out!