The Top 10 Data Lifecycle Problems that Data Engineering Solves

In this article, I want to tackle some of the biggest challenges data engineers face when working with pipelines throughout the data lifecycle. Understanding how to manage the data lifecycle is key in our constantly changing field. As a data engineer, I often deal with huge volumes of different types of data, including unstructured data, coming from various sources like databases, data lakes, and third-party APIs. These factors can make managing critical data really tough. We’ll cover all the important stages of data processing, from collection and analysis to storage and destruction, and I’ll share the best practices I use every day.

Data lifecycle management

Data lifecycle management enables businesses with a strategic and regulated approach to organising and managing data from source to destination or its final state such as archiving or destruction.

Essentially, this is a set of policies to maximise the value of data throughout its useful life, from data creation to destruction where it becomes obsolete or needs to be destroyed due to compliance regulations.

The typical data lifecycle follows a well-known ETL pattern.

Data Sources: Data is being created somewhere. This is a data creation stage. It can be external and internal data sources – APIs (CRM systems, Accounting software, etc.), Relational databases (MySQL, Postgres), cloud storage and many others we might want to create ourselves.
Data Collection: Extract ("E" in ETL). We would want to extract data from the data source and do something with it – either load "as is" or transform first and then load.
Data Ingestion: ( "T" and "L" in ETL). We are talking about ELT / ETL services that can transform our data, handle different file formats and orchestrate data ingestion into a data warehouse solution (DWH) or data lake.
Data Storage: it all depends on our data pipeline design pattern [2] and can be a data lake, DWH or OLAP database. It would typically perform a storage function, i.e., a landing area for data files and be a proxy stage of many other pipelines.
Data Governance: a set of strategic policies to make our data always available to end users, secure and accurate.
Business Intelligence and Reporting – creating reports is obviously another challenging task. BI developers usually do this.
Data Orchestration: Finally we would want to orchestrate all this madness effectively.

According to this, our data platform infrastructure would often look very complex:

Simplified Data platform blueprint. Image by author.

Data creation stage

Pain point 1: Data source management and data observability

Data lineage looks okay but where is this data coming from?

Managing multiple data sources can become a significant challenge when orchestrating data extraction from numerous origins. Consider the complexity of integrating data from a dozen APIs, relational databases, and native platforms like Google Analytics 4 via Firebase into your data warehouse. Effective management of these diverse inputs is crucial, and we should focus on the meticulous declaration of these database entities.

It’s all about data sources and their declarations.

Solution:

Use Git and metadata to declare sources.
Add data source descriptions, data origins and owners.
Add sources to your data lineage graphs so everyone can find the information about them.
Create a single source of truth for all data pipelines using a modern data modelling tool.
Deploy data observability tests to understand data health and the state of your data better.

Advanced tools such as Dataform (by Google) and DBT offer comprehensive features to streamline and optimize the management of these data sources, ensuring efficient and organized data integration. Consider this snippet below:

sources:
  - name: sales_data
    description: Data source description and origins of data.
    owner: data source owner
    database: analytics
    schema: raw_sales
    tables:
      - name: transactions
      - name: orders
          columns:
            - name: order_id
              tests:
                - unique
                - not_null
            - name: status
              tests:
                - accepted_values:
                    values: ['placed', 'shipped', 'completed', 'returned']
            - name: customer_id
      - name: customers

Data collection and ingestion

Indeed, often these two steps are combined as they are being performed by the same data service. It can be anything that can perform efficient data manipulation, i.e. the PySpark application built in Databricks or AWS Glue service. It also can be a tiny Cloud Function invoked by message queue events (Rabbit, SQS, Pub/Sub, Kafka, etc.).

Pain point 2:

Lack of functionality in out-of-the-box (OOTB) solutions available in the market.

In data collection or extraction we can always work with managed tools like Fivetran or Stitch but sometimes they don’t have the capabilities you need.

For example, consider a data pipeline that generates invoices using sales data from DWH. These invoices must be uploaded into your accounting system. This is not an OOTB integration and it doesn’t exist. You would want to build your own data service and orchestrate invoice uploads as needed. You would want to keep it warm and refresh access tokens constantly to be able to authenticate the service successfully.

Solution:

If you are a data engineer then you are the solution.

Why would someone need to pay for any third-party service if they have a data engineer with relevant programming skills? Check out these ETL techniques that I use daily to design and deploy durable data pipelines.

Python for Data Engineers

Pain point 3:

Third-party tools are too expensive

Whether it’s a Fivetran, Stitch or anything else their pricing models are usually row-based for data we need to extract. This would typically result in a thousand-dollar bill monthly.

Solution:

Build robust, durable and cost-effective data pipelines which are unit-tested for side effects.
Make sure your code is idempotent, reusable and easy to maintain.
Provision resources effectively with infrastructure as code.

Building Durable Data Pipelines

Pain point 4: Scalability, **** Insight Delays and Operational Overheads

Choosing the right architecture for your data pipelines is the key to success

Designing robust data pipelines is a complex task that demands thorough requirements analysis. Focus on business needs to determine the "what" and "why" of data processing. The "how" is defined through functional requirements, developed in collaboration with technical teams. For instance, if business units require real-time analytics, data freshness will dictate whether to use batch or streaming processing.

A deep understanding of business logic is essential for selecting the appropriate tools. Consider a simple pipeline that transfers data from the data warehouse business or mart model into a Google spreadsheet – this is straightforward for any data engineer.

However, when dealing with highly sensitive data in a transaction monitoring compliance pipeline, more sophisticated solutions are necessary to meet regulatory requirements for Personally Identifiable Information (PII) and effectively manage data input/output features.

Solution:

Focus on high-level business requirements in the first plan, be project-focused and understand the limitations and deadlines.
Collect all functional requirements to fulfil what business needs.
Ensure you understand the skillset of your end users and main stakeholders – this will define the data tools.
Finally, choose the data pipeline design pattern that fits well into this framework.

Data pipeline design patterns

Data storage

In many companies data storage is the key data platform component used to stage, preserve, categorise and archive the data we extract from other systems and use in analytics.

Pain point 5:

Storage costs, schema drifts, file formats and partitioning layout – how to optimise and which ones to use?

Even a data warehouse solution has a storage component which translates into associated costs. It’s a known fact that modern DWH tools have decoupled computing and storage resources. Even though storage costs are relatively low they might accumulate over time. Consider a super large dataset with activity schema with terabytes of user engagement events generated daily. Now imagine this being loaded into your BigQuery data warehouse. The good thing is that three months of data are optimised to near cold line storage class and are fairly cheap but everything else after that point goes into standard storage class and is way more expensive.

ORC vs Parquet vs AVRO. No JSON?

In the application world, a huge amount of data is gathered and kept in a JSON format. So why not store it in JSON? Simply because JSON doesn’t carry any schema information on board, dealing with it in Big Data, Hadoop tools might be slow. Basically, it’s a no-go for Big Data processing frameworks. This is the main reason Parquet and ORC formats were created.

Solution:

Storing data in an optimized storage class is essential for efficiency. Utilizing cloud storage provides the flexibility to process data across various platforms within the data lake using Spark, AWS Glue, or Databricks. It facilitates reloading data into the data warehouse if necessary.
Parquet and ORC store data in columns and offer a compression ratio higher than AVRO. If your data structure might change over time and you need schema evolution support then **choose AVRO**. ORC is usually considered the best file format option for HIVE, whereas Parquet is considered the optimal solution across the Hadoop ecosystem.
Create a cloud storage bucket for an archive with HIVE partitioning layout using one of the major cloud providers (Google, AWS, or Azure).
Establish a routine to monitor and offload obsolete data, archiving it to reduce costs.

Consider this code below. It shows how easy it is to unload the data and then load it back:

-- export data from BigQuery to GCP
EXPORT DATA
  OPTIONS (
    uri = 'gs://events-export/public-project/events.json',
    format = 'Parquet', -- or CSV, JSON, Parquet
    overwrite = true
)
AS (
SELECT *
FROM `firebase-public-project.analytics_153293282.events_20181001`
);

Load data back if needed:

LOAD DATA INTO source.json_external_test
FROM FILES(
  format='JSON',
  uris = ['gs://events-export/public-project/*']
)

When your Stack is a Lake House

Pain point 6:

Security, major access controls and data retention policies

If we are dealing with sensitive company data then it must be protected. We would want to ensure that only authorized users have access and that the data we have can be considered as a single source of truth for any data dataset and model we have.

Solution:

Use VPC to restrict access. Virtual private clouds are a good practice to ensure your data is isolated from the outer world.
Apply versioning to your data. This aims to ensure data consistency in case of any potential changes to our data from any user or service that might be working with data lake data.
Back up data regularly and introduce data loss prevention policies.

Consider this Terraform example below. It will enable versioning on the AWS S3 bucket. It’s a simple thing but it helps to keep your data lineage clean and protected ensuring that the data lake is a single source of truth, including any potential database replication pipelines where data can be easily manipulated, updated or erased.

resource "aws_s3_bucket" "example" {
  bucket = "example-bucket"
}

resource "aws_s3_bucket_acl" "example" {
  bucket = aws_s3_bucket.example.id
  acl    = "private"
}

resource "aws_s3_bucket_versioning" "versioning_example" {
  bucket = aws_s3_bucket.example.id
  versioning_configuration {
    status = "Enabled"
  }
}

Data governance

Data Governance is all about making data accurate, secure and available to main stakeholders.

Pain point 7: Granular access controls for major stakeholders and data availability.

Sometimes it is a good idea to place a DWH solution into a separate data processing stage. Indeed, data management, role-based access controls and robust data governance features – are all that make these tools very practical in modern data stacks [3]. I discussed it here:

Modern Data Warehousing

Solution:

If not using a data warehouse apply identity-based access controls for data lake datasets.
Apply row/column-based access controls for your DWH data where needed.
Apply object tagging and tag-based masking policies.
Implement alerting using access history. If you are a DBA then you must have it under control.
Apply regular data quality checks. They can be scheduled or run in CI for every pull request.

For example, DBT has a set of generic and singular tests that we might want to run on data regularly. This is very easy to implement like so:

version: 2

models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'returned']
      - name: customer_id
        tests:
          - relationships:
              to: ref('customers')
              field: id

As an example, in one of my previous stories I wrote how to create email alerts in BigQuery using row conditions for data quality:

Automated emails and data quality checks for your data

Pain point 8:

Data quality and testing

This is what matters in data development. You would want to use the "Do Not Repeat Yourself" (DRY) modular approach and create data models that are templated, easy to maintain, reuse and unit-tested.

Introduce separate data environments and separate production, development and testing areas in both DWH and data lake solutions.
Use data modelling and data lineage graphs for your data pipelines.
Run unit tests for every data transformation you have.

Splitting data environments makes perfect sense as you wouldn’t want staging data to mix up with your production one. In this light, it would be best to design your data platform to have data environment layers split like in this example below:

DATABASE_NAME   SCHEMA_NAME     
-------------------------------
RAW_DEV         SERVER_DB_1     -- mocked data
RAW_DEV         SERVER_DB_2     -- mocked data
RAW_DEV         EVENTS          -- mocked data
RAW_PROD        SERVER_DB_1     -- real production data from pipelines
RAW_PROD        SERVER_DB_2     -- real production data from pipelines
RAW_PROD        EVENTS          -- real production data from pipelines
...                             
BASE_PROD       EVENTS          -- enriched data
BASE_DEV        EVENTS          -- enriched data
...                             
ANALYTICS_PROD       REPORTING  -- materialised queries and aggregates
ANALYTICS_DEV        REPORTING  
ANALYTICS_PROD       AD_HOC     -- ad-hoc queries and views

Database Data Transformation for Data Engineers

SQL unit tests are very useful and that’s what you do if you want to make sure your data transformation logic is persisted and remains the same unless you want to change it intentionally.

I previously wrote about it here:

Advanced Data Modelling

Think of data governance as a set of policies and tools designed to guide key data stakeholders, especially management, on the proper use of data. Some data warehouse solutions, like Snowflake, offer object categorization for potentially personal or sensitive data. This feature can help ensure compliance with privacy regulations when necessary.

Business Intelligence and Reporting

This is the most interesting one as I keep seeing BI tools becoming increasingly injected with features that are typically implemented in previous data lifecycle stages.

Pain point 9:

My BI tool is too expensive

Indeed, modern reporting solutions are expensive, especially if they are meant for enterprise-level companies. What should we do if we are an SME?

Well, I would say one single data engineer can replace numerous features in modern BI tools.

Solution:

Optimise data modelling: BI with data modelling capabilities seems like an overkill in the DBT era. All SQL data transformations should be implemented before they reach your BI solution. In this case, paying for two instances in (development and production) Looker doesn’t make much sense to me and we can save a lot of money.
BI and dedicated OLAP cubes: Why would you need something that would be processing your data outside your data warehouse? It doesn’t make sense to pay the double price using a BI tool with this feature.
Caching and data freshness: Even free community-based BI solutions like Looker Studio have this functionality. Data freshness tests can be implemented using either row conditions or DBT generic tests. A data engineer would schedule analytics data model update materialization and implement a full data refresh if needed.
Semantic layer: This is a great example. Think of it as a model with dimensions, metrics or measures for each dimension, and clear instructions on how they all connect. That’s why I like Looker; it takes data modelling to the next level. But what if I told you that you can get this for free? Try the DBT semantic layer with dbt-metricflow and you’ll understand that it is not that difficult to do it yourself.
Git and source control: This is my favourite one. At the moment there a just a few BI tools with this feature. The most popular ones are Looker and Sisense but what if I tell you that we can go open-source and do it ourselves? Try Apache Superset, create virtual datasets, source control them in your repository and, finally, deploy them using API.

Data pipeline orchestration

Now we need to orchestrate all this modern data stack and fancy pipelines we built.

Pain point 10:

Orchestration

My previous research highlights that an emphasis on heterogeneity, support for multiple programming languages, effective use of metadata, and the adoption of data mesh architecture are essential trends in data pipeline orchestration for building modern, robust, and scalable data platforms.

How Data Engineering Evolved since 2014

Solution:

If you rely on the legacy architecture built with Apache Airflow of AWS Step Functions then I would recommend using that stack. However, if you’re looking for a more robust and heterogeneous solution then Mage orchestrator might be a better fit.
If you need to orchestrate a lot of machine learning pipelines I would highly recommend trying Flyte.

The support for multiple programming languages is expected to be a significant trend in the coming years. For example, Temporal accommodates various languages and runtimes, whereas Airflow primarily emphasizes Python.

Conclusion

Third-party tools for data extraction often fail to meet specific needs, and their pricing models – typically based on the volume of data extracted – can be prohibitively expensive. As a data engineer, it is important to know how to build and deploy durable, cost-effective data pipelines.

Advanced ETL Techniques for Beginners

Data storage presents another challenge that may not be immediately apparent. Utilizing more efficient storage classes is essential. Cloud storage offers the flexibility to process data in various environments within the data lake, such as Spark, AWS Glue, or Databricks, and facilitates reloading data into the data warehouse when necessary. Implementing versioning and protection for your data lake buckets is a best practice, as is establishing data retention policies that specify which data should be archived and for how long. Ensure archived data is stored in standard big data formats to maintain accessibility as technology evolves.

Big Data File Formats, Explained

Data governance should be viewed as a comprehensive set of policies and tools guiding key data stakeholders, particularly management, in the proper usage of data. This strategy ensures high-quality data is available to all involved parties.

Business intelligence (BI) tools with data modelling capabilities may seem excessive in the DBT era. All SQL data transformations should be completed before they reach your BI solution, aligning with the principle that everything should be unit-tested prior to deployment. This approach exemplifies data governance and data modelling at their finest.