The world’s leading publication for data science, AI, and ML professionals.

How Data Engineering Evolved since 2014

Top trends to help your data pipelines scale with ease

AI generated image using Kandinsky
AI generated image using Kandinsky

In this discussion, I aim to explore the evolving trends in data orchestration and data modelling, highlighting the advancements in tools and their core benefits for data engineers. While Airflow has been the dominant player since 2014, the Data Engineering landscape has significantly transformed, now addressing more sophisticated use cases and requirements, including support for multiple programming languages, integrations, and enhanced scalability. I will examine contemporary and perhaps unconventional tools that streamline my data engineering processes, enabling me to effortlessly create, manage, and orchestrate robust, durable, and scalable data pipelines.


During the last decade we witnessed a "Cambrian explosion" of various ETL frameworks for data extraction, transformation and orchestration. It’s not a surprise that many of them are open-source and are Python-based.

The most popular ones:

  • Airflow, 2014
  • Luigi, 2014
  • Prefect,2018
  • Temporal, 2019
  • Flyte, 2020
  • Dagster, 2020
  • Mage, 2021
  • Orchestra, 2023

Apache Airflow, 2014

Apache Airflow, created by Airbnb in 2014, went open-source in 2016 and joined the Apache Software Foundation in 2018. It’s a platform where you can programmatically create, schedule, and monitor workflows. They were the first to introduce the DAG concept. Using Python, engineers define DAGs (Directed Acyclic Graphs) to organize and visualize tasks. This is now used in almost every orchestrator tool. For example "Prefect DAG" looks almost identical. The interface is simple and allows to monitor data pipelines with ease:

Image by author
Image by author

DAGs are easy to understand and are written in Python:

@dag(default_args=default_args, tags=['etl'])
def etl_pipeline():

    @task()
    def extract():
        return json.loads(data_string)    
    @task(multiple_outputs=True)
    def transform(order_data_dict: dict):
        return {"total_count": len(order_data_dict)}    
    @task()
    def load(total_order_value: float):
        print(f"Total order value is: {total_count}")    

    extracted = extract()
    transformed = transform(extracted)
    load(transformed["total_count"])

In one of my previous stories I wrote about a more advanced Airflow example:

How to Become a Data Engineer

Advanatges:

  • Simple and Robust UI: Detailed views of DAGs, retries, task durations and failures – everything looks so familiar after all these years.
  • Community Support: Open-source project after all these years has a massive community of followers. It is very popular and has lots of plugins for various data sources and destinations. Combined with regular updates it makes it an obvious choice for many data devs.
  • Python: Python equals custom and scripty. Everything looks so flexible here. In one of my stories many years ago I managed to customize the ML connector and it was very easy to do [1].

Training your ML model using Google AI Platform and Custom Environment containers

  • Jinja and Parameters: This feature must be familiar to many DBT users allowing us to create dynamic pipelines in a similar way we create templated models in DBT.

Luigi, 2014

Another orchestrator for our ETL pipelines. Developed by Spotify to handle massive data processing workloads, Luigi features a command-line interface and excellent visualization capabilities. However, even simple ETL pipelines require some Python programming skills. Nonetheless, Luigi works well in many situations.

Advantages

  • HDFS Support: Luigi gives teams a handy toolbox with task templates, including file system abstractions for HDFS and local files. This helps keep operations atomic and data pipelines consistent, making everything run smoothly.
  • Easy to use: Strong community of followers and contributors. The project is maintained.
  • Nice UI: Web UI allows to search, filter and prioritize DAGs which makes it easy to work with pipeline dependencies.
  • Robust infrastructure: Luidgi supports complex pipelines with multiple tools, A/B tests, recommendations and external reports.

From my experience, it’s excellent for strict and straightforward pipelines, though implementing complex branching logic can be challenging. Even Spotify themselves moved from it in 2019. They said it was not easy to maintain and that they needed multiple language support [2].

Why We Switched Our Data Orchestration Service – Spotify Engineering

They moved to Flyte (more below).

Prefect, 2018

Established in 2018, Prefect aims to overcome some of the challenges engineers face with other orchestration tools like Airflow. Prefect employs a much simpler system than Airflow. Instead of using a DAG structure, you can easily convert your code into a "Prefect DAG" using Python decorators.

It offers a hybrid execution model, allowing tasks to be executed locally or on any cloud platform.

Prefect promotes itself as "the new standard in dataflow automation."

Advantages:

  1. Open-source: Flexibility and multiple deployment options.
  2. Lightweight: With just one command we can set up our development environment for orchestration.
  3. Dynamic Mapping: Enables dynamic task execution based on another task’s output.
  4. Python-first: With Python-first experience Prefect provides cleaner abstractions, similar to Airflow.
  5. Monitoring: Monitor key operational metrics and run failures using Prefect’s Cloud UI. It simply looks intuitive and nice.
  6. Alerts and Notifications: Discord, Email, Slack and many more channels for our pipeline alerts.
Image by author
Image by author

The open-source version provides everything we typically use in Airflow including features like scheduling, native secrets, retries, concurrency management, scaling functionality and great logging.

With Prefect Cloud we get a few extras, e.g. Automations, Webhooks, Event feeds and Workspaces and Organisations tailored for the hosted environment we might need.

Key components would be tasks and flows we build. Each step of the pipeline we can describe with Prefect’s @task decorator – a base unit of a flow. For each task, we can supply settings using arguments. It can be anything – tags, descriptions, retries, cache settings, etc. Consider this @task below.

@task
def extract_data(source, pipe):
    ...
    return result

@task
def load_data_to_dwh(data, database: bigquery):
    ...
 return table_name

@task
def run_data_checks(dataset, row_conditions: qa_dataset):
    ...

In the example below, we create a Flow using the @flow decorator. The flow will execute all tasks in order, generating inputs, and outputs and passing them from one task to another.

@flow
def etl_workflow():
    s3_data = extract_data(google_spreadsheet, pipe_object_with_settings)
    dataset = load_data_to_dwh(s3_data, database)
    run_data_checks(table_name, rules)

if __name__ == "__main__":
    etl_workflow()

Prefect uses Work Pools to manage work allocation efficiently and prioritize tasks in a required environment (test, dev, prod) for optimal performance and automated testing. We can create Workers (agents) locally or in the cloud.

Prefect can be installed with pip:

pip install -U "prefect==2.17.1"
# or get the latest version from prefect
pip install -U git+https://github.com/PrefectHQ/prefect

Here is the simple script to extract data from NASA API:


# ./asteroids.py
import requests
API_KEY="fsMlsu69Y7KdMNB4P2m9sqIpw5TGuF9IuYkhURzW"
ASTEROIDS_API_URL="https://api.nasa.gov/neo/rest/v1/feed"

def get_asteroids_data():
    print('Fetching data from NASA Asteroids API...')
    session = requests.Session()
    url=ASTEROIDS_API_URL
    apiKey=API_KEY
    requestParams = {
        'api_key': apiKey,
        'start_date': '2023-04-20',
        'end_date': '2023-04-21'
    }
    response = requests.get(url, params = requestParams)
    print(response.status_code)
    near_earth_objects = (response.json())['near_earth_objects']
    return near_earth_objects

if __name__ == "__main__":
    get_asteroids_data()

We can turn it into a flow like so:

# my_nasa_pipeline.py
import requests   # an HTTP client library and dependency of Prefect
from prefect import flow, task

API_KEY = "fsMlsu69Y7KdMNB4P2m9sqIpw5TGuF9IuYkhURzW"
ASTEROIDS_API_URL = "https://api.nasa.gov/neo/rest/v1/feed"

@task(retries=2)
def get_asteroids_data(api_key: str, url: str):
    """Get asteroids close to Earth for specific datess
    - will retry twice after failing"""
    print('Fetching data from NASA Asteroids API...')
    session = requests.Session()
    url = ASTEROIDS_API_URL
    apiKey = API_KEY
    requestParams = {
        'api_key': apiKey,
        'start_date': '2023-04-20',
        'end_date': '2023-04-21'
    }
    response = session.get(url, params=requestParams)
    print(response.status_code)
    near_earth_objects = (response.json())['near_earth_objects']
    return near_earth_objects

@task
def save_to_s3(data):
    """Save data to S3 storage"""
    # Do some ETL here
    result = print(data)
    return result

@flow(log_prints=True)
def asteroids_info(date: str = "2023-04-21"):
    """
    Given a date, saves data to S3 storage
    """
    asteroids_data = get_asteroids_data(API_KEY, ASTEROIDS_API_URL)
    print(f"Close to Eart asteroids: {asteroids_data}")

    s3_location = save_to_s3(asteroids_data)
    print(f"Saved to: {s3_location}")

if __name__ == "__main__":
    asteroids_info("2023-04-21")

Run the flow:

python my_nasa_pipeline.py
Image by author
Image by author

Creating a Prefect server

Now we would want to create a deployment to schedule our flow run:

# create_deployment.py
from prefect import flow

if __name__ == "__main__":
    flow.from_source(
        # source="https://github.com/your_repo/prefect_nasa.git",
        source="./",
        entrypoint="my_nasa_pipeline.py:asteroids_info",
    ).deploy(
        name="my-first-deployment",
        work_pool_name="test-pool",
        cron="0 1 * * *",
    )

In your command line run:

python create_deployment.py
# Run the workflow manually
# prefect deployment run 'repo-info/my-first-deployment'
Image by author
Image by author

In order to run our scheduled flow we would want to create a worker pool:

prefect work-pool create test-pool
prefect worker start --pool 'test-pool'
prefect server start
Image by author
Image by author
Image by author
Image by author

Alternatively, we can use this command and use the prompt to create a deployment:


prefect deploy ./my_nasa_pipeline.py:asteroids_info -n my-first-deployment
Image by author
Image by author

Prefect integrations

Prefect becomes even more powerful with its robust set of integrations a full list of which can be found here [3].

For example, we can use Prefect with DBT which makes our data modelling even more powerful. In your command line run:

pip install -U prefect-dbt
# register blocks to start using tghem in Prefect
prefect block register -m  prefect_dbt

And now we can use Prefect with dbt-core :

from prefect import flow
from prefect_dbt.cli.commands import DbtCoreOperation

@flow
def trigger_dbt_flow() -> str:
    result = DbtCoreOperation(
        commands=["pwd", "dbt debug", "dbt run"],
        project_dir="PROJECT-DIRECTORY-PLACEHOLDER",
        profiles_dir="PROFILES-DIRECTORY-PLACEHOLDER"
    ).run()
    return result

if __name__ == "__main__":
    trigger_dbt_flow()

In one of my stories I wrote how to configure DBT for convenient execution from any environment:

Database Data Transformation for Data Engineers

Temporal, 2019

Temporal supports workflow triggering through APIs and allows for multiple concurrent workflow executions.

Advantages:

  • Fault-tolerance and Retries: Fault tolerance with automatic retries for any tasks that fail. The ability to manage long-running workflows that may last for several days or even months.
  • Scalability: Capabilities to accommodate high-throughput workloads.
  • Enhanced monitoring: Insight into workflow execution and historical data.
  • Support for temporal queries and event-driven workflows.
  • Heterogeneity: Multiple programming language support.

Temporal can orchestrate intricate data processing workflows, managing failures effectively while maintaining data consistency and reliability throughout the entire process. Its capability to manage states across distributed systems allows Temporal to enhance resource allocation.

Temporal accommodates continuously running workflows, enabling the modelling of the lifecycles of various entities. Temporal workflows are inherently dynamic, and capable of executing multi-step core business logic. They can send or await signals from external processes, facilitating notifications to humans or initiating processes for intervention.

Flyte, 2020

Flyte is an open-source orchestration tool for managing Machine Learning and AI workflows, built on Kubernetes. With data being crucial for businesses, running large-scale compute jobs is essential but operationally challenging. Scaling, monitoring, and managing clusters can burden product teams, slowing innovation. These workflows also have complex data dependencies, making collaboration difficult without platform abstraction.

Image by author
Image by author

Flyte aims to boost development speed for machine learning and data processing by simplifying these tasks. It ensures reliable, scalable compute, letting teams focus on business logic. Additionally, it promotes sharing and reuse across teams, solving problems once and enhancing collaboration as data and machine learning roles merge.

Advantages

  • End-to-end Data Lineage: This allows you to monitor the health of your data and ML workflows throughout each execution stage. Easily analyze data flows to pinpoint the source of any errors.
  • Parameters and Caching: Designed with ML in mind it allows dynamic pipelines and use of cached pre-computed artifacts. For example, during hyperparameter optimization, you can easily apply different parameters for each run. If a task has already been computed in a previous execution, Flyte will efficiently use the cached output, saving time and resources.
  • Multi-tenant and Serverless: Flyte eliminates the need to manage infrastructure, letting you focus on business challenges instead of machines. As a multi-tenant service, it provides an isolated repository for your work, allowing you to deploy and scale independently. Your code is versioned, and containerized with its dependencies, and every execution is reproducible.
  • Extensible: Flyte tasks can be highly complex. It can be a simple ETL job or a call to remote Hive cluster or distributed Spark executions. The best solution might be hosted elsewhere, so task extensibility allows you to integrate external solutions into Flyte and your infrastructure.
  • Heterogeneous and multi-language support: Data pipelines can be complex. Each step can be written in a different language and use various frameworks. One step could use Spark to prepare data, while the next step trains a deep learning model.

Dagster, 2020

Dagster is an open-source data orchestrator that facilitates the development, management, and monitoring of data pipelines. It supports job monitoring, debugging, data asset inspection, and backfill execution.

Essentially, Dagster serves as a framework for building data pipelines, acting as the abstraction layer within your data ecosystem.

Heterogeneous Everything

Advantages:

  • Heterogeneous: Dagster offers a comprehensive interface that enables users to build, test, deploy, run, and refine their data pipelines all in one place. This unified approach streamlines the workflow for data engineers, making it easier to manage the entire lifecycle of data processing.
  • Improved Abstractions: It employs declarative programming through software-defined assets (SDAs), which enhances abstraction and simplifies pipeline design. Users benefit from shared, reusable, and configurable components that facilitate efficient data processing. Additionally, Dagster includes declarative scheduling features that allow users to implement freshness policies, ensuring data is up-to-date.
  • Testing Capabilities: To ensure the integrity of data, Dagster incorporates quality checks using types that define the acceptable values for the inputs and outputs of your pipelines. It also supports asset versioning for both code and data, along with caching mechanisms that enhance performance.
  • Great Monitoring Features: Dagster comes equipped with a built-in observability dashboard, providing real-time monitoring of pipeline performance and health. Its built-in testability feature allows users to verify the functionality of their pipelines seamlessly.
  • Robust Integrations: Dagster offers robust integrations with various tools in your data ecosystem, including Airflow, dbt, Databricks, Snowflake, Fivetran, Great Expectations, Spark, and Pandas.

Mage, 2021

It looks like Mage was created with a focus on speed and scalability, tailored specifically for containerized environments such as Kubernetes. Created in 2021, Mage was developed to meet the increasing need for real-time data processing in microservice architectures.

With Mage, we can use multiple programming languages such as R, Python and SQL combined with robust template functionality. Mage definitely introduced a couple of new things in the orchestration space.

Mage's DBT support. Image by author.
Mage’s DBT support. Image by author.

Advantages:

  • DRY Components and DBT support: Built with a focus on a modular design, for example, we can integrate different repos, packages and components with ease. Building, running, and managing your dbt models with ease. Very useful for Data Mesh platforms.
  • Cost Effectiveness: Claims to optimize resource provisioning and consumption with obvious cost-effective benefits included.
  • Native Kubernetes: Easy deployments for modern data platform architectures.
  • Real-time data pipelines: Works beautifully with real-time data. Ingesting and transforming streaming data is a real pain point for many companies.
  • Built-in Integrations: Supports dozens of sources and destinations, e.g. Amazon S3, BigQuery, Redshift, PowerBI, Tableau, Salesforce, Snowflake, and more.

Mage’s user interface also offers the capability to preview the pipelines you create prior to deployment.

This feature allows users to visualize and examine the structure and functionality of their pipelines, ensuring everything is set up correctly before going live. By providing this preview option, Mage helps users identify any potential issues or adjustments that may be needed, ultimately leading to a smoother deployment process and enhancing overall workflow efficiency.

Mage’s key concepts are pretty much the same as in Prefect: projects, blocks, pipeline, data product, backfill, trigger and runs.

Interactive notebook UI allowing me to preview my code’s output is one of my favourite features.

Orchestra, 2024

Created in 2023 by Hugo Lu Orchestra is the new generation orchestrator solution focussing on serverless architecture to unify everything in one platform.

Advantages:

  • Serverless Modular Architecture: You don’t need a Kubernetes cluster anymore.
  • Deliver fast: Build enterprise-grade orchestration in minutes.
  • Enhanced Monitoring: Don’t waste your time identifying bugs.
  • State of the Art integrations: Hundreds of out-of-the-box integrations.

Conclusion

This research highlights that an emphasis on heterogeneity, support for multiple programming languages, effective use of metadata, and the adoption of data mesh architecture are essential trends in data pipeline orchestration for building modern, robust, and scalable data platforms.

For instance, Apache Airflow provides a variety of pre-built data connectors that facilitate seamless ETL tasks across various cloud vendors, including AWS, GCP, and Azure. However, it lacks features such as intra-task checkpointing and caching, and it is not optimized for machine learning pipelines.

The support for multiple programming languages is expected to be a significant trend in the coming years. For example, Temporal accommodates various languages and runtimes, whereas Airflow primarily emphasizes Python.

Collaboration in Data Mesh era is the key to success in data space. Your organization may have separate teams for data management, classification models, and forecasting models, all of which can collaboratively utilize the same platform – Mage or Flyte, for example. This allows them to operate within the same workspace without interfering with one another’s work.

Although the evolution of tools is truly fascinating, the basic principles of data engineering remain the same:

Python for Data Engineers

We can witness a "Cambrian explosion" of various ETL frameworks for data extraction and transformation. It’s not a surprise that many of them are open-source and are Python-based.

Modern Data Engineering

Recommended read

[1] https://www.datacamp.com/tutorial/ml-workflow-orchestration-with-prefect

[2] https://docs.prefect.io/latest/concepts/work-pools/

[3] https://docs.prefect.io/latest/integrations/prefect-dbt/

[4] https://docs.prefect.io/latest/integrations/prefect-dbt/

[5] https://engineering.atspotify.com/2022/03/why-we-switched-our-data-orchestration-service/

[6] https://thenewstack.io/flyte-an-open-source-orchestrator-for-ml-ai-workflows/?utm_referrer=https%3A%2F%2Fwww.google.com%2F

[7] https://atlan.com/dagster-data-orchestration/


Related Articles