The world’s leading publication for data science, AI, and ML professionals.

Airflow Data Intervals: A Deep Dive

Building idempotent and re-playable data pipelines

Photo by Gareth David on Unsplash
Photo by Gareth David on Unsplash

Apache Airflow is a powerful orchestration tool for scheduling and monitoring workflows, but its behaviour can sometimes feel counterintuitive, especially when it comes to data intervals.

Understanding these intervals is crucial for building reliable data pipelines, ensuring idempotency, and enabling replayability. By leveraging data intervals effectively, you can guarantee that your workflows produce consistent and accurate results, even under retries or backfills.

In this article, we’ll explore Airflow’s data intervals in detail, discuss the reasoning behind their design, why they were introduced, and how they can simplify and enhance day-to-day Data Engineering work.


What Are Data Intervals in Airflow?

Data intervals sit at at the heart of how Apache Airflow schedules and executes workflows. Simply put, a data interval represents the specific time range that a DAG run is responsible for processing.

For instance, in a daily-scheduled DAG, each data interval starts at midnight (00:00) and ends at midnight the following day (24:00). The DAG executes only after the data interval has ended, ensuring that the data for that interval is complete and ready for processing.

The intuition behind data intervals

The introduction of data intervals was driven by the need to standardize and simplify how workflows operate on time-based data. In many data engineering scenarios, tasks need to process data for a specific period, such as hourly logs or daily transaction records. Without a clear concept of data intervals, workflows might operate on incomplete data or overlap with other runs, leading to inconsistencies and potential errors.

Data intervals provide a structured way to:

  • Ensure that DAGs process only the data they are meant to handle
  • Avoid issues caused by data that arrives late or is still being written
  • Enable accurate backfilling and replaying of workflows for historical periods
  • Support idempotency by ensuring that each DAG run processes data for a clearly defined and immutable time range. This prevents duplication or missed data when workflows are retried or replayed

By tying each DAG run to a specific data interval, Airflow ensures that tasks can be safely retried without affecting other runs, and workflows can be replayed with confidence, knowing that they will produce consistent results every time.

How do data intervals work?

When a DAG is scheduled, Airflow assigns each run a logical date, which corresponds to the start of its data interval. This logical date is not the actual execution time but rather a marker for the data interval. Each data interval is defined by:

  • Data Interval Start: The beginning of the time range the DAG run is responsible for processing.
  • Data Interval End: The end of the time range for the same DAG run.
  • Logical Date: A timestamp representing the start of the data interval, used as a reference point for the DAG run.

For example:

If a DAG is scheduled to run daily with a start_date of January 1, 2025, the first data interval will have:

  • Data Interval Start: January 1, 2025, 00:00
  • Data Interval End: January 2, 2025, 00:00
  • Logical Date: January 1, 2025
  • The DAG run for this interval will execute after the interval ends, at midnight on January 2, 2025.

This approach ensures that the DAG operates on a complete set of data for the interval it represents.


The role of start_date

The start_date in a DAG defines the logical start of the first data interval. It is a critical parameter that determines when a DAG begins tracking its data intervals and influences backfilling capabilities.

When you set a start_date, it serves as a reference point for Airflow to calculate data intervals. Once again, it’s worth highlighting that a DAG does not execute immediately upon deployment; it waits for the first data interval to end before initiating a run. This ensures that the data for the interval is complete and ready for processing.

Why start_date Matters

The start_date also plays a significant role in how Airflow determines the boundaries of data intervals. By defining a fixed starting point, Airflow ensures that all intervals are consistently calculated, enabling workflows to process data in a structured and reliable manner.

By anchoring DAG runs to a consistent start_date, you ensure reproducible workflows that operate predictably across different environments and timeframes.

  1. Logical Start of Intervals: The start_date establishes the beginning of the first data interval, which serves as the foundation for all subsequent intervals. For example, if your start_date is January 1, 2025, and the schedule is @daily, the first interval will span January 1, 2025, 00:00 to January 2, 2025, 00:00.
  2. Alignment with Schedules: Ensuring that the start_date aligns with the DAG’s schedule is crucial for predictable execution. A mismatch between the two can lead to unintended gaps or overlaps in processing.
  3. Backfilling Support: A thoughtfully chosen start_date enables you to backfill historical data by triggering DAG runs for past intervals. This capability is vital for workflows that need to process older datasets.
  4. Reproducibility: By anchoring DAG runs to a consistent start_date, you ensure reproducible workflows that operate predictably across different environments and timeframes.

In the example DAG presented below,

  • The DAG is scheduled to run daily starting January 1, 2025
  • The catchup=True parameter ensures that if the DAG is deployed after the start_date, it will backfill all missed intervals
Python">from datetime import datetime

from airflow import DAG
from airflow.operators.dummy import DummyOperator

with DAG(
    dag_id='example_dag',
    schedule_interval='@daily',
    start_date=datetime(2025, 1, 1),
    catchup=True,  # Enables backfilling
) as dag:
    start = DummyOperator(task_id='start')
    end = DummyOperator(task_id='end')

    start >> end

If the DAG is deployed at 10:00 AM on January 1, 2025, the first run will not execute immediately. Instead, it will run at midnight on January 2, 2025, processing data for January 1, 2025.

If the DAG is deployed on January 3, 2025, it will backfill runs for January 1 and January 2, 2025, before executing the current interval.

Common Pitfalls with start_date

While the start_date parameter is straightforward, its improper configuration can lead to unexpected behavior or workflow failures. Understanding these pitfalls is crucial to avoid common mistakes and ensure smooth execution.

Choosing an Arbitrary start_datePicking a random start_date without considering the data availability or schedule can lead to failed runs or incomplete processing.

Using Dynamic ValuesAvoid setting the start_date to a dynamic value like datetime.now(). Since the start_date is meant to be a fixed reference point, using a dynamic value can result in unpredictable intervals and make backfilling impossible.

Ignoring Time ZonesAlways ensure that the start_date accounts for the time zone of your data source to avoid misaligned intervals.

Misconfigured Catchup SettingsIf catchup is disabled, missed intervals will not be processed, potentially leading to data gaps.


Airflow Templates Reference

Airflow provides several templated variables to simplify interactions with data intervals. These variables are crucial for creating dynamic, maintainable, and idempotent workflows.

By using these templates, you can ensure that your DAGs adapt to changing data intervals without requiring hardcoded values, making them more robust and easier to debug.

Airflow uses Jinja templating under the hood to enable this dynamic behaviour, allowing you to embed these variables directly in your task parameters or scripts.

Data Interval Templated Variables

Here’s a list of some commonly used templated references that you can embed in your DAGs:

  • {{ data_interval_start }}: Represents the start of the current data interval. Use this to query or process data that falls within the beginning of the interval.
  • {{ data_interval_end }}: Represents the end of the current data interval. This is helpful for defining the boundary for data processing tasks.
  • {{ logical_date }}: The logical date for the DAG run, which aligns with the start of the data interval. This is often used for logging or metadata purposes.
  • {{ prev_data_interval_start_success }}: The start of the data interval for the previous successful DAG run. Use this for tasks that depend on the output of earlier runs.
  • {{ prev_data_interval_end_success }}: The end of the data interval for the previous successful DAG run. This ensures continuity in workflows that process sequential data.

You can find a more comprehensive list of all available templated references in the relevant section of Airflow documentation.

The importance of Templates reference when authoring DAGs

Template variables are an integral part of writing flexible and efficient DAGs in Airflow. They provide a way to dynamically reference key properties of data intervals, ensuring that your workflows remain adaptable to changes and are easy to maintain.

Here are some of the main reasons why these variables are so important:

  1. Dynamic Adaptability: Template variables allow your DAGs to automatically adjust to the current data interval, eliminating the need for hardcoding specific dates or time ranges.
  2. Idempotency: By tying task parameters to specific data intervals, you ensure that reruns or retries produce the same results, regardless of when they are executed.
  3. Ease of Maintenance: Using templated variables reduces the risk of errors and simplifies updates. For example, if your data processing logic changes, you can adjust the templates without rewriting the DAG.
  4. Facilitate Backfilling: These variables make it easy to backfill historical data, as each DAG run is automatically tied to the appropriate data interval.
  5. Leverage Jinja Templating: Jinja templating enables you to embed these variables dynamically in your task commands, SQL queries, or scripts. This ensures your workflows remain flexible and context-aware.

Visualising Data Intervals

To better understand data intervals, consider the following visualisation

A visualisation representing of how data intervals work in Airflow - Source: Author
A visualisation representing of how data intervals work in Airflow – Source: Author

Every DAG has a single start date, and every DAG run executes only after the corresponding data interval has ended. The table below illustrates the values of each of the three data interval variables of interest, for every daily run.

| Logical Date | Data Interval Start | Data Interval End |
| ------------ | ------------------- | ----------------- |
| 2025-01-01   | 2025-01-01 00:00    | 2025-01-02 00:00  |
| 2025-01-02   | 2025-01-02 00:00    | 2025-01-03 00:00  |
| 2025-01-03   | 2025-01-03 00:00    | 2025-01-04 00:00  |
| 2025-01-04   | 2025-01-04 00:00    | 2025-01-05 00:00  |
...

A working example

In the example DAG below, we attempt to fetch posts from a public API (jsonplaceholder) that accepts a date as a parameter. In our scenario, we wish to specify the value of the date parameter to the beginning of the interval. In other words, if the DAG is supposed to run today, we want to call the API using yesterday’s date.

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.http.operators.http import SimpleHttpOperator
from airflow.utils.dates import days_ago
from airflow.utils.dates import timedelta
from datetime import datetime

def get_posts_for_data_interval_start_date(**kwargs):
    """
    Fetch Posts from jsonplaceholder
    """
    import requests

    data_interval_start_dt = kwargs['data_interval_start']

    # Format the date as required by the API
    formatted_date = data_interval_start_dt.strftime('%Y-%m-%d')    
    print(f"Calling API for date: {formatted_date}")

    # Call the API
    response = requests.get(
        'https://jsonplaceholder.typicode.com/posts', 
        params={'date': formatted_date}
    )

    if response.status_code == 200:
        print(f'API call successful for {formatted_date}')
    else:
        print(f'API call failed for {formatted_date} with status code {response.status_code}')

dag = DAG(
    'test_dag',
    default_args={
        'owner': 'airflow',
        'retries': 1,
        'retry_delay': timedelta(minutes=5),
    },
    schedule_interval='@daily',
    start_date=datetime(2025, 1, 1),
    catchup=True,  # Enable backfilling, to run missed intervals
)

api_task = PythonOperator(
    task_id='call_api_for_date',
    python_callable=get_posts_for_data_interval_start_date,
    dag=dag,
)

The DAG is scheduled to run daily, starting from January 1st, 2025. It leverages the data_interval_start templated variable to dynamically pass the date for each run. Specifically, the data_interval_start corresponds to the start of the execution window for each DAG run, allowing the API to receive the correct date parameter.

The start_date parameter determines when the DAG execution begins, while the catchup=True setting ensures that missed intervals are backfilled. This means that if the DAG is delayed or the system is down, Airflow will automatically execute tasks for the missed dates, ensuring no data is skipped.

The core of the DAG is a Python task that calls the API, passing the data_interval_start as a formatted date. This allows for a flexible, interval-based API query system.


Final Thoughts

Understanding Airflow’s data intervals is essential for building reliable workflows. By learning how start_date, logical dates, and templated variables work together, you can create pipelines that process data accurately and efficiently. Whether it’s backfilling historical data or managing complex workflows, these principles ensure your pipelines run smoothly and consistently.

Mastering data intervals will help you design workflows that are easier to maintain and adapt, making your day-to-day work in data engineering more effective and predictable.


Related Articles