The world’s leading publication for data science, AI, and ML professionals.

Beneath the Surface: A Closer Look at 4 Airflow Internals

Four Apache Airflow internals you might have missed

Image generated via DALL-E
Image generated via DALL-E

I have been working with Airflow for more than three years now and overall, I am quite confident with it. It’s a powerful orchestrator that helps me build data pipelines quickly and in a scalable fashion while for most things I am looking to implement it comes with batteries included.

Recently, and while preparing myself to get a certification for Airflow, I’ve come across many different things I had literally no clue about. And this was essentially my motivation to write this article and share with you a few Airflow internals that have totally blown my mind!


1. Scheduler only parses files containing certain keywords

The Airflow Scheduler will parse only files containing airflow or dag in the code! Yes, you’ve heard this right! If a file under the DAG folder does not contain at least one of these two keywords, it will simply not be parsed by the scheduler.

If you want to modify this rule such that this is no longer a requirement for the scheduler, you can simply set DAG_DISCOVERY_SAFE_MODE configuration setting to False. In that case, the scheduler will parse all files under your DAG folder (/dags).

I wouldn’t recommend disabling this check though, since doing so doesn’t really make any sense. A proper DAG file will have Airflow imports and DAG definition which means the requirements for parsing that file are met) but it is worth knowing that this rule exists.


2. Variables with certain keywords in their name have their values hidden

We know that by default, Airflow will hide sensitive information stored in a Connection (and more specifically in the password field), but what about Variables?

Well, this is indeed possible and the mind blowing thing is that Airflow can do this automatically for you. If a variable contains certain keywords, that can possibly indicate sensitive information, then its value will automatically be hidden.

Here’s a list of keywords that will make a Variable qualify for having sensitive information store as its value:

access_token
api_key
apikey
authorization
passphrase
passwd
password
private_key
secret
token
keyfile_dict
service_account

This means that if your variable name contains any of these keywords, Airflow will handle its value accordingly.

Let’s see this feature in action. First, let’s create a variable from the UI whose name does not contain any keyword that would make it qualify as sensitive. As you can see in the screenshot below, the value of variable my_var is visible on the User Interface.

Airflow variable that does not qualify as sensitive - Source: Author
Airflow variable that does not qualify as sensitive – Source: Author

Now let’s create another variable containing one of the keywords in its name and see what happens. From the UI, we go to Admin -> Variables -> + and create a new variable called my_api_key:

Creating a new Airflow variable via User Interface - Source: Author
Creating a new Airflow variable via User Interface – Source: Author

Now if we go to the Variables section in the User Interface, we can see our newly created Variable but you should now notice that its value is actually hidden and now replaced with asterisks.

Airflow variables containing certain keywords in their name will have their values hidden from the UI - Source: Author
Airflow variables containing certain keywords in their name will have their values hidden from the UI – Source: Author

Personally, I prefer storing Airflow variables in a secret store, such as HashiCorp Vault, and therefore I had no clue that Airflow handles variables this way. In fact, it is a very nice and useful feature. However, I would expect it to be a bit more flexible. Instead of having a set of pre-defined keywords, it would be more convenient if we could specify whether a variable contains sensitive information rather than restricting ourselves in naming a variable with a particular keyword. Let’s hope this is going to be implemented in a future Airflow version.

For now, you can still extend the list of sensitive keywords by configuring Airflow accordingly, either using sensitive_var_conn_names in airflow.cfg (under [core] section), or by exporting AIRFLOW__CORE__SENSITIVE_VAR_CONN_NAMES environment variable.

sensitive_var_conn_names

  • New in version 2.1.0.

A comma-separated list of extra sensitive keywords to look for in variables names or connection’s extra JSON.

Type: stringDefault: ''Environment Variable: AIRFLOW__CORE__SENSITIVE_VAR_CONN_NAMES

Airflow Docs


3. Two or more DAGs can actually have the same DAG ID (but UI won’t like it)

I always make sure to use unique IDs for all of my DAGs. In fact, I do so by naming them after their Python filename (since this approach guarantees that no duplicate filename – and thus DAG id – can be created by mistake). All this time, I had the impression that it is not even possible to have two or more DAGs with the same ID but surprisingly, this is far from true!

Two or more DAGs can actually have the same DAG ID, meaning that you will not see any error at all. However, this is a bad practice and must be avoided. Even if the scheduler will not complain about DAGs with the same ID, on the User Interface one DAG will appear. In fact, if you refresh the UI after a while, the one previously shown might disappear and one of the DAGs with the same ID will appear this time.

It’s not just about rendering the DAG in the User Interface though. Things can go extremely wrong when you want to reference a DAG whose ID is used by other DAG(s) too. As an example, consider the TriggerDagRunOperator that can be used in a DAG to trigger another DAG. In such case, the wrong DAG might get triggered. Users even reported that clicking ‘Run’ on the UI of the DAG being shown, resulted in the execution of the DAG with the same ID that was not shown in the UI. So make sure to avoid having duplicate DAG IDs.

💡 Pro Tip: Instead of hardcoding the ID of a DAG, you can dynamically name it after the filename itself, using Path(__filename__).stem:

from pathlib import Path

from airflow import DAG

with DAG(
  dag_id=Path(__file__).stem,
  ...
): 
  ...

4. Airflow supports "ignore file"

I am pretty sure you are familiar with .gitignore file, but personally, I had no clue that Airflow supports the same construct, too.

You can instruct scheduler to ignore certain files in your DAG folder, by creating a .airflowignore file under the DAG folder itself (i.e. /dags/.airflowignore). Overall it works like a .gitignore file. You can use it to specify directories or files in DAG folder. Each line in .airflowignore essentially specifies a regex pattern, which means that any directory or file whose name matches any of the patterns specified will be ignored.

As an example, let’s assume that we have created a .airflowignore file with the content outlined below.

# .airflowignore

helper
dbt_[d]

Files like

  • helper_functions.py
  • utils_HeLpEr.py
  • dbt_1.py
  • dbt_2.py
  • helper/utils.py

will be completely ignored. Be careful though, since if a directory’s name matches any of the patterns, this directory and all its subfolders would not be scanned by Airflow at all.

This feature can help you speed up DAG parsing in case you have tons of files in your DAG folder that your scheduler should not care about.


Final Thoughts..

Whether you are new to Airflow or a seasoned Engineer, there are always opportunities to learn new things about the technology, which might not even be mentioned (explicitly) in the documentation.

💡 Thanks for taking the time to read this article. If you found it useful, make sure to follow me and subscribe to my Medium newsletter to be notified when the next article will be out.


Related Articles