Data Engineering | Towards Data Science

Kubernetes — Understanding and Utilizing Probes Effectively

Kiril Vodenicharov — Thu, 06 Mar 2025 03:59:54 +0000

Introduction

Let’s talk about Kubernetes probes and why they matter in your deployments. When managing production-facing containerized applications, even small optimizations can have enormous benefits.

Aiming to reduce deployment times, making your applications better react to scaling events, and managing the running pods healthiness requires fine-tuning your container lifecycle management. This is exactly why proper configuration — and implementation — of Kubernetes probes is vital for any critical deployment. They assist your cluster to make intelligent decisions about traffic routing, restarts, and resource allocation.

Properly configured probes dramatically improve your application reliability, reduce deployment downtime, and handle unexpected errors gracefully. In this article, we’ll explore the three types of probes available in Kubernetes and how utilizing them alongside each other helps configure more resilient systems.

Quick refresher

Understanding exactly what each probe does and some common configuration patterns is essential. Each of them serves a specific purpose in the container lifecycle and when used together, they create a rock-solid framework for maintaining your application availability and performance.

Startup: Optimizing start-up times

Start-up probes are evaluated once when a new pod is spun up because of a scale-up event or a new deployment. It serves as a gatekeeper for the rest of the container checks and fine-tuning it will help your applications better handle increased load or service degradation.

Sample Config:

startupProbe:
  httpGet:
    path: /health
    port: 80
  failureThreshold: 30
  periodSeconds: 10

Key takeaways:

Keep periodSeconds low, so that the probe fires often, quickly detecting a successful deployment.
Increase failureThreshold to a high enough value to accommodate for the worst-case start-up time.

The Startup probe will check whether your container has started by querying the configured path. It will additionally stop the triggering of the Liveness and Readiness probes until it is successful.

Liveness: Detecting dead containers

Your liveness probes answer a very simple question: “Is this pod still running properly?” If not, K8s will restart it.

Sample Config:

livenessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 10
  failureThreshold: 3

Key takeaways:

Since K8s will completely restart your container and spin up a new one, add a failureThreshold to combat intermittent abnormalities.
Avoid using initialDelaySeconds as it is too restrictive — use a Start-up probe instead.

Be mindful that a failing Liveness probe will bring down your currently running pod and spin up a new one, so avoid making it too aggressive — that’s for the next one.

Readiness: Handling unexpected errors

The readiness probe determines if it should start — or continue — to receive traffic. It is extremely useful in situations where your container lost connection to the database or is otherwise over-utilized and should not receive new requests.

Sample Config:

readinessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 3
  failureThreshold: 1
  timeoutSeconds: 1

Key takeaways:

Since this is your first guard to stopping traffic to unhealthy targets, make the probe aggressive and reduce the periodSeconds .
Keep failureThreshold at a minimum, you want to fail quick.
The timeout period should also be kept at a minimum to handle slower Containers.
Give the readinessProbe ample time to recover by having a longer-running livenessProbe .

Readiness probes ensure that traffic will not reach a container not ready for it and as such it’s one of the most important ones in the stack.

Putting it all together

As you can see, even if all of the probes have their own distinct uses, the best way to improve your application’s resilience strategy is using them alongside each other.

Your startup probe will assist you in scale up scenarios and new deployments, allowing your containers to be quickly brought up. They’re fired only once and also stop the execution of the rest of the probes until they successfully complete.

The liveness probe helps in dealing with dead containers suffering from non-recoverable errors and tells the cluster to bring up a new, fresh pod just for you.

The readiness probe is the one telling K8s when a pod should receive traffic or not. It can be extremely useful dealing with intermittent errors or high resource consumption resulting in slower response times.

Additional configurations

Probes can be further configured to use a command in their checks instead of an HTTP request, as well as giving ample time for the container to safely terminate. While these are useful in more specific scenarios, understanding how you can extend your deployment configuration can be beneficial, so I’d recommend doing some additional reading if your containers handle unique use cases.

Further reading:
Liveness, Readiness, and Startup Probes
Configure Liveness, Readiness and Startup Probes

The post Kubernetes — Understanding and Utilizing Probes Effectively appeared first on Towards Data Science.

Building a Data Engineering Center of Excellence

Richie Bachala — Fri, 14 Feb 2025 02:35:48 +0000

As data continues to grow in importance and become more complex, the need for skilled data engineers has never been greater. But what is data engineering, and why is it so important? In this blog post, we will discuss the essential components of a functioning data engineering practice and why data engineering is becoming increasingly critical for businesses today, and how you can build your very own Data Engineering Center of Excellence!

I’ve had the privilege to build, manage, lead, and foster a sizeable high-performing team of data warehouse & ELT engineers for many years. With the help of my team, I have spent a considerable amount of time every year consciously planning and preparing to manage the growth of our data month-over-month and address the changing reporting and analytics needs for our 20000+ global data consumers. We built many data warehouses to store and centralize massive amounts of data generated from many OLTP sources. We’ve implemented Kimball methodology by creating star schemas both within our on-premise data warehouses and in the ones in the cloud.

The objective is to enable our user-base to perform fast analytics and reporting on the data; so our analysts’ community and business users can make accurate data-driven decisions.

It took me about three years to transform teams (plural) of data warehouse and ETL programmers into one cohesive Data Engineering team.

I have compiled some of my learnings building a global data engineering team in this post in hopes that Data professionals and leaders of all levels of technical proficiency can benefit.

Evolution of the Data Engineer

It has never been a better time to be a data engineer. Over the last decade, we have seen a massive awakening of enterprises now recognizing their data as the company’s heartbeat, making data engineering the job function that ensures accurate, current, and quality data flow to the solutions that depend on it.

Historically, the role of Data Engineers has evolved from that of data warehouse developers and the ETL/ELT developers (extract, transform and load).

The data warehouse developers are responsible for designing, building, developing, administering, and maintaining data warehouses to meet an enterprise’s reporting needs. This is done primarily via extracting data from operational and transactional systems and piping it using extract transform load methodology (ETL/ ELT) to a storage layer like a data warehouse or a data lake. The data warehouse or the data lake is where data analysts, data scientists, and business users consume data. The developers also perform transformations to conform the ingested data to a data model with aggregated data for easy analysis.

A data engineer’s prime responsibility is to produce and make data securely available for multiple consumers.

Data engineers oversee the ingestion, transformation, modeling, delivery, and movement of data through every part of an organization. Data extraction happens from many different data sources & applications. Data Engineers load the data into data warehouses and data lakes, which are transformed not just for the Data Science & predictive analytics initiatives (as everyone likes to talk about) but primarily for data analysts. Data analysts & data scientists perform operational reporting, exploratory analytics, service-level agreement (SLA) based business intelligence reports and dashboards on the catered data. In this book, we will address all of these job functions.

The role of a data engineer is to acquire, store, and aggregate data from both cloud and on-premise, new, and existing systems, with data modeling and feasible data architecture. Without the data engineers, analysts and data scientists won’t have valuable data to work with, and hence, data engineers are the first to be hired at the inception of every new data team. Based on the data and analytics tools available within an enterprise, data engineering teams’ role profiles, constructs, and approaches have several options for what should be included in their responsibilities which we will discuss in this chapter.

Data Engineering team

Software is increasingly automating the historically manual and tedious tasks of data engineers. Data processing tools and technologies have evolved massively over several years and will continue to grow. For example, cloud-based data warehouses (Snowflake, for instance) have made data storage and processing affordable and fast. Data pipeline services (like Informatica IICS, Apache Airflow, Matillion, Fivetran) have turned data extraction into work that can be completed quickly and efficiently. The data engineering team should be leveraging such technologies as force multipliers, taking a consistent and cohesive approach to integration and management of enterprise data, not just relying on legacy siloed approaches to building custom data pipelines with fragile, non-performant, hard to maintain code. Continuing with the latter approach will stifle the pace of innovation within the said enterprise and force the future focus to be around managing data infrastructure issues rather than how to help generate value for your business.

The primary role of an enterprise Data Engineering team should be to transform raw data into a shape that’s ready for analysis — laying the foundation for real-world analytics and data science application.

The Data Engineering team should serve as the librarian for enterprise-level data with the responsibility to curate the organization’s data and act as a resource for those who want to make use of it, such as Reporting & Analytics teams, Data Science teams, and other groups that are doing more self-service or business group driven analytics leveraging the enterprise data platform. This team should serve as the steward of organizational knowledge, managing and refining the catalog so that analysis can be done more effectively. Let’s look at the essential responsibilities of a well-functioning Data Engineering team.

Responsibilities of a Data Engineering Team

The Data Engineering team should provide a shared capability within the enterprise that cuts across to support both the Reporting/Analytics and Data Science capabilities to provide access to clean, transformed, formatted, scalable, and secure data ready for analysis. The Data Engineering teams’ core responsibilities should include:

· Build, manage, and optimize the core data platform infrastructure

· Build and maintain custom and off-the-shelf data integrations and ingestion pipelines from a variety of structured and unstructured sources

· Manage overall data pipeline orchestration

· Manage transformation of data either before or after load of raw data through both technical processes and business logic

· Support analytics teams with design and performance optimizations of data warehouses

Data is an Enterprise Asset.

Data as an Asset should be shared and protected.

Data should be valued as an Enterprise asset, leveraged across all Business Units to enhance the company’s value to its respective customer base by accelerating decision making, and improving competitive advantage with the help of data. Good data stewardship, legal and regulatory requirements dictate that we protect the data owned from unauthorized access and disclosure.

In other words, managing Security is a crucial responsibility.

Why Create a Centralized Data Engineering Team?

Treating Data Engineering as a standard and core capability that underpins both the Analytics and Data Science capabilities will help an enterprise evolve how to approach Data and Analytics. The enterprise needs to stop vertically treating data based on the technology stack involved as we tend to see often and move to more of a horizontal approach of managing a data fabric or mesh layer that cuts across the organization and can connect to various technologies as needed drive analytic initiatives. This is a new way of thinking and working, but it can drive efficiency as the various data organizations look to scale. Additionally — there is value in creating a dedicated structure and career path for Data Engineering resources. Data engineering skill sets are in high demand in the market; therefore, hiring outside the company can be costly. Companies must enable programmers, database administrators, and software developers with a career path to gain the needed experience with the above-defined skillsets by working across technologies. Usually, forming a data engineering center of excellence or a capability center would be the first step for making such progression possible.

Challenges for creating a centralized Data Engineering Team

The centralization of the Data Engineering team as a service approach is different from how Reporting & Analytics and Data Science teams operate. It does, in principle, mean giving up some level of control of resources and establishing new processes for how these teams will collaborate and work together to deliver initiatives.

The Data Engineering team will need to demonstrate that it can effectively support the needs of both Reporting & Analytics and Data Science teams, no matter how large these teams are. Data Engineering teams must effectively prioritize workloads while ensuring they can bring the right skillsets and experience to assigned projects.

Data engineering is essential because it serves as the backbone of data-driven companies. It enables analysts to work with clean and well-organized data, necessary for deriving insights and making sound decisions. To build a functioning data engineering practice, you need the following critical components:

Data Engineering Center of Excellence

The Data Engineering team should be a core capability within the enterprise, but it should effectively serve as a support function involved in almost everything data-related. It should interact with the Reporting and Analytics and Data Science teams in a collaborative support role to make the entire team successful.

The Data Engineering team doesn’t create direct business value — but the value should come in making the Reporting and Analytics, and Data Science teams more productive and efficient to ensure delivery of maximum value to business stakeholders through Data & Analytics initiatives. To make that possible, the six key responsibilities within the data engineering capability center would be as follow –

Data Engineering Center of Excellence — Image by Author.

Let’s review the 6 pillars of responsibilities:

1. Determine Central Data Location for Collation and Wrangling

Understanding and having a strategy for a Data Lake.(a centralized data repository or data warehouse for the mass consumption of data for analysis). Defining requisite data tables and where they will be joined in the context of data engineering and subsequently converting raw data into digestible and valuable formats.

2. Data Ingestion and Transformation

Moving data from one or more sources to a new destination (your data lake or cloud data warehouse) where it can be stored and further analyzed and then converting data from the format of the source system to that of the destination

3. ETL/ELT Operations

Extracting, transforming, and loading data from one or more sources into a destination system to represent the data in a new context or style.

4. Data Modeling

Data modeling is an essential function of a data engineering team, granted not all data engineers excel with this capability. Formalizing relationships between data objects and business rules into a conceptual representation through understanding information system workflows, modeling required queries, designing tables, determining primary keys, and effectively utilizing data to create informed output.

I’ve seen engineers in interviews mess up more with this than coding in technical discussions. It’s essential to understand the differences between Dimensions, Facts, Aggregate tables.

5. Security and Access

Ensuring that sensitive data is protected and implementing proper authentication and authorization to reduce the risk of a data breach

6. Architecture and Administration

Defining the models, policies, and standards that administer what data is collected, where and how it is stored, and how it such data is integrated into various analytical systems.

The six pillars of responsibilities for data engineering capabilities center on the ability to determine a central data location for collation and wrangling, ingest and transform data, execute ETL/ELT operations, model data, secure access and administer an architecture. While all companies have their own specific needs with regards to these functions, it is important to ensure that your team has the necessary skillset in order to build a foundation for big data success.

Besides the Data Engineering following are the other capability centers that need to be considered within an enterprise:

Analytics Capability Center

The analytics capability center enables consistent, effective, and efficient BI, analytics, and advanced analytics capabilities across the company. Assist business functions in triaging, prioritizing, and achieving their objectives and goals through reporting, analytics, and dashboard solutions, while providing operational reports and visualizations, self-service analytics, and required tools to automate the generation of such insights.

Data Science Capability Center

The data science capability center is for exploring cutting-edge technologies and concepts to unlock new insights and opportunities, better inform employees and create a culture of prescriptive information usage using Automated AI and Automated ML solutions such as H2O.ai, Dataiku, Aible, DataRobot, C3.ai

Data Governance

The data governance office empowers users with trusted, understood, and timely data to drive effectiveness while keeping the integrity and sanctity of data in the right hands for mass consumption.

As your company grows, you will want to make sure that the data engineering capabilities are in place to support the six pillars of responsibilities. By doing this, you will be able to ensure that all aspects of data management and analysis are covered and that your data is safe and accessible by those who need it. Have you started thinking about how your company will grow? What steps have you taken to put a centralized data engineering team in place?

Thank you for reading!

The post Building a Data Engineering Center of Excellence appeared first on Towards Data Science.

Polars vs. Pandas — An Independent Speed Comparison

Eirik Berge, PhD — Tue, 11 Feb 2025 21:07:55 +0000

Overview

Introduction — Purpose and Reasons

Speed is important when dealing with large amounts of data. If you are handling data in a cloud data warehouse or similar, then the speed of execution for your data ingestion and processing affects the following:

Cloud costs: This is probably the biggest factor. More compute time equals more costs in most billing models. In other billing based on a certain amount of preallocated resources, you could have chosen a lower service level if the speed of your ingestion and processing was higher.
Data timeliness: If you have a real-time stream that takes 5 minutes to process data, then your users will have a lag of at least 5 minutes when viewing the data through e.g. a Power BI rapport. This difference can be a lot in certain situations. Even for batch jobs, the data timeliness is important. If you are running a batch job every hour, it is a lot better if it takes 2 minutes rather than 20 minutes.
Feedback loop: If your batch job takes only a minute to run, then you get a very quick feedback loop. This probably makes your job more enjoyable. In addition, it enables you to find logical mistakes more quickly.

As you’ve probably understood from the title, I am going to provide a speed comparison between the two Python libraries Polars and Pandas. If you know anything about Pandas and Polars from before, then you know that Polars is the (relatively) new kid on the block proclaiming to be much faster than Pandas. You probably also know that Polars is implemented in Rust, which is a trend for many other modern Python tools like uv and Ruff.

There are two distinct reasons that I want to do a speed comparison test between Polars and Pandas:

Reason 1 — Investigating Claims

Polars boasts on its website with the following claim: Compared to pandas, it (Polars) can achieve more than 30x performance gains.

As you can see, you can follow a link to the benchmarks that they have. It’s commendable that they have speed tests open source. But if you are writing the comparison tests for both your own tool and a competitor’s tool, then there might be a slight conflict of interest. I’m not here saying that they are purposefully overselling the speed of Polars, but rather that they might have unconsciously selected for favorable comparisons.

Hence the first reason to do a speed comparison test is simply to see whether this supports the claims presented by Polars or not.

Reason 2 — Greater granularity

Another reason for doing a speed comparison test between Polars and Pandas is to make it slightly more transparent where the performance gains might be.

This might be already clear if you’re an expert on both libraries. However, speed tests between Polars and Pandas are mostly of interest to those considering switching up their tool. In that case, you might not yet have played around much with Polars because you are unsure if it is worth it.

Hence the second reason to do a speed comparison is simply to see where the speed gains are located.

I want to test both libraries on different tasks both within data ingestion and Data Processing. I also want to consider datasets that are both small and large. I will stick to common tasks within data engineering, rather than esoteric tasks that one seldom uses.

What I will not do

I will not give a tutorial on either Pandas or Polars. If you want to learn Pandas or Polars, then a good place to start is their documentation.
I will not cover other common data processing libraries. This might be disappointing to a fan of PySpark, but having a distributed compute model makes comparisons a bit more difficult. You might find that PySpark is quicker than Polars on tasks that are very easy to parallelize, but slower on other tasks where keeping all the data in memory reduces travel times.
I will not provide full reproducibility. Since this is, in humble words, only a blog post, then I will only explain the datasets, tasks, and system settings that I have used. I will not host a complete running environment with the datasets and bundle everything neatly. This is not a precise scientific experiment, but rather a guide that only cares about rough estimations.

Finally, before we start, I want to say that I like both Polars and Pandas as tools. I’m not financially or otherwise compensated by any of them obviously, and don’t have any incentive other than being curious about their performance

Datasets, Tasks, and Settings

Let’s first describe the datasets that I will be considering, the tasks that the libraries will perform, and the system settings that I will be running them on.

Datasets

A most companies, you will need to work with both small and (relatively) large datasets. In my opinion, a good data processing tool can tackle both ends of the spectrum. Small datasets challenge the start-up time of tasks, while larger datasets challenge scalability. I will consider two datasets, both can be found on Kaggle:

A small dataset on the format CSV: It is no secret that CSV files are everywhere! Often they are quite small, coming from Excel files or database dumps. What better example of this than the classical iris dataset (licensed with CC0 1.0 Universal License) with 5 columns and 150 rows. The iris version I linked to on Kaggle has 6 columns, but the classical one does not have a running index column. So remove this column if you want precisely the same dataset as I have. The iris dataset is certainly small data by any stretch of the imagination.
A large dataset on the format Parquet: The parquet format is super useful for large data as it has built-in compression column-wise (along with many other benefits). I will use the Transaction dataset (licensed with Apache License 2.0) representing financial transactions. The dataset has 24 columns and 7 483 766 rows. It is close to 3 GB in its CSV format found on Kaggle. I used Pandas & Pyarrow to convert this to a parquet file. The final result is only 905 MB due to the compression of the parquet file format. This is at the low end of what people call big data, but it will suffice for us.

Tasks

I will do a speed comparison on five different tasks. The first two are I/O tasks, while the last three are common tasks in data processing. Specifically, the tasks are:

Reading data: I will read both files using the respective methods read_csv() and read_parquet() from the two libraries. I will not use any optional arguments as I want to compare their default behavior.
Writing data: I will write both files back to identical copies as new files using the respective methods to_csv() and to_parquet() for Pandas and write_csv() and write_parquet() for Polars. I will not use any optional arguments as I want to compare their default behavior.
Computing Numeric Expressions: For the iris dataset I will compute the expression SepalLengthCm ** 2 + SepalWidthCm as a new column in a copy of the DataFrame. For the transactions dataset, I will simply compute the expression (amount + 10) ** 2 as a new column in a copy of the DataFrame. I will use the standard way to transform columns in Pandas, while in Polars I will use the standard functions all(), col(), and alias() to make an equivalent transformation.
Filters: For the iris dataset, I will select the rows corresponding to the criteria SepalLengthCm >= 5.0 and SepalWidthCm <= 4.0. For the transactions dataset, I will select the rows corresponding to the categorical criteria merchant_category == 'Restaurant'. I will use the standard filtering method based on Boolean expressions in each library. In pandas, this is syntax such as df_new = df[df['col'] < 5], while in Polars this is given similarly by the filter() function along with the col() function. I will use the and-operator & for both libraries to combine the two numeric conditions for the iris dataset.
Group By: For the iris dataset, I will group by the Species column and calculate the mean values for each species of the four columns SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm. For the transactions dataset, I will group by the column merchant_category and count the number of instances in each of the classes within merchant_category. Naturally, I will use the groupby() function in Pandas and the group_by() function in Polars in obvious ways.

Settings

System Settings: I’m running all the tasks locally with 16GB RAM and an Intel Core i5–10400F CPU with 6 Cores (12 logical cores through hyperthreading). So it’s not state-of-the-art by any means, but good enough for simple benchmarking.
Python: I’m running Python 3.12. This is not the most current stable version (which is Python 3.13), but I think this is a good thing. Commonly the latest supported Python version in cloud data warehouses is one or two versions behind.
Polars & Pandas: I’m using Polars version 1.21 and Pandas 2.2.3. These are roughly the newest stable releases to both packages.
Timeit: I’m using the standard timeit module in Python and finding the median of 10 runs.

Especially interesting will be how Polars can take advantage of the 12 logical cores through multithreading. There are ways to make Pandas take advantage of multiple processors, but I want to compare Polars and Pandas out of the box without any external modification. After all, this is probably how they are running in most companies around the world.

Results

Here I will write down the results for each of the five tasks and make some minor comments. In the next section I will try to summarize the main points into a conclusion and point out a disadvantage that Polars has in this comparison:

Task 1 — Reading data

The median run time over 10 runs for the reading task was as follows:

# Iris Dataset
Pandas: 0.79 milliseconds
Polars: 0.31 milliseconds

# Transactions Dataset
Pandas: 14.14 seconds
Polars: 1.25 seconds

For reading the Iris dataset, Polars was roughly 2.5x faster than Pandas. For the transactions dataset, the difference is even starker where Polars was 11x faster than Pandas. We can see that Polars is much faster than Pandas for reading both small and large files. The performance difference grows with the size of the file.

Task 2— Writing data

The median run time in seconds over 10 runs for the writing task was as follows:

# Iris Dataset
Pandas: 1.06 milliseconds
Polars: 0.60 milliseconds

# Transactions Dataset
Pandas: 20.55 seconds
Polars: 10.39 seconds

For writing the iris dataset, Polars was around 75% faster than Pandas. For the transactions dataset, Polars was roughly 2x as fast as Pandas. Again we see that Polars is faster than Pandas, but the difference here is smaller than for reading files. Still, a difference of close to 2x in performance is a massive difference.

Task 3 —Computing Numeric Expressions

The median run time over 10 runs for the computing numeric expressions task was as follows:

# Iris Dataset
Pandas: 0.35 milliseconds
Polars: 0.15 milliseconds

# Transactions Dataset
Pandas: 54.58 milliseconds
Polars: 14.92 milliseconds

For computing the numeric expressions, Polars beats Pandas with a rate of roughly 2.5x for the iris dataset, and roughly 3.5x for the transactions dataset. This is a pretty massive difference. It should be noted that computing numeric expressions is fast in both libraries even for the large dataset transactions.

Task 4 — Filters

The median run time over 10 runs for the filters task was as follows:

# Iris Dataset
Pandas: 0.40 milliseconds
Polars: 0.15 milliseconds

# Transactions Dataset
Pandas: 0.70 seconds
Polars: 0.07 seconds

For filters, Polars is 2.6x faster on the iris dataset and 10x as fast on the transactions dataset. This is probably the most surprising improvement for me since I suspected that the speed improvements for filtering tasks would not be this massive.

Task 5 — Group By

The median run time over 10 runs for the group by task was as follows:

# Iris Dataset
Pandas: 0.54 milliseconds
Polars: 0.18 milliseconds

# Transactions Dataset
Pandas: 334 milliseconds 
Polars: 126 milliseconds

For the group-by task, there is a 3x speed improvement for Polars in the case of the iris dataset. For the transactions dataset, there is a 2.6x improvement of Polars over Pandas.

Conclusions

Before highlighting each point below, I want to point out that Polars is somewhat in an unfair position throughout my comparisons. It is often that multiple data transformations are performed after one another in practice. For this, Polars has the lazy API that optimizes this before calculating. Since I have considered single ingestions and transformations, this advantage of Polars is hidden. How much this would improve in practical situations is not clear, but it would probably make the difference in performance even bigger.

Data Ingestion

Polars is significantly faster than Pandas for both reading and writing data. The difference is largest in reading data, where we had a massive 11x difference in performance for the transactions dataset. On all measurements, Polars performs significantly better than Pandas.

Data Processing

Polars is significantly faster than Pandas for common data processing tasks. The difference was starkest for filters, but you can at least expect a 2–3x difference in performance across the board.

Final Verdict

Polars consistently performs faster than Pandas on all tasks with both small and large data. The improvements are very significant, ranging from a 2x improvement to a whopping 11x improvement. When it comes to reading large parquet files or performing filter statements, Polars is leaps and bound in front of Pandas.

However…Nowhere here is Polars remotely close to performing 30x better than Pandas, as Polars’ benchmarking suggests. I would argue that the tasks that I have presented are standard tasks performed on realistic hardware infrastructure. So I think that my conclusions give us some room to question whether the claims put forward by Polars give a realistic picture of the improvements that you can expect.

Nevertheless, I am in no doubt that Polars is significantly faster than Pandas. Working with Polars is not more complicated than working with Pandas. So for your next data engineering project where the data fits in memory, I would strongly suggest that you opt for Polars rather than Pandas.

Wrapping Up

Photo by Spencer Bergen on Unsplash

I hope this blog post gave you a different perspective on the speed difference between Polars and Pandas. Please comment if you have a different experience with the performance difference between Polars and Pandas than what I have presented.

If you are interested in AI, Data Science, or data engineering, please follow me or connect on LinkedIn.

Like my writing? Check out some of my other posts:

The post Polars vs. Pandas — An Independent Speed Comparison appeared first on Towards Data Science.

Align Your Data Architecture for Universal Data Supply

Bernd Wessely — Fri, 31 Jan 2025 17:02:00 +0000

Photo by Simone Hutsch on Unsplash

Now that we understand the business requirements, we need to check if the current Data Architecture supports them.

If you’re wondering what to assess in our data architecture and what the current setup looks like, check the business case description.

· Assessing against short-term requirements ∘ Initial alignment approach · Medium-term requirements and long-term vision · Step-by-step conversion ∘ Agility requires some foresight ∘ Build your business process and information model ∘ Holistically challenge your architecture ∘ Decouple and evolve

Assessing against short-term requirements

Let’s recap the short-term requirements:

Immediate feedback with automated compliance monitors: Providing timely feedback to staff on compliance to reinforce hand hygiene practices effectively. Calculate the compliance rates in near real time and show them on ward monitors using a simple traffic light visualization.
Device availability and maintenance: Ensuring dispensers are always functional, with near real-time tracking for refills to avoid compliance failures due to empty dispensers.

The current weekly batch ETL process is obviously not able to deliver immediate feedback.

However, we could try to reduce the batch runtime as much as possible and loop it continuously. For near real-time feedback, we would also need to run the query continuously to get the latest compliance rate report.

Both of these technical requirements are challenging. The weekly batch process from the HIS handles large data volumes and can’t be adjusted to run in seconds. Continuous monitoring would also put a heavy load on the data warehouse if we keep the current model, which is optimized for tracking history.

Before we dig deeper to solve this, let’s also examine the second requirement.

The smart dispenser can be loaded with bottles of various sizes, tracked in the Dispenser Master Data. To calculate the current fill level, we subtract the total amount dispensed from the initial volume. Each time the bottle is replaced, the fill level should reset to the initial volume. To support this, the dispenser manufacturer has announced two new events to be implemented in a future release:

The dispenser will automatically track its fill level and send a refill warning when it reaches a configurable low point. This threshold is based on the estimated time until the bottle is empty (remaining time to failure).
When the dispenser’s bottle is replaced, it will send a bottle exchange event.

However, these improved devices won’t be available for about 12 months. As a workaround, the current ETL process needs to be updated to perform the required calculations and generate the events.

A new report is needed based on these events to inform support staff about dispensers requiring timely bottle replacement. In medium-sized hospitals with 200–500 dispensers, intensive care units use about two 1-liter bottles of disinfectant per month. This means around 19 dispensers need refilling in the support staff’s weekly exchange plan.

Since dispenser usage varies widely across wards, the locations needing bottle replacements are spread throughout the hospital. Support staff would like to receive the bottle exchange list organized in an optimal route through the building.

Initial alignment approach

Following the principle "Never change a running system," we could try to reuse as many components as possible to minimize changes.

Initial idea to implement short-term requirements— Image by author

We would have to build NEW components (in green) and CHANGE existing components (in dark blue) to support the new requirements.

We know the batch needs to be replaced with stream processing for near real-time feedback. We consider using Change Data Capture (CDC)a technology to get updates on dispenser usage from the internal relational database. However, tests on the Dispenser Monitoring System showed that the Dispenser Usage Data Collector only updates the database every 5 minutes. To keep things simple, we decide to reschedule the weekly batch extraction process to sync with the monitoring system’s 5-minute update cycle.

By reducing the batch runtime and continuously looping over it, we effectively create a microbatch that supports stream processing. For more details, see my article on how to unify batch and stream processing.

Reducing the runtime of the HIS Data ETL batch process is a major challenge due to the large amount of data involved. We could decouple patient and occupancy data from the rest of the HIS data, but the HIS database extraction process is a complex, long-neglected COBOL program that no one dares to modify. The extraction logic is buried deep within the COBOL monolith, and there is limited knowledge of the source systems. Therefore, we consider implementing near real-time extraction of patient and occupancy data from HIS as "not feasible."

Instead, we plan to adjust the Compliance Rate Calculation to allow near real-time Dispenser Usage Data to be combined with the still-weekly updated HIS data. After discussing this with the hygiene specialists, we agree that the low rates of change in patient treatment and occupancy suggest the situation will remain fairly stable throughout the week.

The Continuous Compliance Rate On Ward Level will be stored in a real-time partition associated to the ward entity of the data warehouse. It will support short runtimes of the new Traffic Light Monitor Query that is scheduled as successor to the respective ETL batch process.

Consequently, the monitor will be updated every 5 minutes, which seems close enough to near real-time. The new Exchange List Query will be scheduled weekly to create the Weekly Bottle-Exchange Plan to be sent by email to the support staff.

We feel confident that this will adequately address the short-term requirements.

Medium-term requirements and long-term vision

However, before we start sprinting ahead with the short-term solution, we should also examine the medium and long-term vision. Let’s recap the identified requirements:

Granular data insights: Moving beyond aggregate reports to gain insight into compliance at more specific levels (e.g., by shift or even person).
Actionable alerts for non-compliance: Leveraging historical data with near real-time extended monitoring data to enable systems to notify staff immediately of missed hygiene actions, ideally personalized by healthcare worker.
Personalized compliance dashboards: Creating personalized dashboards that show each worker’s compliance history, improvement opportunities, and benchmarks.
Integration with smart wearables: Utilizing wearable technology to give real-time and discrete feedback directly to healthcare workers, supporting compliance at the point of care.

These long-term visions highlight the need to significantly improve real-time processing capabilities. They also emphasize the importance of processing data at a more granular level and using intelligent processing to derive individualized insights. Processing personalized information raises security concerns that must be properly addressed as well. Finally, we need to seamlessly integrate advanced monitoring devices and smart wearables to receive personalized information in a secure, discreet, and timely manner.

That leads to a whole chain of additional challenges for our current architecture.

But it’s not only the requirements of the hygiene monitoring that are challenging; the hospital is also about to be taken over by a large private hospital operator.

This means the current HIS must be integrated into a larger system that will cover 30 hospitals. The goal is to extend the advanced monitoring functionality for hygiene dispensers so that other hospitals in the new operator’s network can also benefit. As a long-term vision, they want the monitoring functionality to be seamlessly integrated into their global HIS.

Another challenge is planning for the announced innovations from the dispenser manufacturer. Through ongoing discussions about remaining time to failure, refill warnings, and bottle exchange events, we know the manufacturer is open to enabling real-time streaming for Dispenser Usage Data. This would allow data to be sent directly to consumers, bypassing the current 5-minute batch process through the relational database.

Step-by-step conversion

We want to counter the enormous challenges facing our architecture with a gradual transformation.

Since we’ve learned that working agile is beneficial, we want to start with the initial idea and then refine the system in subsequent steps.

But is this really agile working?

Agility requires some foresight

What I often encounter is that people equate "acting in small incremental steps" with working agile. While it’s true that we want to evolve our architecture progressively, each step should aim at the long-term target.

If we constrain our evolution to what the current IT architecture can deliver, we might not be moving toward what is truly needed.

When we developed our initial alignment, we just reasoned on how to implement the first step within the existing architecture’s constraints. However, this approach narrows our view to what’s ‘feasible’ within the current setup.

So, let’s try the opposite and clearly address what’s needed including the long-term requirements. Only then we can target the next steps to move the architecture in the right direction.

For architecture decisions, we don’t need to detail every aspect of the business processes using standards like Business Process Model and Notation (BPMN). We just need a high-level understanding of the process and information flow.

But what’s the right level of detail that allows us to make evolutionary architecture decisions?

Build your business process and information model

Let’s start very high to find out about the right level.

In part 3 of my series on Challenges and Solutions in Data Mesh I have outlined an approach based on modeling patterns to model an ontology or enterprise data model. Let’s apply this approach to our example.

Note: We can’t create a complete ontology for the healthcare industry in this article. However, we can apply this approach to the small sub-topic relevant to our example.

Let’s identify the obvious modeling patterns relevant for our example:

Party & Role: The parties acting in our healthcare example include patients, medical device suppliers, healthcare professionals (doctors, nurses, hygiene specialists, etc.), the hospital operator, support staff and the hospital as an organizational unit.

Location: The hospital building address, patient rooms, floors, laboratories, operating rooms, etc.

Ressource / Asset: The hospital as a building, medical devices like our intelligent dispensers, etc.

Document: All kinds of files representing patient information like diagnosis, written agreements, treatment plans, etc.

Event: We have identified dispenser-related events, such as bottle exchange and refill warnings, as well as healthcare practitioner-related events, like an identified hand hygiene opportunity or moment.

Task: From the doctor’s patient treatment plan, we can directly derive procedures or activities that healthcare workers need to perform. Monitoring these procedures is one of the many information requirements for delivering healthcare services.

The following high-level modeling patterns my not be as obvious for the healthcare setup in our example at first sight:

Product: Although we might not think of hospitals of being product-oriented, they certainly provide services like diagnoses or patient treatments. If pharmaceuticals, supplies, and medical equipment are offered, we even can talk about typical products. A better overall term would probably be a "health care offering".

Agreement: Not only agreements between provider networks and supplier agreements for the purchase of medical products and medicines but also agreements between patients and doctors.

Account: Our use case is mainly concerned with upholding best hygiene practices by closely monitoring and educating staff. We just don’t focus on accounting aspects here. However, accounting in general as well as claims management and payment settlement are very important healthcare business processes. A large part of the Hospital Information System (HIS) therefore deals with accounting.

Let’s visualize our use case with this high-level modeling patterns and their relationships.

Our example from the healthcare sector, illustrated with high-level modeling patterns – Image by author

What does this buy us?

With this high-level model we can identify ‘hygiene monitoring’ as an overall business process to observe patient care and take appropriate action so that infections associated with care are prevented in the best possible way.

We recognize ‘patient management’ as an overall process to manage and track all the patient care activities related to the healthcare plan prepared by the doctors.

We recognize ‘hospital management’ that organizes assets like hospital buildings with patient bedrooms as well as all medical devices and instrumentation inside. Patients and staff occupy and use these assets over time and this usage needs to be managed.

Let’s describe some of the processes:

A Doctor documents the Diagnosis derived from the examination of the Patient
A Doctor discusses the derived Diagnosis with the Patient and documents everything that has been agreed with the Patient about the recommended treatment in a Patient Treatment Plan.
The Agreement on the treatment triggers the Treatment Procedure and reflects the responsibility of the Doctor and Nurses for the patient’s treatment.
A Nurse responsible for Patient Bed Occupancy will assign a patient bed at the ward, which triggers a Patient Bed Allocation.
A Nurse responsible for the patient’s treatment takes a blood sample from the patient and triggers several Hand Hygiene Opportunities and Dispenser Hygiene Actions detected by Hygiene Monitoring.
The Hygiene Monitoring calculates compliance from Dispenser Hygiene Action, Hand Hygiene Opportunity, and Patient Bed Allocation information and documents it for the Continuous Compliance Monitor.
During the week ongoing Dispenser Hygiene Actions cause the Hygiene Monitoring to trigger Dispenser Refill Warnings.
A Hygiene Specialist responsible for the Hygiene Monitoring compiles a weekly Bottle Exchange Plan from accumulated Dispenser Refill Warnings.
Support Staff responsible for the weekly Exchange Bottle Tour receives the Bottle Exchange Plan and triggers Dispenser Bottle Exchange events when replacing empty bottles for the affected dispensers.
and so on …

This way we get an overall functional view of our business. The view is completely independent of the architectural style we’ll choose to actually implement the business requirements.

A high-level business process and information model is therefore a perfect artifact to discuss any use case with healthcare practitioners.

Holistically challenge your architecture

With such a thorough understanding of our business, we can challenge our architecture more holistically. Everything we already understand and know today can and should be used to drive our next step toward the target architecture.

Let’s examine why our initial architecture approach falls short to properly support all identified requirements:

Near real-time processing is only partly addressed

A traditional data warehouse architecture is not the ideal architectural approach for near real-time processing. In our example, the long-running HIS data extraction process is a batch-oriented monolith that cannot be tuned to support low-latency requirements.

We can split the monolith into independent extraction processes, but to really enable all involved applications for near real-time processing, we need to rethink the way we share data across applications.

As data engineers, we should create abstractions that relieve the application developer from low-level data processing decisions. They should neither have to reason about whether batch or stream processing style needs to be chosen nor need to know how to actually implement this technically.

If we allow the application developers to implement the required business logic independent of these technical data details, it would greatly simplify their job.

You can get more details on how to practically implement this in my article on unifying batch and stream processing.

The initial alignment is driven by technology, not by business

Business requirements should drive the IT architecture decisions. If we turn a blind eye and soften the requirements to such an extent that it becomes ‘feasible’, we allow technology to drive the process.

The discussion with the hygiene specialists about the low rates of change in patient treatment and occupancy are such a softening of requirements. We know that there will be situations where the state will change during the week, but we accept the assumption of stability to keep the current IT architecture.

Even if we won’t be able to immediately change the complete architecture, we should take steps into the right direction. Even if we cannot enable all applications at once to support near real-time processing, we should take action to create support for it.

Smart devices, standard operational systems (HIS) and advanced monitoring need to be seamlessly integrated

The long-term vision is to seamlessly integrate the monitoring functionality with available HIS features. This includes the integration of various new (sub-)systems and new technical devices that are essential for operating the hospital.

With an architecture that focuses one-sidedly on analytical processing, we cannot adequately address these cross-cutting needs. We need to find ways to enable flexible data flow between all future participants in the system. Every application or system component requires to be connected to our mesh of data without having to change the component itself.

Overall, we can state that the initial architecture change plan won’t be a targeted step towards such a flexible integration approach.

Decouple and evolve

To ensure that each and every step is effective in moving towards our target architecture, we need a balanced decoupling of our current architecture components.

Universal data supply therefore defines the abstraction data as a product for the exchange of data between applications of any kind. To enable current applications to create data as a product without having to completely redesign them, we use data agents to (re-)direct data flow from the application to the mesh.

Modern Data And Application Engineering Breaks the Loss of Business Context

By using these abstractions, any application can also become near real-time capable. Because it doesn’t matter if the application is part of the operational or the analytical plane, the intended integration of the operational HIS with hygiene monitoring components is significantly simplified.

Operational and Analytical Data

Let’s examine how the decoupling helps, for instance, to integrate the current data warehouse to the mesh.

The data warehouse can be redefined to act like one among many applications in the mesh. We can, for instance, re-design the ETL component Continuous Compliance Rate on Ward Level as an independent application producing the data as a product abstraction. If we don’t want or can’t touch the ETL logic itself, we can instead use the data agent abstraction to transform data to the target structure.

We can do the same for Dispenser Exchange Events or any other ETL or query / reporting component identified. The COBOL monolith HIS Data can be decoupled by implementing a data agent that separates the data products HIS occupancy data and HIS patient data. This allows to evolve the data delivering components completely independent of the consumers.

Whenever the dispenser vendor is ready to deliver advanced functionalities to directly create the required exchange events, we would just have to change the Dispenser Exchange Events component. Either the vendor can deliver the data as a product abstraction directly, or we can convert the dispenser’s proprietary data output by adapting Dispenser Exchange Event data agent and logic.

Aligned Architecture as an Adapted Data Mesh enabling universal data supply – Image by author

Whenever we are able to directly create HIS patient data or HIS occupancy data from the HIS, we can partly or completely decommission the HIS Data component without affecting the rest of the system.

We need to assess our architecture holistically, considering all known business requirements. A technology-constrained approach can lead to intermediate steps that are not geared towards what’s needed but just towards what seems feasible.

Dive deep into your business and derive technology-agnostic processes and information models. These models will foster your business understanding and at the same time allow your business to drive your architecture.

In subsequent steps, we will look at more technical details on how to design data as a product and data agents based on these ideas. Stay tuned for more insights!

The post Align Your Data Architecture for Universal Data Supply appeared first on Towards Data Science.

Stop Creating Bad DAGs – Optimize Your Airflow Environment By Improving Your Python Code

Alvaro Leandro Cavalcante Carneiro — Thu, 30 Jan 2025 20:31:57 +0000

Valuable tips to reduce your DAGs’ parse time and save resources.

Photo by Dan Roizer on Unsplash

Apache Airflow is one of the most popular orchestration tools in the data field, powering workflows for companies worldwide. However, anyone who has already worked with Airflow in a production environment, especially in a complex one, knows that it can occasionally present some problems and weird bugs.

Among the many aspects you need to manage in an Airflow environment, one critical metric often flies under the radar: DAG parse time. Monitoring and optimizing parse time is essential to avoid performance bottlenecks and ensure the correct functioning of your orchestrations, as we’ll explore in this article.

That said, this tutorial aims to introduce [airflow-parse-bench](https://github.com/AlvaroCavalcante/airflow-parse-bench), an open-source tool I developed to help Data engineers monitor and optimize their Airflow environments, providing insights to reduce code complexity and parse time.

Why Parse Time Matters

Regarding Airflow, DAG parse time is often an overlooked metric. Parsing occurs every time Airflow processes your Python files to build the DAGs dynamically.

By default, all your DAGs are parsed every 30 seconds – a frequency controlled by the configuration variable _min_file_process_interval_. This means that every 30 seconds, all the Python code that’s present in your dags folder is read, imported, and processed to generate DAG objects containing the tasks to be scheduled. Successfully processed files are then added to the DAG Bag.

Two key Airflow components handle this process:

DagFileProcessorManager: Checks which files need to be processed.
DagFileProcessorProcess: Executes the actual file parsing.

Together, both components (commonly referred to as the dag processor) are executed by the Airflow Scheduler, ensuring that your DAG objects are updated before being triggered. However, for scalability and security reasons, it is also possible to run your dag processor as a separate component in your cluster.

If your environment only has a few dozen DAGs, it’s unlikely that the parsing process will cause any kind of problem. However, it’s common to find production environments with hundreds or even thousands of DAGs. In this case, if your parse time is too high, it can lead to:

Delay DAG scheduling.
Increase resource utilization.
Environment heartbeat issues.
Scheduler failures.
Excessive CPU and memory usage, wasting resources.

Now, imagine having an environment with hundreds of DAGs containing unnecessarily complex parsing logic. Small inefficiencies can quickly turn into significant problems, affecting the stability and performance of your entire Airflow setup.

How to write better DAGs?

When writing Airflow DAGs, there are some important best practices to bear in mind to create optimized code. Although you can find a lot of tutorials on how to improve your DAGs, I’ll summarize some of the key principles that can significantly enhance your DAG performance.

Limit Top-Level Code

One of the most common causes of high DAG parsing times is inefficient or complex top-level code. Top-level code in an Airflow DAG file is executed every time the Scheduler parses the file. If this code includes resource-intensive operations, such as database queries, API calls, or dynamic task generation, it can significantly impact parsing performance.

The following code shows an example of a non-optimized DAG:

In this case, every time the file is parsed by the Scheduler, the top-level code is executed, making an API request and processing the DataFrame, which can significantly impact the parse time.

Another important factor contributing to slow parsing is top-level imports. Every library imported at the top level is loaded into memory during parsing, which can be time-consuming. To avoid this, you can move imports into functions or task definitions.

The following code shows a better version of the same DAG:

Avoid Xcoms and Variables in Top-Level Code

Still talking about the same topic, is particularly interesting to avoid using Xcoms and Variables in your top-level code. As stated by Google documentation:

If you are using Variable.get() in top level code, every time the .py file is parsed, Airflow executes a Variable.get() which opens a session to the DB. This can dramatically slow down parse times.

To address this, consider using a JSON dictionary to retrieve multiple variables in a single database query, rather than making multiple Variable.get() calls. Alternatively, use Jinja templates, as variables retrieved this way are only processed during task execution, not during DAG parsing.

Remove Unnecessary DAGs

Although it seems obvious, it’s always important to remember to periodically clean up unnecessary DAGs and files from your environment:

Remove unused DAGs: Check your dags folder and delete any files that are no longer needed.
Use .airflowignore: Specify the files Airflow should intentionally ignore, skipping parsing.
Review paused DAGs: Paused DAGs are still parsed by the Scheduler, consuming resources. If they are no longer required, consider removing or archiving them.

Change Airflow Configurations

Lastly, you could change some Airflow configurations to reduce the Scheduler resource usage:

min_file_process_interval: This setting controls how often (in seconds) Airflow parses your DAG files. Increasing it from the default 30 seconds can reduce the Scheduler’s load at the cost of slower DAG updates.
dag_dir_list_interval: This determines how often (in seconds) Airflow scans the dags directory for new DAGs. If you deploy new DAGs infrequently, consider increasing this interval to reduce CPU usage.

How to Measure DAG Parse Time?

We’ve discussed a lot about the importance of creating optimized DAGs to maintain a healthy Airflow environment. But how do you actually measure the parse time of your DAGs? Fortunately, there are several ways to do this, depending on your Airflow deployment or operating system.

For example, if you have a Cloud Composer deployment, you can easily retrieve a DAG parse report by executing the following command on Google CLI:

gcloud composer environments run $ENVIRONMENT_NAME 
 - location $LOCATION 
 dags report

While retrieving parse metrics is straightforward, measuring the effectiveness of your code optimizations can be less so. Every time you modify your code, you need to redeploy the updated Python file to your cloud provider, wait for the DAG to be parsed, and then extract a new report – a slow and time-consuming process.

Another possible approach, if you’re on Linux or Mac, is to run this command to measure the parse time locally on your machine:

time python airflow/example_dags/example.py

However, while simple, this approach is not practical for systematically measuring and comparing the parse times of multiple DAGs.

To address these challenges, I created the airflow-parse-bench, a Python library that simplifies measuring and comparing the parse times of your DAGs using Airflow’s native parse method.

Measuring and Comparing Your DAG’s Parse Times

The [airflow-parse-bench](https://github.com/AlvaroCavalcante/airflow-parse-bench) tool makes it easy to store parse times, compare results, and standardize comparisons across your DAGs.

Installing the Library

Before installation, it’s recommended to use a virtualenv to avoid library conflicts. Once set up, you can install the package by running the following command:

pip install airflow-parse-bench

Note: This command only installs the essential dependencies (related to Airflow and Airflow providers). You must manually install any additional libraries your DAGs depend on.

For example, if a DAG uses boto3 to interact with AWS, ensure that boto3 is installed in your environment. Otherwise, you’ll encounter parse errors.

After that, it’s necessary to initialize your Airflow database. This can be done by executing the following command:

airflow db init

In addition, if your DAGs use Airflow Variables, you must define them locally as well. However, it’s not necessary to put real values on your variables, as the actual values aren’t required for parsing purposes:

airflow variables set MY_VARIABLE 'ANY TEST VALUE'

Without this, you’ll encounter an error like:

error: 'Variable MY_VARIABLE does not exist'

Using the Tool

After installing the library, you can begin measuring parse times. For example, suppose you have a DAG file named dag_test.py containing the non-optimized DAG code used in the example above.

To measure its parse time, simply run:

airflow-parse-bench --path dag_test.py

This execution produces the following output:

Execution result. Image by author.

As observed, our DAG presented a parse time of 0.61 seconds. If I run the command again, I’ll see some small differences, as parse times can vary slightly across runs due to system and environmental factors:

Result of another execution of the same DAG. Image by author.

In order to present a more concise number, it’s possible to aggregate multiple executions by specifying the number of iterations:

airflow-parse-bench --path dag_test.py --num-iterations 5

Although it takes a bit longer to finish, this calculates the average parse time across five executions.

Now, to evaluate the impact of the aforementioned optimizations, I replaced the code in mydag_test.py with the optimized version shared earlier. After executing the same command, I got the following result:

Parse result of the optimized code. Image by author.

As noticed, just applying some good practices was capable of reducing almost 0.5 seconds in the DAG parse time, highlighting the importance of the changes we made!

Further Exploring the Tool

There are other interesting features that I think it’s relevant to share.

As a reminder, if you have any doubts or problems using the tool, you can access the complete documentation on GitHub.

Besides that, to view all the parameters supported by the library, simply run:

airflow-parse-bench --help

Testing Multiple DAGs

In most cases, you likely have dozens of DAGs to test the parse times. To address this use case, I created a folder named dags and put four Python files inside it.

To measure the parse times for all the DAGs in a folder, it’s just necessary to specify the folder path in the --path parameter:

airflow-parse-bench --path my_path/dags

Running this command produces a table summarizing the parse times for all the DAGs in the folder:

Testing the parse time of multiple DAGs. Image by author.

By default, the table is sorted from the fastest to the slowest DAG. However, you can reverse the order by using the --order parameter:

airflow-parse-bench --path my_path/dags --order desc

Inverted sorting order. Image by author.

Skipping Unchanged DAGs

The --skip-unchanged parameter can be especially useful during development. As the name suggests, this option skips the parse execution for DAGs that haven’t been modified since the last execution:

airflow-parse-bench --path my_path/dags --skip-unchanged

As shown below, when the DAGs remain unchanged, the output reflects no difference in parse times:

Output with no difference for unchanged files. Image by author.

Resetting the Database

All DAG information, including metrics and history, is stored in a local SQLite database. If you want to clear all stored data and start fresh, use the --reset-db flag:

airflow-parse-bench --path my_path/dags --reset-db

This command resets the database and processes the DAGs as if it were the first execution.

Conclusion

Parse time is an important metric for maintaining scalable and efficient Airflow environments, especially as your orchestration requirements become increasingly complex.

For this reason, the [airflow-parse-bench](https://github.com/AlvaroCavalcante/airflow-parse-bench) library can be an important tool for helping data engineers create better DAGs. By testing your DAGs’ parse time locally, you can easily and quickly find your code bottleneck, making your dags faster and more performant.

Since the code is executed locally, the produced parse time won’t be the same as the one present in your Airflow cluster. However, if you are able to reduce the parse time in your local machine, the same might be reproduced in your cloud environment.

Finally, this project is open for collaboration! If you have suggestions, ideas, or improvements, feel free to contribute on GitHub.

References

maximize the benefits of Cloud Composer and reduce parse times | Google Cloud Blog

Optimize Cloud Composer via Better Airflow DAGs | Google Cloud Blog

Scheduler – Airflow Documentation

Best Practices – Airflow Documentation

GitHub – AlvaroCavalcante/airflow-parse-bench: Stop creating bad DAGs! Use this tool to measure and…

The post Stop Creating Bad DAGs – Optimize Your Airflow Environment By Improving Your Python Code appeared first on Towards Data Science.

Battle of the Ducks

Thomas Reid — Tue, 28 Jan 2025 14:01:58 +0000

Image by AI (Dalle-3)

As some of you may know, I’m a big fan of the DuckDB Python library, and I’ve written many articles on it. I was also one of the first to write an article about an even newer Python library called Fireducks and helped bring that to people’s attention.

If you’ve never heard of these useful libraries, check out the links below for an introduction to them.

DuckDB

New Pandas rival, FireDucks, brings the smoke!

Both libraries are increasing their share of data science workloads where it could be argued that data manipulation and general wrangling are at least as important as the data analysis and insight that the machine learning side of things brings.

The core foundations of both tools are very different; DuckDB is a modern, embedded analytics database designed for efficient processing and querying of gigabytes of data from various sources. Fireducks is designed to be a much faster replacement for Pandas.

Their key commonality, however, is that they are both highly performant for general mid-sized Data Processing tasks. If that’s your use case, which one should you choose? That’s what we’ll find out today.

Here are the tests I’ll perform.

read a large CSV file into memory, i.e. a DuckDB table and a Fireducks dataframe
perform some typical data processing tasks against both sets of in-memory data
create a new column in the in-memory data sets based on existing table/data frame column data.
write out the updated in-memory data sets as CSV and Parquet

Input data set

I created a CSV file with fake sales data containing 100 million records.

The schema of the input data is this,

order_id (int)
order_date (date)
customer_id (int)
customer_name (str)
product_id (int)
product_name (str)
category (str)
quantity (int)
price (float)
total (float)

Here is a Python program you can use to create the CSV file. On my system, this resulted in a file of approximately 7.5GB.

# generate the 100m CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta

def generate(nrows: int, filename: str):
    names = np.asarray(
        [
            "Laptop",
            "Smartphone",
            "Desk",
            "Chair",
            "Monitor",
            "Printer",
            "Paper",
            "Pen",
            "Notebook",
            "Coffee Maker",
            "Cabinet",
            "Plastic Cups",
        ]
    )

    categories = np.asarray(
        [
            "Electronics",
            "Electronics",
            "Office",
            "Office",
            "Electronics",
            "Electronics",
            "Stationery",
            "Stationery",
            "Stationery",
            "Electronics",
            "Office",
            "Sundry",
        ]
    )

    product_id = np.random.randint(len(names), size=nrows)
    quantity = np.random.randint(1, 11, size=nrows)
    price = np.random.randint(199, 10000, size=nrows) / 100

    # Generate random dates between 2010-01-01 and 2023-12-31
    start_date = datetime(2010, 1, 1)
    end_date = datetime(2023, 12, 31)
    date_range = (end_date - start_date).days

    # Create random dates as np.array and convert to string format
    order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])

    # Define columns
    columns = {
        "order_id": np.arange(nrows),
        "order_date": order_dates,
        "customer_id": np.random.randint(100, 1000, size=nrows),
        "customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
        "product_id": product_id + 200,
        "product_names": names[product_id],
        "categories": categories[product_id],
        "quantity": quantity,
        "price": price,
        "total": price * quantity,
    }

    # Create Polars DataFrame and write to CSV with explicit delimiter
    df = pl.DataFrame(columns)
    df.write_csv(filename, separator=',',include_header=True)  # Ensure comma is used as the delimiter

# Generate data with random order_date and save to CSV
generate(100_000_000, "/mnt/d/sales_data/sales_data_100m.csv")

Installing WSL2 Ubuntu

Fireducks only runs under Linux, so as I usually run Windows, I’ll be using WSL2 Ubuntu for my Linux environment, but the same code should work on any Linux/Unix setup. I have a full guide on installing WSL2 here.

Setting up a dev environment

OK, we should set up a separate development environment before starting our coding examples. That way, what we do won’t interfere with other versions of libraries, Programming, etc….. we might have on the go for other projects.

I use Miniconda for this, but you can use whatever method suits you best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda – Anaconda documentation

Once the environment is created, switch to it using the activatecommand, and then install Jupyter and any required Python libraries.

#create our test environment
(base) $ conda create -n duck_battle python=3.11 -y

# Now activate it
(base) $ conda activate duck_battle

# Install python libraries, etc ...
(duck_battle) $ pip install jupyter fireducks duckdb

Test 1 – Reading a large CSV file and display the last 10 records

DuckDB

import duckdb

print(duckdb.__version__)

'1.1.3'

# DuckDB read CSV file 
#
import duckdb
import time

# Start the timer
start_time = time.time()

# Create a connection to an in-memory DuckDB database
con = duckdb.connect(':memory:')

# Create a table from the CSV file
con.execute(f"CREATE TABLE sales AS SELECT * FROM read_csv('/mnt/d/sales_data/sales_data_100m.csv',header=true)")

# Fetch the last 10 rows
query = "SELECT * FROM sales ORDER BY rowid DESC LIMIT 10"
df = con.execute(query).df()

# Display the last 10 rows
print("nLast 10 rows of the file:")
print(df)

# End the timer and calculate the total elapsed time
total_elapsed_time = time.time() - start_time

print(f"DuckDB: Time taken to read the CSV file and display the last 10 records: {total_elapsed_time} seconds")

#
# DuckDB output
#

Last 10 rows of the file:
   order_id order_date  customer_id   customer_name  product_id product_names  
0  99999999 2023-06-16          102   Customer_9650         203         Chair   
1  99999998 2022-03-02          709  Customer_23966         208      Notebook   
2  99999997 2019-05-10          673  Customer_25709         202          Desk   
3  99999996 2011-10-21          593  Customer_29352         200        Laptop   
4  99999995 2011-10-24          501  Customer_29289         202          Desk   
5  99999994 2023-09-27          119  Customer_15532         209  Coffee Maker   
6  99999993 2015-01-15          294  Customer_27081         200        Laptop   
7  99999992 2016-04-07          379   Customer_1353         207           Pen   
8  99999991 2010-09-19          253  Customer_29439         204       Monitor   
9  99999990 2016-05-19          174  Customer_11294         210       Cabinet   

    categories  quantity  price   total  
0       Office         4  59.58  238.32  
1   Stationery         1  78.91   78.91  
2       Office         5   9.12   45.60  
3  Electronics         3  67.42  202.26  
4       Office         7  53.78  376.46  
5  Electronics         2  55.10  110.20  
6  Electronics         9  86.01  774.09  
7   Stationery         5  21.56  107.80  
8  Electronics         4   5.17   20.68  
9       Office         9  65.10  585.90  

DuckDB: Time taken to read the CSV file and display the last 10 records: 59.23184013366699 seconds

Fireducks

import fireducks
import fireducks.pandas as pd

print(fireducks.__version__)
print(pd.__version__)

1.1.6
2.2.3

# Fireducks read CSV
#
import fireducks.pandas as pd
import time

# Start the timer
start_time = time.time()

# Path to the CSV file
file_path = "/mnt/d/sales_data/sales_data_100m.csv"

# Read the CSV file into a DataFrame
df_fire = pd.read_csv(file_path)

# Display the last 10 rows of the DataFrame
print(df_fire.tail(10))

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to read the CSV file and display the last 10 records: {elapsed_time} seconds")         

#
# Fireducks output
#

          order_id  order_date  customer_id   customer_name  product_id  
99999990  99999990  2016-05-19          174  Customer_11294         210   
99999991  99999991  2010-09-19          253  Customer_29439         204   
99999992  99999992  2016-04-07          379   Customer_1353         207   
99999993  99999993  2015-01-15          294  Customer_27081         200   
99999994  99999994  2023-09-27          119  Customer_15532         209   
99999995  99999995  2011-10-24          501  Customer_29289         202   
99999996  99999996  2011-10-21          593  Customer_29352         200   
99999997  99999997  2019-05-10          673  Customer_25709         202   
99999998  99999998  2022-03-02          709  Customer_23966         208   
99999999  99999999  2023-06-16          102   Customer_9650         203   

         product_names   categories  quantity  price   total  
99999990       Cabinet       Office         9  65.10  585.90  
99999991       Monitor  Electronics         4   5.17   20.68  
99999992           Pen   Stationery         5  21.56  107.80  
99999993        Laptop  Electronics         9  86.01  774.09  
99999994  Coffee Maker  Electronics         2  55.10  110.20  
99999995          Desk       Office         7  53.78  376.46  
99999996        Laptop  Electronics         3  67.42  202.26  
99999997          Desk       Office         5   9.12   45.60  
99999998      Notebook   Stationery         1  78.91   78.91  
99999999         Chair       Office         4  59.58  238.32 

Fireducks: Time taken to read the CSV file and display the last 10 records: 65.69259881973267 seconds

There is not much in it; DuckDB edges it by about 6 seconds.

Test 2— Calculate total sales by category

DuckDB

# duckdb process data
#
import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    categories, 
    SUM(total) AS total_sales
FROM sales
GROUP BY categories
ORDER BY total_sales DESC
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for sales by category calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckDB: Time for sales by category calculation: 0.1401681900024414 seconds

  categories  total_sales
0 Electronics 1.168493e+10
1 Stationery  7.014109e+09
2 Office      7.006807e+09
3 Sundry      2.338428e+09

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

total_sales_by_category = df_fire.groupby('categories')['total'].sum().sort_values(ascending=False)
print(total_sales_by_category)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate sales by category: {elapsed_time} seconds")

#
# Fireducks output
#

categories
Electronics    1.168493e+10
Stationery     7.014109e+09
Office         7.006807e+09
Sundry         2.338428e+09
Name: total, dtype: float64

Fireducks: Time taken to calculate sales by category:  0.13571524620056152 seconds

There is not much in it there, either. Fireducks shades it.

Test 3— Top 5 customer spend

DuckDB

# duckdb process data
#
import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    customer_id, 
    customer_name, 
    SUM(total) AS total_purchase
FROM sales
GROUP BY customer_id, customer_name
ORDER BY total_purchase DESC
LIMIT 5
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckdDB: Time to calculate top 5 customers: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckdDB: Time to calculate top 5 customers: 1.4588654041290283 seconds

  customer_id customer_name  total_purchase
0 681         Customer_20387 6892.96
1 740         Customer_30499 6613.11
2 389         Customer_22686 6597.35
3 316         Customer_185   6565.38
4 529         Customer_1609  6494.35

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

top_5_customers = df_fire.groupby(['customer_id', 'customer_name'])['total'].sum().sort_values(ascending=False).head(5)
print(top_5_customers)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate top 5 customers: {elapsed_time} seconds")

#
# Fireducks output
#

customer_id  customer_name 
681          Customer_20387    6892.96
740          Customer_30499    6613.11
389          Customer_22686    6597.35
316          Customer_1859     6565.38
529          Customer_1609     6494.35
Name: total, dtype: float64
Fireducks: Time taken to calculate top 5 customers: 2.823930263519287 seconds

DuckDB wins that one, being almost twice as fast as Fireducks.

Test 4— Monthly sales figures

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    SUM(total) AS monthly_sales
FROM sales
GROUP BY month
ORDER BY month
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for seasonal trend calculation: {time.time() - start_time} seconds")

results

# 
# DuckDB output
#

DuckDB: Time for seasonal trend calculation: 0.16109275817871094 seconds

  month        monthly_sales
0 2010-01-01   1.699500e+08
1 2010-02-01   1.535730e+08
2 2010-03-01   1.702968e+08
3 2010-04-01   1.646421e+08
4 2010-05-01   1.704506e+08
... ... ...
163 2023-08-01 1.699263e+08
164 2023-09-01 1.646018e+08
165 2023-10-01 1.692184e+08
166 2023-11-01 1.644883e+08
167 2023-12-01 1.643962e+08

168 rows × 2 columns

Fireducks

import fireducks.pandas as pd
import time

def seasonal_trend():
    # Ensure 'order_date' is datetime
    df_fire['order_date'] = pd.to_datetime(df_fire['order_date'])

    # Extract 'month' as string
    df_fire['month'] = df_fire['order_date'].dt.strftime('%Y-%m')

    # Group by 'month' and sum 'total'
    results = (
        df_fire.groupby('month')['total']
        .sum()
        .reset_index()
        .sort_values('month')
    )
    print(results)

start_time = time.time()
seasonal_trend()
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

print(f"Fireducks: Time for seasonal trend calculation: {time.time() - start_time} seconds")

#
# Fireducks Output
#

       month         total
0    2010-01  1.699500e+08
1    2010-02  1.535730e+08
2    2010-03  1.702968e+08
3    2010-04  1.646421e+08
4    2010-05  1.704506e+08
..       ...           ...
163  2023-08  1.699263e+08
164  2023-09  1.646018e+08
165  2023-10  1.692184e+08
166  2023-11  1.644883e+08
167  2023-12  1.643962e+08

[168 rows x 2 columns]
Fireducks: Time for seasonal trend calculation: 3.109074354171753 seconds

DuckDB was significantly quicker in this example.

Test 5— Average order by product

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    product_id,
    product_names,
    AVG(total) AS avg_order_value
FROM sales
GROUP BY product_id, product_names
ORDER BY avg_order_value DESC
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for average order by product calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckDB: Time for average order by product calculation: 0.13720130920410156 seconds

  product_id product_names avg_order_value
0 206        Paper         280.529144
1 208        Notebook      280.497268
2 201        Smartphone    280.494779
3 207        Pen           280.491508
4 205        Printer       280.470150
5 200        Laptop        280.456913
6 209        Coffee Maker  280.445365
7 211        Plastic Cups  280.440161
8 210        Cabinet       280.426960
9 202        Desk          280.367135
10 203       Chair         280.364045
11 204       Monitor       280.329706

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

avg_order_value = df_fire.groupby(['product_id', 'product_names'])['total'].mean().sort_values(ascending=False)
print(avg_order_value)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

print(f"Fireducks: Time for average order calculation: {time.time() - start_time} seconds")

#
# Fireducks output
#

product_id  product_names
206         Paper            280.529144
208         Notebook         280.497268
201         Smartphone       280.494779
207         Pen              280.491508
205         Printer          280.470150
200         Laptop           280.456913
209         Coffee Maker     280.445365
211         Plastic Cups     280.440161
210         Cabinet          280.426960
202         Desk             280.367135
203         Chair            280.364045
204         Monitor          280.329706
Name: total, dtype: float64
Fireducks: Time for average order calculation: 0.06766319274902344 seconds

Fireducks gets one back there and was twice as fast as DuckDB.

Test 6— product performance analysis

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
WITH yearly_sales AS (
    SELECT 
        EXTRACT(YEAR FROM order_date) AS year,
        SUM(total) AS total_sales
    FROM sales
    GROUP BY year
)
SELECT 
    year,
    total_sales,
    LAG(total_sales) OVER (ORDER BY year) AS prev_year_sales,
    (total_sales - LAG(total_sales) OVER (ORDER BY year)) / LAG(total_sales) OVER (ORDER BY year) * 100 AS yoy_growth
FROM yearly_sales
ORDER BY year
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for product performance analysis calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

Time for product performance analysis  calculation: 0.03958845138549805 seconds

   year total_sales prev_year_sales yoy_growth
0  2010 2.002066e+09 NaN            NaN
1  2011 2.002441e+09 2.002066e+09   0.018739
2  2012 2.008966e+09 2.002441e+09   0.325848
3  2013 2.002901e+09 2.008966e+09  -0.301900
4  2014 2.000773e+09 2.002901e+09  -0.106225
5  2015 2.001931e+09 2.000773e+09   0.057855
6  2016 2.008762e+09 2.001931e+09   0.341229
7  2017 2.002164e+09 2.008762e+09  -0.328457
8  2018 2.002383e+09 2.002164e+09   0.010927
9  2019 2.002891e+09 2.002383e+09   0.025383
10 2020 2.008585e+09 2.002891e+09   0.284318
11 2021 2.000244e+09 2.008585e+09  -0.415281
12 2022 2.004500e+09 2.000244e+09   0.212756
13 2023 1.995672e+09 2.004500e+09  -0.440401

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

df_fire['year'] = pd.to_datetime(df_fire['order_date']).dt.year
yearly_sales = df_fire.groupby('year')['total'].sum().sort_index()
yoy_growth = yearly_sales.pct_change() * 100

result = pd.DataFrame({
    'year': yearly_sales.index,
    'total_sales': yearly_sales.values,
    'prev_year_sales': yearly_sales.shift().values,
    'yoy_growth': yoy_growth.values
})

print(result)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time for product performance analysis  calculation: {time.time() - start_time} seconds")

#
# Fireducks output
#

    year   total_sales  prev_year_sales  yoy_growth
0   2010  2.002066e+09              NaN         NaN
1   2011  2.002441e+09     2.002066e+09    0.018739
2   2012  2.008966e+09     2.002441e+09    0.325848
3   2013  2.002901e+09     2.008966e+09   -0.301900
4   2014  2.000773e+09     2.002901e+09   -0.106225
5   2015  2.001931e+09     2.000773e+09    0.057855
6   2016  2.008762e+09     2.001931e+09    0.341229
7   2017  2.002164e+09     2.008762e+09   -0.328457
8   2018  2.002383e+09     2.002164e+09    0.010927
9   2019  2.002891e+09     2.002383e+09    0.025383
10  2020  2.008585e+09     2.002891e+09    0.284318
11  2021  2.000244e+09     2.008585e+09   -0.415281
12  2022  2.004500e+09     2.000244e+09    0.212756
13  2023  1.995672e+09     2.004500e+09   -0.440401

Time for product performance analysis  calculation: 0.17495489120483398 seconds

DuckDB is quicker this time.

Test 7 – Add a new column to the data set and update its value

DuckDB

import duckdb

from datetime import datetime

start_time = time.time()

# Add new columns
con.execute("""
ALTER TABLE sales ADD COLUMN total_with_tax FLOAT
"""
)

# Perform the calculations and update the table
con.execute("""
UPDATE sales
SET total_with_tax = CASE 
    WHEN total <= 100 THEN total * 1.125  -- 12.5% tax
    WHEN total > 100 AND total <= 200 THEN total * 1.15   -- 15% tax
    WHEN total > 200 AND total <= 500 THEN total * 1.17   -- 17% tax
    WHEN total > 500 THEN total * 1.20   -- 20% tax
END;
""")

print(f"Time to add new column: {time.time() - start_time} seconds")

# Verify the new columns
result = con.execute("""
    SELECT 
        *
    FROM sales
    LIMIT 10;
""").fetchdf()

print(result)

#
# DuckDB output
#

Time to add new column: 2.4016575813293457 seconds

   order_id order_date  customer_id   customer_name  product_id product_names  
0         0 2021-11-25          238  Customer_25600         211  Plastic Cups   
1         1 2017-06-10          534  Customer_14188         209  Coffee Maker   
2         2 2010-02-15          924  Customer_14013         207           Pen   
3         3 2011-01-26          633   Customer_6120         211  Plastic Cups   
4         4 2014-01-11          561   Customer_1352         205       Printer   
5         5 2021-04-19          533   Customer_5342         208      Notebook   
6         6 2012-03-14          684  Customer_21604         207           Pen   
7         7 2017-07-01          744  Customer_30291         201    Smartphone   
8         8 2013-02-13          678  Customer_32618         204       Monitor   
9         9 2023-01-04          340  Customer_16898         207           Pen   

    categories  quantity  price   total  total_with_tax  
0       Sundry         2  99.80  199.60      229.539993  
1  Electronics         8   7.19   57.52       64.709999  
2   Stationery         6  70.98  425.88      498.279602  
3       Sundry         6  94.38  566.28      679.536011  
4  Electronics         4  44.68  178.72      205.528000  
5   Stationery         4  21.85   87.40       98.324997  
6   Stationery         3  93.66  280.98      328.746613  
7  Electronics         6  39.41  236.46      276.658203  
8  Electronics         2   4.30    8.60        9.675000  
9   Stationery         2   6.67   13.34       15.007500

Fireducks

import numpy as np
import time
import fireducks.pandas as pd

# Start total runtime timer
start_time = time.time()
# Define tax rate conditions and choices
conditions = [
    (df_fire['total'] <= 100),
    (df_fire['total'] > 100) & (df_fire['total'] <= 200),
    (df_fire['total'] > 200) & (df_fire['total'] <= 500),
    (df_fire['total'] > 500)
]

choices = [1.125, 1.15, 1.17, 1.20]

# Calculate total_with_tax using np.select for efficiency
df_fire['total_with_tax'] = df_fire['total'] * np.select(conditions, choices)

# Print total runtime
print(f"Fireducks: Time to add new column: {time.time() - start_time} seconds")
print(df_fire)

#
# Fireducks oputput
#

Fireducks: Time to add new column: 2.7112433910369873 seconds

          order_id order_date  customer_id   customer_name  product_id  
0                0 2021-11-25          238  Customer_25600         211   
1                1 2017-06-10          534  Customer_14188         209   
2                2 2010-02-15          924  Customer_14013         207   
3                3 2011-01-26          633   Customer_6120         211   
4                4 2014-01-11          561   Customer_1352         205   
...            ...        ...          ...             ...         ...   
99999995  99999995 2011-10-24          501  Customer_29289         202   
99999996  99999996 2011-10-21          593  Customer_29352         200   
99999997  99999997 2019-05-10          673  Customer_25709         202   
99999998  99999998 2022-03-02          709  Customer_23966         208   
99999999  99999999 2023-06-16          102   Customer_9650         203   

         product_names   categories  quantity  price   total    month  year  
0         Plastic Cups       Sundry         2  99.80  199.60  2021-11  2021   
1         Coffee Maker  Electronics         8   7.19   57.52  2017-06  2017   
2                  Pen   Stationery         6  70.98  425.88  2010-02  2010   
3         Plastic Cups       Sundry         6  94.38  566.28  2011-01  2011   
4              Printer  Electronics         4  44.68  178.72  2014-01  2014   
...                ...          ...       ...    ...     ...      ...   ...   
99999995          Desk       Office         7  53.78  376.46  2011-10  2011   
99999996        Laptop  Electronics         3  67.42  202.26  2011-10  2011   
99999997          Desk       Office         5   9.12   45.60  2019-05  2019   
99999998      Notebook   Stationery         1  78.91   78.91  2022-03  2022   
99999999         Chair       Office         4  59.58  238.32  2023-06  2023   

          total_with_tax  
0              229.54000  
1               64.71000  
2              498.27960  
3              679.53600  
4              205.52800  
...                  ...  
99999995       440.45820  
99999996       236.64420  
99999997        51.30000  
99999998        88.77375  
99999999       278.83440  

[100000000 rows x 13 columns]

They have very similar run times yet again. A draw.

Test 8 – Write out the updated data to a CSV file

DuckDB

start_time = time.time()

# Write the modified sales_data table to a CSV file
start = time.time()
con.execute("""
    COPY (SELECT * FROM sales) TO '/mnt/d/sales_data/final_sales_data_duckdb.csv' WITH (HEADER TRUE, DELIMITER ',')
""")

print(f"DuckDB: Time to write CSV to file: {time.time() - start_time} seconds")

DuckDB: Time to write CSV to file: 54.899176597595215 seconds

Fireducks

# fireducks write data back to CSV
#
import fireducks.pandas as pd

# Tidy up DF before writing out
cols_to_drop = ['year', 'month']
df_fire = df_fire.drop(columns=cols_to_drop)
df_fire['total_with_tax'] = df_fire['total_with_tax'].round(2) 
df_fire['order_date'] = df_fire['order_date'].dt.date

# Start total runtime timer
start_time = time.time()

df_fire.to_csv('/mnt/d/sales_data/fireducks_sales.csv',quoting=0,index=False)

# Print total runtime
print(f"Fireducks: Time to write CSV  to file: {time.time() - start_time} seconds")

Fireducks: Time to write CSV  to file: 54.490307331085205 seconds

Too close to call again.

Test 9— Write out the updated data to a parquet file

DuckDB

# DuckDB write Parquet data
# 

start_time = time.time()

# Write the modified sales_data table to a Parquet file
start = time.time()
con.execute("COPY sales TO '/mnt/d/sales_data/final_sales_data_duckdb.parquet' (FORMAT 'parquet');")

print(f"DuckDB: Time to write parquet to file: {time.time() - start_time} seconds")

DuckDB: Time to write parquet to file: 30.011869192123413 seconds

Fireducks

import fireducks.pandas as pd
import time

# Start total runtime timer
start_time = time.time()

df_fire.to_parquet('/mnt/d/sales_data/fireducks_sales.parquet')

# Print total runtime
print(f"Fireducks: Time to write Parquet to file: {time.time() - start_time} seconds")

Fireducks: Time to write Parquet to file: 86.29632377624512 seconds

That’s the first major discrepancy between run times. Fireducks took almost a minute longer to write out its data to Parquet than did DuckDB.

Summary

So, what are we to make of all this? Simply put, there is nothing much in it between these two libraries. Both are superfast and capable of processing large data sets. Once your data is in memory, either in a DuckDB table or Fireducks dataframe, both libraries are equally capable of processing it in double quick time

The choice of which one to use depends on your existing infrastructure and skill set.

If you’re a database person, DuckDB is the obvious library to use, as your SQL skills would be instantly transferable.

Alternatively, if you’re already embedded in the Pandas’ world, Fireducks would be a great choice for you.

_OK, that’s all for me just now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories, follow me or subscribe to get notified when I post new content._

If you like this content, you might find these articles interesting, too.

Building a Data Dashboard

Speed up Pandas code with Numpy

The post Battle of the Ducks appeared first on Towards Data Science.

The Concepts Data Professionals Should Know in 2025: Part 2

Sarah Lea — Mon, 20 Jan 2025 11:02:02 +0000

From AI Agent to Human-In-The-Loop — Master 12 data concepts and turn them into simple projects to stay ahead in IT.

Innovation in the field of data is progressing rapidly.

Let’s take a quick look at the timeline of GenAI: ChatGPT, launched in November 2022, became the world’s best-known application of generative AI in early 2023. By spring 2025, leading companies like Salesforce (Marketing Cloud Growth) and Adobe (Firefly) integrated it into mainstream applications – making it accessible to companies of various sizes. Tools like MidJourney advanced image generation, while at the same time, discussions about agentic AI took center stage. Today, tools like ChatGPT have already become common for many private users.

That’s why I have compiled 12 terms that you will certainly encounter as a data engineer, data scientist and data analyst in 2025 and are important to understand. Why are they relevant? What are the challenges? And how can you apply them to a small project?

Table of Content Term 1–6 in part 1: Data Warehouse, Data Lake, Data Lakehouse, Cloud Platforms, Optimizing data storage, Big Data technologies, ETL, ELT and Zero-ETL, Even-Driven-Architecture 7 – Data Lineage & XAI 8 – Gen AI 9 – Agentic AI 10 – Inference Time Compute 11 – Near Infinite Memory 12 – Human-In-The-Loop-Augmentation Final Thoughts

In the first part, we looked at terms for the basics of understanding modern data systems (storage, management & processing of data). In part 2, we now move beyond infrastructure and dive into some terms related to Artificial Intelligence that use this data to drive innovation.

7 – Explainability of predictions and traceability of data: XAI & Data Lineage

As data and AI tools become increasingly important in our everyday lives, we also need to know how to track them and create transparency for decision-making processes and predictions:

Let’s imagine a scenario in a hospital: A deep learning model is used to predict the chances of success of an operation. A patient is categorised as ‘unsuitable’ for the operation. The problem for the medical team? There is no explanation as to how the model arrived at this decision. The internal processes and calculations that led to the prediction remain hidden. It is also not clear which attributes – such as age, state of health or other parameters – were decisive for this assessment. Should the medical team nevertheless believe the prediction and not proceed with the operation? Or should they proceed as they see best fit?

This lack of transparency can lead to uncertainty or even mistrust in AI-supported decisions. Why does this happen? Many deep learning models provide us with results and excellent predictions – much better than simple models can do. However, the models are ‘black boxes’ – we don’t know exactly how the models arrived at the results and what features they used to do so. While this lack of transparency hardly plays a role in everyday applications, such as distinguishing between cat and dog photos, the situation is different in critical areas: For example, in healthcare, financial decisions, criminology or recruitment processes, we need to be able to understand how and why a model arrives at certain results.

This is where Explainable AI (XAI) comes into play: techniques and methods that attempt to make the decision-making process of AI models understandable and comprehensible. Examples of this are SHAP (SHapley Additive ExPlanations) or LIME (Local Interpretable Model-agnostic Explanations). These tools can at least show us which features contributed most to a decision.

Data Lineage, on the other hand, helps us understand where data comes from, how it has been processed and how it is ultimately used. In a BI tool, for example, a report with incorrect figures could be used to check whether the problem occurred with the data source, the transformation or when loading the data.

Why are the terms important?

XAI: The more AI models we use in everyday life and as decision-making aids, the more we need to know how these models have achieved their results. Especially in areas such as finance and healthcare, but also in processes such as HR and social services.

Data Lineage: In the EU there is GDPR, in California CCPA. These require companies to document the origin and use of data in a comprehensible manner. What does that mean in concrete terms? If companies have to comply with data protection laws, they must always know where the data comes from and how it was processed.

What are the challenges?

Complexity of the data landscape (data lineage): In distributed systems and multi-cloud environments, it is difficult to fully track the data flow.
Performance vs. transparency (XAI): Deep learning models often deliver more precise results, but their decision paths are difficult to trace. Simpler models, on the other hand, are usually easier to interpret but less accurate.

Small project idea to better understand the terms:

Use SHAP (SHapley Additive ExPlanations) to explain the decision logic of a machine learning model: Create a simple ML model with scikit-learn to predict house prices, for example. Then install the SHAP library in Python and visualize how the different features influence the price prediction.

8 – Generative AI (Gen AI)

Since Chat-GPT took off in January 2023, the term Gen AI has also been on everyone’s lips. Generative AI refers to AI models that can generate new content from an input. Outputs can be texts, images, music or videos. For example, there are now even fashion stores that have created their advertising images using generative AI (e.g. Calvin Klein, Zalando).

"We started OpenAI almost nine years ago because we believed that AGI was possible, and that it could be the most impactful technology in human history. We wanted to figure out how to build it and make it broadly beneficial; […]"

Reference: Sam Altman, CEO of OpenAI

Why is the term important?

Clearly, GenAI can greatly increase efficiency. The time required for tasks such as content creation, design or texts is reduced for companies. GenAI is also changing many areas of our working world. Tasks are being performed differently, jobs are changing and data is becoming even more important.

In Salesforce’s latest marketing automation tool, for example, users can enter a prompt in natural language, which generates an email layout – even if this does not always work reliably in reality.

What are the challenges?

Copyrights and ethics: The models are trained with huge amounts of data that originate from us humans and try to generate the most realistic results possible based on this (e.g. also with texts by authors or images by well-known painters). One problem is that GenAI can imitate existing works. Who owns the result? A simple way to minimize this problem at least somewhat is to clearly label AI-generated content as such.
Costs and energy: Large models require a very large amount of computing resources.
Bias and misinformation: The models are trained with specific data. If the data already contains a bias (e.g. less data from one gender, less data from one country), these models can reproduce biases. For example, if an HR tool has been trained with more male than female data, it could favor male applicants in a job application. And of course, sometimes the models simply provide incorrect information.

Small project idea to better understand the terms:

Create a simple chatbot that accesses the GPT-4 API and can answer a question. I have attached a step-by-step guide at the bottom of the page.

9 – Agentic AI / AI Agents

Agentic AI is currently a hotly debated topic and is based on generative AI. AI agents describe intelligent systems that can think, plan and act "autonomously":

"This is what AI was meant to be. […] And I am really excited about this. I think this is going to change companies forever. I think it’s going to change software forever. And I think it’ll change Salesforce forever."

_Reference: Marc Benioff, Salesforce CEO about Agents & Agentforce_

AI Agents are, so to speak, a continuation of traditional chatbots and bots. These systems promise to solve complex problems by creating multi-level plans, learning from data and making decisions based on this and executing them autonomously.

Multi-step plans mean that the AI thinks several steps ahead to achieve a goal.

Let’s imagine a quick example: An AI agent has the task of delivering a parcel. Instead of simply following the sequence of orders, the AI could first analyze the traffic situation, calculate the fastest route and then deliver the various parcels in this calculated sequence.

Why is the term important?

The ability to execute multi-step plans sets AI Agents apart from previous bots and chatbots and brings a new era of autonomous systems.

If AI Agents can actually be used in businesses, companies can automate repetitive tasks through agents, reducing costs and increasing efficiency. The economic benefits and competitive advantage would be there. As the Salesforce CEO says in the interview, it can change our corporate world tremendously.

What are the challenges?

Logical consistency and (current) technological limitations: Current models struggle with consistent logical thinking – especially when it comes to handling complex scenarios with multiple variables. And that’s exactly what they’re there for – or that’s how they’re advertised. This means that in 2025 there will definitely be an increased need for better models.
Ethics and acceptance: Autonomous systems can make decisions and solve their own tasks independently. How can we ensure that autonomous systems do not make decisions that violate ethical standards? As a society, we also need to define how quickly we want to integrate such changes into our everyday (working) lives without taking employees by surprise. Not everyone has the same technical know-how.

Small project idea to better understand the term:

Create a simple AI agent with Python: define the agent first. For example, the agent should retrieve data from an API. Use Python to coordinate the API query, filtering of results and automatic emailing to the user. Implement then a simple decision logic: For example, if no result matches the filter criteria, the search radius is extended.

10 – Inference Time Compute

Next, we focus on the efficiency and performance of using AI models: An AI model receives input data, makes a prediction or decision based on it and gives an output. This process requires computing time, which is referred to as inference time compute. Modern models such as AI agents go one step further by flexibly adapting their computing time to the complexity of the task.

Basically, it’s the same as with us humans: When we have to solve more complex problems, we invest more time. AI models use dynamic reasoning (adapting computing time according to task requirements) and chain reasoning (using multiple decision steps to solve complex problems).

Why is the term important?

AI and models are becoming increasingly important in our everyday lives. The demand for dynamic AI systems (AI that adapts flexibly to requests and understands our requests) will increase. Inference time affects the performance of systems such as chatbots, autonomous vehicles and real-time translators. AI models that adapt their inference time to the complexity of the task and therefore "think" for different lengths of time will improve efficiency and accuracy.

What are the challenges?

Performance vs. quality: Do you want a fast but less accurate or a slow but very accurate solution? Shorter inference times improve efficiency, but can compromise accuracy for complex tasks.
Energy consumption: The longer the inference time, the more computing power is required. This in turn increases energy consumption.

11 – Near Infinite Memory

Near Infinite Memory is a concept that describes how technologies can store and process enormous amounts of data almost indefinitely.

For us users, it seems like infinite storage – but it is actually more of a combination of scalable cloud services, data-optimized storage solutions and intelligent data management systems.

Why is this term important?

The data we generate is growing exponentially due to the increasing use of IoT, AI and Big Data. As already described in terms 1–3, this creates ever greater demands on data architectures such as data lakehouses. AI models also require enormous amounts of data for training and validation. It is therefore important that storage solutions become more efficient.

What are the challenges?

Energy consumption: Large storage solutions in cloud data centers consume immense amounts of energy.
Security concerns and dependence on centralized services: Many near-infinite memory solutions are provided by cloud providers. This can create a dependency that brings financial and data protection risks.

Small project idea to better understand the terms:

Develop a practical understanding of how different data types affect storage requirements and learn how to use storage space efficiently. Take a look at the project under the term "Optimizing Data Storage".

12 – Human-In-The-Loop Augmentation

AI is becoming increasingly important, as the previous terms have shown. However, with the increasing importance of AI, we should ensure that the human part is not lost in the process.

"We need to let people who are harmed by technology imagine the future that they want."

_Reference: Timnit Gebru, former Head of Department of Ethics in AI at Google_

Human-in-the-loop augmentation is the interface between computer science and psychology, so to speak. It describes the collaboration between us humans and artificial intelligence. The aim is to combine the strengths of both sides:

A great strength of AI is that such models can efficiently process data in large quantities and discover patterns in it that are difficult for us to recognize.
We humans, on the other hand, bring judgment, ethics, creativity and contextual understanding to the table without being pre-trained and have the ability to cope with unforeseen situations.

The goal must be for AI to serve us humans – and not the other way around.

Why is the term important?

AI can improve decision-making processes and minimize errors. In particular, AI can recognize patterns in data that are not visible to us, for example in the field of medicine or biology.

The MIT Center for Collective Intelligence published a study in Nature Human Behavior in which they analyzed how well human-AI combinations perform compared to purely human or purely AI-controlled systems:

In decision-making tasks, human-AI combinations often performed worse than AI systems alone (e.g. medical diagnoses / classification of deepfakes).
In creative tasks, the interaction already works better. Here, human-AI teams outperformed both humans and AI alone.

However, the study shows that human-in-the-loop augmentation does not yet work perfectly.

Reference: Humans and AI: Do they work better together or alone?

What are the challenges?

Lack of synergy and mistrust: It seems that there is a lack of intuitive interfaces that make it easier for us humans to interact effectively enough with AI tools. Another challenge is that AI systems are sometimes viewed critically or even rejected.
(Current) technological limitations of AI: Current AI systems struggle to understand logical consistency and context. This can lead to erroneous or inaccurate results. For example, an AI diagnostic system could misjudge a rare case because it does not have enough data for such cases.

Final Thoughts

The terms in this article only show a selection of the innovations that we are currently seeing – the list could definitely be extended. For example, in the area of AI models, the size of the models will also play an important role: In addition to very large models (with up to 50 trillion parameters), individual very small models will probably also be developed that will only contain a few billion parameters. The advantage of these small models will be that they do not require huge data centers and GPUs, but can run on our laptops or even on our smartphones and perform very specific tasks.

Which terms do you think are super important? Let us know in the comments.

Where can you continue learning?

Book – The data lakehouse for dummies
Medium – SQL and Data Modelling in Action: A Deep Dive into Data Lakehouses
Gartner – Top 10 strategic technology trends
AWS – Start your journey with AWS
DataCamp Course – AWS Concepts
Snowflake Blog – Avro vs. Parquet
Medium – Why ETL-Zero? Understanding the Shift in Data Integration
[AWS Blog – What is event-driven architecture?](http://What is an Event-Driven Architecture?)
Medium – Can you trust AI-Models Without Explainability? Introduction to XAI
IBM Blog – Agentic AI: 4 reasons why it’s the next big thing in AI research
MIT Management Sloan School – Study about Humans & AI
Blog Sam Altman – Reflections about AI

Own visualization – Illustrations from unDraw.co

All information in this article is based on the current status in January 2025.

The post The Concepts Data Professionals Should Know in 2025: Part 2 appeared first on Towards Data Science.

The Concepts Data Professionals Should Know in 2025: Part 1

Sarah Lea — Sun, 19 Jan 2025 19:02:04 +0000

From Data Lakehouses to Event-Driven Architecture — Master 12 data concepts and turn them into simple projects to stay ahead in IT.

When I scroll through YouTube or LinkedIn and see topics like RAG, Agents or Quantum Computing, I sometimes get a queasy feeling about keeping up with these innovations as a data professional.

But when I reflect then on the topics my customers face daily as a Salesforce Consultant or as a Data Scientist at university, the challenges often seem more tangible: examples are faster data access, better data quality or boosting employees’ tech skills. The key issues are often less futuristic and can usually be simplified. That’s the focus of this and the next article:

I have compiled 12 terms that you will certainly encounter as a data engineer, data scientist and data analyst in 2025. Why are they relevant? What are the challenges? And how can you apply them to a small project?

So – Let’s dive in.

Table of Content 1 – Data Warehouse, Data Lake, Data Lakehouse 2 – Cloud platforms as AWS, Azure & Google Cloud Platform 3 – Optimizing data storage 4 – Big data technologies such as Apache Spark, Kafka 5 – How data integration becomes real-time capable: ETL, ELT and Zero-ETL 6 – Even-Driven Architecture (EDA) Term 7–12 in part 2: Data Lineage & XAI, Gen AI, Agentic AI, Inference Time Compute, Near Infinite Memory, Human-In-The-Loop-Augmentation Final Thoughts

1 – Data Warehouse, Data Lake, Data Lakehouse

We start with the foundation for data architecture and storage to understand modern data management systems.

Data warehouses became really well known in the 1990s thanks to Business Intelligence tools from Oracle and SAP, for example. Companies began to store structured data from various sources in a central database. An example are weekly processed sales data in a business intelligence tool.

The next innovation was data lakes, which arose from the need to be able to store unstructured or semi-structured data flexibly. A data lake is a large, open space for raw data. It stores both structured and unstructured data, such as sales data alongside social media posts and images.

The next step in innovation combined data lake architecture with warehouse architecture: Data lakehouses were created.

The term was popularized by companies such as Databricks when it introduced its Delta Lake technology. This concept combines the strengths of both previous data platforms. It allows us to store unstructured data as well as quickly query structured data in a single system. The need for this data architecture has arisen primarily because warehouses are often too restrictive, while lakes are difficult to search.

Why are the terms important?

We are living in the era of Big Data – companies and private individuals are generating more and more data (structured as well as semi-structured and unstructured data).

A short personal anecdote: The year I turned 15, Facebook cracked the 500 million active user mark for the first time. Instagram was founded in the same year. In addition, the release of the iPhone 4 significantly accelerated the global spread of smartphones and shaped the mobile era. In the same year, Microsoft further developed and promoted Azure (which was released in 2008) to compete with Google Cloud and AWS. From today’s perspective, I can see how all these events made 2010 a decisive year for digitalisation: 2010 was a key year in which digitalisation and the transition to cloud technologies gained momentum.

In 2010, around 2 zettabytes (ZB) of data were generated, in 2020 it was around 64 ZB, in 2024 we are at around 149 zettabytes.

Reference: Statista

Due to the explosive data growth in recent years, we need to store the data somewhere – efficiently. This is where these three terms come into play. Hybrid architectures such as data lakehouses solve many of the challenges of big data. The demand for (near) real-time data analysis is also rising (see term 5 on zero ETL). And to remain competitive, companies are under pressure to use data faster and more efficiently. Data lakehouses are becoming more important as they offer the flexibility of a data lake and the efficiency of a data warehouse – without having to operate two separate systems.

What are the challenges?

Data integration: As there are many different data sources (structured, semi-structured, unstructured), complex ETL / ELT processes are required.
Scaling & costs: While data warehouses are expensive, data lakes can easily lead to data chaos (if no good data governance is in place) and lakehouses require technical know-how & investment.
Access to the data: Permissions need to be clearly defined if the data is stored in a centralized storage.

Small project idea to better understand the terms:

Create a mini data lake with AWS S3: Upload JSON or CSV data to an S3 bucket, then process the data with Python and perform data analysis with Pandas, for example.

2 – Cloud Platforms as AWS, Azure & Google Cloud Platform

Now we move on to the platforms on which the concepts from 1 are often implemented.

Of course, everyone knows the term cloud platforms such as AWS, Azure or Google Cloud. These services provide us with a scalable infrastructure for storing large volumes of data. We can also use them to process data in real-time and to use Business Intelligence and Machine Learning tools efficiently.

But why are the terms important?

I work in a web design agency where we host our clients’ websites in one of the other departments. Before the easy availability of cloud platforms, this meant running our own servers in the basement – with all the challenges such as cooling, maintenance and limited scalability.

Today, most of our data architectures and AI applications run in the cloud. Cloud platforms have changed the way we store, process and analyse data over the last decades. Platforms such as AWS, Azure or Google Cloud offer us a completely new level of flexibility and scalability for model training, real-time analyses and generative AI.

What are the challenges?

A quick personal example of how complex things get: While preparing for my Salesforce Data Cloud Certification (a data lakehouse), I found myself diving into a sea of new terms – all specific to the Salesforce world. Each cloud platform has its own terminology and tools, which makes it time-consuming for employees in companies to familiarize themselves with them.
Data security: Sensitive data can often be stored in the cloud. Access control must be clearly defined – user management is required.

Small project idea to better understand the terms:

Create a simple data pipeline: Register with AWS, Azure or GCP with a free account and upload a CSV file (e.g. to an AWS S3 bucket). Then load the data into a relational database and use an SQL tool to perform queries.

3 – Optimizing Data Storage

More and more data = more and more storage space required = more and more costs.

With the use of large amounts of data and the platforms and concepts from 1 and 2, there is also the issue of efficiency and cost management. To save on storage, reduce costs and speed up access, we need better ways to store, organize and access data more efficiently.

Strategies include data compression (e.g. Gzip) by removing redundant or unneeded data, data partitioning by splitting large data sets, indexing to speed up queries and the choice of storage format (e.g. CSV, Parquet, Avro).

Why is the term important?

Not only is my Google Drive and One Drive storage nearly maxed out…

… in 2028, a total data volume of 394 zettabytes is expected.

It will therefore be necessary for us to be able to cope with growing data volumes and rising costs. In addition, large data centers consume immense amounts of energy, which in turn is critical in terms of the energy and climate crisis.

What are the challenges?

Different formats are optimized for different use cases. Parquet, for example, is particularly suitable for analytical queries and large data sets, as it is organized on a column basis and read access is efficient. Avro, on the other hand, is ideal for streaming data because it can quickly convert data into a format that is sent over the network (serialization) and just as quickly convert it back to its original form when it is received (deserialization). Choosing the wrong format can affect performance by either wasting disk space or increasing polling times.
Cost / benefit trade-off: Compression and partitioning save storage space but can slow down computing performance and data access.
Dependency on cloud providers: As a lot of data is stored in the cloud today, optimization strategies are often tied to specific platforms.

Small project idea to better understand the terms:

Compare different storage optimization strategies: Generate a 1 GB dataset with random numbers. Save the data set in three different formats such as CSV, Parquet & Avro (using the corresponding Python libraries). Then compress the files with Gzip or Snappy. Now load the data into a Pandas DataFrame using Python and compare the query speed.

4 – Big Data Technologies such as Apache Spark & Kafka

Once the data has been stored using the storage concepts described in sections 1–3, we need technologies to process it efficiently.

We can use tools such as Apache Spark or Kafka to process and analyze huge amounts of data. They allow us to do this in real-time or in batch mode.

Spark is a framework that processes large amounts of data in a distributed manner and is used for tasks such as machine learning, Data Engineering and ETL processes.

Kafka is a tool that transfers data streams in real-time so that various applications can access and use them immediately. One example is the processing of real-time data streams in financial transactions or logistics.

Why is the term important?

In addition to the exponential growth in data, AI and machine learning are becoming increasingly important. Companies want to be able to process data in (almost) real-time: These Big Data technologies are the basis for real-time and batch processing of large amounts of data and are required for AI and streaming applications.

What are the challenges?

Complexity of implementation: Setting up, maintaining and optimizing tools such as Apache Spark and Kafka requires in-depth technical expertise. In many companies, this is not readily available and must be built up or brought in externally. Distributed systems in particular can be complex to coordinate. In addition, processing large volumes of data can lead to high costs if the computing capacities in the cloud need to be scaled.
Data quality: If I had to name one of my customers’ biggest problems, it would probably be data quality. Anyone who works with data knows that data quality can often be optimized in many companies… When data streams are processed in real-time, this becomes even more important. Why? In real-time systems, data is processed without delay and the results are sometimes used directly for decisions or are followed by reactions. Incorrect or inaccurate data can lead to wrong decisions.

Small project idea to better understand the terms:

Develop a small pipeline with Python that simulates, processes and saves real-time data: For example, simulate real-time data streams of temperature values. Then check whether the temperature exceeds a critical threshold value. As an extension, you can plot the temperature data in real-time.

5 – How Data Integration Becomes Real-Time Capable: ETL, ELT and Zero-ETL

ETL, ELT and Zero-ETL describe different approaches to integrating and transforming data.

While ETL (Extract-Transform-Loading) and ELT (Extract-Loading-Transform) are familiar to most, Zero-ETL is a data integration concept introduced by AWS in 2022. It eliminates the need for separate extraction, transformation, and loading steps. Instead, data is analyzed directly in its original format – almost in real-time. The technology promises to reduce latency and simplify processes within a single platform.

Let’s take a look at an example: A company using Snowflake as a data warehouse can create a table that references the data in the Salesforce Data Cloud. This means that the organization can query the data directly in Snowflake, even if it remains in the Data Cloud.

Why are the terms important?

We live in an age of instant – thanks to the success of platforms such as WhatsApp, Netflix and Spotify.

This is exactly what cloud providers such as Amazon Web Services, Google Cloud and Microsoft Azure have told themselves: Data should be able to be processed and analyzed almost in real-time and without major delays.

What are the challenges?

Here, too, there are similar challenges as with big data technologies: Data quality must be adequate, as incorrect data can lead directly to incorrect decisions during real-time processing. In addition, integration can be complex, although less so than with tools such as Apache Spark or Kafka.

Let me share a quick example to illustrate this: We implemented Data Cloud for a customer – the first-ever implementation in Switzerland since Salesforce started offering the Data Lakehouse solution. The entire knowledge base had to be built at the customer’s side. What did that mean? 1:1 training sessions with the power users and writing a lot of documentation.

This demonstrates a key challenge companies face: They must first build up this knowledge internally or rely on external resources as agencies or consulting companies.

Small project idea to better understand the terms:

Create a relational database with MySQL or PostgreSQL, add (simulated) real-time data from orders and use a cloud service such as AWS to stream the data directly into an analysis tool. Then visualize the data in a dashboard and show how new data becomes immediately visible.

6 – Event-Driven Architecture (EDA)

If we can transfer data between systems in (almost) real time, we also want to be able to react to it in (almost) real time: This is where the term Event-Driven Architecture (EDA) comes into play.

EDA is an architectural pattern in which applications are driven by events. An event is any relevant change in the system. Examples are when customers log in to the application or when a payment is received. Components of the architecture react to these events without being directly connected to each other. This in turn increases the flexibility and scalability of the application. Typical technologies include Apache Kafka or AWS EventBridge.

Why is the term important?

EDA plays an important role in real-time data processing. With the growing demand for fast and efficient systems, this architecture pattern is becoming increasingly important as it makes the processing of large data streams more flexible and efficient. This is particularly crucial for IoT, e-commerce and financial technologies.

Event-driven architecture also decouples systems: By allowing components to communicate via events, the individual components do not have to be directly dependent on each other.

Let’s take a look at an example: In an online store, the "order sent" event can automatically start a payment process or inform the warehouse management system. The individual systems do not have to be directly connected to each other.

What are the challenges?

Data consistency: The asynchronous nature of EDA makes it difficult to ensure that all parts of the system have consistent data. For example, an order may be saved as successful in the database while the warehouse component has not correctly reduced the stock due to a network issue.
Scaling the infrastructure: With high data volumes, scaling the messaging infrastructure (e.g. Kafka cluster) is challenging and expensive.

Small project idea to better understand the terms:

Simulate an Event-Driven Architecture in Python that reacts to customer events:

First define an event: An example could be ‘New order’.
Then create two functions that react to the event: 1) Send an automatic message to a customer. 2) Reduce the stock level by -1.
Call the two functions one after the other as soon as the event is triggered. If you want to extend the project, you can work with frameworks such as Flask or FastAPI to trigger the events through external user input.

Final Thoughts

In this part, we have looked at terms that focus primarily on the storage, management & processing of data. These terms lay the foundation for understanding modern data systems.

In part 2, we shift the focus to AI-driven concepts and explore some key terms such as Gen AI, agent-based AI and human-in-the-loop augmentation.

Own visualization – Illustrations from unDraw.co

All information in this article is based on the current status in January 2025.

The post The Concepts Data Professionals Should Know in 2025: Part 1 appeared first on Towards Data Science.

Modern Data And Application Engineering Breaks the Loss of Business Context

Bernd Wessely — Sat, 18 Jan 2025 15:57:59 +0000

A giant task for data and application engineer ing— Image created by DALL-E

I have previously written about the need to redefine the current data engineering discipline. I looked at it primarily from an organizational perspective and described what a data engineer should and should not take responsibility for.

The main argument was that business logic should be the concern of application engineers (developers) while all about data should be the data engineers’ concern. I advocated a redefinition of Data Engineering as "all about the movement, manipulation, and management of data".

Who cares for the intersection of data and logic? – Image from author

Now, as a matter of fact, the representation of application engineers’ created logic actually also results in data. Depending on which angle we look at this from, it means that we either have a technical gap or too much overlap at the intersection of data and logic.

So let’s roll up our sleeves and commonly take on the responsibility for maintaining the dependency between logic and data.

What exactly is data, information and the logic in between?

Let’s go through some basic definitions to better understand that dependency and how we can preserve it.

Data is the digitalized representation of information.
Information is data that has been processed and contextualized to provide meaning.
Logic is inherently conceptual, representing reasoning processes of various kinds, such as decision-making, answering, and problem-solving.
Applications are machine-executable, digital representations of human-defined logic using programming languages.
Programming languages are formal representation systems designed to express human logic in a way that computers can understand and execute as applications.
Machine Learning (ML) is the process of deriving information and logic from data through logic (sophisticated algorithms). The resulting logic can be saved in models.
Models are generated representations of logic derived from ML. Models can be used in applications to make intelligent predictions or decisions based on previously unseen data input. In this sense, models are software modules for logic that can’t be easily expressed by humans using programming languages.

Finally, we can conclude that logic applied to source data leads to information or other (machine-generated) logic. The logic itself can also be encoded or represented as data – quite similar to how information is digitalized.

The representation can be in form of programming languages, compiled applications or executable images (like docker), generated models from ML (like ONNX) and other intermediate representations such as Java bytecode for JVM, LLVM Intermediate Representation (IR), or .NET Common Intermediate Language (CIL).

If we really work hard to maintain the relation between source data and the applied logic, we can re-create derived information at any time by re-executing that logic.

Now what does this buy us?

Business context is key to derive insight from data

Data without any business context (or metadata) is by and large worthless. The less you know about the schema and the logic that produced the data, the more difficult is it to derive information from it.

Data Empowers Business

Regrettably, we often regard metadata as secondary. Although the required information is usually available in the source applications, it’s rarely stored together with the related data. And this despite the fact we know that even with the help of AI it’s extremely challenging and expensive to reconstruct the business context from data alone.

Why is context lost?

So why do we throw away context, when we later have to reconstruct it at much higher costs?

Remember, I’m not only talking about the data schema, which is generally considered important. It’s about the complete business context in which the information was created. This includes everything needed to re-create the information from the available sources (source data or the source application itself, the schema, and the logic in digitalized form) and information that helps to understand the meaning and background (descriptions, relations, time of creation, data owner, etc.).

The strategy to keep the ability to reconstruct derived data from logic is similar to the core principles of functional programming (FP) or data-oriented programming (DOP). These principles advice us to separate logic from data and allow us to transparently decide whether we keep the logic and the source data or else also cache the result of that logic for optimization purposes. Both ways are conceptually the same, as long as the logic (function) is idempotent and the source data is immutable.

Retaining or losing business context – Image by author

Now, I don’t want to add arguments to the discussion for or against the use of functional programming languages. While functional programming is increasingly used today, it was noted in 2023 that functional programming languages still collectively hold less than 5% of mind share.

Perhaps this is the reason why, at the enterprise level, we are still by and large only caching the resulting data and thus losing the business context of the source applications.

This really is a lamentable practice that data and application engineering urgently need to fix.

If we base our data and application architecture on the following principles, we stand a good chance of retaining business relevance as data flows through the enterprise.

Save and version all logic applied to source data

Logic and data versioned and persisted as an object – Image by author

Idempotency

Referencing functional programming principles, applications in our enterprise architecture can and should act like idempotent functions.

These functions, when called multiple times with the same input, produce exactly the same output as if they were called only once. In other words, executing such an application multiple times doesn’t change the result beyond the initial application.

However, within the application (at the micro level), the internal processing of data can vary extensively, as long as these variations do not affect the application’s external output or observable side effects (at the macro level).

Such internal processing might include local data manipulations, temporary calculations, intermediate state changes, or caching.

The internal workings of the application can even process data differently each time, as long as the final output remains consistent for the same input.

What we really need to avoid are global state changes that could make repeated calls with the same input produce different output.

Treat logic as data

The representations of our applications are currently managed by the application engineers. They store the source code and – since the emergence of DevOps – everything else needed to derive executable code in code repositories (such as GIT). This is certainly a good practice, but the relationship of the application’s logic actually applied to a specific version of data is not managed by application engineering.

We don’t currently have a good system to manage and store the dynamic relationship between application logic and data with the same rigor as we do this for data on its own.

Digital representation starts with data, not logic. The logic is encoded in the source code of a programming language, which is compiled into machine-executable applications (files with machine-executable byte codes). For the operating system, it’s only data until a special, executable file is started as a program.

An operating system can easily start any application version to process data of a specific version. However, it also has no built-in functionality to track which application version has processed which data version.

We urgently need such a system on enterprise level. It’s needed as urgently as databases were once needed to manage data.

Since the representation of the application logic is also data, I believe both engineering disciplines are called upon to take responsibility.

Actively maintain relationships between logic and data

There are two main approaches how logic and its associated data are managed in systems today. Either it’s application-centric as practiced in application engineering or it’s data-centric as practiced in data engineering.

The type of logic management mainly practiced in the enterprise today is application-centric.

Applications are installed on operating systems primarily using application packaging systems. These systems enable to pull application versions from central repositories, handling all necessary dependencies.

By default, the well-known APT (Advanced Package Tool) does not support installing multiple versions of one application at the same time. It’s designed to manage and install a single version only.

Since container technology emerged on Linux, application engineering enhanced this system to better enable the management of applications in isolated environments.

This allows us to install and manage several versions of the same application side by side.

In a Kubernetes cluster, for instance, the executable docker images are managed in an image database called registry. The cluster dynamically installs and runs any application (a micro-service if you like) of a specific version requested in an isolated pod. Data is then read and written using persistent volume claims (PVC) from and to a database or data system.

The same application running concurrently as different versions in isolated pods – Image by author

While we do see advancements in managing the concurrent execution of several application versions, the dynamic relation of data and applied logic is still neglected. There is no standard way of managing this relationship over time.

Apache Spark, as a typical data-centric system, treats logic as functions that are tightly coupled to its source data. The core abstraction of a Resilient Distributed Dataset (RDD) defines such a data object as an abstract class with pre-defined low-level functions (map, reduce/aggregate, filter, join, etc.) that can sequentially be applied to the source data.

The chain of functions applied to the source data are tracked as a directed acyclic graph (DAG). An application in Spark is therefore an instantiated chain of functions applied to source data. Hence, the relationship of data and logic is properly managed by Spark in the RDD.

However, directly passing RDDs between applications is not possible due to the nature of RDDs and Spark’s architecture. An RDD tracks the lineage of logic applied to the source data, but it’s ephemeral and local to the application and can’t be transferred to another Spark application. Whenever you persist the data from an RDD to exchange it with other applications, the context of applied logic is again stripped away.

Unfortunately both engineering disciplines cook their own soups. On one side we have applications managed in file systems, code repositories, and image registries maintained by application engineers. And on the other side we have data managed in databases or data platforms allowing application logic to be applied but maintained by data engineers.

Unfortunately no single discipline invented a good common system to manage the combination of data and applied logic. This relation is largely lost as soon as the logic has been applied and the resulting data needs to be persisted.

I can already hear you screaming that we have a principle to handle this. And yes, we have object-oriented programming (OOP), which has taught us to bundle logic and data into objects. This is true, but unfortunately it’s also true that OOP failed to deliver completely.

A good solution for the persistence and exchange of objects between applications running in completely different environments was not provided here either. Object-oriented database management systems (OODBMS) have never gained acceptance due to this restriction.

I think data and application engineering has to agree on a way to maintain the unit of data and applied logic as an object, but allow both parts to evolve independently.

Just imagine RDDs as a persistable abstraction that tracks the lineage of arbitrarily complex logic and can be exchanged between applications across system boundaries.

I described such an object as the abstraction ‘data as a product using a pure data structure’ in my article "Deliver Your Data as a Product, But Not as an Application".

Note, that this concept is different to completely event-based data processing. Event-based processing systems are architectures where all participating applications only communicate and process data in the form of events. An event is a record of a significant change or action that has occurred within a system and is comparable to data atoms described in the next chapter.

These systems are typically designed to consistently handle real-time data flows by reacting to events as they happen. However, processing at the enterprise level typically requires many more different ways to transform and manage data. Especially legacy applications may use completely different processing styles and can’t directly participate in event-based processing.

But as we’ve seen, applications can locally act in very different styles as long as they stay idempotent at the macro level.

If we adhere to the principles described, we can integrate applications of any kind at the macro (enterprise) level and prevent the loss of business context. We do not have to force applications to only process data atoms (events) in near real-time. Applications can manage their internal data (data on the inside) as needed and completely independent of data to be exchanged (data on the outside) with other applications.

Applications can stay idempotent at the macro-level by managing internal data (data on the inside) completely different to shared data – Image by author

Create source data in atomic form and keep it immutable

Now, if we are able to seamlessly track and manage the lineage of data through all applications in the enterprise, we need to have a special look at source data.

Source data is special because it’s original information that, apart from the initial encoding into data, has not yet been further transformed by application logic.

This is new information that cannot be obtained by applying logic to existing data. Rather, it must first be created, measured, observed or otherwise recorded by the company and encoded to data.

If we save the original information created by source applications in immutable and atomic form, we enable the data to be stored in the most compact and lossless way to be subsequently usable in the most flexible way.

Immutability

Immutability forces the versioning of any source data updates instead of directly overwriting the data. This enables us to fully preserve all data that has ever been used for application transformation logic.

Data immutability refers to the concept that data, once created, cannot be altered later.

Does that mean that we can’t change anything, once created?

No, this would be completely impractical. Instead of modifying existing data structures, new ones are created that can easily be versioned.

But isn’t most of the information in the enterprise derived from original information, instead of being created as new?

Yes, and as discussed, this derivation can best be tracked and managed by the chain of application versions applied to immutable source data.

Besides this core benefit of immutable data, it offers other benefits as well:

• Predictability

Since data doesn’t change, applications that operate on immutable data are easier to understand and its effect can be better predicted.

• Concurrency

Immutable data structures are inherently thread-safe. This enables concurrent processing without the complexities of managing shared mutable state.

• Debugging

With immutable data, the state of the system at any point in time is fixed. This greatly simplifies the debugging process.

But let me assure you once again: We do not have to convert all our databases and data storage to immutability. It’s perfectly fine for an application to use a conventional relational database system to manage its local state, for example.

It’s the publicly shared original information at the macro (enterprise) level which needs to stay immutable.

Atomicity

Storing data in atomic form is an optimal model for source data because it captures every detail of what has happened or become known to the organization over time.

As described in my article on taking a fresh view on data modeling, any other data model can be derived from atomic data by applying appropriate transformation logic. In concept descriptions of the Data Mesh, data as a product is often classified as source-aligned and consumer-aligned. This is an overly coarse classification of the many possible intermediate data models that can be derived from source data in atomic form.

Because source data can’t be re-created with saved logic, it’s really important to durably save that data. So better setup a proper backup process for it.

If we decide to persist (or cache) specific derived data, we are able to use this as a specialized physical data model to optimize further logic based on that data. Any derived data model can in this setup be treated as a long-term cache for the logic applied.

Check my article on Modern Enterprise Data Modeling for more details on how to encode complex information as time-ordered data atoms and organize data governance at enterprise level. The minimal schema applied to encode the information enables extremely flexible use, comparable to completely unstructured data. However, it allows the data to be processed much more efficient compared to its unstructured variant.

This maximum flexibility is especially important for source data, where we do not yet know how it will be further transformed and used in the enterprise.

By adhering to the principles described, we can integrate applications of any kind at the macro or enterprise level and break the loss of business context.

If both engineering disciplines agree upon a common system that acts at the intersection of data and logic, we could better maintain business meaning throughout the enterprise.

Application engineering provides source data in atomic form when consumers and their individual requirements are not yet known.
Data and application engineering agree on the management of data and logic relationship by a common system.
Data engineering doesn’t implement any business logic, but leaves this to application engineers.
Data engineering abstracts away the low-level differences between data streaming and batch processing as well as eventual and immediate consistency for data.

This modern way of managing data in the enterprise is the backbone of what I call universal data supply.

Towards Universal Data Supply

The post Modern Data And Application Engineering Breaks the Loss of Business Context appeared first on Towards Data Science.