The world’s leading publication for data science, AI, and ML professionals.

DuckDB and AWS – How to Aggregate 100 Million Rows in 1 Minute

Process huge volumes of data with Python and DuckDB – An AWS S3 example.

Photo by Growtika on Unsplash
Photo by Growtika on Unsplash

When companies need a secure, performant, and scalable storage solution, they tend to gravitate toward the cloud. One of the most popular platforms in the game is Aws S3 – and for a good reason – it’s an industry-leading object storage solution that can serve as a data lake.

The question is – Can you aggregate S3 bucket data without downloading it? And can you do it fast?

The answer is yes to both questions. Duckdb allows you to connect to your S3 bucket directly via the httpfs extension. You’ll learn how to use it today by aggregating around 111 million rows split between 37 Parquet files.

Spoiler alert: It will take you around a minute.

Note: I wrote this post because I was searching for a more performant Pandas alternative. My goal was to perform analysis on large datasets locally instead of opting for cloud solutions. I have no affiliations with DuckDB or AWS.


AWS S3 Setup

First things first, you’ll need an AWS account and an S3 bucket. You’ll also want to create an IAM user for which you can generate an access key.

As for the data, I’ve downloaded Yellow Taxi data Parquet files from January 2021 to January 2024 from the following link:

This is what it looks like when loaded into an S3 bucket:

Image 1 - Parquet files in an S3 bucket (image by author)
Image 1 – Parquet files in an S3 bucket (image by author)

The bucket now contains 37 Parquet files taking 1.79 GB of space and containing over 111 million rows.

DuckDB AWS S3 Setup

Setup on the Python end requires the duckdb library and the httpfs extension for DuckDB. Assuming you have the library installed (simple pip installation), import it, and create a new connection:

import duckdb

conn = duckdb.connect()

DuckDB httpfs Extension

The httpfs extension, among other things, allows you to write/read files to and from a given AWS S3 bucket.

Install it and load it with the following Python command (run the installation only once):

conn.execute("""
    INSTALL httpfs;
    LOAD httpfs;
""").df()

You should see a success message like this:

Image 2 - Installing and loading httpfs extension (image by author)
Image 2 – Installing and loading httpfs extension (image by author)

DuckDB S3 Configuration

As for the S3 configuration, provide the region, access key, and secret access key to DuckDB:

conn.execute("""
    SET s3_region = '<your-region>';
    SET s3_access_key_id = '<your-access-key>';
    SET s3_secret_access_key = '<your-secret-key>';
""").df()

You should see a success message once again:

Image 3 - DuckDB S3 configuration (image by author)
Image 3 – DuckDB S3 configuration (image by author)

And that’s it! You can now query S3 data directly from DuckDB.

Python and DuckDB – How to Get Data From AWS

This section will show how long it takes to run two queries – simple count and aggregation – from 37 Parquet files stored on S3.

Query #1 – Simple Count

To read Parquet data from an S3 bucket, use the parquet_scan() function and provide a glob path to all Parquet files stored in the root path. Just remember to change the bucket name:

res_count = conn.execute("""
    select count(*)
    from parquet_scan('s3://<your-bucket-name>/*.parquet');
""").df()

res_count

Getting a count of over 111 million takes only 7 seconds:

Image 4 - DuckDB count results (image by author)
Image 4 – DuckDB count results (image by author)

Query #2 – Monthly Summary Statistics

And now let’s calculate summary statistics over all Parquet files. The goal is to get counts, sums, and averages for certain columns grouped on a monthly level:

res_agg = conn.execute("""
    select 
        period,
        count(*) as num_rides,
        round(avg(trip_duration), 2) as avg_trip_duration,
        round(avg(trip_distance), 2) as avg_trip_distance,
        round(sum(trip_distance), 2) as total_trip_distance,
        round(avg(total_amount), 2) as avg_trip_price,
        round(sum(total_amount), 2) as total_trip_price,
        round(avg(tip_amount), 2) as avg_tip_amount
    from (
        select
            date_part('year', tpep_pickup_datetime) as trip_year,
            strftime(tpep_pickup_datetime, '%Y-%m') as period,
            epoch(tpep_dropoff_datetime - tpep_pickup_datetime) as trip_duration,
            trip_distance,
            total_amount,
            tip_amount
        from parquet_scan('s3://duckdb-bucket-20240422/*.parquet')
        where trip_year >= 2021 and trip_year <= 2024
    )
    group by period
    order by period
""").df()

res_agg

It takes just over a minute to perform this aggregation:

Image 5 - DuckDB aggregation results (image by author)
Image 5 – DuckDB aggregation results (image by author)

As a frame of reference, the same operation took under 2 seconds when files were saved on a local disk. Pandas took over 11 minutes for the same task.


Summing up DuckDB and AWS in Python

Overall, if you have huge volumes of data stored on S3, DuckDB is your friend. You can analyze and aggregate data in no time without file downloads – which might not be allowed at all due to privacy and security concerns.

You can also use DuckDB as an analysis and aggregation layer between two AWS S3 storage layers. Raw data goes in, aggregated data goes out – hopefully, into another bucket or subdirectory.

Read next:

How to Train a Decision Tree Classifier… In SQL


Related Articles