When companies need a secure, performant, and scalable storage solution, they tend to gravitate toward the cloud. One of the most popular platforms in the game is Aws S3 – and for a good reason – it’s an industry-leading object storage solution that can serve as a data lake.
The question is – Can you aggregate S3 bucket data without downloading it? And can you do it fast?
The answer is yes to both questions. Duckdb allows you to connect to your S3 bucket directly via the httpfs
extension. You’ll learn how to use it today by aggregating around 111 million rows split between 37 Parquet files.
Spoiler alert: It will take you around a minute.
Note: I wrote this post because I was searching for a more performant Pandas alternative. My goal was to perform analysis on large datasets locally instead of opting for cloud solutions. I have no affiliations with DuckDB or AWS.
AWS S3 Setup
First things first, you’ll need an AWS account and an S3 bucket. You’ll also want to create an IAM user for which you can generate an access key.
As for the data, I’ve downloaded Yellow Taxi data Parquet files from January 2021 to January 2024 from the following link:
- New York City Taxi and Limousine Commission (TLC) Trip Record Data was accessed on April 18, 2024, from https://www.[nyc.gov](https://www.nyc.gov/home/terms-of-use.page)/site/tlc/about/tlc-trip-record-data.page. The data is free to use. License info can be found on nyc.gov.
This is what it looks like when loaded into an S3 bucket:

The bucket now contains 37 Parquet files taking 1.79 GB of space and containing over 111 million rows.
DuckDB AWS S3 Setup
Setup on the Python end requires the duckdb
library and the httpfs
extension for DuckDB. Assuming you have the library installed (simple pip
installation), import it, and create a new connection:
import duckdb
conn = duckdb.connect()
DuckDB httpfs Extension
The httpfs
extension, among other things, allows you to write/read files to and from a given AWS S3 bucket.
Install it and load it with the following Python command (run the installation only once):
conn.execute("""
INSTALL httpfs;
LOAD httpfs;
""").df()
You should see a success message like this:

DuckDB S3 Configuration
As for the S3 configuration, provide the region, access key, and secret access key to DuckDB:
conn.execute("""
SET s3_region = '<your-region>';
SET s3_access_key_id = '<your-access-key>';
SET s3_secret_access_key = '<your-secret-key>';
""").df()
You should see a success message once again:

And that’s it! You can now query S3 data directly from DuckDB.
Python and DuckDB – How to Get Data From AWS
This section will show how long it takes to run two queries – simple count and aggregation – from 37 Parquet files stored on S3.
Query #1 – Simple Count
To read Parquet data from an S3 bucket, use the parquet_scan()
function and provide a glob path to all Parquet files stored in the root path. Just remember to change the bucket name:
res_count = conn.execute("""
select count(*)
from parquet_scan('s3://<your-bucket-name>/*.parquet');
""").df()
res_count
Getting a count of over 111 million takes only 7 seconds:

Query #2 – Monthly Summary Statistics
And now let’s calculate summary statistics over all Parquet files. The goal is to get counts, sums, and averages for certain columns grouped on a monthly level:
res_agg = conn.execute("""
select
period,
count(*) as num_rides,
round(avg(trip_duration), 2) as avg_trip_duration,
round(avg(trip_distance), 2) as avg_trip_distance,
round(sum(trip_distance), 2) as total_trip_distance,
round(avg(total_amount), 2) as avg_trip_price,
round(sum(total_amount), 2) as total_trip_price,
round(avg(tip_amount), 2) as avg_tip_amount
from (
select
date_part('year', tpep_pickup_datetime) as trip_year,
strftime(tpep_pickup_datetime, '%Y-%m') as period,
epoch(tpep_dropoff_datetime - tpep_pickup_datetime) as trip_duration,
trip_distance,
total_amount,
tip_amount
from parquet_scan('s3://duckdb-bucket-20240422/*.parquet')
where trip_year >= 2021 and trip_year <= 2024
)
group by period
order by period
""").df()
res_agg
It takes just over a minute to perform this aggregation:

As a frame of reference, the same operation took under 2 seconds when files were saved on a local disk. Pandas took over 11 minutes for the same task.
Summing up DuckDB and AWS in Python
Overall, if you have huge volumes of data stored on S3, DuckDB is your friend. You can analyze and aggregate data in no time without file downloads – which might not be allowed at all due to privacy and security concerns.
You can also use DuckDB as an analysis and aggregation layer between two AWS S3 storage layers. Raw data goes in, aggregated data goes out – hopefully, into another bucket or subdirectory.
Read next: