Thomas Reid, Author at Towards Data Science

Battle of the Ducks

Thomas Reid — Tue, 28 Jan 2025 14:01:58 +0000

Image by AI (Dalle-3)

As some of you may know, I’m a big fan of the DuckDB Python library, and I’ve written many articles on it. I was also one of the first to write an article about an even newer Python library called Fireducks and helped bring that to people’s attention.

If you’ve never heard of these useful libraries, check out the links below for an introduction to them.

DuckDB

New Pandas rival, FireDucks, brings the smoke!

Both libraries are increasing their share of data science workloads where it could be argued that data manipulation and general wrangling are at least as important as the data analysis and insight that the machine learning side of things brings.

The core foundations of both tools are very different; DuckDB is a modern, embedded analytics database designed for efficient processing and querying of gigabytes of data from various sources. Fireducks is designed to be a much faster replacement for Pandas.

Their key commonality, however, is that they are both highly performant for general mid-sized Data Processing tasks. If that’s your use case, which one should you choose? That’s what we’ll find out today.

Here are the tests I’ll perform.

read a large CSV file into memory, i.e. a DuckDB table and a Fireducks dataframe
perform some typical data processing tasks against both sets of in-memory data
create a new column in the in-memory data sets based on existing table/data frame column data.
write out the updated in-memory data sets as CSV and Parquet

Input data set

I created a CSV file with fake sales data containing 100 million records.

The schema of the input data is this,

order_id (int)
order_date (date)
customer_id (int)
customer_name (str)
product_id (int)
product_name (str)
category (str)
quantity (int)
price (float)
total (float)

Here is a Python program you can use to create the CSV file. On my system, this resulted in a file of approximately 7.5GB.

# generate the 100m CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta

def generate(nrows: int, filename: str):
    names = np.asarray(
        [
            "Laptop",
            "Smartphone",
            "Desk",
            "Chair",
            "Monitor",
            "Printer",
            "Paper",
            "Pen",
            "Notebook",
            "Coffee Maker",
            "Cabinet",
            "Plastic Cups",
        ]
    )

    categories = np.asarray(
        [
            "Electronics",
            "Electronics",
            "Office",
            "Office",
            "Electronics",
            "Electronics",
            "Stationery",
            "Stationery",
            "Stationery",
            "Electronics",
            "Office",
            "Sundry",
        ]
    )

    product_id = np.random.randint(len(names), size=nrows)
    quantity = np.random.randint(1, 11, size=nrows)
    price = np.random.randint(199, 10000, size=nrows) / 100

    # Generate random dates between 2010-01-01 and 2023-12-31
    start_date = datetime(2010, 1, 1)
    end_date = datetime(2023, 12, 31)
    date_range = (end_date - start_date).days

    # Create random dates as np.array and convert to string format
    order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])

    # Define columns
    columns = {
        "order_id": np.arange(nrows),
        "order_date": order_dates,
        "customer_id": np.random.randint(100, 1000, size=nrows),
        "customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
        "product_id": product_id + 200,
        "product_names": names[product_id],
        "categories": categories[product_id],
        "quantity": quantity,
        "price": price,
        "total": price * quantity,
    }

    # Create Polars DataFrame and write to CSV with explicit delimiter
    df = pl.DataFrame(columns)
    df.write_csv(filename, separator=',',include_header=True)  # Ensure comma is used as the delimiter

# Generate data with random order_date and save to CSV
generate(100_000_000, "/mnt/d/sales_data/sales_data_100m.csv")

Installing WSL2 Ubuntu

Fireducks only runs under Linux, so as I usually run Windows, I’ll be using WSL2 Ubuntu for my Linux environment, but the same code should work on any Linux/Unix setup. I have a full guide on installing WSL2 here.

Setting up a dev environment

OK, we should set up a separate development environment before starting our coding examples. That way, what we do won’t interfere with other versions of libraries, Programming, etc….. we might have on the go for other projects.

I use Miniconda for this, but you can use whatever method suits you best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda – Anaconda documentation

Once the environment is created, switch to it using the activatecommand, and then install Jupyter and any required Python libraries.

#create our test environment
(base) $ conda create -n duck_battle python=3.11 -y

# Now activate it
(base) $ conda activate duck_battle

# Install python libraries, etc ...
(duck_battle) $ pip install jupyter fireducks duckdb

Test 1 – Reading a large CSV file and display the last 10 records

DuckDB

import duckdb

print(duckdb.__version__)

'1.1.3'

# DuckDB read CSV file 
#
import duckdb
import time

# Start the timer
start_time = time.time()

# Create a connection to an in-memory DuckDB database
con = duckdb.connect(':memory:')

# Create a table from the CSV file
con.execute(f"CREATE TABLE sales AS SELECT * FROM read_csv('/mnt/d/sales_data/sales_data_100m.csv',header=true)")

# Fetch the last 10 rows
query = "SELECT * FROM sales ORDER BY rowid DESC LIMIT 10"
df = con.execute(query).df()

# Display the last 10 rows
print("nLast 10 rows of the file:")
print(df)

# End the timer and calculate the total elapsed time
total_elapsed_time = time.time() - start_time

print(f"DuckDB: Time taken to read the CSV file and display the last 10 records: {total_elapsed_time} seconds")

#
# DuckDB output
#

Last 10 rows of the file:
   order_id order_date  customer_id   customer_name  product_id product_names  
0  99999999 2023-06-16          102   Customer_9650         203         Chair   
1  99999998 2022-03-02          709  Customer_23966         208      Notebook   
2  99999997 2019-05-10          673  Customer_25709         202          Desk   
3  99999996 2011-10-21          593  Customer_29352         200        Laptop   
4  99999995 2011-10-24          501  Customer_29289         202          Desk   
5  99999994 2023-09-27          119  Customer_15532         209  Coffee Maker   
6  99999993 2015-01-15          294  Customer_27081         200        Laptop   
7  99999992 2016-04-07          379   Customer_1353         207           Pen   
8  99999991 2010-09-19          253  Customer_29439         204       Monitor   
9  99999990 2016-05-19          174  Customer_11294         210       Cabinet   

    categories  quantity  price   total  
0       Office         4  59.58  238.32  
1   Stationery         1  78.91   78.91  
2       Office         5   9.12   45.60  
3  Electronics         3  67.42  202.26  
4       Office         7  53.78  376.46  
5  Electronics         2  55.10  110.20  
6  Electronics         9  86.01  774.09  
7   Stationery         5  21.56  107.80  
8  Electronics         4   5.17   20.68  
9       Office         9  65.10  585.90  

DuckDB: Time taken to read the CSV file and display the last 10 records: 59.23184013366699 seconds

Fireducks

import fireducks
import fireducks.pandas as pd

print(fireducks.__version__)
print(pd.__version__)

1.1.6
2.2.3

# Fireducks read CSV
#
import fireducks.pandas as pd
import time

# Start the timer
start_time = time.time()

# Path to the CSV file
file_path = "/mnt/d/sales_data/sales_data_100m.csv"

# Read the CSV file into a DataFrame
df_fire = pd.read_csv(file_path)

# Display the last 10 rows of the DataFrame
print(df_fire.tail(10))

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to read the CSV file and display the last 10 records: {elapsed_time} seconds")         

#
# Fireducks output
#

          order_id  order_date  customer_id   customer_name  product_id  
99999990  99999990  2016-05-19          174  Customer_11294         210   
99999991  99999991  2010-09-19          253  Customer_29439         204   
99999992  99999992  2016-04-07          379   Customer_1353         207   
99999993  99999993  2015-01-15          294  Customer_27081         200   
99999994  99999994  2023-09-27          119  Customer_15532         209   
99999995  99999995  2011-10-24          501  Customer_29289         202   
99999996  99999996  2011-10-21          593  Customer_29352         200   
99999997  99999997  2019-05-10          673  Customer_25709         202   
99999998  99999998  2022-03-02          709  Customer_23966         208   
99999999  99999999  2023-06-16          102   Customer_9650         203   

         product_names   categories  quantity  price   total  
99999990       Cabinet       Office         9  65.10  585.90  
99999991       Monitor  Electronics         4   5.17   20.68  
99999992           Pen   Stationery         5  21.56  107.80  
99999993        Laptop  Electronics         9  86.01  774.09  
99999994  Coffee Maker  Electronics         2  55.10  110.20  
99999995          Desk       Office         7  53.78  376.46  
99999996        Laptop  Electronics         3  67.42  202.26  
99999997          Desk       Office         5   9.12   45.60  
99999998      Notebook   Stationery         1  78.91   78.91  
99999999         Chair       Office         4  59.58  238.32 

Fireducks: Time taken to read the CSV file and display the last 10 records: 65.69259881973267 seconds

There is not much in it; DuckDB edges it by about 6 seconds.

Test 2— Calculate total sales by category

DuckDB

# duckdb process data
#
import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    categories, 
    SUM(total) AS total_sales
FROM sales
GROUP BY categories
ORDER BY total_sales DESC
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for sales by category calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckDB: Time for sales by category calculation: 0.1401681900024414 seconds

  categories  total_sales
0 Electronics 1.168493e+10
1 Stationery  7.014109e+09
2 Office      7.006807e+09
3 Sundry      2.338428e+09

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

total_sales_by_category = df_fire.groupby('categories')['total'].sum().sort_values(ascending=False)
print(total_sales_by_category)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate sales by category: {elapsed_time} seconds")

#
# Fireducks output
#

categories
Electronics    1.168493e+10
Stationery     7.014109e+09
Office         7.006807e+09
Sundry         2.338428e+09
Name: total, dtype: float64

Fireducks: Time taken to calculate sales by category:  0.13571524620056152 seconds

There is not much in it there, either. Fireducks shades it.

Test 3— Top 5 customer spend

DuckDB

# duckdb process data
#
import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    customer_id, 
    customer_name, 
    SUM(total) AS total_purchase
FROM sales
GROUP BY customer_id, customer_name
ORDER BY total_purchase DESC
LIMIT 5
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckdDB: Time to calculate top 5 customers: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckdDB: Time to calculate top 5 customers: 1.4588654041290283 seconds

  customer_id customer_name  total_purchase
0 681         Customer_20387 6892.96
1 740         Customer_30499 6613.11
2 389         Customer_22686 6597.35
3 316         Customer_185   6565.38
4 529         Customer_1609  6494.35

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

top_5_customers = df_fire.groupby(['customer_id', 'customer_name'])['total'].sum().sort_values(ascending=False).head(5)
print(top_5_customers)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate top 5 customers: {elapsed_time} seconds")

#
# Fireducks output
#

customer_id  customer_name 
681          Customer_20387    6892.96
740          Customer_30499    6613.11
389          Customer_22686    6597.35
316          Customer_1859     6565.38
529          Customer_1609     6494.35
Name: total, dtype: float64
Fireducks: Time taken to calculate top 5 customers: 2.823930263519287 seconds

DuckDB wins that one, being almost twice as fast as Fireducks.

Test 4— Monthly sales figures

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    SUM(total) AS monthly_sales
FROM sales
GROUP BY month
ORDER BY month
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for seasonal trend calculation: {time.time() - start_time} seconds")

results

# 
# DuckDB output
#

DuckDB: Time for seasonal trend calculation: 0.16109275817871094 seconds

  month        monthly_sales
0 2010-01-01   1.699500e+08
1 2010-02-01   1.535730e+08
2 2010-03-01   1.702968e+08
3 2010-04-01   1.646421e+08
4 2010-05-01   1.704506e+08
... ... ...
163 2023-08-01 1.699263e+08
164 2023-09-01 1.646018e+08
165 2023-10-01 1.692184e+08
166 2023-11-01 1.644883e+08
167 2023-12-01 1.643962e+08

168 rows × 2 columns

Fireducks

import fireducks.pandas as pd
import time

def seasonal_trend():
    # Ensure 'order_date' is datetime
    df_fire['order_date'] = pd.to_datetime(df_fire['order_date'])

    # Extract 'month' as string
    df_fire['month'] = df_fire['order_date'].dt.strftime('%Y-%m')

    # Group by 'month' and sum 'total'
    results = (
        df_fire.groupby('month')['total']
        .sum()
        .reset_index()
        .sort_values('month')
    )
    print(results)

start_time = time.time()
seasonal_trend()
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

print(f"Fireducks: Time for seasonal trend calculation: {time.time() - start_time} seconds")

#
# Fireducks Output
#

       month         total
0    2010-01  1.699500e+08
1    2010-02  1.535730e+08
2    2010-03  1.702968e+08
3    2010-04  1.646421e+08
4    2010-05  1.704506e+08
..       ...           ...
163  2023-08  1.699263e+08
164  2023-09  1.646018e+08
165  2023-10  1.692184e+08
166  2023-11  1.644883e+08
167  2023-12  1.643962e+08

[168 rows x 2 columns]
Fireducks: Time for seasonal trend calculation: 3.109074354171753 seconds

DuckDB was significantly quicker in this example.

Test 5— Average order by product

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    product_id,
    product_names,
    AVG(total) AS avg_order_value
FROM sales
GROUP BY product_id, product_names
ORDER BY avg_order_value DESC
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for average order by product calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckDB: Time for average order by product calculation: 0.13720130920410156 seconds

  product_id product_names avg_order_value
0 206        Paper         280.529144
1 208        Notebook      280.497268
2 201        Smartphone    280.494779
3 207        Pen           280.491508
4 205        Printer       280.470150
5 200        Laptop        280.456913
6 209        Coffee Maker  280.445365
7 211        Plastic Cups  280.440161
8 210        Cabinet       280.426960
9 202        Desk          280.367135
10 203       Chair         280.364045
11 204       Monitor       280.329706

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

avg_order_value = df_fire.groupby(['product_id', 'product_names'])['total'].mean().sort_values(ascending=False)
print(avg_order_value)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

print(f"Fireducks: Time for average order calculation: {time.time() - start_time} seconds")

#
# Fireducks output
#

product_id  product_names
206         Paper            280.529144
208         Notebook         280.497268
201         Smartphone       280.494779
207         Pen              280.491508
205         Printer          280.470150
200         Laptop           280.456913
209         Coffee Maker     280.445365
211         Plastic Cups     280.440161
210         Cabinet          280.426960
202         Desk             280.367135
203         Chair            280.364045
204         Monitor          280.329706
Name: total, dtype: float64
Fireducks: Time for average order calculation: 0.06766319274902344 seconds

Fireducks gets one back there and was twice as fast as DuckDB.

Test 6— product performance analysis

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
WITH yearly_sales AS (
    SELECT 
        EXTRACT(YEAR FROM order_date) AS year,
        SUM(total) AS total_sales
    FROM sales
    GROUP BY year
)
SELECT 
    year,
    total_sales,
    LAG(total_sales) OVER (ORDER BY year) AS prev_year_sales,
    (total_sales - LAG(total_sales) OVER (ORDER BY year)) / LAG(total_sales) OVER (ORDER BY year) * 100 AS yoy_growth
FROM yearly_sales
ORDER BY year
"""
start_time = time.time()

# 1. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for product performance analysis calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

Time for product performance analysis  calculation: 0.03958845138549805 seconds

   year total_sales prev_year_sales yoy_growth
0  2010 2.002066e+09 NaN            NaN
1  2011 2.002441e+09 2.002066e+09   0.018739
2  2012 2.008966e+09 2.002441e+09   0.325848
3  2013 2.002901e+09 2.008966e+09  -0.301900
4  2014 2.000773e+09 2.002901e+09  -0.106225
5  2015 2.001931e+09 2.000773e+09   0.057855
6  2016 2.008762e+09 2.001931e+09   0.341229
7  2017 2.002164e+09 2.008762e+09  -0.328457
8  2018 2.002383e+09 2.002164e+09   0.010927
9  2019 2.002891e+09 2.002383e+09   0.025383
10 2020 2.008585e+09 2.002891e+09   0.284318
11 2021 2.000244e+09 2.008585e+09  -0.415281
12 2022 2.004500e+09 2.000244e+09   0.212756
13 2023 1.995672e+09 2.004500e+09  -0.440401

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

df_fire['year'] = pd.to_datetime(df_fire['order_date']).dt.year
yearly_sales = df_fire.groupby('year')['total'].sum().sort_index()
yoy_growth = yearly_sales.pct_change() * 100

result = pd.DataFrame({
    'year': yearly_sales.index,
    'total_sales': yearly_sales.values,
    'prev_year_sales': yearly_sales.shift().values,
    'yoy_growth': yoy_growth.values
})

print(result)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time for product performance analysis  calculation: {time.time() - start_time} seconds")

#
# Fireducks output
#

    year   total_sales  prev_year_sales  yoy_growth
0   2010  2.002066e+09              NaN         NaN
1   2011  2.002441e+09     2.002066e+09    0.018739
2   2012  2.008966e+09     2.002441e+09    0.325848
3   2013  2.002901e+09     2.008966e+09   -0.301900
4   2014  2.000773e+09     2.002901e+09   -0.106225
5   2015  2.001931e+09     2.000773e+09    0.057855
6   2016  2.008762e+09     2.001931e+09    0.341229
7   2017  2.002164e+09     2.008762e+09   -0.328457
8   2018  2.002383e+09     2.002164e+09    0.010927
9   2019  2.002891e+09     2.002383e+09    0.025383
10  2020  2.008585e+09     2.002891e+09    0.284318
11  2021  2.000244e+09     2.008585e+09   -0.415281
12  2022  2.004500e+09     2.000244e+09    0.212756
13  2023  1.995672e+09     2.004500e+09   -0.440401

Time for product performance analysis  calculation: 0.17495489120483398 seconds

DuckDB is quicker this time.

Test 7 – Add a new column to the data set and update its value

DuckDB

import duckdb

from datetime import datetime

start_time = time.time()

# Add new columns
con.execute("""
ALTER TABLE sales ADD COLUMN total_with_tax FLOAT
"""
)

# Perform the calculations and update the table
con.execute("""
UPDATE sales
SET total_with_tax = CASE 
    WHEN total <= 100 THEN total * 1.125  -- 12.5% tax
    WHEN total > 100 AND total <= 200 THEN total * 1.15   -- 15% tax
    WHEN total > 200 AND total <= 500 THEN total * 1.17   -- 17% tax
    WHEN total > 500 THEN total * 1.20   -- 20% tax
END;
""")

print(f"Time to add new column: {time.time() - start_time} seconds")

# Verify the new columns
result = con.execute("""
    SELECT 
        *
    FROM sales
    LIMIT 10;
""").fetchdf()

print(result)

#
# DuckDB output
#

Time to add new column: 2.4016575813293457 seconds

   order_id order_date  customer_id   customer_name  product_id product_names  
0         0 2021-11-25          238  Customer_25600         211  Plastic Cups   
1         1 2017-06-10          534  Customer_14188         209  Coffee Maker   
2         2 2010-02-15          924  Customer_14013         207           Pen   
3         3 2011-01-26          633   Customer_6120         211  Plastic Cups   
4         4 2014-01-11          561   Customer_1352         205       Printer   
5         5 2021-04-19          533   Customer_5342         208      Notebook   
6         6 2012-03-14          684  Customer_21604         207           Pen   
7         7 2017-07-01          744  Customer_30291         201    Smartphone   
8         8 2013-02-13          678  Customer_32618         204       Monitor   
9         9 2023-01-04          340  Customer_16898         207           Pen   

    categories  quantity  price   total  total_with_tax  
0       Sundry         2  99.80  199.60      229.539993  
1  Electronics         8   7.19   57.52       64.709999  
2   Stationery         6  70.98  425.88      498.279602  
3       Sundry         6  94.38  566.28      679.536011  
4  Electronics         4  44.68  178.72      205.528000  
5   Stationery         4  21.85   87.40       98.324997  
6   Stationery         3  93.66  280.98      328.746613  
7  Electronics         6  39.41  236.46      276.658203  
8  Electronics         2   4.30    8.60        9.675000  
9   Stationery         2   6.67   13.34       15.007500

Fireducks

import numpy as np
import time
import fireducks.pandas as pd

# Start total runtime timer
start_time = time.time()
# Define tax rate conditions and choices
conditions = [
    (df_fire['total'] <= 100),
    (df_fire['total'] > 100) & (df_fire['total'] <= 200),
    (df_fire['total'] > 200) & (df_fire['total'] <= 500),
    (df_fire['total'] > 500)
]

choices = [1.125, 1.15, 1.17, 1.20]

# Calculate total_with_tax using np.select for efficiency
df_fire['total_with_tax'] = df_fire['total'] * np.select(conditions, choices)

# Print total runtime
print(f"Fireducks: Time to add new column: {time.time() - start_time} seconds")
print(df_fire)

#
# Fireducks oputput
#

Fireducks: Time to add new column: 2.7112433910369873 seconds

          order_id order_date  customer_id   customer_name  product_id  
0                0 2021-11-25          238  Customer_25600         211   
1                1 2017-06-10          534  Customer_14188         209   
2                2 2010-02-15          924  Customer_14013         207   
3                3 2011-01-26          633   Customer_6120         211   
4                4 2014-01-11          561   Customer_1352         205   
...            ...        ...          ...             ...         ...   
99999995  99999995 2011-10-24          501  Customer_29289         202   
99999996  99999996 2011-10-21          593  Customer_29352         200   
99999997  99999997 2019-05-10          673  Customer_25709         202   
99999998  99999998 2022-03-02          709  Customer_23966         208   
99999999  99999999 2023-06-16          102   Customer_9650         203   

         product_names   categories  quantity  price   total    month  year  
0         Plastic Cups       Sundry         2  99.80  199.60  2021-11  2021   
1         Coffee Maker  Electronics         8   7.19   57.52  2017-06  2017   
2                  Pen   Stationery         6  70.98  425.88  2010-02  2010   
3         Plastic Cups       Sundry         6  94.38  566.28  2011-01  2011   
4              Printer  Electronics         4  44.68  178.72  2014-01  2014   
...                ...          ...       ...    ...     ...      ...   ...   
99999995          Desk       Office         7  53.78  376.46  2011-10  2011   
99999996        Laptop  Electronics         3  67.42  202.26  2011-10  2011   
99999997          Desk       Office         5   9.12   45.60  2019-05  2019   
99999998      Notebook   Stationery         1  78.91   78.91  2022-03  2022   
99999999         Chair       Office         4  59.58  238.32  2023-06  2023   

          total_with_tax  
0              229.54000  
1               64.71000  
2              498.27960  
3              679.53600  
4              205.52800  
...                  ...  
99999995       440.45820  
99999996       236.64420  
99999997        51.30000  
99999998        88.77375  
99999999       278.83440  

[100000000 rows x 13 columns]

They have very similar run times yet again. A draw.

Test 8 – Write out the updated data to a CSV file

DuckDB

start_time = time.time()

# Write the modified sales_data table to a CSV file
start = time.time()
con.execute("""
    COPY (SELECT * FROM sales) TO '/mnt/d/sales_data/final_sales_data_duckdb.csv' WITH (HEADER TRUE, DELIMITER ',')
""")

print(f"DuckDB: Time to write CSV to file: {time.time() - start_time} seconds")

DuckDB: Time to write CSV to file: 54.899176597595215 seconds

Fireducks

# fireducks write data back to CSV
#
import fireducks.pandas as pd

# Tidy up DF before writing out
cols_to_drop = ['year', 'month']
df_fire = df_fire.drop(columns=cols_to_drop)
df_fire['total_with_tax'] = df_fire['total_with_tax'].round(2) 
df_fire['order_date'] = df_fire['order_date'].dt.date

# Start total runtime timer
start_time = time.time()

df_fire.to_csv('/mnt/d/sales_data/fireducks_sales.csv',quoting=0,index=False)

# Print total runtime
print(f"Fireducks: Time to write CSV  to file: {time.time() - start_time} seconds")

Fireducks: Time to write CSV  to file: 54.490307331085205 seconds

Too close to call again.

Test 9— Write out the updated data to a parquet file

DuckDB

# DuckDB write Parquet data
# 

start_time = time.time()

# Write the modified sales_data table to a Parquet file
start = time.time()
con.execute("COPY sales TO '/mnt/d/sales_data/final_sales_data_duckdb.parquet' (FORMAT 'parquet');")

print(f"DuckDB: Time to write parquet to file: {time.time() - start_time} seconds")

DuckDB: Time to write parquet to file: 30.011869192123413 seconds

Fireducks

import fireducks.pandas as pd
import time

# Start total runtime timer
start_time = time.time()

df_fire.to_parquet('/mnt/d/sales_data/fireducks_sales.parquet')

# Print total runtime
print(f"Fireducks: Time to write Parquet to file: {time.time() - start_time} seconds")

Fireducks: Time to write Parquet to file: 86.29632377624512 seconds

That’s the first major discrepancy between run times. Fireducks took almost a minute longer to write out its data to Parquet than did DuckDB.

Summary

So, what are we to make of all this? Simply put, there is nothing much in it between these two libraries. Both are superfast and capable of processing large data sets. Once your data is in memory, either in a DuckDB table or Fireducks dataframe, both libraries are equally capable of processing it in double quick time

The choice of which one to use depends on your existing infrastructure and skill set.

If you’re a database person, DuckDB is the obvious library to use, as your SQL skills would be instantly transferable.

Alternatively, if you’re already embedded in the Pandas’ world, Fireducks would be a great choice for you.

_OK, that’s all for me just now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories, follow me or subscribe to get notified when I post new content._

If you like this content, you might find these articles interesting, too.

Building a Data Dashboard

Speed up Pandas code with Numpy

The post Battle of the Ducks appeared first on Towards Data Science.

Building a Data Dashboard

Thomas Reid — Mon, 20 Jan 2025 18:28:23 +0000

Image by Author

With source data from a Postgres database

As a Python data engineer for many years, one area I was not very involved in was the production of data dashboards. That all changed when Python-based libraries such as Streamlit, Gradio and Taipy came along.

With their introduction, python programmers had no excuses not to use them to craft nice-looking front-ends and dashboards.

Until then, the only other options were to use specialised tools like Tableau or AWS’s Quicksight or—horror of horrors—get your hands dirty with CSS, HTML, and Javascript.

So, if you’ve never used one of these new Python-based graphical front-end libraries before, this article is for you as I’ll be taking you through how to code up a data dashboard using one of the most popular libraries for this purpose called Streamlit.

My intention is that this will be the first part of a series of articles on developing a data dashboard using three of the most popular Python-based GUI libraries. In addition to this one, I also plan to release articles on Gradio and Taipy, so look out for those. As much as possible I’ll try to replicate the same layout and functionality in each dashboard. I’ll use the exact same data for all three too, albeit in different formats e.g. a CSV, database etc …

Please also note that I have no connection or affiliation with Streamlit/Snowflake, Postgres or any other company or tool mentioned in this post.

What is Streamlit?

Founded in 2018 by Adrien Treuille, Amanda Kelly, and Thiago Teixeira, Streamlit quickly gained popularity among data scientists and machine learning engineers when it introduced its open-source Python framework to simplify the creation of interactive data applications.

In March 2022, Snowflake, a Data Cloud company, acquired Streamlit and its capabilities were integrated into the Snowflake ecosystem to enhance data application development.

Streamlit’s open-source framework has been widely adopted, with over 8 million downloads and more than 1.5 million applications built using the platform. An active community of developers and contributors continues to play a significant role in its ongoing development and success.

What we’ll develop

We’re going to develop a data dashboard. Our source data for the dashboard will be in a single Postgres database table and contain 100,000 synthetic sales records.

To be honest, the actual source of the data isn’t that important. It could just as easily be a text or CSV file, SQLite, or any database you can connect to. I chose Postgres because I have a copy on my local PC, and it’s convenient for me to use.

This is what our final dashboard will look like.

Image by Author

There are four main sections.

The top row allows the user to choose specific start and end dates and/or product categories via date pickers and a drop-down list, respectively.
The second row – Key metrics – shows a top-level summary of the chosen data.
The Visualisation section allows the user to select one of three graphs to display the input data set.
The raw data section is exactly what it says. This is a tabular representation of the chosen data, effectively viewing the underlying Postgres database table data.

Using the dashboard is easy. Initially, stats for the whole data set are displayed. The user can then narrow the data focus using the 3 choice fields at the top of the display. The graphs, key metrics and raw data sections dynamically change to reflect what the user has chosen.

The underlying data

As mentioned, the dashboard’s source data is contained in a single Postgres database table. The data is a set of 100,000 synthetic sales-related data records. Here is the Postgres table creation script for reference.

CREATE TABLE IF NOT EXISTS public.sales_data
(
    order_id integer NOT NULL,
    order_date date,
    customer_id integer,
    customer_name character varying(255) COLLATE pg_catalog."default",
    product_id integer,
    product_names character varying(255) COLLATE pg_catalog."default",
    categories character varying(100) COLLATE pg_catalog."default",
    quantity integer,
    price numeric(10,2),
    total numeric(10,2)
)

And here is some Python code you can use to generate a data set for yourself. Make sure both numpy and polars libraries are installed first

# generate the 1m record CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta

def generate(nrows: int, filename: str):
    names = np.asarray(
        [
            "Laptop",
            "Smartphone",
            "Desk",
            "Chair",
            "Monitor",
            "Printer",
            "Paper",
            "Pen",
            "Notebook",
            "Coffee Maker",
            "Cabinet",
            "Plastic Cups",
        ]
    )

    categories = np.asarray(
        [
            "Electronics",
            "Electronics",
            "Office",
            "Office",
            "Electronics",
            "Electronics",
            "Stationery",
            "Stationery",
            "Stationery",
            "Electronics",
            "Office",
            "Sundry",
        ]
    )

    product_id = np.random.randint(len(names), size=nrows)
    quantity = np.random.randint(1, 11, size=nrows)
    price = np.random.randint(199, 10000, size=nrows) / 100

    # Generate random dates between 2010-01-01 and 2023-12-31
    start_date = datetime(2010, 1, 1)
    end_date = datetime(2023, 12, 31)
    date_range = (end_date - start_date).days

    # Create random dates as np.array and convert to string format
    order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])

    # Define columns
    columns = {
        "order_id": np.arange(nrows),
        "order_date": order_dates,
        "customer_id": np.random.randint(100, 1000, size=nrows),
        "customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
        "product_id": product_id + 200,
        "product_names": names[product_id],
        "categories": categories[product_id],
        "quantity": quantity,
        "price": price,
        "total": price * quantity,
    }

    # Create Polars DataFrame and write to CSV with explicit delimiter
    df = pl.DataFrame(columns)
    df.write_csv(filename, separator=',',include_header=True)  # Ensure comma is used as the delimiter

# Generate 100,000 rows of data with random order_date and save to CSV
generate(100_000, "/mnt/d/sales_data/sales_data.csv")

Setting up our development environment

Before we get to the example code, let’s set up a separate development environment. That way, what we do won’t interfere with other versions of libraries, programming, etc… we might have on the go for other projects we’re working on.

I use Miniconda for this, but you can use whatever method suits you best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda – Anaconda documentation

Once the environment is created, switch to it using the activatecommand, and then pip install our required Python libraries.

#create our test environment
(base) C:Usersthoma>conda create -n streamlit_test python=3.12 -y

# Now activate it
(base) C:Usersthoma>conda activate streamlit_test

# Install python libraries, etc ...
(streamlit_test) C:Usersthoma>pip install streamlit pandas matplotlib psycopg2

The Code

I’ll split the code up into sections and explain each one along the way.

#
# Streamlit equivalent of final Gradio app
#
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import psycopg2
from psycopg2 import sql
from psycopg2 import pool

# Initialize connection pool
try:
    connection_pool = psycopg2.pool.ThreadedConnectionPool(
        minconn=5,
        maxconn=20,
        dbname="postgres",
        user="postgres",
        password="postgres",
        host="localhost",
        port="5432"
    )
except psycopg2.Error as e:
    st.error(f"Error creating connection pool: {e}")

def get_connection():
    try:
        return connection_pool.getconn()
    except psycopg2.Error as e:
        st.error(f"Error getting connection from pool: {e}")
        return None

def release_connection(conn):
    try:
        connection_pool.putconn(conn)
    except psycopg2.Error as e:
        st.error(f"Error releasing connection back to pool: {e}")

We start by importing all the external libraries we’ll need. Next, we set up a ThreadedConnectionPool that allows **** multiple threads to share a pool of database connections. Two helper functions follow, one to get a database connection and the other to release it. This is overkill for a simple single-user app but essential for handling multiple simultaneous users or threads accessing the database in a web app environment.

def get_date_range():
    conn = get_connection()
    if conn is None:
        return None, None
    try:
        with conn.cursor() as cur:
            query = sql.SQL("SELECT MIN(order_date), MAX(order_date) FROM public.sales_data")
            cur.execute(query)
            return cur.fetchone()
    finally:
        release_connection(conn)

def get_unique_categories():
    conn = get_connection()
    if conn is None:
        return []
    try:
        with conn.cursor() as cur:
            query = sql.SQL("SELECT DISTINCT categories FROM public.sales_data ORDER BY categories")
            cur.execute(query)
            return [row[0].capitalize() for row in cur.fetchall()]
    finally:
        release_connection(conn)

def get_dashboard_stats(start_date, end_date, category):
    conn = get_connection()
    if conn is None:
        return None
    try:
        with conn.cursor() as cur:
            query = sql.SQL("""
                WITH category_totals AS (
                    SELECT 
                        categories,
                        SUM(price * quantity) as category_revenue
                    FROM public.sales_data
                    WHERE order_date BETWEEN %s AND %s
                    AND (%s = 'All Categories' OR categories = %s)
                    GROUP BY categories
                ),
                top_category AS (
                    SELECT categories
                    FROM category_totals
                    ORDER BY category_revenue DESC
                    LIMIT 1
                ),
                overall_stats AS (
                    SELECT 
                        SUM(price * quantity) as total_revenue,
                        COUNT(DISTINCT order_id) as total_orders,
                        SUM(price * quantity) / COUNT(DISTINCT order_id) as avg_order_value
                    FROM public.sales_data
                    WHERE order_date BETWEEN %s AND %s
                    AND (%s = 'All Categories' OR categories = %s)
                )
                SELECT 
                    total_revenue,
                    total_orders,
                    avg_order_value,
                    (SELECT categories FROM top_category) as top_category
                FROM overall_stats
            """)
            cur.execute(query, [start_date, end_date, category, category, 
                                start_date, end_date, category, category])
            return cur.fetchone()
    finally:
        release_connection(conn)

The get_date_range function executes the SQL query to find the range of dates (MIN and MAX) in the order_date column and returns the two dates as a tuple: (start_date, end_date).

The get_unique_categories function runs an SQL query to fetch unique values from the categories column. It capitalizes the category names (first letter uppercase) before returning them as a list.

The get_dashboard_stats function executes a SQL query with the following parts:

category_totals: Calculates total revenue for each category in the given date range.
top_category: Finds the category with the highest revenue.
overall_stats: Computes overall statistics:
Total revenue (SUM(price * quantity)).
Total number of unique orders (COUNT(DISTINCT order_id)).
Average order value (total revenue divided by total orders).

It returns a single row containing:

total_revenue: Total revenue in the specified period.
total_orders: Number of distinct orders.
avg_order_value: Average revenue per order.
top_category: The category with the highest revenue.

def get_plot_data(start_date, end_date, category):
    conn = get_connection()
    if conn is None:
        return pd.DataFrame()
    try:
        with conn.cursor() as cur:
            query = sql.SQL("""
                SELECT DATE(order_date) as date,
                       SUM(price * quantity) as revenue
                FROM public.sales_data
                WHERE order_date BETWEEN %s AND %s
                  AND (%s = 'All Categories' OR categories = %s)
                GROUP BY DATE(order_date)
                ORDER BY date
            """)
            cur.execute(query, [start_date, end_date, category, category])
            return pd.DataFrame(cur.fetchall(), columns=['date', 'revenue'])
    finally:
        release_connection(conn)

def get_revenue_by_category(start_date, end_date, category):
    conn = get_connection()
    if conn is None:
        return pd.DataFrame()
    try:
        with conn.cursor() as cur:
            query = sql.SQL("""
                SELECT categories,
                       SUM(price * quantity) as revenue
                FROM public.sales_data
                WHERE order_date BETWEEN %s AND %s
                  AND (%s = 'All Categories' OR categories = %s)
                GROUP BY categories
                ORDER BY revenue DESC
            """)
            cur.execute(query, [start_date, end_date, category, category])
            return pd.DataFrame(cur.fetchall(), columns=['categories', 'revenue'])
    finally:
        release_connection(conn)

def get_top_products(start_date, end_date, category):
    conn = get_connection()
    if conn is None:
        return pd.DataFrame()
    try:
        with conn.cursor() as cur:
            query = sql.SQL("""
                SELECT product_names,
                       SUM(price * quantity) as revenue
                FROM public.sales_data
                WHERE order_date BETWEEN %s AND %s
                  AND (%s = 'All Categories' OR categories = %s)
                GROUP BY product_names
                ORDER BY revenue DESC
                LIMIT 10
            """)
            cur.execute(query, [start_date, end_date, category, category])
            return pd.DataFrame(cur.fetchall(), columns=['product_names', 'revenue'])
    finally:
        release_connection(conn)

def get_raw_data(start_date, end_date, category):
    conn = get_connection()
    if conn is None:
        return pd.DataFrame()
    try:
        with conn.cursor() as cur:
            query = sql.SQL("""
                SELECT 
                    order_id, order_date, customer_id, customer_name, 
                    product_id, product_names, categories, quantity, price, 
                    (price * quantity) as revenue
                FROM public.sales_data
                WHERE order_date BETWEEN %s AND %s
                  AND (%s = 'All Categories' OR categories = %s)
                ORDER BY order_date, order_id
            """)
            cur.execute(query, [start_date, end_date, category, category])
            return pd.DataFrame(cur.fetchall(), columns=[desc[0] for desc in cur.description])
    finally:
        release_connection(conn)

def plot_data(data, x_col, y_col, title, xlabel, ylabel, orientation='v'):
    fig, ax = plt.subplots(figsize=(10, 6))
    if not data.empty:
        if orientation == 'v':
            ax.bar(data[x_col], data[y_col])
        else:
            ax.barh(data[x_col], data[y_col])
        ax.set_title(title)
        ax.set_xlabel(xlabel)
        ax.set_ylabel(ylabel)
        plt.xticks(rotation=45)
    else:
        ax.text(0.5, 0.5, "No data available", ha='center', va='center')
    return fig

The get_plot_data function fetches daily revenue within the given date range and category. It retrieves data grouped by the day (DATE(order_date)) and calculates daily revenue (SUM(price * quantity)), then returns a Pandas DataFrame with columns: date (the day) and revenue (total revenue for that day).

The get_revenue_by_category function fetches revenue totals grouped by category within the specified date range. It groups data by categories and calculates revenue for each category (SUM(price * quantity)), orders the results by revenue in descending order and returns a Pandas DataFrame with columns: categories (category name) and revenue (total revenue for the category).

The get_top_products function retrieves the top 10 products by revenue within the given date range and category. It groups data by product_names and calculates revenue for each product (SUM(price * quantity)), orders the products by revenue in descending order and limits results to the top 10 before returning a Pandas DataFrame with columns: product_names (product name) and revenue (total revenue for the product).

The get_raw_data function fetches raw transaction data within the specified date range and category.

The plot_data function takes in some data (in a pandas DataFrame) and the names of the columns you want to plot on the x- and y-axes. It then creates a bar chart – either vertical or horizontal, depending on the chosen orientation – labels the axes, adds a title, and returns the finished chart (a Matplotlib Figure). If the data is empty, it just displays a "No data available" message instead of trying to plot anything.

# Streamlit App
st.title("Sales Performance Dashboard")

# Filters
with st.container():
    col1, col2, col3 = st.columns([1, 1, 2])
    min_date, max_date = get_date_range()
    start_date = col1.date_input("Start Date", min_date)
    end_date = col2.date_input("End Date", max_date)
    categories = get_unique_categories()
    category = col3.selectbox("Category", ["All Categories"] + categories)

# Custom CSS for metrics
st.markdown("""
    
""", unsafe_allow_html=True)

# Metrics
st.header("Key Metrics")
stats = get_dashboard_stats(start_date, end_date, category)
if stats:
    total_revenue, total_orders, avg_order_value, top_category = stats
else:
    total_revenue, total_orders, avg_order_value, top_category = 0, 0, 0, "N/A"

# Custom metrics display
metrics_html = f"""

    
        Total Revenue
        ${total_revenue:,.2f}
    
    
        Total Orders
        {total_orders:,}
    
    
        Average Order Value
        ${avg_order_value:,.2f}
    
    
        Top Category
        {top_category}
    

"""
st.markdown(metrics_html, unsafe_allow_html=True)

This code section creates the main structure for displaying the key metrics in the Streamlit dashboard. It:

Sets up the page title: "Sales Performance Dashboard."
Presents filters for start/end dates and category selection.
Retrieves metrics (such as total revenue, total orders, etc.) for the chosen filters from the database.
Applies custom CSS to style these metrics in a row of boxes with labels and values.
Displays the metrics within an HTML block, ensuring each metric gets its own styled container.

# Visualization Tabs
st.header("Visualizations")
tabs = st.tabs(["Revenue Over Time", "Revenue by Category", "Top Products"])

# Revenue Over Time Tab
with tabs[0]:
    st.subheader("Revenue Over Time")
    revenue_data = get_plot_data(start_date, end_date, category)
    st.pyplot(plot_data(revenue_data, 'date', 'revenue', "Revenue Over Time", "Date", "Revenue"))

# Revenue by Category Tab
with tabs[1]:
    st.subheader("Revenue by Category")
    category_data = get_revenue_by_category(start_date, end_date, category)
    st.pyplot(plot_data(category_data, 'categories', 'revenue', "Revenue by Category", "Category", "Revenue"))

# Top Products Tab
with tabs[2]:
    st.subheader("Top Products")
    top_products_data = get_top_products(start_date, end_date, category)
    st.pyplot(plot_data(top_products_data, 'product_names', 'revenue', "Top Products", "Revenue", "Product Name", orientation='h'))

This section adds a header titled "Visualizations" to this part of the dashboard. It creates three tabs, each of which displays a different graphical representation of the data:

Tab 1: Revenue Over Time

Fetches revenue data grouped by date for the given filters using get_plot_data().
Calls plot_data() to generate a bar chart of revenue over time, with dates on the x-axis and revenue on the y-axis.
Displays the chart in the first tab.

Tab 2: Revenue by Category

Fetches revenue grouped by category using get_revenue_by_category().
Calls plot_data() to create a bar chart of revenue by category.
Displays the chart in the second tab.

Tab 3: Top Products

Fetches top 10 products by revenue for the given filters using get_top_products().
Calls plot_data() to create a horizontal bar chart (indicated by orientation='h').
Displays the chart in the third tab.


st.header("Raw Data")

raw_data = get_raw_data(
    start_date=start_date,
    end_date=end_date,
    category=category
)

# Remove the index by resetting it and dropping the old index
raw_data = raw_data.reset_index(drop=True)

st.dataframe(raw_data,hide_index=True)

# Add spacing
st.write("")

The final section displays the raw data in a dataframe. The user is able to scroll up and down as required to see all records available.

An empty st.write("") is added at the end to provide spacing for better visual alignment.

Running the App

Let’s say you save your code into a file called app.py. You can run it using this from the command line,

(streamlit_test) C:Usersthoma> python -m streamlit run app.py

If everything works as expected, you will see this after you run the above command.


  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://192.168.0.59:8501

Click on the Local URLs shown, and a browser screen should appear with the Streamlit app running.

Summary

In this article, I’ve attempted to provide a comprehensive guide to building an interactive sales performance dashboard using Streamlit with a Postgres database table as its source data.

Streamlit is a modern, Python-based open-source framework that simplifies the creation of data-driven dashboards and applications. The dashboard I developed allows users to filter data by date ranges and product categories, view key metrics such as total revenue and top-performing categories, explore visualizations like revenue trends and top products, and navigate through raw data with pagination.

This guide includes a complete implementation, from setting up a Postgres database with sample data to creating Python functions for querying data, generating plots, and handling user input. This step-by-step approach demonstrates how to leverage Streamlit’s capabilities to create user-friendly and dynamic dashboards, making it ideal for data engineers and scientists who want to build interactive data applications.

Although I used Postgres for my data, it should be straightforward to modify the code to use a CSV file or any other relational database management system (RDBMS), such as SQLite, as your data source.

_That’s all from me for now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._

If you liked this content, Medium thinks you’ll find these articles interesting, too.

Speed up Pandas code with Numpy

Introducing Deepseek Artifacts

The post Building a Data Dashboard appeared first on Towards Data Science.

Speed up Pandas code with Numpy

Thomas Reid — Mon, 13 Jan 2025 09:02:14 +0000

Image by AI (Dalle-3)

Speed up Pandas Code with NumPy

But I can’t vectorise this, can I? …. yes, you probably can!

In one of the first articles I wrote on Medium, I talked about using the apply() method on Pandas dataframes and said it should be avoided, if possible, on larger dataframes. I’ll put a link to that article at the end of this one if you want to check it out.

Although I talked then a bit about possible alternatives, i.e. using vectorisation, I didn’t give many examples of using vectorisation, so I intend to remedy that here. Specifically, I want to talk about how Numpy and a couple of its lesser-known methods ( whereand select) can be used to speed up Pandas operations that involve complex if/then/else conditions.

Vectorisation in the context of Pandas refers to the method of applying operations to entire blocks of data at once rather than iterating through them row by row or element by element. This approach is possible due to Pandas’ reliance on NumPy, which supports vectorised operations that are highly optimized and written in C, enabling faster processing. When you use vectorised operations in Pandas, such as applying arithmetic operations or functions to DataFrame or Series objects, the operations are dispatched to multiple data elements simultaneously.

This not only leads to more concise and readable code but can significantly boost performance by reducing the overhead of Python loops and taking advantage of modern CPUs’ capabilities to perform operations on multiple data points in parallel. Vectorization is a key feature that makes Pandas powerful for data manipulation and analysis in Python.

The problem is that some vectorisation operations are obvious and happen without you even realising that’s what’s being done under the hood. Many times, though, when the types of operations you want to perform become more complex, it is difficult to see how vectorisation can be applied in these situations.

In this article, I’ll discuss some common scenarios in which this is the case and show you how vectorisation can be applied.

Getting some data to work with

In many of my articles looking at the performance of Python libraries and database systems, I use a synthetic set of sales data for my testing. The schema of this data set looks like this,

order_id (int)
order_date (date)
customer_id (int)
customer_name (str)
product_id (int)
product_name (str)
category (str)
quantity (int)
price (float)
total (float)

Here is a Python program that you can use to generate such a data set. It produces a CSV. The number of records to create and the location of the output file is configurable. It uses the NumPy and polars libraries, so you must install these before running it.

For this test, as I have a fairly high-spec PC, I’ll create and use a 1 million record CSV.

# generate the 1m record CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta

def generate(nrows: int, filename: str):
    names = np.asarray(
        [
            "Laptop",
            "Smartphone",
            "Desk",
            "Chair",
            "Monitor",
            "Printer",
            "Paper",
            "Pen",
            "Notebook",
            "Coffee Maker",
            "Cabinet",
            "Plastic Cups",
        ]
    )

    categories = np.asarray(
        [
            "Electronics",
            "Electronics",
            "Office",
            "Office",
            "Electronics",
            "Electronics",
            "Stationery",
            "Stationery",
            "Stationery",
            "Electronics",
            "Office",
            "Sundry",
        ]
    )

    product_id = np.random.randint(len(names), size=nrows)
    quantity = np.random.randint(1, 11, size=nrows)
    price = np.random.randint(199, 10000, size=nrows) / 100

    # Generate random dates between 2010-01-01 and 2023-12-31
    start_date = datetime(2010, 1, 1)
    end_date = datetime(2023, 12, 31)
    date_range = (end_date - start_date).days

    # Create random dates as np.array and convert to string format
    order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])

    # Define columns
    columns = {
        "order_id": np.arange(nrows),
        "order_date": order_dates,
        "customer_id": np.random.randint(100, 1000, size=nrows),
        "customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
        "product_id": product_id + 200,
        "product_names": names[product_id],
        "categories": categories[product_id],
        "quantity": quantity,
        "price": price,
        "total": price * quantity,
    }

    # Create Polars DataFrame and write to CSV with explicit delimiter
    df = pl.DataFrame(columns)
    df.write_csv(filename, separator=',',include_header=True)  # Ensure comma is used as the delimiter

# Generate data with random order_date and save to CSV
generate(1_000_000, "/mnt/d/sales_data/sales_data_1m.csv")

Setting up our development environment

I use Miniconda for this, but you can use whatever method suits you best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda – Anaconda documentation

Once the environment is created, switch to it using the activatecommand, and then install Jupyter and any required Python libraries.

#create our test environment
(base) C:Usersthoma>conda create -n pandas_vect python=3.12 -y

# Now activate it
(base) C:Usersthoma>conda activate pandas_vect

# Install python libraries, etc ...
(pandas_vect) C:Usersthoma>conda install pandas numpy jupyter -y

Now type in jupyter notebook into your command prompt. You should see a jupyter notebook open in your browser. If that doesn’t happen automatically, what you’ll likely see is a screenful of information after the jupyter notebookcommand.

Near the bottom, there will be a URL that you should copy and paste into your browser to initiate the Jupyter Notebook.

Your URL will be different to mine, but it should look something like this:-

http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69da

The code

To begin, let’s read our data set into a Pandas dataframe. We’ll time everything to get an idea of any speed-ups we gain.

import pandas as pd
import numpy as np
import time

# Start the timer
start_time = time.time()

# Path to the CSV file
file_path = "d:sales_datasales_data_1m.csv"

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Display the last 10 rows of the DataFrame
print(df.head())

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken to read the CSV file : {elapsed_time} seconds")

#
# Here is our output
#

  order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   

  product_names   categories  quantity  price   total  
0    Smartphone  Electronics         3  90.02  270.06  
1       Printer  Electronics         6  12.74   76.44  
2      Notebook   Stationery         8  48.35  386.80  
3        Laptop  Electronics         3  74.85  224.55  
4       Cabinet       Office         6  53.77  322.62  

Time taken to read the CSV file : 1.0295870304107666 seconds

Example 1 – Setting the scene

So, many operations you do on a Pandas dataframe are inherently vectorised. For example, suppose we want to multiply the quantity field by 5 and update the total value column.

# Start the timer
start_time = time.time()

df['quantity'] *= 5

# Update 'total' to reflect the new 'quantity'
df['total'] = df['quantity'] * df['price']

# Display the updated DataFrame
print(df.head())

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")

#
# Output
#

   order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   

  product_names   categories  quantity  price    total  
0    Smartphone  Electronics        15  90.02  1350.30  
1       Printer  Electronics        30  12.74   382.20  
2      Notebook   Stationery        40  48.35  1934.00  
3        Laptop  Electronics        15  74.85  1122.75  
4       Cabinet       Office        30  53.77  1613.10  

Time taken : 0.009307861328125 seconds

It took less than one-hundredth of a second to process 1 million records. How do we know the above was a vectorisation process? Apart from the minuscule amount of time it took to run, we can check to see how long it takes to do the same thing using a non-vectorised method—specifically, using the apply() method.

If you weren’t very experienced in coding Pandas, this is a method you might have devised to solve this problem in the first place.

Using the Apply() method

Apply() allows you to run a function that will be applied to every record in a dataframe. This should be quicker than using a for loop to iterate over the dataframe rows, but how does it compare to the original vectorised code?

# Start the timer
start_time = time.time()

# Define the function to update 'quantity' and 'total'
def update_row(row):
    row['quantity'] *= 5
    row['total'] = row['quantity'] * row['price']
    return row

# Apply the function to each row
df = df.apply(update_row, axis=1)

# Display the updated DataFrame
print(df.head())

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")

#
# Output
#

   order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   

  product_names   categories  quantity  price    total  
0    Smartphone  Electronics        15  90.02  1350.30  
1       Printer  Electronics        30  12.74   382.20  
2      Notebook   Stationery        40  48.35  1934.00  
3        Laptop  Electronics        15  74.85  1122.75  
4       Cabinet       Office        30  53.77  1613.10 

Time taken : 75.53943586349487 seconds

So, as you can see from the timng of ther above operation, we’ve established that vectorisation in Pandas coding is essential when processing large datasets and that it often just gets implemented "behind the scenes" for simpler code problems.

But what happens when our coding needs are slightly more complicated?

Example 2 -Vectorise an If/then/else condition

In this example, say we want to do the same operation as before (quantity 5), but this time only for Smartphone products. For any other products, we want (quantity 2).

Here is the naive apply() implementation.

def update_record(row):
    if row['product_names'] == 'Smartphone':
        row['quantity'] *= 5
    else:
        row['quantity'] *= 2
    row['total'] = row['quantity'] * row['price']
    return row

# Start the timer
start_time = time.time()

# Apply the update_record function to each row
df = df.apply(update_record, axis=1)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

# Display the updated DataFrame
print(df)
print(f"Time taken: {elapsed_time} seconds")

        order_id  order_date  customer_id   customer_name  product_id  
0              0  2022-08-01          245    Customer_884         201   
1              1  2022-02-19          701   Customer_1672         205   
2              2  2017-01-01          184  Customer_21720         208   
3              3  2013-03-09          275  Customer_23770         200   
4              4  2022-04-23          960  Customer_23790         210   
...          ...         ...          ...             ...         ...   
999995    999995  2011-05-08          408  Customer_26518         202   
999996    999996  2019-02-11          850   Customer_4581         208   
999997    999997  2021-11-19          399  Customer_28681         205   
999998    999998  2016-05-02          714  Customer_12693         209   
999999    999999  2018-08-12          324  Customer_28553         207   

       product_names   categories  quantity  price    total  
0         Smartphone  Electronics        15  90.02  1350.30  
1            Printer  Electronics        12  12.74   152.88  
2           Notebook   Stationery        16  48.35   773.60  
3             Laptop  Electronics         6  74.85   449.10  
4            Cabinet       Office        12  53.77   645.24  
...              ...          ...       ...    ...      ...  
999995          Desk       Office        12  32.29   387.48  
999996      Notebook   Stationery         6   8.16    48.96  
999997       Printer  Electronics         6  92.69   556.14  
999998  Coffee Maker  Electronics         4  18.10    72.40  
999999           Pen   Stationery        16  93.04  1488.64  

[1000000 rows x 10 columns]
Time taken: 78.65310955047607 seconds

At almost 80 secs, the run time is wayyyy too long.

There are a couple of ways we can improve this using vectorisation. The first is the obvious way and probably what most experienced Pandas coders would turn to as it seems natural. It’s quick, too.

# Start the timer
start_time = time.time()

# Multiply 'quantity' by 5 for Smartphones, by 2 for others
df.loc[df['product_names'] == 'Smartphone', 'quantity'] *= 5
df.loc[df['product_names'] != 'Smartphone', 'quantity'] *= 2

# Update 'total' based on new 'quantity'
df['total'] = df['quantity'] * df['price']

# Display the updated DataFrame
print(df.head())

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")

#
# Output
#

   order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   

  product_names   categories  quantity  price    total  
0    Smartphone  Electronics        15  90.02  1350.30  
1       Printer  Electronics        12  12.74   152.88  
2      Notebook   Stationery        16  48.35   773.60  
3        Laptop  Electronics         6  74.85   449.10  
4       Cabinet       Office        12  53.77   645.24  
Time taken : 0.14528226852416992 seconds

The second way I’ll show uses a NumPy method you might not have encountered before – the NumPy where() function, and it’s even faster than regular vectorisation. Numpy.where() takes three arguments. The first argument is the condition you’re testing for. The second is what is returned if the test condition returns True, and the third is returned if the test condition is False.

# Start the timer
start_time = time.time()

# Update 'quantity' using numpy.where
df['quantity'] = np.where(df['product_names'].values == 'Smartphone', df['quantity'].values * 5, df['quantity'].values * 2)

# Recalculate 'total' based on the new 'quantity'
df['total'] = df['quantity'] * df['price']

# Display the updated DataFrame
print(df.head())

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")

#
# Output
#

   order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   

  product_names   categories  quantity  price    total  
0    Smartphone  Electronics        15  90.02  1350.30  
1       Printer  Electronics        12  12.74   152.88  
2      Notebook   Stationery        16  48.35   773.60  
3        Laptop  Electronics         6  74.85   449.10  
4       Cabinet       Office        12  53.77   645.24  

Time taken : 0.026806116104125977 seconds

Using NumPy, in this case, was more than 5x times faster than the regular vectorisation speed and a whopping 3500x faster than using apply().

Example 3 – vectorise multiple if/then/else conditions.

Nested if/then/else conditions can be handled using nested Numpy.where conditions. But it gets harder to read and maintain such code when there are many such cases. If you’ve ever used nested IF statements in Excel, you’ll know what I mean.

In this instance, we can use another pretty cool NumPy method not many people have heard of, which is called select().

Like where(), select() also takes 3 arguments. The first is a Python list of conditions to test for. The second argument is a Python list of choices that tells NumPy what to return if the equivalent condition is met. The third argument is a default statement of what to return if none of the conditions are met.

The order of these is important. The first item in the choices list goes with the first item in the conditions list, and so on.

In a sense, it’s a bit like a case statement you would use if programming in Python, C, Java, etc …

Take our previous example to an extreme. Say we want to multiply our initial quantity by a different amount for each different type of product.

import pandas as pd
import numpy as np

# Start the timer
start_time = time.time()

# Define conditions
conditions = [
    df['product_names'] == 'Smartphone',
    df['product_names'] == 'Printer',
    df['product_names'] == 'Notebook',
    df['product_names'] == 'Laptop',
    df['product_names'] == 'Cabinet'
]

# Define choices
choices = [
    df['quantity'] * 1.5,
    df['quantity'] * 2.5,
    df['quantity'] * 3.5,
    df['quantity'] * 4.5,
    df['quantity'] * 5.5
]

# Default value if none of the conditions are met
default_choice = 0

# Update 'quantity' using numpy.select
df['quantity'] = np.select(conditions, choices, default=default_choice)

# Recalculate 'total' based on the new 'quantity'
df['total'] = df['quantity'] * df['price']

# Display the updated DataFrame
print(df.head(10))

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time taken : {elapsed_time} seconds")

#
# Output
#

 order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   
5         5  2019-07-10          197  Customer_25587         202   
6         6  2014-11-12          510   Customer_6912         204   
7         7  2016-07-12          150  Customer_17761         200   
8         8  2016-11-12          997  Customer_23801         209   
9         9  2017-01-23          151  Customer_30325         207   

  product_names   categories  quantity  price     total  
0    Smartphone  Electronics       4.5  90.02   405.090  
1       Printer  Electronics      15.0  12.74   191.100  
2      Notebook   Stationery      28.0  48.35  1353.800  
3        Laptop  Electronics      13.5  74.85  1010.475  
4       Cabinet       Office      33.0  53.77  1774.410  
5          Desk       Office       0.0  47.17     0.000  
6       Monitor  Electronics       0.0  22.50     0.000  
7        Laptop  Electronics      40.5  49.33  1997.865  
8  Coffee Maker  Electronics       0.0  47.22     0.000  
9           Pen   Stationery       0.0   3.50     0.000  

Time taken : 0.26836657524108887 seconds

Example 4 – Vectorise nested multiple if/then/else conditions

My final example is similar to the previous one, but now the quantity for Smartphones only gets multiplied by 1.5 if the existing quantity for Smartphones is > 5. Otherwise, the quantity gets multiplied by 2.

The other conditions are unchanged. In other words, we have a nested if/then/else scenario. If this logic was in a function in a programming language like Python, it would look similar to this,

def adjust_quantity(row):
    if row['product_names'] == 'Smartphone':
        if row['quantity'] > 5:
            row['quantity'] *= 1.5
        else:
            row['quantity'] *= 2
    elif row['product_names'] == 'Printer':
        row['quantity'] *= 2.5
    elif row['product_names'] == 'Notebook':
        row['quantity'] *= 3.5
    elif row['product_names'] == 'Laptop':
        row['quantity'] *= 4.5
    elif row['product_names'] == 'Cabinet':
        row['quantity'] *= 5.5
    else:
        row['quantity'] = 0  # Default case if none of the conditions are met
    return row

Can we vectorise this? Yes, we can by modifying the same method as before. This time, we add some extra boolean logical tests and the extra actions required to the choices list.

I won’t repeat the whole code, just the changes.

# Define conditions
conditions = [
  (df['product_names'] == 'Smartphone') & (df['quantity'] > 5),
  (df['product_names'] == 'Smartphone') & (df['quantity'] <=5),    
  df['product_names'] == 'Printer',
  df['product_names'] == 'Notebook',
  df['product_names'] == 'Laptop',
  df['product_names'] == 'Cabinet'
]

# Define choices
choices = [
  df['quantity'] * 1.5,
  df['quantity'] * 2,    
  df['quantity'] * 2.5,
  df['quantity'] * 3.5,
  df['quantity'] * 4.5,
  df['quantity'] * 5.5
]

Running our new code, we get this output,

   order_id  order_date  customer_id   customer_name  product_id  
0         0  2022-08-01          245    Customer_884         201   
1         1  2022-02-19          701   Customer_1672         205   
2         2  2017-01-01          184  Customer_21720         208   
3         3  2013-03-09          275  Customer_23770         200   
4         4  2022-04-23          960  Customer_23790         210   
5         5  2019-07-10          197  Customer_25587         202   
6         6  2014-11-12          510   Customer_6912         204   
7         7  2016-07-12          150  Customer_17761         200   
8         8  2016-11-12          997  Customer_23801         209   
9         9  2017-01-23          151  Customer_30325         207   

  product_names   categories  quantity  price     total  
0    Smartphone  Electronics       6.0  90.02   540.120  
1       Printer  Electronics      15.0  12.74   191.100  
2      Notebook   Stationery      28.0  48.35  1353.800  
3        Laptop  Electronics      13.5  74.85  1010.475  
4       Cabinet       Office      33.0  53.77  1774.410  
5          Desk       Office       0.0  47.17     0.000  
6       Monitor  Electronics       0.0  22.50     0.000  
7        Laptop  Electronics      40.5  49.33  1997.865  
8  Coffee Maker  Electronics       0.0  47.22     0.000  
9           Pen   Stationery       0.0   3.50     0.000  
Time taken : 0.33173537254333496 seconds

The run time is slightly slower than the previous run as more work is being done.

Summary

Hopefully, you’ll take away two things from this article. The first is that vectorisation is an essential ingredient for performant Pandas code when dealing with medium to large data sets. The second is that even if the operations you need to apply to dataframe records are complex, they can still often be vectorised.

I explained why vectorisation in Pandas is essential for performant code and I showed serveral examples of how we can use vectorisation even in complex data processing scenarios

_That’s all from me for now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._

If you liked this content, Medium thinks you’ll find these articles interesting, too.

Structured output in the OpenAI API

PySpark Explained: Delta Tables

Here’s the link to the story I referred to at the beginning about avoiding the use of the apply() method in Pandas when dealing with large dataframes.

Thinking of using Pandas apply() on a big dataframe: Stop! – read this instead

The post Speed up Pandas code with Numpy appeared first on Towards Data Science.

Structured LLM Output Using Ollama

Thomas Reid — Tue, 17 Dec 2024 00:35:17 +0000

With version 0.5, Ollama released a significant enhancement to its LLM API. By introducing structured outputs, Ollama now makes it possible to constrain a model’s output to a specific format defined by a JSON schema. Under the hood, most systems use Pydantic’s capabilities to enable this.

Image by Author (Dalle-3)

Structured output solves a nagging problem many developers face when a system or process takes the output from an Llm for further processing. It’s important for that system to "know" what to expect as its input to process it accurately with repeatable results each time.

Likewise, you want to display model output in the same format each time you display it to a user to avoid confusion and errors

Until now, ensuring consistent output formats from most models has been a pain, but the new functionality from Ollama makes doing so quite easy, as I hope to show in my example code snippets.

Before that, though, you need to install the latest version of Ollama. This isn’t a tutorial on Ollama or how to run it. If you want that information, click my article below, where I go through all that good stuff.

Introduction to Ollama – Part 1

Suffice it to say that Ollama runs on Windows, Linux, and macOS, and you can install the latest version on Windows or MacOS by navigating to https://ollama.com/ and clicking on the big black download button you’ll see onscreen. I’ll be using a Linux system, and for this, you can install it by running this command,

$ curl -fsSL https://ollama.com/install.sh | sh

When the download has finished, run the installer. Next, we need to set up our development environment.

Setting up our dev environment

Before coding, I always create a separate Python development environment where I can install any needed software. Now, anything I do in this environment is siloed and will not impact my other projects.

I use Miniconda for this, but you can use whatever method you know and that suits you best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda – Anaconda documentation

1/ Create our new dev environment and install the required libraries

(base) $ conda create -n ollama_test python=3.12 -y
(base) $ conda activate ollama_test
(ollama_test) $ pip install ollama --upgrade
(ollama_test) $ pip install pydantic bs4
# Check the installed version is >= 0.5
(ollama_test) $ ollama --version
ollama version is 0.5.1
(ollama_test) $

2/ Decide what model to use with Ollama

Ollama has access to hundreds of open-source models. Choose which one(s) you want to use and pull them from Ollama. Meta recently released their latest llama model (version 3.3), so I will use it. Also, as I’ll be trying out an image-based task, I’ll use Meta’s Lama3.2 vision model.

(ollama_test) $ ollama pull llama3.2-vision
(ollama_test) $ ollama pull llama3.3

I normally code my examples in a Jupyter Notebook. However, there is currently an issue when trying to run the latest versions of Jupyter with Ollama due to an incompatibility with a third-party library.

Jupyter expects a certain version of this library to be present, and Ollama expects a different version of it to be present.

So, this time, I’m simply saving my code in a Python file and running it with Python on the command line.

Example code 1 – Image interpretation

For this example, I’m asking the model to identify the different animal types in a PNG image. Here is that image.

Image collage by Author (Individual animal images from pexels.com)

Here is the code. It’s heavily commented and short, so I won’t go into the details of what it’s doing.

from ollama import chat
from pydantic import BaseModel

# Define a Pydantic model for representing a single animal with its type.
class Animal(BaseModel):
    animal: str

# Define a Pydantic model for representing a list of animals.
# This model contains a list of Animal objects.
class AnimalList(BaseModel):
    animals: list[Animal]

# Function to analyze an image and identify all animals present in it.
# Uses the Ollama `chat` function to interact with a vision-based model (`llama3.2-vision`).
# Returns the results as an AnimalList object.
def analyze_animals_in_image(image_path: str) -> AnimalList:
    # Call the `chat` function with the specified model, format, and parameters.
    response = chat(
        model='llama3.2-vision',
        format=AnimalList.model_json_schema(),
        messages=[
            {
                'role': 'user',
                'content': '''Analyze this image and identify all animals present. For each animal, provide:
                - The type of animal
                Return information for ALL animal types visible in the image.''',
                'images': [image_path],
            },
        ],
        options={'temperature': 0}  # Ensure deterministic output by setting temperature to 0
    )
    # Validate and parse the response JSON into an AnimalList object.
    animals_data = AnimalList.model_validate_json(response.message.content)
    return animals_data

# Main block to execute the script.
if __name__ == "__main__":
    # Path to the image to be analyzed.
    image_path = "D:/photos/2024/animals.png"

    # Print an initial message before starting the analysis.
    print("nAnalyzing image for animals...")

    # Call the function to analyze the image and get the results.
    animals_result = analyze_animals_in_image(image_path)

    # Print the analysis results.
    print("Animal Analysis Results:")
    print(f"Found {len(animals_result.animals)} animals in the image:")

    # Loop through the list of animals and print details for each one.
    for i, animal in enumerate(animals_result.animals, 1):
        print(f"Animal #{i}:")
        print(animal.model_dump_json)

This produced the following output.

Analyzing image for animals...
Animal Analysis Results:
Found 5 animals in the image:]
Animal #1:

Animal #2:

Animal #3:

Animal #4:

Animal #5:

That’s not too bad at all. The model may have gotten confused with the top left image. I’m unsure if it’s of a Walrus or an elephant seal. The former, I think.

Example code 2— Text summarisation

This is useful if you have a bunch of different texts you want to summarise but want the summaries to have the same structure. In this example, we’ll process the Wikipedia entries for some famous scientists and retrieve certain key facts about them in a highly organized way.

In our summary, we want to output the following structure for each scientist,

The name of the Scientist When and where they were born Their main claim to fame The year they won the Nobel Prize When and where they died

Here is the code.

from pydantic import BaseModel
import requests
from bs4 import BeautifulSoup
from ollama import chat
from typing import List
import json  # For parsing JSON content from the response

# List of Wikipedia URLs
urls = [
    "https://en.wikipedia.org/wiki/Albert_Einstein",
    "https://en.wikipedia.org/wiki/Richard_Feynman",
    "https://en.wikipedia.org/wiki/James_Clerk_Maxwell",
    "https://en.wikipedia.org/wiki/Alan_Guth"
]

# Scientist names extracted from URLs for validation
specified_scientists = ["Albert Einstein", "Richard Feynman", "James Clerk Maxwell", "Alan Guth"]

# Function to scrape Wikipedia content
def get_article_content(url):
    try:
        print(f"Scraping URL: {url}")  # Debug print
        response = requests.get(url)
        soup = BeautifulSoup(response.content, "html.parser")
        article = soup.find("div", class_="mw-body-content")
        if article:
            content = "n".join(p.text for p in article.find_all("p"))
            print(f"Successfully scraped content from: {url}")  # Debug print
            return content
        else:
            print(f"No content found in: {url}")  # Debug print
            return ""
    except requests.exceptions.RequestException as e:
        print(f"Error scraping {url}: {e}")
        return ""

# Fetch content from each URL
print("Fetching content from all URLs...")  # Debug print
contents = [get_article_content(url) for url in urls]
print("Finished fetching content from all URLs.")  # Debug print

# Prompt for the summarization task
summarization_prompt = '''
    You will be provided with content from an article about a famous scientist.
    Your goal will be to summarize the article following the schema provided.
    Focus only on the specified scientist in the article.
    Here is a description of the parameters:
    - name: The name of the Scientist
    - born: When and where the scientist was born
    - fame: A summary of what their main claim to fame is
    - prize: The year they won the Nobel Prize
    - death: When and where they died
'''

# Pydantic model classes
class ArticleSummary(BaseModel):
    name: str
    born: str
    fame: str
    prize: int
    death: str

class ArticleSummaryList(BaseModel):
    articles: List[ArticleSummary]

# Function to summarize an article
def get_article_summary(text: str):
    try:
        print("Sending content to chat model for summarization...")  # Debug print
        completion = chat(
            model='llama3.3',
            messages=[
                {"role": "system", "content": summarization_prompt},
                {"role": "user", "content": text}
            ],
            format=ArticleSummaryList.model_json_schema(),
        )
        print("Chat model returned a response.")  # Debug print

        # Parse and validate the JSON response
        articles = ArticleSummaryList.model_validate_json(completion.message.content)
        print("Successfully validated and parsed articles.")  # Debug print
        return articles
    except Exception as e:
        print(f"Error during summarization: {e}")
        return None

# Function to format and filter summaries
def format_summary(summary: ArticleSummaryList):
    formatted = []
    for article in summary.articles:  # Accessing the 'articles' attribute directly
        # Filter out scientists not in the specified list
        if article.name in specified_scientists:
            formatted.append(
                f"The name of the Scientist: {article.name}n"
                f"When and where they were born: {article.born}n"
                f"Their main claim to fame: {article.fame}n"
                f"The year they won the Nobel Prize: {article.prize}n"
                f"When and where they died: {article.death}n"
            )
    print("Finished formatting summary.")  # Debug print
    return "n".join(formatted)

# Main function to process all articles
def main():
    summaries = []
    for i, content in enumerate(contents):
        print(f"Processing content {i+1}/{len(contents)}...")  # Debug print
        if content.strip():  # Skip empty articles
            summary = get_article_summary(content)
            if summary:
                formatted_summary = format_summary(summary)
                if formatted_summary:  # Only add if not empty after filtering
                    summaries.append(formatted_summary)

    # Print all formatted summaries
    print("Final Summaries:")
    print("nn".join(summaries))

if __name__ == '__main__':
     main()

Here is the final output. It took around 5 minutes to fully run, and my system is quite high-spec, so be warned. Also, the quality of the response is highly dependent on the quality of the LLM you use. I tried it with Llama3.2, and the output was significantly worse than when using the 3.3 version.

(ollama_test) C:Usersthomaollama-test>python tomtest.py
Fetching content from all URLs...
Scraping URL: https://en.wikipedia.org/wiki/Albert_Einstein
Successfully scraped content from: https://en.wikipedia.org/wiki/Albert_Einstein
Scraping URL: https://en.wikipedia.org/wiki/Richard_Feynman
Successfully scraped content from: https://en.wikipedia.org/wiki/Richard_Feynman
Scraping URL: https://en.wikipedia.org/wiki/James_Clerk_Maxwell
Successfully scraped content from: https://en.wikipedia.org/wiki/James_Clerk_Maxwell
Scraping URL: https://en.wikipedia.org/wiki/Alan_Guth
Successfully scraped content from: https://en.wikipedia.org/wiki/Alan_Guth
Finished fetching content from all URLs.
Processing content 1/4...
Sending content to chat model for summarization...
Chat model returned a response.
Successfully validated and parsed articles.
Finished formatting summary.
Processing content 2/4...
Sending content to chat model for summarization...
Chat model returned a response.
Successfully validated and parsed articles.
Finished formatting summary.
Processing content 3/4...
Sending content to chat model for summarization...
Chat model returned a response.
Successfully validated and parsed articles.
Finished formatting summary.
Processing content 4/4...
Sending content to chat model for summarization...
Chat model returned a response.
Successfully validated and parsed articles.
Finished formatting summary.
Final Summaries:
The name of the Scientist: Albert Einstein
When and where they were born: 14 March 1879
Their main claim to fame: Einstein became one of the most famous scientific celebrities after the confirmation of his general theory of relativity in 1919.
The year they won the Nobel Prize: 1921
When and where they died: 18 April 1955

The name of the Scientist: Richard Feynman
When and where they were born: May 11, 1918
Their main claim to fame: Physicist and mathematician
The year they won the Nobel Prize: 1965
When and where they died: February 15, 1988

The name of the Scientist: James Clerk Maxwell
When and where they were born: 13 June 1831
Their main claim to fame: Scottish physicist and mathematician
The year they won the Nobel Prize: 0
When and where they died: 5 November 1879

The name of the Scientist: Alan Guth
When and where they were born:
Their main claim to fame: theoretical physics
The year they won the Nobel Prize: 2014
When and where they died:

Note that Alan Guth is still alive; hence, the When/Where they died part for him is blank. James Clerk Maxwell did not receive a Nobel prize as they weren’t around during his lifetime. Also, note that the model could not extract the place of death for any of the scientists, even though that information was contained in the Wikipedia extracts.

Summary

In this article, I’ve provided code and demonstrated two key capabilities of structured outputs using Ollama. The first example showed the use of structured output in image processing, while the second focused on text summarization.

Specifying structured output from LLMs is a big step for Ollama and has many applications. By organizing information in a predictable JSON format, structured outputs improve clarity and make LLMs’ responses more consistent, reducing ambiguities. This structured approach enables seamless integration into downstream applications like APIs, databases, or visualization tools without extensive preprocessing while simplifying data parsing and automation.

Validation against predefined rules becomes easier, minimizing errors and ensuring compliance with expected standards. Ultimately, structured output transforms LLMs into highly practical tools for diverse real-world use cases.

_That’s all from me for now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._

I know times are tough and wallets constrained, but if you got real value from this article, please consider buying me a wee dram.

If you liked this content, I think you’ll also find these articles interesting.

Introducing the New Anthropic Token Counting API

Polars … but even faster!

The post Structured LLM Output Using Ollama appeared first on Towards Data Science.

OpenAI Prompt Cache Monitoring

Thomas Reid — Tue, 10 Dec 2024 15:01:48 +0000

As part of their recent DEV Day presentation, OpenAI announced that Prompt Caching was now available for various models. At the time of writing, those models were:-

GPT-4o, GPT-4o mini, o1-preview and o1-mini, as well as fine-tuned versions of those models.

This news shouldn’t be underestimated, as it will allow developers to save on costs and reduce application runtime latency.

API calls to supported models will automatically benefit from Prompt Caching on prompts longer than 1,024 tokens. The API caches the longest prefix of a prompt that has been previously computed, starting at 1,024 tokens and increasing in 128-token increments. If you reuse prompts with common prefixes, OpenAI will automatically apply the Prompt Caching discount without requiring you to change your API integration.

As an OpenAI API developer, the only thing you may have to worry about is how to monitor your Prompt Caching use, i.e. check that it’s being applied.

In this article, I’ll show you how to do that using Python, a Jupyter Notebook and a chat completion example.

Install WSL2 Ubuntu

I’m on Windows, but I’ll run my example code under WSL2 Ubuntu. Check out the link below for a comprehensive guide on installing WSL2 for Windows.

Installing WSL2 Ubuntu for Windows

Setting up our development environment

Before developing like this, I always create a separate Python development environment where I can install any software needed and experiment with coding. Now, anything I do in this environment will be siloed and won’t impact my other projects.

I use Miniconda for this, but there are many other ways to do it, so use whatever method you know best.

If you want to go down the Miniconda route and don’t already have it, you must install Miniconda first. Get it using this link,

Miniconda – Anaconda documentation

To follow along with my example, you’ll need an OpenAI API key. Create an OpenAI account if you don’t already have one, then you can get a key from the OpenAI platform using the link below:

https://platform.openai.com/api-keys

1/ Create our new dev environment and install the required libraries

(base) $ conda create -n oai_test python=3.11 -y
(base) $ conda activate oai_test
(oai_test) $ pip install openai --upgrade
(oai_test) $ pip install jupyter

2/ Start Jupyter

Now type in jupyter notebook into your command prompt. You should see a jupyter notebook open in your browser. If that doesn’t happen automatically, you’ll likely see a screenful of information after the jupyter notebook command. Near the bottom, there will be a URL that you should copy and paste into your browser to initiate the Jupyter Notebook.

Your URL will be different to mine, but it should look something like this:-

http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69

The code

Prompt caching is automatic so you don’t have to change your existing code base. But recall that it only kicks in when the combined system and user prompt are > 1024 tokens.

OpenAI recommends structuring your prompts so that any static information is at the beginning and dynamic content towards the end. This ties in nicely with the static data being in the system prompt and the dynamic data in the user prompt. You don’t have to do this, but it makes the most sense to do so.

So, let’s put all this together by showing a hypothetical example grounded in a real-use case study. In our hypothetical scenario, we’ll model a smart home system where you can remotely request actions to be taken in or around your home. For example, you might like your smart home system to turn on your lights, heating system, etc…. when you’re away from your house.

Our code consists of two tools (functions) that the Llm can use. One does the actual switching on/off of a control device, and the other can do so in response to a timed event.

After that, we have our system prompt, which clearly defines what the smart home system should be capable of and any rules/guidance it needs to perform its function.

Additionally, we have, in the first instance, a simple user prompt that requests the control system to turn on the house lights. We run this initial command and get a count of the total tokens in the prompts, the number of cached tokens and a few other data points.

After this initial run, we ask the control system to perform a different task, and once again, we get various token counts for that operation.

from OpenAI import OpenAI
import os
import json
import time

api_key = "YOUR_API_KEY_GOES_HERE"
client = OpenAI( api_key=api_key)

# Define tools (functions)
tools = [
    {
        "type": "function",
        "function": {
            "name": "control_device",
            "description": "Control a smart home device, such as turning it on/off or changing settings.",
            "parameters": {
                "type": "object",
                "properties": {
                    "device_id": {
                        "type": "string",
                        "description": "The unique identifier of the device to control."
                    },
                    "action": {
                        "type": "string",
                        "description": "The action to perform (e.g., 'turn_on', 'turn_off', 'set_temperature')."
                    },
                    "value": {
                        "type": ["string", "number"],
                        "description": "Optional value for the action, such as temperature setting."
                    }
                },
                "required": ["device_id", "action"],
                "additionalProperties": False
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "set_schedule",
            "description": "Set a schedule for a smart home device to perform an action at a specified time.",
            "parameters": {
                "type": "object",
                "properties": {
                    "device_id": {
                        "type": "string",
                        "description": "The unique identifier of the device to schedule."
                    },
                    "action": {
                        "type": "string",
                        "description": "The action to perform (e.g., 'turn_on', 'turn_off')."
                    },
                    "schedule_time": {
                        "type": "string",
                        "description": "The time to perform the action, in ISO 8601 format or a natural language description."
                    }
                },
                "required": ["device_id", "action", "schedule_time"],
                "additionalProperties": False
            }
        }
    }
]

# System message with guidelines
# Expanded system message to exceed 1024 tokens
# to make sure Prompt Caching enabled
messages = [
    {
        "role": "system",
        "content": (
            "You are a smart home assistant that helps users control their smart home devices securely and efficiently. "
            "Your goals are to execute user commands, provide device statuses, and manage schedules while ensuring safety and privacy. "
            "Always confirm actions with the user before executing them, especially for critical devices like security systems or door locks. "
            "Maintain a friendly and professional tone, adapting to the user's level of technical expertise.nn"
            # Begin expansion
            "Important guidelines to follow:nn"
            "1. **User Privacy and Security**: Handle all personal and device information confidentially. "
            "Verify the user's identity if necessary before performing sensitive actions. Never share personal data with unauthorized parties. "
            "Ensure that all communications comply with data protection laws and regulations.nn"
            "2. **Confirmation Before Actions**: Always confirm the user's intent before executing actions that affect their devices. "
            "For example, if a user asks to unlock the front door, verify their identity and confirm the action to prevent unauthorized access.nn"
            "3. **Error Handling**: If an action cannot be completed, politely inform the user and suggest alternative solutions. "
            "Provide clear explanations for any issues, and guide the user through troubleshooting steps if appropriate.nn"
            "4. **Safety Measures**: Ensure that commands do not compromise safety. "
            "Avoid setting temperatures beyond safe limits, and alert the user if a requested action might be unsafe. "
            "For instance, if the user tries to turn off security cameras, remind them of potential security risks.nn"
            "5. **No Unauthorized Access**: Do not control devices without explicit user permission. "
            "Ensure that any scheduled tasks or automated routines are clearly communicated and approved by the user.nn"
            "6. **Clear Communication**: Use simple language and avoid technical jargon unless the user is familiar with it. "
            "Explain any technical terms if necessary, and ensure that instructions are easy to understand.nn"
            "7. **Compliance**: Adhere to all relevant laws, regulations, and company policies regarding smart home operations. "
            "Stay updated on changes to regulations that may affect how devices should be controlled or monitored.nn"
            "8. **Accurate Information**: Provide precise device statuses and avoid speculation. "
            "If unsure about a device's status, inform the user and suggest ways to verify or troubleshoot the issue.nn"
            "9. **Accessibility Considerations**: Be mindful of users with disabilities. "
            "Ensure that instructions and responses are accessible, and offer alternative interaction methods if needed.nn"
            "10. **Personalization**: Adapt to the user's preferences and prior interactions. "
            "Remember frequent commands and offer suggestions based on usage patterns, while respecting privacy settings.nn"
            "11. **Timeouts and Idle States**: If a session is idle for a prolonged period, securely end the session to protect user data. "
            "Notify the user when the session is about to expire and provide options to extend it if necessary.nn"
            "12. **Multi-User Environments**: Recognize when multiple users may be interacting with the system. "
            "Manage profiles separately to ensure personalized experiences and maintain privacy between users.nn"
            "13. **Energy Efficiency**: Promote energy-saving practices. "
            "If a user forgets to turn off devices, gently remind them or offer to automate energy-saving routines.nn"
            "14. **Emergency Protocols**: Be prepared to assist during emergencies. "
            "Provide quick access to emergency services if requested, and understand basic protocols for common emergencies.nn"
            "15. **Continuous Learning**: Stay updated with the latest device integrations and features. "
            "Inform users about new capabilities that may enhance their smart home experience.nn"
            "16. **Language and Cultural Sensitivity**: Be aware of cultural differences and language preferences. "
            "Support multiple languages if possible and be sensitive to cultural norms in communication.nn"
            "17. **Proactive Assistance**: Anticipate user needs by offering helpful suggestions. "
            "For example, if the weather forecast indicates rain, suggest closing windows or adjusting irrigation systems.nn"
            "18. **Logging and Monitoring**: Keep accurate logs of actions taken, while ensuring compliance with privacy policies. "
            "Use logs to help troubleshoot issues but never share log details with unauthorized parties.nn"
            "19. **Third-Party Integrations**: When interacting with third-party services, ensure secure connections and compliance with their terms of service. "
            "Inform users when third-party services are involved.nn"
            "20. **Disaster Recovery**: In case of system failures, have protocols in place to restore functionality quickly. "
            "Keep the user informed about outages and provide estimated resolution times.nn"
        )
    },
    {
        "role": "user",
        "content": "Hi, could you please turn on the living room lights?"
    }
]
# Function to run completion with the provided message history and tools
def completion_run(messages, tools):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        tools=tools,
        messages=messages,
        tool_choice="required"
    )
    usage_data = json.dumps(completion.to_dict(), indent=4)
    return usage_data

# Main function to handle the runs
def main(messages, tools):
    # Run 1: Initial query
    print("Run 1:")
    run1 = completion_run(messages, tools)
    print(run1)

    # Delay for 3 seconds
    time.sleep(3)

    # Append user_query2 to the message history
    user_query2 = {
        "role": "user",
        "content": "Actually, could you set the thermostat to 72 degrees at 6 PM every day?"
    }
    messages.append(user_query2)

    # Run 2: With appended query
    print("nRun 2:")
    run2 = completion_run(messages, tools)
    print(run2)

# Run the main function
if __name__ == "__main__":
    main(messages, tools)

And our output is:-

Run 1:
{
    "id": "chatcmpl-AFePFIyWQtNJ4txIGcLbXZaZleEZv",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": null,
                "refusal": null,
                "role": "assistant",
                "tool_calls": [
                    {
                        "id": "call_m4V9sn2PY7X3EapH7ph1K8t9",
                        "function": {
                            "arguments": "{"device_id":"living_room_lights","action":"turn_on"}",
                            "name": "control_device"
                        },
                        "type": "function"
                    }
                ]
            }
        }
    ],
    "created": 1728293605,
    "model": "gpt-4o-mini-2024-07-18",
    "object": "chat.completion",
    "system_fingerprint": "fp_f85bea6784",
    "usage": {
        "completion_tokens": 21,
        "prompt_tokens": 1070,
        "total_tokens": 1091,
        "completion_tokens_details": {
            "reasoning_tokens": 0
        },
        "prompt_tokens_details": {
            "cached_tokens": 0
        }
    }
}

Run 2:
{
    "id": "chatcmpl-AFePJwIczKSjJnvwed7wpyRI7gLWU",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": null,
                "refusal": null,
                "role": "assistant",
                "tool_calls": [
                    {
                        "id": "call_PjCse4kD4QJxYcFuZ7KlqJAc",
                        "function": {
                            "arguments": "{"device_id": "living_room_lights", "action": "turn_on"}",
                            "name": "control_device"
                        },
                        "type": "function"
                    },
                    {
                        "id": "call_GOr7qfGUPD0ZV9gAgUktyKj6",
                        "function": {
                            "arguments": "{"device_id": "thermostat", "action": "set_temperature", "schedule_time": "2023-10-23T18:00:00"}",
                            "name": "set_schedule"
                        },
                        "type": "function"
                    }
                ]
            }
        }
    ],
    "created": 1728293609,
    "model": "gpt-4o-mini-2024-07-18",
    "object": "chat.completion",
    "system_fingerprint": "fp_f85bea6784",
    "usage": {
        "completion_tokens": 75,
        "prompt_tokens": 1092,
        "total_tokens": 1167,
        "completion_tokens_details": {
            "reasoning_tokens": 0
        },
        "prompt_tokens_details": {
            "cached_tokens": 1024
        }
    }
}

We can see that in Run 1, the cached_tokens count is zero, which is to be expected. However, in Run 2, the `cached_tokens` count is 1024. This indicates that caching took place.

Summary

Prompt caching is a very useful new addition to OpenAI’s capabilities. It can save on application run times by reducing latency and your token costs. So it’s important to monitor if and when it’s being used and investigate why it isn’t if you think it should be being used.

So, using code, as I’ve shown above, you can effectively monitor your system and intervene when you think prompt caching isn’t being applied. It would be fairly straightforward to send an automated message to yourself or to a team to indicate a potential caching issue.

_That’s all from me for now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._

I know times are tough and wallets constrained, but if you got real value from this article, please consider buying me a wee dram.

If you liked this content, Medium thinks you’ll find these articles interesting, too.

Structured output in the OpenAI API

Introducing the more-itertools Python library

The post OpenAI Prompt Cache Monitoring appeared first on Towards Data Science.

Introducing the New Anthropic PDF Processing API

Thomas Reid — Wed, 27 Nov 2024 13:02:15 +0000

In the last few weeks, Anthropic has released some exciting beta features that have largely gone under the radar. One of these was its new token-counting API. I have already written an article on this, which you can read by clicking the link below.

Introducing the New Anthropic Token Counting API

The other exciting feature, and the subject of this article, is that Claude 3.5 can now process PDFs and understand both text and visual content within PDF documents.

PDF Capabilities

Claude works with any standard PDF file, allowing you to inquire about text, images, charts, and tables within your documents. Here are some common use cases:

Analyzing financial reports, interpreting charts and tables
Extracting key information from legal documents
Assisting with document translations
Converting document content into structured formats

Limitations

Because this is still a Beta release, there are a few limitations to its use. Right now, it can handle a maximum file size of 32MB, and the number of pages in any one document is limited to 100.

Supported Platforms and Models

PDF support is currently available on the latest Claude 3.5 Sonnet model (claude-3-5-sonnet-20241022) through direct API access.

Calculate Expected Token Usage

The token count for a Pdf file is determined by the amount of text extracted and the total number of pages. Each page is converted to an image, and token costs are calculated accordingly. Depending on content density, each page typically requires between 1,500 and 3,000 tokens.

Standard input token pricing applies, with no extra fees for PDF processing.

You can also use token counting (see story link above) to calculate the number of tokens for a message that includes PDFs.

Okay, let’s get started. First, I’m developing using Windows WSL2 Ubuntu. If you’re a Windows user, I have a comprehensive guide on installing WSL2, which you can find here.

Setting up a dev environment

Before we start coding, let’s set up a separate development environment. That way, all our projects will be siloed and won’t interfere with each other. I use conda for this, but use whichever tool you’re familiar with.

(base) $ conda create -n claude_pdf python=3.10 -y
(base) $ conda activate claude_pdf
# Install required Libraries
(claude_pdf) pip install anthropic jupyter

Getting an Anthropic API key

You’ll need an Anthropic API key if you don’t already have one. You can get that from the Anthropic Console. Register or Sign-In, then you’ll see a screen like this,

Image from Anthropic Website

Click the Get API Keys button and follow the instructions from there. Take note of your key and set the environment variable ANTHROPIC_API_KEY to it.

The code

For my input PDF, I’ll use a copy of Tesla’s Q10 September 2023 quarterly submission to the Securities and Exchange Commission that I downloaded to my local PC.

This document is 51 pages of mixed text and tabular data. You can see what it looks like online by clicking here.

Example 1 – Asking a basic question

"What is tesla’s phone number?"

import anthropic
import base64

# First fetch the file
with open("/mnt/d/tesla/tesla_q10_sept_23.pdf", "rb") as pdf_file:
    pdf_data = base64.standard_b64encode(pdf_file.read()).decode("utf-8")

# Finally send the API request
client = anthropic.Anthropic()

message = client.beta.messages.create(
    model="claude-3-5-sonnet-20241022",
    betas=["pdfs-2024-09-25"],
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data
                    }
                },
                {
                    "type": "text",
                    "text": "What is tesla's phone number?"
                }
            ]
        }
    ],
)

print(message.content)

It came back with this answer.

[BetaTextBlock(text="According to the document, Tesla's phone number 
is (512) 516-8177. This is listed on the first page of the Form 10-Q as 
their registrant's telephone number.", type='text')]

Not too shabby. It’s an impressive start.

Example 2 – Let’s try a harder question.

What were the energy generation and storage sales for the Three Months Ended September 30 in 2022 and 2023 ?

If we look at the PDF, we can see that the answer to this is in a table on Page 10. The figures are 966 and 1416 million dollars, respectively.

Image by Author

message = client.beta.messages.create(
    model="claude-3-5-sonnet-20241022",
    betas=["pdfs-2024-09-25"],
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data
                    }
                },
                {
                    "type": "text",
                    "text": "What were the energy generation and storage sales for the Three Months Ended September 30 in 2022 and 2023 ?"
                }
            ]
        }
    ],
)

print(message.content)

And the response from Claude.

[BetaTextBlock(text="According to the financial statements, Tesla's 
energy generation and storage sales were:nn- Three months ended 
September 30, 2023: $1,416 millionn- 
Three months ended September 30, 2022: $966 millionnn
This represents an increase of $450 million or approximately 
47% year-over-year for that segment's sales revenue.", type='text')]

That’s fantastic. That is a spot-on answer again.

Example 3 – using prompt caching

For repeated analysis of a PDF, Anthropic recommends the use of prompt caching to reduce your token usage, hence costs. Prompt caching can be "switched on" by simply adding the following small changes in the message API code,

1/ Change 

betas=["pdfs-2024-09-25"],

to

betas=["pdfs-2024-09-25", "prompt-caching-2024-07-31"],

2/ Add the following to the messages content section in the API call

...
 "cache_control": {"type": "ephemeral"}
...

Now, when you run your RAG code, all the document contents will be cached, and subsequent calls to interrogate it will use the cached version, resulting in much much less tokens being used. According to the Anthropic documentation,

"The cache has a 5-minute lifetime, refreshed each time the cached content is used."

Let’s see another full example and include the prompt caching code.

We are asking an old favourite question of mine, which I’ve used in previous articles on implementing RAG on Tesla’s Q10 PDF.

"What are the Total liabilities and Total assets for 2022 and 2023"

To a human, the answer is easy. Just go to page 4 of the PDF, and you’ll see this table,

Image by Author

As you can see, the Total assets for 2022/2023 were (in Millions) $93,941 and $82,338. The Total liabilities were (in Millions) $39,446 and $36,440. Let’s see if Claude can answer this.

message = client.beta.messages.create(
    model="claude-3-5-sonnet-20241022",
    betas=["pdfs-2024-09-25", "prompt-caching-2024-07-31"],
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data
                    },
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": ""What are the Total liabilities and Total assets for 2022 and 2023"?"
                }
            ]
        }
    ],
)
print(message.content)

And the answer.

[BetaTextBlock(text='According to the consolidated balance sheets in the 
document:nnFor September 30, 2023:n- Total liabilities: $39,446 millionn- 
Total assets: $93,941 millionnnFor December 31, 2022:n- Total liabilities: 
$36,440 million n- Total assets: $82,338 million', type='text')]

Spot on again.

Example 4— Interpreting diagrams/images

For my final example, I created a PDF, then pasted an image of an AWS architecture diagram into it, and saved it. Here is what it looks like.

Image from AWS website

Let’s see if the model can interpret what it is.

import anthropic
import base64

# First fetch the file
with open("/mnt/d/images/arch.pdf", "rb") as pdf_file:
    pdf_data = base64.standard_b64encode(pdf_file.read()).decode("utf-8")

# Send the API request
client = anthropic.Anthropic()

message = client.beta.messages.create(
    model="claude-3-5-sonnet-20241022",
    betas=["pdfs-2024-09-25"],
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data
                    }
                },
                {
                    "type": "text",
                    "text": "What does the diagram depict"
                }
            ]
        }
    ],
)

for block in message.content:
    print(block.text)  # Print only the text attribute

The diagram (Figure 1) depicts an AWS Cloud architecture workflow for data 
processing and inventory management. It shows:

1. An ingestion phase starting with Amazon Redshift
2. Post-processing steps using Amazon RDS for archival metadata
3. A series of AWS services working together including:
   - Amazon EventBridge
   - AWS Lambda functions
   - Amazon S3
   - AWS Step Functions
   - S3 Inventory
   - S3 Glacier

The workflow appears to handle data movement, processing, and storage with 
various status updates and notifications. There's a daily inventory process 
and both temporary and long-term storage solutions implemented through Amazon 
S3 and S3 Glacier. The diagram shows how data flows through these different 
AWS services and includes features for data deletion and archival management.

This seems to be a complete data pipeline architecture that handles ingestion, 
processing, storage, and lifecycle management of data within the AWS ecosystem.

Another impressive response.

Summary

All I can say is that PDF processing using LLMs has come a long way in a short space of time. Claude’s PDF understanding is very impressive

Until recently, the last question I asked on the Tesla PDF about total liabilities and assets was almost impossible for AI and RAG models to answer correctly. I’ve tried several methods before, most recently by using Google’s Gemini Flash 1.5 model.

The only way I could get that model to answer correctly was by telling it which specific page of the PDF document to go to for the information.

Before that, I also tried using AWS Bedrock with a knowledge base and Claude V1.2 LLM. With that setup, I got close to the correct answer, but it was still not 100% right.

The only time I got the correct answer immediately was when I used LlamaParse.

The big difference between this version of Claude and a traditional RAG system like LlamaParse is its simplicity. There’s …

No chunking.
No vectorisation / Embedding
No vector DB storage
No similarity searching.
No fuss.

I’ve said it before, and I’ll repeat it here: I believe traditional RAG processing is dead in the water for many, not all, use cases. What do you think?

To find out more about PDF processing with Anthropic, check out their documentation using this link.

_Anyway, that’s all for me for now. Hopefully, you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._

Times are tough and wallets constrained, but if you got real value from this article, please consider buying me a wee dram.

If you liked this content, I think you’ll also find these related articles interesting.

Develop, then deploy a WEBP to PNG image converter Taipy App to the web – Part 1

C Programming Using Claude’s New Computer Use Model

The post Introducing the New Anthropic PDF Processing API appeared first on Towards Data Science.

Boost Your Python Code with CUDA

Thomas Reid — Wed, 20 Nov 2024 18:06:21 +0000

I’ve written about the Python library Numba before. Check my article out using the link below,

Python on Steroids: The Numba Boost

The TL;DR of the above was that I showed how to realise significant speed up in your Python code using Numba. Numba is a high-performance Python library designed to optimize your code for speed. At its core, Numba is a Just-In-Time (JIT) compiler that translates a subset of Python and NumPy code into fast machine code. This process is automatic and dynamic, allowing Python developers to gain real performance improvements with minimal changes to their original Python code.

The regular Numba JIT compiler is all about optimising code run-time for your CPU, but if you are lucky enough to have access to a GPU, in this article, I’ll show you how you can use Numba again, this time with its Cuda JIT, to accelerate your Python code even further by targeting the GPU to run code on.

Pre-requisites

To use NVIDIA CUDA on your system, you will need the following:-

CUDA-capable GPU
CUDA drivers appropriate to your operating system
CUDA Toolkit (available at https://developer.nvidia.com/cuda-downloads)

For comprehensive instructions, you’re best to visit the official Installation guide at Nvidia. Click here for that.

It would also be useful to get acquainted with some terminology specific to the GPU world. For example,

host: a synonym for the CPU
device: a synonym for the GPU
host memory: your system’s main memory (RAM)
device memory: the onboard memory on your GPU card
kernel: a GPU function launched by the host and executed on the device
device function: a GPU function executed on the device which can only be called from the device (i.e. from a kernel or another device function)
Streaming Multiprocessors: These are the fundamental computational units within an NVIDIA GPU architecture. They are responsible for executing the instructions of threads in parallel, making GPUs highly effective at parallel processing tasks.

Understanding the Memory Hierarchy in GPUs

To fully understand how Numba CUDA programming works, learning the memory hierarchy and GRid layout system as they apply to GPUs is worthwhile. Unlike CPUs, which have a single, unified memory space, GPUs have a hierarchical memory architecture that consists of:

Registers: Small, fast on-chip memory that stores temporary results and variables.
Shared Memory: A small, fast on-chip memory shared among threads within a block.
Global Memory: A large, off-chip memory that stores data and program instructions.
Texture Memory: A read-only memory that stores 2D arrays and is optimized for texture mapping.
Constant Memory: A small, read-only memory that stores constants and is optimized for broadcasting.

Understanding the Grid system in GPUs

Another very important idea to grasp is that of the Grid System. In GPU programming, the grid system is a fundamental concept that allows developers to organize and execute parallel computations on the GPU. The grid system consists of:

The Grid. A 1D, 2D or 3D array of blocks.
The Block. A group of threads that are executed together. Each block can contain a certain number of threads, and all threads within a block can cooperate using shared memory. Threads within a block are typically arranged in 1D, 2D, or 3D structures.
The Thread. The smallest unit of execution. A thread is akin to a single instruction stream running on the GPU. Each thread performs computations on a specific portion of the data.

How the Grid Works

The grid can be defined in 1, 2, or 3 dimensions, depending on the problem you’re solving. For example, if you’re processing a 2D image, you might choose a 2D grid to better map the computational tasks to the data structure.
Each block in the grid can also be 1D, 2D, or 3D. The block dimensions define the number of threads per block.
When you launch a CUDA kernel, you specify the grid and block dimensions. The CUDA runtime distributes the threads across the available Streaming Multiprocessors (SMs) on the GPU. Each block is assigned to an SM, and the threads within the block are distributed among the cores of that SM.

Cuda has several built-in values that can help you determine block and thread positions on the grid. To keep things simple, let’s consider a 2D block arrangement.

Image by Author

Block Location
---------------
bx = cuda.blockIdx.x  ---------> 1 in our example diagram
by = cuda.blockIdx.y  ---------> 1

Block Dimensions
------------------
bw=cuda.blockDim.x    ---------> 3
bh=cuda.blockDim.y    ---------> 3

Block thread location
---------------------
tx=cuda.threadIdx.x   ---------> 0
ty=cuda.threadIdx.y   ---------> 0

Grid thread location
--------------------
X = bw * bx + tx     ----------> 3
Y = bh * by + ty     ----------> 3

       or

X,Y = cuda.grid(2)

Setting up our dev environment

Before we get to the coding, let’s set up a separate development environment for our work. I use conda for this, but you can use whatever method you know and suits you best.

#create our test environment
(base) $ conda create -n numba_cuda Python=3.11 -y

# Now activate it
(base) $ conda activate numba_cuda
(numba_cuda) $

Now that our environment is set up, we can install the required libraries and software.

According to the Numba requirements for Cuda programming, as I have CUDA 12 installed, I needed the following libraries,

 (numba_cuda) $ conda install -c conda-forge cuda-nvcc cuda-nvrtc "cuda-version>=12.0"

I also need these,

(numba_cuda) $ conda install numba jupyter  -y
(numba_cuda) $ pip install matplotlib

Numba CUDA in use

For our tests, I’ll repeat some of the programming snippets I used in my Numba JIT article, and we’ll see how much of an improvement we can squeeze out of converting them to use Numba CUDA.

Example 1 – Simple for loop test

Numba JIT version

from numba import jit
import time

# Decorate the function with @jit to enable JIT compilation
@jit(nopython=True)  # nopython mode is recommended for best performance
def loop_test_jit():
    result = 0.0
    # Outer loop
    for i in range(10000):
        # Inner loop
        for j in range(10000):
            # Perform a simple operation
            result += i * j * 0.1
    return result

# Call the function to allow Numba to compile it
loop_test_jit()

# Record start time
start_time = time.time()

# Call the JIT-compiled function
for i in range(5):
    result = loop_test_jit()

# Record end time
end_time = time.time()

# Calculate and print the execution time
print(f"CUDA JIT result = {result}")
print(f"Execution time: {(end_time - start_time)/5} seconds")

#
# Output  below
#
NUMBA JIT result = 249950002500000.0
Execution time: 0.09600849151611328 seconds

Recall that the first time Numba encounters a function, it takes some time to compile it before running it. Therefore, I run the function once for the compilation stage, then call it again in a loop 5 times and take the average time per run in the loop. This should give a fair comparison between run times.

The Numba CUDA version

from numba import cuda
import numpy as np
import time

# Define the number of threads that will run per block
threads_per_block = 256

# Define the CUDA kernel function
@cuda.jit
def loop_test_kernel(results):
    i = cuda.grid(1)
    # Make sure we don't go out of bounds
    if i < results.size:
        result = 0.0
        for j in range(10000):
            result += i * j * 0.1
        results[i] = result

# Main function to manage the computation
def loop_test_cuda():
    num_elements = 10000
    # calculates the number of blocks (blocks_per_grid) needed to 
    # process all num_elements with the given number of threads per block.
    blocks_per_grid = (num_elements + (threads_per_block - 1)) // threads_per_block

    # Allocate space for the results on the device (GPU)
    results = cuda.device_array(num_elements, dtype=np.float64)

    # Launch the kernel on the GPU with the required
    # number of blocks and threads
    loop_test_kernel[blocks_per_grid, threads_per_block](results)

    # Copy the results back to the host (CPU)
    results_host = results.copy_to_host()

    # Aggregate the results
    return results_host.sum()

# Warm up the CUDA kernel to allow JIT compilation
loop_test_cuda()

# Record start time
start_time = time.time()

# Call the CUDA function
for i in range(5):
    result = loop_test_cuda()

# Record end time
end_time = time.time()

# Calculate and print the execution time
print(f"NUMBA CUDA result = {result}")
print(f"Execution time: {(end_time - start_time)/5} seconds")

#
# Output  below
#
NUMBA CUDA result = 249950002500000.0
Execution time: 0.01670536994934082 seconds

Straight away, we see a 6x improvement on a piece of code that was already quick.

The CUDA code is more complex; most of which comes from the mapping we must do when allocating the for-loop processes to threads on the GPU.

I also received the following warning message when the code ran…

NumbaPerformanceWarning: Grid size 40 will likely result in 
GPU under-utilization due to low occupancy.

So, there’s scope for playing around with some of the numbers to see if the runtime can be improved further. For example, the warning message disappeared when I changed the threads_per_block variable from 256 to 64. This increases the number of blocks per grid, which is counter-intuitive.

Example 2 – recursive functions

Numba can also speed up recursive function calling. Rather than go down the Fibonacci road, we’ll try a similar algorithm you might not have heard of before called Lucas numbers. Lucas numbers are similar to Fibonacci numbers, following the same recursive pattern but starting with different initial values. The Lucas sequence starts with 2 and 1 instead of 0 and 1 for the Fibonacci sequence. The nth Lucas number can be defined recursively as L(n)=L(n−1)+L(n−2) with base cases L(0)=2 and L(1)=1.

Numba JIT Version

from numba import jit

# Apply Numba's JIT decorator
@jit(nopython=True)
def lucas_numba(n):
    if n == 0:
        return 2
    elif n == 1:
        return 1
    else:
        return lucas_numba(n-1) + lucas_numba(n-2)

lucas_result_numba = lucas_numba(40)  # Example input

# Timing the JIT-compiled function
start_time = time.time()
for _ in range(5):
    lucas_result_numba = lucas_numba(40)  # Example input
end_time = time.time()

print(f"Lucas number 40 with Numba: {lucas_result_numba}")
print(f"Execution time with Numba: {(end_time - start_time)/5} seconds")

# 
# Output
#

Lucas number 40 with CUDA: 228826127
Execution time with Numba: 0.7562449932098388 seconds

Numba Cuda version

from numba import cuda
import numpy as np
import time

# CUDA kernel to calculate Lucas numbers
@cuda.jit
def lucas_cuda(n, result):
    i = cuda.grid(1)  # 1D grid, i represents the index in the array

    if i <= n:  # Ensure we don't go out of bounds
        if i == 0:
            result[i] = 2
        elif i == 1:
            result[i] = 1
        else:
            a = 2
            b = 1
            for j in range(2, i + 1):
                c = a + b
                a = b
                b = c
            result[i] = b

# Define the target number (40th Lucas number)
n = 40

# Allocate result array on the device
result = np.zeros(n + 1, dtype=np.int32)  # We need an array of size 41 (0-40)
result_device = cuda.to_device(result)

# Define threads per block and blocks per grid
# There's a bit of trial and error to this
threads_per_block = 128  
blocks_per_grid = (n + (threads_per_block - 1)) // threads_per_block

# Launch the CUDA kernel
start_time = time.time()
lucas_cuda[blocks_per_grid, threads_per_block](n, result_device)
# Wait till all threads are done
cuda.synchronize()
end_time = time.time()

# Copy the result back to the host
result_host = result_device.copy_to_host()

# Print the 40th Lucas number (index 40)
print(f"Lucas number for {n} with CUDA: {result_host[n]}")
print(f"Execution time with CUDA: {end_time - start_time} seconds")

#
# Output
#

Lucas number 40 with CUDA: 228826127
EExecution time with CUDA: 0.10776114463806152 seconds

Approximately a 7x speed up on the original Numba JIT code that time.

Example 3 – image processing

In this test, we take an image of the Taj Mahal and convert it to greyscale. On my system, the original colour image (PNG format) was 3.7 MB in size.

Numba JIT version

from numba import jit
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.image import imread

# Numba-optimized function to convert RGB to grayscale
@jit(nopython=True)
def rgb_to_grayscale_numba(rgb):
    # Preallocate the output grayscale array
    grayscale = np.zeros((rgb.shape[0], rgb.shape[1]), dtype=np.float64)

    # Loop through each pixel and apply grayscale conversion
    for i in range(rgb.shape[0]):
        for j in range(rgb.shape[1]):
            grayscale[i, j] = (0.299 * rgb[i, j, 0] + 
                               0.587 * rgb[i, j, 1] + 
                               0.114 * rgb[i, j, 2])
    return grayscale

# Load the image 
img = imread("d:/images/enlarged_taj_mahal.png")

grayscale_img_numba = rgb_to_grayscale_numba(img)

# Just timing the numba part
start_time = time.time()
for _ in range(5):
    # Convert to grayscale using Numba
    grayscale_img_numba = rgb_to_grayscale_numba(img)

print(f"Numba Execution Time: {time.time() - start_time} seconds")

# Display the original and grayscale images
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.imshow(img)
plt.title('Original Image')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(grayscale_img_numba, cmap='gray')
plt.title('Grayscale Image with Numba JIT')
plt.axis('off')

plt.show()

The output is:-

Original image by Yury Taranik (licensed from Shutterstock)

Numba CUDA version

from numba import cuda
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.image import imread
import time

# CUDA kernel to convert RGB to grayscale
@cuda.jit
def rgb_to_grayscale_cuda(rgb, grayscale):
    i, j = cuda.grid(2)  # Get the 2D grid index for each thread

    if i < rgb.shape[0] and j < rgb.shape[1]:  # Check bounds
        grayscale[i, j] = (0.299 * rgb[i, j, 0] + 
                           0.587 * rgb[i, j, 1] + 
                           0.114 * rgb[i, j, 2])

# Load the image
img = imread("d:/images/enlarged_taj_mahal.png")

# Preallocate the output grayscale array on the host
grayscale_img = np.zeros((img.shape[0], img.shape[1]), dtype=np.float32)

# Allocate device memory for the input and output images
img_device = cuda.to_device(img)
grayscale_img_device = cuda.device_array((img.shape[0], img.shape[1]), dtype=np.float32)

# Define the threads per block and blocks per grid
threads_per_block = (16, 16)  # 16x16 threads per block is a common choice
blocks_per_grid_x = (img.shape[0] + threads_per_block[0] - 1) // threads_per_block[0]
blocks_per_grid_y = (img.shape[1] + threads_per_block[1] - 1) // threads_per_block[1]
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)

rgb_to_grayscale_cuda[blocks_per_grid, threads_per_block](img_device, grayscale_img_device)

# Start timing
start_time = time.time()
for _ in range(5):
    # Launch the CUDA kernel
    rgb_to_grayscale_cuda[blocks_per_grid, threads_per_block](img_device, grayscale_img_device)

# Copy the result back to the host
grayscale_img = grayscale_img_device.copy_to_host()

print(f"CUDA Execution Time: {time.time() - start_time} seconds")

# Display the original and grayscale images
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.imshow(img)
plt.title('Original Image')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(grayscale_img, cmap='gray')
plt.title('Grayscale Image with NUMBA CUDA')
plt.axis('off')

plt.show()

And the output?

Original image by Yury Taranik (licensed from Shutterstock)

It only doubled the run time speed on this occasion, but it’s still pretty impressive.

Summary

In this article, I’ve described how, with little effort, you can squeeze even more performance from your Python code – if you have access to a GPU.

The above timing improvements may not seem that impressive. But bear in mind that the base level we were starting from was already an incredibly improved position over our initial non-optimised code using Numba JIT.

For example, look at the progression in the runtime of the Lucas number calculation from Regular code -> Numba JIT -> Numba CUDA

Regular Python: 29 sec
     NUMBA JIT: 0.71 sec
    NUMBA CUDA: 0.1 sec

That’s almost a 300x speed-up on the non-optimised code.

_OK, that’s all for me just now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._

I know times are tough and wallets constrained, but if you got real value from this article, please consider buying me a wee dram.

If you liked this content, I think you’ll find these articles interesting, too.

Need for Speed: cuDF Pandas vs. Pandas

Python’s Parallel Paradigm Shift

The post Boost Your Python Code with CUDA appeared first on Towards Data Science.

Introducing the New Anthropic Token Counting API

Thomas Reid — Fri, 08 Nov 2024 17:57:53 +0000

Anthropic has released some exciting beta features in the last couple of days that have largely gone under the radar. One of these was the ability to process PDFs with their models, which can now understand both text and visual content within PDF documents. I’ll maybe write up something on that at a later date.

The other exciting beta feature, and the subject of this article, was the introduction of token counting. Crucially, you can count the tokens in user messages, PDFs and images before you send them to Claude. This is excellent news for those who like to monitor their token usage costs closely.

According to the official announcement from Anthropic (link here),

"The token counting endpoint accepts the same structured list of inputs for creating a message, including support for system prompts, tools, images, and PDFs. The response contains the total number of input tokens."

And supports the following models,

"Claude 3.5 Sonnet Claude 3.5 Haiku Claude 3 Haiku Claude 3 Opus"

The good news is that token counting is free to use but subject to requests per minute rate limits based on your usage tier.

For the rest of this article, we’ll go through some examples of using the token counting API to count tokens in user/system messages, PDFs and images.

To make things more interactive, once we have the basics of our code developed, we’ll wrap up the functionality in a Gradio app that will display a nice user interface to enter user text or upload PDFs and images, then count the tokens. It’ll look a bit like this,

Image by Author

Ok, let’s get started. First off, I’m developing using Windows WSL2 Ubuntu. If you’re a Windows user, I have a comprehensive guide on installing WSL2, which you can find here.

Setting up a dev environment

(base) $ conda create -n token_count python=3.10 -y
(base) $ conda activate token_count
# Install required Libraries
(token_count) pip install anthropic jupyter

Getting an Anthropic API key

You can get that from the Anthropic Console. Register or Sign-In, then you’ll see a screen like this,

Image from Anthropic Website

Click the Get API Keys button and follow the instructions from there. Take note of your key and set the environment variable ANTHROPIC_API_KEY to it.

The code

Example 1 – Counting tokens in the user and system prompts.

import anthropic
import os

client = anthropic.Anthropic()

response = client.beta.messages.count_tokens(
    betas=["token-counting-2024-11-01"],
    model="claude-3-5-sonnet-20241022",
    system="""
        You are a helpful assistant and will respond to users's queries 
        in a polite, friendly and knowledgable manner
    """,
    messages=[{
        "role": "user",
        "content": "What is the capital city of France"
    }],
)

print(response.json())

#
# Output
#

{"input_tokens":41}

Example 2— Counting tokens in a PDF

For my input PDF, I’ll use a copy of Tesla’s Q10 September 2023 quarterly submission to the Securities and Exchange Commission. This document is 51 pages of mixed text and tabular data. You can see what it looks like online by clicking here.

import base64
import anthropic

client = anthropic.Anthropic()

with open("/mnt/d/tesla/tesla_q10_sept_23.pdf", "rb") as pdf_file:
    pdf_base64 = base64.standard_b64encode(pdf_file.read()).decode("utf-8")

response = client.beta.messages.count_tokens(
    betas=["token-counting-2024-11-01", "pdfs-2024-09-25"],
    model="claude-3-5-sonnet-20241022",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": pdf_base64
                }
            },
            {
                "type": "text",
                "text": "Please summarize this document."
            }
        ]
    }]
)

print(response.json())

#
# Output
#

{"input_tokens":118967}

Example 3 – Counting tokens in an image

This is the image I’ll use.

Image by AI (Dalle-3)

It’s a PNG and approximately 2.6MB in size.

import anthropic
import base64
import httpx

image_url = "/mnt/d/images/android.png"
image_media_type = "image/png"
# Read the image file and encode it to base64
with open(image_path, "rb") as image_file:
    image_data = base64.standard_b64encode(image_file.read()).decode("utf-8")

client = anthropic.Anthropic()

# Create the request using the locally stored image
response = client.beta.messages.count_tokens(
    betas=["token-counting-2024-11-01"],
    model="claude-3-5-sonnet-20241022",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": image_media_type,
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": "Describe this image"
                }
            ],
        }
    ],
)

print(response.json())

#
# Output
#

{"input_tokens":1575}

Note that in all the above examples, no requests were set to the Llm to answer any user questions. It was just token counting.

Pulling it all together into a Gradio app.

Now that we have all the code we need, let’s design a user interface for it using Gradio.

We need two input text boxes, one for an optional system prompt and one for an optional user prompt.

Next, we’ll need an input field where the user can select PDF or image files to upload. Below this field, there will be a Add button to allow the user to add the files chosen above. The names of any chosen files or images will be displayed in a message box.

Finally, there will be a button that calls the code to calculate the token cost and a button to clear all input and output fields.

We can do this part using an LLM. It took a bit of back and forth with the LLM, but eventually, with GPT4-o’s help, I developed this code. It’s heavily commented on, so it should be relatively easy to follow.

# Import Gradio for building the web app interface
import gradio as gr
# Import Anthrop client for token counting API
import anthropic
# Import base64 for encoding files in base64 format
import base64
# Import os for interacting with the file system (though not used in this script)
import os

# Initialize the Anthropic client to access the API functions
# need to have your ANTHROPIC_API_KEY environment variable set
client = anthropic.Anthropic()

# Define a function to handle file uploads incrementally, allowing files to be added without overwriting previous uploads
def add_files(uploaded_files, current_files):
    # Initialize the current_files list if it's empty
    if current_files is None:
        current_files = []

    # Append any newly uploaded files to the current list of files
    if uploaded_files:
        current_files.extend(uploaded_files)

    # Create a list of file names for display purposes
    file_names = [file.name for file in current_files]

    # Return the updated file list, the display names, and clear the uploaded_files input
    return current_files, file_names, None

# Define a function to count tokens in system and user prompts, as well as in uploaded files
def count_tokens(system_prompt, user_prompt, all_files):
    # Check if all inputs are empty or cleared; if so, return 0
    if not system_prompt and not user_prompt and not all_files:
        return 0

    # Initialize an empty list to store the message objects for the API request
    messages = []

    # Add the user prompt to the messages list if it's provided
    if user_prompt:
        messages.append({
            "role": "user",
            "content": user_prompt
        })

    # Process each uploaded file, determining whether it's a PDF or an image
    if all_files:
        for file in all_files:
            # Get the file type by extracting and converting the file extension to lowercase
            file_type = file.name.split(".")[-1].lower()

            # If the file is a PDF, encode it in base64 and prepare a document message
            if file_type == "pdf":
                with open(file.name, "rb") as f:
                    pdf_base64 = base64.standard_b64encode(f.read()).decode("utf-8")
                pdf_content = {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_base64
                    }
                }
                # Add the PDF message to the messages list with a prompt for summarization
                messages.append({
                    "role": "user",
                    "content": [pdf_content, {"type": "text", "text": "Please summarize this document."}]
                })

            # If the file is an image (JPEG or PNG), encode it in base64 and prepare an image message
            elif file_type in ["jpg", "jpeg", "png"]:
                media_type = f"image/{file_type}"
                with open(file.name, "rb") as f:
                    image_base64 = base64.standard_b64encode(f.read()).decode("utf-8")
                image_content = {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_base64,
                    }
                }
                # Add the image message to the messages list with a prompt to describe it
                messages.append({
                    "role": "user",
                    "content": [image_content, {"type": "text", "text": "Describe this image"}]
                })

    # If no prompts or files are provided, add a placeholder message
    if not messages:
        messages.append({
            "role": "user",
            "content": ""
        })

    # Call the Anthrop API to count tokens, using system prompt and messages as input
    response = client.beta.messages.count_tokens(
        betas=["token-counting-2024-11-01", "pdfs-2024-09-25"],
        model="claude-3-5-sonnet-20241022",
        system=system_prompt,
        messages=messages,
    )

    # Return the total number of tokens counted
    return response.input_tokens

# Define a function to clear all input fields in the Gradio app
def clear_inputs():
    return "", "", [], "", ""

# Build the Gradio interface
with gr.Blocks(theme="huggingface") as app:
    # Display a title for the app
    gr.Markdown("Anthropic Token Counter")

    # Create input fields for system and user prompts
    with gr.Row():
        system_prompt = gr.Textbox(label="System Prompt", placeholder="Enter the system prompt here...", lines=3)
        user_prompt = gr.Textbox(label="User Prompt", placeholder="Enter the user prompt here...", lines=3)

    # Create an upload field for multiple PDF or image files
    uploaded_files = gr.File(label="Upload PDF(s) or Image(s)", file_count="multiple", file_types=[".pdf", ".jpg", ".jpeg", ".png"])

    # Create a state variable to hold the list of currently uploaded files
    current_files = gr.State([])

    # Display a text box to show the names of uploaded files
    file_display = gr.Textbox(label="Uploaded Files", interactive=False) 

    # Define buttons for adding files, counting tokens, and clearing inputs
    add_files_button = gr.Button("Add Files")
    with gr.Row():
        count_button = gr.Button("Count Tokens", size="small")
        clear_button = gr.Button("Clear", size="small")

    # Display the token count result in a text box
    result = gr.Textbox(label="Token Count", interactive=False)

    # Configure the "Add Files" button to append files to the current file list
    add_files_button.click(fn=add_files, inputs=[uploaded_files, current_files], outputs=[current_files, file_display, uploaded_files])

    # Configure the "Count Tokens" button to process the prompts and files, displaying the token count
    count_button.click(fn=count_tokens, inputs=[system_prompt, user_prompt, current_files], outputs=result)

    # Configure the "Clear" button to reset all inputs and the token count display
    clear_button.click(fn=clear_inputs, outputs=[system_prompt, user_prompt, current_files, file_display, result])

# Launch the Gradio app
app.launch()

To use the app, do the following.

Enter a system and/or user prompt if required. You can leave these blank if you want.
To upload one or more files, drag a file into the file upload box or click on it and choose a file. After this, click the Add button, and your chosen file should appear in the Uploaded Files list box.
Repeat the step above to add more files if you want
Click the Count Tokens button to display a count of the tokens in all uploaded files and/or any text entered into the user or system prompts
Click the Clear button to reset everything and start from scratch

Here’s an example run where I uploaded 2 PDF files and an image along with a user prompt.

Image by Author

Summary

In this article, I wrote about an announcement made by Anthropic about a new token-counting API that had been released in beta. I then went on to use the API to develop code that counts tokens for user and system prompts, as well as for uploaded images and PDF documents.

I then showed how you would develop a user interface for the code using Gradio, bundling the code we developed into the app.

Finally, I showed what the app looks like and provided a working example of its use.

_Ok, that’s all for me for now. Hopefully, you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._

I know times are tough and wallets constrained, but if you got real value from this article, please consider buying me a wee dram.

If you liked this content, I think you’ll find these articles interesting, too.

Develop, then deploy a WEBP to PNG image converter Taipy App to the web – Part 1

C Programming Using Claude’s New Computer Use Model

The post Introducing the New Anthropic Token Counting API appeared first on Towards Data Science.

Build and Deploy a Multi-File RAG App to the Web

Thomas Reid — Fri, 01 Nov 2024 11:01:56 +0000

This is the second of a two-part series of articles on building and deploying a Gradio AI-based web application.

This part is all about how to deploy your finished app to the world wide web using Hugging Face Spaces.

PS. If you want a sneak peek at the deployed app on Hugging Face Spaces, click on this link

I’ve talked about Gradio before in many of my articles. In my opinion, it’s one of the easiest ways to build a GUI app on top of your Python code.

If Gradio is completely new to you, or you’re only vaguely aware of it, I suggest checking out my article below where I introduce who they are and what they do. I also show some small sample code snippets showing Gradio in action.

Gradio: Rapid GUI Prototyping

In a previous article, I took you through the process of building a Mutli-file RAG chat app that can upload, read and analyse various document formats including PDF, Text, Microsoft Word and Excel files. Check the link below if you haven’t seen it yet.

Build and Deploy a Multi-File, Multi-Format RAG App to the Web

Now that you have a new super-duper Gradio app, the next question you might be asking is "How do I share this with the world?"

One of the ways, which is also FREE, is to deploy on Hugging Face Spaces. In the rest of this article, I’ll show you how to do this.

Who is Hugging Face?

If you haven’t heard of Hugging Face (HF) before, it’s a prominent technology company and community platform in the field of artificial intelligence and machine learning. They also happen to own Gradio. HF is made up of several distinct parts. The main ones are.

An AI Platform.

It facilitates the development, sharing, and deployment of machine learning models, particularly in natural language processing (NLP).

Model Hub.

They maintain a vast repository of pre-trained models that developers and researchers can use, adapt, and build upon.

Transformers Library.

Hugging Face is famous for its Transformers library, an open-source library that provides thousands of pre-trained models and an API to perform tasks on texts, images, and audio.

Spaces.

Spaces is a platform provided by Hugging Face that allows developers, researchers, and machine learning enthusiasts to easily host, deploy, and share machine learning models and demos. As this is what we’ll be using, let’s dive a bit deeper into what benefits Spaces provides.
Spaces provide free hosting for machine learning demos and applications.
It aims to simplify the process of deploying and sharing machine learning models and applications.
It allows for the creation of interactive demos for AI models without needing extensive web development skills.
It supports Gradio and Streamlit, two popular frameworks for creating AI GUI apps.
Continuous deployment ensures your app automatically updates when you push changes to the linked GitHub repository

Pre-requisites

Before deploying to HF, there are a few things you need.

1/ Git installed on your system. Instructions for that are here. But this isn’t a tutorial on Git, so I’m assuming you have a basic knowledge of how to use it.

2/ A hugging face account. This is free. Head over to,

Hugging Face – The AI community building the future.

You should see a screen like this, where you can register and/or sign in.

Image from Hugging Face website

3/ You also require a Hugging Face token, Again, this is free.

Go to https://huggingface.co/settings/tokens
Click on "New token"
Set the token type to Write
Give it a name (e.g., "Git access token")
Click "Create token"
Copy the token immediately and save it somewhere (you won’t be able to see it again)

Create an HF Space

Click the link below

Spaces – Hugging Face

Near the top right, click the Create new Space button. You’ll see a screen like this.

Image from HF Website

Type in a name for your new space.
Select the licence type you want to apply to your App.
Choose Gradio->Blank as the SDK type
Click Public if you want the world to see your App
Click the Create Space button

After a few seconds, you should be greeted by a page that says your Space has been created, together with instructions on how to proceed.

Like this.

Image from HF Website

The final thing you may want to do with your HF Spaces is set up one or more secret keys. This will depend on your app, but for example, if it uses things like API Keys this is where you should set them up.

To do that, in your HF Spaces, click on the Settings link near the top right of the page. On the page that’s displayed, scroll down until you see a section labelled Variables and secrets.

Click the New Secret button and fill in the details as required. In my case, I was using a Groq API key, so I called mine GROQ_API_KEY as that’s how I was referencing it in my original code.

Setting up the coding environment

I’m showing how to do this using WSL2 Ubuntu for Windows, but you can just as easily do this under Windows directly. If you want to try out Ubuntu for Windows I have a comprehensive guide on installing it that you can find here.

From this point on, the setup is similar to what you would do if developing any regular app using Git. But, instead of deploying code etc … to a remote repository on GitHub, we deploy to a remote repository hosted by Hugging Face Spaces.

What I normally do is have a Projects directory where I put all my separate applications. For example,

$ cd /usr/tom
$ mkdir projects
$ cd projects

Next, initialise your Git environment if you haven’t already done so.

$ git config --global user.email "you@example.com"
$ git config --global user.name "Your Name"

Deploying your App

The next stage is to git clone the HF repository that was created as part of your Spaces creation. You can see the command you need by referring to the instruction page that was displayed earlier. In my case, it was this,

$ git clone https://huggingface.co/spaces/taupirho/gradio_multi_file_rag

This will create a sub-folder under Projects containing a README.md and .gitattributes files

Now create your app.py containing your Gradio code. My code looked like this.

Python"># Contents of my app.py file
#
import gradio as gr
from huggingface_hub import InferenceClient
import os
import groq
import warnings
import asyncio
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.groq import Groq
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# A warning may appear which doesn't 
# affect the operation of the code
# Suppress it with this code
warnings.filterwarnings("ignore", message=".*clean_up_tokenization_spaces.*")

# Global variables
index = None
query_engine = None

# Initialize Groq LLM and ensure it is used
llm = Groq(model="mixtral-8x7b-32768")
Settings.llm = llm  # Ensure Groq is the LLM being used

# Initialize our chosen embedding model
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# These are our RAG functions, called in response to user
# initiated events e.g clicking the Load Documents button
# on the GUI
#
def load_documents(file_objs):
    global index, query_engine
    try:
        if not file_objs:
            return "Error: No files selected."

        documents = []
        document_names = []
        for file_obj in file_objs:
            document_names.append(file_obj.name)
            loaded_docs = SimpleDirectoryReader(input_files=[file_obj.name]).load_data()
            documents.extend(loaded_docs)

        if not documents:
            return "No documents found in the selected files."

        # Create index from documents using Groq LLM and HuggingFace Embeddings
        index = VectorStoreIndex.from_documents(
            documents,
            llm=llm,  # Ensure Groq is used here
            embed_model=embed_model
        )

        # Create query engine
        query_engine = index.as_query_engine()

        return f"Successfully loaded {len(documents)} documents from the files: {', '.join(document_names)}"
    except Exception as e:
        return f"Error loading documents: {str(e)}"

async def perform_rag(query, history):
    global query_engine
    if query_engine is None:
        return history + [("Please load documents first.", None)]
    try:
        response = await asyncio.to_thread(query_engine.query, query)
        return history + [(query, str(response))]
    except Exception as e:
        return history + [(query, f"Error processing query: {str(e)}")]

def clear_all():
    global index, query_engine
    index = None
    query_engine = None
    return None, "", [], ""  # Reset file input, load output, chatbot, and message input to default states

# Create the Gradio interface
with gr.Blocks(theme=gr.themes.Soft()) as demo:
    gr.Markdown("# RAG Multi-file Chat Application")

    with gr.Row():
        file_input = gr.File(label="Select files to load", file_count="multiple")
        load_btn = gr.Button("Load Documents")

    load_output = gr.Textbox(label="Load Status")

    msg = gr.Textbox(label="Enter your question")
    chatbot = gr.Chatbot()  
    clear = gr.Button("Clear")

    # Set up event handlers
    load_btn.click(load_documents, inputs=[file_input], outputs=[load_output])
    msg.submit(perform_rag, inputs=[msg, chatbot], outputs=[chatbot])
    clear.click(clear_all, outputs=[file_input, load_output, chatbot, msg], queue=False)

# Run the app
if __name__ == "__main__":
    demo.queue()
    demo.launch()

There is one change you should make to your code if it uses things like API Keys. In my code, for example, I initially had a line like this,

...
os.environ["GROQ_API_KEY"] = "YOUR_GROQ_API_KEY"
...

I was able to remove this completely since I had already set my GROQ API KEY as an HF Spaces secret, labelled GROQ_API_KEY. HF automatically assigns whichever label you put on a secret to an equivalent O/S environment variable with the same name as your secret label.

Next, create a requirements.txt file that contains all the external libraries e.g. Gradio, Groq etc … that your application code needs to be able to work.

Mine looked like this,

# Contents of my requirements.txt file
#
huggingface_hub==0.22.2
gradio
groq
llama-index-llms-groq 
llama_index
openpyxl
llama-index-embeddings-huggingface
docx2txt

The best practice is to also update the README.md file to let users know what your app does and/or how to use it.

Now all our code changes are done. The last thing we need is to authenticate ourselves to our host provider (i.e. Hugging Face). This is where the token we created earlier comes into play.

Type the following in at your system command line, replacing your_hf_username & your_hf_spaces_namewith your own HF user and space names.

$ git config --global credential.helper store
$ git remote set-url origin https://your_hf_username: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@huggingface.co/spaces/your_hf_username/your_hf_spaces_name

Now to finally deploy our app properly.

$ git commit -am "Update Gradio App"
$ git push

Assuming all your code is correct, you should see on your HF Spaces page (via the Files link near the top right) that your files have been updated to the HF Spaces repository.

Click on the App link (also near the top right of your Spaces page) and you’ll see the progress of your app build.

Any errors will be apparent, and you should go through the process of fixing any locally before committing and pushing your changes to your HF Spaces repo as before.

If all is OK, after a minute or two the build will complete and your app should be displayed for you to try out.

Congratulations, you have just deployed your Gradio APP to HF Spaces!

If you want to check out my HF Spaces app, click here.

Also, the app.py, requirements.txt and README.md files are viewable by anyone using the Files link near the top right of my HF Space.

Summary

Well done if you made it to the end and managed to deploy your app to the web. There are a lot of moving parts to it, but no individual step is particularly complex.

In this article, I showed how to deploy a Gradio app to the web. Along the way, I explained the prerequisites required, how to set up a Hugging Face account and create a Hugging Face Space.

I then explained in detail the steps required for deployment including authentication with Hugging Face and the uploading of files to your Git repository on Hugging Face Spaces.

_OK, that’s all for me just now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories, follow me or subscribe to get notified when I post new content._

I know times are tough and wallets constrained, but if you got real value from this article, please consider buying me a wee dram.

I think you’ll find these articles interesting if you liked this content.

Create Your Own Prompt Enhancer from Scratch

Deploying a Streamlit App to the Web

The post Build and Deploy a Multi-File RAG App to the Web appeared first on Towards Data Science.