Sql | Towards Data Science https://towardsdatascience.com/category/data-science/sql/ The world’s leading publication for data science, AI, and ML professionals. Wed, 05 Mar 2025 13:58:38 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Sql | Towards Data Science https://towardsdatascience.com/category/data-science/sql/ 32 32 How to Enhance SQL Code Security and Maintainability https://towardsdatascience.com/how-to-enhance-sql-code-security-and-maintainability-3e398b4dd68e/ Fri, 31 Jan 2025 12:02:01 +0000 https://towardsdatascience.com/how-to-enhance-sql-code-security-and-maintainability-3e398b4dd68e/ Introduction to SQL procedures, their applications and benefits, and how to encrypt SQL code

The post How to Enhance SQL Code Security and Maintainability appeared first on Towards Data Science.

]]>
Photo by FlyD on Unsplash
Photo by FlyD on Unsplash

When you’re writing SQL code to accomplish certain tasks for your company, have you ever worried that your SQL code might get leaked and expose critical business logic to competitor companies? Or have you noticed that long and complex SQL code are very difficult to maintain and fix when issues arise? SQL procedures can address the problems mentioned above and serve as a key step for data professionals looking to advance their coding skills while not everyone has paid enough attention to this technique.

A SQL procedure, also known as a SQL stored procedure, is a database object that is created and stored in the database management system and can be executed with a single call. It’s a powerful tool for improving database Security, modularity, and code reusability.

People often confuse SQL UDFs with SQL procedures. Both techniques are used for improving performance, maintainability and security of SQL queries. They share many similarities: both enable developers to write a block of SQL code once and reuse it multiple times throughout the applications, and both can accept input parameters. Due to these similarities, the two techniques can sometimes achieve the same goals.

I’ll use the mock data promo_sales to explain on their similarities. promo_sales is the data for the sales performance from a department store and consists of fieldsSale_Person_ID , Department and Sales_Amount . According to the company policy, 20% of the sales amount is paid as the bonus to each department. We can write the Sql code to query the department summary including the average sales per person and the bonus for each department.

In order to boost the sales during the holiday season, the company’s senior management decided to provide an extra 10% bonus for the departments with sales amount over 2000K USD.

CREATE TABLE promo_sales(
  Sale_Person_ID VARCHAR(40) PRIMARY KEY,
  Department VARCHAR(40),
  Sales_Amount INT
);

INSERT INTO promo_sales VALUES ('001', 'Cosmetics', 500);
INSERT INTO promo_sales VALUES ('002', 'Cosmetics', 700);
INSERT INTO promo_sales VALUES ('003', 'Fashion', 1000);
INSERT INTO promo_sales VALUES ('004', 'Jewellery', 800);
INSERT INTO promo_sales VALUES ('005', 'Fashion', 850);
INSERT INTO promo_sales VALUES ('006', 'Kid', 500);
INSERT INTO promo_sales VALUES ('007', 'Cosmetics', 900);
INSERT INTO promo_sales VALUES ('008', 'Fashion', 600);
INSERT INTO promo_sales VALUES ('009', 'Fashion', 1200);
INSERT INTO promo_sales VALUES ('010', 'Jewellery', 900);
INSERT INTO promo_sales VALUES ('011', 'Kid', 700);
INSERT INTO promo_sales VALUES ('012', 'Fashion', 1500);
INSERT INTO promo_sales VALUES ('013', 'Cosmetics', 850);
INSERT INTO promo_sales VALUES ('014', 'Kid', 750);
INSERT INTO promo_sales VALUES ('015', 'Jewellery', 950);

If we need to update the logic of bonus calculation, we would have to rewrite the code. For projects with complex logics, this may lead to challenges. To resolve this problem, we can write the UDF consisting of the two statements which represent the logic before and after updates. This design significantly improve the code maintainability.

CREATE FUNCTION dbo.MultiStmt_GetDepartmentSummary()
RETURNS @DeptSummary TABLE 
(
    Department VARCHAR(40),
    Total_Sales INT,
    Number_of_Sales_Persons INT,
    Avg_Sales_Per_Person DECIMAL(10, 2),
    Bonus DECIMAL(10, 2) 
)
AS
BEGIN
    -- First Statement: Initialize the table variable with department sales summary
    INSERT INTO @DeptSummary (Department, Total_Sales, Number_of_Sales_Persons, Avg_Sales_Per_Person, Bonus)
    SELECT 
        Department,
        SUM(Sales_Amount) AS Total_Sales,
        COUNT(DISTINCT Sale_Person_ID) AS Number_of_Sales_Persons,
        AVG(Sales_Amount) AS Avg_Sales_Per_Person,
        SUM(Sales_Amount) * 0.2 AS Bonus
    FROM promo_sales
    GROUP BY Department;  

    -- Second Statement: Update rows in the table variable
    UPDATE @DeptSummary
    SET Bonus = Bonus * 1.1  
    WHERE Total_Sales > 2000;

    -- Return the final table
    RETURN;
END;
GO

-- Usage:
SELECT * FROM dbo.MultiStmt_GetDepartmentSummary();

Alternatively, we can use a SQL procedure to generate the same result.

-- Creating the stored procedure to achieve the same functionality
CREATE PROCEDURE dbo.GetDepartmentSummary_Proc
AS
BEGIN
    -- Create a temporary table to store the department summary
    CREATE TABLE #DeptSummary 
    (
        Department VARCHAR(40),
        Total_Sales INT,
        Number_of_Sales_Persons INT,
        Avg_Sales_Per_Person DECIMAL(10, 2),
        Bonus DECIMAL(10, 2)
    );

    -- First Statement: Insert department summary into the temporary table
    INSERT INTO #DeptSummary (Department, Total_Sales, Number_of_Sales_Persons, Avg_Sales_Per_Person, Bonus)
    SELECT 
        Department,
        SUM(Sales_Amount) AS Total_Sales,
        COUNT(DISTINCT Sale_Person_ID) AS Number_of_Sales_Persons,
        AVG(Sales_Amount) AS Avg_Sales_Per_Person,
        SUM(Sales_Amount) * 0.2 AS Bonus
    FROM promo_sales
    GROUP BY Department;

    -- Second Statement: Update rows in the temporary table 
    UPDATE #DeptSummary    
    SET Bonus = Bonus * 1.1  
    WHERE Total_Sales > 2000;

    -- Return the final table
    SELECT * FROM #DeptSummary;

    -- Clean up: Drop the temporary table
    DROP TABLE #DeptSummary;
END;
GO

-- Usage:
EXEC dbo.GetDepartmentSummary_Proc;
Image by the author (SQL procedure output)
Image by the author (SQL procedure output)

Although there are similarities, there are also differences which enable developers to have greater flexibilities when using stored procedure in SQL. Compared to SQL functions which must return a value and can have only input parameters, stored procedures don’t need to return results and have options to implement with input/output parameters. In today’s article, I will focus on the important features of SQL stored procedures.

If you’re interested in SQL User Defined Functions, you can refer to my another article ‘SQL User Defined Functions (UDFs)‘.

SQL User Defined Functions (UDFs)


Syntax For SQL Stored Procedures

Syntax of SQL Stored Procedure

A universal syntax of SQL stored procedure is:

CREATE PROCEDURE procedure_name(parameters)
AS
BEGIN;
//statements;

END;

EXEC procedure_name;

In this syntax, parameters are optional. When creating the SQL procedure dbo.GetDepartmentSummary_Proc , no parameters were assigned. The stored procedure contains several statements which perform different tasks, such as table creation, data insertion, variable calculation, variable updates, summary queries, table deletion, and so on. Unlike a SQL UDF, the RETURN statement is not used here to return a table, although it can optionally be used to return an integer status value. Another key difference is that a SQL procedure use EXEC statement to execute the procedure defined and obtain the intended result.

Creating a Stored Procedure with Default Parameters

A default parameter in a SQL procedure is the value assigned to a parameter that is automatically used if there’s no value provided when the procedure is executed. For the procedure created above, we can include a parameter that filters the results by a specific department. Without specifying the default parameter, the procedure will return the summary of all department.

-- Creating the stored procedure to achieve the same functionality
CREATE PROCEDURE dbo.GetDepartmentSummary_Proc
    @DepartmentFilter VARCHAR(40) = NULL 
AS
BEGIN
    -- Create a temporary table to store the department summaryeh 
    CREATE TABLE #DeptSummary 
    (
        Department VARCHAR(40),
        Total_Sales INT,
        Number_of_Sales_Persons INT,
        Avg_Sales_Per_Person DECIMAL(10, 2),
        Bonus DECIMAL(10, 2)
    );

    -- First Statement: Insert department summary into the temporary table
    INSERT INTO #DeptSummary (Department, Total_Sales, Number_of_Sales_Persons, Avg_Sales_Per_Person, Bonus)
    SELECT 
        Department,
        SUM(Sales_Amount) AS Total_Sales,
        COUNT(DISTINCT Sale_Person_ID) AS Number_of_Sales_Persons,
        AVG(Sales_Amount) AS Avg_Sales_Per_Person,
        SUM(Sales_Amount) * 0.2 AS Bonus
    FROM promo_sales
    WHERE (@DepartmentFilter IS NULL OR Department = @DepartmentFilter)
    GROUP BY Department;

    -- Second Statement: Update rows in the temporary table 
    UPDATE #DeptSummary    
    SET Bonus = Bonus * 1.1  
    WHERE Total_Sales > 2000;

    -- Return the final table
    SELECT * FROM #DeptSummary;

    -- Clean up: Drop the temporary table
    DROP TABLE #DeptSummary;
END;
GO

EXEC dbo.GetDepartmentSummary_Proc @DepartmentFilter = 'Cosmetics';
GO

In this example, the @DepartmentFilter parameter is defined but when it’s set as NULL , we still generate the summary of all departments. By adding the WHERE clause, we can dynamically filter the data based on the parameter’s value. After executing the procedure with the parameter, this approach makes the stored procedure flexible and reusable for different scenarios.

Creating a SQL Procedure with Output Parameters

An output parameter in a SQL procedure is a parameter that can return a value back to the caller after the procedure executes. Using the output parameter in a SQL procedure is considered more efficient by returning a single value rather than a result set. It allows developers to return additional information alongside the result set. This feature of SQL procedures provides more flexibility in handling data.

-- Creating the stored procedure to achieve the same functionality
CREATE PROCEDURE dbo.GetDepartmentSummary_Proc
    @TotalDepartments INT OUTPUT
AS
BEGIN
    -- Create a temporary table to store the department summary
    CREATE TABLE #DeptSummary 
    (
        Department VARCHAR(40),
        Total_Sales INT,
        Number_of_Sales_Persons INT,
        Avg_Sales_Per_Person DECIMAL(10, 2),
        Bonus DECIMAL(10, 2)
    );

    -- First Statement: Insert department summary into the temporary table
    INSERT INTO #DeptSummary (Department, Total_Sales, Number_of_Sales_Persons, Avg_Sales_Per_Person, Bonus)
    SELECT 
        Department,
        SUM(Sales_Amount) AS Total_Sales,
        COUNT(DISTINCT Sale_Person_ID) AS Number_of_Sales_Persons,
        AVG(Sales_Amount) AS Avg_Sales_Per_Person,
        SUM(Sales_Amount) * 0.2 AS Bonus
    FROM promo_sales
    GROUP BY Department;

    -- Set the output parameter to the total number of departments processed
    SELECT @TotalDepartments = COUNT(*) 
    FROM #DeptSummary;

    -- Second Statement: Update rows in the temporary table 
    UPDATE #DeptSummary    
    SET Bonus = Bonus * 1.1  
    WHERE Total_Sales > 2000;

    -- Return the final table
    SELECT * FROM #DeptSummary;

    -- Clean up: Drop the temporary table
    DROP TABLE #DeptSummary;
END;
GO

-- Usage:
-- Declare a variable to hold the output parameter value
DECLARE @TotalDepts INT;

-- Execute the procedure and pass the output parameter
EXEC dbo.GetDepartmentSummary_Proc 
    @TotalDepartments = @TotalDepts OUTPUT; -- Capture the output parameter

-- Display the value of the output parameter
PRINT 'Total Departments Processed: ' + CAST(@TotalDepts AS VARCHAR);

The output parameter @TotalDepartments INT OUTPUT is defined to store the total number of departments and the value of @TotalDepartments is set during the procedure execution and can be printed out after the procedure completes.


Encrypting Stored Procedure in SQL

Nowadays, data security has become a growing concern for many companies. Keeping sensitive data away from unauthorized users, complying with data privacy laws and regulations and maintaining data integrity require the encryption of stored procedures in SQL.

If you’re a data engineer or a data analyst from a key business department, learning how to encrypt your SQL procedures is a must-have skill. Encrypting a stored procedure is quite straightforward – you only need to use the ENCRYPTION keyword while creating the procedure. The keyword will encrypt and hide the source code. If someone attempts to retrieve the source code with the in-built function sp_helptext, the server will respond with "The text for object ‘procedure_name’ is encrypted."

CREATE PROCEDURE procedure_name(parameters)
WITH ENCRYPTION
AS
BEGIN;
//statements;

END;

sp_helptext procedure_name;

Conclusion

As one of the key advanced SQL techniques, stored procedures undoubtedly influence the performance, flexibility, maintainability and efficiency of your code and ultimately generate a significant impact on your database management and data analysis activities. There are always debates about which is better – SQL functions or stored procedures. In my view, the choice depends on the complexity of the data and logic, the functionality aimed to achieve, and the expected outputs. Which technique do you prefer? Please share your thoughts in the comments.

Thank you for reading! If you found this article helpful, please give it some claps! Follow me and subscribe via email to receive notifications whenever I publish a new article.

The post How to Enhance SQL Code Security and Maintainability appeared first on Towards Data Science.

]]>
Advanced SQL Techniques for Unstructured Data Handling https://towardsdatascience.com/advanced-sql-techniques-for-unstructured-data-handling-832f3c7c43b9/ Wed, 08 Jan 2025 18:01:40 +0000 https://towardsdatascience.com/advanced-sql-techniques-for-unstructured-data-handling-832f3c7c43b9/ Everything you need to know to get started with text mining

The post Advanced SQL Techniques for Unstructured Data Handling appeared first on Towards Data Science.

]]>
Photo by Etienne Girardet on Unsplash
Photo by Etienne Girardet on Unsplash

The ideal dataset for data analysis is like Table_1:

Table_1 (mock data by the author)
Table_1 (mock data by the author)

However, the datasets that we encounter in reality are mostly like Table_2:

Table_2: customer support log (mock data by the author)
Table_2: customer support log (mock data by the author)

The main differences between these two tables are whether the data is well organized with rows and column and only presented in numbers or text. Due to these differences, the data in Table_1 is called structured data while the data in Table_2 is categorized as unstructured data.

Unstructured data refers to information that doesn’t have a predetermined structure or format. It’s difficult to store and manage in relational database. But it often contains valuable information which is useful for generating data insights, training machine learning models, or performing natural language processing (NLP).

In this article, I’ll introduce 7 advanced SQL techniques used to hand unstructured data. Although we called these techniques ‘advanced’ in SQL, they actually construct the foundation of data parsing or texting mining.


JSON Parsing

JSON data is the short form of "JavaScript Object Notation" data. It is a text-based format used in web development to exchange information between a server and a web application. JSON data is widely used because of its advantages like ease of use, platform independence and flexibility. But we can never directly analyze and obtain insights from JSON data due to its complex structure and varying levels of nesting for data storage. The syntax for JSON parsing is as follows:

SELECT JSON_VALUE(json_column, '$.key') AS extracted_value
FROM table_name;

In Table_2, column customer_data is JSON data, which can be converted to two columns name and age .

SELECT JSON_VALUE(customer_data, '$.name') AS name, JSON_VALUE(customer_data, '$.age') AS age
FROM support_logs;
Output of JSON parsing (screenshot by the author)
Output of JSON parsing (screenshot by the author)

Regular Expression

A regular expression(regex) is a sequence of characters used to define a search pattern. Regex allows developers to find and manipulate complex string data within a database by matching specific patterns within text. The syntax for regex in SQL is:

SELECT column_name
FROM table_name
WHERE column_name REGEXP 'pattern';

Regex is widely used for flexible search, data validation and data extraction. A typical use case is to extract emails from text.

SELECT column_name
FROM users
WHERE column_name REGEXP '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}';

Sometimes people tend to confuse REGEXP operator with LIKE operator. Both operators are used in SQL to perform pattern matching within strings. But LIKE only supports basic wildcards while REGEXP allows for complex pattern using regular expression syntax and offers greater flexibility for advanced pattern matching scenarios duo to its more robust syntax. Meanwhile, Regex requires more understanding of the structure of the original data, which may be challenging.


Key-Value Pair Parsing

When using JSON parsing, the format is comparably standardized. If the string data has more complex structures but the key and value are separated by a delimiter like a colon, a semicolon or an equal sign, JSON parsing may not be enough for data extraction. Instead, we should use Key-Value Pair Parsing.

Key-Value Pair Parsing is the process of extracting and separating data stored in a format where each piece of information is represented as a "key" paired with its corresponding "value". This method is implemented with string functions or regular expressions (regex).

String functions like SUBSTRING , SUBSTRING_INDEX, POSITION and REPLACE are frequently used to extract key and values parts based on the delimiters. An example of using SUBSTRING_INDEX function for paring key-value pairs is as follows.

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(column_name, 'key=', -1), ';', 1) AS value
FROM table_name;

Text Analytics with Window Functions

Window functions perform calculations across a specified set of rows, known as a "window", from a query and return a single value related to the current row. You can refer to my another article ‘The Most Useful Advanced SQL Techniques to Succeed in the Tech Industry’ to have a thorough understanding of the syntax, benefits and use cases of window functions.

The Most Useful Advanced SQL Techniques to Succeed in the Tech Industry

Besides the capability of computing metrics across partitions, window functions can be used for text-based analysis as well, which very few people have realised.

The applications of window functions for text-based analysis are versatile. For example, if we’d like to rank the text by length, we can use the syntax below:

SELECT column_name, RANK() OVER (ORDER BY LENGTH(column_name) DESC) AS rank
FROM table_name;

Data Tokenization

Data tokenization is a security technique where sensitive data, such as credit card numbers or social security numbers, is replaced with randomized tokens. SQL itself doesn’t inherently support tokenization, but it can work with tokenized data through:

  • Lookup Tables: A mapping table associates tokens with their original values.
  • Encryption or Hash Functions: While not true tokenization, these methods can obfuscate data.

Data tokenization is not considered as a typical methodology to clean up an unstructured dataset. But it’s an important technique for text-mining especially more and more

Data tokenization is not typically considered a methodology for cleaning unstructured datasets. However, it is an important technique for text mining, especially as data privacy are increasingly breached and data security become serious risks in data usage.


COALESCE()

COALESCE() is a function that returns the first non-null value from a list of expressions. It’s useful for handling incomplete or inconsistent data, which are very common in unstructured dataset. The syntax of COALESCE() function is:

SELECT COALESCE(column1, column2, 'default_value') AS result
FROM table_name;

COALESCE() is widely used for replacing null values, selecting the first available value or fallback logic.


CAST()

CAST() converts data from one type to another. The syntax is:

SELECT CAST(column_name AS target_type) AS converted_value
FROM table_name;

When using CAST() function, we must be cautious, especially when we handle the data contains missing values (null).

SELECT 
    CAST((JSON_VALUE(customer_data, '$.age') AS INT) AS customer_age
FROM 
    support_logs;

The code above will return an error. This is because the column customer_age contains ‘null’ after parsing and you cannot cast ‘nothing’ into something concrete like an integer or string.

SELECT 
    CAST(JSON_VALUE(customer_data, '$.age') AS UNSIGNED) AS customer_age
FROM 
    support_logs;

SQL Example of Handling Unstructured Data

Let’s revisit Table_2: the customer support logs table, and use the techniques mentioned above to convert the unstructured data into a structured data table which is ready for analysis.

Table_2: customer support log (mock data by the author)
Table_2: customer support log (mock data by the author)

Here are the tasks that we aim to accomplish:

  1. Extract customer name and age from the customer_data column.
  2. Handle the missing values in the issue_description column.
  3. Extract the priority and status of each ticket from the extra_info column to help the IT team prioritize workloads and track status of each ticket.
  4. Extract the resolution time (in hours) for further analysis.
  5. Rank tickets by the length of the issue_description .
  6. Tokenize phone numbers to protect customer privacy.
SELECT 
    ticket_id,
    customer_id,

    -- Extract customer names and ages
    JSON_VALUE(customer_data, '$.name') AS customer_name,
    CAST(JSON_VALUE(customer_data, '$.age') AS UNSIGNED) AS customer_age,

    -- Handle missing issue descriptions
    COALESCE(issue_description, 'No issue reported') AS issue_description,

    -- Extract ticket priority and status
    SUBSTRING_INDEX(REGEXP_SUBSTR(extra_info, 'priority=[^;]+'), '=', -1) AS priority,
    SUBSTRING_INDEX(SUBSTRING_INDEX(extra_info, 'status=', -1), ';', 1) AS status,

    -- Extract hours of ticket resolution
    CAST(REGEXP_REPLACE(NULLIF(resolution_time, 'N/A'), '[^0-9]', '') AS UNSIGNED) AS resolution_time_hours,

    -- Rank tickets by lengths
    RANK() OVER (ORDER BY LENGTH(issue_description) DESC) AS issue_length_rank,

    -- Tokenize phone number
    CONCAT('TOKEN-', RIGHT(MD5(phone_number), 8)) AS tokenized_phone_number
FROM support_logs;

We can transform the original unstructured data into a structured table with no sensitive information, as shown below:

Output of text mining (screenshot by the author)
Output of text mining (screenshot by the author)

Conclusion

SQL is not only a powerful tool for data retrieval and manipulation , but its syntax for handling text from various sources, such as logs, emails, websites, social media, and mobile apps, is also exceptionally robust.

Unstructured data may be confusing and difficult to interpret, but by utilizing SQL’s relevant functionalities, we can extract highly valuable insights and drive the success of Data Science projects to new heights from the data.

Thank you for reading! If you found this article helpful, please give it some claps! Follow me and subscribe via email to receive notifications whenever I publish a new article. My goal is to help data analysts and data scientists, whether you’re a beginner or experienced, enhance your technical skills and achieve greater success in your career.

The post Advanced SQL Techniques for Unstructured Data Handling appeared first on Towards Data Science.

]]>
All The SQL a Data Scientist Needs to Know https://towardsdatascience.com/all-the-sql-a-data-scientist-needs-to-know-7a5328176a67/ Tue, 07 Jan 2025 18:34:49 +0000 https://towardsdatascience.com/all-the-sql-a-data-scientist-needs-to-know-7a5328176a67/ What you need to know, best practices, and where you can practice your skills

The post All The SQL a Data Scientist Needs to Know appeared first on Towards Data Science.

]]>
Image artificially generated using Grok 2.
Image artificially generated using Grok 2.

Introduction

In my opinion, SQL is one of the most important skills a Data professional should have. Whether you’re a Data Analyst, Data Scientist, or Software Developer, you’re likely going to be interacting with databases daily using SQL.

Speaking from a data scientist perspective, you do not need to be an expert in SQL. Being able to extract, manipulate, and analyse data using SQL should be enough for the majority of data scientists’ tasks. You will often find that you only use SQL for loading data into a Jupyter Notebook prior to implementing some exploratory data analysis (EDA) using Pandas.

The purpose of this article is to discuss the fundamentals of SQL syntax, discuss SQL best practices, and what resources are available for you to practice your SQL skills.

What is SQL?

SQL is a domain-specific language created to manage and manipulate relational databases. SQL has been widely adopted by not only Data Scientists but the majority of data professionals as the go-to language whenever interacting with databases.

The acronym SQL stands for:

  • Structured: Data is stored in an organised state, unlike unstructured data (e.g. audio, video, text).
  • Query: How users speak to the database, extracting the information they’re looking for by writing SQL queries.
  • Language: SQL is a programming language, designed to be extremely user-friendly and very easy to read, unlike some traditional programming languages.

SQL comes in many different flavors, the main difference when comparing flavors is whether they’re a paid or free service. Over the years there have been several open-source flavors of SQL released, the most popular being MySQL and PostgreSQL.

From my experience, Transact-SQL (via MS SQL Server), GoogleSQL (via BigQuery), and PostgreSQL are the most popular. I would focus on Transact-SQL (via MS SQL Server) if I was starting from scratch as most tutorials cover this flavor of SQL.

For more information on SQL, please see here.

SQL Fundamentals

Some professions such as Data Engineers and Database Administrators (DBAs) need to have an advanced knowledge of SQL, but this is not the case for Data Scientists. As you gain experience, you will find that writing SQL scripts becomes quite repetitive, and most of the time you’re just copying previous scripts and making minor amendments.

Most Data Scientists will use SQL to perform basic data transformations before importing into a Python environment. I am going to provide you with all the fundamental commands required to perform 90% of your day-to-day SQL-related tasks as a Data Scientist.

Selecting Data

The most important SQL command is SELECT, this command allows you to define what columns you would like to select from the table stated in your query.

select
    order_date,
    product_sku,
    order_quantity
from
    my_store.ecommerce.orders

Columns can either be stated individually or you can use an asterisk (*) indicating that you would like to select all columns in that table.

The query above will select all rows from the my_store.ecommerce.orders table regardless of the presence of duplicated rows. To prevent this from happening you can use the DISTINCT command to only return unique rows.

select distinct
    order_date,
    product_sku,
    order_quantity
from
    my_store.ecommerce.orders

Engineering Data

Sometimes your table does not contain the column you would like, but what it does have is the underlying data to create that column. Using something like a CASE command, you can engineer your own feature in your SQL query.

select distinct
    order_date,
    product_sku,
    order_quantity,
    case when order_quantity >= 5 then "High"
         when order_quantity between 3 and 5 then "Medium"
         else "Low" end as order_quantity_status
from
    my_store.ecommerce.orders

In the query above, we have created the column order_quantity_status based on the values in the column order_quantity. The CASE command acts as an IF-ELSE statement, similar to something you might have come across in alternative programming languages.

Note: There are many alternative approaches than using CASE to engineer new features. More information on these approaches is available via the learning resources at the bottom of this article.

Grouping and Ordering Data

These clauses are very self-explanatory, the GROUP BY clause is used when aggregating columns whereas the ORDER BY clause is used when you want to order columns in your output.

select
    order_date,
    count(distinct product_sku) as distinct_product_count
from
    my_store.ecommerce.orders
group by
    order_date
order by
    count(distinct product_sku) desc

In the query above we are grouping by order_date and counting how many unique products were sold each day. After calculating this aggregation, we return the output ordered descending by the newly created distinct_product_count column.

Filtering Data

It is not uncommon to encounter database tables which are terabytes in size. To reduce processing costs and time, including filtering in your queries is essential.

select
    order_date,
    product_sku,
    order_quantity
from
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"

Including the WHERE clause in your query allows you to take advantage of partitions and/or indexing. By reducing the size of the data your query has to process, your queries will run much faster at an extremely lower cost. Your Data Engineers and DBAs will thank you!

Not only is the WHERE clause good for filtering dates, it can also be implemented on any column in your table. For example, if we wanted to only include SKUs SKU100, SKU123, and SKU420, and only wanted to see orders of those products with a quantity of less than 3, we can use the following query:

select
    order_date,
    product_sku,
    order_quantity
from
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"
    and product_sku in ("SKU100", "SKU123", "SKU420")
    and order_quantity < 3

Note: Also spend some time looking at the HAVING clause, this is an alternative approach to filtering using aggregated column values. The query below demonstrates this by only returning order dates and the total number of orders when the daily sum is greater or equal to 100.

select
    order_date,
    sum(order_quantity) as total_orders
from
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"
group by
    order_date
having
    sum(order_quantity) >= 100

Joining Data

The most adopted database design pattern is the Star Schema which uses fact and dimension tables. Fact tables consist of quantitative data such as metrics and measurements whereas dimension tables provide more descriptive information adding further context to the information provided in the fact table.

As a Data Scientist, it’s your responsibility to identify the tables where the data you require is located. Secondly, you must perform the correct join to combine these tables.

select
    o.order_date,
    o.product_sku,
    o.order_quantity,
    p.product_name,
    p.product_weight
from
    my_store.ecommerce.orders o
inner join
    my_store.ecommerce.product_details p
on
    o.product_sku = p.product_sku
where
    o.order_date >= "2024-12-01"

In the query above, we are performing an INNER JOIN on the product_sku column. An INNER JOIN will return all of the order rows where we successfully identify the product_sku in the product_details table.

It is important to pay attention to the aliases assigned to each table, it is not uncommon for more than one table to have the same column name. By using aliases, it allows you to state the specific column you are referencing.

Note: Ensure you spend the time researching alternative joins e.g. LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. For those who are visual learners, check out this link on SQL joins.

Aggregating Data

Aggregating columns is something you should get extremely familiar with when using SQL. The most common commands you will use frequently are COUNT(), SUM(), MIN(), MAX(), and AVG().

select
    count(product_sku) as product_count,
    sum(order_quantity) as total_orders,
    min(order_quantity) as minimum_orders,
    max(order_quantity) as maximum_orders,
    avg(order_quantity) as average_orders
from
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"

These aggregation functions are used to generate descriptive statistics from your data. Although this can be completed using Python, I find it more productive for me to complete this task using SQL, especially if it is answering a stakeholder question on the fly.

Next Steps

After mastering the fundamentals, you should expand your knowledge and focus on intermediate SQL. Some common processes that arise frequently in my day job are [common expression tables (CTEs)](https://www.atlassian.com/data/sql/using-common-table-expressions#:~:text=A%20Common%20Table%20Expression%20(CTE,focus%20on%20non%2Drecurrsive%20CTEs.) and window functions.

CTEs

As I do the majority of my SQL using BigQuery via GCP, I use CTEs in almost all my queries. CTEs allow you to create temporary tables that can be later referenced as part of a broader, larger SQL query.

with total_product_orders_daily as 
(
  select
    order_date,
    product_sku,
    sum(order_quantity) as total_orders
  from  
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"
)

select
    tpod.order_date,
    tpod.product_sku,
    p.product_name,
    tpod.total_orders
from
    total_product_orders_daily tpod
inner join
    my_store.ecommerce.product_details p
on
    tpod.product_sku = p.product_sku

The query above creates a CTE that first calculates the total_orders before then joining the total_product_orders_daily table to the my_store.ecommerce.product_details table. Also note, the WHERE clause occurs as early as possible in the CTE, you should always aim to reduce the amount of data you’re working with as soon as possible.

Window Functions

A window function performs a calculation across a set of rows that are related to the current row, each row maintains a separate identity. For example, if you want to rank your data or identify duplicated records, you can do this by implementing window functions.

select
    order_date,
    product_sku,
    order_quantity,
    rank() over (partition by order_date, product_sku order by order_quantity desc) as daily_sku_order_rank
from
    my_store.ecommerce.orders
where
    order_date >= "2024-12-01"

The query above creates the column daily_sku_order_rank which is ranking each product_sku per order_date in descending order.

To identify and remove duplicated records using window functions you can use the following code:

with base_table as 
(
  select
    order_date,
    product_sku,
    order_quantity,
    row_number() over (partition by order_date, product_sku) as daily_sku_row_num
  from
    my_store.ecommerce.orders
  where
    order_date >= "2024-12-01"
)

select
    order_date,
    product_sku,
    order_quantity,
from
    base_table
where
  daily_sku_order_rank = 1

For instances where daily_sku_order_rank is greater than 1 (a duplicated record), these will be removed once the CTE executes and generates an output.

Note: There are more functions available when performing window functions such as _DENSE_RANK_, more information is available here.

SQL Best Practices

Similar to alternative programming scripts, when composing a SQL script you should always consider somebody else reusing your code. To make this process easier, it is best to follow some SQL best practices. Some standout best practices are:

  1. Use Meaningful Naming Conventions: It is better to have longer more descriptive column/table naming conventions.
  2. Code Formatting: Use consistent indentation throughout your script. There is no right or wrong regarding upper or lowercase text, pick one and stick to it.
  3. Refrain from Selecting All: Select specific columns you want to include in your output, do not use the asterisks when selecting from a table.
  4. Indexing Columns: Columns that are frequently used in WHERE, JOIN, or ORDER BY clauses should be indexed, therefore optimising query performance.
  5. Where to Use Functions: Similar to where you should index your columns, you should also not be using any functions (e.g. CAST(), LEN()) within WHERE, JOIN, or ORDER BY clauses. This is also the same for wildcards.

Note: There are more SQL best practices than those stated above, this can sometimes be dependant on the flavor of SQL you’re using. I encourage you to enquire within your company to see whether there has been any internal SQL best practices established which you could look to implement in your work.

Resources for Practicing SQL

Your SQL development will always advance at a faster pace when working with real data in a professional setting. For aspiring Data Scientists who are yet to land their first position, there are many online alternatives you can use to maintain and grow your SQL skills.

Some of the top resources I have found for learning SQL are:

W3Schools.com

Solve SQL Code Challenges

Master Coding for Data Science

SQLZoo

SQLBolt – Learn SQL – Introduction to SQL

DataLemur – Ace the SQL & Data Science Interview

Personally, I find StrataScratch to be the best as it allows you to choose between different flavors of SQL, has a great selection of questions, and has a good UI (similar to LeetCode).

For more theoretical learning, W3Schools is the one I would select. I started reading this resource when I was first learning SQL, it is always in my bookmarks should I need to refresh my memory on a specific topic.

One thing I would suggest is not to spend too much time trying to find the right resource, pick one, and start tackling the challenges. Start with the beginner tasks and work your way up, be patient and consistent with your learning. You do not need to complete all the hard challenges before you are deemed interview-ready, as you progress your confidence will grow.

Note: Some of these resources are free, whereas others have free tiers but some of their content sits behind a paywall.

Final Thoughts

All Data Scientists should have at least a foundational knowledge of SQL. Unfortunately, SQL does not get the recognition at an academic level which often results in graduates lacking the skills when trying to land their first Data Scientist role.

The language isn’t the most attractive and when compared to learning Python, can often be described as being quite boring. It is only when you begin working in a professional setting that you understand how important SQL will be in your career.

Not only are there vast amounts of free resources online to teach you, but there is also a great community of engineers and scientists discussing SQL best practices online.

Take the time to learn SQL, having this skill in your toolbox early in your career will definitely make you stand out from the competition.


Disclaimer: I have no affiliation with any of the companies, software, or products discussed in this article. Furthermore, unless stated otherwise, all images included in this article are owned by the author.


If you enjoyed reading this article, please follow me on Medium, X, and GitHub for similar content relating to Data Science, Artificial Intelligence, and Engineering.

Happy learning! 🚀

The post All The SQL a Data Scientist Needs to Know appeared first on Towards Data Science.

]]>
5 Simple Projects to Start Today: A Learning Roadmap for Data Engineering https://towardsdatascience.com/5-simple-projects-to-start-today-a-learning-roadmap-for-data-engineering-940ecbad6b5f/ Thu, 02 Jan 2025 11:31:37 +0000 https://towardsdatascience.com/5-simple-projects-to-start-today-a-learning-roadmap-for-data-engineering-940ecbad6b5f/ Start with 5 practical projects to lay the foundation for your data engineering roadmap.

The post 5 Simple Projects to Start Today: A Learning Roadmap for Data Engineering appeared first on Towards Data Science.

]]>
Start with 5 practical projects to lay the foundation for your data engineering roadmap

Tutorials help you to understand the basics. You will definitely learn something. However, the real learning effect comes when you directly implement small projects. And thus combine theory with practice.

You will benefit even more if you explain what you have learned to someone else. You can also use ChatGPT as a learning partner or tutor – explain in your own words what you have learned and get feedback. Use one of the prompts that I have attached after the roadmap.

In this article, I present a roadmap for 4 months to learn the most important concepts in data engineering for beginners. You start with the basics and increase the level of difficulty to tackle more complex topics. The only requirements are that you have some Python programming skills, basic knowledge of data manipulation (e.g. simple SQL queries) and motivation 🚀

Why only 4 months?

It is much easier for us to commit to a goal over a shorter period of time. We stay more focused and motivated. Open your favorite app right away and start a project based on the examples. Or set a calendar entry to make time for the implementation.

5 projects for your 4-month roadmap

As a data engineer, you ensure that the right data is collected, stored and prepared in such a way that it is accessible and usable for data scientists and analysts.

You are the kitchen chef, so to speak, who organizes the kitchen and ensures that all ingredients are fresh and ready to hand. The data scientist is the cook chief who combines them into creative dishes.

Month 1 – Programming and SQL

Deepen your knowledge of Python basics CSV and JSON files are common formats for data exchange. Learn how to edit CSV and JSON files. Understand how to manipulate data with the Python libraries with Pandas and NumPy.

A small project to start in Month 1 Clean a CSV file with unstructured data, prepare it for data analysis and save it in a clean format. Use Pandas for data manipulation and basic Python functions for editing.

  1. Read the file with ‘pd.read_csv()’ and get an overview with ‘df.head()’ and ‘df.info()’.
  2. Remove duplicates with ‘df.drop_duplicates()’ and fill in missing values with the average using ‘df.fillna(df.mean())’. Optional: Research what options are available to handle missing values.

  3. Create a new column with ‘df[‘new_column’]’, which, for example, fills all rows above a certain value with a ‘True’ and all others with a ‘False’.
  4. Save the cleansed data with ‘df.to_csv(‘new_name.csv’, index=False)’ in a new CSV file.

What problem does this project solve? Data quality is key. Unfortunately, this is not always the case when you receive data in the business world.

Tools & Languages: Python (Pandas & NumPy library), Jupyter Lab

Understanding SQL SQL allows you to query and organize data efficiently. Understand how to use the most important commands such as CREATE TABLE, ALTER TABLE, DROP TABLE, SELECT, WHERE, ORDER BY, GROUP BY, HAVING, COUNT, SUM, AVG, MAX & MIN, JOIN.

A small project to deepen your knowledge in month 1: Create a relational data model that maps real business processes. Do you have a medium-sized bookstore in your city? This is certainly a good scenario to start.

  1. Think about what data the bookshop manages. For example, books with the data title, author, ISBN (unique identification number), customers with the data name, e-mail, etc.
  2. Now draw a diagram that shows the relationships between the data. A bookstore has several books, which can be from several authors. Customers buy these books at the same time. Think about how this data is connected.
  3. Next, write down which tables you need and which columns each table has. For example, the columns ISBN, title, author and price for the book table. Do this step for all the data you identified in step 1.
  4. Optional: Create the tables with ‘CREATE TABLE nametable ();’ in a SQLite database. You can create a table with the following code.
-- Creating a table with the name of the columns and their data types
CREATE TABLE Books (
    BookID INT PRIMARY KEY,
    Title VARCHAR(100),
    Author VARCHAR(50),
    Price DECIMAL(10, 2)
);

What problem does this project solve? With a well thought-out data model, a company can efficiently set up important processes such as tracking customer purchases or managing inventory.

Tools & languages: SQL, SQLite, MySQL or PostgreSQL

Month 2 – Databases and ETL pipelines

Mastering relational DBs and NoSQL databases Understand the concepts of tables, relationships, normalization and queries with SQL. Understand what CRUD operations (Create, Read, Update, Delete) are. Learn how to store, organize and query data efficiently Understand the advantages of NoSQL over relational databases.

Tools and languages: SQLite, MySQL, PostgreSQL for relational databases; MongoDB or Apache Cassandra for NoSQL databases

Understand the Etl basics Understand how to extract data from CSV, JSON or XML files and from APIs. Learn how to load cleansed data into a relational database.

A small project for month 2 Create a pipeline that extracts data from a CSV file, transforms it and loads it into a SQLite database. Implement a simple ETL logic.

  1. Load a CSV file with ‘pd.read_csv()’ and get an overview of the data again. Again, remove missing values and duplicates (see project 1). You can find publicly accessible datasets on Kaggle. For example, search for a dataset with products.
  2. Create a SQLite database and define a table according to the data from the CSV. Below you can see an example code for this. SQLite is easier to get started with, as the SQLite library is available in Python by default (module sqlite3).
  3. Load the cleaned data from the DataFrame into the SQLite database with ‘df.to_sql(‘tablename’, conn, if_exists=’replace’, index=False)’.
  4. Now execute a simple Sql query e.g. with SELECT and ORDER BY. Limit the results to 5 outputs. Close the connection to the database at the end.
import sqlite3

# Create the connection to the SQLite-DB
conn = sqlite3.connect('produkte.db')

# Create the table
conn.execute('''
CREATE TABLE IF NOT EXISTS Produkte (
    ProduktID INTEGER PRIMARY KEY,
    Name TEXT,
    Kategorie TEXT,
    Preis REAL
)
''')
print("Tabelle erstellt.")

Tools and languages: Python (SQLAlchemy library), SQL

Month 3 – Workflow orchestration and cloud storage

Workflow orchestration Workflow orchestration means that you automate and coordinate processes (tasks) in a specific order. Learn how to plan and execute simple workflows. You will also gain a basic understanding of the DAG (Directed Acyclic Graph) framework. A DAG is the basic structure in Apache Airflow and describes which tasks are executed in a workflow and in which order.

Tools and languages: Apache Airflow

Cloud storage Learn how to store data in the cloud. Know at least the names of the major products from the biggest cloud providers such as S3, EC2, Redshift from AWS, BigQuery, Dataflow, Cloud Storage from Google Cloud and Azure Blob Storage, Synapse Analytics, Azure Data Factory from Azure. The many different products can be overwhelming – start with something you enjoy.

A small project for month 3 Create a simple workflow orchestration concept with Python (without Apache Airflow, as this lowers the barrier to getting started for you) that sends you automated reminders during your daily routine:

  1. Plan the workflow: Define tasks such as reminders to "Drink water", "Exercise for 3 minutes" or "Get some fresh air".
  2. Create a sequence of the tasks (DAG): Decide the order in which the tasks should be executed. Define if they are dependent on each other. For example, Task A ("Drink water") runs first, followed by Task B ("Exercise for 3 minutes"), and so on.
  3. Implement the task in Python: Write a Python function for each reminder (see code snippet 1 below as an example).
  4. Link the tasks: Arrange the functions so that they execute sequentially (see code snipped 2 below as an example).
import os
import time

# Task 1: Send a reminder
def send_reminder():
    print("Reminder: Drink water!")  # Print a reminder message
    time.sleep(1)  # Pause for 1 second before proceeding to the next task
if __name__ == "__main__":
    print("Start Workflow...")  # Indicate the workflow has started

    # Execute tasks in sequence
    send_reminder()  # Task 1: Send a reminder to drink water

    # Additional tasks (uncomment and define these functions if needed)
    # reminder_exercise()  # Example: Send the second reminder
    # create_task_list()    # Advanced-Example: Create a daily task list

    print("Workflow is done!")  # Indicate the workflow has completed

Too easy? Install Apache Airflow and create your first DAG that performs the task of printing out "Hello World" or load your transformed data into an S3 bucket and analyze it locally.

Tools and languages: AWS, Google Cloud, Azure

Implement the 5 projects to learn twice as much as if you only look at the theory.
Implement the 5 projects to learn twice as much as if you only look at the theory.

Month 4 – Introduction to Big Data and Visualization

Big data basics Understand the basics of Hadoop and Apache Spark. Below you can find a great, super-short video from simplilearn to introduce you to Hadoop and Apache Spark.

Tools and languages: Hadoop, Apache Spark, PySpark (Python API fĂĽr Apache Spark), Python

Data visualization Understand the basics of data visualization

A small project for month 4 To avoid the need for big data tools like Apache Spark or Hadoop, but still apply the concepts, download a dataset from Kaggle, analyze it with Python and visualize the results:

  1. Download a publicly available medium sized dataset from Kaggle (e.g. weather data), read in the dataset with Pandas and get an overview of your data.
  2. Perform a small exploratory data analysis (EDA).
  3. Create e.g. a line chart of average temperatures or a bar chart of rain and sun days per month.

Tools and languages: Python (Matplotlib & Seaborn libraries)

2 prompts to use ChatGPT as a learning partner or tutor

When I learn something new, the two prompts help me to reproduce what I have learned and use ChatGPT to check whether I have understood it. Try it out and see if it helps you too.

  1. I have just learned about the [topic / project] and want to make sure I have understood it correctly. Here is my explanation: [your explanation]. Give me feedback on my explanation. Add anything that is missing or that I have not explained clearly.
  2. I would like to understand the topic [topic/project] better. Here is what I have learned so far: [your explanation]. Are there any mistakes, gaps or tips on how I can explain this even better? Optional: How could I expand the project? What could I learn next?

What comes next?

  • Deepen the concepts from months 1–4.
  • Learn complex SQL queries such as subqueries and database optimization techniques.
  • Understand the principles of data warehouses, data lakes and data lakehouses. Look at tools such as Snowflake, AmazonRedshift, GoogleBigQuery or Salesforce Data Cloud.
  • Learn CI/CD practices for data engineers.
  • Learn how to prepare data pipelines for machine learning models
  • Deepen your knowledge of cloud platforms – especially in the area of serverless computing (e.g. AWS Lambda)
Own visualization - Illustrations from unDraw.co
Own visualization – Illustrations from unDraw.co

Final Thoughts

Companies and individuals are generating more and more data – and the growth continues to accelerate. One reason for this is that we have more and more data from sources such as IoT devices, social media and customer interactions. At the same time, data forms the basis for machine learning models, the importance of which will presumably continue to increase in everyday life. The use of cloud services such as AWS, Google Cloud or Azure is also becoming more widespread. Without well-designed data pipelines and scalable infrastructures, this data can neither be processed efficiently nor used effectively. In addition, in areas such as e-commerce or financial technology, it is becoming increasingly important that we can process data in real-time.

As data engineers, we create the infrastructure so that the data is available for machine learning models and real-time streaming (zero ETL). With the points from this roadmap, you can develop the foundations.

Where can you continue learning?

The post 5 Simple Projects to Start Today: A Learning Roadmap for Data Engineering appeared first on Towards Data Science.

]]>
Scaling Statistics: Incremental Standard Deviation in SQL with dbt https://towardsdatascience.com/scaling-statistics-incremental-standard-deviation-in-sql-with-dbt-2eb0505aad2b/ Wed, 01 Jan 2025 15:02:26 +0000 https://towardsdatascience.com/scaling-statistics-incremental-standard-deviation-in-sql-with-dbt-2eb0505aad2b/ Why scan yesterday's data when you can increment today's?

The post Scaling Statistics: Incremental Standard Deviation in SQL with dbt appeared first on Towards Data Science.

]]>
SQL aggregation functions can be computationally expensive when applied to large datasets. As datasets grow, recalculating metrics over the entire dataset repeatedly becomes inefficient. To address this challenge, incremental aggregation is often employed – a method that involves maintaining a previous state and updating it with new incoming data. While this approach is straightforward for aggregations like COUNT or SUM, the question arises: how can it be applied to more complex metrics like standard deviation?

Standard deviation is a statistical metric that measures the extent of variation or dispersion in a variable’s values relative to its mean. It is derived by taking the square root of the variance. The formula for calculating the variance of a sample is as follows:

Sample variance formula
Sample variance formula

Calculating standard deviation can be complex, as it involves updating both the mean and the sum of squared differences across all data points. However, with algebraic manipulation, we can derive a formula for incremental computation – enabling updates using an existing dataset and incorporating new data seamlessly. This approach avoids recalculating from scratch whenever new data is added, making the process much more efficient (A detailed derivation is available on my GitHub).

Derived sample variance formula
Derived sample variance formula

The formula was basically broken into 3 parts:

  1. The existing’s set weighted variance
  2. The new set’s weighted variance
  3. The mean difference variance, accounting for between-group variance.

This method enables incremental variance computation by retaining the COUNT (k), AVG (µk), and VAR (Sk) of the existing set, and combining them with the COUNT (n), AVG (µn), and VAR (Sn) of the new set. As a result, the updated standard deviation can be calculated efficiently without rescanning the entire dataset.

Now that we’ve wrapped our heads around the math behind incremental standard deviation (or at least caught the gist of it), let’s dive into the Dbt SQL implementation. In the following example, we’ll walk through how to set up an incremental model to calculate and update these statistics for a user’s transaction data.

Consider a transactions table named stg__transactions, which tracks user transactions (events). Our goal is to create a time-static table, int__user_tx_state, that aggregates the ‘state’ of user transactions. The column details for both tables are provided in the picture below.

Image by the author
Image by the author

To make the process efficient, we aim to update the state table incrementally by combining the new incoming transactions data with the existing aggregated data (i.e. the current user state). This approach allows us to calculate the updated user state without scanning through all historical data.

Image by the author
Image by the author

The code below assumes understanding of some dbt concepts, if you’re unfamiliar with it, you may still be able to understand the code, although I strongly encourage going through dbt’s incremental guide or read this awesome post.

We’ll construct a full dbt Sql step by step, aiming to calculate incremental aggregations efficiently without repeatedly scanning the entire table. The process begins by defining the model as incremental in dbt and using unique_key to update existing rows rather than inserting new ones.

-- depends_on: {{ ref('stg__transactions') }}
{{ config(materialized='incremental', unique_key=['USER_ID'], incremental_strategy='merge') }}

Next, we fetch records from the _stg__transactions_ table. The is_incremental block filters transactions with timestamps later than the latest user update, effectively including "only new transactions".

WITH NEW_USER_TX_DATA AS (
    SELECT
        USER_ID,
        TX_ID,
        TX_TIMESTAMP,
        TX_VALUE
    FROM {{ ref('stg__transactions') }}
    {% if is_incremental() %}
      WHERE TX_TIMESTAMP > COALESCE((select max(UPDATED_AT) from {{ this }}), 0::TIMESTAMP_NTZ)
    {% endif %}
)

After retrieving the new transaction records, we aggregate them by user, allowing us to incrementally update each user’s state in the following CTEs.

INCREMENTAL_USER_TX_DATA AS (
    SELECT
        USER_ID,
        MAX(TX_TIMESTAMP) AS UPDATED_AT,
        COUNT(TX_VALUE) AS INCREMENTAL_COUNT,
        AVG(TX_VALUE) AS INCREMENTAL_AVG,
        SUM(TX_VALUE) AS INCREMENTAL_SUM,
        COALESCE(STDDEV(TX_VALUE), 0) AS INCREMENTAL_STDDEV,
    FROM
        NEW_USER_TX_DATA
    GROUP BY
        USER_ID
)

Now we get to the heavy part where we need to actually calculate the aggregations. When we’re not in incremental mode (i.e. we don’t have any "state" rows yet) we simply select the new aggregations

NEW_USER_CULMULATIVE_DATA AS (
    SELECT
        NEW_DATA.USER_ID,
        {% if not is_incremental() %}
            NEW_DATA.UPDATED_AT AS UPDATED_AT,
            NEW_DATA.INCREMENTAL_COUNT AS COUNT_TX,
            NEW_DATA.INCREMENTAL_AVG AS AVG_TX,
            NEW_DATA.INCREMENTAL_SUM AS SUM_TX,
            NEW_DATA.INCREMENTAL_STDDEV AS STDDEV_TX
        {% else %}
        ...

But when we’re in incremental mode, we need to join past data and combine it with the new data we created in the INCREMENTAL_USER_TX_DATA CTE based on the formula described above. We start by calculating the new SUM, COUNT and AVG:

  ...
  {% else %}
      COALESCE(EXISTING_USER_DATA.COUNT_TX, 0) AS _n, -- this is n
      NEW_DATA.INCREMENTAL_COUNT AS _k,  -- this is k
      COALESCE(EXISTING_USER_DATA.SUM_TX, 0) + NEW_DATA.INCREMENTAL_SUM AS NEW_SUM_TX,  -- new sum
      COALESCE(EXISTING_USER_DATA.COUNT_TX, 0) + NEW_DATA.INCREMENTAL_COUNT AS NEW_COUNT_TX,  -- new count
      NEW_SUM_TX / NEW_COUNT_TX AS AVG_TX,  -- new avg
   ...

We then calculate the variance formula’s three parts

  1. The existing weighted variance, which is truncated to 0 if the previous set is composed of one or less items:
    ...
      CASE
          WHEN _n > 1 THEN (((_n - 1) / (NEW_COUNT_TX - 1)) * POWER(COALESCE(EXISTING_USER_DATA.STDDEV_TX, 0), 2))
          ELSE 0
      END AS EXISTING_WEIGHTED_VARIANCE,  -- existing weighted variance
    ...
  1. The incremental weighted variance in the same way:
    ...
      CASE
          WHEN _k > 1 THEN (((_k - 1) / (NEW_COUNT_TX - 1)) * POWER(NEW_DATA.INCREMENTAL_STDDEV, 2))
          ELSE 0
      END AS INCREMENTAL_WEIGHTED_VARIANCE,  -- incremental weighted variance
    ...
  1. The mean difference variance, as outlined earlier, along with SQL join terms to include past data.
    ...
      POWER((COALESCE(EXISTING_USER_DATA.AVG_TX, 0) - NEW_DATA.INCREMENTAL_AVG), 2) AS MEAN_DIFF_SQUARED,
      CASE
          WHEN NEW_COUNT_TX = 1 THEN 0
          ELSE (_n * _k) / (NEW_COUNT_TX * (NEW_COUNT_TX - 1))
      END AS BETWEEN_GROUP_WEIGHT,  -- between group weight
      BETWEEN_GROUP_WEIGHT * MEAN_DIFF_SQUARED AS MEAN_DIFF_VARIANCE,  -- mean diff variance
      EXISTING_WEIGHTED_VARIANCE + INCREMENTAL_WEIGHTED_VARIANCE + MEAN_DIFF_VARIANCE AS VARIANCE_TX,
      CASE
          WHEN _n = 0 THEN NEW_DATA.INCREMENTAL_STDDEV -- no "past" data
          WHEN _k = 0 THEN EXISTING_USER_DATA.STDDEV_TX -- no "new" data
          ELSE SQRT(VARIANCE_TX)  -- stddev (which is the root of variance)
      END AS STDDEV_TX,
      NEW_DATA.UPDATED_AT AS UPDATED_AT,
      NEW_SUM_TX AS SUM_TX,
      NEW_COUNT_TX AS COUNT_TX
  {% endif %}
    FROM
        INCREMENTAL_USER_TX_DATA new_data
    {% if is_incremental() %}
    LEFT JOIN
        {{ this }} EXISTING_USER_DATA
    ON
        NEW_DATA.USER_ID = EXISTING_USER_DATA.USER_ID
    {% endif %}
)

Finally, we select the table’s columns, accounting for both incremental and non-incremental cases:

SELECT
    USER_ID,
    UPDATED_AT,
    COUNT_TX,
    SUM_TX,
    AVG_TX,
    STDDEV_TX
FROM NEW_USER_CULMULATIVE_DATA

By combining all these steps, we arrive at the final SQL model:

-- depends_on: {{ ref('stg__initial_table') }}
{{ config(materialized='incremental', unique_key=['USER_ID'], incremental_strategy='merge') }}
WITH NEW_USER_TX_DATA AS (
    SELECT
        USER_ID,
        TX_ID,
        TX_TIMESTAMP,
        TX_VALUE
    FROM {{ ref('stg__initial_table') }}
    {% if is_incremental() %}
      WHERE TX_TIMESTAMP > COALESCE((select max(UPDATED_AT) from {{ this }}), 0::TIMESTAMP_NTZ)
    {% endif %}
),
INCREMENTAL_USER_TX_DATA AS (
    SELECT
        USER_ID,
        MAX(TX_TIMESTAMP) AS UPDATED_AT,
        COUNT(TX_VALUE) AS INCREMENTAL_COUNT,
        AVG(TX_VALUE) AS INCREMENTAL_AVG,
        SUM(TX_VALUE) AS INCREMENTAL_SUM,
        COALESCE(STDDEV(TX_VALUE), 0) AS INCREMENTAL_STDDEV,
    FROM
        NEW_USER_TX_DATA
    GROUP BY
        USER_ID
),

NEW_USER_CULMULATIVE_DATA AS (
    SELECT
        NEW_DATA.USER_ID,
        {% if not is_incremental() %}
            NEW_DATA.UPDATED_AT AS UPDATED_AT,
            NEW_DATA.INCREMENTAL_COUNT AS COUNT_TX,
            NEW_DATA.INCREMENTAL_AVG AS AVG_TX,
            NEW_DATA.INCREMENTAL_SUM AS SUM_TX,
            NEW_DATA.INCREMENTAL_STDDEV AS STDDEV_TX
        {% else %}
            COALESCE(EXISTING_USER_DATA.COUNT_TX, 0) AS _n, -- this is n
            NEW_DATA.INCREMENTAL_COUNT AS _k,  -- this is k
            COALESCE(EXISTING_USER_DATA.SUM_TX, 0) + NEW_DATA.INCREMENTAL_SUM AS NEW_SUM_TX,  -- new sum
            COALESCE(EXISTING_USER_DATA.COUNT_TX, 0) + NEW_DATA.INCREMENTAL_COUNT AS NEW_COUNT_TX,  -- new count
            NEW_SUM_TX / NEW_COUNT_TX AS AVG_TX,  -- new avg
            CASE
                WHEN _n > 1 THEN (((_n - 1) / (NEW_COUNT_TX - 1)) * POWER(COALESCE(EXISTING_USER_DATA.STDDEV_TX, 0), 2))
                ELSE 0
            END AS EXISTING_WEIGHTED_VARIANCE,  -- existing weighted variance
            CASE
                WHEN _k > 1 THEN (((_k - 1) / (NEW_COUNT_TX - 1)) * POWER(NEW_DATA.INCREMENTAL_STDDEV, 2))
                ELSE 0
            END AS INCREMENTAL_WEIGHTED_VARIANCE,  -- incremental weighted variance
            POWER((COALESCE(EXISTING_USER_DATA.AVG_TX, 0) - NEW_DATA.INCREMENTAL_AVG), 2) AS MEAN_DIFF_SQUARED,
            CASE
                WHEN NEW_COUNT_TX = 1 THEN 0
                ELSE (_n * _k) / (NEW_COUNT_TX * (NEW_COUNT_TX - 1))
            END AS BETWEEN_GROUP_WEIGHT,  -- between group weight
            BETWEEN_GROUP_WEIGHT * MEAN_DIFF_SQUARED AS MEAN_DIFF_VARIANCE,
            EXISTING_WEIGHTED_VARIANCE + INCREMENTAL_WEIGHTED_VARIANCE + MEAN_DIFF_VARIANCE AS VARIANCE_TX,
            CASE
                WHEN _n = 0 THEN NEW_DATA.INCREMENTAL_STDDEV -- no "past" data
                WHEN _k = 0 THEN EXISTING_USER_DATA.STDDEV_TX -- no "new" data
                ELSE SQRT(VARIANCE_TX)  -- stddev (which is the root of variance)
            END AS STDDEV_TX,
            NEW_DATA.UPDATED_AT AS UPDATED_AT,
            NEW_SUM_TX AS SUM_TX,
            NEW_COUNT_TX AS COUNT_TX
        {% endif %}
    FROM
        INCREMENTAL_USER_TX_DATA new_data
    {% if is_incremental() %}
    LEFT JOIN
        {{ this }} EXISTING_USER_DATA
    ON
        NEW_DATA.USER_ID = EXISTING_USER_DATA.USER_ID
    {% endif %}
)

SELECT
    USER_ID,
    UPDATED_AT,
    COUNT_TX,
    SUM_TX,
    AVG_TX,
    STDDEV_TX
FROM NEW_USER_CULMULATIVE_DATA

Throughout this process, we demonstrated how to handle both non-incremental and incremental modes effectively, leveraging mathematical techniques to update metrics like variance and standard deviation efficiently. By combining historical and new data seamlessly, we achieved an optimized, scalable approach for real-time data aggregation.

In this article, we explored the mathematical technique for incrementally calculating standard deviation and how to implement it using dbt’s incremental models. This approach proves to be highly efficient, enabling the processing of large datasets without the need to re-scan the entire dataset. In practice, this leads to faster, more scalable systems that can handle real-time updates efficiently. If you’d like to discuss this further or share your thoughts, feel free to reach out – I’d love to hear your thoughts!

The post Scaling Statistics: Incremental Standard Deviation in SQL with dbt appeared first on Towards Data Science.

]]>
Measuring Cross-Product Adoption Using dbt_set_similarity https://towardsdatascience.com/measuring-cross-product-adoption-using-dbt-set-similarity-fdf7c1f88bc2/ Sat, 28 Dec 2024 13:02:07 +0000 https://towardsdatascience.com/measuring-cross-product-adoption-using-dbt-set-similarity-fdf7c1f88bc2/ Enhancing cross-product insights within dbt workflows

The post Measuring Cross-Product Adoption Using dbt_set_similarity appeared first on Towards Data Science.

]]>
Introduction

For multi-product companies, one critical metric is often what is called "cross-product adoption". (i.e. understanding how users engage with multiple offerings in a given product portfolio)

One measure suggested to calculate cross-product or cross-feature usage in the popular book Hacking Growth [1] is the Jaccard Index. Traditionally used to measure the similarity between two sets, the Jaccard Index can also serve as a powerful tool for assessing product adoption patterns. It does this by quantifying the overlap in users between products, from which one can identify cross-product synergies and growth opportunities.

The dbt package dbt_set_similarity is designed to simplify the calculation of set similarity metrics directly within an analytics workflow. This package provides a method to calculate the Jaccard Indices within SQL transformational layers.

To import this package into your dbt project, add the following to the packages.yml file. We will also need dbt_utils for the purposes of this articles example. Run a dbt deps command within your project to install the package.

packages:
  - package: Matts52/dbt_set_similarity
    version: 0.1.1
  - package: dbt-labs/dbt_utils
    version: 1.3.0

The Jaccard Index

The Jaccard Index, also known as the Jaccard Similarity Coefficient, is a metric used to measure the similarity between two sets. It is defined as the size of the intersection of the sets divided by the size of their union.

Mathematically, it can be expressed as:

The Jaccard Index represents the "Intersection" over the "Union" of two sets (image by author)
The Jaccard Index represents the "Intersection" over the "Union" of two sets (image by author)

Where:

  • A and B are two sets (ex. users of product A and product B)
  • The numerator represents the number of elements in both sets
  • The denominator represents the total number of distinct elements across both sets
(image by author)
(image by author)

The Jaccard Index is particularly useful in the context of cross-product adoption because:

  • It focuses on the overlap between two sets, making it ideal for understanding shared user bases
  • It accounts for differences in the total size of the sets, ensuring that results are proportional and not skewed by outliers

For example:

  • If 100 users adopt Product A and 50 adopt Product B, with 25 users adopting both, the Jaccard Index is 25 / (100 + 50 – 25) = 0.2, indicating a 20% overlap between the two user bases by the Jaccard Index.

Example Data

The example dataset we will be using is a fictional SaaS company which offers storage space as a product for consumers. This company provides two distinct storage products: document storage (_docstorage) and photo storage (_photostorage). These are either true, indicating the product has been adopted, or false, indicating the product has not been adopted.

Additionally, the demographics (_usercategory) that this company serves are either tech enthusiasts or homeowners.

For the sake of this example, we will read this csv file in as a "seed" model named seed_example within the dbt project.

Simple Cross-Product Adoption

Now, let’s say we want to calculate the jaccard index (cross-adoption) between our document storage and photo storage products. First, we need to create an array (list) of the users who have the document storage product, alongside an array of the users who have the photo storage product. In the second cte, we apply the jaccard_coef function from the dbt_set_similarity package to help us easily compute the jaccard coefficient between the two arrays of user id’s.

with product_users as (
    select
        array_agg(user_id) filter (where doc_storage = true)
            as doc_storage_users,
        array_agg(user_id) filter (where photo_storage = true)
            as photo_storage_users
    from {{ ref('seed_example') }}
)

select
    doc_storage_users,
    photo_storage_users,
    {{
        dbt_set_similarity.jaccard_coef(
            'doc_storage_users',
            'photo_storage_users'
        )
    }} as cross_product_jaccard_coef
from product_users
Output from the above dbt model (image by author)
Output from the above dbt model (image by author)

As we can interpret, it seems that just over half (60%) of users who have adopted either of products, have adopted both. We can graphically verify our result by placing the user id sets into a Venn diagram, where we see three users have adopted both products, amongst five total users: 3/5 = 0.6.

What the collection of user id's and product adoption would look like, verifying our result (image by author)
What the collection of user id’s and product adoption would look like, verifying our result (image by author)

Segmented Cross-Product Adoption

Using the dbt_set_similarity package, creating segmented jaccard indices for our different user categories should be fairly natural. We will follow the same pattern as before, however, we will simply group our aggregations on the user category.

with product_users as (
    select
        user_category,
        array_agg(user_id) filter (where doc_storage = true)
            as doc_storage_users,
        array_agg(user_id) filter (where photo_storage = true)
            as photo_storage_users
    from {{ ref('seed_example') }}
    group by user_category
)

select
    user_category,
    doc_storage_users,
    photo_storage_users,
    {{
        dbt_set_similarity.jaccard_coef(
            'doc_storage_users',
            'photo_storage_users'
        )
    }} as cross_product_jaccard_coef
from product_users
Output from the above dbt model (image by author)
Output from the above dbt model (image by author)

From the output, cross-product adoption is higher amongst homeowners, when considering jaccard indices. As shown above, all homeowners who have adopted one of the product, have adopted both. Meanwhile, only one-third of the tech enthusiasts who have adopted one product have adopted both products. Thus, in our very small dataset, cross-product adoption is higher amongst homeowners as opposed to tech enthusiasts.

We can graphically verify the output by again creating Venn diagram:

Venn diagrams split by the two segments (image by author)
Venn diagrams split by the two segments (image by author)

Conclusion

dbt_set_similarity provides a straightforward and efficient way to calculate cross-product adoption metrics such as the Jaccard Index directly within a dbt workflow. By applying this method, multi-product companies can gain valuable insights into user behavior and adoption patterns across their product portfolio. In our example, we demonstrated the calculation of overall cross-product adoption as well as segmented adoption for distinct user categories.

Using the package for cross-product adoption is simply one straightforward application. In reality, there exists countless other potential applications of this technique, for example:

  • Feature usage Analysis
  • Marketing campaign impact analysis
  • Support analysis

Additionally, this style of analysis is certainly not limited to just SaaS, but can apply to virtually any industry. Happy Jaccard-ing!

References

[1] Sean Ellis and Morgan Brown, Hacking Growth (2017)

Resources

dbt package hub

The post Measuring Cross-Product Adoption Using dbt_set_similarity appeared first on Towards Data Science.

]]>
From Prototype to Production: Enhancing LLM Accuracy https://towardsdatascience.com/from-prototype-to-production-enhancing-llm-accuracy-791d79b0af9b/ Thu, 19 Dec 2024 20:32:55 +0000 https://towardsdatascience.com/from-prototype-to-production-enhancing-llm-accuracy-791d79b0af9b/ Implementing evaluation frameworks to optimize accuracy in real-world applications

The post From Prototype to Production: Enhancing LLM Accuracy appeared first on Towards Data Science.

]]>
Building a prototype for an Llm application is surprisingly straightforward. You can often create a functional first version within just a few hours. This initial prototype will likely provide results that look legitimate and be a good tool to demonstrate your approach. However, this is usually not enough for production use.

LLMs are probabilistic by nature, as they generate tokens based on the distribution of likely continuations. This means that in many cases, we get the answer close to the "correct" one from the distribution. Sometimes, this is acceptable – for example, it doesn’t matter whether the app says "Hello, John!" or "Hi, John!". In other cases, the difference is critical, such as between "The revenue in 2024 was 20M USD" and "The revenue in 2024 was 20M GBP".

In many real-world business scenarios, precision is crucial, and "almost right" isn’t good enough. For example, when your LLM application needs to execute API calls, or you’re doing a summary of financial reports. From my experience, ensuring the accuracy and consistency of results is far more complex and time-consuming than building the initial prototype.

In this article, I will discuss how to approach measuring and improving accuracy. We’ll build an Sql Agent where precision is vital for ensuring that queries are executable. Starting with a basic prototype, we’ll explore methods to measure accuracy and test various techniques to enhance it, such as self-reflection and retrieval-augmented generation (RAG).

Setup

As usual, let’s begin with the setup. The core components of our SQL agent solution are the LLM model, which generates queries, and the SQL database, which executes them.

LLM model – Llama

For this project, we will use an open-source Llama model released by Meta. I’ve chosen Llama 3.1 8B because it is lightweight enough to run on my laptop while still being quite powerful (refer to the documentation for details).

If you haven’t installed it yet, you can find guides here. I use it locally on MacOS via Ollama. Using the following command, we can download the model.

ollama pull llama3.1:8b

We will use Ollama with LangChain, so let’s start by installing the required package.

pip install -qU langchain_ollama 

Now, we can run the Llama model and see the first results.

from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3.1:8b")
llm.invoke("How are you?")
# I'm just a computer program, so I don't have feelings or emotions 
# like humans do. I'm functioning properly and ready to help with 
# any questions or tasks you may have! How can I assist you today?

We would like to pass a system message alongside customer questions. So, following the Llama 3.1 model documentation, let’s put together a helper function to construct a prompt and test this function.

def get_llama_prompt(user_message, system_message=""):
  system_prompt = ""
  if system_message != "":
    system_prompt = (
      f"<|start_header_id|>system<|end_header_id|>nn{system_message}"
      f"<|eot_id|>"
    )
  prompt = (f"<|begin_of_text|>{system_prompt}"
            f"<|start_header_id|>user<|end_header_id|>nn"
            f"{user_message}"
            f"<|eot_id|>"
            f"<|start_header_id|>assistant<|end_header_id|>nn"
           )
  return prompt   

system_prompt = '''
You are Rudolph, the spirited reindeer with a glowing red nose, 
bursting with excitement as you prepare to lead Santa's sleigh 
through snowy skies. Your joy shines as brightly as your nose, 
eager to spread Christmas cheer to the world!
Please, answer questions concisely in 1-2 sentences.
'''
prompt = get_llama_prompt('How are you?', system_prompt)
llm.invoke(prompt)

# I'm feeling jolly and bright, ready for a magical night! 
# My shiny red nose is glowing brighter than ever, just perfect 
# for navigating through the starry skies. 

The new system prompt has changed the answer significantly, so it works. With this, our local LLM setup is ready to go.

Database – ClickHouse

I will use an open-source database ClickHouse. I’ve chosen ClickHouse because it has a specific SQL dialect. LLMs have likely encountered fewer examples of this dialect during training, making the task a bit more challenging. However, you can choose any other database.

Installing ClickHouse is pretty straightforward – just follow the instructions provided in the documentation.

We will be working with two tables: ecommerce.users and ecommerce.sessions. These tables contain fictional data, including customer personal information and their session activity on the e-commerce website.

You can find the code for generating synthetic data and uploading it on GitHub.

With that, the setup is complete, and we’re ready to move on to building the basic prototype.

The first prototype

As discussed, our goal is to build an SQL Agent – an application that generates SQL queries to answer customer questions. In the future, we can add another layer to this system: executing the SQL query, passing both the initial question and the database results back to the LLM, and asking it to generate a human-friendly answer. However, for this article, we’ll focus on the first step.

The best practice with LLM applications (similar to any other complex tasks) is to start simple and then iterate. The most straightforward implementation is to do one LLM call and share all the necessary information (such as schema description) in the system prompt. So, the first step is to put together the prompt.

generate_query_system_prompt = '''
You are a senior data analyst with more than 10 years of experience writing complex SQL queries. 
There are two tables in the database with the following schemas. 

Table: ecommerce.users 
Description: customers of the online shop
Fields: 
- user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004
- country (string) - country of residence, for example, "Netherlands" or "United Kingdom"
- is_active (integer) - 1 if customer is still active and 0 otherwise
- age (integer) - customer age in full years, for example, 31 or 72

Table: ecommerce.sessions 
Description: sessions of usage the online shop
Fields: 
- user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004
- session_id (integer) - unique identifier of session, for example, 106 or 1023
- action_date (date) - session start date, for example, "2021-01-03" or "2024-12-02"
- session_duration (integer) - duration of session in seconds, for example, 125 or 49
- os (string) - operation system that customer used, for example, "Windows" or "Android"
- browser (string) - browser that customer used, for example, "Chrome" or "Safari"
- is_fraud (integer) - 1 if session is marked as fraud and 0 otherwise
- revenue (float) - income in USD (the sum of purchased items), for example, 0.0 or 1506.7

Write a query in ClickHouse SQL to answer the following question. 
Add "format TabSeparatedWithNames" at the end of the query to get data from ClickHouse database in the right format. 
'''

I’ve included the example values for each field in the prompt to ensure that LLM understands the data format.

And that’s it! With this, we have our first functional prototype for the SQL Agent. Now, it’s time to put it to the test.

prompt = get_llama_prompt('How many customers made purchase in December 2024?', 
  generate_query_system_prompt)
llm.invoke(prompt)

# To find out how many customers made a purchase in December 2024, 
# we need to join the `sessions` table with the `users` table on the `user_id`, 
# and then filter by `action_date` for sessions that took place in December 2024. 

# Here is the query:
# 
# ```sql
# SELECT COUNT(DISTINCT u.user_id)
# FROM ecommerce.sessions s
# JOIN ecommerce.users u ON s.user_id = u.user_id
# WHERE 
#     EXTRACT(YEAR FROM s.action_date) = 2024
#     AND EXTRACT(MONTH FROM s.action_date) = 12
#     AND revenue > 0;
# ```

# This query first joins the `sessions` table with the `users` table 
# on the `user_id`. Then it filters out sessions where no purchase 
# was made (revenue is greater than 0). Finally, it selects the distinct 
# user IDs of these customers.

# When you run this query in ClickHouse SQL, you can add "format TabSeparatedWithNames" 
# at the end to get the result in the right format:

# ```sql
# SELECT COUNT(DISTINCT u.user_id)
# FROM ecommerce.sessions s
# JOIN ecommerce.users u ON s.user_id = u.user_id
# WHERE 
#     EXTRACT(YEAR FROM s.action_date) = 2024
#     AND EXTRACT(MONTH FROM s.action_date) = 12
#     AND revenue > 0;
# format TabSeparatedWithNames;
# ```

The agent produced a fairly decent result, but there’s one issue – the LLM returned not only the SQL query but also some commentary. Since we plan to execute SQL queries later, this format is not suitable for our task. Let’s work on fixing it.

Fortunately, this problem has already been solved, and we don’t need to parse the SQL queries from the text manually. We can use the chat model ChatOllama. Unfortunately, it doesn’t support structured output, but we can leverage tool calling to achieve the same result.

To do this, we will define a dummy tool to execute the query and instruct the model in the system prompt always to call this tool. I’ve kept the comments in the output to give the model some space for reasoning, following the chain-of-thought pattern.

from langchain_ollama import ChatOllama
from langchain_core.tools import tool

@tool
def execute_query(comments: str, query: str) -> str:
  """Excutes SQL query.

  Args:
      comments (str): 1-2 sentences describing the result SQL query 
          and what it does to answer the question,
      query (str): SQL query
  """
  pass 

chat_llm = ChatOllama(model="llama3.1:8b").bind_tools([execute_query])
result = chat_llm.invoke(prompt)
print(result.tool_calls)

# [{'name': 'execute_query',
#   'args': {'comments': 'SQL query returns number of customers who made a purchase in December 2024. The query joins the sessions and users tables based on user ID to filter out inactive customers and find those with non-zero revenue in December 2024.',
#   'query': 'SELECT COUNT(DISTINCT T2.user_id) FROM ecommerce.sessions AS T1 INNER JOIN ecommerce.users AS T2 ON T1.user_id = T2.user_id WHERE YEAR(T1.action_date) = 2024 AND MONTH(T1.action_date) = 12 AND T2.is_active = 1 AND T1.revenue > 0'},
#   'type': 'tool_call'}]

With the tool calling, we can now get the SQL query directly from the model. That’s an excellent result. However, the generated query is not entirely accurate:

  • It includes a filter for is_active = 1, even though we didn’t specify the need to filter out inactive customers.
  • The LLM missed specifying the format despite our explicit request in the system prompt.

Clearly, we need to focus on improving the model’s accuracy. But as Peter Drucker famously said, "You can’t improve what you don’t measure." So, the next logical step is to build a system for evaluating the model’s quality. This system will be a cornerstone for performance improvement iterations. Without it, we’d essentially be navigating in the dark.

Evaluating the accuracy

Evaluation basics

To ensure we’re improving, we need a robust way to measure accuracy. The most common approach is to create a "golden" evaluation set with questions and correct answers. Then, we can compare the model’s output with these "golden" answers and calculate the share of correct ones. While this approach sounds simple, there are a few nuances worth discussing.

First, you might feel overwhelmed at the thought of creating a comprehensive set of questions and answers. Building such a dataset can seem like a daunting task, potentially requiring weeks or months. However, we can start small by creating an initial set of 20–50 examples and iterating on it.

As always, quality is more important than quantity. Our goal is to create a representative and diverse dataset. Ideally, this should include:

  • Common questions. In most real-life cases, we can take the history of actual questions and use it as our initial evaluation set.
  • Challenging edge cases. It’s worth adding examples where the model tends to hallucinate. You can find such cases either while experimenting yourself or by gathering feedback from the first prototype.

Once the dataset is ready, the next challenge is how to score the generated results. We can consider several approaches:

  • Comparing SQL queries. The first idea is to compare the generated SQL query with the one in the evaluation set. However, it might be tricky. Similarly-looking queries can yield completely different results. At the same time, queries that look different can lead to the same conclusions. Additionally, simply comparing SQL queries doesn’t verify whether the generated query is actually executable. Given these challenges, I wouldn’t consider this approach the most reliable solution for our case.
  • Exact matches. We can use old-school exact matching when answers in our evaluation set are deterministic. For example, if the question is, "How many customers are there?" and the answer is "592800", the model’s response must match precisely. However, this approach has its limitations. Consider the example above, and the model responds, "There are 592,800 customers". While the answer is absolutely correct, an exact match approach would flag it as invalid.
  • Using LLMs for scoring. A more robust and flexible approach is to leverage LLMs for evaluation. Instead of focusing on query structure, we can ask the LLM to compare the results of SQL executions. This method is particularly effective in cases where the query might differ but still yields correct outputs.

It’s worth keeping in mind that evaluation isn’t a one-time task; it’s a continuous process. To push our model’s performance further, we need to expand the dataset with examples causing the model’s hallucinations. In production mode, we can create a feedback loop. By gathering input from users, we can identify cases where the model fails and include them in our evaluation set.

In our example, we will be assessing only whether the result of execution is valid (SQL query can be executed) and correct. Still, you can look at other parameters as well. For example, if you care about efficiency, you can compare the execution times of generated queries against those in the golden set.

Evaluation set and validation

Now that we’ve covered the basics, we’re ready to put them into practice. I spent about 20 minutes putting together a set of 10 examples. While small, this set is sufficient for our toy task. It consists of a list of questions paired with their corresponding SQL queries, like this:

[
  {
    "question": "How many customers made purchase in December 2024?",
    "sql_query": "select uniqExact(user_id) as customers from ecommerce.sessions where (toStartOfMonth(action_date) = '2024-12-01') and (revenue > 0) format TabSeparatedWithNames"
  },
  {
    "question": "What was the fraud rate in 2023, expressed as a percentage?",
    "sql_query": "select 100*uniqExactIf(user_id, is_fraud = 1)/uniqExact(user_id) as fraud_rate from ecommerce.sessions where (toStartOfYear(action_date) = '2023-01-01') format TabSeparatedWithNames"
  },
  ...
]

You can find the full list on GitHub – link.

We can load the dataset into a DataFrame, making it ready for use in the code.

import json
with open('golden_set.json', 'r') as f:
  golden_set = json.loads(f.read())

golden_df = pd.DataFrame(golden_set) 
golden_df['id'] = list(range(golden_df.shape[0]))

First, let’s generate the SQL queries for each question in the evaluation set.

def generate_query(question):
  prompt = get_llama_prompt(question, generate_query_system_prompt)
  result = chat_llm.invoke(prompt)
  try:
    generated_query = result.tool_calls[0]['args']['query']
  except:
    generated_query = ''
  return generated_query

import tqdm

tmp = []
for rec in tqdm.tqdm(golden_df.to_dict('records')):
  generated_query = generate_query(rec['question'])
  tmp.append(
    {
      'id': rec['id'],
      'generated_query': generated_query
    }
  )

eval_df = golden_df.merge(pd.DataFrame(tmp))

Before moving on to the LLM-based scoring of query outputs, it’s important to first ensure that the SQL query is valid. To do this, we need to execute the queries and examine the database output.

I’ve created a function that runs a query in ClickHouse. It also ensures that the output format is correctly specified, as this may be critical in business applications.

CH_HOST = 'http://localhost:8123' # default address 
import requests
import io

def get_clickhouse_data(query, host = CH_HOST, connection_timeout = 1500):
  # pushing model to return data in the format that we want
  if not 'format tabseparatedwithnames' in query.lower():
    return "Database returned the following error:n Please, specify the output format."

  r = requests.post(host, params = {'query': query}, 
    timeout = connection_timeout)
  if r.status_code == 200:
    return r.text
  else: 
    return 'Database returned the following error:n' + r.text
    # giving feedback to LLM instead of raising exception

The next step is to execute both the generated and golden queries and then save their outputs.

tmp = []

for rec in tqdm.tqdm(eval_df.to_dict('records')):
  golden_output = get_clickhouse_data(rec['sql_query'])
  generated_output = get_clickhouse_data(rec['generated_query'])

  tmp.append(
    {
      'id': rec['id'],
      'golden_output': golden_output,
      'generated_output': generated_output
    }
  )

eval_df = eval_df.merge(pd.DataFrame(tmp))

Next, let’s check the output to see whether the SQL query is valid or not.

def is_valid_output(s):
  if s.startswith('Database returned the following error:'):
    return 'error'
  if len(s.strip().split('n')) >= 1000:
    return 'too many rows'
  return 'ok'

eval_df['golden_output_valid'] = eval_df.golden_output.map(is_valid_output)
eval_df['generated_output_valid'] = eval_df.generated_output.map(is_valid_output)

Then, we can evaluate the SQL validity for both the golden and generated sets.

The initial results are not very promising; the LLM was unable to generate even a single valid query. Looking at the errors, it’s clear that the model failed to specify the right format despite it being explicitly defined in the system prompt. So, we definitely need to work more on the accuracy.

Checking the correctness

However, validity alone is not enough. It’s crucial that we not only generate valid SQL queries but also produce the correct results. Although we already know that all our queries are invalid, let’s now incorporate output evaluation into our process.

As discussed, we will use LLMs to compare the outputs of the SQL queries. I typically prefer using more powerful model for evaluation, following the day-to-day logic where a senior team member reviews the work. For this task, I’ve chosen OpenAI GPT 4o-mini.

Similar to our generation flow, I’ve set up all the building blocks necessary for accuracy assessment.

from langchain_openai import ChatOpenAI

accuracy_system_prompt = '''
You are a senior and very diligent QA specialist and your task is to compare data in datasets. 
They are similar if they are almost identical, or if they convey the same information. 
Disregard if column names specified in the first row have different names or in a different order.
Focus on comparing the actual information (numbers). If values in datasets are different, then it means that they are not identical.
Always execute tool to provide results.
'''

@tool
def compare_datasets(comments: str, score: int) -> str:
  """Stores info about datasets.
  Args:
      comments (str): 1-2 sentences about the comparison of datasets,
      score (int): 0 if dataset provides different values and 1 if it shows identical information
  """
  pass

accuracy_chat_llm = ChatOpenAI(model="gpt-4o-mini", temperature = 0.0)
  .bind_tools([compare_datasets])

accuracy_question_tmp = '''
Here are the two datasets to compare delimited by ####
Dataset #1: 
####
{dataset1}
####
Dataset #2: 
####
{dataset2}
####
'''

def get_openai_prompt(question, system):
  messages = [
    ("system", system),
    ("human", question)
  ]
  return messages

Now, it’s time to test the accuracy assessment process.

prompt = get_openai_prompt(accuracy_question_tmp.format(
  dataset1 = 'customersn114032n', dataset2 = 'customersn114031n'),
  accuracy_system_prompt)

accuracy_result = accuracy_chat_llm.invoke(prompt)
accuracy_result.tool_calls[0]['args']
# {'comments': 'The datasets contain different customer counts: 114032 in Dataset #1 and 114031 in Dataset #2.',
#  'score': 0}

prompt = get_openai_prompt(accuracy_question_tmp.format(
  dataset1 = 'usersn114032n', dataset2 = 'customersn114032n'),
  accuracy_system_prompt)
accuracy_result = accuracy_chat_llm.invoke(prompt)
accuracy_result.tool_calls[0]['args']
# {'comments': 'The datasets contain the same numerical value (114032) despite different column names, indicating they convey identical information.',
#  'score': 1}

Fantastic! It looks like everything is working as expected. Let’s now encapsulate this into a function.

def is_answer_accurate(output1, output2):
  prompt = get_openai_prompt(
    accuracy_question_tmp.format(dataset1 = output1, dataset2 = output2),
    accuracy_system_prompt
  )

  accuracy_result = accuracy_chat_llm.invoke(prompt)

  try:
    return accuracy_result.tool_calls[0]['args']['score']
  except:
    return None

Putting the evaluation approach together

As we discussed, building an LLM application is an iterative process, so we’ll need to run our accuracy assessment multiple times. It will be helpful to have all this logic encapsulated in a single function.

The function will take two arguments as input:

  • generate_query_func: a function that generates an SQL query for a given question.
  • golden_df: an evaluation dataset with questions and correct answers in the form of a pandas DataFrame.

As output, the function will return a DataFrame with all evaluation results and a couple of charts displaying the main KPIs.


def evaluate_sql_agent(generate_query_func, golden_df):

  # generating SQL
  tmp = []
  for rec in tqdm.tqdm(golden_df.to_dict('records')):
    generated_query = generate_query_func(rec['question'])
    tmp.append(
      {
          'id': rec['id'],
          'generated_query': generated_query
      }
    )

  eval_df = golden_df.merge(pd.DataFrame(tmp))

  # executing SQL queries
  tmp = []
  for rec in tqdm.tqdm(eval_df.to_dict('records')):
    golden_output = get_clickhouse_data(rec['sql_query'])
    generated_output = get_clickhouse_data(rec['generated_query'])

    tmp.append(
      {
        'id': rec['id'],
        'golden_output': golden_output,
        'generated_output': generated_output
      }
    )

  eval_df = eval_df.merge(pd.DataFrame(tmp))

  # checking accuracy
  eval_df['golden_output_valid'] = eval_df.golden_output.map(is_valid_output)
  eval_df['generated_output_valid'] = eval_df.generated_output.map(is_valid_output)

  eval_df['correct_output'] = list(map(
    is_answer_accurate,
    eval_df['golden_output'],
    eval_df['generated_output']
  ))

  eval_df['accuracy'] = list(map(
    lambda x, y: 'invalid: ' + x if x != 'ok' else ('correct' if y == 1 else 'incorrect'),
    eval_df.generated_output_valid,
    eval_df.correct_output
  ))

  valid_stats_df = (eval_df.groupby('golden_output_valid')[['id']].count().rename(columns = {'id': 'golden set'}).join(
    eval_df.groupby('generated_output_valid')[['id']].count().rename(columns = {'id': 'generated'}), how = 'outer')).fillna(0).T

  fig1 = px.bar(
    valid_stats_df.apply(lambda x: 100*x/valid_stats_df.sum(axis = 1)),
    orientation = 'h', 
    title = '<b>LLM SQL Agent evaluation</b>: query validity',
    text_auto = '.1f',
    color_discrete_map = {'ok': '#00b38a', 'error': '#ea324c', 'too many rows': '#f2ac42'},
    labels = {'index': '', 'variable': 'validity', 'value': 'share of queries, %'}
  )
  fig1.show()

  accuracy_stats_df = eval_df.groupby('accuracy')[['id']].count()
  accuracy_stats_df['share'] = accuracy_stats_df.id*100/accuracy_stats_df.id.sum()

  fig2 = px.bar(
    accuracy_stats_df[['share']],
    title = '<b>LLM SQL Agent evaluation</b>: query accuracy',
    text_auto = '.1f', orientation = 'h',
    color_discrete_sequence = ['#0077B5'],
    labels = {'index': '', 'variable': 'accuracy', 'value': 'share of queries, %'}
  )

  fig2.update_layout(showlegend = False)
  fig2.show()

  return eval_df

With that, we’ve completed the evaluation setup and can now move on to the core task of improving the model’s accuracy.

Improving accuracy: Self-reflection

Let’s do a quick recap. We’ve built and tested the first version of SQL Agent. Unfortunately, all generated queries were invalid because they were missing the output format. Let’s address this issue.

One potential solution is self-reflection. We can make an additional call to the LLM, sharing the error and asking it to correct the bug. Let’s create a function to handle generation with self-reflection.

reflection_user_query_tmpl = '''
You've got the following question: "{question}". 
You've generated the SQL query: "{query}".
However, the database returned an error: "{output}". 
Please, revise the query to correct mistake. 
'''

def generate_query_reflection(question):
  generated_query = generate_query(question) 
  print('Initial query:', generated_query)

  db_output = get_clickhouse_data(generated_query)
  is_valid_db_output = is_valid_output(db_output)
  if is_valid_db_output == 'too many rows':
    db_output = "Database unexpectedly returned more than 1000 rows."

  if is_valid_db_output == 'ok': 
    return generated_query

  reflection_user_query = reflection_user_query_tmpl.format(
    question = question,
    query = generated_query,
    output = db_output
  )

  reflection_prompt = get_llama_prompt(reflection_user_query, 
    generate_query_system_prompt) 
  reflection_result = chat_llm.invoke(reflection_prompt)

  try:
    reflected_query = reflection_result.tool_calls[0]['args']['query']
  except:
    reflected_query = ''
  print('Reflected query:', reflected_query)
  return reflected_query

Now, let’s use our evaluation function to check whether the quality has improved. Assessing the next iteration has become effortless.

refl_eval_df = evaluate_sql_agent(generate_query_reflection, golden_df)

Wonderful! We’ve achieved better results – 50% of the queries are now valid, and all format issues have been resolved. So, self-reflection is pretty effective.

However, self-reflection has its limitations. When we examine the accuracy, we see that the model returns the correct answer for only one question. So, our journey is not over yet.

Improving accuracy: RAG

Another approach to improving accuracy is using RAG (retrieval-augmented generation). The idea is to identify question-and-answer pairs similar to the customer query and include them in the system prompt, enabling the LLM to generate a more accurate response.

RAG consists of the following stages:

  • Loading documents: importing data from available sources.
  • Splitting documents: creating smaller chunks.
  • Storage: using vector stores to process and store data efficiently.
  • Retrieval: extracting documents that are relevant to the query.
  • Generation: passing a question and relevant documents to LLM to generate the final answer.

If you’d like a refresher on RAG, you can check out my previous article, "RAG: How to Talk to Your Data."

We will use the Chroma database as a local vector storage – to store and retrieve embeddings.

from langchain_chroma import Chroma
vector_store = Chroma(embedding_function=embeddings)

Vector stores are using embeddings to find chunks that are similar to the query. For this purpose, we will use OpenAI embeddings.

from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Since we can’t use examples from our evaluation set (as they are already being used to assess quality), I’ve created a separate set of question-and-answer pairs for RAG. You can find it on GitHub.

Now, let’s load the set and create a list of pairs in the following format: Question: %s; Answer: %s.

with open('rag_set.json', 'r') as f:
    rag_set = json.loads(f.read())
rag_set_df = pd.DataFrame(rag_set)

rag_set_df['formatted_txt'] = list(map(
    lambda x, y: 'Question: %s; Answer: %s' % (x, y),
    rag_set_df.question,
    rag_set_df.sql_query
))

rag_string_data = 'nn'.join(rag_set_df.formatted_txt)

Next, I used LangChain’s text splitter by character to create chunks, with each question-and-answer pair as a separate chunk. Since we are splitting the text semantically, no overlap is necessary.

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="nn",
    chunk_size=1, # to split by character without merging
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False,
)

texts = text_splitter.create_documents([rag_string_data])

The final step is to load the chunks into our vector storage.

document_ids = vector_store.add_documents(documents=texts)
print(vector_store._collection.count())
# 32

Now, we can test the retrieval to see the results. They look quite similar to the customer question.

question = 'What was the share of users using Windows yesterday?'
retrieved_docs = vector_store.similarity_search(question, 3)
context = "nn".join(map(lambda x: x.page_content, retrieved_docs))
print(context)

# Question: What was the share of users using Windows the day before yesterday?; 
# Answer: select 100*uniqExactIf(user_id, os = 'Windows')/uniqExact(user_id) as windows_share from ecommerce.sessions where (action_date = today() - 2) format TabSeparatedWithNames
# Question: What was the share of users using Windows in the last week?; 
# Answer: select 100*uniqExactIf(user_id, os = 'Windows')/uniqExact(user_id) as windows_share from ecommerce.sessions where (action_date >= today() - 7) and (action_date < today()) format TabSeparatedWithNames
# Question: What was the share of users using Android yesterday?; 
# Answer: select 100*uniqExactIf(user_id, os = 'Android')/uniqExact(user_id) as android_share from ecommerce.sessions where (action_date = today() - 1) format TabSeparatedWithNames

Let’s adjust the system prompt to include the examples we retrieved.

generate_query_system_prompt_with_examples_tmpl = '''
You are a senior data analyst with more than 10 years of experience writing complex SQL queries. 
There are two tables in the database you're working with with the following schemas. 

Table: ecommerce.users 
Description: customers of the online shop
Fields: 
- user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004
- country (string) - country of residence, for example, "Netherlands" or "United Kingdom"
- is_active (integer) - 1 if customer is still active and 0 otherwise
- age (integer) - customer age in full years, for example, 31 or 72

Table: ecommerce.sessions 
Description: sessions of usage the online shop
Fields: 
- user_id (integer) - unique identifier of customer, for example, 1000004 or 3000004
- session_id (integer) - unique identifier of session, for example, 106 or 1023
- action_date (date) - session start date, for example, "2021-01-03" or "2024-12-02"
- session_duration (integer) - duration of session in seconds, for example, 125 or 49
- os (string) - operation system that customer used, for example, "Windows" or "Android"
- browser (string) - browser that customer used, for example, "Chrome" or "Safari"
- is_fraud (integer) - 1 if session is marked as fraud and 0 otherwise
- revenue (float) - income in USD (the sum of purchased items), for example, 0.0 or 1506.7

Write a query in ClickHouse SQL to answer the following question. 
Add "format TabSeparatedWithNames" at the end of the query to get data from ClickHouse database in the right format. 
Answer questions following the instructions and providing all the needed information and sharing your reasoning. 

Examples of questions and answers: 
{examples}
'''

Once again, let’s create the generate query function with RAG.

def generate_query_rag(question):
  retrieved_docs = vector_store.similarity_search(question, 3)
  context = context = "nn".join(map(lambda x: x.page_content, retrieved_docs))

  prompt = get_llama_prompt(question, 
    generate_query_system_prompt_with_examples_tmpl.format(examples = context))
  result = chat_llm.invoke(prompt)

  try:
    generated_query = result.tool_calls[0]['args']['query']
  except:
    generated_query = ''
  return generated_query

As usual, let’s use our evaluation function to test the new approach.

rag_eval_df = evaluate_sql_agent(generate_query_rag, golden_df)

We can see a significant improvement, increasing from 1 to 6 correct answers out of 10. It’s still not ideal, but we’re moving in the right direction.

We can also experiment with combining two approaches: RAG and self-reflection.

def generate_query_rag_with_reflection(question):
  generated_query = generate_query_rag(question) 

  db_output = get_clickhouse_data(generated_query)
  is_valid_db_output = is_valid_output(db_output)
  if is_valid_db_output == 'too many rows':
      db_output = "Database unexpectedly returned more than 1000 rows."

  if is_valid_db_output == 'ok': 
      return generated_query

  reflection_user_query = reflection_user_query_tmpl.format(
    question = question,
    query = generated_query,
    output = db_output
  )

  reflection_prompt = get_llama_prompt(reflection_user_query, generate_query_system_prompt) 
  reflection_result = chat_llm.invoke(reflection_prompt)

  try:
    reflected_query = reflection_result.tool_calls[0]['args']['query']
  except:
    reflected_query = ''
  return reflected_query

rag_refl_eval_df = evaluate_sql_agent(generate_query_rag_with_reflection, 
  golden_df)

We can see another slight improvement: we’ve completely eliminated invalid SQL queries (thanks to self-reflection) and increased the number of correct answers to 7 out of 10.

That’s it. It’s been quite a journey. We started with 0 valid SQL queries and have now achieved 70% accuracy.

You can find the complete code on GitHub.

Summary

In this article, we explored the iterative process of improving accuracy for LLM applications.

  • We built an evaluation set and the scoring criteria that allowed us to compare different iterations and understand whether we were moving in the right direction.
  • We leveraged self-reflection to allow the LLM to correct its mistakes and significantly reduce the number of invalid SQL queries.
  • Additionally, we implemented Retrieval-Augmented Generation (RAG) to further enhance the quality, achieving an accuracy rate of 60–70%.

While this is a solid result, it still falls short of the 90%+ accuracy threshold typically expected for production applications. To achieve such a high bar, we need to use fine-tuning, which will be the topic of the next article.

Thank you a lot for reading this article. I hope this article was insightful for you. If you have any follow-up questions or comments, please leave them in the comments section.

Reference

All the images are produced by the author unless otherwise stated.

This article is inspired by the "Improving Accuracy of LLM Applications" short course from DeepLearning.AI.

The post From Prototype to Production: Enhancing LLM Accuracy appeared first on Towards Data Science.

]]>
An Introduction to CTEs in SQL https://towardsdatascience.com/an-introduction-to-ctes-in-sql-ab0a979578f9/ Wed, 04 Dec 2024 16:01:05 +0000 https://towardsdatascience.com/an-introduction-to-ctes-in-sql-ab0a979578f9/ Explore how Common Table Expression (CTE) can help optimize SQL performance and readability

The post An Introduction to CTEs in SQL appeared first on Towards Data Science.

]]>
In the past few months, I have learned the importance of writing clean, readable, and efficient SQL queries. They are essential for integrating information from different tables.

However, writing complex queries from scratch can be time-consuming, especially when you frequently use them for analyzing your data. To address this need, SQL offers a powerful construct, known as Common Table Expression.

In this article, I am going to explain what a Common Table Expression is, why it’s useful, and demonstrate its application through examples. Let’s get started!


Table of contents:

  • What is a Common Table Expression?
  • Setting up DBeaver and Database
  • Three simple examples of CTEs

What is a Common Table Expression?

A Common Table Expression (CTE) is a temporary result set that simplifies complex queries, making them more readable and maintainable. It works by breaking a complex query into smaller, more manageable pieces — essentially building on sequential, modular subqueries. It doesn’t just limit on managing complicated queries. It can also be useful an alternative to a view, self-reference a table or be used for recursion.

To create a common table expression, the keyword WITH is followed by the name of the CTE. Within the parentheses, you specify the SQL query that defines the CTE. After, you can select the entire result set by specifying the name on the query.

WITH cte_name (
 QUERY
)
SELECT *
FROM cte_name

Alternatively, it can also happen that you select only few columns from the CTE:

WITH cte_name (
 QUERY
)
SELECT column1, column2
FROM cte_name

Setting up DBeaver and Database

Screenshot by Author. 1) Download DBeaver
Screenshot by Author. 1) Download DBeaver

Dbeaver is a free and open-source database management system that supports relational data stored in databases. Compared to other database tools, the interface is very intuitive and simple. For that reason, I would recommend to install it. It can be installed on Windows, Mac, and Linux.

Once you download it, it’s time to create the database. For this tutorial, I decided to create synthentic data to demonstrate the strenght of CTE. This is possible by asking Chat-GPT to create it.

Prompt: I want to create a database about sales of a fashion online company, 
zalando: create tables and insert rows using SQL. 
The goal of this database is to demonstrate the strenghts of SQL CTE. 
It should contain syntetic data that resemble real data between 2023 and 2024. 

After sending the prompt, ChatGPT designs the structure of this new database and the rows that should contain. The output is very long and I will just show a GIF to have an idea of what you can obtain.

GIF By Author. 2) Create synthetic data using Chat-GPT
GIF By Author. 2) Create synthetic data using Chat-GPT

Even if it’s only a very small database, the result is astonishing! It created simple and meaningful tables linked between each other.

After we can finally create a new database in DBeaver. We just need to create a new connection with SQLite, that can be suitable for a light and local database. After, we press the button "Create" and select the path where we want to store the database.

GIF By Author. 3) Create New Connection using DBeaver
GIF By Author. 3) Create New Connection using DBeaver

After we copy and execute the SQL code generated by Chat-GPT to create the tables and insert the rows in each table.

The new database is composed of five main tables:

  • Customers
  • Categories
  • Products
  • Orders
  • OrderDetails

Example 1: Simple CTE

To understand how to define a CTE, let’s start with a simple example. Let’s suppose we want to know the number of customers that ordered on the company’s website by year.

WITH NumberCustomerByYear AS (
   SELECT STRFTIME('%Y',c.join_date) AS Year, count(*) AS NumberCustomers
   FROM Customers c
   GROUP BY STRFTIME('%Y',c.join_date)

)

SELECT *
FROM NumberCustomerByYear
ORDER BY Year DESC;

This is the output:

Year NumberCustomers
2024 5
2023 5

Now we have the number of clients by Year, which is extracted by the column c.joint_data using the SQLite function STRFTIME. We prefer to show the number of customers in decreasing order based on the year to visualize the most recent data first.

Example 2: Simplify a Complex Query

In this section, we show a CTE that helps to list the products that have been sold more than 3 times. This time we need to do the left join between Products and OrderDetails to obtain the information.

WITH PopularProducts AS (
 SELECT
   p.name AS product_name,
   SUM(od.quantity) AS total_quantity_sold
 FROM Products p 
 LEFT JOIN OrderDetails od ON p.product_id = od.product_id
 GROUP BY p.name
 HAVING SUM(od.quantity)>3
)

SELECT *
FROM PopularProducts
ORDER BY total_quantity_sold DESC;

The output table is the following:

product_name    total_quantity_sold
Sandals         5
T-shirt         5
Tracksuit       4

That’s good! How we have obtained the names of the most popular products.

Example 3: Use Multiple CTEs in a Query

Previously, we have shown examples that didn’t use more than a single common table expression. This time, we can try to solve a problem that requires two CTEs.

Let’s suppose that we want to compare the number of orders each month with the previous month. The first CTE MonthlyOrders contains the number of orders by year and month.

The second CTE is MonthlyComparison and has five columns: order_year, order_month, current_month_orders, previous_month_orders and order_difference. The last two fields, previous_month_orders and order_difference, are obtained using a self-join, which is very useful when comparing a row with other rows within the same table.

When there is more than one CTE, we don’t put the clause WITH beside the second CTE, but we need a comma to define it.

WITH MonthlyOrders AS (
    SELECT 
        STRFTIME('%Y',order_date) AS order_year,
        CAST(STRFTIME('%m',order_date) AS INTEGER) AS order_month,
        COUNT(order_id) AS total_orders
    FROM Orders
    GROUP BY STRFTIME('%Y',order_date), STRFTIME('%m',order_date)
),
MonthlyComparison AS (
    SELECT 
        mo1.order_year,
        mo1.order_month,
        mo1.total_orders AS current_month_orders,
        COALESCE(mo2.total_orders, 0) AS previous_month_orders,
        mo1.total_orders - COALESCE(mo2.total_orders, 0) AS order_difference
    FROM MonthlyOrders mo1
    LEFT JOIN MonthlyOrders mo2 
        ON (mo1.order_year = mo2.order_year AND mo1.order_month = mo2.order_month + 1)
         OR (mo1.order_year = mo2.order_year+1 AND mo1.order_month=1 AND mo2.order_month=12)
)
SELECT *
FROM MonthlyComparison
ORDER BY order_year DESC, order_month DESC;

In the main query, we select all the columns from the second CTE that compares the number of orders each month with the previous month. The results of the query are the following:

order_year  order_month  current_month_orders  previous_month_orders  order_difference
2024        5            1                     1                      0
2024        4            1                     1                      0
2024        3            1                     1                      0
2024        2            1                     1                      0
2024        1            1                     0                      1
2023        7            1                     1                      0
2023        6            1                     1                      0
2023        5            1                     1                      0
2023        4            1                     1                      0
2023        3            1                     0                      1

This is great! This is just a taste of what you can obtain with multiple CTEs! The numeric values are not very realistic, since the data is synthetic, but it can help


Summary of SQL functions used

  • COUNT(*) to return the number of records
  • SUM(od.quantity) to sum the values of the field quantity
  • STRFTIME('%Y', order_date) to extract the year from the date column
  • CAST(STRFTIME('%Y','order_date') AS INTEGER) to convert the column from STRING to INTEGER type
  • COALESCE(total_orders,0) to replace null values of total_orders with 0

Final thoughts

I hope that you have appreciated this guide for getting started with Common Table Expressions in SQL. It can be intimidating to understand this topic without practical examples from the most simple to the hardest.

Be aware that some of the SQL functions used in the example can change depending on the connection type selected, such as SQL server and Google Big Query. For example, STRFTIME is replaced by YEAR in SQL server and EXTRACT in Google BigQuery.

If you want to go deeper on CTE, check the resources below. The code for creating the tables, inserting the rows, and building the CTEs is here if you want to replicate the results that are based on the synthetic Database generated using Chat-GPT. Since the code is very long, I didn’t put all the code lines on the article for readability reasons.

Thanks for reading! Have a nice day!


Useful resources:

The post An Introduction to CTEs in SQL appeared first on Towards Data Science.

]]>
SQL vs. Calculators: Building Champion/Challenger Tests from Scratch https://towardsdatascience.com/sql-vs-calculators-building-champion-challenger-tests-from-scratch-b457dc43d784/ Wed, 04 Dec 2024 14:01:03 +0000 https://towardsdatascience.com/sql-vs-calculators-building-champion-challenger-tests-from-scratch-b457dc43d784/ In depth SQL code for creating your own statistical test design

The post SQL vs. Calculators: Building Champion/Challenger Tests from Scratch appeared first on Towards Data Science.

]]>
CODE OR CLICK: WHAT IS BETTER FOR A/B TESTING
Image from Imagen 3
Image from Imagen 3

The $300 Million Button: How A/B Testing Changed E-Commerce Forever

I am sure a lot of people are aware of the $300 million button story. For those that are not aware of the story, it is about a major e-commerce platform losing millions in potential revenue due to customer drop-offs at checkout. This was a large online retailer, and a single button labeled "Register" when changed to "Continue," with an option to register later, the company saw a $300 million increase in annual revenue. This case study was documented by UX expert Jared Spool )Source: UIE, Jared Spool, "The $300 Million Button"), showing how a minor change can drastically impact business outcomes.

Yet surprisingly, 58% of executives still rely on intuition when making business decisions, according to a PwC report (Source: PwC Global Data and Analytics Survey). I always believe that folks with industry knowledge and well-versed with business processes intuition is important but adds more value when combined with observed evidence of data and numbers in decision making. champion-challenger testing is one such approach to decision-making that changes guesswork into scientific validation.

What Is Champion/Challenger Testing?

Champion/challenger testing (A/B testing) is a technique used in businesses to optimize processes and business operations by selecting best options that improve performance by increasing revenue, reduce costs, and enhance Decision Making. Champion here is the current operation or methodology that works best while challenger is the method or a new strategy you want to test against your champion to see if it works better or worse than your current process or strategy. Your champion challenger should have the same type of setup, like similar type of accounts or customer segments, to ensure you have an apples-to-apples comparison. It is important to know, what is the goal you are trying to achieve and know what your key performance indicators should be to measure the success of the test.

Implementation Through Oracle SQL: A Practical Guide

When implementing champion-challenger testing, I always wondered whether to rely on online calculators or invest in a database-driven Sql implementation. The answer depends on various factors but let us explore an SQL approach through a practical example. While going through the example, I will also be walking you through the importance of some of the variables and conditions to consider ensuring we have a solid champion-challenger testing created.

Imagine a collection agency wanting to test the effectiveness of leaving voicemail versus not leaving them. The current strategy involves no voicemails, and some believe leaving voicemails could improve metrics like contact rate and payment rate, but implementing this change across all accounts carries risks like potential reduction in contact rates, compliance considerations with leaving messages, resource costs of leaving voicemails, and a possible decrease in payment rates. Let us design a rigorous test to evaluate the hypothesis.

To begin our implementation, we need to create a structured foundation that will track our test from start to finish. I used Oracle SQL developer to write my SQL and for illustration purpose in the voicemail testing context, I assumed some of the key component values as mentioned below to generate voicemail champion-challenger test. Below are the details of what each of these key components mean:

  1. Baseline Conversion Rate: Your current conversion rate for the metric you’re testing. In this specific voicemail test example, we are assuming 8% current payment rate as baseline conversion rate.
  2. Minimum Detectable Effect (MDE): The smallest improvement in conversion rate you care about detecting. For voicemails, we want to see if we can improve the current conversion rate by 10% which is increasing to 8.8% (8% * (1 + 0.10) = 8.8%).
  3. Statistical Significance Level: Typically set at 95%, meaning you’re 95% confident that your results are not due to chance.
  4. Statistical Power: Often set at 80%, this is a measure of whether the test has enough data to reach a conclusive result.
  5. Hypothesis / Tail type: a statement that predicts whether changing a certain variable will affect customer behavior. There are two types of hypotheses to consider or more known as tail tests:

a) One-tail test: This test is recommended only when you are testing if something is either only better than current performance or only worse than current performance. Voicemail testing with one-tail test means we only want to know if voicemails improve payment rates.

b) Two-tail test: This test is recommended in scenarios where you need to understand any change in performance. You are testing if something is either better or worse than current performance. Voicemail testing with two -tail test means we want to know if voicemails will increase or decrease payment rates.

As we do not know whether voicemails will increase or decrease payment rates, we will be going with a two-tailed test.

with test_parameters as(
    select 
        0.08 as baseline_rate,       -- assuming current rate of 8% of payment rate
        10 as min_detectable_effect, -- wanting 10% improvement
        95 as significance_level,    -- 95% confidence level
        80 as statistical_power,     -- 80% statistical power
        'TWO' as tail_type,          -- 'ONE' or 'TWO' for tail type test 
        &amp;volume as monthly_volume    -- dynamic query to pull volume data can be used 
        -- example: (select count(*) from accounts where assign_date>=add_months(sysdate,-1) ) 
    from dual
    )

   select * from test_parameters;
SQL prompt for monthly volume input
SQL prompt for monthly volume input
Output Result
Output Result

This above configuration is important because it records what we are testing and why. These metrics are the key components in sample size calculation. I will show you the sample size calculation, split ratio, months and days needed to run your test and finally the recommendation results for different monthly volumes available.

Sample Size Calculation

Using the right sample size is important to make sure your test results are statistically significant. A sample size that’s too small may result in inaccurate results. Larger sample sizes will give you more accurate average values, identify outliers in data and provide smaller margins of error. The question here ultimately is what too small vs large sample sizes is. You will find out the answers to it as you go through the article.

The below oracle script shows how to calculate sample size. I am using a CTE and partitioned them into multiple sections of snapshots to explain the code better. If you want to use the script, you need to put all sections of code together. Now, I am going to set up our statistical parameters.

--statistical parameter conversion
    ,statistical_parameters as(
    select
        baseline_rate,
        min_detectable_effect,
        monthly_volume,
        tail_type,

    --set confidence level z-score based on tail type
        case when tail_type='ONE' then 
         case significance_level 
              when 90 then 1.28 -- One tailed test for 90% confidence
              when 95 then 1.645 -- One tailed test for 95% confidence
              when 99 then 2.326 -- One tailed test for 99% confidence
              else 1.645 end 
         else
             case significance_level 
              when 90 then 1.645 -- Two tailed test for 90% confidence
              when 95 then 1.96 -- Two tailed test for 95% confidence
              when 99 then 2.576 -- Two tailed test for 99% confidence
              else 1.96 end end as z_alpha,

    --set power level z-score (same for both tail types)
        case statistical_power
            when 80 then 0.84
            when 90 then 1.28
            when 95 then 1.645
            else 0.84 end as z_beta
    from test_parameters
    )

    select * from statistical_parameters;

This conversion converts the confidence levels into statistical values used in sample size calculations. For collections, 95% confidence means there is a possibility of 5% of the time results being wrong or when voicemails don’t help.

In statistical terms, z-alpha represents our confidence level, with different values based on both confidence level and tail-type test. Typically, two tailed test values are higher than one tailed test values because of the error rate split in both directions for a two-tailed test. In voicemail testing scenario, 5% chance of being wrong indicates error rate split evenly (0.025 chance probability for payments going lower and 0.025 for payments going higher) whereas a one-tailed test concentrates the entire 0.05 probability in one direction, as we’re only interested in payments going either up or down, not both.

Statistical power is known as z-beta. When we set 80% statistical power (z-beta = 0.84), we are saying we want to catch real changes 80% of the time and will accept missing them 20% of the time.

Z-alpha and Z-beta put together means, if voicemails truly help improve payment rates, we will detect this improvement 80% of the time, and when we do detect it, we can be 95% confident it is a real improvement and not due to a chance.

Output Result
Output Result

Let us now move into the calculation of the sample size volume needed. This calculation determines how many accounts we need to test. In our voicemail scenario, if we’re looking to improve from 8% to 8.8% payment rate, this tells us how many accounts we need to be confident that the payment rate will increase, or decrease is real and not just by chance.

--Sample size calculation
    ,sample_size_calculation as(
    select 
        baseline_rate,
        min_detectable_effect,
        monthly_volume,
        tail_type,
        z_alpha,
        z_beta,

    --calculate minimum effect size
        baseline_rate*(min_detectable_effect/100) as minimum_effect,

    --calculate base sample size
        ceil(
             case tail_type 
                  when 'ONE' then
                       ( power(z_alpha + z_beta, 2) * baseline_Rate * (1 - baseline_Rate)) / (power(baseline_Rate * (min_detectable_effect/100), 2))
                  else
                       (2 * power(z_alpha + z_beta, 2) * baseline_Rate * (1 - baseline_Rate)) / (power(baseline_Rate * (min_detectable_effect/100), 2)) 
                  end
             ) as required_sample_size     
    from statistical_parameters
    )
Output Result
Output Result

Split Ratios and Test Duration

Split ratios determine how you divide your dataset between the champion (your current version) and the challenger(s) (your test versions). Common split ratios include two way (like 50/50, 80/20 or 90/10 splits) or multi-way splits like 50/25/25 or 70/10/10/10. These multi-way tests are used to test different variations while we still have a control group.

Choosing a split ratio should not be random or solely depend on volume availability but also consider other factors like confidence level in the challenger, impact of the change especially if it hurts the current metrics, and ensure the test meets the minimum sample size needed requirement.

This below analysis translates statistical requirements into business terms and shows how different split ratios affect test duration. It also shows risk level based on split ratio. Split ratios represent how we divide accounts between champion and challenger.

 --split ratio
    ,split_ratios as(
    --generate split ratios from 10 to 50 for challenger
    Select  
        level * 10 as challenger_pct,
        100 - (level * 10) as control_pct
    from dual
    connect by level <= 5 -- This generates 10/90, 20/80, 30/70, 40/60, 50/50
    )

    --split_analysis
    ,split_analysis as(
    select 
        s.baseline_Rate * 100 as current_rate_pct,
        s.baseline_rate * (1 + s.min_detectable_effect/100) * 100 as target_rate_pct,
        s.min_detectable_effect as improvement_pct,
        s.tail_type,
        s.required_sample_size as sample_size_per_group,
        s.required_sample_size * 2 as total_sample_needed,
        s.monthly_volume,
        r.challenger_pct,
        r.control_pct,

    --calculate test duration (months) for different splits
        round(s.required_sample_size / (s.monthly_volume * (r.challenger_pct/100)), 1) as months_needed,

    --calculate test days needed for each split
        round(s.required_sample_size / (s.monthly_volume * (r.challenger_pct/100)) * 30, 0) as days_needed,

     --Assess risk level for each split
        case 
            when r.challenger_pct <= 20 then 'Conservative'
            when r.challenger_pct <= 35 then 'Balanced'
            else 'Aggressive' end as risk_level
    from sample_size_calculation s cross join split_ratios r
    )

    select * from split_analysis;

Conservative risk only impacts 10–20% of accounts getting new treatment and 80–90% accounts from potential negative impacts. This split ratio takes longer to gather enough data. Balanced risk will impact one third of the accounts and protect the rest while it gathers data faster. Aggressive risk impacts up to half the accounts though it gathers data quickly, it exposes more accounts to risk.

Part of the output result
Part of the output result

It is important to know how long a champion/challenger test should be run. Run a test for too short of a time, and you risk making decisions based on incomplete or misleading data. Run it too long, you may waste resources and delay decision making. To maintain the balance, generally, tests should run for a minimum of one full business cycle. Tests typically shouldn’t run for more than 4–8 weeks and this way we don’t mix up our results with other operational or seasonal changes taking place.

Risk Assessment and Volume Requirements

I observe analysts new to champion/challenger testing do not know what split ratio to opt for. We can decide on which split ratio to opt for by considering the risks associated in choosing for a certain split ratio and what volume is needed for that split ratio.

Worst-case scenario must be calculated to assess the risk level.

,risk_Assessment as(
        select 
            monthly_volume,
            sample_size_per_group,
            challenger_pct,
            risk_level,
        --assess potential impact
    round(monthly_volume * (challenger_pct/100) * (current_rate_pct/100)) as accounts_at_risk,
    round(monthly_volume * (challenger_pct/100) * (current_rate_pct/100) * (1 - (improvement_pct/100))) as worst_case_scenario
        from split_analysis
    )

    ,volume_recommendations as(
        select distinct 
            sample_size_per_group,
            --recommende monthly volumes for different completion timeframes for all splits
            ceil(sample_size_per_group / 0.5) as volume_for_1_month_50_50, --50/50 split
            ceil(sample_size_per_group / 0.4) as volume_for_1_month_40_60, --40/60 split
            ceil(sample_size_per_group / 0.3) as volume_for_1_month_30_70, --30/70 split
            ceil(sample_size_per_group / 0.2) as volume_for_1_month_20_80, --20/80 split
            ceil(sample_size_per_group / 0.1) as volume_for_1_month_10_90  --10/90 split
        from split_analysis
        )
Part of the output result
Part of the output result

Let us say we opt for 30/70 split ratio which is showing a ‘balanced’ split for voicemails. With 10,000 monthly accounts, 3000 accounts will receive voicemails while 7000 accounts continue as normal. If voicemails perform poorly, it affects 3,000 accounts and the maximum exposure will be 240 payments at risk (3,000 8%). In the scenario, voicemails test decrease payment rates by 10% instead of improving them, we would only receive 216 payments (3,000 8% * (1–10%)). This means we lose 24 payments which we would have otherwise received.

This worst-case calculation helps us understand what’s at risk. With a more aggressive 50/50 split, we would have 5,000 accounts in the test group, risking a potential loss of 40 payments under worse-case conditions. A conservative 20/80 split would only risk 16 payments, though it would take longer to complete the test.

With a 50/50 split, we need a total volume of 36k accounts to get our required 18k accounts in the test group. Since we only have 10k accounts monthly, this means our test would take approximately 3.6 months to complete. Moving to the most conservative 10/90 split would require 180k accounts, making the test duration impractically long at 18 months.

,final_Recommendation as(
    select
        sa.*,
        ra.accounts_At_Risk,
        ra.worst_case_scenario,
        vr.volume_for_1_month_50_50,
        vr.volume_for_1_month_40_60,
        vr.volume_for_1_month_30_70,
        vr.volume_for_1_month_20_80,
        vr.volume_for_1_month_10_90,
        --Generate final recommendations based on all split ratios
    case when sa.monthly_volume >= vr.volume_for_1_month_50_50 and sa.challenger_pct = 50 
         then 'AGGRESSIVE: 50/50 split possible. Fastest completion in ' || sa.days_needed || ' days but highest risk ' 
         when sa.monthly_volume >= vr.volume_for_1_month_40_60 and sa.challenger_pct = 40 
         then 'MODERATELY AGGRESSIVE: 40/60 split feasible. Completes in ' || sa.days_needed || ' days with moderate-high risk.'
         when sa.monthly_volume >= vr.volume_for_1_month_30_70 and sa.challenger_pct = 30 
         then 'BALANCED: 30/70 split recommended. Completes in ' || sa.days_needed || ' days with balanced risk.'
         when sa.monthly_volume >= vr.volume_for_1_month_20_80 and sa.challenger_pct = 20 
         then 'CONSERVATIVE: 20/80 split possible. Takes ' || sa.days_needed || ' days with lower risk.'
         when sa.monthly_volume >= vr.volume_for_1_month_10_90 and sa.challenger_pct = 10 
         then 'BALANCED: 10/90 split possible. Takes ' || sa.days_needed || ' days but minimizes risk.'
         else 'NOT RECOMMENDED: Current volume of ' || sa.monthly_volume || ' insufficient for reliable testing with ' 
              || sa.challenger_pct || '/' ||  sa.control_pct || ' split.' end as recommendation
    from split_analysis sa join risk_assessment ra on sa.challenger_pct=ra.challenger_pct
        cross join volume_recommendations vr 
        )
select      
        tail_type as test_type,
        current_rate_pct || '%' as current_rate,
        target_rate_pct || '%' as target_rate,
        improvement_pct || '%' as improvement,
        sample_size_per_group as needed_per_group,
        total_sample_needed as total_needed,
        monthly_volume,
        challenger_pct || '/' || control_pct || ' split' as split_ratio,
        days_needed || ' days (' || round(months_needed, 1) || ' months)' as duration,
        risk_level,
        accounts_At_Risk || ' accounts at risk' as risk_exposure,
        worst_Case_Scenario || ' worst case' as risk_scenario,
            case
                when challenger_pct = 10 then
                    case    
                        when monthly_volume >= volume_for_1_month_10_90 
                        then 'Current volume (' || monthly_volume || ') sufficient for 10/90 split'
                        else 'Need ' || volume_for_1_month_10_90 
                        || ' monthly accounts for 10/90 split (current: ' || monthly_volume || ')'
                    end
                when challenger_pct = 20 then
                    case    
                        when monthly_volume >= volume_for_1_month_20_80 
                        then 'Current volume (' || monthly_volume || ') sufficient for 20/80 split'
                        else 'Need ' || volume_for_1_month_20_80 
                        || ' monthly accounts for 20/80 split (current: ' || monthly_volume || ')'
                    end
                 when challenger_pct = 30 then
                    case    
                        when monthly_volume >= volume_for_1_month_30_70 
                        then 'Current volume (' || monthly_volume || ') sufficient for 30/70 split'
                        else 'Need ' || volume_for_1_month_30_70 
                        || ' monthly accounts for 30/70 split (current: ' || monthly_volume || ')'
                    end
                 when challenger_pct = 40 then
                    case    
                        when monthly_volume >= volume_for_1_month_40_60 
                        then 'Current volume (' || monthly_volume || ') sufficient for 40/60 split'
                        else 'Need ' || volume_for_1_month_40_60 
                        || ' monthly accounts for 40/60 split (current: ' || monthly_volume || ')'
                    end
                else
                    case    
                        when monthly_volume >= volume_for_1_month_50_50 
                        then 'Current volume (' || monthly_volume || ') sufficient for 50/50 split'
                        else 'Need ' || volume_for_1_month_50_50 
                        || ' monthly accounts for 50/50 split (current: ' || monthly_volume || ')'
                    end
                end as volume_assessment,
            recommendation
        from final_Recommendation
        order by challenger_pct;
Part of the output result for 10k monthly volume
Part of the output result for 10k monthly volume

If monthly volume is 50,000 accounts:

Part of the output result for 50k monthly volume
Part of the output result for 50k monthly volume

Certain questions need to be thought of in order to decide which split ratio to choose and risk level is acceptable and eventually understand the volume available to test voicemails. Can the business accept potentially losing 40 payments monthly in exchange for completing the test in 3.6 months or would it be better to risk only 16 payments monthly but extend the test duration? By carefully choosing your split ratios and understand what sample sizes are appropriate, you can design tests that provide accurate and actionable insights.

Calculators versus SQL Implementation

Online calculators like Evan Miller and Optimizely are valuable tools, typically defaulting to a 50/50 split ratio or two-tailed tests. Another online tool, Statsig, doesn’t default to anything but at the same time doesn’t provide additional details like what we just coded with our SQL implementation. The SQL implementation becomes valuable here as it helps track not just basic metrics, but also monitor risk exposure and test duration based on your actual monthly volume. This comprehensive view helps especially when you need to deviate from standard 50/50 splits or want to understand different split ratios on your test design and business risks.

Continuous Testing

Champion/challenger testing is not a one-time effort but a continuous cycle of improvement. Create performance reports and continuously monitor the results. Adapt to the changing conditions including seasonal shifts and economic changes. By integrating this approach into your strategy testing, you are creating a systematic approach to decision-making that drives innovation, mitigates risk, and most importantly intuition can be backed up with solid data evidence.

Note: All images, unless otherwise noted, are by the author.

The post SQL vs. Calculators: Building Champion/Challenger Tests from Scratch appeared first on Towards Data Science.

]]>