Data Science | Towards Data Science

One-Tailed Vs. Two-Tailed Tests

Allon Korem — Thu, 06 Mar 2025 04:22:42 +0000

Introduction

If you’ve ever analyzed data using built-in t-test functions, such as those in R or SciPy, here’s a question for you: have you ever adjusted the default setting for the alternative hypothesis? If your answer is no—or if you’re not even sure what this means—then this blog post is for you!

The alternative hypothesis parameter, commonly referred to as “one-tailed” versus “two-tailed” in statistics, defines the expected direction of the difference between control and treatment groups. In a two-tailed test, we assess whether there is any difference in mean values between the groups, without specifying a direction. A one-tailed test, on the other hand, posits a specific direction—whether the control group’s mean is either less than or greater than that of the treatment group.

Choosing between one- and two-tailed hypotheses might seem like a minor detail, but it affects every stage of A/B testing: from test planning to Data Analysis and results interpretation. This article builds a theoretical foundation on why the hypothesis direction matters and explores the pros and cons of each approach.

One-tailed vs. two-tailed hypothesis testing: Understanding the difference

To understand the importance of choosing between one-tailed and two-tailed hypotheses, let’s briefly review the basics of the t-test, the commonly used method in A/B testing. Like other Hypothesis Testing methods, the t-test begins with a conservative assumption: there is no difference between the two groups (the null hypothesis). Only if we find strong evidence against this assumption can we reject the null hypothesis and conclude that the treatment has had an effect.

But what qualifies as “strong evidence”? To that end, a rejection region is determined under the null hypothesis and all results that fall within this region are deemed so unlikely that we take them as evidence against the feasibility of the null hypothesis. The size of this rejection region is based on a predetermined probability, known as alpha (α), which represents the likelihood of incorrectly rejecting the null hypothesis.

What does this have to do with the direction of the alternative hypothesis? Quite a bit, actually. While the alpha level determines the size of the rejection region, the alternative hypothesis dictates its placement. In a one-tailed test, where we hypothesize a specific direction of difference, the rejection region is situated in only one tail of the distribution. For a hypothesized positive effect (e..g., that the treatment group mean is higher than the control group mean), the rejection region would lie in the right tail, creating a right-tailed test. Conversely, if we hypothesize a negative effect (e.g., that the treatment group mean is less than the control group mean), the rejection region would be placed in the left tail, resulting in a left-tailed test.

In contrast, a two-tailed test allows for the detection of a difference in either direction, so the rejection region is split between both tails of the distribution. This accommodates the possibility of observing extreme values in either direction, whether the effect is positive or negative.

To build intuition, let’s visualize how the rejection regions appear under the different hypotheses. Recall that according to the null hypothesis, the difference between the two groups should center around zero. Thanks to the central limit theorem, we also know this distribution approximates a normal distribution. Consequently, the rejection areas corresponding to the different alternative hypothesis look like that:

Why does it make a difference?

The choice of direction for the alternative hypothesis impacts the entire A/B testing process, starting with the planning phase—specifically, in determining the sample size. Sample size is calculated based on the desired power of the test, which is the probability of detecting a true difference between the two groups when one exists. To compute power, we examine the area under the alternative hypothesis that corresponds to the rejection region (since power reflects the ability to reject the null hypothesis when the alternative hypothesis is true).

Since the direction of the hypothesis affects the size of this rejection region, power is generally lower for a two-tailed hypothesis. This is due to the rejection region being divided across both tails, making it more challenging to detect an effect in any one direction. The following graph illustrates the comparison between the two types of hypotheses. Note that the purple area is larger for the one-tailed hypothesis, compared to the two-tailed hypothesis:

In practice, to maintain the desired power level, we compensate for the reduced power of a two-tailed hypothesis by increasing the sample size (Increasing sample size raises power, though the mechanics of this can be a topic for a separate article). Thus, the choice between one- and two-tailed hypotheses directly influences the required sample size for your test.

Beyond the planning phase, the choice of alternative hypothesis directly impacts the analysis and interpretation of results. There are cases where a test may reach significance with a one-tailed approach but not with a two-tailed one, and vice versa. Reviewing the previous graph can help illustrate this: for example, a result in the left tail might be significant under a two-tailed hypothesis but not under a right one-tailed hypothesis. Conversely, certain results might fall within the rejection region of a right one-tailed test but lie outside the rejection area in a two-tailed test.

How to decide between a one-tailed and two-tailed hypothesis

Let’s start with the bottom line: there’s no absolute right or wrong choice here. Both approaches are valid, and the primary consideration should be your specific business needs. To help you decide which option best suits your company, we’ll outline the key pros and cons of each.

At first glance, a one-tailed alternative may appear to be the clear choice, as it often aligns better with business objectives. In industry applications, the focus is typically on improving specific metrics rather than exploring a treatment’s impact in both directions. This is especially relevant in A/B testing, where the goal is often to optimize conversion rates or enhance revenue. If the treatment doesn’t lead to a significant improvement the examined change won’t be implemented.

Beyond this conceptual advantage, we have already mentioned one key benefit of a one-tailed hypothesis: it requires a smaller sample size. Thus, choosing a one-tailed alternative can save both time and resources. To illustrate this advantage, the following graphs show the required sample sizes for one- and two-tailed hypotheses with different power levels (alpha is set at 5%).

In this context, the decision between one- and two-tailed hypotheses becomes particularly important in sequential testing—a method that allows for ongoing data analysis without inflating the alpha level. Here, selecting a one-tailed test can significantly reduce the duration of the test, enabling faster decision-making, which is especially valuable in dynamic business environments where prompt responses are essential.

However, don’t be too quick to dismiss the two-tailed hypothesis! It has its own advantages. In some business contexts, the ability to detect “negative significant results” is a major benefit. As one client once shared, he preferred negative significant results over inconclusive ones because they offer valuable learning opportunities. Even if the outcome wasn’t as expected, he could conclude that the treatment had a negative effect and gain insights into the product.

Another benefit of two-tailed tests is their straightforward interpretation using confidence intervals (CIs). In two-tailed tests, a CI that doesn’t include zero directly indicates significance, making it easier for practitioners to interpret results at a glance. This clarity is particularly appealing since CIs are widely used in A/B testing platforms. Conversely, with one-tailed tests, a significant result might still include zero in the CI, potentially leading to confusion or mistrust in the findings. Although one-sided confidence intervals can be employed with one-tailed tests, this practice is less common.

Conclusions

By adjusting a single parameter, you can significantly impact your A/B testing: specifically, the sample size you need to collect and the interpretation of the results. When deciding between one- and two-tailed hypotheses, consider factors such as the available sample size, the advantages of detecting negative effects, and the convenience of aligning confidence intervals (CIs) with hypothesis testing. Ultimately, this decision should be made thoughtfully, taking into account what best fits your business needs.

(Note: all the images in this post were created by the author)

The post One-Tailed Vs. Two-Tailed Tests appeared first on Towards Data Science.

Kubernetes — Understanding and Utilizing Probes Effectively

Kiril Vodenicharov — Thu, 06 Mar 2025 03:59:54 +0000

Introduction

Let’s talk about Kubernetes probes and why they matter in your deployments. When managing production-facing containerized applications, even small optimizations can have enormous benefits.

Aiming to reduce deployment times, making your applications better react to scaling events, and managing the running pods healthiness requires fine-tuning your container lifecycle management. This is exactly why proper configuration — and implementation — of Kubernetes probes is vital for any critical deployment. They assist your cluster to make intelligent decisions about traffic routing, restarts, and resource allocation.

Properly configured probes dramatically improve your application reliability, reduce deployment downtime, and handle unexpected errors gracefully. In this article, we’ll explore the three types of probes available in Kubernetes and how utilizing them alongside each other helps configure more resilient systems.

Quick refresher

Understanding exactly what each probe does and some common configuration patterns is essential. Each of them serves a specific purpose in the container lifecycle and when used together, they create a rock-solid framework for maintaining your application availability and performance.

Startup: Optimizing start-up times

Start-up probes are evaluated once when a new pod is spun up because of a scale-up event or a new deployment. It serves as a gatekeeper for the rest of the container checks and fine-tuning it will help your applications better handle increased load or service degradation.

Sample Config:

startupProbe:
  httpGet:
    path: /health
    port: 80
  failureThreshold: 30
  periodSeconds: 10

Key takeaways:

Keep periodSeconds low, so that the probe fires often, quickly detecting a successful deployment.
Increase failureThreshold to a high enough value to accommodate for the worst-case start-up time.

The Startup probe will check whether your container has started by querying the configured path. It will additionally stop the triggering of the Liveness and Readiness probes until it is successful.

Liveness: Detecting dead containers

Your liveness probes answer a very simple question: “Is this pod still running properly?” If not, K8s will restart it.

Sample Config:

livenessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 10
  failureThreshold: 3

Key takeaways:

Since K8s will completely restart your container and spin up a new one, add a failureThreshold to combat intermittent abnormalities.
Avoid using initialDelaySeconds as it is too restrictive — use a Start-up probe instead.

Be mindful that a failing Liveness probe will bring down your currently running pod and spin up a new one, so avoid making it too aggressive — that’s for the next one.

Readiness: Handling unexpected errors

The readiness probe determines if it should start — or continue — to receive traffic. It is extremely useful in situations where your container lost connection to the database or is otherwise over-utilized and should not receive new requests.

Sample Config:

readinessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 3
  failureThreshold: 1
  timeoutSeconds: 1

Key takeaways:

Since this is your first guard to stopping traffic to unhealthy targets, make the probe aggressive and reduce the periodSeconds .
Keep failureThreshold at a minimum, you want to fail quick.
The timeout period should also be kept at a minimum to handle slower Containers.
Give the readinessProbe ample time to recover by having a longer-running livenessProbe .

Readiness probes ensure that traffic will not reach a container not ready for it and as such it’s one of the most important ones in the stack.

Putting it all together

As you can see, even if all of the probes have their own distinct uses, the best way to improve your application’s resilience strategy is using them alongside each other.

Your startup probe will assist you in scale up scenarios and new deployments, allowing your containers to be quickly brought up. They’re fired only once and also stop the execution of the rest of the probes until they successfully complete.

The liveness probe helps in dealing with dead containers suffering from non-recoverable errors and tells the cluster to bring up a new, fresh pod just for you.

The readiness probe is the one telling K8s when a pod should receive traffic or not. It can be extremely useful dealing with intermittent errors or high resource consumption resulting in slower response times.

Additional configurations

Probes can be further configured to use a command in their checks instead of an HTTP request, as well as giving ample time for the container to safely terminate. While these are useful in more specific scenarios, understanding how you can extend your deployment configuration can be beneficial, so I’d recommend doing some additional reading if your containers handle unique use cases.

Further reading:
Liveness, Readiness, and Startup Probes
Configure Liveness, Readiness and Startup Probes

The post Kubernetes — Understanding and Utilizing Probes Effectively appeared first on Towards Data Science.

Practical SQL Puzzles That Will Level Up Your Skill

Mateus Trentz — Tue, 04 Mar 2025 19:46:10 +0000

There are some Sql patterns that, once you know them, you start seeing them everywhere. The solutions to the puzzles that I will show you today are actually very simple SQL queries, but understanding the concept behind them will surely unlock new solutions to the queries you write on a day-to-day basis.

These challenges are all based on real-world scenarios, as over the past few months I made a point of writing down every puzzle-like query that I had to build. I also encourage you to try them for yourself, so that you can challenge yourself first, which will improve your learning!

All queries to generate the datasets will be provided in a PostgreSQL and DuckDB-friendly syntax, so that you can easily copy and play with them. At the end I will also provide you a link to a GitHub repo containing all the code, as well as the answer to the bonus challenge I will leave for you!

I organized these puzzles in order of increasing difficulty, so, if you find the first ones too easy, at least take a look at the last one, which uses a technique that I truly believe you won’t have seen before.

Okay, let’s get started.

Analyzing ticket moves

I love this puzzle because of how short and simple the final query is, even though it deals with many edge cases. The data for this challenge shows tickets moving in between Kanban stages, and the objective is to find how long, on average, tickets stay in the Doing stage.

The data contains the ID of the ticket, the date the ticket was created, the date of the move, and the “from” and “to” stages of the move. The stages present are New, Doing, Review, and Done.

Some things you need to know (edge cases):

Tickets can move backwards, meaning tickets can go back to the Doing stage.
You should not include tickets that are still stuck in the Doing stage, as there is no way to know how long they will stay there for.
Tickets are not always created in the New stage.

CREATE TABLE ticket_moves (
    ticket_id INT NOT NULL,
    create_date DATE NOT NULL,
    move_date DATE NOT NULL,
    from_stage TEXT NOT NULL,
    to_stage TEXT NOT NULL
);

INSERT INTO ticket_moves (ticket_id, create_date, move_date, from_stage, to_stage)
    VALUES
        -- Ticket 1: Created in "New", then moves to Doing, Review, Done.
        (1, '2024-09-01', '2024-09-03', 'New', 'Doing'),
        (1, '2024-09-01', '2024-09-07', 'Doing', 'Review'),
        (1, '2024-09-01', '2024-09-10', 'Review', 'Done'),
        -- Ticket 2: Created in "New", then moves: New → Doing → Review → Doing again → Review.
        (2, '2024-09-05', '2024-09-08', 'New', 'Doing'),
        (2, '2024-09-05', '2024-09-12', 'Doing', 'Review'),
        (2, '2024-09-05', '2024-09-15', 'Review', 'Doing'),
        (2, '2024-09-05', '2024-09-20', 'Doing', 'Review'),
        -- Ticket 3: Created in "New", then moves to Doing. (Edge case: no subsequent move from Doing.)
        (3, '2024-09-10', '2024-09-16', 'New', 'Doing'),
        -- Ticket 4: Created already in "Doing", then moves to Review.
        (4, '2024-09-15', '2024-09-22', 'Doing', 'Review');

A summary of the data:

Ticket 1: Created in the New stage, moves normally to Doing, then Review, and then Done.
Ticket 2: Created in New, then moves: New → Doing → Review → Doing again → Review.
Ticket 3: Created in New, moves to Doing, but it is still stuck there.
Ticket 4: Created in the Doing stage, moves to Review afterward.

It might be a good idea to stop for a bit and think how you would deal with this. Can you find out how long a ticket stays on a single stage?

Honestly, this sounds intimidating at first, and it looks like it will be a nightmare to deal with all the edge cases. Let me show you the full solution to the problem, and then I will explain what is happening afterward.

WITH stage_intervals AS (
    SELECT
        ticket_id,
        from_stage,
        move_date 
        - COALESCE(
            LAG(move_date) OVER (
                PARTITION BY ticket_id 
                ORDER BY move_date
            ), 
            create_date
        ) AS days_in_stage
    FROM
        ticket_moves
)
SELECT
    SUM(days_in_stage) / COUNT(DISTINCT ticket_id) as avg_days_in_doing
FROM
    stage_intervals
WHERE
    from_stage = 'Doing';

The first CTE uses the LAG function to find the previous move of the ticket, which will be the time the ticket entered that stage. Calculating the duration is as simple as subtracting the previous date from the move date.

What you should notice is the use of the COALESCE in the previous move date. What that does is that if a ticket doesn’t have a previous move, then it uses the date of creation of the ticket. This takes care of the cases of tickets being created directly into the Doing stage, as it still will properly calculate the time it took to leave the stage.

This is the result of the first CTE, showing the time spent in each stage. Notice how the Ticket 2 has two entries, as it visited the Doing stage in two separate occasions.

With this done, it’s just a matter of getting the average as the SUM of total days spent in doing, divided by the distinct number of tickets that ever left the stage. Doing it this way, instead of simply using the AVG, makes sure that the two rows for Ticket 2 get properly accounted for as a single ticket.

Not so bad, right?

Finding contract sequences

The goal of this second challenge is to find the most recent contract sequence of every employee. A break of sequence happens when two contracts have a gap of more than one day between them.

In this dataset, there are no contract overlaps, meaning that a contract for the same employee either has a gap or ends a day before the new one starts.

CREATE TABLE contracts (
    contract_id integer PRIMARY KEY,
    employee_id integer NOT NULL,
    start_date date NOT NULL,
    end_date date NOT NULL
);

INSERT INTO contracts (contract_id, employee_id, start_date, end_date)
VALUES 
    -- Employee 1: Two continuous contracts
    (1, 1, '2024-01-01', '2024-03-31'),
    (2, 1, '2024-04-01', '2024-06-30'),
    -- Employee 2: One contract, then a gap of three days, then two contracts
    (3, 2, '2024-01-01', '2024-02-15'),
    (4, 2, '2024-02-19', '2024-04-30'),
    (5, 2, '2024-05-01', '2024-07-31'),
    -- Employee 3: One contract
    (6, 3, '2024-03-01', '2024-08-31');

As a summary of the data:

Employee 1: Has two continuous contracts.
Employee 2: One contract, then a gap of three days, then two contracts.
Employee 3: One contract.

The expected result, given the dataset, is that all contracts should be included except for the first contract of Employee 2, which is the only one that has a gap.

Before explaining the logic behind the solution, I would like you to think about what operation can be used to join the contracts that belong to the same sequence. Focus only on the second row of data, what information do you need to know if this contract was a break or not?

I hope it’s clear that this is the perfect situation for window functions, again. They are incredibly useful for solving problems like this, and understanding when to use them helps a lot in finding clean solutions to problems.

First thing to do, then, is to get the end date of the previous contract for the same employee with the LAG function. Doing that, it’s simple to compare both dates and check if it was a break of sequence.

WITH ordered_contracts AS (
    SELECT
        *,
        LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS previous_end_date
    FROM
        contracts
),
gapped_contracts AS (
    SELECT
        *,
        -- Deals with the case of the first contract, which won't have
        -- a previous end date. In this case, it's still the start of a new
        -- sequence.
        CASE WHEN previous_end_date IS NULL
            OR previous_end_date < start_date - INTERVAL '1 day' THEN
            1
        ELSE
            0
        END AS is_new_sequence
    FROM
        ordered_contracts
)
SELECT * FROM gapped_contracts ORDER BY employee_id ASC;

An intuitive way to continue the query is to number the sequences of each employee. For example, an employee who has no gap, will always be on his first sequence, but an employee who had 5 breaks in contracts will be on his 5th sequence. Funnily enough, this is done by another window function.

--
-- Previous CTEs
--
sequences AS (
    SELECT
        *,
        SUM(is_new_sequence) OVER (PARTITION BY employee_id ORDER BY start_date) AS sequence_id
FROM
    gapped_contracts
)
SELECT * FROM sequences ORDER BY employee_id ASC;

Notice how, for Employee 2, he starts his sequence #2 after the first gapped value. To finish this query, I grouped the data by employee, got the value of their most recent sequence, and then did an inner join with the sequences to keep only the most recent one.

--
-- Previous CTEs
--
max_sequence AS (
    SELECT
        employee_id,
        MAX(sequence_id) AS max_sequence_id
FROM
    sequences
GROUP BY
    employee_id
),
latest_contract_sequence AS (
    SELECT
        c.contract_id,
        c.employee_id,
        c.start_date,
        c.end_date
    FROM
        sequences c
        JOIN max_sequence m ON c.sequence_id = m.max_sequence_id
            AND c.employee_id = m.employee_id
        ORDER BY
            c.employee_id,
            c.start_date
)
SELECT
    *
FROM
    latest_contract_sequence;

As expected, our final result is basically our starting query just with the first contract of Employee 2 missing!

Tracking concurrent events

Finally, the last puzzle — I’m glad you made it this far.

For me, this is the most mind-blowing one, as when I first encountered this problem I thought of a completely different solution that would be a mess to implement in SQL.

For this puzzle, I’ve changed the context from what I had to deal with for my job, as I think it will make it easier to explain.

Imagine you’re a data analyst at an event venue, and you’re analyzing the talks scheduled for an upcoming event. You want to find the time of day where there will be the highest number of talks happening at the same time.

This is what you should know about the schedules:

Rooms are booked in increments of 30min, e.g. from 9h-10h30.
The data is clean, there are no overbookings of meeting rooms.
There can be back-to-back meetings in a single meeting room.

Meeting schedule visualized (this is the actual data).

CREATE TABLE meetings (
    room TEXT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL
);

INSERT INTO meetings (room, start_time, end_time) VALUES
    -- Room A meetings
    ('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
    ('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
    -- Room B meetings
    ('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
    -- Room C meetings
    ('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room C', '2024-10-01 11:30', '2024-10-01 12:00');

The way to solve this is using what is called a Sweep Line Algorithm, or also known as an event-based solution. This last name actually helps to understand what will be done, as the idea is that instead of dealing with intervals, which is what we have in the original data, we deal with events instead.

To do this, we need to transform every row into two separate events. The first event will be the Start of the meeting, and the second event will be the End of the meeting.

WITH events AS (
  -- Create an event for the start of each meeting (+1)
  SELECT 
    start_time AS event_time, 
    1 AS delta
  FROM meetings
  UNION ALL
  -- Create an event for the end of each meeting (-1)
  SELECT 
   -- Small trick to work with the back-to-back meetings (explained later)
    end_time - interval '1 minute' as end_time,
    -1 AS delta
  FROM meetings
)
SELECT * FROM events;

Take the time to understand what is happening here. To create two events from a single row of data, we’re simply unioning the dataset on itself; the first half uses the start time as the timestamp, and the second part uses the end time.

You might already notice the delta column created and see where this is going. When an event starts, we count it as +1, when it ends, we count it as -1. You might even be already thinking of another window function to solve this, and you’re actually right!

But before that, let me just explain the trick I used in the end dates. As I don’t want back-to-back meetings to count as two concurrent meetings, I’m subtracting a single minute of every end date. This way, if a meeting ends and another starts at 10h30, it won’t be assumed that two meetings are concurrently happening at 10h30.

Okay, back to the query and yet another window function. This time, though, the function of choice is a rolling SUM.

--
-- Previous CTEs
--
ordered_events AS (
  SELECT
    event_time,
    delta,
    SUM(delta) OVER (ORDER BY event_time, delta DESC) AS concurrent_meetings
  FROM events
)
SELECT * FROM ordered_events ORDER BY event_time DESC;

The rolling SUM at the Delta column is essentially walking down every record and finding how many events are active at that time. For example, at 9 am sharp, it sees two events starting, so it marks the number of concurrent meetings as two!

When the third meeting starts, the count goes up to three. But when it gets to 9h59 (10 am), then two meetings end, bringing the counter back to one. With this data, the only thing missing is to find when the highest value of concurrent meetings happens.

--
-- Previous CTEs
--
max_events AS (
  -- Find the maximum concurrent meetings value
  SELECT 
    event_time, 
    concurrent_meetings,
    RANK() OVER (ORDER BY concurrent_meetings DESC) AS rnk
  FROM ordered_events
)
SELECT event_time, concurrent_meetings
FROM max_events
WHERE rnk = 1;

That’s it! The interval of 9h30–10h is the one with the largest number of concurrent meetings, which checks out with the schedule visualization above!

This solution looks incredibly simple in my opinion, and it works for so many situations. Every time you are dealing with intervals now, you should think if the query wouldn’t be easier if you thought about it in the perspective of events.

But before you move on, and to really nail down this concept, I want to leave you with a bonus challenge, which is also a common application of the Sweep Line Algorithm. I hope you give it a try!

Bonus challenge

The context for this one is still the same as the last puzzle, but now, instead of trying to find the period when there are most concurrent meetings, the objective is to find bad scheduling. It seems that there are overlaps in the meeting rooms, which need to be listed so it can be fixed ASAP.

How would you find out if the same meeting room has two or more meetings booked at the same time? Here are some tips on how to solve it:

It’s still the same algorithm.
This means you will still do the UNION, but it will look slightly different.
You should think in the perspective of each meeting room.

You can use this data for the challenge:

CREATE TABLE meetings_overlap (
    room TEXT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL
);

INSERT INTO meetings_overlap (room, start_time, end_time) VALUES
    -- Room A meetings
    ('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
    ('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
    -- Room B meetings
    ('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
    -- Room C meetings
    ('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
    -- Overlaps with previous meeting.
    ('Room C', '2024-10-01 09:30', '2024-10-01 12:00');

If you’re interested in the solution to this puzzle, as well as the rest of the queries, check this GitHub repo.

Conclusion

The first takeaway from this blog post is that window functions are overpowered. Ever since I got more comfortable with using them, I feel that my queries have gotten so much simpler and easier to read, and I hope the same happens to you.

If you’re interested in learning more about them, you would probably enjoy reading this other blog post I’ve written, where I go over how you can understand and use them effectively.

The second takeaway is that these patterns used in the challenges really do happen in many other places. You might need to find sequences of subscriptions, customer retention, or you might need to find overlap of tasks. There are many situations when you will need to use window functions in a very similar fashion to what was done in the puzzles.

The third thing I want you to remember is about this solution to using events besides dealing with intervals. I’ve looked at some problems I solved a long time ago that I could’ve used this pattern on to make my life easier, and unfortunately, I didn’t know about it at the time.

I really do hope you enjoyed this post and gave a shot to the puzzles yourself. And I’m sure that if you made it this far, you either learned something new about SQL or strengthened your knowledge of window functions!

Thank you so much for reading. If you have questions or just want to get in touch with me, don’t hesitate to contact me at mtrentz.com.

All images by the author unless stated otherwise.

The post Practical SQL Puzzles That Will Level Up Your Skill appeared first on Towards Data Science.

Mastering 1:1s as a Data Scientist: From Status Updates to Career Growth

Yu Dong — Tue, 04 Mar 2025 19:12:13 +0000

I have been a data team manager for six months, and my team has grown from three to five.

I wrote about my initial manager experiences back in November. In this article, I want to talk about something that is more essential to the relationship between a DS or DA individual contributor (IC) and their manager — the 1:1 meetings. I remember when I first started my career, I felt nervous and awkward in my 1:1s, as I didn’t know what to expect or what was useful. Now, having been on both sides during 1:1s, I understand better how to have an effective 1:1 meeting.

If you have ever struggled with how to make the best out of your 1:1s, here are my essential tips.

I. Set up a regular 1:1 cadence

First and foremost, 1:1 meetings with your manager should happen regularly. It could be weekly or biweekly, depending on the pace of your projects. For example, if you are more analytics-focused and have lots of fast-moving reporting and analysis tasks, a weekly 1:1 might be better to provide timely updates and align on project prioritization. However, if you are focusing on a long-term machine learning project that will span multiple weeks, you might feel more comfortable with a biweekly cadence — this allows you to do your research, try different approaches, and have meaningful conversations during 1:1s.

I have weekly recurring 30-minute 1:1 slots with everyone on my team, just to make sure I always have this dedicated time for them every week. These meetings sometimes end up being short 15-minute chats or even casual conversations about life after work, but I still find them super helpful for staying updated on what’s on top of everyone’s mind and building personal connections.

II. Make preparations and update your 1:1 agenda

Preparing for your 1:1 is critical. I maintain a shared 1:1 document with my manager and update it every week before our meetings. I also appreciate my direct reports preparing their 1:1 agenda beforehand. Here is why:

Throughout the week, I like to jot down discussion topics quickly on my 1:1 doc whenever they come to my mind. This ensures I cover all important points during the meeting and improves communication effectiveness.
Having an agenda helps both you and your manager keep track of what has been discussed and keeps everyone accountable. We talk to many people every day, so it is totally normal if you lose track of what you have mentioned to someone. Therefore, having such a doc reminds you of your previous conversations. Now, as a manager with a team of five, I also turn to the 1:1 docs to ensure I address all open questions and action items from the last meeting and find links to past projects.
It can also assist your performance review process. When writing my self-review, I read through my 1:1 doc to list my achievements. Similarly, I also use the 1:1 docs with my team to make sure I do not miss any highlights from their projects.

So, what are good topics for 1:1? See the section below.

III. Topics on your 1:1 agenda

While each manager has their preferences, there’s a wide range of topics that are generally appropriate for 1:1s. You don’t have to cover every one of them, but I hope they give you some inspiration and you no longer feel clueless about your 1:1.

Achievements since the last 1:1: I recommend listing the latest achievements in your 1:1 doc. You don’t have to talk about each one in detail during the meeting, but it’s good to give your manager visibility and remind them how good you are . It is also a good idea to highlight both your effort and impact. Business is usually impact-driven, and the data team is no exception. If your A/B test leads to a go/no-go decision, mention that in the meeting. If your analysis leads to a product idea, bring it up and discuss how you plan to support the development and measure the impact.
Ongoing and upcoming projects: One common pattern I’ve observed in my 7-year career is that Data Teams usually have long backlogs with numerous “urgent” requests. 1:1 is a good time to align with your manager on shifting priorities and timelines.
- If your project is blocked, let your manager know. While independence is always appreciated, unexpected blockers can arise at anytime. It’s perfectly acceptable to work through the blockers with your manager, as they typically have more experience and are supposed to empower you to complete your projects. It is better to let your manager know ahead of time instead of letting them find out themselves later and ask you why you missed the timeline. Meanwhile, ideally, you don’t just bring up the blockers but also suggest possible solutions or ask for specific help. For example, “I am blocked on accessing X data. Should I prioritize building the data pipeline with the data engineer or push for an ad-hoc pull?” This shows you are a true problem-solver with a growth mindset.
Career growth: You can also use the 1:1 time to talk about career growth topics. Career growth for data scientists isn’t just about promotions. You might be more interested in growing technical expertise in a specific domain, such as experimentation, or moving from DS to different functions like MLE, or gaining Leadership experience and transitioning to a people management role, just like me. To make sure you are moving towards your career goal, you should have this conversation with your manager regularly so they can provide corresponding advice and match you with projects that align with your long-term goal.
- I also have monthly career growth check-in sessions with my team to specifically talk about career progress. If you always find your 1:1 time being occupied by project updates, consider setting up a separate meeting like this with your manager.
Feedback: Feedback should go both directions.
- Your manager likely does not have as much time to work on data projects as you do. Therefore, you might notice inefficiencies in project workflows, analysis processes, or cross-functional collaboration that they aren’t aware of. Don’t hesitate to bring these up. And similar to handling blockers, it’s recommended to think about potential solutions before going to the meeting to show your manager you are a team player who contributes to the team’s culture and success. For example, instead of saying, “We’re getting too many ad-hoc requests,” frame it as “Ad-hoc requests coming through Slack DMs reduce our focus time on planned projects. Could we invite stakeholders to our sprint planning meetings to align on priorities and have a more formal request intake process during the sprint?”
- Meanwhile, you can also use this opportunity to ask your manager for any feedback on your performance. This helps you identify gaps, improve continuously, and ensures there are no surprises during your official performance review
Team and company goals: Change is the only constant in business. Data teams work closely with stakeholders, so data scientists need to understand the company’s priorities and what matters most at the moment. For example, if your company is focusing on retention, you might want to analyze drivers of higher retention and propose corresponding marketing campaign ideas to your stakeholder.

To give you a more concrete idea of the 1:1 agenda, let’s assume you work at a consumer bank and focus on the credit card rewards domain. Here is a sample agenda:

Date: 03/03/2025

Last week’s accomplishments

Rewards A/B test analysis [link]: Shared with stakeholders, and we will launch the winning treatment A to broader users in Q1.
Rewards redemption analysis [link]: Most users redeem rewards for statement balance. Talking to the marketing team to run an email campaign advertising other redemption options.

Ongoing projects

[P0] Rewards <> churn analysis: Understand if rewards activities are correlated with churn. ETA 3/7.
[P1] Rewards costs dashboard: Build a dashboard tracking the costs of all rewards activities. ETA 3/12.
[Blocked] Travel credit usage dashboard: Waiting for DE to set up the travel booking table. Followed up on 2/27. Need escalation?
[Deprioritized] Retail merchant bonus rewards campaign support: This was deprioritized by the marketing team as we delayed the campaign.

Other topics

I would like to gain more experience in machine learning. Are there any project opportunities?
Any feedback on my collaboration with the stakeholder?

Please also keep in mind that you should update your 1:1 doc actively during the meeting. It should reflect what is discussed and include important notes for each bullet point. You can even add an ‘Action Items’ section at the bottom of each meeting agenda to make the next steps clear.

Final thoughts

Above are my essential tips to run effective 1:1s as a data scientist. By establishing regular meetings, preparing thoughtful agendas, and covering meaningful topics, you can transform these meetings from awkward status updates into valuable growth opportunities. Remember, your 1:1 isn’t just about updating your manager — it’s about getting the support, guidance, and visibility you need to grow in your role.

The post Mastering 1:1s as a Data Scientist: From Status Updates to Career Growth appeared first on Towards Data Science.

Data Science: From School to Work, Part II

Vincent Margot — Mon, 03 Mar 2025 14:00:00 +0000

In my previous article, I highlighted the importance of effective project management in Python development. Now, let’s shift our focus to the code itself and explore how to write clean, maintainable code — an essential practice in professional and collaborative environments.

Readability & Maintainability: Well-structured code is easier to read, understand, and modify. Other developers — or even your future self — can quickly grasp the logic without struggling to decipher messy code.
Debugging & Troubleshooting: Organized code with clear variable names and structured functions makes it easier to identify and fix bugs efficiently.
Scalability & Reusability: Modular, well-organized code can be reused across different projects, allowing for seamless scaling without disrupting existing functionality.

So, as you work on your next Python project, remember:

Half of good code is Clean Code.

Introduction

Python is one of the most popular and versatile Programming languages, appreciated for its simplicity, comprehensibility and large community. Whether web development, data analysis, artificial intelligence or automation of tasks — Python offers powerful and flexible tools that are suitable for a wide range of areas.

However, the efficiency and maintainability of a Python project depends heavily on the practices used by the developers. Poor structuring of the code, a lack of conventions or even a lack of documentation can quickly turn a promising project into a maintenance and development-intensive puzzle. It is precisely this point that makes the difference between student code and professional code.

This article is intended to present the most important best practices for writing high-quality Python code. By following these recommendations, developers can create scripts and applications that are not only functional, but also readable, performant and easily maintainable by third parties.

Adopting these best practices right from the start of a project not only ensures better collaboration within teams, but also prepares your code to evolve with future needs. Whether you’re a beginner or an experienced developer, this guide is designed to support you in all your Python developments.

The code structuration

Good code structuring in Python is essential. There are two main project layouts: flat layout and src layout.

The flat layout places the source code directly in the project root without an additional folder. This approach simplifies the structure and is well-suited for small scripts, quick prototypes, and projects that do not require complex packaging. However, it may lead to unintended import issues when running tests or scripts.

 my_project/
├──  my_project/                  # Directly in the root
│   ├──  __init__.py
│   ├──  main.py                   # Main entry point (if needed)
│   ├──  module1.py             # Example module
│   └──  utils.py
├──  tests/                            # Unit tests
│   ├──  test_module1.py
│   ├──  test_utils.py
│   └── ...
├──  .gitignore                      # Git ignored files
├──  pyproject.toml              # Project configuration (Poetry, setuptools)
├──  uv.lock                         # UV file
├──  README.md               # Main project documentation
├──  LICENSE                     # Project license
├──  Makefile                       # Automates common tasks
├──  DockerFile                   # Automates common tasks
├──  .github/                        # GitHub Actions workflows (CI/CD)
│   ├──  actions/               
│   └──  workflows/

On the other hand, the src layout (src is the contraction of source) organizes the source code inside a dedicated src/ directory, preventing accidental imports from the working directory and ensuring a clear separation between source files and other project components like tests or configuration files. This layout is ideal for large projects, libraries, and production-ready applications as it enforces proper package installation and avoids import conflicts.

 my-project/
├──  src/                              # Main source code
│   ├──  my_project/            # Main package
│   │   ├──  __init__.py        # Makes the folder a package
│   │   ├──  main.py             # Main entry point (if needed)
│   │   ├──  module1.py       # Example module
│   │   └── ...
│   │   ├──  utils/                  # Utility functions
│   │   │   ├──  __init__.py     
│   │   │   ├──  data_utils.py  # data functions
│   │   │   ├──  io_utils.py      # Input/output functions
│   │   │   └── ...
├──  tests/                             # Unit tests
│   ├──  test_module1.py     
│   ├──  test_module2.py     
│   ├──  conftest.py              # Pytest configurations
│   └── ...
├──  docs/                            # Documentation
│   ├──  index.md                
│   ├──  architecture.md         
│   ├──  installation.md         
│   └── ...                     
├──  notebooks/                   # Jupyter Notebooks for exploration
│   ├──  exploration.ipynb       
│   └── ...                     
├──  scripts/                         # Standalone scripts (ETL, data processing)
│   ├──  run_pipeline.py         
│   ├──  clean_data.py           
│   └── ...                     
├──  data/                            # Raw or processed data (if applicable)
│   ├──  raw/                    
│   ├──  processed/
│   └── ....                                 
├──  .gitignore                      # Git ignored files
├──  pyproject.toml              # Project configuration (Poetry, setuptools)
├──  uv.lock                         # UV file
├──  README.md               # Main project documentation
├──  setup.py                       # Installation script (if applicable)
├──  LICENSE                     # Project license
├──  Makefile                       # Automates common tasks
├──  DockerFile                   # To create Docker image
├──  .github/                        # GitHub Actions workflows (CI/CD)
│   ├──  actions/               
│   └──  workflows/

Choosing between these layouts depends on the project’s complexity and long-term goals. For production-quality code, the src/ layout is often recommended, whereas the flat layout works well for simple or short-lived projects.

You can imagine different templates that are better adapted to your use case. It is important that you maintain the modularity of your project. Do not hesitate to create subdirectories and to group together scripts with similar functionalities and separate those with different uses. A good code structure ensures readability, maintainability, scalability and reusability and helps to identify and correct errors efficiently.

Cookiecutter is an open-source tool for generating preconfigured project structures from templates. It is particularly useful for ensuring the coherence and organization of projects, especially in Python, by applying good practices from the outset. The flat layout and src layout can be initiate using a UV tool.

The SOLID principles

SOLID programming is an essential approach to software development based on five basic principles for improving code quality, maintainability and scalability. These principles provide a clear framework for developing robust, flexible systems. By following the Solid Principles, you reduce the risk of complex dependencies, make testing easier and ensure that applications can evolve more easily in the face of change. Whether you are working on a single project or a large-scale application, mastering SOLID is an important step towards adopting object-oriented programming best practices.

S — Single Responsibility Principle (SRP)

The principle of single responsibility means that a class/function can only manage one thing. This means that it only has one reason to change. This makes the code more maintainable and easier to read. A class/function with multiple responsibilities is difficult to understand and often a source of errors.

Example:

# Violates SRP
class MLPipeline:
    def __init__(self, df: pd.DataFrame, target_column: str):
        self.df = df
        self.target_column = target_column
        self.scaler = StandardScaler()
        self.model = RandomForestClassifier()
   
    def preprocess_data(self):
        self.df.fillna(self.df.mean(), inplace=True)  # Handle missing values
        X = self.df.drop(columns=[self.target_column])
        y = self.df[self.target_column]
        X_scaled = self.scaler.fit_transform(X)  # Feature scaling
        return X_scaled, y
        
    def train_model(self):
        X, y = self.preprocess_data()  # Data preprocessing inside model training
        self.model.fit(X, y)
        print("Model training complete.")

Here, the Report class has two responsibilities: Generate content and save the file.

# Follows SRP
class DataPreprocessor:
    def __init__(self):
        self.scaler = StandardScaler()
        
    def preprocess(self, df: pd.DataFrame, target_column: str):
        df = df.copy()
        df.fillna(df.mean(), inplace=True)  # Handle missing values
        X = df.drop(columns=[target_column])
 
# Follows SRP
class DataPreprocessor:
    def __init__(self):
        self.scaler = StandardScaler()
        
    def preprocess(self, df: pd.DataFrame, target_column: str):
        df = df.copy()
        df.fillna(df.mean(), inplace=True)  # Handle missing values
        X = df.drop(columns=[target_column])
        y = df[target_column]
        X_scaled = self.scaler.fit_transform(X)  # Feature scaling
        return X_scaled, y


class ModelTrainer:
    def __init__(self, model):
        self.model = model
        
    def train(self, X, y):
        self.model.fit(X, y)
        print("Model training complete.")

O — Open/Closed Principle (OCP)

The open/close principle means that a class/function must be open to extension, but closed to modification. This makes it possible to add functionality without the risk of breaking existing code.

It is not easy to develop with this principle in mind, but a good indicator for the main developer is to see more and more additions (+) and fewer and fewer removals (-) in the merge requests during project development.

L — Liskov Substitution Principle (LSP)

The Liskov substitution principle states that a subordinate class can replace its parent class without changing the behavior of the program, ensuring that the subordinate class meets the expectations defined by the base class. It limits the risk of unexpected errors.

Example :

# Violates LSP
class Rectangle:
    def __init__(self, width, height):
        self.width = width
        self.height = height

    def area(self):
        return self.width * self.height


class Square(Rectangle):
    def __init__(self, side):
        super().__init__(side, side)
# Changing the width of a square violates the idea of a square.

To respect the LSP, it is better to avoid this hierarchy and use independent classes:

class Shape:
    def area(self):
        raise NotImplementedError


class Rectangle(Shape):
    def __init__(self, width, height):
        self.width = width
        self.height = height

    def area(self):
        return self.width * self.height


class Square(Shape):
    def __init__(self, side):
        self.side = side

    def area(self):
        return self.side * self.side

I — Interface Segregation Principle (ISP)

The principle of interface separation states that several small classes should be built instead of one with methods that cannot be used in certain cases. This reduces unnecessary dependencies.

Example:

# Violates ISP
class Animal:
    def fly(self):
        raise NotImplementedError

    def swim(self):
        raise NotImplementedError

It is better to split the class Animal into several classes:

# Follows ISP
class CanFly:
    def fly(self):
        raise NotImplementedError


class CanSwim:
    def swim(self):
        raise NotImplementedError


class Bird(CanFly):
    def fly(self):
        print("Flying")


class Fish(CanSwim):
    def swim(self):
        print("Swimming")

D — Dependency Inversion Principle (DIP)

The Dependency Inversion Principle means that a class must depend on an abstract class and not on a concrete class. This reduces the connections between the classes and makes the code more modular.

Example:

# Violates DIP
class Database:
    def connect(self):
        print("Connecting to database")


class UserService:
    def __init__(self):
        self.db = Database()

    def get_users(self):
        self.db.connect()
        print("Getting users")

Here, the attribute db of UserService depends on the class Database. To respect the DIP, db has to depend on an abstract class.

# Follows DIP
class DatabaseInterface:
    def connect(self):
        raise NotImplementedError


class MySQLDatabase(DatabaseInterface):
    def connect(self):
        print("Connecting to MySQL database")


class UserService:
    def __init__(self, db: DatabaseInterface):
        self.db = db

    def get_users(self):
        self.db.connect()
        print("Getting users")


# We can easily change the used database.
db = MySQLDatabase()
service = UserService(db)
service.get_users()

PEP standards

PEPs (Python Enhancement Proposals) are technical and informative documents that describe new features, language improvements or guidelines for the Python community. Among them, PEP 8, which defines style conventions for Python code, plays a fundamental role in promoting readability and consistency in projects.

Adopting the PEP standards, especially PEP 8, not only ensures that the code is understandable to other developers, but also that it conforms to the standards set by the community. This facilitates collaboration, re-reads and long-term maintenance.

In this article, I present the most important aspects of the PEP standards, including:

Style Conventions (PEP 8): Indentations, variable names and import organization.
Best practices for documenting code (PEP 257).
Recommendations for writing typed, maintainable code (PEP 484 and PEP 563).

Understanding and applying these standards is essential to take full advantage of the Python ecosystem and contribute to professional quality projects.

PEP 8

This documentation is about coding conventions to standardize the code, and there exists a lot of documentation about the PEP 8. I will not show all recommendation in this posts, only those that I judge essential when I review a code

Naming conventions

Variable, function and module names should be in lower case, and use underscore to separate words. This typographical convention is called snake_case.


my_variable
my_new_function()
my_module

Constances are written in capital letters and set at the beginning of the script (after the imports):


LIGHT_SPEED
MY_CONSTANT

Finally, class names and exceptions use the CamelCase format (a capital letter at the beginning of each word). Exceptions must contain an Error at the end.


MyGreatClass
MyGreatError

Remember to give your variables names that make sense! Don’t use variable names like v1, v2, func1, i, toto…

Single-character variable names are permitted for loops and indexes:

my_list = [1, 3, 5, 7, 9, 11]
for i in range(len(my_liste)):
    print(my_list[i])

A more “pythonic” way of writing, to be preferred to the previous example, gets rid of the i index:

my_list = [1, 3, 5, 7, 9, 11]
for element in my_list:
    print(element )

Spaces management

It is recommended surrounding operators (+, -, *, /, //, %, ==, !=, >, not, in, and, or, …) with a space before AND after:

# recommended code:
my_variable = 3 + 7
my_text = "mouse"
my_text == my_variable

# not recommended code:
my_variable=3+7
my_text="mouse"
my_text== ma_variable

You can’t add several spaces around an operator. On the other hand, there are no spaces inside square brackets, braces or parentheses:

# recommended code:
my_list[1]
my_dict{"key"}
my_function(argument)

# not recommended code:
my_list[ 1 ]
my_dict{ "key" }
my_function( argument )

A space is recommended after the characters “:” and “,”, but not before:

# recommended code:
my_list = [1, 2, 3]
my_dict = {"key1": "value1", "key2": "value2"}
my_function(argument1, argument2)

# not recommended code:
my_list = [1 , 2 , 3]
my_dict = {"key1":"value1", "key2":"value2"}
my_function(argument1 , argument2)

However, when indexing lists, we don’t put a space after the “:”:

my_list = [1, 3, 5, 7, 9, 1]

# recommended code:
my_list[1:3]
my_list[1:4:2]
my_list[::2]

# not recommended code:
my_list[1 : 3]
my_list[1: 4:2 ]
my_list[ : :2]

Line length

For the sake of readability, we recommend writing lines of code no longer than 80 characters long. However, in certain circumstances this rule can be broken, especially if you are working on a Dash project, it may be complicated to respect this recommendation

The \ character can be used to cut lines that are too long.

For example:

my_variable = 3
if my_variable > 1 and my_variable < 10 \
    and my_variable % 2 == 1 and my_variable % 3 == 0:
    print(f"My variable is equal to {my_variable }")

Within a parenthesis, you can return to the line without using the \ character. This can be useful for specifying the arguments of a function or method when defining or using it:

def my_function(argument_1, argument_2,
                argument_3, argument_4):
    return argument_1 + argument_2

It is also possible to create multi-line lists or dictionaries by skipping a line after a comma:

my_list = [1, 2, 3,
          4, 5, 6,
          7, 8, 9]
my_dict = {"key1": 13,
          "key2": 42,
          "key2": -10}

Blank lines

In a script, blank lines are useful for visually separating different parts of the code. It is recommended to leave two blank lines before the definition of a function or class, and to leave a single blank line before the definition of a method (in a class). You can also leave a blank line in the body of a function to separate the logical sections of the function, but this should be used sparingly.

Comments

Comments always begin with the # symbol followed by a space. They give clear explanations of the purpose of the code and must be synchronized with the code, i.e. if the code is modified, the comments must be too (if applicable). They are on the same indentation level as the code they comment on. Comments are complete sentences, with a capital letter at the beginning (unless the first word is a variable, which is written without a capital letter) and a period at the end.I strongly recommend writing comments in English and it is important to be consistent between the language used for comments and the language used to name variables. Finally, Comments that follow the code on the same line should be avoided wherever possible, and should be separated from the code by at least two spaces.

Tool to help you

Ruff is a linter (code analysis tool) and formatter for Python code written in Rust. It combines the advantages of the flake8 linter and black and isort formatting while being faster.

Ruff has an extension on the VS Code editor.

To check your code you can type:

ruff check my_modul.py

But, it is also possible to correct it with the following command:

ruff format my_modul.py

PEP 20

PEP 20: The Zen of Python is a set of 19 principles written in poetic form. They are more a way of coding than actual guidelines.

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one– and preferably only one –obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea — let’s do more of those!

PEP 257

The aim of PEP 257 is to standardize the use of docstrings.

What is a docstring?

A docstring is a string that appears as the first instruction after the definition of a function, class or method. A docstring becomes the output of the __doc__ special attribute of this object.

def my_function():
    """This is a doctring."""
    pass

And we have:

>>> my_function.__doc__
>>> 'This is a doctring.'

We always write a docstring between triple double quote """.

Docstring on a line

Used for simple functions or methods, it must fit on a single line, with no blank line at the beginning or end. The closing quotes are on the same line as opening quotes and there are no blank lines before or after the docstring.

def add(a, b):
    """Return the sum of a and b."""
    return a + b

Single-line docstring MUST NOT reintegrate function/method parameters. Do not do:

def my_function(a, b):
    """ my_function(a, b) -> list"""

Docstring on several lines

The first line should be a summary of the object being documented. An empty line follows, followed by more detailed explanations or clarifications of the arguments.

def divide(a, b):
    """Divide a byb.

    Returns the result of the division. Raises a ValueError if b equals 0.
    """
    if b == 0:
        raise ValueError("Only Chuck Norris can divide by 0") return a / b

Complete Docstring

A complete docstring is made up of several parts (in this case, based on the numpydoc standard).

Short description: Summarizes the main functionality.
Parameters: Describes the arguments with their type, name and role.
Returns: Specifies the type and role of the returned value.
Raises: Documents exceptions raised by the function.
Notes (optional): Provides additional explanations.
Examples (optional): Contains illustrated usage examples with expected results or exceptions.

def calculate_mean(numbers: list[float]) -> float:
    """
    Calculate the mean of a list of numbers.

    Parameters
    ----------
    numbers : list of float
        A list of numerical values for which the mean is to be calculated.

    Returns
    -------
    float
        The mean of the input numbers.

    Raises
    ------
    ValueError
        If the input list is empty.

    Notes
    -----
    The mean is calculated as the sum of all elements divided by the number of elements.

    Examples
    --------
    Calculate the mean of a list of numbers:
    >>> calculate_mean([1.0, 2.0, 3.0, 4.0])
    2.5"""

Tool to help you

VsCode’s autoDocstring extension lets you automatically create a docstring template.

PEP 484

In some programming languages, typing is mandatory when declaring a variable. In Python, typing is optional, but strongly recommended. PEP 484 introduces a typing system for Python, annotating the types of variables, function arguments and return values. This PEP provides a basis for improving code readability, facilitating static analysis and reducing errors.

What is typing?

Typing consists in explicitly declaring the type (float, string, etc.) of a variable. The typing module provides standard tools for defining generic types, such as Sequence, List, Union, Any, etc.

To type function attributes, we use “:” for function arguments and “->” for the type of what is returned.

Here a list of none typing functions:

def show_message(message):
    print(f"Message : {message}")

def addition(a, b):
    return a + b

def is_even(n):
    return n % 2 == 0

def list_square(numbers):
      return [x**2 for x in numbers]

def reverse_dictionary(d):
    return {v: k for k, v in d.items()}

def add_element(ensemble, element):
    ensemble.add(element)
  return ensemble

Now here’s how they should look:

from typing import List, Tuple, Dict, Set, Any

def show_message(message: str) -> None:
    print(f"Message : {message}")

def addition(a: int, b: int) -> int:
    return a + b

def is_even(n: int) -> bool:
    return n % 2 == 0

def list_square(numbers: List[int]) -> List[int]:
    return [x**2 for x in numbers]

def reverse_dictionary(d: Dict[str, int]) -> Dict[int, str]:
    return {v: k for k, v in d.items()}

def add_element(ensemble: Set[int], element: int) -> Set[int]:
    ensemble.add(element)
    return ensemble

Tool to help you

The MyPy extension automatically checks whether the use of a variable corresponds to the declared type. For example, for the following function:

def my_function(x: float) -> float:
    return x.mean()

The editor will point out that a float has no “mean” attribute.

Image from author

The benefit is twofold: you’ll know whether the declared type is the right one and whether the use of this variable corresponds to its type.

In the above example, x must be of a type that has a mean() method (e.g. np.array).

Conclusion

In this article, we have looked at the most important principles for creating clean Python production code. A solid architecture, adherence to SOLID principles, and compliance with PEP recommendations (at least the four discussed here) are essential for ensuring code quality. The desire for beautiful code is not (just) coquetry. It standardizes development practices and makes teamwork and maintenance much easier. There’s nothing more frustrating than spending hours (or even days) reverse-engineering a program, deciphering poorly written code before you’re finally able to fix the bugs. By applying these best practices, you ensure that your code remains clear, scalable, and easy for any developer to work with in the future.

References

1. src layout vs flat layout

2. SOLID principles

3. Python Enhancement Proposals index

The post Data Science: From School to Work, Part II appeared first on Towards Data Science.

I Won’t Change Unless You Do

Dorian Drost — Fri, 28 Feb 2025 12:00:00 +0000

In Game Theory, how can players ever come to an end if there still might be a better option to decide for? Maybe one player still wants to change their decision. But if they do, maybe the other player wants to change too. How can they ever hope to escape from this vicious circle? To solve this problem, the concept of a Nash equilibrium, which I will explain in this article, is fundamental to game theory.

This article is the second part of a four-chapter series on game theory. If you haven’t checked out the first chapter yet, I’d encourage you to do that to get familiar with the main terms and concepts of game theory. If you did so, you are prepared for the next steps of our journey through game theory. Let’s go!

Finding the solution

Finding a solution to a game in game theory can be tricky sometimes. Photo by Mel Poole on Unsplash

We will now try to find a solution for a game in game theory. A solution is a set of actions, where each player maximizes their utility and therefore behaves rationally. That does not necessarily mean, that each player wins the game, but that they do the best they can do, given that they don’t know what the other players will do. Let’s consider the following game:

If you are unfamiliar with this matrix-notation, you might want to take a look back at Chapter 1 and refresh your memory. Do you remember that this matrix gives you the reward for each player given a specific pair of actions? For example, if player 1 chooses action Y and player 2 chooses action B, player 1 will get a reward of 1 and player 2 will get a reward of 3.

Okay, what actions should the players decide for now? Player 1 does not know what player 2 will do, but they can still try to find out what would be the best action depending on player 2’s choice. If we compare the utilities of actions Y and Z (indicated by the blue and red boxes in the next figure), we notice something interesting: If player 2 chooses action A (first column of the matrix), player 1 will get a reward of 3, if they choose action Y and a reward of 2, if they choose action Z, so action Y is better in that case. But what happens, if player 2 decides for action B (second column)? In that case, action Y gives a reward of 1 and action Z gives a reward of 0, so Y is better than Z again. And if player 2 chooses action C (third column), Y is still better than Z (reward of 2 vs. reward of 1). That means, that player 1 should never use action Z, because action Y is always better.

We compare the rewards for player 1for actions Y and Z.

With the aforementioned considerations, player 2 can anticipate, that player 1 would never use action Z and hence player 2 doesn’t have to care about the rewards that belong to action Z. This makes the game much smaller, because now there are only two options left for player 1, and this also helps player 2 decide for their action.

We found out, that for player 1 Y is always better than Z, so we don’t consider Z anymore.

If we look at the truncated game, we see, that for player 2, option B is always better than action A. If player 1 chooses X, action B (with a reward of 2) is better than option A (with a reward of 1), and the same applies if player 1 chooses action Y. Note that this would not be the case if action Z was still in the game. However, we already saw that action Z will never be played by player 1 anyway.

We compare the rewards for player 2 for actions A and B.

As a consequence, player 2 would never use action A. Now if player 1 anticipates that player 2 never uses action A, the game becomes smaller again and fewer options have to be considered.

We saw, that for player 2 action B is always better than action A, so we don’t have to consider A anymore.

We can easily continue in a likewise fashion and see that for player 1, X is now always better than Y (2>1 and 4>2). Finally, if player 1 chooses action A, player 2 will choose action B, which is better than C (2>0). In the end, only the action X (for player 1) and B (for player 2) are left. That is the solution of our game:

In the end, only one option remains, namely player 1 using X and player 2 using B.

It would be rational for player 1 to choose action X and for player 2 to choose action B. Note that we came to that conclusion without exactly knowing what the other player would do. We just anticipated that some actions would never be taken, because they are always worse than other actions. Such actions are called strictly dominated. For example, action Z is strictly dominated by action Y, because Y is always better than Z.

The best answer

Scrabble is one of those games, where searching for the best answer can take ages. Photo by Freysteinn G. Jonsson on Unsplash

Such strictly dominated actions do not always exist, but there is a similar concept that is of importance for us and is called a best answer. Say we know which action the other player chooses. In that case, deciding on an action becomes very easy: We just take the action that has the highest reward. If player 1 knew that player 2 chose option A, the best answer for player 1 would be Y, because Y has the highest reward in that column. Do you see how we always searched for the best answers before? For each possible action of the other player we searched for the best answer, if the other player chose that action. More formally, player i’s best answer to a given set of actions of all other players is the action of player 1 which maximises the utility given the other players’ actions. Also be aware, that a strictly dominated action can never be a best answer.

Let us come back to a game we introduced in the first chapter: The prisoners’ dilemma. What are the best answers here?

Prisoners’ dilemma

How should player 1 decide, if player 2 confesses or denies? If player 2 confesses, player 1 should confess as well, because a reward of -3 is better than a reward of -6. And what happens, if player 2 denies? In that case, confessing is better again, because it would give a reward of 0, which is better than a reward of -1 for denying. That means, for player 1 confessing is the best answer for both actions of player 2. Player 1 doesn’t have to worry about the other player’s actions at all but should always confess. Because of the game’s symmetry, the same applies to player 2. For them, confessing is also the best answer, no matter what player 1 does.

The Nash Equilibrium

The Nash equilibrium is somewhat like the master key that allows us to solve game-theoretic problems. Researchers were very happy when they found it. Photo by rc.xyz NFT gallery on Unsplash

If all players play their best answer, we have reached a solution of the game that is called a Nash Equilibrium. This is a key concept in game theory, because of an important property: In a Nash Equilibrium, no player has any reason to change their action, unless any other player does. That means all players are as happy as they can be in the situation and they wouldn’t change, even if they could. Consider the prisoner’s dilemma from above: The Nash equilibrium is reached when both confess. In this case, no player would change his action without the other. They could become better if both changed their action and decided to deny, but since they can’t communicate, they don’t expect any change from the other player and so they don’t change themselves either.

You may wonder if there is always a single Nash equilibrium for each game. Let me tell you there can also be multiple ones, as in the Bach vs. Stravinsky game that we already got to know in Chapter 1:

Bach vs. Stravinsky

This game has two Nash equilibria: (Bach, Bach) and (Stravinsky, Stravinsky). In both scenarios, you can easily imagine that there is no reason for any player to change their action in isolation. If you sit in the Bach concerto with your friend, you would not leave your seat to go to the Stravinsky concerto alone, even if you favour Stravinsky over Bach. In a likewise fashion, the Bach fan wouldn’t go away from the Stravinsky concerto if that meant leaving his friend alone. In the remaining two scenarios, you would think differently though: If you were in the Stravinsky concerto alone, you would want to get out there and join your friend in the Bach concerto. That is, you would change your action even if the other player doesn’t change theirs. This tells you, that the scenario you have been in was not a Nash equilibrium.

However, there can also be games that have no Nash equilibrium at all. Imagine you are a soccer keeper during a penalty shot. For simplicity, we assume you can jump to the left or to the right. The soccer player of the opposing team can also shoot in the left or right corner, and we assume, that you catch the ball if you decide for the same corner as they do and that you don’t catch it if you decide for opposing corners. We can display this game as follows:

A game matrix for a penalty shooting.

You won’t find any Nash equilibrium here. Each scenario has a clear winner (reward 1) and a clear loser (reward -1), and hence one of the players will always want to change. If you jump to the right and catch the ball, your opponent will wish to change to the left corner. But then you again will want to change your decision, which will make your opponent choose the other corner again and so on.

Summary

We learned about finding a point of balance, where nobody wants to change anymore. That is a Nash equilibrium. Photo by Eran Menashri on Unsplash

This chapter showed how to find solutions for games by using the concept of a Nash equilibrium. Let us summarize, what we have learned so far:

A solution of a game in game theory maximizes every player’s utility or reward.
An action is called strictly dominated if there is another action that is always better. In this case, it would be irrational to ever play the strictly dominated action.
The action that yields the highest reward given the actions taken by the other players is called a best answer.
A Nash equilibrium is a state where every player plays their best answer.
In a Nash Equilibrium, no player wants to change their action unless any other play does. In that sense, Nash equilibria are optimal states.
Some games have multiple Nash equilibria and some games have none.

If you were saddened by the fact that there is no Nash equilibrium in some games, don’t despair! In the next chapter, we will introduce probabilities of actions and this will allow us to find more equilibria. Stay tuned!

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though:

Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in English language could be this one:

Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one:

Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

The post I Won’t Change Unless You Do appeared first on Towards Data Science.

Debugging the Dreaded NaN

Chaim Rand — Thu, 27 Feb 2025 21:52:06 +0000

You are training your latest AI model, anxiously watching as the loss steadily decreases when suddenly — boom! Your logs are flooded with NaNs (Not a Number) — your model is irreparably corrupted and you’re left staring at your screen in despair. To make matters worse, the NaNs don’t appear consistently. Sometimes your model trains just fine; other times, it fails inexplicably. Sometimes it will crash immediately, sometimes after many days of training.

NaNs in Deep Learning workloads are amongst the most frustrating issues to encounter. And because they often appear sporadically — triggered by a specific combination of model state, input data, and stochastic factors — they can be incredibly difficult to reproduce and debug.

Given the considerable cost of training AI models and the potential waste caused by NaN failures, it is recommended to have dedicated tools for capturing and analyzing NaN occurrences. In a previous post, we discussed the challenge of debugging NaNs in a TensorFlow training workload. We proposed an efficient scheme for capturing and reproducing NaNs and shared a sample TensorFlow implementation. In this post, we adopt and demonstrate a similar mechanism for debugging NaNs in PyTorch workloads. The general scheme is as follows:

On each training step:

Save a copy of the training input batch.
Check the gradients for NaN values. If any appear, save a checkpoint with the current model weights before the model is corrupted. Also, save the input batch and, if necessary, the stochastic state. Discontinue the training job.
Reproduce and debug the NaN occurrence by loading the saved experiment state.

Although this scheme can be easily implemented in native PyTorch, we will take the opportunity to demonstrate some of the conveniences of PyTorch Lightning — a powerful open-source framework designed to streamline the development of machine learning (ML) models. Built on PyTorch, Lightning abstracts away many of the boiler-plate components of an ML experiment, such as training loops, data distribution, logging, and more, enabling developers to focus on the core logic of their models.

To implement our NaN capturing scheme, we will use Lightning’s callback interface — a dedicated structure that enables inserting custom logic at specific points during the flow of execution.

Importantly, please do not view our choice of Lightning or any other tool or technique that we mention as an endorsement of its use. The code that we will share is intended for demonstrative purposes — please do not rely on its correctness or optimality.

Many thanks to Rom Maltser for his contributions to this post.

NaNCapture Callback

To implement our NaN capturing solution, we create a NaNCapture Lightning callback. The constructor receives a directory path for storing/loading checkpoints and sets up the NaNCapture state. We also define utilities for checking for NaNs, storing checkpoints, and halting the training job.

 import os
import torch
from copy import deepcopy
import lightning.pytorch as pl

class NaNCapture(pl.Callback):

    def __init__(self, dirpath: str):
        # path to checkpoint
        self.dirpath = dirpath
        
        # update to True when Nan is identified
        self.nan_captured = False
        
        # stores a copy of the last batch
        self.last_batch = None
        self.batch_idx = None

    @staticmethod
    def contains_nan(tensor):
        return torch.isnan(tensor).any().item()
        # alternatively check for finite
        # return not torch.isfinite(tensor).item()

    @staticmethod
    def halt_training(trainer):
        trainer.should_stop = True
        # communicate stop command to all other ranks
        trainer.strategy.reduce_boolean_decision(trainer.should_stop,
                                                 all=False)

    def save_ckpt(self, trainer):
        os.makedirs(self.dirpath, exist_ok=True)
        # include trainer.global_rank to avoid conflict
        filename = f"nan_checkpoint_rank_{trainer.global_rank}.ckpt"
        full_path = os.path.join(self.dirpath, filename)
        print(f"saving ckpt to {full_path}")
        trainer.save_checkpoint(full_path, False)

Callback Function: on_train_batch_start

We begin by implementing the on_train_batch_start hook to store a copy of each input batch. In case of a NaN event, this batch will be stored in the checkpoint.

Callback Function: on_before_optimizer_step

Next we implement the on_before_optimizer_step hook. Here, we check for NaN entries in all of the gradient tensors. If found, we store a checkpoint with the uncorrupted model weights and halt the training.

Python">    def on_before_optimizer_step(self, trainer, pl_module, optimizer):
        if not self.nan_captured:
            # Check if gradients contain NaN
            grads = [p.grad.view(-1) for p in pl_module.parameters()
                     if p.grad is not None]
            all_grads = torch.cat(grads)
            if self.contains_nan(all_grads):
                print("nan found")
                self.save_ckpt(trainer)
                self.halt_training(trainer)

Capturing the Training State

To enable reproducibility, we include the NaNCapture state in the checkpoint by appending it to the training state dictionary. Lightning provides dedicated utilities for saving and loading a callback state:

def state_dict(self):
        d = {"nan_captured": self.nan_captured}
        if self.nan_captured:
            d["last_batch"] = self.last_batch
        return d


    def load_state_dict(self, state_dict):
        self.nan_captured = state_dict.get("nan_captured", False)
        if self.nan_captured:
            self.last_batch = state_dict["last_batch"]

Reproducing the NaN Occurrence

We have described how our NaNCapture callback can be used to store the training state that resulted in a NaN, but how do we reload this state in order to reproduce the issue and debug it? To accomplish this, we leverage Lightning’s dedicated data loading class, LightningDataModule.

DataModule Function: on_before_batch_transfer

In the code block below, we extend the LightningDataModule class to allow injecting a fixed training input batch. This is achieved by overriding the on_before_batch_transfer hook, as shown below:

from lightning.pytorch import LightningDataModule

class InjectableDataModule(LightningDataModule):

    def __init__(self):
        super().__init__()
        self.cached_batch = None

    def set_custom_batch(self, batch):
        self.cached_batch = batch

    def on_before_batch_transfer(self, batch, dataloader_idx):
        if self.cached_batch:
            return self.cached_batch
        return batch

Callback Function: on_train_start

The final step is modifying the on_train_start hook of our NaNCapture callback to inject the stored training batch into the LightningDataModule.

    def on_train_start(self, trainer, pl_module):
        if self.nan_captured:
            datamodule = trainer.datamodule
            datamodule.set_custom_batch(self.last_batch)

In the next section we will demonstrate the end-to-end solution using a toy example.

Toy Example

To test our new callback, we create a resnet50-based image classification model with a loss function deliberately designed to trigger NaN occurrences.

Instead of using the standard CrossEntropy loss, we compute binary_cross_entropy_with_logits for each class independently and divide the result by the number of samples belonging to that class. Inevitably, we will encounter a batch in which one or more classes are missing, leading to a divide-by-zero operation, resulting in NaN values and corrupting the model.

The implementation below follows Lightning’s introductory tutorial.

import lightning.pytorch as pl
import torch
import torchvision
import torch.nn.functional as F

num_classes = 20


# define a lightning module
class ResnetModel(pl.LightningModule):
    def __init__(self):
        """Initializes a new instance of the MNISTModel class."""
        super().__init__()
        self.model = torchvision.models.resnet50(num_classes=num_classes)

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_nb):
        x, y = batch
        outputs = self(x)
        # uncomment for default loss
        # return F.cross_entropy(outputs, y)
        
        # calculate binary_cross_entropy for each class individually
        losses = []
        for c in range(num_classes):
            count = torch.count_nonzero(y==c)
            masked = torch.where(y==c, 1., 0.)
            loss = F.binary_cross_entropy_with_logits(
                outputs[..., c],
                masked,
                reduction='sum'
            )
            mean_loss = loss/count # could result in NaN
            losses.append(mean_loss)
        total_loss = torch.stack(losses).mean()
        return total_loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

We define a synthetic dataset and encapsulate it in our InjectableDataModule class:

import os
import random
from torch.utils.data import Dataset, DataLoader

batch_size = 128
num_steps = 800

# A dataset with random images and labels
class FakeDataset(Dataset):
    def __len__(self):
        return batch_size*num_steps

    def __getitem__(self, index):
        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
        label = torch.tensor(random.randint(0, num_classes-1),
                             dtype=torch.int64)
        return rand_image, label



# define a lightning datamodule
class FakeDataModule(InjectableDataModule):

    def train_dataloader(self):
        dataset = FakeDataset()
        return DataLoader(
            dataset,
            batch_size=batch_size,
            num_workers=os.cpu_count(),
            pin_memory=True
        )

Finally, we initialize a Lightning Trainer with our NaNCapture callback and call trainer.fit with our Lightning module and Lightning DataModule.

import time

if __name__ == "__main__":

    # Initialize a lightning module
    lit_module = ResnetModel()

    # Initialize a DataModule
    mnist_data = FakeDataModule()

    # Train the model
    ckpt_dir = "./ckpt_dir"
    trainer = pl.Trainer(
        max_epochs=1,
        callbacks=[NaNCapture(ckpt_dir)]
    )

    ckpt_path = None
    
    # check is nan ckpt exists
    if os.path.isdir(ckpt_dir):

    # check if nan ckpt exists
    if os.path.isdir(ckpt_dir):
        dir_contents = [os.path.join(ckpt_dir, f)
                        for f in os.listdir(ckpt_dir)]
        ckpts = [f for f in dir_contents
                 if os.path.isfile(f) and f.endswith('.ckpt')]
        if ckpts:
            ckpt_path = ckpts[0]

    t0 = time.perf_counter()
    trainer.fit(lit_module, mnist_data, ckpt_path=ckpt_path)
    print(f"total runtime: {time.perf_counter() - t0}")

After a number of training steps, a NaN event will occur. At this point a checkpoint is saved with the full training state and the training is halted.

When the script is run again the exact state that caused the NaN will be reloaded allowing us to easily reproduce the issue and debug its root cause.

Performance Overhead

To assess the impact of our NaNCapture callback on runtime performance, we modified our experiment to use CrossEntropyLoss (to avoid NaNs) and measured the average throughput when running with and without NaNCapture callback. The experiments were conducted on an NVIDIA L40S GPU, with a PyTorch 2.5.1 Docker image.

Overhead of NaNCapture Callback (by Author)

For our toy model, the NaNCapture callback adds a minimal 1.5% overhead to the runtime performance — a small price to pay for the valuable debugging capabilities it provides.

Naturally, the actual overhead will depend on the specifics of the model and runtime environment.

How to Handle Stochasticity

The solution we have described henceforth will succeed in reproducing the training state provided that the model does not include any randomness. However, introducing stochasticity into the model definition is often critical for convergence. A common example of a stochastic layer is torch.nn.Dropout.

You may find that your NaN event depends on the precise state of randomness when the failure occurred. Consequently, we would like to enhance our NaNCapture callback to capture and restore the random state at the point of failure. The random state is determined by a number of libraries. In the code block below, we attempt to capture the full state of randomness:

import os
import torch
import random
import numpy as np
from copy import deepcopy
import lightning.pytorch as pl

class NaNCapture(pl.Callback):

    def __init__(self, dirpath: str):
        # path to checkpoint
        self.dirpath = dirpath
        
        # update to True when Nan is identified
        self.nan_captured = False
        
        # stores a copy of the last batch
        self.last_batch = None
        self.batch_idx = None

        # rng state
        self.rng_state = {
            "torch": None,
            "torch_cuda": None,
            "numpy": None,
            "random": None
        }

    @staticmethod
    def contains_nan(tensor):
        return torch.isnan(tensor).any().item()
        # alternatively check for finite
        # return not torch.isfinite(tensor).item()

    @staticmethod
    def halt_training(trainer):
        trainer.should_stop = True
        trainer.strategy.reduce_boolean_decision(trainer.should_stop,
                                                 all=False)

    def save_ckpt(self, trainer):
        os.makedirs(self.dirpath, exist_ok=True)
        # include trainer.global_rank to avoid conflict
        filename = f"nan_checkpoint_rank_{trainer.global_rank}.ckpt"
        full_path = os.path.join(self.dirpath, filename)
        print(f"saving ckpt to {full_path}")
        trainer.save_checkpoint(full_path, False)

    def on_train_start(self, trainer, pl_module):
        if self.nan_captured:
            # inject batch
            datamodule = trainer.datamodule
            datamodule.set_custom_batch(self.last_batch)

    def on_train_batch_start(self, trainer, pl_module, batch, batch_idx):
       if self.nan_captured:
            # restore random state
            torch.random.set_rng_state(self.rng_state["torch"])
            torch.cuda.set_rng_state_all(self.rng_state["torch_cuda"])
            np.random.set_state(self.rng_state["numpy"])
            random.setstate(self.rng_state["random"])
        else:
            # capture current batch
            self.last_batch= deepcopy(batch)
            self.batch_idx = batch_idx
    
            # capture current random state
            self.rng_state["torch"] = torch.random.get_rng_state()
            self.rng_state["torch_cuda"] = torch.cuda.get_rng_state_all()
            self.rng_state["numpy"] = np.random.get_state()
            self.rng_state["random"] = random.getstate()
    
    def on_before_optimizer_step(self, trainer, pl_module, optimizer):
        if not self.nan_captured:
            # Check if gradients contain NaN
            grads = [p.grad.view(-1) for p in pl_module.parameters()
                     if p.grad is not None]
            all_grads = torch.cat(grads)
            if self.contains_nan(all_grads):
                print("nan found")
                self.save_ckpt(trainer)
                self.halt_training(trainer)

    def state_dict(self):
        d = {"nan_captured": self.nan_captured}
        if self.nan_captured:
            d["last_batch"] = self.last_batch
            d["rng_state"] = self.rng_state
        return d

    def load_state_dict(self, state_dict):
        self.nan_captured = state_dict.get("nan_captured", False)
        if self.nan_captured:
            self.last_batch = state_dict["last_batch"]
            self.rng_state = state_dict["rng_state"]

Importantly, setting the random state may not guarantee full reproducibility. The GPU owes its power to its massive parallelism. In some GPU operations, multiple threads may read or write concurrently to the same memory locations resulting in nondeterminism. PyTorch allows for some control over this via its use_deterministic_algorithms, but this may impact the runtime performance. Additionally, there is a possibility that the NaN event will not reproduced once this configuration setting is changed. Please see the PyTorch documentation on reproducibility for more details.

Summary

Encountering NaN failures is one of the most discouraging events that can happen in machine learning development. These errors not only waste valuable computation and development resources, but often indicate fundamental issues in the model architecture or experiment design. Due to their sporadic, sometimes elusive nature, debugging NaN failures can be a nightmare.

This post introduced a proactive approach for capturing and reproducing NaN errors using a dedicated Lightning callback. The solution we shared is a proposal which can be modified and extended for your specific use case.

While this solution may not address every possible NaN scenario, it significantly reduces debugging time when applicable, potentially saving developers countless hours of frustration and wasted effort.

The post Debugging the Dreaded NaN appeared first on Towards Data Science.

The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines

Murtaza Ali — Thu, 27 Feb 2025 02:15:25 +0000

“You don’t have to be an expert to deceive someone, though you might need some expertise to reliably recognize when you are being deceived.”

When my co-instructor and I start our quarterly lesson on deceptive visualizations for the data visualization course we teach at the University of Washington, he emphasizes the point above to our students. With the advent of modern technology, developing pretty and convincing claims about data is easier than ever. Anyone can make something that seems passable, but contains oversights that render it inaccurate and even harmful. Furthermore, there are also malicious actors who actively want to deceive you, and who have studied some of the best ways to do it.

I often start this lecture with a bit of a quip, looking seriously at my students and asking two questions:

“Is it a good thing if someone is gaslighting you?”
After the general murmur of confusion followed by agreement that gaslighting is indeed bad, I ask the second question: “What’s the best way to ensure no one ever gaslights you?”

The students generally ponder that second question for a bit longer, before chuckling a bit and realizing the answer: It’s to learn how people gaslight in the first place. Not so you can take advantage of others, but so you can prevent others from taking advantage of you.

The same applies in the realm of misinformation and disinformation. People who want to mislead with data are empowered with a host of tools, from high-speed internet to social media to, most recently, generative AI and large language models. To protect yourself from being misled, you need to learn their tricks.

In this article, I’ve taken the key ideas from my data visualization course’s unit on deception–drawn from Alberto Cairo’s excellent book How Charts Lie–and broadened them into some general principles about deception and data. My hope is that you read it, internalize it, and take it with you to arm yourself against the onslaught of lies perpetuated by ill-intentioned people powered with data.

Humans Cannot Interpret Area

At least, not as well as we interpret other visual cues. Let’s illustrate this with an example. Say we have an extremely simple numerical data set; it’s one dimensional and consists of just two values: 50 and 100. One way to represent this visually is via the length of bars, as follows:

This is true to the underlying data. Length is a one-dimensional quantity, and we have doubled it in order to indicate a doubling of value. But what happens if we want to represent the same data with circles? Well, circles aren’t really defined by a length or width. One option is to double the radius:

Hmm. The first circle has a radius of 100 pixels, and the second has a radius of 50 pixels–so this is technically correct if we wanted to double the radius. However, because of the way that area is calculated (πr²), we’ve way more than doubled the area. So what if we tried just doing that, since it seems more visually accurate? Here is a revised version:

Now we have a different problem. The larger circle is mathematically twice the area of the smaller one, but it no longer looks that way. In other words, even though it is a visually accurate comparison of a doubled quantity, human eyes have difficulty perceiving it.

The issue here is trying to use area as a visual marker in the first place. It’s not necessarily wrong, but it is confusing. We’re increasing a one-dimensional value, but area is a two-dimensional quantity. To the human eye, it’s always going to be difficult to interpret accurately, especially when compared with a more natural visual representation like bars.

Now, this may seem like it’s not a huge deal–but let’s take a look at what happens when you extend this to an actual data set. Below, I’ve pasted two images of charts I made in Altair (a Python-based visualization package). Each chart shows the maximum temperature (in Celsius) during the first week of 2012 in Seattle, USA. The first one uses bar lengths to make the comparison, and the second uses circle areas.

Which one makes it easier to see the differences? The legend helps in the second one, but if we’re being honest, it’s a lost cause. It is much easier to make precise comparisons with the bars, even in a setting where we have such limited data.

Remember that the point of a visualization is to clarify data–to make hidden trends easier to see for the average person. To achieve this goal, it’s best to use visual cues that simplify the process of making that distinction.

Beware Political Headlines (In Any Direction)

There is a small trick question I sometimes ask my students on a homework assignment around the fourth week of class. The assignment mostly involves generating visualizations in Python–but for the last question, I give them a chart I myself generated accompanied by a single question:

Question: There is one thing egregiously wrong with the chart above, an unforgivable error in Data Visualization. What is it?

Most think it has something to do with the axes, marks, or some other visual aspect, often suggesting improvements like filling in the circles or making the axis labels more informative. Those are fine suggestions, but not the most pressing.

The most flawed trait (or lack thereof, rather) in the chart above is the missing title. A title is crucial to an effective data visualization. Without it, how are we supposed to know what this visualization is even about? As of now, we can only ascertain that it must vaguely have something to do with carbon dioxide levels across a span of years. That isn’t much.

Many folks, feeling this requirement is too stringent, argue that a visualization is often meant to be understood in context, as part of a larger article or press release or other accompanying piece of text. Unfortunately, this line of thinking is far too idealistic; in reality, a visualization must stand alone, because it will often be the only thing people look at–and in social media blow-up cases, the only thing that gets shared widely. As a result, it should have a title to explain itself.

Of course, the title of this very subsection tells you to be wary of such headlines. That is true. While they are necessary, they are a double-edged sword. Since visualization designers know viewers will pay attention to the title, ill-meaning ones can also use it to sway people in less-than-accurate directions. Let’s look at an example:

It's time to end Chain Migration: https://t.co/kad5A8Slw7 pic.twitter.com/735JzAZIUa
— The White House 45 Archived (@WhiteHouse45) December 18, 2017

The above is a picture shared by the White House’s public Twitter account in 2017. The picture is also referenced by Alberto Cairo in his book, which emphasizes many of the points I will now make.

First things first. The word “chain migration,” referring to what is formally known as family-based migration (where an immigrant may sponsor family members to come to the United States), has been criticized by many who argue that it is needlessly aggressive and makes legal immigrants sound threatening for no reason.

Of course, politics is by its very nature divisive, and it is possible for any side to make a heated argument. The primary issue here is actually a data-related one–specifically, what the use of the word “chain” implies in the context of the chart shared with the tweet. “Chain” migration seems to indicate that people can immigrate one after the other, in a seemingly endless stream, uninhibited and unperturbed by the distance of family relations. The reality, of course, is that a single immigrant can mostly just sponsor immediate family members, and even that takes quite a bit of time. But when one reads the phrase “chain migration” and then immediately looks at a seemingly sensible chart depicting it, it is easy to believe that an individual can in fact spawn additional immigrants at a base-3 exponential growth rate.

That is the issue with any kind of political headline–it makes it far too easy to conceal dishonest, inaccurate workings with actual data processing, analysis, and visualization.

There is no data underlying the chart above. None. Zero. It is completely random, and that is not okay for a chart that is purposefully made to appear as if it is showing something meaningful and quantitative.

As a fun little rabbit hole to go down which highlights the dangers of political headlining within data, here is a link to FloorCharts, a Twitter account that posts the most absurd graphics shown on the U.S. Congress floor.

Don’t Use 3D. Please.

I’ll end this article on a slightly lighter topic–but still an important one. Under no circumstances–none at all–should you ever utilize a 3D chart. And if you’re in the shoes of the viewer–that is, if you’re looking at a 3D pie chart made by someone else–don’t trust it.

The reason for this is simple, and connects back to what I discussed with circles and rectangles: a third dimension severely distorts the actuality behind what are usually one-dimensional measures. Area was already hard to interpret–how well do you really think the human eye does with volume?

Here is a 3D pie chart I generated with random numbers:

Now, here is the exact same pie chart, but in two dimensions:

Notice how the blue is not quite as dominant as the 3D version seems to suggest, and that the red and orange are closer to one another in size than originally portrayed. I also removed the percentage labels intentionally (technically bad practice) in order to emphasize how even with the labels present in the first one, our eyes automatically pay more attention to the more drastic visual differences. If you’re reading this article with an analytical eye, perhaps you think it doesn’t make that much of a difference. But the fact is, you’ll often see such charts in the news or on social media, and a quick glance is all they’ll ever get.

It is important to ensure that the story told by that quick glance is a truthful one.

Final Thoughts

Data science is often touted as the perfect synthesis of Statistics, computing, and society, a way to obtain and share deep and meaningful insights about an information-heavy world. This is true–but as the capacity to widely share such insights expands, so must our general ability to interpret them accurately. It is my hope that in light of that, you have found this primer to be helpful.

Stay tuned for Part 2, in which I’ll talk about a few deceptive techniques a bit more involved in nature–including base proportions, (un)trustworthy statistical measures, and measures of correlation.

In the meantime, try not to get deceived.

The post The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines appeared first on Towards Data Science.

Efficient Data Handling in Python with Arrow

Pol Marin — Tue, 25 Feb 2025 20:56:16 +0000

1. Introduction

We’re all used to work with CSVs, JSON files… With the traditional libraries and for large datasets, these can be extremely slow to read, write and operate on, leading to performance bottlenecks (been there). It’s precisely with big amounts of data that being efficient handling the data is crucial for our data science/analytics workflow, and this is exactly where Apache Arrow comes into play.

Why? The main reason resides in how the data is stored in memory. While JSON and CSVs, for example, are text-based formats, Arrow is a columnar in-memory data format (and that allows for fast data interchange between different data processing tools). Arrow is therefore designed to optimize performance by enabling zero-copy reads, reducing memory usage, and supporting efficient compression.

Moreover, Apache Arrow is open-source and optimized for analytics. It is designed to accelerate big data processing while maintaining interoperability with various data tools, such as Pandas, Spark, and Dask. By storing data in a columnar format, Arrow enables faster read/write operations and efficient memory usage, making it ideal for analytical workloads.

Sounds great right? What’s best is that this is all the introduction to Arrow I’ll provide. Enough theory, we want to see it in action. So, in this post, we’ll explore how to use Arrow in Python and how to make the most out of it.

2. Arrow in Python

To get started, you need to install the necessary libraries: pandas and pyarrow.

pip install pyarrow pandas

Then, as always, import them in your Python script:

import pyarrow as pa
import pandas as pd

Nothing new yet, just necessary steps to do what follows. Let’s start by performing some simple operations.

2.1. Creating and Storing a Table

The simplest we can do is hardcode our table’s data. Let’s create a two-column table with football data:

teams = pa.array(['Barcelona', 'Real Madrid', 'Rayo Vallecano', 'Athletic Club', 'Real Betis'], type=pa.string())
goals = pa.array([30, 23, 9, 24, 12], type=pa.int8())

team_goals_table = pa.table([teams, goals], names=['Team', 'Goals'])

The format is pyarrow.table, but we can easily convert it to pandas if we want:

df = team_goals_table.to_pandas()

And restore it back to arrow using:

team_goals_table = pa.Table.from_pandas(df)

And we’ll finally store the table in a file. We could use different formats, like feather, parquet… I’ll use this last one because it’s fast and memory-optimized:

import pyarrow.parquet as pq
pq.write_table(team_goals_table, 'data.parquet')

Reading a parquet file would just consist of using pq.read_table('data.parquet').

2.2. Compute Functions

Arrow has its own compute module for the usual operations. Let’s start by comparing two arrays element-wise:

import pyarrow.compute as pc
>>> a = pa.array([1, 2, 3, 4, 5, 6])
>>> b = pa.array([2, 2, 4, 4, 6, 6])
>>> pc.equal(a,b)
[
  false,
  true,
  false,
  true,
  false,
  true
]

That was easy, we could sum all elements in an array with:

>>> pc.sum(a)

And from this we could easily guess how we can compute a count, a floor, an exp, a mean, a max, a multiplication… No need to go over them, then. So let’s move to tabular operations.

We’ll start by showing how to sort it:

>>> table = pa.table({'i': ['a','b','a'], 'x': [1,2,3], 'y': [4,5,6]})
>>> pc.sort_indices(table, sort_keys=[('y', descending)])

[
  2,
  1,
  0
]

Just like in pandas, we can group values and aggregate the data. Let’s, for example, group by “i” and compute the sum on “x” and the mean on “y”:

>>> table.group_by('i').aggregate([('x', 'sum'), ('y', 'mean')])
pyarrow.Table
i: string
x_sum: int64
y_mean: double
----
i: [["a","b"]]
x_sum: [[4,2]]
y_mean: [[5,5]]

Or we can join two tables:

>>> t1 = pa.table({'i': ['a','b','c'], 'x': [1,2,3]})
>>> t2 = pa.table({'i': ['a','b','c'], 'y': [4,5,6]})
>>> t1.join(t2, keys="i")
pyarrow.Table
i: string
x: int64
y: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
y: [[4,5,6]]

By default, it is a left outer join but we could twist it by using the join_type parameter.

There are many more useful operations, but let’s see just one more to avoid making this too long: appending a new column to a table.

>>> t1.append_column("z", pa.array([22, 44, 99]))
pyarrow.Table
i: string
x: int64
z: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
z: [[22,44,99]]

Before ending this section, we must see how to filter a table or array:

>>> t1.filter((pc.field('x') > 0) & (pc.field('x') < 3))
pyarrow.Table
i: string
x: int64
----
i: [["a","b"]]
x: [[1,2]]

Easy, right? Especially if you’ve been using pandas and numpy for years!

3. Working with files

We’ve already seen how we can read and write Parquet files. But let’s check some other popular file types so that we have several options available.

3.1. Apache ORC

Being very informal, Apache ORC can be understood as the equivalent of Arrow in the realm of file types (even though its origins have nothing to do with Arrow). Being more correct, it’s an open source and columnar storage format.

Reading and writing it is as follows:

from pyarrow import orc
# Write table
orc.write_table(t1, 't1.orc')
# Read table
t1 = orc.read_table('t1.orc')

As a side note, we could decide to compress the file while writing by using the “compression” parameter.

3.2. CSV

No secret here, pyarrow has the CSV module:

from pyarrow import csv
# Write CSV
csv.write_csv(t1, "t1.csv")
# Read CSV
t1 = csv.read_csv("t1.csv")

# Write CSV compressed and without header
options = csv.WriteOptions(include_header=False)
with pa.CompressedOutputStream("t1.csv.gz", "gzip") as out:
    csv.write_csv(t1, out, options)

# Read compressed CSV and add custom header
t1 = csv.read_csv("t1.csv.gz", read_options=csv.ReadOptions(
    column_names=["i", "x"], skip_rows=1
)]

3.2. JSON

Pyarrow allows JSON reading but not writing. It’s pretty straightforward, let’s see an example supposing we have our JSON data in “data.json”:

from pyarrow import json
# Read json
fn = "data.json"
table = json.read_json(fn)

# We can now convert it to pandas if we want to
df = table.to_pandas()

Feather is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. So, contrary to Apache ORC, this one was indeed created early in the Arrow project.

from pyarrow import feather
# Write feather from pandas DF
feather.write_feather(df, "t1.feather")
# Write feather from table, and compressed
feather.write_feather(t1, "t1.feather.lz4", compression="lz4")

# Read feather into table
t1 = feather.read_table("t1.feather")
# Read feather into df
df = feather.read_feather("t1.feather")

4. Advanced Features

We just touched upon the most basic features and what the majority would need while working with Arrow. However, its amazingness doesn’t end here, it’s right where it starts.

As this will be quite domain-specific and not useful for anyone (nor considered introductory) I’ll just mention some of these features without using any code:

We can handle memory management through the Buffer type (built on top of C++ Buffer object). Creating a buffer with our data does not allocate any memory; it is a zero-copy view on the memory exported from the data bytes object. Keeping up with this memory management, an instance of MemoryPool tracks all the allocations and deallocations (like malloc and free in C). This allows us to track the amount of memory being allocated.
Similarly, there are different ways to work with input/output streams in batches.
PyArrow comes with an abstract filesystem interface, as well as concrete implementations for various storage types. So, for example, we can write and read parquet files from an S3 bucket using the S3FileSystem. Google Cloud and Hadoop Distributed File System (HDFS) are also accepted.

5. Conclusion and Key Takeaways

Apache Arrow is a powerful tool for efficient Data Handling in Python. Its columnar storage format, zero-copy reads, and interoperability with popular data processing libraries make it ideal for data science workflows. By integrating Arrow into your pipeline, you can significantly boost performance and optimize memory usage.

6. Resources

The post Efficient Data Handling in Python with Arrow appeared first on Towards Data Science.