Data Analysis | Towards Data Science

One-Tailed Vs. Two-Tailed Tests

Allon Korem — Thu, 06 Mar 2025 04:22:42 +0000

Introduction

If you’ve ever analyzed data using built-in t-test functions, such as those in R or SciPy, here’s a question for you: have you ever adjusted the default setting for the alternative hypothesis? If your answer is no—or if you’re not even sure what this means—then this blog post is for you!

The alternative hypothesis parameter, commonly referred to as “one-tailed” versus “two-tailed” in statistics, defines the expected direction of the difference between control and treatment groups. In a two-tailed test, we assess whether there is any difference in mean values between the groups, without specifying a direction. A one-tailed test, on the other hand, posits a specific direction—whether the control group’s mean is either less than or greater than that of the treatment group.

Choosing between one- and two-tailed hypotheses might seem like a minor detail, but it affects every stage of A/B testing: from test planning to Data Analysis and results interpretation. This article builds a theoretical foundation on why the hypothesis direction matters and explores the pros and cons of each approach.

One-tailed vs. two-tailed hypothesis testing: Understanding the difference

To understand the importance of choosing between one-tailed and two-tailed hypotheses, let’s briefly review the basics of the t-test, the commonly used method in A/B testing. Like other Hypothesis Testing methods, the t-test begins with a conservative assumption: there is no difference between the two groups (the null hypothesis). Only if we find strong evidence against this assumption can we reject the null hypothesis and conclude that the treatment has had an effect.

But what qualifies as “strong evidence”? To that end, a rejection region is determined under the null hypothesis and all results that fall within this region are deemed so unlikely that we take them as evidence against the feasibility of the null hypothesis. The size of this rejection region is based on a predetermined probability, known as alpha (α), which represents the likelihood of incorrectly rejecting the null hypothesis.

What does this have to do with the direction of the alternative hypothesis? Quite a bit, actually. While the alpha level determines the size of the rejection region, the alternative hypothesis dictates its placement. In a one-tailed test, where we hypothesize a specific direction of difference, the rejection region is situated in only one tail of the distribution. For a hypothesized positive effect (e..g., that the treatment group mean is higher than the control group mean), the rejection region would lie in the right tail, creating a right-tailed test. Conversely, if we hypothesize a negative effect (e.g., that the treatment group mean is less than the control group mean), the rejection region would be placed in the left tail, resulting in a left-tailed test.

In contrast, a two-tailed test allows for the detection of a difference in either direction, so the rejection region is split between both tails of the distribution. This accommodates the possibility of observing extreme values in either direction, whether the effect is positive or negative.

To build intuition, let’s visualize how the rejection regions appear under the different hypotheses. Recall that according to the null hypothesis, the difference between the two groups should center around zero. Thanks to the central limit theorem, we also know this distribution approximates a normal distribution. Consequently, the rejection areas corresponding to the different alternative hypothesis look like that:

Why does it make a difference?

The choice of direction for the alternative hypothesis impacts the entire A/B testing process, starting with the planning phase—specifically, in determining the sample size. Sample size is calculated based on the desired power of the test, which is the probability of detecting a true difference between the two groups when one exists. To compute power, we examine the area under the alternative hypothesis that corresponds to the rejection region (since power reflects the ability to reject the null hypothesis when the alternative hypothesis is true).

Since the direction of the hypothesis affects the size of this rejection region, power is generally lower for a two-tailed hypothesis. This is due to the rejection region being divided across both tails, making it more challenging to detect an effect in any one direction. The following graph illustrates the comparison between the two types of hypotheses. Note that the purple area is larger for the one-tailed hypothesis, compared to the two-tailed hypothesis:

In practice, to maintain the desired power level, we compensate for the reduced power of a two-tailed hypothesis by increasing the sample size (Increasing sample size raises power, though the mechanics of this can be a topic for a separate article). Thus, the choice between one- and two-tailed hypotheses directly influences the required sample size for your test.

Beyond the planning phase, the choice of alternative hypothesis directly impacts the analysis and interpretation of results. There are cases where a test may reach significance with a one-tailed approach but not with a two-tailed one, and vice versa. Reviewing the previous graph can help illustrate this: for example, a result in the left tail might be significant under a two-tailed hypothesis but not under a right one-tailed hypothesis. Conversely, certain results might fall within the rejection region of a right one-tailed test but lie outside the rejection area in a two-tailed test.

How to decide between a one-tailed and two-tailed hypothesis

Let’s start with the bottom line: there’s no absolute right or wrong choice here. Both approaches are valid, and the primary consideration should be your specific business needs. To help you decide which option best suits your company, we’ll outline the key pros and cons of each.

At first glance, a one-tailed alternative may appear to be the clear choice, as it often aligns better with business objectives. In industry applications, the focus is typically on improving specific metrics rather than exploring a treatment’s impact in both directions. This is especially relevant in A/B testing, where the goal is often to optimize conversion rates or enhance revenue. If the treatment doesn’t lead to a significant improvement the examined change won’t be implemented.

Beyond this conceptual advantage, we have already mentioned one key benefit of a one-tailed hypothesis: it requires a smaller sample size. Thus, choosing a one-tailed alternative can save both time and resources. To illustrate this advantage, the following graphs show the required sample sizes for one- and two-tailed hypotheses with different power levels (alpha is set at 5%).

In this context, the decision between one- and two-tailed hypotheses becomes particularly important in sequential testing—a method that allows for ongoing data analysis without inflating the alpha level. Here, selecting a one-tailed test can significantly reduce the duration of the test, enabling faster decision-making, which is especially valuable in dynamic business environments where prompt responses are essential.

However, don’t be too quick to dismiss the two-tailed hypothesis! It has its own advantages. In some business contexts, the ability to detect “negative significant results” is a major benefit. As one client once shared, he preferred negative significant results over inconclusive ones because they offer valuable learning opportunities. Even if the outcome wasn’t as expected, he could conclude that the treatment had a negative effect and gain insights into the product.

Another benefit of two-tailed tests is their straightforward interpretation using confidence intervals (CIs). In two-tailed tests, a CI that doesn’t include zero directly indicates significance, making it easier for practitioners to interpret results at a glance. This clarity is particularly appealing since CIs are widely used in A/B testing platforms. Conversely, with one-tailed tests, a significant result might still include zero in the CI, potentially leading to confusion or mistrust in the findings. Although one-sided confidence intervals can be employed with one-tailed tests, this practice is less common.

Conclusions

By adjusting a single parameter, you can significantly impact your A/B testing: specifically, the sample size you need to collect and the interpretation of the results. When deciding between one- and two-tailed hypotheses, consider factors such as the available sample size, the advantages of detecting negative effects, and the convenience of aligning confidence intervals (CIs) with hypothesis testing. Ultimately, this decision should be made thoughtfully, taking into account what best fits your business needs.

(Note: all the images in this post were created by the author)

The post One-Tailed Vs. Two-Tailed Tests appeared first on Towards Data Science.

Do European M&Ms Actually Taste Better than American M&Ms?

Erin Wilson — Fri, 21 Feb 2025 21:52:58 +0000

(Oh, I am the only one who’s been asking this question…? Hm. Well, if you have a minute, please enjoy this exploratory Data Analysis — featuring experimental design, statistics, and interactive visualization — applied a bit too earnestly to resolve an international debate.)

1. Introduction

1.1 Background and motivation

Chocolate is enjoyed around the world. From ancient practices harvesting organic cacao in the Amazon basin, to chocolatiers sculpting edible art in the mountains of Switzerland, and enormous factories in Hershey, Pennsylvania churning out 70 million kisses per day, the nuanced forms and flavors of chocolate have been integrated into many cultures and their customs. While quality can greatly vary across chocolate products, a well-known, shelf-stable, easily shareable form of chocolate are M&Ms. Readily found by convenience store check-out counters and in hotel vending machines, the brightly colored pellets are a popular treat whose packaging is re-branded to fit nearly any commercializable American holiday.

While living in Denmark in 2022, I heard a concerning claim: M&Ms manufactured in Europe taste different, and arguably “better,” than M&Ms produced in the United States. While I recognized that fancy European chocolate is indeed quite tasty and often superior to American chocolate, it was unclear to me if the same claim should hold for M&Ms. I learned that many Europeans perceive an “unpleasant” or “tangy” taste in American chocolate, which is largely attributed to butyric acid, a compound resulting from differences in how milk is treated before incorporation into milk chocolate.

But honestly, how much of a difference could this make for M&Ms? M&Ms!? I imagined M&Ms would retain a relatively processed/mass-produced/cheap candy flavor wherever they were manufactured. As the lone American visiting a diverse lab of international scientists pursuing cutting-edge research in biosustainability, I was inspired to break out my data science toolbox and investigate this M&M flavor phenomenon.

1.2 Previous work

To quote a European woman, who shall remain anonymous, after she tasted an American M&M while traveling in New York:

“They taste so gross. Like vomit. I don’t understand how people can eat this. I threw the rest of the bag away.”

Vomit? Really? In my experience, children raised in the United States had no qualms about eating M&Ms. Growing up, I was accustomed to bowls of M&Ms strategically placed in high traffic areas around my house to provide readily available sugar. Clearly American M&Ms are edible. But are they significantly different and/or inferior to their European equivalent?

In response to the anonymous European woman’s scathing report, myself and two other Americans visiting Denmark sampled M&Ms purchased locally in the Lyngby Storcenter Føtex. We hoped to experience the incredible improvement in M&M flavor that was apparently hidden from us throughout our youths. But curiously, we detected no obvious flavor improvements.

Unfortunately, neither preliminary study was able to conduct a side-by-side taste test with proper controls and randomized M&M sampling. Thus, we turn to science.

1.3 Study Goals

This study seeks to remedy the previous lack of thoroughness and investigate the following questions:

Is there a global consensus that European M&Ms are in fact better than American M&Ms?
Can Europeans actually detect a difference between M&Ms purchased in the US vs in Europe when they don’t know which one they are eating? Or is this a grand, coordinated lie amongst Europeans to make Americans feel embarrassed?
Are Americans actually taste-blind to American vs European M&Ms? Or can they taste a difference but simply don’t describe this difference as “an improvement” in flavor?
Can these alleged taste differences be perceived by citizens of other continents? If so, do they find one flavor obviously superior?

2. Methods

2.1 Experimental design and data collection

Participants were recruited by luring — er, inviting them to a social gathering (with the promise of free food) that was conveniently co-located with the testing site. Once a participant agreed to pause socializing and join the study, they were positioned at a testing station with a trained experimenter who guided them through the following steps:

Participants sat at a table and received two cups: 1 empty and 1 full of water. With one cup in each hand, the participant was asked to close their eyes, and keep them closed through the remainder of the experiment.
The experimenter randomly extracted one M&M with a spoon, delivered it to the participant’s empty cup, and the participant was asked to eat the M&M (eyes still closed).
After eating each M&M, the experimenter collected the taste response by asking the participant to report if they thought the M&M tasted: Especially Good, Especially Bad, or Normal.
Each participant received a total of 10 M&Ms (5 European, 5 American), one at a time, in a random sequence determined by random.org.
Between eating each M&M, the participant was asked to take a sip of water to help “cleanse their palate.”
Data collected: for each participant, the experimenter recorded the participant’s continent of origin (if this was ambiguous, the participant was asked to list the continent on which they have the strongest memories of eating candy as a child). For each of the 10 M&Ms delivered, the experimenter recorded the M&M origin (“Denmark” or “USA”), the M&M color, and the participant’s taste response. Experimenters were also encouraged to jot down any amusing phrases uttered by the participant during the test, recorded under notes (data available here).

2.2 Sourcing materials and recruiting participants

Two bags of M&Ms were purchased for this study. The American-sourced M&Ms (“USA M&M”) were acquired at the SFO airport and delivered by the author’s parents, who visited her in Denmark. The European-sourced M&Ms (“Denmark M&M”) were purchased at a local Føtex grocery store in Lyngby, a little north of Copenhagen.

Experiments were conducted at two main time points. The first 14 participants were tested in Lyngby, Denmark in August 2022. They mostly consisted of friends and housemates the author met at the Novo Nordisk Foundation Center for Biosustainability at the Technical University of Denmark (DTU) who came to a “going away party” into which the experimental procedure was inserted. A few additional friends and family who visited Denmark were also tested during their travels (e.g. on the train).

The remaining 37 participants were tested in Seattle, WA, USA in October 2022, primarily during a “TGIF happy hour” hosted by graduate students in the computer science PhD program at the University of Washington. This second batch mostly consisted of students and staff of the Paul. G. Allen School of Computer Science & Engineering (UW CSE) who responded to the weekly Friday summoning to the Allen Center atrium for free snacks and drinks.

Figure 1. Distribution of participants recruited to the study. In the first sampling event in Lyngby, participants primarily hailed from North America and Europe, and a few additionally came from Asia, South America, or Australia. Our second sampling event in Seattle greatly increased participants, primarily from North America and Asia, and a few more from Europe. Neither event recruited participants from Africa. Figure made with Altair.

While this study set out to analyze global trends, unfortunately data was only collected from 51 participants the author was able to lure to the study sites and is not well-balanced nor representative of the 6 inhabited continents of Earth (Figure 1). We hope to improve our recruitment tactics in future work. For now, our analytical power with this dataset is limited to response trends for individuals from North America, Europe, and Asia, highly biased by subcommunities the author happened to engage with in late 2022.

2.3 Risks

While we did not acquire formal approval for experimentation with human test subjects, there were minor risks associated with this experiment: participants were warned that they may be subjected to increased levels of sugar and possible “unpleasant flavors” as a result of participating in this study. No other risks were anticipated.

After the experiment however, we unfortunately observed several cases of deflated pride when a participant learned their taste response was skewed more positively towards the M&M type they were not expecting. This pride deflation seemed most severe among European participants who learned their own or their fiancé’s preference skewed towards USA M&Ms, though this was not quantitatively measured and cannot be confirmed beyond anecdotal evidence.

3. Results & Discussion

3.1 Overall response to “USA M&Ms” vs “Denmark M&Ms”

3.1.1 Categorical response analysis — entire dataset

In our first analysis, we count the total number of “Bad”, “Normal”, and “Good” taste responses and report the percentage of each response received by each M&M type. M&Ms from Denmark more frequently received “Good” responses than USA M&Ms but also more frequently received “Bad” responses. M&Ms from the USA were most frequently reported to taste “Normal” (Figure 2). This may result from the elevated number of participants hailing from North America, where the USA M&M is the default and thus more “Normal,” while the Denmark M&M was more often perceived as better or worse than the baseline.

^{Figure 2. Qualitative taste response distribution across the whole dataset. The percentage of taste responses for “Bad”, “Normal” or “Good” was calculated for each type of M&M. Figure made with Altair.}

Now let’s break out some Statistics, such as a chi-squared (X2) test to compare our observed distributions of categorical taste responses. Using the scipy.stats chi2_contingency function, we built contingency tables of the observed counts of “Good,” “Normal,” and “Bad” responses to each M&M type. Using the X2 test to evaluate the null hypothesis that there is no difference between the two M&Ms, we found the p-value for the test statistic to be 0.0185, which is significant at the common p-value cut off of 0.05, but not at 0.01. So a solid “maybe,” depending on whether you’d like this result to be significant or not.

3.1.2 Quantitative response analysis — entire dataset.

The X2 test helps evaluate if there is a difference in categorical responses, but next, we want to determine a relative taste ranking between the two M&M types. To do this, we converted taste responses to a quantitative distribution and calculated a taste score. Briefly, “Bad” = 1, “Normal” = 2, “Good” = 3. For each participant, we averaged the taste scores across the 5 M&Ms they tasted of each type, maintaining separate taste scores for each M&M type.

Figure 3. Quantitative taste score distributions across the whole dataset. Kernel density estimation of the average taste score calculated for each participant for each M&M type. Figure made with Seaborn.

With the average taste score for each M&M type in hand, we turn to scipy.stats ttest_ind (“T-test”) to evaluate if the means of the USA and Denmark M&M taste scores are different (the null hypothesis being that the means are identical). If the means are significantly different, it would provide evidence that one M&M is perceived as significantly tastier than the other.

We found the average taste scores for USA M&Ms and Denmark M&Ms to be quite close (Figure 3), and not significantly different (T-test: p = 0.721). Thus, across all participants, we do not observe a difference between the perceived taste of the two M&M types (or if you enjoy parsing triple negatives: “we cannot reject the null hypothesis that there is not a difference”).

But does this change if we separate participants by continent of origin?

3.2 Continent-specific responses to “USA M&Ms” vs “Denmark M&Ms”

We repeated the above X2 and T-test analyses after grouping participants by their continents of origin. The Australia and South America groups were combined as a minimal attempt to preserve data privacy. Due to the relatively small sample size of even the combined Australia/South America group (n=3), we will refrain from analyzing trends for this group but include the data in several figures for completeness and enjoyment of the participants who may eventually read this.

3.2.1 Categorical response analysis — by continent

In Figure 4, we display both the taste response counts (upper panel, note the interactive legend) and the response percentages (lower panel) for each continent group. Both North America and Asia follow a similar trend to the whole population dataset: participants report Denmark M&Ms as “Good” more frequently than USA M&Ms, but also report Denmark M&Ms as “Bad” more frequently. USA M&Ms were most frequently reported as “Normal” (Figure 4).

On the contrary, European participants report USA M&Ms as “Bad” nearly 50% of the time and “Good” only 18% of the time, which is the most negative and least positive response pattern, respectively (when excluding the under-sampled Australia/South America group).

^{Figure 4. Qualitative taste response distribution by continent. Upper panel: counts of taste responses — click the legend to interactively filter! Lower panel: percentage of taste responses for each type of M&M. Figure made with Altair.}

This appeared striking in bar chart form, however only North America had a significant X2 p-value (p = 0.0058) when evaluating each continent’s difference in taste response profile between the two M&M types. The European p-value is perhaps “approaching significance” in some circles, but we’re about to accumulate several more hypothesis tests and should be mindful of multiple hypothesis testing (Table 1). A false positive result here would be devastating.

When comparing the taste response profiles between two continents for the same M&M type, there are a couple interesting notes. First, we observed no major taste discrepancies between all pairs of continents when evaluating Denmark M&Ms — the world seems generally consistent in their range of feelings about M&Ms sourced from Europe (right column X2 p-values, Table 2). To visualize this comparison more easily, we reorganize the bars in Figure 4 to group them by M&M type (Figure 5).

^{Figure 5. Qualitative taste response distribution by M&M type, reported as percentages. (Same data as Figure 4 but re-arranged). Figure made with Altair.}

However, when comparing continents to each other in response to USA M&Ms, we see larger discrepancies. We found one pairing to be significantly different: European and North American participants evaluated USA M&Ms very differently (p = 0.000007) (Table 2). It seems very unlikely that this observed difference is by random chance (left column, Table 2).

3.2.2 Quantitative response analysis — by continent

We again convert the categorical profiles to quantitative distributions to assess continents’ relative preference of M&M types. For North America, we see that the taste score means of the two M&M types are actually quite similar, but there is a higher density around “Normal” scores for USA M&Ms (Figure 6A). The European distributions maintain a bit more of a separation in their means (though not quite significantly so), with USA M&Ms scoring lower (Figure 6B). The taste score distributions of Asian participants is most similar (Figure 6C).

Reorienting to compare the quantitative means between continents’ taste scores for the same M&M type, only the comparison between North American and European participants on USA M&Ms is significantly different based on a T-test (p = 0.001) (Figure 6D), though now we really are in danger of multiple hypothesis testing! Be cautious if you are taking this analysis at all seriously.

Figure 6. Quantitative taste score distributions by continent. Kernel density estimation of the average taste score calculated for each each continent for each M&M type. A. Comparison of North America responses to each M&M. B. Comparison of Europe responses to each M&M. C. Comparison of Asia responses to each M&M. D. Comparison of continents for USA M&Ms. E. Comparison of continents for Denmark M&Ms. Figure made with Seaborn.

At this point, I feel myself considering that maybe Europeans are not just making this up. I’m not saying it’s as dramatic as some of them claim, but perhaps a difference does indeed exist… To some degree, North American participants also perceive a difference, but the evaluation of Europe-sourced M&Ms is not consistently positive or negative.

3.3 M&M taste alignment chart

In our analyses thus far, we did not account for the baseline differences in M&M appreciation between participants. For example, say Person 1 scored all Denmark M&Ms as “Good” and all USA M&Ms as “Normal”, while Person 2 scored all Denmark M&Ms as “Normal” and all USA M&Ms as “Bad.” They would have the same relative preference for Denmark M&Ms over USA M&Ms, but Person 2 perhaps just does not enjoy M&Ms as much as Person 1, and the relative preference signal is muddled by averaging the raw scores.

Inspired by the Lawful/Chaotic x Good/Evil alignment chart used in tabletop role playing games like Dungeons & Dragons©, in Figure 7, we establish an M&M alignment chart to help determine the distribution of participants across M&M enjoyment classes.

Figure 7. M&M enjoyment alignment chart. The x-axis represents a participant’s average taste score for USA M&Ms; the y-axis is a participant’s average taste score for Denmark M&Ms. Figure made with Altair.

Notably, the upper right quadrant where both M&M types are perceived as “Good” to “Normal” is mostly occupied by North American participants and a few Asian participants. All European participants land in the left half of the figure where USA M&Ms are “Normal” to “Bad”, but Europeans are somewhat split between the upper and lower halves, where perceptions of Denmark M&Ms range from “Good” to “Bad.”

An interactive version of Figure 7 is provided below for the reader to explore the counts of various M&M alignment regions.

^{Figure 7 (interactive): click and brush your mouse over the scatter plot to see the counts of continents in different M&M enjoyment regions. Figure made with Altair.}

3.4 Participant taste response ratio

Next, to factor out baseline M&M enjoyment and focus on participants’ relative preference between the two M&M types, we took the log ratio of each person’s USA M&M taste score average divided by their Denmark M&M taste score average.

Equation 1: Equation to calculate each participant’s overall M&M preference ratio.

As such, positive scores indicate a preference towards USA M&Ms while negative scores indicate a preference towards Denmark M&Ms.

On average, European participants had the strongest preference towards Denmark M&Ms, with Asians also exhibiting a slight preference towards Denmark M&Ms (Figure 8). To the two Europeans who exhibited deflated pride upon learning their slight preference towards USA M&Ms, fear not: you did not think USA M&Ms were “Good,” but simply ranked them as less bad than Denmark M&Ms (see participant_id 4 and 17 in the interactive version of Figure 7). If you assert that M&Ms are a bad American invention not worth replicating and return to consuming artisanal European chocolate, your honor can likely be restored.

Figure 8. Distribution of participant M&M preference ratios by continent. Preference ratios are calculated as in Equation 1. Positive numbers indicate a relative preference for USA M&Ms, while negative indicate a relative preference for Denmark M&Ms. Figure made with Seaborn.

North American participants are pretty split in their preference ratios: some fall quite neutrally around 0, others strongly prefer the familiar USA M&M, while a handful moderately prefer Denmark M&Ms. Anecdotally, North Americans who learned their preference skewed towards European M&Ms displayed signals of inflated pride, as if their results signaled posh refinement.

Overall, a T-test comparing the distributions of M&M preference ratios shows a possibly significant difference in the means between European and North American participants (p = 0.049), but come on, this is like the 20th p-value I’ve reported — this one is probably too close to call.

3.5 Taste inconsistency and “Perfect Classifiers”

For each participant, we assessed their taste score consistency by averaging the standard deviations of their responses to each M&M type, and plotting that against their preference ratio (Figure 9).

^{Figure 9. Participant taste consistency by preference ratio. The x-axis is a participant’s relative M&M preference ratio. The y-axis is the average of the standard deviation of their USA M&M scores and the standard deviation of their Denmark M&M scores. A value of 0 on the y-axis indicates perfect consistency in responses, while higher values indicate more inconsistent responses. Figure made with Altair.}

Most participants were somewhat inconsistent in their ratings, ranking the same M&M type differently across the 5 samples. This would be expected if the taste difference between European-sourced and American-sourced M&Ms is not actually all that perceptible. Most inconsistent were participants who gave the same M&M type “Good”, “Normal”, and “Bad” responses (e.g., points high on the y-axis, with wider standard deviations of taste scores), indicating lower taste perception abilities.

Intriguingly, four participants — one from each continent group — were perfectly consistent: they reported the same taste response for each of the 5 M&Ms from each M&M type, resulting in an average standard deviation of 0.0 (bottom of Figure 9). Excluding the one of the four who simply rated all 10 M&Ms as “Normal”, the other three appeared to be “Perfect Classifiers” — either rating all M&Ms of one type “Good” and the other “Normal”, or rating all M&Ms of one type “Normal” and the other “Bad.” Perhaps these folks are “super tasters.”

3.6 M&M color

Another possible explanation for the inconsistency in individual taste responses is that there exists a perceptible taste difference based on the M&M color. Visually, the USA M&Ms were noticeably more smooth and vibrant than the Denmark M&Ms, which were somewhat more “splotchy” in appearance (Figure 10A). M&M color was recorded during the experiment, and although balanced sampling was not formally built into the experimental design, colors seemed to be sampled roughly evenly, with the exception of Blue USA M&Ms, which were oversampled (Figure 10B).

Figure 10. M&M colors. A. Photo of each M&M color of each type. It’s perhaps a bit hard to perceive on screen in my unprofessionally lit photo, but with the naked eye, USA M&Ms seemed to be brighter and more uniformly colored while Denmark M&Ms have a duller and more mottled color. Is it just me, or can you already hear the Europeans saying “They are brighter because of all those extra chemicals you put in your food that we ban here!” B. Distribution of M&Ms of each color sampled over the course of the experiment. The Blue USA M&Ms were not intentionally oversampled — they must be especially bright/tempting to experimenters. Figure made with Altair.

We briefly visualized possible differences in taste responses based on color (Figure 11), however we do not believe there are enough data to support firm conclusions. After all, on average each participant would likely only taste 5 of the 6 M&M colors once, and 1 color not at all. We leave further M&M color investigations to future work.

Figure 11. Taste response profiles for M&Ms of each color and type. Profiles are reported as percentages of “Bad”, “Normal”, and “Good” responses, though not all M&Ms were sampled exactly evenly. Figure made with Altair.

3.7 Colorful commentary

We assured each participant that there was no “right “answer” in this experiment and that all feelings are valid. While some participants took this to heart and occasionally spent over a minute deeply savoring each M&M and evaluating it as if they were a sommelier, many participants seemed to view the experiment as a competition (which occasionally led to deflated or inflated pride). Experimenters wrote down quotes and notes in conjunction with M&M responses, some of which were a bit “colorful.” We provide a hastily rendered word cloud for each M&M type for entertainment purposes (Figure 12) though we caution against reading too far into them without diligent sentiment analysis.

Figure 11. A simple word cloud generated from the notes column of each M&M type. Fair warning — these have not been properly analyzed for sentiment and some inappropriate language was recorded. Figure made with WordCloud.

4. Conclusion

Overall, there does not appear to be a “global consensus” that European M&Ms are better than American M&Ms. However, European participants tended to more strongly express negative reactions to USA M&Ms while North American participants seemed relatively split on whether they preferred M&Ms sourced from the USA vs from Europe. The preference trends of Asian participants often fell somewhere between the North Americans and Europeans.

Therefore, I’ll admit that it’s probable that Europeans are not engaged in a grand coordinated lie about M&Ms. The skew of most European participants towards Denmark M&Ms is compelling, especially since I was the experimenter who personally collected much of the taste response data. If they found a way to cheat, it was done well enough to exceed my own passive perception such that I didn’t notice. However, based on this study, it would appear that a strongly negative “vomit flavor” is not universally perceived and does not become apparent to non-Europeans when tasting both M&Ms types side by side.

We hope this study has been illuminating! We would look forward to extensions of this work with improved participant sampling, additional M&M types sourced from other continents, and deeper investigations into possible taste differences due to color.

Thank you to everyone who participated and ate M&Ms in the name of science!

Figures and analysis can be found on github: https://github.com/erinhwilson/mnm-taste-test

Article by Erin H. Wilson, Ph.D.[1,2,3] who decided the time between defending her dissertation and starting her next job would be best spent on this highly valuable analysis. Hopefully it is clear that this article is intended to be comedic— I do not actually harbor any negative feelings towards Europeans who don’t like American M&Ms, but enjoyed the chance to be sassy and poke fun at our lively debates with overly-enthusiastic data analysis.

Shout out to Matt, Galen, Ameya, and Gian-Marco for assisting in data collection!

[1] Former Ph.D. student in the Paul G. Allen School of Computer Science and Engineering at the University of Washington

[2] Former visiting Ph.D. student at the Novo Nordisk Foundation Center for Biosustainability at the Technical University of Denmark

[3] Future data scientist at LanzaTech

The post Do European M&Ms Actually Taste Better than American M&Ms? appeared first on Towards Data Science.

Triangle Forecasting: Why Traditional Impact Estimates Are Inflated (And How to Fix Them)

Sayali Kulkarni — Fri, 07 Feb 2025 20:39:59 +0000

Accurate impact estimations can make or break your business case.

Yet, despite its importance, most teams use oversimplified calculations that can lead to inflated projections. These shot-in-the-dark numbers not only destroy credibility with stakeholders but can also result in misallocation of resources and failed initiatives. But there’s a better way to forecast effects of gradual customer acquisition, without requiring messy Excel spreadsheets and formulas that error out.

By the end of this article, you will be able to calculate accurate yearly forecasts and implement a scalable Python solution for Triangle Forecasting.

The Hidden Cost of Inaccurate Forecasts

When asked for annual impact estimations, product teams routinely overestimate impact by applying a one-size-fits-all approach to customer cohorts. Teams frequently opt for a simplistic approach:

Multiply monthly revenue (or any other relevant metric) by twelve to estimate annual impact.

While the calculation is easy, this formula ignores a fundamental premise that applies to most businesses:

Customer acquisition happens gradually throughout the year.

The contribution from all customers to yearly estimates is not equal since later cohorts contribute fewer months of revenue.

Triangle Forecasting can cut projection errors by accounting for effects of customer acquisition timelines.

Let us explore this concept with a basic example. Let’s say you’re launching a new subscription service:

Monthly subscription fee: $100 per customer
Monthly customer acquisition target: 100 new customers
Goal: Calculate total revenue for the year

An oversimplified multiplication suggests a revenue of $1,440,000 in the first year (= 100 new customers/month * 12 months * $100 spent / month * 12 months).

The actual number is only $780,000!

This 46% overestimation is why impact estimations frequently do not pass stakeholders’ sniff test.

Accurate forecasting is not just about mathematics —

It is a tool that helps you build trust and gets your initiatives approved faster without the risk of over-promising and under-delivering.

Moreover, data professionals spend hours building manual forecasts in Excel, which are volatile, can result in formula errors, and are challenging to iterate upon.

Having a standardized, explainable methodology can help simplify this process.

Introducing Triangle Forecasting

Triangle Forecasting is a systematic, mathematical approach to estimate the yearly impact when customers are acquired gradually. It accounts for the fact that incoming customers will contribute differently to the annual impact, depending on when they onboard on to your product.

This method is particularly handy for:

New Product Launches: When customer acquisition happens over time
Subscription Revenue Forecasts: For accurate revenue projections for subscription-based products
Phased Rollouts: For estimating the cumulative impact of gradual rollouts
Acquisition Planning: For setting realistic monthly acquisition targets to hit annual goals

Image generated by author

The “triangle” in Triangle Forecasting refers to the way individual cohort contributions are visualized. A cohort refers to the month in which the customers were acquired. Each bar in the triangle represents a cohort’s contribution to the annual impact. Earlier cohorts have longer bars because they contributed for an extended period.

To calculate the impact of a new initiative, model or feature in the first year :

For each month (m) of the year:

Calculate number of customers acquired (Am)
Calculate average monthly spend/impact per customer (S)
Calculate remaining months in year (Rm = 13-m)
Monthly cohort impact = Am × S × Rm

2. Total yearly impact = Sum of all monthly cohort impacts

Image generated by author

Building Your First Triangle Forecast

Let’s calculate the actual revenue for our subscription service:

January: 100 customers × $100 × 12 months = $120,000
February: 100 customers × $100 × 11 months = $110,000
March: 100 customers × $100 × 10 months = $100,000
And so on…

Calculating in Excel, we get:

Image generated by author

The total annual revenue equals $780,000— 46% lower than the oversimplified estimate!

Pro Tip: Save the spreadsheet calculations as a template to reuse for different scenarios.

Need to build estimates without perfect data? Read my guide on “Building Defendable Impact Estimates When Data is Imperfect”.

Putting Theory into Practice: An Implementation Guide

While we can implement Triangle Forecasting in Excel using the above method, these spreadsheets become impossible to maintain or modify quickly. Product owners also struggle to update forecasts quickly when assumptions or timelines change.

Here’s how we can perform build the same forecast in Python in minutes:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def triangle_forecast(monthly_acquisition_rate, monthly_spend_per_customer):
    """
    Calculate yearly impact using triangle forecasting method.
    """
    # Create a DataFrame for calculations
    months = range(1, 13)
    df = pd.DataFrame(index=months, 
                     columns=['month', 'new_customers', 
                             'months_contributing', 'total_impact'])

    # Convert to list if single number, else use provided list
    acquisitions = [monthly_acquisitions] * 12 if type(monthly_acquisitions) in [int, float] else monthly_acquisitions
    
    # Calculate impact for each cohort
    for month in months:
        df.loc[month, 'month'] = f'Month {month}'
        df.loc[month, 'new_customers'] = acquisitions[month-1]
        df.loc[month, 'months_contributing'] = 13 - month
        df.loc[month, 'total_impact'] = (
            acquisitions[month-1] * 
            monthly_spend_per_customer * 
            (13 - month)
        )
    
    total_yearly_impact = df['total_impact'].sum()
    
    return df, total_yearly_impact

Continuing with our previous example of subscription service, the revenue from each monthly cohort can be visualized as follows:

# Example
monthly_acquisitions = 100  # 100 new customers each month
monthly_spend = 100        # $100 per customer per month

# Calculate forecast
df, total_impact = triangle_forecast(monthly_acquisitions, monthly_spend)

# Print results
print("Monthly Breakdown:")
print(df)
print(f"\nTotal Yearly Impact: ${total_impact:,.2f}")

Image generated by author

We can also leverage Python to visualize the cohort contributions as a bar chart. Note how the impact decreases linearly as we move through the months.

Image generated by author

Using this Python code, you can now generate and iterate on annual impact estimations quickly and efficiently, without having to manually perform version control on crashing spreadsheets.

Beyond Basic Forecasts

While the above example is straightforward, assuming monthly acquisitions and spending are constant across all months, that need not necessarily be true. Triangle forecasting can be easily adapted and scaled to account for :

Multiple spend tiers

For varying monthly spend based on spend tiers, create a distinct triangle forecast for each cohort and then aggregate individual cohort’s impacts to calculate the total annual impact.

Varying acquisition rates

Typically, businesses don’t acquire customers at a constant rate throughout the year. Acquisition might start at a slow pace and ramp up as marketing kicks in, or we might have a burst of early adopters followed by slower growth. To handle varying rates, pass a list of monthly targets instead of a single rate:

# Example: Gradual ramp-up in acquisitions
varying_acquisitions = [50, 75, 100, 150, 200, 250, 
                        300, 300, 300, 250, 200, 150]
df, total_impact = triangle_forecast(varying_acquisitions, monthly_spend)

Image generated by author

Seasonality adjustments

To account for seasonality, multiply each month’s impact by its corresponding seasonal factor (e.g., 1.2 for high-season months like December, 0.8 for low-season months like February, etc.) before calculating the total impact.

Here is how you can modify the Python code to account for seasonal variations:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def triangle_forecast(monthly_acquisitions, monthly_spend_per_customer, seasonal_factors = None):
    """
    Calculate yearly impact using triangle forecasting method.
    """    
    # Create a DataFrame for calculations
    months = range(1, 13)
    df = pd.DataFrame(index=months, 
                     columns=['month', 'new_customers', 
                             'months_contributing', 'total_impact'])

    # Convert to list if single number, else use provided list
    acquisitions = [monthly_acquisitions] * 12 if type(monthly_acquisitions) in [int, float] else monthly_acquisitions

    if seasonal_factors is None:
        seasonality = [1] * 12
    else:
        seasonality = [seasonal_factors] * 12 if type(seasonal_factors) in [int, float] else seasonal_factors        
    
    # Calculate impact for each cohort
    for month in months:
        df.loc[month, 'month'] = f'Month {month}'
        df.loc[month, 'new_customers'] = acquisitions[month-1]
        df.loc[month, 'months_contributing'] = 13 - month
        df.loc[month, 'total_impact'] = (
            acquisitions[month-1] * 
            monthly_spend_per_customer * 
            (13 - month)*
            seasonality[month-1]
        )
    
    total_yearly_impact = df['total_impact'].sum()
    
    return df, total_yearly_impact

# Seasonality-adjusted example 
monthly_acquisitions = 100  # 100 new customers each month
monthly_spend = 100        # $100 per customer per month
seasonal_factors = [1.2,  # January (New Year)
            0.8,  # February (Post-holiday)
            0.9,  # March
            1.0,  # April
            1.1,  # May
            1.2,  # June (Summer)
            1.2,  # July (Summer)
            1.0,  # August
            0.9,  # September
            1.1, # October (Halloween) 
            1.2, # November (Pre-holiday)
            1.5  # December (Holiday)
                   ]

# Calculate forecast
df, total_impact = triangle_forecast(monthly_acquisitions, 
                                     monthly_spend, 
                                     seasonal_factors)

Image generated by author

These customizations can help you model different growth scenarios including:

Gradual ramp-ups in early stages of launch
Step-function growth based on promotional campaigns
Seasonal variations in customer acquisition

The Bottom Line

Having dependable and intuitive forecasts can make or break the case for your initiatives.

But that’s not all — triangle forecasting also finds applications beyond revenue forecasting, including calculating:

Customer Activations
Portfolio Loss Rates
Credit Card Spend

Ready to dive in? Download the Python template shared above and build your first Triangle forecast in 15 minutes!

Input your monthly acquisition targets
Set your expected monthly customer impact
Visualize your annual trajectory with automated visualizations

Real-world estimations often require dealing with imperfect or incomplete data. Check out my article “Building Defendable Impact Estimates When Data is Imperfect” for a framework to build defendable estimates in such scenarios.

Acknowledgement:

Thank you to my wonderful mentor, Kathryne Maurer, for developing the core concept and first iteration of the Triangle Forecasting method and allowing me to build on it through equations and code.

I’m always open to feedback and suggestions on how to make these guides more valuable for you. Happy reading!

The post Triangle Forecasting: Why Traditional Impact Estimates Are Inflated (And How to Fix Them) appeared first on Towards Data Science.

Myths vs. Data: Does an Apple a Day Keep the Doctor Away?

Matteo Casolari — Thu, 06 Feb 2025 03:55:10 +0000

Introduction

“Money can’t buy happiness.” “You can’t judge a book by its cover.” “An apple a day keeps the doctor away.”

You’ve probably heard these sayings several times, but do they actually hold up when we look at the data? In this article series, I want to take popular myths/sayings and put them to the test using real-world data.

We might confirm some unexpected truths, or debunk some popular beliefs. Hopefully, in either case we will gain new insights into the world around us.

The hypothesis

“An apple a day keeps the doctor away”: is there any real evidence to support this?

If the myth is true, we should expect a negative correlation between apple consumption per capita and doctor visits per capita . So, the more apples a country consumes, the fewer doctor visits people should need.

Let’s look into the data and see what the numbers really say.

Testing the relationship between apple consumption and doctor visits

Let’s start with a simple correlation check between apple consumption per capita and doctor visits per capita.

Data sources

The data comes from:

Apple consumption per capita: Our World in Data
Doctor visits per capita: OECD

Since data availability varies by year, 2017 was selected as it provided the most complete in terms of number of countries. However, the results are consistent across other years.

The United States had the highest apple consumption per capita, exceeding 55 kg per year, while Lithuania had the lowest, consuming just under 1 kg per year.

South Korea had the highest number of doctor visits per capita, at more than 18 visits per year, while Colombia had the lowest, with just above 2 visits per year.

Visualizing the relationship

To visualize whether higher apple consumption is associated with fewer doctor visits, we start by looking at a scatter plot with a regression line.

The regression plot shows a very slim negative correlation, meaning that in countries where people eat more apples, there is a barely noticeable tendency to have lower doctor visits.
Unfortunately, the trend is so weak that it cannot be considered meaningful.

OLS regression

To test this relationship statistically, we run a linear regression (OLS), where doctor visits per capita is the dependent variable and apple consumption per capita is the independent variable.

The results confirm what the scatterplot suggested:

The coefficient for apple consumption is -0.0107, meaning that even if there is an effect, it is very small.
The p-value is 0.860 (86%), far more than the standard significance threshold of 5%.
The R² value is almost zero, meaning apple consumption explains virtually none of the variation in doctor visits.

This doesn’t strictly mean that there is no relationship, but rather that we cannot prove one with the available data. It’s possible that any real effect is too small to detect, that other factors we didn’t include play a larger role, or that the data simply doesn’t reflect the relationship well.

Controlling for confounders

Are we done? Not quite. So far, we’ve only checked for a direct relationship between apple consumption and doctor visits.

As already mentioned, many other factors could be influencing both variables, potentially hiding a true relationship or creating an artificial one.

If we consider this causal graph:

We are assuming that apple consumption directly affects doctor visits. However, other hidden factors might be at play. If we don’t account for them, we risk failing to detect a real relationship if one exists.

A well-known example where confounder variables are on display comes from a study by Messerli (2012), which found an interesting correlation between chocolate consumption per capita and the number of Nobel laureates.

So, would starting to eat a lot of chocolate help us win a Nobel Prize? Probably not. The likely explanation was that GDP per capita was a confounder. That means that richer countries tend to have both higher chocolate consumption and more Nobel Prize winners. The observed relationship wasn’t causal but rather due to a hidden (confounding) factor.

The same thing could be happening in our case. There might be confounding variables that influence both apple consumption and doctor visits, making it difficult to see a real relationship if one exists.

Two key confounders to consider are GDP per capita and median age. Wealthier countries have better healthcare systems and different dietary patterns, and older populations tend to visit doctors more often and may have different eating habits.

To control for this, we change our model by introducing these confounders:

Data sources

The data comes from:

GDP: Our World in Data
Median age: Our World in Data

Luxembourg had the highest GDP per capita, exceeding 115K USD, while Colombia had the lowest, at 14.3K USD.

Japan had the highest median age, at over 46 years, while Mexico had the lowest, at under 27 years.

OLS regression (with confounders)

After controlling for GDP per capita and median age, we run a multiple regression to test whether apple consumption has any meaningful effect on doctor visits.

The results confirm what we observed earlier:

The coefficient for apple consumption remains very small(-0.0100), meaning any potential effect is negligible.
The p-value (85.5%) is still extremely high, far from statistical significance.
We still cannot reject the null hypothesis, meaning we have no strong evidence to support the idea that eating more apples leads to fewer doctor visits.

Same as before, this does not necessarily mean that no relationship exists, but rather that we cannot prove one using the available data. It could still be possible that the real effect is too small to detect or that there are yet other factors we didn’t include.

One interesting observation, however, is that GDP per capita also shows no significant relationship with doctor visits, as its p-value is 0.668 (66.8%), indicating that we couldn’t find in the data that wealth explains variations in healthcare usage.

On the other hand, median age appears to be strongly associated with doctor visits, with a p-value of 0.001 (0.1%) and a positive coefficient (0.4952). This suggests that older populations tend to visit doctors more frequently, which is actually not really surprising if we think about it!

So while we find no support for the apple myth, the data does reveal an interesting relationship between aging and healthcare usage.

Median age → Doctor visits

The results from the OLS regression showed a strong relationship between median age and doctor visits, and the visualization below confirms this trend.

There is a clear upward trend, indicating that countries with older populations tend to have more doctor visits per capita.

Since we are only looking at median age and doctor visits here, one could argue that GDP per capita might be a confounder, influencing both. However, the previous OLS regression demonstrated that even when GDP was included in the model, this relationship remained strong and statistically significant.

This suggests that median age is a key factor in explaining differences in doctor visits across countries, independent of GDP.

GDP → Apple consumption

While not directly related to doctor visits, an interesting secondary finding emerges when looking at the relationship between GDP per capita and apple consumption.

One possible explanation is that wealthier countries have better access to fresh products. Another possibility is that climate and geography play a role, so it could be that many high-GDP countries are located in regions with strong apple production, making apples more available and affordable.

Of course, other factors could be influencing this relationship, but we won’t dig deeper here.

The scatterplot shows a positive correlation: as GDP per capita increases, apple consumption also tends to rise. However, compared to median age and doctor visits, this trend is weaker, with more variation in the data.

The OLS confirms the relationship: with a 0.2257 coefficient for GDP per capita, we can estimate an increase of around 0.23 kg in apple consumption per capita for each increase of $1,000 in GDP per capita.

The 3.8% p-value allows us to reject the null hypothesis. So the relationship is statistically significant. However, the R² value (0.145) is relatively low, so while GDP explains some variation in apple consumption, many other factors likely contribute.

Conclusion

The saying goes:

“An apple a day keeps the doctor away,”

But after putting this myth to the test with real-world data, the results seem not in line with this saying. Across multiple years, the results were consistent: no meaningful relationship between apple consumption and doctor visits emerged, even after controlling for confounders. It seems that apples alone aren’t enough to keep the doctor away.

However, this doesn’t completely disprove the idea that eating more apples could reduce doctor visits. Observational data, no matter how well we control for confounders, can never fully prove or disprove causality.

To get a more statistically accurate answer, and to rule out all possible confounders at a level of granularity that could be actionable for an individual, we would need to conduct an A/B test.
In such an experiment, participants would be randomly assigned to two groups, for example one eating a fixed amount of apples daily and the other avoiding apples. By comparing doctor visits over time among these two groups, we could determine if any difference between them arise, providing stronger evidence of a causal effect.

For obvious reasons, I chose not to go that route. Hiring a bunch of participants would be expensive, and ethically forcing people to avoid apples for science is definitely questionable.

However, we did find some interesting patterns. The strongest predictor of doctor visits wasn’t apple consumption, but median age: the older a country’s population, the more often people see a doctor.

Meanwhile, GDP showed a mild connection to apple consumption, possibly because wealthier countries have better access to fresh produce, or because apple-growing regions tend to be more developed.

So, while we can’t confirm the original myth, we can offer a less poetic, but data-backed version:

“A young age keeps the doctor away.”

If you enjoyed this analysis and want to connect, you can find me on LinkedIn.

The full analysis is available in this notebook on GitHub.

Data Sources

Fruit Consumption: Food and Agriculture Organization of the United Nations (2023) — with major processing by Our World in Data. “Per capita consumption of apples — FAO” [dataset]. Food and Agriculture Organization of the United Nations, “Food Balances: Food Balances (-2013, old methodology and population)”; Food and Agriculture Organization of the United Nations, “Food Balances: Food Balances (2010-)” [original data]. Licensed under CC BY 4.0.

Doctor Visits: OECD (2024), Consultations, URL (accessed on January 22, 2025). Licensed under CC BY 4.0.

GDP per Capita: World Bank (2025) — with minor processing by Our World in Data. “GDP per capita — World Bank — In constant 2021 international $” [dataset]. World Bank, “World Bank World Development Indicators” [original data]. Retrieved January 31, 2025 from https://ourworldindata.org/grapher/gdp-per-capita-worldbank. Licensed under CC BY 4.0.

Median Age: UN, World Population Prospects (2024) — processed by Our World in Data. “Median age, medium projection — UN WPP” [dataset]. United Nations, “World Population Prospects” [original data]. Licensed under CC BY 4.0.

All images, unless otherwise noted, are by the author.

The post Myths vs. Data: Does an Apple a Day Keep the Doctor Away? appeared first on Towards Data Science.

How to do Date calculations in DAX

Salvatore Cagliari — Tue, 28 Jan 2025 17:01:59 +0000

How to Do Date Calculations in DAX

Moving back and forth in time is a common task for Time Intelligence calculations in DAX. We have some excellent functions; one of the most useful is DATEADD. Let’s take a detailed look at it.

Photo by Towfiqu barbhuiya on Unsplash

What is it for?

I have shown the usefulness of the DATEADD() function in one of my past articles:

Explore variants of Time Intelligence in DAX

But sometimes, we want to do other stuff.

For example, we have a fixed date and want to move it back or forward in time.

If we want to do it for some days, it’s very easy:

For example, I want to move the date 2025/01/15 10 days into the future:

Figure 1 – Moving a date ten days into the future (Figure by the Author)

I must add the curly brackets as EVALUATE expects a table, and these brackets create a table from the output of the expression.

But what when we want to do this for months, quarters, years, or weeks?

Look at some use cases.

The DATEADD() function seems the obvious choice for the function to perform such tasks.

Before working with Dax, I worked with T-SQL for SQL Server. The T-SQL DATEADD() function could be used precisely for this.

OK, let’s give it a try to go back two months, starting with 2025/01/25:

Figure 2 – Error message when we try using DATEADD() with a simple Date variable (Figure by the Author)

This doesn’t work, as I passed a Date variable. Even though it looks intuitive, it doesn’t work.

The documentation explains that the first parameter must be a column containing dates.

Therefore, we must persuade the engine to think that we pass a date column to the function.

One way to do it is by using TREATAS():

DEFINE
 VAR MyDate = DATE(2025, 1, 25)

EVALUATE
 DATEADD(
  TREATAS({MyDate}
   ,'Date'[Date]
   )
  ,-2
  ,MONTH)

Notice that we don’t need the curly brackets, as DATEADD() returns a table with one column and, in this case, one row.

When executing this query, DATEADD() thinks that it receives a value from the [Date] column from the Date table as a parameter and performs the calculation as expected:

Figure 3 – Using DATEADD() with a fixed date a TREATAS() (Figure by the Author)

Another way to perform this task is by using the Date table and passing the Variable with the date as a filter:

DEFINE
    VAR MyDate = DATE(2025, 1, 25)

EVALUATE
    CALCULATETABLE(
                DATEADD('Date'[Date]
                        ,-2
                        ,MONTH)
                ,'Date'[Date] = MyDate
                )

The result is the same as before, and there is no notable difference in the execution time (Between 7 and 10 ms).

Again, when looking at the documentation on DAX.guide, they use the second approach for doing the job.

We can change the second expression to get one single value:

DEFINE
    VAR MyDate = DATE(2025, 1, 25)

    VAR Result =
        CALCULATE(DATEADD('Date'[Date]
                        ,-2
                        ,MONTH)
                ,'Date'[Date] = MyDate
                )

EVALUATE
    { Result }

These three approaches are almost identical functionally. However, the difference is that we can use these two approaches differently.

For example, we want to calculate the Online Sale for the day 2 months before 2025/01/25.

Here is the first approach:

DEFINE
    VAR MyDate = DATE(2025, 1, 25)

    VAR TargetDate =
                DATEADD(
                        TREATAS({MyDate}
                            ,'Date'[Date]
                            )
                        ,-2
                        ,MONTH)

EVALUATE
{
    CALCULATE([Sum Online Sales]
            ,TargetDate
            )
            }

And here is the second one:

DEFINE
    VAR MyDate = DATE(2025, 1, 25)

    VAR TargetDate =
                CALCULATE(DATEADD('Date'[Date]
                        ,-2
                        ,MONTH)
                ,'Date'[Date] = MyDate
                )

EVALUATE
{
    CALCULATE([Sum Online Sales]
            ,'Date'[Date] = TargetDate
            )
            }

The result of the expression is not essential here.

But do you notice the difference?

The first expression uses DATEADD to get a table aligned with the [Date] column of the Date table. I can directly use this table as a Filter in CALCULATE(). It will follow the filter’s Lineage (the origin) and can be applied directly to the Data model.

The second expression generates a single value and loses the lineage to the Date table.

Therefore, we must use the = operator to filter the Date table.

Again, both approaches return identical results and are equivalent performance-wise.

What about weeks?

Calculating with weeks is not as easy as using DATEADD(), as this function supports only Days, Months, Quarters, and Years.

Fortunately, weeks always have 7 days. Therefore, we can move back and forth by multiplying 7 days by the number of weeks we want to move the start date.

For example, I want to go back three weeks:

7 x 3 = 21 days: 2025/01/25 to 2025/01/04:

Figure 4 – Going back three weeks by subtracting 21 days from the starting date (Figure by the Author)

A more elaborate pattern for the same task is this:

DEFINE
    VAR MyDate = DATE(2025, 1, 25)

    VAR DayOfWeek = CALCULATE(MIN('Date'[DayOfWeek])
                                ,'Date'[Date] = MyDate
                                )

    VAR WeekIndex = CALCULATE(MIN('Date'[WeekIndex])
                                ,'Date'[Date] = MyDate)

EVALUATE
    { CALCULATE(MIN('Date'[Date])
                ,REMOVEFILTERS('Date')
                ,'Date'[DayOfWeek] = DayOfWeek
                ,'Date'[WeekIndex] = WeekIndex - 3
                )
        }

Here, I leverage two columns I commonly add to my Date tables:

Day of Week
WeekIndex: This one counts the number of weeks starting from the week of the last data refresh. Today, it’s zero, the week before the current is -1, and the week after the current week is 1, etc.

Now, I can use them to perform a dynamic calculation based on these columns.

Interestingly, this version of the query’s performance is not slower than before.

The result is still the same:

Figure 5 – More complex approach to go back three weeks (Figure by the Author)

But why should I use a more complex approach when subtracting some days lead to the same result?

Well, it depends on what you want to achieve. I experienced some scenarios where a more complex approach leads to more possibilities, and it can be used more generically.

In this specific case, the first approach works very well.

Conclusion

Manipulating dates is one of the main tasks when analyzing or visualizing data.

The DATEADD() function is a crucial part of the DAX language’s toolset, and it’s important to know how it works and how it can be used.

There are three main points with this function:

How the input date-parameter must look like
How can I manipulate a date to work with DATEADD()
How can I use the output of DATEADD() in Measures

I tried to show you all of them in this piece. I hope that you found this helpful.

I will show you more complex scenarios in an upcoming piece, where using such additional columns can save you much time – both development and execution time.

In the meantime, you can start incorporating these techniques into your work.

Photo by Debby Hudson on Unsplash

References

Here, the article referenced at the start of this one about exploring variants of Time Intelligence:

Explore variants of Time Intelligence in DAX

You can find more information about my date-table here:

3 Ways to Improve your reporting with an expanded date table

Read this piece to learn how to extract performance data in DAX-Studio and how to interpret it:

How to get performance data from Power BI with DAX Studio

A list of all Time Intelligence functions in DAX.guide:

Time Intelligence – DAX Guide

Like in my previous articles, I use the Contoso sample dataset. You can download the ContosoRetailDW Dataset for free from Microsoft here.

The Contoso Data can be freely used under the MIT License, as described here.

I changed the dataset to shift the data to contemporary dates.

Get an email whenever Salvatore Cagliari publishes.

The post How to do Date calculations in DAX appeared first on Towards Data Science.

Water Cooler Small Talk, Ep 7: Anscombe’s Quartet and the Datasaurus

Maria Mouschoutzi, PhD — Mon, 27 Jan 2025 17:52:11 +0000

DATA ANALYSIS

Image created by the author using GPT-4

Ever heard a co-worker confidently declaring something like "The longer I lose at roulette, the closer I am to winning?", or had a boss that demanded you to not overcomplicate things and provide "just one number", ignoring your attempts to explain why such a number doesn’t exist? Maybe you’ve even shared birthdays with a colleague, and everyone in office commented on what a bizarre cosmic coincidence it must be.

These moments are typical examples of water cooler small talk – a special kind of small talk, thriving around break rooms, coffee machines, and of course, water coolers. It is where employees share all kinds of corporate gossip, myths, and legends, inaccurate scientific opinions, outrageous personal anecdotes, or outright lies. Anything goes. So, in my Water Cooler Small Talk posts, I discuss strange and usually scientifically invalid opinions I have overheard in office, and explore what’s really going on.

So, here’s the water cooler opinion of today’s post:

The descriptive statistics match perfectly, so the datasets are basically the same. No need to dig deeper.

Sure, that might make some sense – if you’ve never taken a statistics class. But in reality, very different datasets can have the same descriptive statistics measures, as for instance mean, average, or standard deviation. In other words, while descriptive statistics describe a dataset, they don’t define it. One needs to always plot the data, in order to get the full picture.

Anyways, one of the first people to demonstrate this was Frank Anscombe with his now-infamous Anscombe’s quartet.

DataCream is a newsletter offering data-driven articles and perspectives on data, tech, AI, and ML. If you are interested in these topics subscribe here.

Maria Mouschoutzi, PhD – Medium

What about Anscombe’s quartet?

So, Anscombe’s quartet is a set of four datasets with almost same descriptive statistics that nevertheless, look very different when visualized. The dataset is in Public Domain and is conveniently available in seaborn library, allowing us to play around and explore what is happening. We can easily load the dataset by:

import seaborn as sns
data = sns.load_dataset("anscombe")

Then, we can create a visualization of the four datasets using matplotlib:

import matplotlib.pyplot as plt

sns.relplot(
    data=data,
    x="x", y="y",
    col="dataset", hue="dataset",
    kind="scatter",
    palette="deep",
    height=4, aspect=1
)
plt.suptitle("Anscombe's Quartet", y=1.05)
plt.show()

Image created by the author in Python using Anscombe’s Quartet data

In particular, all four datasets consist of 11 (x, y) points:

dataset I seems to be a simple linear relationship
dataset II clearly is a parabolic curve
dataset III clearly is a linear relationship, but with one large outlier
and dataset IV is a vertical line, but again, distorted by one large outlier

All four datasets are very different from one another, however, when we calculate their descriptive statistics, we are up for a plot twist – all four datasets share almost identical descriptive statistics. More specifically, we can calculate the descriptive statistics as following:

import pandas as pd
from scipy.stats import linregress

def calculate_statistics(group):
    mean_x = group['x'].mean()
    var_x = group['x'].var()
    mean_y = group['y'].mean()
    var_y = group['y'].var()
    correlation = group[['x', 'y']].corr().iloc[0, 1]
    slope, intercept, r_value, p_value, std_err = linregress(group['x'], group['y'])
    r_squared = r_value ** 2

    return pd.Series({
        "Mean of x": mean_x,
        "Variance of x": var_x,
        "Mean of y": mean_y,
        "Variance of y": var_y,
        "Correlation of x and y": correlation,
        "Linear regression line": f"y = {intercept:.2f} + {slope:.2f}x",
        "R2 of regression line": r_squared
    })

statistics = data.groupby("dataset").apply(calculate_statistics)

for dataset, stats in statistics.iterrows():
    print(f"Dataset: {dataset}")
    for key, value in stats.items():
        print(f"  {key}: {value}")
    print()

Image created by the author in Python using Anscombe’s Quartet data

Identical numbers, wildly different visuals. Crazy, right? This is why it is so important to always visualize the data, no matter what the numbers suggest. In Anscombe’s own words, a common but misguided belief among statisticians is that "numerical calculations are exact, but graphs are rough".

Anscombe’s quartet is such a powerful example for highlighting the importance of data visualization, because the visualizations are not just different, but rather clearly different – with just one glance, one can understand that the datasets are completely distinct from each other. In other words, the visualizations immediately provide us with meaningful information, that the descriptive statistics fail to incorporate.

No one knows exactly how Anscombe came up with these datasets in the first place, but this 2017 study presents a method for creating such datasets from scratch. This allows to produce endless examples of very different datasets with almost identical descriptive statistics – my favorite by far is Datasaurus dozen . Similarly to Anscombe’s quartet, the Datasaurus dozen includes thirteen – a dozen + the Datasaurus – very different datasets with almost identical descriptive statistics.

The datasets are available in the datasauRus R package under the MIT License, which permits commercial use.

Image created by the author in R using the Datasaurus dozen datasets from the datasauRus R package

Again, all of these datasets have almost identical descriptive statistics, however, they are strikingly different. Apparently, skipping the visualization step, would result in a terrible loss of information.

When a plot could have helped

Anscombe’s quartet or the Datasaurus may seem like fun, simplistic examples, aiming to teach us about the importance of data visualization. But, don’t be fooled; the lesson – the importance of data visualization – is not a theoretic concept, but rather very real, with tangible implication in the real world.

1. 2008 financial crisis

Take for instance the 2008 financial crisis. The models used by banks and investment firms back then, heavily relied on aggregate risk metrics like _Value at Risk (VaR). More specifically, VaR_ provides an estimation of the potential maximum loss of a portfolio over a given time frame at a specified confidence level (e.g. 95% or 99%). For instance, a 99% VaR of $100 means that under normal market conditions, losses are expected to exceed $100 only 1% of the time.

VaR calculations are based on a bunch of assumptions, as for instance the often used assumption that market returns are normally distributed. Nevertheless, as you may have already figured out by now, ‘normally distributed‘ is usually a rather sloppy interpretation of the real world – real life is in most of the cases much more nuanced and not so straightforward.

Anyways, in the case of 2008 financial crisis, the assumed ‘normal distribution’ failed to represent reality and account for the ‘fat tails’ appearing in the actual data of market returns. A ‘fat tail’ in a distribution represents a higher-than-expected (from the normal distribution) probability of extreme events – here, extreme losses. In other words, extreme losses occurred much more frequently that the used models assumed.

Image created by the author in Python with data generated using scipy.stats

Nonetheless, focusing on aggregate metrics like VaR does not allow to identify those insights. On the contrary, a detailed visualization of historical returns can provide deeper insight into what is really happening. Individuals and institutions that use visualizations and detailed analysis, di identify the risks.

Ultimately, VaR is a great example of how often people tend to feel confident and secure when they are presented with a number – a calculation – irrespectively if this number is calculated out of thin air. At the core of 2008 financial crisis, overreliance on quantitative models, led institutions to greatly underestimate risks. Apparently, an abundance of other factors contributed to the outcome – say for instance the systemic risk of interrelated portfolios – nevertheless, blind trust in aggregate numbers remains a strong factor.

2. Challenger space shuttle disaster

Even in the Challenger space shuttle disaster, some data visualization could have helped. For reference, on January 28, 1986, the Space Shuttle Challenger broke apart 73 seconds after takeoff due to a failure of the O-rings separating the sections of the rocket booster. The failure of the O-rings occurred because of extremely low temperatures (2°C / 35°F, reaching -8°C/17°F the night before). For a large period prior to the launch, Morton Thiokol – the NASA subcontractor who manufactured the O-rings – engineers were concerned about the performance of O-rings in low temperatures, and even explicitly recommended against the launch. According to the accident report they presented the relevant data to their company’s management as appearing in the following picture. Nonetheless, their concerns were ignored by the company’s management, which eventually recommended NASA that it’s ok to launch.

History of O-ring damage and temperatures of various Solid Rocket Motors (SRMs), Source: https://www.nasa.gov/history/rogersrep/v5p896.htm, Public Domain

Said illustration of the O-ring failure in relation to temperature contains a lot of data, however, the point of the illustration is not clearly communicated. Edward Tufte in his book ‘Visual Explanations’, argues that a simple, clear graphical representation of the correlation between temperature and O-ring failure could have gone a long way, even resulting into a different decision about approving the launch. Allegedly, Morton Thiokol also presented NASA a scatter plot of O-ring failure incidents in relation to temperature – but the plot was missing the flights with no failure, so it didn’t make much sense. A better plot, also including the flights with no O-ring incidents, is presented in the ‘_Report to the President By the Presidential Commission On the Space Shuttle Challenger Accident._ Apparently, this is a much more useful plot, allowing one to immediately suspect the correlation between O-ring failure and temperature. Tufte’s critique emphasizes how this omission contributed to poor decision-making.

Scatter plots of number of O-ring failure incidents per mission versus temperature, Source: _Report to the President By the Presidential Commission On the Space Shuttle Challenger Accident_, Public Domain

Be that as it may, with today’s historical and technological distance from those events, it’s easy to find fault in cases of the past. Nonetheless, it is heartbreaking to realize that in many cases disaster can be foreseen, but cannot be effectively communicated to decision makers in order to take the right decisions.

3. 1854 Broad Street cholera outbreak

A remarkable use case of effective data visualization is the case of 1854 Broad Street cholera outbreak. In particular, this outbreak occurred in Soho, London, near Broad Street, during the 1846–1860 cholera pandemic and killed 616 people.

During this outbreak Dr. John Snow managed to illustrate the cholera cases in a map, and realized that they clustered around a water pump in Broad Street. This resulted in identifying that the Broad Street water pump was contaminated and was essentially the source of the outbreak. The simple act of removing the handle of the water pump led to cessation of the epidemic.

Dr. John Snow’s cholera map, Source: https://commons.wikimedia.org/wiki/File:Snow-cholera-map-1.jpg?uselang=en#Licensing, Public Domain

This was a huge discovery that not only stopped the epidemic, but also revolutionized the understanding of disease transmission and epidemiology. Undeniably, a great example of how a simple visualization can go a long way and unlock impactful insights.

On my mind

Ultimately, the lesson from datasets like Anscombe’s quartet or the Datasaurus dozen is that our deep-rooted notion that ‘numerical calculations are exact, but graphs are rough’ is flawed. Both visualization and numerical calculation are essential in order to extract meaningful insights from data. In the end, interpreting data in a meaningful way may be more of an art than an exact science, as there is no single, one-fits-all calculation approach that should be followed. Data visualization is much more than pretty pictures – it is a necessity for avoiding misinterpretation and poor decisions in Data Analysis…

…cause a Datasaurus maybe lurking somewhere in the data.

Data problem? DataCream can help!

Insights: Unlocking actionable insights with customized analyses to fuel strategic growth.
Dashboards: Building real-time, visually compelling dashboards for informed decision-making.

Got an interesting data project? Need data-centric editorial content or a fancy data visual? Drop me an email at maria.mouschoutzi@gmail.com or contact me on LinkedIn.

Loved this post?

Let’s be friends! Join me on Substack LinkedIn Buy me a coffee!

or, take a look at my other Water Cooler Small Talks:

Water Cooler Small Talk: Benford’s Law

Water Cooler Small Talk: Simpson’s Paradox

The post Water Cooler Small Talk, Ep 7: Anscombe’s Quartet and the Datasaurus appeared first on Towards Data Science.

Does It Matter That Online Experiments Interact?

Zach Flynn — Fri, 24 Jan 2025 12:02:03 +0000

Photo by Uriel Soberanes on Unsplash

Experiments do not run one at a time. At any moment, hundreds to thousands of experiments run on a mature website. The question comes up: what if these experiments interact with each other? Is that a problem? As with many interesting questions, the answer is "yes and no." Read on to get even more definite, actionable, entirely clear, and confident takes like that!

Definitions: Experiments interact when the treatment effect for one experiment depends on which variant of another experiment the unit gets assigned to.

For example, suppose we have an experiment Testing a new search model and another testing a new recommendation model, powering a "people also bought" module. Both experiments are ultimately about helping customers find what they want to buy. Units assigned to the better recommendation algorithm may have a smaller treatment effect in the search experiment because they are less likely to be influenced by the search algorithm: they made their purchase because of the better recommendation.

Some empirical evidence suggests that typical interaction effects are small. Maybe you don’t find this particularly comforting. I’m not sure I do, either. After all, the size of interaction effects depends on the experiments we run. For your particular organization, experiments might interact more or less. It might be the case that interaction effects are larger in your context than at the companies typically profiled in these types of analyses.

So, this blog post is not an empirical argument. It’s theoretical. That means it includes math. So it goes. We will try to understand the issues with interactions with an explicit model without reference to a particular company’s data. Even if interaction effects are relatively large, we’ll find that they rarely matter for decision-making. Interaction effects must be massive and have a peculiar pattern to affect which experiment wins. The point of the blog is to bring you peace of mind.

Interactions Aren’t So Special, And They Aren’t So Bad

Suppose we have two A/B experiments. Let Z = 1 indicate treatment in the first experiment and W = 1 indicate treatment in the second experiment. Y is the metric of interest.

The treatment effect in experiment 1 is:

Let’s decompose these terms to look at how interaction impacts the treatment effect.

Bucketing for one randomized experiment is independent of bucketing in another randomized experiment, so:

So, the treatment effect is:

Or, more succinctly, the treatment effect is the weighted average of the treatment effect within the W=1 and W=0 populations:

One of the great things about just writing the math down is that it makes our problem concrete. We can see exactly the form the bias from interaction will take and what will determine its size.

The problem is this: only W = 1 or W = 0 will launch after the second experiment ends. So, the environment during the first experiment will not be the same as the environment after it. This introduces the following bias in the treatment effect:

Suppose W = w launches, then the post-experiment treatment effect for the first experiment, TE(W=w), is mismeasured by the experiment treatment effect, TE, leading to the bias:

If there is an interaction between the second experiment and the first, then TE(W=1-w) – TE(W=w) != 0, so there is a bias.

So, yes, interactions cause a bias. The bias is directly proportional to the size of the interaction effect.

But interactions are not special. Anything __ that differs between the experiment’s environment and the future environment that affects the treatment effect leads to a bias with the same form. Does your product have seasonal demand? Was there a large supply shock? Did inflation rise sharply? What about the butterflies in Korea? Did they flap their wings?

Online Experiments are not Laboratory Experiments. We cannot control the environment. The economy is not under our control (sadly). We always face biases like this.

So, Online Experiments are not about estimating treatment effects that hold in perpetuity. They are about making decisions. Is A better than B? That answer is unlikely to change because of an interaction effect for the same reason that we don’t usually worry about it flipping because we ran the experiment in March instead of some other month of the year.

For interactions to matter for decision-making, we need, say, TE ≥ 0 (so we would launch B in the first experiment) and TE(W=w) < 0 (but we should have launched A given what happened in the second experiment).

TE ≥ 0 if and only if:

Taking the typical allocation pr(W=w) = 0.50, this means:

Because TE(W=w) < 0, this can only be true if TE(W=1-w) > 0. Which makes sense. For interactions to be a problem for decision-making, the interaction effect has to be large enough that an experiment that is negative under one treatment is positive under the other.

The interaction effect has to be extreme at typical 50–50 allocations. If the treatment effect is +$2 per unit under one treatment, the treatment must be less than -$2 per unit under the other for interactions to affect decision-making. To make the wrong decision from the standard treatment effect, we’d have to be cursed with massive interaction effects that change the sign of the treatment and maintain the same magnitude!

This is why we’re not concerned about interactions and all those other factors (seasonality, etc.) that we can’t keep the same during and after the experiment. The change in environment would have to radically alter the user’s experience of the feature. It probably doesn’t.

It’s always a good sign when your final take includes "probably."

Thanks for reading!

Zach

Connect at: https://linkedin.com/in/zlflynn

The post Does It Matter That Online Experiments Interact? appeared first on Towards Data Science.

Behind the Scenes of a Successful Data Analytics Project

Ilona Hetsevich — Thu, 23 Jan 2025 11:32:03 +0000

Learn the steps to approach any data analytics project like a pro.

Having worked as a data analyst for a while and tackled numerous projects, I can say that even though each project is unique, there is always a proven way to approach it.

Today, I’ll share with you the steps I usually take when working on a data project so you can follow them too.

Step 1: Define the Problem and Objectives

You cannot solve a problem or answer a business question if you do not understand what it is and how it fits into the bigger picture.

No matter how big or complex the task is, you must always understand what your business stakeholders are trying to achieve before diving into data. This is the part where you ask many questions, and before you get at least some answers, you are not diving into any data.

I learned this the hard way early in my career. Back then, when a vague request like "We saw visitors drop this month. Can you check why?" came, I would immediately jump into work. But every single time, I wasted hours trying to understand the real problem because I didn’t ask the right questions upfront.

I didn’t ask for context:

Why the team needed the traffic to be high?
What was the chosen strategy (brand awareness vs demand generation)?
What were the chosen tactics (paid search vs programmatic)?
What were the investments?

I didn’t ask stakeholders what they would do after receiving the data.

Did they want to increase signups and sales?
Were they aware that website visits may look impressive but not necessarily correlate with business outcomes and that focusing on metrics such as conversion rate would have a much better effect?

This initial first step is important because it affects everything else: from the data sources you will use to retrieve the data to the metrics you will analyze, the format you will use to present the insights, and the timeline you need to be ready for.

So don’t ever skip it or just partially understand hoping you will figure it out along the way.

Step 2: Set Expectations

Once you’ve defined the problem, it’s time to set expectations.

Stakeholders don’t always realize how much time and effort goes into collecting and analyzing data. You are among the few people in the organization who can find the answers, so you receive many requests. That is why you need to prioritize and set expectations.

Understanding the problem, its complexity, how it aligns with the organization’s goals from Step 1 helps prioritize and communicate to stakeholders when the task can be done or why you will not be prioritizing it right now. You want to focus on the most impactful work.

A colleague of mine took a smart approach. They required stakeholders to fill out a questionnaire when submitting a task. This questionnaire included various questions about the problem description, timeline, etc., and it also asked, "What will you do with the insights?". This approach not only gathered all the necessary information upfront, eliminating the need for back-and-forth communication, but it also made stakeholders think twice before submitting another "Can you quickly look at…?" request. Genius, right?

Step 3: Prepare the Data

Now that you’ve defined the problem and set expectations, it’s time to prepare the data.

This is the step where you ask youself:

Do I have all the data available, or do I need to collect it first?
Do I have all the domain knowledge needed, or do I need to do the research?
Do I have documentation available for the associated datasets? (If there is no documentation, you may need to contact the data owners for clarification.)

Another critical question to answer at this step is _"_What metrics should I measure?"

I always align my metrics with the business objectives. For instance, if the goal is to increase brand awareness, I prioritize metrics like impressions, branded search volume, direct traffic, and reach. If the objective is to drive sales, I focus on conversion rates, average order value, and customer acquisition cost. I also explore secondary metrics (demographics, device usage, customer behavior) to ensure my analysis is comprehensive and paints a complete picture.

Step 4: Explore the Data

Now comes the fun part – Exploratory Data Analysis (EDA). I love this part because it is where all the magic happens. Like a detective, you review the evidence, investigate the case, formulate hypotheses, and look for hidden patterns.

As you explore the data, you:

Ask better questions. As you become more familiar with data, you can approach data owners with concrete questions, making you look competent, knowledgeable, and confident in the eyes of your colleagues.
Innovate with feature engineering. You understand whether or not you need to create new features from existing ones. This helps to better capture the underlying patterns in the data that would otherwise go unnoticed.
Assess data quality. You check the number of rows of data and whether there are any anomalies, such as outliers, missing, or duplicate data.

If the exploration step shows data needs to be cleaned (and believe me, it is more than the case that it is not), you proceed with data cleanup.

Step 5: Clean the Data

No matter how polished a dataset looks at first glance, never assume it’s clean. Data quality issues are more common than not.

The most common data quality problems you need to fix are:

1. Missing values:

The way you will handle missing data differs from case to case.

If it is due to errors in data entry, collaborate with the relevant teams to correct it.
If the original data cannot be recovered, you need to either remove missing values or impute them using industry benchmarks, calculating the mean or median, or applying machine learning methods.
If missing values represent a small portion of the dataset and won’t significantly impact the analysis, it is usually OK to remove them.

2. Inconsistent data: Check data for inconsistent data formats and standardize them.

3. Duplicate records: Identify and remove duplicate records to avoid skewing results.

4. Outliers or errors in data: Check for outliers or errors in the data. Based on its context, decide whether to remove, fix, or keep it.

Once your data is cleaned, it is time to proceed to the analysis phase.

Step 6: Analyze the Data

This is where your detective work starts to pay off.

The key is to start with a very focused and specific question and not to be biased by having a hypothesis in mind. Using data to tell the story you or your colleagues want or expect to hear might be tempting, but you must let the dataset speak for itself.

I prefer to use the root-cause approach when analyzing data. For example, to answer the question, "Why do we see a drop in signups?" I would follow these 10 steps:

Trend analysis: When does the drop happen for the first time? Is it seasonal?
Traffic and conversion rates: Are fewer people visiting the site or fewer visitors signing up?
Offer performance: Is the decline widespread or isolated to a particular offer?
Website Performance: Are there any technical issues or broken links?
User insights: Is the pattern specific to a particular segment or all users?
User journey analysis: Are there any friction points where potential customers drop off?
Campaign performance: Have any recent marketing campaigns or changes in strategy, budget allocation, or execution impacted effectiveness?
Competitor activity: Have competitors launched a marketing campaign, new product, or feature? Have they changed their prices? Is there another reason that might be attracting customers away?
Market trends: Are there market trends and changes in consumer behavior affecting sales in the industry?
Customer feedback: Are customers dissatisfied with the offering? Did their needs change? Do we receive more support tickets?

Another important point is that the fastest and most accurate answers aren’t usually the same, and a lot depends on the context. That is why you need to collaborate with cross-functional teams and develop strong domain and industry knowledge.

Step 7: Build the Story

This step is my second favorite after data exploration because it is when all the data pieces fall into place, revealing a clear story and making perfect sense.

A common mistake here is including everything you found interesting instead of focusing on what the audience cares about. I get it. After working hard to get insights, it’s tempting to show off all the cool stuff you did. But if you overload your audience with data, you can further confuse them.

Don’t throw every data point at stakeholders; focus on what matters most to your audience instead. Think about their level of seniority, how familiar they are with the topic, their data literacy level, how much time they have, and whether you’re presenting in person or sending a report via email. This way you don’t waste anyone’s time – yours or theirs.

Lastly, always include actionable recommendations to stakeholders in your story. Your story should guide stakeholders on the next steps, ensuring that your insights drive meaningful decisions.

This brings us to the next point – sharing the insights and recommendations.

Step 8: Share the Insights

As a Data Analyst, you have the power to drive change. The secret lies in how you share data and tell the story.

First, consider the format your audience expects (see Step 1). Are you creating a dashboard, emailing a report, or presenting in person? Data storytelling becomes crucial for live presentations.

A great data story blends data, narrative, visuals, and practice:

Data: Focus only on insights with real business impact. If you can’t find a compelling reason why your insight will matter to the audience, if it’s unclear what they should do with the insights, or if the business impact is minimal, move it to the appendix.

Narrative: Ensure that your story has a clear structure.

Set the scene: What’s happening now?
Introduce the problem (to create some tension).
Reveal the key insights: What did you discover?
Finish with actionable steps: What should they do next?

This keeps your audience interested and makes your story memorable.

Visuals: The chart that helped you discover an insight isn’t always the best for presenting it. Highlight the key points and avoid clutter. For example, if you analyzed 10 categories but only 2 are critical, focus on those.

Practice: Practicing helps you feel more comfortable with the material. It also allows you to focus on important things like eye contact, hand gestures, and pacing. The more you practice, the more confident and credible you will appear.

You might think that once you’ve shared your insights, your job as a data analyst is done. In reality you want people not only hear what you’ve discovered but also act on your insights. This leads us to the final step – making people act on your data.

Step 9. Make People Act on Your Data

Seeing my work have an impact and a chance to drive real change brings me the most satisfaction. So don’t let your hard work go to waste either.

Work with the relevant teams to set clear action steps, timelines, and success metrics.
Monitor progress and ensure your recommendations are being implemented.
Communicate regularly with cross-functional teams to track the impact of your recommendations.

I understand that this might feel like a lot right now, but please don’t worry. With practice, it will become easier, and before you know it, these steps will become second nature.

Good luck on your data analyst journey! You’re on the right track!

All images were created by the author using Canva.com.

The post Behind the Scenes of a Successful Data Analytics Project appeared first on Towards Data Science.

Data-Driven Decision Making with Sentiment Analysis in R

Devashree Madhugiri — Tue, 21 Jan 2025 19:06:30 +0000

Leveraging the Quanteda, Textstem and Sentimentr Packages to Extract Customer Insights and Enhance Business Strategy

Image by Ralf Ruppert from Pixabay

Should Businesses Really Hear Their Customers’ Voices?

In a rapidly evolving world that is getting more and more AI-driven every instant, businesses now need to constantly seek a competitive edge to remain sustainable. Companies may do this by regularly observing and analyzing customer opinions regarding their products and services. They achieve this by assessing comments from many sources, both online and offline. Identifying positive and negative trends in customer feedback allows them to fine-tune product features and design marketing strategies that meet the needs of customers.

Thus, customer opinions need to be discerned appropriately to find valuable insights that can help make informed business decisions.

Familiarizing Yourself with Sentiment Analysis

Sentiment analysis, a part of natural language processing (NLP), is a popular technique today because it studies people’s opinions, sentiments, and emotions in any given text. Businesses can understand public opinion, monitor brand reputation, and improve customer experiences by applying sentiment analysis to their collected feedback, which contains valuable information, but its unstructured nature can make it difficult to analyze. By regularly analyzing customer sentiments, companies can identify their strengths and weaknesses, decide on how to boost product development, and build better marketing strategies.

Powerful packages for sentiment analysis in both Python and R enable businesses to uncover valuable patterns, track sentiment trends, and make data-driven decisions. In this article, we will explore how to use different packages (Quanteda, Sentimentr and Textstem) to perform sentiment analysis on customer feedback by processing, analyzing, and visualizing textual data.

Adding a Real-world Context

For this tutorial, let us consider a fictional tech company, PhoneTech, that has recently launched a new smartphone in the budget segment for its young audience. Now, they want to know the public perception of their newly launched product and, hence, want to analyze the customer feedback from social media, online reviews, and customer surveys.

To achieve this, PhoneTech needs to use Sentiment Analysis to find product strengths and weaknesses, guide product development, and adjust marketing strategies. For example, PhoneTech has collected feedback from various platforms like social media (e.g., informal comments like "The camera is but battery life . #Disappointed"), online reviews (e.g., semi-structured comments such as "Amazing build quality! Battery could last longer, though"), and customer surveys (e.g., structured responses to questions like "What do you like/dislike about the product?").

It’s important to note that customer feedback often includes informal language, emojis, and specific terms. We can use R packages to clean, tokenize, and analyze this data in order to turn raw text into actionable business insights.

Implementing Sentiment Analysis

Next, we’ll build a model for sentiment analysis in R using the chosen quantedapackage.

1. Importing necessary packages and dataset

For evaluating sentiments in a given dataset, we need several packages, including dplyr to manipulate the data of customer feedback entries, quanteda(License: GPL-3.0 license) for text analysis, and quanteda.textplots to create a word cloud. Additionally, tidytext (License: [MIT](https://cran.r-project.org/web/licenses/MIT) + file [LICENSE](https://cran.r-project.org/web/packages/sentimentr/LICENSE)) to use sentiment lexicons for scoring while ggplot2 will be used for data visualization, textstem (License: GPL-2) will aid in text stemming and lemmatization, sentimentr (License: MIT + file LICENSE) will be utilized for sentiment analysis, and RColorBrewer will provide color palettes for our visualizations.

These can be easily installed with the following command-

install.packages(c("dplyr", "ggplot2", "quanteda", "quanteda.textplots", 
                   "tidytext", "textstem", "sentimentr", "colorbrewer"))

After installation, we can load the packages as:

# Load necessary R packages
library(dplyr)
library(ggplot2)
library(quanteda)
library(quanteda.textplots)
library(tidytext)
library(textstem)
library(sentimentr)
library(RColorBrewer)

Dataset for customer reviews

In the case of the real-world dataset, this data would actually be scraped using multiple tools from various social media platforms. The collected data would represent the feedback that includes informal language, emojis, and domain-specific terms. Such a combined dataset can allow for a detailed analysis of customer sentiments and opinions across different sources.

However, for this tutorial, let us use a synthetic dataset generated in R using packages that cover these above points. The dataset with 200 rows represents customer feedback (~2–3 sentences in each row) from different sources and includes raw text with emojis and symbols, abbreviations, etc., mimicking real-world scenarios. These sentences are simply a generic representation of the reviews commonly seen on e-commerce or product websites (talk about keywords such as UI, design, phone features and price, experienced battery life, customer service support, etc.) and are combined in random patterns with emojis for creating a review text.

You can find the synthetic dataset generated using R on GitHub here.

#load the dataset
data <- read.csv("sentiment_data.csv")
# Print the dimensions (number of rows and columns) of the dataset
dataset_size <- dim(data)
print(dataset_size)

Since the dataset has a lot of text, let’s print a few words per row for the dataset overview.

To achieve this, we’ll first define a function to extract the first few words from each feedback entry in our dataset. We’ll then randomly sample 5 rows from the dataset and apply the function to truncate the feedback text. Finally, we’ll print the resulting data frame to get an idea of the feedback text.

# Function to extract the first few words
extract_first_words <- function(text, num_words = 10) {
 if (is.na(text) || !is.character(text)) {
 return(NA)
}
words <- unlist(strsplit(text, "s+"))
 return(paste(words[1:min(num_words, length(words))], collapse = " "))
}
# Randomly sample 5 rows from the dataset
set.seed(123)
random_feedback <- data[sample(nrow(data), size = 5, replace = FALSE), ]
# Extract the first 5 words
random_feedback$text <- sapply(random_feedback$text, function(text) {
truncated <- extract_first_words(text)
paste0(truncated, "...")
})
# Print the data frame
print(random_feedback)

2. Preprocessing Text Data

Before moving to text analysis, we need to preprocess the text to ensure a clean and consistent format. Preprocessing will involve several key steps:

Text Cleaning which includes removal of punctuation, numbers, and special characters;
Text Normalizing which includes conversion of the alphabets to lowercase;
Tokenizing the text which includes splitting the text into individual words or tokens;
Removing stop words which includes intentional removal of words that do not contribute to sentiment analysis (e.g., "the," "and"); and finally,
Stemming or lemmatizing the text where the words are reduced to their root forms. These steps help lessen the noise and improve the accuracy of the analysis.

Now, we’ll implement the above preprocessing steps on our dataset.

# Cleaning the dataset
corpus <- quanteda::corpus(data$text)
tokens_clean <- quanteda::tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("en"))
# Convert tokens to character vectors for lemmatization
tokens_char <- sapply(tokens_clean, function(tokens) paste(tokens, collapse = " "))
# Lemmatize tokens using WordNet
lemmatized_texts <- lemmatize_strings(tokens_char)

In this code, we convert the dataset’s text column into a quanteda corpus object. We clean the text by tokenizing it, which involves removing punctuation, numbers, and symbols, converting all words to lowercase, and filtering out common stopwords. Finally, stemming is applied to reduce words to their root forms, such as changing "running" to "run," to ensure consistency in text analysis. It is important to note here that we haven’t used stemming since it can cause partial or incomplete extraction of words due to the simplification of words to root forms. In another way, it applies simple rules to chop off the ends of words. For example, the algorithm might remove the suffix "-ing" from "amazing," resulting in "amaz," or "terrible" could be "terribl". To avoid that and get more accurate root forms, instead, we’ll use lemmatization, which is a more sophisticated process that relies on dictionaries to map words and considers the context and part of speech of the words to return their base or dictionary forms.

Now that we have cleaned and tokenized the text data, we can move on to the next step. Our goal is to analyze the sentiments in the feedback entries. We will use the sentimentr package to evaluate the sentiments in our structured data, providing insights into the emotional tone of the feedback entries.

3. Performing Sentiment Analysis Using Sentimentr package

Now, we can perform sentiment analysis on these sentences with the sentiment function from the sentimentr package. This function calculates sentiment scores for each piece of text, identifying positive and negative words.

Next, we summarize the sentiment scores for each document. We group the scores by document and calculate the total number of positive and negative words. We also calculate a compound score and categorize the overall sentiment as either positive or negative.

# Perform sentiment analysis using sentimentr
sentiment_scores <- sentiment(lemmatized_texts)
# Summarize sentiment scores for each document
sentiment_summary <- sentiment_scores %>%
group_by(element_id) %>%
summarize(
positive_words = sum(sentiment > 0),
negative_words = sum(sentiment < 0),
compound = sum(sentiment)
) %>%
mutate(
sentiment = ifelse(compound > 0, "Positive", "Negative")
)

Finally, we merge this sentiment summary with the original text data and print the results. This gives us a clear, concise evaluation of the sentiment in our dataset.

# Merge with original text for context using row number as a common column
sentiment_summary <- sentiment_summary %>%
mutate(doc_id = as.character(element_id)) %>%
left_join(data %>% mutate(doc_id = as.character(1:nrow(data))), by = "doc_id") %>%
select(text, positive_words, negative_words, compound, sentiment)
# Print the sentiment evaluation table
print(sentiment_summary)

The output table clearly shows the positive and negative word count per review in each row along with the compound score as well as the predictive sentiment. At a glance, the model does a reasonably good job of sorting positive and negative reviews. Although some reviews clearly look negative (e.g. "Would not recommend….") due to the incomplete display of the review content in a table, it is quite likely that there are more positive keywords (satisfies, best, good, etc) contained in that particular review that resulted in a positive sentiment as per the model evaluation. Hence, such reviews need to be carefully reviewed separately before being included in the interpretation of the results for decision-making.

Next, we need to print a Document-Feature Matrix (DFM) which is a structured representation of the text where rows represent documents and columns represent features (words). Each cell contains the frequency of a specific word in a document. Here, the cleaned corpus is transformed into a DFM, making it ready for statistical analysis and visualization.

# Create a document-feature matrix (DFM)
dfm <- dfm(corpus_clean)

This section calculates sentiment metrics for each text entry. Positive and negative word counts are summed, and a compound score is computed as the difference between these counts. A positive compound score indicates positive sentiment and a negative score indicates negative sentiment. This information is combined with the original text for a comprehensive sentiment evaluation.

4. Analyzing Sentiment Proportions

# Evaluate sentiment proportions as percentages
sentiment_proportion <- sentiment_summary %>%
group_by(sentiment) %>%
summarise(count = n()) %>%
mutate(proportion = count / sum(count) * 100)
print(sentiment_proportion)

To understand the overall sentiment distribution, we calculate the proportions of positive and negative sentiments in the dataset. Grouping by sentiment type, the count of entries in each category is calculated and normalized to derive their proportions.

5. Visualizing Sentiment Distribution

We’ll create a bar chart in ggplot2 to visualize the proportions of positive and negative sentiments for an intuitive visualization of the sentiment distribution, making it easy to observe which type of sentiment seems dominant.

# Plot sentiment distribution as percentages
ggplot(sentiment_proportion, aes(x = sentiment, y = proportion, fill = sentiment)) +
geom_bar(stat = "identity", width = 0.7) +
scale_fill_manual(values = c("Positive" = "blue", "Negative" = "red")) +
labs(title = "Distribution of Sentiments",
x = "Sentiment Type",
y = "Percentage",
fill = "Sentiment") +
theme_minimal() +
theme(panel.grid = element_blank())

Image by Author

In our dataset, positive sentiment seems dominant. Hence, a larger proportion of the customers are happy with PhoneTech’s product.

6. Visualizing Top Terms

# Plotting top 10 terms
top_terms <- topfeatures(dfm, 10)
bar_colors <- colorRampPalette(c("lightblue", "blue"))(length(top_terms))
# Barplot
barplot(top_terms, main = "Top 10 Terms", las = 2, col = bar_colors, horiz = TRUE, cex.names = 0.7)

Image by Author

The 5 most frequent terms in the dataset seem to be "recommend", "design", "smartphone", "display," and "terrible". Although such words are not very useful standalone for understanding sentiment, PhoneTech personnel could dig deeper into how these words are associated with the product in the reviews and build some other plots to add more context in which it would be clear whether these words are used in a certain review.

So, let’s filter out the positive feedback, create a DFM, and plot again to see what customers are really saying about the product.

# Filter positive feedback 
positive_feedback <- sentiment_summary %>% 
 filter(sentiment == "Positive") 
# Create a DFM for positive feedback 
positive_tokens <- quanteda::tokens(positive_feedback$text, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% 
 tokens_tolower() %>% 
 tokens_remove(stopwords("en")) 
positive_dfm <- dfm(positive_tokens)
# Plot top 5 terms with a gradient
top_positive <- topfeatures(positive_dfm, 5)
bar_colors <- colorRampPalette(c("lightblue", "blue"))(length(top_positive))
# Plot with gradient
barplot(top_positive, main = "Top 5 Positive Terms", las = 2, col = bar_colors, horiz = TRUE, cex.names = 0.7)

Image by Author

The product performance, smartphone (could likely indicate brand), display, and design seem to be the most talked about in the dataset.

Another way to visualize these sentiments in our dataset is by generating a word cloud and fine-tuning the word frequencies using the max_words parameter as needed.

7. Generating a Word Cloud

# Word cloud
textplot_wordcloud(dfm, max_words = 200, color = RColorBrewer::brewer.pal(8, "Reds"), min_size = 0.5, max_size = 8)

We can also display the most frequent terms in an engaging and intuitive format with the use of word cloud when working on sentiment analysis tasks. It is important to note that larger words indicate higher frequencies, and this plot is particularly useful for quickly identifying key themes in the given dataset.

Image by Author

For the PhoneTech team, it might be worth considering two separate positive and negative word clouds to understand better what the most loved feature of the product is and what the pain point is.

8. Sampling and Reviewing Sentiments

Finally, we’ll print five random sentences from the dataset to inspect their sentiment evaluation results. This will help us validate the sentiment analysis outputs and gain insights into individual entries.

# Sample 5 sentences from the dataset
sample_indices <- sample(1:nrow(sentiment_summary), 5)
sample_sentiment_summary <- sentiment_summary[sample_indices, ]
# Print the sample sentences
print(sample_sentiment_summary)

So, all the above steps form a comprehensive pipeline for analyzing textual data as well as extracting valuable insights. Together, these steps help to transform raw text into actionable insights, supporting data-driven decision-making for any company.

Interpreting Sentiment Analysis Results

It is crucial to assess and evaluate the findings of the sentiment analysis correctly. **** For this, we generated a Document-Feature Matrix (DFM) to find top words and overall sentiment distribution, helping us understand the overall customer mood and identify patterns in the feedback. Additionally, our model generated sentiment scores to provide an idea about the tone of the reviews.

For example, PhoneTech finds that 68% of the feedback is positive, with the top words being "design" and "performance," it highlights key selling points for marketing. Conversely, the remaining 32% of reviews, i.e., negative comments, talk about customer service and poor photos, indicate potential areas for improvement.

Comparing sentiment trends over time or across sources, such as social media versus online reviews, helps identify shifts in customer perception. An accurate interpretation is important for making informed decisions and developing targeted strategies.

While the model seems to effectively identify positive and negative reviews, further steps can involve fine-tuning the model to sort neutral reviews, if any, for a more comprehensive analysis.

Applying Sentiment Insights to Fine-tune Strategy

The sentiment analysis has revealed some key areas of product improvement and its strengths for PhoneTech that can be leveraged to enhance its business. By addressing both positive and negative customer feedback, PhoneTech can drive overall satisfaction and attract more buyers.

Based on sentiment analysis results, PhoneTech could identify the following actionable insights and strategies to improve its business:

Positive Strategies:

(1) Refine Marketing Strategies:

Customers seem to be happy with the sleek and fast UI.
Positive feedback on the UI design indicates that this is a key selling point, which PhoneTech should continue promoting in their marketing campaigns to attract more buyers.

Negative Strategies:

(1) Enhance Product Features:

Frequent complaints about image quality suggest an issue with either the hardware or software.
Improving these areas quickly can enhance the user experience and reduce negative reviews.

(2) Addressing Customer Service Issues:

Handling customer service issues and resolving them promptly will boost product satisfaction.
These actions can prevent or reduce negative reviews while ensuring a better user experience and increasing overall reliability.

Best Practices in Sentiment Analysis

Text Context: As lexicon-based sentiment analysis often misses sarcasm and context, using advanced techniques like machine learning helps better capture nuances.
Domain-Specific Language: As general lexicons may not understand industry-specific terms and slang, tailoring lexicons to include technical terms relevant to the industry improves accuracy.
Use of Informal Language and Emojis: Since informal language and emojis can be challenging to analyze, using tools like quanteda to clean and systematically analyze data is beneficial.
Combining Techniques: As relying on one method limits analysis depth, combining text processing with machine learning provides comprehensive insights.

Key Takeaways

Sentiment analysis helps businesses understand customer opinions to improve products and services.
The R packages quanteda, sentimentr, and textstemwork well together for text analysis of customer reviews.
The outlined approach for sentiment analysis can be easily applied across industries like finance, healthcare, and retail for actionable insights.

Conclusion

Sentiment analysis gives businesses a clear idea about their customer needs and pain points. Companies can leverage insights to improve products and craft data-driven strategies.

In this article, we explored how R packages can help with sentiment analysis on customer feedback for a tech product. We discussed the background of the challenge with possible steps such as including data collection and preparation, corpus creation, tokenization, feature extraction, building sentiment models, and visualizing results to implement the sentiment analysis in R. We also considered the outcomes of the analysis that seem to have an impact and need to be considered by the company for further refining the product.

Other domain companies that are looking to gain actionable insights, enhance product features, refine marketing strategies, and monitor brand reputation effectively could take a significantly similar approach to sentiment analysis.

The post Data-Driven Decision Making with Sentiment Analysis in R appeared first on Towards Data Science.