Alternate Hypothesis | Towards Data Science

One-Tailed Vs. Two-Tailed Tests

Allon Korem — Thu, 06 Mar 2025 04:22:42 +0000

Introduction

If you’ve ever analyzed data using built-in t-test functions, such as those in R or SciPy, here’s a question for you: have you ever adjusted the default setting for the alternative hypothesis? If your answer is no—or if you’re not even sure what this means—then this blog post is for you!

The alternative hypothesis parameter, commonly referred to as “one-tailed” versus “two-tailed” in statistics, defines the expected direction of the difference between control and treatment groups. In a two-tailed test, we assess whether there is any difference in mean values between the groups, without specifying a direction. A one-tailed test, on the other hand, posits a specific direction—whether the control group’s mean is either less than or greater than that of the treatment group.

Choosing between one- and two-tailed hypotheses might seem like a minor detail, but it affects every stage of A/B testing: from test planning to Data Analysis and results interpretation. This article builds a theoretical foundation on why the hypothesis direction matters and explores the pros and cons of each approach.

One-tailed vs. two-tailed hypothesis testing: Understanding the difference

To understand the importance of choosing between one-tailed and two-tailed hypotheses, let’s briefly review the basics of the t-test, the commonly used method in A/B testing. Like other Hypothesis Testing methods, the t-test begins with a conservative assumption: there is no difference between the two groups (the null hypothesis). Only if we find strong evidence against this assumption can we reject the null hypothesis and conclude that the treatment has had an effect.

But what qualifies as “strong evidence”? To that end, a rejection region is determined under the null hypothesis and all results that fall within this region are deemed so unlikely that we take them as evidence against the feasibility of the null hypothesis. The size of this rejection region is based on a predetermined probability, known as alpha (α), which represents the likelihood of incorrectly rejecting the null hypothesis.

What does this have to do with the direction of the alternative hypothesis? Quite a bit, actually. While the alpha level determines the size of the rejection region, the alternative hypothesis dictates its placement. In a one-tailed test, where we hypothesize a specific direction of difference, the rejection region is situated in only one tail of the distribution. For a hypothesized positive effect (e..g., that the treatment group mean is higher than the control group mean), the rejection region would lie in the right tail, creating a right-tailed test. Conversely, if we hypothesize a negative effect (e.g., that the treatment group mean is less than the control group mean), the rejection region would be placed in the left tail, resulting in a left-tailed test.

In contrast, a two-tailed test allows for the detection of a difference in either direction, so the rejection region is split between both tails of the distribution. This accommodates the possibility of observing extreme values in either direction, whether the effect is positive or negative.

To build intuition, let’s visualize how the rejection regions appear under the different hypotheses. Recall that according to the null hypothesis, the difference between the two groups should center around zero. Thanks to the central limit theorem, we also know this distribution approximates a normal distribution. Consequently, the rejection areas corresponding to the different alternative hypothesis look like that:

Why does it make a difference?

The choice of direction for the alternative hypothesis impacts the entire A/B testing process, starting with the planning phase—specifically, in determining the sample size. Sample size is calculated based on the desired power of the test, which is the probability of detecting a true difference between the two groups when one exists. To compute power, we examine the area under the alternative hypothesis that corresponds to the rejection region (since power reflects the ability to reject the null hypothesis when the alternative hypothesis is true).

Since the direction of the hypothesis affects the size of this rejection region, power is generally lower for a two-tailed hypothesis. This is due to the rejection region being divided across both tails, making it more challenging to detect an effect in any one direction. The following graph illustrates the comparison between the two types of hypotheses. Note that the purple area is larger for the one-tailed hypothesis, compared to the two-tailed hypothesis:

In practice, to maintain the desired power level, we compensate for the reduced power of a two-tailed hypothesis by increasing the sample size (Increasing sample size raises power, though the mechanics of this can be a topic for a separate article). Thus, the choice between one- and two-tailed hypotheses directly influences the required sample size for your test.

Beyond the planning phase, the choice of alternative hypothesis directly impacts the analysis and interpretation of results. There are cases where a test may reach significance with a one-tailed approach but not with a two-tailed one, and vice versa. Reviewing the previous graph can help illustrate this: for example, a result in the left tail might be significant under a two-tailed hypothesis but not under a right one-tailed hypothesis. Conversely, certain results might fall within the rejection region of a right one-tailed test but lie outside the rejection area in a two-tailed test.

How to decide between a one-tailed and two-tailed hypothesis

Let’s start with the bottom line: there’s no absolute right or wrong choice here. Both approaches are valid, and the primary consideration should be your specific business needs. To help you decide which option best suits your company, we’ll outline the key pros and cons of each.

At first glance, a one-tailed alternative may appear to be the clear choice, as it often aligns better with business objectives. In industry applications, the focus is typically on improving specific metrics rather than exploring a treatment’s impact in both directions. This is especially relevant in A/B testing, where the goal is often to optimize conversion rates or enhance revenue. If the treatment doesn’t lead to a significant improvement the examined change won’t be implemented.

Beyond this conceptual advantage, we have already mentioned one key benefit of a one-tailed hypothesis: it requires a smaller sample size. Thus, choosing a one-tailed alternative can save both time and resources. To illustrate this advantage, the following graphs show the required sample sizes for one- and two-tailed hypotheses with different power levels (alpha is set at 5%).

In this context, the decision between one- and two-tailed hypotheses becomes particularly important in sequential testing—a method that allows for ongoing data analysis without inflating the alpha level. Here, selecting a one-tailed test can significantly reduce the duration of the test, enabling faster decision-making, which is especially valuable in dynamic business environments where prompt responses are essential.

However, don’t be too quick to dismiss the two-tailed hypothesis! It has its own advantages. In some business contexts, the ability to detect “negative significant results” is a major benefit. As one client once shared, he preferred negative significant results over inconclusive ones because they offer valuable learning opportunities. Even if the outcome wasn’t as expected, he could conclude that the treatment had a negative effect and gain insights into the product.

Another benefit of two-tailed tests is their straightforward interpretation using confidence intervals (CIs). In two-tailed tests, a CI that doesn’t include zero directly indicates significance, making it easier for practitioners to interpret results at a glance. This clarity is particularly appealing since CIs are widely used in A/B testing platforms. Conversely, with one-tailed tests, a significant result might still include zero in the CI, potentially leading to confusion or mistrust in the findings. Although one-sided confidence intervals can be employed with one-tailed tests, this practice is less common.

Conclusions

By adjusting a single parameter, you can significantly impact your A/B testing: specifically, the sample size you need to collect and the interpretation of the results. When deciding between one- and two-tailed hypotheses, consider factors such as the available sample size, the advantages of detecting negative effects, and the convenience of aligning confidence intervals (CIs) with hypothesis testing. Ultimately, this decision should be made thoughtfully, taking into account what best fits your business needs.

(Note: all the images in this post were created by the author)

The post One-Tailed Vs. Two-Tailed Tests appeared first on Towards Data Science.

Statistics in Python – Using Chi-Square for Feature Selection

Wei-Meng Lee — Mon, 11 Oct 2021 19:12:44 +0000

In my previous two articles, I talked about how to measure correlations between the various columns in your dataset and how to detect multicollinearity between them:

Statistics in Python – Understanding Variance, Covariance, and Correlation

Statistics in Python – Collinearity and Multicollinearity

However, these techniques are useful when the variables you are trying to compare with are continuous. How do you compare them if your variables are categorical? In this article, I will explain to you how you can test two categorical columns in your dataset to determine if they are dependent on each other (i.e. correlated). We will use a statistics test known as chi-square (commonly written as _χ_2).

Before we start our discussion on chi-square, here is a quick summary of the test methods that can be used for testing the various types of variables:

Using the chi-square statistics to determine if two categorical variables are correlated

The chi-square (_χ_2) statistics is a way to check the relationship between two categorical nominal variables.

Nominal variables contains values that have no intrinsic ordering. Examples of nominal variables are sex, race, eye color, skin color, etc. Ordinal variables, on the other hand, contains values that have ordering. Examples of ordinal variables are grade, education level, economic status, etc.

The key idea behind the chi-square test is to compare the observed values in your data to the expected values and see if they are related or not. In particular, it is a useful way to check if two categorical nominal variables are correlated. This is particularly important in machine learning where you only want features that are correlated to the target to be used for training.

There are two types of chi-square tests:

Chi-Square Goodness of Fit Test – test if one variable is likely to come from a given distribution.
Chi-Square Test of Independence – test if two variables might be correlated or not.

Check out https://www.jmp.com/en_us/statistics-knowledge-portal/chi-square-test.html for a more detailed discusson of the above two chi-square tests.

When comparing to see if two categorical variables are correlated, you will use the Chi-Square Test of Independence.

Steps to Performing a Chi-Square Test

To use the chi-square test, you need to perform the following steps:

Define your null hypothesis and Alternate Hypothesis. They are:

H₀ (Null Hypothesis) – that the 2 categorical variables being compared are independent of each other.
H₁ (Alternate Hypothesis) – that the 2 categorical variables being compared are dependent on each other.

Decide on the α value. This is the risk that you are willing to take in drawing the wrong conclusion. As an example, say you set α=0.05 when testing for independence. This means you are undertaking a 5% risk of concluding that two variables are independent when in reality they are not.
Calculate the chi-square score using the two categorical variables and use it to calculate the p-value. A low p-value means there is a high Correlation between your two categorical variables (they are dependent on each other). The p-value is calculated from the chi-square score. The p-value will tell you if your tests results are significant or not.

In a chi-square analysis, the p-value is the probability of obtaining a chi-square as large or larger than that in the current experiment and yet the data will still support the hypothesis. It is the probability of deviations from what was expected being due to mere chance. In general a p-value of 0.05 or greater is considered critical, anything less means the deviations are significant and the hypothesis being tested must be rejected.

Source: https://passel2.unl.edu/view/lesson/9beaa382bf7e/8

To calculate the p-value, you need two pieces of information:

Degrees of freedom – the number of categories minus 1
Chi-square score.

If the p-value **** obtained is:

< 0.05 (the α value you have chosen) you reject the H₀ (Null Hypothesis) and accept the H₁ (Alternate Hypothesis). This means the two categorical variables are dependent.
0.05 you accept the H₀ (Null Hypothesis) and reject the H₁ (Alternate Hypothesis). This means the two categorical variables are independent.

In the case of feature selection for machine learning, you would want the feature that is being compared to the target to have a low p-value (less than 0.05), as this means that the feature is dependent on (correlated to) the target.

With the chi-square score that is calculated, you can also use it to refer to a chi-square table to see if your score falls within the rejection region or the acceptance region.

All the steps above sound a little vague, and the best way to really understand how chi-square works is to look at an example. In the next section, I will use the Titanic dataset and apply the chi-square test on a few of the features and see how if they are correlated to the target.

Using chi-square test on the Titanic dataset

A good way to understand a new topic is to go through the concepts using an example. For this, I am going to use the classic Titanic dataset (https://www.kaggle.com/tedllh/titanic-train).

The Titanic dataset is often used in machine learning to demonstrate how to build a machine learning model and use it to make predictions. In particular, the dataset contains several features (Pclass, Sex, Age, Embarked, etc) and one target (Survived). Several features in the dataset are categorical variables:

Pclass-the class of cabin that the passenger was in
Sex-the sex of the passenger
Embarked-the port of embarkation
Survived-if the passenger survived the disaster

Because this article explores the relationships between categorical features and targets, we are only interested in those columns that contains categorical values.

Loading the Dataset

Now that you have obtained the dataset, let’s load it up in a Pandas DataFrame:

import pandas as pd
import numpy as np

df = pd.read_csv('titanic_train.csv')
df.sample(5)

Image by author

Data Cleansing and Feature Engineering

There are some columns that are not really useful and hence we will proceed to drop them. Also, there are some missing values so let’s drop all those rows with empty values:

df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'], 
        inplace=True)
df.dropna(inplace=True)
df

Image by author

We will also add one more column named Alone, based on the Parch (Parent or children) and Sibsp (Siblings or spouse) columns. The idea we want to explore is if being alone affects the surviability of the passenger. So Alone is 1 if both Parch and Sibsp are 0, else it is 0:

df['Alone'] = (df['Parch'] + df['SibSp']).apply(
                  lambda x: 1 if x == 0 else 0)
df

Image by author

Visualizing the correlations between features and target

Now that the data is cleaned, let’s try to visualize how the sex of passengers is related to their survival of the accident:

import seaborn as sns
sns.barplot(x='Sex', y='Survived', data=df, ci=None)

The Sex column contains nominal data(i.e. ranking is not important).

Image by author

From the above figure, you can see that of all the female passengers, more than 70% survived; of all the men, about 20% survived. Seems like there exists a very strong relationship between the Sex and Survived features. To confirm this, we will use the chi-square test to confirm this later on.

How about Pclass and Survived? Are they related?

sns.barplot(x='Pclass', y='Survived', data=df, ci=None)

Image by author

Perhaps unsurprisingly, it shows that the higher the Pclass that the passenger was in, the higher the survival rate of the passenger.

The next feature of interest is if the place of embarkation determines who survives and who doesn’t:

sns.barplot(x='Embarked', y='Survived', data=df, ci=None)

Image by author

From the chart it seems like more people who embarked from C (Cherbourg) survived.

C = Cherbourg; Q = Queenstown; S = Southampton

You also want to know if being alone on the trip makes one more survivable:

ax = sns.barplot(x='Alone', y='Survived', data=df, ci=None)    
ax.set_xticklabels(['Not Alone','Alone'])

Image by author

You can see that if one is with their family, he/she will have a higher chances of survival.

Visualizing the correlations between each feature

Now that we have visualized the relationships between the categorical features against the target (Survived), we want to now visualize the relationships between each feature. Before you can do that, you need to convert the label values in the Sex and Embarked columns to numeric. To do that, you can make use of the LabelEncoder class in sklearn:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['Sex'])
df['Sex'] = le.transform(df['Sex'])
sex_labels = dict(zip(le.classes_, le.transform(le.classes_)))
print(sex_labels)

le.fit(df['Embarked'])
df['Embarked'] = le.transform(df['Embarked'])
embarked_labels = dict(zip(le.classes_, 
                      le.transform(le.classes_)))
print(embarked_labels)

The above code snippet label-encodes the Sex and Embarked columns. The output shows the mappings of the values for each column, which is very useful later when performing predictions:

{'female': 0, 'male': 1}
{'C': 0, 'Q': 1, 'S': 2}

The following statements show the relationship between Embarked and Sex:

ax = sns.barplot(x='Embarked', y='Sex', data=df, ci=None)
ax.set_xticklabels(embarked_labels.keys())

Image by author

Seems like more males boarded from Southampton (S) than in Queenstown (Q) and Cherbourg (C).

How about Embarked and Alone?

ax = sns.barplot(x='Embarked', y='Alone', data=df, ci=None)
ax.set_xticklabels(embarked_labels.keys())

Image by author

Seems like a large proportion of those who embarked from Queenstown are alone.

And finally, let’s see the relationship between Sex and Alone:

ax = sns.barplot(x='Sex', y='Alone', data=df, ci=None)
ax.set_xticklabels(sex_labels.keys())

Image by author

As you can see, there are more males than females who are alone for the trip.

Defining the Hypotheses

You now define your null hypothesis and alternate hypothesis. As explained earlier, they are:

H₀ (Null Hypothesis) – that the 2 categorical variables to be compared are independent of each other.
H₁ (Alternate Hypothesis) – that the 2 categorical variables being compared are dependent on each other.

And you draw your conclusions based on the following p-value conditions:

p < 0.05 – this means the two categorical variables are correlated.
p > 0.05 – this means the two categorical variables are not correlated.

Calculating χ2 manually

Let’s manually go through the steps in calculating the χ2 values. The first step is to create a contingency table. Using the Sex and Survived columns as example, you first create a contingency table:

Image by author

The contingency table above displays the frequency distribution of the two categorical columns – Sex and Survived.

The Degrees of Freedom is next calculated as *(number of rows -1) (number of columns -1)**. In this example, the degree of freedom is (2–1)*(2–1) = 1.

Once the contingency table is created, sum up all the rows and columns, like this:

Image by author

The above is your Observed values.

Next, you are going to calculate the Expected values. Here is how they are calculated:

Replace each value in the observed value with the product of the sum of its column and the sum of its row, divided by the total sum.

The following figure shows how the first value is calculated:

Image by author

The next figure shows how the second value is calculated:

Image by author

Here is the result for the Expected values:

Image by author

Then, calculate the chi-square value for each cell using the formula for _χ_2:

Image by author

Applying this formula to the Observed and Expected values, you get the chi-square values:

Image by author

The chi-square score is the grand total of the chi-square values:

Image by author

You can use the following websites to verify if the numbers are correct:

Chi-Square Calculator – https://www.mathsisfun.com/data/chi-square-calculator.html

The Python implementation for the above steps is contained within the following chi2_by_hand() function:

def chi2_by_hand(df, col1, col2):    
    #---create the contingency table---
    df_cont = pd.crosstab(index = df[col1], columns = df[col2])
    display(df_cont)

    #---calculate degree of freedom---
    degree_f = (df_cont.shape[0]-1) * (df_cont.shape[1]-1)

    #---sum up the totals for row and columns---
    df_cont.loc[:,'Total']= df_cont.sum(axis=1)
    df_cont.loc['Total']= df_cont.sum()
    print('---Observed (O)---')
    display(df_cont)

    #---create the expected value dataframe---
    df_exp = df_cont.copy()    
    df_exp.iloc[:,:] = np.multiply.outer(
        df_cont.sum(1).values,df_cont.sum().values) / 
        df_cont.sum().sum()            
    print('---Expected (E)---')
    display(df_exp)

    # calculate chi-square values
    df_chi2 = ((df_cont - df_exp)**2) / df_exp    
    df_chi2.loc[:,'Total']= df_chi2.sum(axis=1)
    df_chi2.loc['Total']= df_chi2.sum()

    print('---Chi-Square---')
    display(df_chi2)

    #---get chi-square score---   
    chi_square_score = df_chi2.iloc[:-1,:-1].sum().sum()

    return chi_square_score, degree_f

The chi2_by_hand() function takes in three argument – the dataframe containing all your columns, followed by two strings containing the names of the two columns you are comparing against. It returns a tuple – the chi-square score, plus the degrees of freedom.

Let’s now test the above function using the Titanic dataset. First, let’s compare the Sex and the Survived columns:

chi_score, degree_f = chi2_by_hand(df,'Sex','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}')

You will see the following result:

Chi2_score: 205.1364846934008, Degrees of freedom: 1

Using the chi-square score, you can now decide if you will accept or reject the null hypothesis using the chi-square distribution curve:

Image by author

The x-axis represents the _χ_2 score. The area that is to the right of the critical chi-square region is known as the rejection region. Area to the left of it is known as the acceptance region. If the chi-square score that you have obtained falls in the acceptance region, the null hypothesis is accepted; else the alternate hypothesis is accepted.

So how do you obtain the critical chi-square region? For this, you have to check the chi-square table:

Table from https://people.smp.uq.edu.au/YoniNazarathy/stat_models_B_course_spring_07/distributions/chisqtab.pdf; annotations by author

You can check out the Chi-Square Table at **** https://www.mathsisfun.com/data/chi-square-table.html

This is how you use the chi-square table. With your α set to be 0.05, and 1 degrees of freedom, the critical chi-square region is 3.84 (refer to the chart above). Putting this value into the chi-square distribution curve, you can conclude that:

Image by author

As the calculated chi-square value (205) is greater than 3.84, it therefore falls in the rejection region, and hence the null hypothesis is rejected and the alternate hypothesis is accepted.
Recalling our alternate hypothesis as: H₁ (Alternate Hypothesis) – that the 2 categorical variables being compared are dependent on each other.

This means that the Sex and Survived columns are dependent on each other.

As a practise, you can use the chi2_by_hand() function on the other features.

Calculating the p-value

The previous section shows how you can accept or reject the null hypothesis by examining the chi-square score and comparing it with the chi-square distribution curve.

An alternative way to accept or reject the null hypothesis is by using the p-value. Remember, the p-value can be calculated using the chi-square score and the degrees of freedom.

For simplicity, we shall not go into the details of how to calculate the p-value by hand.

In Python, you can calculate the p-value using the stats module’s sf() function:

def chi2_by_hand(df, col1, col2):    
    #---create the contingency table---
    df_cont = pd.crosstab(index = df[col1], columns = df[col2])
    display(df_cont)

...

    chi_square_score = df_chi2.iloc[:-1,:-1].sum().sum()

    #---calculate the p-value---
    from scipy import stats
    p = stats.distributions.chi2.sf(chi_square_score, degree_f)

    return chi_square_score, degree_f, p

You can now call the chi2_by_hand() function and get both the chi_square score, degrees of freedom, and p-value:

chi_score, degree_f, p = chi2_by_hand(df,'Sex','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}, p-value: {p}')

The above code results in the following p-value:

Chi2_score: 205.1364846934008, Degrees of freedom: 1, p-value: 1.581266384342472e-46

As a quick recap, you accept or reject the hypotheses and form your conclusion based on the following p-value conditions:

p < 0.05 – this means the two categorical variables are correlated.
p > 0.05 – this means the two categorical variables are not correlated.

And since p < 0.05 – this means the two categorical variables are correlated.

Trying out the other features

Let’s try out the categorical columns that contains nominal values:

chi_score, degree_f, p = chi2_by_hand(df,'Embarked','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}, p-value: {p}')

# Chi2_score: 27.918691003688615, Degrees of freedom: 2, 
# p-value: 8.660306799267924e-07


chi_score, degree_f, p = chi2_by_hand(df,'Alone','Survived')
print(f'Chi2_score: {chi_score}, Degrees of freedom: {degree_f}, p-value: {p}')

# Chi2_score: 28.406341862069905, Degrees of freedom: 1, 
# p-value: 9.834262807301776e-08

Since the p-values for both Embarked and Alone are < 0.05, you can conclude that both the Embarked and Alone features are correlated to the Survived target, and should be included for training in your model.

Summary

In this article, I have gone through a brief explanation of how the chi-square statistics test works, and how you can apply it to the Titanic dataset. A few notes of caution would be useful here:

While the Pearson’s coefficient and Spearman’s rank coefficient measure the strength of an association between two variables, the chi-square test measures the significance of the association between two variables. What it tells you is whether the relationship you found in the sample is likely to exist in the population, or how likely it is by chance due to sampling error.
The chi-square test is sensitive to small frequencies in your contingency table. Generally, if a cell in your contingency table has a frequency of 5 or less, the chi-square test will lead to errors in conclusion. Also, chi-square test should not be used if the sample size is less than 50.

I hope you now have a better understanding of how chi-square works and how it can be used for feature selection in machine learning. See you in my next article!

Join Medium with my referral link – Wei-Meng Lee

The post Statistics in Python – Using Chi-Square for Feature Selection appeared first on Towards Data Science.