The world’s leading publication for data science, AI, and ML professionals.

Statistics Bootcamp 7: Balancing Type I and II Errors

Learn the math and methods behind the libraries you use daily as a data scientist

Statistics Bootcamp

Image by Author
Image by Author

This article is part of a larger Bootcamp series (see kicker for full list!). This one is dedicated to understanding Type 1 and 2 errors and introducing the t-distribution.

Any decision we make based on hypothesis testing may be incorrect. We can reject when we should have accepted, and vice versa. This arises when we use a single sample to inform probabilities about the population as a whole.

In Statistics, we consider two types of errors, depending on the directionality of ground truth of the situation (which we rarely know) but can estimate. There are Type I and Type II errors. A Type I error is when we have rejected the null hypothesis when we shouldn’t have, i.e. the null hypothesis = true. Conversely, a Type II error has occurred when we have failed to reject the null hypothesis (accepted the null hypothesis), when we shouldn’t have, i.e. the null hypothesis = False. You can see the contingency table below for a visual representation of this phenomenon.

Graphically, we can represent this as:

_Example._H₀: μ = 150 lbs (the average female weighs 150 lbs) H₁: μ ≠ 150 lbs (the average female does not weigh 150 lbs)

A type I error has occurred if the ground truth is μ = 150 lbs, but the data analysis has led to the conclusion that μ ≠ 150 lbs. Conversely, a type II error has occurred if the ground truth is μ ≠ 150 lbs, but the data analysis has led to the conclusion that μ = 150 lbs.

Probability of Errors

Now, both type I and type II errors have a certain probability of occurring. The probability, P of a Type I error, is denoted P(Type I error) = α, and we refer to this as our significance level in statistics. This is the probability of rejecting the null hypothesis by sheer chance and is the area under the curve (AUC). P(Type I error) = P(reject H₀ | H₀ is true)

The probability, P of a Type II error, is denoted P(Type II error) = β. This is the probability of not rejecting the null hypothesis when it is false and the alternative hypothesis is true. P(Type II error) = P(fail to reject H₀ | H₁ is true) or: P(Type II error) = P(fail to reject H₀ | H₀ is false)

We have to balance the risk of α against the risk of β, as these two trade off. Because the smaller α is, the larger the value of β is and vice versa, assuming sample size is kept constant. We can see that represented visually here:

If we think about sliding these distributions over one another, if we decrease α (Type I error) by moving the right distribution farther right, we inevitably increase the value of β (Type II error).

Coming back to our contingency table from above, we can fill what we have learned and how we can incorporate it into what we learned previously.

P-value

A probability value (P-value) refers to the area under the distribution curve that denotes the probability of getting the result we observe (test statistic) from our data if the null hypothesis is true. This tells us how ‘surprised’ we should be by our results – i.e. how much evidence we have against H and in favor of H. As before, we have 3 different kinds of tests: two-tailed, right-tailed and left-tailed.

Note that for the two-tailed test, we have to multiply our P-value by 2, so we have the same probability (5%, if α = 0.05) at either tail otherwise it defaults to 1/2 of what we’d want – 0.025 or 2.5%. It should come as no surprise that if your P-value is very small (meaning it’s location is at the far tail(s) of the distribution) that it becomes more likely that H is false as there is more evidence to suggest that it is the farther away from the mean of the distribution.

In our last bootcamp, we learned hypothesis testing with the z-statistic based on the critical value approach, where we follow the same first 4 steps outlined below when solving using the p-value approach. Where we differ is step 5. Instead of determining our critical points, we will determine the corresponding p-values to the calculated statistic.

One Sample z-test (P-value approach)

We are now going to discuss hypothesis testing using a p-value approach. Our assumptions are the same as previous: simple random sample, normal and/or large population, σ is known.s

  1. State our null and alternative hypotheses – H₀ (μ = μ) and H₁.
  2. Determine from our hypotheses if this constitutes a right (μ > μ₀), left (μ < μ₀) or two-tailed test (μ ≠ μ₀).
  3. Ascertain our significance level α.
  4. Compute the test statistic based on our data
  1. Determine our p-value (area under distribution) based on the test statistic from our corresponding z-table.
  2. Compare our P-value to α.
  3. Make the decision to reject or not reject the null hypothesis. If P-value ≤ α, reject the null hypothesis. If P-value > α, do not reject the null hypothesis.
  1. Interpret your findings – ‘there is sufficient evidence to support the rejection of the null’ OR ‘there is not sufficient evidence to reject the null hypothesis’.

If you are wondering whether these 2 methods (critical v. p-value) give different results – they don’t. They are just two different ways to think about the same phenomenon. With the critical value approach, you are comparing your statistic to the critical value directly. In the p-value approach, you are comparing the associated area under the curve (AUC) associated with that same critical value and test statistic.

Example (p-value approach). A librarian wishes to see if the mean number of books her students check out on a daily basis is >50. She collects a sample over books checked out for 30 random weekdays during the scool year which are foudn to havea mean of 52 books. At α = 0.05, test the claim that the mean number books check out is >50/school day. The standard deviation of the mean is 4 books.

Let’s check our assumptions: 1) we have obtained a random sample, 2) sample size is n=30, which is ≥ 30, 3) the standard deviation (σ) of the population is known.

  1. State the hypotheses. H₀: μ = 50, H₁: μ > 50
  2. Directionality of the test: right-tailed test (since we are testing ‘greater than’)
  3. Our significance level is α = 0.05
  4. Compute the test statistic value.

5. Determine our p-value (area under distribution) based on the test statistic from our corresponding z-table (right tailed). We read the z-table by finding 2.7 on the rows and then 0.04 in the columns to get 2.74 and the area to the left of zcalc is 0.99693, therefore the area to the right is 0.00361.

6. Our p-value < α = 0.05, therefore we reject the null hypothesis.

7. Interpret our findings. There is enough evidence to support the claim that the mean number of books checked out in a day is >50.

Example (p-value approach). The Nature family of scientific journal that the reports the average time to review is 8 weeks. To see if the average cost of individual journal is different, a researcher selects a random sample of 35 papers that have an average time to review of 9 weeks. The standard deviation (σ) is 1 week. At α = 0.01, can it be concluded that the average time to review is different than 8 weeks?

Let’s check our assumptions: 1) we have obtained a random sample, 2) sample is 35, which is ≥ 30, 3) the standard deviation of the population is provided.

  1. State the hypotheses. H₀: μ = 8, H₁: μ 8
  2. Directionality of the test: two-tailed
  3. α = 0.01
  4. Compute the z-test statistic

5. Determine our p-value (area under distribution) based on the test statistic from our corresponding z-table (two tailed). We read the z-table by finding 0.1 on the rows and then 0.06 in the columns to get 0.16 and the area to the left of zcalc is 0.4364 , but we need to multiply by 2 since this is a two tailed test so we get: 2(0.4364) = 0.8692.

6. Since p-valueα on either side, it does not falls into the rejection region, we reject fail to reject H₀.

7. Interpret our findings. There is insufficient evidence to support the claim that the average time to review is different than 8 weeks.

Confidence Intervals When σ is Unknown

In prior boot camps, the examples we’ve worked through have had the population standard deviation, σ, provided. However, it is rarely known. To compensate when σ is unknown, we use a t-distribution and values rather than the standard normal N(0,1) distribution and z-score and values. The equations look very similar. Though the ‘look up’ table they correspond to is different:

In short, if σ (population standard deviation) is known, use the z-distribution. Otherwise, use the t-distribution working with ‘s’, the sample distribution.

t-Distribution

The t-distribution is parameterized by the degrees of freedom (DOF), which can be any unsigned integer, and the value of α. Note that DOF is usually approximated as n (number of samples)-1. There are several properties of the t-distribution:

  • shape is determined by DOF = n-1
  • Can take on any value between (-inf, inf)
  • it is symmetric around 0, but flatter than our standard normal curve – N(0,1)

As the DOF of a t-distribution increases, it approaches the standard normal distribution – N(0,1). The notation for the t-distribution is denoted:

Plotting different t-distributions with different DOF we get (code below):

from scipy.stats import t
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

#generate t distribution with sample size 1000
x = t.rvs(df=12, size=1000)
y = t.rvs(df=2, size=1000)
z = t.rvs(df=8, size=1000)

#sns.kdeplot(x) # plotting t, a separately 
fig = sns.kdeplot(x, color="r")
fig = sns.kdeplot(y, color="b")
fig = sns.kdeplot(z, color="y")
plt.legend(loc='upper right', labels=['df = 12', 'df = 2', 'df = 8'])
plt.show()

For a curve with 15 degrees of freedom, what is the critical value of t for __ t(0.05) in a right tail test? To find this, use a t-table.

One Sample t-Interval

Now that you are comfortable determining t-value from t-tables, we can translate this into determining confidence interval (CI) for a given μ when σ in uknown.

Again, our assumptions are that we have obtained a simple random sample and that sample constitutes a normal population or is large enough to assume normality, and σ is uknown. Very similar to how we calculated CIs in a z-distribution:

  1. Determine the confidence level: 1 – α
  2. Determine DOF = n – 1 (where n is sample size)
  3. Reference a t-table to find t-values for t{α/2}_
  4. Compute the mean and standard deviation of the sample, _xbar and s respectively
  5. The confidence interval is represented as:
  1. Summarize CI. ‘We can say that the mean of the sample will fall within these bounds (confidence level) amount of the time‘.

Example. Confidence Interval in t-test. A diabetes study recruited 25 participants to investigate the effect of a new diabetes drug program on a1c levels. After 1 month, the subjects’ a1c were recorded below. Use the data to find the 95% confidence interval for the mean decrease in hemoglobin a1c, μ. Assume the sample comes from a normally distributed population.

6.1  5.9  7.0  6.5  6.4 
5.3  7.1  6.3  5.5  7.0
7.6  6.3  6.6  7.2  5.7
6.0  5.4  5.8  6.2  6.4
5.9  6.2  6.6  6.8  7.1 
  1. Our confidence level as specified in the question is 0.95, with α = 0.05
  2. DOF = n–1 = 25–1 = 24
  3. The value as per the t-table corresponding to t(24, 0.05) is:

_4. The mean xbar = 6.356 and standard deviation s is = 0.60213509

5. Using our formula:

6. Our interpretation of this CI is that 95% of the time the mean a1c in this population is somewhere between 6.12 and 6.61

t-Test: P-value vs. Critical Value

As indicated earlier in this article when describing the difference between z-test approaches (critical value vs. p-value) the same holds true for the t-test. They are just two different ways to think about the same phenomenon. With the critical value approach, you are comparing your statistic to the critical value directly. In the p-value approach, you are comparing the associated area under the curve (AUC) associated with that same critical value and test statistic. The only difference is that a t-test is applied to a sample rather than a population, so we calculate our degrees of freedom (DOF) when finding our values/areas. Let’s work the same problem implementing both solution types.

Example. One Sample t-test critical value approach A swimming coach claims that male swimmers are on average taller than than their female counterparts. A sample of 15 male swimmers has a mean height of 188 cm with a standard deviation of 5cm. If the average height of the female swimmers is 175 cm, is there enough enough evidence to support this claim at α = 0.05? Asume the population is normally distributed.

Let’s first check our assumptions:

  • We have obtained a random sample
  • Population is normally distributed
  • The population standard deviation is provided in the prompt
  1. State the hypotheses. H₀: μ = 175 cm, H₁: μ > 175 cm
  2. Directionality of the test: right-tailed
  3. α = 0.05, df = 15–1 = 14
  4. Compute t(calc):
  1. Find the critical value, based on the t-table. Since α = 0.05 and the test is a right-tailed test, the critical value is Tcrit(0.05) = 1.761. Reject H₀ if Tcalc > 1.761.
  2. Since 10.07 > 1.761, it falls into the rejection region, we reject H₀.

7. Interpret our findings. There is sufficient evidence to support the claim that the average male swimmers is taller than their female counterparts.

Example. One Sample t-test p-value value approach

Let’s work through the same problem above, but using a p-value approach.

  1. The corresponding p-value (area under distribution) based on the t statistic (10.07) is < .00001.
  2. Since tcrit < α, it falls into the rejection region, we reject H₀.
  3. Interpret our findings. There is sufficient evidence to support the claim that the average male swimmers is taller than their female counterparts.

Summary

In this bootcamp we’ve covered errors, z versus t testing and how to perform statistical calculations using the 2 main approaches (p-value and critical value). After reading this article, you should understand how changing the rate of a Type I error influecne the rate of a Type II error. As an FYI, Type I errors are considered to be more aggregious. The rationale for this is that status quo (Ho) is our default assumption, and is better to default and assume no effect than claim a false one.

All images unless otherwise stated are created by the author.


Additionally, if you like seeing articles like this and want unlimited access to my articles and all those supplied by Medium, consider signing up using my referral link below. Membership is $5(USD)/month; I make a small commission that in turn helps to fuel more content and articles!

Join Medium with my referral link – Adrienne Kline


Related Articles