The world’s leading publication for data science, AI, and ML professionals.

Statistics Bootcamp 5: What is Normal?

Learn the math and methods behind the libraries you use daily as a data scientist

Statistics Bootcamp

Image by Author
Image by Author

To more formally address the need for a statistics lecture series on Medium, I have started to create a series of statistics boot camps, as seen in the title above. These will build on one another and as such will be numbered accordingly. The motivation for doing so is to democratize the knowledge of statistics in a ground up fashion to address the need for more formal statistics training in the Data Science community. These will begin simple and expand upwards and outwards, with exercises and worked examples along the way. My personal philosophy when it comes to engineering, coding, and statistics is that if you understand the math and the methods, the abstraction now seen using a multitude of libraries falls away and allows you to be a producer, not only a consumer of information. Many facets of these will be a review for some learners/readers, however having a comprehensive understanding and a resource to refer to is important. Happy reading/learning!

This article is dedicated to introducing the normal distribution and properties of it.

What is Normal?

Medical researchers have determined so-called normal intervals for a person’s blood pressure, cholesterol, and triglycerides.

Ex. systolic blood pressure: 110–140 (these metrics differ for in office versus home blood pressure measurements)

But our question remains, how does one determine the so-called normal intervals?

The Normal Distribution

The normal distribution (Gaussian) is a continuous, symmetric, bell-shaped distribution of a variable, denoted by N(μ,σ).

The mathematical equation to represent it is denoted by:

and represents the probability density function or p.d.f. (for short). Mean is denoted by μ and standard deviation by σ.

Properties:

  • Bell-shaped with the two tails continuing indefinitely in both directions
  • Symmetric about center – mean, median, and mode
  • Total area under the distribution curve is equal to 1
  • Area under the curve represents probability

Example. Last year high schools in Chicago had, 3264 female 12th grade students enrolled. The mean height of these students was 64.4 inches and that the standard deviation was 2.4 inches. Here the variable is height and the population consists of 3264 female students attending 12th grade. The normal curve with the same mean and standard deviation: μ = 64.4 and σ = 2.4.

Thus we can approximate the percentage of students between 67 anf 68 inches tall by the area under the curve between 67 and 68, the area is 0.0735 and shown on the graph as occuring between 67 and 68 inches tall.

Empirical Rule – Revisited

Although we covered this in a previous bootcamp, spaced repetition is the best way to ensure recall and retention! So here we have the empirical rule plotted for graphical emphasis.

  1. Approximately 68.3% of the data values will fall within 1 standard deviation of the mean, x±1s for samples and with endpoints μ±3σ for populations
  2. Approximately 95.4% of the data values will fall within 2 standard deviations of the mean, x±2s for samples and with endpoints μ±2σ for populations
  3. Approximately 99.7% of the data values will fall within 3 standard deviations of the mean, x±1s for samples and with endpoints μ±3σ for populations

Standard Normal Distribution N(0,1)

Different normal distributions have different means and standard deviations. A standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. The notation of the p.d.f. for the standard normal distribution is:

We can see, looking at the equation above, it is the simplification of that provided by the p.d.f. of the normal distribution (above). Standardization is the process by which we convert these distributions into the standard normal distribution (0,1) for comparison. We can convert the values of any normally distributed variable using the equation below, which is called the z-score.

Compare this with the previously defined p.d.f. for the normal curve above.

Area Under the Standard Normal Distribution

After converting your normal distribution into a standard normal distribution, look up z-scores in the Standard Normal Distribution table (z-table) to determine the area under the curve. We use this table to find the area under the curve (highlighted in the figure below) that lies (a) to the left of a specified z-score b) to the right of a specified z-score, and c) between two specified z-scores.

Examples.

a) Find the area to the left of z=2.06. P(Z < 2.06) = 98.03% b) Find the area to the right of z = -1.19. P(Z>-1.19) = 88.3% c) Find the area between z = 1.68 and z = -1.37. P(-1.37 < Z < 1.68) = P(Z<1.68) – P(Z < -1.37) = 0.9535–0.0853 = 0.8682 = 86.82%

Remember that a t-table defaults to the LEFT….

Example. During the 2008 baseball season, Mark recorded his distances (in meter) for each home run and found they are normally distributed with a mean of 100 and a standard deviation of 16. Determine the probability of his next home run falling between 115 and 140 meters. P(115<X<140)=? where X = distances for each home run.

Sketch the normal curve: X ~ N(100,16), with μ = 100 and σ = 16.

Compute the z-scores: z1 = (115–100)/16 = 0.94 z2 = (140–100)/16 = 2.50

P(z1 < Z < z2) = ? Where Z~ N(0,1).

The area under the standard normal curve that lies between 0.94 and 2.50 is the same as the area area between 115 and 140 under the normal curve with mean 100 and standard deviation 16, i.e. P(115<X<140) = P(0.94<Z<2.50) = P(Z<2.50) – P(Z<0.94).

The area to the left of 0.94 is 0.8264, and the area to the left of 2.50 is 0.9938. the required area, is therefore 0.9938–0.8264 = 0.1674. The chance that Mark’s next home run falls between 115 and 140 meters is 0.1674 = 16.74%.

Example. What if you were given an area under the standard normal distribution curve and asked to find the z-score?

Assessing Normality

Construct a normal probability plot, given a dataset:

Normal probability plot is the plot of observed data versus normal scores. If it is linear, the variable is normally distributed, and if it is not linear, the variable is not normally distributed. Plotting the data above, we get the plot:

Normal Q-Q Plot

A Q-Q plot (quantile-quantile plot) is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.

If the data are truly sampled from a Gaussian distribution, the Q-Q plot will be linear:

Code to generate the plot above:

import numpy as np
import statsmodels.api as statmod
import matplotlib.pyplot as plt
#create dataset with 100 values that follow a normal distribution
data = np.random.normal(0,1,100)
#create Q-Q plot with 45-degree line added to plot
fig = statmod.qqplot(data, line='45')
plt.show()

Distribution of Sample Means

The sample means from different samples represent a random variable and follow a distribution. x

A sampling distribution of sample means is a distribution using the means computed from all possible random samples of a specific size taken from a population. If the samples are randomly selected with replacement, the sample means, for the most part, will be somewhat different from the population mean μ. These differences are caused by sampling error. Sampling error is the difference between the sample measure and the corresponding population measure, due to the fact that the sample is not a perfect representation of the population.

Example. Suppose a professor gave an 8 point quiz to a small class of four students. The results of the quiz were 2, 4, 6, and 8. For the sake of discussion, assume that the four students constitute the population. The mean of the population is:

The standard deviation of the population is:

Now, if all sample sizes of 2 are taken with replacement and the mean of each sample is found, the distribution is:

The mean of the sample means is:

The standard deviation of the sample means is:

Properties of Sample Means

  1. The mean of the sample means will be the same as the population mean.
  1. The standard deviation of the sample means will be smaller than the standard deviation of the population, and it will be equal to the population deviation divided by the square root of the sample size. This means that there is less sampling error, and is associated with a larger sample size.

The standard deviation of the sample means is called the standard error of the mean****

The larger the sample size:

  • the closer the sample mean mux approximates the population mean mu.
  • the less sampling error
  • the smaller the standard deviation of the sample means around the population mean

Distribution of sample means for a large number of samples

  1. Conclusion: there will be 95.4% of sample means that fall within 2_σxbar of each side of _μxbar or μ.
  2. In other words: if we have 100 samples, there will be around 95 samples means falling within 2_σxbar of each side of μ.

Confidence Interval

If we repeat the above experiment several times and each time we construct intervals of length 2 standard errors to each side of the extimates of Xbar, then we can be confident (think empirical rule) that 95.4% of these intervals will cover the population parameter, μ.

If we repeat the experiment several times and construct several confidence intervals, then 95.4% of these intervals include the population parameter, μ.

Central Limit Theorem

The central limit theorem states that as the sample size n increases without limit, the shape of the distribution of the sample means taken with replacement from a population with mean, μ and a standard deviation σ, will approach a normal distribution. This distribution will have a mean μ and a standard deviation μ/sqrt(n).

Why is the central limit theorem (CLT) so important?

If the sample size is sufficiently large, the CLT can be used to answer questions about sample means, regardless of what the distribution of the population is.

That is:

and converting to a z-score:

  1. When the variable from the population is normally distributed, the distribution of the sample means will be normally distributed, for any sample size, n.
  2. When the variable from the population is not normally distributed, the rule of the thumb is: a sample size of 30 or more is needed to use a normal distribution to approximate the sample mean distribution. The larger the sample size, the better the approximation will be. That is, need n ≥ 30 for CLT to kick in.

Summary of z-score formulas

The first is used to gain information about an individual data point when the variable is normally distributed:

Note: This first equation for z DEFAULTS TO THE LEFT!

The second one is used when we want to gain information when applying the central limit theorem about a sample mean when the variable is normally distributed OR when the sample size ≥ 30:

Example. The average number of pounds of meat that a person consumes per year is 218.4 pounds. Assume that the standard deviation is 25 pounds and the distribution is approximately normal. a) Find the probability that a person selected at random consumes less than 224 pounds per year. b) If a sample of 40 individuals is selected, find the probability that the mean of the sample will be less than 224 pounds per year. Solution: a) The question asks about an individual person.

b)The question concerns the mean of the sample with a size 40.

The large difference between these two probabilities is due to the fact that the distribution of sample means is much less variable than the distribution of individual data values. (Note: An individual person is the equivalent of saying n=1).

The Normal Approximation to The Binomial Distribution

Recall that the binomial distribution is determined by n (the number of trials) and p (the probability of success). When p is approximately 0.5, and as n increases, the shape of the binomial distribution becomes similar to that of a normal distribution.

When p is close to 0 or and n is relatively small, a normal approximation is inaccurate. As a rule of thumb, statisticians generally agree that a normal approximation should be used only when n*p and n*q are both greater than or equal to 5, i.e. np≥5 and nq≥5.

import math
import matplotlib.pyplot as pyplot
def binomial(x, n, p):
    return math.comb(n, x) * (p ** x) * ((1 - p) ** (n - x))
n = 50
p = 0.5
binomial_list = []
keys = []
for x in range(50):
    binomial_list.append(binomial(x, n, p))
for y in range(50):
    keys.append(y)

pyplot.bar(x=keys, height=binomial_list)
pyplot.show()

A correction for continuity is a correction employed when a continuous distribution is used to approximate a discrete distribution.

For all cases, μ=np, σ=sqrt(npq), np≥5, n*q≥5

Step-wise Process for the Normal Approximation to the Binomial Distribution

  1. Check to see whether the normal approximation can be used
  2. Find the mean μ, and standard deviation σ
  3. Write the problem using probability notation, e.g. P(X=x)
  4. Rewrite the problem leveraging the continuity correction factor and show the corresponding AUC of the normal distribution
  5. Calculate corresponding z values
  6. Solve!

Example. A magazine reported that 6% of Americans look at their phone while driving. If 300 drivers are selected at random, find the exact probability that 25 of them look at their phone while driving. Then use the normal approximation to find the approximate probability. p=0.06, q=0.94, n=300, X=25 The exact probability using the binomial approximation:

Normal approximation approach:

  1. Check to see whether a normal approximation can be used. np=3000.06 = 18, nq = 3000.94 = 282 Since np ≥ 5, and nq ≥ 5, the normal distribution can be used

  2. Find the mean and standard deviation

3. Write the problem in probability notation P(X=25).

4. Rewrite the problem by using the continuity correction factor. P(25–0.5 < X < 25 + 0.5) = P(24.5 < X < 25.5)

5. Find the corresponding z values:

  1. Solve:

The probability that 25 people read the newspaper while driving is 2.27%.

Wrap-Up

In this bootcamp, we have continued in the vein of probability theory now including work to introduce Bayes theorem and how we can derive it using our previously learned rules of probability (multiplication theory). You have also learned how to think about probability distributions – Poisson, Bernoulli, Multinomial and Hypergeometric. Look out for the next installment of this series, where we will continue to build our knowledge of stats!!

Previous boot camps in the series:

#1 Laying the Foundations #2 Center, Variation and Position #3 Probability… Probability #4 Bayes, Fish, Goats and Cars

All images unless otherwise stated are created by the author.


Additionally, if you like seeing articles like this and want unlimited access to my articles and all those supplied by Medium, consider signing up using my referral link below. Membership is $5(USD)/month; I make a small commission that in turn helps to fuel more content and articles!

Join Medium with my referral link – Adrienne Kline


Related Articles