The world’s leading publication for data science, AI, and ML professionals.

Likelihood, Probability, and the Math You Should Know

What role does likelihood play in machine learning?

Photo by Saad Ahmad on Unsplash
Photo by Saad Ahmad on Unsplash

Likelihood is a confusing term. Likelihood is not a probability, but is proportional to a probability; the two terms can’t be used interchangeably. In this post, we will be dissecting likelihood as a concept and understand it’s importance in machine learning.

Intuition

Let us understand likelihood and how it is different from a probability distribution with an imaginary city, Databerg (a cringe name, but bear with me). Let’s also imagine we have access to the pricing data of all houses in this city. I don’t know exactly how this distribution looks since Databerg isn’t a real city, but intuitively I’d say we would notice many houses that are moderately priced and a few houses that are very expensive. If one were to plot a distribution of these prices, it might look something like this.

Figure 1
Figure 1

The continuous house price values are plotted along the x-axis and the probability values are plotted on the y-axis. Figure 1 thus represents a probability density function. The phrase probability density function is often used interchangeably with probability distribution function when talking about continuous values, such as house prices. So throughout this article, we will use the more relatable phrase probability distribution function to represent both. But don’t be confused when other sources talk about the same concept using the phrase probability density function.

Since we are dealing with a probability distribution function, the area under the orange curve is 1. This means that if we were to randomly choose a house in Databerg, the probability of this price being some positive real number is 100%; this makes sense. We can also determine the probability a house price lies between two price points. Consider the following graph.

Figure 2
Figure 2

Let’s say the area under the red striped section is 0.45. We can then make the following statement.

Given this house price distribution, the probability that a house is priced between $600K and $800K is 0.45.

Now that we have a notion of probability, let us invert our set up to talk about likelihood. Instead of being given all housing information in Databerg (and hence a probability distribution function of house prices as shown in Figure 1), let us assume we are given the prices of only 10,000 houses. And our goal now is to determine (or at least approximate) the probability distribution function in Figure 1.

Figure 3
Figure 3

Each dot near the x-axis is a house. How well the current distribution fits this data is quantified by likelihood. Let’s say the distribution above is a log normal distribution with mean 13.2 and standard deviation 0.4. And the likelihood for this distribution on the given house samples is 1.78. We can then make the following statement:

Given these 10,000 houses, the likelihood of the distribution’s mean being 13.2 and standard deviation being 0.4 is 1.78.

Note the value of likelihood can be greater than 1, so it is not a probability density function. In fact, the 1.78 value of likelihood has more meaning when compared to the likelihood of other distributions with respect to the same data.

Figure 4
Figure 4

In Figure 4, each dotted lined distribution is obtained by changing the mean and standard deviation of the log normal distribution. For each distribution, we can determine the likelihood (which quantifies how well the distribution fits the 10,000 data points). Let us now continue our discussion of likelihood with a definition.

Defining Likelihood

The book On the mathematical foundations of theoretical statistics gives us a clear definition of likelihood.

Likelihood that any parameter (or a of a set of parameters) should have any assigned value (or set of values) is proportional to the probability that if this were so, the totality of observations should be that observed

The set of parameters are the parameters of the distribution. From Figure 3, the house price distribution that we are assuming is log normal. The parameters of this distribution are the mean and standard deviation. The assigned set of values are 13.2 for the mean and 0.4 for the standard deviation. The observations are the 10,000 house prices. Given this information, we can translate the definition into math.

Equation 1
Equation 1

The L on the left hand side is the likelihood function. It is a function of the parameters of the probability density function. The P on the right hand side is a conditional joint probability distribution function. It is the probability that each house y has the price as we observe given the distribution we assumed. The likelihood is proportional to this probability, and not necessarily equal to it.

Likelihood & Machine Learning

In parametric models like linear regression and logistic regression, we are given a set of data points with the goal of finding the parameters of these models that best fit the observed data. Let’s consider the same house price example that we introduced in the previous section. We want to fit some statistical model to predict house prices given some information about the property such as number of bedrooms in the house, size of house in square footage (sqft), and age of the house.

Figure 5
Figure 5

These are 3 features that go into the model here, but there can be more in practice. Let us assume we want to perform a linear regression. Because of this assumption, the input features and output label are related in the following way.

Equation 2
Equation 2

One can write this more generally with the following form.

Equation 3
Equation 3

The x terms are the features of the _iᵗʰ house, y_ is the price of the _i_ᵗʰ house, the 𝜃 terms are the coefficients of each feature, and the epsilon 𝜖 denotes an irreducible error. It is error from inherent system randomness and also occurs because some features are not accounted for.

To construct this linear regression model, we need to know the values of the 𝜃 terms. To find the 𝜃 terms, we need examples of house features and their prices. i.e we need pairs of (𝑥, 𝑦) to __ fill the values in Equation 3 to estimate the 𝜃 terms. This is why we need training data.

Remember our imaginary city Databerg? Let’s add details to make this data useful for training a model. We have access to 10,000 house records in Databerg. Each record has information about a house: number of bedrooms; size of the house in square feet; age of the house; and the price at which this house is evaluated. Since we want to predict the price of a house, the label y is this price. The other fields in this imaginary dataset are the features x that are inputs to our linear regression to predict the corresponding price. This training data looks like the table below. One house is $757,000, a second house is $780,000 a third house is $680,000, and so on.

Figure 6
Figure 6

If we assume the values of the 𝜃 terms, we can quantify how well the linear regression model fits training data using the likelihood function. In the end, we want to determine the 𝜃 terms that will be_st fit t_he given data; in other words, we want to determine the value of the 𝜃 terms that will maximize the likelihood function. This is translated into math as follows.

Equation 4
Equation 4

We will explain these terms shortly. But before that, let us get rid of this cumbersome notation by representing all the 𝜃 terms in a vector form.

Equation 5
Equation 5

Now, Equation 4 can be written in the more general and concise form.

Equation 6
Equation 6

This notation is telling us a few things: the likelihood function L is a function of the model parameters 𝜃; the a_rg max f_unction returns the value of 𝜃 that maximizes this likelihood function L. __ Quite literally by definition, this value of 𝜃 obtained is the ma_ximum likelihood estimation o_f 𝜃. To distinguish the variable 𝜃 used on the right hand side from this specific value of 𝜃 we seek on the left, we add a MLE s_upe_rscript to the latter. Furthermore, the hat on the 𝜃 maximum likelihood estimate shows this is value is just that – an estimate.

Maximum Likelihood Estimation of 𝜃

From our definition of likelihood, the likelihood function is proportional the joint probability distribution of the observed training data.

Equation 7
Equation 7

An assumption we tend to make in Machine Learning is the price of these houses do not influence each other. In math language, each house sample is independently and identically distributed or i.i.d; this assumption is reasonable for the most part. Mathematically, this means the joint probability P can now be expressed as a product of individual probability distributions p.

Equation 8
Equation 8

Let’s make this notation compact with the product symbol; and use n as the number of samples instead of the arbitrary number 10,000 training samples we have chosen.

Equation 9
Equation 9

More technically for every house sample, each of these house price labels y are computed when we are given the other features of the house (number of bedrooms, sqft, & age). Let us make this apparent in the conditional probability distribution notation.

Equation 10
Equation 10

Like we did with 𝜃, let’s make x _m_ore compact using a vector notation.

Equation 11
Equation 11

You can think of the 1 in the xᵢ vector as the coefficient of the constant term in linear regression Equation 3. We can now write the likelihood function more cleanly as the following.

Equation 12
Equation 12

Each of these terms on the right hand side are probabilities that lie between 0 and 1. The product of these samples for even a reasonable dataset will be a number close to 0. Machine learning is done with computers; computers won’t be able to calculate this product with appropriate precision; this condition is arithmetic underflow. To combat this, we maximize the logarithm of this likelihood function; this is done by converting the product term to a sum of logs.

Equation 13
Equation 13

This formulation takes advantage of an important property of logarithms: the logarithm of products is the same as the sum of the logarithms of their individual parts. In the end, we are interested in the value of the parameters 𝜃 that maximize the original likelihood function, and not in the value of the likelihood function itself. The value of the 𝜃 vector that maximizes the original likelihood function is the same value of the 𝜃 vector that maximizes the log-likelihood function. We can write the following form.

Equation 14
Equation 14

This is because logarithms are monotonically increasing. That is, if one number is greater than the other, the logarithm behaves similarly. So the value of parameters that maximize the likelihood function will maximize its logarithm. Furthermore, we are able to write the last line in Equation 14 because we know the log-likelihood and the sum of Probability values are proportional to each other (Equation 13). Hence, they share the same maximum. Let us rewrite the final likelihood equation we need to maximize to obtain the model parameters.

Equation 15
Equation 15

How we solve this likelihood function depends on the type of machine learning model. For example in linear regression, we would make the assumption that labels y follow a normal distribution given the features x and the parameters 𝜃. The proof for solving the likelihood estimation for simple linear regression is given in the resources below. An interesting realization after doing this math is the optimal values of 𝜃 would be the values that maximizes the residual sum of squares equation, a fundamental equation in machine learning. Similarly, we can use Eq_uation 15 a_s a starting point for parametric classification models like logistic regression. Here is a video of me working out the maximum likelihood estimation for logistic regression.

Conclusion

This post started with the intuition behind likelihood. We then formally defined likelihood and its relation to probability. We discussed how likelihood can be used to estimate the parameters of a statistical model that best fit training data using maximum likelihood estimation. Along the way, we introduced math notation to generalize how maximum likelihood estimation is conducted for any parametric model. We ended our discussion by deriving a general form for the likelihood equation that can be used to solve for parameters in models like linear regression and logistic regression.

Thank you for reading until the end! For more technical content, please follow me on the Code Emporium YouTube channel where I teach data science and machine learning.

Resources

[1] Code Emporium, Gradient Descent – the math you should know (2019), YouTube.

[2] Stephen Pettigrew, From model to log-likelihood (2014), University of Pennsylvania

[3] Liyan Xu, Machine Learning: MLE vs MAP (2021), Just Chillin’ Blog

[4] whuber, What is the difference between "likelihood" and "probability"? (2019), stats.stackexchange.

[5] StatQuest with Josh Starmer, Probability is not Likelihood (2019), YouTube

[6] Matt Bogner, Probability Distribution Applet (2016), University of Iowa

[7] Joram Soch, Maximum Likelihood Estimation for simple Linear Regression (2021), The book of statistical proofs

[8] user132704, What is the difference between "probability density function" and "probability distribution function"? (2019), stats.stackexchange.

All images and figures used in this post were created by the author.


Related Articles