Dorian Drost, Author at Towards Data Science https://towardsdatascience.com/author/doriandrost/ The world’s leading publication for data science, AI, and ML professionals. Fri, 28 Feb 2025 08:14:20 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Dorian Drost, Author at Towards Data Science https://towardsdatascience.com/author/doriandrost/ 32 32 I Won’t Change Unless You Do https://towardsdatascience.com/i-wont-change-unless-you-do/ Fri, 28 Feb 2025 12:00:00 +0000 https://towardsdatascience.com/?p=598543 Game Theory 101: The Nash equilibrium

The post I Won’t Change Unless You Do appeared first on Towards Data Science.

]]>
In Game Theory, how can players ever come to an end if there still might be a better option to decide for? Maybe one player still wants to change their decision. But if they do, maybe the other player wants to change too. How can they ever hope to escape from this vicious circle? To solve this problem, the concept of a Nash equilibrium, which I will explain in this article, is fundamental to game theory.

This article is the second part of a four-chapter series on game theory. If you haven’t checked out the first chapter yet, I’d encourage you to do that to get familiar with the main terms and concepts of game theory. If you did so, you are prepared for the next steps of our journey through game theory. Let’s go!

Finding the solution

Finding a solution to a game in game theory can be tricky sometimes. Photo by Mel Poole on Unsplash

We will now try to find a solution for a game in game theory. A solution is a set of actions, where each player maximizes their utility and therefore behaves rationally. That does not necessarily mean, that each player wins the game, but that they do the best they can do, given that they don’t know what the other players will do. Let’s consider the following game:

If you are unfamiliar with this matrix-notation, you might want to take a look back at Chapter 1 and refresh your memory. Do you remember that this matrix gives you the reward for each player given a specific pair of actions? For example, if player 1 chooses action Y and player 2 chooses action B, player 1 will get a reward of 1 and player 2 will get a reward of 3. 

Okay, what actions should the players decide for now? Player 1 does not know what player 2 will do, but they can still try to find out what would be the best action depending on player 2’s choice. If we compare the utilities of actions Y and Z (indicated by the blue and red boxes in the next figure), we notice something interesting: If player 2 chooses action A (first column of the matrix), player 1 will get a reward of 3, if they choose action Y and a reward of 2, if they choose action Z, so action Y is better in that case. But what happens, if player 2 decides for action B (second column)? In that case, action Y gives a reward of 1 and action Z gives a reward of 0, so Y is better than Z again. And if player 2 chooses action C (third column), Y is still better than Z (reward of 2 vs. reward of 1). That means, that player 1 should never use action Z, because action Y is always better.

We compare the rewards for player 1for actions Y and Z.

With the aforementioned considerations, player 2 can anticipate, that player 1 would never use action Z and hence player 2 doesn’t have to care about the rewards that belong to action Z. This makes the game much smaller, because now there are only two options left for player 1, and this also helps player 2 decide for their action.

We found out, that for player 1 Y is always better than Z, so we don’t consider Z anymore.

If we look at the truncated game, we see, that for player 2, option B is always better than action A. If player 1 chooses X, action B (with a reward of 2) is better than option A (with a reward of 1), and the same applies if player 1 chooses action Y. Note that this would not be the case if action Z was still in the game. However, we already saw that action Z will never be played by player 1 anyway.

We compare the rewards for player 2 for actions A and B.

As a consequence, player 2 would never use action A. Now if player 1 anticipates that player 2 never uses action A, the game becomes smaller again and fewer options have to be considered.

We saw, that for player 2 action B is always better than action A, so we don’t have to consider A anymore.

We can easily continue in a likewise fashion and see that for player 1, X is now always better than Y (2>1 and 4>2). Finally, if player 1 chooses action A, player 2 will choose action B, which is better than C (2>0). In the end, only the action X (for player 1) and B (for player 2) are left. That is the solution of our game:

In the end, only one option remains, namely player 1 using X and player 2 using B.

It would be rational for player 1 to choose action X and for player 2 to choose action B. Note that we came to that conclusion without exactly knowing what the other player would do. We just anticipated that some actions would never be taken, because they are always worse than other actions. Such actions are called strictly dominated. For example, action Z is strictly dominated by action Y, because Y is always better than Z. 

The best answer

Scrabble is one of those games, where searching for the best answer can take ages. Photo by Freysteinn G. Jonsson on Unsplash

Such strictly dominated actions do not always exist, but there is a similar concept that is of importance for us and is called a best answer. Say we know which action the other player chooses. In that case, deciding on an action becomes very easy: We just take the action that has the highest reward. If player 1 knew that player 2 chose option A, the best answer for player 1 would be Y, because Y has the highest reward in that column. Do you see how we always searched for the best answers before? For each possible action of the other player we searched for the best answer, if the other player chose that action. More formally, player i’s best answer to a given set of actions of all other players is the action of player 1 which maximises the utility given the other players’ actions. Also be aware, that a strictly dominated action can never be a best answer. 

Let us come back to a game we introduced in the first chapter: The prisoners’ dilemma. What are the best answers here?

Prisoners’ dilemma

How should player 1 decide, if player 2 confesses or denies? If player 2 confesses, player 1 should confess as well, because a reward of -3 is better than a reward of -6. And what happens, if player 2 denies? In that case, confessing is better again, because it would give a reward of 0, which is better than a reward of -1 for denying. That means, for player 1 confessing is the best answer for both actions of player 2. Player 1 doesn’t have to worry about the other player’s actions at all but should always confess. Because of the game’s symmetry, the same applies to player 2. For them, confessing is also the best answer, no matter what player 1 does. 

The Nash Equilibrium

The Nash equilibrium is somewhat like the master key that allows us to solve game-theoretic problems. Researchers were very happy when they found it. Photo by rc.xyz NFT gallery on Unsplash

If all players play their best answer, we have reached a solution of the game that is called a Nash Equilibrium. This is a key concept in game theory, because of an important property: In a Nash Equilibrium, no player has any reason to change their action, unless any other player does. That means all players are as happy as they can be in the situation and they wouldn’t change, even if they could. Consider the prisoner’s dilemma from above: The Nash equilibrium is reached when both confess. In this case, no player would change his action without the other. They could become better if both changed their action and decided to deny, but since they can’t communicate, they don’t expect any change from the other player and so they don’t change themselves either. 

You may wonder if there is always a single Nash equilibrium for each game. Let me tell you there can also be multiple ones, as in the Bach vs. Stravinsky game that we already got to know in Chapter 1:

Bach vs. Stravinsky

This game has two Nash equilibria: (Bach, Bach) and (Stravinsky, Stravinsky). In both scenarios, you can easily imagine that there is no reason for any player to change their action in isolation. If you sit in the Bach concerto with your friend, you would not leave your seat to go to the Stravinsky concerto alone, even if you favour Stravinsky over Bach. In a likewise fashion, the Bach fan wouldn’t go away from the Stravinsky concerto if that meant leaving his friend alone. In the remaining two scenarios, you would think differently though: If you were in the Stravinsky concerto alone, you would want to get out there and join your friend in the Bach concerto. That is, you would change your action even if the other player doesn’t change theirs. This tells you, that the scenario you have been in was not a Nash equilibrium. 

However, there can also be games that have no Nash equilibrium at all. Imagine you are a soccer keeper during a penalty shot. For simplicity, we assume you can jump to the left or to the right. The soccer player of the opposing team can also shoot in the left or right corner, and we assume, that you catch the ball if you decide for the same corner as they do and that you don’t catch it if you decide for opposing corners. We can display this game as follows:

A game matrix for a penalty shooting.

You won’t find any Nash equilibrium here. Each scenario has a clear winner (reward 1) and a clear loser (reward -1), and hence one of the players will always want to change. If you jump to the right and catch the ball, your opponent will wish to change to the left corner. But then you again will want to change your decision, which will make your opponent choose the other corner again and so on.

Summary

We learned about finding a point of balance, where nobody wants to change anymore. That is a Nash equilibrium. Photo by Eran Menashri on Unsplash

This chapter showed how to find solutions for games by using the concept of a Nash equilibrium. Let us summarize, what we have learned so far: 

  • A solution of a game in game theory maximizes every player’s utility or reward. 
  • An action is called strictly dominated if there is another action that is always better. In this case, it would be irrational to ever play the strictly dominated action.
  • The action that yields the highest reward given the actions taken by the other players is called a best answer.
  • A Nash equilibrium is a state where every player plays their best answer.
  • In a Nash Equilibrium, no player wants to change their action unless any other play does. In that sense, Nash equilibria are optimal states. 
  • Some games have multiple Nash equilibria and some games have none.

If you were saddened by the fact that there is no Nash equilibrium in some games, don’t despair! In the next chapter, we will introduce probabilities of actions and this will allow us to find more equilibria. Stay tuned!

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though:

  • Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in English language could be this one:

  • Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one:

  • Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

The post I Won’t Change Unless You Do appeared first on Towards Data Science.

]]>
Talking about Games https://towardsdatascience.com/talking-about-games/ Fri, 21 Feb 2025 19:14:25 +0000 https://towardsdatascience.com/?p=598289 Game Theory 101: terms and concepts

The post Talking about Games appeared first on Towards Data Science.

]]>
Game theory is a field of research that is quite prominent in Economics but rather unpopular in other scientific disciplines. However, the concepts used in game theory can be of interest to a wider audience, including data scientists, statisticians, computer scientists or psychologists, to name just a few. This article is the opener to a four-chapter tutorial series on the fundamentals of game theory, so stay tuned for the upcoming articles. 

In this article, I will explain the kinds of problems Game Theory deals with and introduce the main terms and concepts used to describe a game. We will see some examples of games that are typically analysed within game theory and lay the foundation for deeper insights into the capabilities of game theory in the later chapters. But before we go into the details, I want to introduce you to some applications of game theory, that show the multitude of areas game-theoretic concepts can be applied to. 

Applications of game theory

Even french fries can be an application of game theory. Photo by engin akyurt on Unsplash

Does it make sense to vote for a small party in an election if this party may not have a chance to win anyway? Is it worth starting a price war with your competitor who offers the same goods as you? Do you gain anything if you reduce your catch rate of overfished areas if your competitors simply carry on as before? Should you take out insurance if you believe that the government will pay for the reconstruction after the next hurricane anyway? And how should you behave in the next auction where you are about to bid on your favourite Picasso painting? 

All these questions (and many more) live within the area of applications that can be modelled with game theory. Whenever a situation includes strategic decisions in interaction with others, game-theoretic concepts can be applied to describe this situation formally and search for decisions that are not made intuitively but that are backed by a notion of rationality. Key to all the situations above is that your decisions depend on other people’s behaviour. If everybody agrees to conserve the overfished areas, you want to play along to preserve nature, but if you think that everybody else will continue fishing, why should you be the only one to stop? Likewise, your voting behaviour in an election might heavily depend on your assumptions about other people’s votes. If nobody votes for that candidate, your vote will be wasted, but if everybody thinks so, the candidate doesn’t have a chance at all. Maybe there are many people who say “I would vote for him if others vote for him too”.

Similar situations can happen in very different situations. Have you ever thought about having food delivered and everybody said “You don’t have to order anything because of me, but if you order anyway, I’d take some french fries”? All these examples can be applications of game theory, so let’s start understanding what game theory is all about. 

Understanding the game

Before playing, you need to understand the components of the game. Photo by Laine Cooper on Unsplash

When you hear the word game, you might think of video games such as Minecraft, board games such as Monopoly, or card games such as poker. There are some common principles to all these games: We always have some players who are allowed to do certain things determined by the game’s rules. For example, in poker, you can raise, check or give up. In Monopoly, you can buy a property you land on or don’t buy it. What we also have is some notion of how to win the game. In poker, you have to get the best hand to win and in Monopoly, you have to be the last person standing after everybody went bankrupt. That also means that some actions are better than others in some scenarios. If you have two aces on the hand, staying in the game is better than giving up. 

When we look at games from the perspective of game theory, we use the same concepts, just more formally.

A game in game theory consists of n players, where each player has a strategy set and a utility function.

A game consists of a set of players I = {1, .., n}, where each player has a set of strategies S and a utility function ui(s1, s2, … sn). The set of strategies is determined by the rules of the games. For example, it could be S = {check, raise, give-up} and the player would have to decide which of these actions they want to use. The utility function u (also called reward) describes how valuable a certain action of a player would be, given the actions of the other players. Every player wants to maximize their utility, but now comes the tricky part: The utility of an action of yours depends on the other players’ actions. But for them, the same applies: Their actions’ utilities depend on the actions of the other players (including yours). 

Let’s consider a well-known game to illustrate this point. In rock-paper-scissors, we have n=2 players and each player can choose between three actions, hence the strategy set is S={rock, paper, scissors} for each player. But the utility of an action depends on what the other player does. If our opponent chooses rock, the utility of paper is high (1), because paper beats rock. But if your opponent chooses scissors, the utility of paper is low (-1), because you would lose. Finally, if your opponent chooses paper as well, you reach a draw and the utility is 0. 

Utility values for player one choosing paper for three choices of the opponents strategy.

Instead of writing down the utility function for each case individually, it is common to display games in a matrix like this:

The first player decides for the row of the matrix by selecting his action and the second player decides for the column. For example, if player 1 chooses paper and player 2 chooses scissors, we end up in the cell in the third column and second row. The value in this cell is the utility for both players, where the first value corresponds to player 1 and the second value corresponds to player 2. (-1,1) means that player 1 has a utility of -1 and player 2 has a utility of 1. Scissors beat paper. 

Some more details

Now we have understood the main components of a game in game theory. Let me add a few more hints on what game theory is about and what assumptions it uses to describe its scenarios. 

  • We often assume that the players select their actions at the same time (like in rock-paper-scissors). We call such games static games. There are also dynamic games in which players take turns deciding on their actions (like in chess). We will consider these cases in a later chapter of this tutorial. 
  • In game theory, it is typically assumed that the players can not communicate with each other so they can’t come to an agreement before deciding on their actions. In rock-paper-scissors, you wouldn’t want to do that anyway, but there are other games where communication would make it easier to choose an action. However, we will always assume that communication is not possible. 
  • Game theory is considered a normative theory, not a descriptive one. That means we will analyse games concerning the question “What would be the rational solution?” This may not always be what people do in a likewise situation in reality. Such descriptions of real human behaviour are part of the research field of behavioural economics, which is located on the border between Psychology and economics. 

The prisoner’s dilemma

The prisoner’s dilemma is all about not ending up here. Photo by De an Sun on Unsplash

Let us become more familiar with the main concepts of game theory by looking at some typical games that are often analyzed. Often, such games are derived from are story or scenario that may happen in the real world and require people to decide between some actions. One such story could be as follows: 

Say we have two criminals who are suspected of having committed a crime. The police have some circumstantial evidence, but no actual proof for their guilt. Hence they question the two criminals, who now have to decide if they want to confess or deny the crime. If you are in the situation of one of the criminals, you might think that denying is always better than confessing, but now comes the tricky part: The police propose a deal to you. If you confess while your partner denies, you are considered a crown witness and will not be punished. In this case, you are free to go but your partner will go to jail for six years. Sounds like a good deal, but be aware, that the outcome also depends on your partner’s action. If you both confess, there is no crown witness anymore and you both go to jail for three years. If you both deny, the police can only use circumstantial evidence against you, which will lead to one year in prison for both you and your partner. But be aware, that your partner is offered the same deal. If you deny and he confesses, he is the crown witness and you go to jail for six years. How do you decide?

The prisoner’s dilemma.

The game derived from this story is called the prisoner’s dilemma and is a typical example of a game in game theory. We can visualize it as a matrix just like we did with rock-paper-scissors before and in this matrix, we easily see the dilemma the players are in. If both deny, they receive a rather low punishment. But if you assume that your partner denies, you might be tempted to confess, which would prevent you from going to jail. But your partner might think the same, and if you both confess, you both go to jail for longer. Such a game can easily make you go round in circles. We will talk about solutions to this problem in the next chapter of this tutorial. First, let’s consider some more examples. 

Bach vs. Stravinsky

Who do you prefer, Bach or Stravinsky? Photo by Sigmund on Unsplash

You and your friend want to go to a concert together. You are a fan of Bach’s music but your friend favors the Russian 20th. century composer Igor Stravinsky. However, you both want to avoid being alone in any concert. Although you prefer Bach over Stravinsky, you would rather go to the Stravinsky concert with your friend than go to the Bach concert alone. We can create a matrix for this game: 

Bach vs. Stravinsky

You decide for the row by going to the Bach or Stravinsky concert and your friend decides for the column by going to one of the concerts as well. For you, it would be best if you both chose Bach. Your reward would be 2 and your friend would get a reward of 1, which is still better for him than being in the Stravinsky concert all by himself. However, he would be even happier, if you were in the Stravinsky concert together. 

Do you remember, that we said players are not allowed to communicate before making their decision? This example illustrates why. If you could just call each other and decide where to go, this would not be a game to investigate with game theory anymore. But you can’t call each other so you just have to go to any of the concerts and hope you will meet your friend there. What do you do? 

Arm or disarm?

Make love, not war. Photo by Artem Beliaikin on Unsplash

A third example brings us to the realm of international politics. The world would be a much happier place with fewer firearms, wouldn’t it? However, if nations think about disarmament, they also have to consider the choices other nations make. If the USA disarms, the Soviet Union might want to rearm, to be able to attack the USA — that was the thinking during the Cold War, at least. Such a scenario could be described with the following matrix: 

The matrix for the disarm vs. upgrade game.

As you see, when both nations disarm, they get the highest reward (3 each), because there are fewer firearms in the world and the risk of war is minimized. However, if you disarm, while the opponent upgrades, your opponent is in the better position and gets a reward of 2, while you only get 0. Then again, it might have been better to upgrade yourself, which gives a reward of 1 for both players. That is better than being the only one who disarms, but not as good as it would get if both nations disarmed. 

The solution?

All these examples have one thing in common: There is no single option that is always the best. Instead, the utility of an action for one player always depends on the other player’s action, which, in turn, depends on the first player’s action and so on. Game theory is now interested in finding the optimal solution and deciding what would be the rational action; that is, the action that maximizes the expected reward. Different ideas on how exactly such a solution looks like will be part of the next chapter in this series. 

Summary

Learning about game theory is as much fun as playing a game, don’t you think? Photo by Christopher Paul High on Unsplash

Before continuing with finding solutions in the next chapter, let us recap what we have learned so far. 

  • A game consists of players, that decide for actions, which have a utility or reward
  • The utility/reward of an action depends on the other players’ actions. 
  • In static games, players decide for their actions simultaneously. In dynamic games, they take turns. 
  • The prisoner’s dilemma is a very popular example of a game in game theory.
  • Games become increasingly interesting if there is no single action that is better than any other. 

Now that you are familiar with how games are described in game theory, you can check out the next chapter to learn how to find solutions for games in game theory. 

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though: 

  • Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in English language could be this one: 

  • Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one: 

  • Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

The post Talking about Games appeared first on Towards Data Science.

]]>
Bringing Structure to Your Data https://towardsdatascience.com/bringing-structure-to-your-data-9acbd1051a1f/ Mon, 14 Oct 2024 07:06:01 +0000 https://towardsdatascience.com/bringing-structure-to-your-data-9acbd1051a1f/ Testing assumptions with path models

The post Bringing Structure to Your Data appeared first on Towards Data Science.

]]>
Data Scientists often collect a multitude of variables and search for relationships between them. During this journey, it is helpful to have assumptions and hypotheses on how exactly variables relate to each other. Does a student’s motivation to study for the next exam influence their grades? Or do good grades lead to motivation to study at all? And what exactly are the behavioral patterns that motivated people show that lead to good grades in the end?

To give some structure to questions like the aforementioned and to provide a tool to test them empirically, I want to explain path models, also called Structural Equation Models (SEMs) in this article. While in social sciences like psychology path models are commonly used, I feel they are not that prominent in other areas like data science and computer science. Hence I want to give an overview of the main concept of path analysis and introduce semopy, which is a package for applying path analysis in python. Throughout this article, we will analyze artificial data to showcase typical problems that can be solved with path models and introduce the concepts of moderators and mediators. Be aware that this data has been generated for demonstration purposes and may not be realistic in every detail.

Research question

Before analysing data, we need to have an idea what we search for. Photo by Ansia Lasa on Unsplash
Before analysing data, we need to have an idea what we search for. Photo by Ansia Lasa on Unsplash

If we want to analyze data, we need to have a research question in mind that we want to investigate. For this article, let us investigate school children and the grades they achieve. We might be interested in factors that foster learning and achieving good grades. That could be the amount of fun they have in school, their feeling of belonging to class, their interest in the subject, their number of friends in the class, their relationship with the teacher, their intelligence and much more. So we go into different schools and collect data by handing out questionnaires on the feeling of belonging, the relationship with the teacher, the interest in the topic and the fun the pupils have in school, we conduct an IQ test with the pupils and we ask them how many friends they have. And of course we collect their grades in the exams.

It all starts with data

We now have data for all the variables shown here:

Our next step is to investigate, how exactly the variables influence the grade. We can make different assumptions about the influences and we can verify these assumptions with the data. Let us start with the most trivial case, where we assume that each variable has a direct influence on the grades that is independent of all the other variables. For example, we would assume that higher intelligence leads to a better grade, no matter the interest in the topic or the fun the pupil has in school. A likewise relationship with the grades we would hypothesize for the other variables as well. Visually displayed, this relationship would look like this:

Model assuming all variables directly influence the variable grades. Image by author.
Model assuming all variables directly influence the variable grades. Image by author.

Each arrow describes an influence between the variables. We could also formulate this relationship as a weighted sum, like this:

grades = a_feeling_ofbelonging + b_number_offriends + c_relationship_withteacher + d_fun_inschool + eintelligence + f_interest_intopic

Here a,b,c,d,e and f are weights that tell us, how strong the influence of the different variables is on our outcome grades. Okay, that is our assumption. Now we want to test this assumption given the data. Let’s say we have a data frame called data, where we have one column for each of the aforementioned variables. Then we can use semopy in python like this:

import semopy 

path = """
grades ~ intelligence + interest_in_topic 
+ feeling_of_belonging + relationship_with_teacher 
+ fun_in_school + number_of_friends
"""

m = semopy.Model(path)
m.fit(data)

In the last lines, we create a semopy.Model object and fit it with the data. The most interesting part is the variable path before. Here we specify the assumption we just had, namely that the variable grades is a combination of all the other variables. On the left part of the tilde (~) we have the variable that we expect to be dependent on the variables right to the tilde. Note that we didn’t explicitly specify the weights a,b,c,d,e and f. These weights are actually what we want to know, so let us run the following line to get a result:

m.inspect()
Results of the assumption that all variables directly influence the variable grade. Image by author.
Results of the assumption that all variables directly influence the variable grade. Image by author.

The weights a,b,c,d,e and f are what we see in the column Estimate. What information can we extract from this table? First, we see that some weights are bigger and some are smaller. For example, the _feeling_ofbelonging has the biggest weight (0.40), indicating that it has the strongest influence. _Interest_intopic, for example, has a much lower influence (0.08) and other variables like intelligence and _number_offriends have a weight of (almost) zero.

Also, take a look at the p-value column. If you are familiar with statistical tests, you may already know how to interpret this. If not, don’t worry. There is a vast pile of literature on how to understand the topic of significance (this is what this column indicates) and I encourage you to deepen your knowledge about it. However, for the moment, we can just say that this column gives us some idea of how likely it is, that an effect we found is just random noise. For example, the influence of _number_offriends on grades is very small (-0.01) and it is very likely (0.42), that it is just a coincidence. Hence we would say there is no effect, although the weight is not exactly zero. The other way round, if the p-value is (close to) zero, we can assume that we indeed found an effect that is not just coincidence.

Okay, so according to our analysis, there are three variables that have an influence on the grade, that are _interest_intopic (0.08), _feeling_ofbelonging (0.40) and _relationship_withteacher (0.19). The other variables have no influence. Is this our final answer?

It is not! Remember, that the calculations performed by semopy were influenced by the assumptions we gave it. We said that we assume all variables to directly influence the grades independent of each other. But what if the actual relationship looks different? There are many other ways variables could influence each other, so let us come up with some different assumptions and thereby explore the concepts of mediators and moderators.

Mediators

Mediators can be like billard balls, where the one pushes the other. Photo by Steve Mushero on Unsplash
Mediators can be like billard balls, where the one pushes the other. Photo by Steve Mushero on Unsplash

Instead of saying that both _number_offriends and _feeling_ofbelonging influence grades directly, let us think in a different direction. If you don’t have any friends in class, you would not feel a sense of belonging to the class, would you? This feeling of (not) belonging might then influence the grade. So the relationship would rather look like this:

Model assuming that number_of_friends influences feeling_of_belongig which in turn influences grades. Image by author.
Model assuming that number_of_friends influences feeling_of_belongig which in turn influences grades. Image by author.

Note that the direct effect of _number_offriends on grades has vanished but we have an influence of _number_offriends on _feeling_ofbelonging, which in turn influences grades. We can take this assumption and let semopy test it:

path = """
feeling_of_belonging ~ number_of_friends
grades ~ feeling_of_belonging
"""
m = semopy.Model(path)
m.fit(data)

Here we said that _feeling_ofbelonging depends on _number_offriends and that grades depends on _feeling_ofbelonging. You see the output in the following. There is still a weight of 0.40 between _feeling_ofbelonging and grades, but now we also have a weight of 0.29 between _number_offriends and _feeling_ofbelonging. Looks like our assumption is valid. The number of friends influences the feeling of belonging and this, in turn, influences the grade.

Results of the assumption of number_of_friends influencing feeling_of_belonging. Image by author.
Results of the assumption of number_of_friends influencing feeling_of_belonging. Image by author.

The kind of influence we have modelled here is called a mediator because one variable mediates the influence of another. In other words, _number_offriends does not have a direct influence on grades, but an indirect one, mediated through the _feeling_ofbelonging.

Mediations can help us understand the exact ways and processes by which some variables influence each other. Students who have clear goals and ideas of what they want to become are less likely to drop out of high school, but what exactly are the behavioral patterns that lead to performing well in school? Is it learning more? Is it seeking help if one doesn’t understand a topic? These could both be mediators that (partly) explain the influence of clear goals on academic achievement.

Moderators

A moderator can be like a valve, that only allows a certain throughput. Photo by Igal Ness on Unsplash
A moderator can be like a valve, that only allows a certain throughput. Photo by Igal Ness on Unsplash

We just saw that assuming a different relationship between the variables helped describe the data more effectively. Maybe we can do something similar to make sense of the fact that intelligence has no influence on the grade in our data. This is surprising, as we would expect more intelligent pupils to reach higher grades on average, wouldn’t we? However, if a pupil is just not interested in the topic they wouldn’t spend much effort, would they? Maybe there is not a direct influence of intelligence on the grades, but there is a joint force of intelligence and interest. If pupils are interested in the topics, the more intelligent ones will receive higher grades, but if they are not interested, it doesn’t matter, because they don’t spend any effort. We could visualize this relationship like this:

Model assuming that interest_in_topic moderatues the influence of intelligence on grades. Image by author.
Model assuming that interest_in_topic moderatues the influence of intelligence on grades. Image by author.

That is, we assume there is an effect of intelligence on the grades, but this effect is influenced by _interest_intopic. If interest is high, pupils will make use of their cognitive abilities and achieve higher grades, but if interest is low, they will not.

If we want to test this assumption in semopy, we have to create a new variable that is the product of intelligence and _interest_intopic. Do you see how multiplying the variables reflects the ideas we just had? If _interest_intopic is near zero, the whole product is close to zero, no matter the intelligence. If _interest_intopic is high though, the product will be mainly driven by the high or low intelligence. So, we calculate a new column of our dataframe, call it _intelligence_xinterest and feed semopy with our assumed relationship between this variable and the grades:

path = """
grades ~ intellgence_x_interest
"""
m = semopy.Model(path)
m.fit(data)

And we find an effect:

Result of the assumption of the product of intelligence and interest influencing grades. Image by author.
Result of the assumption of the product of intelligence and interest influencing grades. Image by author.

Previously, intelligence had no effect on grades and _interest_intopic had a very small one (0.08). But if we combine them, we find a very big effect of 0.81. Looks like this combination of both variables describes our data much better.

This interaction of variables is called moderation. We would say that _interest_intopic moderates the influence of intelligence on grades because the strength of the connection between intelligence and grades depends on the interest. Moderations can be important to understand how relations between variables differ in different circumstances or between different groups of participants. For example, longer experience in a job influences the salary positively, but for men, this influence is even stronger than for women. In this case, gender is the moderator for the effect of work experience on salary.

Summing up

If we combine all the previous steps, our new model looks like this:

Full model with all the previous assumptions. Image by author.
Full model with all the previous assumptions. Image by author.
Results of the full model. Image by author.
Results of the full model. Image by author.

Now we have a more sophisticated and more plausible structure for our data. Note that _fun_inschool still has no influence on the grades (hence I gave it a dashed line in the visualization above). Either there is none in the data, or we just did not find the correct interplay with the other variables yet. We might even be missing some interesting variables. Just like intelligence only made sense to look at in combination with _interest_intopic, maybe there is another variable that is required to understand the influence _fun_inschool has on the grades. This shows you, that for path analysis, it is important to make sense of your data and have an idea what you want to investigate. It all starts with assumptions which you derive from theory (or sometimes from gut feeling) and which you then test with the data to better understand it.

This is what path models are about. Let us sum up what we just learned.

  • Path models allow us to test assumptions on how exactly variables influence each other.
  • Mediations appear, if a variable a does not have a direct influence on a variable c, but influences another variable b, which then influences c.
  • We speak of moderations if the influence of a variable a on a variable c becomes stronger or less strong depending on another variable b. This can be modelled by calculating the product of variables.
  • Semopy can be used to test path models with given data in python.

I hope I have been able to convince you of the usefulness of path models. What I showed you is just the very beginning of it though. Many more sophisticated assumptions can be tested with path models or other models derived from them, that go way beyond the scope of this article.

References

You can find semopy here:

If you want to learn more about path analysis, Wikipedia can be a good entry point:

I use this book for statistical background (unfortunately, it is available in German only):

  • Eid, M., Gollwitzer, M., & Schmitt, M. (2015). Statistik und Forschungsmethoden.

This is how the data for this article has been generated:

import numpy as np
import pandas as pd

np.random.seed(42)

N = 7500

def norm(x):
    return (x - np.mean(x)) / np.std(x)

number_of_friends = [int(x) for x in np.random.exponential(2, N)]

# let's assume the questionairs here had a range from 0 to 5
relationship_with_teacher = np.random.normal(3.5,1,N)
relationship_with_teacher = np.clip(relationship_with_teacher, 0,5)
fun_in_school = np.random.normal(2.5, 2, N)
fun_in_school = np.clip(fun_in_school, 0,5)

# let's assume the interest_in_topic questionaire goes from 0 to 10
interest_in_topic = 10-np.random.exponential(1, N)
interest_in_topic = np.clip(interest_in_topic, 0, 10)

intelligence = np.random.normal(100, 15, N)
# normalize variables
interest_in_topic = norm(interest_in_topic)
fun_in_school = norm(fun_in_school)
intelligence = norm(intelligence)
relationship_with_teacher = norm(relationship_with_teacher)
number_of_friends = norm(number_of_friends)

# create dependend variables
feeling_of_belonging = np.multiply(0.3, number_of_friends) + np.random.normal(0, 1, N)
grades = 0.8 * intelligence * interest_in_topic + 0.2 * relationship_with_teacher + 0.4*feeling_of_belonging + np.random.normal(0,0.5,N)

data = pd.DataFrame({
    "grades":grades,
    "intelligence":intelligence,
    "number_of_friends":number_of_friends,
    "fun_in_school":fun_in_school,
    "feeling_of_belonging": feeling_of_belonging,
    "interest_in_topic":interest_in_topic,
    "intellgence_x_interest" : intelligence * interest_in_topic,
    "relationship_with_teacher":relationship_with_teacher
})

Like this article? Follow me to be notified of my future posts.

The post Bringing Structure to Your Data appeared first on Towards Data Science.

]]>
Take a Look Under the hood https://towardsdatascience.com/take-a-look-under-the-hood-24e40281c900/ Thu, 13 Jun 2024 16:14:04 +0000 https://towardsdatascience.com/take-a-look-under-the-hood-24e40281c900/ Using Monosemanticity to understand the concepts a Large Language Model learned

The post Take a Look Under the hood appeared first on Towards Data Science.

]]>
Take a Look Under the Hood
Just like with the brain, it is quite hard to understand, what is really happening inside an LLM. Photo by Robina Weermeijer on Unsplash
Just like with the brain, it is quite hard to understand, what is really happening inside an LLM. Photo by Robina Weermeijer on Unsplash

With the increasing use of Large Language Models (LLMs), the need for understanding their reasoning and behavior increases as well. In this article, I want to present to you an approach that sheds some light on the concepts an LLM represents internally. In this approach, a representation is extracted that allows one to understand a model’s activation in terms of discrete concepts being used for a given input. This is called Monosemanticity, indicating that these concepts have just a single (mono) meaning (semantic).

In this article, I will first describe the main idea behind Monosemanticity. For that, I will explain sparse autoencoders, which are a core mechanism within the approach, and show how they are used to structure an LLM’s activation in an interpretable way. Then I will retrace some demonstrations the authors of the Monosemanticity approach proposed to explain the insights of their approach, which closely follows their original publication.

Sparse autoencoders

Just like an hourglass, an autoencoder has a bottleneck the data must pass through. Photo by Alexandar Todov on Unsplash
Just like an hourglass, an autoencoder has a bottleneck the data must pass through. Photo by Alexandar Todov on Unsplash

We have to start by taking a look at sparse autoencoders. First of all, an autoencoder is a neural net that is trained to reproduce a given input, i.e. it is supposed to produce exactly the vector it was given. Now you wonder, what’s the point? The important detail is, that the autoencoder has intermediate layers that are smaller than the input and output. Passing information through these layers necessarily leads to a loss of information and hence the model is not able to just learn the element by heart and reproduce it fully. It has to pass the information through a bottleneck and hence needs to come up with a dense representation of the input that still allows it to reproduce it as well as possible. The first half of the model we call the encoder (from input to bottleneck) and the second half we call the decoder (from bottleneck to output). After having trained the model, you may throw away the decoder. The encoder now transforms a given input into a representation that keeps important information but has a different structure than the input and potentially removes unneeded parts of the data.

To make an autoencoder sparse, its objective is extended. Besides reconstructing the input as well as possible, the model is also encouraged to activate as few neurons as possible. Instead of using all the neurons a little, it should focus on using just a few of them but with a high activation. This also allows to have more neurons in total, making the bottleneck disappear in the model’s architecture. However, the fact that activating too many neurons is punished still keeps the idea of compressing the data as much as possible. The neurons that are activated are then expected to represent important concepts that describe the data in a meaningful way. We call them features from now on.

In the original Monosemanticity publication, such a sparse autoencoder is trained on an intermediate layer in the middle of the Claude 3 Sonnet model (an LLM published by Anthropic that can be said to play in the same league as the GPT models from OpenAI). That is, you can take some tokens (i.e. text snippets), forward them to the first half of the Claude 3 Sonnett model, and forward that activation to the sparse autoencoder. You will then get an activation of the features that represent the input. However, we don’t really know what these features mean so far. To find out, let’s imagine we feed the following texts to the model:

  • The cat is chasing the dog.
  • My cat is lying on the couch all day long.
  • I don’t have a cat.

If there is one feature that activates for all three of the sentences, you may guess that this feature represents the idea of a cat. There may be other features though, that just activate for single sentences but not for the others. For sentence one, you would expect the feature for dog to be activated, and to represent the meaning of sentence three, you would expect a feature that represents some form of negation or "not having something".

Different features

Features can describe quite different things, from apples and bananas to the notion of being edible and tasting sweet. Photo by Jonas Kakaroto on Unsplash
Features can describe quite different things, from apples and bananas to the notion of being edible and tasting sweet. Photo by Jonas Kakaroto on Unsplash

From the aforementioned example, we saw that features can describe quite different things. There may be features that represent concrete objects or entities (such as cats, the Eiffel Tower, or Benedict Cumberbatch), but there may also be features dedicated to more abstract concepts like sadness, gender, revolution, lying, things that can melt or the german letter ß (yes, we indeed have an additional letter just for ourselves). As the model also saw programming code during its training, it also includes many features that are related to programming languages, representing contexts such as code errors or computational functions. You can explore the features of the Claude 3 model here.

If the model is capable of speaking multiple languages, the features are found to be multilingual. That means, a feature that corresponds to, say, the concept of sorrow, would be activated in relevant sentences in each language. In a likewise fashion, the features are also multimodal, if the model is able to work with different input modalities. The Benedict Cumberbatch feature would then activate for the name, but also for pictures or verbal mentions of Benedict Cumberbatch.

Influence on behavior

Features can influence behavior, just like a steering wheel influences the way you take. Photo by Niklas Garnholz on Unsplash
Features can influence behavior, just like a steering wheel influences the way you take. Photo by Niklas Garnholz on Unsplash

So far we have seen that certain features are activated when the model produces a certain output. From a model’s perspective, the direction of causality is the other way round though. If the feature for the Golden Gate Bridge is activated, this causes the model to produce an answer that is related to this feature’s concept. In the following, this is demonstrated by artificially increasing the activation of a feature within the model’s inference.

Answers of the model being influenced by a high activation of a certain feature. Image taken from the original publication.
Answers of the model being influenced by a high activation of a certain feature. Image taken from the original publication.

On the left, we see the answers to two questions in the normal setup, and on the right we see, how these answers change if the activation of the features Golden Gate Bridge (first row) and brain sciences (second row) are increased. It is quite intuitive, that activating these features makes the model produce texts that include the concepts of the Golden Gate Bridge and brain sciences. In the usual case, the features are activated from the model’s input and its prompt, but with the approach we saw here, one can also activate some features in a more deliberate and explicit way. You could think of always activating the politeness feature to steer the model’s answers in the desired way. Without the notion of features, you would do that by adding instructions to the prompt such as "always be polite in your answers", but with the feature concept, this could be done more explicitly. On the other hand, you can also think of deactivating features explicitly to avoid the model telling you how to build an atomic bomb or conduct tax fraud.

Taking a deeper look: Specificity, Sensitivity and Completeness

Let's observe the features in more detail. Photo by K8 on Unsplash
Let’s observe the features in more detail. Photo by K8 on Unsplash

Now that we have understood how the features are extracted, we can follow some of the author’s experiments that show us which features and concepts the model actually learned.

First, we want to know how specific the features are, i.e. how well they stick to their exact concept. We may ask, does the feature that represents Benedict Cumberbatch indeed activate only for Benedict Cumberbatch and not for other actors? To shed some light on this question, the authors used an LLM to rate texts regarding their relevance to a given concept. In the following example, it was assessed how much a text relates to the concept of brain science on a scale from 0 (completely irrelevant) to 3 (very relevant). In the next figure, we see these ratings as the colors (blue for 0, red for 3) and we see the activation level on the x-axis. The more we go to the right, the more the feature is activated.

The activation of the feature for brain science together with relevance scores of the inputs. Image taken from the original publication.
The activation of the feature for brain science together with relevance scores of the inputs. Image taken from the original publication.

We see a clear correlation between the activation (x-axis) and the relevance (color). The higher the activation, the more often the text is considered highly relevant to the topic of brain sciences. The other way round, for texts that are of little or no relevance to the topic of brain sciences, the feature only activates marginally (if at all). That means, that the feature is quite specific for the topic of brain science and does not activate that much for related topics such as psychology or medicine.

Sensitivity

The other side of the coin to specificity is sensitivity. We just saw an example, of how a feature activates only for its topic and not for related topics (at least not so much), which is the specificity. Sensitivity now asks the question "but does it activate for every mention of the topic?" In general, you can easily have the one without the other. A feature may only activate for the topic of brain science (high specificity), but it may miss the topic in many sentences (low sensitivity).

The authors spend less effort on the investigation of sensitivity. However, there is a demonstration that is quite easy to understand: The feature for the Golden Gate Bridge activates for sentences on that topic in many different languages, even without the explicit mention of the English term "Golden Gate Bridge". More fine-grained analyses are quite difficult here because it is not always clear what a feature is supposed to represent in detail. Say you have a feature that you think represents Benedict Cumberbatch. Now you find out, that it is very specific (reacting to Benedict Cumberbatch only), but only reacts to some – not all – pictures. How can you know, if the feature is just insensitive, or if it is rather a feature for a more fine-grained subconcept such as Sherlock from the BBC series (played by Benedict Cumberbatch)?

Completeness

In addition to the features’ activation for their concepts (specificity and sensitivity), you may wonder if the model has features for all important concepts. It is quite difficult to decide which concepts it should have though. Do you really need a feature for Benedict Cumberbatch? Are "sadness" and "feeling sad" two different features? Is "misbehaving" a feature on its own, or can it be represented by the combination of the features for "behaving" and "negation"?

To catch a glance at the feature completeness, the authors selected some categories of concepts that have a limited number such as the elements in the periodic table. In the following figure, we see all the elements on the x-axis and we see whether a corresponding feature has been found for three different sizes of the autoencoder model (from 1 million to 34 million parameters).

Elements of the periodic table having a feature in the autoencoders of different sizes. Image taken from original publication.
Elements of the periodic table having a feature in the autoencoders of different sizes. Image taken from original publication.

It is not surprising, that the biggest autoencoder has features for more different elements of the periodic table than the smaller ones. However, it also doesn’t catch all of them. We don’t know though, if this really means, that the model does not have a clear concept of, say, Bohrium, or if it just did not survive within the autoencoder.

Limitations

While we saw some demonstrations of the features representing the concepts the model learned, we have to emphasize that these were in fact qualitative demonstrations and not quantitative evaluations. All the examples were great to get an idea of what the model actually learned and to demonstrate the usefulness of the Monosemanticity approach. However, a formal evaluation that assesses all the features in a systematic way is needed, to really backen the insights gained from such investigations. That is easy to say and hard to conduct, as it is not clear, how such an evaluation could look like. Future research is needed to find ways to underpin such demonstrations with quantitative and systematic data.

Summary

Monosemanticity is an interesting path, but we don't yet know where it will lead us. Photo by Ksenia Kudelkina on Unsplash
Monosemanticity is an interesting path, but we don’t yet know where it will lead us. Photo by Ksenia Kudelkina on Unsplash

We just saw an approach that allows to gain some insights into the concepts a Large Language Model may leverage to arrive at its answers. A number of demonstrations showed how the features extracted with a sparse autoencoder can be interpreted in a quite intuitive way. This promises a new way to understand Large Language Models. If you know that the model has a feature for the concept of lying, you could expect it do to so, and having a concept of politeness (vs. not having it) can influence its answers quite a lot. For a given input, the features can also be used to understand the model’s thought traces. When asking a model to tell a story, the activation of the feature happy end may explain how it comes to a certain ending, and when the model does your tax declaration, you may want to know if the concept of fraud is activated or not.

As we see, there is quite some potential to understand LLMs in more detail. A more formal and systematical evaluation of the features is needed though, to back the promises this format of analysis introduces.


Sources

This article is based on this publication, where the Monosemanticity approach is applied to an Llm:

There is also a previous work that introduces the core ideas in a more basic model:

For the Claude 3 model that has been analyzed, see here:

The features can be explored here:

Like this article? Follow me to be notified of my future posts.

The post Take a Look Under the hood appeared first on Towards Data Science.

]]>
An Overview of the LoRA Family https://towardsdatascience.com/an-overview-of-the-lora-family-515d81134725/ Sun, 10 Mar 2024 15:27:45 +0000 https://towardsdatascience.com/an-overview-of-the-lora-family-515d81134725/ LoRA, DoRA, AdaLoRA, Delta-LoRA, and more variants of low-rank adaptation.

The post An Overview of the LoRA Family appeared first on Towards Data Science.

]]>
LoRA comes in different shapes and varieties. Photo by Lucas George Wendt on Unsplash.
LoRA comes in different shapes and varieties. Photo by Lucas George Wendt on Unsplash.

Low-Rank Adaptation (LoRA) can be considered a major breakthrough towards the ability to train large language models for specific tasks efficiently. It is widely used today in many applications and has inspired research on how to improve upon its main ideas to achieve better performance or train models even faster.

In this article, I want to give an overview of some variants of LoRA, that promise to improve LoRAs capabilities in different ways. I will first explain the basic concept of LoRA itself, before presenting LoRA+, VeRA, LoRA-FA, LoRA-drop, AdaLoRA, DoRA, and Delta-LoRA. I will introduce the basic concepts and main ideas each, and show, how these approaches deviate from the original LoRA. I will spare technical details, unless they are important for the basic concepts, and will also not discuss evaluations in detail. For readers interested, I linked the original papers at the end.

Lora

The main idea of LoRA is to add two smaller tunable matrices A and B next to the pre-trained weight matrix W, without changing the parameters of W. Image from [1].
The main idea of LoRA is to add two smaller tunable matrices A and B next to the pre-trained weight matrix W, without changing the parameters of W. Image from [1].

Low-Rank Adaption (LoRA) [1] is a technique, that is widely used today to train large language models (LLMs). Large language models come with the capability to predict tokens of natural language given a natural language input. This is an astonishing capability, but for solving many problems this is not enough. Most of the time, you want to train an LLM on a given downstream task, such as classifying sentences or generating answers to given questions. The most straightforward way of doing that is fine-tuning, where you train some of the layers of the LLM with data of the desired task. That means training very big models with millions to billions of parameters though.

LoRA gives an alternative way of training that is much faster and easier to conduct due to a drastically reduced number of parameters. Next to the parameter weights of an already pre-trained LLM layer, LoRA introduces two matrices A and B, that are called adapters and that are much smaller. If the original matrix of parameters W is of size d x d, the matrices A and B are of size d x r and r x d, where r is much smaller (typically below 100). The parameter r is called the rank. That is, if you use LoRA with a rank of r=16, these matrices are of shape 16 x d. The higher the rank, the more parameters you train. That can lead to better performance on the one hand but needs more computation time on the other.

Now that we have these new matrices A and B, what happens with them? The input fed to W is also given to BA, and the output of BA is added to the output of the original matrix W. That is, you train some parameters on top and add their output to the original prediction, which allows you to influence the model’s behavior. You don’t train W anymore, which is why we sometimes say that W is frozen. Importantly, the addition of A and B is not only done at the very end (which would just add a layer on top) but can be applied to layers deep inside the neural network.

That is the main idea of LoRA, and its biggest advantage is, that you have to train fewer parameters than in fine-tuning, but still get comparable performance. One more technical detail I want to mention at this place: At the beginning, the matrix A is initialized with random values of mean zero, but with some variance around that mean. The matrix B is initialized as a matrix of complete zeros. This ensures, that the LoRA matrices don’t change the output of the original W in a random fashion from the very beginning. The update of A and B on W’s output should rather be an addition to the original output, once the parameters of A and B are being tuned in the desired direction. However, we will later see that some approaches deviate from this idea for different reasons.

LoRA as just explained is used very often with today’s LLMs. However, by now there are many variants of LoRA that deviate from the original method in different ways and aim at improving speed, performance, or both. Some of these I want to present to you in the following.

LoRA+

LoRA+ introduces different learning rates for the two matrices A and B, here indicated by the parameter λ. Image from [2].
LoRA+ introduces different learning rates for the two matrices A and B, here indicated by the parameter λ. Image from [2].

LoRA+ [2] **** introduces a more efficient way of training LoRA adapters by introducing different learning rates for matrices A and B. Most of the time, when training a neural network, there is just one learning rate that is applied to all weight matrices the same way. However, for the adapter matrices used in LoRA, the authors of LoRA+ can show, that it is suboptimal to have that single learning rate. The training becomes more efficient by setting the learning rate of matrix B much higher than that of matrix A.

There is a theoretical argument to justify that approach, that mainly bases on numerical caveats of a neural network’s initialization if the model becomes very wide in terms of the number of its neurons. However, the math required to prove that is quite complicated (if you are really into it, feel free to take a look at the original paper [2]). Intuitively, you may think that matrix B, which is initialized with zero, could use bigger update steps than the randomly initialized matrix A. In addition, there is empirical evidence for an improvement by that approach. By setting the learning rate of matrix B 16 times higher than that of matrix A, the authors have been able to gain a small improvement in model accuracy (around 2%), while speeding up the training time by factor two for models such as RoBERTa or Llama-7b.

VeRA

VeRA doesn't train A and B, but initializes them to a random projection and trains additional vectors d and b instead. Image from [3].
VeRA doesn’t train A and B, but initializes them to a random projection and trains additional vectors d and b instead. Image from [3].

With VeRA (Vector-based Random Matrix Adaptation) [3], the authors introduce an approach to drastically reduce the parameter size of the LoRA adapters. Instead of training the matrices A and B, which is the core idea of LoRA in the first place, they initialize these matrices with shared random weights (i.e. all the matrices A and B in all the layers have the same weights) and add two new vectors d and b. Only these vectors d and b are trained in the following.

You may wonder how this can work at all. A and B are matrices of random weights. How should they contribute anything to the model’s performance, if they are not trained at all? This approach is based on an interesting field of research on so-called random projections. There is quite some research that indicates that in a large neural network only a small fraction of the weights is used to steer the behavior and lead to the desired performance on the task the model was trained for. Due to the random initialization, some parts of the model (or sub-networks) are contributing more towards the desired model behavior from the very beginning. During the training, all parameters are trained though, as it is now known which are the important subnetworks. That makes training very costly, as most of the parameters that are updated don’t add any value to the model’s prediction.

Based on this idea, there are approaches to only train these relevant sub-networks. A similar behavior can be obtained by not training the sub-networks themselves, but by adding projection vectors after the matrix. Due to the multiplication of the matrix with the vector, this can lead to the same output as tuning some sparse parameters in the matrix would. That is exactly what the authors of VeRA propose by introducing the vectors d and b, which are trained, while the matrices A and B are frozen. Also, in contrast to the original LoRa approach, matrix B is not set to zero anymore but is initialized randomly just as matrix A.

This approach naturally leads to a number of parameters that is much smaller than the full matrices A and B. For example, if you introduce LoRA-layers of rank 16 to GPT-3, you would have 75.5M parameters. With VeRA, you only have 2.8M (a reduction of 97%). But how is the performance with such a small number of parameters? The authors of VeRA performed an evaluation with some common benchmarks such as GLUE or E2E and with models based on RoBERTa and GPT2 Medium. Their results suggest, that the VeRA model yields performance that is only marginally lower than models that are fully finetuned or that use the original LoRa technique.

LoRA-FA

LoRA-FA freezes matrix A and only trains matrix B. Image from [4].
LoRA-FA freezes matrix A and only trains matrix B. Image from [4].

Another approach, LoRA-FA [4], which stands for LoRA with Frozen-A, is going in a similar direction as VeRA. In LoRA-FA, the matrix A is frozen after initialization and hence serves as a random projection. Instead of adding new vectors, matrix B is trained though, after being initialized with zeros (just as in the original LoRA). This halves the number of parameters while having comparable performance to normal LoRA.

LoRa-drop

LoRA-drop uses the output of B*A to decide, which LoRA-layers are worth to be trained at all. Image from [5].
LoRA-drop uses the output of B*A to decide, which LoRA-layers are worth to be trained at all. Image from [5].

In the beginning, I explained, that you can add Lora matrices to any layer in the neural network. LoRA-drop [5] introduces an algorithm to decide which layers are worth to be enhanced by LoRA, and for which this is not worth the effort. Even if training LoRA adapters is much cheaper than finetuning the whole model, the more LoRA adapters you add, the more expensive is the training, still.

LoRA-drop consists of two steps. In the first step, you sample a subset of the data and train the LoRA adapters for a few iterations. Then you calculate the importance of each LoRA adapter as BAx, where A and B are the LoRA matrices, and x is the input. That is simply the output of the LoRA adapters that is added to the output of the frozen layer each. If this output is big, it changes the behavior of the frozen layer more drastically. If it is small, this indicates that the LoRA adapter has only little influence on the frozen layer and could as well be omitted.

Given that importance, you now select the LoRA layers that are most important. The are different ways of doing that. You can sum up the importance values until you reach a threshold, which is controlled by a hyperparameter, or you just take the top n LoRA layers with the highest importance for a fixed n. In any case, in the next step, you conduct the full training on the whole dataset (remember that you used a subset of data for the previous steps) but only on those layers that you just selected. The other layers are fixed to a shared set of parameters that won’t be changed anymore during training.

The algorithm of LoRA-drop hence allows to training a model with just a subset of the LoRA layers. The authors propose empirical evidence that indicates only marginal changes in accuracy, compared to training all LoRA layers, but at reduced computation time due to the smaller number of parameters that have to be trained.

AdaLoRA

AdaLoRA allows to adapt the rank of the LoRA matrices dynamically. Photo by Hasmik Ghazaryan Olson on Unsplash
AdaLoRA allows to adapt the rank of the LoRA matrices dynamically. Photo by Hasmik Ghazaryan Olson on Unsplash

There are alternative ways how to decide which LoRA parameters are more important than others. In this section, I will present AdaLoRA [6], which stands for Adaptive LoRa. What part of LoRA is adaptive here? It is the rank (i.e. the size) of the LoRA matrices. The main problem is the same as in the previous section: It may not be worth adding LoRA matrices A and B to each layer, but for some layers, the LoRA training may be more important (i.e. may lead to more change in the model’s behavior) than for others. To decide on that importance, the authors of AdaLoRA propose to consider the singular values of the LoRA matrices as indicators of their importance.

What is meant by that? First, we have to understand, that a matrix multiplication can also be seen as applying a function to a vector. When dealing with neural networks, this is quite obvious: Most of the time you use neural networks as functions, i.e. you give an input (say, a matrix of pixel values) and obtain a result (say, a classification of an image). Under the hood, this function application is powered by a sequence of matrix multiplications. Now, let’s say you want to reduce the number of parameters in such a matrix. That will change the function’s behavior, but you want it to change as little as possible. One way to do that is to compute the eigenvalues of the matrix, which tell you how much variance is captured by the rows of the matrix each. You may then decide to set some rows to zero, that capture only a small fraction of the variance, and hence don’t add much information to the function. This is the main idea of AdaLoRA since the aforementioned singular values are exactly the square roots of the eigenvalues. That is, based on the singular values, AdaLoRA decides which rows of which LoRA matrices are more important, and which can be omitted. This effectively shrinks the rank of some matrices, which have many rows that don’t contribute much. However, note an important difference to LoRA-drop from the previous section: In LoRA-drop, the adapter of a layer is selected to either be trained fully, or not trained at all. AdaLoRA can also decide to keep adaptors for some layers but with lower rank. That means, in the end, different adaptors can have different ranks (whereas in the original LoRA approach, all adaptors have the same rank).

There are some more details to the AdaLoRA approach, which I omitted for brevity. I want to mention two of them though: First, the AdaLoRA approach does not calculate the singular values explicitly all the time (as that would be very costly), but decomposes the weight matrices with a singular value decomposition. This decomposition is another way of representing the same information as in a single matrix, but it allows to get the singular values directly, without costly computation needed. Second, AdaLoRA does not decide on the singular values alone but also takes into account the sensitivity of the loss to certain parameters. If setting a parameter to zero has a large influence on the loss, this parameter is said to have high sensitivity. When deciding where to shrink the rank, the mean sensitivity of a row’s elements is taken into consideration in addition to the singular value.

Empirical evidence for the value of the approach is given by comparing AdaLoRA with standard LoRA of the same rank budget. That is, both approaches have the same number of parameters in total, but these are distributed differently. In LoRA, all matrices have the same rank, while in AdaLoRA, some have a higher and some have a lower rank, which leads to the same number of parameters in the end. In many scenarios, AdaLoRA yields better scores than the standard LoRA approach, indicating a better distribution of trainable parameters on parts of the model, that are of particular importance for the given task. The following plot gives an example, of how AdaLoRA distributed the ranks for a given model. As we see, it gives higher ranks to the layers towards the end of the model, indicating that adapting these is more important.

On different layers of the network, the LoRA matrices are given different ranks. On later layers, the ranks are higher, in general. Image from [6].
On different layers of the network, the LoRA matrices are given different ranks. On later layers, the ranks are higher, in general. Image from [6].

DoRA

In DoRA, the weight matrix W is decomposed into magnitude m and direction V, which are tuned independently. Image from [7].
In DoRA, the weight matrix W is decomposed into magnitude m and direction V, which are tuned independently. Image from [7].

Another approach to modify LoRa to get better performance is Weight-Decomposed Low-Rank Adaption, or Dora [7]. DoRA starts with the idea, that each matrix can be decomposed into the product of a magnitude and a direction. For a vector in 2D space, you can easily visualize that: A vector is nothing else than an arrow starting at the position of zero and ending at a certain point in the vector space. With the vector’s entries, you specify that point, e.g. by saying x=1 and y=1, if your space has two dimensions x and y. Alternatively, you could describe the very same point in a different way by specifying a magnitude and an angle (i.e. a direction), such as m=√2 and a=45°. That means that you start at point zero and move in the direction of 45° with an arrow length of √2. That will lead you to the same point (x=1,y=1).

This decomposition into magnitude and direction can also be done with matrices of higher order. The authors of DoRA apply this to the weight matrices that describe the updates within the training steps for a model trained with normal fine-tuning and a model trained with LoRA adapters. A comparison of these two techniques we see in the following plot:

Finetuning and LoRA differ in the relationship between the changes in magnitude and direction. Image from [7].
Finetuning and LoRA differ in the relationship between the changes in magnitude and direction. Image from [7].

We see two plots, one for a fine-tuned model (left) and one for a model trained with Lora adapters (right). On the x-axis, we see the change in direction, on the y-axis we see the change in magnitude, and each scatter point in the plots belongs to one layer of the model. There is an important difference between the two ways of training. In the left plot, there is a small negative correlation between the update in direction and the update in magnitude, while in the right plot, there is a positive relationship, which is much stronger. You may wonder which is better, or whether this has any meaning at all. Remember, that the main idea of LoRA is to achieve the same performance as finetuning, but with fewer parameters. That means, ideally we want LoRA’s training to share as many properties with fine-tuning as possible, as long as this does not increase the costs. If the correlation between direction and magnitude is slightly negative in fine-tuning, this may be a desirable property for LoRA as well, if it is achievable. In other words, if the relationship between direction and magnitude is different in LoRA compared to full fine-tuning, this may be one of the reasons why LoRA sometimes performs less well than fine-tuning.

The authors of DoRA introduce a method to train magnitude and direction independently by separating the pretrained matrix W into a magnitude vector m of size 1 x d and a direction matrix V. The direction matrix V is then enhanced by B*A, as known from the standard LoRA approach, and m is trained as it is, which is feasible because it has just one dimension. While LoRA tends to change both magnitude and direction together (as indicated by the high positive correlation between these two), DoRA can more easily adjust the one without the other, or compensate changes in one with negative changes in the other. We can see the relationship between direction and magnitude is more like the one in finetuning:

For DoRA, the relationship between magnitude and direction is more like that in finetuning. Image from [7].
For DoRA, the relationship between magnitude and direction is more like that in finetuning. Image from [7].

On several benchmarks, DoRA outperforms LoRA in accuracy. Decomposing the weight updates into magnitude and direction may allow DoRA to perform a training that is closer to the training done in fine-tuning, while still using the smaller parameters space introduced by LoRA.

Delta-LoRA

Delta-LoRA doesn't freeze the matrix W but updates it with the gradient obtained from B*A. Image from [8].
Delta-LoRA doesn’t freeze the matrix W but updates it with the gradient obtained from B*A. Image from [8].

Delta-LoRA [8] **** introduces yet another idea to improve LoRA. This time, the pre-trained matrix W comes into play again. Remember that the main idea in LoRA is to not (!) tune the pre-trained matrix W, as that is too costly (and that would be normal fine-tuning). That is why LoRA introduced new smaller matrices A and B. However, those smaller matrices have less capability to learn the downstream task, which is why the performance of a LoRA-trained model is often lower than the performance of a fine-tuned model. Tuning W during training would be great, but how can we afford that?

The authors of Delta-LoRA propose to update the matrix W by the gradients of AB, which is the difference between AB in two consecutive time steps. This gradient is scaled with some hyperparameter λ, which controls, how big the influence of the new training on the pre-trained weights should be, and is then added to W (while α and r (the rank) are hyperparameters from the original LoRA setup):

W is updated with the difference of AB in two consecutive steps. Image from [8].
W is updated with the difference of AB in two consecutive steps. Image from [8].

That introduces more parameters to be trained at almost no computational overhead. We do not have to calculate the gradient for the whole matrix W, as we would within finetuning, but update it with a gradient we already got in the LoRA training anyway. The authors compared this method on a number of benchmarks using models like RoBERTA and GPT-2 and found a boost in performance over the standard LoRA approach.

Summary

Congrats. You've made it to the end. Photo by david Griffiths on Unsplash
Congrats. You’ve made it to the end. Photo by david Griffiths on Unsplash

We just saw a number of approaches, that vary the core idea of LoRA to reduce computation time or improve performance (or both). In the end, I will give a short summary of the different approaches:

  • LoRA introduces low-rank matrices A and B that are trained, while the pre-trained weight matrix W is frozen.
  • LoRA+ suggests having a much higher learning rate for B than for A.
  • VeRA does not train A and B, but initializes them randomly and trains new vectors d and b on top.
  • LoRA-FA only trains matrix B.
  • LoRA-drop uses the output of B*A to determine, which layers are worth to be trained at all.
  • AdaLoRA adapts the ranks of A and B in different layers dynamically, allowing for a higher rank in these layers, where more contribution to the model’s performance is expected.
  • DoRA splits the LoRA adapter into two components of magnitude and direction and allows to train them more independently.
  • Delta-LoRA changes the weights of W by the gradient of A*B.

The field of research on LoRA and related methods is very rich and vivid, with new contributions every other day. In this article, I wanted to explain the core ideas of some approaches. Naturally, that was only a selection of such, that is far away from being a complete review.

I hope that I have been able to share some knowledge with you and possibly inspire you to new ideas. LoRA and related methods are a field of research with great potential, as we saw. New breakthroughs in improving performance or computation time in training large language models can be expected soon, I suppose.

References and Further Reading

These are the papers on the concepts explained in this article:

  • [1] LoRA: Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., … & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  • [2] LoRA+: Hayou, S., Ghosh, N., & Yu, B. (2024). LoRA+: Efficient Low Rank Adaptation of Large Models. arXiv preprint arXiv:2402.12354.
  • [3] VeRA: Kopiczko, D. J., Blankevoort, T., & Asano, Y. M. (2023). Vera: Vector-based random matrix adaptation. arXiv preprint arXiv:2310.11454.
  • [4]: LoRA-FA: Zhang, L., Zhang, L., Shi, S., Chu, X., & Li, B. (2023). Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303.
  • [5] LoRA-drop: Zhou, H., Lu, X., Xu, W., Zhu, C., & Zhao, T. (2024). LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation. arXiv preprint arXiv:2402.07721.
  • [6] AdaLoRA: Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., & Zhao, T. (2023). Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.
  • [7] DoRA: Liu, S. Y., Wang, C. Y., Yin, H., Molchanov, P., Wang, Y. C. F., Cheng, K. T., & Chen, M. H. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. arXiv preprint arXiv:2402.09353.
  • [8]: Delta-LoRA: Zi, B., Qi, X., Wang, L., Wang, J., Wong, K. F., & Zhang, L. (2023). Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices. arXiv preprint arXiv:2309.02411.

For some core ideas on random projection, as mentioned in the section on VeRA, this is one of the major contributions to the field:

  • Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.

For a more fine-grained explanation of LoRA and DoRA, I can recommend this article:

Like this article? Follow me to be notified of my future posts.

The post An Overview of the LoRA Family appeared first on Towards Data Science.

]]>
The German Tank Problem https://towardsdatascience.com/the-german-tank-problem-3455237aaf43/ Wed, 06 Mar 2024 06:14:20 +0000 https://towardsdatascience.com/the-german-tank-problem-3455237aaf43/ Estimating your chances of winning the lottery with sampling

The post The German Tank Problem appeared first on Towards Data Science.

]]>
Statistical estimates can be fascinating, can’t they? By just sampling a few instances from a population, you can infer properties of that population such as the mean value or the variance. Likewise, under the right circumstances, it is possible to estimate the total size of the population, as I want to show you in this article.

I will use the example of drawing lottery tickets to estimate how many tickets there are in total, and hence calculate the likelihood of winning. More formally, this means to estimate the population size given a discrete uniform distribution. We will see different estimates and discuss their differences and weaknesses. In addition, I will point you to some other use cases this approach can be used in.

Playing the lottery

I'm too anxious to ride one of those, but if it pleases you...Photo by Oneisha Lee on Unsplash
I’m too anxious to ride one of those, but if it pleases you…Photo by Oneisha Lee on Unsplash

Let’s imagine I go to a state fair and buy some tickets in the lottery. As a data scientist, I want to know the probability of winning the main prize, of course. Let’s assume there is just a single ticket that wins the main prize. So, to estimate the likelihood of winning, I need to know the total number of lottery tickets N, as my chance of winning is 1/N then (or k/N, if I buy k tickets). But how can I estimate that N by just buying a few tickets (which are, as I saw, all losers)?

I will make use of the fact, that the lottery tickets have numbers on them, and I assume, that these are consecutive running numbers (which means, that I assume a discrete uniform distribution). Say I have bought some tickets and their numbers in order are [242,412,823,1429,1702]. What do I know about the total number of tickets now? Well, obviously there are at least 1702 tickets (as that is the highest number I have seen so far). That gives me a first lower bound of the number of tickets, but how accurate is it for the actual number of tickets? Just because the highest number I have drawn is 1702, that doesn’t mean that there are any numbers higher than that. It is very unlikely, that I caught the lottery ticket with the highest number in my sample.

However, we can make more out of the data. Let us think as follows: If we knew the middle number of all the tickets, we could easily derive the total number from that: If the middle number is m, then there are m-1 tickets below that middle number, and there are m-1 tickets above that. That is, the total number of tickets would be (m-1) + (m-1) + 1, (with the +1 being the ticket of number m itself), which is equal to 2m-1. We don’t know that middle number m, but we can estimate it by the mean or the median of our sample. My sample above has the (rounded) average of 922, which yields 2*922-1 = 1843. That is, from that calculation the estimated number of tickets is 1843.

That was quite interesting so far, as just from a few lottery ticket numbers, I was able to give an estimate of the total number of tickets. However, you may wonder if that is the best estimate we can get. Let me spoil you right away: It is not.

The method we used has some drawbacks. Let me demonstrate that to you with another example: Say we have the numbers [12,30,88], which leads us to 2*43–1 = 85. That means, the formula suggests there are 85 tickets in total. However, we have ticket number 88 in our sample, so this can not be true at all! There is a general problem with this method: The estimated N can be lower than the highest number in the sample. In that case, the estimate has no meaning at all, as we already know, that the highest number in the sample is a natural lower bound of the overall N.

A better approach: Using even spacing

These birds are quite evenly spaced on the power line, which is an important concept for our next approach. Photo by Ridham Nagralawala on Unsplash
These birds are quite evenly spaced on the power line, which is an important concept for our next approach. Photo by Ridham Nagralawala on Unsplash

Okay, so what can we do? Let us think in a different direction. The lottery tickets I bought have been sampled randomly from the distribution that goes from 1 to unknown N. My ticket with the highest number is number 1702, and I wonder, how far away is this from being the highest ticket at all. In other words, what is the gap between 1702 and N? If I knew that gap, I could easily calculate N from that. What do I know about that gap, though? Well, I have reason to assume that this gap is expected to be as big as all the other gaps between two consecutive tickets in my sample. The gap between the first and the second ticket should, on average, be as big as the gap between the second and the third ticket, and so on. There is no reason why any of those gaps should be bigger or smaller than the others, except for random deviation, of course. I sampled my lottery tickets independently, so they should be evenly spaced on the range of all possible ticket numbers. On average, the numbers in the range of 0 to N would look like birds on a power line, all having the same gap between them.

That means I expect N-1702 to equal the average of all the other gaps. The other gaps are 242–0=242, 412–242=170, 823–412=411, 1429–823=606, 1702–1429=273, which gives the average 340. Hence I estimate N to be 1702+340=2042. In short, this can be denoted by the following formula:

Here x is the biggest number observed (1702, in our case), and k is the number of samples (5, in our case). This is just a short form of calculating the average as we just did.

Let’s do a simulation

We just saw two estimates of the total number of lottery tickets. First, we calculated 2*m-1, which gave us 1843, and then we used the more sophisticated approach of x + (x-k)/k and obtained 2042. I wonder which estimation is more correct now? Are my chances of winning the lottery 1/1843 or 1/2042?

To show some properties of the estimates we just used, I did a simulation. I drew samples of different sizes k from a distribution, where the highest number is 2000, and that I did a few hundred times each. Hence we would expect that our estimates also return 2000, at least on average. This is the outcome of the simulation:

Probability densities of the different estimates for varying k. Note that the ground truth N is 2000. Image by author.
Probability densities of the different estimates for varying k. Note that the ground truth N is 2000. Image by author.

What do we see here? On the x-axis, we see the k, i.e. the number of samples we take. For each k, we see the distribution of the estimates based on a few hundred simulations for the two formulas we just got to know. The dark point indicates the mean value of the simulations each, which is always 2000, independent of the k. That is a very interesting point: Both estimates converge to the correct value if they are repeated an infinite number of times.

However, besides the common average, the distributions differ quite a lot. We see, that the formula 2*m-1 has higher variance, i.e. its estimates are far away from the real value more often than for the other formula. The variance has a tendency to decrease with higher k though. This decrease does not always hold perfectly, as this is just as simulation and is still subject to random influences. However, it is quite understandable and expected: The more samples I take, the more precise is my estimation. That is a very common property of statistical estimates.

We also see that the deviations are symmetrical, i.e. underestimating the real value is as likely as overestimating it. For the second approach, this symmetry does not hold: While most of the density is above the real mean, there are more and larger outliers below. How does that come? Let’s retrace how we computed that estimate. We took the biggest number in our sample and added the average gap size to that. Naturally, the biggest number in our sample can only be as big as the biggest number in total (the N that we want to estimate). In that case, we add the average gap size to N, but we can’t get any higher than that with our estimate. In the other direction, the biggest number can be very low. If we are unlucky, we could draw the sample [1,2,3,4,5], in which case the biggest number in our sample (5) is very far away from the actual N. That is why larger deviations are possible in underestimating the real value than in overestimating it.

Which is better?

From what we just saw, which estimate is better now? Well, both give the correct value on average. However, the formula x + (x-k)/k has lower variance, and that is a big advantage. It means, that you are closer to the real value more often. Let me demonstrate that to you. In the following, you see the probability density plots of the two estimates for a sample size of k=5.

Probability densities for the two estimates for k=5. The colored shape under the curves is covering the space from N=1750 to N=2250. Image by author.
Probability densities for the two estimates for k=5. The colored shape under the curves is covering the space from N=1750 to N=2250. Image by author.

I highlighted the point N=2000 (the real value for N) with a dotted line. First of all, we still see the symmetry that we have seen before already. In the left plot, the density is distributed symmetrically around N=2000, but in the right plot, it is shifted to the right and has a longer tail to the left. Now let’s take a look at the grey area under the curves each. In both cases, it goes from N=1750 to N=2250. However, in the left plot, this area accounts for 42% of the total area under the curve, while for the right plot, it accounts for 73%. In other words, in the left plot, you have a chance of 42% that your estimate is not deviating more than 250 points in either direction. In the right plot, that chance is 73%. That means, you are much more likely to be that close to the real value. However, you are more likely to slightly overestimate than underestimate.

I can tell you, that x+ (x-k)/k is the so-called uniformly minimum variance unbiased estimator, i.e. it is the estimator with the smallest variance. You won’t find any estimate with lower variance, so this is the best you can use, in general.

Use cases

Make love, not war 💙 . Photo by Marco Xu on Unsplash
Make love, not war 💙 . Photo by Marco Xu on Unsplash

We just saw how to estimate the total number of elements in a pool, if these elements are indicated by consecutive numbers. Formally, this is a discrete uniform distribution. This problem is commonly known as the German Tank Problem. In the Second World War, the Allies used this approach to estimate how many tanks the German forces had, just by using the serial numbers of the tanks they had destroyed or captured so far.

We can now think of more examples where we can use this approach. Some are:

  • You can estimate how many instances of a product have been produced if they are labeled with a running serial number.
  • You can estimate the number of users or customers if you are able to sample some of their IDs.
  • You can estimate how many students are (or have been) at your university if you sample students’ matriculation numbers (given that the university has not yet used the first numbers again after reaching the maximum number already).

However, be aware that some requirements need to be fulfilled to use that approach. The most important one is, that you indeed draw your samples randomly and independently of each other. If you ask your friends, who have all enrolled in the same year, for their matriculation numbers, they won’t be evenly spaced on the whole range of matriculation numbers but will be quite clustered. Likewise, if you buy articles with running numbers from a store, you need to make sure, that this store got these articles in a random fashion. If it was delivered with the products of numbers 1000 to 1050, you don’t draw randomly from the whole pool.

Conclusion

We just saw different ways of estimating the total number of instances in a pool under discrete uniform distribution. Although both estimates give the same expected value in the long run, they differ in terms of their variance, with one being superior to the other. This is interesting because neither of the approaches is wrong or right. Both are backed by reasonable theoretical considerations and estimate the real population size correctly (in frequentist statistical terms).

I now know that my chance of winning the state fair lottery is estimated to be 1/2042 = 0.041% (or 0.24% with the 5 tickets I bought). Maybe I should rather invest my money in cotton candy; that would be a save win.

References & Literature

Mathematical background on the estimates discussed in this article can be found here:

  • Johnson, R. W. (1994). Estimating the size of a population. Teaching Statistics, 16(2), 50–52.

Also feel free to check out the Wikipedia articles on the German tank problem and related topics, which are quite explanatory:

This is the script to do the simulation and create the plots shown in the article:

import numpy as np
import random
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt

if __name__ == "__main__":
    N = 2000
    n_simulations = 500

    estimate_1 = lambda sample: 2 * round(np.mean(sample)) - 1
    estimate_2 = lambda sample: round(max(sample) + ((max(sample) - k) / k))

    estimate_1_per_k, estimate_2_per_k = [],[]
    k_range = range(2,10)
    for k in k_range:
        diffs_1, diffs_2 = [],[]
        # sample without duplicates:
        samples = [random.sample(range(N), k) for _ in range(n_simulations)]
        estimate_1_per_k.append([estimate_1(sample) for sample in samples])
        estimate_2_per_k.append([estimate_2(sample) for sample in samples])

    fig,axs = plt.subplots(1,2, sharey=True, sharex=True)
    axs[0].violinplot(estimate_1_per_k, positions=k_range, showextrema=True)
    axs[0].scatter(k_range, [np.mean(d) for d in estimate_1_per_k], color="purple")
    axs[1].violinplot(estimate_2_per_k, positions=k_range, showextrema=True)
    axs[1].scatter(k_range, [np.mean(d) for d in estimate_2_per_k], color="purple")

    axs[0].set_xlabel("k")
    axs[1].set_xlabel("k")
    axs[0].set_ylabel("Estimated N")
    axs[0].set_title(r"$2times m-1$")
    axs[1].set_title(r"$x+frac{x-k}{k}$")
    plt.show()

    plt.gcf().clf()
    k = 5
    xs = np.linspace(500,3500, 500)

    fig, axs = plt.subplots(1,2, sharey=True)
    density_1 = gaussian_kde(estimate_1_per_k[k])
    axs[0].plot(xs, density_1(xs))
    density_2 = gaussian_kde(estimate_2_per_k[k])
    axs[1].plot(xs, density_2(xs))
    axs[0].vlines(2000, ymin=0, ymax=0.003, color="grey", linestyles="dotted")
    axs[1].vlines(2000, ymin=0, ymax=0.003, color="grey", linestyles="dotted")
    axs[0].set_ylim(0,0.0025)

    a,b = 1750, 2250
    ix = np.linspace(a,b)
    verts = [(a, 0), *zip(ix, density_1(ix)), (b, 0)]
    poly = plt.Polygon(verts, facecolor='0.9', edgecolor='0.5')
    axs[0].add_patch(poly)
    print("Integral for estimate 1: ", density_1.integrate_box(a,b))

    verts = [(a, 0), *zip(ix, density_2(ix)), (b, 0)]
    poly = plt.Polygon(verts, facecolor='0.9', edgecolor='0.5')
    axs[1].add_patch(poly)
    print("Integral for estimate 2: ", density_2.integrate_box(a,b))

    axs[0].set_ylabel("Probability Density")
    axs[0].set_xlabel("N")
    axs[1].set_xlabel("N")
    axs[0].set_title(r"$2times m-1$")
    axs[1].set_title(r"$x+frac{x-k}{k}$")

    plt.show()

Like this article? Follow me to be notified of my future posts.

The post The German Tank Problem appeared first on Towards Data Science.

]]>
How Nightshade Works https://towardsdatascience.com/how-nightshade-works-b1ae14ae76c3/ Fri, 03 Nov 2023 17:08:03 +0000 https://towardsdatascience.com/how-nightshade-works-b1ae14ae76c3/ Confusing image-generating AI with poisoned data

The post How Nightshade Works appeared first on Towards Data Science.

]]>
Just like the high walls of a castle, Nightshade can be a way to defend intellectual properties against illegitimate use. Photo by Nabih El Boustani on Unsplash
Just like the high walls of a castle, Nightshade can be a way to defend intellectual properties against illegitimate use. Photo by Nabih El Boustani on Unsplash

The recent emergence of Nightshade, an algorithm that allows to create poisoned data for confusing image-generating AI models, has given new life to the discussion on adversarial attacks on such models. This discussion is also influenced by ethical and social considerations, as such attacks may provide a way for artists, content creators, and others to fight back if they feel treated unjustly by the fact that AI models use their content without permission, but could also be used with bad intentions.

In this article, I want to explain the core concepts of Nightshade. To this end, I will first explain the general idea of data poisoning and highlight its shortcomings. Then I will introduce you to Nightshade, an algorithm that overcomes some of the disadvantages of the naive approach. In the end, I will briefly discuss some ethical considerations that come from its use.

Data poisoning

Poisonous or not? Photo by Fiona Smallwood on Unsplash
Poisonous or not? Photo by Fiona Smallwood on Unsplash

Let’s start with the idea of Data Poisoning in general. Say you want to influence image-generating AIs in such a way, that they fail to generate certain kinds of images, or are unable to understand certain prompts. Why do you want to do this? The most likely non-destructive reasons might be, that you are an artist and want to avoid that an image-generating model is able to generate images in your style, or that you have created a new comic character that should not be reproduced by an image-generating model without your permission.

So, what do you do? Let us start with understanding a basic concept of how generative AI learns. Of course, an image-generating AI depends on its training data. To be precise, it relies on the fact that there are images showing a certain concept (say a dog) and that those images are associated with a text describing their content (e.g. an image caption like a cute dog with glasses). From that, it learns to extract certain visual properties that images share, which also share certain keywords in their captions. That is, the model learns what dogs look like by learning the properties of all those images that mention dog in their caption.

Now, what would happen if you would introduce images that show dogs, but whose captions always mention cats? In the end, dog and cat are just symbols for what can be seen in the images. If the images that show dogs are labeled as cats, the model will just learn that the symbol cat refers to what we would call dog. Without any prior knowledge of the English language, how would the model know that the labels are wrong if they are so consistent? If you don’t speak German, and I would show you a hundred images of dogs and tell you their label is Katze, you would assume that Katze is the German word for dog. You wouldn’t know that the actual German word for dog is Hund, and Katze means cat because you just learned the correlation between the labels and the images’ properties.

The process just described is called data poisoning, stemming from the idea that you introduce data instances, that have a malicious effect on the model’s training (just like a poison has a malicious effect on your health).

Naive poisoning attacks

A cute dog with glasses, thinking about how to attack an image-generating model. Photo by Jamie Street on Unsplash
A cute dog with glasses, thinking about how to attack an image-generating model. Photo by Jamie Street on Unsplash

As a naive approach, you could take the aforementioned idea and use it to confuse machine learning models like Stable Diffusion. Say you want to make Stable Diffusion create images of cats when being prompted for dogs. For that, you would need to create many images of cats, label them as dogs, and upload them to the internet. Then you hope that those images are scraped for the next training of a Stable Diffusion model.

If many of your images become part of the next training run, that could indeed lead to confusion between cats and dogs. However, there are some drawbacks to that approach:

  • You will need many images. Since there are many other images of cats that are not poisoned, you need a large number of images to have an impact at all. If you provide only 10 poisoned images, and there are 1000 non-poisoned images of cats on the other side, you almost have no influence on the training. Typically, you can expect to poison 20% or more of all images to have an effect.
  • Note that you do not know which images exactly will be part of the training. Hence, if you want to introduce 500 poisoned images into the training, you may have to create 5000 and spread them all over the internet, because only some of them may actually be scraped for training.
  • If you upload images of cats, labeled as dogs, humans can easily detect that. Before using your images for training, they may be filtered out by a quality gate (being a human or a specialized AI).

Nightshade

The nightshade algorithm has its name from a very poisonous plant. Photo by Georg Eiermann on Unsplash
The nightshade algorithm has its name from a very poisonous plant. Photo by Georg Eiermann on Unsplash

Now let’s take a look at Nightshade, an algorithm that aims at overcoming those disadvantages. For that, Nightshade uses two key concepts: It creates images that have the maximum effect on the model (which leads to a need for fewer images in total) and that are indistinguishable from non-poisoned images for humans.

First, how to get the maximum effect out of the images? In theory, you would want to use those images, that lead to the biggest change of the gradient during the training. However, to find out which images those are, you would have to observe the training process, which you can’t do, in general. The authors of Nightshade propose a different solution though: You take an image, that has been generated by the model you want to poison. That is, if you want to have cat images labeled as dogs, you prompt the model with a simple prompt like an image of a cat. The image it creates will be a very typical representation of what the model understood to be a cat. If this image is seen in training, it will have a very high influence on the understanding of the concept cat (a much higher than rather untypical images of cats have). Hence, if you poison that image, you will get a very large effect on the model’s training.

Second, we said that Nightshade’s images should be indistinguishable from non-poisoned images. To reach this goal, Nightshade takes natural images and applies a perturbation (i.e. a small change in the pixel’s values), until the image is perceived differently by the model. Continuing with our dog vs. cat example from above, we would take an image generated by the model that shows a cat. This image we refer to as the anchor image or xᵃ in the upcoming formulas. Next, we take a very typical image of a dog, which we refer to as xₜ. To this image xₜ, we now add the perturbation δ s.t. it optimizes the following objective:

where F() is the image feature extractor used by the model, Dist is a distance function and p is an upper bound for the δ, to avoid the image changing too much. That means we want to find δ s.t. the distance between the features of the perturbated dog image (F(xₜ + δ)) and the anchor image (showing the cat, F(xᵃ)) is as small as possible. In other words, we want the two images to look alike from the model’s perspective. Be aware, that F(x), the result of the feature extractor, is how the model sees the image in feature space, which is different from how you see the image (in pixel space, if you want).

In the following images, you won’t be able to spot any difference between the original and the poisoned images (at least I can’t). However, in their feature space, they differ a lot. The features of the poisoned dog image, for example, are very close to the features of a cat image and hence for the model it almost looks like a cat.

Two examples of poisoned images. The images in the lower line are perturbated versions of the upper ones. Although no difference can be seen by a human, the original and poisoned images look very different from a model's perspective. Image taken from the Nightshade paper[1].
Two examples of poisoned images. The images in the lower line are perturbated versions of the upper ones. Although no difference can be seen by a human, the original and poisoned images look very different from a model’s perspective. Image taken from the Nightshade paper[1].

With this technique, we are able to generate images that have a very big effect on the model’s training and that can’t be detected as being poisoned. If you would upload these images to the internet, no human would be suspicious at all and hence it is very unlikely, that they would be filtered out by any quality gate. In addition, since they are so powerful, you don’t need to poison 20% of all dog images in the training data, as you would with the naive approach. With Nightshade, 50 to 100 images are typically enough to ruin a model’s performance on a particular concept.

Generalizability

Beyond the points we just saw, Nightshade has another interesting advantage, which is its ability to generalize in multiple ways.

First of all, poisoning a certain keyword also influences concepts that are related in a linguistic or semantic fashion. E.g. poisoning images of the concept dog also influences keywords like puppy or husky, which are related to dog. In the following examples, the concept dog has been poisoned and this also impedes the generation of puppies and huskies.

An example of how poisoning a concept (dog) also impedes the generation of related concepts (puppy, husky, wolf). Image taken from the Nightshade paper[1].
An example of how poisoning a concept (dog) also impedes the generation of related concepts (puppy, husky, wolf). Image taken from the Nightshade paper[1].

In a likewise fashion, poisoning a concept such as fantasy also influences concepts that are related semantically, but leaves other concepts unaffected, as can be seen in the following example. As you see, a concept like dragon, which is close to the poisoned fantasy, is affected, while a concept like chair is not.

An example of how poisoning a concept (fantasy) also impedes related concepts (e.g. dragon). Note that non-related concepts (e.g. chair) are not affected. Image taken from the Nightshade paper[1].
An example of how poisoning a concept (fantasy) also impedes related concepts (e.g. dragon). Note that non-related concepts (e.g. chair) are not affected. Image taken from the Nightshade paper[1].

In addition, when poisoning multiple concepts, the ability to generate images may break down in its entirety. In the following example, 100, 250, or 500 concepts have been poisoned. With more concepts being poisoned, the generation of other concepts, that have not been poisoned at all (like person or painting in the example), is heavily impeded as well.

An example is how poisoning many concepts impedes the ability to generate images in general. Note that the concepts person, painting, and seashell have not been poisoned specifically. Image taken from the Nightshade paper[1].
An example is how poisoning many concepts impedes the ability to generate images in general. Note that the concepts person, painting, and seashell have not been poisoned specifically. Image taken from the Nightshade paper[1].

In addition to that, Nightshade’s effects also generalize over different target models. Remember that we used the model we wanted to attack to generate the anchor images, that helped us construct our poisoned images. The idea behind that was, that those images are very prototypical and hence would have a strong influence on the training. We also needed access to the feature extractor to optimize the perturbation. Naturally, Nightshade’s influence is strongest if these anchor images are generated by the model that is to be attacked and if that model’s feature extractor can be used for the optimization. However, even if anchor images and feature extractor come from another model, the poisoning works quite well. That is, you could generate your poisoned images with the help of, say, Stable Diffusion 2, even if you want to attack Stable Diffusion XL. This may be of interest if you don’t have access to the model you actually want to attack.

Ethical Concerns

So far, I introduced Nightshade as a method that can be used by content creators to defend their intellectual properties against illegitimate use. However, as there are always two sides to a coin, data poisoning can as well be used in a harmful manner, may it be on purpose or not. Needless to say, data poisoning can be used to deliberately disturb Generative Ai models, cause financial damage to their creators, and impede scientific research. An AI company destroying the training data of their competitors to improve their own model in contrast is only one of countless examples of malign usages of data poisoning. However, even if you just want to defend your own content, we just saw, that poisoning many concepts impedes the AI’s ability to generate images in total. Hence, if many people make use of Nightshade, this may destroy image-generating AIs even on those concepts that would be legitimate to use. Hence, even with the intention to protect their own content, a content creator using Nightshade may cause unwanted damage. Whether and to which extent such collateral damage has to be accepted is a question for a vivid open debate.

On top of that, as you can imagine, attacking the capabilities of generative AI is a battle of constant ups and downs. Whenever new attacks are invented, the other side comes up with new defense mechanisms. Although the authors claim that Nightshade is quite robust against common defense mechanisms (e.g. detecting images as being poisoned by a specialized classifier or other properties), it may only be a matter of time until new defenses are discovered that counteract Nightshade. From that perspective, Nightshade may allow creators to protect their content for the time being but may become outdated sooner or later.

Summary

As we just saw, Nightshade is an algorithm to create poisoned datasets, that goes beyond the naive approach of labeling data with incorrect labels. It creates images that are not detectable as being poisoned by humans and that can influence an image-generating AI heavily even with a low number of examples. This drastically increases the chance of the poisoned images becoming part of the training and having an effect there. Even more, it promises to generalize in a number of ways, which makes the attacks more powerful and less easy to defend against. With that, Nightshade provides a new way of fighting back against the illegitimate use of content for model training, for which no permission has been given by its creators, but also includes the potential of destructive use and hence calls for a debate on its ethical implications. Being used with noble intentions, Nightshade can help defend intellectual properties such as an artist’s style or inventions though.


Sources

That is the original paper introducing Nightshade:

  • [1] Shan, S., Ding, W., Passananti, J., Zheng, H., & Zhao, B. Y. (2023). Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models. arXiv preprint arXiv:2310.13828.

Like this article? Follow me to be notified of my future posts.

The post How Nightshade Works appeared first on Towards Data Science.

]]>
The Capture-ReCapture Method https://towardsdatascience.com/the-capture-recapture-method-dc9985b928da/ Wed, 13 Sep 2023 19:16:52 +0000 https://towardsdatascience.com/the-capture-recapture-method-dc9985b928da/ Estimating a population size without counting it

The post The Capture-ReCapture Method appeared first on Towards Data Science.

]]>
When you capture our individuals, make sure to not harm them, as you have to set them free later again. Photo by Anne Nygård on Unsplash
When you capture our individuals, make sure to not harm them, as you have to set them free later again. Photo by Anne Nygård on Unsplash

In this article, I want to present a statistical method to estimate the size of a population without counting it fully, which is called the Capture-ReCapture method. Coming from biological domains, the procedure can also be applied to many other fields and scenarios that can be of interest to data scientists and related professions.

I will first demonstrate the procedure on a biological example before I talk about its statistical background and the properties that allow for its usage. After that, I will present some examples of different domains to demonstrate the capabilities the Capture-ReCapture method has for different scenarios.

How many snails are in my garden?

Many people don't like snails, but I still think they are adorable. Let's count them without hurting them. Photo by Krzysztof Niewolny on Unsplash
Many people don’t like snails, but I still think they are adorable. Let’s count them without hurting them. Photo by Krzysztof Niewolny on Unsplash

Say I want to know how many snails are living in my garden. I could try to count all of them, but how will I know when I’m finished with that? Even if I don’t find any more snails, I can never be sure that there are none left. Instead, there is a different method I can use.

On the first day, I dedicate half an hour to collecting snails and counting them. Additionally, I mark each one with a dot of paint, before I release it into my garden again. Say I have collected 21 snails. Can I already give an estimate on the number of snails in total in my garden? No, not yet (besides the fact, that there must be at least 21 snails), but I am not finished.

A day later, I go to my garden again and start counting snails for half an hour. Some of the snails I find that day already have a dot of paint on their shell, i.e. I already found them yesterday, while others don’t (i.e. I didn’t find that particular snail yesterday). Say I count 28 snails that day, 9 of which are already marked with a dot of paint. Now I can give an estimate of the number of snails in total. Let’s do the Math.

On the second day, a proportion of 9/28 snails I had already found the day before. That ratio should be equal to the ratio of the snails I found on the first day over the number of snails in total, i.e. 21/N = 9/28, where N is the total number of snails. I can reformulate that to get the number of snails as N = (21*28)/9 = 65.

Why is that? On the second day, a certain ratio of individuals (say p%) has a certain property (namely being marked). If I draw a random sample from the population, I expect that p% of my sample have that property as well. That is very intuitive: If you randomly sample from the population of your city, you would also expect that the ratio of genders in your sample reflects the ratio of genders in total, right? However, on the second day we know this ratio p, which we didn’t know on the first day (when painting snails on the first day, we didn’t know which fraction of snails we had already caught), so on the first day, we painted p% of all snails. It is easy now to derive the total number of snails from that: If I painted 21 snails, and I now know that this is 9/28=32% of the population, there are roughly 65 snails in total (with 21 being roughly 32% of 65).

Conditions for recapturing

Before using the Capture-ReCapture method, make sure that the required conditions are fulfilled. Photo by Sung Jin Cho on Unsplash
Before using the Capture-ReCapture method, make sure that the required conditions are fulfilled. Photo by Sung Jin Cho on Unsplash

Besides counting the number of snails in your garden, there are many other scenarios where you can apply the aforementioned procedure. As you can imagine, the distance between the two sampling steps doesn’t have to be a day, and the marking can also be done in a different way than by marking individuals literally. You may also just keep a list of the individuals you have drawn in the first round, as long as you can easily determine whether an individual you find in the second iteration is already present on the list. However, for the Capture-ReCapture method to be applicable, there are some properties that have to be fulfilled, which are the following:

  • On both points of data collection, the population has to be the same. In particular, that demands that no individuals are added or removed between the two points in time.
  • On both points of data collection, one has to draw randomly and independently from the distribution. I.e. each individual must have the same likelihood of being caught. In particular, being marked or not should not make a difference in the likelihood of being drawn on the other occasion.
  • The number of individuals drawn on each occasion has to be of sufficient size to create a meaningful overlap. You can easily imagine that randomly sampling 100 books each from your local library, where the number of books is in the millions, creates no overlap at all and hence doesn’t help your estimation.

Example Use cases

Spoiler: Medicine is a domain where variants of the Capture-ReCapture method are used a lot. Photo by Ksenia Yakovleva on Unsplash
Spoiler: Medicine is a domain where variants of the Capture-ReCapture method are used a lot. Photo by Ksenia Yakovleva on Unsplash

Now that we have understood the Capture-ReCapture method, let’s take a look at some examples where to use it. It comes in handy whenever we want to determine the size of a population without being able to count it fully. However, different scenarios may have different pitfalls to the methods prerequisites that have to be taken into consideration.

Counting the number of guests at a party

At the next party you are attending, you can take five minutes to mark some individuals (either by literally marking them or by keeping a list of them) and some minutes later you draw random individuals again. However, make sure that you really draw randomly and independently. That is, you should catch people from all over the place and don’t be biased towards people you know or don’t know. Also, make sure that the distance between the two points of data collection is not too big; otherwise, your estimate might be biased by the fact that people left the party in the meantime.

Capturing from two independent lists

A variant of the Capture-ReCapture method doesn’t use recapturing at a different point in time but uses two independent data sources (that have been drawn from the same distribution) and their overlap. In this way the method is often used in medical scenarios, so let’s take a look at an example where we estimate the prevalence of a disease.

Say I have a list of patients from a hospital listing 142 persons having a certain disease, and I have another list coming from the National Health Service that lists 442 persons having that disease. Say 71 people are appearing on both lists. Then we can use the formula from above and obtain our result (142*442)/71 = 884. That is 884 people are estimated to suffer from the disease.

Most important for that variant is that the two lists are indeed independent. I.e. the likelihood for an individual to be part of one list should not differ whether or not that individual is part of the other list or vice versa.

Estimate the number of potential customers

Say you have a website to sell your breathtaking new product. On one day you capture all visitors on your website (e.g. by tracking their IP) and the very same you do some days later. With the overlap between the two days, you can estimate the number of potential customers for your product. However, you should be aware that this scenario may easily include a violation of an important assumption, namely the independent draws on both captures. In particular, one could argue that visiting the website on day one can increase the likelihood of visiting the website again.

Summary

We have now seen some examples of the Capture-ReCapture method, which allows us to estimate the size of a population without counting it fully. Instead of counting each individual in the population, the method demands to draw two independent samples of the population (either at different points in time or from different sources) and use their overlap to estimate the population size. This can be used in a variety of domains, whenever a full observation of the population is not feasible.


Further reading

The example of counting snails in the garden I adapted from the following book:

  • Kit Yates (2019). The Math of Life and Death. Why Math Is (Almost) Everything. Quercus Editions Ltd, London.

An overview of the Capture-ReCapture method used in medical domains can be found here:

  • Ramos, P. L., Sousa, I., Santana, R., Morgan, W. H., Gordon, K., Crewe, J., … & Macedo, A. F. (2020). A review of capture-recapture methods and its possibilities in ophthalmology and vision sciences. Ophthalmic Epidemiology, 27(4), 310–324.

Like this article? Follow me to be notified of my future posts.

The post The Capture-ReCapture Method appeared first on Towards Data Science.

]]>
Multilevel Regression Models and Simpson’s paradox https://towardsdatascience.com/multilevel-regression-models-and-simpsons-paradox-acb9820e836d/ Tue, 08 Aug 2023 18:31:22 +0000 https://towardsdatascience.com/multilevel-regression-models-and-simpsons-paradox-acb9820e836d/ Avoiding false conclusions with the proper tooling

The post Multilevel Regression Models and Simpson’s paradox appeared first on Towards Data Science.

]]>
Data Analysis influences our conclusions, but we should use the proper tooling to take the right path. Photo by Brendan Church on Unsplash
Data Analysis influences our conclusions, but we should use the proper tooling to take the right path. Photo by Brendan Church on Unsplash

Data Analysis is – as indicated by that profession’s name – an integral part of a Data Scientist’s work, ranging from descriptive statistics and simple regression models to sophisticated machine learning approaches. However, those approaches have to be handled with care, and selecting the right approach is far away from being trivial. Complex data often includes hidden structures that, if not considered appropriately, can lead to fallacies that may end up in invalid conclusions.

In this article, I want to give an example of Simpson’s paradox and show how a simple yet shortsighted analysis can lead to false conclusions that appear to be backed by the data, although they are not more than a misinterpretation. Thereby I demonstrate the usage of multilevel regression models as an appropriate way of data analysis for hierarchical, i.e. nested, data.

The problem

Let’s start already! Say we have published a smartphone app and want to learn more about our users and how satisfied they are. Hence we perform a small survey and ask some of our users to tell us how satisfied they are with our app on a scale from 1 (very unhappy) to 4 (very happy). Additionally, we measured how much time they spent in our app in the last week, and to get a rich sample, we asked users in different countries. Then our data may look like this (I am using generated data for this article):

   Satisfaction  Time_spent  Country
0      2.140440    1.585295        0
1      2.053545    0.636235        0
2      1.589258    1.468033        1
3      1.853545    0.968651        2
4      1.449286    0.967104        2
.      .            .              .
.      .            .              .
.      .            .              .

We are interested in the relationship between the time spent in our app and the reported satisfaction. To be precise, we want to know whether spending more time in our app is associated with more satisfaction or less, and we want to quantify that association, i.e. we want to make a statement like "spending one hour more in our app is associated with being x times more/less satisfied". When we look at the data, we might have a first intuition already, that more time spent in the app is correlated with lower satisfaction:

Linear regression

Let’s do a Linear Regression to see if we are right. With linear regression, we try to predict satisfaction given the time spent as a linear function of the form _satisfaction = intercept + regression_coefficient * timespent. We can easily do that with the statsmodels package using the OLS (Ordinary Least Squares) function.

import statsmodels.api as sm
result = sm.OLS(df["Satisfaction"], sm.add_constant(df["Time_spent"])).fit()
print(result.params)

The _addconstant method is just a technical detail we use to tell the model, that we want to have an intercept in our equation (which is required, as long as our data is not standardized). The result.params gives us two values, namely the intercept (const) and the _regressioncoefficient for the variable _Timespent.

const         3.229412
Time_spent   -0.655470

That is, our model tells us, that satisfaction can be predicted as _3.229 –0.655*timespent. In other words, one hour more time spent in the app leads to a decrease of 0.655 points in satisfaction (because of the minus sign). However, one doesn’t start from zero, but the mean satisfaction of a person just from their first impression (i.e. _timespent=0) is 3.229. We can also make that visible as a line with intercept 3.229 and slope -0.665:

Of course, this prediction is not perfect, but at least it gives us a trend. Okay, so the case is clear, right? Spending more time in the app leads to a decrease in satisfaction, and we can even quantify that decrease. We could now draw our conclusions from that and think about how to improve the app (we want our users to be more satisfied as they use the app more, of course), or do a more detailed survey to find out why the users are not satisfied.

Well, not so fast!

Grouping per country

Remember, that we collected data from users in different countries? What happens, if we take a look at the data separated by country? In the following plot, we see the very same data points as before, but now we highlighted each country in a different color.

There are two observations we can make from that plot. First, the countries seem to differ in their satisfaction and their time spent in the app. The subjects from the blue country spend more time in the app but are less satisfied than subjects from the other countries, on average. Even more, when we take a look at the three countries in separation, we might think that the association between time spent in the app and satisfaction is indeed positive. Isn’t that contradicting our previous analysis?

Simpson’s paradox

Well, it is actually not named after those Simpsons...Photo by Stefan Grage on Unsplash
Well, it is actually not named after those Simpsons…Photo by Stefan Grage on Unsplash

The effect we just saw is called Simpson’s paradox. It occurs when a correlation in data is different across groups vs. within groups. While being very counterintuitive, this can happen indeed (as we just saw), and the reason for that is confounding variables. Let’s explain that with our example above. When looking at each country in isolation, we see a positive trend: more time spent in the app is associated with higher satisfaction. However, as we already saw, the countries differ in their mean satisfaction and time spent in the app. In the blue country, the mean satisfaction is lower but the mean time spent in the app is higher than in the orange or green country; a trend that is opposing the trend within the countries. However, there may be another variable causing this. E.g. one could imagine, that in the blue country, more people are bored more often, leading to less satisfaction in general (and hence less positive mood towards our app) but more time to spend in the app. Of course, that is just one possible explanation and there can be many others. However, the correct explanation doesn’t matter too much at the moment. For us, it is important to understand, that there are systematic differences between the countries.

So, why didn’t we find that out in our previous analysis? Did we do a mistake when performing the linear regression? Well, yes, as it was wrong to perform a linear regression at all because one of the core assumptions of the linear regression was violated: A linear regression assumes, that all data points are sampled independently and from the same distribution. That is not the case in our example, though! Obviously, the distributions of time spent in the app and satisfaction are different across the different countries. Now, if the assumptions of linear regression are violated, linear regression is not the right tool for data analysis.

Hierarchical models

What can we do now, to analyze our data in a more appropriate way? Luckily, statistical models are available that extend the idea of linear regression to hierarchical data. We speak of hierarchical data if the data points we sampled are nested in a hierarchical structure, like in our case, where the people we asked are nested in the countries. Those statistical models are called hierarchical linear models, multilevel models, or linear mixed effect models. Such models account for the group structures by introducing so-called fixed effects and random effects. In a simple example, where we want to predict one variable given a single other variable (as we want to predict satisfaction given the time spent in the app), the fixed effects consist of one intercept and one slope for all groups together. So far that is the very same as in the linear regression.

Now the random effects can introduce a deviation from that intercept for each group separately. For example, the intercept for the blue country may be a little lower and the intercept for the green country may be a little higher than the fixed intercept. That would account for the differences in the countries’ mean levels of satisfaction.

Additionally, the random effects can introduce a deviation of the slope for each group. For example, in the orange group, the slope may be higher than the fixed slope (i.e. the association between satisfaction and time spent is stronger), and in the green country, it may be lower.

Hierarchical models in action

Let’s see that in action to understand what really happens. We conduct a new analysis, but now we use statsmodels’ mixedlm function. We clarify that we want to predict satisfaction given the _timespent (and not vice versa) by the formula _"Satisfaction ~ Timespent" and we indicate that the "Country" column of our dataframe is determining the different groups. Additionally, the parameter _re_formula="Timespent" tells the model that we want to have a separate slope for each group. Without that, the random effects would consider a group-specific intercept only, but not a group-specific slope.

import statsmodels.formula.api as smf

result = smf.mixedlm("Satisfaction ~ Time_spent", data=df, groups=df["Country"], re_formula="Time_spent").fit()
print(result.fe_params)
print(result.random_effects)

If we print the _fixedeffects (_feparams) and the _randomeffects, we get values like these:


Fixed effects
  Intercept     2.286638
  Time_spent    0.497657
Random Effects
  {0: Group -0.958805, Time_spent -0.018178,
   1: Group 0.155233,  Time_spent 0.274222,
   2: Group 0.803572,  Time_spent -0.256044}

So, what does that mean? For the fixed effects, we have one value for the intercept and one value for our variable time_spent. For the random effects, however, we have two values per country (0,1,2): one for the intercept (Group), and one for the slope of our variable (_Timespent). As we saw above, the random effects describe the deviation from the mean effects for each group. For our three groups, we can construct three different linear equations by adding the random effects to the fixed effects for the intercept and the slope each.

satisfaction_0 = (2.286638 - 0.958805) + (0.497657 - 0.018178) * time_spent = 1.327833 + 0.479479 * time_spent
satisfaction_1 = (2.286638 + 0.155233) + (0.497657 + 0.274222) * time_spent = 2.441871 + 0.771879 * time_spent
satisfaction_2 = (2.286638 + 0.803572) + (0.497657 - 0.256044) * time_spent = 3.090210 + 0.241613 * time_spent

We see that the random intercept for group 0 is negative (-0.958) and the random intercept for group 2 is positive (0.803), so group 0 is below the fixed intercept, and group 2 is above. Consequently, group 0 has the lowest intercept in its linear function (1.327) and group 2 has the highest (3.090). In other words, in country 0, satisfaction starts at a lower level than in country 2.

We also see that the slopes differ between the groups. In group 1, the slope is highest with 0.771, while for group 2 it is only 0.241. That means the association between satisfaction and time spent in the app is much higher in country 1 than in country 2. In other words, in country 1 an increase of one hour of time spent in the app leads to 0.771 points more in satisfaction (in the mean), while for country 2 it only adds another 0.241 points in satisfaction. In addition, all slopes are positive, which we expected from the plot above, but which is contradicting the negative slope of the linear regression we did at the beginning.

We can plot one regression line for each country now:

Now we clearly see the positive trend in each country and the different intercepts (i.e. the positions where the lines would be at _timespent=0).

Conclusion

In the above example, we saw how a short-sighted analysis can easily lead us to false conclusions. Ignoring the nested structure of the data, i.e. users coming from different countries, we could have easily stopped after the linear regression and would have concluded, that more time spent in the app was associated with lower satisfaction. Only by understanding that our data is not fulfilling the core assumptions of linear regression, as the data points are not sampled from the same distribution, we were motivated to do further analyses that revealed, that indeed the opposite is the case: More time spent in the app is indeed associated with higher satisfaction.

So, let’s formulate some takeaways from this example:

  • Before using a statistical method for analysis, its assumptions should be validated with the data.
  • For nested data, the assumption that all data points are sampled from the same distribution may not always be the case.
  • It can happen, that a trend in the data overall is different from a trend inside single groups that form that data altogether. This is called Simpson’s paradox.
  • Multilevel linear models are one way to cope with nested data structures and avoid false conclusions from Simpson’s paradox.

Further reading

We used the following implementation of hierarchical models in statsmodels:

I used the following statistics textbook (which is only available in German, unfortunately).

  • Eid, M., Gollwitzer, M., & Schmitt, M. (2017). Statistik und Forschungsmethoden.

Background information about multilevel models can also be found here:

If you want to reproduce the results, this is how the data was generated:

import numpy as np
import pandas as pd
group_1_x = np.random.uniform(0.5, 1.8, 25)
group_1_y = (1 + 0.3 * group_1_x) + np.random.rand(len(group_1_x))

#start_2, end_2, step_2 = 0.3, 1.3, 0.04
group_2_x = np.random.uniform(0.3, 1.3, 22)
group_2_y = (2 + 0.7*group_2_x) + np.random.rand(len(group_2_x))

#start_3, end_3, step_3 = 0, 1, 0.04
group_3_x = np.random.uniform(0, 1, 32)
group_3_y = (2.5 + 0.3*group_3_x) + np.random.rand(len(group_3_x))

all_x = np.concatenate([group_1_x, group_2_x, group_3_x])
all_y = np.concatenate([group_1_y, group_2_y, group_3_y])
df = pd.DataFrame({"Satisfaction": all_y, "Time_spent":all_x, "Country":[0]*len(group_1_x) + [1]*len(group_2_x) + [2]*len(group_3_x)})

Like this article? Follow me to be notified of my future posts.

The post Multilevel Regression Models and Simpson’s paradox appeared first on Towards Data Science.

]]>