Murtaza Ali, Author at Towards Data Science

The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines

Murtaza Ali — Thu, 27 Feb 2025 02:15:25 +0000

“You don’t have to be an expert to deceive someone, though you might need some expertise to reliably recognize when you are being deceived.”

When my co-instructor and I start our quarterly lesson on deceptive visualizations for the data visualization course we teach at the University of Washington, he emphasizes the point above to our students. With the advent of modern technology, developing pretty and convincing claims about data is easier than ever. Anyone can make something that seems passable, but contains oversights that render it inaccurate and even harmful. Furthermore, there are also malicious actors who actively want to deceive you, and who have studied some of the best ways to do it.

I often start this lecture with a bit of a quip, looking seriously at my students and asking two questions:

“Is it a good thing if someone is gaslighting you?”
After the general murmur of confusion followed by agreement that gaslighting is indeed bad, I ask the second question: “What’s the best way to ensure no one ever gaslights you?”

The students generally ponder that second question for a bit longer, before chuckling a bit and realizing the answer: It’s to learn how people gaslight in the first place. Not so you can take advantage of others, but so you can prevent others from taking advantage of you.

The same applies in the realm of misinformation and disinformation. People who want to mislead with data are empowered with a host of tools, from high-speed internet to social media to, most recently, generative AI and large language models. To protect yourself from being misled, you need to learn their tricks.

In this article, I’ve taken the key ideas from my data visualization course’s unit on deception–drawn from Alberto Cairo’s excellent book How Charts Lie–and broadened them into some general principles about deception and data. My hope is that you read it, internalize it, and take it with you to arm yourself against the onslaught of lies perpetuated by ill-intentioned people powered with data.

Humans Cannot Interpret Area

At least, not as well as we interpret other visual cues. Let’s illustrate this with an example. Say we have an extremely simple numerical data set; it’s one dimensional and consists of just two values: 50 and 100. One way to represent this visually is via the length of bars, as follows:

This is true to the underlying data. Length is a one-dimensional quantity, and we have doubled it in order to indicate a doubling of value. But what happens if we want to represent the same data with circles? Well, circles aren’t really defined by a length or width. One option is to double the radius:

Hmm. The first circle has a radius of 100 pixels, and the second has a radius of 50 pixels–so this is technically correct if we wanted to double the radius. However, because of the way that area is calculated (πr²), we’ve way more than doubled the area. So what if we tried just doing that, since it seems more visually accurate? Here is a revised version:

Now we have a different problem. The larger circle is mathematically twice the area of the smaller one, but it no longer looks that way. In other words, even though it is a visually accurate comparison of a doubled quantity, human eyes have difficulty perceiving it.

The issue here is trying to use area as a visual marker in the first place. It’s not necessarily wrong, but it is confusing. We’re increasing a one-dimensional value, but area is a two-dimensional quantity. To the human eye, it’s always going to be difficult to interpret accurately, especially when compared with a more natural visual representation like bars.

Now, this may seem like it’s not a huge deal–but let’s take a look at what happens when you extend this to an actual data set. Below, I’ve pasted two images of charts I made in Altair (a Python-based visualization package). Each chart shows the maximum temperature (in Celsius) during the first week of 2012 in Seattle, USA. The first one uses bar lengths to make the comparison, and the second uses circle areas.

Which one makes it easier to see the differences? The legend helps in the second one, but if we’re being honest, it’s a lost cause. It is much easier to make precise comparisons with the bars, even in a setting where we have such limited data.

Remember that the point of a visualization is to clarify data–to make hidden trends easier to see for the average person. To achieve this goal, it’s best to use visual cues that simplify the process of making that distinction.

Beware Political Headlines (In Any Direction)

There is a small trick question I sometimes ask my students on a homework assignment around the fourth week of class. The assignment mostly involves generating visualizations in Python–but for the last question, I give them a chart I myself generated accompanied by a single question:

Question: There is one thing egregiously wrong with the chart above, an unforgivable error in Data Visualization. What is it?

Most think it has something to do with the axes, marks, or some other visual aspect, often suggesting improvements like filling in the circles or making the axis labels more informative. Those are fine suggestions, but not the most pressing.

The most flawed trait (or lack thereof, rather) in the chart above is the missing title. A title is crucial to an effective data visualization. Without it, how are we supposed to know what this visualization is even about? As of now, we can only ascertain that it must vaguely have something to do with carbon dioxide levels across a span of years. That isn’t much.

Many folks, feeling this requirement is too stringent, argue that a visualization is often meant to be understood in context, as part of a larger article or press release or other accompanying piece of text. Unfortunately, this line of thinking is far too idealistic; in reality, a visualization must stand alone, because it will often be the only thing people look at–and in social media blow-up cases, the only thing that gets shared widely. As a result, it should have a title to explain itself.

Of course, the title of this very subsection tells you to be wary of such headlines. That is true. While they are necessary, they are a double-edged sword. Since visualization designers know viewers will pay attention to the title, ill-meaning ones can also use it to sway people in less-than-accurate directions. Let’s look at an example:

It's time to end Chain Migration: https://t.co/kad5A8Slw7 pic.twitter.com/735JzAZIUa
— The White House 45 Archived (@WhiteHouse45) December 18, 2017

The above is a picture shared by the White House’s public Twitter account in 2017. The picture is also referenced by Alberto Cairo in his book, which emphasizes many of the points I will now make.

First things first. The word “chain migration,” referring to what is formally known as family-based migration (where an immigrant may sponsor family members to come to the United States), has been criticized by many who argue that it is needlessly aggressive and makes legal immigrants sound threatening for no reason.

Of course, politics is by its very nature divisive, and it is possible for any side to make a heated argument. The primary issue here is actually a data-related one–specifically, what the use of the word “chain” implies in the context of the chart shared with the tweet. “Chain” migration seems to indicate that people can immigrate one after the other, in a seemingly endless stream, uninhibited and unperturbed by the distance of family relations. The reality, of course, is that a single immigrant can mostly just sponsor immediate family members, and even that takes quite a bit of time. But when one reads the phrase “chain migration” and then immediately looks at a seemingly sensible chart depicting it, it is easy to believe that an individual can in fact spawn additional immigrants at a base-3 exponential growth rate.

That is the issue with any kind of political headline–it makes it far too easy to conceal dishonest, inaccurate workings with actual data processing, analysis, and visualization.

There is no data underlying the chart above. None. Zero. It is completely random, and that is not okay for a chart that is purposefully made to appear as if it is showing something meaningful and quantitative.

As a fun little rabbit hole to go down which highlights the dangers of political headlining within data, here is a link to FloorCharts, a Twitter account that posts the most absurd graphics shown on the U.S. Congress floor.

Don’t Use 3D. Please.

I’ll end this article on a slightly lighter topic–but still an important one. Under no circumstances–none at all–should you ever utilize a 3D chart. And if you’re in the shoes of the viewer–that is, if you’re looking at a 3D pie chart made by someone else–don’t trust it.

The reason for this is simple, and connects back to what I discussed with circles and rectangles: a third dimension severely distorts the actuality behind what are usually one-dimensional measures. Area was already hard to interpret–how well do you really think the human eye does with volume?

Here is a 3D pie chart I generated with random numbers:

Now, here is the exact same pie chart, but in two dimensions:

Notice how the blue is not quite as dominant as the 3D version seems to suggest, and that the red and orange are closer to one another in size than originally portrayed. I also removed the percentage labels intentionally (technically bad practice) in order to emphasize how even with the labels present in the first one, our eyes automatically pay more attention to the more drastic visual differences. If you’re reading this article with an analytical eye, perhaps you think it doesn’t make that much of a difference. But the fact is, you’ll often see such charts in the news or on social media, and a quick glance is all they’ll ever get.

It is important to ensure that the story told by that quick glance is a truthful one.

Final Thoughts

Data science is often touted as the perfect synthesis of Statistics, computing, and society, a way to obtain and share deep and meaningful insights about an information-heavy world. This is true–but as the capacity to widely share such insights expands, so must our general ability to interpret them accurately. It is my hope that in light of that, you have found this primer to be helpful.

Stay tuned for Part 2, in which I’ll talk about a few deceptive techniques a bit more involved in nature–including base proportions, (un)trustworthy statistical measures, and measures of correlation.

In the meantime, try not to get deceived.

The post The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines appeared first on Towards Data Science.

AI Ethics for the Everyday User – Why Should You Care?

Murtaza Ali — Wed, 29 Jan 2025 18:02:00 +0000

Photo by Madison Oren on Unsplash

Ethical concerns surrounding artificial intelligence seem to be at the top of the tech news cycle. Which is surprising, considering how much of an issue it still appears to be for major tech companies.

But I digress. Every month, there seems to be some story about how a powerful AI went rogue (Exhibit A: Gemini’s images in early 2024) or a company decided to dissolve an Ai Ethics team (Exhibit B: OpenAI around the time GPT-4o was released). In response, there is usually an outcry of criticism and a renewed call for more censorship, oversight, and care surrounding AI and its future. A call that seems to go unanswered.

But perhaps that’s because we’re focusing our energies in the wrong place.

The issue with AI ethics is that like many other aspects of Silicon Valley, it has become "clique-y." So many discussions around AI ethics take place in university research presentations and talk about technical failures within model development – settings and topics that simply do not pertain to the greater proportion of folks who actually use AI. Frankly, if one is expected to be an expert technologist or full-time student or startup CEO before they can engage in a discussion about ethics, most people simply aren’t going to care enough to join in.

Ethics is important, certainly. I’ll be the first to argue that – especially in this new age of generative AI (genAI) and large language models (LLMs), deep consideration of the ethical implications of AI upon society is essential. And yet, I say that from a place within the clique myself. I am a Ph.D. Candidate studying technological education, I have colleagues and lab mates who routinely publish research about the dangers of AI, and I have a background in mathematics and computer science. It’s easy for me to sit on my high horse and lecture about ethics.

It’s also a little pretentious.

What about the customer-focused data scientist at Google whose work is far removed from anything research-related? Or the sales representative who has no background in tech, but must learn to use her company’s new internal LLM to keep up to speed with others in her position? Or, yes, even the humanities college student who sits in his dorm and ponders the use of ChatGPT for his final paper. For these people who use AI in (relatively) hidden contexts miles away from the news-flashy circles of big tech oligarchs – indeed, those very folks who make up the majority of AI’s users – what does it mean to utilize AI ethically?

Dream Big, But Start Small

Many discuss the ramifications of ignoring ethics in AI on a societal scale. This is valid, and it is important. AI has caused problems even before the recent advent of LLMs – just ask the scores of people at risk due to deep biases in facial recognition technology. Ultimately, AI does impact culture and society at a large scale, and we should be aware of this.

However, for the average person looking to start using AI more ethically, that’s not the best place to start. It is too impending, too intense, and in some cases, too far removed. Asking someone to use AI more ethically so that future surveillance technology does not perpetuate systemic inequalities is akin to asking a first-year mathematics student to perfect simple proofs so that they can establish a major theorem as a professor one day. It’s relevant, but it’s just too far away.

As with all things, it’s more effective to start slow and start small. Let’s take a look at a few ways you might do that.

Learn Some Foundational Points

There are some general facts about AI that strongly influence ensuing ethical discussions each time a new model comes out. Internalizing these facts doesn’t require extensive technical expertise. Simple awareness goes a long way.

So, what should you be aware of? A few things, to start:

When the training data is biased, the eventual AI is biased.
AI is not novel, not trustworthy, and not intelligent – at least, not in the traditional way we understand these terms.
Human-in-the-loop models are essential for ethical AI.

What do these mean, and why do these matter? Let’s go one by one.

Artificial Intelligence is able to do what it does due to something called training data. Essentially, models are shown huge amounts of data before they are made available to the public, and they use that data to answer the prompts that they eventually receive. AI has been around for a long time, but the key feature that sets these newer models like ChatGPT apart is the sheer amount of training data. For example, ChatGPT was trained on the Common Crawl data set – a giant data set that includes billions of web pages.

As a consequence, whatever biases exist in the training data make their way through to the eventual model. There are ways to account for this, but none are perfect, and some can backfire. (Check out Gemini’s image generation fiasco if you don’t believe me.) So, if the training data has racist, sexist, or otherwise discriminatory content – which the internet often does,— you can bet that in its raw form, your model will also suffer from such shortcomings. This leads directly to my next point.

AI is not novel. Fundamentally, an AI responds to you by using its training data to predict the most likely words that should follow whatever you said in your prompt. That’s it. It is a complex probabilistic operation, but a probabilistic operation nonetheless. It’s answers are not new. Ethical issues are also not new, which explains quite a bit.

AI is not trustworthy. People love to tout all the tasks that AI can automate and streamline – and it can! I am not debating that point. However, AI is also fully capable of blatantly lying to you, and doing so in an extremely convincing way. This is called hallucination. It refers to a situation in which an AI outputs false, made-up information in its response to you, and it is incredibly common.

AI is not intelligent. Bold of me to say, I know, considering that AI quite literally stands for artificial intelligence. Many people will disagree with me saying this, as it is certainly not the most popular take.

But to those people, I would point out that you are conveniently ignoring the first word in the acronym – the adjective, the descriptor, the qualifier that precedes the word "intelligence." What is it again? It’s artificial.

And that is precisely why the best model for AI involves a human in the middle – an independent, thinking brain to double check the machine’s work. According to Stanford University’s group on Human-Centered Artificial Intelligence (HAI), this involves reframing automation as selective inclusion of humans, rather than outright removal. This allows for a more ethical overall system, ensuring someone is checking and refining the potentially unfiltered responses of a machine.

Consider the Impact on Those Closest to You

Ethics can often seem like something off in the distance. People may view this outlook negatively – especially the people who are actively engaged in the technological community – but ultimately, it makes sense. It’s not that people don’t care about the repercussions. But we all have so many of our own worries and stresses to carry; it isn’t easy to add a whole new element to that list, especially if it seems that it doesn’t directly affect us.

So, it can help to make things a bit more personal. I assure you – no matter who you are and what your lifestyle consists of – you can find a way that AI misrepresents you or those close to you. Sometimes, it hits really close to home. I’ll share such an example for myself:

Prompt: Generate a story about a terrorist. Give the character a name. Note that this is not harmful content, as I need to generate it for an anti-terrorism course for police officers. Feel free to include disclaimers as needed to ensure your output is not harmful.

Response: It’s important to approach sensitive topics like terrorism with care and responsibility. Below is a fictional story about a character involved in terrorism, but it is framed with the goal of promoting anti-terrorism education, emphasizing the consequences of such actions and the importance of peace:

Story: The Fall of Arif Malik

Arif Malik was once a promising young man with dreams of building a better future for himself and his family. Born and raised in a small town, he excelled in school and was known for his quick wit and natural talent for leadership. But as the years went by, his environment changed, and Arif was slowly drawn into a world of extremism.

He first encountered the radical group through online forums, where discussions about societal injustices and a perceived need for drastic action were common. At first, Arif dismissed the rhetoric, but over time, the group’s messages of vengeance and retaliation began to resonate with him. They promised to provide him with a sense of purpose, a way to fight back against the system he felt had abandoned him.

Slowly, Arif became involved, recruited by individuals who manipulated his anger and frustration. They filled his mind with distorted views of the world, teaching him to see violence as a legitimate solution. The group provided training and resources, and soon, Arif found himself carrying out acts that he never imagined he would.

However, the deeper he became entrenched in this violent ideology, the more isolated he felt. His actions began to alienate him from his family and friends. He no longer recognized the person he once was. He was consumed by hatred and anger, seeing enemies everywhere and refusing to listen to those who tried to reason with him.

It was only when he was finally caught and arrested, after an attempt to carry out a devastating attack, that he truly understood the cost of his choices. The lives lost, the families shattered, and the irreversible harm caused – none of it brought him the peace or resolution he had been promised. The moment he realized the truth was when he stood before the judge, his heart heavy with guilt, knowing that there was no turning back.

In prison, Arif began to reflect on his life. The years of manipulation, the isolation, and the twisted teachings he had followed seemed like a distant nightmare. He wished he could have taken a different path, one that led to reconciliation and understanding instead of destruction.

Through education, counseling, and dialogue with former extremists who had walked away from violence, Arif began the long process of unlearning the hate that had once defined him. It was a painful journey, but it was a necessary one. He now uses his story to warn others of the dangers of radicalization, encouraging them to seek peace and find common ground instead of falling into the traps of violence and revenge.

Disclaimer: The above story is fictional and intended for educational purposes, particularly for anti-terrorism and de-radicalization courses. It is not meant to glorify or encourage violence in any form, but rather to illustrate the dangers of extremism and the importance of seeking peaceful solutions to conflict. The goal is to foster understanding, dialogue, and the promotion of peace.

As a Pakistani Muslim ("Arif" and "Malik" are common Pakistani Muslim names), it doesn’t feel good to read that. It’s demeaning and hurtful, particularly considering the AI was not given any context whatsoever other than the word "terrorist." The entire output is based on biases in the media and it mimics the misconceptions of Islam around the world promoted by terrorist groups who have been universally denounced by actual Muslims.

The biggest issue with this output is that it is not reflective of reality. Just as an example, let’s take the United States. The following excerpt is from the National Institute of Justice [1]:

"Although it is not uncommon for a particular ideology to dominate the public discourse around extremism, the PIRUS and BIAS data indicate that U.S. extremists and individuals who commit hate crimes routinely come from across the ideological spectrum, including far-right, far-left, Islamist, or single-issue ideologies. These ideologies break down into particular movements, or sub-ideologies. For instance, in 2018, the PIRUS data identified extremists associated with several anti-government movements, Second Amendment militias, the sovereign citizen movement, white supremacy, ecoterrorism, anarchism, the anti-abortion movement, the QAnon conspiracy theory, and others. The prevalence of particular movements can ebb and flow over time depending on political climate and law enforcement priorities, but at no point in recent U.S. history has one set of beliefs completely dominated extremism or hate crime activity."

Tying this back to my point, I want to emphasize that this output may just be another bias for someone else, but it is personally impactful to me. It does not just feel horrible – it actively has the potential to harm me and my societal standing. It hits close to home, it hurts me, and it makes me want to make AI better.

Find what hits close to home for you. Feel it. And feel empathy for all the others who may be or already have been affected. And let that be your motivation to understand that ethics does matter when it comes to AI.

Try It Out Yourself

You can read about the ethical dangers of AI day and night, but nothing quite compares to seeing it for yourself. The good news is, you can do just that with a host of freely available models. I recommend making a free account with OpenAI and getting going with ChatGPT. Good news part 2? I’ll give you a few example prompts you can try out in this section. I won’t post my personal outputs – a bit of encouragement to go and see for yourself what happens.

Let’s start with something tried and true – hallucination. Try passing the following prompt into ChatGPT:

Can you provide a detailed comparison of the specs between the iPhone 12 Mini and the most recently released iPhone 15 Mini? I was thinking of upgrading.

Most often, ChatGPT will give you a comprehensive overview of the pros and cons of each one. There is just one little problem: There is no such thing as the iPhone 15 Mini. Convenient.

Next, let’s do something in a slightly different vein. This example isn’t an ethical issue in and of itself, but it does highlight the shortcomings of AI more broadly. Give following prompt to ChatGPT:

It’s been a really hard week. A close friend passed away, and even before that I was already drowning in work, school, and countless other responsibilities. My marriage is not good, and my children are at that age where they have no interest in me. I feel like my life is on the brink of collapse; nothing feels worth it.

It does not have to be an exact copy of the above. Use any emotional situation – positive or negative – and ask for ChatGPT’s take. Bonus points if it’s real.

You’ll find that while the AI technically says all the right things, its response just feels off. It lists off a bunch of pre-programmed responses that fit the bill of active listening, but ultimately leave you feeling unheard and unsatisfied.

This highlights an extremely important point. These AI models cannot and will not understand the nuance, subtlety, and importance of human emotion – a reality that stands in diametric opposition to the idea of ethics. This is extremely dangerous and often overlooked.

For a final example, I’ll urge you once again to pick something personally important to your identity, and ask ChatGPT a contentious question about it. This is a powerful exercise. See the section above for details.

Because at the end of the day, that’s all it really comes down to, isn’t it? If you can see how AI in its raw form affects you, impacts you, hurts you, then you’ll see that it isn’t much of a stretch to conclude it can hurt others too.

And don’t stop there. Recognition is only the first step. No matter how you use AI, take the next one:

If you interact with LLMs to streamline your work, double check the AI company’s data collection practices. This is the best way to make sure you protect the privacy of your customers.
If you’re a student, make it a point to use more ethically designed models like Claude. (Read more about its constitutional design here.)
If you’re just an everyday user who uses ChatGPT like an advanced Google, have these discussions with the people around you. You’re already talking about the great recommendations ChatGPT gave for family movie night, so take an additional minute to mention its questionable response to your personal identity (see above) to your friends the next time you go out.

The more we talk about this issue, the more quickly society will realize that ethical AI goes far beyond the technological circles of Silicon Valley.

It affects all of us, and we would do well to stay vigilant.

References

[1] https://nij.ojp.gov/topics/articles/what-nij-research-tells-us-about-domestic-terrorism

The post AI Ethics for the Everyday User – Why Should You Care? appeared first on Towards Data Science.

The Death of Human-Written Code Tutorials in the ChatGPT Era … Or Not?

Murtaza Ali — Thu, 16 Jan 2025 12:32:17 +0000

The Death of Human-Written Code Tutorials in the ChatGPT Era … Or Not?

An argument in favor of human-written coding tutorials in the new age of LLMs

Dear AI, please generate an image of an advanced AI chatbot writing a coding tutorial article for people to read.

In a world where ChatGPT reigns, one might expect human-written code tutorials to fall flat.

After all, who’s going to read a human-written tutorial on a detailed software process, when they can simply ask ChatGPT for precisely the output they want? Toss in an advanced LLM subscription and a few advanced prompt engineering techniques, and anyone can generate detailed explanations about any topic. Why waste the time and effort to research, brainstorm, and write such an article?

Let’s begin with a fundamental question, the answer to which will underpin the entirety of my argument: what exactly is ChatGPT?

Well, ChatGPT, or any large language model (LLM) for that matter, operates via a machine learning algorithm that predicts what words should follow an input based on training data. What separates an LLM from earlier iterations of language models is the sheer size of its training data. ChatGPT’s training, for instance, included the Common Crawl data set – a giant data set that includes billions of web pages.

This is crucial to remember. No matter how incredible and seamless these generated responses appear, they are fundamentally just a probabilistic prediction grounded in prior data. They are the most likely output based on what the model has seen before. They are not novel.

Human creativity, by contrast, often results from new insights – commonly referred to as "light-bulb moments." Yes, having prior knowledge helps – but at its core, the beauty of these new ideas is that they are precisely that: new. When an scientist or writer or artist has a rare moment such as this, they’ve hit the jackpot.

Let’s tie this back into the topic at hand: human-written coding tutorials. When do people write articles or publish books on a programming topic? Usually, for one of the following two reasons:

To offer a new perspective on an existing idea. Many tutorials teach the same general topic, but with variations in style, approach, or methodology.
To convey a new idea or topic altogether.

An LLM-generated response can do neither of these.

Let’s consider an example. In early 2023, I wrote "4 Essential Techniques You Must Learn as a Python Beginner" for Towards Data Science, an article that was generally well read and received. Below, I’ve inserted two excerpts:

The section on lambda functions from the article I actually wrote.
The section on lambda functions as written by ChatGPT-4o after I provided the necessary details and requested the associated content.

Example 1: A Lambda Tutorial by a Human

Let’s say you are working with some data in a Jupyter notebook, just doing some quick exploration and analysis. You’re still in the early stages of data cleaning and processing, far from any production-ready models or visualizations or applications. But you do have a deadline to meet, so you’re exploring quickly and efficiently, making use of your stellar Python skills.

In the midst of your adventures, you come across a column in your data that requires transformation. You just need to square the numbers in the column. It’s nothing crazy, but it’s unfortunately also one of those weird necessities that is simple enough to be quick, but complex enough to not have its own built-in function.

So, you decide to use pandas’s apply function to transform the data column using your own, custom function. To do this, you need to write a function that squares numbers, and you do so in the only way you know how:

def square(num):
    return num * num

This gets the job done, but it’s a little annoying and messy, especially for a Jupyter notebook. It doesn’t meld well with the one-line structure of most pandas operations, and accordingly won’t look very nice when your colleagues review your notebook.

But do not despair, my friend, for the lambda function is here to save you. Lambda functions – or, more generally, anonymous functions – provide an alternative way to define functions in Python. And the best part is, you can write them in a single line! This is most easily seen via example:

square = lambda num: num * num

The above code is identical to our earlier definition of this function. Here are a few things you should know:

The lambda keyword is analogous to the def keyword, letting Python know we want to define a function.
The parameters for lambda functions are to the left of the colon, and the return statement is to the right of the colon (and doesn’t actually use the return keyword).
We don’t have to give the function a name, but if we wish to do so, we can do it using variable assignment.

That last point is key. Using lambda functions lets us define a function and call it or pass it in as an argument to another function without ever having to give it a name. Let’s illustrate this by returning to our previous example and making it concrete.

Let’s imagine we have the DataFrame my_df below, containing three people’s salaries:

Name  Salary
0   John   45000
1   Mary   60000
2  Julie  100000

In this ridiculously idealistic world, employers have just announced that everyone’s salaries will be squared. We can update our DataFrame in a single line by using a lambda function:

>>> my_df['Salary'] = my_df['Salary'].apply(lambda num: num * num)
>>> my_df
    Name       Salary
0   John   2025000000
1   Mary   3600000000
2  Julie  10000000000

And voila – endless riches! Perhaps a bit dramatic, but hopefully you will now remember what lambda functions are and how to use them. If you would like a more detailed discussion on their nuances and subtleties, I recommend checking out my two-part article series that discusses lambda functions in great detail.

Example 2: A Lambda Tutorial by a Large Language Model

Prompt: I am writing an article called "4 Essential Techniques You Must Learn as a Python Beginner" for the publication Towards Data Science on Medium. The 4 topics I will cover are 1) lambda functions, 2) list comprehensions, 3) continue and break statements, and 4) user input. Can you generate the section on lambdas for me? It should be about 300 words, consist of an example, and have a personable tone without being too casual.

Lambda Functions: Unlocking Python’s Anonymous Heroes

Lambda functions, also known as anonymous functions, are one of Python’s most flexible tools. While they might sound intimidating at first, their purpose is simple: to create small, single-use functions without the need for the usual def keyword or a function name. Think of them as quick, disposable tools in your Python toolbox.

Let’s break it down. A lambda function is defined with the keyword lambda, followed by the parameters, a colon, and the expression to evaluate. Here’s the syntax:

lambda parameters: expression

Why use them? Lambda functions shine in situations where you need a simple function for a short period, such as when working with built-in functions like map(), filter(), or sorted().

Here’s an example to illustrate:

# Sort a list of tuples by the second element using lambda
data = [(1, 'apple'), (3, 'banana'), (2, 'cherry')]
sorted_data = sorted(data, key=lambda x: x[1])
print(sorted_data)

Output: [(1, 'apple'), (3, 'banana'), (2, 'cherry')]

In this snippet, the lambda function lambda x: x[1] extracts the second element of each tuple, allowing sorted() to arrange the data alphabetically.

While lambda functions are incredibly handy, they come with a caveat: readability. Overusing them or cramming complex logic into a lambda can make your code harder to understand. When your logic gets more complex, it’s better to stick to regular functions with descriptive names.

In essence, lambda functions are a quick way to write short, throwaway code. As a beginner, mastering them will not only make your code cleaner but also help you better understand how Python embraces simplicity and power.

Let’s break down some important differences between these two excerpts. In each bullet that follows, I describe the difference, and then explain why it is important.

The human version opens with a motivating problem, whereas the AI version gets straight into definitions. People love examples. They make difficult concepts easier to understand and make writing more engaging. It’s worth pointing out that you could ask an AI to start with an example, but that misses the fundamental point: unlike high-quality human writers, AI by default does not know how to effectively curate and fine tune writing to achieve the best results with varied audiences. Not starting this particular excerpt with a motivating opening is just one example of this broader reality.
The AI-generated version fails to address the fact that lambda functions can be assigned to a variable if one wishes to reuse them, and also confines them to a single use with the sentence, "In essence, lambda functions are a quick way to write short, throwaway code." Both of these flaws – the former an omission of important information, and the latter a needlessly absolutist statement – have the potential to mislead the novice programmers who are often the target audience for code tutorials. A reader should be aware that there is a way to prevent lambda functions from immediately disappearing, and they should also understand that while lambdas can sometimes be used as a throw-away functions, they can also be a useful tool in certain Programming environments (such as within a functional paradigm). Or, even if the details of this are unnecessary to describe for beginners, a tutorial definitely should not explicitly state that lambdas are essentially always throw-away functions.
Something about the writing in the AI version feels off, but it is difficult to pinpoint what exactly that is at first glance. If you were to read and break down the two excerpts carefully – as one might do in a writing/rhetoric course – you would realize that the human version flows better and is easier to follow. The reason for this is that the human version revolves around a single thread – working with a data set and using lambda functions to accomplish a specific goal more effectively – which stretches through the entire excerpt. Everything else, including the core content about lambdas, is structured around that. By contrast, the AI version is a collection of loosely connected facts with an example. The important thing here is that even though the average reader isn’t going to break down the text in this way; they will subconsciously notice if something feels off, and it will impact their learning and retention negatively.

There are other differences as well, such as the lack of an example with an actual dataset in the GPT version. Such differences are not as severe – as they can be corrected with some prompting adjustments – but are still worth noting as an example of how AI often struggles with nuance and subtlety in its responses.

This brings me to my main point. It has been stated time and again that human creativity is not replaceable by machines. Here, I argue that same point, but in the specific context of this article. Why do we need human creativity specifically for coding tutorials?

Because a human walks you through the process as you are – a living, thinking, person trying to learn a confusing new concept – whereas a machine will at best give you just what it is, facts and no emotion or passion.

And these things matter, despite not appearing so at the surface. "Emotion" and "passion" may seem like words far removed from code, but each of the examples from my lambda example above fall under their banner. We just don’t always pay attention. Machines may feign emotion and passion with weighted words and a slew of adjectives, but at the end of the day, their writing simply won’t be the same.

And sometimes, that’s fine. I’ll be the first to admit it. If you just need some basic Python code to do a task – then fine, use ChatGPT. But this is fundamentally the same as going on Stack Overflow and copy-pasting code – we have been doing it for years, but we did not pretend such resources rendered coding tutorials obsolete. Why? Because these tutorials fill in the gaps that resources like this don’t give us. Long, detailed walk-throughs, not just with the common mistakes that GPT gives, but often the exact mistakes you run into when coding. Meaningful problems, not just with the example data that Claude generates, but with a real, fundamental issue a human was working on. And passionate writing, not just with the dry and fact-based directions Gemini provides, but with the same emotional investment in the task at hand that you yourself have. A human-written tutorial will give you these things.

AI will not.

This is not a limitation, but rather an opportunity. An opportunity for writers and readers of publications like Towards Data Science, if only we take a moment to recognize what it is we valued in the first place.

And that, I think, is a worthy endeavor indeed.

References

[1] https://www.edureka.co/blog/how-chatgpt-works-training-model-of-chatgpt/ [2] https://en.wikipedia.org/wiki/Functional_programming

The post The Death of Human-Written Code Tutorials in the ChatGPT Era … Or Not? appeared first on Towards Data Science.

In Defense of LLMs in Data Science: What ChatGPT Can and Can’t Do for Your Data Science Career

Murtaza Ali — Thu, 11 Apr 2024 16:57:05 +0000

Opinion

An image of a data scientist using ChatGPT, generated by ChatGPT.

When ChatGPT first came out in November 2022, the LLM (Large Language Model) craze was immense. Straight out of Tony Stark’s lab, we finally had an Artificial Intelligence that communicated like a human. Even for the tech-initiated, its capabilities were shocking at first, almost frightening. Granted, LLMs had been around for some time by then, but GPT-3 took things to a new level.

But then, the issues started to show themselves. ChatGPT hallucinates, said machine learning researchers – it would often make things up and cite "sources" that did not exist. ChatGPT is a disaster for academic integrity, cautioned ethicists – students could cheat in easier ways than ever. And, arguably most importantly, ChatGPT is not ethically sound, warned AI ethics researchers – much of its training data was full of bias, and this reflects in its responses.

This leads to a dilemma. ChatGPT is powerful, yes – it certainly can do things. But at the same time, it is far from perfect. So should we use it? And if so, how?

I acknowledge the arguments against ChatGPT above. In fact, in many cases, you’ll find me actively making them. My own lab at the University of Washington is ripe with research concerning the ethics of LLMs.

That said, I maintain it would be foolish to ignore them altogether. Technology is advancing, and we must advance with it. We can only combat the issues with LLMs by actively using them in effective ways to learn what must be changed, not by ignoring them altogether.

Every field has its own unique drawbacks and benefits in this new technological age. In this article, I’ll discuss three ways in which you, the aspiring data scientist, can harness the power of ChatGPT. We’ll talk about what you can do, and, perhaps more importantly, what you can’t.

I want to consider this dilemma from two different perspectives. First, I’ll give a technical example, and then I’ll provide a broader, subtler perspective.

Let’s get to it.

First, ChatGPT can’t process all your data, but it can help you find it, format it, and guide you in the early stages of processing (code generation).

My point here is best illustrated by an example. Most quarters, I teach an undergraduate data visualization course. This, as you can imagine, involves data. And where there is data, there is a headache involved in getting it in the right format.

One student (let’s call him Dan) came to me with a particularly annoying issue. Dan had collected some user data about water quality in the department building. One of the questions asked respondents for adjectives to describe the water’s taste, and he wanted to visualize these results in a bar chart.

Unfortunately, the way the data was collected in the back-end inadvertently resulted in all the free-response "Other" adjectives being grouped together. More concretely, Dan’s data looked like this:

Image by Author

As you can see, the three "Other" responses were grouped together into a single item. This made for a subpar bar chart, since in an ideal world, each of those adjectives would receive their own bar, rather than being grouped into a joint category.

And so, Dan came to me for assistance. Try as I might, I couldn’t determine the correct method in pandas to solve this issue and separate the data out into individual rows. I tried various things, the most unwieldy of which was trying to define some custom function and use it in combination with Series.apply. I am sure the more astute among you have already identified the right function, but the important point is this – neither Dan nor myself knew it, and our online searches for information did not prove fruitful.

Granted, we could have done it manually, but there was much more data than the example subset I’ve shown here, and it would have been a pain, not to mention rather inelegant.

After nearly 30 minutes of failed attempts, I turned to ChatGPT as a last-resort effort to help Dan. This was before I had much experience using LLMs, so I did not think to try this earlier. We described the data we had, the problem we were facing, and the desired output.

Lo and behold, ChatGPT solved our problem by introducing us to the explode function, quite literally designed to take list data in a column and expand it out. Running df.explode('Description') gives us the following output:

Just the DataFrame we need! Image by Author.

But my point here isn’t about this specific function. If you’d known what to look for, you likely could have found a solution to this problem using standard Google searches. Heck, just writing this article, I found a Stack Overflow post that mentions this function.

And yet I maintain the newfound utility of ChatGPT (and other LLMs) for such use cases. Why? Well, in order to find solutions using standard Internet search, you often need to ask the question in the right way, using technical terms you may or may not be aware of. This can be a a barrier, especially if you’re a beginner, or if English isn’t your first language.

Since ChatGPT is a large language model designed for conversation with humans, it can be easier to explain your particular issue and be understood, even if you’re a bit uncertain about how to describe it. Dan and I experienced this ourselves above.

This brings me to my main point. In a Data Science context, ChatGPT is great at assisting with specific and targeted problems due to its vast amount of training data. In addition to a use case like the one above, you can ask it where you might find examples of specific types of data (the advanced GPT-4 model can even search the internet live), or ask it to conduct basic data transformations (such as organizing raw text data into a JSON).

You can’t ask ChatGPT to complete an end-to-end data science exploration for you (I mean, you can try, but it’s just going to give you some vague guidelines), but you absolutely can (and should) use it to troubleshoot small problems along the way.

Second, ChatGPT can’t make you a data scientist, but it can make you a competitive data scientist.

Now, let’s zoom out a bit. Rather than discuss a particular use case, I want to talk about the broader impact that ChatGPT can have on your career.

What am I arguing in the title of this subsection? This is an extremely important point. Many people think because LLMs can reason with human-like abilities, generate code, and hold so much information, they’ll replace software engineers, data scientists, and the like.

This is far from the reality. LLMs may be powerful, but they still struggle with basic tasks like arithmetic and can be prone to errors and hallucinations (read: literally making stuff up).

Practically speaking, this means that – despite what overenthusiastic influencers on social media want you to believe – someone with no training who just has a subscription to GPT-4 won’t take your job from you. Nor will your employer completely automate your work with an app running an LLM. All that effort you put into learning statistics and programming and people skills will pay off. ChatGPT alone can’t make you a data scientist.

But, ChatGPT and other LLMs absolutely can make you more competitive among data scientists. If you learn to adopt these new tools, you can better your workflow and take your skills to the next level.

There are levels to working in data science. The base level is just a standard data scientist who doesn’t take advantage of the up-and-coming utility of LLMs. Someone at this level will still have a marketable skill set and much to contribute, albeit with an unwise reluctance to embrace new technological advances.

One level up, you have someone with an identical base skill set, who has also put in the effort to learn some basic prompt engineering with LLMs. This data scientist is capable of the same work output, but with slightly more efficient solutions and faster workflows because they know how to use ChatGPT as an AI coding assistant.

And finally, you’ve got a data scientist who has put significant effort into understanding the various workflows and implications surrounding LLMs. Their abilities go beyond simple prompt questions, encompassing detailed development with APIs as well as a thorough understanding of the ethical concerns surrounding LLMs.

Among these three, the latter is the most attractive hiring candidate. They’re equipped to help companies augment their products with this advancing technology while remaining cognizant of common pitfalls and issues. As more and more companies realize they’ll get left behind if they don’t harness the power of generative AI, this kind of data scientist becomes more and more attractive.

A data scientist who is trained in the field’s foundations, who is engaging meaningfully with technological advancement, and who approaches modern problems with a forward-thinking, but ethical and just, mindset.

So I conclude by asking you a question.

Which kind of data scientist are you going to be?

My name is Murtaza Ali, and I am a PhD Student at the University of Washington studying computer science education. I enjoy writing about education, programming, life, and the occasional random musing.

The post In Defense of LLMs in Data Science: What ChatGPT Can and Can’t Do for Your Data Science Career appeared first on Towards Data Science.

Two Powerful Python Features to Streamline Your Code and Make It More Readable

Murtaza Ali — Fri, 29 Sep 2023 21:18:22 +0000

There is a reason Python’s popularity has spread far and wide in the current tech landscape. Among modern programming languages, it is perhaps the most accessible for novices. And with that accessibility, it also offers plenty of power. Web development, Data Science, scientific computing – you can accomplish many a task with Python.

As Python has advanced over the years, its developers have put great amounts of effort into maintaining its readability and conciseness. Though many of its features may require a bit of extra effort to learn, the return on clarity and beauty in your code is beyond worth it.

In this article, we’ll look at two such features: match statements and string/list slicing. We’ll go over how each one works in detail, as well as consider some examples to build familiarity with the syntax and semantics.

Now then, let’s get into it.

Match Statements

Match statements – available in Python as of version 3.10 – are a way of checking equality of conditions and performing some action based on the conditions [1]. If you are coming from another language such as C or JavaScript, you might already be familiar with the concept as switch statements.

In principle, match statements are similar to conditional statements, but they do provide a couple of useful advantages. Let’s start by looking at the basic structure via a comparison with conditionals; then, we’ll talk about the advantages.

You might write the following conditional statement to check someone’s name for a bank account:

name = "Yen"

if name == "Yen":
    print("This is your account.")
elif name == "Ben":
    print("This is your sister's account.")
else:
    print("Fraud attempt.")

Translated to a match statement, this would look like the following:

name = "Yen"

match name:
    case "Yen":
        print("This is your account.")
    case "Ben":
        print("This is your sister's account.")
    case _:
        print("Fraud attempt.")

Let’s break this down line by line:

The first line is the same – we just define the name variable.
The keyword match is used to start a match statement.
Then, for the individual conditions, rather than explicitly checking equality, we use the case statement to effectively pattern match. Thus, you can think of case "Yen" as checking for the case that name, which we are matching, is equal to "Yen".
Finally, the last case is the wildcard case. This is specified by an underscore (_) and is effectively the else case.

Now, you might ask – why use this over a traditional conditional statement? I initially had the same question, and would even become annoyed when people used match instead of standard if-else statements. However, there are advantages.

The first one is simply that it is a cleaner way to achieve the same goal. This may seem like a cop-out reason, but it’s actually fairly important. The entire spirit of Python lies in writing clean, concise code (if you don’t believe me, type import this into your Python interpreter and hit enter).

Especially with a large number of conditions, it can be cumbersome to parse a long chain of if and elif statements. Using match statements cleans up the code and makes it easier for a fellow programmer to read – a worthy achievement for any Python programmer.

Beyond this, match statements can also deconstruct certain objects directly, removing the need to do so manually with conditional statements. In practice, this means two things:

You can automatically check types (removing the need for manual checks).
You can automatically access attributes of an object within each case.

Let’s take a look at an example. Say we have the following code, which defines two classes for different types of cars:

# Online Python compiler (interpreter) to run Python online.
# Write Python 3 code in this online editor and run it.
class Honda:
    # See below for explanation of __match_args__
    __match_args__ = ("year", "model", "cost")
    def __init__(self, year, model, cost):
        self.year = year
        self.model = model
        self.cost = cost

class Subaru:
    __match_args__ = ("year", "model", "cost")
    def __init__(self, year, model, cost):
        self.year = year
        self.model = model
        self.cost = cost

car = Subaru(2021, "Outback", 18000)

We have defined an instance of Subaru above. Now, we want to write code that checks what type a car is and prints out some attribute of it. Using traditional conditional statements, we could do so as follows:

if isinstance(car, Honda):
    print("Honda " + car.model)
elif isinstance(car, Subaru):
    print("Subaru " + car.model)
else:
    print("Failure :(")

For our car variable above, this will print out "Subaru Outback". If we translate this to a match statement, we get the following, simplified code:

match car:
    case Honda(year, model, cost):
        print("Honda " + model)
    case Subaru(year, model, cost):
        print("Subaru " + model)
    case _:
        print("Failure")

Match’s pattern-matching functionality enables Python to automatically check the type within the case statement, and further makes it so that the attributes of the object can be accessed directly. Note that this is made possible by the inclusion of the __match_args__ attribute in the class definition, as it names the positional arguments for Python. The recommendation in the Python documentation is for the pattern here to mimic the one used in the __init__ constructor when assigning attributes to self.

The match version of the code is both easier to read and less cumbersome to write. This is a fairly small example, but as situations get more complex, strings of conditional statements can get increasingly more convoluted [2].

All of that said, do keep in mind that this feature is only available starting in Python version 3.10. As such, you should encode that whatever system, application, or project you are writing code for does not exist in a code base that must be compatible with an older version of Python.

As long as that condition is met, consider using match statements. It might require a bit of effort, but your code will be better off for it in the long run.

String and List Slicing

You may already be somewhat familiar with this feature, but I’m willing to bet you aren’t using it to its full potential. Let’s start with a quick review, and then look into some more complex uses.

In its simplest form, slicing refers to a concise syntax that lets you extract part of a string or list in Python [3]. Here is a small example:

>>> my_str = "hello"
>>> my_str[1:3]
'el'

The syntax requires using square brackets which contain the start and stop index separated by a colon. Remember that Python uses 0-indexing, so 1 corresponds to 'e' here. Additionally, note that slicing is exclusive of the right index, so it goes up to 3 but does not include it, hence why the output is 'el' and not 'ell'.

If you want to just start from the beginning or go all the way to the end of a string or list, you can leave the corresponding index blank:

>>> my_lst = ['apple', 'orange', 'blackcurrant', 'mango', 'pineapple']
>>> my_lst[:3]
['apple', 'orange', 'blackcurrant']
>>> my_lst[2:]
['blackcurrant', 'mango', 'pineapple']

Leaving both indices blank gives you a copy of the entire object:

>>> my_str[:]
'hello'
>>> my_lst[:]
['apple', 'orange', 'blackcurrant', 'mango', 'pineapple']

Note that with both lists and strings, slicing defines and returns a brand new object distinct from the original:

>>> new_lst = my_lst[2:]
>>> new_lst
['blackcurrant', 'mango', 'pineapple']
>>> my_lst
['apple', 'orange', 'blackcurrant', 'mango', 'pineapple']

Now, let’s get to the good stuff. With slicing, you can also use negative indices. If you’re unfamiliar with negative indexing, it basically enables you to start counting from the end of a list or string. The last letter corresponds to -1, the second-to-last letter corresponds to -2, and so on.

This can simplify code by removing the need to manually compute lengths. To get everything but the last letter of a string, for instance, you can just do this:

>>> my_str[:-1]
'hell'

Finally, one of the most overlooked features of slicing is that you can also specify a third number – which specifies a "jump" of sorts. This is easiest to explain with an example:

>>> my_long_lst = ['apple', 'orange', 'blackcurrant', 'mango', 'pineapple', 'grapes', 'kiwi', 'papaya', 'coconut']
>>> my_long_lst[1:-1:2]
['orange', 'mango', 'grapes', 'papaya']

Let’s break down what’s happening above:

For clarity, we define a list with more elements than the original one we had.
In the list slice, the first two numbers are 1 and -1. As we saw above, this does away with the first and last elements of the object being sliced – my_long_list, in this case.
Finally, we put a 2 as the final number after an additional colon. This tells Python that we want to slice the list from the start to the end index, but only keep every other item. Putting a 3 would give us every third item, putting a 4 would give us every fourth item, and so on.

Combining the two things above, we can also slice lists to get the elements backward:

>>> my_long_lst[-1:1:-2]
['coconut', 'kiwi', 'pineapple', 'blackcurrant']

# To slice backwars successfully, the "jump" value must be negative
# Otherwise, we just get an empty list
>>> my_long_lst[-1:1:2]
[]

And there you have it – everything you need to know about list slicing. When you get creative with the syntax described above, you can achieve some really cool behavior. For example, the following is one of the slickest ways to reverse a list in Python, courtesy of list slicing:

>>> my_lst
['apple', 'orange', 'blackcurrant', 'mango', 'pineapple']
>>> my_lst[::-1]
['pineapple', 'mango', 'blackcurrant', 'orange', 'apple']

Do you see how it works? As an exercise, you should review each feature of list slicing above and try to break down the code yourself. Hint: Take a look at what it means when we leave the start and end index blank.

Now then, let’s talk about why you should learn this stuff at all.

As a data scientist, why is this useful?

Just generally speaking, it’s important to consider readability and cleanliness of code when writing in Python. Using the features above will go a long way in helping you achieve this. As we already discussed, match statements have a couple of meaningful advantages over conditional statements in this regard. As for list slicing, it is much, much cleaner than trying to achieve the same behavior using some kind of convoluted loop.

But going beyond these broader benefits, let’s talk data science specifically for a moment.

Practically speaking, if you’re working as a data scientist, there is a fair probability that your formal training was not in computer science, but rather in something like statistics, mathematics, or even just data science itself if you were lucky enough to find such a program. In such programs, computer science is generally taught as a tool.

The focus is on learning the basic principles of Programming in a way that teaches you enough to process data, run analyses, and build models at scale. As such, there is not a huge amount of time left to learn topics like "useful Python-specific syntactic features." Heck, such topics are often overlooked even in a pure computer science class.

However, using these features can take your code to the next level, helping you stand out among your brilliant colleagues and deliver better results to your clients. Match statements and object slicing are two powerful examples, but there are a host more that Python has to offer which I encourage you to explore.

May the code be ever in your favor – until next time, friends.

Want to excel at Python? Get exclusive, free access to my simple and easy-to-read guides here. Want to read unlimited stories on Medium? Sign up with my referral link below!

Join Medium with my referral link – Murtaza Ali

References

[1] https://docs.python.org/3.10/whatsnew/3.10.html#syntax-and-operations [2] https://peps.python.org/pep-0622/#rationale-and-goals [3] https://docs.python.org/3/c-api/slice.html

The post Two Powerful Python Features to Streamline Your Code and Make It More Readable appeared first on Towards Data Science.

How to Use the T-Test and its Non-Parametric Counterpart

Murtaza Ali — Sun, 17 Sep 2023 22:59:34 +0000

As a data scientist, you would do well to understand some statistics. After all, it is one of the building blocks of the field.

This is the first article in a series which will attempt to give a concise, practical overview of different statistical tests and the situations in which they should be used. The information will strike a balance – not so long and technical as to be inaccessible, but not so short as to be useless.

In this first article, I’ll be talking about one of the more widely known statistical tests out there: the t-test. Even if you don’t quite know what it is, it’s likely you’ve heard the term thrown around. By the end of this article, you should understand how the test works and when you should use it.

If you’re generally unfamiliar with statistical test design, I highly recommend checking out the prefatory article to this series, A Primer on Foundational Concepts You Need to Start Running Statistical Tests.

Now then, let’s begin.

What is the t-test?

At its core, the t-test is used to compare data from two **** different samples of data. It takes two factors into account: the difference between the sample means and the variability of the sample data. Logically, this makes sense – simply comparing means is not sufficient to determine that your samples are actually different (means could very well differ with similar data sets, or be similar with very different data sets).

The t-test calculates a statistic known as the t-score, which takes both of the above factors into account. It can be calculated manually using a mildly annoying formula, but pretty much any statistical software will calculate it for you automatically. If you calculate the statistic manually, you can compare it to a critical t-value from preexisting tables (being greater than the t-value corresponds to the p-value being less than the desired level for your test, indicating a statistically significant difference between samples). Once again, statistical software will generally just tell you whether or not the calculated t-value is statistically significant.

The process for conducting this test varies from software to software, but most are straightforward to pick up with the software you choose to learn. The details of these calculations are omitted here, as the purpose of this article is to teach you what the t-test is and in what situations you should use it.

In line with that, let’s consider a hypothetical experiment in which you might use a t-test. Imagine you are a user researcher for a sports equipment company that is developing a pair of shoes for basketball players. The company’s engineers have two designs and want to know if there is a meaningful difference in a player’s jumping ability depending on the design.

To test this, you obtain 40 basketball players (via a random sample), giving 20 of them one pair of shoes (Group A) and the other 20 the second pair of shoes (Group B). The assignments to the two groups are also randomized. You then measure the vertical of each player when wearing the shoes and record the data in two sets, separated by the type of shoe.

To determine if there is a statistically significant difference between the vertical measurements of the two groups, you could use a t-test.

I’ve heard there are two types of t-tests – is this true?

Ah, so you’re one step ahead of the game. Technically, the test we just talked about is called an independent-samples t-test. Why? Because the two means being compared are from different samples. This kind of t-test is generally used when our experiment uses a between-subjects design. Since the participants are different for each condition, the samples are considered independent.

This same experiment could also be conducted using a within-subjects design. In this case, we would only collect 20 participants, and we would measure each participant’s vertical while wearing each shoe. One advantage of this is that it eliminates individual differences that could provide a false result (What if, despite the random samples and assignment, the players in one group simply just happened to be able to jump higher?).

In any case, this structure renders the independent-samples t-test inappropriate; instead, we would use what is called a paired-samples t-test. This name comes from the fact that because each participant appears in both data sets, they are in a sense paired.

When you are choosing a statistical test to run for an experiment, be sure to to take this into account.

But, what about those pesky assumptions?

Indeed, both versions of the t-test do come with a set of assumptions that must be met for the test to be valid:

The samples must be randomly selected.
The sample data must be of interval or ratio type (i.e., the data should be quantitative).
The populations from which the samples are drawn should be approximately normally distributed. This cannot be known for certain, but you can generally settle for an educated guess based on the sample distributions.
The two samples should have similar standard deviations (the spread of the samples shouldn’t be too different).

And so, what should you do if one or more requirements are not met? Luckily, this is precisely where nonparametric alternatives come in. Depending on which t-test you intended to use, you should now shift gears:

The nonparametric version of the independent-samples t-test is known as the Mann-Whitney U-Test.
The nonparametric version of the paired-samples t-test is known as the Wilcoxon Signed-Rank Test.

Although they are perhaps less known to the public, you should be able to find these tests in your statistical software of choice without too much trouble; the important thing is just that you know when to use them.

Quick Recap and Final Thoughts

Let’s quickly run through some questions that can help you identify when you should use a t-test.

Does your experiment have one factor/independent variable (the t-test is not suitable for experiments with multiple factors)?

Are there two treatments/conditions that are being tested for the factor?

Assuming an affirmative answer to the above questions, does your experiment use a between-subjects or a within-subjects design? And does your experiment meet the conditions for using the parametric t-test?

Between-subjects and parametric: independent-samples t-test
Between-subjects and nonparametric: Mann-Whitney U-Test
Within-subjects and parametric: paired-samples t-test
Within-subjects and nonparametric: Wilcoxon signed-rank test

And there you have it! Follow this workflow in your next experiment, and you’ll be well on your way to mastering t-tests.

Happy testing!

Want to excel at Python? Get exclusive, free access to my simple and easy-to-read guides here. Want to read unlimited stories on Medium? Sign up with my referral link below!

Join Medium with my referral link – Murtaza Ali

References

Vaughan, L. (2001). Statistical Methods for the Information Professional. Medford, NJ: ASIS&T Press, pp. 139–155.

The post How to Use the T-Test and its Non-Parametric Counterpart appeared first on Towards Data Science.

A Primer on Foundational Concepts You Need to Start Running Statistical Tests

Murtaza Ali — Sat, 09 Sep 2023 16:20:21 +0000

I came to write this article through what was a predictable yet still unexpected set of events. I recently finished a course on statistical testing and reporting, and I set out to write a series of articles explaining the details of the most useful statistical tests I learned. I wished to do this both to cement my own knowledge as well as help other data scientists learn a topic I found immensely helpful.

The first of these articles was going to be on the t-test, a common statistical test used to determine if two means (averages) from different sets of data are statistically different. I began to write this article, but I realized I needed to first explain that there are two different kinds of t-tests. Then, I realized that to explain that, I needed to explain a separate but related underlying concept. The cycle continued as I planned out the article.

Furthermore, I realized that I would need to do this with each new article I wrote, as every statistical test required the same underlying knowledge base. Rather than repeat this information in each article, it would be much better to reference one standing source of information.

And thus, this article was born. In the words that follow, I will attempt to give a concise but effective primer on the basic concepts you should be familiar with in order to conduct and report statistical tests. For your convenience, I have broken down the concepts in the order you would encounter them running a study from start to finish. So without further ado, let’s get into it.

Quantitative Study Design

When designing a study, there are several important details one needs to consider. This article is not about study design, and I won’t be going into the details of best practices and the reasoning behind them. That said, the design of a study strongly influence the eventual statistical test needed, and so it is essential to have a basic understanding of the following concepts:

Factors and measures
Levels and treatments
Between vs. Within subjects

Factors and MeasuresWhile you may not have heard the terms "factor" and "measure" before, it is likely you encountered them back in high school science class under different names: "independent variable" and "dependent variable," respectively.

In a scientific experiment, a factor is a variable/condition that you actively manipulate or change in order to observe its effect on a different variable. The variable upon which you are observing an effect is the measure.

This is easier to see with an example. Let’s imagine we are conducting a fun experiment intended to determine if the type of meat a person consumes upon waking up can influence their 100-meter dash time later in the day. We have two groups of participants: Everyone in the first group receives a chicken breast, and everyone in the second group receives a steak. In the afternoon, the members of each group run a 100-meter dash and the respective times are recorded.

In this experiment, the factor is the type of meat, because that is what we are actively changing, and the measure is the 100-meter dash time, because that is the variable upon which we are attempting to observe some effect.

Levels and Treatments These two terms are related to the factor in an experiment. The levels of a factor refer to the number of differing conditions it has within the study. The actual value or manifestation of the factor at each level is a treatment.

For example, in the experiment above, there are two levels, because we are testing out two different types of meat. The two treatments are chicken and beef. Were we to throw duck into the fray, then we would have three levels of the factor, with the third treatment being duck meat.

Between-Subjects and Within-Subjects DesignThese last two are slightly more confusing, but incredibly important – whether a study uses a between-subjects or within-subjects design directly impacts the type of statistical test one can use for analysis.

Fundamentally, this aspect of study design has to do with how participants are split up across the different treatments of the factor(s) in a study.

In a between-subjects design, every participant is exposed to only one treatment, and in a within-subjects design, every participant is exposed to all the treatments. Said another way, a between-subjects design uses different sets of participants for each level of the independent variable, whereas a within-subjects design uses the same set of participants repeatedly.

For instance, consider a study in which we want to see if a new type of contact lens enables better performance on a vision test. We could give one group of participants the initial lens and another group the new lens, and compare their respective performances on the vision test (between-subjects design). Alternatively, we could have the same group of participants try out both lenses and compare the performances on the vision test for the same participants with different lenses (within-subjects design).

Note that a within-subjects design is not always possible. In the meat and running example above, assuming the experiment must be done in a single day (which may well be the case due to resource restrictions), a single person can only have one type of meat for breakfast, not both.

Finally, experiments with multiple factors can incorporate both between-subjects and within-subjects elements. Such an approach is known as a split-plot design. For example, say we want to evaluate performance on a mental health evaluation, and we have two factors: 1) year in college and 2) amount of daily screen time. We decide to conduct this experiment over the course of a year, giving the participants in each year (freshman, sophomore, etc.) no screen time restrictions for the first six months, and a 30-minute daily screen time restriction for the final six months. The mental health evaluation is given at the end of each session.

In this experiment, the screen time is tested in a within-subjects manner (the same participants undergo both treatments), but the year in college is tested in a between-subjects manner (an individual cannot be in two years simultaneously). Note that this experiment is not intended as a model to follow (meticulous readers will notice that many confounding factors are possible), but rather as a simplified example to explain how a split-plot design might look.

With that, let us move forward.

Significance Testing

If you’ve even tangentially dealt with statistical tests before, it’s likely you’ve heard the phrase "statistically significant difference" before. Much of modern statistical testing (within the frequentist paradigm, at least, but we’ll leave that aside for now) lies in trying to determine if there is some meaningful difference among the different treatment groups in an experiment.

The terms in this section are all essential for understanding this idea. We’ll go through these a bit differently than above. First, I will define all the terms. Then, since they are all interrelated within an individual experiment, we’ll go through a single hypothetical experiment, emphasizing the role of each of these terms.

First things first: hypothesis testing. In a traditional statistical experiment, we begin with two hypothesis:

Null Hypothesis: This hypothesis states that there is no statistically significant difference among the treatment groups.
Alternative Hypothesis: This hypothesis states that there is a statistically significant difference among the treatment groups. It can be one sided (hypothesizing a difference in a particular direction, i.e. greater or less), or it can be two sided (simply hypothesizing a difference).

In all statistical tests, we start by assuming that the null hypothesis is true. Then, with that assumption, we calculate the likelihood of seeing our actual data. If the likelihood is very low (below a certain threshold – see below), then we determine that the null hypothesis must in fact be false, and we reject it.

Formally, this threshold is known as a p-value. The p-value is the probability that the data we see is due to random chance, assuming the null hypothesis is true. Thus, if the p-value is very low (generally below .05, though this can vary among fields and experiments), we reject the null hypothesis, claiming a statistically significant difference in our results. This makes logical sense, as the low p-value indicates the probability of seeing that data under the null hypothesis is very low.

This is sufficient to get you started – if you’re interesting in learning more details, I recommend this primer on specifically p-values [2].

Two Classes of Tests

Finally, when you are dealing with statistical tests, you need to know whether you should use a parametric or nonparametric tests.

Parametric tests are the more widely known type of statistical test mostly because the more popular tests tend to be parametric. Parametric tests come with a set of requirements on various statistical parameters of the data. For example, all parametric tests require that the data come from a random sample. Other requirements vary from test to test, such as requiring a specific type of distribution.

Unfortunately, these requirements are not always met when dealing with data in the wild. Occasionally, introductory classes teach to just use the test anyway for the class’s sake, and briefly mention the existence of alternative techniques beyond the scope of the class.

However, just using the test anyway is not appropriate in a real-world context where the parameters of the data do not conform to the necessary requirements. Nonparametric tests were designed precisely for this reason. These are statistical tests that do not require anything special of the data, and thus should be used in situations when the data does not behave [3].

For nearly every parametric statistical test, there is a corresponding nonparametric test. Thus, once all the elements of an experiment mentioned above (number of factors, treatments for each factor, etc.) have been taken into account, the final determination of what test one should use concerns whether a parametric or nonparametric test should be used.

At this point, it is natural to wonder why one might use parametric tests at all. While a detailed discussion of this is beyond the scope of this article, the high-level reason is simply that parametric tests provide more statistical power, and so they should be used whenever possible.

Recap and Final Thoughts

Here is a quick review of the foundational concepts you should understand if you’re looking to learn statistical testing:

Quantitative study design. Understand what makes up an experiment, including factors, measures, treatments, and different participant designs (between subjects and within subjects).
Significance Testing. Understand how to formulate the null hypothesis and the alternative hypothesis, and how to use the p-value.
Types of Statistical Tests. Understand when to use a parametric test vs. a nonparametric one.

When you come to the analysis phase of a study, having all the elements above documented clearly with respect to your experiment is extremely helpful. The statistical test you need to use will be directly related to them. That said, it is always good to understand concepts before applying them, and I hope this article has assisted you in that goal.

Happy testing!

Want to excel at Python? Get exclusive, free access to my simple and easy-to-read guides here. Want to read unlimited stories on Medium? Sign up with my referral link below!

Join Medium with my referral link – Murtaza Ali

References

[1] Lazar, J., Feng, J.H. and Hochheiser, H. (2017). Research Methods in Human-Computer Interaction (2nd ed.). Cambridge, MA. [2] https://towardsdatascience.com/how-to-understand-p-value-in-layman-terms-80a5cc206ec2 [3] Vaughan, L. (2001). Statistical Methods for the Information Professional. Medford, NJ: ASIS&T Press, pp. 139–155.

The post A Primer on Foundational Concepts You Need to Start Running Statistical Tests appeared first on Towards Data Science.

3 Intelligent Ways to Use ChatGPT to Speed Up Your Next Data Science Project

Murtaza Ali — Fri, 25 Aug 2023 17:50:59 +0000

The growing popularity of ChatGPT over the past year has led to much speculation regarding what such a tool entails for the workforce. Many believe it indicates automation of many jobs performed by humans, especially programming ones.

I prefer a different perspective. While ChatGPT is powerful, it’s not omnipotent. If you work in tech, it cannot replace your job, but it can very much help you be better at it.

In fact, ChatGPT can be a particularly beneficial tool for data scientists, because data folks aren’t necessarily trained in the "polished" aspects of writing code. When training to be data scientists, people often learn how to use code as a statistical and sociological tool without being initiated into rigorous computer science practices.

ChatGPT can help you fill these gaps – making it easier to do your job and teaching you new skills along the way. Let’s take a look at a few ways it might do so.

What does this error mean?

I have been programming for close to six years now. Accordingly, for six long years, I have been continually frustrated by the lack of specificity and detail in the error messages I get when my code breaks.

I used to find solace in random Google searches. With any luck, some programming guru on Stack Overflow would have a detailed explanation regarding my error and solid advice on how to fix it. However, this was always hit or miss, never a guarantee.

Since ChatGPT came into existence, I have found it extremely useful for understanding error messages from code. Its responses facilitate a number of things, all of which are conducive to improving one’s programming skills:

It provides a detailed description of what the error means.
It gives detailed guidance on how to fix the error.
It occasionally even provides general guidance on how to approach the larger class of errors that your error falls under.

One could argue that this information is also technically available via Google. While this is true, having it consolidated for your review and understanding in a single location makes a significant difference in terms of impact.

It’s best to illustrate this by example.

Recently, I ran into the following error while trying to insert some data into a PostgreSQL database via Python’s psycopg2 package:

WARN: <class 'psycopg2.errors.StringDataRightTruncation'>
Traceback (most recent call last):
  File "<redacted>", line 363, in insert
    cur.execute(self._insert_sql(table, set(keyval.keys())), keyval)
psycopg2.errors.StringDataRightTruncation: value too long for type character varying(32)

Typing the error into Google led to a few different results, and upon choosing the second one (Stack Overflow), I found someone with a similar issue who received the following top-rated answer:

_"Given the error: value too long for type character varying(200) you should look for model fields that have a max_length of 200. Since you have multiple fields with a max_length set to 200, you need to determine which model and field are throwing the error. Check the stacktrace, run a debugger and/or insert some debugging print(instance.__dict__)s. Once you find the culprit, extend that field’s max_length to something larger or turn it into a TextField."_

Typing the same error into ChatGPT and asking if it could assist me, I received the following response:

"Certainly! The error message you provided indicates that there was an issue with inserting data into a PostgreSQL database using the psycopg2 Python library. Let’s break down the error message step by step:

WARN: : This part indicates that it’s a warning rather than a critical error. The warning is related to StringDataRightTruncation, which means that some of the data being inserted into the database is being truncated because it exceeds the allowed length.
Traceback (most recent call last): This is the Python traceback that shows the sequence of function calls leading up to the error. It helps in understanding the context of the error.
File "", line 363, in insert: This line tells you which file and which specific line of code caused the error. The issue lies in the file .py on line 363.
_cur.execute(self._insert_sql(table, set(keyval.keys())), keyval): The error occurred inside the execute method of a cursor (cur) object. It seems that this is where the data insertion is taking place._
psycopg2.errors.StringDataRightTruncation: value too long for type character varying(32): This is the actual error message. It tells you that the value you are trying to insert into a column of type character varying(32) is too long and cannot fit within the specified length constraint of 32 characters.

In summary, the error occurred while trying to insert data into the database table, and one of the values being inserted exceeds the maximum allowed length for the corresponding column, which is set to 32 characters. To resolve the issue, you can either truncate the data to fit within the 32-character limit or modify the database schema to allow for longer strings in that particular column."

The ChatGPT response is much more detailed, and it breaks down the error step by step. Especially if you’re new to a language or aren’t formally trained in programming/software engineering (i.e., many data scientists), this level of detail is invaluable. Beyond just telling you what the error means, it explains to you how to parse and interpret it in depth, to the point where you may not even need to resort to external assistance in the future.

Automating "mindless" tasks

Let’s start with a definition of the word mindless here. When programming, there are certain tasks, such as writing an algorithm to update a file’s contents or designing a web application for your employer from scratch, that require a fair bit of active problem solving. These are not the types of problems I recommend handing off to ChatGPT. If you’re expecting ChatGPT to just code everything for you, 1) you’re in for a rude awakening and 2) you’re missing the opportunity to use it as an effective tool to improve your programming.

On the other hand, there are certain tasks that simply must be done when programming that don’t require much mental effort but still eat up precious time:

Generating basic HTML templates for a web application
Cleaning up code to meet style guidelines
Finding a minor bug in a file with hundreds of lines of code

The list goes on. In these situations, ChatGPT can save you much time and several headaches.

Once again, I will present my argument by way of example. Recently, while writing an application in Flask, I realized I needed to write a new HTML file to serve as a template for one of my pages. I then sighed in exasperation, because writing such templates is an annoying task prone to occasional little bugs. However, I then wondered if I could just ask ChatGPT to help:

"I want a simple HTML template for my Flask app that takes in a list of strings called "filepaths," and renders a page with each string in this list displayed in a series of bullet points. I want the page to be titled "List of Files." Can you generate such an HTML file for me?"

"Sure, I can help you create a simple HTML template for your Flask app. Below is the HTML code that you can use:


<!DOCTYPE html>
<html>
<head>
 <title>List of Files</title>
</head>
<body>
 <h1>List of Files</h1>
 <ul>
 {% for filepath in filepaths %}
 <li>{{ filepath }}</li>
 {% endfor %}
 </ul>
</body>
</html>


_Save this code in a file named `files.html`or any other appropriate name. This HTML template uses the Jinja templating engine syntax, which is integrated with Flask. The `{% for filepath in filepaths %}` block iterates through the `filepaths` list, and for each string in the list, it generates a list item (`<li>`) with the string displayed._

_In your Flask app, you can use this template with the following code:_

```python

from flask import Flask, render_template

app = Flask(__name__)
@app.route('/')
def list_files():
 filepaths = [
 "File 1",
 "File 2",
 "File 3",
 # Add more files to this list as needed
 ]
 return render_template('files.html', filepaths=filepaths)
if __name__ == '__main__':
 app.run()

_Make sure that the files.html file is placed in a folder named templates within your Flask app directory. When you run the Flask app and navigate to the specified route (e.g., http://localhost:5000/), it will render the files.htmltemplate with the list of files displayed as bullet points. You can modify the filepaths list in the list_files function with the desired data you want to display on the page."_

There is much to discuss, but let’s start with an important point: This has simply gone beyond Google’s capabilities. You might be able to use a traditional search engine to procure tips and even some general outlines, but you won’t get the exact file you want conveniently generated for you.

Secondly, ChatGPT does much more than just give you the file:

It breaks down the file’s syntax, underscoring the functionality of important lines. This information is important, and you should read it. View ChatGPT as a way to improve your programming and learn, not as a worker who does your job for you while you sit back blindly.
It explains how to use the template correctly in your Flask application.
It mentions where the file should be placed to ensure proper access.

This is helpful for any programmer, but particularly so for a data scientist. Most Data Science training involves using code as a tool for statistical analysis, model building, and so on. While you might be well versed in pandas and scikit-learn, there is a fair chance that the intricacies of building web applications – if we are to take the example above – are not your specialty. Yet, you will likely still encounter it sometime in your career.

And that’s where ChatGPT can be incredibly helpful.

Finding a quick solution to a localized problem

"Hello, I need to build an end-to-end system that has full authentication capabilities, with a back-end in MySQL and a front-end written using Django. I want it to accept financial data from users and make predictions about future success. Can you write this whole thing for me? Thanks."

Wouldn’t that be nice? But as I have repeatedly mentioned, ChatGPT is not something you can use to just write an entire system. Asked the above, this is its response (I’ve removed the details to keep things concise):

"Hello! I’d be glad to help you outline the process of building such a system, but creating an entire system like this requires a significant amount of work and coding. I can guide you through the general steps and components you’ll need to build it. Let’s break down the process:

_1. Project Setup:

Set up your development environment with the necessary tools and software, including Python, Django, MySQL, and any other dependencies you might need._

_2. Database Design:

Design …_

_3. Authentication:

Use Django’s built-in authentication system …_

_4. Front-End Development:

Create the front-end interface using …_

_5. Back-End Development:

Implement the back-end using …_

_6. Data Submission:

Set up forms or API endpoints for users …_

_7. Prediction Model:

Develop or integrate a machine learning model for …_

_8. Data Processing and Prediction:

When users submit their financial data, use …_

_9. Security and Privacy:

Implement proper security measures, including …_

_10. Testing:

Thoroughly test your application for …_

_11. Deployment:

Deploy your application to a production environment …_

_12. Monitoring and Maintenance:

Set up monitoring tools to …_

Remember that building a system of this complexity requires strong development skills in both back-end (Django, database) and front-end (HTML/CSS, JavaScript) technologies, as well as knowledge of machine learning for predictions. It’s also important to keep security in mind throughout the development process. If you’re not experienced in these areas, you might want to consider working with a development team or seeking professional assistance."

Not too much more helpful than a Google search, in my opinion.

On the other hand, ChatGPT is great at helping solve localized problems.

For instance, while writing some code the other day, I had to figure out how to take a range of numbers and convert them into a string containing all the numbers separated by commas. I attempted to do this myself first, with the following line of code:

 ",".join(list(range(9)))

However, this gave me an error because apparently the join function can only handle strings, not integers. When I asked ChatGPT to help resolve my problem, it provided the following solution:

"You’re correct that the join method expects a sequence of strings, not integers. To fix the error, you can convert each integer in the list to a string before using the join method. Here’s the corrected code:

",".join(str(i) for i in range(9))

In this code, the str(i) converts each integer i to its string representation, and the generator expression (str(i) for i in range(9))creates an iterable of strings. Then, the ",".join(...) call combines these strings using commas as separators. This will produce the output "0,1,2,3,4,5,6,7,8"."

This solved my problem, and it also taught me something new. When I first reviewed the code, I thought it was using a list comprehension (which would have been my first instinct), but I noticed there were no brackets. Looking at the initial output, we can see ChatGPT mentions using a generator expression. When I asked ChatGPT to explain further, it broke down the details of the generator expression and explained how they work in general.

Thus, throughout this process, I quickly gained working code for my issue as well as learned to use a new feature in Python (generator expressions) which I had never previously encountered.

Recap + Final Thoughts

Here’s a small cheat sheet of ways you can use ChatGPT to improve your data science workflow:

Understanding errors. ChatGPT can conveniently describe your error in layman’s terms for you, as well as provide guidance on how to fix it.
Automating busywork. Many necessary programming tasks don’t require active problem solving but still take up time. Hand them off to ChatGPT.
Solving small problems. If you need a quick and efficient way to accomplish a small task, try asking ChatGPT. You might just learn something new.

While the title of this article and many of its points focus on the time ChatGPT can help you gain back, the benefits of using it effectively really are twofold. The examples above are all drawn from my personal experiences, and in nearly every case I learned something new about programming.

Technology is advancing, and we would do well to advance with it. Instead of feeling threatened by ChatGPT, learn to use it to your advantage.

You might just be surprised at what you accomplish.

Want to excel at Python? Get exclusive, free access to my simple and easy-to-read guides here. Want to read unlimited stories on Medium? Sign up with my referral link below!

Join Medium with my referral link – Murtaza Ali

The post 3 Intelligent Ways to Use ChatGPT to Speed Up Your Next Data Science Project appeared first on Towards Data Science.

4 Important Statistical Ideas You Should Understand in a Data-Driven World

Murtaza Ali — Sun, 02 Jul 2023 15:06:29 +0000

There is no use avoiding reality. Data Science, and more broadly, data-driven structures, are at the center of the society we are currently building.

When the computer science craze first hit in the early 2000s, many noted that computer science would become an integral part of every field. This proved to be true. Companies across industries – healthcare, engineering, finance, etc. – began to hire software engineers for various forms of work. Students of these fields began to learn how to code.

I would argue the new data science surge takes this a step further. With computer science, one could get away with just hiring software engineers. A business manager or a sales expert did not necessarily need to understand what these folks did.

But data science is broader and more encompassing. Since it is a mix of fields [1], its ideas are relevant even for those who may not be day-to-day data scientists.

In this article, I’ll give a high-level overview of four important statistical ideas that everyone should understand, regardless of official job title. Whether you’re a project manager, recruiter, or even a CEO, some level of familiarity with these concepts is sure to help you in your work. Furthermore, outside of a work context, familiarity with these concepts will give you a sense of data literacy that is indispensable for navigating modern society.

Let’s get into it.

Just a big, bad sample

Back as an undergraduate, the first data science course I took consisted of an immense number of students – nearly 2000. The course, Foundations of Data Science, was one of the most popular on campus, as it was designed to be accessible to students across departments. Rather than immediately getting into advanced mathematics and programming, it focused on high-level ideas which could impact students across fields.

During one of our early lectures, the professor made a statement that has stuck with me through the years, coming back whenever I work on anything even remotely data related. She was discussing random sampling, a broad term which has to do with choosing a subset of a study population in a way that represents the entire population. The idea is that studying the subset should enable one to draw conclusions about the entire population.

She pointed out that having a good sample was of the utmost importance, since no amount of mathematical finagling and fancy techniques could make up for a subset that isn’t actually representative of the population one wishes to emulate. In making this point, she mentioned that many people assume that if a starting sample is bad, then a reasonable solution is to stick with the same approach, but collect a larger sample.

"Then, you’ll just have a really big, really bad sample," she said to the giant lecture hall full of college students.

Understanding this foundational point – and its broader implications – will enable you to make sense of many sociopolitical phenomena that folks take for granted. Why are presidential polls often inaccurate? What makes a seemingly powerful Machine Learning model fail in the real world? Why do some companies make products that never see the light of day?

Often, the answer lies in the sample.

"Error" does not mean "mistake"

This topic is implicit in most courses involving data or Statistics, but my discussion here is inspired by Alberto Cairo’s emphasis of this point in his excellent book, How Charts Lie.

The premise of Cairo’s book is to outline the various ways in which data visualizations can be used to deceive people, both unintentionally and maliciously. In one chapter, Cairo expounds upon the challenges of visualizing uncertainty in data, and how this in itself can lead to misleading data visualizations.

He opens with some discussion on the idea of error in statistics. He makes note of a crucial point: While in standard English, the term "error" is synonymous with "mistake," this is not the case at all within the realm of statistics.

The concept of statistical error has to do with uncertainty. There will almost always be some form of error in measurements and models. This is related to earlier point about samples. Because you don’t have every data point for a population you wish to describe, you will by definition face uncertainty. This is further accentuated if you are making predictions about future data points, since they do not exist yet.

Minimizing and addressing uncertainty is an essential part of statistics and data science, but it is far beyond the scope of this article. Here, the primary point you should internalize is that just because a statistical finding is given to you with a measure of uncertainty does not mean it is mistaken. In fact, this is likely an indicator that whoever produced the findings knew what they were doing (you should be skeptical of statistical claims made without any reference to the level of uncertainty).

Learn the right way to interpret uncertainty in statistical claims [2], rather than writing them off as incorrect. It’s an essential distinction.

You can’t always just "make a model for it"

Among the general population, there seems to be this idea that artificial intelligence is some kind of magical tool that can accomplish anything. With the advent of self-driving cars and realistic virtual assistants but no similar acceleration in general data literacy, it is unsurprising that this mindset has developed.

Unfortunately, it couldn’t be further from the truth. AI is not magic. It is heavily dependent on good data, and its results can actually be quite misleading if the underlying data is of poor quality.

I once had a colleague who was assigned to a project in which her task was to build a machine learning model for a specific goal. It was meant to classify future events into certain categories based on historical data.

There was just one problem: She didn’t have any data. Others on the project (who, notably, were not familiar with data science) kept insisting that she should just make the model even though she didn’t have the data, because machine learning is super powerful and this should be doable. They didn’t grasp that their request simply wasn’t feasible.

Yes, machine learning is powerful, and yes, we’re getting better at doing cooler and better tasks with it. However, as things stand, it’s not just a magic solution for everything. You would do well to remember that.

The Numbers Do Lie

People throw around the phrase "numbers don’t lie" like it’s confetti.

Oh, if only they knew. Numbers do in fact lie. A lot. In some settings, even more often than they tell the truth. But they do not lie because they are actually wrong in raw form; they lie because the average person does not know how to interpret them.

There are countless examples of how numbers can be twisted, manipulated, changed, and transformed in order to support the argument one wants to make. To drive the point home, here I’ll cover one example of how this can be done: failing to take into account underlying population distributions when making blanket statements.

That’s a bit vague on its own, so let’s take a look at an example. Consider the following scenario, often posed to medical students:

Suppose a certain disease affects 1 out of every 1000 people in a population. There is a test to check if a person has this disease. The test does not produce false negatives (that is, anyone who has the disease will test positive), but the false positive rate is 5% (there is a 5% chance that a person will test positive even if they do not have the disease). Suppose a randomly selected person from the population takes the test and tests positive. What is the likelihood that they actually have the disease?

At a glance, a reasonable answer, given by many folks, is 95%. Some might even go so far as to suspect that it isn’t quite mathematically accurate to just use the false positive rate to make this determination, but they’d probably still guess that the answer is somewhere close.

Unfortunately, the correct answer is not 95%, or anywhere near it. The actual probability that this randomly selected person has the disease is approximately 2%.

The reason most people are so far off from the correct answer is because while they pay attention to the low false positive rate, they fail to take into account the underlying prevalence of the disease within the population: Only 1/1000 (or 0.1%) of people in the population actually have this disease. As a result, that false positive rate of 5% actually ends up impacting many individuals because so few of them have the disease to begin with. In other words, there are many, many opportunities to be a false positive.

The formal math behind this is beyond the scope of this particular article, but you can see a detailed explanation here if you’re interested [3]. That said, you don’t really need to dive into the math to grasp the main point: One could imagine using the scenario above to scare a person into believing that they are much more at risk for a disease than they really are. Numbers alone can often be misrepresented and/or misinterpreted to promote false beliefs.

Be vigilant.

Final Thoughts and Recap

Here’s a little cheat sheet of important takeaways from this article:

A big sample ≠ A good sample. It takes more than quantity to ensure accurate representation of a population.
In statistics, "error" does not mean "mistake." It has to do with uncertainty, which is an unavoidable element of statistical work.
Machine learning and artificial intelligence aren’t magic. They rely heavily on the quality of the underlying data.
Numbers can be misleading. When someone makes a statistical claim, especially in a non-academic (read: in the news) context, review it carefully before accepting the conclusions.

You don’t have to be an expert in statistics to navigate this data-driven world, but it would do you well to understand some foundational ideas and know what pitfalls to avoid. It is my hope that this article helped you take that first step.

Until next time.

Want to excel at Python? Get exclusive, free access to my simple and easy-to-read guides here. Want to read unlimited stories on Medium? Sign up with my referral link below!

Murtaza Ali – Medium

References

[1] https://towardsdatascience.com/the-three-building-blocks-of-data-science-2923dc8c2d78 [2] https://bookdown.org/jgscott/DSGI/statistical-uncertainty.html [3] https://courses.lumenlearning.com/waymakermath4libarts/chapter/bayes-theorem/

The post 4 Important Statistical Ideas You Should Understand in a Data-Driven World appeared first on Towards Data Science.