Stephanie Kirmer, Author at Towards Data Science

Generative AI and Civic Institutions

Stephanie Kirmer — Mon, 03 Mar 2025 23:57:58 +0000

Different sectors, different goals

Recent events have got me thinking about AI as it relates to our civic institutions — think government, education, public libraries, and so on. We often forget that civic and governmental organizations are inherently deeply different from private companies and profit-making enterprises. They exist to enable people to live their best lives, protect people’s rights, and make opportunities accessible, even if (especially if) this work doesn’t have immediate monetary returns. The public library is an example I often think about, as I come from a library-loving and defending family — their goal is to provide books, cultural materials, social supports, community engagement, and a love of reading to the entire community, regardless of ability to pay.

In the private sector, efficiency is an optimization goal because any dollar spent on providing a product or service to customers is a dollar taken away from the profits. The (simplified) goal is to spend the bare minimum possible to run your business, with the maximum amount returned to you or the shareholders in profit form. In the civic space, on the other hand, efficiency is only a meaningful goal insomuch as it enables higher effectiveness — more of the service the institution provides getting to more constituents.

In the civic space, efficiency is only a meaningful goal insomuch as it enables higher effectiveness — more of the service the institution provides getting to more constituents.

So, if you’re at the library, and you could use an Ai Chatbot to answer patron questions online instead of assigning a librarian to do that, that librarian could be helping in-person patrons, developing educational curricula, supporting community services, or many other things. That’s a general efficiency that could make for higher effectiveness of the library as an institution. Moving from card catalogs to digital catalogs is a prime example of this kind of efficiency to effectiveness pipeline, because you can find out from your couch whether the book you want is in stock using search keywords instead of flipping through hundreds of notecards in a cabinet drawer like we did when I was a kid.

However, we can pivot too hard in the direction of efficiency and lose sight of the end goal of effectiveness. If, for example, your online librarian chat is often used by schoolchildren at home to get homework help, replacing them with an AI chatbot could be a disaster — after getting incorrect information from such a bot and getting a bad grade at school, a child might be turned off from patronizing the library or seeking help there for a long time, or forever. So, it’s important to deploy Generative Ai solutions only when it is well thought out and purposeful, not just because the media is telling us that “AI is neat.” (Eagle-eyed readers will know that this is basically similar advice to what I’ve said in the past about deploying AI in businesses as well.)

As a result, what we thought was a gain in efficiency leading to net higher effectiveness actually could diminish the number of lifelong patrons and library visitors, which would mean a loss of effectiveness for the library. Sometimes unintended effects from attempts to improve efficiency can diminish our ability to provide a universal service. That is, there may be a tradeoff between making every single dollar stretch as far as it can possibly go and providing reliable, comprehensive services to all the constituents of your institution.

Sometimes unintended effects from attempts to improve efficiency can diminish our ability to provide a universal service.

AI for efficiency

It’s worth it to take a closer look at this concept — AI as a driver of efficiency. Broadly speaking, the theory we hear often is that incorporating generative AI more into our workplaces and organizations can increase productivity. Framing it at the most Econ 101 level: using AI, more work can be completed by fewer people in the same amount of time, right?

Let’s challenge some aspects of this idea. AI is useful to complete certain tasks but is sadly inadequate for others. (As our imaginary schoolchild library patron learned, an LLM is not a reliable source of facts, and should not be treated like one.) So, AI’s ability to increase the volume of work being done with fewer people (efficiency) is limited by what kind of work we need to complete.

If our chat interface is only used for simple questions like “What are the library’s hours on Memorial Day?” we can hook up a RAG (Retrieval Augmented Generation) system with an LLM and make that quite useful. But outside of the limited bounds of what information we can provide to the LLM, we should probably set guard rails and make the model refuse to try and answer, to avoid giving out false information to patrons.

So, let’s play that out. We have a chatbot that does a very limited job, but does it well. The librarian who was on chatbot duty now may have some reduction in the work required of them, but there are still going to be a subset of questions that still require their help. We have some choices: put the librarian on chatbot duty for a reduced number of hours a week, hoping the questions come in when they’re on? Tell people to just call the reference desk or send an email if the chatbot refuses to answer them? Hope that people come in to the library in person to ask their questions?

I suspect the likeliest option is actually “the patron will seek their answer elsewhere, perhaps from another LLM like ChatGPT, Claude, or Gemini.” Once again, we’ve ended up in a situation where the library loses patronage because their offering wasn’t meeting the needs of the patron. And to boot, the patron may have gotten another wrong answer somewhere else, for all we know.

I am spinning out this long example just to illustrate that efficiency and effectiveness in the civic environment can have a lot more push and pull than we would initially assume. It’s not to say that AI isn’t useful to help civic organizations stretch their capabilities to serve the public, of course! But just like with any application of generative AI, we need to be very careful to think about what we’re doing, what our goals are, and whether those two are compatible.

Conversion of labor

Now, this has been a very simplistic example, and eventually we could hook up the whole encyclopedia to that chatbot RAG or something, of course, and try to make it work. In fact, I think we can and should continue developing more ways to chain together AI models to expand the scope of valuable work they can do, including making different specific models for different responsibilities. However, this development is itself work. It’s not really just a matter of “people do work” or “models do work”, but instead it’s “people do work building AI” or “people do work providing services to people”. There’s a calculation to be made to determine when it would be more efficient to do the targeted work itself, and when AI is the right way to go.

Working on the AI has an advantage in that it will hopefully render the task reproducible, so it will lead to efficiency, but let’s remember that AI engineering is vastly different from the work of the reference librarian. We’re not interchanging the same workers, tasks, or skill sets here, and in our contemporary economy, the AI engineer’s time costs a heck of a lot more. So if we did want to measure this efficiency all in dollars and cents, the same amount of time spent working at the reference desk and doing the chat service will be much cheaper than paying an AI engineer to develop a better agentic AI for the use case. Given a bit of time, we could calculate out how many hours, days, years of work as a reference librarian we’d need to save with this chatbot to make it worth building, but often that calculation isn’t done before we move towards AI solutions.

We need to interrogate the assumption that incorporating generative AI in any given scenario is a guaranteed net gain in efficiency.

Externalities

While we’re on this topic of weighing whether the AI solution is worth doing in a particular situation, we should remember that developing and using AI for tasks does not happen in a vacuum. It has some cost environmentally and economically when we choose to use a generative AI tool, even when it’s a single prompt and a single response. Consider that the newly released GPT-4.5 has increased prices 30x for input tokens ($2.50 per million to $75 per million) and 15x for output tokens ($10 per million to $150 per million) just since GPT-4o. And that isn’t even taking into account the water consumption for cooling data centers (3 bottles per 100 word output for GPT-4), electricity use, and rare earth minerals used in GPUs. Many civic institutions have as a macro level goal to improve the world around them and the lives of the citizens of their communities, and concern for the environment has to have a place in that. Should organizations whose purpose is to have a positive impact weigh the possibility of incorporating AI more carefully? I think so.

Plus, I don’t often get too much into this, but I think we should take a moment to consider some folks’ end game for incorporating AI — reducing staffing altogether. Instead of making our existing dollars in an institution go farther, some people’s idea is just reducing the number of dollars and redistributing those dollars somewhere else. This brings up many questions, naturally, about where those dollars will go instead and whether they will be used to advance the interests of the community residents some other way, but let’s set that aside for now. My concern is for the people who might lose their jobs under this administrative model.

For-profit companies hire and fire employees all the time, and their priorities and objectives are focused on profit, so this is not particularly hypocritical or inconsistent. But as I noted above, civic organizations have objectives around improving the community or communities in which they exist. In a very real way, they are advancing that goal when part of what they provide is economic opportunity to their workers. We live in a Society where working is the overwhelmingly predominant way people provide for themselves and their families, and giving jobs to people in the community and supporting the economic well-being of the community is a role that civic institutions do play.

[R]educing staffing is not an unqualified good for civic organizations and government, but instead must be balanced critically against whatever other use the money that was paying their salaries will go to.

At the bare minimum, this means that reducing staffing is not an unqualified good for civic organizations and government, but instead must be balanced critically against whatever other use the money that was paying their salaries will go to. It’s not impossible for reducing staff to be the right decision, but we have to bluntly acknowledge that when members of communities experience joblessness, that effect cascades. They are now no longer able to patronize the shops and services they would have been supporting with their money, the tax base may be reduced, and this negatively affects the whole collective.

Workers aren’t just workers; they’re also patrons, customers, and participants in all aspects of the community. When we think of civic workers as simply money pits to be replaced with AI or whose cost for labor we need to minimize, we lose sight of the reasons for the work to be done in the first place.

Conclusion

I hope this discussion has brought some clarity about how really difficult it is to decide if, when, and how to apply generative AI to the civic space. It’s not nearly as simple a thought process as it might be in the for-profit sphere because the purpose and core meaning of civic institutions are completely different. Those of us who do machine learning and build AI solutions in the private sector might think, “Oh, I can see a way to use this in government,” but we have to recognize and appreciate the complex contextual implications that might have.

Next month, I’ll be bringing you a discussion of how social science research is incorporating generative AI, which has some very intriguing aspects.

As you may have heard, Towards Data Science has moved to an independent platform, but I will continue to post my work on my Medium page, my personal website, and the new TDS platform, so you’ll be able to find me wherever you happen to go. Subscribe to my newsletter on Medium if you’d like to ensure you get every article in your inbox.

Find more of my work at www.stephaniekirmer.com.

The Cultural Backlash Against Generative AI

Stephanie Kirmer — Sat, 01 Feb 2025 23:49:07 +0000

Photo by Joshua Hoehne on Unsplash

The recent reveal of DeepSeek-R1, the large scale LLM developed by a Chinese company (also named DeepSeek), has been a very interesting event for those of us who spend time observing and analyzing the cultural and social phenomena around AI. Evidence suggests that R1 was trained for a fraction of the price that it cost to train ChatGPT (any of their recent models, really), and there are a few reasons that might be true. But that’s not really what I want to talk about here – tons of thoughtful writers have commented on what DeepSeek-R1 is, and what really happened in the training process.

What I’m more interested in at the moment is how this news shifted some of the momentum in the AI space. Nvidia and other related stocks dropped precipitously when the news of DeepSeek-R1 came out, largely (it seems) because it didn’t require the newest GPUs to train, and by training more efficiently, it required less power than an OpenAI model. I had already been thinking about the cultural backlash that Big Generative AI was facing, and something like this opens up even more space for people to be critical of the practices and promises of generative AI companies.

Where are we in terms of the critical voices against generative AI as a business or as a technology? Where is that coming from, and why might it be occurring?

Schools of Thought

The two often overlapping angles of criticism that I think are most interesting are first, the social or community good perspective, and second, the practical perspective. From a social good perspective, critiques of generative AI as a business and an industry are myriad, and I’ve talked a lot about them in my writing here. Making generative AI into something ubiquitous comes at extraordinary costs, from the environmental to the economic and beyond.

As a practical matter, it might be simplest to boil it down to "this technology doesn’t work the way we were promised". Generative AI lies to us, or "hallucinates", and it performs poorly on many of the kinds of tasks that we have most need for technological help on. We are led to believe we can trust this technology, but it fails to meet expectations, while simultaneously being used for such misery-inducing and criminal things as synthetic CSAM and deepfakes to undermine democracy.

So when we look at these together, you can develop a pretty strong argument: this technology is not living up to the overhyped expectations, and in exchange for this underwhelming performance, we’re giving up electricity, water, climate, money, culture, and jobs. Not a worthwhile trade, in many people’s eyes, to put it mildly!

I do like to bring a little nuance to the space, because I think when we accept the limitations on what generative AI can do, and the harm it can cause, and don’t play the overhype game, we can find a passable middle ground. I don’t think we should be paying the steep price for training and for inference of these models unless the results are really, REALLY worth it. Developing new molecules for medical research? Maybe, yes. Helping kids cheat (poorly) on homework? No thanks. I’m not even sure it’s worth the externality cost to help me write code a little bit more efficiently at work, unless I’m doing something really valuable. We need to be honest and realistic about the true price of both creating and using this technology.

How we got here

So, with that said, I’d like to dive in and look at how this situation came to be. I wrote way back in September 2023 that machine learning had a public perception problem, and in the case of generative AI, I think that has been proven out by events. Specifically, if people don’t have realistic expectations and understanding of what LLMs are good for and what they’re not good for, they’re going to bounce off, and backlash will ensue.

"My argument goes something like this:

1. People are not naturally prepared to understand and interact with machine learning.

2. Without understanding these tools, some people may avoid or distrust them.

3. Worse, some individuals may misuse these tools due to misinformation, resulting in detrimental outcomes.

4. After experiencing the negative consequences of misuse, people might become reluctant to adopt future machine learning tools that could enhance their lives and communities."

me, in Machine Learning’s Public Perception Problem, Sept 2023

So what happened? Well, the generative AI industry dove head first into the problem and we’re seeing the repercussions.

Generative AI applications don’t meet people’s needs

Part of the problem is that generative AI really can’t effectively do everything the hype claims. An LLM can’t be reliably used to answer questions, because it’s not a "facts machine". It’s a "probable next word in a sentence machine". But we’re seeing promises of all kinds that ignore these limitations, and tech companies are forcing generative AI features into every kind of software you can think of. People hated Microsoft’s Clippy because it wasn’t any good and they didn’t want to have it shoved down their throats – and one might say they’re doing the same basic thing with an improved version, and we can see that some people still understandably resent it.

When someone goes to an LLM today and asks for the price of ingredients in a recipe at their local grocery store right now, there’s absolutely no chance that model can answer that correctly, reliably. That is not within its capabilities, because the true data about those prices is not available to the model. The model might accidentally guess that a bag of carrots is $1.99 at Publix, but it’s just that, an accident. In the future, with chaining models together in agentic forms, there’s a chance we could develop a narrow model to do this kind of thing correctly, but right now it’s absolutely bogus.

But people are asking LLMs these questions today! And when they get to the store, they’re very disappointed about being lied to by a technology that they thought was a magic answer box. If you’re OpenAI or Anthropic, you might shrug, because if that person was paying you a monthly fee, well, you already got the cash. And if they weren’t, well, you got the user number to tick up one more, and that’s growth.

However, this is actually a major business problem. When your product fails like this, in an obvious, predictable (inevitable!) way, you’re beginning to singe the bridge between that user and your product. It may not burn it all at once, but it’s gradually tearing down the relationship the user has with your product, and you only get so many chances before someone gives up and goes from a user to a critic. In the case of generative AI, it seems to me like you don’t get many chances at all. Plus, failure in one mode can make people mistrust the entire technology in all its forms. Is that user going to trust or believe you in a few years when you’ve hooked up the LLM backend to realtime price APIs and can in fact correctly return grocery store prices? I doubt it. That user might not even let your model help revise emails to coworkers after it failed them on some other task.

From what I can see, tech companies think they can just wear people down, forcing them to accept that generative AI is an inescapable part of all their software now, whether it works or not. Maybe they can, but I think this is a self defeating strategy. Users may trudge along and accept the state of affairs, but they won’t feel positive towards the tech or towards your brand as a result. Begrudging acceptance is not the kind of energy you want your brand to inspire among users!

What Silicon Valley has to do with it

You might think, well, that’s clear enough -let’s back off on the generative AI features in software, and just apply it to tasks where it can wow the user and works well. They’ll have a good experience, and then as the technology gets better, we’ll add more where it makes sense. And this would be somewhat reasonable thinking (although, as I mentioned before, the externality costs will be extremely high to our world and our communities).

However, I don’t think the big generative AI players can really do that, and here’s why. Tech leaders have spent a truly exorbitant amount of money on creating and trying to improve this technology – from investing in companies that develop it, to building power plants and data centers, to lobbying to avoid copyright laws, there are hundreds of billions of dollars sunk into this space already with more soon to come.

In the tech industry, profit expectations are quite different from what you might encounter in other sectors – a VC funded software startup has to make back 10–100x what’s invested (depending on stage) to look like a really standout success. So investors in tech push companies, explicitly or implicitly, to take bigger swings and bigger risks in order to make higher returns plausible. This starts to develop into what we call a "bubble" – valuations become out of alignment with the real economic possibilities, escalating higher and higher with no hope of ever becoming reality. As Gerrit De Vynck in the Washington Post noted, "… Wall Street analysts are expecting Big Tech companies to spend around $60 billion a year on developing AI models by 2026, but reap only around $20 billion a year in revenue from AI by that point… Venture capitalists have also poured billions more into thousands of AI start-ups. The AI boom has helped contribute to the $55.6 billion that venture investors put into U.S. start-ups in the second quarter of 2024, the highest amount in a single quarter in two years, according to venture capital data firm PitchBook."

So, given the billions invested, there are serious arguments to be made that the amount invested in developing generative AI to date is impossible to match with returns. There just isn’t that much money to be made here, by this technology, certainly not in comparison to the amount that’s been invested. But, companies are certainly going to try. I believe that’s part of the reason why we’re seeing generative AI inserted into all manner of use cases where it might not actually be particularly helpful, effective, or welcomed. In a way, "we’ve spent all this money on this technology, so we have to find a way sell it" is kind of the framework. Keep in mind, too, that the investments are continuing to be sunk in to try and make the tech work better, but any LLM advancement these days is proving very slow and incremental.

Where to now?

Generative AI tools are not proving essential to people’s lives, so the economic calculus is not working to make a product available and convince folks to buy it. So, we’re seeing companies move to the "feature" model of generative AI, which I theorized could happen in my article from August 2024. However, the approach is taking a very heavy hand, as with Microsoft adding generative AI to Office365 and making the features and the accompanying price increase both mandatory. I admit I hadn’t made the connection between the public image problem and the feature vs product model problem until recently – but now we can see that they are intertwined. Giving people a feature that has the functionality problems we’re seeing, and then upcharging them for it, is still a real problem for companies. Maybe when something just doesn’t work for a task, it’s neither a product nor a feature? If that turns out to be the case, then investors in generative AI will have a real problem on their hands, so companies are committing to generative AI features, whether they work well or not.

I’m going to be watching with great interest to see how things progress in this space. I do not expect any great leaps in generative AI functionality, although depending on how things turn out with DeepSeek, we may see some leaps in efficiency, at least in training. If companies listen to their users’ complaints and pivot, to target generative AI at the applications it’s actually useful for, they may have a better chance of weathering the backlash, for better or for worse. However, that to me seems highly, highly unlikely to be compatible with the desperate profit incentive they’re facing. Along the way, we’ll end up wasting tremendous resources on foolish uses of generative AI, instead of focusing our efforts on advancing the applications of the technology that are really worth the trade.

Read more of my work at www.stephaniekirmer.com.

The Cultural Impact of AI Generated Content: Part 2

Stephanie Kirmer — Fri, 03 Jan 2025 17:24:24 +0000

Photo by Meszárcsek Gergely on Unsplash

In my prior column, I established how AI generated content is expanding online, and described scenarios to illustrate why it’s occurring. (Please read that before you go on here!) Let’s move on now to talking about what the impact is, and what possibilities the future might hold.

Social and Creative Creatures

Human beings are social creatures, and visual ones as well. We learn about our world through images and language, and we use visual inputs to shape how we think and understand concepts. We are shaped by our surroundings, whether we want to be or not.

Accordingly, no matter how much we are consciously aware of the existence of AI generated content in our own ecosystems of media consumption, our subconscious response and reaction to that content will not be fully within our control. As the truism goes, everyone thinks they’re immune to advertising – they’re too smart to be led by the nose by some ad executive. But advertising continues! Why? Because it works. It inclines people to make purchasing choices that they otherwise wouldn’t have, whether just from increasing brand visibility, to appealing to emotion, or any other advertising technique.

AI-generated content may end up being similar, albeit in a less controlled way. We’re all inclined to believe we’re not being fooled by some bot with an LLM generating text in a chat box, but in subtle or overt ways, we’re being affected by the continued exposure. As much as it may be alarming that advertising really does work on us, consider that with advertising the subconscious or subtle effects are being designed and intentionally driven by ad creators. In the case of generative AI, a great deal of what goes into creating the content, no matter what its purpose, is based on an algorithm using historical information to choose the features most likely to appeal, based on its training, and human actors are less in control of what that model generates.

I mean to say that the results of generative AI routinely surprise us, because we’re not that well attuned to what our history really says, and we often don’t think of edge cases or interpretations of prompts we write. The patterns that AI is uncovering in the data are sometimes completely invisible to human beings, and we can’t control how these patterns influence the output. As a result, our thinking and understanding are being influenced by models that we don’t completely understand and can’t always control.

Critical Thinking

Beyond that, as I’ve mentioned, public critical thinking and critical media consumption skills are struggling to keep pace with AI generated content, to give us the ability to be as discerning and thoughtful as the situation demands. Similarly to the development of Photoshop, we need to adapt, but it’s unclear whether we have the ability to do so.

We are all learning tell-tale signs of AI generated content, such as certain visual clues in images, or phrasing choices in text. The average internet user today has learned a huge amount in just a few years about what AI generated content is and what it looks like. However, suppliers of the models used to create this content are trying to improve their performance to make such clues subtler, attempting to close the gap between obviously AI generated and obviously human produced media. We’re in a race with AI companies, to see whether they can make more sophisticated models faster than we can learn to spot their output.

We’re in a race with AI companies, to see whether they can make more sophisticated models faster than we can learn to spot their output.

In this race, it’s unclear if we will catch up, as people’s perceptions of patterns and aesthetic data have limitations. (If you’re skeptical, try your hand at detecting AI generated text: https://roft.io/) We can’t examine images down to the pixel level the way a model can. We can’t independently analyze word choices and frequencies throughout a document at a glance. We can and should build tools that help do this work for us, and there are some promising approaches for this, but when it’s just us facing an image, a video, or a paragraph, it’s just our eyes and brains versus the content. Can we win? Right now, we often don’t. People are fooled every day by AI-generated content, and for every piece that gets debunked or revealed, there must be many that slip past us unnoticed.

One takeaway to keep in mind is that it’s not just a matter of "people need to be more discerning" – it’s not as simple as that, and if you don’t catch AI generated materials or deepfakes when they cross your path every time, it’s not all your fault. This is being made increasingly difficult on purpose.

Plus, Bots!

So, living in this reality, we have to cope with a disturbing fact. We can’t trust what we see, at least not in the way we have become accustomed to. In a lot of ways, however, this isn’t that new. As I described in my first part of this series, we kind of know, deep down, that photographs may be manipulated to change how we interpret them and how we perceive events. Hoaxes have been perpetuated with newspapers and radio since their invention as well. But it’s a little different because of the race – the hoaxes are coming fast and furious, always getting a little more sophisticated and a little harder to spot.

We can’t trust what we see, at least not in the way we have become accustomed to.

There’s also an additional layer of complexity in the fact that a large amount of the AI generated content we see, particularly on social media, is being created and posted by bots (or agents, in the new generative AI parlance), for engagement farming/clickbait/scams and other purposes as I discussed in part 1 of this series. Frequently we are quite a few steps disconnected from a person responsible for the content we’re seeing, who used models and automation as tools to produce it. This obfuscates the origins of the content, and can make it harder to infer the artificiality of the content by context clues. If, for example, a post or image seems too good (or weird) to be true, I might investigate the motives of the poster to help me figure out if I should be skeptical. Does the user have a credible history, or institutional affiliations that inspire trust? But what if the poster is a fake account, with an AI generated profile picture and fake name? It only adds to the challenge for a regular person to try and spot the artificiality and avoid a scam, deepfake, or fraud.

As an aside, I also think there’s general harm from our continued exposure to unlabeled bot content. When we get more and more social media in front of us that is fake and the "users" are plausibly convincing bots, we can end up dehumanizing all social media engagement outside of people we know in analog life. People already struggle to humanize and empathize through computer screens, hence the longstanding problems with abuse and mistreatment online in comments sections, on social media threads, and so on. Is there a risk that people’s numbness to humanity online worsens, and degrades the way they respond to people and models/bots/computers?

What Now?

How do we as a society respond, to try and prevent being taken in by AI-generated fictions? There’s no amount of individual effort or "do your homework" that can necessarily get us out of this. The patterns and clues in AI-generated content may be undetectable to the human eye, and even undetectable to the person who built the model. Where you might normally do online searches to validate what you see or read, those searches are heavily populated with AI-generated content themselves, so they are increasingly no more trustworthy than anything else. We absolutely need photographs, videos, text, and music to learn about the world around us, as well as to connect with each other and understand the broader human experience. Even though this pool of material is becoming poisoned, we can’t quit using it.

There are a number of possibilities for what I think might come next that could help with this dilemma.

AI declines in popularity or fails due to resource issues. There are a lot of factors that threaten the growth and expansion of generative AI commercially, and these are mostly not mutually exclusive. Generative AI very possibly could suffer some degree of collapse due to AI generated content infiltrating the training datasets. Economic and/or environmental challenges (insufficient power, natural resources, or capital for investment) could all slow down or hinder the expansion of AI generation systems. Even if these issues don’t affect the commercialization of generative AI, they could create barriers to the technology’s progressing further past the point of easy human detection.
Organic content becomes premium and gains new market appeal. If we are swarmed with AI generated content, that becomes cheap and low quality, but the scarcity of organic, human-produced content may drive a demand for it. In addition, there is a significant growth already in backlash against AI. When customers and consumers find AI generated material off-putting, companies will move to adapt. This aligns with some arguments that AI is in a bubble, and that the excessive hype will die down in time.
Technological work challenges the negative effects of AI. Detector models and algorithms will be necessary to differentiate organic and generated content where we can’t do it ourselves, and work is already going on in this direction. As generative AI grows in sophistication, making this necessary, a commercial and social market for these detector models may develop. These models need to become a lot more accurate than they are today for this to be possible – we don’t want to rely upon notably bad models like those being used to identify generative AI content in student essays in educational institutions today. But, a lot of work is being done in this space, so there’s reason for hope. (I have included a few research papers on these topics in the notes at the end of this article.)
Regulatory efforts expand and gain sophistication. Regulatory frameworks may develop sufficiently to be helpful in reining in the excesses and abuses generative AI enables. Establishing accountability and provenance for AI agents and bots would be a massively positive step. However, all this relies on the effectiveness of governments around the world, which is always uncertain. We know big tech companies are intent on fighting against regulatory obligations and have immense resources to do so.

I think it very unlikely that generative AI will continue to gain sophistication at the rate seen in 2022–2023, unless a significantly different training methodology is developed. We are running short of organic training data, and throwing more data at the problem is showing diminishing returns, for exorbitant costs. I am concerned about the ubiquity of AI-generated content, but I (optimistically) don’t think these technologies are going to advance at more than a slow incremental rate going forward, for reasons I have written about before.

This means our efforts to moderate the negative externalities of generative AI have a pretty clear target. While we continue to struggle with difficulty detecting AI-generated content, we have a chance to catch up if technologists and regulators put the effort in. I also think it is vital that we work to counteract the cynicism this AI "slop" inspires. I love machine learning, and I’m very glad to be a part of this field, but I’m also a sociologist and a citizen, and we need to take care of our communities and our world as well as pursuing technical progress.

Read more of my work at www.stephaniekirmer.com.

The Cultural Impact of AI Generated Content: Part 1

Stephanie Kirmer — Tue, 03 Dec 2024 17:36:15 +0000

What happens when AI generated media becomes ubiquitous in our lives? How does this relate to what we’ve experienced before, and how does it change us?

Photo by Annie Spratt on Unsplash

This is the first part of a two part series I’m writing analyzing how people and communities are affected by the expansion of AI generated content. I’ve already talked at some length about the environmental, economic, and labor issues involved, as well as discrimination and social bias. But this time I want to dig in a little and focus on some psychological and social impacts from the AI generated media and content we consume, specifically on our relationship to critical thinking, learning, and conceptualizing knowledge.

History

Hoaxes have been perpetrated using photography essentially since its invention. The moment we started having a form of media that was believed to show us true, unmediated reality of phenomena and events, was the moment that people started coming up with ways to manipulate that form of media, to great artistic and philosophical effect. (As well as humorous or simply fraudulent effect.) We have a form of unwarranted trust in photographs, despite this, and we have developed a relationship with the form that balances between trust and skepticism.

When I was a child, the internet was not yet broadly available to the general public, and certainly very few homes had access to it, but by the time I was a teenager that had completely changed, and everyone I knew spent time on AOL instant messenger. Around the time I left graduate school, the iPhone was launched and the smartphone era started. I retell all this to make the point that cultural creation and consumption changed startlingly quickly and beyond recognition in just a couple of decades.

I think the current moment represents a whole new era specifically in the media and cultural content we consume and create, because of the launch of generative AI. It’s a little like when Photoshop became broadly available, and we started to realize that photos were sometimes retouched, and we began to question whether we could trust what images looked like. (Readers may find the ongoing conversation around "what is a photograph" an interesting extension of this issue.) But even then, Photoshop was expensive and had a skill level requirement to use it effectively, so most photos we encountered were relatively true to life, and I think people generally expected that images in advertising and film were not going to be "real". Our expectations and intuitions had to adjust to the changes in technology, and we more or less did.

Current Day

Today, AI content generators have democratized the ability to artificially produce or alter any kind of content, including images. Unfortunately, it’s extremely difficult to get an estimate of how much of the content online may be AI-generated – if you google this question you’ll get references to an article from Europol claiming it says that the number will be 90% by 2026 – but read it and you’ll see that the research paper says nothing of the sort. You might also find a paper by some AWS researchers being cited, saying that 57% is the number – but that’s also a mistaken reading (they’re talking about text content being machine translated, not text generated from whole cloth, to say nothing of images or video). As far as I can tell, there’s no reliable, scientifically based work indicating actually how much of the content we consume may be AI generated – and even if it did, the moment it was published it would be outdated.

But if you think about it, this is perfectly sensible. A huge part of the reason AI generated content keeps coming is because it’s harder than ever before in human History to tell whether a human being actually created what you are looking at, and whether that representation is a reflection of reality. How do you count something, or even estimate a count, when it’s explicitly unclear how you can identify it in the first place?

I think we all have the lived experience of spotting content with questionable provenance. We see images that seem to be in the uncanny valley, or strongly suspect that a product review on a retail site sounds unnaturally positive and generic, and think, that must have been created using generative AI and a bot. Ladies, have you tried to find inspiration pictures for a haircut online recently? In my own personal experience, 50%+ of the pictures on Pinterest or other such sites are clearly AI generated, with tell-tale signs: textureless skin, rubbery features, straps and necklaces disappearing into nowhere, images explicitly not including hands, never showing both ears straight on, etc. These are easy to dismiss, but a large swath makes you question whether you’re seeing heavily filtered real images or wholly AI generated content. I make it my business to understand these things, and I’m often not sure myself. I hear tell that single men on dating apps are so swamped with scamming bots based on generative AI that there’s a name for the way to check – the "Potato Test". If you ask the bot to say "potato" it will ignore you, but a real human person will likely do it. The small, everyday areas of our lives are being infiltrated by AI content without anything like our consent or approval.

Why?

What’s the point of dumping AI slop in all these online spaces? The best case scenario goal may be to get folks to click through to sites where advertising lives, offering nonsense text and images just convincing enough to get those precious ad impressions and get a few cents from the advertiser. Artificial reviews and images for online products are generated by the truckload, so that drop-shippers and vendors of cheap junk can fool customers into buying something that’s just a little cheaper than all the competition, letting them hope they’re getting a legitimate item. Perhaps the item can be so incredibly cheap that the disappointed buyer will just accept the loss and not go to the trouble of getting their money back.

Worse, bots using LLMs to generate text and images can be used to lure people into scams, and because the only real resource necessary is compute, the scaling of such scams costs pennies – well worth the expense if you can steal even one person’s money every so often. AI generated content is used for criminal abuse, including pig butchering scams, AI-generated CSAM and non-consensual intimate images, which can turn into blackmail schemes as well.

There are also political motivations for AI-generated images, video, and text – in this US election year, entities all across the world with different angles and objectives produced AI-generated images and videos to support their viewpoints, and spewed propagandistic messages via generative AI bots to social media, especially on the former Twitter, where content moderation to prevent abuse, harassment, and bigotry has largely ceased. The expectation from those disseminating this material is that uninformed internet users will absorb their message through continual, repetitive exposure to this content, and for every item they realize is artificial, an unknown number will be accepted as legitimate. Additionally, this material creates an information ecosystem where truth is impossible to define or prove, neutralizing good actors and their attempts to cut through the noise.

A small minority of the AI-generated content online will be actual attempts to create appealing images just for enjoyment, or relatively harmless boilerplate text generated to fill out corporate websites, but as we are all well aware, the internet is rife with scams and get-rich-quick schemers, and the advances of generative AI have brought us into a whole new era for these sectors. (And, these applications have massive negative implications for real creators, energy and the environment, and other issues.)

Where we Are

I’m painting a pretty grim picture of our online ecosystems, I realize. Unfortunately, I think it’s accurate and only getting worse. I’m not arguing that there’s no good use of generative AI, but I’m becoming more and more convinced that the downsides for our society are going to have a larger, more direct, and more harmful impact than the positives.

I think about it this way: We’ve reached a point where it is unclear if we can trust what we see or read, and we routinely can’t know if entities we encounter online are human or AI. What does this do to our reactions to what we encounter? It would be silly to expect our ways of thinking to not change as a result of these experiences, and I worry very much that the change we’re undergoing is not for the better.

The ambiguity is a big part of the challenge, however. It’s not that we know that we’re consuming untrustworthy information, it’s that it’s essentially unknowable. We’re never able to be sure. Critical thinking and critical media consumption habits help, but the expansion of AI generated content may be outstripping our critical capabilities, at least in some cases. This seems to me to have a real implication for our concepts of trust and confidence in information.

In my next article, I’ll discuss in detail what kind of effects this may have on our thoughts and ideas about the world around us, and consider what, if anything, our communities might do about it.

Read more of my work at www.stephaniekirmer.com.

Also, regular readers will know I publish on a two week schedule, but I am moving to a monthly publishing cadence going forward. Thank you for reading, and I look forward to continuing to share my ideas!

Choosing and Implementing Hugging Face Models

Stephanie Kirmer — Fri, 01 Nov 2024 14:02:09 +0000

Photo by Erda Estremera on Unsplash

I’ve been having a lot of fun in my daily work recently experimenting with models from the Hugging Face catalog, and I thought this might be a good time to share what I’ve learned and give readers some tips for how to apply these models with a minimum of stress.

My specific task recently has involved looking at blobs of unstructured text data (think memos, emails, free text comment fields, etc) and classifying them according to categories that are relevant to a business use case. There are a ton of ways you can do this, and I’ve been exploring as many as I can feasibly do, including simple stuff like pattern matching and lexicon search, but also expanding to using pre-built neural network models for a number of different functionalities, and I’ve been moderately pleased with the results.

I think the best strategy is to incorporate multiple techniques, in some form of ensembling, to get the best of the options. I don’t trust these models necessarily to get things right often enough (and definitely not consistently enough) to use them solo, but when combined with more basic techniques they can add to the signal.

Choosing the use case

For me, as I’ve mentioned, the task is just to take blobs of text, usually written by a human, with no consistent format or schema, and try to figure out what categories apply to that text. I’ve taken a few different approaches, outside of the analysis methods mentioned earlier, to do that, and these range from very low effort to somewhat more work on my part. These are three of the strategies that I’ve tested so far.

Ask the model to choose the category (zero-shot classification – I’ll use this as an example later on in this article)
Use a named entity recognition model to find key objects referenced in the text, and make classification based on that
Ask the model to summarize the text, then apply other techniques to make classification based on the summary

Finding the models

This is some of the most fun – looking through the Hugging Face catalog for models! At https://huggingface.co/models you can see a gigantic assortment of the models available, which have been added to the catalog by users. I have a few tips and pieces of advice for how to select wisely.

Look at the download and like numbers, and don’t choose something that has not been tried and tested by a decent number of other users. You can also check the Community tab on each model page to see if users are discussing challenges or reporting bugs.
Investigate who uploaded the model, if possible, and determine if you find them trustworthy. This person who trained or tuned the model may or may not know what they’re doing, and the quality of your results will depend on them!
Read the documentation closely, and skip models with little or no documentation. You’ll struggle to use them effectively anyway.
Use the filters on the side of the page to narrow down to models suited to your task. The volume of choices can be overwhelming, but they are well categorized to help you find what you need.
Most model cards offer a quick test you can run to see the model’s behavior, but keep in mind that this is just one example and it’s probably one that was chosen because the model’s good at that and finds this case pretty easy.

Incorporating into your code

Once you’ve found a model you’d like to try, it’s easy to get going- click the "Use this Model" button on the top right of the Model Card page, and you’ll see the choices for how to implement. If you choose the Transformers option, you’ll get some instructions that look like this.

Screenshot taken by author

If a model you’ve selected is not supported by the Transformers library, there may be other techniques listed, like TF-Keras, scikit-learn, or more, but all should show instructions and sample code for easy use when you click that button.

In my experiments, all the models were supported by Transformers, so I had a mostly easy time getting them running, just by following these steps. If you find that you have questions, you can also look at the deeper documentation and see full API details for the Transformers library and the different classes it offers. I’ve definitely spent some time looking at these docs for specific classes when optimizing, but to get the basics up and running you shouldn’t really need to.

Preparing inference data

Ok, so you’ve picked out a model that you want to try. Do you already have data? If not, I have been using several publicly available datasets for this experimentation, mainly from Kaggle, and you can find lots of useful datasets there as well. In addition, Hugging Face also has a dataset catalog you can check out, but in my experience it’s not as easy to search or to understand the data contents over there (just not as much documentation).

Once you pick a dataset of unstructured text data, loading it to use in these models isn’t that difficult. Load your model and your tokenizer (from the docs provided on Hugging Face as noted above) and pass all this to the pipeline function from the transformers library. You’ll loop over your blobs of text in a list or pandas Series and pass them to the model function. This is essentially the same for whatever kind of task you’re doing, although for zero-shot classification you also need to provide a candidate label or list of labels, as I’ll show below.

Code Example

So, let’s take a closer look at zero-shot classification. As I’ve noted above, this involves using a pretrained model to classify a text according to categories that it hasn’t been specifically trained on, in the hopes that it can use its learned semantic embeddings to measure similarities between the text and the label terms.

from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import pipeline

nli_model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli", model_max_length=512)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")
classifier = pipeline("zero-shot-classification", device="cpu", model=nli_model, tokenizer=tokenizer)

label_list = ['News', 'Science', 'Art']

all_results = []
for text in list_of_texts:
    prob = self.classifier(text, label_list, multi_label=True, use_fast=True)
    results_dict = {x: y for x, y in zip(prob["labels"], prob["scores"])}
    all_results.append(results_dict)

This will return you a list of dicts, and each of those dicts will contain keys for the possible labels, and the values are the probability of each label. You don’t have to use the pipeline as I’ve done here, but it makes multi-label zero shot a lot easier than manually writing that code, and it returns results that are easy to interpret and work with.

If you prefer to not use the pipeline, you can do something like this instead, but you’ll have to run it once for each label. Notice how the processing of the logits resulting from the model run needs to be specified so that you get human-interpretable output. Also, you still need to load the tokenizer and the model as described above.

def run_zero_shot_classifier(text, label):
    hypothesis = f"This example is related to {label}."

    x = tokenizer.encode(
        text, 
        hypothesis, 
        return_tensors="pt", 
        truncation_strategy="only_first"
    )

    logits = nli_model(x.to("cpu"))[0]

    entail_contradiction_logits = logits[:, [0, 2]]
    probs = entail_contradiction_logits.softmax(dim=1)
    prob_label_is_true = probs[:, 1]

    return prob_label_is_true.item()

label_list = ['News', 'Science', 'Art']
all_results = []
for text in list_of_texts:
    for label in label_list:
        result = run_zero_shot_classifier(text, label)
        all_results.append(result)

To tune, or not?

You probably have noticed that I haven’t talked about fine tuning the models myself for this project – that’s true. I may do this in future, but I’m limited by the fact that I have minimal labeled training data to work with at this time. I can use semisupervised techniques or bootstrap a labeled training set, but this whole experiment has been to see how far I can get with straight off-the-shelf models. I do have a few small labeled data samples, for use in testing the models’ performance, but that’s nowhere near the same volume of data I will need to tune the models.

If you do have good training data and would like to tune a base model, Hugging Face has some docs that can help. https://huggingface.co/docs/transformers/en/training

Computation and speed

Performance has been an interesting problem, as I’ve run all my experiments on my local laptop so far. Naturally, using these models from Hugging Face will be much more compute intensive and slower than the basic strategies like regex and lexicon search, but it provides signal that can’t really be achieved any other way, so finding ways to optimize can be worthwhile. All these models are GPU enabled, and it’s very easy to push them to be run on GPU. (If you want to try it on GPU quickly, review the code I’ve shown above, and where you see "cpu" substitute in "cuda" if you have a GPU available in your programming environment.) Keep in mind that using GPUs from cloud providers is not cheap, however, so prioritize accordingly and decide if more speed is worth the price.

Most of the time, using the GPU is much more important for training (keep it in mind if you choose to fine tune) but less vital for inference. I’m not digging in to more details about optimization here, but you’ll want to consider parallelism as well if this is important to you- both data parallelism and actual training/compute parallelism.

Testing and understanding output

We’ve run the model! Results are here. I have a few closing tips for how to review the output and actually apply it to business questions.

Don’t trust the model output blindly, but run rigorous tests and evaluate performance. Just because a transformer model does well on a certain text blob, or is able to correctly match text to a certain label regularly, doesn’t mean this is generalizable result. Use lots of different examples and different kinds of text to prove the performance is going to be sufficient.
If you feel confident in the model and want to use it in a production setting, track and log the model’s behavior. This is just good practice for any model in production, but you should keep the results it has produced alongside the inputs you gave it, so you can continually check up on it and make sure the performance doesn’t decline. This is more important for these kinds of deep learning models because we don’t have as much interpretability of why and how the model is coming up with its inferences. It’s dangerous to make too many assumptions about the inner workings of the model.

As I mentioned earlier, I like using these kinds of model output as part of a larger pool of techniques, combining them in ensemble strategies – that way I’m not only relying on one approach, but I do get the signal those inferences can provide.

I hope this overview is useful for those of you getting started with pre-trained models for text (or other mode) analysis – good luck!

Read more of my work at www.stephaniekirmer.com.

A Critical Look at AI Image Generation

Stephanie Kirmer — Thu, 17 Oct 2024 07:36:08 +0000

I recently had the opportunity to provide analysis on an interesting project, and I had more to say than could be included in that single piece, so today I’m going to discuss some more of my thoughts about it.

The approach the researchers took with this project involved providing a series of prompts to different generative AI image generation tools: Stable Diffusion, Midjourney, YandexART, and ERNIE-ViLG (by Baidu). The prompts were particularly framed around different generations – Baby Boomers, Gen X, Millennials, and Gen Z, and requested images of these groups in different contexts, such as "with family", "on vacation", or "at work".

While the results were very interesting, and perhaps revealed some insights about visual representation, I think we should also take note of what this cannot tell us, or what the limitations are. I’m going to divide up my discussion into the aesthetics (what the pictures look like) and representation (what is actually shown in the images), with a few side tracks into how these images come to exist in the first place, because that’s really important to both topics.

Introduction

Before I start, though, a quick overview of these image generator models. They’re created by taking giant datasets of images (photographs, artwork, etc) paired with short text descriptions, and the goal is to get the model to learn the relationships between words and the appearance of the images, such that when given a word the model can create an image that matches, more or less. There’s a lot more detail under the hood, and the models (like other generative AI) have a built in degree of randomness that allows for variations and surprises.

When you use one of these hosted models, you give a text prompt and an image is returned. However, it’s important to note that your prompt is not the ONLY thing the model gets. There are also built in instructions, which I call pre-prompting instructions sometimes, and these can have an effect on what the output is. Examples might be telling the model to refuse to create certain kinds of offensive images, or to reject prompts using offensive language.

Training Data

An important framing point here is that the training data, those big sets of images that are paired with text blurbs, is what the model is trying to replicate. So, we should ask more questions about the training data, and where it comes from. To train models like these, the volume of image data required is extraordinary. Midjourney was trained on https://laion.ai/, whose larger dataset has 5 billion image-text pairs across multiple languages, and we can assume the other models had similar volumes of content. This means that engineers can’t be TOO picky about which images are used for training, because they basically need everything they can get their hands on.

Ok, so where do we get images? How are they generated? Well, we create our own and post them on social media by the bucketload, so that’s necessarily going to be a chunk of it. (It’s also easy to get a hold of, from these platforms.) Media and advertising also create tons of images, from movies to commercials to magazines and beyond. Many other images are never going to be accessible to these models, like your grandma’s photo album that no one has digitized, but the ones that are available to train are largely from these two buckets: independent/individual creators and media/ads.

So, what do you actually get when you use one of these models?

Aesthetics

One thing you’ll notice if you try out these different image generators is the stylistic distinctions between them, and the internal consistency of styles. I think this is really fascinating, because they feel like they almost have personalities! Midjourney is dark and moody, with shadowy elements, while Stable Diffusion is bright and hyper-saturated, with very high contrast. ERNIE-ViLG seems to lean towards a cartoonish style, also with very high contrast and textures appearing rubbery or highly filtered. YandexART has washed out coloring, with often featureless or very blurred backgrounds and the appearance of spotlighting (it reminds me of a family photo taken at a department store in some cases). A number of different elements may be responsible for each model’s trademark style.

As I’ve mentioned, pre-prompting instructions are applied in addition to whatever input the user gives. These may indicate specific aesthetic components that the outputs should always have, such as stylistic choices like the color tones, brightness, and contrast, or they may instruct the model not to follow objectionable instructions, among other things. This forms a way for the model provider to implement some limits and guardrails on the tool, preventing abuse, but can also create aesthetic continuity.

The process of fine tuning with reinforcement learning may also affect style, where human observers are making judgments about the outputs that are provided back to the model for learning. The human observers will have been trained and given instructions about what kinds of features of the output images to approve of/accept and which kinds should be rejected or down-scored, and this may involve giving higher ratings to certain kinds of visuals.

The type of training data also has an impact. We know some of the massive datasets that are employed for training the models, but there is probably more we don’t know, so we have to infer from what the models produce. If the model is producing high-contrast, brightly colored images, there’s a good chance the training data included a lot of images with those characteristics.

As we analyze the outputs of the different models, however, it’s important to keep in mind that these styles are probably a combination of pre-prompting instructions, the training data, and the human fine tuning.

Beyond the visual appeal/style of the images, what’s actually in them?

Representation

Limitations

What the models will have the capability to do is going to be limited by the reality of how they’re trained. These models are trained on images from the past – some the very recent past, but some much further back. For example, consider: as we move forward in time, younger generations will have images of their entire lives online, but for older groups, images from their youth or young adulthood are not available digitally in large quantities (or high quality) for training data, so we may never see them presented by these models as young people. It’s very visible in this project: For Gen Z and Millennials, in this data we see that the models struggle to "age" the subjects in the output appropriately to the actual age ranges of the generation today. Both groups seem to look more or less the same age in most cases, with Gen Z sometimes shown (in prompts related to schooling, for example) as actual children. In contrast, Boomers and Gen X are shown primarily in middle age or old age, because the training data that exists is unlikely to have scanned copies of photographs from their younger years, from the 1960s-1990s. This makes perfect sense if you think in the context of the training data.

[A]s we move forward in time, younger generations will have images of their entire lives online, but for older groups, images from their youth or young adulthood are not available digitally for training data, so we may never see them presented by these models as young people.

Identity

With this in mind, I’d argue that what we can get from these images, if we investigate them, is some impression of A. how different age groups present themselves in imagery, particularly selfies for the younger sets, and B. how media representation looks for these groups. (It’s hard to break these apart sometimes, because media and youth culture are so dialectical.)

The training data didn’t come out of nowhere – human beings chose to create, share, label, and curate the images, so those people’s choices are coloring everything about them. The models are getting the image of these generations that someone has chosen to portray, and in all cases these portrayals have a reason and intention behind it.

A teen or twentysomething taking a selfie and posting it online (so that it is accessible to become training data for these models) probably took ten, or twenty, or fifty before choosing which one to post to Instagram. At the same time, a professional photographer choosing a model to shoot for an ad campaign has many considerations in play, including the product, the audience, the brand identity, and more. Because professional advertising isn’t free of racism, sexism, ageism, or any of the other -isms, these images won’t be either, and as a result, the image output of these models comes with that same baggage. Looking at the images, you can see many more phenotypes resembling people of color among Millennial and Gen Z for certain models (Midjourney and Yandex in particular), but hardly any of those phenotypes among Gen X and Boomers in the same models. This may be at least partly because advertisers targeting certain groups choose representation of race and ethnicity (as well as age) among models that they believe will appeal to them and be relatable, and they’re presupposing that Boomers and Gen X are more likely to purchase if the models are older and white. These are the images that get created, and then end up in the training data, so that’s what the models learn to produce.

The point I want to make is that these are not free of influence from culture and society – whether that influence is good or bad. The training data came from human creations, so the model is bringing along all the social baggage that those humans had.

The point I want to make is that these are not free of influence from culture and society – whether that influence is good or bad. The training data came from human creations, so the model is bringing along all the social baggage that those humans had.

Because of this reality, I think that asking whether we can learn about generations from the images that models produce is kind of the wrong question, or at least a misguided premise. We might incidentally learn something about the people whose creations are in the training set, which may include selfies, but we’re much more likely to learn about the broader society, in the form of people taking pictures of others as well as themselves, the media, and commercialism. Some (or even a lot) of what we’re getting, especially for the older groups who don’t contribute as much self-generated visual media online, is at best perceptions of that group from advertising and media, which we know has inherent flaws.

Is there anything to be gained about generational understanding from these images? Perhaps. I’d say that this project can potentially help us see how generational identities are being filtered through media, although I wonder if it is the most convenient or easy way to do that analysis. After all, we could go to the source – although the aggregation that these models conduct may be academically interesting. It also may be more useful for younger generations, because more of the training data is self-produced, but even then I still think we should remember that we imbue our own biases and agendas into the images we put out into the world about ourselves.

As an aside, there is a knee-jerk impulse among some commentators to demand some sort of whitewashing of the things that models like this create— that’s how we get models that will create images of Nazi soldiers of various racial and ethnic appearances. As I’ve written before, this is largely a way to avoid dealing with the realities about our society that models feed back to us. We don’t like the way the mirror looks, so we paint over the glass instead of considering our own face.

Of course, that’s not completely true either – all of our norms and culture are not going to be represented in the model’s output, only that which we commit to images and feed in to the training data. We’re seeing some slice of our society, but not the whole thing in a truly warts-and-all fashion. So, we must set our expectations realistically based on what these models are and how they are created. We are not getting a pristine picture of our lives in these models, because the photos we take (and the ones we don’t take, or don’t share), and the images media creates and disseminates, are not free of bias or objective. It’s the same reason we shouldn’t judge ourselves and our lives against the images our friends post on Instagram – that’s not a complete and accurate picture of their life either. Unless we implement a massive campaign of Photography and image labeling that pursues accuracy and equal representation, for use in training data, we are not going to be able to change the way this system works.

Conclusion

Getting to spend time with these ideas has been really interesting for me, and I hope the analysis is helpful for those of you who use these kinds of models regularly. There are lots of issues with using generative AI image generating models, from the environmental to the economic, but I think understanding what they are (and aren’t) and what they really do is critical if you choose to use the models in your day to day.

Read more from me at www.stephaniekirmer.com.

Consent in Training AI

Stephanie Kirmer — Wed, 02 Oct 2024 04:17:53 +0000

I’m sure lots of you reading this have heard about the recent controversy where LinkedIn apparently began silently using user personal data for training LLMs without notifying users or updating their privacy policy to allow for this. As I noted at the time over there, this struck me as a pretty startling move, given what we increasingly know about regulatory postures around AI and general public concern. In more recent news, online training platform Udemy has done something somewhat similar, where they quietly offered instructors a small window for opting out of having their personal data and course materials used in training AI, and have closed that window, allowing no more opting out. In both of these cases, businesses have chosen to use passive opt-in frameworks, which can have pros and cons.

To explain what happened in these cases, let’s start with some level setting. Social platforms like Udemy and LinkedIn have two general kinds of content related to users. There’s personal data, meaning information you provide (or which they make educated guesses about) that could be used alone or together to identify you in real life. Then, there’s other content you create or post, including things like comments or Likes you put on other people’s posts, slide decks you create for courses, and more. Some of that content is probably not qualified as personal data, because it would not have any possibility of identifying you individually. This doesn’t mean it isn’t important to you, however, but data privacy doesn’t usually cover those things. Legal protections in various jurisdictions, when they exist, usually cover personal data, so that’s what I’m going to focus on here.

The LinkedIn Story

LinkedIn has a general and very standard policy around the rights to general content (not personal data), where they get non-exclusive rights that permit them to make this content visible to users, generally making their platform possible.

However, a separate policy governs data privacy, as it relates to your personal data instead of the posts you make, and this is the one that’s been at issue in the Ai Training situation. Today (September 30, 2024), it says:

How we use your personal data will depend on which Services you use, how you use those Services and the choices you make in your settings. We may use your personal data to improve, develop, and provide products and Services, develop and train artificial intelligence (AI) models, develop, provide, and personalize our Services, and gain insights with the help of AI, automated systems, and inferences, so that our Services can be more relevant and useful to you and others. You can review LinkedIn’s Responsible AI principles [here](https://www.linkedin.com/help/linkedin/answer/a5538339?hcppcid=search) and learn more about our approach to generative AI here. Learn more about the inferences we may make, including as to your age and gender and how we use them.

Of course, it didn’t say this back when they started using your personal data for AI model training. The earlier version from mid-September 2024 (thanks to the Wayback Machine) was:

How we use your personal data will depend on which Services you use, how you use those Services and the choices you make in your settings. We use the data that we have about you to provide and personalize our Services, including with the help of automated systems and inferences we make, so that our Services (including ads) can be more relevant and useful to you and others.

In theory, "with the help of automated systems and inferences we make" could be stretched in some ways to include AI, but that would be a tough sell to most users. However, before this text was changed on September 18, people had already noticed that a very deeply buried opt-out toggle had been added to the LinkedIn website that looks like this:

Screenshot by the author from linkedin.com

(My toggle is Off because I changed it, but the default is "On".)

This suggests strongly that LinkedIn was already using people’s personal data and content for generative AI development before the terms of service were updated. We can’t tell for sure, of course, but lots of users have questions.

The Udemy Story

For Udemy’s case, the facts are slightly different (and new facts are being uncovered as we speak) but the underlying questions are similar. Udemy teachers and students provide large quantities of personal data as well as material they have written and created to the Udemy platform, and Udemy provides the infrastructure and coordination to allow courses to take place.

Udemy published an Instructor Generative AI policy in August, and this contains quite a bit of detail about the data rights they want to have, but it is very short on detail about what their AI program actually is. From reading the document, I’m very unclear as to what models they plan to train or are already training, or what outcomes they expect to achieve. It doesn’t distinguish between personal data, such as the likeness or personal details of instructors, and other things like lecture transcripts or comments. It seems clear that this policy covers personal data, and they’re pretty open about this in their privacy policy as well. Under "What We Use Your Data For", we find:

Improve our Services and develop new products, services, and features (all data categories), including through the use of AI consistent with the Instructor GenAI Policy (Instructor Shared Content);

The "all data categories" they refer to include, among others:

Account Data: username, password, but for instructors also "government ID information, verification photo, date of birth, race/ethnicity, and phone number" if you provide it
Profile Data: "photo, headline, biography, language, website link, social media profiles, country, or other data."
System Data: "your IP address, device type, operating system type and version, unique device identifiers, browser, browser language, domain and other systems data, and platform types."
Approximate Geographic Data: "country, city, and geographic coordinates, calculated based on your IP address."

But all of these categories can contain personal data, sometimes even PII, which is protected by comprehensive Data Privacy legislation in a number of jurisdictions around the world.

The generative AI move appears to have been rolled out quietly starting this summer, and like with LinkedIn, it’s an opt-out mechanism, so users who don’t want to participate must take active steps. They don’t seem to have started all this before changing their privacy policy, at least so far as we can tell, but in an unusual move, Udemy has chosen to make opt-out a time limited affair, and their instructors have to wait until a specified period each year to make changes to their involvement. This has already begun to make users feel blindsided, especially because the notifications of this time window were evidently not shared broadly. Udemy was not doing anything new or unexpected from an American data privacy perspective until they implemented this strange time limit on opt-out, provided they updated their privacy policy and made at least some attempt to inform users before they started training on the personal data.

(There’s also a question of the IP rights of teachers on the platform to their own creations, but that’s a question outside the scope of my article here, because IP law is very different from privacy law.)

Ethics

With these facts laid out, and inferring that LinkedIn was in fact starting to use people’s data for training GenAI models before notifying them, where does that leave us? If you’re a user of one of these platforms, does this matter? Should you care about any of this?

I’m going suggest there are a few important reasons to care about these developing patterns of data use, independent of whether you personally mind having your data included in training sets generally.

Your personal data creates risk.

Your personal data is valuable to these companies, but it also constitutes risk. When your data is out there being moved around and used for multiple purposes, including training AI, the risk of breach or data loss to bad actors is increased as more copies are made. In generative AI there is also a risk that poorly trained LLMs can accidentally release personal information directly in their output. Every new model that uses your data in training is an opportunity for unintended exposure of your data in these ways, especially because lots of people in machine learning are woefully unaware of the best practices for protecting data.

The principle of informed consent should be taken seriously.

Informed consent is a well known bedrock principle in biomedical research and healthcare, but it doesn’t get as much attention in other sectors. The idea is that every individual has rights that should not be abridged without that individual agreeing, with full possession of the pertinent facts so they can make their decision carefully. If we believe that protection of your personal data is part of this set of rights, then informed consent should be required for these kinds of situations. If we let companies slide when they ignore these rights, we are setting a precedent that says these violations are not a big deal, and more companies will continue behaving the same way.

Dark patterns can constitute coercion.

In social science, there is quite a bit of scholarship about opt-in and opt-out as frameworks. Often, making a sensitive issue like this opt-out is meant to make it hard for people to exercise their true choices, either because it’s difficult to navigate, or because they don’t even realize they have an option. Entities have the ability to encourage and even coerce behavior in the direction that benefits business by the way they structure the interface where people assert their choices. This kind of design with coercive tendencies falls into what we call dark patterns of user experience design online. When you add on the layer of Udemy limiting opt-out to a time window, this becomes even more problematic.

This is about images and multimedia as well as text.

This might not occur to everyone immediately, but I just want to highlight that when you upload a profile photo or any kind of personal photographs to these platforms, that becomes part of the data they collect about you. Even if you might not be so concerned with your comment on a LinkedIn post being tossed in to a model training process, you might care more that your face is being used to train the kinds of generative AI models that generate deepfakes. Maybe not! But just keep this in mind when you consider your data being used in generative AI.

What to do?

At this time, unfortunately, affected users have few choices when it comes to reacting to these kinds of unsavory business practices.

If you become aware that your data is being used for training generative AI and you’d prefer that not happen, you can opt out, if the business allows it. However, if (as in the case of Udemy) they limit that option, or don’t offer it at all, you have to look to the regulatory space. Many Americans are unlikely to have much recourse, but comprehensive data privacy laws like CCPA often touch on this sort of thing a bit. (See the IAPP tracker to check your state’s status.) CCPA generally permits opt-out frameworks, where a user taking no action is interpreted as consent. However, CCPA does require that opting out is not made outlandishly difficult. For example, you can’t require opt-outs be sent as a paper letter in the mail when you are able to give affirmative consent by email. Companies must also respond in 15 days to an opt-out request. Is Udemy limiting the opt-out to a specific timeframe once a year going to fit the bill?

But let’s step back. If you have no awareness that your data is being used to train AI, and you find out after the fact, what do you do then? Well, CCPA lets the consent be passive, but it does require that you be informed about the use of your personal data. Disclosure in a privacy policy is usually good enough, so given that LinkedIn didn’t do this at the outset, that might be cause for some legal challenges.

Notably, EU residents likely won’t have to worry about any of this, because the laws that protect them are much clearer and more consistent. I’ve written before about the EU AI Act, which has quite a bit of restriction on how AI can be applied, but it doesn’t really cover consent or how data can be used for training. Instead, GDPR is more likely to protect people from the kinds of things that are happening here. Under that law, EU residents must be informed and asked to positively affirm their consent, not just be given a chance to opt out. They must also have the ability to revoke consent for use of their personal data, and we don’t know if a time limited window for such action would pass muster, because the GDPR requirement is that a request to stop processing someone’s personal data must be handled within a month.

Lessons Learned

We don’t know with clarity what Udemy and LinkedIn are actually doing with this personal data, aside from the general idea that they’re training generative AI models, but one thing I think we can learn from these two news stories is that protecting individuals’ data rights can’t be abdicated to corporate interests without government engagement. For all the ethical businesses out there who are careful to notify customers and make opt-out easy, there are going to be many others that will skirt the rules and do the bare minimum or less unless people’s rights are protected with enforcement.

Read more of my work at www.stephaniekirmer.com.

Disability, Accessibility, and AI

Stephanie Kirmer — Mon, 16 Sep 2024 19:37:49 +0000

Disability, Accessibility, and AI

Photo by Thought Catalog on Unsplash

I recently read a September 4th thread on Bluesky by Dr. Johnathan Flowers of American University about the dustup that occurred when organizers of NaNoWriMo put out a statement saying that they approved of people using generative AI such as LLM chatbots as part of this year’s event.

"Like, art is often the ONE PLACE where misfitting between the disabled bodymind and the world can be overcome without relying on ablebodied generosity or engaging in forced intimacy. To say that we need AI help is to ignore all of that." –Dr. Johnathan Flowers, Sept 4 2024

Dr. Flowers argued that by specifically calling out this decision as an attempt to provide access to people with disabilities and marginalized groups, the organizers were downplaying the capability of these groups to be creative and participate in art. As a person with a disability himself, he notes that art is one of a relatively few places in society where disability may not be a barrier to participation in the same way it is in less accessible spaces.

Since the original announcement and this and much other criticism, the NaNoWriMo organizers have softened or walked back some of their statement, with the most recent post seeming to have been augmented earlier this week. Unfortunately, as so often happens, much of this conversation on social media devolved into an unproductive discussion.

I’ve talked in this space before about the difficulty in assessing what it really means when generative AI is involved in art, and I still stand by my point that as a consumer of art, I am seeking a connection to another person’s perspective and view of the world, so AI-generated material doesn’t interest me in that way. However, I have not spent as much time thinking about the role of AI as Accessibility tooling, and that’s what I’d like to discuss today.

I am not a person with physical disability, so I can only approach this topic as a social scientist and a viewer of that community from the outside. My views are my own, not those of any community or organization.

Framing

In a recent presentation, I was asked to begin with a definition of "AI", which I always kind of dread because it’s so nebulous and difficult, but this time I took a fresh stab at it, and read some of the more recent regulatory and policy discussions, and came up with this:

AI: use of certain forms of machine learning to perform labor that otherwise must be done by people.

I’m still workshopping, and probably will be forever as the world changes, but I think this is useful for today’s discussion. Notice that this is NOT limiting our conversation to generative AI, and that’s important. This conversation about AI specifically relates to applying machine learning, whether it involves deep learning or not, to completing tasks that would not be automatable in any other way currently available to us.

Social theory around Disability is its own discipline, with tremendous depth and complexity. As with discussions and scholarship examining other groups of people, it’s incredibly important for actual members of this community to have their voices not only heard, but to lead discussions about how they are treated and their opportunities in the broader society. Based on what I understand of the field, I want to prioritize concerns about people with disability having the amount of autonomy and independence they desire, with the amount of support necessary to have opportunities and outcomes comparable to people without disabilities. It’s also worth mentioning that much of the technology that was originally developed to aid people with disabilities is assistive to all people, such as automatic doors.

AI as a tool

So, what role can AI really play in this objective? Is AI a net good for people with disabilities? Technology in general, not just AI related development, has been applied in a number of ways to provide autonomy and independence to people with disabilities that would not otherwise be possible. Anyone who has, like me, been watching the Paris Paralympics this past few weeks will be able to think of examples of technology in this way.

But I’m curious what AI provides to the table that isn’t otherwise there, and what the downsides or risks may be. Turns out, quite a bit of really interesting scholarly research has already been done on the question and continues to be released. I’m going to give a brief overview of a few key areas and provide more sources if you happen to be interested in a deeper dive in any of them.

Positives

Neurological and Communication Issues

This seems like it ought to be a good wheelhouse for AI tools. LLMs have great usefulness for restating, rephrasing, or summarizing texts. When individuals struggle with reading long texts/concentration, having the ability to generate accurate summaries can make the difference between a text’s themes being accessible to those people or not. This isn’t necessarily a substitution for the whole text, but just might be a tool augmenting the reader’s understanding. (Like Cliff Notes, but for the way they’re supposed to be used.) I wouldn’t recommend things like asking LLMs direct questions about the meaning of a passage, because that is more likely to produce error or inaccuracies, but summarizing a text that already exists is a good use case.

Secondarily, people with difficulty in either producing or consuming spoken communication can get support from AI tools. The technologies can either take spoken text and generate highly accurate automatic transcriptions, which may be easier for people with forms of aphasia to comprehend, or it can allow a person who struggles with speaking to write a text and convert this to a highly realistic sounding human spoken voice. (Really, AI synthetic voices are becoming so amazing recently!)

This is not even getting into the ways that AI can help people with hearing impairment, either! Hearing aids can use models to identify and isolate the sounds the user wants to focus on, and diminish distractions or background noise. Anyone who’s used active noise canceling is benefiting from this kind of technology, and it’s a great example of things that are helpful for people with and without disabilities both.

Speech Recognition Utilizing Deep Learning: A Systematic Review of the Latest Developments (Mar 2024)

Vision and Images

For people with visual impairments, there may be barriers to digital participation, including things like poorly designed websites for screen readers, as well as the lack of alt text describing the contents of images. Models are increasingly skilled at identifying objects or features within images, and this may be a highly valuable form of AI if made widely accessible so that screen reading software could generate its own alt text or descriptions of images.

Physical Prosthetics

There are also forms of AI that help prosthetics and physical accessibility tools work better. I don’t mean necessarily technologies using neural implants, although that kind of thing is being studied, but there are many models that learn the physics of human movement to help computerized powered prosthetics work better for people. These can integrate with muscles and nerve endings, or they can subtly automate certain movements that help with things like fine motor skills with upper limb prosthetics. Lower body limb prosthetics can use AI to better understand and produce stride lengths and fluidity, among other things.

Negatives

Representation and Erasure

Ok, so that is just a handful of the great things that AI can do for disability needs. However, we should also spend some time discussing the areas where AI can be detrimental for people with disabilities and our society at large. Most of these areas are about the cultural production using AI, and I think they are predominantly caused by the fact that these models replicate and reinforce social biases and discrimination.

For example:

Because our social structures don’t prioritize or highlight people with disabilities and their needs, models don’t either. Our society is shot through with ableism and this comes out in texts produced by AI. We can explicitly try to correct for that in prompt engineering, but a lot of people won’t spend the time or think to do that.
Similarly, images generated by AI models tend to erase all kinds of communities that are not dominant culturally or prioritized in media, including people with disabilities. The more these models use training data that includes representation of people with disabilities in positive context, the better this will get, but there is always a natural tension between representation proportions being true to life and having more representation because we want to have better visibility and not erasure.

Data Privacy and Ethics

This area has two major themes that have negative potential for people with disabilities.

First, there is a high risk of AI being used to make assumptions about desires and capabilities of people with disabilities, leading to discrimination. As with any group, asking AI what the group might prefer, need, or find desirable is no substitute for actually getting that community involved in decisions that will affect them. But it’s easy and lazy for people to just "ask AI" instead, and that is undoubtedly going to happen at times.
Second, data privacy is a complicated topic here. Specifically, when someone is using accessibility technologies, such as a screen reader for a cell phone or webpage, this can create inferred data about disability status. If that data is not carefully protected, the disability status of an individual, or the perceived status if the inference is wrong, can be a liability that will subject the person to risks of discrimination in other areas. We need to ensure that whether or not someone is using an accessibility tool or feature is regarded as sensitive personal data just like other information about them.

Bias in Medical Treatment

When the medical community starts using AI in their work, we should take a close look at the side effects for marginalized communities including people with disabilities. Similarly to how LLM use can mean the actual voices of people with disabilities are overlooked in important decision making, if medical professionals are using LLMs to advise on the diagnosis or therapies for disabilities, this advice will be affected by the social and cultural negative biases that these models carry.

This might mean that non-stereotypical or uncommon presentations of disability may be overlooked or ignored, because models necessarily struggle to understand outliers and exceptional cases. It may also mean that patients have difficulty convincing providers of their lived experience when it runs counter to what a model expects or predicts. As I’ve discussed in other work, people can become too confident in the accuracy of machine learning models, and human perspectives can be seen as less trustworthy in comparison, even when this is not a justifiable assertion.

Access to technologies

There are quite a few other technologies I haven’t had time to cover here, but I do want to make note that the mere existence of a technology is not the same thing as people with disabilities having easy, affordable access to these things to actually use. People with disabilities are often disadvantaged economically, in large part because of unnecessary barriers to economic participation, so many of the exceptional advances are not actually accessible to lots of the people who might need them. This is important to recognize as a problem our society needs to take responsibility for – as with other areas of healthcare in the United States in particular, we do a truly terrible job meeting people’s needs for the care and tools that would allow them to live their best lives and participate in the economy in the way they otherwise could.

Conclusions

This is only a cursory review of some of the key issues in this space, and I think it’s an important topic for those of us working in machine learning to be aware of. The technologies we build have benefits and risks both for marginalized populations, including people with disabilities, and our responsibility is to take this into account as we work and do our best to mitigate those risks.

Writing a Good Job Description for Data Science/Machine Learning

Stephanie Kirmer — Fri, 16 Aug 2024 15:25:50 +0000

I’ve probably been involved in the hiring process for data scientists a dozen times or more over my career, while never being the hiring manager myself, and I have been closely involved in writing the job description for several of these. It kind of seems like this should be easy – you’re just trying to convince people to apply for your job, so you can pick the one you like best, right?

Well, it’s actually more complicated than that. Most of the people out there in the world are not qualified for any given job, and even among those who are qualified, there may be reasons they wouldn’t like working in this role. It’s not a one-way street; you don’t want just anybody to apply, you want the best suited people, for whom this job would work, to apply. So, how do you thread that needle? What should you write?

This column is only my opinion and does not represent the views of my employer. I have not been involved in writing any job descriptions my current employer has posted, for ML or anything else.

Why write a Job Description?

To figure out what to write, let’s break down what it is a good job description is supposed to do, for a DS/ML job or for any other kind.

Explain to candidates what the job is, and what they would do in the job
Explain to candidates what qualifications you’re looking for in applicants

These are the bare essential functions, although there are several other things your job description posting should also do:

Make your organization seem like an attractive place to work for a diverse pool of qualified candidates
Describe the compensation, work circumstances, and benefits, so candidates can decide whether to bother applying

With this, we’re starting to get into more subjective and complicated components, in some ways.

In some spots, I’m going to give advice for two different scenarios: first, for a small organization with few or zero existing DS/ML staff members, and second, for a medium or large sized organization with some DS/ML staff. These two can be quite different situations, with different needs and challenges in certain areas.

You may notice I’m using "DS/ML" a lot in this article – I consider the advice here good for people hiring data scientists as well as those hiring machine learning engineers, so I want to be inclusive where possible. Sorry it’s a little clunky.

What is this job?

Firstly, for any organization, consider what kind of role you have open. I’ve written in the past about the different kinds of data scientist, and I’d strongly recommend taking a look and seeing what archetypes your role fits into. Think about how this person will fit into your organization, and be clear about that as you proceed.

The Small Organization

A challenge, especially for small organizations with limited or no existing DS/ML capability or expertise, is that you don’t really know what your ML Engineer or Data Scientist may end up doing. You know what general outcomes you’d like this person to produce, but you don’t know how they’ll achieve them, because this isn’t your area of expertise!

However, you’ll still need to figure out some way to describe the role’s responsibilities anyway. I advise being honest and up-front about the level of Data Science sophistication at your company, and explaining the outcomes you’re hoping to see. Candidates with enough experience and skill to help you will be able to conceptualize how they’d attack the problem, and in the interview process you should ask them to do that. You should have some kind of project or goal in mind for this person, otherwise why are you hiring in the first place?

The Larger Organization

In this case, you already have at least a couple of DS/ML staff members, so you can hopefully call on those folks to tell you what the job is like day to day for an IC. Ask them! It’s surprising how often you’ll find HR or management not actually taking advantage of the expertise they already have in house in situations like this.

However, you should also determine whether this new hire is going to be doing mostly or entirely the same thing as someone already in place, or whether they may end up filling a different kind of gap. If your existing problem is just having not enough skilled hands to do all the work on your plate, then it’s probably reasonable to expect the new hire will be filling a role similar to what’s there. But, if you are Hiring someone for a very specific skillset (say, a new NLP problem came up and nobody on your team knows that stuff very well), then make sure you are clear in your job posting about the unique responsibilities this role will have to pioneer.

What do you need?

This brings us to an important point, as well – how much experience and which skills does your candidate need to have in order to successfully do the job?

The Small Organization

Experience: If this person is your first or second DS/ML hire, do not hire someone without some substantial work experience. These folks will cost more, but in your situation, you need someone who can be very self-directed and who has seen data science and Machine Learning practice done well in other professional settings already. This might go without saying, but you have little or no in-house capacity to train this person on the job, so you need them to already have acquired training from other previous roles.
Technical skills: But what skills do you really need to look for, then? What technical competencies, programming languages, etc will someone need to have to pursue your goals effectively? Beyond making sure that they can use Python, I’d recommend seeking advice from other practitioners in the field already if you can, to ask them what the skillset for your needs should look like. This changes a lot, as this is a very fast-moving discipline, so I can’t tell you today what your Data Scientist or MLE will need to be able to do a year from now. (I can tell you that asking for a Ph.D. is almost definitely not the answer.)

If you do go looking for advice, make sure you’re consulting people who are practicing DS/ML on the ground, not just "thought leaders" or people who market themselves as recruiting whisperers. If you don’t know anybody directly who fits the bill, try looking through your network or reaching out to DS/ML professional organizations. Take a look at other job postings that sound like what you need, but be cautious, since these other postings may not be that good either.

Regardless, take this seriously – if you write unrealistic, unreasonable, or absurdly irrelevant/outdated skills in your job description, you will turn off qualified candidates because they’ll recognize "Oh, this company doesn’t know what they’re doing", and that will defeat the whole point of this exercise.

Another option is finding a freelancer in data science/machine learning to get you started, instead of hiring someone yourself at all. There are a lot of fractional or freelance practitioners these days, as well as consulting firms that can take this whole problem off your plate. A quick google of "fractional data scientist" produces lots of options, but remember to do your due diligence.

The Larger Organization

Experience: I’m a big believer in hiring less senior folks and training them up, if your organization can handle it. New entrants to the field have to learn somehow, and business experience is often the biggest gap in a new data scientist’s skillset. Consider whether you really need to hire a Senior Staff Machine Learning Engineer, or whether you could promote internally and backfill a junior person. There’s no right or wrong answer, but give it some thought instead of jumping right to hiring the most senior level. We senior folks are both expensive and rare!
Technical skills: As with the job responsibilities, this is time to ask your existing team for their advice. Don’t just ask them what tech they use, also ask them what they might like to learn, if there was someone skilled brought on who could share that knowledge. (These skills go in Optional or Nice to Have, not Requirements!) You already have a DS/ML tech stack in place, of course, so the new person will need to be able to work with that, but if there are adjacent or newer technologies that might benefit your team, this is a good time to find out and potentially bring them on board. Don’t fall into the trap of asking for only the same stuff everyone in your org already uses, without giving any consideration or value to additional other competencies.

Also keep in mind what your candidates need to already have, in contrast with what they could learn on the job from your team. Don’t inflate your requirements to make the role sound more prestigious, or to artificially weed out candidates, especially if you’re not paying commensurate with those inflated requirements, because you’ll be shooting yourself in the foot. You’ll be deterring the candidates who might be a good fit for the level, and getting overqualified people in the pipeline who wouldn’t accept the pay you have available. And don’t ask for a Ph.D. if it’s not vital! (It’s almost never vital.)

The Job Title

It may seem insignificant, but once you’ve defined the role, picking a title to post really does send signals to candidates out there deciding what to apply for. I’ve talked in other pieces about the evolution of titles in data science, and this continues to change over time. But my shorthand advice, at least today, is:

Data Scientist: Not responsible for data engineering, pipelining, or doing their own deployment, although they may be capable of it. May do BI or analytics as well as model development.
Machine Learning Engineer: Responsible for any or all of data engineering, pipelining, and doing their own deployment. Does model development, but minimal or no BI or analytics work.

For leveling, I’d say this, as a very very rough rule of thumb, your mileage may vary significantly:

Junior or Associate: Fresh out of school. No work experience. Maybe an internship.
No Level: May have had one or two professional jobs or 2–3 years experience.
Senior: 3+ years professional experience.

Beyond that, there are higher levels that some orgs have and some don’t:

Staff: maybe 6–10 years professional experience.
Principal, Senior Staff, etc: More than that. It varies so widely in different orgs it’s really hard to say.

So if you want someone who can do their own pipelines, deployment, and modeling, and don’t need them to do analytics, and you want them to have multiple years experience, then Senior Machine Learning Engineer is what you should write. If you are looking for someone fresh out of school to do some modeling and analytics, but engineers can handle the deployment etc, you need an Associate Data Scientist.

This advice is subject to change as the field continues to evolve. If you really want to write something special like Machine Learning Scientist, I’d advise against it unless you have a really good explanation as to why. Clarity and findability are key here – use the terms your candidates will be familiar with and searching for.

Selling Yourselves

Now we can move on to your pitch: sell your organization as a good place to work, and share the compensation/benefits that you have to offer. We’ve spent a lot of time telling the candidates what they need to bring, and what they need to be prepared to do if they get this job, but that’s not all that a job description is about. You actually also need to be advertising your company and department as an appealing place to work, in order to get the best candidates on your radar. This advice is mostly generalizable to any organization size.

Don’t Lie

I have a few rules of thumb when it comes to describing a company to job candidates, in writing or in interviews. The main one is Don’t Lie. Don’t say you have a "fast paced culture" when it takes three weeks to deploy. Don’t say you "value work life balance" when no one on this team has taken a vacation in a year. And DON’T say "remote" when you mean "hybrid" for pete’s sake! You may think you’re just throwing in nice-sounding boilerplate, but these words mean something.

Feeling like you got bait-and-switched into joining an organization that is a bad fit is awful. Think of it like selling a product – if you overpromise and underdeliver, maybe you made that initial sale, but that customer is going to churn and be out the door with a bad taste in their mouth as soon as they realize their mistake. Then you not only do not have that customer, you have someone out there in the world with a bad opinion of your company who may be telling their whole network about this experience.

If you can’t think of good selling points for your company that aren’t either lies or stretching the truth, then you need to take a cold hard look at your company’s operations.

Being honest will not only make your eventual hire better, but it will attract candidates who really do want to work in a company like yours. Everyone has different wants and needs from a job, and not everyone wants to work at a place that "works hard and plays hard". There’s not one right culture for companies, and owning the culture your company has will get you the candidates who could be happy working there.

Value Diversity

One other important key is ensuring and displaying that your company values and includes all the angles of diversity among your staff. Your job description is the candidate’s first introduction to how you’re taking care of your people, regardless of protected class or general diversity of experience, background, ability, etc. In this case, that means you need to consider your choices of language very carefully. Unless you really mean it, don’t ask for an "expert" in a skill set. Don’t say your candidates must be "rock stars". This is both deterring to candidates with reasonable humility about their skills, and also makes your organization sound, well, kind of like jerks.

Note: the old saw that we’ve all heard a million times that "women don’t apply to a job unless they meet all the requirements" is very, very tired, and problematic for many reasons, but it does remind us to ask for skills you actually need, not just a laundry list of wishes.

Instead, use inclusive language. I advise writing your desired qualifications in the form of "Successful candidates can do …." and then write action oriented items like "build machine learning models using Python" or "perform model evaluation using appropriate metrics such as recall, precision, MAE, RMSE, etc". Be clear, and make it easy for someone to say "oh, I can do that" or "nope, I can’t do that".

If you know your pool of potential candidates is very homogeneous, for example because not many people of color get college degrees in your field, consider whether you need to take extra steps to get your job in front of those candidates. Take the time to post jobs on diversity-oriented job boards, and share your posting with professional organizations for different kinds of people. If your posting never gets seen by varied individuals, you won’t get varied candidates applying.

Compensation and Benefits

Now this should really go without saying, but be transparent and clear about the benefits and compensation for the role. Give a compensation range even if you don’t have to by law. If you’re not hiring in a state that requires a compensation range, you may think that isn’t an issue for you, but it actually is, because candidates with choices will prefer to apply to postings where they can clearly see the pay is commensurate with their expectations. It makes you look exploitative to leave off a compensation range (or to give a range spanning $100k so the range is effectively useless). Get with the times and give a reasonable range.

Also, I already mentioned it but it bears repeating – be honest about the working circumstances. Don’t advertise a job as "remote" only to reveal in the interviews that it’s 3 days a week on site. That’s also really bad practice and a rude waste of everyone’s time. Give candidates the details they need to make an informed decision about applying.

Beyond that, remember that health insurance is important to anyone in America looking for a job, and be as clear as you can be about what you are offering. If you can list the insurance carrier, do that; it may help people know if their doctor or provider would be in network. It’s not a huge deal to every candidate, but many candidates, including those with disabilities or health concerns (or dependents with concerns) will appreciate it.

Conclusion

Hiring for technical roles, including DS/ML, is hard. This advice might all sound like a lot of work you’d rather avoid, but consider: the alternative is weeding through thousands of applications from terribly unqualified candidates, or candidates who would never accept the job. Do some work up front so you’re not wasting your own time and that of the applicants down the road. It’s not only more efficient, it’s also the ethical choice. Applicants are real people and deserve to be treated as such.

To recap:

Figure out what the job would do (or what outcomes you want to see)
Figure out what the experience level and technical skillset needs to be (not your dream wish list, but realistic needs)
Write a job title that’s clear, accurate, and searchable
Don’t lie about your organization or the job
State the compensation range up front, and describe the benefits

Good luck out there!

Read more of my work at www.stephaniekirmer.com.

Stephanie Kirmer, Author at Towards Data Science

Generative AI and Civic Institutions

Different sectors, different goals

AI for efficiency

Conversion of labor

Externalities

Conclusion

Further reading

The Cultural Backlash Against Generative AI

Schools of Thought

How we got here

Generative AI applications don’t meet people’s needs

What Silicon Valley has to do with it

Where to now?

Further Reading

The Cultural Impact of AI Generated Content: Part 2

Social and Creative Creatures

Critical Thinking

Plus, Bots!

What Now?

Further Reading

The Cultural Impact of AI Generated Content: Part 1

What happens when AI generated media becomes ubiquitous in our lives? How does this relate to what we’ve experienced before, and how does it change us?

History

Current Day

Why?

Where we Are

Further Reading

Choosing and Implementing Hugging Face Models

Choosing the use case

Finding the models

Incorporating into your code

Preparing inference data

Code Example

To tune, or not?

Computation and speed

Testing and understanding output

Further Reading

A Critical Look at AI Image Generation

Introduction

Training Data

Aesthetics

Representation

Limitations

Identity

Conclusion

Further Reading

Consent in Training AI

The LinkedIn Story

The Udemy Story

Ethics

Your personal data creates risk.

The principle of informed consent should be taken seriously.

Dark patterns can constitute coercion.

This is about images and multimedia as well as text.

What to do?

Lessons Learned

Further Reading

The privacy policies

GDPR and CCPA

Disability, Accessibility, and AI

Framing

AI as a tool

Positives

Neurological and Communication Issues

Vision and Images

Physical Prosthetics

Negatives

Representation and Erasure

Data Privacy and Ethics

Bias in Medical Treatment

Access to technologies

Conclusions

Further Reading

Writing a Good Job Description for Data Science/Machine Learning

Why write a Job Description?

What is this job?

The Small Organization

The Larger Organization

What do you need?