Learning from Machine Learning

Welcome back to "Learning from Machine Learning," a series of interviews that delves into the fascinating world of machine learning. As AI revolutionizes our world, each episode offers technical insights along with Career Advice and life lessons from leading practitioners in the field.

In this episode, Vincent Warmerdam, creator of calmcode and koaning.io, ML Engineer at Explosion, and former Research Advocate at Rasa, shares his experiences and advice. He has held numerous roles in the data science world including consultant, advocate, trainer, lead, educator and recruiter. His work has focused on simplifying processes, utilizing models effectively and educating others.

His site calmcode provides high-quality, easily consumable educational Data Science content and has gained popularity and garnered positive feedback. It currently attracts thousands of users monthly and has become a valuable resource for data scientists. Vincent helped organize PyData Amsterdam and he has consistently been a speaker at PyData and NormConf. His talks are enjoyable and engaging (check them out on YouTube).

Vincent has created dozens of tools to help in machine learning development, data processing, testing, and natural language processing.

koaning – Overview

Vincent’s passion for building practical tools, sharing his work openly and addressing real-world problems with simplicity and calmness has influenced the way that I do my work.

Summary

Vincent has had an unconventional career journey and entered data science while it was still emerging. He attributes his success to a combination of luck, the popularity of his blog, organizing meetups, and his open-source contributions, which resulted in job offers directly from CEOs and CTOs instead of going through the traditional recruitment processes. His academic background includes studying operations research and design which gave him a unique perspective on algorithms and constraints. He also mentioned that his gig as a bartender at a Comedy Club helped sharpen his presentation skills.

Inspired by a family member’s wrong medical diagnosis, Vincent was attracted to machine learning to use algorithms to make better decisions and predictions. During his Masters thesis he took the opportunity to focus on machine learning.

His open source work has been driven by his desire to solve specific problems and sometimes to scratch his own itches. He started with his first PyPI project, evol, and developed libraries like scikit-lego, human-learn, whatlies, and doubtlab based on his needs and experiences at different companies.

The bulk library emerged as a result of his work on human-learn and whatlies. It leverages embeddings, dimensionality reduction and clustering to facilitate bulk labeling and data exploration. While the tool does not provide perfect labels, it helps lower the barrier to entry and build intuition when working on new datasets.

He started calmcode as a response to the lack of quality educational content in data science, aiming to provide concise and opinionated tutorials that focus on getting the idea across effectively. His lessons consist of a series of short, easily digestible videos. He focuses on teaching tools and approaches to improve day-to-day data science work.

Vincent’s experience in the field provided great perspective on the current state of machine learning. We were able to discuss the limitations of generative models, the role of rule-based systems to complement ML models and the complexities of ground truth labeling.

Vincent’s advice was real and poignant. He cautions against placing too much emphasis on monetary gains early in your career and to prioritize discovery of your passion and true motivations. He grapples with the contrasting feelings of the usefulness of the ML field and the need for skilled professionals in it while also recognizing the abundance of hype. Most importantly, Vincent emphasizes gaining a clear understanding of what you can and cannot influence and to focus your energy on areas where you can make a difference.

Advice & Takeaways

Understanding the problem correctly is most important – a better algorithm may not always yield better outcomes if applied to the wrong problem.
Rephrasing problems can be beneficial – think holistically about system optimization rather than focusing solely on improving individual components.
"People often forget that the algorithm is usually just a cog in the system. And we’re interested in building a better system, not a better cog. So, if you’re building a better cog but it doesn’t fit the rest, it’s not a better cog because you don’t get a better system."
Consider starting a blog where you share what you learn. Writing short "today I learned" snippets can be an easy way to build an online presence and demonstrate your continuous learning.
Don’t feel the pressure to know everything when starting out. Consider related roles like being an analyst, which can provide valuable skills and knowledge that will benefit your data science journey.
"You can make python packages more often than you might think, so just build one." More people should create their own Python packages, even for internal tools or small helper functions. Not enough people take advantage of the benefits of packaging and reusing their own code.
Stepping outside the machine learning bubble can be refreshing and provide inspiration. Interacting with people who are not in the field and understanding their perspectives can help you create better applications and focus on the human aspect of your work.
Be careful of getting too focused on money early in your career. It’s important to figure out what you enjoy and what makes you tick rather than just chasing a higher salary.
"I’m kind of a mixed bag when it comes to the whole machine learning thing. Part of my opinion is it’s a super useful tool and we need more good people doing machine learning. But at the same time, it’s like a gross bucket of hype that we really want to have less of. And my day to day is to sort of deal with both of these feelings."
"Calling .fit and .predict are the easy bits. It’s all the stuff around that. It’s way trickier. Especially when you consider themes of fairness, all the things that can go wrong, can we really know that upfront? I don’t know if you always can."
"There’s some stuff that you can control, some stuff that you can’t. Just make sure that you understand what you can and can not control and then move on from there."

Introduction Summary Advice & Takeaways Full Interview

Welcome

Background

Academic Background – Machine Learning Attraction

Calmcode

Open Source

Bulk

Understanding the Problem – Unanswered Questions in Machine Learning

Generative and Predictive Machine Learning

Influences

Career Advice

Advice for New Data Scientists

Learning from Machine Learning Video Interview Spotify Audio Previous Episode featuring Maarten Grootendorst References

Full Interview

(Note: This Interview was lightly edited for clarity)

Welcome

Seth: Welcome to Learning from Machine Learning. On this episode, I have the pleasure of having Vincent Warmerdam. He’s currently a machine learning engineer at Explosion, the company behind SpaCy and Prodigy. Vincent is an educator, a blogger, consistent PyData speaker.

He’s created many valuable open source tools. He’s endorsed for awesomeness on LinkedIn over a hundred times. And truly an inspiring force in the data science community. Welcome to the podcast.

Vincent: Hi. That comment. It’s making me check. Do I have a hundred endorsements? I didn’t know.

Seth: Over.

Vincent: Oh, okay, cool.

It’s an inside joke I have with a former colleague whoever can get the most. I’ve got more than ninety nine endorsements for awesomeness on LinkedIn now, Nice!

Background

Seth: Why don’t you give us some background on your career journey? How did you get to where you are today?

Vincent: It’s a little bit hard to give proper career advice because I want to just recognize I’m a little bit privileged and I got lucky a whole bunch because when I started this whole data science thing, this was the era when random forests were kind of new. And if you could just use a random forest, you’re already way better than all the economic traditions with linear vols. So can you run fit and predict – bang. You’ve got a job. And, I was kind of in that era at the right time.

After that though, I started blogging. I started helping out by arranging some meetups. There’s a machine learning meetup in Amsterdam – I helped organize PyData Amsterdam.

And I had a pretty popular blog as well, so people started recognizing me for that. And, at some point, that recognition gets you places, you get invited to speak, and then people sort of see you as an authority figure. I’m not, I like to think I have sensible ideas, but I try to be modest about that. But that has kind of been the story of my career though because people knew my blog, which would usually include a CTO of a company. And then a CTO would say, "Hey, I like your blog. Can we just have a beer?"

And then usually, that led to the job offer. I have yet to talk to a recruiter. Honestly. I’ve never been hired through the recruitment pipeline so far in my career.

It’s always been via the CEO or CTO because they knew of my work beforehand. And this is very weird, this is a very bad story because this is very hard to replicate for others because I just got very lucky when it comes to this. I do think having a blog and being able to get recognized is very useful, just very hard to replicate. But I also like to think that some of the side projects that I did for open source definitely helped out. Calmcode is something that people know me of these days. There’s a saying, plant a thousand flowers and one of them will be a lotus.

…plant a thousand flowers and one of them will be a lotus

Image generated using https://huggingface.co/CompVis/stable-diffusion-v1-4 with prompt "Plant a thousand flowers and one of them will be a lotus"

I subscribe to that idea, but a lot of it comes down to luck and a bit of privilege as well. I just want to be honest about that. Being able to be recognizable has proven to be useful to me.

Seth: Yeah. Definitely. I mean, everyone’s journey is unique. What are the roles that you’ve played at different companies?

I know you’ve been a consultant at one point, an advocator. You’ve had some interesting titles. Right?

Vincent: There was a phase at a previous company where they would really let you pick any name you could and me and a couple of colleagues would think it’s funny to really see how far we could stretch it. So I called myself a Pokemon master because I thought that it was kind of funny. My favorite title was senior person because at some point, I’m just one of the old guys at the company. So I just called myself that.

As far as roles goes, I’ve been doing lots of training, so at some point you call yourself a trainer. Because you’re doing consultancy, you have different roles within different teams. So I was a lead at some point as well.

I was also helping a recruiting company recruit people at companies for specific data team. Sometimes my role would be super temporary. Sometimes it’d be for two years. But usually, it’s been I was a person on a data team trying to get the team to become productive, doing whatever. Usually, the things I like to think I’m pretty good at is keeping things very simple.

I can get a lot of mileage from linear models, which tends to work quite nicely. And before that, I also had lots of different jobs in college, and I do like to think that that kind of helped as well. My background is in operations research, but before that, I actually studied design. And my evening job, I was a bartender at a comedy theater in the Netherlands, which I do like to think that might have helped my presentation skills in a way. Having studied design for a while also makes me think differently about algorithms.

And how operations research makes me think about constraints a lot when I’m doing data science things. I like to think that I have a somewhat diverse background, and it’s that somewhat diverse background that makes it easy for me to do stuff that I’m doing now. That’s a summary.

Academic Background

Seth: That makes sense. So diving into your academic background, what was it? So, it was some operations research and some design? That sounds pretty unique there.

Vincent: Yeah, I studied industrial design engineering for a year, and then I found out it wasn’t for me. So I had to switch majors. And then the bachelor’s was econometrics and operations research, and the masters was operations research. And I thought that it was the most interesting of the two. But also because that had a little gateway to computer science.

So if I wanted to do the computer science courses, that’d be kind of an easy way for me to get courses I wanted to take. But, I was also rowing when I was in college and did a bit of partying, so I wasn’t allowed to start my masters just yet. So I had this one year where I just took whatever course interested me. I did a couple of courses in neuroscience, which was pretty interesting, some psychology and biology and what not.

Again, I like to think that the diversification of knowledge has proven to be quite useful, but the official title for my academic background, Operations Research. So that’s the mathematics behind optimizing systems. That’s kind of the thing that you’re taught there.

Seth: Very cool. What’s the quintessential or canonical problem in operations research that that people try to solve?

Vincent: So, a very textbook example in machine learning, you usually try to optimize towards something. You want to get the loss as low as possible or the accuracy as high as possible. And you’ve got algorithms for that. You typically take your data and the label you want to predict and you come up with some sort of loss function, you try to get those as small as possible. And in operation research, you kind of do a very similar thing.

It’s just that in operation research, you typically aren’t dealing with a machine learning algorithm, but you’re dealing with, let’s say, "Hey, we have stocks that we would like to invest in and Oh, by the way, that also introduces constraints. Because, yes, we want to get the highest return, but of course, we don’t want to overshoot the budget, and we also have a risk preference." And those are defined as hard mathematical constraints. Then if you want to optimize, it’s kind of a different ball game because if your algorithm ever exceeds the bounds of the constraint, then you’re in sort of bad territory. So I would argue that’s he main thing.

If you’re doing operations research then you’re taught constraints really matter and you want to deal with those in mathematically proper ways. And that’s a different ballgame. But that’s the main thing that they’re dealing with, this constrained optimization. That’s what they do.

Seth: Very cool. Yeah. Sounds like you need a very strong math background. Some linear algebra in there too?

Vincent: Lots of calculus and linear algebra. Although, I will say it depends a bit on what you do. When you do the master’s degree, of course, you do theoretical courses and you got to do the proofs. But the moment that you start doing your thesis [things change]. I had a professor who basically said, "I know nothing about machine learning Vincent, but you seem so eager. You do machine learning and you just teach me how it works because I’ve got no idea."

And that was great. The professors just let me do what I wanted to do, and I was also able to teach myself that way. But if you really want to do the proper operations research and especially if you want to do a PhD, It is super math heavy. That’s true.

A little bit too heavy for my comfort to be honest. I’m a little bit more on the applied side of things, but still know people who end up doing a PhD, and they are definitely the math proof kind of people. They’re the kind of this cookbook of linear algebra kind of a person. That’s definitely in the field.

Machine Learning Attraction

Seth: What was it that attracted you to machine learning? What initially got you interested in it?

Vincent: There’s always something cool about making predictions. Right? So there’s something about that I thought was pretty interesting. I think the longer story, though, I do remember a very close family member of mine got a wrong medical diagnosis.

The wrong diagnosis was they told the person you have a very bad disease and the person didn’t. And we found out just in time, thank God. But, they might have made some really weird life decisions like sell the house immediately because of that decision. So that made me think, okay, there’s a definite consequence to making the wrong decisions. Anything that we can do to make better decisions is interesting.

And maybe there’s something about this machine learning. The whole idea that you try to learn more from data by using a machine, there’s something plausible about that. It seems very interesting. So that was kind of around the time that I did think, Hey, yeah, Let’s see if these algorithms might be able to do something.

And then the career prospects turned out to be amazing. So that’s another motivation to go into that realm. But the initial spark was a wrong decision got made. That’s how I started thinking, hey, maybe there are systems that we can improve here.

Calmcode

Seth: Very interesting, and we’re going to dive into machine learning in a bit. But having the creator and maintainer of calmcode which is such an unbelievable resource.

[Vincent Laughs]

Seth: No, nothing nothing to laugh about. Calmcode is incredible.

It’s in the top two or three things that I recommend to every new data scientist. The way that you break down really complex things into a nice calm and logical and rational way is extremely valuable.

Vincent: Glad to hear.

Seth: So can you talk a little bit about calmcode and why’d you start it? And well, what is it? Give everyone a breakdown

Vincent: So first of all, happy to hear it – that you like calmcode. I’m happy to hear that it helps. So, basically, the story behind calmcode was, at some point, I was looking at educational content around data science. And, as an educator, I just started noticing there’s just so much gunk.

So to give an example, the number one tutorial, maybe four or five years ago, on how scikit-learn works, what happened with this data set called Load Boston, which is about Boston House prices. There’s so many tutorials that use that data set from all your O’Reily books to a lot of open source packages. But then you look at the data and it turns out that one of the variables that they’re using to predict the house price is skin color.

I forgot the exact name, but it was something like percentage of blacks in the town. You don’t want to put that in the predictive model. It’s a really, really bad idea. Also, why is this dataset on scikit-learn? Why are so many people using it?

So that led to a lot of frustration on my end. And then I also noticed that there are these enterprise courses that use Load Boston that charge a thousand dollars a day. And you look at it, you kind of go, this is a mess. And then I figured if I’m this frustrated, maybe I can get energy out by putting this stuff out there for free.

I knew that I was knowledgeable enough to be able to teach these kinds of topics because I’ve taught them before. But I’ve also noticed that a lot of this educational content seems to focus more on the creator and less about just getting the idea across. So I figured it would be kind of like a fun little experiment. If I were to make a learning platform, how would I do it? And that’s how CalmCode got created.

The idea is you just have a maximum of five one minute videos to explain a single topic and the sequence of those can be a little course on pandas or can be a little course on whatever.

…for every single topic I can say, Is this a calm tool? Is it something that makes your day to day nicer? And if the answer is maybe not, then I just don’t teach it.

What I like about doing the calmcode thing is for every single topic I can say, Is this a calm tool? Is it something that makes your day to day nicer? And if the answer is maybe not, then I just don’t teach it, which is also one of the reasons why I don’t teach Spark, to be honest. Because installing it is just such a pain. And sometimes there’s easier ways of analyzing the data than resorting to a very big data tool.

So it’s just a very opinionated learning environment that people seem to have really liked. I’ve gotten lots of very nice responses. Ever since the baby showed up, I have been doing way less.

But it has been very cool to see just this little hobby project of mine without distractions that’s very calm. Seems to be getting between ten and twenty thousand people a month. And I get lots of people buying me beers at conferences out of the blue. That kind of stuff is pretty cool.

Seth: Yeah. It’s a great resource. Definitely paying it forward, creating a place for data scientists to go to learn anything. Have you ever found yourself going back to an old calmcode to refresh your self on some of these skills?

Vincent: Yeah. That’s another reason. So one thing that calmcode has proven to be quite nicely for me is it’s kind of like a snippets library. Because I knew the course that I made and I knew that I definitely mentioned this in that course and I kind of need a config file, where is it?

Just today, I was looking at my typer course because I needed "oh, how do options work again? Copy paste." So it’s also almost a snippet tool from myself at this point as well. Not the original intent, but it is something that seems to be happening.

And also, I’m building search for calmcode now as well. It’s kind of as a hobby project. I’m contemplating, hey, maybe the main thing the search feature should do is just find the right snippets, which is kind of like an interesting search problem on its own.

But, yeah, totally, I need a reminder too. There’s many courses. I don’t have all of them in my head, all of the time. I still watch my own stuff in that sense. Yeah.

Seth: Yeah. It’s a good resource for you that you can consume which also became something so many other people can use. I’m trying to think of my first usage of it. I think it was, like, args kwargs, which is one of the first ones on there. Yeah. I revisit it every so often.

Vincent: Nice. Yeah. Nice.

Seth: Thank you.

Vincent: Well, so then I would love to do more.

But the simple fact of the matter is my life is a little bit different now because of the baby. So, there’s so many ideas I have that I could do with calmcode. The one thing I also kind of like about the project is I cannot spend any effort on it and the site will just still run. Right. So that’s also kind of the calm design of it.

I really like having a hobby project where it’s impossible to break. And if any of it breaks, it’s super easy to fix because it’s just a static website. So that makes it super easy.

Seth: If there were no time constraints or any resource constraints, what would you do to improve calmcode?

Vincent: There’s a couple of courses particularly that I would love to do. One of them is just embeddings I think that there seems to be a bit of hype around it, but also just you can make embeddings do different things and there’s reasons why they work. But they don’t solve every problem and I can do a fun course where you start with letter embeddings and you move on to other embeddings and images, and then you also show how they can fail. I think that that could be super cool.

Bayesian MCMC [Markov Chain Monte Carlo] stuff, would be nice to have as well because you can make very articulate models, which is a trick not enough people are appreciative enough of.

And then I would love to have a new section on the site, which is all about demos and benchmarks. And that’s because I think it’s very hard to do a benchmark wrong. All benchmarks are wrong, but some of them can be very insightful. And I think just celebrating that a bit more would also just be fun. I’ve got a couple of examples lined up, but I have no time to actually produce it.

But things like, hey, what can you do to actually make numeric algorithms converge a bit quicker. Does standardization really help or not? And, just exploring that a little bit could be just super fun.

Stuff like that’s in my mind. There’s always stuff to make. And, like, another thing I’m playing with is, like, would it be fun to collaborate on that project? Maybe. I don’t know. But of course, There’s no rush. So it’s also kind of fine if I don’t spend time on it right now. That’s also cool.

Open Source

Seth: Yeah. Awesome. Switching gears a little bit into some of your open source work. I think the first library of yours that I was exposed to was bulk. Maybe something else before that but that was the first one I was really using. And then you also have embetter, human-learn, whatlies, doubtlab, cluestar. Those are the ones that I’m most familiar with. I know there’s about another two or three dozen

Vincent: Yeah. Small dozen at this point.

Seth: When do you decide like – this project deserves an open source library? When you think it’s a tool?

Vincent: It helps a bit to maybe explain how the open source thing kind of got started. So my first open source project that I put on PyPI was called evol, which is basically a DSL for evolutionary programming. I made it with a colleague of mine. It was a very cute idea. And I wanted to have my own little library.

So I was looking for a problem. And then I just find out that, if I have a population object and evolution object, and those two can interact in nice ways and super easy to make genetic algorithms. Alright. Cool library, I did a bunch of talks on that.

But then at some point, I taught myself how to make Python packages. And then I was a consultant, and I started noticing that at different clients that I would be writing the same, scikit-learn components. So I figured, I have to have a library with these components that I keep reusing. And that’s how scikit-lego came to be, and that’s how I familiarized myself with the scikit-learn ecosystem.

And then, I started working at Rasa. And there, we do lots of benchmarks on sentence classification because Rasa builds chatbots. And when you’re building chatbots, sentence comes in and we need to figure out the intent. Okay. So I wrote a bunch of benchmarking tools because that’s what I needed and some of those can be open sourced.

Whatlies was an example of that because I wanted to have a library where very quickly I could have many non-English embeddings and see if they were better. And then it turned out that there’s a whole non-English community around Rasa who was super interested in that.

So I was able to build some Rasa plugins to support all these non-English tools. And, then at some point, I started maintaining my own libraries, and then I noticed that I need some unit tests for my docs because I don’t want my docs to break. So I made a couple of tools to help me do that. Mktestdocs – That’s one of these tools.

I noticed the tests at Rasa were running super slow so I made pytest duration insights so I could figure out which tests were slowest. And you can see how all these things accumulate, but it’s always because I’m scratching another itch. And my preferred way of operating is to do that in public.

And of course, there are tools that I can’t do in public. I work at a company. Some tools are private. That’s fine. But most of the time, I’ve encountered a problem, and I just want to be able to solve it again later with very low effort. And because I’ve made packages before, it’s just super easy to repeat.

And that’s also how doubtlab happened, and it’s also how embetter happened, and honestly, also how bulk happened. It’s just that at some point, I figured I need this for my work. It’s nice to have around, so let’s just package it and go build in public, and that works very well for me. That’s the main story there.

Seth: Yeah. Very cool. And that’s a great story. It seems like building one tool, you build up certain skills, and then one thing kind of leads to another, and then it’s not such a big deal.

Once, I guess, you have around three dozen amazing tools to add that thirty seventh tool.

Vincent: So, yes, but I do want to make one comment because I do think that in general, if I look at the companies that I visited, with my background as a consultant, I do think not enough people make their own Python packages.

For example, imagine that you have a pandas query that has to deal with time series or something that is working on this very specific database. Okay. Then the function that reads out the data from the database probably can be a function that needs to be reused. And maybe you have to add sessions or maybe you have a very specific machine learning model that you want to reuse.

And for all of these utilities, you don’t want them to live in a notebook. You want them to live in a Python package. And I have seen that not enough people make their own internal tools which I do think is a shame. I was around a couple of mature colleagues at the time and we would write our own Python tools internally.

…you can make Python packages more often than you might think. So, just build one even if it’s for your own little helper functions in pandas that you like to use.

And because we had that habit, it was also quite easy for me to make one that was just public. So, this is advice I might have for a more general crowd, you can make Python packages more often than you might think. So, just build one even if it’s for your own little helper functions in pandas that you like to use. That’s a totally legitimate use case.

Bulk

Seth: Yeah. To dive into the one that I’ve used the most, bulk. Can you talk about bulk? What is the pipeline and the requirements for it? What are the mechanisms at play?

Vincent: Yeah. So it might be fun to also explain how that library accidentally happened. So I had a library called human-learn. There’s a couple of really cool features, but the whole thing with human-learn is that as a human, you can now make scikit-learn models without knowing anything about machine learning. One thing you can do is turn a Python function into a scikit-learn compatible component which is useful. So you can grid search over the kwargs and all that.

However, one thing I thought was kind of cool too is, usually you see a plot, of some blue dots there, some yellow dots there, some red ones there. And people say, this is what we need machine learning for and then an algorithm dissects them. But then I figured, you know, you can just draw a circle around the green dots and a circle around the blue ones and just translate that circle into a scikit-learn model. So that’s a feature of human-learn. In human-learn, we have bokeh components that can do that from a notebook.

And while I was working on that, I was also working on whatlies over at Rasa for all of these word embeddings. Then at some point, it started dawning on me that when you take these word embeddings and when you pass them through UMAP, you kind of get these clusters. And then I figured, oh, I just want to select them. Oh, hang on. I’ve got this tool called human-learn that just does that.

And within, like an hour, I had that working in a notebook. Then, I showed it to a bunch of colleagues and they all kind of went, "This is super useful, Vinny. Well done." So that was a notebook that got shared around a lot.

And now, I no longer work at Rasa, and I started working at this company called Explosion. We have an annotation tool. And I felt like doing the bulk trick again, but I didn’t feel like using that in a notebook. So I just turned that into a little web app that you can run locally and it’s one of the pre-processing steps I like to use as the thing you do before you start annotating in Prodigy. You just take your data, you embed it into a 2-D plot using UMAP and then you typically see clusters and you try to explore that space, make a selection, and that’s it.

It’s a very nice way to do bulk labeling because clusters tend to appear from these embeddings. And that’s basically the whole trick. These bulk labeling techniques, they kind of work, but they’re not perfect. They seem pragmatic enough for me to go ahead and get started within an hour. And that’s kind of the power of it. Stuff that used to take me six hours and now it takes me only one hour.

And it’s a trick that only works for getting started, but I get started a lot on a lot of new data sets. So for me, it totally solves a problem. Bulk is also one of these projects where I would love to have more time to fix some of the rough edges, but it is a little hack that totally works and I love using it. And there seems to be a little crowd of people who seem very appreciative of that tool as well, especially because it does text but also images. Out of the box, it just does that.

Seth: Very cool. Yeah. I used bulk when it was in a notebook. I know I reached out to you. You were very generous with your time trying to help me getting running in different environments.

Vincent: Yeah, the first notebook was definitely buggy. That’s definitely true. Yeah. I definitely remember that.

Seth: Still did the trick.

Vincent: Yeah. Well, the thing also back at Rasa, I would make a habit of making these videos. So, bulk that had a YouTube video attached as well, which is how a lot of people found out about it. And I think there’s this one repository that happens to have that notebook, which is still getting stars these days.

But I recommend people just use the command line thing now because less distraction and a bit more stable.

Seth: Yeah. And then interestingly for me, as I was moving a lot of my work outside of notebooks and into scripts. I came across bulk again and now I’m using more of the web app. I love both of them. They’re great tools and you make a good point.

Sometimes lowering the barrier to get started on a problem is just so important because then you start to get the ball rolling, you start to get some thoughts going, and you can make some meaningful progress. What I like about it is you start building some intuition, by exploring the data and you start to think, "Oh, okay. These could be some potential categories."

Vincent: There’s definitely a human in the loop who is learning aspect of it that I also think is really useful. Especially when they dump a new dataset on you. Yeah, you can start throwing it into an algorithm and that’s [fine]. But genuinely, understanding what’s in the dataset typically is the thing that takes the most time. And it’s nice that as a side effect of bulk, you are at least exposing yourself to these clusters. And that on its own seems quite useful.

Right now, you can do bulk labeling on sentences and images. One of the things I am working on is doing that for phrases as well, for substrings in text. So right now I can embed the entire sentence, but what I want to move towards is that I’m also able to say, take every noun phrase in that sentence and make a small little point for that. Because that way, if you’re interested in doing named entity recognition or something like that, we can also do bulk labeling for you.

And, especially things like video games that might be abbreviations – Star Wars are two tokens. It’d be nice if we can turn it into a single phrase. And over at our company Explosion, we have lots of tricks that totally solve all of this. It’s just that I need to have an afternoon to make that work inside of bulk.

But it’s stuff that is on the road map that I am definitely interested in solving some of those problems as well.

Understanding the Problem

Seth: Yeah. So I’ve noticed going through some of your work. A lot of it is focused on creating high quality datasets. But something before that is actually understanding the problem. And I watched one of your PyData talks basically about rephrasing the problem.

And you gave an incredible example about a problem where someone is looking for beans, beef, and bread.

Vincent: Oh, yeah.

Seth: Can you can you talk about that one?

Vincent: So this was not my tale. I actually met the person who works at the World Food Program doing Operations Research. And one of the problems that they had was [dealing with] hunger in the world. And, sometimes a village with hunger says, we need more beans or we need more chicken or there’s demand for certain products. And then part of what the World Food Organization tries to do is to source those foodstuffs cheaply.

And then part of the cost picture here is the logistics of it. So, can we get the food on the truck? And how expensive is it to get the truck? And all the logistics. And as this person was saying, they defined the problem the wrong way originally because when a person says, I need beans, yes, they can say that, but it’s not beans that they need, it’s nutrients. And beans, they’re high in fiber and high in protein.

Okay. There’s other food like lentils that is also high in fiber and high in protein. And if we’re fighting hunger, then we’re not going to be very picky about whether or not we get beans or lentils. And maybe if we do that, we can get the foodstuff without needing a shipyard. We can just send the truck.

And just by redefining that problem, I believe they got like a five percent cost reduction, which is a crazy high number for an operation – for a problem that people have already spent years on trying to optimize. Getting a five percent cost reduction is almost unheard of, but it was basically because they were solving the wrong problem. And my theory is at least that, like, this is an anecdote of a thing that happened to this one person for the World Food program. Quite typically, this whole act of rephrasing is a very useful exercise and maybe not enough of us do.

An example in NLP, one of the problems we’ll sometimes see on our support forum is, let’s say, they have a resume that they want to parse. And then they say, well, I want to have the start date and the end date per job. So I want to have an algorithm that can detect the start date. And, you know, you can build an algorithm that can detect the start date, that’s fine. But if you rephrase the problem into, let’s first find all the dates, and then afterwards figure out which one’s a start date and the end date, then the second problem becomes, well, the start date is probably first and the end date is probably after that. Oh, the whole problem just becomes a whole lot simpler if you just rephrase the problem into a two step approach instead of considering it end to end.

And there’s lots of these opportunities that people forget about. And I, again, to come back to calmcode, I fear that partially some of the machine learning textbooks are to blame because very few machine learning books actually tell you that you can choose to ignore half the data if it makes more sense. You can choose to just solve a different problem if that’s easier to solve. But that’s not the mode of thinking I seem to see, especially with new graduates. Which is bit of a shame.

But with that World Food Program story, I have to trust the person on stage who told it to me, but that definitely happens. Like, the World Food Program found a way to reduce the cost of transportation by five percent just by rephrasing a mathematical problem. And definitely something that happens in real life.

"It wasn’t the algorithm that saved the day, rather the understanding of the world. A better algorithm would yield a worse outcome if it is used on the wrong problem."

Seth: Right, yeah. And doing something at that scale any sort of reduction, a five percent reduction is massive. My favorite quote from that presentation you said, "It wasn’t the algorithm that saved the day, rather an understanding of the world. A better algorithm would yield a worse outcome if it is used on the wrong problem." I really liked that one.

Vincent: Oh, happy to hear it. So, there’s more anecdotes in that story. But if people are interested in this, there is an operations researcher, [Russell] Akhoff.

And he wrote this one paper about, the title was The Future of Operations Research is Past, which he wrote in like the eighties. It basically outlines why operations research algorithms can fail. And it’s reasons related to this anecdote. The reason why I want to bring this up is because some of those arguments work for data science too. It’s an article from the eighties, but everyone should read it: The Future of Operations Research is Past.

And I wrote a similar article called The Future of Data Sciences is Past just by repeating a couple of those arguments. But people often forget that the algorithm – It’s usually just a cog in the system, and we’re interested in building a better system, not a better cog. So if you’re building a better cog but doesn’t fit the rest. It’s not a better cog because you don’t get a better system.

But people often forget that the algorithm – It’s usually just a cog in the system, and we’re interested in building a better system, not a better cog. So if you’re building a better cog but doesn’t fit the rest. It’s not a better cog because you don’t get a better system.

Another thing that Akhoff does very well in his books, he basically explains a lot of these systems theories. And one one quote there that I can recommend people think more about is maybe instead of making, let’s say, a better cog. Instead of thinking, "Hey, maybe there’s like one part of the system that we can optimize." Maybe instead try to see if you can make communication between two parts better. Because if you think about it from a systems perspective, by doing that, you’re optimizing two things.

And also, you’re gaining clarity, so that’s always good. And it’s definitely this sort of let’s think about a problem by reducing it down to a single number and not consider anything else. That’s usually like a rabbit hole where people lose themselves in as well in data science, I think.

Seth: Yeah. It’s super interesting because I think that there are are a lot of times when people approach problems sometimes they focus on sort of the different modules and they have this modular way of thinking about things and they go, oh, if I make this one thing the best that it could be, then the whole system will be better. And in some cases, it will make a great improvement. But other times, it’s very important to understand the supporting system and how it integrates. Reminds me that you have to have good integration tests and you need to make sure that everything fits into the system properly.

Vincent: To give an anecdote here, the former CEO of bol.com wrote this in his autobiography. So, bol.com is like the Dutch Amazon. Amazon’s not that big here. Bol.com is basically Amazon, but blue, and Dutch – It’s kind of a thing we have here.

But they hired their first data scientist at some point. And this book has a chapter on that – What happened when we got our first data scientist? And in the book, the first data scientist is portrayed as kind of an arrogant kind of a person. Who’s always complaining that all these humans are not as good as my algorithm.

And then one of the things that he does is he figures out that there’s an optimal time to tweet about new video games that come out on their social channels and etcetera. So that’s like a thing he did. In Holland, we have this thing called Remembrance Day. And I believe it’s seven o’clock could be eight, but during Remembrance Day, we remember the Second World War. And basically, the entire country goes for two minutes of silence.

You might have seen some of the photos where people on their bikes delivering pizzas would step off the bike, just stand still for two minutes. It’s a thing that people take quite serious. So seven o’clock on Remembrance Day will be a very bad time to tweet about the new Call of Duty shooting game where you can shoot a bunch of people. And it will be especially bad if you would tweet that you’re super excited about the prospect of shooting people during Remembrance Day. But that’s exactly what happened because his algorithm determined that seven o’clock was the optimal time to start tweeting about this sort of thing.

And there’s so many of these stories. Right? And, on their own, on paper, I cannot necessarily blame the data scientist for doing his or her work. But this is the systems thing. Group one has concern that something might go wrong, group two does not.

If you just get them talking to each other, then usually the world’s a better place. That’s the theme, I would say.

Seth: When you get the answer to your problem and you have to ask yourself, does this make sense? That’s sometimes a little step that a lot of people skip over, and it’s extremely important.

Vincent: I do want to acknowledge that it’s also hard, right? I think calling .fit and .predict are the easy bits.

It’s all the stuff around that. It’s way trickier. Especially when you consider themes of fairness, all the things that can go wrong, can we really know that upfront? I don’t know if you always can.

To give one shout out though? There is this project. It’s called deon, the Deon checklist. There is a calmcode course.

Deon is a data science checklist. So just a bunch of stuff that has gone wrong at different companies where there’s newspaper articles, like explaining how bad the situation became. They just have a checklist of stuff that’s like "hey, check for this before you push live because stuff might go wrong." And for every item on that checklist, they also have two newspaper articles of stuff that happened in the past. So you as the data scientists can go up to your boss and say, "I want to [minimize] risk because this actually went wrong."

And it’s a really cool project just because they actually did the proper collecting of anecdotes, which is a powerful act in this day and age.

Unanswered Questions in Machine Learning

Seth: Yeah. A hundred percent. Having a story connected to anything in data science is always valuable.

To zoom out and talk about machine learning in general, what’s an important question that you believe remains unanswered in machine learning?

Vincent: Okay. So I was drinking at a PyData Afterparty. And a couple of people came up to me, and these were people I would consider relatively senior. They knew their stuff and they asked me to predict the future of machine learning.

And I kind of felt like making a joke because, you know, you’re at the bar. I wasn’t really inclined to go super into that. As a joke, I figured I would say, "you know what I think of the future of data science, people are going to really realize just the sheer amount of nonsense that’s in our field. And we should maybe just stop altogether."

But I decided to think about it more and I will say, there is some truth in that actually. I do kind of worry that maybe a lot of the stuff that we’re doing is more the hype thing instead of, are we sure that we understand the problem?

So what’s missing in machine learning? Well, maybe we’re doing too much of it. This is kind of a feeling that I have.

And of course, there’s a place in machine learning in the future. It’s definitely going to happen, but it doesn’t have to be everything. That’s kind of more the thing that I’m afraid of.

There’s an author who writes a book about artificial weirdness. Just like all the weird gunk that artificial intelligence can produce.

And the book is called, You Look Like a Thing and I Love You by Janelle Shane. Have a read. The book starts by saying I have all of these Tinder texts, and I want to have an algorithm figure out the best Tinder text to send. And the algorithm came with, "you look like a thing and I love you." Which is kind of a hilariously brilliant thing, but it’s not the thing you should send, I think, on Tinder.

But the book is full of these examples where you kind of have to be careful that artificial stupidity is not happening at the same time as well. Right? There’s plenty of examples where that happens. Like, the Call of Duty thing is just one example.

I find myself to be kind of the grumpy old guy who sort of yells at clouds. Kind of a, "sure, machine learning has a place, but can we do without it first?" First try the simple thing because that’s something that people seem to forget to do. And that’s a more pressing concern, personally.

Seth: In a similar vein with everything that’s going on in natural language processing right now with generative models and ChatGPT. How do you view the gap between the hype and the reality? I’m excited to get the old grumpy guys perspective on this.

Vincent: So I am actually professionally toying with this stuff. If you have a look, the Explosion repository now openai/prodigy recipes. That’s the name of the repository. So we are experimenting a little bit with, like, hey, can ChatGPT just say, here’s a sentence, detect all the dates.

Just so we can pre-highlight that in our prodigy interface. It’s something we are exploring right now. And it turns out it’s actually really good at some of these examples. And it’s really bad at others. We don’t fully understand why yet.

But I will acknowledge that can be quite useful. If that’s something that you can use to get better training data quicker because the annotation is just a lot easier just saying yes or no is quicker than sort of highlighting every single item in the user interface. That seems totally fine.

What I think is a bit more of a concern though, is that people sort of say, "oh, it’s magic. That’s how this works. It’s magic." It’s not magic. This is to some extent kind of like the Markov Chain thing where it just predicts the next word. And you can imagine that if you just give that enough text and enough compute power, you might be able to have it generate very plausible text that you might find on the Internet. Then you can ask questions, like, is it generalizing? Or is it just remembering?

Magic Floating through the air generated by Stable Diffusion v1–4

And, those are all fair questions. But it’s not intelligence just yet. It’s not real reasoning. And I have plenty of silly examples that demonstrate that it’s not actual intelligence that is happening under the hood.

That said again, as long as there’s a human in the loop and it proves to be useful and productive, then I think it’s fine. But again, that’s when I’m wearing the lens of, hey, there are professional interests. There are, of course, harmful factors that I do think need to be taken into consideration as well. You can definitely send more mass emails in bulk and maybe have more Twitter bots and all those things that I’m not particularly fond of.

So anyway, that’s one aspect. Another thing I do also maybe want to highlight because I also tried the Midjourney thing. I’ve tried to generate Magic: The Gathering cards.

Seth: Okay. I’ve seen them and they’re pretty funny.

Vincent: I thought at some point it would be kind of funny to say, hey, let’s make Magic: The Gathering cards of orcs in the office. You would have an orc warlord product manager and an orc venture capitalist and an orc TED keynote speaker. And immediately, this idea is pretty funny because if you think about the office, you kind of think of like a dull gray suit. And if you think of an orc, you think of World of Warcraft and like a warmonger, etcetera. So, that was pretty funny.

But then the next question is, can we actually generate the really funny pictures? And that turned out to be somewhat hard. So I have this one picture of an orc paladin, like, totally covered in iron basically, mesh like, behind the computer. And you kind of go, okay, data engineer, kind of okay. That’s kind of funny already.

But I wanted this orc to be a data analytics engineer because they are talking about data lakes. And then I thought the funny thing would be heavy ironclad orc but with a little yellow snorkel comes out of the helmet. That would just be the funniest thing. And I for the life of me, I could not get it to generate a yellow snorkel.

And you start thinking about, why might that be? And then you also think, like, well, Vincent, you’re already kind of stretching it to have these World of Warcraft Dungeon and Dragon styles in an office. The fact that those two styles are even compatible is already kind of a stretch. Let alone that you also generate some sort of weird snorkel from it. Right?

So if people consider these tools like magic, the best advice that I do have is try to come up with a kind of an awkward weird task that tries to touch where the edges are of where such algorithms are comfortable, and that’s usually going to give you examples that can maybe help you consider that it’s not really magic what’s happening. There’s just this; It’s trying to remember. It’s trying to sort of generate stuff that it’s seen before. And there’s plenty of edge case examples where this sort of stuff is just "You look like a thing and I love you." Read that book. It has really compelling examples and the style of the book is lovely too. I highly recommend it.

Generative and Predictive Machine Learning

Seth: Thank you. Yeah. I will check it out. I think generative models are super interesting because unlike predictive models where for example you’re doing text categorization, where you can sort of know if it’s correct or not? There usually is a ground truth. With generative models where you’re doing something like you want to create an orc that’s wearing a snorkel, you know, how do you know that it’s correct?

It’s not so clear cut.

Vincent: How many labels of unrelated photos do you need to actually generate that? Right? Oh, but also here is why also part of the solution here is obviously, user interface as well. There are amazing things you can do having text as an input. But in this case, you are also okay. We’re almost there. I just want to select the region around the helmet where a yellow snorkel needs to appear.

Something like that is going to happen at some point, and that’s going to make these systems better. And then I can move on to enterprise elves and figure out some other edge case. Right? And that will kind of be a continuous thing.

But, yeah, in general, because you mentioned ground truth – ground truth is tricky too. And this is also where a lot of artificial stupidity kind of comes from. And my personal gripe with that – so consider, image classification, the famous cat dog thing: Is this a picture of a dog or is this a picture of a cat?

Source: Catdog Logo Wikipedia

Standard classification would say, okay, this is a binary task. But then you kind of go, well, we can have photos of no cats or dogs. So, we need three classes? Okay. What do you do with photos that have both a cat and a dog? Oh, yeah. Okay. That can happen too, right.

Okay. Reality is more complex. And what do we do then?

Well, maybe we have to say, is there a dog in the photo? Yes or no. And is there a cat in the photo? Yes or no. Maybe those should just be two binary classifiers. Maybe that’d be more sensible. Okay. What do you do when there’s four dogs in the photo?

Again, the more and more you start thinking about it, you also kind of realize, even the well defined text classification doesn’t always mix well with reality either. And even if you have ground truth labels, you kind of have to wonder, well, the ground truth labels maybe don’t mix with reality either it it’s defined as a classification task because a sentence can be about more than one topic and the photo can be about more than one thing as well.

So taking a step back and just really wondering, well, some of these things can be details as long as we really understand the problem, but maybe we should focus on that then. Maybe we should skip hyperparameter tuning and only worry about – Do we really understand the problem?

Seth: Yeah. That’s a really good point. I think that when you’re approaching a problem, people tend to jump to a solution. If you’re doing something like text classification – oh, okay. I’m going to create a multi-class text classifier. Well, it turns out that it is never really quite that simple. Right?

It’s really multi-label. Should I use a hierarchy? Should I do this? Should I do that? And, you know, getting a better understanding of the problem always helps you figure out more. It’s so much more valuable than doing hyper parameter tuning on that original multi-class text classifier.

"The model can do one step, but your system can do two or three if need be. So definitely feel free to consider the two-step system where we have a couple of classifiers that detect a couple of properties, and then we have a rule-based system after that’s going to say, ‘Okay, this combination of things that seems interesting. Let’s go for that.’ People forget about the rule-based system that can be built on top of. And that’s, you know, a bit of a miss. But it’s also, like 80% of the time, that’s also the fix."

Vincent: Well, so the main thing, I do have a little bit of advice in general. I’m on the Prodigy forum and I help some SpaCy users with their problems. The most general advice that I do give people in this domain is to consider that maybe the model can do one step, but your system can do two or three if need be. So definitely feel free to consider the two step system where you have a couple of classifiers that detect a couple of properties, and then we have a rule based system after that’s going to say, "oh, okay. This combination of things seems interesting. Let’s go for that."

People forget about the rule-based system that can be built on top of. And that’s a bit of a miss. But it’s also, like eighty percent of the time, that’s also the fix. So do with this information, what you will, dear audience, but I do think that there’s a two step approach that definitely does work in general.

Seth: I think that’s really good advice especially right now with all of the hype with deep learning. I think we’re still in a world where finding the right combination between machine learning models and heuristics, sometimes pretty basic heuristics, often yields the best results.

Influences

Seth: To move into the learning from machine learning portion of our talk. We’ll start with this. Who are some people in the machine learning fields that influence you?

Vincent: I’ve had some really lovely direct colleagues that I still hang out with. So, those obviously. Back when I started, I was learning R, so Hadley Wickham was a person that I definitely looked up to a lot. And I also met him on a couple of occasions, which is super cool. He did an advanced course, like, five years ago, and I was a TA. Great great experience, I got to meet the guy.

Katharine Jarmul is a person who also comes to mind. She was one of the kickstarters behind PyLadies, but she also has been a great advocate for privacy and fairness in machine learning. And she has reviewed my slides in the past for a couple of talks and she’s just great. She comes to mind.

Vicki Boykis, I think, is one of the funniest people – she deserves way more credit for shit posting, she’s great. The NormConf thing was also an amazing thing that she helped kick start there, it was great.

And then Bret Victor, I think, has the best talk I’ve ever seen, that I will ever see. The future of programming by Brett Victor. That’s a thing I watch every year, basically. That’s the most gobsmack, most inspirational thing I’ve ever seen. I won’t tell you what the thing is about. Just watch it.

Seth: I’m looking forward to it.

Vincent: And then, I guess, [Russel] Ackoff, but the main thing with Ackoff was I did this whole master’s degree in Operations research and then a professor was going to retire and I was one of the speakers at his party. And then at some point, he said the reason I wanted you here is because you really remind me of Ackoff. I was like "who is he?"

"He’s this amazing guy. Just buy his book." And then you read this stuff. He’s like me in the 80s. So that was definitely also a good source of inspiration.

…the average Joe is pretty inspirational, but the average Joe doesn’t think that he or she should be on stage.

One thing I do want to mention about this is back when I was organizing a PyData. You kind of think, "okay, who are good keynote speakers and who are good, invited speakers, etcetera." And my impression is that the average Joe is pretty inspirational, but the average Joe doesn’t think that he or she should be on stage.

And the best example of this is, at PyData London, there was a normal talk by a guy who was building drones to find endangered species of orangutan in the rainforest of Borneo.

Seth: Wow!

Vincent: And he had the small room, but his talk was amazing. So I figured, screw this. You’re the keynote at Amsterdam. This is the most amazing thing I’ve ever heard. This is your hobby.

So he was the keynote speaker the next year. And he was grateful and very good fun. But, he didn’t realize that that was definite keynote material. And similarly, I’ve read this blog post once where this guy was trying to figure out which words are the most metal.

And the way he did that was by training a huge Markov Chain on metal lyrics and non-metal lyrics. And the conclusion of the blog post was that the least metal word is cooperation because it only appears in the corpus once. And you read this, this is amazing. Because you’re basically applying the theory correctly on a pretty humorously silly problem, maybe.

But there’s passion here. And the guy, when I did approach him, [I said] you really need to apply for PyData. I don’t have to review your thing, I think you’re going to be in.

And it just hadn’t occurred to him that this was something he could do. And I like to think that there are so many more people who suffer from this, that they might have a really grand amazing inspirational moment, but don’t consider that they’re able to share that. And of course, some people are, properly introverts, which is also just fine. But one lesson I have learned at PyData is that the inspiration can really come from surprising angles that you don’t expect. So don’t focus too much on the big names.

That’s also the thing.

Seth: Yeah. One of the best types of people are very humble and they do such a good job with their work and you can tell how much they care about what they do and how much, I don’t know if pride’s the right word, but they take their work very seriously. They care..

Vincent: They care, Yeah. You can be the smartest person, but if you don’t care about your topic it’s not going to be a great talk.

And let’s say that maybe you’ve cut a few corners, but you calculated the optimal Pokemon. I don’t know, something like that. It can still be a great talk.

And again, more people should do it. If people are interested in doing more blogs and talks by the way, consider Lightning Talks and very short blog posts are called "Today I Learned". The world definitely needs more of that. And I’m happy to see PyAmsterdam, the meetup, once a year, they do the lightning talk meetup, where ten people give five minute presentations.

Those meetups tend to be amazing. Any PyData organizers listening, feel free to steal this idea. Those meetups are always fun.

Career Advice

Seth: Very cool. So you’ve you’ve given a lot of advice so far, but I want to ask, what’s one piece of advice or something that’s stuck out that you’ve received that’s helped you in your machine learning or career journey.

Vincent: I got this very early on in my career. My former CTO that I still hang out with gave me pretty good career advice when I was twenty three. And he said, "Be careful of getting a raise. Because if your job starts earning a lot of money, but it’s kind of getting boring, then the money might be a reason that you’re going to stick around."

Be careful of getting a raise early in your career

And that’s a dangerous thing early in your career because maybe you have to figure out what you like in life. And maybe you have to figure out what makes you tick. And if you’re going to hyper focus on the money, it’s kind of like hyper focusing on the metric. You’re going to over optimize for something that might not matter as much. So that was pretty cool advice, kind of on the meta side. But I do think in general, I have been able to apply that quite well.

Again, privileged speaking here. Right? But, I have been able to apply that. So that’s been cool.

Kind of a weird anecdote, but it’s surprisingly inspirational as well. So I have a lot of friends who do nothing in data science. And I love that. I’m nerdy Vincent, and when I drink beer with them, they say, stop being nerdy Vincent, you’re amongst normal people, you can just talk about life now.

And I live in a neighborhood where you know all of your neighbors, basically. And it’s sort of still kind of a middle class neighborhood – it’s changing because of gentrification, but, we all know each other. So there’s a guy on my street. And he’s a painter. And when it’s nice weather, he puts a crate of beer on his bench outside of his house.

And the whole street just goes for a pint, basically. It’s the cutest thing. Cutest neighborhood ever. But the thing with him is he recently became an independent contractor as a painter, which also meant that he bought his first laptop ever. And he’s forty two.

And, he needs help, not just with his website, but, like, getting word started. His entire life, the main computer that he had was his phone, and he has been fine, but he finds a computer terribly, terribly complex. And to be honest, I find that just such a refreshing thing. And to also be reminded of the fact that the way that I experience computers doesn’t necessarily have to be normal. That’s a very useful reminder. So, it’s the best inspiration in a sense.

Maybe don’t be in machine learning all the time. It is my advice. Especially if you’re making machine learning for apps that the average person uses. It really helps to remember that they really don’t care about your algorithm. They just don’t. They really, really don’t.

I have found myself to be stuck in a machine learning bubble at times. And I just find it very refreshing to [step outside].

I used to do this at consultancy gigs as well. I was making an app that the truckers would have to use for logistics and stuff. And at some point, I would just hang out at the smoker’s corner where all the truckers would hang to sort of understand what kind of people they were. And also just to understand what they found frustrating about the app and doing more of that really. Being more of a human in the loop.

Focus on the human thing is what I’m trying to do more of and what I find very inspirational.

Advice for New Data Scientists

Seth: Yeah. I really like that. For somebody who is just starting out in the field, let’s say, that they just got hired as a junior data scientist, or they’re thinking about starting in data science, what would your advice be to them?

Vincent: Okay. So step one, I gave a talk on this topic at NormConf. So there’s a talk titled Group-by statements that save the day. This talk is precisely designed for you.

Having said that I’m a really bad person to give career advice because I looking back, I just thought that a large chunk of where I am today is due to luck and that’s something that’s kind of hard to optimize for.

What I do think is useful, in general is maybe to have your own blog where you just share things that you learned today. So just like calmcode is still kind of my snippets library in a way. Your blog can be the same for you. And in general, I have found that for these "today I learned" snippets – If you’re able to write two a month, and it take maybe half an hour per post instead of a big blog post that takes hours, this thing shouldn’t take more than like tens of minutes, let’s say. But you did it for a year and you’ve got a blog with 24 posts.

If you’re learning and you’re able to share knowledge then people are going to acknowledge that you do have a bit of a resume there that demonstrates that you’re learning stuff. So that seems like a pretty easy thing to do if you want to get something of an online presence with low effort. That’s something I recommend.

I do think if you are a super junior just getting started, I do want to acknowledge it’s kind of hard. It’s a bit of a shame now the [state of the] hiring market and all that. But one thing that you can do to make it maybe slightly easier for yourself is to consider that you don’t have to know everything in order to get the job.

You might also be able to get a related job. Some advice that I have given to friends of mine who wanted to get into this data science field is it’s a little bit easier maybe to learn R than it is to learn Python, and it’s maybe a little bit easier to just be an analyst for a year or two.

And all the skills you learn while being an analyst are going to be super useful if you want to become a data science person later. So if it’s easier and you get paid to learn, don’t optimize for a title. Just optimize for the stuff that you learn while on the job. That seems easier. And there’s nothing wrong with being a good analyst.

Maybe we need more good analysts than we do good data scientists as well. Right? Maybe we need more group by statements that saved the day. Hint, watch the talk.

But I do think there’s a little bit of snobbery when it comes to job titles. "Like, oh, I’m the super senior staff, mega engineer. Like, sure." But if you’re just a really decent analyst, we need more very decent analysts. That’s also fine. Go for that.

Seth: Yeah. That’s definitely good advice. And now the question that we’ve all been waiting for, what has a career in machine learning taught you about life?

Vincent: Some problems solve themselves when you ignore them. Seriously, I’ve been in so many of these situations where the problem got solved by just ignoring the machine learning bit, that you kind of start to wonder, well, maybe some problems do solve themselves if you ignore them.

And I have noticed in a few instances, this is just kind of the case. Especially when you have a child, you do kind of learn that there’s some stuff that you can over optimize for as well. And, like, oh, the baby’s not sleeping well. Well, that problem will sort itself out at some point. It’s not like influence from my end is going to make a very significant impact there.

And I guess the same thing with machine learning. There’s some stuff that you can control, some stuff that you can’t. Just make sure that you understand what you can and can not control and then move on from there.

And again, I’m kind of a mixed bag when it comes to the whole machine learning thing. Part of my opinion is it’s a super useful tool and we need more good people doing machine learning. But at the same time, it’s like a gross bucket of hype that we really want to have less of. And my day to day is to sort of deal with both of these feelings.

I hope this answered the question in some way, but that that’s kind of where I’m at. Try to do it calmly, that’s my final pun. That’s also something I might recommend.

Seth: There you go. Yeah. I think that some problems over time do resolve themselves. And also I like the first rule of machine learning is do you really need to use machine learning?

Vincent: Yeah. I agree. And one thing I really do maybe to brag about the employer a bit.

One thing I really like about SpaCy is you don’t have to use the machine learning bits. You can also just use the non-machine learning bits in SpaCy. And they are also performant, fast and super useful. There are also machine learning packages that allow you to do some rule-based stuff. And if you’re doing NLP, this is really why I love using SpaCy.

You don’t have to use statistical stuff all the time. The rule based engines are great too. End of pitch.

Seth: I’ve been a huge fan of SpaCy for a while now – at least four years, probably more. It’s helped me solve lots of problems from named entity recognition, text classification, cool ways of doing matching, all of that.

Vincent: Well, so if I can give one final pitch. So there’s a lot of talk about data-centric AI these days. But the reason why I started getting interested in what these Explosion people were doing back in the day is there’s a blog post from 2017 called Supervised Learning is great – It’s data collection that’s broken. They were doing data-centric stuff in 2017, but that’s one of the best blog posts I’ve ever read.

So they talk about data quality and one of the best quotes ever is don’t expect great data if you’re boring the shit out of underpaid people because mechanical turk is still like the way people go sometimes, read that blog post. I will give you a link for the show notes. That’s also like a highly inspirational thing people should read.

Seth: Awesome. Thank you so much. It has been such a pleasure to talk with you. You’ve given me tons of great resources. Putting together the show notes for this one is definitely going to be a good time. If there are some places that you would want listeners to learn more about you, what would those places be?

Vincent: So I’m on Twitter and Fosstodon these days. But the main thing is I can’t announce anything just yet. It’s that I work at Explosion and I can see the stuff that’s in the pipeline. So I’m working on very cool stuff and there’s definitely going to be announcements of super cool stuff all my other colleagues are working on – just follow Explosion.

There’s a bunch of really cool stuff in the pipeline. And if you do that, then you also at some point, will hear about some of the stuff that I’m working on.

Seth: Awesome. Thank you so much, Vincent. It has truly been a pleasure.

Vincent: Likewise.

Listen Now

The video of the full interview can be seen here:

The podcast is now available on all podcast platforms:

Our previous episode featured Maarten Grootendorst, the creator of BERTopic and KeyBERT, and a prolific author here on Towards Data Science. He discussed Open Source projects, the intersection of psychology with machine learning and software development, and the future of Natural Language Processing.

Learning from Machine Learning | Maarten Grootendorst: BERTopic, Data Science, Psychology

Learning from Machine Learning | Vincent Warmerdam: Calmcode, Explosion, Data Science

Summary

Advice & Takeaways

Table of Contents

Full Interview

Welcome

Background

Academic Background

Machine Learning Attraction

Calmcode

Open Source

Bulk

Understanding the Problem

Unanswered Questions in Machine Learning

Generative and Predictive Machine Learning

Influences

Career Advice

Advice for New Data Scientists

Learning from Machine Learning

Listen Now

References

References from the Episode

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

How to Make the Most of Your Experience as a TDS Author

Our Columns