Office Hours
When I got my first corporate "data scientist" gig a few years ago, I barely knew what a decision tree was, I had no clue about why people kept talking about random forests, and I had no grasp on what people actually meant when they talked about "AI," which to me was primarily associated with dystopian movies. I was quite overwhelmed, to put it mildly.
These days I feel a lot more at ease, comfortably reading research papers from a wide range of topics within AI and ML, giving keynote talks, being a lead data scientist in a corporate pharma company, and so on. A lot of lessons have been learned along the way. In this post, I’ll outline what I believe to be some of the biggest insights in no particular order and some general lessons for thriving in a corporate data science position.
Note: "Data science" varies a lot across different industries, so this is the perspective from someone working in corporate biotech/pharma companies.
1. Passion, Curiosity, and Capacity
Passion and curiosity are the most important qualities of a data scientist. If the option stands between a run-of-the-mill super experienced senior data scientist and a candidate with an unmistakable fire in his or her belly, pick the latter, even if that means less experience, shorter education, or whatever.
I have seen many data scientists with long, impressive CVs and costly salaries be completely out of touch with the technical or business details of the problem being solved. Meanwhile, I’ve seen a handful of violently passionate candidates without any prior experience achieve rapid success.
Passion and curiosity are the most important qualities of a data scientist
The notion that data scientists should be "passionate" is nothing new under the sun. There is a little more controversial appendix to that, though: you also need to have the "mental" capacity. The details of what that means are maybe the subject of a future post. Still, most importantly, it requires the tenacity to keep pushing yourself towards becoming better, as well as at least a certain level of brain horsepower (i.e., "intelligence").

2. Get or create the right job
A "data scientist" job is not a guarantee that you will do cool AI stuff all day long. On the contrary, a large majority of corporate data scientists I have met never or rarely even get to fit a linear regression model because all their time gets allocated to meetings, data cleaning or wrangling, dashboarding, etc.
A "data scientist" job is not a guarantee that you will do cool AI stuff all day long
The thing is, if you are not intimately familiar with how for instance, the latest NLP algorithms work, once you have spent weeks ingesting and cleaning all the internal documents in the company, there is a large incentive for the company to hire some consultants for a few weeks to do the actual data science work; doing so is more efficient than you "catching up" studying.
If your goal is to master the craft, you have to exercise the craft. Data science is like strength training; you can read as many books as you want, but you will not get strong unless you get under the bar and lift heavy things. In the above example, that could mean studying and applying the NLP algorithms while doing data extraction and cleaning. If necessary, then without your managers’ consent. Alternatively, start doing hobby projects, Kaggle competitions, and so on to hone your skills. Again do this on company (and personal) time, and if necessary, without the consent of your manager. That might be controversial, but see the next lesson for why I believe it is OK.
Data science is like strength training; you can read as many books as you want, but unless you get under the bar and lift heavy things, you will not get strong.

3. Infinite-X Data Scientist
A "10X developer" concept was introduced decades ago to signify a developer that is 10 times more productive than the average developer. Although a lot of controversies surrounds the claim that such developers exist, my personal experience is that the fold-order may be a lot higher than 10X when it comes to data scientists – you can argue that an average developer may be able to solve just about any task given enough time, but I do not believe the average data scientist can similarly solve any task given "enough time."
It is in your own, and your employers, interest that you spend a lot of time honing your skills and expanding your horizons.
Say, for example, that you face a business problem that cannot be solved with traditional ML and instead requires some custom implementation of a transformer-based architecture on multiple misaligned time series while also being interpretable. If you have never even played with simple neural networks, you will not be able to crack that problem in any reasonable amount of time. And yes, my experience is that these problems do come up, and if you do not have a solid overview of the methodologies that can solve these problems, either you will not solve the problems, or the problems will not be solved at all.

4. Do not fall into the Proof-of-Concept (POC) trap
This is a big one. You will likely see it repeatedly; you get data, fit some models, and get some nice initial results that look very promising. And then, nothing. Nobody picks it up. Nobody puts it into production. It never generates value. The whole point is to generate value.
Your primary goal is generating business value. POCs do not generate value.
The reasons for this happening are multiple and are a topic in and of its own. For one, putting machine learning models into production is difficult, as demonstrated in part in my previous post on getting started with MLOps. Typically, another reason is poor change management and anchoring into the business. Make sure you have someone dedicated to this process so that the work you do end up generating value at the end.

5. Your background is important
I did not have a background in data or computer science when I landed my first data science job. However, I had been coding almost daily for 15 years and earned a Ph.D. in computational biotechnology. This meant I could jump straight into understanding various algorithms and testing them out without also learning to code. I could also start reading research papers right out the door, albeit slowly initially, as you have to get into the terminology.
The point? You do not master data science in 1, 3, or even 6 months. And you should beware of anyone trying to tell you otherwise. You can learn to be useful and employable in a short amount of time, but data science is a vast and multifaceted subject, which takes years or decades to master.

6. Study all the subjects
A typical data science recommendation is to specialize within a given subject to avoid becoming a "jack of all trades, master of none." This is BS, and you risk being outdated very quickly if you follow this Advice. Rather, one should strive to become "jack of all trades, master of several. "
Do not shun the idea of the "data science unicorn", rather strive to become one.
Why? There are countless synergies across different fields. The more perspective you have, the more often you will find yourself employing weird "tricks" from one field within another, enabling you to solve problems you otherwise would not have been able to. If nothing else, when your particular Data Science specialization eventually does get disrupted by new technology, it is nice to have other strengths to play on.
You can always lose your job, but the knowledge you acquire is yours to keep.
Not only do I believe you benefit from studying as many data science disciplines as possible, I believe you should pursue knowledge on general programming skills, data engineering systems and workflows, setting up cloud infrastructure, front-end development, etc. The days where a data scientist could sit and fit models in RStudio all day and hand them over to the operations team are dead.

7. It might not be your job, but it is your problem
A friend recently brought up the sentiment (see here) that all the most successful data scientists typically have a "might not be my job, still my problem" mentality. If nobody is there to do the data ingestion or set up infrastructure, someone messed up data cleaning, nobody is doing the stakeholder management, or whatever, you have to find a way to deal with that. Dealing with these things builds character.
Take ownership, and do whatever it takes to generate business value.
One of the best methods to grow as a data scientist is to take a difficult data science problem all the way to where it creates business value – not just creating a POC. This is yet another reason many projects fail even after successful POCs; nobody steps up to take on the various challenges of putting things into production.

8. Define the right problem
There are many reasons why data science projects can fail to generate business value. One reason I believe is often overlooked is failure to solve the right problem. Some people spend months ingesting or cleaning data with no idea or plan about what to do with the data. Some get too stuck on the idea of using a specific modeling approach. Even more typically, sometimes, you spend a long time creating something that the business never actually wanted or needed.
Sometimes it is not possible to create the model you want. Or even worse, nobody ever wanted the model in the first place.
Before spending months cleaning data, establish exactly what you want to use that data for, and establish a baseline ML model to guide you on your cleaning journey. Talk with the stakeholders, establish a good relationship, and make sure they are keener on seeing this model in production than you yourself are. This is a major task in and of itself – you need to translate a real-world problem into a machine learning problem that can be solved.
When working on a given solution, if after a good amount of effort it does not seem like you can create a sufficiently good model, try to rethink whether you can solve a different "right" problem; e.g., maybe it is possible to solve a classification problem instead of a forecasting problem? or to reformulate the approach as a ranking problem rather than a regression problem, etc.

9. Establish good habits
Willpower and passion are not enough to master data science unless you become obsessed and spend every hour of every day eating, sleeping, and breathing data science. I did the latter for a longer period, and while definitely fruitful, it eventually leads to burnout. Instead:
Good habits must be established and adhered to, in order to ensure gradual and constant improvement of your skills.
Establishing successful habits is a topic in itself, but a book like Atomic Habits is an excellent starting point. Examples of habits could be reading a research paper every day before breakfast, always writing docstrings on functions and methods, always establishing a simple baseline before any model development, re-implementing one research paper from scratch per month, and so on.

10. Data science is not going extinct
Within the first week of starting data science, someone told me that we would make ourselves obsolete in a few years. Yes, I believe the goal is to make whatever we do right now automated and "obsolete." No, data science is not going extinct. Your skills might. Or the "data science" title may be segregated into subdisciplines over time. But the core idea of working with data to generate insights and value is here to stay for a very long time to come. The only thing auto ml and no/low code data science tools will do is let you focus on more interesting and difficult problems rather than on the repetitive parts of building a model.

11. Learn to tell a story
Say we could map your storytelling skills onto a continuous scale where the value of 0 would indicate a complete inability to communicate your results, and 1 would be extraordinarily efficient and engaging communication. Now, if you take all your technical skills as a data scientist, then:
the value that you can create will be proportional to your storytelling ability multiplied with your technical prowess
The lesson? learn to communicate your results and tell an engaging story, but do not compromise your technical ability, i.e., do not spend all your time making powerpoints even if that is what your boss wants.

12. Eat your vegetables
Mastering data science is hard; you have to understand the business problems, you have to be able to talk with people and communicate effectively, you need to be able to implement complex coding and analytics solutions effectively, and you have to constantly do research and reinvent yourself, staying on the cutting edge.
To do all these things, your brain needs to function at optimal capacity. Especially if you are going to be pushing your own boundaries. Educate yourself, experiment, and objectively evaluate results from different diets and exercise regimens. Do not fall into the trap of thinking that you "need" specific foods, stimulants, etc. – maybe read a book like The Pleasure Trap if you’re completely new to this. I wish this was not true, but for me personally, I know for a fact that my health strongly influences my performance.

13. Master the fundamentals
Relearn the fundamentals regularly. In a typical predictive use case, the goal is to create an ML model that will perform well in real life. That is, on data never seen before in either our training or test set. To assist us in creating such a model that will generalize well to unseen data, we typically use the fundamental technique of cross-validation. You must be sure that you get that and do not find yourself iteratively optimizing some score, which will not eventually translate to generalization.
At the end of the day the goal of a ML model is to work on new unseen data
Do not let information leak from training to test set. Be sure you understand stratified cross-validation, time-series cross-validation, etc. Be intimately familiar with different evaluation metrics, their strengths, and their weaknesses. Please get to the level where you have enough experience to know that improving a model is often much more about the data quality than the model itself. And remember, at the end of the day, the goal of an ML model is to perform well on new unseen data.

14. Establish a baseline
When facing a new problem, it is not rare to see people jumping to super-advanced models immediately. Always start with the simplest model imaginable and establish that as your "baseline." Then you can start progressively experimenting with data and model refinements. Log all experiments using a tool such as mlflow, W&B, etc.

15. Do not trust your own results
Never present any results without having 1) ensured that your cross-validation setup is legit, and 2) you have thoroughly inspected SHAP values or similar for your predictions.
Before presenting predictive model evaluation results ask yourself: 1) does my validation strategy make sense? and 2) have I thoroughly inspected and interpreted why the model is predicting what it does?
Too often, a practitioner will find that a given model performs extraordinarily well on a given problem, only to realize later that information was leaked from the test to the training dataset during evaluation. Alternatively, realize the implementation leaked the information from the target, e.g., in the form of some highly correlated input feature that should not be there. Or, in some cases (e.g., images), realize that the model is focusing on something completely different than what we want it to focus on.

16. Do not trust other people’s results
Also do not blindly accept the results of others. That goes for scientific literature as well as the results of your colleagues. Even if everything "compiles" just fine and produces a reasonable result, there might still be a completely non-obvious error in a given solution, so always carefully study how your colleagues solved a specific problem. And hopefully, they will do the same for you!
I do think there is a good framework for thinking. It is physics – you know the sort of first principles reasoning … What I mean by that is boil things down to their fundamental truths and reason up from there as opposed to reasoning by analogy – Elon Musk
This lesson goes deeper, though. The devil is in the detail. Do not trust that one library resizing images performs this operation similarly to another library that you might use in production. Do not trust that people implemented a given algorithm in a given library correctly. Do not assume that people know how to calculate a standard deviation properly—question everything.

17. Do not trust consultants
Working in a corporate, you will be working alongside consultants at least once in a while. You can find some of the brightest minds and best data scientists in these consultants. You can, however, also find senior/principal data scientists that have no clue what they are doing.
So should you always go with the most expensive consultants from the big consultancy houses? No, actually, my experience is quite the contrary that startups/smaller independent consultancies typically produce better results. In any case, carefully evaluate the individuals (and not the consultancy as a whole) in question as you would any candidate, and establish a good working relationship.
Most importantly, though, ALWAYS have internal data scientists (i.e. technical people) work together with consultants to ensure quality, consistency, and anchoring of results within the organization.

18. Beware of time series problems
This one may sound weird, and maybe it is just me, but I’ve been burned on it multiple times. Be vary of time series problems. Real problems are rarely like the forecasting tutorials, and especially in financial problems, you quickly end up with multiple timestamp entries in different time zones for different events occurring for different items, and cleaning that up before modeling can really be a hassle requiring a lot of knowledge about the business process going on. Always allocate a bit extra time for these, as when you start getting into the business logic of things, it can get hairy.

19. Everything is a web app
Even if you create the most awesome model ever, it rarely has much value if it just sits on your local machine or if you just shared the raw implementation with the end-user. The barrier towards using a given ML implementation has to be as low as humanly possible, and one of the most easily accessible ways to expose your model is as a web application, where the end-user can direct their browser to e.g. [www.mycompany.com/sales_forecast](http://www.mycompany.com/sales_forecast)
to get their sales forecast. Implementing apps can be done using simple frameworks such as streamlit or dash, and alternatively using more established backend frameworks (e.g., Flask, Django, or FastAPI) and frontend frameworks (React, Angular, Vue).

20. Be humble
Finally, humility is a virtue that any data scientist must embrace. The field is enormous and ever-changing, and we constantly have to face that the technology we spent years learning is now obsolete, that we did not understand a given concept as well as we thought we did, or the fact that there are a lot of people out there that are way smarter than us. Keep learning and stay humble; there is always a lot more to learn.

Final Remarks
This turned out to be quite a lengthy post, so if you made it this far, thank you for reading all that 🙌 I’m sure I forgot many lessons that have become second nature instead of being subject to conscious focus. I would love to hear lessons learned from other people, so feel free to reach out. I am sure that the experiences of data scientists working in technology companies are different from mine, which is collected from working in corporate biotech and pharma companies for a limited number of years.