Writing | Towards Data Science

Announcing the Towards Data Science Author Payment Program

TDS Editors — Fri, 28 Feb 2025 18:36:45 +0000

At TDS, we see value in every article we publish and recognize that authors share their work with us for a wide range of reasons — some wish to spread their knowledge and help other learners, others aim to grow their public profile and advance in their career, and some look at writing as an additional income stream. In many cases, it’s a combination of all of the above.

Historically, there was no direct monetization involved in contributing to TDS (unless authors chose to join the partner program at our former hosting platform). As we establish TDS as an independent, self-sustaining publication, we’ve decided to change course, as it was important for us to reward the articles that help us reach our business goals in proportion to their impact.

How it works

The TDS Author Payment Program is structured around a 30-day window. Articles are eligible for payment based on the number of readers who engage with them in the first 30 days after publication.

Authors are paid based on three earning tiers:

25,000+ Views: The article will earn $0.1 per view within 30 days of publication: a minimum of $2,500, and up to $7,500, which is the cap for earnings per article.
10,000-24,999 Views: The article will earn $0.05 per view within 30 days of publication: a minimum of $500, and up to $1,249.
5,000-9,999 Views: The article will earn $0.025 per view within 30 days of publication: a minimum of $125, and up to $249.

A few important points to keep in mind:

Views are counted only if a reader stays on the page for at least 30 seconds, ensuring that the payouts reflect real engagement, not clicks.
Articles with fewer than 5,000 views in 30 days will not qualify for payment.
During these 30 days, articles must remain exclusive to Towards Data Science. After that, authors are free to republish or remove their articles.

Who can participate?

This program is available to every current TDS contributor, and to any new author who becomes eligible once an article reaches the first earning tier.

Participation in the program is subject to approval to ensure authentic traffic. We reserve the right to pause or decline participation if we detect unusual spikes or fraudulent activity. Additionally, payments are only available to authors who live in countries supported by Stripe.

Authors can submit up to four articles per month for paid participation.

Why we’re doing this

We built this program to create a transparent and sustainable system that pays contributors for the time and effort required to write great articles that attract a wide audience of data science and machine learning professionals. By tracking genuine engagement, we ensure that the best work gets recognized and rewarded while keeping the system simple and transparent.

We’re excited to offer this opportunity and look forward to supporting our contributors who keep Towards Data Science the leading destination in the data science community.

How to contribute work?

We’re working swiftly to roll out an author portal that will streamline article pitches and feedback.

In the meantime, please send your upcoming article directly to our team using this form.

If you’re having an issue with our online form, please let us know via email (publication@towardsdatascience.com) so we can help you complete the process. Please do not email us an article that you have already sent via our form.

The post Announcing the Towards Data Science Author Payment Program appeared first on Towards Data Science.

Bridging the Data Literacy Gap

Nithhyaa Ramamoorthy — Fri, 06 Dec 2024 04:39:32 +0000

Introduction

With Data being constantly glorified as the most valuable asset organizations can own, leaders and decision-makers are always looking for effective ways to put their data insights to use. Every time customers interact with digital products, millions of data points are generated and the opportunity loss of not harnessing these data points to make better products, optimize revenue generation, and improve customer footprint is simply too high to ignore. The role of "Data Translators" began to emerge in analytics and data science job boards in the 2010s to help bridge the knowledge gap between business and Data teams and enable organizations to be better and more data-informed. Over the last decade, this role has evolved and absorbed more and more facets of data-driven decision-making and has been providing much-needed context and translation for business leadership. This role also plays an important role in interfacing with stakeholder groups such as Marketing, Product, and Strategy to help make all decisions data-centric. With the well-accepted importance of this role and the nimble nature of the set of responsibilities assigned to it, it is essential for all data practitioners to build the "data translation" muscle to excel, succeed, and progress in their roles and career paths.

Addressing the Data Knowledge Gap

Decision-making has been a cornerstone of successful business stories across all industries. Peter Drucker, the notable Management Theory and Practice expert famously said "Every decision is risky: it is a commitment of present resources to an uncertain and unknown future." In most Modern organizations, data centricity and data-informed decision-making have been agreed upon as proven ways to have reduced risk and ambiguity and business decisions to have a higher likelihood of successful outcomes. Data and marketing executives are tasked with making a series of decisions each day that have far-reaching impacts on an organization’s day-to-day operations and long-term priorities. While data resources are abundant in the current landscape, the process of utilizing these resources is still a struggle. According to a recent study released by Oracle titled "The Decision Dilemma" (April 2023), 72% of business leaders have expressed that the enormous volume of data available and the lack of trust and inconsistencies in data sources have stopped them from making decisions and 89% believe that the growing number of data sources has limited the success of their organizations, despite understanding that decisions that are not backed by data can less accurate, less successful and more prone to errors.

Data-driven decision-making is certainly not a new concept, in fact, the first set of decision models based on data and statistical principles was proposed in 1953 by Irwin D.J Bross distinguishing between the real and symbolic valid and illustrating the importance of measurements and validation. And organizations have consistently evolved over the past few decades to make data investments and have crafted strategies to make data, the center of their risk mitigation and decision-making efforts. Despite having these resources, organizations currently struggle with the unique problem of balancing the appetite for high-quality actionable insights and the availability of resources. A simple "Seesaw" analogy can be used to describe these business circumstances. An excessive appetite for knowledge and actionable insights combined with inadequate data resources may result in Data leaders relying on past decisions, anecdotal evidence, and gut feelings to make decisions. On the other hand, an abundance of data resources combined with less appetite for knowledge can ultimately result in unnecessary data solutions and too many self-serve dashboards being built with no clear strategy to make these resources useful.

Source: Image illustrated by the author using AI prompts.

It is becoming increasingly clear that the data knowledge gap is becoming wider despite the availability of abundant data resources. We are increasingly observing a unique situation of a Broken Seesaw, where both Data resources and appetite for knowledge exist, but owing to the lack of efforts to translate the value that the data teams provide to business leaders, both sides of the seesaw get overloaded, eventually leading to broken and inefficient decision making processes in the long run. Valuable data creates impact, everything else sleeps in a dashboard.

Is data literacy the answer?

Yes and No.

Data literacy has quickly become a focal point in the conversation around data insights and delivery, as organizations recognize the importance of equipping employees with the skills to understand and utilize data effectively. The movement gained momentum with research highlighting a significant skills gap among business professionals when it comes to interpreting data. The emphasis on training users to interpret data can sometimes overlook the steep learning curve that is required to build the skill of critically thinking about data evidence and interpreting them in a way that aids in risk mitigation.

Technology barriers are another bottleneck that exists between the data teams and business stakeholders. We can break this bottleneck down into two parts. Firstly, an insufficient analytics tool stack for Data Analysis can hinder effective data utilization and communication by non-super users of data. Secondly, the lack of training on their use can often lead to misinterpretation and misalignment with other data sources hence hindering the chance to establish a single source of truth. This eventually affects the credibility of the data teams.

A significant drawback of the current emphasis on data literacy is the tendency to place undue blame on users for the shortcomings of data initiatives. When data products fail to deliver value or are met with resistance, the reflexive response is often to assume a lack of user skill or understanding. This perspective overlooks the critical role that business literacy and business context play in effectively communicating the data insights whether it is proving or disproving business hypotheses. Data literacy is a two-way street. Oftentimes, it is the responsibility of the data team members to view the task from the business perspective and understand why they would care about what the data team has to say. Acknowledging and addressing these shortcomings and aligning data initiatives with business goals can lead to more effective and harmonious data-driven cultures.

The Advent of Data Translators – Who were they back then, and who are they now?

One solution that the data industry has adopted to address this data and knowledge gap and the shortcomings of Data Literacy efforts is the introduction of "Data Translator" roles within organizations. The role of a data translator has evolved significantly over the years, reflecting changes in how organizations utilize data analytics. Initially emerging as a bridge between data scientists and business units, the role was designed to ensure that complex data insights were translated into actionable business strategies.

In the early stages, data translators were primarily seen as intermediaries who could communicate technical findings to non-technical stakeholders, helping to prioritize business problems and ensuring that analytics solutions were aligned with business goals. As the demand for data-driven decision-making grew, so did the importance of this role. By 2019, the role had become more prevalent, with about a third of companies having positions fitting the data translator description. The responsibilities expanded to include not only communication but also ensuring that analytics tools were adopted across enterprises and that data democratization was achieved. Recently, there has been a shift towards integrating these roles into broader functions such as Data Product Owners, reflecting an evolution towards more holistic roles that encompass both technical and strategic responsibilities. This evolution highlights the ongoing need for roles that can effectively link data insights with business outcomes.

Figure 1 : Dominant themes in Data Translator Skill set universe. Source :Image illustrated by author.

Figure 2: The various roles that the Data Translator skills fuel. Source: Image illustrated by author.

The Data Translator role can take on a multitude of responsibilities depending upon the nature of the organizations they serve. For example, consulting organizations typically assign a dedicated Data Translator who is responsible for translating the provided data solutions to the business audience. Professionals who are hired in-house typically take the form of either dedicated Data Translator resources, Data Product Managers, or Analytics Delivery Managers with the responsibility of ensuring that the Data team’s efforts are utilized appropriately for critical business decisions. Despite having various job titles, Data Translators are tasked with the critical responsibility of proving the value and impact driven by data teams. They accomplish this by focusing on the following key areas:

1. Cost Containment:

Data Translators work as liaisons between the business leaders and data teams by consistently quantifying the impact of the projects delivered by the data team and weighing on the thoughtful allocation of data resources. For example, they may do this by keeping a record of monetary impact and decisions driven by the data teams they support. This record is often helpful in estimating resources for new strategic initiatives and serves as a reference for data solutions that can be replicated for similar problems in new contexts.

2. Strategy and Prioritization:

Data translators have a solid grasp of business goals and priorities and work on aligning their team’s efforts with the broader business objectives. This process often involves identifying projects that not only leverage the team’s skills but also have the potential to influence strategic outcomes. A popular approach to prioritization is using a framework that assesses the potential impact and feasibility of projects. By streamlining the data team’s intake systems and focusing on initiatives that promise significant returns or solve critical business problems, data teams can maximize their usefulness and productivity. In an article explaining the traits of data product managers and translators, Harvard Business Review identified business context, broad technical fluency, project management skills, an entrepreneurial spirit, and the ability to explain data needs and strategy to the rest of the organization.

3. Bridging the Data Literacy Gap

Data Translators work with Governance teams across the organization to establish common data language, definitions, and standards to ensure that all teams are aligned in their understanding and interpretation of data. This ensures that all data efforts are working together cohesively to establish a single source of truth.

4. Stakeholder Engagement

Identifying and prioritizing key stakeholders is essential for data teams to ensure their efforts are aligned with the organization’s strategic goals. Data Translators often accomplish this by using a project management technique called the "Interest – Influence Matrix". This process begins by mapping stakeholders across two dimensions: their level of interest in data initiatives and their influence on decision-making. High-interest and high-influence stakeholders are considered key players and should be prioritized for regular communication and collaboration. Building strong relationships with these individuals is crucial, as they can champion data projects, help secure resources, and remove roadblocks. For less influential stakeholders, maintaining periodic contact ensures they remain informed without overextending team resources. This type of thoughtful engagement enables data teams to focus their efforts where they can have the most significant impact, driving value for the organization as a whole.

5. Internal Promotion and Outreach

In an increasingly data-centric landscape, the role of Data teams has become significant, yet they are often misunderstood. Data Translators often create roadshows, presentations, and educational materials to share out the Data Team’s achievements and value provided in order to build and maintain credibility and trust across the organization.

Building the Data Translator Muscle

Observing the history and evolution of the Data Translator role has established that, along with data fluency, it is essential to have domain knowledge, Business context, and a solid understanding of organizational nuances such as goals, expected outcomes, and effective stakeholder partnerships to be successful in this role. The nimble nature of this role cannot go unnoticed. Over the past few decades, professionals across the data ecosystem with various job titles have been absorbed into the "Data Translator" roles and responsibilities in different ways. In order to future proof their data careers and be consistently successful and valuable to their organizations, data professionals must build the "Data Translator" muscle.

Practical tips Analysts can follow to empower themselves to be Data translators

Elaborated below, is a non-exhaustive list of practical tips that will help analysts become well-versed in Data Translation.

Curse of Knowledge

The curse of knowledge is a cognitive bias that occurs when a person who has specialized knowledge assumes that others share that same knowledge. This bias makes it difficult for knowledgeable individuals to imagine what it’s like to lack their expertise. Assuming everyone shares the same understanding and background knowledge leads to misunderstandings, wrong assumptions and ineffective communication. This is particularly true when interfacing with teams such as Marketing and Product, where the stakeholders are not necessarily data fluent, but data plays a major role in their projects and campaigns being efficient and fruitful. A data translator must have the unique capability to dissect the problem statement and map it into data points available, make the connections, find answers, and explain it to stakeholders in plain English. Here is a Marketing Analytics example:

Statement 1 (Analyst): Looking at the channel attribution charts, it looks like most of your campaign’s ROAS is negative, but it looks like there is less churn and more engagement, it’s not all wasted effort.

Statement 2 (Data translator): After assessing the marketing dollar spend and returns, it looks like your campaign is losing money in the short term. But looking at the big picture, the users acquired by your marketing campaigns are engaging and returning more hence creating long-term value.

The data translated version of the statement clearly explains the findings and illustrates the long-term impact of the campaign without the Data Analytics jargon.

Move on from business questions to business goals.

Oftentimes, analysts confine themselves to the bounds of their job responsibilities and purely focus on answering business questions. Sometimes, this phenomenon is also an unexpected side effect of organization-wide Data Literacy efforts. Answering business questions limits the insights to a specific problem while focusing on the overall business outcome gives a chance for both the Data and Business teams to look at data insights at a more holistic level. Data Literacy goes hand in hand with Business literacy. Data Translators are always expected to have a working knowledge of the business outcomes so they can tie insights to the overarching goals.

For example,

Business Question: How is my newly launched brand campaign doing?

Answer (Analyst): We had 6000 impressions in 3 days which is 50% higher compared to the last time we ran a similar campaign same time last year.

Answer (Data Translator): The expected outcome of this campaign is to improve brand awareness. We had 3000 net new users visit our website from this campaign. We also measured brand perception metrics before vs. after using a survey poll for these specific users and their opinions and awareness about the brand’s product offerings have improved.

Learn to Zoom out

Learning to zoom out and look at the big picture, and being able to map out individual tasks into overall priorities help Data translators focus their efforts on impactful initiatives. This skill also enables them to learn to build scalable analytics solutions that can be repurposed, eventually leading to time savings and better speed to insight.

Become a good data storyteller

"I didn’t have time to write a short letter, so I wrote a long one instead."

― Mark Twain

Data storytelling is equal parts science and art. And it is an essential tool in the Data Translator toolkit. It requires a thorough understanding of the problem and solution, constructing a clear, concise, and relatable narrative, and ending with recommendations and insights that can be acted upon. Every data story needs a governing idea that loosely follows an arc. One effective way to arrange the analysis story deck is in the order of Problem, Problem’s Impact, Findings, Recommendations, and Next steps. This ensures that your data story is easy to follow and speaks for itself even when you’re not around to narrate and walk through the whole deck.

The order may look different in repeating tasks such as routine performance updates or retrospective summaries. But for a typical request requiring data insights to aid decision-making, this order is a great starting point. The main thing to ensure in this step is to have accurate and relevant data points that clearly support your story. Apart from that, I have a few tips to help wrap up your Analysis solution neatly. These little details go a long way in presentation delivery and helping the audience remember the key insights.

· Clearly indicate if the key data point being shared is a good sign or a bad sign by using arrows and colors. (Example: A low bounce rate is a good sign, but a low conversion rate is a bad sign.)

· Always add context for any number (data point) shared in the slide by including benchmarking details or trend analyses. (Example: Conversion rate for this month was 12%, this is in line with other SKUs in the same product line and higher compared to the average conversion rate for the same months in the past three years.)

· Tie back the insights to some part of the original business question, goal, and outcome in each slide.

· Including details such as sample size, analysis time frames and important annotations in the footnote will help build trust and credibility.

In essence, a data story can be deemed effective when it leaves the audience informed and inspired to act.

Conclusion

Data Translators perform the critical role of bridging the gap between Data and Business teams. Their skill set is instrumental in proving the worth and impact of data investments, promoting data literacy, prioritizing high-impact initiatives, and protecting Analysts’ time from working on low-value tasks. Organizations and data teams can reap symbiotic benefits by encouraging, incorporating, and nurturing team members with data translator skills.

About the Author :

Nithhyaa Ramamoorthy is a Data Subject matter Expert with over 12 years’ worth of experience in Analytics and Big Data, specifically in the intersection of Healthcare and Consumer behavior. She holds a Master’s Degree in Information Sciences and more recently a CSPO along with several other professional certifications. She is passionate about leveraging her analytics skills to drive business decisions that create inclusive and equitable digital products rooted in empathy.

The post Bridging the Data Literacy Gap appeared first on Towards Data Science.

My Medium Journey as a Data Scientist: 6 Months, 18 Articles, and 3,000 Followers

Yu Dong — Mon, 11 Nov 2024 12:02:05 +0000

I started writing data science and AI content on Medium in May 2024. This is my sixth month and I just hit a major milestone – 3,000 followers! I am very proud of my achievements.

In this article, I will share how this journey started, what I have been writing, and what I learned. Plus, as a data scientist, I always enjoy analyzing my own data. I collected a dataset of my Medium stats, including article views , reads , claps , earnings , etc. Join me as I break down my Medium experience using data and share my data-driven writing strategies.

Image created by DALL·E

My Medium Journey Overview

How it all began

My writing habit dates back well before I started writing on Medium. I have been running my data science portfolio site since 2018, back when I started my first full-time job. I post articles there and occasionally share them on LinkedIn. It helps me connect with friends and colleagues in the data domain. Earlier this year, I posted an article about my experimentation with the custom GPTs, and it reached nearly 10k impressions on LinkedIn. This is not bad at all but it also leads me to wonder how I can reach an even wider audience.

Meanwhile, I have been a Medium Member since 2020. It has been invaluable for me to learn skills outside of my daily work and keep up with new technologies in the industry. Being in the industry for seven years, I feel it is time to be on the other side to share my knowledge with the community (and get my $5 monthly Medium subscription fee back ).

This is how the story started. I first tried posting some of my old articles on Medium, then moved on to writing brand-new content, submitting my articles to publications like Towards Data Science, and posting two to four new articles each month.

What I write about

My articles cover these three categories:

Technical tutorials: Many people come to Medium to learn how to do X, just as I do. Therefore, a majority of my articles fall under this category. This includes my article with the highest earning: Mastering SQL Optimization: From Functional to Efficient Queries.
Learnings: We don’t know everything, but that is okay. I enjoy exploring new things and sharing my discoveries on Medium. For example, I have a series of articles comparing ChatGPT, Claude, and Gemini on various data science and analytics tasks.
My career stories: With seven years in the industry, I have lots of career stories and reflections. In fact, the article that brought me the most claps and new followers is 330 Weeks of Data Visualizations: My Journey and Key Takeaways.

How writing on Medium has helped me

Writing on Medium of course helped me engage more with the data science community and earn some extra money. But it brought me many more benefits, including:

It makes me more confident in expressing my opinions. I have been following Towards Data Science so many years as a reader, and have always seen it as a publication for those top-notch data science articles. Now being an author who publishes here regularly, I feel much more confident in my data skills and story-telling abilities. And every clap and comment is a wonderful form of recognition.
It enhances my knowledge and skills. The process of writing an article is like re-learning something or re-experience a journey. It requires lots of fact-checks and reflections. Therefore, every article I write reinforces my understanding of the topic.
It helps me to keep this habit of reading and writing. Working in a second language isn’t easy (my native language is Mandarin Chinese) and regular reading and writing are the keys to constantly improving my English communication. Because now I am writing on Medium, I also tend to read others’ articles more to get inspiration. This created a positive cycle of reading and writing.

Mapping My Journey with Data

As a data scientist, I like collecting and analyzing data to improve decision-making. This also applies to blogging. Let’s start with some key metrics of my Medium journey (as of 11/3):

Stories posted: 18
Total reads: 54k
Total claps: 6,926 (~385 per article)
Total followers: 3,210
Total earning: $2,140

These are just the top-line metrics. To dig deeper, I prepared a dataset with daily stats on views, reads, claps, follows, and earnings for every article by following this guide. Here is what I discovered from the exploratory Data Analysis.

Key Data Insights

1. 80% of article views happen in the first 7 days.

As shown in the charts below, on average, 50% of the views come within the first 3 days, and 80% within the first 7 days. After 2 weeks, daily views usually drop below 50. This is likely because 1. publications like Towards Data Science usually share new articles on social media within the first few days after publishing, and 2. Medium prioritizes newer articles when distributing them through its recommendation system.

This means you can already tell if your article is a hit in 3 days.

Daily views visualization, data and image by the author

2. Medium members are 3x more likely to read an article than non-members.

Medium defines views as people who visited your story’s page and reads as people who read your story for at least 30 seconds. Therefore, the read ratio = # reads / # views tells how engaging your article is to the audience that visits it.

An interesting pattern I noticed is that the Medium members have a read ratio of around 60%, while it is closer to 20% for non-members. This shows the motivation to read more when you are paying the subscription fee Meanwhile, it might also be driven by the fact that non-members will hit the paywall if they have already exceeded the preview limit for the month (if those views are not excluded from the Medium stats, which I could not verify).

Member vs. non-member read ratio, data and image by the author

3. Article earnings follow the 80/20 rule. 80% of my earnings come from just 3 articles, which is a perfect example of the 80/20 law. In fact, my best-performing article alone has brought me nearly $1,000 now. On the other hand, as you can see in the histogram below, many articles earn less than $10.

My three best-performing articles also happen to be the three that are boosted by Medium. "Boost" is a program where Medium hand-picks high-quality stories and weights those stories for extra distribution via the recommendation algorithm. According to Medium, "95% of Boosted stories get at least 500 extra views within two weeks". You can read more about this program here.

Article earning histogram, data and image by the author

4. Member reads and whether boosted or not are key to earnings.

So what factors determine the earnings? Medium has never revealed its formula but shared some key factors in its help center article. And here is my take by analyzing my (small sample of) earnings data. Two major factors that influence earnings the most are:

If your article is boosted or not. In the help article, Medium also says there is "a multiplier of engagement points when the story is Boosted.". As you can see in my chart below, earnings from the boosted articles are clearly outliers compared to the not-boosted articles.
Number of member reads. It is not surprising that the more reads you get from the Medium members, the higher your earnings will be. When I separated boosted vs. not boosted articles, I found a strong positive correlation between member reads and earnings. And please note that this is member reads – unfortunately reads from non-members don’t matter according to the help article.

Correlation between member reads, boosts and earnings, data and image by the author

Here are the fitted regression formulas:

Boosted articles: Earnings = 0.28 * member reads – 43
- R-squared = 0.998
- P-value = 0.029
- But please note that I only have 3 data points haha!
Not Boosted articles: Earnings =0.027 * member reads + 2.1
- R-squared = 0.965
- P-value = <0.001
- sample size = 15

The slope for boosted articles is 10x that of non-boosted ones. In other words, when your article is boosted, you earn 10x .

Medium says reading time and engagement like claps, highlights, and responses also impact earnings. However, my articles are mostly between 7 to 10 minutes long, so the reading time probably doesn’t vary too much (and the data is not available to me). As for the engagement metrics, they all appear to be highly correlated with member reads. Therefore, just using member reads itself already has a strong predictive power in my case.

Eventually, when I get a significantly larger dataset one day, I plan to run a more rigorous regression analysis with all the metrics I have access to. But please let me know if my findings match your Medium article stats

Data-driven Medium Writing Strategy

What can we learn from the analysis above? Here are my data-driven recommendations on Medium Writing:

Write regularly to build your audience: Earning is highly correlated with member reads. What is the best way to increase member reads? To build your audience. Every new article has a chance to attract more followers, and if your articles show up on someone’s homepage often, they have a higher chance to follow you and read your future articles.
Quality over quantity: I’ve seen people recommending posting articles every day. But that is not what I mean by writing regularly. I believe fully polishing an article on a topic that you are really into is the way to engage your audience and improve the read ratio. It also increases your chance of getting "boosted". (And honestly, I am not the type of creative person who can come up with new writing ideas every day…)
Submit to publications. Publications like Towards Data Science have established their subscriber bases and distribute accepted articles across various channels like emails, LinkedIn posts, Twitter (I mean… X), etc. This means your article will reach a much wider audience than you just letting Medium do their recommendation algorithm magic or sharing it on your social media. This is particularly important for new writers. Additionally, only publication editors can nominate your article for a "Boost" (read more here). So this also gives you a higher chance to earn more money.
Optimize your title and opening. A ‘read’ counts when someone reads your story for at least 30 seconds. What can people see in 30 seconds? That’s probably only enough time to read the title and subtitle and skim through the first paragraph. Therefore, you should try to optimize the first impression to grab the reader’s interest. This is the same reason why companies do SEO and marketing email optimization. I would like to A/B test my titles if I could, but unfortunately, that is not doable on Medium. So I am also learning by trial and error now.
Create content that is ‘you’. Among my past articles, the ones that perform best are always the ones with more personal touches. Even for technical topics like SQL optimization, I included my personal experiences and examples. Essentially, your content shouldn’t be something that ChatGPT is able to create by itself.

I hope this article gives you more insights into writing on Medium (especially in the data science domain) and inspires you to embark on a similar journey.

If you have enjoyed this article, please follow me and check out my other articles on data science, analytics, and AI.

The post My Medium Journey as a Data Scientist: 6 Months, 18 Articles, and 3,000 Followers appeared first on Towards Data Science.

Four Takeaways from a 5-Year Journey

Andrea Valenzuela — Fri, 04 Oct 2024 17:56:44 +0000

With September being a month of new beginnings — often more so than January for some people – just ending, __ I think it is the perfect time to share a recent experience I had for the first time in my professional career: giving a speech at a master’s graduation ceremony!

When I was offered the opportunity to give the graduation speech at the ceremony for the same master’s program I completed almost five years ago, I immediately accepted. But then panic set in — what was I supposed to say? Did I have any good advice to offer the new graduates?

I procrastinated nearly until the last week but eventually realized that after five years of working in both industry and academia, I did have four key things I had learned along the way and would have appreciated knowing right after graduation.

It was a good experience after all, and now I would like to share my four takeaways with you!

1. Graduation is the Perfect Time to Explore

Self-made slide prepared for the graduation speech.

I remember graduating with the feeling that I needed to have everything figured out, like where I should work or which professional path to take. But over time, I realized that the period after graduation is actually the best time to explore and try new things.

In the past five years, I have had the opportunity to experience different roles. I started in a private company, then switched to a research assistant position at a University. I also had the chance to work at one of my dream places, the European Organization for Nuclear Research (CERN) in Geneva, Switzerland. This experience was twofold: I fulfilled a childhood dream by working there for almost four years, and I also got to experience the challenge of leaving my hometown to start a new life abroad.

I think every tech person goes through a phase where they believe they can start their own business with a friend. I went through this too! For the past two years, I have been collaborating with one of my best friends to create online content in Data Science and Artificial Intelligence.

So no, I did not have everything figured out after graduation, but what a journey it has been!

2. Learning Does Not End at Graduation

Self-made slide prepared for the graduation speech.

After Graduation, I also had the impression that I had studied a lot and passed many exams, but when it came to real-world experience, I felt like I didn’t know anything. However, I quickly realized that employers are already aware of this. When hiring new graduates, they are looking for energy, enthusiasm, and a willingness to learn. If you bring those qualities, they will train you in whatever their core tasks are.

Another valuable learning opportunity I recently discovered is Summer Schools. These programs offer a chance to step away from your main responsibilities and emails and spend a week learning about various topics from renowned experts in the field.

They are usually held in amazing locations, and you stay there for the duration of the program. Some are sponsored and free, while others may require a fee. I highly recommend the ACM Summer School and the CERN School of Computing.

3. Don’t Hesitate: People Are Always Willing to Help

Self-made slide prepared for the graduation speech.

I am the type of person who tends to overthink before asking for help.

I often worry that I might be bothering people, especially when asking for something like a letter of recommendation. However, I’ve never actually received a negative response.

Over time, I have realized that people won’t waste their time if they don’t want to, but they are usually more than willing to help or provide advice if the request is reasonable. I’ve found that professionals in Computer Science, Data Science, and Artificial Intelligence are particularly responsive.

From personal experience, this summer I began supervising students. During the internship, it was always a pleasure to offer my opinions or explain my past decisions when they asked for advice. After the internship, when the time came for recommendation letters, it was equally rewarding to help good students take the next steps in their careers, although it required a time investment.

4. Pursue your Passions

Self-made slide prepared for the graduation speech.

The outside world after graduation is tough, but one should never give up on pursuing their passion. I remember being a 15-year-old, reading Breakfast with Particles and dreaming of becoming a physicist at CERN.

Well, things didn’t turn out exactly as I had imagined. I grew tired of Physics and ended up working as a Computer Scientist, though still at CERN. I have come to realize that we also learn to scale our dreams. As a teenager, I dreamed of discovering a new particle. Now, contributing to a small component in a large system feels like a significant professional achievement.

The moral? I’ve changed, and so have my interests, but I never gave up on pursuing my passions. That’s something I deeply value about my career path and would strongly recommend to any new graduate.

Final Thoughts

In the end, standing in front of the new graduates, I realized that I had quite a few things to share from this five-year journey. It may not have been useful to everyone sitting there, but I hope it resonated with some of them.

If you’ve recently graduated, or are about to, keep this in mind: Graduation is not the finish line, but the beginning of an unpredictable, exciting journey. You don’t need to have everything figured out right away, and that’s okay. Keep trying new things, keep learning, and don’t hesitate to ask for help along the way. Most importantly, always pursue what excites you. If you do that, like me, you may find that even if the path looks different from what you imagined, it can still lead you to something truly fulfilling.

And you? What are your main takeaways after graduation?

Many thanks for reading!

This is the first personal article I am sharing.

I normally write content about Artificial Intelligence, especially Large Language Models. Feel free to take a look at my content and subscribe to my Newsletter if you like it!

DALL-E 3: a step towards content policy moderation

Why LLMs are not Good for Coding

The post Four Takeaways from a 5-Year Journey appeared first on Towards Data Science.

Creating Project Environments in Python with VSCode

Gustavo Santos — Fri, 13 Sep 2024 00:45:16 +0000

Introduction

Creating Data Science projects can be pretty straightforward. With the numerous resources available these days, it’s just a matter of choosing a development tool and kicking off your project.

Documentation is also easily available, in addition to several AI bots that will help you working through mostly everything you want to create.

However, as the projects start to become more complex and professional, there will be a need to start isolating projects from one another. Sometimes, modules that work well together in project A, may fail to run together in project B. Or a method with the same name in two different packages can generate confusion. I mean, a lot can happen in non-isolated environments.

That is when we will find the need to start isolating Development environments. So, in this post, the idea is to show you a quick and easy way to create an isolated environment using Python and VS Code.

Let’s get to work.

Project Environments

As already mentioned, a development environment is an isolated "box" created inside your computer to install only the modules to be used for that project.

A development environment is an isolated "box" created inside your computer for better package control.

Imagine we will create a classification project that needs Pandas, Scikit Learn, and Streamlit. In this case, we can install only those modules and their dependencies, without the need to add many other packages that will never be used. It can then be separated from another project that won’t be using Streamlit, for example.

Now let’s move on and start coding a little.

Using Pip

The easiest way to create an environment is using the native tools from Python. To do that, just start a VS Code session and open a new Powershell terminal (Terminal > New Terminal).

Next, you can create a new folder for the project.

mkdir name_your_project

Then, change folders to access the recently created directory.

cd name_your_project

At this point, you can use VS Code to open the newly created folder, if you’d like. Just remember to re-open the Terminal from the new window.

Within the new folder, it is time to create the new environment. Use the following command. I will create a virtual environment using the standard name .venv .

python -m venv .venv

There you go. Now to activate it, you can use this command in Powershell.

.venv/Scripts/activate

This is the virtual environment activated. Image by the author.

Now, whatever you want to install when this environment is activated will be isolated under the test_proj and won’t affect the other projects. Let’s install Pandas and Streamlit, but not Scipy.

pip install pandas
pip install streamlit

Both packages are installed, if I run a quick script to check, here is the result.

import pandas as pd
import streamlit as st
print(pd.__version__)
print(st.__version__)

[OUT]:
2.2.2
1.38.0

If I check for Scipy import scipy .

import scipy
[OUT]: ModuleNotFoundError: No module named 'scipy'

If we create another environment now and install only Scipy, look what happens.

python -m venv env2
env2/Scripts/activate
pip install scipy

import pandas as pd
[OUT]: ModuleNotFoundError: No module named 'pandas'

See how Pandas is not installed in the env2. Let’s see about Scipy.

import scipy.stats as scs
print(scs.norm.rvs(loc=1, scale=3)

[OUT]:
0.5100109427428302

Works like a charm.

Using PyEnv and Poetry

Now let’s see another way to do the same thing with PyEnv and Poetry, two python packages suited for this purporse. This is even easier than the native tools from Python.

Using PyEnv is a good idea because it can manage different versions of Python in the same machine. When working with many projects, one problem that is always potential is that a given Python version is not compatible with a package that you are working with (or want to work with). In that case, you will need to install a previous or more updated version of Python. PyEnv comes to solve that.

Installing the package is a little tricky, but following this tutorial you can do that in no time for Windows.

Now to install Poetry, first you must install pipx. Follow these steps here. and then use the command pipx install poetry to finish installing Poetry.

You might also need to run this next command to tell your windows shell that you can let Poetry managing the virtual environment.

poetry config virtualenvs.in-project true

After the initial installation steps are done, to create a new environmento with Poetry is easy like this in your Powershell in VSCode:

poetry new test_proj  

Created package test_proj in test_proj

With that command, Poetry already creates the virtual environment folder, a test folder, and a nice pyproject.toml file with all the project specifications. It is amazing. Take a look at the toml file.

pyproject.toml file. Image by the author.

When I command poetry shell , Poetry creates the .venv folder and activates it.

Now, to add new packages to the project, you can use:

poetry add pandas

This gets added to the toml file:
[tool.poetry.dependencies]
Python = "^3.12"
pandas = "^2.2.2"

Or to remove them, use:

poetry remove pandas

To use a different version of Python for this project, we can ask PyEnv to use 3.11.5, for example.

pyenv local 3.11.5

Python Version changed. Image by the author.

Once done, you can write exit on the shell to deactivate your environment.

Before You Go

With that, we finish our little tour to understand a little more about environment management in Python using VS Code, Pip and Poetry.

This knowledge is useful to be able to isolate the effects of our projects in a controled "box", mitigating problems of dependencies or that famous "it worked on my machine".

I believe that the toml file generated by Poetry is also very useful and gives you a summary of what’s in the project. Additionally, Poetry does not bring all the dependencies listed. It displays only the packages you actually requested to install, like "Pandas", "Scipy", etc, instead of showing numpy, and other dependencies.

References

How to install pyenv on windows [2023]

Introduction

GitHub – pyenv/pyenv: Simple Python version management

Managing Multiple Python Versions With pyenv – Real Python

The post Creating Project Environments in Python with VSCode appeared first on Towards Data Science.

Peer Review Demystified: What, Why, and How

Shrey Pareek, PhD — Wed, 04 Sep 2024 20:11:25 +0000

I will share what I have learnt about the academic peer review process through a personal journey from a hesitant reviewer to an Associate Editor for the IEEE Robotics and Automation Letters (Impact Factor 4.6).

While most traditional science and engineering publications require prior publication experience and academic credentials to serve as reviewers, Machine Learning and data science might be an exception. A significant driver of the widespread adoption and use of data science has been open-source projects and repositories. Many influential contributors to open-source data science are not always published researchers but possess deep knowledge of the field through practice and experimentation. Additionally, formal academic degrees in machine learning have only existed for a few years, and many current researchers come from diverse backgrounds. I, for example, have a background in Mechanical Engineering.

With the above in mind, I hope that if you are a machine learning practitioner who is curious about the review process and wants to get involved, this article should provide some value.

Table of Content

· My Story · What is Peer Review? ∘ Shouldn’t Editorial Board Members be the Experts? · Peer Review Process · Why You Should Consider Peer Reviewing · How Can You Get Involved? ∘ Tracking Peer Reviews using Web of Science ∘ Do I need to be a published researcher? · How Much Time Does it Take? · Conclusion · Cold Email Template · Disclaimer

My Story

In August 2024, I reached 100 verified peer reviews for 9 different academic journals and conferences. Although I performed my first review in 2016, it was not until mid-2022 that I truly started enjoying the process.

Peer Review metrics from my Web of Science profile. Image by author.

As a graduate student (2015–2020), I never really enjoyed reviewing papers. Instead, I mostly did it as an academic obligation when my advisor asked me to do so. Furthermore, I lacked confidence in my ability to critique others’ work, given that I only had few publications under my belt.

After graduating, I found it challenging to stay up-to-date with new research. As a student, reading papers was part of the job. In industry, however, I only read the most popular papers. To stay current with the latest research, fulfill my academic responsibilities, and build a stronger research profile, I began emailing editors of various journals to express my interest in becoming a reviewer. Although I received responses from almost all the journals, only 2–3 assigned me papers initially. Over time, I started receiving review requests from journals I hadn’t contacted as well.

In late 2023, I applied to IEEE RA-L for an Associate Editor role and was eventually selected to serve in the human-robot-interaction track.

In the rest of this article, I will explain:

the importance of peer reviewing and what the process entails,
why you should consider reviewing for academic publications and how you can get started
time commitment and other factors to consider

Finally, I will also share a cold email template that you can use to reach out to editors.

Although there is some controversy over the efficacy of the peer review process, I do not consider myself well-versed enough to comment on that aspect. Instead, I will focus on sharing my experiences and learnings.

What is Peer Review?

Peer review is a crucial tool that, ideally, ensures high-quality scientific work. It is a process used to evaluate the quality, validity, and relevance of research or scholarly work before it is published or accepted for presentation. This evaluation is performed by expert peers in the relevant domain. The peer review process helps ensure that published research is of high quality and contributes meaningfully to the field, maintaining academic standards and credibility.

Publications rely on a network of volunteer peer reviewers for the above. This is primarily due to two reasons:

Submission Volume : Academic publications may receive thousands of potential manuscripts each year. For instance, the IEEE Computer Vision and Pattern Recognition Conference (CVPR) received 11,532 submissions in 2024. Even though editorial boards of popular journals/conferences may include a few hundred members, they are far out numbered by the number of submissions. Additionally, most publications have at least 2 rounds of reviews, effectively doubling the number of reviews required.
Varied Domain Expertise : Although most publications have a relatively narrow scope, they still cover a vast domain of scientific knowledge within a specific field. To this end, editorial boards comprise of experts from numerous sub domains, but the nature of academic research is highly specific and it is nigh impossible for the editorial staff to have the right expertise to fairly critique every submission.

Shouldn’t Editorial Board Members be the Experts?

Yes, but the scope (or focus areas) of journals is often too broad. For example, consider the scope of the IEEE Robotics and Automation Letters (RA-L), where I serve as an Associate Editor:

publishes peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.

The __ phrase areas of robotics and automation describes the wide variety of work the journal focuses on. This may include _bio-inspired robotics biomedical robotics, field robotics, human-robot interaction, humanoid robotics, soft robotic_s – to name a few. In addition, the automation part may be based on machine learning, rule-based methods, or good old control theory.

Most robotics researchers specialize in a particular domain. I myself obtained my doctorate in medical robotics. Within that, I focused on physical therapy assistive robots. Within that, upper limb stroke rehabilitation. And finally within that, I explored the use of advanced deep learning and biomechanical signals for automated assistance. So although on paper I am a so called "expert" on medical robots, I do not have in-depth knowledge of say – use of deep learning for surgical robotics.

However, I do know fellow researchers and colleagues with expertise in these specialized fields. I can rely on their knowledge to provide feedback and recommendations for publications. These peers are essential to the publication process, ensuring that submissions receive informed and comprehensive evaluations.

Peer Review Process

Very briefly, the peer review process usually comprises of the following steps:

Submission: Author submits a manuscript to a journal or conference.
Initial Screening: Editor checks if the submission fits the journal’s scope and standards.
Review Assignment: Editor sends the manuscript to experts (peer reviewers) in the field.
Review: Reviewers assess the manuscript’s quality, methodology, and significance, providing feedback. They may recommend accept, request revisions, or reject the manuscript.
Editorial Decision: Editor decides to accept, request revisions, or reject the manuscript based on feedback from multiple reviewers.
Revisions: If needed, the author revises the manuscript and resubmits it for further review.
Re-Review: Reviewers re-review revised manuscripts and recommend accept, request revisions, or reject. Most journals only allow a binary accept or reject at this stage. Although this varies.
Publication: Accepted manuscripts are edited and published.

Why You Should Consider Peer Reviewing?

Peer reviewing can be a gratifying experience and a valuable way to contribute to the advancement of scientific research, even if you are not an active researcher. Here’s a breakdown of why peer review is important:

Academic Responsibility: If you are a researcher who publishes papers, the general guideline is to maintain a 3:1peer review-to-publication ratio. This means that for every paper you publish, you should ideally review three papers. This ratio reflects the typical practice where most publications assign three reviewers to each submission.
Staying Up to Date: Reviewing papers involves reading work that has not yet been published, often representing the cutting edge of your field. While you are not permitted to disclose or use results from unpublished reviews, you still gain insight into new techniques and current research trends within your area of expertise.
Build Research Network and Profile: Serving as a peer reviewer highlights your expertise in a particular field and is an excellent way to expand your research network. It connects you with fellow researchers globally and provides direct access to editorial board members, enhancing your professional visibility and connections.
Improve Paper Writing: Most journals allow you to review the feedback provided by other reviewers on the same submission. This exposure offers valuable insights into what fellow researchers consider strong versus weak papers, which can help you refine and enhance your own writing skills.
Green Card Criteria: The following is not legal advice and only reflects my personal experience. Please consult an immigration lawyer if you need further information. This is relevant if you are an immigrant in the U.S. seeking an Employment-Based (EB) Green Card. Categories such as EB1-A, EB1-B, and EB1-NIW often require "evidence of participation, either on a panel or individually, as a judge of the work of others in the same or allied academic field" as one of the criteria to demonstrate expertise. Therefore, reviewing more papers can strengthen your application and increase your chances of meeting this criteria. In fact, I myself used my peer review background as a criteria for the EB1-B Green Card.

How Can You Get Involved?

Peer reviewing might seem intimidating, but it is quite manageable and resembles the code review process. Similar to creating a pull request that needs to be reviewed before merged, a manuscript needs to be reviewed before it can be published.

Editors are constantly seeking peer reviewers and are often very receptive if you reach out to them. A straightforward cold email can be very effective. I will provide an email template at the end of this article for you to use.

As long as your aim is to offer unbiased feedback to help authors improve their work, you are approaching the process with the right mindset. Most editors will value and respect your contributions.

Tracking Peer Reviews Using Web of Science

I would highly recommend creating a Web of Science profile. It enables you to get your reviews verified and all in one place. It has a handy export feature that can serve as a proof of your reviewer experience that is accepted by most organizations. It also provides intersting metrics such as the average length of your reviews.

Some interesting metrics that can be generated by Web of Science. Image by author.

Do I need to be a published researcher?

Not necessarily.

While many top journals and conferences require some publication experience, others do not. If you do not have publications and are a millennial (i.e. suffer from imposter syndrome), you can start by reviewing poster or abstract submissions for smaller, local conferences to build your profile and confidence. You can then use this as a crux to graduate to international publications as well as a filler for a lack of publication history.

Keep in mind that each manuscript is typically reviewed by 2–3 reviewers at various career stages. As a newcomer, you may offer a fresh perspective compared to more experienced researchers. Editors value all feedback, and a diverse range of viewpoints is highly beneficial.

How Much Time Does it Take?

The time commitment for peer reviewing varies and is entirely up to you. I typically limit myself to 1–2 papers per month (this includes my AE assignments). In 2023, I reviewed more frequently, but I am now more selective. You can always decline invitations if needed, as peer reviewing is a voluntary activity, and editors respect your time.

The duration of each review also varies. In my experience, reviews can take anywhere from a couple of hours to several days. If a paper is closely related to my research, I can complete it in an afternoon. However, papers that are adjacent to my field or involve complex equations can take longer, especially if they require extensive verification. Personally, I avoid papers with too many equations as I do not enjoy reading them. More pictures, less math, please!

Sometimes I do receive terrible quality papers that appear to be a waste of time. But these usually take the least amount of time to review anyway.

In summary, the time commitment varies, but you can choose the number and type of papers to review based on your preferences and availability.

Conclusion

In conclusion, the peer review process is a crucial component of academic Publishing that ensures the quality and integrity of scientific research. The process, while challenging, offers significant benefits, including staying abreast of cutting-edge research, enhancing one’s academic profile, and contributing meaningfully to the scholarly community.

Ultimately, peer reviewing is not only a responsibility but also an opportunity for personal growth and professional development. It provides a platform for researchers to influence the advancement of their field, build valuable networks, and improve their own research skills. While the peer review system is not without its criticisms, it remains a vital component of the research ecosystem, fostering academic rigor and innovation.

Cold Email Template

As promised, here is the cold email template I have used in the past.

SUB: Request to serve as peer reviewer for [Publication Name]

Dear [Editor’s Name],

I hope this message finds you well.

I am writing to express my interest in serving as a peer reviewer for [Publication Name]. Currently, I am a [Your Role] at [Your Organization], with a [Bachelor’s/Master’s/Doctorate] in [Field] from [University]. My areas of expertise include [Expertise 1, Expertise 2, Expertise 3], and I have demonstrated proficiency through [briefly mention any relevant experience or achievements].

I have published articles in [Publication 1, Publication 2, Publication 3] and contributed to open-source projects such as [Project 1, Project 2] and [Blog 1, Blog 2]. I also serve as a reviewer for [Publication 1, Publication 2, Publication 3].

You can find more information about my work on my [Google Scholar and/or GitHub] profile, and I have attached my resume for your reference.

I am confident that my background and expertise make me a suitable candidate for reviewing submissions, particularly in areas related to [List Areas of Interest]. I would be honored to contribute to [Publication Name] and support the advancement of research in these fields.

Thank you for considering my application. I look forward to your response.

Best regards,

[Your Full Name] [Your Contact Information]

Disclaimer

ChatGPT was used as a proof-reading tool for this article. Minor edits were made based on the feedback. Content was created by author.

The post Peer Review Demystified: What, Why, and How appeared first on Towards Data Science.

LLM Agents, Text Vectorization, Advanced SQL, and Other Must-Reads by Our Newest Authors

TDS Editors — Thu, 22 Aug 2024 13:31:48 +0000

If you’re a regular reader of the Variable, you might have noticed that we stress—every week—that TDS is always open to contributions from new authors. And we mean it! Some of you might have seen this message and thought something along the lines of "great, I’d love to write an article!" but then wondered what kinds of posts would be a good fit, what topics our readers are interested in, and what types of experiences and skill sets are welcome.

This week’s Variable edition highlights some of our best recent articles, so if you have no desire to become a TDS author, that’s totally fine! We hope you enjoy your reading as always. We’ve focused exclusively on posts by our most recent cohort of authors, however, in the hope that their work inspires you to give this a try, too.

As you’ll see, TDS contributors come to us with a wide range of experience levels (from early learners to PhDs and industry veterans), interests, and writing styles. What unites them is great storytelling skills and a desire to share their knowledge with a broader community. We hope (and are fairly certain) you’ll enjoy our weekly lineup.

What Do Large Language Models "Understand"? "When we attribute human-like abilities to LLMs, we fall into an anthropomorphic bias by likening their capabilities to our own. But are we also showing an anthropocentric bias by failing to recognize the capabilities that LLMs consistently demonstrate?" In one of the most thought-provoking articles we’ve read recently, Tarik Dzekman tackles the question of LLMs’ capacity to understand language, looking at the topic through a philosophy- and psychology-informed lens.
Integrating LLM Agents with LangChain into VICA"Our goal is to say goodbye to the robotic and awkward form-like experience within a chatbot, and say hello to personalized conversations with human-like assistance." Ng Wei Cheng and Nicole Ren share practical insights and lessons learned from their extensive work on Singapore’s GovTech Virtual Intelligent Chat Assistant (VICA) platform.
Text Vectorization Demystified: Transforming Language into Data "For those of us who are aware of the Machine Learning pipeline in general, we understand that feature engineering is a very crucial step in generating good results from the model. The same concept applies in NLP as well." Lakshmi Narayanan offers a thorough overview of text-vectorization approaches and weighs their respective advantages and limitations.

Photo by Totte Annerbrink on Unsplash

Leveraging Gemini-1.5-Pro-Latest for Smarter Eating "It is worth noting here that with advancements in the world of AI, it is incumbent on data scientists to gradually shift from traditional deep learning to generative AI techniques in order to revolutionize their role." Mary Ara presents an end-to-end project walkthrough that demonstrates how to do precisely that—in this case, through the creation of a calorie-tracking app that leverages a cutting-edge multimodal model.
The Most Useful Advanced SQL Techniques to Succeed in the Tech Industry"Although mastering basic and intermediate SQL is relatively easy, achieving mastery of this tool and wielding it adeptly in diverse scenarios is sometimes challenging." Jiayan Yin aims to help data analysts and other practitioners bridge that skill gap with a comprehensive overview of the more advanced SQL techniques you should add to your querying toolkit.
Fine-Tune the Audio Spectrogram Transformer with Hugging Face Transformers"This process adapts the model’s capabilities to the unique characteristics of our dataset, such as classes and data distribution, ensuring the relevance of the results." Writing at the intersection of machine learning and audio data, Marius Steger outlines a detailed workflow for fine-tuning the Audio Spectrogram Transformer (AST) on any audio-classification dataset.
Algorithm-Agnostic Model Building with MLflow"Consider this scenario: we have an sklearn model currently deployed in production for a particular use case. Later on, we find that a deep learning model performs even better. If the sklearn model was deployed in its native format, transitioning to the deep learning model could be a hassle because the two model artifacts are very different." Mena Wang, PhD explains why it can sometimes make a lot of sense to work with algorithm-agnostic models—and shows how to get started in MLflow.
A Fresh Look at Nonlinearity in Deep Learning "But why do we need activation functions in the first place, specifically nonlinear activation functions? There’s a traditional reasoning, and also a new way to look at it." Harys Dalvi unpacks the stakes of using a linear layer for the output of deep learning classifiers and the value we can gain by interpreting the consequences of linearity and nonlinearity in multiple ways.

Thank you for supporting the work of our authors! As we mentioned above, we love publishing articles from new authors, so if you’ve recently written an interesting project walkthrough, tutorial, or theoretical reflection on any of our core topics, don’t hesitate to share it with us.

Until the next Variable,

TDS Team

The post LLM Agents, Text Vectorization, Advanced SQL, and Other Must-Reads by Our Newest Authors appeared first on Towards Data Science.

How To Start Technical Writing & Blogging

Egor Howell — Sun, 21 Jul 2024 19:04:16 +0000

Starting a Data Science blog on Medium was one of the best decisions I have made in my life. It up-skilled me in my career, opened many opportunities, and even made me some money.

Technical writing and blogging are not just valuable skills; they’re powerful tools that can elevate your career, especially for data scientists who often need to communicate complex ideas to various people with different technical skills.

That’s why in this article, I will explain what technical writing is, why you should do it, and how you can start your tech blog today!

Note: You can also watch the video version of this post on my Youtube channel:

What Is Technical Writing/Blogging?

Put simply, technical writing is the process of writing tutorials or explainers on technical subjects like maths or coding. You basically educate and teach audiences complex and challenging topics in a digestible way.

For example, you can write about how to use Pandas in Python, how a machine-learning algorithm works or what ChatGPT is doing under the hood. The detail and complexity you go into is up to you, but the goal is to explain the topic as clearly as possible to a desired or specified target audience.

Technical writing and blogging are very similar but slightly different if we are being pedantic. The former is more of a formal full-time profession, whereas the latter is a more hobbyist, relaxed approach. However, at their heart, they are both trying to do the same thing of explaining technical topics.

That’s also not to say that you can’t earn a full-time income from tech blogs or online writing; many writers earn a living wage from their posts.

One of the best examples is The PyCoach, who consistently made over $5,000 per month in 2022. Obviously, you shouldn’t expect this and he is clearly an anomaly but it does show you that writing articles can be very lucrative if thats what you are after.

Why Have A Blog?

What’s the point of a blog? Well, there are so many reasons, both altruistic and selfish if we are being completely honest.

From the pure altruistic perspective:

You likely have specialist knowledge in a particular tech area that others may benefit from learning from you.
You can help people who want to get a job in your field by advising them on the best way to do so and on the resources they may need.

On the more selfish side:

Writing articles shows your interest and abilities in a field and will make you stand out to potential employers and recruiters.
It helps you learn new skills and topics that can advance your career. By writing posts, you are basically using the Feynman technique as you improve your understanding by teaching others.
You can earn some money on the side to supplement your full-time income and maybe even exceed it in some cases.

As JP Morgan once said:

A man always has two reasons for doing anything: a good reason and the real reason.

But, no matter your reason, it is clear that starting a blog is a very low risk for a potential very high award, and the pros clearly outweigh the cons.

If you are still on the fence, I encourage you to read "Show Your Work!" by Austin Kleon. After reading this book, I promise you will run straight to your computer and start writing something!

Show Your Work! a book by Austin Kleon

How To Start?

Choose Your Niche

When I started my blog, I was just about to start my first job as a data scientist. Naturally, I began writing articles on data, maths, statistics, and AI, all areas relevant to my role and were things I wanted to learn more about.

I was far from an expert and I hadn’t even started the job when I wrote my first article! However, I took the "learn in public" approach to documenting my learning as I progressed in my career.

Recently, I pivoted to writing more career advice articles, but that’s only after working in the field for nearly three years.

You can literally write about anything, but I recommend choosing a niche that interests you or that you have experience in. Then, decide whether you want to give advice or document your journey. Of course, you can do both, depending on your experience and what stage you are at in your career.

You can find my first article below. It’s not my best work, but we all start from somewhere!

One Hot Encoding Simply Explained

Platform

After you have decided on your niche, it’s time to choose a platform for your writing.

One of the most successful online writers, Nicolas Cole, said in one of his posts:

Your blog has no distribution flywheel.

A blog is just a website. So the big question is, how are people going to find out your website exists?

And then goes on to say:

Which is why you are so much better off starting your digital writing journey by writing on social platforms.

Twitter

Quora

Medium

LinkedIn

Basically, anywhere readers already are.

Why You Shouldn’t Start A Blog (And Where You Should Write Instead)

To expand on this, when starting your tech blog, writing on your personal website is not a good idea for someone with no current audience. The main reason is that it is tough to gain traction, as search engines need to index your site higher than other, more established websites.

Writing on platforms like Medium or Quora will likely generate more traffic to your blog as they already have a large audience on the platform, and their URLs rank highly on SEO.

So, what platforms should you choose?

If you don’t want to write complete articles just yet, X/Twitter and LinkedIn are your best bet. They have a massive tech community with an established audience.

There are a couple of options for full blog posts, but the main two are Medium or Hashnode for tech articles. Hashnode is more on the tech side and a bit more developer-friendly. Still, in my opinion, Medium is much better as it has lower friction and much better distribution channels and has one of the most biggest communities in Towards Data Science.

I have been writing on Medium for nearly three years now, and it is truly amazing. I really recommend starting here.

Plan Your Articles

After we have chosen our niche and platform, it’s time to plan our posts!

In general, for tech blogging, I think there are three main types of posts you can create from my experience:

Career advice – As the name says, give advice on how to break into the industry. These include how to become a software engineer, tech interview advice, the best degree to get a job as a data scientist, etc.
Technical breakdowns – Create tutorials of things you already know how to do. These could include how to deploy a website, use git, find the best cloud platform provider, etc.
Document learning – Choose something you want to learn, study it and then write about it to solidify your understanding. This is an example of the Feynman technique were you build your understanding through teaching.

Honestly, it doesn’t matter which one you choose; I have done all of them in the past, and they can all make for compelling reads.

Career advice and technical breakdowns are more straightforward as you already know what you are writing about. To document your learning, you must learn new things simultaneously, so the process takes longer. However, it comes with the added benefit of improving your existing knowledge.

Repeat & Refine

Regardless of your path, you will quickly realise that tech blogging is an example of an infinite game; there is no end destination. You keep playing forever because it’s fun and you enjoy it.

I plan on doing a separate article on how to grow your blog if that interests you. However, the whole growth process generally boils down to consistency and slowly improving with each article. I know that’s obvious, but it is the truth, there is no secret.

Although simply growing should not necessarily be your sole goal, for my blog, I am not trying to get it to "blow up" but to use it as a tool to learn data science and gain a deeper understanding. I feel this is way more invaluable in the long run than trying to go "viral."

Summary & Further Thoughts

Starting this data science blog was indeed one of the best decisions I have ever made. I became a better data scientist, networked with amazing people, and earned some money. Starting a blog is pretty straightforward, and you can literally write about anything from career advice to documenting your learning journey. I recommend starting here on Medium as it’s a terrific platform with a large tech community you can access right from the beginning.

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips and advice as a practising data scientist. Plus, when you subscribe, you will get my FREE data science resume!

Dishing The Data | Egor Howell | Substack

Connect With Me!

LinkedIn, X (Twitter), or Instagram.
My YouTube Channel to learn technical data science and Machine Learning concepts!

The post How To Start Technical Writing & Blogging appeared first on Towards Data Science.

The Most Undervalued Skill for Data Scientists

Tessa Xie — Wed, 03 Jul 2024 19:19:00 +0000

Image by author (Midjourney)

"Why is my manager nitpicking my write-up? What difference does it make changing the wording from X to Y?"

You have probably caught yourself thinking this when you see your managers’ numerous suggestions all over your document; I know I have. In fact, I used to think that writing is the most trivial part of the job of a data scientist; because the analyses and numbers should speak for themselves right? Wrong!

Over the last years, I have realized that writing is an essential skill for data scientists, and that the ability to write well is one of the key things that sets high-impact data scientists apart from their peers.

In this article, I will first convince you that writing is at least as important as your technical skills, and then give you concrete tips to help you improve your writing.

Why is writing so important for data scientists?

1. It’s used everywhere in the corporate world – I have highlighted the importance of communication in my previous articles and like it or not, majority of communication in the corporate world happens in a written form. From project-scoping documents to weekly updates, analysis and experiment write-ups, feedback and performance reviews, JIRA tickets and wiki pages, everything counts on effective written communication to get the message across.

2. Writing helps to bring clarity to your thinking process – Paul Graham, cofounder of the famous startup accelerator Y Combinator (who’s a computer scientist AND writer) famously said in one of his memos:

If writing down your ideas always makes them more precise and more complete, then no one who hasn’t written about a topic has fully formed ideas about it. And someone who never writes has no fully formed ideas about anything nontrivial.

— Paul Graham

Very often, when you start writing things down, you realize how little you know about a subject and the potential gaps in your thinking/analysis.

3. Writing is the "last mile" of your Data Science work. None of your stakeholders will read your SQL query or look at your Jupyter Notebook (a lot of engineers and data scientists would like believe the opposite but trust me, they likely won’t). If you want your work to be understood by others and influence decisions, then you need to do the final step of packaging it in an effective write-up. If you skip this step, it’s like leaving the package in the warehouse instead of delivering it to the customer.

Image by author

What does "good" writing look like in data science?

Be clear about your audience. If you are Writing for everyone, you are writing for no one. Be very specific about who this particular piece of writing is for, and tailor it to that audience and their needs.

Focus on the "so what"; the sausage-making goes in the appendix. As data scientists, we love to talk about the complex analysis we did or how we designed the experiment. Because we put in all that work it feels so wasteful to NOT talk about it. But the harsh truth is, most of the time, our audience does not care; they just want to understand the takeaways.

You can describe the technical details of your work in the appendix in case someone wants to go deep, but the main part should focus on the insights and recommendations.

Have a clear storyline. Fiction or not, every piece of (long-form) writing should be a story. Because that’s how humans communicate and that how our brains process information. Usually the storyline for analysis goes like this:

⮕ We found out about something interesting and this is why you should care about it / what you should do (summary to get your readers hooked, including a recommendation if applicable)

⮕ Here’s how we arrived at these insights (analysis details for the curious explorers)

⮕ Here are the caveats and alternative paths forward (optionality in case someone challenges the recommendation)

⮕ Here are additional resources you might find interesting (appendix for those that really want to go deep on the topic)

It might help to build the skeleton first before adding the details. If the story depends on how the analysis goes (which is often the case for DS analysis since the nature is more exploratory), at least figure out the structure of the doc before diving into the details.

If you are building a deck/presentation, I have a little

Have a clear summary. If you remember the pyramid principle I mentioned in my previous post about communication, it’s especially important to written communication. Because the summary is your first touch point with your readers, it should be interesting enough to capture the their attention so they want to read on; at the same time, it should capture all the essence so if they decide to stop reading after the summary, they got all the most crucial information they need to know.

Be succinct. When it comes to writing, less is more.

Keep it simple. We work in a technical field and use technical jargon all the time. Often, data scientists think it makes them seem more competent if they use technical language. If you look closely, though, you will notice that the more senior people become, the simpler their choice of words. VPs and C-Level executives can explain complex topics in language that anyone can understand, regardless of their (technical) background. You can use tools like the Hemingway app to check if your writing is too complex.

Use signposting. Signposting is a technique that makes it easier for the reader to understand your document. The core idea is to use words and phrases that make it immediately clear what the sentence or section is about, so that readers can quickly skim the text and make sense of it. For example:

Using the phrase "for example" before you give an example
Writing "in conclusion" before you summarize
Labeling sequences of arguments with "Firstly / secondly / finally"

Always have your readers in mind – what do your readers care about the most from this piece of analysis? What do your readers already know about the context/background and what else if necessary for them to know?

Add visualizations. It’s a cliche for a reason: "A picture says more than a thousand words." When you are trying to communicate dense technical content, a crisp diagram, framework or flowchart can help a lot to get your point across. For example, illustrating what "pyramid principle" means like the graph below will hopefully give you a better idea about how to carry it out in your own writing.

Image by author

How can you improve your writing?

Read a lot. This includes both guides on how to write well (by reading this post, you’ve made the first step!) as well as strong technical writing that you can imitate (you can find some examples here).

If you want to dig deeper into the science of writing well, I recommend you take a look at "On Writing Well" by William Zinsser.

Practice, practice, practice. As with everything else, practice makes perfect. Here are a few concrete things you can do to practice your writing:

Document your work in a personal wiki. Few data scientists do this in my experience, but it’s a very useful resource to have and a great way to get more writing practice.
Write structured Slack messages. Most of the Slack messages we send and receive all day feel like a stream of consciousness (or worse, like teenagers’ text messages). People tend to type what comes to their mind and hit "Send" without taking the time to structure the message in a way that makes it easy for the reader to understand it. Writing succinct, structured Slack messages using the principles discussed above is a great way to stand out.
Write online. Writing these posts on Medium is ongoing writing practice for me. Try it out; you might even enjoy it and find an audience that enjoy your insights.

Challenge yourself. "You are your own worst enemy" might not be a bad thing when it comes to writing. You need to be able to read your own writing like it’s your first time seeing it so you can be objective about what’s missing, what’s confusing and what needs to be shortened.

Ask others to be your devil’s advocate. Being your own devil’s advocate can be extremely hard sometimes, because true objectivity requires you to abandon your current knowledge about the topic and your ego. It’s sometimes just easier to find another challenger for your work. Ideally this is someone who truly knows nothing about the subject matter and is willing to be very honest with you about their opinion.

What are some good examples of strong technical writing?

I described above what good writing looks like in theory but it’s easier to understand once you see a few examples of it. Here I’m providing some concrete examples for some of the points I mentioned above so you can have a better idea about how to put those suggestions into practice.

Clear audience

The Data-Driven VC newsletter is targeted specifically towards Venture Capitalists and startup founders who want to take a data-driven approach to investing in and growing companies. While this creates a niche blog that might not appeal to everyone, picking this specific target audience makes it easier to provide value for them.

Strong visualizations

For a crash-course on how to visualize complex systems and technical subject matter in general, check out ByteByteGo. Their diagrams make it super easy to understand things that would take multiple paragraphs of jargon to describe accurately.

SeattleDataGuy is also using plenty of visualizations, but typically in a slightly less serious way (e.g. see his post on Apache Iceberg here).

Keeping it simple

Gergely Orosz, who writes The Pragmatic Engineer, does a good job summarizing complex topics in relatively simple terms. E.g. check out his post on how AI Software Engineering Agents work.

Combining best practices: Simple, succinct language with clear visualizations

Daily Dose of Data Science is a prime example for how to combine multiple best practices to produce easy-to-understand but still insightful data science content.

For example, check out their recent post on Confidence Intervals and Prediction Intervals. Or their super brief, but informative post on Cross-Validation Techniques.

In conclusion

Being able to write (well) is crucial for your work, even (or, you could argue, especially) for technical folks. Being able to succinctly communicate your thoughts on paper takes practice. Reading a lot, writing a lot and being open to feedback are the keys to getting better at this craft.

For more hands-on tips and interesting topics about data science, definitely follow me here on Medium, or on LinkedIn or on Substack

The post The Most Undervalued Skill for Data Scientists appeared first on Towards Data Science.