Yu Dong, Author at Towards Data Science

Mastering 1:1s as a Data Scientist: From Status Updates to Career Growth

Yu Dong — Tue, 04 Mar 2025 19:12:13 +0000

I have been a data team manager for six months, and my team has grown from three to five.

I wrote about my initial manager experiences back in November. In this article, I want to talk about something that is more essential to the relationship between a DS or DA individual contributor (IC) and their manager — the 1:1 meetings. I remember when I first started my career, I felt nervous and awkward in my 1:1s, as I didn’t know what to expect or what was useful. Now, having been on both sides during 1:1s, I understand better how to have an effective 1:1 meeting.

If you have ever struggled with how to make the best out of your 1:1s, here are my essential tips.

I. Set up a regular 1:1 cadence

First and foremost, 1:1 meetings with your manager should happen regularly. It could be weekly or biweekly, depending on the pace of your projects. For example, if you are more analytics-focused and have lots of fast-moving reporting and analysis tasks, a weekly 1:1 might be better to provide timely updates and align on project prioritization. However, if you are focusing on a long-term machine learning project that will span multiple weeks, you might feel more comfortable with a biweekly cadence — this allows you to do your research, try different approaches, and have meaningful conversations during 1:1s.

I have weekly recurring 30-minute 1:1 slots with everyone on my team, just to make sure I always have this dedicated time for them every week. These meetings sometimes end up being short 15-minute chats or even casual conversations about life after work, but I still find them super helpful for staying updated on what’s on top of everyone’s mind and building personal connections.

II. Make preparations and update your 1:1 agenda

Preparing for your 1:1 is critical. I maintain a shared 1:1 document with my manager and update it every week before our meetings. I also appreciate my direct reports preparing their 1:1 agenda beforehand. Here is why:

Throughout the week, I like to jot down discussion topics quickly on my 1:1 doc whenever they come to my mind. This ensures I cover all important points during the meeting and improves communication effectiveness.
Having an agenda helps both you and your manager keep track of what has been discussed and keeps everyone accountable. We talk to many people every day, so it is totally normal if you lose track of what you have mentioned to someone. Therefore, having such a doc reminds you of your previous conversations. Now, as a manager with a team of five, I also turn to the 1:1 docs to ensure I address all open questions and action items from the last meeting and find links to past projects.
It can also assist your performance review process. When writing my self-review, I read through my 1:1 doc to list my achievements. Similarly, I also use the 1:1 docs with my team to make sure I do not miss any highlights from their projects.

So, what are good topics for 1:1? See the section below.

III. Topics on your 1:1 agenda

While each manager has their preferences, there’s a wide range of topics that are generally appropriate for 1:1s. You don’t have to cover every one of them, but I hope they give you some inspiration and you no longer feel clueless about your 1:1.

Achievements since the last 1:1: I recommend listing the latest achievements in your 1:1 doc. You don’t have to talk about each one in detail during the meeting, but it’s good to give your manager visibility and remind them how good you are . It is also a good idea to highlight both your effort and impact. Business is usually impact-driven, and the data team is no exception. If your A/B test leads to a go/no-go decision, mention that in the meeting. If your analysis leads to a product idea, bring it up and discuss how you plan to support the development and measure the impact.
Ongoing and upcoming projects: One common pattern I’ve observed in my 7-year career is that Data Teams usually have long backlogs with numerous “urgent” requests. 1:1 is a good time to align with your manager on shifting priorities and timelines.
- If your project is blocked, let your manager know. While independence is always appreciated, unexpected blockers can arise at anytime. It’s perfectly acceptable to work through the blockers with your manager, as they typically have more experience and are supposed to empower you to complete your projects. It is better to let your manager know ahead of time instead of letting them find out themselves later and ask you why you missed the timeline. Meanwhile, ideally, you don’t just bring up the blockers but also suggest possible solutions or ask for specific help. For example, “I am blocked on accessing X data. Should I prioritize building the data pipeline with the data engineer or push for an ad-hoc pull?” This shows you are a true problem-solver with a growth mindset.
Career growth: You can also use the 1:1 time to talk about career growth topics. Career growth for data scientists isn’t just about promotions. You might be more interested in growing technical expertise in a specific domain, such as experimentation, or moving from DS to different functions like MLE, or gaining Leadership experience and transitioning to a people management role, just like me. To make sure you are moving towards your career goal, you should have this conversation with your manager regularly so they can provide corresponding advice and match you with projects that align with your long-term goal.
- I also have monthly career growth check-in sessions with my team to specifically talk about career progress. If you always find your 1:1 time being occupied by project updates, consider setting up a separate meeting like this with your manager.
Feedback: Feedback should go both directions.
- Your manager likely does not have as much time to work on data projects as you do. Therefore, you might notice inefficiencies in project workflows, analysis processes, or cross-functional collaboration that they aren’t aware of. Don’t hesitate to bring these up. And similar to handling blockers, it’s recommended to think about potential solutions before going to the meeting to show your manager you are a team player who contributes to the team’s culture and success. For example, instead of saying, “We’re getting too many ad-hoc requests,” frame it as “Ad-hoc requests coming through Slack DMs reduce our focus time on planned projects. Could we invite stakeholders to our sprint planning meetings to align on priorities and have a more formal request intake process during the sprint?”
- Meanwhile, you can also use this opportunity to ask your manager for any feedback on your performance. This helps you identify gaps, improve continuously, and ensures there are no surprises during your official performance review
Team and company goals: Change is the only constant in business. Data teams work closely with stakeholders, so data scientists need to understand the company’s priorities and what matters most at the moment. For example, if your company is focusing on retention, you might want to analyze drivers of higher retention and propose corresponding marketing campaign ideas to your stakeholder.

To give you a more concrete idea of the 1:1 agenda, let’s assume you work at a consumer bank and focus on the credit card rewards domain. Here is a sample agenda:

Date: 03/03/2025

Last week’s accomplishments

Rewards A/B test analysis [link]: Shared with stakeholders, and we will launch the winning treatment A to broader users in Q1.
Rewards redemption analysis [link]: Most users redeem rewards for statement balance. Talking to the marketing team to run an email campaign advertising other redemption options.

Ongoing projects

[P0] Rewards <> churn analysis: Understand if rewards activities are correlated with churn. ETA 3/7.
[P1] Rewards costs dashboard: Build a dashboard tracking the costs of all rewards activities. ETA 3/12.
[Blocked] Travel credit usage dashboard: Waiting for DE to set up the travel booking table. Followed up on 2/27. Need escalation?
[Deprioritized] Retail merchant bonus rewards campaign support: This was deprioritized by the marketing team as we delayed the campaign.

Other topics

I would like to gain more experience in machine learning. Are there any project opportunities?
Any feedback on my collaboration with the stakeholder?

Please also keep in mind that you should update your 1:1 doc actively during the meeting. It should reflect what is discussed and include important notes for each bullet point. You can even add an ‘Action Items’ section at the bottom of each meeting agenda to make the next steps clear.

Final thoughts

Above are my essential tips to run effective 1:1s as a data scientist. By establishing regular meetings, preparing thoughtful agendas, and covering meaningful topics, you can transform these meetings from awkward status updates into valuable growth opportunities. Remember, your 1:1 isn’t just about updating your manager — it’s about getting the support, guidance, and visibility you need to grow in your role.

The post Mastering 1:1s as a Data Scientist: From Status Updates to Career Growth appeared first on Towards Data Science.

DeepSeek V3: A New Contender in AI-Powered Data Science

Yu Dong — Sat, 01 Feb 2025 13:32:01 +0000

Nvidia stock price slumped over 15% on Monday, Jan 27th, after a Chinese startup, DeepSeek, released its new AI model. The model performance is on par with ChatGPT, Llama, and Claude but at a fraction of the cost. According to Wired, OpenAI spent more than USD$100m to train GPT-4. But DeepSeek’s V3 model was trained for just $5.6m. This cost efficiency is also reflected in the API costs – for every 1M tokens, the deepseek-chat model (V3) costs $0.14, and the deepseek-reasoner model (R1) costs only $0.55 (DeepSeek API Pricing). Meanwhile, gpt-4o API costs $2.50 / 1M input tokens, and o1 API costs $15.00 / 1M input tokens (OpenAI API Pricing).

Always intrigued by emerging LLMs and their application in data science, I decided to put DeepSeek to the test. My goal was to see how well its chatbot (V3) model could assist or even replace data scientists in their daily tasks. I used the same criteria from my previous article series, where I evaluated the performance of ChatGPT-4o vs. Claude 3.5 Sonnet vs. Gemini Advanced on SQL queries, Exploratory Data Analysis (EDA), and Machine Learning (ML).

Image created by DALL·E

I. First Impressions

Here are some quick observations from my initial exploration of DeepSeek’s web chatbot UI:

Interface: The chatbot UI looks similar to ChatGPT, with all the past chats listed on the left and the current conversation in the main panel.
Chat Labels: ChatGPT usually labels your past chats with a brief summary. However, by default, DeepSeek labels old chats with the first several words of your prompt. For new chats in the same web session, it simply shows ‘New chat’, which could be confusing when there are multiple. But of course, you can rename any chats for clarity.
Model Options: In the message input box, you can opt to use the reasoning model (R1) or enable the web search functionality.
Formatting: The chatbot formats keywords and code snippets neatly, making it easy to read.

DeepSeek UI (image by author)

II. SQL

1. Problem Solving (3/3)

I started by testing DeepSeek’s problem-solving skills with three challenging LeetCode SQL questions (262, 185, and 1341) that have low acceptance rates. These questions have clear descriptions with input and output table structures and are similar to interview questions.

DeepSeek aced all three questions using aggregation, filters, window functions, etc. It also offered step-by-step breakdowns and clear explanations.

DeepSeek’s response to LeetCode SQL Questions (image by author)

2. Business Logic (3.5/4)

Next, I uploaded four synthetic datasets to simulate real-world scenarios where table descriptions are often incomplete and you have to assume information based on how the data looks like.

Though the total size of the four CSV files is only ~300KB, I got the error message "DeepSeek can only read 40% of all files. Try replacing the attached files with smaller excerpts." So I cut the file size down to only 30KB by truncating to the top 100 rows of each file – I no longer got the above error message, but this time it said "Oops! DeepSeek is experiencing high traffic at the moment. Please check back in a little while." Eventually, I turned to upload screenshots of the top rows of each dataset, which worked.

DeepSeek reads table screenshots and writes SQL queries (image by author)

The datasets include:

users: User-level data with demographic information.
products: Product-level data.
orders: Order-level data with payment information.
ordered_products: A table linking orders and products.

I asked DeepSeek to write queries for metrics like total order amount by month from US users, monthly new user counts, top 5 best-selling product categories, and monthly user retention rate. These are common metrics that you would track at an e-commerce company. It was able to generate correct queries for the first three but made an error in the retention rate question (this was also the question other AI tools struggled with the most). However, after prompting the issue, it was able to fix the query.

DeepSeek fixed retention rate query with hints (image by author)

3. Query Optimization (2.5/3)

Finally, I tested DeepSeek’s ability to optimize suboptimal SQL queries. I used inefficient code examples from my SQL optimization article. It improved queries by only selecting necessary columns, moving aggregation steps earlier, avoiding redundant de-duplication operations, etc.

What I like the most is that it not only suggested improvements but provided detailed explanations of everything that could be optimized, including database-specific tips.

DeepSeek’s SQL query optimization performance 1 (image by author)

DeepSeek’s SQL query optimization performance 2 (image by author)

I only encountered an issue with the last question where I asked it to further optimize the query by adjusting the window function, but it generated a query that produced repetitive rows. It quickly corrected the issue after I pointed it out.

DeepSeek corrected the issue it made (image by author)

SQL Performance Summary

Overall, DeepSeek performed very well in the SQL section, providing clear explanations and suggestions for SQL queries. It only made two small mistakes and managed to fix itself quickly after my prompts. However, its file upload limitations could cause inconvenience for users.

SQL performance evaluation (image by author)

III. Exploratory Data Analysis (EDA)

Now let’s switch gears to Exploratory Data Analysis. Due to DeepSeek’s file upload constraints, I only managed to upload a very tiny dataset (2KB) of my Medium article performance. If you are interested, you can find a detailed analysis and review of my Medium journey in my past article.

Here is my prompt:

I have been writing articles on Medium and collected this dataset of my articles’ performance. You are a data science professional. Your objective today is to help me conduct a thorough exploratory data analysis (EDA) of this dataset with necessary steps, such as data cleaning, analysis and visualizations, clear insights, and actionable recommendations.

Your EDA will be used to better understand the medium earning and inform future writing strategies.

Below are the rubrics I used to evaluate the EDA capability of AI tools.

EDA evaluation rubrics (image by author)

1. Completeness (4/5)

DeepSeek’s EDA response was very organized and covered most of the critical components of EDA.

Data inspection: You could click on the uploaded dataset to get a preview. However, the preview was text-based, making it hard to digest. It also does not provide any text description of the dataset. Therefore, I consider this step incomplete.

Inspect the dataset in DeepSeek (image by author)

Data cleaning: DeepSeek started its report with data cleaning. It checked for missing values, and data types, and removed unnecessary columns. Though it did not display the results, it provided summaries and instructions at each step.

Data cleaning steps in DeepSeek (image by author)

Univariate analysis: DeepSeek examined the distribution of earnings and other columns with Python code to run analysis and generate visualizations. It does not plot the charts in the UI, so I ran them manually in my Jupyter Notebook to validate.

Univariate analysis in DeepSeek (image by author)

Bivariate and multivariate analysis: DeepSeek explored the relationship between earnings and many other variables to understand the drivers of Medium earnings.

Bi-variate analysis in DeepSeek (image by author)

Insights and recommendations: DeepSeek also provided actionable insights based on its analysis.

Insights and recommendations by DeepSeek (image by author)

2. Accuracy (3/4)

I reviewed the Python script DeepSeek generated and ran it manually. While most of the code worked well, the correlation matrix section threw an error due to non-numeric columns being included. I reported the error message back, and it corrected the issue by adding df.select_dtypes(include=[np.number]) to filter on numeric columns only.

This minor error resulted in a one-point deduction.

Correlation Matrix code erorr (image by author)

3. Visualization (2/4)

DeepSeek did not display the visualizations in its UI, only the Python code. While the code generated accurate visualizations (except for the correlation matrix error above), the overall experience was less user-friendly compared to other tools like ChatGPT. Therefore, I deducted two points for this limitation.

4. Insightfulness (4/4)

DeepSeek provided valuable insights and actionable recommendations based on its analysis. It covered content strategy, publication selection, the power of "Boost", etc.

5. Reproducibility and Documentation (3/3)

DeepSeek structured its EDA report logically, from data cleaning to analysis, and insights. The paragraph is also well formatted with bullet points, code blocks, and highlighted keywords.

EDA Performance Summary

DeepSeek delivered a logically structured EDA report with functional code and clear insights. However, its inability to display visualizations in the UI was a notable drawback – this added an additional step for users to run the code locally and adjust the charts manually.

EDA performance evaluation (image by author)

III. Machine Learning (ML)

I used the same dataset to evaluate how DeepSeek could assist in Machine Learning projects. Here are my rubrics.

Machine learning evaluation rubrics (image by author)

1. Feature Engineering (3/3)

I first asked it to conduct feature engineering with the below prompt:

I have been writing articles on Medium and collected this dataset of my articles’ performance. You are a data scientist professional. I would like you to help me build a machine learning model to forecast article earnings and understand how to improve earnings.

Let’s do the task step by step. First, please focus on feature engineering.

Can you suggest some feature engineering techniques that could help improve the performance of my model?

Please consider transformations, interactions between features, and any domain-specific features that might be relevant.

Provide a brief explanation for each suggested feature or transformation.

DeepSeek suggested 10 feature engineering techniques. Most methods were pretty reasonable, for example, applying log transformation on right-skewed variables, calculating engagement per view ratios, adding temporal features, etc.

Feature Engineering by DeepSeek (image by author)

2. Model Selection (3/3)

Next, I asked DeepSeek to recommend the most suitable model: "Can you recommend the most suitable machine learning models for this task? For each recommended model, provide a brief explanation of why it is appropriate and mention any important considerations for using it effectively".

DeepSeek listed eight model options, from Linear Regression and its variations, to Random Forest and other tree-based models, to Neural Networks. It provided clear pros and cons for each model, ending with a summary and actionable next steps.

Model recommendations by DeepSeek (image by author)

3. Model Training and Evaluation (3.5/4)

Lastly, let’s see its capability in model training and evaluation. My prompt is

Can you provide the code to train a ridge regression model? Please ensure that it includes steps like splitting the data into training and testing sets and performing cross-validation. Please also suggest the appropriate evaluation metrics and potential hyperparameters tunning opportunities.

DeepSeek’s code has a clear structure and comments. It ran well and output regression coefficients. It also offered reasonable strategy of picking the right evaluation metrics and hyperparameter tunning. I followed up with the question of how to interpret the coefficients of a Ridge Regression model, it was also able to explain the interpretation methodology and challenges with multicollinearity.

However, it only provided the basic code without incorporating any of its feature engineering ideas earlier in the same thread. When I asked it to add those features in, it kept erroring out with the message "The server is busy. Please try again later." I finally got the output after four retries. I’ve noticed this in earlier threads as well – DeepSeek server did not seem very reliable and errored out often, especially with longer chats. Therefore, I deducted 0.5 points for the server reliability issue.

model training and evaluation steps by DeepSeek (image by author)

ML Performance Summary

For Machine Learning use cases, similar to the other AI tools, DeepSeek excelled at suggesting feature engineering ideas, brainstorming models, and writing code. However, it required human expertise to provide guidance, ask follow-up questions, and make final calls.

Machine learning performance (image by author)

Summary and Final Thoughts

When it comes to Data Science projects, DeepSeek v3’s performance is very much on par with ChatGPT-4o, Claude 3.5 Sonnet, and Gemini Advanced. This is especially impressive given its much lower training costs.

Performance comparison (image by author)

Deepseek excels in coding but misses the key functionality of executing Python code and displaying visualizations directly in the UI. This is very similar to my observation of Claude 3.5 Sonnet back in Aug last year when it was also missing the interactive visualization function. However, Claude has since then added the analysis tool function (though via JavaScript and React, instead of Python), overcoming its previous drawback. DeepSeek might follow the trend and add that feature as well.
Its server seems less reliable than the other tools right now – it feels like the early-version ChatGPT when ChatGPT was first launched. The limitation of uploading files could pose a challenge to using the chatbot meaningfully in data science workflows.
However, its free chatbot access and more affordable API costs give it a significant competitive edge, particularly for users in China and for small businesses worldwide. It could democratize access to advanced AI tools, enabling smaller companies and individual developers to leverage powerful models at a much lower cost.
DeepSeek’s rise will for sure incentivize more AI innovations globally, both from AI giants like OpenAI and Anthropic and from smaller AI startups. Super excited to see how this space will evolve in the coming year!

Enjoyed this article? Follow me to stay tuned for more articles on data science and AI.

ChatGPT vs. Claude vs. Gemini for Data Analysis (Part 1)

ChatGPT vs. Claude vs. Gemini for Data Analysis (Part 2): Who’s the Best at EDA?

ChatGPT vs. Claude vs. Gemini for Data Analysis (Part 3): Best AI Assistant for Machine Learning

How ChatGPT Became My Best Solo Travel Buddy in 2024

The post DeepSeek V3: A New Contender in AI-Powered Data Science appeared first on Towards Data Science.

Unlocking the Power of Machine Learning in Analytics: Practical Use Cases and Skills

Yu Dong — Wed, 15 Jan 2025 16:02:07 +0000

In the past decade, we have seen explosive growth in the data science industry, with a rise in machine learning and AI use cases. Meanwhile, the "Data Scientist" title has evolved into different roles at different companies. Thinking about functions, there are Product Data Scientists, Marketing Data Scientists, those specialized in Finance, Risk, and people supporting Operations, HR, etc.

Another common distinction is the DS Analytics (often referred to as DSA) and the DS Machine Learning (DSML) tracks. As the name suggests, the prior focuses on analyzing data to derive insights, while the latter trains and deploys more machine learning models. However, this does not mean that DSA positions do not involve machine learning projects. You can often find machine learning among the required skills in the job descriptions of DSA openings.

This overlap often leads to confusion among aspiring data scientists. During coffee chats, I frequently hear questions like: Do DSA positions still require machine learning skills? Or do DSAs also deploy machine learning models? Unfortunately, the answer is not a simple Yes or No. Firstly, the boundaries between the two positions are always blurry (even a decade after the data science job became a trend). Sometimes, within the same company, DSAs supporting different functions could be using very different skill sets. Secondly, machine learning itself is a broad field, covering things from simple linear regression to complex neural networks, and even LLM. Therefore, some of them could be fantastic tools for analytics purposes, while others are more tailored for building a production-grade predictive model.

But if I have to answer the question: "Yes, DSAs also use machine learning skills, but just not in the same way as DSML positions. Instead of building complex models and optimizing for scalability, accuracy, and latency, DSAs primarily leverage machine learning as a tool to generate insights or support analyses and focus more on generating interpretable and actionable outputs to better inform business decisions."

In the section below, I will explain this answer further with three common machine learning use cases in Analytics. These examples will provide a more concrete answer to the above question, and highlight the machine learning skills you need to succeed in a DSA position.

Image created by DALL·E

Use Case 1: Metrics Driver Analysis

Usually, the first step to understanding a business or a product is to define and track the right metrics. But after completing this step, the question becomes how can we move the metrics in the right direction. Machine learning is a powerful tool for understanding what drives these metrics and how to influence them.

Let’s say you are tasked to improve customer retention. You might have heard tons of assumptions from user interviews, surveys, and even business intuition. A common way to validate the assumptions is to build a classification model to predict retention and identify important features.

Here is the process:

Establish the assumptions based on conversations with your stakeholders and customers, and your business understanding. For example,
- Customers with longer tenures are less likely to churn;
- Heavy usage of feature X indicates higher loyalty to the product;
- Customers who only use the mobile app are more likely to churn due to the missing features on the app.
Collect features based on your assumptions, for example, customer tenure, time spent and frequency on feature X, and if the user is mobile only or not.
Build a classification model to predict if a customer churned or not. Common classification models include Logistic Regression and tree-based models (random forests, XGBoost, LightGBM, CatBoost, etc.). My go-to model is the XGBoost model because it captures nonlinear relationships and feature interactions, has built-in methods to handle imbalanced datasets and avoid overfitting, and typically achieves a good accuracy baseline even before extensive parameter tuning.
Generate data insights and business recommendations. You can use the model’s Feature Importance to understand which assumptions were correct. You can also use SHAP values to decompose the prediction into contributions from each feature. For example, if the model reveals that mobile user only is a very important feature and is correlated with a high churn rate, the next steps could be 1. improving the mobile app to close the feature gaps; 2. launching an email campaign encouraging mobile-only users to try more features on the desktop version. And of course, you will monitor the effectiveness of these recommendations and continue monitoring the retention rate.

From the above example, you can see that building the model is only part of the task for DSAs. The real value comes from interpreting results and translating them into actionable business strategies.

Use Case 2: Customer Segmentation

We always talk about product-market fit; a critical part is understanding your customer portfolio and their needs. Therefore, data scientists are often asked to conduct customer segmentation tasks, grouping users based on similar behaviors or preferences.

There are numerous approaches to customer segmentation. Let me name a few:

Simple demographics segmentation. For example, you work for a fashion retailer that owns product lines with different price tiers. In this case, a simple solution to customer segmentation is to break down customers based on their household income (if you have this data).
RFM (Recency, Frequency, Monetary) analysis. You can collect metrics including how recently a customer spends (recency), how often they spend (frequency), and how much they spend (monetary). Then you can categorize high-recency, high-frequency, and high monetary value customers as VIP customers; high-recency, high-frequency, and low monetary value customers as price-sensitive customers, etc.
Back to our topic, you can also build an unsupervised learning model (K-Means Clustering, Hierarchical Clustering, DBSCAN, etc.). Unlike the classification model example above, unsupervised learning is a type of machine learning that identifies patterns and relationships in unlabeled data without predefined output labels. Let’s still stick to the fashion retailer example:
- You can collect features including customer demographics (age, income, etc.) and purchase patterns (how often and how much they spend on different product lines and types), then run them through a clustering algorithm like K-Means. This will automatically group your customers into several clusters.
- Though there are no existing true labels, you still need to evaluate the model outputs. Usually, you would examine the characteristics of each cluster to understand if the output makes business sense and what type of customers they represent, then give them a label that is intuitive to your stakeholders. For example, you might find a cluster of customers mostly purchasing discounted products, and you can label them "Bargain Hunters"; Another cluster of customers who spend a lot on luxury brands could be the "Luxury Shoppers" group.
- Once you have the reasonable segments, they could be used to inform product strategies and conduct targeted email campaigns and personalized product recommendations.

Use Case 3: Experimentation and causal inference

Another common task for DSAs is measuring the causal impact of a certain event. This involves randomized experimentation and more advanced causal inference methods when a controlled experiment is not feasible.

When it comes to randomized experiments like A/B tests, an application of Machine Learning is to reduce noise in experiment results by controlling for covariates. These adjustments improve the sensitivity of experiments and lead to smaller sample sizes or shorter test durations while maintaining statistical power. CUPED (Controlled Pre-Experiment Data) is one of those variance reduction methods that can incorporate machine learning techniques. It reduces the variance of the outcome metric by adjusting for pre-experiment covariates that are predictive of the outcome – something you can achieve with a machine learning model.

When it comes to causal inference, there are lots of machine learning use cases as it can be used to address key challenges like confounding, non-linearity, and high-dimensional data.

Here is an example of using machine learning to enhance Propensity Score Matching, which **** is a useful technique to manually create two comparable groups when you don’t have perfectly randomized test and control groups. Suppose your company launched a newsletter program and your stakeholder wants you to assess its impact on customer retention. However, users who subscribed to the newsletter might inherently differ from those who did not. To apply the Propensity Score Matching method here,

You can train a machine learning model (for example, logistics regression or XGBoost) to predict the likelihood of a user subscribing to the newsletter.
Use the predicted "propensity scores" to match newsletter subscribers with similar non-subscribers.
Compare the retention rate from this matched control and treatment group. This will give you a more fair evaluation of the newsletter’s impact on customer retention.

Machine learning also can be incorporated into other causal inference methods, such as Regression Discontinuity, Instrumental Variables, and Synthetic Control, which help address biases in observational data and derive the causal relationship.

I hope the three use cases above give you a concrete idea of how you can use machine learning in analytics workflows.

If you aim to be a Data Scientist specializing in Analytics, Here is your essential machine learning skills checklist for your interview preparation and daily work.

Data Preparation:
- Handle missing values and outliers
- Encode categorical variables
- Apply data normalization techniques
Common ML Algorithm: understand assumptions and pros and cons of each model and how to pick the right one– Supervised learning models: Regression models (Linear Regression, Logistic Regression), Tree-based models (Random Forest, XGBoost)
- Unsupervised learning models: K-Means and Hierarchical Clustering
Model Training and Evaluation:
- Select the right evaluation metric
- Prevent overfitting
- Handle imbalanced datasets
- Feature selection and feature engineering
- Hyperparameter tuning
Model Interpretation:
- Understand coefficients in Regression models
- Understand Feature Importance in tree-based models
- Use interpretability tools like SHAP
- Translate model insights into business insights and communicate them effectively to non-technical stakeholders

By mastering these skills and understanding the use cases discussed, you’ll be well-equipped to leverage machine learning effectively as a Data Scientist in Analytics.

If you have enjoyed this article, please follow me and check out my other articles on Data Science, analytics, and AI.

Seven Common Causes of Data Leakage in Machine Learning

ChatGPT vs. Claude vs. Gemini for Data Analysis (Part 3): Best AI Assistant for Machine Learning

From Data Scientist to Data Manager: My First 3 Months Leading a Team

The post Unlocking the Power of Machine Learning in Analytics: Practical Use Cases and Skills appeared first on Towards Data Science.

5 Essential Tips to Build Business Dashboards Stakeholders Love

Yu Dong — Wed, 11 Dec 2024 17:02:16 +0000

Working in data science, dashboarding often feels like an unfavored but unavoidable work. Why is it unfavored? Dashboarding is less technical (less fancy) than analysis and modeling, and more repetitive. But why is it also unavoidable? It is the first and must-have step to understand any product, opening the gate to analysis and modeling. It even helps build trust with stakeholders as dashboarding is usually among the first requests from your stakeholders.

Meanwhile, the difficulty of dashboarding is also often undervalued. I have seen numerous dashboards built by colleagues in the past seven years. Surprisingly, not everyone does this ‘easy task’ well. You might be wondering, isn’t dashboarding just creating a bunch of charts? Well, Yes and No. To build a good Dashboard, you need to logically organize each visualization and simplify the complex data for your audience. It is more about data-informed decision-making. Moreover, a good dashboard can always enable self-serve analytics and reduce tedious data pull requests, benefitting the data team in return.

In this article, I will share my top five tips for designing effective business dashboards that provide clarity, actionable insights, and lasting value. These tips apply to any dashboarding tools you use – no matter it is Tableau, Looker, PowerBI, or something else.

Image created by DALL·E

I. Start with the North Star metric

Dashboards are built to measure performance – whether it’s for a product, a marketing campaign, an experiment, a team, etc. Therefore, you should have a North Star metric that best aligns with the goal. On the high level, the North Star metric should be simple, focused, and actionable. For example, it could be DAU (Daily Active User) for the Facebook App, GMV (Gross Merchandise Value) for Amazon, and Minutes Watched for YouTube.

Pro Tips:

You should discuss with your stakeholders to agree on the North Star metric, and highlight it in a predominant location on the dashboard (usually the first section with the largest font sizes).
If your team has a specific target for the North Star metric, make sure to display it clearly on your dashboard to indicate if the metric is trending in the right direction.
Take it one step further, you can also set up alerts if the metric drops below a critical threshold to provide a timely warning – this is a very common feature supported by most BI tools.

II. Focus on common business questions

The best dashboards deliver actionable insights promptly and enable decision-making. Therefore, beyond displaying your North Star metric, the dashboard should include secondary metrics and deep-dive functionality to address common business questions and automate data exploration requests.

Let’s assume we are building a dashboard tracking the health of the ChatGPT App and we have decided MAU (Monthly Active User) to be the North Star metric, measuring the overall user base size and stickiness. But when it goes up or down significantly, it is always the data scientist’s responsibility to identify the root cause. A common way to conduct root cause analysis is to break down the metric by its funnel steps and explore important segments. To automate this step, after showing the MAU trend on the dashboard, we could add the below sections:

Let’s decomposite MAU. MAU this month = MAU last month + new users acquired this month - users churned this month. Therefore, it will be helpful to include the # of new users trend and churn rate trend to diagnose if a decrease in MAU is due to lower user acquisition or higher user churn.
Stakeholders might care more about certain segments. For example, maybe ChatGPT is expanding European markets right now, therefore people want to understand the MAU trend in Europe vs. other markets. In this case, you should highlight it in a section breaking down the MAU by region. Other common segments include devices, user tenure, age groups, etc. You can also include these segments as filters on the dashboard, so your stakeholders can explore the data on their own.

This example above also demonstrates that the secondary metrics and deep dive sections are largely influenced by the business understanding. Therefore, dashboarding is also an iterative process with stakeholders, and it is critical to identify common business questions and design the dashboards accordingly.

Pro Tips:

Align with stakeholders about what business questions they would like to answer with the dashboard, and include corresponding deep-dive sections to enable self-service analytics;
It is a good idea to provide a dashboard mockup during the design phase, including the North Star and secondary metrics, segments, and filters on the dashboard. Simply discussing in a meeting is hard, but providing a visual draft will inspire conversation, clarify requests, and reduce unnecessary revisions later.

III. Define metrics clearly

Now that we have discussed what metrics to include on the dashboard, the following tips focus on how to deliver the data insights easily.

Though we always say that a good data visualization should be intuitive and self-explanatory with minimum text, let’s be honest, business dashboards value accuracy and clarity more than anything else. Therefore, I recommend including concrete metric definitions and all the necessary descriptions on the dashboard.

Returning to our ChatGPT MAU example, to avoid confusion, you should be explicit about:

What qualifies as an ‘active’ user? does it mean creating at least one new conversation in the past 30 days, or does interacting with an old thread count too?
Are we looking at all registered users or only ChatGPT Plus subscribers?

Every reader has their own interpretation of a metric, and your company might have had multiple versions of an OKR – My team was just cleaning up four versions of the TTR (time to resolution) calculation for the CX team last week… Therefore, it is a good idea to be extra specific.

Pro Tips:

Include a detailed description of each metric to avoid confusion. You can also link to internal documentation if it exists.
Other helpful details include the goal and scope of the dashboard, the freshness of the dashboard (refresh frequency and when was it last updated), and the links to related dashboards or analyses.

IV. Choose the right visualizations

We are at the fourth tip now and finally get to the visualization part I have written about my 7-year weekly visualization journey and 7 less common but powerful visualization types. These articles focus on making single data visualizations but could be too creative or complicated when it comes to visualizations in a dashboard. Again, the goal of a dashboard is to deliver data insights quickly, so it prioritizes clarity and simplicity.

You can find many examples of good business dashboard design on this Tableau Public page.

Pro Tips:

Pick simple and intuitive visualization types. I recommend sticking to basic chart types in dashboards— use bar charts for comparisons and line charts for trends. I am personally against pie charts because it is visually hard to compare the relative sizes of the slices (unless you explicitly label them).
Combine summary stats and trendlines. It might sound repetitive, but putting summary stats and trendlines together helps stakeholders quickly see the current number without hovering over the charts and compare it with the targets. For example, when tracking a weekly metric like CSAT, I will include summary stats of CSAT this week, % change week-over-week, and % over/above the target, and then put the weekly CSAT trend next to them.
Of course, all the important considerations for single visualizations are still relevant and you should be detail-oriented. For example, you should cover sufficient historical data in line charts, include clear title and axis labels, use effective color schemes, add helpful tooltips, etc.

V. Optimize dashboard performance

Apart from visual design, performance is another key piece in dashboarding. Honestly, the most common complaint I’ve heard from stakeholders about business dashboards is "This dashboard loads too slowly"

Here are two parts of the dashboard performance:

Data connection and underlying tables. All the metrics come from your database, so your data infrastructure and pipeline influence the dashboard performance. Taking Looker as an example, if you load a Looker dashboard without any cached results, it hits your data warehouse, runs the SQL query, retrieves the result, and at the end renders the visualization. Therefore, how your underlying table is structured and your data warehouse setup matters. I once saw a Looker-generated query that joins six large tables together given how the joins are defined in LookML (the Looker configuration files used to create semantic data models). However, only three tables were really necessary for the specific metric. This sort of suboptimal setup increases the query time and eventually slows down your dashboard.
BI tool limitation. Of course, each BI tool has its own limitations. But generally speaking, you should limit the number of charts on your dashboard to ensure the responsiveness of the UI elements.

Above are my top five tips for designing an effective business dashboard. To recap, here is my checklist :

Align with your stakeholders on the key business questions they would like to answer with the dashboard;
Define your North Star metric, secondary metrics, and key segments;
Make sure all the metrics are clearly defined on the dashboard to avoid confusion;
Choose simple and intuitive visualization types with attention to detail;
Optimize the dashboard performance with thoughtful underlying infrastructure design.

Dashboarding is more than just data visualization but requires stakeholder collaboration, business understanding, data modeling, and even a bit of UX design. By following these tips, you will create dashboards that empower decision-making and build trust with your audience.

Do you have other dashboarding tips or lessons learned? Please share them in the comments

If you have enjoyed this article, please follow me and check out my other articles on Data Science, analytics, and AI.

Beyond Line and Bar Charts: 7 Less Common But Powerful Visualization Types

330 Weeks of Data Visualizations: My Journey and Key Takeaways

Evaluating ChatGPT’s Data Analysis Improvements: Interactive Tables and Charts

The post 5 Essential Tips to Build Business Dashboards Stakeholders Love appeared first on Towards Data Science.

From Data Scientist to Data Manager: My First 3 Months Leading a Team

Yu Dong — Tue, 26 Nov 2024 14:02:16 +0000

This is the 7th year in my data science career, a journey filled with dashboards, metrics, analyses, and models. But in August, I stepped into a new territory: becoming a people manager for the first time. To be honest, whenever asked about my career goal in the past, I always said I preferred staying on the IC track. I loved the technical challenges and owning projects end-to-end. However, when this opportunity came up, I decided to give it a shot. After all, you don’t know if something is right for you until you try.

In this article, I will share my initial experience as a manager – what has changed, what I’ve enjoyed, and what’s been challenging. If you are debating between the IC and people management path, I hope it will help shed some light.

Photo by Mimi Thian on Unsplash

My path to management

To set the stage, let me share how I transitioned to a people manager. When I first joined the team four years ago, everyone on the team was a ‘full-stack’ data scientist – we each supported a specific domain and owned everything from building data pipelines, defining business metrics, and dashboarding, to analysis, experimentation, and modeling. This framework worked well for a startup. However, with the company becoming more mature and the team growing, we started to see the limitations: team members had varying preferences over data engineering vs. Data Analytics vs. data science work, but we were all required to do a bit of everything; Stakeholders often evaluated us based on the dashboards or reports we delivered, but did not realize how much effort we needed to put into building the underlying data pipeline; It was hard to standardize things like data engineering best practices as it was only one part of everyone’s role.

As a result, late last year, we restructured the team into three sub-teams: Data Engineers, Data Analysts, and Data Scientists. This change clarified responsibility and allowed for deeper expertise in each stage of the data cycle. I was then a Senior Data Scientist on the DS team. But as the data org grew, in August, I was offered the opportunity to manage the Data Analysts team, focusing on generating source-of-truth metrics reporting and actionable data insights. As I mentioned above, I decided to embrace the challenge and experience the people manager life.

What has changed?

1. My meeting time doubled

The first change I noticed is how much my meeting time has increased… Let me show you some numbers: when I was an IC, my average meeting time was about 7 hours per week, which means I had at least 80% of the time to focus on my projects. However, in the past three months, my average meeting time was roughly 14 hours per week, with one week exceeding 18 hours, more than doubling my prior meeting time.

So where does all this time go? Here is the breakdown:

Regular meetings with the team (~5 hours). I host a weekly team meeting for sprint planning, team events, and discussions. I also have weekly 1:1s and monthly growth check-ins with everyone on the team, helping me understand their projects and growth opportunities, and serving as a two-way feedback channel. And of course, I have 1:1 scheduled with my manager and managers on the DE and DS teams to align priorities and collaborate effectively.
Regular cross-functional meetings (~3 hours). The data analytics team works closely with stakeholders to help them track key metrics and conduct analysis. Therefore, it is critical to have regular meetings with stakeholders to take new requests and present findings. While I encourage my team to lead these discussions, I often attend to understand the team’s focus, provide support, clarify priorities, or push back when needed.
Project syncs and ad-hoc meetings. There are often projects that involve multiple teams or span multiple sprints. In that case, project sync meetings are unavoidable. I attend those meetings to align on the project scope and progress. I also always make myself available for ad-hoc meetings, when my team wants to review their projects or brainstorm analysis ideas.

Do I like meetings? Unfortunately no, as I am an introvert. I’ve also found my days to be more scattered as I only have 30-minute blocks here and there between the meetings. However, these conversations are crucial for me to always be on the same page with my team.

2. More mentoring and coaching

When I was an IC, success meant delivering high-quality projects. However, as a manager, my success comes from ensuring my team has everything they need to deliver their projects. Therefore, management comes with a lot of mentoring and coaching.

The monthly growth check-in meetings are the perfect venue for me to understand what my team is missing and what areas they want to grow. Based on their feedback, I host monthly L&D sessions on topics like text analytics with LLM, A/B Testing 101, and (up next) Causal Inference 101.

Of course, applying a skill in real projects is the best way to master it. Therefore, I try my best to review my team’s projects timely and brainstorm analysis ideas with them. It might be a small piece of advice every time, for example, how to optimize a query they wrote, how to make a visualization more user-friendly, or how to better format an analysis report (you can find all of them in my past articles as my writing ideas are always inspired by real work). However, I believe these small but consistent improvements help them become a better data analyst every day.

Mentoring and coaching have been among the most fulfilling aspects of becoming a manager – I feel deeply rewarded when I see people grow with my help.

3. Overseeing more projects with a broader scope

As an IC, I focused on projects for specific domains like Risk, Operations, CX, Product, Implementation, etc. Now, as a manager, my scope is essentially the whole company…

We have data analysts assigned to different organizations across the company. Therefore, I had to learn new functions like Sales and Marketing quickly to better support them. At first, I was a bit worried if I would be helpful enough given my limited context there. I eventually bridged the gap by reading existing documentation, key dashboards, and past analyses, and by diving into the ongoing projects. One lesson I learned is that the manager-report relationship is not just one-way coaching, but a mutual learning experience. My direct reports are my best teachers when it comes to learning domain knowledge, and I trust their judgment in scoping their work.

This increased scope also helps me zoom out from single projects, and see the big picture. I’ve started noticing connections between projects across domains as they serve the same company goal. This benefits me a lot when I need to prioritize projects for my team and make sure they are aware of similar initiatives supported by each other.

On the other hand, this change also means less time for me to do hands-on projects. Though I still carve out up to 50% of my capacity for IC work (given the limited resources on the team), the time I could spend on diving into new methodologies is now rare, and that is the piece I truly miss.

4. Visibility into the behind-the-scenes work

Another very positive change for me is that I now have more visibility into what happens behind the scenes. Here are some examples:

Process optimization: As an IC, I just followed the established process by my manager without much thought, no matter if it was adopting a new sprint planning process, building a team hub in Confluence, or following a certain doc format. However, as a manager, I now constantly think about how to improve the team productivity through process improvements. For example, several people on the team mentioned they got ad-hoc requests all over the place, from Slack direct messages, JIRA tickets, ad-hoc meetings, emails, etc. To improve the requests intake process, I suggested centralizing all the requests in specialized Slack channels like #gtm-data-help or #ops-data-help using a request form, and always asking stakeholders to post their requests there. This process helped the team to better manage the workload and helped the stakeholders to better format their requests while seeing all the other ongoing data tasks in their domain.
Hiring: I’ve successfully hired two people since becoming a manager. In the past, I’ve only been involved in the hiring process as an interviewer. But being a hiring manager is a totally different story – I needed to prepare the job description, align with recruiters on the candidate profile, design the interview loops, review resumes, conduct interviews, participate in every debrief, and eventually make hiring decisions. This is much more effort than what I would have expected. But I also find it very rewarding to grow my team.
Performance review: I recently completed my first performance review – it is not the critical year-end review that decides promotion and compensation changes, but a smaller quarterly review – but it is still something very new to me. To ensure the performance review never surprises anyone, I constantly use the weekly 1:1s and monthly growth check-ins to align expectations and collect feedback. Therefore, doing the performance review is more or less celebrating the great work they’ve done in the past quarter and setting goals for the next quarter. However, doing this process myself gives me more insights into how daily work translates into ratings.

How do I like my new role?

With all the changes above, how do I like my new role?

On the positive side, I enjoy helping people grow. It is very fulfilling to pay forward the mentorship and guidance I have received in my career. The expanded scope also gives me valuable insights into how businesses run and how to better align data projects with company goals.

On the flip side, I do miss the IC time of doing hands-on Data Science work, owning projects end-to-end, and diving into technical details. Sitting in meetings all day is exhaustive and poses a challenge to centralize focus time.

However, I am absolutely enjoying the challenge so far. Whether I stick with management long-term or return to the IC track, this experience is teaching me invaluable lessons that will benefit my career for years.

How was your transition to a manager? I’d love to hear your thoughts, advice, and insights in the comments below!

If you have enjoyed this article, please follow me and check out my other articles on data science, analytics, and AI.

My Medium Journey as a Data Scientist: 6 Months, 18 Articles, and 3,000 Followers

From Insights to Impact: Presentation Skills Every Data Scientist Needs

Beyond Line and Bar Charts: 7 Less Common But Powerful Visualization Types

The post From Data Scientist to Data Manager: My First 3 Months Leading a Team appeared first on Towards Data Science.

My Medium Journey as a Data Scientist: 6 Months, 18 Articles, and 3,000 Followers

Yu Dong — Mon, 11 Nov 2024 12:02:05 +0000

I started writing data science and AI content on Medium in May 2024. This is my sixth month and I just hit a major milestone – 3,000 followers! I am very proud of my achievements.

In this article, I will share how this journey started, what I have been writing, and what I learned. Plus, as a data scientist, I always enjoy analyzing my own data. I collected a dataset of my Medium stats, including article views , reads , claps , earnings , etc. Join me as I break down my Medium experience using data and share my data-driven writing strategies.

Image created by DALL·E

My Medium Journey Overview

How it all began

My writing habit dates back well before I started writing on Medium. I have been running my data science portfolio site since 2018, back when I started my first full-time job. I post articles there and occasionally share them on LinkedIn. It helps me connect with friends and colleagues in the data domain. Earlier this year, I posted an article about my experimentation with the custom GPTs, and it reached nearly 10k impressions on LinkedIn. This is not bad at all but it also leads me to wonder how I can reach an even wider audience.

Meanwhile, I have been a Medium Member since 2020. It has been invaluable for me to learn skills outside of my daily work and keep up with new technologies in the industry. Being in the industry for seven years, I feel it is time to be on the other side to share my knowledge with the community (and get my $5 monthly Medium subscription fee back ).

This is how the story started. I first tried posting some of my old articles on Medium, then moved on to writing brand-new content, submitting my articles to publications like Towards Data Science, and posting two to four new articles each month.

What I write about

My articles cover these three categories:

Technical tutorials: Many people come to Medium to learn how to do X, just as I do. Therefore, a majority of my articles fall under this category. This includes my article with the highest earning: Mastering SQL Optimization: From Functional to Efficient Queries.
Learnings: We don’t know everything, but that is okay. I enjoy exploring new things and sharing my discoveries on Medium. For example, I have a series of articles comparing ChatGPT, Claude, and Gemini on various data science and analytics tasks.
My career stories: With seven years in the industry, I have lots of career stories and reflections. In fact, the article that brought me the most claps and new followers is 330 Weeks of Data Visualizations: My Journey and Key Takeaways.

How writing on Medium has helped me

Writing on Medium of course helped me engage more with the data science community and earn some extra money. But it brought me many more benefits, including:

It makes me more confident in expressing my opinions. I have been following Towards Data Science so many years as a reader, and have always seen it as a publication for those top-notch data science articles. Now being an author who publishes here regularly, I feel much more confident in my data skills and story-telling abilities. And every clap and comment is a wonderful form of recognition.
It enhances my knowledge and skills. The process of writing an article is like re-learning something or re-experience a journey. It requires lots of fact-checks and reflections. Therefore, every article I write reinforces my understanding of the topic.
It helps me to keep this habit of reading and writing. Working in a second language isn’t easy (my native language is Mandarin Chinese) and regular reading and writing are the keys to constantly improving my English communication. Because now I am writing on Medium, I also tend to read others’ articles more to get inspiration. This created a positive cycle of reading and writing.

Mapping My Journey with Data

As a data scientist, I like collecting and analyzing data to improve decision-making. This also applies to blogging. Let’s start with some key metrics of my Medium journey (as of 11/3):

Stories posted: 18
Total reads: 54k
Total claps: 6,926 (~385 per article)
Total followers: 3,210
Total earning: $2,140

These are just the top-line metrics. To dig deeper, I prepared a dataset with daily stats on views, reads, claps, follows, and earnings for every article by following this guide. Here is what I discovered from the exploratory Data Analysis.

Key Data Insights

1. 80% of article views happen in the first 7 days.

As shown in the charts below, on average, 50% of the views come within the first 3 days, and 80% within the first 7 days. After 2 weeks, daily views usually drop below 50. This is likely because 1. publications like Towards Data Science usually share new articles on social media within the first few days after publishing, and 2. Medium prioritizes newer articles when distributing them through its recommendation system.

This means you can already tell if your article is a hit in 3 days.

Daily views visualization, data and image by the author

2. Medium members are 3x more likely to read an article than non-members.

Medium defines views as people who visited your story’s page and reads as people who read your story for at least 30 seconds. Therefore, the read ratio = # reads / # views tells how engaging your article is to the audience that visits it.

An interesting pattern I noticed is that the Medium members have a read ratio of around 60%, while it is closer to 20% for non-members. This shows the motivation to read more when you are paying the subscription fee Meanwhile, it might also be driven by the fact that non-members will hit the paywall if they have already exceeded the preview limit for the month (if those views are not excluded from the Medium stats, which I could not verify).

Member vs. non-member read ratio, data and image by the author

3. Article earnings follow the 80/20 rule. 80% of my earnings come from just 3 articles, which is a perfect example of the 80/20 law. In fact, my best-performing article alone has brought me nearly $1,000 now. On the other hand, as you can see in the histogram below, many articles earn less than $10.

My three best-performing articles also happen to be the three that are boosted by Medium. "Boost" is a program where Medium hand-picks high-quality stories and weights those stories for extra distribution via the recommendation algorithm. According to Medium, "95% of Boosted stories get at least 500 extra views within two weeks". You can read more about this program here.

Article earning histogram, data and image by the author

4. Member reads and whether boosted or not are key to earnings.

So what factors determine the earnings? Medium has never revealed its formula but shared some key factors in its help center article. And here is my take by analyzing my (small sample of) earnings data. Two major factors that influence earnings the most are:

If your article is boosted or not. In the help article, Medium also says there is "a multiplier of engagement points when the story is Boosted.". As you can see in my chart below, earnings from the boosted articles are clearly outliers compared to the not-boosted articles.
Number of member reads. It is not surprising that the more reads you get from the Medium members, the higher your earnings will be. When I separated boosted vs. not boosted articles, I found a strong positive correlation between member reads and earnings. And please note that this is member reads – unfortunately reads from non-members don’t matter according to the help article.

Correlation between member reads, boosts and earnings, data and image by the author

Here are the fitted regression formulas:

Boosted articles: Earnings = 0.28 * member reads – 43
- R-squared = 0.998
- P-value = 0.029
- But please note that I only have 3 data points haha!
Not Boosted articles: Earnings =0.027 * member reads + 2.1
- R-squared = 0.965
- P-value = <0.001
- sample size = 15

The slope for boosted articles is 10x that of non-boosted ones. In other words, when your article is boosted, you earn 10x .

Medium says reading time and engagement like claps, highlights, and responses also impact earnings. However, my articles are mostly between 7 to 10 minutes long, so the reading time probably doesn’t vary too much (and the data is not available to me). As for the engagement metrics, they all appear to be highly correlated with member reads. Therefore, just using member reads itself already has a strong predictive power in my case.

Eventually, when I get a significantly larger dataset one day, I plan to run a more rigorous regression analysis with all the metrics I have access to. But please let me know if my findings match your Medium article stats

Data-driven Medium Writing Strategy

What can we learn from the analysis above? Here are my data-driven recommendations on Medium Writing:

Write regularly to build your audience: Earning is highly correlated with member reads. What is the best way to increase member reads? To build your audience. Every new article has a chance to attract more followers, and if your articles show up on someone’s homepage often, they have a higher chance to follow you and read your future articles.
Quality over quantity: I’ve seen people recommending posting articles every day. But that is not what I mean by writing regularly. I believe fully polishing an article on a topic that you are really into is the way to engage your audience and improve the read ratio. It also increases your chance of getting "boosted". (And honestly, I am not the type of creative person who can come up with new writing ideas every day…)
Submit to publications. Publications like Towards Data Science have established their subscriber bases and distribute accepted articles across various channels like emails, LinkedIn posts, Twitter (I mean… X), etc. This means your article will reach a much wider audience than you just letting Medium do their recommendation algorithm magic or sharing it on your social media. This is particularly important for new writers. Additionally, only publication editors can nominate your article for a "Boost" (read more here). So this also gives you a higher chance to earn more money.
Optimize your title and opening. A ‘read’ counts when someone reads your story for at least 30 seconds. What can people see in 30 seconds? That’s probably only enough time to read the title and subtitle and skim through the first paragraph. Therefore, you should try to optimize the first impression to grab the reader’s interest. This is the same reason why companies do SEO and marketing email optimization. I would like to A/B test my titles if I could, but unfortunately, that is not doable on Medium. So I am also learning by trial and error now.
Create content that is ‘you’. Among my past articles, the ones that perform best are always the ones with more personal touches. Even for technical topics like SQL optimization, I included my personal experiences and examples. Essentially, your content shouldn’t be something that ChatGPT is able to create by itself.

I hope this article gives you more insights into writing on Medium (especially in the data science domain) and inspires you to embark on a similar journey.

If you have enjoyed this article, please follow me and check out my other articles on data science, analytics, and AI.

The post My Medium Journey as a Data Scientist: 6 Months, 18 Articles, and 3,000 Followers appeared first on Towards Data Science.

From Insights to Impact: Presentation Skills Every Data Scientist Needs

Yu Dong — Mon, 28 Oct 2024 23:29:33 +0000

How to structure, design, and deliver data presentations that win over stakeholders

Being a data scientist today is more than just a technical role. It has evolved into a highly cross-functional job, as you need to explain your data insights and sell your ideas to your stakeholders to drive real business impacts. Therefore, to be a successful data scientist, presentation is an essential skill -many times using old-school PowerPoint or Google Slides. However, it is a skill that is usually missing from entry-level data scientists and analysts who focus heavily on technical skills.

In this article, I will share the presentation format that I have found most effective from numerous presentations I had to give during my seven years as a data scientist. I will also walk you through slide examples, which I created using fake data for illustration purposes. Hopefully, this will help you improve your presentation skills and advance your data career.

Please note that this article focuses on the presentation step, assuming you already have a solid analysis or model

Image created by DALL·E

Part I – Context

To kick off your presentation, you should start with the project context to set the stage and align the expectations, just in case not all of the audience is already aware of the business problem. This section should be concise with a maximum of 2 slides. It should cover:

Goal: What business question you are trying to solve? Is it analyzing the outcome of an experiment or building a machine learning model to predict customer churn? Does it come from a recent metrics trend or follow a product change?
Supplementary Context: You should also include other key information about the projects. For example, if you are presenting an experiment analysis, you should briefly talk about the experiment design and target population; For a machine learning model, you should provide an overview of the training data quickly.

Let’s say you are a data scientist at a subscription-based B2C company (for example, Netflix, HelloFresh, The Economist, etc.). You have conducted an analysis to identify the drivers of subscriber churn and you are going to present your findings. Here is a sample slide that effectively introduces the analysis context:

Example context slide (image and fake data created by author for illustration purpose)

Part II – Key Insights and Recommendations

This might not be a conventional idea, but I recommend sharing your key insights and recommendations before diving into the analysis details. Why? This helps you capture the attention of your audience as the decision-makers are most interested in the actions. It also makes it easier for the audience to connect the dots later when you go through the detailed analysis.

Here are some tips to make this section effective:

Pick 3 to 5 most important insights. Make sure to include data points in your key insights to increase credibility.
Provide actionable recommendations. The recommendations should be directly linked to the key insights, and be logical and practical for the stakeholders. If possible, add the opportunity sizing of your recommendations. This will help your stakeholders understand the potential impact and prioritize the ideas.

Here are the example slides (with made-up data for illustration purposes). I like to use color and bold fonts to highlight the key data points and messages.

Example key insights and recommendation slides (image and fake data created by author for illustration purpose)

Part III – Detailed Analysis

Next, it’s your time to present your analytical process. Your analysis should serve as the supporting evidence for the key insights and recommendations you shared earlier. Therefore, you should be selective, instead of overwhelming your audience with all the analysis details. They should also form a logical storyline that connects your findings.

What makes good analysis slides?

Summarize insight in the slide title: Instead of using generic keywords like "trend analysis" or "churn rate by tenure" as the slide title, I recommend using a one-sentence summary, such as "churn rate is 15% higher for mobile users". This guides focus and improves the storytelling efficiency.
Limit text on the slides: Don’t dump too much text on the slides. It’s hard for your audience to read a long paragraph while listening to what you are talking about.
Polish your visualizations: On the other hand, visualizations are a better way to explain your insights intuitively. Effective visualizations should be simple and uncluttered, use easy-to-follow chart types, have appropriate (and large enough) labels, and follow a consistent design style.

As shown in the slide example below, the slide title summarises the finding, followed by a short explanation and a clear chart supporting it.

Example analysis slide (image and fake data created by author for illustration purpose)

Part IV – Summary Slides

After going through your detailed analysis, I recommend wrapping up your presentation by repeating the key insights and recommendations. This might sound redundant, but it reinforces your conclusions and sets the stage for Q&As. You can simply copy and paste the slides at the beginning.

Part V – Appendix

The above four sections cover all the slides you want to go through during the presentation. However, it is always a good idea to prepare additional slides in the appendix to cover the following topics:

Answers for the anticipated Q&A topics— It is not uncommon to foresee certain questions coming up during the Q&A, especially when you have done your mock presentation with your teammate. Preparing slides to address those questions can help you feel less nervous during the presentation and make you look more prepared in front of the stakeholders.
Technical details – If your audience includes both technical and non-technical stakeholders, you might want to include the technical details in the appendix, in case someone wants to learn more about your methodology.
More supporting data and visualizations – Given the time limit, there might be more interesting data points that you could not cover in the presentation. You can also save them in the appendix for further reading.
Important caveats of this project – If there are any important caveats of the project, for example, a certain data point is not available or a dataset is not accessible, you should explain them in the appendix.

For example, I would include a slide like this in the appendix to show the detailed result of a regression analysis.

Example appendix slide (image and fake data created by author for illustration purpose)

Summary

As a rule of thumb, for a 30-mins presentation (including 5–10 minutes saved for Q&A), your presentation should not exceed 12 slides (excluding the appendix), which include:

Context: 1–2 slides
Key insights and recommendations: 1–2 slides
Detailed analysis: 5–7 slides
Summary: 1–2 slides

And of course, presentation is more than preparing a well-structured deck. Here are some bonus tips for those who have made it this far:

Rehearse your presentation: If you are not confident enough with your presentation skills or you have an upcoming presentation with senior leadership, it is always a good idea to rehearse your presentation. You should get yourself 100% familiar with the content on each slide, and practice your flow, tone, pace, and timing. The more you practice, the more natural your presentation will be. You can also ask your manager or teammates for feedback.
Maintain eye contact: During the presentation, be sure to engage your audience by maintaining eye contact with them, especially when you are delivering your key points. This helps you evaluate if they resonate with your message or if they need any clarification on certain topics. It also makes your presentation more persuasive.
Handle Q&A strategically: Q&A can be stressful. Though I said above that you could anticipate some questions and prepare answers in the appendix, there will always be unexpected questions coming up. When that happens, if you have any ideas in mind, don’t be shy and share your thoughts confidently. Meanwhile, it is also totally okay to admit when you don’t know – you can always say "That’s a great question. I will look into it further and follow up with you."

A successful presentation is the key to converting your data insights into real business impact. I hope you have found the above advice helpful and feel more confident in your presentation skills. If you have any additional suggestions, please share them in the comment below!

If you have enjoyed this article, please follow me and check out my other articles on Data Science, analytics, and AI.

Top 5 Principles for Building User-Friendly Data Tables

Beyond Line and Bar Charts: 7 Less Common But Powerful Visualization Types

Seven Common Causes of Data Leakage in Machine Learning

The post From Insights to Impact: Presentation Skills Every Data Scientist Needs appeared first on Towards Data Science.

Top 5 Principles for Building User-Friendly Data Tables

Yu Dong — Sun, 13 Oct 2024 11:02:42 +0000

Working in data science and analytics for seven years, I have created and queried many tables. There are numerous times I wonder, "What does this column mean?" "Why are there two columns with the same name in table A and table B? Which one should I use?" "What is the granularity of this table?" etc.

If you’ve faced the same frustration, this article is for you!

In this article, I will share five principles that will help you create tables that your colleagues will appreciate. Please note that this is written from the perspective of a data scientist. Therefore, it will not cover the traditional database design best practices but focus on the strategies to make user-friendly tables.

Image created by DALL·E

I. Single Source of Truth

Maintaining a single source of truth for each key data point or metric is very important for reporting and analysis. There should not be any repeated logic in multiple tables.

For convenience, sometimes we compute the same metric in multiple tables. for example, the Gross Merchandise Value (GMV) calculation might exist in the customer table, monthly financial report table, merchant table, and so on. These columns are then referenced in more downstream tables with even more variations. Over time, they can diverge. Everything worked fine, until one day, a stakeholder came to us asking "Why is the monthly GMV in this dashboard different from the other? Is your data wrong?" When we dig into layers and layers of the data pipeline, we realized last quarter we agreed to only include cleared transactions in GMV, but forgot to make the change in every table… This hurts stakeholders’ trust and we will end up spending more and more time investigating and maintaining the complex data model.

So how to solve this? Maintain a single version of the GMV calculation in a dedicated table. Then every other table that needs the GMV metric should use this table, instead of creating their own calculation logic.

DON’Ts:

Keep multiple versions of the same calculation across different tables.

DOs:

Maintain one version of each key metric.
Reference that source-of-truth table in downstream tables, rather than duplicating the logic everywhere.

II. Consistent Granularity

If a table is on the daily level, keep it daily. Mixing data with different granularities in a single table can lead to confusion and data integrity issues.

This is my real example: I once created a table to report the Operations Service Level Agreement (SLA) performance – we had SLA defined for different Operations workflows and we wanted to track how often we could meet it. The SLA was defined at two levels: touchpoint level (each interaction) and case level (the entire process) – a case can have back-and-forth and be touched by multiple people on the Operations team. We want each touchpoint to be completed within a specific time limit and the whole case to be solved within a time range. I thought it was much easier to dump one table into BI tools for reporting, so I created a table mixing the two granularities. It had a column called sla_level with two values touchpointand case. I was hoping people would filter on a specific SLA level to get the metric they need.

Why is this problematic? Not everyone had the context I just explained. They often did not include this filter in the reporting, and ended up double counting a case, as it would show up in both case level and touchpoint level.

What should I have done instead? Create two tables, one on the touchpoint level, and another on the case level, explain their differences clearly in the table documentation, and only use the appropriate one for different use cases.

DON’Ts:

Mix rows or columns with different data granularities in the same table. This could lead to misuse of your data and incorrect analysis.

DOs:

Only have one type of granularity in each table.

III. Descriptive Naming Conventions

I have to admit that I have created temp tables called txn_temp , tnx_temp_v2 , tnx_subset_v3, txn_final It’s okay to have them in your exploratory code (as long as you can remember what they stand for when you look back…). But if you are going to share them with your colleagues or use them in an official data pipeline, you should give them more intuitive names. For example, the txn_subset_v3 should be something like refund_transactions, which clearly explains what is the subset and spells out transactions instead of txn.

The same principle applies to the column names. This just happened to me yesterday – I was querying the monthly number of cases created, and I did SELECT DATE_TRUNC('month', created_at) AS created_month, COUNT(DISTINCT case_id) FROM cases GROUP BY 1 ORDER BY 1 . Then I got 'created_at' does not exsits error… Taking a closer look, I realized the timestamp column was named created_date , though it was actually of the datetime type. What makes it worse is that similar columns could be created_at or created_time or created_date in different tables.

Another example is that we sometimes add a column created by the row_number() window function to easily query the first or last occurrence of a partition in the table. I have seen multiple times that people just leave the column name as rn. emmm… Right Now? Real Numbers? Registered Nurse? Even if you know it stands for row_number, you will still need to look up the underlying SQL code to understand what is the partition and in what order. Therefore, if you do need to keep this column in the final table, please make it descriptive likerow_num_by_customer_order_date_asc. This is a much longer name, but in my opinion, longer names are better than short but confusing names.

It would also be helpful to have a consistent naming convention across tables. For example, all the tables on the customer level could be named with the customer_ prefix, so you have customer_demographics, customer_status, customer_signup_information, etc.

DON’Ts:

Tables called xxx_temp , xxx_v1 , xxx_final.
Unconventional/confusing abbreviations in table or column names.
Tables on the same level and topic but named very differently (e.g. customer_demographics and status_of_customer).

DOs:

Clear, descriptive, and intuitive table and column names – longer names are better than short but confusing names.
Consistent naming conventions across tables.

IV. Appropriate Data Type and Format

When it comes to intuitive tables, it’s not just about descriptive table and column names, but also about appropriate data types and formats.

Let me give you some examples:

Assume that we havecountry fields in different tables, such as the country where a customer is located, the merchant country of a transaction, the billing address country, etc. However, some of them are of the full country name (United Stated), some are in two-digit country code (US), and some are in lower cases (us). Then later, you are going to build a fraud detection model, and you want to create a feature to determine if the merchant country is the same as the customer’s billing address country. Ideally, this can be done by a simple join, but in this case, you will have to check if all the country values in the two columns are consistent. As a result, you will need to clean up a bunch of values and find some country names to two-digit country code mapping to join the two tables. This is a very error-prone process.
We all love binary columns – they are very handy when we need to filter on some common segments. However, I’ve seen them both in TRUE/FALSE and 1/0 formats here and there. What’s worse? Sometimes the 1/0 is of integer type, but sometimes in string… It is clearly much better to keep them all consistent in the TRUE/FALSE format.
For all the monetary value fields like transaction_mount, delivery_fee, or product_price, they should be in a consistent unit as well. If some of them are in dollars, while some are in cents, you can easily calculate profits that are 100x or claim a profitable deal to be unprofitable.

DON’Ts:

Have different data types or formats for columns that are similar.

DOs:

Use appropriate data types. For example, always use TRUE/FALSE boolean values for the binary columns; xx_time columns are always in the datatime format, while the xx_date columns should be in date format.
Keep consistent formats. For example, all the country fields are in two-digit country codes, and all the money-related fields are in dollars.

V. Complete Documentation

Last but not least, a good table always comes with a complete table documentation. A good documentation ensures that everyone using the table will understand what it’s for and use it appropriately.

What makes a "complete table documentation" though is a separate topic that might be worth its own article. But here are the key components in my opinion:

Table description: What is the purpose and data topic of this table? When should people use this table? What is the granularity of the table? Any caveats that users should keep in mind?
Column documentation: The data type and meaning of each column. If a column has a complex calculation logic, it is also important to explain that logic.
Data source: Where does the data come from? In the ideal state, data users should be able to explore the data lineage easily to understand the upstream tables and columns. This helps a lot when we need to troubleshoot a data issue or a metric change.
Frequency and freshness: How often is the table updated and when was it last refreshed?

DON’Ts:

Create a table with no documentation or incomplete information;
Keep the documentation outdated when you update a table logic.

DOs:

Always maintain complete and up-to-date table documentation.

These are my top five principles to make a table that is friendly to your data colleagues. Follow them, data investigation will be faster, analysis will be easier, and models will be more accurate.

Is there anything else you would like to add to the list? Please comment below!

If you have enjoyed this article, please follow me and check out my other articles on Data Science, analytics, and AI.

Beyond Line and Bar Charts: 7 Less Common But Powerful Visualization Types

Seven Common Causes of Data Leakage in Machine Learning

Mastering SQL Optimization: From Functional to Efficient Queries

The post Top 5 Principles for Building User-Friendly Data Tables appeared first on Towards Data Science.

Beyond Line and Bar Charts: 7 Less Common But Powerful Visualization Types

Yu Dong — Thu, 26 Sep 2024 07:28:12 +0000

In a previous article, I shared my journey of creating one visualization every week since 2018 – I have 350+ of them now on my Tableau Public profile! Not surprisingly, among all the visualization types, I have used bar charts and line charts the most. They are simple but highly effective and intuitive in telling stories. However, I sometimes feel bored making similar charts over and over again, and they could fail to demonstrate complex data patterns.

In this article, I will introduce seven less common but powerful visualization types. I will talk about their specific use cases, share my visualization examples, discuss their strengths and weaknesses, and show you how to create them in visualization tools like Tableau.

Image created by DALL·E

1. Bump Chart: Tracking Ranks Over Time

A bump chart is a special type of line chart that visualizes the change in rank of multiple categories over time. Its y-axis is the category ranks, so it shows how the categories "bump" each other up and down over time. Therefore, It is perfect to demonstrate the competition between categories.

Limitations: Since it only keeps the rank information but overlooks the underlying value, it doesn’t show the scale of differences between categories and the value change of each category over time. Therefore, it’s your choice depending on the story you want to tell – If your main goal is to emphasize the relative rank changes among certain categories, it is the perfect visualization type for you; But if you also want to show how the differences between two categories are increasing/shrinking over time, a traditional line chart might be better.

My Bump Chart Example

Example: In the above visualization, I plotted the rank of U.S. credit card issuers based on their purchase volume by year. You can clearly see that while American Express was in the 1st place, Chase has surpassed it since 2020. Meanwhile, Citi has always been in the 3rd place between 2016 and 2021. However, as mentioned earlier, this chart does not tell us how large the differences between Chase and American Express are, or if the overall market is expanding or shrinking.

How to make it:

Set the Time Dimension: Place the time dimension (e.g. years, months) on the x-axis.
Calculate Ranks: For each category at every time point, calculate the rank based on the metric of interest (purchase volume in my example). In Tableau, this can be done using the RANK() function. Please make sure that you set up the calculation correctly by choosing the appropriate dimension in the Compute Using settings.
Plot the Lines: Use a line chart to connect the ranks of each category over time.

2. Gantt Plot: Visualizing Schedules and Flows

A Gantt Plot is a special type of bar chart. People often use it in project management to visualize task schedules. You might have seen it in tools like JIRA. Unlike common bar charts, bars in a Gantt Plot do not start at the same value. Instead, the bars start at the event start time, and their lengths represent the duration of the event. You can also use it to show value changes (instead of duration) alongside a series of events or categories.

Limitations: A Gantt plot contains rich information as it shows both the schedules and durations in one chart. However, the fact that not all bars start from the same value makes it less intuitive to compare the duration across the events or categories. If you need to emphasize the duration differences, it might not be the best choice.

My Gantt Plot Example

Example: My visualization above shows the Caltrain passenger flow at each station during its peak time. In this Gantt Plot, I do not visualize the time schedule but show the passenger volume changes. The red bars represent how many passengers get on the train, while the grey bars represent the number of passengers getting off. For example, during the morning peak hours, on the northbound trains, we see more passengers board at stations like San Jose and Sunnyvale, while more people get off than get on at the Palo Alto Station.

How to Make It:

Define the Two Axes: You should put the numeric variable (e.g. dates, volume) on the X-axis, and have the categories (e.g. tasks, events) listed in sequence on the Y-axis.
Plot Bars to Show Duration or Value Changes: For each category, decide the starting point, then draw a bar with the length representing the duration/change size.

Gantt plot is a chart type supported by Tableau natively. You should drag your measure that represents the starting value to the Columns, and put the change size measure on the Size card. Here is an official tutorial.

3. Radar Chart: Visualizing Multi-Dimensional Data

A radar chart is also known as a spider chart or web chart given its shape. It no longer has two axes, but multiple axes depending on how many variables your dataset has. Therefore, it is a great fit for visualizing multivariate data – it allows you to compare multiple aspects of severable records/items at the same time.

Limitations: While radar charts are helpful in visualizing multiple dimensions, that does not mean you can include as many axes as you want. When there are too many dimensions, the chart will get very busy, making it hard for the audience to digest useful information quickly.

My Radar Chart Example

Example: In the visualization above, I compared the nutrition profiles of multiple McDonald’s menu items using a radar chart. Each axis of the chart represents a common nutrient like Carbs or Protein, and each colored curve is one menu item. For example, you can see that the Chocolate Triple Thick Shake is extremely high in Total Carbs and Sugars, while the Hamburger is more balanced across the nutrients. I also normalized the dimensions by the maximum value, so they all start with 0% and end with 100% (of maximum), as shown by the grey reference lines.

How to Make It:

Define the Axes: Each axis in the Radar Chart should represent a variable that you want to compare. Generally speaking, you should keep the number of axes in the range between 3 to 8.
Normalize the Scale: It is important to make sure all variables are on a similar scale or normalized. This will make the comparison more intuitive and less biased by different scales. For example, you can adjust the axis scale to 0%-100% of its maximum as I did above.
Plot the Data: For each item, plot points on each axis based on its value for that variable, then connect the points to form a polygon.

Specifically, if you want to make a Radar Chart in Tableau, you can calculate the x and y coordinates using calculated fields with some simple trigonometric functions. Assuming you have X variables (therfore X axes) and they each have a unique index of 0 to X-1, and the value for each variable is stored in a value field:

angle = [index] / X * (2 * pi()): this creates a new field angle, representing the angle to the x-axis of each variable on the chart.
X-coordinate = [value] * COS([angle]) and Y-coordinate = [value] * SIN([angle]) : These two fields calculate the coordinates of an item on a specific variable axis.

You can read more in this article.

4. Rose Chart: Unfolding Data in Circular Form

A rose chart, also known as a polar area chart or wind rose, is a circular visualization. It indeed looks like a rose – each of its "petals" is a data category, and the radius of the "petals" represents the category value. It might look like a pie chart at first glance, but each segment of the rose chart has the same angle but a different radius. Rose charts work best when you want to show the proportional size of each segment and the distribution within it.

Limitations: Rose charts can be difficult to read when there are too many categories or when the differences between the categories are too small. They are also not as intuitive as stacked bar charts for audiences who are unfamiliar with this type of visualization.

My Rose Chart Example

Example: My visualization above compares the cost of a night out in various cities for a date night for two people and a party night for one person. Each rose chart represents a breakdown of expenses for various activities. If you open it in Tableau, you can hover over each segment to see the city name, segment, and costs.

How to Make It:

Define the Categories: Determine the categories and allocate them in angular sectors ("petals") around the circle with the same angle.
Calculate the Radius: The radius of each category represents the value for that category. If you have multiple segments within each category, you should place them outward from the center, with their radius proportional to their values.

Making it in Tableau could be pretty tricky as you need to use a Polygon chart with a set of point coordinates to plot a smooth curve. You can see in my example above that I only used minimal points (the four points on the corners) for each segment, so the outer shape looks more straight than it ideally should be. You can read more about the detailed methodology in this article. Alternatively, you can make it more easily in Python with packages like Plotly.

5. Circular Bar Plot: Adding a Twist to Traditional Bar Charts

A circular bar plot is a variation of a bar chart. While the traditional bar charts are plotted on a Cartesian coordinates system as straight bars, the circular bar chart is on a polar coordinates system and plots bars around a circle. It provides a visually attractive way to show the differences across categories.

Limitations: Circular bar plots can be challenging to read when there are many categories. The curvature of the bars can distort perception (just as with the pie chart), making it harder to judge the actual lengths compared to traditional bar charts.

My Circular Bar Plot Example

Example: The visualization above shows Amazon Prime Day’s Gross Merchandise Sales from 2015 to 2022. In this chart, each circular bar represents the total sales on that year’s Prime Day. The circular layout also mimics the shape of Amazon’s logo. It might be hard to compare the actual sales without the labels, but it creates a visually appealing way to show the exponential sales growth.

How to Make It:

Determine the Categories and Values: Similar to a classic bar chart, you only need two values here— categories and their sizes.
Plot the Circular Bars: Each circular bar will represent a category. They should start on the same axis around the same central point. The end angle of each bar should be proportional to its value.

In Tableau, you can achieve this by creating bins to plot each circular bar as a series of points. You can follow a detailed guide here. Alternatively, you can use existing Tableau extensions like this one.

6. Ridgeline Plot: Revealing Trends Across Multiple Distributions

A ridgeline plot is also known as a joyplot. It displays the distribution of multiple categories over time or another continuous dimension. Each category is represented by a density plot or an area chart. Then they are layered one above another (usually along a certain sequence). This special layout makes it just look like the mountain ridgelines. This visualization type is especially useful for showing how each category catches up with and replaces another. One of its very popular use cases is to visualize the Google Search Trends.

Limitations: if the distribution of each "ridge" is not unimodal or when there are too many categories competing with each other, the ridgeline plots could be hard to digest.

My Ridgeline Plot Example

Example: My visualization above shows the evolution of U.S. recorded music revenues by format from 1970 to 2025. Each ridgeline represents the revenue of a different music format over time. The plot shows how the popularity of different formats has shifted. For example, CD was the dominant music distribution format between 1990 to 2010, but it has been replaced by the paid music subscription.

How to Make It:

Create Individual Area Charts: For each category, plot an area chart (or density curve or histogram) to represent its value over time (or another numeric dimension).
Stack the Plots: Stack the plots vertically, one above another. You should be careful with the sequence of the categories to show the data insights. In my example above, I manually adjusted the order of the music formats to show the evolution of popularity over time.

To make this in Tableau, you can simply use the Area Chart to plot the trend of each category over time, then put the categories on the Rows to stack them.

7. Sankey Diagram: Visualizing Flows and Relationships

A Sankey diagram is a flow chart to show the movement or distribution of resources (e.g. money, energy, people) between different stages or categories. The width of each flow represents the magnitude of the movement. Therefore, it is a powerful visualization type to demonstrate complex changes.

Limitations: Sankey diagrams can become overwhelming to interpret when there are too many categories or flows. They are best used when you want to highlight a few key flows or relationships.

My Sankey Diagram Example

Example: The Sankey Diagram above visualizes the housing situation of participants in the Australia Homelessness Services from 2017 to 2018. It uses different colors to represent the different housing situations. The colors of the flows correspond to their start situations. It shows that fewer participants were couch surfers or in short-term temporary accommodation after the program ended. However, there was an 8% unknown status in the end situation, preventing us from drawing more reliable conclusions.

How to Make It:

Identify the Nodes: Nodes in a Sankey Diagram are the starting and end points, and the intermediate stages (if any). They are the essential components of a Sankey Diagram.
Plot the Flows: Flows connect the nodes. To define a flow, we need to know: 1. What are the nodes it connects, and 2. The size of the flow (percentage of people in my example).

In Tableau, it is very complicated to create a Sankey Diagram manually, as you will need to create all the points that will be used to draw the curves. This usually needs to be done by creating bins and calculating coordinates with the Sigmoid function. This article provides a clear step-by-step guide. However, Tableau offers an official extension, which you can use to build a Sankey Diagram much more easily.

Plotly also offers an easy way to make a Sankey Diagram in Python.

Conclusion

Each of the above visualization types has its unique strengths (and weaknesses). If used appropriately, they can bring a fresh perspective to your data storytelling. Next time when you want to visualize a dataset differently, consider stepping outside the classic bar and line charts, and using one of these visualizations. It can help you highlight complex data patterns and make your analyses more engaging.

Do you know any other less common but useful visualization types? Please share your thoughts and experiences in the comments below!

If you have enjoyed this article, please follow me and check out my other articles on Data Science, analytics, and AI.

Seven Common Causes of Data Leakage in Machine Learning

330 Weeks of Data Visualizations: My Journey and Key Takeaways

Evaluating ChatGPT’s Data Analysis Improvements: Interactive Tables and Charts

The post Beyond Line and Bar Charts: 7 Less Common But Powerful Visualization Types appeared first on Towards Data Science.