Mikkel Dengsøe, Author at Towards Data Science

2024: The year of the value-driven data person

Mikkel Dengsøe — Thu, 04 Jan 2024 13:13:51 +0000

2024: The Year of the Value-Driven Data Person

Growth at all costs has been replaced with a need to operate efficiently and be ROI-driven–data teams are no exception

It’s been a whirlwind if you worked in tech over the past few years.

VC funding declined by 72% from 2022 to 2023

New IPOs fell by 82% from 2021 to 2022

More than 150,000 tech workers were laid off in the US in 2023

2023 reality check. Source: Author, using publications such as Techcrunch, The Verge, and CNN

During the heydays until 2021, funding was easy to come by, and teams couldn’t grow fast enough. In 2022, growth at all costs was replaced with profitability goals. Budgets were no longer allocated based on finger-in-the-air goals but were heavily scrutinized by the CFO.

Data teams were not isolated from this. A 2023 survey by dbt found that 28% of data teams planned on reducing headcount.

Looking at the number of data roles in selected scale-ups, compared to the start of last year, more have reduced headcount than have expanded.

The new reality for data teams

Data teams now find themselves at a chasm.

On one hand, the ROI of data teams has historically been difficult to measure. On the other hand, AI is having its moment (according to a survey by MIT Technology Review, 81% of executives believe that AI will be a significant competitive advantage for their business). AI & ML projects often have clearer ROI cases, and data teams are at the center of this, with an increasing number of machine learning systems being powered by the data warehouse.

So, what’s data people to do in 2024?

Below, I’ve looked into five steps you can take to make sure you’re well-positioned and aligned to business value if you work in a data role.

Ask your stakeholders for feedback

People like it when they get to share their opinions about you. It makes them feel listened to and gives you a chance to learn about your weak spots. You should lean into this and proactively ask your important stakeholders for feedback.

While you may not want to survey everyone in the company, you can create a group of people most reliant on data, such as everyone in a senior role. Ask them to give candid, anonymous feedback on questions such as happiness about self-serve, the quality of their dashboards, and if there are enough data people in their area (this also gives you some ammunition before asking for headcount).

Source: Author

End with the question, "If you had a magic wand, what would you change?" to allow them to come up with open-ended suggestions.

Survey results–data about data teams’ data work. It doesn’t get better…

The marketing data team needs some scrutiny. Data foundations are not doing great, either. Source: Author

Be transparent with the survey results and play it back for the stakeholders with a clear action plan for addressing areas that need improvement. If you run the survey every six months and put your money where your mouth is, you can come back and show meaningful improvements. Make sure to collect data about which business area the respondents work in. This will give you some invaluable insights into where you’ve got blind spots and if there are specific pain points in business areas, you didn’t know about.

Build a business case as if you were seeking VC funding

You can sit back and wait for stakeholder requests to come to you. But if you’re like most data people, you want to have a say in what projects you work on and may even have some ideas yourself.

Back in my days as a Data Analyst at Google, one of the business unit directors shared a wise piece of advice: "If you want me to buy into your project, present it to me as if you were a founder raising capital for your startup". This may sound like Silicon Valley talk, but he had some valid points when I dug into it.

Show what the total $ opportunity is and what % you expect to capture
Show that you’ve made an MVP to prove that it can be done
Show me the alternatives and why I should pick your idea

Example–ML model business case proposal summary

Source: Author

Business case proposals like the one above are presented to a few of the senior stakeholders in your area to get buy-in that you should spend your time here instead __ of one of the hundreds of other things you could be doing. It gives them a transparent forum for becoming part of the project and being brought in from the get-go–and also a way to shoot down projects early where the opportunity is too small or the risk too big.

Projects such as a new ML model or a new project to create operational efficiencies are particularly well suited for this. But even if you’re asked to revamp a set of data models or build a new company-wide KPI dashboard, applying some of the same principles can make sense.

Take a holistic approach to cost reduction

When you think about cost, it’s easy to end up where you can’t see the forest for the trees. For example, it may sound impressive that a data analyst can shave off $5,000/month by optimizing some of the longest-running queries in dbt. But while these achievements shouldn’t be ignored, a more holistic approach to cost savings is helpful.

Start by asking yourself what all the costs of the data team consist of and what the implications of this are.

If you take a typical mid-sized data team in a scaleup, it’s not uncommon to see the three largest cost drivers be disproportionately allocated as:

Headcount (15 FTEs x $100,0000): $1,500,000
Data warehouse costs (e.g., Snowflake, BigQuery): $150,000
Other data tooling (e.g., Looker, Fivetran): $100,000

Source: Author

This is not to say that you should immediately be focusing on headcount reduction, but if your cost distribution looks anything like the above, ask yourself questions like:

Should we have 2x FTEs build this in-house tool, or could we buy it instead?

Are there low-value initiatives where an expensive head count is tied up?

Are two weeks of work for a $5,000 cost-saving a good return on investment?

Are there optimizations in the development workflow, such as the speed of CI/CD checks, that could be improved to free up time?

Balance speed and quality

I’ve seen teams get bogged down by having 10,000s dbt tests across thousands of data models. It’s hard to know which ones are important, and developing new data models takes twice as long as everything is scrutinized through the same lens.

On the other hand, teams who barely test their data pipelines and build data models that don’t follow solid data modeling principles too often find themselves slowed down and have to spend twice as much time cleaning up and fixing issues retrospectively.

The value-drive data person carefully balances speed and quality through

A shared understanding of what the most important data assets are
Definitions of the level of testing that’s expected for data assets based on importance
Expectations and SLAs for issues based on severity

They also know that to be successful, their company needs to operate more like a speed boat and less like a tanker–taking quick turns as you learn through experiments what works and what doesn’t, reviewing progress every other week, and giving autonomy to each team to set their direction.

Data teams often operate under uncertainty (e.g., will this ML model work). The faster you ship, the quicker you learn what works and what doesn’t. The best data people are careful always to keep this in mind and know where on the curve they fall.

For example, if you’re an ML engineer working on a model to decide which customers can sign up for a multi-billion dollar neobank, you can no longer get away with quick and dirty work. But if you’re working in a seed-stage startup where the entire backend may be redone in a few months, you know sometimes to balance speed over quality.

Proactively share the impact of your work

People in data roles are often not the ones to shout the loudest about their achievements. While nobody wants to be a shameless self-promoter, there’s a balance to strive towards.

If you’ve done work that had an impact, don’t be afraid to let your colleagues know. It’s even better if you have some numbers to back it up (who better to put numbers to the impact of data work than you). When doing this, it’s easy to get bogged down by implementation details of how hard it was to build, the fancy algorithm you used, or how many lines of code you wrote. But stakeholders care little about this. Instead, consider this framing.

Focus on impact
Instead of saying "delivered X"
Use "delivered X, and it had Y impact"

Don’t be afraid to call out early when things are not progressing as expected. For example, call it out if you’re working on a project going nowhere or getting increasingly complex. You may fear that you put yourself at risk by doing so, but your stakeholders will perceive it as showing a high level of ownership and not falling for the sunk cost fallacy.

If you’ve any more ideas or questions about the article, reach out to me on LinkedIn.

The post 2024: The year of the value-driven data person appeared first on Towards Data Science.

The hidden cost of data quality issues on the return of ad spend

Mikkel Dengsøe — Thu, 06 Jul 2023 08:40:11 +0000

The Hidden Cost of Data Quality Issues on the Return of Ad Spend

Your data has a lot of things to say about which customers turned out to be money in the bank and which ones didn’t. Regardless of whether you work as a Lifecycle Marketing Manager in a B2B company where you optimize for driving free trials to paid customers or as a Data Scientist in a B2C eCommerce and optimize for getting first-time users to buy your product, each user has value to you.

Leading companies have become adept at predicting the lifetime value of customers at various stages based on their interactions with websites or products. Armed with this data, they can adjust their bids accordingly, justifiably paying an extra $5 for a user who is likely to generate an additional $50 in their lifetime.

In other words, you’re sitting on a goldmine that you can turn into predictions and input directly to Google and Meta to adjust your bidding strategy and win in the market by paying the price that’s right for each customer.

Data issues impacting the customer lifetime value (CLTV) calculation cause value bids to be based on wrong assumptions

But the return on your Ad Spend is only as good as your customer lifetime value calculations.

The average 250–500 person company uses dozen of data sources across many hundreds of tables and don’t always have the right level of visibility into whether the data they use is accurate. This means that they’re allocating the budget to the wrong users and wasting hundreds of thousands of dollars in the process.

In this post, we will delve into the data quality issues data-driven marketing teams face as raw data undergoes transformation, serving as input for value-based bidding in ad platforms. We’ll specifically address the following areas:

360 overview – why it’s important to have an overview of your entire marketing data stack
Monitoring – common issues that you should look out for in your marketing pipelines
People & tools – the importance of aligning people and tools to build reliable marketing data pipelines

Why you need a 360 overview of your marketing pipelines

To gain an understanding of the value of each customer, you can analyze user behaviors and data points that serve as strong indicators. This often reveals a list of predictive factors, derived from dozens of different systems. By combining these factors, you can obtain a full view of your customers, and connect the dots to understand the key drivers behind behaviors and actions that indicate that a customer has a high value.

For example, if you are a marketer in a B2B company, you may have an understanding of the factors that drive customers to transition from free to paid users.

Logging in twice makes customers 50% more likely to convert (Stripe)
Referring others within 7 days makes customers 70% more valuable (Segment)
Users with company email addresses and 250+ employees are 30% more likely to become paying customers (Clearbit)
Mobile-only logins decrease customer value by 30% (Amplitude)

Dozen of upstream sources go into the data warehouse before being sent to Google & Facebook for ad bidding

Without a comprehensive overview, you may mistakenly assume the accuracy of data inputted into your bidding systems, only to later realize critical issues such as:

Incorrect extraction of company size from email domain names due to faulty Clearbit/Segment integration.
Event tracking conflicts result in missing data for essential actions in the checkout flow from Amplitude.
Inaccurate data sync from the Stripe integration, leading to incomplete information about customer purchases.

"Our CLTV calculation broke due to an issue with a 3rd party data source. Not only did we lose some of the £100,000 we spent on Google that day but we also had to wait a few days for the CLTV model to recalibrate" – 500 people fintech

The significance of multiple factors in predicting CLTV for online retailer ASOS is highlighted in a research paper. The study finds that key factors include order behaviors (e.g., number of orders, recent order history), demographic information (e.g., country, age), web/app session logs (e.g., days since last session), and purchasing data (e.g., total ordered value). These insights are the outcome of hundreds of data transformations and integrations of dozens of 3rd party sources.

ASOS – factors to determine CLTV

Data issues – known unknowns and unknown unknowns

Having a comprehensive data overview is not enough; it is important to proactively identify potential issues affecting CLTV calculations. These issues can be categorized into two types:

Known unknowns: issues that are discovered and acknowledged, such as pipeline failures leading to the Google API not syncing data for 12 hours.

Unknown unknowns: issues that may go unnoticed, such as incorrect syncing of product analytics event data to the data warehouse, resulting in inaccurate assumptions about user behavior.

"We are spending $50,000 per day on Facebook marketing and one of our upstream pipelines was not syncing for 3 days causing us to waste half of our budget. We had no idea this was happening until they notified us" – 250 people eCommerce company

To proactively identify and address data issues impacting CLTV calculations, consider monitoring across the following areas:

Logical tests: Apply assumptions to different columns and tables using a tool like dbt. For example, ensure that user_id columns are unique and order_id columns never contain empty values. Implement additional logical checks, such as validating that phone number fields only contain integers or that the average order size is not above a reasonable limit.

Volume: Monitor data volumes for anomalies. A sudden increase in new rows in the order table, for instance, could indicate duplicates from an incorrect data transformation or reflect the success of a new product.

Freshness: Be aware of the latest refresh times for all data tables, as data pipeline failures may go unnoticed in more granular areas. For instance, an integration issue pausing the collection of company-size data from Clearbit could persist without immediate detection.

Segments: Identify issues within specific segments, such as mislabeling certain product categories, which can be challenging to detect without proper checks in place.

Establishing responsibility and ownership

Once you have a comprehensive overview of your data and monitoring systems in place, it is key to define responsibilities for different aspects of monitoring. In the examples mentioned earlier, data ownership spans product usage, demographics, billing, and orders. Assigning owners for relevant sources and tables ensures prompt issue triaging and resolution.

"We had an important test alert go off for weeks without it being addressed as the person who was receiving the alert had left the company" – UK Fintech Unicorn

Additionally, prioritize the most critical components of your data product and establish Service Level Agreements (SLAs). Regularly assess uptime and performance to address any areas requiring attention in a systematic manner.

Summary

Leading companies use data from many sources to accurately predict the Customer Lifetime Value (CLTV) of each customer. This allows them to optimize their ad bids and target the most profitable customers. However, the success of your ad spend ultimately depends on the accuracy of your CLTV calculations, making undiscovered data issues a significant risk.

To ensure high-quality data for value-based ad bidding, we recommend focusing on two key areas:

360 Overview: Without a comprehensive overview, you run the risk of assuming Data accuracy in your bidding systems, only to later discover critical issues. These issues could include stale data in platforms like Amplitude or integration problems with Clearbit.
Monitoring: Proactively identifying and addressing data issues that impact CLTV calculations is crucial. Implement monitoring processes that encompass logical tests, data freshness, volume tracking, and segment analysis.

By prioritizing a comprehensive overview and proactive monitoring, companies can mitigate the risks associated with faulty CLTV calculations and improve the effectiveness of their value-based ad bidding strategies.

If you work with Marketing data and are thinking about how to build reliable data I’d love to speak with you. Reach out at mikkel@synq.io.

The post The hidden cost of data quality issues on the return of ad spend appeared first on Towards Data Science.

How to identify your business-critical data

Mikkel Dengsøe — Wed, 14 Jun 2023 05:28:34 +0000

How to Identify Your Business-Critical Data

Source: synq.io

This article has been co-written with Lindsay Murphy

Not all data is created equal. If you work in a data team you know that if a certain dashboard breaks you drop everything and jump on it, whereas other issues can wait until the end of the week. There’s a good reason for this. The first may mean that your entire company is missing data whereas the former may have no significant impact.

However, keeping track of all your business-critical data as you scale your team and grow the number of data models and dashboards can be difficult. This is why situations such as these ones happen

"I had no idea finance was relying on this dashboard for their monthly audit report"

"What the heck, did our CEO bookmark this dashboard that I made in a rush as a one-off request six months ago"

In this article we’ll look into

Why you should identify your critical data assets
How to identify critical dashboards and data models
Creating a culture of uptime for critical data

Why you should identify your business-critical data

When you have mapped out your business-critical assets you can have an end-end overview across your stack that shows which data models or dashboards are business-critical, where they are used, and what their latest status is.

This can be really useful, in a number of different ways:

It can become an important piece of documentation that helps drive alignment across the business on the most important data assets
It breeds confidence in the data team to make changes and updates to existing models or features, without fear of breaking something critical downstream
It enables better decision making, speed, and prioritisation when issues arise
It gives your team permission to focus more of your energy on the highly-critical assets, and let some less important things slide

Example of seeing important impacted data models and dashboards for an incident. Source: synq.io

In this article we’ll look at how to identify your business-critical data models and dashboards. You can apply most of the same principles to other types of data assets that may be critical to your business.

What data is business-critical

Data used for decision-making is important and if data is incorrect it may lead to wrong decisions and over time a loss of trust in data. But data-forward businesses have data that is truly business-critical. If this data is wrong or stale you have a hair-on-fire moment and there is an immediate business impact if you don’t fix it such as…

Tens of thousands of customers may get the wrong email as the reverse ETL tool is reading from a stale data model
You’re reporting incorrect data to regulators and your C-suite can be held personally liable
Your forecasting model is not running and hundreds of employees in customer support can’t get their next shift schedules before the holidays

Mapping out these use cases requires you to have a deep understanding of how your company works, what’s most important to your stakeholders and what potential implications of issues are.

Identifying your business-critical dashboards

Looker exposes metadata about content usage in pre-built Explores that you can enrich with your own data to make it more useful. In the following examples, we’ll be using Looker, but most modern BI tools enable usage-based reporting in some form (Lightdash also has built in Usage Analytics, Tableau Cloud offers Admin Insights, and Mode’s Discovery Database offers access to usage data, just to name a few).

Importance based on business-critical use case

When you speak with your business leaders you can ask questions such as:

What are your top priorities for the next three months?
How do you measure success for your area?
What are the most critical issues you’ve had in the past year?

Your business leaders may not know that the reason why average customer support response times jumped from two hours to 24 hours over Christmas was due to a forecasting error from stale upstream data, but they’ll describe the painful experience to you. If you can map out the most critical operations and workflows and understand how data is used you’ll start uncovering the truly business-critical data.

Importance based on dashboard usage

The most obvious important dashboards are ones that everyone in the company uses. Most of these you may already be aware of such as "Company wide KPIs", "Product usage dashboards", or "Customer service metrics". But you’ll sometimes be surprised to discover that dozens of people are relying on dashboards you had no idea existed.

Source: synq.io

In most cases you should filter for recent usage to not include dashboards that had a lot of users six months ago but no usage in the last month. There are exceptions to this such as a quarterly OKR dashboard that’s only used every three months.

Importance based on dashboard C-suite usage

Like it or not, if your CEO uses a dashboard regularly it’s important, even if there’s only a handful of other users. In the worst case scenario you realise that a member of the C-suite has been using a dashboard for months with incorrect data without you having any idea this dashboard existed.

"We discovered that our CEO was religiously looking at a daily email delivered with a report on revenue, but it was incorrectly filtered to include a specific segment, so it didn’t match the centralised company KPI dashboard." – Canadian healthcare startup

If you have an employee system of record, you may be able to easily get identifiers for peoples’ titles and enrich your usage data with this. If not, you can maintain a manual mapping of these and update them when the executive team changes.

Source: synq.io

While usage by seniority is highly correlated with importance, your first priority should be mapping out the business-critical use cases. For example, a larger fintech company has a dashboard used by the Head of Regulatory Reporting to share critical information with regulators. The accuracy of this data can be of higher importance to your CEO than the dashboard they look at every day.

Identifying your business-critical data models

With many Dbt projects exceeding hundreds or thousands of data models, it’s important to know which ones are business-critical so you know when you should prioritise a run or test failure, or build extra robust tests.

Data models with many downstream dependencies

You likely have a set of data models where if they break everything else is delayed or impacted. These are typically models that everything else depends on such as users, orders or transactions.

You may already know which ones these are. If not, you can also use the manifest.json file that dbt produces as part of the artifacts at each invocation and the depends_on property for each node to loop through all your models and count the total number of models that depend on them.

In most cases you’ll find a handful of models with disproportionately many dependencies. These should be marked as critical.

Data models on the critical path

Data models are rarely critical on their own, but most often because of the importance of their downstream dependency, such as an important dashboard or a machine learning model used to serve recommendations to users on your website

All data models upstream of a business-critical dashboard are on the critical path. Source: synq.io

Once you’ve gone through the hard work of identifying your business-critical downstream dependencies and use cases you can use exposures in dbt to manually map these or use a tool that automatically connects your lineage across tools.

Anything upstream of a critical asset should be marked as critical or as on the critical path.

How to keep your critical data model definitions updated

Automate as much as possible around tagging your critical data models. For example:

Use check-model-tags from the pre-commit dbt package to enforce that each data model has a criticality tag
Build a script, or use a tool, that automatically adds a critical-path tag to all models that are upstream of a business-critical asset

Defining criticality labels

There’s no one right answer to how to define criticality but you should ask yourself two questions

What are your plans for how you treat critical data assets differently
How do you maintain a consistent definition across what’s critical so that everyone is on the same page

Most companies use a tiered approach (e.g. bronze, silver, gold) or a binary approach (e.g. critical, non-critical). Both options can work and the best solution depends on your situation.

Source: synq.io

You should be consistent in how you define criticality and write these up as part of your onboarding for new-joiners and avoid postponing this. For example, the definition of tiering could be

Tier 1: Data model used by a machine learning system to determine which users are allowed to sign up for your product
Tier 2: Dashboard used by the CMO for the weekly marketing review
Tier 3: Dashboard used by your product manager to track monthly product engagement

If you’re not consistently updating and tagging your assets it leads to a lack of trust and an assumption that you can’t rely on the definition.

Where to define criticality

There’s no one right place to define criticality but it’s most commonly done either in the tool where you create the data asset, or in a data catalog, such as Secoda.

Defining criticality in the tool where you create the data asset

In dbt you can keep your criticality definitions in your .yml file alongside your data model definition. This has several advantages such as being able to enforce criticality when merging a PR or easily carrying over this information across tools such as a data catalogue or observability tool

models:
 - name: fct_orders 
   description: All orders        
   meta:
     criticality: high

Example of defining criticality in a .yml file

In BI tools, one option that makes it transparent to everyone is to label the title of a dashboard with e.g. "Tier 1" to indicate that it’s critical. This data can typically be extracted and used in other tools.

Source: synq.io

Defining criticality in a data catalog

In a data catalog you can easily access all your company data and find answers to common questions by searching across your stack, which makes it easier to align on metrics and models

Tagging critical data. Source: secoda.co

Acting based on criticality

Mapping your business-critical assets will only pay off if you act differently because of it. Here are some processes to build in quality by design.

Dashboards:

Tier 1 dashboards need a code reviewer before being pushed to production
Tier 1 dashboards should adhere to specific performance metrics around load time and have a consistent visual layout
Usage of Tier 1 dashboards should be monitored monthly by the owner

Data models:

Test or run failures on critical data models should be acted on within the same day
Issues on critical data models should be send to PagerDuty (an on-call team member) so they can be quickly actioned
Critical data models should have at least unique and not null tests as well as an owner defined

You can read more about how to act on data issues in our guide Designing severity levels for data issues

Summary

If you identify and map out your business-critical data assets you can act faster on issues that are important and be intentional about where you build high quality data assets.

To identify dashboards that are business critical, start by looking at your business use cases. Then consider usage data such as number of users or if anyone from the C-suite are using a dashboard
Data models that are business-critical often have many downstream dependencies and/or critical downstream dependencies
Define criticality, either directly in the tools where you create the data assets, or use a data catalog
Be explicit about how you act on issues within business-critical assets and put in procedures for building quality by design

The post How to identify your business-critical data appeared first on Towards Data Science.

The important purple people outside the data team

Mikkel Dengsøe — Mon, 14 Nov 2022 10:58:13 +0000

Anna Filippova from dbt wrote that about how we need more "purple" people – generalists who can navigate both the business context and the modern data stack.

Some of the best hires I’ve made have been purple people elsewhere in the company who’ve wanted to move to a Data role. They may work in customer support and have become the go-to person for data related questions in their team. Or they may work as an account manager and have built superb dashboards used by everyone in the sales team.

These people bring a unique combination of having a deep understanding of the business and a drive to learn about data. And you’re in luck – in many cases they want to make a move to the data team.

They naturally align themselves well to solve real problems that drive ROI for the business. They understand the impact of automating a tedious process around scoring customer health because they know how painful it was to do that manually. And they understand why a 3-month project to measure the impact of a marketing campaign on sales doesn’t make sense because the sales teams systematically don’t log calls on time.

You likely find yourself working with some of these people and may have already made an effort to integrate them into the data team.

Wherever you fall on the scale, I recommend being deliberate about how you work with them to create the best outcomes for the data team and to give them the best opportunities for career growth. When done well, they will function as extensions to the data team, help solve important problems and handle ad-hoc requests.

Practical tips for creating success with people outside the data team

Start by mapping out who these people are. If you’re in a management role you can ask your team and they’ll know.

I’ve found these steps to work well

Map out everyone doing data-like work outside the data team and have your team select which ones have the most impact and are most eager to learn
Invite a few to be part of data team rituals. Give them a mentor on the data team and invite them to your offsite and weekly team meetings. This gives you a better sense for how they work and gives them a chance to learn
If it’s a good fit on both ends and their manager agrees, you can consider bringing them onto the data team. If you do, I suggest doing a 3 month probation period where clear expectations are set

If you find this working well, consider making it into a more formal data rotation program that anyone can apply to.

Common pitfalls when bringing in people from outside the data team

Unfortunately it doesn’t always work out when people are brought into the data team. I’ve seen a few common pitfalls that you should be on the lookout for as early as possible

They have a hard time letting go of the work from their old role and keep getting bogged down by DMs or operational work from previous stakeholders
They struggle to change their mindset, don’t take the time it takes to learn to do things the right way the first time around and end up cutting corners
They don’t have the right level of support in the data team and are not onboarded well

Be sure to do what you can to ensure they’ve the best possible setup for success and use the 3 month probation period to raise feedback with them early on so they have a chance to improve.

What happens when you have entire data-like teams outside the data team?

Engaging and bringing in people to the data team is a win-win. If done right, you’ll get a lot of value and help eager and ambitious people make a move into data.

The situation is more complicated when you start seeing entire teams outside of the core data team who are doing data-like work. These people are often brought in to do business critical work such as making the forecasting model that determines which support agents should work when or building data models to determine the credit score of customers.

If done wrong, these teams bring the risk of deterioration the reliability of data and can lower the quality of decisions made across the company.

Data reliability is only as strong as the weakest link in the chain.

You can think of it as an equation:

Data reliability = lowest(upstream data quality, data model quality, dashboard quality, ...)

In the example above you could have a situation where

Rigorous logging of calls in the CRM (sales team) = high →
dbt models with high test coverage (data team) = high →
Scrappy LookML code with logical errors (sales ops) = low

Regardless of the quality of the upstream and data modelling layer, the data used to make decisions is low as quality is only as strong as the weakest link.

In the same way that you can’t build reliable data on top of flaky data from upstream producers, you can’t confidently rely on data-driven decision making if decisions are made with downstream teams not following the standards you expect.

"Data-like teams frequently work on high importance business problems but too often the data team is not informed or involved in their work. Sometimes senior stakeholders circumvent the data team to move faster but end up creating long-term data debt"

It’s important to mention that in no way should data work be exclusively for people in the data team. In fact, the more people that engage in data work the stronger a data culture you have. However, you need to be clear of where you expect high quality. That may mean that you create a rule that for the most critical use cases of data, the data team needs to at least be informed.

How to spot teams doing data-like work

Here are some signs you should look out for to spot teams that you might want to bring in closer to the data team.

A group of sales ops analysts making dashboards with scrappy LookML code used by hundreds of people without the data team being aware of it.

An operations team maintaining a forecasting model that determines when the shift of workers starts and ends built in Pandas, being run manually each morning on a local machine.

A business strategy team that decides to use Google Data Studio for a dashboard for investors because someone used it in their previous job although the data team policy is to use Looker.

A credit analyst team who’ve started to develop their own dbt project for data models used to decide which customers are allowed to borrow money.

Before you know it, business critical decisions will be made by people using dashboards and data models that the data team had no idea existed.

You need to build a system for when teams and individuals should be in the core data team and when they should remain independent.

High overlap: You risk ending up with a mess that will eventually come back to the data team to fix. You should consider making them part of the core data team with the same expectations and onboarding you’d have for anyone else.

Some overlap: The quality of data and decisions heavily depend on them but the work is too different from what the data team does. Invite them to some of the data rituals and pair them up with a mentor from the data team.

Little overlap: It’s great that they’re building their own dashboards but you shouldn’t invest much time here as you risk spreading yourself too thin. Instead, offer to do office hours and a monthly training on Looker best practices.

Conclusion

Some of the best hires you can make into the data team likely already work elsewhere within your company.

Structure the way you bring people outside the data team in. Gradually make them part of the data team and have a probation period with clear goals
Data reliability is only as strong as the weakest link in the chain. Even if have excellent data sources and great data modelling, if analysts make sloppy dashboards, you’ll end up making decisions based on faulty data
Have a strategy for how you deal with data-like teams outside the data team. If their work closely resembles what you already do in the data team, you should consider making them part of the core team

If you’ve experience with how to best bring in data-like people from outside the data team and create a chain of reliable data, I’d love to hear from you

The post The important purple people outside the data team appeared first on Towards Data Science.

The Difficult Life of the Data Lead

Mikkel Dengsøe — Mon, 12 Sep 2022 09:02:25 +0000

Working in data has never been harder. The data stack is growing more complex and expectations are higher than ever before.

However, one Data role has it harder than most: The Data Lead.

People in data middle-management roles often go by the name of Data Lead or Data Manager and it’s the only role where you have to balance managing a team with working as part of a leadership group and still doing hands-on work. No easy combination.

Image by Author

If you’re an IC (individual contributor) your job is challenging but your focus is clear. You succeed by delivering high quality analysis and data products and by working with stakeholders to make sure your work has impact.

If you’re a Head of Data you work on a strategic level and if you’re lucky you have a seat at the top management table. Your role is difficult but your focus is also clear; build a great team around you and make sure the data team is working on the right priorities, and is set up to succeed in the future.

But if you’re a Data Lead you’ve got to do all of these at once.

Image by Author

As a Data Lead you have to manage your direct team. This includes dealing with performance management issues, making sure top performers are challenged, and hiring the right people.

You also have to manage stakeholders who often have competing priorities and keep context on what goes on in each area where someone from your direct team is involved.

And you need to stay hands-on and be able to jump into code, build a dashboard or deliver an analysis that sets the bar for what good looks like.

This is a lot to balance.

Image by Author

All this combined makes the role of data middle management unique and as a data industry we haven’t figured out how to deal with this yet.

As the data IC career progression path is starting to become a viable alternative to the people-management ladder, I’m starting to see more Data Leads being drawn to this. They’re still ambitious and want to progress in their careers but also want to get back to having time to focus on their craft and doing deep work.

If you look at engineering teams, many Engineering Managers have left behind most of their IC-related work. In fact, seeing an Engineering Manager push code to the production codebase is uncommon in many organisations.

Companies still need data managers so not everyone can move to the IC ladder. Some have suggested that Data Leads should operate more like Engineering Managers but I’m not so sure that’s what they want. In my experience, being hands-on is the part of the job they often enjoy the most.

A likely root cause of the issue

As data teams are getting larger the need for data middle managers will only increase and figuring out the best operating model is key. One solution is to stop expecting them to be able to do it all at once.

If the top priority is to grow the team, stakeholders shouldn’t expect the Data Lead to be as hands-on until the team has been hired.

If there’s a performance management issue and someone on the team requires a lot of attention, the hiring team should take on more of the initial candidate calls.

My take on what’s the most common root cause for the strain on data managers is that it’s most often with stakeholders. They are not deliberately being difficult (I hope) and often have good intentions to push for their own business goals. But many stakeholders don’t know how to work with data people. In high-growth companies you often have stakeholders coming from all kinds of backgrounds. People coming from traditional companies in particular may expect the data team to operate more as a service function where the goal is to respond to ad-hoc data requests. This is a topic that I’ve seen consistently create a lot of friction.

It’s also not uncommon to see stakeholders fight for data "resources" to be allocated to their projects, leaving the Data Lead in the difficult position of not being able to make anyone happy.

What can be done to improve this?

One problem is that many organisations don’t have a senior data person who has a seat at the top management table and can speak for data. This is unlike what you see in engineering where it would be uncommon to not have at least one senior technical person there.

Another solution is to educate stakeholders. I’ve always found it much easier to work with senior stakeholders who have a background in data. They appreciate how difficult some work can be and know when to cut corners. It may be wishful thinking that all senior leaders have a data background but data teams who don’t work as service functions are still a relatively new thing. Hopefully more stakeholders will get accustomed to how to work with data people.

I’m still waiting for someone to write the ultimate manual to stakeholders for how to work with data people.

If you’re working on this or have any experiences with this topic, I’d love to hear from you!

The post The Difficult Life of the Data Lead appeared first on Towards Data Science.

The unsung data heroes

Mikkel Dengsøe — Mon, 15 Aug 2022 07:57:43 +0000

The Unsung Data Heroes

There’s a good reason why many of the best Data people want to work at high growth technology companies. The learning curve is steep, you can progress quickly, the company’s mission is exciting, and you get to work with the latest technologies.

But working at high growth companies also means that everything changes all __ the time. Constant changes to the business means that the data expectations have to change too. The median tenure at startups is just two years and there are always a lot of new joiners.

This creates the perfect environment for the data hero.

The data hero is often one of the longest standing members of the data team. It’s that person who still understands the fct_orders.sql table with 500 upstream dependencies and the only one courageous enough to dare make a change.

Source: Unsplash.com (modifications by Author)

If you run a data team you know these are the people you can always count on to save the day when things are about to go really bad. They fix stuff faster than everyone else, if they don’t know something they always figure it out, and they take a personal responsibility for the consequence of what happens to the business if data is wrong.

The problem with data heroes is that they hide issues in your team that would have otherwise been exposed and if they leave they’re hard to replace.

Unfortunately, managers rely on data heroes for too much of the mundane work such as fixing urgent issues because they know it will get done. This work is not glamorous and around the two-year mark data heroes too often get fed up and start looking around for jobs elsewhere.

The missed opportunity to companies from data heroes leaving sooner than they would otherwise have is massive.

Image by Author

It takes a while to hit your stride in a new data role. You have to get to know the business, understand existing data models and build relationships with the right people. I’d expect the average data hire to be twice as productive in their second year. Still, the median tenure at startups is just two years.

If you can make your best people stay for four years instead of two you’re getting twice the productivity for the same price.

So, how do you retain your data heroes? Let’s start with what not to do

Don’t give them all the mundane work just because they know how to do it. Don’t let it be on them to fix it every time your "model_with_500dependencies that nobody knows.sql" fails.
Don’t let them be the ones that always have to log in on weekends just because they deeply care about the consequences to the business of data being wrong.

Instead, build the right behaviours into your data culture.

Creating a culture that retains data heroes

Make caring about important data everyone’s responsibility by design. Have a setup where the right people are notified of the data issues they should care about so it doesn’t fall on the same few people each time.

Reward behaviours where data heroes share best practices or decouple the largest data models so more people can contribute.

Make sure that there’s a progression path for individual contributors (ICs) that give your best people room to progress without going into management. We’re starting to see the appearance of staff data roles but they are still uncommon compared to engineering.

Image by Author

In the example above, if you take the happy path you’ll have a data team that’s twice as productive.

Take good care of your data heroes and don’t forget to make room for new ones.

Any thoughts or feedback on the topic? Let me know!

The post The unsung data heroes appeared first on Towards Data Science.

Data teams are getting larger, faster

Mikkel Dengsøe — Mon, 11 Jul 2022 08:26:07 +0000

Data Teams Are Getting Larger, Faster

Data teams at high-growth companies are getting larger and some of the best tech companies are approaching a data-to-engineers ratio of 1:2.

More data people means more analysis, more insights, more machine learning models, and more data-informed decisions. But more data people also means more complexity, more data models, more dependencies, more alerts, and higher expectations.

When a data team is small you may be resource-constrained but things feel easy. Everyone knows everyone, you know the data stack inside out, and if anything fails you can fix it in no time.

But something happens when a data team grows past 10 people. You no longer know if the data you use is reliable, the lineage is too large to make sense of, and end-users start complaining about data issues every other day.

It doesn’t get easier from there. By the time the data team is 50 people you start having new joiners you’ve never met, people who have already left the company are still tagged in critical alerts, and the daily pipeline is only done by 11am leaving stakeholders complaining that data is never ready on time.

How did this happen?

With scale, data becomes exponentially more difficult.

Image by Author

The data lineage becomes unmanageable. Visualising the data lineage is still the best way to get a representation of all dependencies and how data flows. But as you exceed hundreds of data models the lineage loses its purpose. At this scale you may have models with hundreds of dependencies and it feels more like a spaghetti mess than something useful. As it gets harder to visualise dependencies it also gets more difficult to reason about how everything fits together and knowing where the bottlenecks are.

The pipeline runs a bit slower every day. You have so many dependencies that you no longer know what depends on what. Before you know it you find yourself in a mess that’s hard to get out of. That upstream data model with hundreds of downstream dependencies is made 30 minutes slower by one quirky join that someone made without knowing the consequences. Your data pipeline gradually degrades until stakeholders start complaining that data is never ready before noon. At that point you have to drop everything to fix it and spend months on something that could have been avoided.

Data alerts get increasingly difficult to manage. If you’re unlucky you’re stuck with hundreds of alerts, people mute the #data-alerts channel, or analysts stop writing tests altogether (beware of broken windows). If you’re more fortunate you get fewer alerts but still find it difficult to manage data issues. It’s unclear who’s looking at which issue. You often end up wasting time looking at data issues that have already been flagged to the upstream engineering team who will be making a root cause fix next week.

The largest data challenge is organisational. With scale you have teams that operate centrally, embedded and hybrid. You no longer know everyone in the team, in each all-hands meeting there are many new joiners you’ve never heard of, and people you have never met rely on data models you created a year ago and constantly come to you with questions. As new people join they find it increasingly difficult to understand how everything fits together. You end up relying on the same few data heroes who are the only ones that understand how everything fits together. If you lose one of them you wouldn’t even know where to begin.

Image by Author

All of the above are challenges faced by a growing number of data teams. Many growth companies that are approaching IPO stage have already surpassed a hundred people in their data teams.

How to deal with scale

How to deal with data teams at scale is something everyone is still trying to figure out. Here are a few of my own observations from having worked in a data team approaching a hundred people.

Embrace it. The first part is accepting that things get exponentially harder with scale and it won’t be as easy as when you were five people. Your business is much more complex, there are exponentially more dependencies and you may have regulatory scrutiny from preparing for an IPO or from regulators that you didn’t have before.

If things feel difficult that’s okay and they probably should.

Work as if you were a group of small teams. The big problem when teams scale is that the data stack is still treated as everyone’s responsibility.

As a rule of thumb a new joiner should be able to clearly see data models and systems that are important to them but more importantly know what they don’t have to pay attention to. It should be clear which data models from other teams you depend on. Some data teams have already started making progress on only exposing certain well-crafted data models to people outside their own team.

Don’t make all data everyone’s problem. Some people thrive from complex architectural-like work such as improving the pipeline run time. But some of the best data analysts are more like Sherlock Holmes and shine when they can dig for insights in a haystack of data.

Avoid mixing these too much. If your data analysts spend 50% of their time debugging data issues or sorting out pipeline performance, you should probably invest more in distributing this responsibility to data engineers or analytics engineers who shine at (and enjoy) this type of work.

Growing data teams are here to stay and we’ve only scratched the surface of how we should approach this. If you’ve any thoughts on how to do this, let me know!

The post Data teams are getting larger, faster appeared first on Towards Data Science.

How should analysts spend their time

Mikkel Dengsøe — Mon, 30 May 2022 06:11:45 +0000

How Should Analysts Spend Their Time?

How to do more of the work that matters and less of the work that doesn’t

When I was at Google and started building machine learning models, one of the first things I was shown was the image below. It highlighted that writing the actual machine learning code is only a fraction of the total work I’d be doing. This turned out to be true.

Source: Hidden Technical Debt in Machine Learning Systems

But working in Data is much more than just machine learning so what does this chart look like for a data analyst?

The answer is not a simple one. There will be times where it makes sense to focus on building dashboards and underlying data models and other times where it makes more sense to focus on analysis and insights.

With that in mind, this is my assessment of how an average week looks like in the life of an analyst.

Image by Author

Just as with machine learning there are many different tasks that are just as valuable as doing analysis.

However, one common topic is that data people spend upwards 50% of their time working reactively, often dealing with data issues or trying to find or get access to data. Examples of this are:

A stakeholder mentions that a KPI in a dashboard looks different to what it did last week and you have to answer why this is
A data test in dbt is failing and you have to understand what’s the root cause of the issue
You want to use a data point for a new customer segment and there are five different definitions when you search in Looker and it’s unclear which one to use

Allocating time differently

Here’s an alternative for how data analysts could be spending their time

Image by Author

So, what’s different?

More time for analysis

Analysts should do… wait for it… analysis. They should have the freedom to get an idea on the way in to work and have it answered by lunch time. This type of work often requires long stretches of focused time, having the right tools in place and an acceptance that it’s okay to spend half a day on work that may go nowhere.

I’ve seen this done particularly well when data analysts work closely with other people such as user research on quickly formulating hypotheses for which A/B tests to run well knowing that not all of them will be a home run.

"Taking that extra time to making sure you’re doing the right work instead of doing the work right is often some of time best spent"

Less time spent dealing with data issues

Data teams I’ve spoken to spend upwards 20% of their day on dealing with data issues and it gets more painful as you scale. When you are a small data team you can look at a few data models to get to the root cause. As you scale you start having dozens of other data people relying on your data models, engineering systems breaking without you being notified and code changes that you had no idea about impacting your data models.

Less time finding what data to use

The difficulty of finding the right data to use increases exponentially with data team size as well. When you’re a small team you know where everything is. As you get larger it becomes increasingly difficult and you often run into issues around having multiple definitions of the same metric. As you get really large just getting access to the right data can in the worse instances take months.

In search for the optimal time allocation

The optimal allocation is different for everyone, but my guess is that you’ll have room for improvements. The most important first step is to be deliberate about how you spend your time.

"I’d recommend setting aside a few minutes each week to look back at the past week and note down how you spent your time. If you do this regularly you’ll get a surprisingly good understanding of whether you spend your time right"

Finding the right data

Data catalogues are great in theory but they often fail to live up to their potential. Instead, they end up as an afterthought and something not too dissimilar from the broken window theory starts to happen. As soon as people stop maintaining a few data points in the data catalogue you may as well throw the whole thing out the window. Luckily, things are moving in the right direction and many people are thinking about how we get the definitions and metadata to live closer to the tools people already use everyday.

Reduce time spent dealing with data issues

The last few years data teams have gotten larger and much more data is being created. Dealing with data issues was easy when you were a five person data team with daily stand-ups. It’s not so straightforward when there’re dozens of data people all creating different data left and right.

Image by Author

We need to be able to apply some of what made life easy when you were a smaller data team to larger data teams. Some ideas of how this could be done:

Every data asset should have an owner
Encapsulate data assets into domains with public and private access endpoints so not all data is accessible to everyone
Close the loop to data producers with shared ownership so issues are caught as far upstream as possible
Enable everyone to debug data issues so they don’t have to be escalated to the same few "data heroes" each time

"Instead of treating every data issue as an ad-hoc fire, invest in fundamental data quality and controls that make data issues less likely to reoccur"

If you have some ideas of how to improve where you spend your time in a data team let me know!

The post How should analysts spend their time appeared first on Towards Data Science.

Data, engineers and designers: How US compares to Europe

Mikkel Dengsøe — Sun, 13 Feb 2022 19:19:01 +0000

Data, engineers, and designers: How US compares to Europe

Earlier I made the claim that data was having its moment with companies doubling down on data hiring and the data to engineers ratio approaching 1:2 for some top European tech companies.

How does it compare across the pond?

The median data to engineers ratio for the US companies I looked at is 1:7 compared to 1:4 for the European companies. And the design to engineers ratio is 1:9 for both groups.

This post gives some answers to why this is but also leaves some questions unanswered.

Let’s dig in!

That’s quite the gap between companies at different ends of the scale.

Analytics companies are in fact not that analytical. Disappointing as it may be, analytics companies such as Amplitude, Pendo, Snowflake and Fullstory are… not that analytical. All of the companies above only have ~2% of their total workforce in data roles and more than half of their employees don’t work in tech roles at all.

Developer tools companies really are for developers. HashiCorp, Sentry and Gitlab are all bottom of the list when it comes to number of data people and designers per engineer. This makes sense and they all have large engineering teams – some exceeding 50% of their total workforce.

Data is the secret Texan hot sauce

Plotting a matrix of design and Data to engineers tells us an interesting story of where companies focus.

Is this any different than how it looked for European companies? Quite a bit!

Only 50% of US companies have more data people than designers compared to 80% of European companies I looked at.

This doesn’t mean that Europeans invest less in design; they just invest more in data. The median design to developer ratio is 1:9 for both US and European companies. This is not too far from what Nielsen Norman Group found in a study of 500 companies which showed that 50% were targeting a design to developer ratio of at least 1:10.

US companies are more often engineering first. 36 of the US companies fall in that quadrant __ in comparison to 19 of the European companies.

Is this a coincidence or is there something structural going on here? Let’s go deeper .

Cookie or biscuit? How US companies compare to Europeans

From last week’s analysis we know that the business model is one of the best indicators for the data to engineers ratio; marketplaces for example had 2–3x as many data people per engineer compared to B2B. Did I simply just pick more companies from a low data ratio business model for the US? Perhaps, but even within the same business models US and European companies have very different data to engineers ratios.

The picture is clear. European companies have notably more data people per engineer compared to US companies across all business models.

US is a larger market with larger engineering teams – ex. Doordash & Deliveroo.With a data to engineers ratio of 0.42, Deliveroo’s ( ) ratio is more than double that of Doordash ( ) at 0.18. This doesn’t mean that Doordash is less data driven; in fact, in absolute terms they have more data people than Deliveroo and there’s no shortage of awesome data work coming out of their team. They just have a much larger engineering team than Deliveroo.

Looking at all 100 companies in my analyses engineers as a percentage of total workforce for US companies is 21% compared to 17% for the Europeans.

Europe has more deep Tech and US more Dev tools – ex. HashiCorp & Onfido.Onfido ( ) has a data to engineer ratio of 0.16 compared to 0.02 for HashiCorp ( ). If you’re Onfido, that makes sense. Machine learning is a core part of your offering and investing more in data is a good idea. Europe is home to many similar deep tech startups. In contrast, Dev tools have a very low data ratio to engineers and the US has by far the largest share of the DevOps tools market.

If we plot the data to engineers ratio side by side the contrast is stark with the gap exceeding 2x for some verticals.

While larger engineering teams in the US and a focus on deep tech in Europe helps explain some of this there are also other factors at play. Here are a few of my (speculative) hypothesis

1) The US market is larger, more competitive and there are more experienced sales executives. Therefore US-based B2B companies have relatively larger sales forces.

2) European startups are often scattered across many markets from day one. This creates unique data challenges that require larger data teams.

3) More US companies are led by engineering founders that create a more engineering-focused culture.

For good order sake, I also looked at if there’s any correlation between company size and data to engineers ratio as the US companies in my sample are slightly larger (the average US company size is 2,100 compared to 1,500 for the Europeans). I found no significant correlation.

Let me know your thoughts.

About the data

I’ve looked at the keywords on LinkedIn and included all matches (for example, product engineer would be matched from the engineer term)

Data: Data Analyst, Data Scientist, Machine Learning, Data Engineer, Data Manager, Analytics Engineer, Product Analyst, Business Intelligence, Data Lead/Manager/Director/VP
Engineering: Engineer (excluding Data Engineer), Tech/Technical Lead
Design: Design(er), User Experience, UX, User Research

I’ve deliberately not included all analyst roles which means that roles such as financial analyst, sales analyst and strategy analyst are not counted as data roles although you could classify some of their work as data work.

The post Data, engineers and designers: How US compares to Europe appeared first on Towards Data Science.