Who Does What in Data? A Practical Introduction to the Role of a Data Engineer & Data Scientist

Next autumn, I will start my Master’s degree in Data Science in Zurich – with its three thematic pillars of Data Analytics, Data Engineering and Data Services, it offers exactly the opportunities we need in the current economy. But before I specialize in one of these areas, the crucial question arises:

What does a data engineer do differently to a data scientist?

Buzzwords such as data analyst, data scientist, data engineer, machine learning engineer or even business analyst are often mixed up, which leads to confusion. Of course, there are also overlaps and the job is not performed in the same way in every company. But in practice, these roles have clearly defined tasks that are both super relevant for a modern company: A Data Engineer is there to enable the work of a Data Scientist or a Data Analyst. The data scientist uses the infrastructure provided and the processed data to create insights and conclusions. They are responsible for different tasks, but both are necessary.

In this article, I use the example of a house price prediction model to show which tasks are performed by a data engineer and which by a data scientist. Regardless of which direction you want to develop in, it’s best to play through the entire super-simplified example to better understand the differences between the two roles. I’ve also put together an overview of which skills and tools you need to know in which role and a checklist of questions to find out which role your heart beats faster for 💞 .

Table of Content 1) Beginner’s tutorial: What do Data Engineers and Data Scientists do? 2) What does a data engineer do? What does a data scientist do? 3) Career guide: How to choose between Data Engineer and Data Scientist 5) Final Thoughts

1) Beginner’s tutorial: What do Data Engineers and Data Scientists do?

In this simplified example, the data engineer provides the data by storing it in an SQLite database and cleans it. The data scientist then uses the data to visualize it and predict house prices using a machine learning model. The first two steps (loading raw data, cleansing it and saving it in a database) are the role of the data engineer. These two steps ensure the infrastructure and data preparation. The last two steps (analyzing the data and training the ML model) belong to the role of the data scientist, as they use the data to gain insights and generate predictions.

Steps of the Data Engineer

1. Cooking up the raw ingredients – Prepping data like a chef: providing the data

We use the California Housing dataset from scikit-learn (BSD license) so that we don’t have to install tools like Apache Airflow or Apache NiFi, but can run everything in a Python environment. This allows you to play through the example even if you’re just diving into this world.

I use Anaconda to manage my various Python environments. If you don’t know why and how to install a specific environment, you can have a look at this article ‘Python Data Analysis Ecosystem – A Beginner’s Roadmap‘. For this practical example, you need to have the following packages installed: pandas, numpy, matplotlib, seaborn, scikit-learn, sqlite3, jupyterlab and of course python (e.g. version 3.9).

from sklearn.datasets import fetch_california_housing
import pandas as pd

# Loading California Housing Dataset
data = fetch_california_housing(as_frame=True)
df = data.frame

print(df.head())

2. Decluttering the fridge – Cleaning up the data

Next, we remove duplicate values with ‘drop_duplicates()’ and fill in missing values with an average value:

# Removing duplicate lines
df = df.drop_duplicates()
print(f"Number of rows after removing duplicates: {len(df)}")

# Displaying missing values
print("Missing values per column:")
print(df.isnull().sum())

# Filling missing values with the average value of the respective column
df = df.fillna(df.mean())
print("Missing values were replaced with the average value.")

This data set is provided by scikit-learn and therefore has good data quality. In practice, however, this is often not the case – data is often incomplete, inconsistent or incorrect. For example, the data may be in the wrong format and you may have to convert it into a float or datetime, or you may have to convert categorical variables such as ‘YES/NO’ into numerical values for machine learning models. It is also possible that variables such as income and square meters have very different orders of magnitude. Here you have to standardize the values, for example, so that they are in comparable ranges.

3. Stocking the pantry – Saving the data so it’s ready for the next chef (Data Scientist)

We now save the cleansed data in the SQLite database:

import sqlite3

# Creating a connection to the SQLite-Datavase
conn = sqlite3.connect('california_housing.db')

# Saving data in the SQLite-Database
df.to_sql('housing', conn, if_exists='replace', index=False)
print("Data successfully saved in the SQLite database.")

If you are working with a lot of complex data that is stored in the database, as a data engineer you may need to implement additional points:

Organise data into a suitable structure, e.g. by creating indices on important columns to speed up queries.
Add primary or foreign keys to define the relationships between the tables.
For larger projects, you could set up an automated data import. For example, to regularly pull data from an API.
You could also configure the access rights to the database, for example, to ensure that only authorized data scientists can access the data.

As the work for the data engineer is now complete, we close the connection to the database:

conn.close()

For larger projects, we can use MySQL or PostgreSQL. For our application, however, SQLite is completely sufficient.

Steps of the data scientist

Now we take on the role of a data scientist. They take the data provided by the data engineer and aim to generate insights from it.

3. Digging into the ingredients – Analysing and visualising the data

In this step, we want to gain a basic understanding of the data and recognize patterns. We therefore first load the data from the SQLite database:

# Connecting to the database
conn = sqlite3.connect('california_housing.db')

# Executing a sql query
df = pd.read_sql_query("SELECT * FROM housing", conn)
print(df.head())

To analyze the data, we use ‘describe()’ to output a statistical summary of the mean value, the standard deviation of the minimum & maximum and the quartiles.

print(df.describe())

We use matplotlib and seaborn to create a scatterplot and a histogram:

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram of the target variable(Median House Value)
plt.hist(df['MedHouseVal'], bins=30, color='blue', edgecolor='black')
plt.title('Distribution of the Median House Values')
plt.xlabel('Median House Value')
plt.ylabel('Frequence')
plt.show()

# Scatterplot: House Value vs. Median Income
sns.scatterplot(x='MedInc', y='MedHouseVal', data=df)
plt.title('Median House Value vs. Median Income')
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.show()

We could extend this step almost indefinitely. Of course, we would first have to carry out an exploratory data analysis (EDA) to better understand the data. We could also output further visualizations such as heat maps or correlations. Instead of visualizing the data with Python, we could also use Tableau or Power BI to put together interactive dashboards.

In one of my old articles, you will find 9 essential steps for EDA that you can apply before using an ML model.

4. Whipping up some magic & serving the final dish – Training & creating a machine learning model

In the last step, we want to train a model that predicts house prices based on one or more variables.

To do this, we first prepare the data for the model by defining the independent variable x and the dependent variable y. We then use sklearn’s ‘train_test_split’ to split the data into training and test data in an 80–20 ratio. We specify the ‘random_state’ so that it’s reproducible. To obtain consistent results, it is important that the same training and test data are selected for each run. With ‘fit()’ we train a linear regression model and with ‘predict()’ we generate the predictions. We also calculate the mean square error (MSE) to see how good our prediction model is. At the end, we visualize the predictions.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Defining the independent and dependent variable
X = df[['MedInc']]  # Median Income as Feature
y = df['MedHouseVal']  # Target variable

# Splitting the data in trainings- and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training a Linear Regression Modell
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions based on test data
y_pred = model.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

plt.scatter(y_test, y_pred, alpha=0.5)
plt.title('Actual vs. predicted values')
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.show()

Alternatively, we could use more complex models such as Random Forest or Gradient Boosting to generate more accurate predictions. If you want to know how to use Random Forest to generate predictions on this dataset, you can find a step-by-step guide in the article ‘Beginner’s Guide to Predicting House Prices with Random Forest: Step-by-Step Introduction with a Built-in scikit-learn Dataset’. We could also do feature engineering to improve the predictions (could also fall into the role of the data engineer).

Own visualization - Illustrations from unDraw.co — Own visualization – Illustrations from unDraw.co

2) What does a data engineer do? What does a data scientist do?

If we look at a data workflow, the data engineer is responsible for the first step (any maybe for the second step): ensuring that the collected data is correctly ingested and stored in a system in such a way that it is easily accessible and ready to be analyzed. The Data Engineer lays the foundation for Data Analysts, Data Scientists & Machine Learning Engineers, so to speak. He builds and optimizes the infrastructure that enables data scientists to access and analyze data.

Own visualization - Icons from svgrepo.com — Own visualization – Icons from svgrepo.com

What are the main responsibilities of a Data Engineer?

The main tasks as a Data Engineer are to ensure that data is collected, stored and prepared so that makes it accessible for Data Scientists and Analysts to use.

Ingest data from various data sources and store it in a systemYou need to import data from APIs, databases, flat files like CSV, JSON, XML or streams like Kafka. Kafka Streams is an API library that can be used to consume and process data from Kafka in real time and write other systems.

For example, you develop a workflow that collects customer data from a Salesforce API, transactional data from a PostgreSQL database, and website tracking data from a Kafka stream and stores all the data in a centralized data warehouse like Snowflake.

Set up databases and optimize them for subsequent analysis This includes defining tables and relationships in a meaningful way (=creating a DB schema), setting so-called indices for faster queries and splitting large amounts of data into smaller parts (partitioning). You can use indices in a similar way to a table of contents that takes you directly to the right place for a query. You may also need to be able to use NoSQL databases such as MongoDB or Cassandra to store semi-structured data. If the data is too large for individual machines, you need to be able to use frameworks such as Apache Spark or Hadoop to distribute the processing across multiple computers (= distributed computing).

For example, for a company that sells digital devices and stores millions of customer and order data, you set up a database, optimize the queries with indices and split the data by year so that a data analyst can perform sales analyses quickly.

Create & manage data pipelines You need to be able to create automated workflows and set up ETL, ELT or Zero ETL processes. In this article ‘Why ETL-Zero? Understanding the Shift in Data Integration‘ you will find an introduction to what ETL-Zero or ETL-Copy means.

Remove corrupt data It is also your responsibility to find incorrect or missing data and clean it up before it is processed further. You can use tools such as Great Expectations, which automatically checks whether data complies with the desired rules, or you can write simple scripts to correct such errors.

For example, you find a customer data table with entries with invalid email addresses. With a Python script, you can recognize and mark these entries so that they can be removed or checked manually.

Develop, design, test and maintain data architectures Specifically, this means that you determine how and where the data is stored, processed and forwarded. You also need to ensure, for example, that the system remains stable as data volumes increase and that new requirements can be easily integrated.

For example, you work for an e-commerce company where customers and order data come from different sources: The customer data comes from a Salesforce API, the product data from a database and the payment data from a payment provider such as Stripe or PayPal. Your task is now to bring this data into a central data warehouse such as Snowflake via a pipeline and ensure that the pipeline remains stable and fast even if the number of orders doubles.

Your main customers as a Data Engineer are other Data Scientists and Analysts. You have to ensure that the data is available in the right form and quality so that it can then be evaluated, visualized and further processed using machine learning models.

Where can you continue learning?

Datacamp Data Engineer Certification Path (not free, but I can recommend it; no affiliate link)
Linkedin Learning Apache Spark (no affiliate link)
IBM Training Apache Spark
Datacamp Blog Apache Airflow vs. Apache Nifi
Datacamp Blog Data Engineering Tools

What are the main responsibilities of a data scientist?

As a Data Scientist you transform the raw data into actionable insights that drive decision-making. Also it’s about telling a story with the data that stakeholders can understand and act upon.

Analyse, evaluate, visualize data & communicate results As a Data Scientist, you will need to analyze data to find trends, anomalies or patterns. You may also need to be able to create interactive dashboards using tools such as Tableau, Power BI or Python libraries such as Plotly. You must be able to present the results of the analyses and the models developed to other analysts, management or other stakeholders in an understandable form. Your keyword here is storytelling with data.

For example, your line manager wants to know how sales have developed over the last 6 months and whether there are any seasonal patterns. To do this, you retrieve the sales data from a database, analyze the sales per month and put together an interactive dashboard in Tableau or with Python that visualizes the sales.

Access databases You need to master SQL so that you can make queries on databases so that you can retrieve the desired data for your analyses. It is also important that you can write efficient queries to minimise performance problems.

For example, you use the following SQL query to retrieve the sales of the last 6 months:

SELECT order, SUM(order_amount) AS total_sales
FROM sales_data
WHERE order_date >= DATEADD(month, -6, GETDATE())
GROUP BY order_date
ORDER BY order_date;

Creating machine learning models and feature engineering As a data scientist, you have to develop models that answer relevant business questions. You must also be able to create new features from the raw data that can improve the performance of the machine learning models (feature engineering).

For example, you will develop an ML model that predicts the turnover of an online shop based on the number of website visits. To do this, you will create a new feature (=feature engineering) that calculates the average order quantity per visit and then train a linear regression model using the selected features and the raw data.

Carrying out A/B tests and statistical analyses You must also be able to formulate and test hypotheses and measure success.

For example, you test whether a new version of a landing page (variant B) generates more sales than the current version (variant A). To do this, you collect data from 1000 visitors. The result shows that variant A has a conversion rate of 20%, while variant B achieves a rate of 26%. Finally, you use a statistical test such as the chi-square test to check whether the difference is actually significant.

As a data scientist, your main customer tends to be management and other analysts. You often use the results of your work to support strategic decisions.

Where can you continue learning?

Datacamp Machine Learning Course (no affiliate link)
Introduction University Course about Machine Learning (Numpy, Pandas, Minimal Plotting, EDA, Regression Models, Classification, Clustering)
GeeksForGeeks SQL for Data Science
IBM Blog What is Feature Engineering?

3) Career guide: How to choose between Data Engineer and Data Scientist

The two roles also overlap – especially as modern data projects also require close collaboration. For example, SQL and Python are needed in both roles, although in slightly different contexts. Both also need to know how to cleanse and validate data.

Checklist – Which role suits you?

Do you love diving deep into technical systems and organizing data so that it can be processed efficiently? For example, you set up a database that stores millions of customer data and optimize it so that queries are fast.
Do you enjoy working with large amounts of data that need to be processed in real-time? For example, you build a pipeline that transfers real-time data from an IoT application to a data warehouse.
Do you enjoy solving technical challenges such as integrating data from different sources? For example, you ensure that customer data from the app, CRM and e-commerce website flows into a central platform.
Do you like to work systematically and are detail-orientated when it comes to documenting data flows & resolving bottlenecks? For example, you analyze why a database query takes too long and then implement indices to speed things up.

→ If you tick the boxes for these questions, you are probably more a data engineer.

In the graphic, you can see some of the most important tools and skills you need to be able to do in each role.

Own visualization - Not exhaustive — Own visualization – Not exhaustive

Do your eyes light up when you can analyze data & extract trends, patterns or insights from large data sets? For example, you analyze customer data to find out which products sell better in certain regions.
Do you enjoy working with mathematical models, statistics or machine learning to solve problems and make predictions? For example, do you develop a model that predicts a company’s sales based on historical data?
Do you enjoy visualizing data for interactive dashboards or reports and telling stories from the numbers? For example, do you visually show how sales figures for a particular department have developed in recent months?
Do you enjoy working closely with the management and want to be able to influence strategic decisions? For example, you use the data to explain why a particular marketing campaign was particularly successful.
Are you creative and like to experiment with hypotheses or new approaches to answer complex questions? For example, you use the data to find out which factors have the greatest influence on the success of a product.

→ If you answer yes to these questions, Data Science is probably the right choice for you.

If you are still unsure, it is probably helpful to gain further experience in both areas.

4) Final Thoughts

The two roles complement each other in a company. One role cannot really function without the other. If we imagine a small analogy, it becomes clear that the different roles are necessary for a company: The Data Engineer is the chef who organizes the kitchen and ensures that all the ingredients are prepared, fresh and ready to hand. The data scientist, on the other hand, is the chef who creatively combines the ingredients to create new and exciting dishes. Without the Data Engineer, the Data Scientist has no high-quality ingredients to work with and conversely, the carefully prepared ingredients are not transformed into valuable insights & results.

Which role do you already fulfill or would you like to develop further?

Who Does What in Data? A Practical Introduction to the Role of a Data Engineer & Data Scientist

1) Beginner’s tutorial: What do Data Engineers and Data Scientists do?

Steps of the Data Engineer

1. Cooking up the raw ingredients – Prepping data like a chef: providing the data

2. Decluttering the fridge – Cleaning up the data

3. Stocking the pantry – Saving the data so it’s ready for the next chef (Data Scientist)

Steps of the data scientist

3. Digging into the ingredients – Analysing and visualising the data

4. Whipping up some magic & serving the final dish – Training & creating a machine learning model

2) What does a data engineer do? What does a data scientist do?

What are the main responsibilities of a Data Engineer?

Where can you continue learning?

What are the main responsibilities of a data scientist?

Where can you continue learning?

3) Career guide: How to choose between Data Engineer and Data Scientist

Checklist – Which role suits you?

4) Final Thoughts

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

How to Make the Most of Your Experience as a TDS Author

Our Columns

Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits

Back to Basics, Part Tres: Logistic Regression