Eirik Berge, PhD, Author at Towards Data Science

Polars vs. Pandas — An Independent Speed Comparison

Eirik Berge, PhD — Tue, 11 Feb 2025 21:07:55 +0000

Overview

Introduction — Purpose and Reasons

Speed is important when dealing with large amounts of data. If you are handling data in a cloud data warehouse or similar, then the speed of execution for your data ingestion and processing affects the following:

Cloud costs: This is probably the biggest factor. More compute time equals more costs in most billing models. In other billing based on a certain amount of preallocated resources, you could have chosen a lower service level if the speed of your ingestion and processing was higher.
Data timeliness: If you have a real-time stream that takes 5 minutes to process data, then your users will have a lag of at least 5 minutes when viewing the data through e.g. a Power BI rapport. This difference can be a lot in certain situations. Even for batch jobs, the data timeliness is important. If you are running a batch job every hour, it is a lot better if it takes 2 minutes rather than 20 minutes.
Feedback loop: If your batch job takes only a minute to run, then you get a very quick feedback loop. This probably makes your job more enjoyable. In addition, it enables you to find logical mistakes more quickly.

As you’ve probably understood from the title, I am going to provide a speed comparison between the two Python libraries Polars and Pandas. If you know anything about Pandas and Polars from before, then you know that Polars is the (relatively) new kid on the block proclaiming to be much faster than Pandas. You probably also know that Polars is implemented in Rust, which is a trend for many other modern Python tools like uv and Ruff.

There are two distinct reasons that I want to do a speed comparison test between Polars and Pandas:

Reason 1 — Investigating Claims

Polars boasts on its website with the following claim: Compared to pandas, it (Polars) can achieve more than 30x performance gains.

As you can see, you can follow a link to the benchmarks that they have. It’s commendable that they have speed tests open source. But if you are writing the comparison tests for both your own tool and a competitor’s tool, then there might be a slight conflict of interest. I’m not here saying that they are purposefully overselling the speed of Polars, but rather that they might have unconsciously selected for favorable comparisons.

Hence the first reason to do a speed comparison test is simply to see whether this supports the claims presented by Polars or not.

Reason 2 — Greater granularity

Another reason for doing a speed comparison test between Polars and Pandas is to make it slightly more transparent where the performance gains might be.

This might be already clear if you’re an expert on both libraries. However, speed tests between Polars and Pandas are mostly of interest to those considering switching up their tool. In that case, you might not yet have played around much with Polars because you are unsure if it is worth it.

Hence the second reason to do a speed comparison is simply to see where the speed gains are located.

I want to test both libraries on different tasks both within data ingestion and Data Processing. I also want to consider datasets that are both small and large. I will stick to common tasks within data engineering, rather than esoteric tasks that one seldom uses.

What I will not do

I will not give a tutorial on either Pandas or Polars. If you want to learn Pandas or Polars, then a good place to start is their documentation.
I will not cover other common data processing libraries. This might be disappointing to a fan of PySpark, but having a distributed compute model makes comparisons a bit more difficult. You might find that PySpark is quicker than Polars on tasks that are very easy to parallelize, but slower on other tasks where keeping all the data in memory reduces travel times.
I will not provide full reproducibility. Since this is, in humble words, only a blog post, then I will only explain the datasets, tasks, and system settings that I have used. I will not host a complete running environment with the datasets and bundle everything neatly. This is not a precise scientific experiment, but rather a guide that only cares about rough estimations.

Finally, before we start, I want to say that I like both Polars and Pandas as tools. I’m not financially or otherwise compensated by any of them obviously, and don’t have any incentive other than being curious about their performance

Datasets, Tasks, and Settings

Let’s first describe the datasets that I will be considering, the tasks that the libraries will perform, and the system settings that I will be running them on.

Datasets

A most companies, you will need to work with both small and (relatively) large datasets. In my opinion, a good data processing tool can tackle both ends of the spectrum. Small datasets challenge the start-up time of tasks, while larger datasets challenge scalability. I will consider two datasets, both can be found on Kaggle:

A small dataset on the format CSV: It is no secret that CSV files are everywhere! Often they are quite small, coming from Excel files or database dumps. What better example of this than the classical iris dataset (licensed with CC0 1.0 Universal License) with 5 columns and 150 rows. The iris version I linked to on Kaggle has 6 columns, but the classical one does not have a running index column. So remove this column if you want precisely the same dataset as I have. The iris dataset is certainly small data by any stretch of the imagination.
A large dataset on the format Parquet: The parquet format is super useful for large data as it has built-in compression column-wise (along with many other benefits). I will use the Transaction dataset (licensed with Apache License 2.0) representing financial transactions. The dataset has 24 columns and 7 483 766 rows. It is close to 3 GB in its CSV format found on Kaggle. I used Pandas & Pyarrow to convert this to a parquet file. The final result is only 905 MB due to the compression of the parquet file format. This is at the low end of what people call big data, but it will suffice for us.

Tasks

I will do a speed comparison on five different tasks. The first two are I/O tasks, while the last three are common tasks in data processing. Specifically, the tasks are:

Reading data: I will read both files using the respective methods read_csv() and read_parquet() from the two libraries. I will not use any optional arguments as I want to compare their default behavior.
Writing data: I will write both files back to identical copies as new files using the respective methods to_csv() and to_parquet() for Pandas and write_csv() and write_parquet() for Polars. I will not use any optional arguments as I want to compare their default behavior.
Computing Numeric Expressions: For the iris dataset I will compute the expression SepalLengthCm ** 2 + SepalWidthCm as a new column in a copy of the DataFrame. For the transactions dataset, I will simply compute the expression (amount + 10) ** 2 as a new column in a copy of the DataFrame. I will use the standard way to transform columns in Pandas, while in Polars I will use the standard functions all(), col(), and alias() to make an equivalent transformation.
Filters: For the iris dataset, I will select the rows corresponding to the criteria SepalLengthCm >= 5.0 and SepalWidthCm <= 4.0. For the transactions dataset, I will select the rows corresponding to the categorical criteria merchant_category == 'Restaurant'. I will use the standard filtering method based on Boolean expressions in each library. In pandas, this is syntax such as df_new = df[df['col'] < 5], while in Polars this is given similarly by the filter() function along with the col() function. I will use the and-operator & for both libraries to combine the two numeric conditions for the iris dataset.
Group By: For the iris dataset, I will group by the Species column and calculate the mean values for each species of the four columns SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm. For the transactions dataset, I will group by the column merchant_category and count the number of instances in each of the classes within merchant_category. Naturally, I will use the groupby() function in Pandas and the group_by() function in Polars in obvious ways.

Settings

System Settings: I’m running all the tasks locally with 16GB RAM and an Intel Core i5–10400F CPU with 6 Cores (12 logical cores through hyperthreading). So it’s not state-of-the-art by any means, but good enough for simple benchmarking.
Python: I’m running Python 3.12. This is not the most current stable version (which is Python 3.13), but I think this is a good thing. Commonly the latest supported Python version in cloud data warehouses is one or two versions behind.
Polars & Pandas: I’m using Polars version 1.21 and Pandas 2.2.3. These are roughly the newest stable releases to both packages.
Timeit: I’m using the standard timeit module in Python and finding the median of 10 runs.

Especially interesting will be how Polars can take advantage of the 12 logical cores through multithreading. There are ways to make Pandas take advantage of multiple processors, but I want to compare Polars and Pandas out of the box without any external modification. After all, this is probably how they are running in most companies around the world.

Results

Here I will write down the results for each of the five tasks and make some minor comments. In the next section I will try to summarize the main points into a conclusion and point out a disadvantage that Polars has in this comparison:

Task 1 — Reading data

The median run time over 10 runs for the reading task was as follows:

# Iris Dataset
Pandas: 0.79 milliseconds
Polars: 0.31 milliseconds

# Transactions Dataset
Pandas: 14.14 seconds
Polars: 1.25 seconds

For reading the Iris dataset, Polars was roughly 2.5x faster than Pandas. For the transactions dataset, the difference is even starker where Polars was 11x faster than Pandas. We can see that Polars is much faster than Pandas for reading both small and large files. The performance difference grows with the size of the file.

Task 2— Writing data

The median run time in seconds over 10 runs for the writing task was as follows:

# Iris Dataset
Pandas: 1.06 milliseconds
Polars: 0.60 milliseconds

# Transactions Dataset
Pandas: 20.55 seconds
Polars: 10.39 seconds

For writing the iris dataset, Polars was around 75% faster than Pandas. For the transactions dataset, Polars was roughly 2x as fast as Pandas. Again we see that Polars is faster than Pandas, but the difference here is smaller than for reading files. Still, a difference of close to 2x in performance is a massive difference.

Task 3 —Computing Numeric Expressions

The median run time over 10 runs for the computing numeric expressions task was as follows:

# Iris Dataset
Pandas: 0.35 milliseconds
Polars: 0.15 milliseconds

# Transactions Dataset
Pandas: 54.58 milliseconds
Polars: 14.92 milliseconds

For computing the numeric expressions, Polars beats Pandas with a rate of roughly 2.5x for the iris dataset, and roughly 3.5x for the transactions dataset. This is a pretty massive difference. It should be noted that computing numeric expressions is fast in both libraries even for the large dataset transactions.

Task 4 — Filters

The median run time over 10 runs for the filters task was as follows:

# Iris Dataset
Pandas: 0.40 milliseconds
Polars: 0.15 milliseconds

# Transactions Dataset
Pandas: 0.70 seconds
Polars: 0.07 seconds

For filters, Polars is 2.6x faster on the iris dataset and 10x as fast on the transactions dataset. This is probably the most surprising improvement for me since I suspected that the speed improvements for filtering tasks would not be this massive.

Task 5 — Group By

The median run time over 10 runs for the group by task was as follows:

# Iris Dataset
Pandas: 0.54 milliseconds
Polars: 0.18 milliseconds

# Transactions Dataset
Pandas: 334 milliseconds 
Polars: 126 milliseconds

For the group-by task, there is a 3x speed improvement for Polars in the case of the iris dataset. For the transactions dataset, there is a 2.6x improvement of Polars over Pandas.

Conclusions

Before highlighting each point below, I want to point out that Polars is somewhat in an unfair position throughout my comparisons. It is often that multiple data transformations are performed after one another in practice. For this, Polars has the lazy API that optimizes this before calculating. Since I have considered single ingestions and transformations, this advantage of Polars is hidden. How much this would improve in practical situations is not clear, but it would probably make the difference in performance even bigger.

Data Ingestion

Polars is significantly faster than Pandas for both reading and writing data. The difference is largest in reading data, where we had a massive 11x difference in performance for the transactions dataset. On all measurements, Polars performs significantly better than Pandas.

Data Processing

Polars is significantly faster than Pandas for common data processing tasks. The difference was starkest for filters, but you can at least expect a 2–3x difference in performance across the board.

Final Verdict

Polars consistently performs faster than Pandas on all tasks with both small and large data. The improvements are very significant, ranging from a 2x improvement to a whopping 11x improvement. When it comes to reading large parquet files or performing filter statements, Polars is leaps and bound in front of Pandas.

However…Nowhere here is Polars remotely close to performing 30x better than Pandas, as Polars’ benchmarking suggests. I would argue that the tasks that I have presented are standard tasks performed on realistic hardware infrastructure. So I think that my conclusions give us some room to question whether the claims put forward by Polars give a realistic picture of the improvements that you can expect.

Nevertheless, I am in no doubt that Polars is significantly faster than Pandas. Working with Polars is not more complicated than working with Pandas. So for your next data engineering project where the data fits in memory, I would strongly suggest that you opt for Polars rather than Pandas.

Wrapping Up

Photo by Spencer Bergen on Unsplash

I hope this blog post gave you a different perspective on the speed difference between Polars and Pandas. Please comment if you have a different experience with the performance difference between Polars and Pandas than what I have presented.

If you are interested in AI, Data Science, or data engineering, please follow me or connect on LinkedIn.

Like my writing? Check out some of my other posts:

The post Polars vs. Pandas — An Independent Speed Comparison appeared first on Towards Data Science.

Use Tablib to Handle Simple Tabular Data in Python

Eirik Berge, PhD — Wed, 27 Nov 2024 12:02:07 +0000

Introduction – What is Tablib?

For many years I have been working with tools like Pandas and PySpark in Python for Data Ingestion, data processing, and data exporting. These tools are great for complex data transformations and big data sizes (Pandas when the data fits in memory). However, often I have used these tools when the following conditions apply:

The data size is relatively small. Think well below 100,000 rows of data.
Performance is not an issue at all. Think of a one-off job or a job that repeats at midnight every night, but I don’t care if it takes 20 seconds or 5 minutes.
There are no complex transformations needed. Think of simply importing 20 JSON files with the same format, stacking them on top of each other, and then exporting this as a CSV file.

In these cases, tools like Pandas and (especially) PySpark are like shooting a fly with a canon. In these cases, the library Tablib is perfect

The Python library Tablib is a small library that deals exclusively with small tabular datasets. It is much less performant than both Pandas and PySpark, but it does not aim for performance at all. The library is only around 1000 lines of code and is a shallow abstraction over Python. What is the advantage of this?

It is simple to understand the nitty gritty details of Tablib since nothing is optimized. Tablib only gives you base functionality and fall back on Python logic for most things.

If you are unsure how something is implemented in Tablib, then a quick glance at the source code gives you the answer. In contrast, understanding how various methods in PySpark are implemented requires an understanding of Scala and the JVM. And loads of free time

Tablib is also focused on importing and exporting tabular data in different formats. While big data is certainly more sexy than small data, the reality is that even in companies with big data, small data sets are abundant. Spinning up a Spark cluster every night to read a CSV file with 200 lines is just a waste of money and sanity.

Tablib should simply be another tool in your toolbox that can be used in the case of small data with few performance requirements and no complex data transformations. It is surprising how often these conditions are satisfied.

In this blog post, I will tell you everything you need to know about Tablib. Unlike Pandas and PySpark, I will manage to explain 80% of Tablib in a single blog post since it is such a simple and shallow abstraction. After reading this blog post, you should be able to easily use this tool where it is useful.

You can find all the code and artifacts used in this blog post in the following Github repository. In addition to this blog post, I’ve also made a free video series on YouTube on the same topic that is gradually coming out. If you prefer videos, then you can check that instead:

Alright! Let’s get started

Working with Datasets

To get started, head over to the Tablib homepage. You can read a quick overview there before going to the installation page. I would recommend using the following command to install Tablib:

pip install "tablib[all]"

Now that Tablib is installed, we can first learn about the main class called Dataset. We can open a file and write the following:

Python">from tablib import Dataset

# Creating a dataset
data = Dataset()

# Adding headers
data.headers = ["Name", "Phone Number"]

# Adding rows
data.append(["Eirik", "74937475"])
data.append(["Stine", "75839478"])

# Have nice printing
print(data)

# Get a standard Python representation
print(data.dict)

Here we import Dataset from tablib and create a dataset. Then we add headers for the columns and add two rows with the .append() method. When we print out data to the console, we get a nicely formatted Tabular Data structure. We can also get a native Python representation by using the .dict attribute.

The variable data will be our format-agnostic container for the data that we will later import. First, let us see how we can add new columns, select both columns and rows, and delete both columns and rows. This data manipulation will be useful when we want to make simple data transformations to imported data.

To add a column, we will use the method .append_col(). It takes a list of values and a keyword argument representing the header name of the column:

data.append_col([29, 30], header="Age")
print(data)

If you print out data again, then you will see that we have added a column representing the age of the individuals. So use .append() to append rows, while append_col() to append columns.

To select columns, we can simply use the notation that you are probably familiar with from either Python dictionaries or from working with Pandas. For selecting rows, we can use index notation or slicing:

# Selecting columns and rows
print(data["Age"])
print(f"Average age: {sum(data["Age"]) / len(data["Age"])}")
print(data[0])
print(data[0:2])

As you can see, the value data["Age"] is simply a Python list. We can thus use the build-in functions like sum() and len() to work with this and calculate the average age.

Notice that this reduces data transformations to pure Python logic, which is not particularly fast. But fast is not the aim of Tablib, predictability and simplicity is

Finally, to delete both columns and rows, we can use the built-in del keyword in Python as follows:

# Delete columns and rows
del data["Age"]
print(data)
del data[1]
print(data)

To summarize so far, we initialize an instance of the Dataset class, then add rows and columns with simple append-type methods. We can easily select and remove both columns and rows as we please.

Notice also that since everything is handled with Python lists, which are not data type homogeneous, there is nothing that requires that datatypes be consistent within a column. We can have a column that has many different data types. You can add your own data validation logic with formatters as we will see later.

Importing Data

It’s time to import some data. The way we created data from scratch in the previous section is not typical. The usual workflow is that we import data and use the tools from the previous section to modify or retrieve pieces of information.

Let’s say that we have a folder called /artifacts for the rest of the blog post. Within that folder are various files we want to import. First, let us assume that there is a CSV file called simple_example.csv with both headers and rows. We can import this with the following simple piece of code:

from tablib import Dataset

# Import a single CSV file
with open('artifacts/simple_example.csv', 'r') as file:
    imported_data = Dataset().load(file)
print(imported_data)

As you can see, we use the .load() method to load the file into a newly created dataset.

The important thing to notice here is that we don’t have separate method for each file type! There is not a .load_csv() method and a separate .load_json() method. There is simply a .load() method that detects the file type. This makes reusability very convenient.

To import a JSON file called standard.json in the artifacts folder we could simply write the following:

# Import a single JSON file
with open('artifacts/standard.json', 'r') as file:
    json_file = Dataset().load(file)
print(json_file)

So you don’t need to learn separate methods for separate datatypes.

One thing to note is that if you have a CSV file that does not have headers, you need to specify the headers=False keyword argument in the .load() method. Afterwards, you can set headers if you want. Here you can see an example of this with a file called no_header.csv:

# Import a CSV file with no headers
with open('artifacts/no_header.csv', 'r') as file:
    no_header_data = Dataset().load(file, headers=False)
no_header_data.headers = ['Name', 'Age']
print(no_header_data)

Finally, a common issue is that you have multiple files that need to be imported and combined. Say that we have a subfolder /artifacts/multiple/ where there are three CSV files. In Tablib, there is not a separate method for this situation. You have to use basic Python logic to load the files, and then you can use the .extend() method to combine them as follows. Here we can use the build-in library pathlib to manage this:

from pathlib import Path

# Work with multiple files
combined_data = Dataset(headers=('first_name', 'last_name'))
for path in Path('artifacts/multiple/').iterdir():
    with open(path, 'r') as file:
        temp_data = Dataset().load(file)
        combined_data.extend(temp_data)
print(combined_data)

Exporting Data

Now it is time to export data. The cool thing is that Tablib has a single method called .export() for exporting that makes this super easy:

from tablib import Dataset

# Import a single CSV file
with open('artifacts/simple_example.csv', 'r') as file:
    imported_data = Dataset().load(file)

# Write as a JSON file
print(imported_data.export('json'))

Notice that the .export() method does not involve the file system at all! It does not let you specify a file to export the data to. What is happening?

The method .export() simply converts the data to a string with the specified format you require. So far we just print this string out to the console. We need to use standard Python logic to write this information to a file.

Again this shows that Tablib wants to remain simple: No interaction with the file system here, you have to use Python logic for this. If you are used to working with Python, this gives you control of this aspect. The cost of this tradeoff is performance, but again, tablib does not care about performance.

To export the file as a JSON file, you can simply write this:

from tablib import Dataset

# Import a single CSV file
with open('artifacts/simple_example.csv', 'r') as file:
    imported_data = Dataset().load(file)

# Export to a JSON file
with open('artifacts/new_file.json', 'w') as file:
    file.write(imported_data.export('json'))

Simple, right? Here are some of the other file formats that Tablib supports out of the box:

from tablib import Dataset

# Import a single CSV file
with open('artifacts/simple_example.csv', 'r') as file:
    imported_data = Dataset().load(file)

# Write as a CSV file
print(imported_data.export('csv'))

# Or as JSON
print(imported_data.export('json'))

# Or as YAML
print(imported_data.export('yaml'))

# Or as HTML
print(imported_data.export('html'))

# Or as Excel
print(imported_data.export('xls'))

When you are writing to an Excel file, make sure that you use the "write binary" option with wb instead of w since Excel files are binary files.

Finally, you should know that you can easily transition between Tablib and Pandas as follows:

from tablib import Dataset
from pandas import DataFrame

# Import a single CSV file
with open('artifacts/simple_example.csv', 'r') as file:
    imported_data = Dataset().load(file)

df = DataFrame(imported_data.dict)
print(df.head())

You should nevertheless use this sparingly. If you feel the need to constantly move over to Pandas for complex transformations, then maybe you should have used Pandas all along. Going to Pandas for a one-off function is fine, but too many conversions to Pandas is a symptom that you have chosen the wrong library.

Dynamic Columns

Now you know how to import data, do simple transformations, and export to various formats. This is the base functionality of Tablib. In addition to this, dynamic columns and formatters are convenient. Let’s first look at dynamic columns.

Let us assume that we have a CSV file called students.csv that looks like this:

student_id,score
84947,75
85345,64
84637,32
89274,98
84636,82
85146,55

We want to calculate a grade for each student based on the score. We could do this after loading the data. However, it would be nice to have the grade automatically calculated when new rows are introduced. To do this, we write a dynamic column as follows:

from tablib import Dataset

# Write a dynamic column
def calculate_grade(row):
    """Calculates the grade of a student based on the score."""
    score = int(row[1])
    if score > 93:
        return 'A'
    elif score >= 80:
        return 'B'
    elif score >= 66:
        return 'C'
    elif score >= 55:
        return 'D'
    elif score >= 40:
        return 'E'
    else:
        return 'F'

# Import a single CSV file
with open('artifacts/students.csv', 'r') as file:
    student_data = Dataset().load(file)

# Add the dynamically calculated column
student_data.append_col(calculate_grade, header="grade")

# Print out the data
print(student_data)

Here we have written the function calculate_grade() to accept individual rows and use that information to return a single value, namely the grade of the student. We can attach this as a dynamic column to the dataset student_data with the method .append_col() as described above.

Now calculate_grade() works as a callback function, so it is applied every time a new row is added:

# Add more rows
student_data.append(['81237', '86'])
print(student_data)

As you can see if you run the code, the grade is automatically calculated for the new student. If I manually want to specify the grade myself, I can do so as well:

# Can add the dynamic column myself
student_data.append(['81237', '56', 'D'])
print(student_data)

If I add the column value myself, the callback function does nothing. If I don’t, then the callback function takes on that responsibility. This is super convenient for automatic data augmentation. You can use this for complex use cases like machine learning predictions based on the other features of a row.

Formatters

Finally, I want to take a quick look at formatters. This is something that is not described in the documentation of Tablib, and you need to read the source code to find this feature. So I want to highlight this here, as this is also pretty convenient.

To understand formatters, you need to realize that Tablib does not do any data validation or data cleaning by default:

from tablib import Dataset

# Creating a dataset
data = Dataset()

# Adding headers
data.headers = ["Name", "Phone Number"]

# Add data with an whitespace error
data.append(["Eirik", "74937475 "])
data.append(["Stine", "75839478"])

# No data formatting - Whitespace is kept
print(data.dict)

Again, we could clean this up after the fact. Going by the same principle as for dynamic columns, it would be nice to attach a callback function to data that automatically formats the data correctly. This is precisely what formatters do:

# Create a formatter
def remove_whitespace(phone_num: str) -> str:
    """Removes whitespace from phone numbers."""
    return phone_num.strip()

# Add the formatter
data.add_formatter("Phone Number", remove_whitespace)

# Check that the formatter has been added
print(data._formatters)

# Append more data with whitespace errors
data.append(["Eirik", " 74937475 "])
data.append(["Stine", "75839478"])

# Data is automatically formatted on insertion.
print(data.dict)

Formatters are callback functions that are added with the .add_formatter() method. You can check the registered formatters on data by using the "private" attribute data._formatters. When you now add more data with whitespace errors, these are automatically cleaned when appended to data.

The difference between dynamic columns and formatters is that dynamic columns create new columns, while formatters modify existing ones. Use dynamic columns for data augmentation, while formatters for data cleaning and data validation

Wrapping Up

Photo by Spencer Bergen on Unsplash

I hope this blog post helped you understand the library Tablib and what it can do in Python. If you are interested in AI, data science, or data engineering, please follow me or connect on LinkedIn.

Like my writing? Check out some of my other posts for more content:

The post Use Tablib to Handle Simple Tabular Data in Python appeared first on Towards Data Science.

Get started with SQLite3 in Python Creating Tables & Fetching Rows

Eirik Berge, PhD — Tue, 18 Jun 2024 04:01:36 +0000

Get Started with SQLite3 in Python, Creating Tables and Fetching Rows

Photo by Sunder Muthukumaran on Unsplash

Overview

Introduction – What is SQLite and SQLite3?
Creating our First SQLite Database
Connectors and Cursors
Creating Tables
Inserting Rows Into the Database
Fetching Rows From the Database
Wrapping Up

Introduction – What is SQLite and SQLite3?

One of the core skills for most modern IT professionals is Structured Query Language (SQL). This is a declarative language that is used to interact with relational databases. Data engineers and analysts regularly use SQL to run data pipelines and investigate useful relationships within the data.

Going straight to common database management systems (DBMS) like PostgreSQL or MySQL can be a bit intimidating when you don’t have any SQL experience. Luckily, SQLite is a great option for learning the basis of SQL. It is simple to set up and easy to manage since it has no separate server process. So although data engineers and data analysts will typically use different database management systems than SQLite, it is a great place to learn SQL. In fact, SQLite is the most commonly used DBMS in the world!

Additionally, the Python library sqlite3 is a simple interface for interacting with SQLite. In this blog post, we will use SQLite and the sqlite3 library to learn two major concepts:

Some elementary ways of using the most basic and useful commands of Sql such as CREATE TABLE, INSERT INTO, and SELECT - FROM.
How a programming language (in our case Python) can be used to interact with a relational database.

We will set up a Sqlite database, create a database connection from Python with sqlite3, and insert/fetch some rows in the database. The goal is not that you become an expert on SQL, but that you see how it can be used and learn some basic commands to get started. If you want to learn more, I have 8 free videos on YouTube that start the same as this blog post, but go much further:

Creating our First SQLite Database

On the official webpage for SQLite, you can find information about downloading SQLite. However, for most of us, this is not necessary as SQLite is already included on most machines. You also need the sqlite3 library in Python, but this is in the standard library, and thus included with most Python distributions. So most likely, there is nothing to install

To check if everything is already installed, open a new Python file and write the single command:

import sqlite3

If the above file runs fine, then both SQLite and the Python library sqlite3 are installed. We are ready to go!

After the import step, we need to create a connection to the database. This is done by using the connect() function in the sqlite3 library:

# Create a connection to the database
connection = sqlite3.connect("music.db")

The argument passed into the connect() function will be the name of the database. Since we don’t have a database yet, this will simply create a new database for us. If you now run the Python file, then a new file will appear in the directory you are working in called music.db. This is our database!

A relational database consists of various tables. If you are new to this, then you can think about this like a collection of Excel sheets. This is undercutting how powerful relational databases are, but is a nice mental model in the beginning.

After creating a connection object, we need to create a cursor. A cursor can execute SQL commands against the database. To create this, we use the .cursor() method on the connection object as follows:

# Create a cursor
cursor = connection.cursor()

The variable cursor now holds a cursor object that we can use to insert and fetch data from the database. So far, you should have the following code:

import sqlite3

# Create a connection to the database
connection = sqlite3.connect("music.db")

# Create a cursor
cursor = connection.cursor()

Creating Tables

First of all, we need a table in our database. We are going to be working with data representing songs from the 80s. On the cursor object, we can call the method execute() to execute SQL statements. The first statement we are going to learn is the CREATE TABLE statement:

# Create a table
cursor.execute("CREATE TABLE IF NOT EXISTS songs(name, artist, album, year, duration)")

As you can see from the command above, we create a table called songs that has five columns: name, artist, album, year, and duration. The optional part IF NOT EXISTS ensures that the table is only created if it does not already exist. If it does exist, then the command does nothing.

Even though our table is empty for now, the schema is clear. We are setting up a table that records relevant information about songs. The information we want to track for each song is the name, artist, album, year, and duration. We will soon populate this table with rows that represent various songs.

After running the Python file, your immediate idea might be to open the database file music.db in your current directory to investigate what has happened. However, the information in music.db is not intended to be accessed like this. You will only see scrambled information there as this is a format not intended for viewing like this. We will have to write more SQL commands to read the information in the database.

Congratulations on learning your first SQL command! Make sure to separate in your mind what are SQL commands and what is the Python library sqlite3. It is only the sentence CREATE TABLE ... that is an SQL command. The connection and cursor are Python objects that are used to interact with the database.

Inserting Rows Into the Database

We now have a database with a single table. But the table is empty! For a database to be useful, we need some data in it. Let us now see how we can insert data into tables in our database with the SQL keywords INSERT INTO. We first create a list of songs, where each song is represented as a tuple of information:

# Rows for the songs table
songs = [
    ("I Wanna Dance with Somebody (Who Loves Me)", "Whitney Houston", "Whitney", 1987, 291),
    ("Dancing in the Dark", "Bruce Springsteen", "Born In The U.S.A.", 1984, 241),
    ("Take On Me", "a-ha", "Hunting High and Low", 1985, 225),
    ("Africa", "TOTO", "Toto IV", 1982, 295),
    ("Never Gonna Give You Up", "Rick Astley", "Whenever You Need Somebody", 1987, 213)
]

As you can see, each tuple has five parts that correspond to the columns in the songs table. For the first song, we have:

The name will be I Wanna Dance with Somebody (Who Loves Me).
The artist will be Whitney Houston.
The album will be Born In The U.S.A..
The year will be 1987.
The duration (in seconds) will be 291.

Now that we have our rows ready, we need to insert them into the songs table in the music.db database.

One way to do this is to insert a single row into the table at a time. The following code inserts the first song into the table:

# Insert a single value into the database
cursor.execute("INSERT INTO songs VALUES(?, ?, ?, ?, ?)", songs[0])
connection.commit()

Here we have picked out the first song in the songs list and insert this into the table songs. We use the SQL command INSERT INTO table VALUES where table is the table we want to insert into. Finally, we use the .commit() method to ensure that the transaction is fully complete.

By repeating this approach with a for-loop in Python, we can write the following code to insert all the rows into the table:

# Insert all the values into the table by looping
for song in songs:
    cursor.execute("INSERT INTO songs VALUES(?, ?, ?, ?, ?)", song)
    connection.commit()

There is no new SQL command here, only some Python logic to ensure that all the rows are inserted into the songs table.

The disadvantage of the above method is that it is pretty slow when you have a lot of rows to insert. In our example, everything is quick since we only have a few songs. But tables in relational databases can have millions or even billions of rows. Then looping in Python can slow down the insertion of rows.

The solution to this is to insert all of the rows at once, rather than loop through them. We can do this by using the .executemany() method on the cursor object, rather than the .execute() method that we have used so far. The following code inserts all the rows in a single batch:

# Can insert all the values at the same time with a batch approach
cursor.executemany("INSERT INTO songs VALUES(?, ?, ?, ?, ?)", songs)
connection.commit()

We now have a table songs in the database music.db that have some rows inserted. The code we have written so far (without comments) looks like this:

import sqlite3

songs = [
    ("I Wanna Dance with Somebody (Who Loves Me)", "Whitney Houston", "Whitney", 1987, 291),
    ("Dancing in the Dark", "Bruce Springsteen", "Born In The U.S.A.", 1984, 241),
    ("Take On Me", "a-ha", "Hunting High and Low", 1985, 225),
    ("Africa", "TOTO", "Toto IV", 1982, 295),
    ("Never Gonna Give You Up", "Rick Astley", "Whenever You Need Somebody", 1987, 213)
]

connection = sqlite3.connect("music.db")

cursor = connection.cursor()

cursor.execute("DROP TABLE IF EXISTS songs")

cursor.execute("CREATE TABLE IF NOT EXISTS songs(name, artist, album, year, duration)")

cursor.executemany("INSERT INTO songs VALUES(?, ?, ?, ?, ?)", songs)
connection.commit()

Looking closely, you will see that I have sneaked in a new line in the code. This is the line that executes the SQL command DROP TABLE IF EXISTS songs. If you run the code above, it first removes the table if it exists, and then builds it again.

This will avoid the experimentation we do leaving us with different results. By running the Python file above, we reset the state of the database and should get the same results in the next section. A statement like this in a production system would be very costly as the entire table is rebuilt every time we insert rows. However, this is fine for experimentation, which we are doing here.

Fetching Rows from the Database

Photo by Andy Powell on Unsplash

It is now time to fetch rows back from the database. We will use the SQL keywords SELECT and FROM to do this. Let us start with just getting a single song back from the database:

# Fetch a single song
single_song = cursor.execute("SELECT * FROM songs").fetchone()
print(single_song)

As usual, we use the .execute() method on the cursor object to execute an SQL statement. The statement SELECT * FROM songs fetches all the columns and all the rows from the database. So this gives us everything. However, we use the .fetchone() method in sqlite3 to only fetch a single one of those rows. By doing this, we only print out a single song when running our Python file.

We have used the wildcard symbol * to retrieve all the columns back. If you only need some of the columns, then you can specify them as follows:

# Fetch only name and artist column of a single song
name_and_artist = cursor.execute("SELECT name, artist FROM songs").fetchone()
print(name_and_artist)

In addition to the method .fetchone(), you can also use the methods .fetchmany(number_of_rows) and .fetchall() to get more rows. The following code selects all the songs with the .fetchall() method:

# Getting all the rows and columns back
full_songs = cursor.execute("SELECT * FROM songs").fetchall()
print(full_songs)

Once you have the information back into Python, you can use standard Python logic to get useful insights. The following code exemplifies this, by finding the average duration for all the songs in our database:

# Get the average duration with Python logic
average_duration = 0
for song in full_songs:
    average_duration += song[-1]
average_duration = average_duration / len(full_songs)
print(f"The average 80s song is {int(average_duration // 60)} minutes and {int(average_duration % 60)} seconds long.")

You might think that we are going a bit back and forth. We already have the original songs list in the Python script, so why do we first need to insert this into the database, and then retrieve it again? This is where a tutorial is a bit artificial. In practice, the Python scripts that insert and fetch data from the database are not the same. There might also be multiple Python scripts (or other interfaces) that insert data into the database. Hence fetching the data from the database and calculating the average duration might be our only option to get this information.

Finally, before finishing up we need to close the database connection. We opened the connection to the database with the connection() function earlier. If we don’t close it, it will remain open and can cause performance and persistence problems in more complicated applications. It is a good practice always to ensure the connection to the database is closed. To do this in sqlite3, we can use the .close() method on the connection object:

# Close the connection
connection.close()

The following code demonstrates everything we have done:

import sqlite3

songs = [
    ("I Wanna Dance with Somebody (Who Loves Me)", "Whitney Houston", "Whitney", 1987, 291),
    ("Dancing in the Dark", "Bruce Springsteen", "Born In The U.S.A.", 1984, 241),
    ("Take On Me", "a-ha", "Hunting High and Low", 1985, 225),
    ("Africa", "TOTO", "Toto IV", 1982, 295),
    ("Never Gonna Give You Up", "Rick Astley", "Whenever You Need Somebody", 1987, 213)
]

connection = sqlite3.connect("music.db")

cursor = connection.cursor()

cursor.execute("DROP TABLE IF EXISTS songs")

cursor.execute("CREATE TABLE IF NOT EXISTS songs(name, artist, album, year, duration)")

cursor.executemany("INSERT INTO songs VALUES(?, ?, ?, ?, ?)", songs)
connection.commit()

full_songs = cursor.execute("SELECT name, artist, album, year, duration FROM songs").fetchall()

average_duration = 0
for song in full_songs:
    average_duration += song[-1]
average_duration = average_duration / len(full_songs)
print(f"The average 80s song is {int(average_duration // 60)} minutes and {int(average_duration % 60)} seconds long.")

connection.close()

Wrapping Up

Photo by Spencer Bergen on Unsplash

I hope this blog post helped you understand SQL commands and the sqlite3 library in Python. If you are interested in AI, data science, or data engineering, then feel free to follow me or connect on LinkedIn.

Like my writing? Check out some of my other posts for more content:

The post Get started with SQLite3 in Python Creating Tables & Fetching Rows appeared first on Towards Data Science.

Seeing is Believing – Deepfakes and How They Warp Truth

Eirik Berge, PhD — Wed, 20 Mar 2024 14:12:43 +0000

Photo by Brett Jordan on Unsplash

Overview

Introduction – What are Deepfakes?
Examples of Malicious Deepfakes
Autoencoders
Media Literacy and Detecting Deepfakes
Wrapping Up

Introduction – What are Deepfakes?

The act of photo manipulation is an old one. It was has been used to colorize old WW1 images, but it has also been used for propaganda; Josef Stalin infamously manipulated photos so that his political opponent Leon Trotsky did not appear in important settings. Photo manipulation has been used for over 100 years to both captivate and deceive viewers.

Moving to the current time period, we consume not only images but also video and audio in our daily lives. The internet has facilitated a dramatic increase in video and audio sharing by third party individuals and organizations, in contrast to fixed TV and radio channels in the past. This is great for gaining new perspectives. However, it has together with innovations in artificial intelligence lead us to a new scary concept: deepfakes.

Deepfakes are synthetic media that have altered or generated, with the use of deep neural networks, so that the content is fake. This could be images and speech, but increasingly also videos. There are legitimate use-cases for deepfakes within video production. This could be aging/de-aging an actor for a role that spans a long time within the narrative. It could also be used as an alternative to reshooting close-up scenes where only minor changes are necessary. Yet, deepfakes presents more malicious applications than legitimate ones as of now.

In this blog post, I will give you some context for how deepfakes can be used maliciously. I will discuss autoencoders since this is one of the most common ways deepfakes are generated. Finally, I will talk about how media literacy and common sense is one of the most effective ways that we can combat deepfakes.

Examples of Malicious Deepfakes

Unfortunately, the majority of deepfake applications have a negative effect on individuals, companies, and society at large as of today. Let’s look at three malicious ways deepfakes can be used at each granularity:

Individual – Non-Consensual Pornography

The word deepfake actually originates from a username on Reddit. The user called deepfakes shared pornographic videos of celebrities, where deep neural networks had been used to add their faces to existing pornographic scenes. In recent years the technology has advanced, and it has now become possible to create non-consensual pornographic deepfakes (NCPD) content of everyday people more easily. The majority of NCPD victims are women and the goal is either bullying, revenge porn, or extorsion. Often the victims do not know who has made the content, and it can be almost impossible to remove it from the web. Even though it is "fake", it can still have damaging psychological effect on the victims and impact their ability to obtain jobs.

Company – Social Engineering Scams

For a long time, social engineering has been one of the top methods for gaining unauthorized access to sensitive company information such as employee passwords. Social engineering is, rather than finding technical security holes in software, to manipulate employees to give out information. This often takes the form of impersonating a colleague that needs quick access to an employees user account. Deepfakes brings new fears into this, as one can now create videos that further give the illusion that you are talking to a colleague. In a now famous instance, cybercriminals convinced a finance employee to transfer $25 million after a deepfake video call with the CFO.

Society – Spreading Misinformation

Looking more broadly, deepfakes presents an obstacle to truth in societal discourse. Deepfakes can be used to alter perceptions of political figures. There are many elections taking place in 2024 around the world, and there has been a major uptick during the last years in deepfakes used to sway political opinions. Deepfakes can also be used by interested parties to cast doubt on scientific evidence that are not beneficial to them. Much of our opinions are based on emotions, and there is thus a big advantage to portraying opponents as ill-informed or simply on a crusade.

It is also possible to use more simple methods such as photoshopping or inserting a word in a speech to distort reality; this is sometimes called cheapfakes or shallowfakes in contrast to deepfakes. While this has been an issue for some time, deepfakes lowers the barrier and heightens the impact. Today, almost anyone can create deepfake content without much expertise or knowledge.

Given that deepfakes can affect individuals, companies, and society as a whole, it is vital that we understand how they work and what measures we can take against them. Let’s start with understanding one way deepfakes are created.

Autoencoders

I now want to give an overview of autoencoders and how they are used to create deepfakes. You should know that other architectures such as Generative Adversarial Networks (GANs) can also be used to create deepfakes. I’ve chosen to focus on autoencoders since they illustrate the process of deepfakes most clearly in my opinion. I will only give a high-level overview of autoencoders here – just enough to discuss the detection of deepfakes in the next section.

Autoencoders are a specific type of a deep neural network that emulates compression and decompression. Specifically, the neural network first has an encoder part that compress a given image into a smaller dimensional space. The target space here is often called latent space or latent representation since it tries to extract the latent features (or essence) of an image.

Finally, there is an decoder part that tries to reconstruct the original image from the latent space. The goal is to go through both processes, encoding and decoding, while staying as close as possible to the original image. Hence the measure of success for an autoencoder is the similarity between the input and the output.

Image created by the author.

The latent space needs to be smaller than the input and output layers in size. In the extreme example, if the latent space is the same size as the input and output layers, then no compression takes place. On the other hand, if the size of the latent space is a single node, then this is clearly not enough information to reconstruct an image.

Below you can see how an autoencoder can reconstruct images from simple clothing items. For more information about this specific example, you can check out the Tensorflow Documentation on Autoencoders.

So what does autoencoders have to do with deepfakes? The trick to create deepfakes is as follows. You have two persons named Alice and Bob. Train two autoencoders, each of them using data consisting of faces from Alice and Bob, respectively. However, make sure that they have the same encoder, but different decoders. Then you do the following:

Take a picture of Alice. Pass it through the common encoder to get the latent features of Alice. Then you pass this through the decoder for Bob. Then the output will try to reconstruct the face of Bob with the latent features of Alice. This gives the deepfake-effect where the facial features of one person is superimposed on another. For more illustrations on how this work I recommend this blog post.

So now that we have an overarching understanding of how deepfakes can be made, what does this say about how we can respond to them?

Media Literacy and Detecting Deepfakes

Recently, several big players in tech such as Adobe, Microsoft, and Google pledged to work together to combat deep fakes in various ways. This is not an easy task. The reason is clear when we reflect on how autoencoders creates deepfakes.

The images are based on something real, and gradually fitted with possibly millions or billions of parameters to the wrong face. This means that unlike photoshopping where rough edges might appear, or taking autoclips out of context where the transitions are not smooth, deepfakes don’t really have a clear characteristic. For some time, deepfakes struggled with hands as in the famous pope Francis image:

AI-generated image of Pope Francis.

You can see that the left hand is not really all there, and that the item Francis is holding is more merged into the hand than properly held. These artifacts will be removed with better models and we will be left with images and videos that could be real from a purely image standpoint.

So is there anything more we could do? Yes! Images, speech, and video do not exist in a vacuum, but is reliant on us to interpret their meaning and their reliability. So we’re back to classic media literacy as a tool that helps with misinformation resulting from deepfakes.

Asking questions such as "Is other sources confirming this?" or "Does this seem plausible?" goes a long way. Is the information radically different from other information that the person has said before? Does anyone benefit from me believing this? Be critical of the content, author, and other interests! The situational dependence can often be enough to detect deepfakes.

A recent example is a deepfake call from U.S. president Biden to Democrat voters in New Hampshire to not vote in the New Hampshire primary, and to save their vote for when it counts; for the election in November. We don’t need deepfake-detecting technology to understand with the tiniest media literacy that this does not make any sense. The sitting president do not call individual residents and tell them to not vote for their party. You would find no official statements from the white house anywhere near this if you looked, and this is radically different from what you would expect. Almost any question you could ask yourself about the situation should make you really skeptical.

Seek out news agencies that have a long history of unbiased reporting and fact-checking. You should not outsource all your critical thinking to a third party, but think of this analogous to an antivirus program. The antivirus program will filter out much, but it is ultimately your responsibility to not download everything you come across on the internet.

When it comes to social engineering fraud, the same amount of critical awareness needs to be present. If someone is asking you through speech or a simple short video to do something drastic (like transfer a large sum of money or to give up account information), then ask for verification from the person on a different forum or in person. This simple rule makes social engineering a lot more difficult to achieve.

For victims of non-consensual pornographic deepfakes, there is unfortunately less measures that can be taken as of now. The legal system has not caught up to the fact that this is happening, or how to properly respond to this. While you should be extra careful to not put out pictures of children on the open internet, it is unrealistic that adults will not have pictures online. The best thing we can do as of now is to be aware that this is a possibility, and to not discount someone in a hiring process because of this. Understanding how uncomfortable being a victim of this can be can make us more empathetic when we learn that it has happened to someone we know.

Wrapping Up

Photo by Spencer Bergen on Unsplash

I hope you got an overview of how deepfakes work and how to combat malicious deepfakes. If you are interested in AI, Data Science, or data engineering, then feel free to follow me or connect on LinkedIn.

Like my writing? Check out some of my other posts for more content:

The post Seeing is Believing – Deepfakes and How They Warp Truth appeared first on Towards Data Science.

An Overview of Microsoft Fabric Going Into 2024

Eirik Berge, PhD — Sun, 31 Dec 2023 17:55:55 +0000

Photo by Ricardo Loaiza on Unsplash

Overview

Introduction
What is Microsoft Fabric?
The Major Components of Microsoft Fabric
3 Upsides to Using Microsoft Fabric
3 Downsides to Using Microsoft Fabric
Should you Change?
Wrapping Up

Introduction

Microsoft Fabric is advertised by Microsoft as being an all-encompassing platform for data analytics, data engineering, and AI. It was introduced in preview in the spring of 2023 and was made generally available for purchase in November 2023. The platform builds on features from existing services like Azure Synapse Analytics, Azure Data Factory, Azure Data Lake Gen 2, Microsoft Purview, and Power BI.

In this blog post, I want to give you a high-level overview of the Microsoft Fabric platform going into 2024. Specifically, I want to give you answers to the following questions:

What parts of the data lifecycle does Microsoft Fabric cover?
What is each component in Microsoft Fabric trying to achieve?
What are the upsides and downsides to using Microsoft Fabric?
Should you consider migrating to Microsoft Fabric?

My experience with Microsoft Fabric is based on a four-day Microsoft depth course I attended, as well as experimentation with Microsoft Fabric in the last couple of months. I also have broad experience with the tools that Microsoft Fabric takes inspiration from. I am however not affiliated with Microsoft and gain nothing monetarily from either overselling or underselling Microsoft Fabric. Based on this I can give an unbiased overview of Microsoft Fabric.

As you will be able to tell from the rest of the blog post, I think that Microsoft Fabric offers some genuinely useful features for a unified data platform. However, like everything else, the choice of migrating to Microsoft Fabric will depend on a lot of factors.

What is Microsoft Fabric?

Let’s try to get an overview of Microsoft Fabric and what it is trying to achieve.

Microsoft Fabric is an all-encompassing data & analytics platform that handles data from the collection stage to the analytics stage. This includes data storage, data pipelines, data alerts, data linage, data governance, AI-features, Power BI integration, and more. The platform is built on previous Microsoft services and collects many existing features into a single package.

According to Microsoft, there are four focus areas of Microsoft Fabric that shape its goals and what it is trying to achieve:

Image from Microsoft’s publicly available learning material

Complete Analytics Platform

The Microsoft Fabric platform is a full-fledged ecosystem that aims to give you a complete package of what you need in a data & analytics platform.

Most Data Platforms such as Databricks and Azure Synapse Analytics are PaaS (Platform as a Service) based, where the supplier handles things like operating systems, maintenance, and distribution of workloads, while you control the code and data. Microsoft Fabric is profiling itself as a SaaS (Software as a Service) platform, where the supplier takes a larger role in the code and configuration. This is achieved through a bigger focus on low-code tools like Azure Data Factory, Azure Data Activator, and Power BI as we shall see.

I think that calling Microsoft Fabric a SaaS platform is somewhat a stretch. For large scale projects there is still a high need to write code, whether that is Spark, SQL, or Python. Nevertheless, the Microsoft Fabric platform is truly a step in the direction towards SaaS data platforms by relying more heavily on low-code/no-code tools.

Microsoft Fabric also emphasizes governance and security through features previously available in Microsoft Purview service. This includes securing data ownership by grouping into domains and workspaces. It also includes visibility through data catalogues and linage, making the solution scalable without completely losing track of which data is available and to whom.

Lake Centric and Open

Microsoft Fabric uses OneLake that promise to simplify data storage, maintenance, and copying data.

OneLake is a single, unified, logical data lake for your whole organization – Microsoft Documentation

OneLake builds on Azure Data Lake Gen 2 that most Azure users have experience with. It is designed to be a single unified place for storing data across the organization, rather than setting up multiple data lakes for different branches and teams within the organization. Hence you can have one, and only one, OneLake connected to a Microsoft Fabric tenant. The ownership of data is handled within the OneLake through organizational features like workspaces and domains.

OneLake can support any file format, whether structured or unstructured. However, it is a bit partial to the Delta Parquet format since any warehouses or lakehouses within Fabric store their data by default in this format.

OneLake uses shortcuts, a feature that emulates shortcuts that we all are familiar with on our local machines. Shortcuts are used to share data without data duplication problems. Users with the right privileges can make shortcuts across workspaces and even to external services like S3 storage in AWS and Dataverse storage in the low-code Power Platform.

Empower Every Business User

The Microsoft Fabric user interface is very familiar to Power BI users and is pretty intuitive for most people coming from different data platforms. This allows business-users on the analytics side to take a bigger role in managing data storage and data transformations.

Microsoft Fabric also goes to great lengths to integrate with two other platforms that business users love – Power Platform and Microsoft 365. Both of these integrations allow business users to get closer to the data and collaborate more seamlessly with data engineers and data scientists.

AI-Powered

Finally, the Microsoft Fabric platform has integrated AI in different ways into the platform. One aspect of this is incorporating LLM (Large Language Models) like GPT-models and Microsoft Copilot to speed up the development process. These tools are starting to be heavily incorporated into the platform. If this incorporation is successful, then this is a major selling point for Microsoft Fabric.

AI has also arrived in Microsoft Fabric in another way as well. You can now train machine learning models and do everything from running experiments to saving and deploying models. This seems to be built on experiences Microsoft has drawn from the Azure Machine Learning service, where all of this has been possible for some time. So while this is not a new feature in the grand scheme of Microsoft Azure, it is new that it is so tightly coupled with data engineering tasks in Microsoft Fabric.

In Azure Synapse Analytics, there were no serious features available for machine learning. Other platforms like Databricks have already have ML coupled with data engineering for quite some time. So in this regard, Microsoft is catching up to what is expected in modern data & analytics platforms.

The Major Components of Microsoft Fabric

The following illustration from Microsoft highlights the components that together constitute the Microsoft Fabric platform. Let’s go through each of them briefly.

Image from Microsoft’s publicly available learning material

OneLake

I’ve already talked a bit about OneLake but would like to add some meet on the bone. As the illustration below shows, OneLake acts as a common foundation for the other components.

Image from Microsoft’s publicly available learning material

Workloads from these components automatically store their data in the OneLake workspace folders. The data in OneLake is then indexed for multiple purposes. One of these are data linage, where you can track which transformation have been applied to a dataset. Another is PII (Personal Identifying Information) scans, where sensitive information can be highlighted.

I personally think that one of the biggest advantages of OneLake is transparency. When working in platforms like Azure Synapse Analytics, it becomes difficult for everyone to keep track of what data is available and which transformations have already been applied to the data. A data analyst might get access to the Azure Data Lake Gen 2 storage to fetch the finished data for visualizations. But he/she have little knowledge of which transformations have been applied to the data to get it in this form. While there are ways to handle this by involving Microsoft Purview, this is a bit cumbersome. Having transparency by default is a feature that I think will not reach the headlines but is crucial to better collaboration.

Data Factory

Data Factory is an existing service in the Azure ecosystem that has been incorporated into the Microsoft Fabric platform. This tool is used to connect to data sources like databases, streaming systems like Kafka, documents in SharePoint, and tons of other sources. Then you can write data pipelines to transform the data in simple steps and automate the pipeline management.

Image from Microsoft’s publicly available learning material

Data Factory also includes Dataflow Gen2, which is a low-code tool for data ingestion and simple transformations. Users of Power BI will find this very familiar since Dataflow Gen 2 looks a lot like the Power Query Editor that they are used to. In this way, data analysts and business users can take a bigger role in data ingestion and processing.

Data Factory was already present within the Azure Synapse Analytics platform through the feature called pipelines. Hence the inclusion of Data Factory in Microsoft Factory is nothing less than expected.

Synapse Data Engineering

Some simple transformations are possible to do with low-code tools in Data Factory. For more complicated processing, you can use Synapse Data Engineering to set up Spark Jobs and notebooks to wrangle the data in a more customized way.

Image from Microsoft’s publicly available learning material

Synapse Data Engineering also allows to set up lakehouses where you can manage both structured and unstructured data in a single location. The data can then be transformed using Spark jobs and notebooks. The lakehouse also comes with a SQL analytics endpoint so that you can write SQL-based queries to fetch the data. Note that the SQL analytics endpoint is designed for read operations only. Most data engineers will spend a lot of their time in Microsoft Fabric working within Synapse Data Engineering.

Synapse Data Science

A departure from Azure Synapse Analytics is the inclusion of data science, specifically the lifecycle of machine learning model development.

Image from Microsoft’s publicly available learning material

Synapse Data Science includes model hosting, experiments, and deployments of ML models. There is a build-in MLflow experience so that tracking of parameters and metrics is simplified. Microsoft Fabric also supports autologging, which is a feature that simplifies the logging experience.

The training of ML models can be done with by Python/PySpark and SparklyR/R. Popular libraries like scikit-learn can easily be incorporated and the experience with developing models have been made a lot simpler. Other Azure-based AI tools like Azure OpenAI Service and Text Analytics can also easily be used from Microsoft Fabric. This connection is in preview as of now, but will include more of the Microsoft AI-services in the future.

I think that some more time, testing, and further development is needed before Microsoft Fabric can be called a full-fledged MLOps platform, but the changes they have already made are impressive.

Synapse Data Warehousing

I mentioned earlier that lakehouses in Microsoft Fabric have a SQL analytics endpoint where you can write read-only SQL queries on the data (as well as creation of views). Microsoft Fabric also has a fully functioning data warehouse solution that supports DDL and DML queries.

Image from Microsoft’s publicly available learning material

Hence with Synapse Data Warehousing you have a fully-fledged data warehouse that supports T-SQL capabilities. Whether to choose a SQL analytics endpoint coming from a lakehouse or a full-fledged data warehouse is a trade-off that needs to be considered in most situations. Microsoft has a lot of documentation that illustrate the trade-off and which features you get from the different options.

Synapse Real-Time Analytics

The expectation of real-time on demand data is addressed by Synapse Real-Time Analytics. Many systems collect data continuously to display in dashboards or to be used in ML models. Examples include IoT-data from sensors or browsing data from customers on a website. The Synapse Real-Time Analytics component of Microsoft Fabric tackles streaming data comprehensively.

Image from Microsoft’s publicly available learning material

It uses KQL (Kusto Query Language) for querying event streams. It is optimized for time-series data and has loads of features that support automatic partitioning and scaling. The end result can be easily integrated with other components in Microsoft Fabric such as Synapse Data Engineering or Power BI.

Power BI

I doubt that Power BI needs much of an introduction. It has been one of the de facto visualization/dashboard solutions for the last decade. With Power BI you can create updated and gorgeous dashboards that can be distributed to people with the right access and privileges.

Image from Microsoft’s publicly available learning material

The new thing in Microsoft Fabric is that Power BI is so tightly integrated with the rest of the data & analytics platform. Previously, data engineers could work in Azure Synapse Analytics while data analysts worked in Power BI with minimal interaction between them. In Microsoft Fabric, data analysts are encouraged to take a bigger role in data processing, while data engineers are encouraged to think more about how the data will facilitate insights in the visualization stage.

There is also a new connection mode called Direct Lake Mode that seems very promising as a middle ground between speed and avoidance of data duplication. This is optimal for large datasets with frequent updates. I have not done any benchmarking, but I am carefully optimistic that this might be valuable in many cases.

Image from Microsoft’s publicly available learning material

Data Activator

The final component of the Microsoft Fabric platform is Data Activator. Data Activator is a no-code tool for taking action whenever a condition or type of pattern is detected in the data. It sets up reflexes – a item that contains the necessary information to connect to data, monitor the data for certain conditions, and then act through triggers.

Image from Microsoft’s publicly available learning material

No code rules and triggers send notifications in applications like Microsoft Teams or Outlook to notify about interesting changes. It is also possible to use Power Automate in the Power Platform to write custom workflows for how an end-user should be alerted.

Alert systems have been possible to integrate with Azure Synapse Analytics through Logic Apps or other services. Alert systems quickly become neglected if there is even a hint of effort to connect them, so having Data Activator as a part of Microsoft Fabric is great. It offers in my opinion nothing that is revolutionary, but makes the whole Microsoft Fabric platform more holistic.

3 Upsides to Using Microsoft Fabric

Now that we have described the components that go into Microsoft Fabric, I want to discuss some upsides to Microsoft Fabric. These upsides are based on my subjective interests.

SaaS and Learning Curve

With Microsoft Fabric being a step in the direction towards a SaaS solution, the upskilling should be faster and available to a larger group of people than PaaS solutions. Specifically, I think there is the hope that data analysts can take on more tasks that traditionally belonged to data engineers. This includes monitoring data, setting up data pipelines, and writing code for data transformations. My initial experimentations with Data Activator and OneLake confirm that it is indeed very quick to get started with. The components also have a friendly interface that does not look intimidating to start with. I think this will drive data analysts to try out and experiment with tasks that were previously left to data engineers.

Much is also being done in term of leaning material. I attended a four-day free digital course from Microsoft that aimed to teach the basics of the Fabric platform. There will also be launched a new Microsoft Certification in 2024 called Microsoft Certified: Fabric Analytics Engineer Associate. It seems like Microsoft is really committing to Microsoft Fabric and is willing to make a lot of leaning materials for this solution.

Closing the Gap Between Data Analysts and Data Engineers

I have briefly touched on this previously but want to really focus on this point a bit. In the data sphere, there are many roles such as data scientist, data engineer, data analyst, ML engineer, data architect, and so on. While there are some clear differences between these roles, the data field has become unnecessarily fragmented in terms of roles. The fragmentation is not based on different ideologies or anything lofty like this, but simply due to a separation of tools. Microsoft Fabric makes this more cohesive and less fragmented in the way that it is structured.

A data engineer should think about how the end result of the data transformations should be used in visualizations. Similarly, a data analyst should care about which data transformations that take place before the data is ready for visualization. What often happens in real life is that data engineers and data analysts are segregated to different tools and have a minimal interface between them. This interface is requirement-based and does not really facilitate collaboration or sharing of insights. This can lead to much back-and-forth and silos that work independently.

With Microsoft Fabric, data analysts and data engineers are encouraged to work closely with each other. They can more easily view each other’s work and contribute outside their own specialty. This does not necessarily mean that data engineers will design Power BI dashboards or that data analysts will write Spark code. But the intersection between a data analyst and a data engineer will be bigger, and more collaboration is possible. I think that this is a major highlight of a data & analytics platform.

A Platform Where AI is not an Afterthought

In many data & analytics platforms, AI and machine learning is more of an afterthought than anything. They offer some hosting features for ML models, but is really a data platform first and foremost, with some AI-features as a sprinkling on the cake.

Microsoft Fabric takes a different approach and places AI front and center for the platform. Not only is the ML model lifecycle features relatively competitive, but the native integration for LLMs like Microsoft Copilot and GPT-models is built into the platform carefully. Since Microsoft is a major player in generative AI it is useful to have access to new developments and improvements as quickly as possible.

It seems like Microsoft Fabric are also gradually building more connections to other Azure AI Services (previously called Azure Cognitive Services). These services can of course be used through their respective endpoints as separate services, but Microsoft Fabric is trying to make the connection as smooth as possible. I think that within half a year from now, most of the Azure AI Services will be easily accessible from Microsoft Fabric. Having advanced Document Intelligence for parsing of PDF-documents or advanced text-to-speech translation easily available with the click of a button is something that most other data platforms will struggle to compete with.

3 Downsides to Using Microsoft Fabric

To counter the upsides to Microsoft Fabric I gave previously, here are some possible downsides to using Microsoft Fabric. Again, these are based on my own concerns.

Uniformization vs Lock-in

Microsoft Fabric certainly uniforms a lot of existing Azure solutions into a single package with its own billing and uniform OneLake. However, this also means that you are encouraged to use Microsoft Fabric as a full-fledged solution, rather than a single piece in a microservice architecture. This makes the data platform a lot more locked into the Microsoft ecosystem. Whether this is good or bad depends on which other tools you are using, and what your ambitions are for the platform going forward. For many, this might be a downside that they are not willing to compromise on.

The Double-Edged Sword of Low-Code

The advantage of low-code is that it allows for more people to engage. Data analysts and business users can take on a bigger set of tasks with Microsoft Fabric. But low-code is a double-edged sword in this regard. The simplicity of low-code also typically means less possibilities for customization. The more GUI-based a tool is, the less options are available for fine-tuning.

As a concrete example, Data Factory is a low-code tool that can extract data from e.g., transactional databases. But the functionality that Data Factory offers is less than what you could do if you wrote SQL-queries to fetch the data. This is natural, as SQL-queries is a whole declarative language, while Data Factory has a few presets and options to configure. Hence Data Factory will do the trick 9/10 times without issues, but sometimes writing it out in code will allow you more possibilities.

The fact that Microsoft Fabric is going the low-code road might be a development that does not make everyone ecstatic. I am quite happy with the balance they have achieved between low-code and code-based tools. Nevertheless, a few more steps in the direction of low-code will make the platform more difficult to manage for myself and many others with coding backgrounds. This development could be a future downside that is worth watching closely.

New Technology – Less Competence

This one is true for any new technology. There are few people out there who are very comfortable with Microsoft Fabric. If you are trying to build an internal team, then requiring Fabric experience for new hires is probably too much to expect. You need to do internal upskilling and spend some time finding out which patterns work in Microsoft Fabric, and which don’t. The upside is that Fabric draws so much from services like Synapse Analytics, Data Factory, and Power BI that this background should be enough for getting started with Microsoft Fabric quickly.

Should you Change?

Changing your existing solution to Microsoft Fabric is a complicated decision. Basing this important decision on a single blog post would be a foolish idea. Yet, here are the two cases that are clear-cut:

Are you using Synapse Analytics as a data platform? Are you using other tools such as Microsoft Purview and Power BI? Do you find it cumbersome to connect the services together and difficult to keep track of them? In this case, migration is promising. Making the change could make your data platform more manageable. Start experimenting with Microsoft Fabric! Try to duplicate some of the data pipelines you have in your existing data platform. If this brings good results, then you have a serious contender for a new data platform.
Are you using tools where several of them are outside the Microsoft platform? Perhaps you are not using Power BI but a different dashboard solution like Grafana? Do you have a focus on code-based tools and open-source tooling? Don’t switch to Microsoft Fabric in this case. You should still keep Microsoft Fabric on your radar. But your princess is unfortunately in another castle.

Outside of clear-cut solutions like this, you have to experiment with Microsoft Fabric. Only then will you be able to grasp whether it fits your needs. Changing your data & analytics platform is a heavy decision. Technical competence and strong business understanding are both needed to succeed.

Wrapping Up

Photo by Spencer Bergen on Unsplash

I hope you got an honest overview of Microsoft Fabric and what it can offer. If you are interested in AI, data science, or data engineering, then feel free to follow me or connect on LinkedIn. What is your experience Microsoft Fabric? I’d love to hear what you have to say

Like my writing? Check out some of my other posts for more content:

The post An Overview of Microsoft Fabric Going Into 2024 appeared first on Towards Data Science.

Beginner Tutorial: Connect GPT Models with Company Data in Microsoft Azure

Eirik Berge, PhD — Sun, 15 Oct 2023 15:37:45 +0000

Photo by Volodymyr Hryshchenko on Unsplash

Overview

Introduction
Setting up the Data
Creating an Index and Deploying a Model
Other Considerations
Wrapping Up

Introduction

There has been a lot of hype in the last year about GPT models and generative AI in general. While promises ** about a full technological revolution can seem somewhat overblown, it is true that GPT models are impressive in many way**s. However, the true value of GPT models comes when connecting them to internal documents. Why is this?

When you use plain vanilla GPT models provided by OpenAI, they do not really understand the inner workings of your company. If you ask it questions, it will answer based on what it most likely finds out about other companies. This is a problem when you want to use GPT models to ask questions like:

What are the steps in an internal procedure that I must follow?
What is the full interaction history between my company and a specific customer?
Who should I call if I have any issues with a specific software or routine?

Trying to ask plain vanilla GPT models these questions will give you nothing of value (try it!). But if you connect GPT models to your internal data, then you can get meaningful answers to these questions and many others.

In this tutorial, I want to show you how to connect GPT models with internal company data in Microsoft Azure. Just in the last few months, this has become a lot simpler. I will walk slowly through setting up resources and doing the necessary configurations. This is a beginner tutorial, so if you are very comfortable with Microsoft Azure, then you can probably skim through the tutorial.

You need to have two things before proceeding to follow along:

A Microsoft Azure tenant where you have sufficient permissions to upload documents, create resources, etc.
As of publishing, your company needs to apply to get access to the Azure OpenAI resource that we will be using. This will probably be lifted sometime in the future, but for now, this is required. The time it takes after applying until you get access is a few days.

NOTE: The real difficulty with making amazing AI assistants comes down to data quality, scoping the project correctly, understanding user needs, user testing, automating data ingestion, and much more. So don’t leave the tutorial thinking that creating a great AI assistant is simple. It is merely that setting up the infrastructure is simple.

Setting up the Data

Everything starts with data. The first step is to upload some internal company data to Azure. In my example, I will use the following text that you can also copy and use:

In company SeriousDataBusiness we always make sure to clean our 
desks before we leave the office.

Save the text into a text file called company_info.txt and store it somewhere convenient. Now we will go to Microsoft Azure and upload the text document. Search the marketplace on Azure to find the Storage account resource:

When creating Azure resources there are many fields that you can fill out. For a storage account, the important ones are:

Subscription: The subscription you want to create the storage account in.
Resource group: The resource group you want to create the storage account in. You might also decide to create a new resource group for this tutorial.
Storage account name: A unique name across all Azure accounts that is between 3 and 24 characters long. It can only contain lowercase letters and numbers.
Region: The Azure region that will host the data.
Performance: The choice Standard is good enough for testing.
Redundancy: The choice of Locally-redundant storage is good enough for testing.

Once you’ve clicked review and then create there should be a storage account waiting for you in the resource group you chose within a few minutes. Once you go into the storage account, go to Containers on the left sidebar:

In there, you can create a new container that essentially works as a namespace for your data. I named my container newcontainer and entered it. You can now see an upload button in the upper left corner. Click upload and then locate the beloved company_info.txt file you saved earlier.

Now our data is in Azure. We can proceed to the next step

Creating an Index and Deploying a Model

When I read a cookbook, I often consult the index at the back of the book. An index tells you quickly which recipe is on which page. Looking through the whole book every time I want to make pancakes is not good enough in a busy world.

Why am I telling you this? Because we are also going to make indexes for our internal data that we uploaded in the previous section! This will make sure that we can quickly locate relevant information in our internal documents. This way, don’t need to send all the data with every question to Gpt models. This would not only be costly but also not possible for even medium-sized data sources due to the token limits in GPT models.

We’re going to need an Azure Cognitive Search resource. This is a resource that will help us to automatically index our documents. As before, head over to the Azure marketplace and find the Cognitive Search resource:

When creating the Azure cognitive search resource, you should choose the same subscription, resource group, and location as for the storage account. Give it a unique service name and choose the pricing tier Standard. Proceed by clicking the Review button and then click the Create button.

When it is completed, we are actually going to create another resource, namely the Azure OpenAI resource. The reason is that we are not going to create the index in the Cognitive Search resource, but rather do it indirectly from the Azure OpenAI resource. This is more convenient for simple applications where you don’t need a lot of fine-tuning of the index.

Head again over to the Azure marketplace and find the Azure OpenAI resource:

You need to pick the same subscription, resource group, and region as the other resources. Give it a name and select the pricing tier Standard SO. Click your way towards the Review and Submit section, and then click Create. This was the final resource you needed for the tutorial. Grab a coffee or another beverage while you wait for the resource to complete.

When inside the Azure OpenAI resource, you should see something like this in the Overview section:

Click on Explore , which takes you to Azure OpenAI Studio. In the studio, you can deploy models and connect your internal data by using a graphical user interface. You should now see something like this:

Let us first create a deployment of a GPT model. Head over to the Models section on the left sidebar. This will show you the available model that you can use. The models you see might be different than mine, based on the region you have chosen. I will select the model gpt-35-turbo and click Deploy. If you don’t have access to this model, then pick another one.

Pick a Deployment name and create the deployment. If you head over to the Deployments section on the left sidebar, then you can see your deployment. We will head over to the section Chat on the left sidebar where we will start to connect the data through an index.

You should see a tab called Add you data (preview) that you can select:

When you are reading this tutorial, this feature might be out of preview mode. Select Add a data source and select Azure Blob Storage as your data source. The rest of the information you need to input are the subscription, Azure Blob storage resource, the storage container where you placed the document company_info.txt , and the Azure Cognitive Search resource we created:

Enter an index name and leave the option Indexer schedule as Once. This is how often the index should be updated based on potentially new data. Since our data won’t change, we pick Once for simplicity. Accept that connecting to an Azure Cognitive Search account will incur usage charges and continue. You can pick Keyword as the Search type under Data management:

Click Save and close and wait for the indexing to finish. Now the deployed GPT model has access to your internal data! You can ask a question in the Chat session to try it out:

The chatbot gives the correct answer based on the internal documents

It gives a reference to the correct document so that you can check out the source material for confirmation.

There is also a button called View code, where you can see the requests made in various programming languages. You can send this request from anywhere, as long as you include the endpoint and access keys listed. Hence you are not limited to the playground here, but can incorporate this into your own applications.

You have now successfully connected a GPT model with internal data! Sure, the internal data is not very interesting in our tutorial. But you can imagine doing this with more pressing material than desk policies.

Other Considerations

Here I want to point you towards some further things to play with.

System Messages

You can also specify a system message in the Chat playground:

This is sometimes called a pre-prompt in other settings. This is a message that is sent every time before the actual question is being asked by the user. The purpose is here to give context to the GPT model about the task at hand. It defaults to something generic like You are an AI assistant that helps people find information.

You can change the system message to request a specific format of the response, or to change the tone of voice for the answer. Feel free to play around with this.

Parameters

You can find a Configuration panel (it is either already visible or you need to go to Show panels and select it). It looks something like this:

Here you can tweak many parameters. Maybe the most important one is Temperature, which indicates how deterministic the answer should be. A low value indicates highly deterministic, so it will give roughly the same answer each time. A high value is the opposite, so the answer is more varied each time. A high value often makes the model seem more creative.

Deploying to a web app

When you have finished your tweaking of system messages and parameters, you might want to deploy the model to a web application. This can be done rather easily from within the Azure OpenAI Studio. Simply click the Deploy to button and select A new web app...

After filling out the relevant information you can access the model from a web application. This is one of the ways to make the model available to others.

Wrapping Up

Photo by Spencer Bergen on Unsplash

In this tutorial, I’ve shown you how to connect GPT models with internal company data in Azure. Again, I want to emphasize that this is only the first step to getting an amazing AI assistant. The next steps require expertise in areas such as data quality, index optimization, service design, and automation. But you now have a minimal setup that you can develop further

If you are interested in AI or data science then feel free to follow me or connect on LinkedIn. What is your experience with connecting GPT models to company data? I’d love to hear what you have to say

Like my writing? Check out some of my other posts for more content:

The post Beginner Tutorial: Connect GPT Models with Company Data in Microsoft Azure appeared first on Towards Data Science.

From Meetups to Mentoring: How to Network as a Data Scientist

Eirik Berge, PhD — Thu, 03 Aug 2023 18:20:02 +0000

Photo by Nik on Unsplash

Overview

Introduction
Tips for Aspiring Data Scientists
Tips for Junior Data Scientists
Tips for Senior Data Scientists
Wrapping Up

Introduction

Networking. We’re constantly told phrases like "I’ve heard networking is useful, why don’t you try that?". Encouragements like this are well-meant, but not very useful. Most people understand the value of networking. Many data scientists are not hired through official job listings but through referrals and connections.

It can be tricky to understand how to network effectively. Some advice is outdated, some is not relevant for data scientists, and some is outright wrong. It’s easier to look back at the previous step in your career and understand what actually worked. This is because you can see the common denominator between successful people around you

In this blog post, I want to give you my tips for how to network as a data scientist. Below I’ve divided this into tips for aspiring data scientists, junior data scientists, and senior data scientists. I’m a senior data scientist myself, so I have experience with all of these steps. I’ve also talked with many professional recruiters and been on both sides of hiring interviews for data scientists. Still, please take everything I say with a grain of salt. I’m no more an authority on this topic than anyone else, and these are just my own reflections.

Before we start, you should know that I think approaching networking as an optimization problem is fundamentally flawed. Having 1000 LinkedIn connections or 5.000 likes on your post about LLMs is not the goal. The goal of networking is to establish genuine connections with other like-minded people that love the same things you do. These connections give you a network of people that you can both help and reach out to when needed.

Paradoxically, approaching networking as an optimization problem makes you bad at networking. This makes you focus on metrics rather than people. Meeting someone that only converses with you for the "clout" is off-putting. This approach is also phycological damaging. You begin to see people as mere means to achieving some arbitrary metric. This is neither particularly fun nor effective

So while reading the tips below, don’t forget that networking is fundamentally about people. This, I’m sure, sounds about as cliché as it gets. But it can also be surprisingly calming. You don’t need to put on a salesperson persona for networking. You don’t need to wear suits and shoot finger guns while talking to new people. Your passion for Data Science is more than enough.

Tips for Aspiring Data Scientists

Photo by Tim Bogdanov on Unsplash

I’m glad you are interested in pursuing a career in data science. I will assume that you are not currently working as a data scientist, but want to do so in the future. You are probably studying at a university or a similar institution. At this stage of your career, I would focus on the following five tips for networking:

Tip 1: Take relevant courses to meet likeminded people

I’m sure courses in archeology or environmental law are interesting. Yet, you will probably not meet many future colleagues in such courses if you aim to be a data scientist. If you take courses in data science, statistics, or informatics, then this will be much more likely. The main reason to take relevant courses is of course to gain knowledge, but don’t underestimate the value of forming a solid network.

When taking relevant courses, try to talk to different people and get to know them a bit. You don’t need to be incredibly social if you don’t want to, but a little goes a long way. Having formed connections with 5–10 people for each relevant course you take quickly adds up.

Tip 2: If possible, apply for a (paid!) summer job that is relevant

If you are able to work a summer job as a data scientist, then this is immensely useful for gaining connections. Some people use this to get a foot in the company and often get a job there after they have finished studying. Even if you don’t, then having previous professional experience with data science is a big plus on any CV.

For the networking aspect, working at a summer job will introduce you to mentors and colleagues that you can connect with. People will have real-world knowledge of data science, which is often quite different from the academic version. Forming a network here is crucial for your future career.

When working in the summer job, try to get to know people outside the little bobble you will be put in. Have lunch with designers and chit-chat with product owners. Being in a summer job means that most people there will be very open to you asking questions and connecting. This makes the networking aspect a lot easier

Tip 3: Go to career days and talk to representatives

Often universities will organize career days where companies come and give presentations. This is a great opportunity to talk with representatives from companies and get to know what they are all about. You might even find yourself some free food, which is always a plus as a student.

I know from experience that attending career days is a bit awkward. It’s semi-formal but also relaxed at the same time. Don’t worry. The representatives of the companies often feel the same way. When attending career days, make sure to make the most of it.

Tip 4: Enroll in a mentorship programs

Universities often have a mentorship/alumni program. There students can connect with professionals in the field to learn about the day-to-day work of the profession. This is a great opportunity that is most likely free and will introduce you to working professionals.

Mentors can give you tips on technical stuff, but also give you advice on how/when to apply for positions or which positions are good. Professionals who mentor students like this are usually doing it since they love their profession, and know it inside-out. If you decide to participate in such a program, then make sure to come prepared. Have questions you want to ask ready and respect the time of the mentor

Don’t be afraid to stay in touch with your mentor. Mentors usually love to hear how things are going and can give you more tips if you wonder about something. Later in your career, you can "give back" and become a mentor for someone else if you want to.

Tip 5: Have at least a minimal online presence

You should sign-up for the most used online networking tool for data scientists in your country. For most of us, this will be LinkedIn, so I will just refer to LinkedIn from now on. From talking with recruiters, I know that not having a LinkedIn profile at all is somewhat looked down upon. While I don’t necessarily personally agree, this hardly matters.

Add people on LinkedIn that you meet through classes, meetups, and summer jobs. This is a convenient way to stay in touch with them. You can also use LinkedIn to advertise that you are looking for a job when that time comes. Many of my friends have found jobs in this way.

Remember that you don’t need to post anything you are not comfortable with online. In fact, I don’t really think you need to post anything at this stage of your career. Simply create an account, add the relevant information, and stay in touch with people you meet. Simple as that!

Tips for Junior Data Scientists

Photo by Microsoft Edge on Unsplash

I see you are working as a junior data scientist. Congratulations! It’s an exciting field, and I’m glad you are with us. You have a lot to learn technically, but paying some attention to networking is also good. At this stage of your career, I would focus on the following five tips for networking:

Tip 1: Cultivate good relationships with colleagues

This one is almost too obvious. You will most likely be working with other data scientists, data engineers, data analysts, and ML engineers in your career. Make sure that you have a good relationship with them and connect with them. Talk about data science and related disciplines by all means, but don’t shy away from talking about other things as well. Building a relationship with someone is usually easier if you know some simple things about them personally.

I would recommend getting to know people in other departments as well. When you change jobs, you might find that someone in your previous communication department works there now. They can still give a recommendation based on your personality, even though they probably can’t say much about your data science skills. Talking to people from different disciplines is also a great way to avoid being completely insulated. Some of the best data scientists I know have minimal working knowledge of nearby fields.

If you are working remotely, this will be significantly more difficult. I’ve worked remotely myself, so I’m not discouraging remote work as a practice in general. I just want to be honest about the reality of the situation if you choose such a position. If you do choose a remote position, then you can put more work into the next tips

Tip 2: Attend meetups, workshops, and conferences

Outside of your job, there will be opportunities to meet other data scientists. This usually comes in the form of meetups, workshops, and conferences. All of these are good to attend from time to time to network. If they are good, then you might also learn a lot. While meetups and workshops are often free, conferences usually have fees that are quite expensive. You can ask your company to cover this.

I would advise you to only go to such events if they genuinely interest you. There is little point in you going to a React workshop if you are not interested in front-end development. Find something that genuinely interests you – otherwise, you will just be exhausted.

Finally, if you find a meetup that you like and have attended a few times, don’t be afraid to volunteer to present. People who organize meetups often struggle with finding presenters. You don’t need to be a senior data scientist to give a good talk. In fact, some of the best talks I’ve seen have been from less experienced people.

Tip 3: Don’t be afraid to mentor others

You might think that you are still not at the level where you can mentor others. But this is not true. Even someone who has only 6 months of working experience can mentor a new graduate on some things. Think back on when you started your job. What would have been helpful to know? Maybe it’s something as mundane as where certain information is located. Or maybe how data science deployment works in the organization? This might be easy for you now, but challenging in the beginning.

By Mentoring others, you form a connection with them. They can come to you with questions, and you will try to assist them. It’s not surprising that many new employees think of their mentor (given that the mentor actually does their job) as the person they know best. So ask your manager if you can be a mentor for new employees as soon as possible

Tip 4: Write blogs or teach a course on technical topics

You are in a position now where you can teach others. While mentoring as mentioned above is a very personalized way of doing this, don’t be afraid to do it through blogs or courses as well. If you write a blog post on a technical library that lacks good documentation, then many people will be grateful. Just make sure to pick a topic that you are comfortable with.

My tip here would be to start small. Are people in your company complaining that the new graduates don’t know version control software like Git? Offer to give a 1-hour course on it. Are colleagues complaining that it is difficult to understand when to use Pandas vs Polars vs Spark for data processing? Create a blog post for this where you take the time to investigate the differences. This can be valuable and great practice as well.

Tip 5: Improve your online presence

I think it can be beneficial to up the game of your online presence a little at this stage of your career. Don’t worry, you don’t need to become an opinioned data scientist influencer. But putting a bit more effort into your profile and interactions is probably a good thing.

For your profile, add more detailed job descriptions, skills, certifications, and so on. If you want to work towards becoming a senior data scientist, then emulate such profiles. Have a clear picture and an inviting and professional about-me section. This should not take you many hours to write, but many recruiters care about this. You can do better than having a picture of your cat as a profile picture and a description that reads "I do that data stuff."

If you want to post content, then post about interesting technology, interesting talks or blogs, or anything that genuinely interests you. You don’t need to make posts addressing current hype trends in data science or make clickbait posts with titles like "SQL is dead!". It’s not, and you hopefully know that by now

Tips for Senior Data Scientists

Photo by bruce mars on Unsplash

Congratulations on becoming a senior data scientist! I’m sure that you’ve worked hard for this. This change in seniority also changes how you network quite a lot. In short, the major change is that you are expected to "put things out there" in various ways. In this stage of your career, I would focus on the following five tips for Networking:

Tip 1: Present at conferences, meetups, webinars, seminars, etc.

In the previous stages of your career, you mostly attended things. While this is still a valuable way to network, it pales in comparison to presenting. When presenting at various avenues you are indirectly showcasing your knowledge. If you do a good job, people will come to you. This reverses the polarity of your networking – you now have a pull-effect. Just know that presenting well takes a lot of time:

First of all, it takes years to become good at presenting for most people. Start early. The earlier in your career, the more accepting people are of mediocre presentations. As a junior data scientist, people will be impressed with your dedication even if you presentation skills needs refining. At a certain point in your career, being awful at presentations is a lot more looked down upon.
Secondly, even when you are pretty good at it, preparing for a presentation takes a lot of work. Hence you should select carefully when to present at an event. It is better to present 5 times a year excellently than to present 50 times mediocre. Prioritize quality over quantity.

Tip 2: Improve your writing

You will find that there is a lot of writing in your job. Whether this is documentation, memos, instructions, presentations, summaries, point-of-views, blog posts, emails, or something completely different does not really matter. The point is that writing has become something you do daily. And you are suddenly expected to be good at it.

The fact is that you will be judged on your writing. As such, it can be a serious impediment to networking at a senior level. For some it seems very shallow to judge someone on how they write. But the writing is often the only output the receiver can judge or even understand. Most stakeholders do not understand your clever hyperparameter search or your intricate data pipeline. If you can’t explain in simple terms why a project should be continued, then don’t act surprised when it is cancelled

Having clear written communication is increasingly important for networking in particular. You will talk to many people through chat, email, and similar interfaces. If you write poorly, then this can sour the impression you give out.

For many of us, English is not our native language. Hence we often have two (or more!) languages that we need to communicate clearly in. This sounds daunting. Luckily for us, writing clearly is mostly language independent. Start working on the language you use the most in your networking. You will quickly find that the clarity of your other languages improves in parallel.

Tip 3: Build something of value

One of the ways you can prove your skills in the eyes of others is to build something of value. Whether this is a Python library that implements a new ML algorithm or a GitHub repo that demonstrates a cool use case is entirely up to you. It can also be an internal tool in your organization.

I want to emphasize that it is not always necessary to build something with great originality. Maybe your organization uses poor data pipeline orchestration routines. Taking ownership and fixing this brings value, even though it has been done 1000 times before in other organizations. Try to create something that will genuinely help others.

If you make internal tools, then this will gain the respect of others in your organization. If you make open-source tools, then this will help a wider audience. Both of these options are great for networking. If people already have a good impression of you, then networking becomes a lot simpler.

Tip 4: Become an expert on something

When starting out as a data scientist, you want to learn a bit about everything. This ranges from NLP to cloud platforms. As a senior data scientist, you should be knowledgeable of a broad range of topics. But you should also become an expert on something. This can be something technical like anomaly detection. It can also be more soft-skill based, like how to successfully implement agile methodologies in data science.

By becoming an expert, you become sought after because of your specialized knowledge. This opens many doors for networking, as you will be asked to speak about and write about your expert knowledge. Picking a specialized topic can be difficult. My best suggestion is to deep dive into something you really care about. If the topic is applicable in a wide range of settings, this is also a plus.

When networking, you can leverage your expert knowledge to stand out. Almost every data scientist knows a bit about deep learning. But how many data scientists do you know that is an expert in ensemble methods? Or an expert in bringing long-term organizational strategies with data initiatives? This stands out

Tip 5: Have a strong online presence

As a senior data scientist you should aim to have a strong online presence. This includes engaging more with others and spending more time putting your ideas and opinions out there. There are many ways to do this, and you get to decide how. The format could be everything from LinkedIn posts to video tutorials. Focus on making quality content that really engage and illuminate topics. Again, remember that quality trumps quantity.

People often speak of developing your own brand. This term is a bit loaded and I prefer to talk about developing your own voice. When you write and talk, you will need to make some choices. Will you put on a friendly and helpful voice, or a more authoritative one? The choice of voice should be paired with what you want to achieve:

Say you want to hire more junior data scientists for the team you have started to lead. Then putting on a "get off my lawn" type reactionary voice online is probably not a good idea.
Say that you want to convince the world that Rust is the future of data science. Then putting on a friendly and agreeable voice is maybe also a mismatch. You should never choose an abusive voice, but sometimes a tint of pent up passive aggressiveness is entertaining

Wrapping Up

Photo by Spencer Bergen on Unsplash

In this blog post, I’ve given you my Tips for networking for the various stages of a data scientist. I don’t want you to treat this as gospel, but rather as pinpoints for what you could be working on.

If you are interested in data science then feel free to follow me on LinkedIn. But to make the connection meaningful, then please tell me a personal opinion of yours when it comes to networking. I’d love to hear what you have to say

Like my writing? Check out some of my other posts for more content:

The post From Meetups to Mentoring: How to Network as a Data Scientist appeared first on Towards Data Science.

How to Not Get Machine Learning Models in Production

Eirik Berge, PhD — Sat, 08 Jul 2023 05:59:37 +0000

Overview of Your Journey

Introduction – No Production, No Problems!
Notebooks can be Used for Everything!
Why Automate When You have the Time?
Testing? Just Never Make Mistakes!
Dependency Management in my Head!
Wrapping Up

1 – Introduction – No Production, No Problems!

As data scientists, data engineers, and ML engineers we are bombarded with information about ML models in production. Hundreds of videos and thousands of blogs have tried their best to help us avoid the situation we find ourselves in. But to no avail. Right now, it pains me to say, there are ML models in production all around the world. Generating value to millions of unknowing people. On every street corner. And we tolerate it, just because it’s common.

When going to conferences, I listen to nervous data scientists on stage in front of big audiences talk about production. It’s clear from their forehead sweat and clammy hands that the situation is serious. This has been going on for many years, but we didn’t trust their prophesies. And look at us now. We should have listened.

This is no time for anyone to say "I told you so". We need to band together to retake what is ours. We need to to unilaterally profess disdain for the modern ways of doing things. We need to go back to better times, when ML models in production was just a nonbinding bullet point on job adverts for middle-sized companies.

Someone has to take the lead and guide this journey of redemption. And who better than me? Without intending to brag, I’ve made several ML models that did not reach production. Some of them were not even close. I can share with you some of the best tips so that you don’t need to replicate your development environment – because there will be nothing else.

I’ve clearly divided each following section into two parts. The first is called The Righteous Way. It tells you how to avoid ML models in production. The second is called The Sinful Way. This lets you know what to avoid, as this is a fast track to getting ML model in production. Don’t get them confused!

2 – A Single Notebook can be Used for Everything!

The Righteous Way

I’ve found that one of the simplest ways to avoid getting ML models in production is simply to have your entire code base in a single Jupyter notebook file. This includes model training, model predictions, data processing, and even configuration files.

Think about it. When all the sensitive configuration files are within your main document, it becomes virtually impossible to upload anything to GitHub or Azure DevOps without introducing major security risks. In addition, have you every tried to read a pull request from someone who has modified a file with 100.000 lines? The best response is to simply drop the remote hosting and version control all together.

This, my friend, puts us on the fast track to avoiding production. No remote hosting means that silos will be inevitable. And have you stopped to think about the cloud consumption costs of running the entire model training every time you need a prediction – remember that we only have a single Jupyter notebook for both training and testing. The single notebook architecture as I’ve proudly named it is simply genius.

The Sinful Way

Instead of the single notebook architecture, people have over time gone astray. They’ve started to split their code, configurations, and documentation into multiple files. These files unfortunately follow conventions that makes them easy to navigate. There are even tools like Cookiecutter that provides easy-to-use templates structuring ML projects.

The code itself is often divided into modular components like functions, classes, and files. This encourages reuse and improves the organization. Coding standards are enforced with PEP8 standards and automatic formatting tools like Black. There are even perverse blog posts guiding data scientists to better software development practices. Disgusting!

Not only is the source code structured. There are tools like Data Version Control (DVC) for versioning big data, or tools like MLflow for structuring ML model packaging. And none of these are even particularly difficult to get started with. Stay clear of this!

3 – Why Automate When You have the Time?

The Righteous Way

Back in the day, things were different. We felt ownership with our code. This ownership came from revisiting the code routinely with regular time intervals. That’s right, we ran all the code manually. This is clearly the way god intended, and I am tired of pretending that it’s not.

The manually approach gave you complete control. There was no need to check if a data pipeline had been initiated at 3 a.m. – you were there! Meticulously typing out execution commands from memory every night. When you have the time, why rely on CRON expressions or Airflow DAGs to get the job done? At the end of the day, can you really trust anyone?

This manual approach turned out to be a blessing against ML models in production. Data drift happens to ML models, and then the models need to be retrained. But with the manual approach, someone have to sit put and watch for when sufficient data drift has happened. In this economy? Not likely! It’s probably better to drop the whole production aspect and return to better ways.

The Sinful Way

Let’s talk about the elephant in the room: Continuous integration & Continuous delivery (CI/CD). I know, it makes my blood boil as well. Nowadays, people automatically check for code quality and run tests before updating the code base. Tools like GitHub Actions or Azure DevOps Pipelines are also used to automate retraining of ML models in production. Do we even stand a chance?

There is more! People now use tools like Terraform to set up infrastructure necessary to support ML model in production. With this infrastructure as code approach, environments can be replicated across a wide range of settings. Suddenly ML models in production in Azure are suddenly in production in AWS!

There are also tools like Airflow that help with the orchestration of data pipelines. Airflow ensures that preprocessing steps wait for the previous steps to complete before executing. It also gives you a GUI where you can inspect previous pipeline runs to see how things have gone. While these things seem simple, they can catch errors quickly before they propagate through your system and corrupt your data quality. High quality data is unfortunately key for successful ML models in production.

4 – Testing? Just Never Make Mistakes!

The Righteous Way

The philosophy of testing is a response to countless production failures and security breaches. The idea is to catch as many issues as possible before launching a product. While indeed noble, I think this is misguided. A much more simple approach reveals itself for those who pay attention. Precisely! It’s clear as day once you see it. You should simply not write any mistakes.

With mistake-free code, there is no need for tests. Even better, while you are 100% certain about the validity of your code, others are less so inclined. They will doubt the correctness of your code. As such, they will avoid putting your ML models in production in the fear that everything will break. You know it wouldn’t. But they don’t have to know that. Keep you pristine coding skills to yourself.

Not writing tests have the additional advantage that you get less documentation. Sometimes tests are vital for letting others understand what your code should be doing. Without tests, even less people will bother you, and you can continue on in peace.

The Sinful Way

Writing tests has become common, even for ML models and data engineering pipelines. Two major classes are unit tests and integration tests. Unit tests, as the name suggests, test a single unit like a Python function whether it works as intended. Integration tests on the other hand test whether different components work seamlessly together. Let’s focus on unit tests.

In Python, you have the built-in library unittest for writing, you guessed it, unit tests. This is OOP-based and requires a bit of boilerplate code to get started. More and more people are using the external library pytest to write unit tests in Python. It is functional rather than OOP-based, and requires less boilerplate code.

Writing unit tests also has a side effect. It forces you to write code modularly. Modular code that is well-tested break less often in production. This is bad news for your ML models. If they don’t break in production, then they will stay there forever. When you think about it, then ML models breaking in production is them trying to escape. And who could blame them?

5 – Dependency Management in my Head!

The Righteous Way

Managing dependencies is an important part of any ML project. This includes for instance knowing which Python libraries you’ve used. My advice is to simply remember which libraries you’ve installed, the operating system you are using, the runtime version, and so on. It’s not that hard, I’m sure there is even an app for keeping track of this.

I do sometimes wake up at night and wonder if I have version 0.2.3 or version 0.3.2 of scikit-learn running. It’s no matter! All the versions exist, so there shouldn’t be a problem…right? If I didn’t routinely solve dependency conflicts, then my dependency conflict skills would go stale.

The advantage of simply remembering your dependencies is that it becomes difficult for others to run your code. Especially if you don’t want to tell them all the dependency details. In this way, you can avoid someone suddenly getting the sneaky idea to lift your ML models to production.

The Sinful Way

People with bad memory skills have opted for the easy way out. They’ve looked for solutions that handle your dependencies for you. I swear, people get lazier every day! A simple approach to managing runtime versions and library dependencies is to use a virtual environment and a requirements.txt file. In Python, there are tools like virtualenv that allows you to set up virtual environments easily.

Those that want to go a step further use a container-based technology like Docker. This can replicate everything from the operating system and upwards. In this way, sharing ML models becomes pretty effortless as long as everyone knows how to run Docker containers. In modern tools like Azure Machine Learning, there are easy ways to use standardized Docker images. These can include commonly used libraries like scikit-learn and tensorflow. And you don’t even have to remember the versions yourself!

6 – Wrapping Up

I hopefully don’t have to point out that you DO want Machine Learning models in production. The points I’ve argued for here can be reversed and help you get those pesky ML models in production successfully.

Feel free to follow me on LinkedIn and say hi, I’m not nearly as grumpy as I pretended in this blog post

Like my writing? Check out some of my other posts for more Python content:

The post How to *Not* Get Machine Learning Models in Production appeared first on Towards Data Science.

The Soft Skills You Need to Succeed as a Data Scientist

Eirik Berge, PhD — Mon, 19 Jun 2023 19:10:03 +0000

Overview of Your Journey

Introduction
Skill 1 – Communication
Skill 2 – Collaboration
Skill 3 – Curiosity
Skill 4 – Project Management
Skill 5 – Mentoring
Wrapping Up

Introduction

When you are working on your career as a data scientist, it’s easy to focus on the hard skills. You might want to learn a new ML algorithm like an SVM with a non-linear kernel, a new software technology like MLflow, or a new AI trend like ChatGPT.

These skills are comfortable to learn because it is easy to measure success in them. Take MLflow as an example. You might first start to learn about what MLflow can provide to your ML lifecycle. You learn about model artifacts, ML project structure, and model registration. You finish a course, spend a few hours reading the user guide, and even implement it in a real-life project. Great! After doing this, you can confidently say that you know some MLflow and can add this as a skillset in your CV.

What about a soft skill like, say, time management? How would you go through the same process? Really stop and think about this. There are certainly books on time management you could read, but it is not nearly as concrete as reading the documentation on MLflow. You could implement time management in your daily routines, but it is not as demonstrable as implementing Mlflow in an ML project. You could list time management in your CV, but what does that even mean?

It is a fact of life that soft skills are harder to measure, simply by their nature of being less tangible. Many people draw the conclusion that soft skills are less valuable than hard skills. But this is a critical mistake! Just because something is difficult to measure does not mean that it is not worth working on!

I’m sure that we’ve all experienced a colleague that had time management down to such a degree that their output was almost twice the amount of others. This is a boost that is almost impossible to obtain with hard skills. Nowhere have I ever seen someone learn MLflow, and then have twice the output of other data scientists. So even if soft skills are hard to measure, they can provide value well beyond many typically hard skills

This is especially true for Data Science. The positive stereotype of a data scientist is someone with great problem-solving capabilities. The negative stereotype is that he or she is a bit lacking in common soft skills needed to succeed in business environments. By spending some of your time working on soft skills, you can gain a massive advantage and use this to forge your own career path.

In this blog post, you will learn the top 5 soft skills a data scientist needs to succeed. This is, of course, just based on my own option. However, that opinion has been shaped by watching many other data scientists and seeing what has made some of them stand out from the rest.

Skill 1 – Communication

The first one is as standard as they come. You should learn how to communicate well. This includes many things:

You should be able to communicate your findings of an exploratory data analysis (EDA) phase clearly. And for the love of god tailor it to the audience. A CEO does not want to hear about the choice of distribution you made to fit the data, or which Docker image you used for running the experiment. The CEO might be enthusiastic about data science, but he or she has hundreds of other things to consider as well. Give a high-level overview of the EDA and focus on business outcomes for the CEO.
When giving talks, make sure that you say something that is of value to the audience. This sounds obvious, but apparently, it’s not. Don’t ramble on about complex architecture or intricate hyperparameter tuning just to make yourself sound smart. This is just a defense mechanism. Rather, make sure that what you say leaves the listener with something of value. By doing this, you will suddenly have people coming up to you wanting to discuss what you talked about.
When speaking to others, make sure that you listen to what they are saying. This is not the same as nodding and waiting for your turn to speak. To actually listen means to put yourself into the shoes of the speaker, and to carefully try to understand their perspective. Say that a product owner explains to you that they want a faster pace and less exploration. Rather than waiting to explain why they are wrong, take a minute to actually listen. The product owner is maybe evaluated on progress, and might not understand the upside to exploratory analysis. And the exploratory analysis might have gone on a bit longer than necessary if you are being honest. Try to listen and then work with the product owner to find a good solution for both of you.

In addition, you should work on your writing. No, really. It’s not awful, but it could be shorter. You have a tendency to write complicated sentences, make unnecessary explanations, and drag on about details that don’t matter half as much as you think. Don’t worry, I do too

Brevity can signal confidence. Compare the following two responses to an email about a feature request due next Friday:

I think that will be possible by next Friday. I will start by looking into the problem, understand the solution space, and then work on it in an iterative manner. I will ask for advice if needed, and otherwise progress towards the goal of finishing the feature by next Friday as you requested. I’m sure that everything will go well and that I will deliver a satisfactory result by next Friday.
I will work towards delivering that by next Friday. I will ask for advice if any blockers emerge.

There is not much more information in the first statement, except for vague talk about the iterations of your work and promises of satisfactory results. Imagine being a project manager or a tech lead that have to read stuff like this day in and day out. Cut this out! Say what you want to say in a professional manner, and then move on to solving the problems at hand.

Skill 2 – Collaboration

Very little impactful data science work is done by a single data scientist. Sure, there are a few exceptions. But most impactful data science is done by teams of data professionals with backing from other occupations like front-end/back-end developers, platform engineers, testers, domain experts, project managers, and the list goes on.

This means that collaboration is not only useful but completely essential for successful data science. Here are a few ways you can work on your collaboration skills:

When depending on other roles, understand the interface between your work and theirs. Say that you are collaborating with a data engineer who writes Spark in Databricks or in Synapse Analytics. The output of their work is tables that are cleaned and in the correct format for data science. But what is the correct format? This depends completely on which features you want to use, and which algorithms you plan to use. You don’t want to end up in a situation where a data engineer meticulously cleans a column in a table that you immediately drop because you are not planning to use it. This is a symptom of bad collaboration.
When other roles depend on you, plan early for how to secure good collaboration. Say you are planning to develop an ML model that ingests real-time data and predicts a value. The prediction will then be sent to both the user in a front-end app and also to a Power BI dashboard for internal tracking. Then both the front-end developers and the data analysts should be kept in the loop regarding the future format of the data. You might even provide them with mock data for the exact structure of the data. In this way, you ensure that the people that depend on you don’t have to wait for you to finish to do their work. When people are collaborating well, it’s like parallel processing. When they are not, it becomes single-threaded and everything slows down.
When collaborating with other data scientists, delegate clear ownership. Since data scientists come from a variety of backgrounds, their skill sets are quite different. You might have a data scientist who is really good at getting those extra percentages of accuracy for a model. Another data scientist might be really good at setting models in production and monitoring for data drift. Different persons can take ownership of different aspects based on their experience. Every data scientist can still contribute to every aspect, though.

Finally, there is the more generic stuff that has nothing to do with data science. Everyone working on the same team deserves to be treated with respect. This is independent of their technical background, skill level, gender, or any other factor that is irrelevant for common courtesy.

People sometimes make mistakes. It’s important to acknowledge mistakes while making sure mistakes are a natural part of learning. You should optimize to make a culture in a team where mistakes can be admitted without fear of retaliation or ridicule. If you fail to create such a culture, then mistakes will not stop happening. They will simply happen under the radar and resurface later when fixing them is a lot harder.

Skill 3 – Curiosity

Photo by nine koepfer on Unsplash

I’ve always felt that data scientists are naturally curious people. They like to learn a new ML algorithm or keep track of new developments in their field. But it is a lot more varying whether this curiosity extends to technologies, methods, and approaches in nearby disciplines.

Some data scientists are excited to learn more about software development, design principles, project management, data analysis, data engineering, business impact, and so on. Others want to stick to their own bubble and only work on data science. While this is perfectly fine, you should not be surprised if you are evaluated lower than colleagues that have the curiosity to explore nearby areas.

Is this unfair? Not really

The data scientist who knows software development is simply more useful than the one that doesn’t. Software development skills ensure that there are more possible projects that data scientists can work on. The interface to roles like front-end/back-end developers and data engineers also suddenly becomes a lot easier to manage.

Others only view your output through their own lens. Say a tester is tasked with running integration tests for several components that you have written. If your components are well-documented, composed modularly, use good coding standards, and have unit tests, then the job of the tester is a lot easier. On the other hand, if you simply have a lot of free-flowing code in a massive R script, then tracking down errors becomes a lot of work for the tester. Naturally, the tester will think that the person that puts effort into the software aspect is more skilled. This is independent of what the ML model within the script does.

Business impact is another classic. One of the most common negative feedback data scientists get is that they are too removed from business objectives. A data scientist that understands the business and comes up with data science use cases that generate ROI will naturally be more valuable to the business.

So how does one work on this broad curiosity? There is only a limited time you can all spend on other disciplines, but I have two general suggestions:

Spend some time trying to understand what other roles are really working on when talking to them. It is pretty quick to pick up some knowledge of KPIs and OKRs when talking to a business analyst, but this knowledge can be super valuable. Personally, I know very little about computer networks since I don’t have an informatics background. But I do know why one would use a private network, the (extremely) rough outline of how one would set this up, and when it might be appropriate to invest in this. I’ve gained this knowledge mainly from talking to network engineers. This knowledge, although pretty surface level, is valuable for knowing when to contact a network engineer about this.
When working on projects, jump on the opportunity to do something slightly out of your comfort zone. Does someone need to implement automatic linting in a continuous integration pipeline? I’ll have a go at that! Even though you don’t know much about CI/CD or YAML files, you will probably figure it out. If not, you can always ask for help. By jumping at opportunities to learn something new you…learn something new. I know, it’s pretty profound

Skill 4 – Project Management

Think back on previous projects that have involved a team effort. Think about those projects that have failed to meet deadlines, or have gone over budget. What is the common denominator? Is it too little hyperparameter tuning? To poor model artifact logging?

Probably not, right? One of the most common reasons for project failures is bad project management. Project management has the responsibility of breaking a project down into manageable phases. Each phase should then be continuously estimated for the amount of work left.

There is a lot more than this that a decided project manager is responsible for, ranging from sprint execution to retrospectives. But I don’t want to focus on project management as a role. I want to focus on project management as a skill. In the same way that anyone in a team can display leadership as a skill, anyone in a team can also display project management as a skill. And boy, is this a useful skill for a data scientist.

Let’s for concreteness focus on estimating a single phase. The fact of the matter is that much of data science work is very difficult to estimate:

How long will a data cleaning phase take? Completely depends on the data you are working with.
How long will an exploratory data analysis phase take? Completely depends on what you find out along the way.

You get my point. This has led many to think that estimating the duration of the phrases in a data science project is pointless.

I think this is the wrong conclusion. What is more accurate is that estimating the duration of a data science phase is difficult to do accurately before starting the phase. But project management is working with continuous estimation. Or, at least, this is what good project management is supposed to be doing

Imagine instead of estimating a data cleaning job in advance that you are one week into the task of cleaning the data. You now know that there are three data sources stored in different databases. Two of the databases are lacking proper documentation, while the last one is lacking data models but is pretty well documented. Some of the data is missing in all three data sources, but not as much as you feared. What can you say about this?

Certainly, you don’t have zero information. You know that you won’t finish the data cleaning job tomorrow. On the other hand, you are very sure that three months are way too long for this job. Hence you have a kind of distribution giving the probability of when the phase is finished. This distribution has a "mean" (a guess for the duration of the phase) and a "standard deviation" (the amount of uncertainty in the guess).

The important point is that this conceptual distribution changes every day. You get more and more information about the work that needs to be done. Naturally, the "standard deviation" will shrink over time as you become more and more certain of when the phase will be finished. It is your job to quantify this information to stakeholders. And don’t use the distribution language I’ve used when explaining this to stakeholders, that can stay between us.

Having a data scientist able to say something like this is super valuable:

"I think this phase will take between 3 and 6 weeks. I can give you an updated estimate in a week that will be more accurate.

Skill 5 – Mentoring

Mentoring more junior data scientists is often seen as a necessary evil. It’s honest work for sure, but something that is not emphasized much. If the junior data scientist would magically learn the concepts themselves that would be preferable, right?

As you probably can tell, I disagree. Mentoring junior data scientists is immensely helpful for both you and them. Here are three reasons:

You learn a lot from explaining concepts: This one is pretty straightforward. By explaining concepts and ideas to junior data scientists, you learn the concepts better yourself. I’ve often found that explaining something to a junior data scientist has helped me articulate something more clearly. It is often only when someone asks you that you realize that you might not understand something as well as you thought. This is a great opportunity to learn more about the topic. In addition, you can highlight to the junior data scientist that it is okay to not know everything. In fact, this is inevitable.
You get minor management experience: Soon, you might be stepping into more senior roles like e.g., a chief data scientist. Roles like these often do not have formal management responsibilities for other employees. Yet, there is the expectation that you can lead and influence others. Like any other skill, this comes with practice. In your day-to-day data cleaning and model tuning, you get little practice with this. So if you never mentor anyone, then don’t be surprised if you struggle to lead and influence others. And if you are weighing the possibility of going into a management track, then no mentoring responsibilities in the past is a bit of a red flag. Why have you never mentored anyone? Is it because you don’t want to do it, or because other people don’t want you to do it? None of these possibilities look great.
You get to build a connection with junior data scientists: Sure, there is a natural power imbalance between a mentor and a junior data scientist. Nevertheless, it is often the mentor that the junior data scientist will connect most with if the mentor does a good job. By taking responsibility and mentoring junior data scientists, you will quickly find yourself surrounded by people whom you have mentored. These people often look up to you and value your advice. This is not such a bad situation to be in.

My advice is to become a mentor quickly in your career. The three benefits above are only valid if you take the mentoring job seriously. If you do a poor job mentoring, you get few of the benefits and might even get a reputation for being a bad mentor

Some companies have very low expectations for mentoring. You can be asked to have a coffee with the junior data scientist once a month. I would advise going beyond the call of duty. Offer to the junior data scientist that they can come to you with problems and questions. Stepping up like this for junior data scientists is a sign that you can take on responsibilities without being explicitly asked.

Wrapping Up

Photo by Spencer Bergen on Unsplash

In this blog post, we’ve seen how Soft Skills can be super valuable for data scientists to move forward with their careers. When interviewing for senior candidates in data science, I look at the soft skills they have accumulated as much as the hard skills. Write me a comment if there are other soft skills that you think are essential for data scientists to have.

If you are interested in data science, programming, or anything in between, then feel free to follow me on LinkedIn and say hi

Like my writing? Check out some of my other posts for more Python content:

The post The Soft Skills You Need to Succeed as a Data Scientist appeared first on Towards Data Science.