Sarah Lea, Author at Towards Data Science

Deep Research by OpenAI: A Practical Test of AI-Powered Literature Review

Sarah Lea — Tue, 04 Mar 2025 20:06:21 +0000

“Conduct a comprehensive literature review on the state-of-the-art in Machine Learning and energy consumption. […]”

With this prompt, I tested the new Deep Research function, which has been integrated into the OpenAI o3 reasoning model since the end of February — and conducted a state-of-the-art literature review within 6 minutes.

This function goes beyond a normal web search (for example, with ChatGPT 4o): The research query is broken down & structured, the Internet is searched for information, which is then evaluated, and finally, a structured, comprehensive report is created.

Let’s take a closer look at this.

Table of Content
1. What is Deep Research from OpenAI and what can you do with it?
2. How does deep research work?
3. How can you use deep research? — Practical example
4. Challenges and risks of the Deep Research feature
Final Thoughts
Where can you continue learning?

1. What is Deep Research from OpenAI and what can you do with it?

If you have an OpenAI Plus account (the $20 per month plan), you have access to Deep Research. This gives you access to 10 queries per month. With the Pro subscription ($200 per month) you have extended access to Deep Research and access to the research preview of GPT-4.5 with 120 queries per month.

OpenAI promises that we can perform multi-step research using data from the public web.

Duration: 5 to 30 minutes, depending on complexity.

Previously, such research usually took hours.

It is intended for complex tasks that require a deep search and thoroughness.

What do concrete use cases look like?

Conduct a literature review: Conduct a literature review on state-of-the-art machine learning and energy consumption.
Market analysis: Create a comparative report on the best marketing automation platforms for companies in 2025 based on current market trends and evaluations.
Technology & software development: Investigate programming languages and frameworks for AI application development with performance and use case analysis
Investment & financial analysis: Conduct research on the impact of AI-powered trading on the financial market based on recent reports and academic studies.
Legal research: Conduct an overview of data protection laws in Europe compared to the US, including relevant rulings and recent changes.

2. How does Deep Research work?

Deep Research uses various Deep Learning methods to carry out a systematic and detailed analysis of information. The entire process can be divided into four main phases:

1. Decomposition and structuring of the research question

In the first step the tool processes the research question using natural language processing (NLP) methods. It identifies the most important key terms, concepts, and sub-questions.

This step ensures that the AI understands the question not only literally, but also in terms of content.

2. Obtaining relevant information

Once the tool has structured the research question, it searches specifically for information. Deep Research uses a mixture of internal databases, scientific publications, APIs, and web scraping. These can be open-access databases such as arXiv, PubMed, or Semantic Scholar, for example, but also public websites or news sites such as The Guardian, New York Times, or BBC. In the end, any content that can be accessed online and is publicly available.

3. Analysis & interpretation of the data

The next step is for the AI model to summarize large amounts of text into compact and understandable answers. Transformers & Attention mechanisms ensure that the most important information is prioritized. This means that it does not simply create a summary of all the content found. Also, the quality and credibility of the sources is assessed. And cross-validation methods are normally used to identify incorrect or contradictory information. Here, the AI tool compares several sources with each other. However, it is not publicly known exactly how this is done in Deep Research or what criteria there are for this.

4. Generation of the final report

Finally, the final report is generated and displayed to us. This is done using Natural Language Generation (NLG) so that we see easily readable texts.

The AI system generates diagrams or tables if requested in the prompt and adapts the response to the user’s style. The primary sources used are also listed at the end of the report.

3. How you can use Deep Research: A practical example

In the first step, it is best to use one of the standard models to ask how you should optimize the prompt in order to conduct deep research. I have done this with the following prompt with ChatGPT 4o:

“Optimize this prompt to conduct a deep research:
Carrying out a literature search: Carry out a literature search on the state of the art on machine learning and energy consumption.”

The 4o model suggested the following prompt for the Deep Research function:

Screenshot taken by the author

The tool then asked me if I could clarify the scope and focus of the literature review. I have, therefore, provided some additional specifications:

Screenshot taken by the author

ChatGPT then returned the clarification and started the research.

In the meantime, I could see the progress and how more sources were gradually added.

After 6 minutes, the state-of-the-art literature review was complete, and the report, including all sources, was available to me.

Deep Research Example.mp4

4. Challenges and risks of the Deep Research feature

Let’s take a look at two definitions of research:

“A detailed study of a subject, especially in order to discover new information or reach a new understanding.”
Reference: Cambridge Dictionary

“Research is creative and systematic work undertaken to increase the stock of knowledge. It involves the collection, organization, and analysis of evidence to increase understanding of a topic, characterized by a particular attentiveness to controlling sources of bias and error.”
Reference: Wikipedia Research

The two definitions show that research is a detailed, systematic investigation of a topic — with the aim of discovering new information or achieving a deeper understanding.

Basically, the deep research function fulfills these definitions to a certain extent: it collects existing information, analyzes it, and presents it in a structured way.

However, I think we also need to be aware of some challenges and risks:

Danger of superficiality: Deep Research is primarily designed to efficiently search, summarize, and provide existing information in a structured form (at least at the current stage). Absolutely great for overview research. But what about digging deeper? Real scientific research goes beyond mere reproduction and takes a critical look at the sources. Science also thrives on generating new knowledge.
Reinforcement of existing biases in research & publication: Papers are already more likely to be published if they have significant results. “Non-significant” or contradictory results, on the other hand, are less likely to be published. This is known to us as publication bias. If the AI tool now primarily evaluates frequently cited papers, it reinforces this trend. Rare or less widespread but possibly important findings are lost. A possible solution here would be to implement a mechanism for weighted source evaluation that also takes into account less cited but relevant papers. If the AI methods primarily cite sources that are quoted frequently, less widespread but important findings may be lost. Presumably, this effect also applies to us humans.
Quality of research papers: While it is obvious that a bachelor’s, master’s, or doctoral thesis cannot be based solely on AI-generated research, the question I have is how universities or scientific institutions deal with this development. Students can get a solid research report with just a single prompt. Presumably, the solution here must be to adapt assessment criteria to give greater weight to in-depth reflection and methodology.

Final thoughts

In addition to OpenAI, other companies and platforms have also integrated similar functions (even before OpenAI): For example, Perplexity AI has introduced a deep research function that independently conducts and analyzes searches. Also Gemini by Google has integrated such a deep research function.

The function gives you an incredibly quick overview of an initial research question. It remains to be seen how reliable the results are. Currently (beginning March 2025), OpenAI itself writes as limitations that the feature is still at an early stage, can sometimes hallucinate facts into answers or draw false conclusions, and has trouble distinguishing authoritative information from rumors. In addition, it is currently unable to accurately convey uncertainties.

But it can be assumed that this function will be expanded further and become a powerful tool for research. If you have simpler questions, it is better to use the standard GPT-4o model (with or without search), where you get an immediate answer.

Where can you continue learning?

Want more tips & tricks about tech, Python, data science, data engineering, machine learning and AI? Then regularly receive a summary of my most-read articles on my Substack — curated and for free.

Click here to subscribe to my Substack!

The post Deep Research by OpenAI: A Practical Test of AI-Powered Literature Review appeared first on Towards Data Science.

Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge

Sarah Lea — Thu, 20 Feb 2025 04:51:55 +0000

“I train models, analyze data and create dashboards — why should I care about Containers?”

Many people who are new to the world of data science ask themselves this question. But imagine you have trained a model that runs perfectly on your laptop. However, error messages keep popping up in the cloud when others access it — for example because they are using different library versions.

This is where containers come into play: They allow us to make machine learning models, data pipelines and development environments stable, portable and scalable — regardless of where they are executed.

Let’s take a closer look.

Table of Contents
1 — Containers vs. Virtual Machines: Why containers are more flexible than VMs
2 — Containers & Data Science: Do I really need Containers? And 4 reasons why the answer is yes.
3 — First Practice, then Theory: Container creation even without much prior knowledge
4 — Your 101 Cheatsheet: The most important Docker commands & concepts at a glance
Final Thoughts: Key takeaways as a data scientist
Where Can You Continue Learning?

1 — Containers vs. Virtual Machines: Why containers are more flexible than VMs

Containers are lightweight, isolated environments. They contain applications with all their dependencies. They also share the kernel of the host operating system, making them fast, portable and resource-efficient.

I have written extensively about virtual machines (VMs) and virtualization in ‘Virtualization & Containers for Data Science Newbiews’. But the most important thing is that VMs simulate complete computers and have their own operating system with their own kernel on a hypervisor. This means that they require more resources, but also offer greater isolation.

Both containers and VMs are virtualization technologies.

Both make it possible to run applications in an isolated environment.

But in the two descriptions, you can also see the 3 most important differences:

Architecture: While each VM has its own operating system (OS) and runs on a hypervisor, containers share the kernel of the host operating system. However, containers still run in isolation from each other. A hypervisor is the software or firmware layer that manages VMs and abstracts the operating system of the VMs from the physical hardware. This makes it possible to run multiple VMs on a single physical server.
Resource consumption: As each VM contains a complete OS, it requires a lot of memory and CPU. Containers, on the other hand, are more lightweight because they share the host OS.
Portability: You have to customize a VM for different environments because it requires its own operating system with specific drivers and configurations that depend on the underlying hardware. A container, on the other hand, can be created once and runs anywhere a container runtime is available (Linux, Windows, cloud, on-premise). Container runtime is the software that creates, starts and manages containers — the best-known example is Docker.

Created by the author

You can experiment faster with Docker — whether you’re testing a new ML model or setting up a data pipeline. You can package everything in a container and run it immediately. And you don’t have any “It works on my machine”-problems. Your container runs the same everywhere — so you can simply share it.

2 — Containers & Data Science: Do I really need Containers? And 4 reasons why the answer is yes.

As a data scientist, your main task is to analyze, process and model data to gain valuable insights and predictions, which in turn are important for management.

Of course, you don’t need to have the same in-depth knowledge of containers, Docker or Kubernetes as a DevOps Engineer or a Site Reliability Engineer (SRE). Nevertheless, it is worth having container knowledge at a basic level — because these are 4 examples of where you will come into contact with it sooner or later:

Model deployment

You are training a model. You not only want to use it locally but also make it available to others. To do this, you can pack it into a container and make it available via a REST API.

Let’s look at a concrete example: Your trained model runs in a Docker container with FastAPI or Flask. The server receives the requests, processes the data and returns ML predictions in real-time.

Reproducibility and easier collaboration

ML models and pipelines require specific libraries. For example, if you want to use a deep learning model like a Transformer, you need TensorFlow or PyTorch. If you want to train and evaluate classic machine learning models, you need Scikit-Learn, NumPy and Pandas. A Docker container now ensures that your code runs with exactly the same dependencies on every computer, server or in the cloud. You can also deploy a Jupyter Notebook environment as a container so that other people can access it and use exactly the same packages and settings.

Cloud integration

Containers include all packages, dependencies and configurations that an application requires. They therefore run uniformly on local computers, servers or cloud environments. This means you don’t have to reconfigure the environment.

For example, you write a data pipeline script. This works locally for you. As soon as you deploy it as a container, you can be sure that it will run in exactly the same way on AWS, Azure, GCP or the IBM Cloud.

Scaling with Kubernetes

Kubernetes helps you to orchestrate containers. But more on that below. If you now get a lot of requests for your ML model, you can scale it automatically with Kubernetes. This means that more instances of the container are started.

3 — First Practice, then Theory: Container creation even without much prior knowledge

Let’s take a look at an example that anyone can run through with minimal time — even if you haven’t heard much about Docker and containers. It took me 30 minutes.

We’ll set up a Jupyter Notebook inside a Docker container, creating a portable, reproducible Data Science environment. Once it’s up and running, we can easily share it with others and ensure that everyone works with the exact same setup.

0 — Install Docker Dekstop and create a project directory

To be able to use containers, we need Docker Desktop. To do this, we download Docker Desktop from the official website.

Now we create a new folder for the project. You can do this directly in the desired folder. I do this via Terminal — on Windows with Windows + R and open CMD.

We use the following command:

Screenshot taken by the author

1. Create a Dockerfile

Now we open VS Code or another editor and create a new file with the name ‘Dockerfile’. We save this file without an extension in the same directory. Why doesn’t it need an extension?

We add the following code to this file:

# Use the official Jupyter notebook image with SciPy
FROM jupyter/scipy-notebook:latest  

# Set the working directory inside the container
WORKDIR /home/jovyan/work  

# Copy all local files into the container
COPY . .

# Start Jupyter Notebook without token
CMD ["start-notebook.sh", "--NotebookApp.token=''"]

We have thus defined a container environment for Jupyter Notebook that is based on the official Jupyter SciPy Notebook image.

First, we define with FROM on which base image the container is built. jupyter/scipy-notebook:latest is a preconfigured Jupyter notebook image and contains libraries such as NumPy, SiPy, Matplotlib or Pandas. Alternatively, we could also use a different image here.

With WORKDIR we set the working directory within the container. /home/jovyan/work is the default path used by Jupyter. User jovyan is the default user in Jupyter Docker images. Another directory could also be selected — but this directory is best practice for Jupyter containers.

With COPY . . we copy all files from the local directory — in this case the Dockerfile, which is located in the jupyter-docker directory — to the working directory /home/jovyan/work in the container.

With CMD [“start-notebook.sh”, “ — NotebookApp.token=‘’’”] we specify the default start command for the container, specify the start script for Jupyter Notebook and define that the notebook is started without a token — this allows us to access it directly via the browser.

2. Create the Docker image

Next, we will build the Docker image. Make sure you have the previously installed Docker desktop open. We now go back to the terminal and use the following command:

cd jupyter-docker
docker build -t my-jupyter .

With cd jupyter-docker we navigate to the folder we created earlier. With docker build we create a Docker image from the Dockerfile. With -t my-jupyter we give the image a name. The dot means that the image will be built based on the current directory. What does that mean? Note the space between the image name and the dot.

The Docker image is the template for the container. This image contains everything needed for the application such as the operating system base (e.g. Ubuntu, Python, Jupyter), dependencies such as Pandas, Numpy, Jupyter Notebook, the application code and the startup commands. When we “build” a Docker image, this means that Docker reads the Dockerfile and executes the steps that we have defined there. The container can then be started from this template (Docker image).

We can now watch the Docker image being built in the terminal.

Screenshot taken by the author

We use docker images to check whether the image exists. If the output my-jupyter appears, the creation was successful.

docker images

If yes, we see the data for the created Docker image:

Screenshot taken by the author

3. Start Jupyter container

Next, we want to start the container and use this command to do so:

docker run -p 8888:8888 my-jupyter

We start a container with docker run. First, we enter the specific name of the container that we want to start. And with -p 8888:8888 we connect the local port (8888) with the port in the container (8888). Jupyter runs on this port. I do not understand.

Alternatively, you can also perform this step in Docker desktop:

Screenshot taken by the author

4. Open Jupyter Notebook & create a test notebook

Now we open the URL [http://localhost:8888](http://localhost:8888/) in the browser. You should now see the Jupyter Notebook interface.

Here we will now create a Python 3 notebook and insert the following Python code into it.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.plot(x, y)
plt.title("Sine Wave")
plt.show()

Running the code will display the sine curve:

Screenshot taken by the author

5. Terminate the container

At the end, we end the container either with ‘CTRL + C’ in the terminal or in Docker Desktop.

With docker ps we can check in the terminal whether containers are still running and with docker ps -a we can display the container that has just been terminated:

Screenshot taken by the author

6. Share your Docker image

If you now want to upload your Docker image to a registry, you can do this with the following command. This will upload your image to Docker Hub (you need a Docker Hub account for this). You can also upload it to a private registry of AWS Elastic Container, Google Container, Azure Container or IBM Cloud Container.

docker login

docker tag my-jupyter your-dockerhub-name/my-jupyter:latest

docker push dein-dockerhub-name/mein-jupyter:latest

If you then open Docker Hub and go to your repositories in your profile, the image should be visible.

This was a very simple example to get started with Docker. If you want to dive a little deeper, you can deploy a trained ML model with FastAPI via a container.

4 — Your 101 Cheatsheet: The most important Docker commands & concepts at a glance

You can actually think of a container like a shipping container. Regardless of whether you load it onto a ship (local computer), a truck (cloud server) or a train (data center) — the content always remains the same.

The most important Docker terms

Container: Lightweight, isolated environment for applications that contains all dependencies.
Docker: The most popular container platform that allows you to create and manage containers.
Docker Image: A read-only template that contains code, dependencies and system libraries.
Dockerfile: Text file with commands to create a Docker image.
Kubernetes: Orchestration tool to manage many containers automatically.

The basic concepts behind containers

Isolation: Each container contains its own processes, libraries and dependencies
Portability: Containers run wherever a container runtime is installed.
Reproducibility: You can create a container once and it runs exactly the same everywhere.

The most basic Docker commands

docker --version # Check if Docker is installed
docker ps # Show running containers
docker ps -a # Show all containers (including stopped ones)
docker images # List of all available images
docker info # Show system information about the Docker installation

docker run hello-world # Start a test container
docker run -d -p 8080:80 nginx # Start Nginx in the background (-d) with port forwarding
docker run -it ubuntu bash # Start interactive Ubuntu container with bash

docker pull ubuntu # Load an image from Docker Hub
docker build -t my-app . # Build an image from a Dockerfile

Final Thoughts: Key takeaways as a data scientist

With Containers you can solve the “It works on my machine” problem. Containers ensure that ML models, data pipelines, and environments run identically everywhere, independent of OS or dependencies.

Containers are more lightweight and flexible than virtual machines. While VMs come with their own operating system and consume more resources, containers share the host operating system and start faster.

There are three key steps when working with containers: Create a Dockerfile to define the environment, use docker build to create an image, and run it with docker run — optionally pushing it to a registry with docker push.

And then there’s Kubernetes.

A term that comes up a lot in this context: An orchestration tool that automates container management, ensuring scalability, load balancing and fault recovery. This is particularly useful for microservices and cloud applications.

Before Docker, VMs were the go-to solution (see more in ‘Virtualization & Containers for Data Science Newbiews’.) VMs offer strong isolation, but require more resources and start slower.

So, Docker was developed in 2013 by Solomon Hykes to solve this problem. Instead of virtualizing entire operating systems, containers run independently of the environment — whether on your laptop, a server or in the cloud. They contain all the necessary dependencies so that they work consistently everywhere.

I simplify tech for curious minds If you enjoy my tech insights on Python, data science, Data Engineering, machine learning and AI, consider subscribing to my substack.

Where Can You Continue Learning?

The post Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge appeared first on Towards Data Science.

Virtualization & Containers for Data Science Newbies

Sarah Lea — Wed, 12 Feb 2025 01:04:25 +0000

Virtualization makes it possible to run multiple virtual machines (VMs) on a single piece of physical hardware. These VMs behave like independent computers, but share the same physical computing power. A computer within a computer, so to speak.

Many cloud services rely on virtualization. But other technologies, such as containerization and serverless computing, have become increasingly important.

Without virtualization, many of the digital services we use every day would not be possible. Of course, this is a simplification, as some cloud services also use bare-metal infrastructures.

In this article, you will learn how to set up your own virtual machine on your laptop in just a few minutes — even if you have never heard of Cloud Computing or containers before.

Table of Contents
1 — The Origins of Cloud Computing: From Mainframes to Serverless Architecture
2 — Understanding Virtualization: Why it’s the Basis of Cloud Computing
3 — Create a Virtual Machine with VirtualBox
Final Thoughts
Where can you continue learning?

1 — The Origins of Cloud Computing: From Mainframes to Serverless Architecture

Cloud computing has fundamentally changed the IT landscape — but its roots go back much further than many people think. In fact, the history of the cloud began back in the 1950s with huge mainframes and so-called dumb terminals.

The era of mainframes in the 1950s: Companies used mainframes so that several users could access them simultaneously via dumb terminals. The central mainframes were designed for high-volume, business-critical data processing. Large companies still use them today, even if cloud services have reduced their relevance.
Time-sharing and virtualization: In the next decade (1960s), time-sharing made it possible for multiple users to access the same computing power simultaneously — an early model of today’s cloud. Around the same time, IBM pioneered virtualization, allowing multiple virtual machines to run on a single piece of hardware.
The birth of the internet and web-based applications in the 1990s: Six years before I was born, Tim Berners-Lee developed the World Wide Web, which revolutionized online communication and our entire working and living environment. Can you imagine our lives today without internet? At the same time, PCs were becoming increasingly popular. In 1999, Salesforce revolutionized the software industry with Software as a Service (SaaS), allowing businesses to use CRM solutions over the internet without local installations.
The big breakthrough of cloud computing in the 2010s:
The modern cloud era began in 2006 with Amazon Web Services (AWS): Companies were able to flexibly rent infrastructure with S3 (storage) and EC2 (virtual servers) instead of buying their own servers. Microsoft Azure and Google Cloud followed with PaaS and IaaS services.
The modern cloud-native era: This was followed by the next innovation with containerization. Docker made Containers popular in 2013, followed by Kubernetes in 2014 to simplify the orchestration of containers. Next came serverless computing with AWS Lambda and Google Cloud Functions, which enabled developers to write code that automatically responds to events. The infrastructure is fully managed by the cloud provider.

Cloud computing is more the result of decades of innovation than a single new technology. From time-sharing to virtualization to serverless architectures, the IT landscape has continuously evolved. Today, cloud computing is the foundation for streaming services like Netflix, AI applications like ChatGPT and global platforms like Salesforce.

2 — Understanding Virtualization: Why Virtualization is the Basis of Cloud Computing

Virtualization means abstracting physical hardware, such as servers, storage or networks, into multiple virtual instances.

Several independent systems can be operated on the same physical infrastructure. Instead of dedicating an entire server to a single application, virtualization enables multiple workloads to share resources efficiently. For example, Windows, Linux or another environment can be run simultaneously on a single laptop — each in an isolated virtual machine.

This saves costs and resources.

Even more important, however, is the scalability: Infrastructure can be flexibly adapted to changing requirements.

Before cloud computing became widely available, companies often had to maintain dedicated servers for different applications, leading to high infrastructure costs and limited scalability. If more performance was suddenly required, for example because webshop traffic increased, new hardware was needed. The company had to add more servers (horizontal scaling) or upgrade existing ones (vertical scaling).

This is different with virtualization: For example, I can simply upgrade my virtual Linux machine from 8 GB to 16 GB RAM or assign 4 cores instead of 2. Of course, only if the underlying infrastructure supports this. More on this later.

And this is exactly what cloud computing makes possible: The cloud consists of huge data centers that use virtualization to provide flexible computing power — exactly when it is needed. So, virtualization is a fundamental technology behind cloud computing.

How does serverless computing work?

What if you didn’t even have to manage virtual machines anymore?

Serverless computing goes one step further than Virtualization and containerization. The cloud provider handles most infrastructure tasks — including scaling, maintenance and resource allocation. Developers should focus on writing and deploying code.

But does serverless really mean that there are no more servers?

Of course not. The servers are there, but they are invisible for the user. Developers no longer have to worry about them. Instead of manually provisioning a virtual machine or container, you simply deploy your code, and the cloud automatically executes it in a managed environment. Resources are only provided when the code is running. For example, you can use AWS Lambda, Google Cloud Functions or Azure Functions.

What are the advantages of serverless?

As a developer, you don’t have to worry about scaling or maintenance. This means that if there is a lot more traffic at a particular event, the resources are automatically adjusted. Serverless computing can be cost-efficient, especially in Function-as-a-Service (FaaS) models. If nothing is running, you pay nothing. However, some serverless services have baseline costs (e.g. Firestore).

Are there any disadvantages?

You have much less control over the infrastructure and no direct access to the servers. There is also a risk of vendor lock-in. The applications are strongly tied to a cloud provider.

A concrete example of serverless: API without your own server

Imagine you have a website with an API that provides users with the current weather. Normally, a server runs around the clock — even at times when no one is using the API.

With AWS Lambda, things work differently: A user enters ‘Mexico City’ on your website and clicks on ‘Get weather’. This request triggers a Lambda function in the background, which retrieves the weather data and sends it back. The function is then stopped automatically. This means you don’t have a permanently running server and no unnecessary costs — you only pay when the code is executed.

3 — What Data Scientists should Know about Containers and VMs — What’s the Difference?

You’ve probably heard of containers. But what is the difference to virtual machines — and what is particularly relevant as a data scientist?

Both containers and virtual machines are virtualization technologies.

Both make it possible to run applications in isolation.

Both offer advantages depending on the use case: While VMs provide strong security, containers excel in speed and efficiency.

The main difference lies in the architecture:

Virtual machines virtualize the entire hardware — including the operating system. Each VM has its own operational system (OS). This in turn requires more memory and resources.
Containers, on the other hand, share the host operating system and only virtualize the application layer. This makes them significantly lighter and faster.

Put simply, virtual machines simulate entire computers, while containers only encapsulate applications.

Why is this important for data scientists?

Since as a data scientist you will come into contact with machine learning, data engineering or data pipelines, it is also important to understand something about containers and virtual machines. Sure, you don’t need to have in-depth knowledge of it like a DevOps Engineer or a Site Reliability Engineer (SRE).

Virtual machines are used in data science, for example, when a complete operating system environment is required — such as a Windows VM on a Linux host. Data science projects often need specific environments. With a VM, it is possible to provide exactly the same environment — regardless of which host system is available.

A VM is also needed when training deep learning models with GPUs in the cloud. With cloud VMs such as AWS EC2 or Azure Virtual Machines, you have the option of training the models with GPUs. VMs also completely separate different workloads from each other to ensure performance and security.

Containers are used in data science for data pipelines, for example, where tools such as Apache Airflow run individual processing steps in Docker containers. This means that each step can be executed in isolation and independently of each other — regardless of whether it involves loading, transforming or saving data. Even if you want to deploy machine learning models via Flask / FastAPI, a container ensures that everything your model needs (e.g. Python libraries, framework versions) runs exactly as it should. This makes it super easy to deploy the model on a server or in the cloud.

3 — Create a Virtual Machine with VirtualBox

Let’s make this a little more concrete and create an Ubuntu VM.

I use the VirtualBox software with my Windows Lenovo laptop. The virtual machine runs in isolation from your main operating system so that no changes are made to your actual system. If you have Windows Pro Edition, you can also enable Hyper-V (pre-installed by default, but disabled). With an Intel Mac, you should also be able to use VirtualBox. With an Apple Silicon, Parallels Desktop or UTM is apparently the better alternative (not tested myself).

1) Install Virtual Box

The first step is to download the installation file for VirtualBox from the official Virtual Box website and install VirtualBox. VirtualBox is installed including all necessary drivers.

You can ignore the note about missing dependencies Python Core / win32api as long as you do not want to automate VirtualBox with Python scripts.

Then we start the Oracle VirtualBox Manager:

Screenshot taken by the author

2) Download the Ubuntu ISO file

Next, we download the Ubuntu ISO file from the Ubuntu website. An ISO Ubuntu file is a compressed image file of the Ubuntu operating system. This means that it contains a complete copy of the installation data. I download the LTS version because this version receives security and maintenance updates for 5 years (Long Term Support). Note the location of the .iso file as we will use it later in VirtualBox.

Screenshot taken by the author

3) Create a virtual machine in VirtualBox

Next, we create a new virtual machine in the VirtualBox Manager and give it the name Ubuntu VM 2025. Here we select Linux as the type and Ubuntu (64-bit) as the version. We also select the previously downloaded ISO file from Ubuntu as the ISO image. It would also be possible to add the ISO file later in the mass storage menu.

Screenshot taken by the author

Next, we select a user name vboxuser2025 and a password for access to the Ubuntu system. The hostname is the name of the virtual machine within the network or system. It must not contain any spaces. The domain name is optional and would be used if the network has multiple devices.

We then assign the appropriate resources to the virtual machine. I choose 8 GB (8192 MB) RAM, as my host system has 64 GB RAM. I recommend 4GB (4096) as a minimum. I assign 2 processors, as my host system has 8 cores and 16 logical processors. It would also be possible to assign 4 cores, but this way I have enough resources for my host system. You can find out how many cores your host system has by opening the Task Manager in Windows and looking at the number of cores under the Performance tab under CPU.

Screenshot taken by the author

Next, we click on ‘Create a virtual hard disk now’ to create a virtual hard disk. A VM requires its own virtual hard disk to install the OS (e.g. Ubuntu, Windows). All programs, files and configurations of the VM are stored on it — just like on a physical hard disk. The default value is 25 GB. If you want to use a VM for machine learning or data science, more storage space (e.g. 50–100 GB) would be useful to have room for large data sets and models. I keep the default setting.

We can then see that the virtual machine has been created and can be used:

Screenshot taken by the author

4) Use Ubuntu VM

We can now use the newly created virtual machine like a normal separate operating system. The VM is completely isolated from the host system. This means you can experiment in it without changing or jeopardizing your main system.

If you are new to Linux, you can try out basic commands like ls, cd, mkdir or sudo to get to know the terminal. As a data scientist, you can set up your own development environments, install Python with Pandas and Scikit-learn to develop data analysis and machine learning models. Or you can install PostgreSQL and run SQL queries without having to set up a local database on your main system. You can also use Docker to create containerized applications.

Final Thoughts

Since the VM is isolated, we can install programs, experiment and even destroy the system without affecting the host system.

Let’s see if virtual machines remain relevant in the coming years. As companies increasingly use microservice architectures (instead of monoliths), containers with Docker and Kubernetes will certainly become even more important. But knowing how to set up a virtual machine and what it is used for is certainly useful.

I simplify tech for curious minds. If you enjoy my tech insights on Python, data science, data engineering, machine learning and AI, consider subscribing to my substack.

Where Can You Continue Learning?

AWS Documentation — Create your first Lambda function
AWS — Getting started with Amazon S3
DataCamp Course — Understanding Cloud Computing (only the first part is free — no affiliate link)
GeeksForGeeks — What is Cloud Computing?
Kubernetes Documentation — Learn Kubernetes Basics
Net Ninja — Video Docker Crash Course

The post Virtualization & Containers for Data Science Newbies appeared first on Towards Data Science.

The Concepts Data Professionals Should Know in 2025: Part 2

Sarah Lea — Mon, 20 Jan 2025 11:02:02 +0000

From AI Agent to Human-In-The-Loop — Master 12 data concepts and turn them into simple projects to stay ahead in IT.

Innovation in the field of data is progressing rapidly.

Let’s take a quick look at the timeline of GenAI: ChatGPT, launched in November 2022, became the world’s best-known application of generative AI in early 2023. By spring 2025, leading companies like Salesforce (Marketing Cloud Growth) and Adobe (Firefly) integrated it into mainstream applications – making it accessible to companies of various sizes. Tools like MidJourney advanced image generation, while at the same time, discussions about agentic AI took center stage. Today, tools like ChatGPT have already become common for many private users.

That’s why I have compiled 12 terms that you will certainly encounter as a data engineer, data scientist and data analyst in 2025 and are important to understand. Why are they relevant? What are the challenges? And how can you apply them to a small project?

Table of Content Term 1–6 in part 1: Data Warehouse, Data Lake, Data Lakehouse, Cloud Platforms, Optimizing data storage, Big Data technologies, ETL, ELT and Zero-ETL, Even-Driven-Architecture 7 – Data Lineage & XAI 8 – Gen AI 9 – Agentic AI 10 – Inference Time Compute 11 – Near Infinite Memory 12 – Human-In-The-Loop-Augmentation Final Thoughts

In the first part, we looked at terms for the basics of understanding modern data systems (storage, management & processing of data). In part 2, we now move beyond infrastructure and dive into some terms related to Artificial Intelligence that use this data to drive innovation.

7 – Explainability of predictions and traceability of data: XAI & Data Lineage

As data and AI tools become increasingly important in our everyday lives, we also need to know how to track them and create transparency for decision-making processes and predictions:

Let’s imagine a scenario in a hospital: A deep learning model is used to predict the chances of success of an operation. A patient is categorised as ‘unsuitable’ for the operation. The problem for the medical team? There is no explanation as to how the model arrived at this decision. The internal processes and calculations that led to the prediction remain hidden. It is also not clear which attributes – such as age, state of health or other parameters – were decisive for this assessment. Should the medical team nevertheless believe the prediction and not proceed with the operation? Or should they proceed as they see best fit?

This lack of transparency can lead to uncertainty or even mistrust in AI-supported decisions. Why does this happen? Many deep learning models provide us with results and excellent predictions – much better than simple models can do. However, the models are ‘black boxes’ – we don’t know exactly how the models arrived at the results and what features they used to do so. While this lack of transparency hardly plays a role in everyday applications, such as distinguishing between cat and dog photos, the situation is different in critical areas: For example, in healthcare, financial decisions, criminology or recruitment processes, we need to be able to understand how and why a model arrives at certain results.

This is where Explainable AI (XAI) comes into play: techniques and methods that attempt to make the decision-making process of AI models understandable and comprehensible. Examples of this are SHAP (SHapley Additive ExPlanations) or LIME (Local Interpretable Model-agnostic Explanations). These tools can at least show us which features contributed most to a decision.

Data Lineage, on the other hand, helps us understand where data comes from, how it has been processed and how it is ultimately used. In a BI tool, for example, a report with incorrect figures could be used to check whether the problem occurred with the data source, the transformation or when loading the data.

Why are the terms important?

XAI: The more AI models we use in everyday life and as decision-making aids, the more we need to know how these models have achieved their results. Especially in areas such as finance and healthcare, but also in processes such as HR and social services.

Data Lineage: In the EU there is GDPR, in California CCPA. These require companies to document the origin and use of data in a comprehensible manner. What does that mean in concrete terms? If companies have to comply with data protection laws, they must always know where the data comes from and how it was processed.

What are the challenges?

Complexity of the data landscape (data lineage): In distributed systems and multi-cloud environments, it is difficult to fully track the data flow.
Performance vs. transparency (XAI): Deep learning models often deliver more precise results, but their decision paths are difficult to trace. Simpler models, on the other hand, are usually easier to interpret but less accurate.

Small project idea to better understand the terms:

Use SHAP (SHapley Additive ExPlanations) to explain the decision logic of a machine learning model: Create a simple ML model with scikit-learn to predict house prices, for example. Then install the SHAP library in Python and visualize how the different features influence the price prediction.

8 – Generative AI (Gen AI)

Since Chat-GPT took off in January 2023, the term Gen AI has also been on everyone’s lips. Generative AI refers to AI models that can generate new content from an input. Outputs can be texts, images, music or videos. For example, there are now even fashion stores that have created their advertising images using generative AI (e.g. Calvin Klein, Zalando).

"We started OpenAI almost nine years ago because we believed that AGI was possible, and that it could be the most impactful technology in human history. We wanted to figure out how to build it and make it broadly beneficial; […]"

Reference: Sam Altman, CEO of OpenAI

Why is the term important?

Clearly, GenAI can greatly increase efficiency. The time required for tasks such as content creation, design or texts is reduced for companies. GenAI is also changing many areas of our working world. Tasks are being performed differently, jobs are changing and data is becoming even more important.

In Salesforce’s latest marketing automation tool, for example, users can enter a prompt in natural language, which generates an email layout – even if this does not always work reliably in reality.

What are the challenges?

Copyrights and ethics: The models are trained with huge amounts of data that originate from us humans and try to generate the most realistic results possible based on this (e.g. also with texts by authors or images by well-known painters). One problem is that GenAI can imitate existing works. Who owns the result? A simple way to minimize this problem at least somewhat is to clearly label AI-generated content as such.
Costs and energy: Large models require a very large amount of computing resources.
Bias and misinformation: The models are trained with specific data. If the data already contains a bias (e.g. less data from one gender, less data from one country), these models can reproduce biases. For example, if an HR tool has been trained with more male than female data, it could favor male applicants in a job application. And of course, sometimes the models simply provide incorrect information.

Small project idea to better understand the terms:

Create a simple chatbot that accesses the GPT-4 API and can answer a question. I have attached a step-by-step guide at the bottom of the page.

9 – Agentic AI / AI Agents

Agentic AI is currently a hotly debated topic and is based on generative AI. AI agents describe intelligent systems that can think, plan and act "autonomously":

"This is what AI was meant to be. […] And I am really excited about this. I think this is going to change companies forever. I think it’s going to change software forever. And I think it’ll change Salesforce forever."

_Reference: Marc Benioff, Salesforce CEO about Agents & Agentforce_

AI Agents are, so to speak, a continuation of traditional chatbots and bots. These systems promise to solve complex problems by creating multi-level plans, learning from data and making decisions based on this and executing them autonomously.

Multi-step plans mean that the AI thinks several steps ahead to achieve a goal.

Let’s imagine a quick example: An AI agent has the task of delivering a parcel. Instead of simply following the sequence of orders, the AI could first analyze the traffic situation, calculate the fastest route and then deliver the various parcels in this calculated sequence.

Why is the term important?

The ability to execute multi-step plans sets AI Agents apart from previous bots and chatbots and brings a new era of autonomous systems.

If AI Agents can actually be used in businesses, companies can automate repetitive tasks through agents, reducing costs and increasing efficiency. The economic benefits and competitive advantage would be there. As the Salesforce CEO says in the interview, it can change our corporate world tremendously.

What are the challenges?

Logical consistency and (current) technological limitations: Current models struggle with consistent logical thinking – especially when it comes to handling complex scenarios with multiple variables. And that’s exactly what they’re there for – or that’s how they’re advertised. This means that in 2025 there will definitely be an increased need for better models.
Ethics and acceptance: Autonomous systems can make decisions and solve their own tasks independently. How can we ensure that autonomous systems do not make decisions that violate ethical standards? As a society, we also need to define how quickly we want to integrate such changes into our everyday (working) lives without taking employees by surprise. Not everyone has the same technical know-how.

Small project idea to better understand the term:

Create a simple AI agent with Python: define the agent first. For example, the agent should retrieve data from an API. Use Python to coordinate the API query, filtering of results and automatic emailing to the user. Implement then a simple decision logic: For example, if no result matches the filter criteria, the search radius is extended.

10 – Inference Time Compute

Next, we focus on the efficiency and performance of using AI models: An AI model receives input data, makes a prediction or decision based on it and gives an output. This process requires computing time, which is referred to as inference time compute. Modern models such as AI agents go one step further by flexibly adapting their computing time to the complexity of the task.

Basically, it’s the same as with us humans: When we have to solve more complex problems, we invest more time. AI models use dynamic reasoning (adapting computing time according to task requirements) and chain reasoning (using multiple decision steps to solve complex problems).

Why is the term important?

AI and models are becoming increasingly important in our everyday lives. The demand for dynamic AI systems (AI that adapts flexibly to requests and understands our requests) will increase. Inference time affects the performance of systems such as chatbots, autonomous vehicles and real-time translators. AI models that adapt their inference time to the complexity of the task and therefore "think" for different lengths of time will improve efficiency and accuracy.

What are the challenges?

Performance vs. quality: Do you want a fast but less accurate or a slow but very accurate solution? Shorter inference times improve efficiency, but can compromise accuracy for complex tasks.
Energy consumption: The longer the inference time, the more computing power is required. This in turn increases energy consumption.

11 – Near Infinite Memory

Near Infinite Memory is a concept that describes how technologies can store and process enormous amounts of data almost indefinitely.

For us users, it seems like infinite storage – but it is actually more of a combination of scalable cloud services, data-optimized storage solutions and intelligent data management systems.

Why is this term important?

The data we generate is growing exponentially due to the increasing use of IoT, AI and Big Data. As already described in terms 1–3, this creates ever greater demands on data architectures such as data lakehouses. AI models also require enormous amounts of data for training and validation. It is therefore important that storage solutions become more efficient.

What are the challenges?

Energy consumption: Large storage solutions in cloud data centers consume immense amounts of energy.
Security concerns and dependence on centralized services: Many near-infinite memory solutions are provided by cloud providers. This can create a dependency that brings financial and data protection risks.

Small project idea to better understand the terms:

Develop a practical understanding of how different data types affect storage requirements and learn how to use storage space efficiently. Take a look at the project under the term "Optimizing Data Storage".

12 – Human-In-The-Loop Augmentation

AI is becoming increasingly important, as the previous terms have shown. However, with the increasing importance of AI, we should ensure that the human part is not lost in the process.

"We need to let people who are harmed by technology imagine the future that they want."

_Reference: Timnit Gebru, former Head of Department of Ethics in AI at Google_

Human-in-the-loop augmentation is the interface between computer science and psychology, so to speak. It describes the collaboration between us humans and artificial intelligence. The aim is to combine the strengths of both sides:

A great strength of AI is that such models can efficiently process data in large quantities and discover patterns in it that are difficult for us to recognize.
We humans, on the other hand, bring judgment, ethics, creativity and contextual understanding to the table without being pre-trained and have the ability to cope with unforeseen situations.

The goal must be for AI to serve us humans – and not the other way around.

Why is the term important?

AI can improve decision-making processes and minimize errors. In particular, AI can recognize patterns in data that are not visible to us, for example in the field of medicine or biology.

The MIT Center for Collective Intelligence published a study in Nature Human Behavior in which they analyzed how well human-AI combinations perform compared to purely human or purely AI-controlled systems:

In decision-making tasks, human-AI combinations often performed worse than AI systems alone (e.g. medical diagnoses / classification of deepfakes).
In creative tasks, the interaction already works better. Here, human-AI teams outperformed both humans and AI alone.

However, the study shows that human-in-the-loop augmentation does not yet work perfectly.

Reference: Humans and AI: Do they work better together or alone?

What are the challenges?

Lack of synergy and mistrust: It seems that there is a lack of intuitive interfaces that make it easier for us humans to interact effectively enough with AI tools. Another challenge is that AI systems are sometimes viewed critically or even rejected.
(Current) technological limitations of AI: Current AI systems struggle to understand logical consistency and context. This can lead to erroneous or inaccurate results. For example, an AI diagnostic system could misjudge a rare case because it does not have enough data for such cases.

Final Thoughts

The terms in this article only show a selection of the innovations that we are currently seeing – the list could definitely be extended. For example, in the area of AI models, the size of the models will also play an important role: In addition to very large models (with up to 50 trillion parameters), individual very small models will probably also be developed that will only contain a few billion parameters. The advantage of these small models will be that they do not require huge data centers and GPUs, but can run on our laptops or even on our smartphones and perform very specific tasks.

Which terms do you think are super important? Let us know in the comments.

Where can you continue learning?

Book – The data lakehouse for dummies
Medium – SQL and Data Modelling in Action: A Deep Dive into Data Lakehouses
Gartner – Top 10 strategic technology trends
AWS – Start your journey with AWS
DataCamp Course – AWS Concepts
Snowflake Blog – Avro vs. Parquet
Medium – Why ETL-Zero? Understanding the Shift in Data Integration
[AWS Blog – What is event-driven architecture?](http://What is an Event-Driven Architecture?)
Medium – Can you trust AI-Models Without Explainability? Introduction to XAI
IBM Blog – Agentic AI: 4 reasons why it’s the next big thing in AI research
MIT Management Sloan School – Study about Humans & AI
Blog Sam Altman – Reflections about AI

Own visualization – Illustrations from unDraw.co

All information in this article is based on the current status in January 2025.

The post The Concepts Data Professionals Should Know in 2025: Part 2 appeared first on Towards Data Science.

The Concepts Data Professionals Should Know in 2025: Part 1

Sarah Lea — Sun, 19 Jan 2025 19:02:04 +0000

From Data Lakehouses to Event-Driven Architecture — Master 12 data concepts and turn them into simple projects to stay ahead in IT.

When I scroll through YouTube or LinkedIn and see topics like RAG, Agents or Quantum Computing, I sometimes get a queasy feeling about keeping up with these innovations as a data professional.

But when I reflect then on the topics my customers face daily as a Salesforce Consultant or as a Data Scientist at university, the challenges often seem more tangible: examples are faster data access, better data quality or boosting employees’ tech skills. The key issues are often less futuristic and can usually be simplified. That’s the focus of this and the next article:

I have compiled 12 terms that you will certainly encounter as a data engineer, data scientist and data analyst in 2025. Why are they relevant? What are the challenges? And how can you apply them to a small project?

So – Let’s dive in.

Table of Content 1 – Data Warehouse, Data Lake, Data Lakehouse 2 – Cloud platforms as AWS, Azure & Google Cloud Platform 3 – Optimizing data storage 4 – Big data technologies such as Apache Spark, Kafka 5 – How data integration becomes real-time capable: ETL, ELT and Zero-ETL 6 – Even-Driven Architecture (EDA) Term 7–12 in part 2: Data Lineage & XAI, Gen AI, Agentic AI, Inference Time Compute, Near Infinite Memory, Human-In-The-Loop-Augmentation Final Thoughts

1 – Data Warehouse, Data Lake, Data Lakehouse

We start with the foundation for data architecture and storage to understand modern data management systems.

Data warehouses became really well known in the 1990s thanks to Business Intelligence tools from Oracle and SAP, for example. Companies began to store structured data from various sources in a central database. An example are weekly processed sales data in a business intelligence tool.

The next innovation was data lakes, which arose from the need to be able to store unstructured or semi-structured data flexibly. A data lake is a large, open space for raw data. It stores both structured and unstructured data, such as sales data alongside social media posts and images.

The next step in innovation combined data lake architecture with warehouse architecture: Data lakehouses were created.

The term was popularized by companies such as Databricks when it introduced its Delta Lake technology. This concept combines the strengths of both previous data platforms. It allows us to store unstructured data as well as quickly query structured data in a single system. The need for this data architecture has arisen primarily because warehouses are often too restrictive, while lakes are difficult to search.

Why are the terms important?

We are living in the era of Big Data – companies and private individuals are generating more and more data (structured as well as semi-structured and unstructured data).

A short personal anecdote: The year I turned 15, Facebook cracked the 500 million active user mark for the first time. Instagram was founded in the same year. In addition, the release of the iPhone 4 significantly accelerated the global spread of smartphones and shaped the mobile era. In the same year, Microsoft further developed and promoted Azure (which was released in 2008) to compete with Google Cloud and AWS. From today’s perspective, I can see how all these events made 2010 a decisive year for digitalisation: 2010 was a key year in which digitalisation and the transition to cloud technologies gained momentum.

In 2010, around 2 zettabytes (ZB) of data were generated, in 2020 it was around 64 ZB, in 2024 we are at around 149 zettabytes.

Reference: Statista

Due to the explosive data growth in recent years, we need to store the data somewhere – efficiently. This is where these three terms come into play. Hybrid architectures such as data lakehouses solve many of the challenges of big data. The demand for (near) real-time data analysis is also rising (see term 5 on zero ETL). And to remain competitive, companies are under pressure to use data faster and more efficiently. Data lakehouses are becoming more important as they offer the flexibility of a data lake and the efficiency of a data warehouse – without having to operate two separate systems.

What are the challenges?

Data integration: As there are many different data sources (structured, semi-structured, unstructured), complex ETL / ELT processes are required.
Scaling & costs: While data warehouses are expensive, data lakes can easily lead to data chaos (if no good data governance is in place) and lakehouses require technical know-how & investment.
Access to the data: Permissions need to be clearly defined if the data is stored in a centralized storage.

Small project idea to better understand the terms:

Create a mini data lake with AWS S3: Upload JSON or CSV data to an S3 bucket, then process the data with Python and perform data analysis with Pandas, for example.

2 – Cloud Platforms as AWS, Azure & Google Cloud Platform

Now we move on to the platforms on which the concepts from 1 are often implemented.

Of course, everyone knows the term cloud platforms such as AWS, Azure or Google Cloud. These services provide us with a scalable infrastructure for storing large volumes of data. We can also use them to process data in real-time and to use Business Intelligence and Machine Learning tools efficiently.

But why are the terms important?

I work in a web design agency where we host our clients’ websites in one of the other departments. Before the easy availability of cloud platforms, this meant running our own servers in the basement – with all the challenges such as cooling, maintenance and limited scalability.

Today, most of our data architectures and AI applications run in the cloud. Cloud platforms have changed the way we store, process and analyse data over the last decades. Platforms such as AWS, Azure or Google Cloud offer us a completely new level of flexibility and scalability for model training, real-time analyses and generative AI.

What are the challenges?

A quick personal example of how complex things get: While preparing for my Salesforce Data Cloud Certification (a data lakehouse), I found myself diving into a sea of new terms – all specific to the Salesforce world. Each cloud platform has its own terminology and tools, which makes it time-consuming for employees in companies to familiarize themselves with them.
Data security: Sensitive data can often be stored in the cloud. Access control must be clearly defined – user management is required.

Small project idea to better understand the terms:

Create a simple data pipeline: Register with AWS, Azure or GCP with a free account and upload a CSV file (e.g. to an AWS S3 bucket). Then load the data into a relational database and use an SQL tool to perform queries.

3 – Optimizing Data Storage

More and more data = more and more storage space required = more and more costs.

With the use of large amounts of data and the platforms and concepts from 1 and 2, there is also the issue of efficiency and cost management. To save on storage, reduce costs and speed up access, we need better ways to store, organize and access data more efficiently.

Strategies include data compression (e.g. Gzip) by removing redundant or unneeded data, data partitioning by splitting large data sets, indexing to speed up queries and the choice of storage format (e.g. CSV, Parquet, Avro).

Why is the term important?

Not only is my Google Drive and One Drive storage nearly maxed out…

… in 2028, a total data volume of 394 zettabytes is expected.

It will therefore be necessary for us to be able to cope with growing data volumes and rising costs. In addition, large data centers consume immense amounts of energy, which in turn is critical in terms of the energy and climate crisis.

What are the challenges?

Different formats are optimized for different use cases. Parquet, for example, is particularly suitable for analytical queries and large data sets, as it is organized on a column basis and read access is efficient. Avro, on the other hand, is ideal for streaming data because it can quickly convert data into a format that is sent over the network (serialization) and just as quickly convert it back to its original form when it is received (deserialization). Choosing the wrong format can affect performance by either wasting disk space or increasing polling times.
Cost / benefit trade-off: Compression and partitioning save storage space but can slow down computing performance and data access.
Dependency on cloud providers: As a lot of data is stored in the cloud today, optimization strategies are often tied to specific platforms.

Small project idea to better understand the terms:

Compare different storage optimization strategies: Generate a 1 GB dataset with random numbers. Save the data set in three different formats such as CSV, Parquet & Avro (using the corresponding Python libraries). Then compress the files with Gzip or Snappy. Now load the data into a Pandas DataFrame using Python and compare the query speed.

4 – Big Data Technologies such as Apache Spark & Kafka

Once the data has been stored using the storage concepts described in sections 1–3, we need technologies to process it efficiently.

We can use tools such as Apache Spark or Kafka to process and analyze huge amounts of data. They allow us to do this in real-time or in batch mode.

Spark is a framework that processes large amounts of data in a distributed manner and is used for tasks such as machine learning, Data Engineering and ETL processes.

Kafka is a tool that transfers data streams in real-time so that various applications can access and use them immediately. One example is the processing of real-time data streams in financial transactions or logistics.

Why is the term important?

In addition to the exponential growth in data, AI and machine learning are becoming increasingly important. Companies want to be able to process data in (almost) real-time: These Big Data technologies are the basis for real-time and batch processing of large amounts of data and are required for AI and streaming applications.

What are the challenges?

Complexity of implementation: Setting up, maintaining and optimizing tools such as Apache Spark and Kafka requires in-depth technical expertise. In many companies, this is not readily available and must be built up or brought in externally. Distributed systems in particular can be complex to coordinate. In addition, processing large volumes of data can lead to high costs if the computing capacities in the cloud need to be scaled.
Data quality: If I had to name one of my customers’ biggest problems, it would probably be data quality. Anyone who works with data knows that data quality can often be optimized in many companies… When data streams are processed in real-time, this becomes even more important. Why? In real-time systems, data is processed without delay and the results are sometimes used directly for decisions or are followed by reactions. Incorrect or inaccurate data can lead to wrong decisions.

Small project idea to better understand the terms:

Develop a small pipeline with Python that simulates, processes and saves real-time data: For example, simulate real-time data streams of temperature values. Then check whether the temperature exceeds a critical threshold value. As an extension, you can plot the temperature data in real-time.

5 – How Data Integration Becomes Real-Time Capable: ETL, ELT and Zero-ETL

ETL, ELT and Zero-ETL describe different approaches to integrating and transforming data.

While ETL (Extract-Transform-Loading) and ELT (Extract-Loading-Transform) are familiar to most, Zero-ETL is a data integration concept introduced by AWS in 2022. It eliminates the need for separate extraction, transformation, and loading steps. Instead, data is analyzed directly in its original format – almost in real-time. The technology promises to reduce latency and simplify processes within a single platform.

Let’s take a look at an example: A company using Snowflake as a data warehouse can create a table that references the data in the Salesforce Data Cloud. This means that the organization can query the data directly in Snowflake, even if it remains in the Data Cloud.

Why are the terms important?

We live in an age of instant – thanks to the success of platforms such as WhatsApp, Netflix and Spotify.

This is exactly what cloud providers such as Amazon Web Services, Google Cloud and Microsoft Azure have told themselves: Data should be able to be processed and analyzed almost in real-time and without major delays.

What are the challenges?

Here, too, there are similar challenges as with big data technologies: Data quality must be adequate, as incorrect data can lead directly to incorrect decisions during real-time processing. In addition, integration can be complex, although less so than with tools such as Apache Spark or Kafka.

Let me share a quick example to illustrate this: We implemented Data Cloud for a customer – the first-ever implementation in Switzerland since Salesforce started offering the Data Lakehouse solution. The entire knowledge base had to be built at the customer’s side. What did that mean? 1:1 training sessions with the power users and writing a lot of documentation.

This demonstrates a key challenge companies face: They must first build up this knowledge internally or rely on external resources as agencies or consulting companies.

Small project idea to better understand the terms:

Create a relational database with MySQL or PostgreSQL, add (simulated) real-time data from orders and use a cloud service such as AWS to stream the data directly into an analysis tool. Then visualize the data in a dashboard and show how new data becomes immediately visible.

6 – Event-Driven Architecture (EDA)

If we can transfer data between systems in (almost) real time, we also want to be able to react to it in (almost) real time: This is where the term Event-Driven Architecture (EDA) comes into play.

EDA is an architectural pattern in which applications are driven by events. An event is any relevant change in the system. Examples are when customers log in to the application or when a payment is received. Components of the architecture react to these events without being directly connected to each other. This in turn increases the flexibility and scalability of the application. Typical technologies include Apache Kafka or AWS EventBridge.

Why is the term important?

EDA plays an important role in real-time data processing. With the growing demand for fast and efficient systems, this architecture pattern is becoming increasingly important as it makes the processing of large data streams more flexible and efficient. This is particularly crucial for IoT, e-commerce and financial technologies.

Event-driven architecture also decouples systems: By allowing components to communicate via events, the individual components do not have to be directly dependent on each other.

Let’s take a look at an example: In an online store, the "order sent" event can automatically start a payment process or inform the warehouse management system. The individual systems do not have to be directly connected to each other.

What are the challenges?

Data consistency: The asynchronous nature of EDA makes it difficult to ensure that all parts of the system have consistent data. For example, an order may be saved as successful in the database while the warehouse component has not correctly reduced the stock due to a network issue.
Scaling the infrastructure: With high data volumes, scaling the messaging infrastructure (e.g. Kafka cluster) is challenging and expensive.

Small project idea to better understand the terms:

Simulate an Event-Driven Architecture in Python that reacts to customer events:

First define an event: An example could be ‘New order’.
Then create two functions that react to the event: 1) Send an automatic message to a customer. 2) Reduce the stock level by -1.
Call the two functions one after the other as soon as the event is triggered. If you want to extend the project, you can work with frameworks such as Flask or FastAPI to trigger the events through external user input.

Final Thoughts

In this part, we have looked at terms that focus primarily on the storage, management & processing of data. These terms lay the foundation for understanding modern data systems.

In part 2, we shift the focus to AI-driven concepts and explore some key terms such as Gen AI, agent-based AI and human-in-the-loop augmentation.

Own visualization – Illustrations from unDraw.co

All information in this article is based on the current status in January 2025.

The post The Concepts Data Professionals Should Know in 2025: Part 1 appeared first on Towards Data Science.

What is MicroPython? Do I Need to Know it as a Data Scientist?

Sarah Lea — Sun, 12 Jan 2025 16:32:01 +0000

When I saw MicroPython on the list of the Stack Overflow survey from this year, I wanted to know what I could use this language for. And I wondered if it could serve as a bridge between hardware and software. In this article, I break down what MicroPython is and what data scientists should know about it.

Table of Content 1 – What is MicroPython and why is it special? 2 – Why should I know MicroPython as a data scientist? 3 – What is the difference to Python and other programming languages? 4 – What does this look like in practice? (Only with web based simulator) 5 – Final Thoughts & Where to continue learning?

Image from Stack Overflow

What is MicroPython and why is it special?

MicroPython is a simplified, compact version of Python 3 designed specifically for use on microcontrollers and other low-resource embedded systems. As we can read on the official website, the language offers a reduced standard library and special modules to interact directly with hardware components such as GPIO pins, sensors or LEDs.

Reference: Official MicroPython Website

Let’s break this definition down:

Simplified, compact Python: MicroPython is designed to use less memory and computing power than the standard Python version. The language is perfect for devices with just a few kilobytes of RAM.
Microcontrollers & embedded systems: Think of a microcontroller as a tiny computer on a chip. It can control devices such as IoT sensors, smart home devices and robots.
Low-resource systems: This means that these systems have little memory (often less than 1 MB) and limited computing power.
GPIO Pins: These are pins on a microcontroller that can be used for various input and output functions. For example, they can be used to control LEDs or read sensor data.

Why is MicroPython relevant?

If you know Python, you can program hardware with MicroPython – without learning a new complex language like C++ or assembly. Sure, you have more options with C++ and assembly and both are closer to machine languages. But if you want to create a prototype with relatively little effort, MicroPython offers you an ideal starting point.

Why should I know MicroPython as a data scientist?

Simply put: Because it is listed in the Stack Overflow survey and gaining traction in the developer community…

IoT and edge computing are playing an increasingly important role in AI and Data Science projects. Especially as we want to make our cities smarter (smart cities).

MicroPython can serve as a bridge between hardware and software here, as it makes it possible to collect sensor data and process it in data science pipelines or machine learning models. For example, a MicroPython sensor can measure air quality and send the data to a Machine Learning pipeline. MicroPython can also run simple AI models directly on devices (edge computing) – this makes it ideal for local computing without the device being dependent on the cloud.

So my conclusion: MicroPython makes hardware more accessible for data scientists. If you know Python, you can also use MicroPython and apply it in a smart home project.

What is the difference to Python and other programming languages?

While Python was developed for general software applications that run on powerful devices such as PCs or servers, MicroPython was developed for low-resource devices such as microcontrollers, which often only have a few kilobytes of memory and computing power.

As we all know, Python offers an extensive library for data analysis (pandas, numpy), machine learning (scikit-learn, tensorflow) or web development. MicroPython, on the other hand, only contains a reduced standard library and slimmed-down modules such as ‘math’ or ‘os’. Instead, it offers special hardware modules such as ‘utime’ for timers or ‘machine’ for controlling microcontroller pins.

While Python is better suited for data-intensive tasks, MicroPython enables direct access to hardware components and is therefore ideal for embedded systems (e.g. everyday electronics such as microwaves & smart TVs or medical devices such as blood pressure monitors) and IoT projects.

What does this look like in practice? Application areas and a quick simulator demo

In which areas is Micropython used?

Internet of Things (IoT): MicroPython can be used to control smart home devices or control sensor data for dashboards.
Edge computing: You can run machine learning models directly on edge devices (e.g. IoT sensors, smartphones, routers, intelligent cameras, smart home devices, etc.).
Prototyping: With relatively little effort, you can quickly set up a prototype for a hardware project – especially if you know Python.
Robotics: MicroPython can be used to control motors or sensors in robotics projects.

Flashing LED in the simulator as a practical example

Since as a data scientist or software specialist you probably don’t want to buy hardware just to try out MicroPython, I explored a MicroPython simulator available online. This is a simple and beginner-firendly way to get started with programming hardware concepts without the need for physical devices:

Open https://micropython.org/unicorn/
Import time, then define the function and call the function at the end. Type in each code snippet in the web terminal separately and then click ‘Enter’. You can use the following code for this:

#Provides functions to work with time
#(standard Python library instead of 'utime', as the code is used in the simulator)
import time

# Simulated LED by defining the function
def blink_led():
    for _ in range(5):
        print("LED ist jetzt: ON")
        time.sleep(0.5)  # Waits for 0.5 seconds
        print("LED ist jetzt: OFF")
        time.sleep(0.5)

# Start the blinking by calling the function
blink_led()

Now we see that the LED light (only in the console) switches back and forth between ON-OFF. In this simulator example, I only used the time library for the delay. To run the example with real hardware, you should use additional libraries such as ‘machine’ or ‘utime’.

In a web simulator, we can write a ‘hello world’ by writing a small script to output a flashing LED.

Final Thoughts

MicroPython is certainly important for people working in hardware projects such as IoT and edge computing. But due to its easy accessibility and because anyone who knows Python can also use MicroPython, the language bridges a gap between data science, AI and hardware technology. It is certainly good to at least know the purpose of MicroPython and the differences to Python. If you are interested in trying out smart home devices or IoT for yourself, it is certainly a accessible entry point.

Where to continue learning?

Own visualization – Illustrations from unDraw.co

The post What is MicroPython? Do I Need to Know it as a Data Scientist? appeared first on Towards Data Science.

5 Simple Projects to Start Today: A Learning Roadmap for Data Engineering

Sarah Lea — Thu, 02 Jan 2025 11:31:37 +0000

Start with 5 practical projects to lay the foundation for your data engineering roadmap

Tutorials help you to understand the basics. You will definitely learn something. However, the real learning effect comes when you directly implement small projects. And thus combine theory with practice.

You will benefit even more if you explain what you have learned to someone else. You can also use ChatGPT as a learning partner or tutor – explain in your own words what you have learned and get feedback. Use one of the prompts that I have attached after the roadmap.

In this article, I present a roadmap for 4 months to learn the most important concepts in data engineering for beginners. You start with the basics and increase the level of difficulty to tackle more complex topics. The only requirements are that you have some Python programming skills, basic knowledge of data manipulation (e.g. simple SQL queries) and motivation

Why only 4 months?

It is much easier for us to commit to a goal over a shorter period of time. We stay more focused and motivated. Open your favorite app right away and start a project based on the examples. Or set a calendar entry to make time for the implementation.

5 projects for your 4-month roadmap

As a data engineer, you ensure that the right data is collected, stored and prepared in such a way that it is accessible and usable for data scientists and analysts.

You are the kitchen chef, so to speak, who organizes the kitchen and ensures that all ingredients are fresh and ready to hand. The data scientist is the cook chief who combines them into creative dishes.

Month 1 – Programming and SQL

Deepen your knowledge of Python basics CSV and JSON files are common formats for data exchange. Learn how to edit CSV and JSON files. Understand how to manipulate data with the Python libraries with Pandas and NumPy.

A small project to start in Month 1 Clean a CSV file with unstructured data, prepare it for data analysis and save it in a clean format. Use Pandas for data manipulation and basic Python functions for editing.

Read the file with ‘pd.read_csv()’ and get an overview with ‘df.head()’ and ‘df.info()’.
Remove duplicates with ‘df.drop_duplicates()’ and fill in missing values with the average using ‘df.fillna(df.mean())’. Optional: Research what options are available to handle missing values.
Create a new column with ‘df[‘new_column’]’, which, for example, fills all rows above a certain value with a ‘True’ and all others with a ‘False’.
Save the cleansed data with ‘df.to_csv(‘new_name.csv’, index=False)’ in a new CSV file.

What problem does this project solve? Data quality is key. Unfortunately, this is not always the case when you receive data in the business world.

Tools & Languages: Python (Pandas & NumPy library), Jupyter Lab

Understanding SQL SQL allows you to query and organize data efficiently. Understand how to use the most important commands such as CREATE TABLE, ALTER TABLE, DROP TABLE, SELECT, WHERE, ORDER BY, GROUP BY, HAVING, COUNT, SUM, AVG, MAX & MIN, JOIN.

A small project to deepen your knowledge in month 1: Create a relational data model that maps real business processes. Do you have a medium-sized bookstore in your city? This is certainly a good scenario to start.

Think about what data the bookshop manages. For example, books with the data title, author, ISBN (unique identification number), customers with the data name, e-mail, etc.
Now draw a diagram that shows the relationships between the data. A bookstore has several books, which can be from several authors. Customers buy these books at the same time. Think about how this data is connected.
Next, write down which tables you need and which columns each table has. For example, the columns ISBN, title, author and price for the book table. Do this step for all the data you identified in step 1.
Optional: Create the tables with ‘CREATE TABLE nametable ();’ in a SQLite database. You can create a table with the following code.

-- Creating a table with the name of the columns and their data types
CREATE TABLE Books (
    BookID INT PRIMARY KEY,
    Title VARCHAR(100),
    Author VARCHAR(50),
    Price DECIMAL(10, 2)
);

What problem does this project solve? With a well thought-out data model, a company can efficiently set up important processes such as tracking customer purchases or managing inventory.

Tools & languages: SQL, SQLite, MySQL or PostgreSQL

Month 2 – Databases and ETL pipelines

Mastering relational DBs and NoSQL databases Understand the concepts of tables, relationships, normalization and queries with SQL. Understand what CRUD operations (Create, Read, Update, Delete) are. Learn how to store, organize and query data efficiently Understand the advantages of NoSQL over relational databases.

Tools and languages: SQLite, MySQL, PostgreSQL for relational databases; MongoDB or Apache Cassandra for NoSQL databases

Understand the Etl basics Understand how to extract data from CSV, JSON or XML files and from APIs. Learn how to load cleansed data into a relational database.

A small project for month 2 Create a pipeline that extracts data from a CSV file, transforms it and loads it into a SQLite database. Implement a simple ETL logic.

Load a CSV file with ‘pd.read_csv()’ and get an overview of the data again. Again, remove missing values and duplicates (see project 1). You can find publicly accessible datasets on Kaggle. For example, search for a dataset with products.
Create a SQLite database and define a table according to the data from the CSV. Below you can see an example code for this. SQLite is easier to get started with, as the SQLite library is available in Python by default (module sqlite3).
Load the cleaned data from the DataFrame into the SQLite database with ‘df.to_sql(‘tablename’, conn, if_exists=’replace’, index=False)’.
Now execute a simple Sql query e.g. with SELECT and ORDER BY. Limit the results to 5 outputs. Close the connection to the database at the end.

import sqlite3

# Create the connection to the SQLite-DB
conn = sqlite3.connect('produkte.db')

# Create the table
conn.execute('''
CREATE TABLE IF NOT EXISTS Produkte (
    ProduktID INTEGER PRIMARY KEY,
    Name TEXT,
    Kategorie TEXT,
    Preis REAL
)
''')
print("Tabelle erstellt.")

Tools and languages: Python (SQLAlchemy library), SQL

Month 3 – Workflow orchestration and cloud storage

Workflow orchestration Workflow orchestration means that you automate and coordinate processes (tasks) in a specific order. Learn how to plan and execute simple workflows. You will also gain a basic understanding of the DAG (Directed Acyclic Graph) framework. A DAG is the basic structure in Apache Airflow and describes which tasks are executed in a workflow and in which order.

Tools and languages: Apache Airflow

Cloud storage Learn how to store data in the cloud. Know at least the names of the major products from the biggest cloud providers such as S3, EC2, Redshift from AWS, BigQuery, Dataflow, Cloud Storage from Google Cloud and Azure Blob Storage, Synapse Analytics, Azure Data Factory from Azure. The many different products can be overwhelming – start with something you enjoy.

A small project for month 3 Create a simple workflow orchestration concept with Python (without Apache Airflow, as this lowers the barrier to getting started for you) that sends you automated reminders during your daily routine:

Plan the workflow: Define tasks such as reminders to "Drink water", "Exercise for 3 minutes" or "Get some fresh air".
Create a sequence of the tasks (DAG): Decide the order in which the tasks should be executed. Define if they are dependent on each other. For example, Task A ("Drink water") runs first, followed by Task B ("Exercise for 3 minutes"), and so on.
Implement the task in Python: Write a Python function for each reminder (see code snippet 1 below as an example).
Link the tasks: Arrange the functions so that they execute sequentially (see code snipped 2 below as an example).

import os
import time

# Task 1: Send a reminder
def send_reminder():
    print("Reminder: Drink water!")  # Print a reminder message
    time.sleep(1)  # Pause for 1 second before proceeding to the next task

if __name__ == "__main__":
    print("Start Workflow...")  # Indicate the workflow has started

    # Execute tasks in sequence
    send_reminder()  # Task 1: Send a reminder to drink water

    # Additional tasks (uncomment and define these functions if needed)
    # reminder_exercise()  # Example: Send the second reminder
    # create_task_list()    # Advanced-Example: Create a daily task list

    print("Workflow is done!")  # Indicate the workflow has completed

Too easy? Install Apache Airflow and create your first DAG that performs the task of printing out "Hello World" or load your transformed data into an S3 bucket and analyze it locally.

Tools and languages: AWS, Google Cloud, Azure

Implement the 5 projects to learn twice as much as if you only look at the theory.

Month 4 – Introduction to Big Data and Visualization

Big data basics Understand the basics of Hadoop and Apache Spark. Below you can find a great, super-short video from simplilearn to introduce you to Hadoop and Apache Spark.

Tools and languages: Hadoop, Apache Spark, PySpark (Python API für Apache Spark), Python

Data visualization Understand the basics of data visualization

A small project for month 4 To avoid the need for big data tools like Apache Spark or Hadoop, but still apply the concepts, download a dataset from Kaggle, analyze it with Python and visualize the results:

Download a publicly available medium sized dataset from Kaggle (e.g. weather data), read in the dataset with Pandas and get an overview of your data.
Perform a small exploratory data analysis (EDA).
Create e.g. a line chart of average temperatures or a bar chart of rain and sun days per month.

Tools and languages: Python (Matplotlib & Seaborn libraries)

2 prompts to use ChatGPT as a learning partner or tutor

When I learn something new, the two prompts help me to reproduce what I have learned and use ChatGPT to check whether I have understood it. Try it out and see if it helps you too.

I have just learned about the [topic / project] and want to make sure I have understood it correctly. Here is my explanation: [your explanation]. Give me feedback on my explanation. Add anything that is missing or that I have not explained clearly.
I would like to understand the topic [topic/project] better. Here is what I have learned so far: [your explanation]. Are there any mistakes, gaps or tips on how I can explain this even better? Optional: How could I expand the project? What could I learn next?

What comes next?

Deepen the concepts from months 1–4.
Learn complex SQL queries such as subqueries and database optimization techniques.
Understand the principles of data warehouses, data lakes and data lakehouses. Look at tools such as Snowflake, AmazonRedshift, GoogleBigQuery or Salesforce Data Cloud.
Learn CI/CD practices for data engineers.
Learn how to prepare data pipelines for machine learning models
Deepen your knowledge of cloud platforms – especially in the area of serverless computing (e.g. AWS Lambda)

Own visualization – Illustrations from unDraw.co

Final Thoughts

Companies and individuals are generating more and more data – and the growth continues to accelerate. One reason for this is that we have more and more data from sources such as IoT devices, social media and customer interactions. At the same time, data forms the basis for machine learning models, the importance of which will presumably continue to increase in everyday life. The use of cloud services such as AWS, Google Cloud or Azure is also becoming more widespread. Without well-designed data pipelines and scalable infrastructures, this data can neither be processed efficiently nor used effectively. In addition, in areas such as e-commerce or financial technology, it is becoming increasingly important that we can process data in real-time.

As data engineers, we create the infrastructure so that the data is available for machine learning models and real-time streaming (zero ETL). With the points from this roadmap, you can develop the foundations.

Where can you continue learning?

The post 5 Simple Projects to Start Today: A Learning Roadmap for Data Engineering appeared first on Towards Data Science.

Master Bots Before Starting with AI Agents: Simple Steps to Create a Mastodon Bot with Python

Sarah Lea — Fri, 27 Dec 2024 14:01:48 +0000

I recently published a post on Mastodon that was shared by six other accounts within two minutes. Curious, I visited the profiles and discovered that at least one of them was a tech bot – accounts that automatically share posts based on tags such as #datascience or #opensource.

Mastodon is currently growing rapidly as a decentralized alternative to X (formerly Twitter). How can bots on a platform like this make our everyday lives easier? And what are the risks? Do bots enrich or disrupt social networks? How do I have to use the Mastodon API to create a bot myself?

In this article, I will not only show you how bots work in general but also give you a step-by-step guide with code examples and screenshots on how to create a Mastodon bot with Python and use the API.

Table of Content 1 – Why do Mastodon and tech bots exist? 2 – Technical basics for a bot on a social network 3 – Bots: The balancing act between benefit and risk 4 – How to create a Mastodon bot: Step-by-step instructions with Python Final Thoughts

1 – Why do Mastodon and tech bots exist?

Mastodon is a decentralized social network developed by Eugen Rochko in Germany in 2016. The platform is open-source and is based on a network of servers that together form the so-called ‘Fediverse’. If you want to share posts, you select a server such as mastodon.social or techhub.social and share your posts on this server. Medium also has its own server at me.dm. Each server sets its own rules and moderation guidelines.

Bots are basically software applications that perform tasks automatically. For example, there are simple bots such as crawler bots that search the internet and index websites. Other bots can do repetitive tasks for you, such as sending notifications or processing large amounts of data (Automation bots). Social media bots go one step further by sharing posts or reacting to content and thus interacting with the platforms. For example, a bot can collect and share the latest news from the technology industry so that followers of this bot profile are always up to date – the bot becomes a curator that curates according to precisely defined algorithms…

Chatbots are also a specific type of bot that are used for customer support, for example. They were developed primarily for dialog with us humans and focus much more on natural language processing (NLP) in order to understand our language and respond to it as meaningfully as possible. Agents, which are currently a hot topic of discussion, are in turn a further development of bots and chatbots: agents can generally take on more complex tasks, learn from data and make decisions independently.

Fun fact: Eliza, which was developed as a chatbot at MIT, was already able to simulate simple conversations in 1960. 65 years later, we have arrived in the world of agents…

Reference: ELIZA-Chatbot

However, bots can also spread disinformation by automatically disseminating false or misleading information on social networks to manipulate public opinion. Such troll bots are repeatedly observed in political elections or crisis situations, for example. Unfortunately, they are also sometimes used for spam messages, data scraping, DDOS cyberattacks or automated ticket sales. It is therefore important that we handle automated bots responsibly.

2 – Technical basics for a bot on a social network

In simple terms, you need these three ingredients for a bot:

Programming language: Typical programming languages are Python or JavaScript with Node.js. But you can also use languages such as Ruby or PHP.
API access: Your bot sends a request to the application programming interface (API) of a social network and receives a response back.
Hosting: Your bot must be hosted on a service such as Heroku, AWS, Replit or Google Cloud. Alternatively, you can run it locally, but this is more suitable for testing.

Programming languagePopular languages for a bot are Python or JavaScript – depending on the requirements and target platform. Python offers many helpful libraries such as Tweepy for Twitter (but now limited in use due to the changes to Twitter-X), Mastodon.py for the Mastodon API or Python Reddit API Wrapper (PRAW) to manage posts and comments for Reddit. Node.js is particularly suitable if your bot requires real-time communication, server-side requests or integration with multiple APIs. There are libraries such as mastodon-api or Botpress that support multiple channels. For bots on Facebook and Instagram, on the other hand, you need to use the Facebook Graph API, which has much stronger restrictions. And for Linkedin, you can use the LinkedIn REST API, which is designed more for company pages.

API Most modern APIs for social networks are based on the REST architecture. This API architecture uses HTTP methods such as GET (to retrieve data), POST (to send data), PUT (to update data) or DELETE (to delete data). For many platforms, you need a secure method such as OAuth2 to have access with your bot to the API: For this, you first register your bot with the platform to receive a client ID and a client secret. These credentials are used to request an access token, which is then sent with every request to the API.

HostingOnce your bot is ready, you need an environment in which your bot can run. You can run it locally for test purposes or prototypes. For longer-term solutions, there are cloud hosting platforms such as AWS, Google Cloud or Heroku. To ensure that your bot also works independently of the server environment without any problems, you can use Docker, which packages your bot together with all the necessary settings, libraries and dependencies in a standardized "package" that can be started on any server.

In addition, you can automate your bot with cron jobs by running your bot at certain times (e.g. every morning at 8.00 a.m.) or when certain events occur (e.g. a post with a certain hashtag was shared).

Own visualization – Illustrations from unDraw.co

3 – Bots: The balancing act between benefit and risk

There are big differences in quality between bots – while a well-programmed bot responds efficiently to requests and delivers added value, a poorly designed bot can be unreliable or even disruptive. As described at the beginning, a bot is a software application that performs automated tasks: The quality of the bot depends on how the underlying algorithms are programmed, what data the bot has been fed with in the case of AI bots and how the design and interactions are structured.

So how do we create ethically responsible bots?

Transparency: Users need to know that they are interacting with a bot and not a human. Bots that disguise this only destroy trust in the technology. For example, Mastodon has a rule that bots’ profiles must be clearly labeled. It is also possible for the bot to add a small note to every interaction or post that makes it clear that the interaction originates from a bot.
No manipulation: Bots must not be used to spread disinformation or manipulate users in a targeted manner.
Respect for the platform and people: Bots must follow the rules of the respective platform.
Data protection must be respected: For example, if bots analyze user profiles, it must be ensured that the bot does not store data that it should not or it must be defined who has access to this data and how it is used in order to comply with data protection laws such as the GDPR in Europe.

Are bots good or bad? Do bots disrupt social networks or enrich them?

In my opinion, technology that automates repetitive tasks is always valuable. On the one hand, well-developed bots can provide us with valuable information, stimulate discussions or act as support for curators.On the other hand, bots can spread spam, be discriminatory or dominate discussions. In my opinion, such technologies are most useful when they are used as support.

Let’s imagine for a moment a social platform that consists only of trained bots that carry out the discussions among themselves – in my opinion, that would be a pretty boring platform – the humanity is missing. The interactions would have a "bland aftertaste". Also, when it comes to automation, I often think that although technology performs the task more "perfectly", but the creativity and love is missing compared to when the task was performed by a human who works professionally and in detail. The human touch, the unforeseen is missing.

4 – How to create a Mastodon bot: Step-by-step instructions with Python

We want to create a bot that regularly searches Mastodon posts with the hashtag #datascience and automatically reposts these posts.

Everything you need to get started

Python must be installed on your device. Tip for newbies: On Windows, you can use ‘python – version’ in Powershell to check if you already have Python installed.
You need an IDE, such as Visual Studio Code, to create the Python files.
Optional: If you are working with the Anaconda distribution, it is best to create a new project with ‘conda create – name NameEnvironment python=3.9 -y’ and install the libraries in this project so that there are no dependencies between the libraries. Tips for newbies: You can then activate the environment with ‘conda activate NameEnvironment’. The -y stands for the fact that all confirmations are automatically accepted during the installation.

1) Install the Mastodon.py library

First we install Mastodon.py with pip:

pip install Mastodon.py

Tips for newbies: With ‘pip – version’ you can check if pip is installed. If no version is displayed, you can install pip with ‘conda install pip’.

2) Register the app for the bot on techhub.social

If you don’t have an account on techhub.social yet, register. Techhub.social describes itself as a Mastodon instance for passionate technologists and states in the rules that bots must be marked as Bot in their profile.

We now register our app for our bot using the ‘Mastodon.create_app()’ function. To do this, we create a Python file with the name ‘register_app.py’ and insert this code: In this code, we register the bot with Mastodon to gain API access and save the necessary access data. First, we create the app with ‘Mastodon.create_app()’. We save the client credentials in the file ‘pytooter_clientcred.secret’. Then we log in to Mastodon to generate the user credentials. We save these in another file ‘pytooter_usercred.secret’. We add the error handling to catch problems such as incorrect login data.

from mastodon import Mastodon, MastodonIllegalArgumentError, MastodonUnauthorizedError

try:
    # Step 1: Creating the app and saving the client-credentials
    Mastodon.create_app(
        'pyAppName',  # Name of your app
        api_base_url='https://techhub.social',  # URL to the Mastodon instance
        to_file='pytooter_clientcred.secret'  # File to store app credentials
    )
    print("App registered. Client-Credentials are saved.")

    # Step 2: Login & Saving of the User-Credentials
    print("Log in the user...")
    mastodon = Mastodon(
        client_id='pytooter_clientcred.secret',
        api_base_url='https://techhub.social'
    )

    mastodon.log_in(
        'useremail@example.com',  # Your Mastodon-Account-Email
        'YourPassword',  # Your Mastodon-Password
        to_file='pytooter_usercred.secret'  # File to store user credentials
    )
    print("Login successful. User-Credentials saved in 'pytooter_usercred.secret'.")

except MastodonUnauthorizedError as e:
    print("Login failed: Invalid email or password.")
except MastodonIllegalArgumentError as e:
    print("Login failed: Check the client credentials or base URL.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Then we enter this command in the Anaconda prompt to execute the script:

python register_app.py

If everything worked successfully, you will find the file ‘pytooter_clientcred.secret’ in your directory, which contains the app-specific credentials for our app that were generated when the app was registered. In addition, there should be the file ‘pytooter_usercred.secret’, which contains the user-specific access data. This information was generated after the successful login.

You will see the following output in the terminal:

Tips for newbies: Tooten is used in Mastodon to say that a post is published (like tweeting on Twitter). The two secret files contain sensitive information. It is important that you do not share them publicly (e.g. do not add them to your GitHub repository). If you want to use 2FA, you must use the OAuth2 flow instead. If you open your Mastodon account in the desktop application you can check this setting in Settings>Account>Two-Factor-Authentication.

3) Publish test post via API

Once the registration and login has worked successfully, we create an additional file ‘test_bot.py’ and use the following code. First we load the user credentials from ‘pytooter_usercred.secret’ and connect to the Mastodon API. With ‘mastodon.toot()’ we specify the content we want to publish. We display a confirmation in the terminal that the toot has been sent successfully.

from mastodon import Mastodon

mastodon = Mastodon(
    access_token='pytooter_usercred.secret',
    api_base_url='https://techhub.social'
)

mastodon.toot('Hello from my Mastodon Bot! #datascience')
print("Toot gesendet!")

We save the file in the same directory as the previous files. Then we open the file in the terminal with this command:

On Mastodon we see that the post has been successfully tooted:

4) Reblog posts with a specific hashtag

Now we want to implement that the bot searches for posts with hashtag #datascience and re-shares them.

In a first step, we create a new file ‘reblog_bot.py’ with the following code: Using the ‘reblog_datascience()’ function, we first connect to the Mastodon API by loading the user credentials from ‘pytooter_usercred.secret’. Then the bot uses ‘timeline_hashtag()’ to retrieve the last 3 posts with the hashtag #datascience. With ‘status_reblog()’ we automatically share each post and display the ID of the shared post in the terminal.

To avoid overloading, the API allows up to 300 requests per account within 5 minutes. With ‘limit=3’ we specify that only 3 posts are reblogged at a time – so this is not a problem.

from mastodon import Mastodon

def reblog_datascience():
    mastodon = Mastodon(
        access_token='pytooter_usercred.secret',
        api_base_url='https://techhub.social'
    )
    # Retrieve posts with the hashtag #datascience
    posts = mastodon.timeline_hashtag('datascience', limit=3)
    for post in posts:
        # Reblogging posts
        mastodon.status_reblog(post['id'])
        print(f"Reblogged post ID: {post['id']}")

# Run the function
reblog_datascience()

As soon as you run the file, 3 posts will be reblogged in your profile and you will see the IDs of the 3 posts in the terminal:

3 posts containing the hashtag #datascience will be reposted.

In the terminal we see the IDs of the 3 posts that were reposted.

Note: I have removed the posts from my Mastodon account afterwards as my profile is not labeled as a bot.

Final Thoughts

We could extend the bot even further, for example by adding functions so that duplicate posts are not reblogged or that error messages (e.g. due to missing authorizations) are caught and logged. We could also host the bot on a platform such as AWS, Google Cloud or Heroku instead of running it locally on our computer. For automated execution, it would also make sense to set up a scheduler. On Windows, for example, this can be tried out with the Task Scheduler. This will run the bot regularly (e.g. every morning at 8.00 a.m.), even if the terminal is closed. On Linux or Mac, we could use alternatives such as cron jobs.

Like practically any technology, bots can offer great benefits if we use them in a considered, ethical and data protection-compliant manner. However, they can also disrupt social platforms if we misuse them.

Where can you continue learning?

The post Master Bots Before Starting with AI Agents: Simple Steps to Create a Mastodon Bot with Python appeared first on Towards Data Science.

The Essential Guide to R and Python Libraries for Data Visualization

Sarah Lea — Mon, 16 Dec 2024 21:40:28 +0000

Being a pro in certain programming languages is the goal of every aspiring data professional. Reaching a certain level in one of the countless languages is a critical milestone for everyone.

For data engineers, SQL is probably the most important language. As a web developer, you need to know JavaScript, HTML, CSS and PHP in your sleep. For data scientists, on the other hand, Python and R are the preferred tools. Both languages have their strengths and weaknesses – and both offer powerful tools and a large community to analyze and visualize data.

If you’re at the very beginning of your Data Science journey, the choice between R and Python can be overwhelming. But if you want to move into this field in the long term, you will come into contact with both languages sooner or later anyway. Also, if you’re already at university, you probably have courses in both languages.

But let’s dive into the most important libraries in R and Python to visualize data, how creating charts in R and Python is different (with code examples), and what the pros and cons of the two languages are.

Table of Content 1 – What makes R a must-have? (And essential libraries for visualizations) 2 – Python is everywhere: From data analysis to web development (And essential libraries for visualizations) 3 – Step-by-Step-Guide: Creating plots in R and Python with code examples 4 – Advantages and disadvantages: Comparing R & Python for Data Science 5 – Final Thoughts and where to continue learning

1 – What makes R a must-have?

R was developed in the 1990s at the University of Auckland specifically for statistical analyses and graphical representations. R is particularly well suited for statistical analyses, hypothesis testing, data manipulation and visual representations. If you want to progress into the academic world and participate in research projects, R is a must anyway (especially in areas such as biostatistics, social sciences, psychology & economics). On CRAN (Comprehensive R Archive Network) you can find thousands of R packages, that are hosted there.

Essential R libraries for visualizations that you should know

R has countless libraries. When it comes to Data Visualization, there are different libraries with their own strengths. Here are 6 important libraries you should definitely know:

1. ggplot2You need to know this library anyway – it’s the undisputed classic in the R community. You can use it to create user-defined and high-quality visualizations.

2. plotly With Plotly you can create interactive plots. For example, users can zoom into the diagram or switch between different views. The cool thing is that you can also integrate them into web applications and dashboards.

3. lattice If you need an alternative to ggplot2, you can use Lattice to create multi-layer plots. ggplot2 is much more common, but lattice scores points because it is less complex for beginners.

4. shiny If you need a way to present your data in real-time, Shiny is a good choice. You can use it to develop interactive web applications directly in R. It is also possible to integrate visualizations that you have created with ggplot2 or plotly into the dashboards.

5. leaflet If you want to create interactive geographical maps, it is best to use this library. You can use it to create interactive maps that you can also customize with additional layers, markers and pop-ups.

6. esquisse I recently discovered esquisse. It is particularly suitable if you want to create a prototype quickly. It is a visual tool that allows you to create ggplot2-based visualizations using drag & drop. This means you can create visualizations without writing a single line of code. You can then export the underlying ggplot2 code to further customize your plots. This library probably deserves its own article…

2 – Python is everywhere: From data analysis to web development

Do you know Monty Python? If not, you should definitely watch some clips from the British comedy group. Python was named after the Monty Python comedy group (not the snake…) and was developed in 1991 (the humor is sometimes a bit dark and needs a bit time – but it’s definitely a classic):

https://www.youtube.com/watch?v=xxamBlMta94

But back to the topic: the language was designed to be highly readable and have a clear syntax. Python is a language for ‘everything’, so to speak: you can use it in data analysis, but also for machine learning, deep learning or in web development (e.g. with the Django framework).

Similar to R with CRAN, Python uses PyPi (Python Package Index) as its central repository with an huge number of libraries to install. While R is mainly used in the fields of statistics and research, Python is used in almost all industries. Now that machine learning and big data are becoming increasingly important, Python is also becoming even more important, as Python is the absolute favourite language for machine learning (with scikit-learn, TensorFlow, Keras & PyTorch).

Essential Python libraries for visualizations that you should know

Here I have put together 8 important libraries that you definitely need to know:

1. matplotlibEven if you’re a beginner, you’ve almost certainly come across this library before. With this library, you can create a wide range of 2D plots – from simple line charts and histograms to complex subplots. It gives you a lot of control over your plots. For example, if you create a bar chart, you can adjust the axes, colours, fonts or even the width of the bars in the code. However, it is often a bit tedious for complex visualizations – for example, the code becomes longer and more complicated if you want to combine several plots.

2. seaborn This library is based on matplotlib. It is particularly suitable for statistical visualizations such as heat maps, pair plots and box plots. You can also use seaborn to work directly with your tables (pandas DataFrames) without having to convert the data first. This makes it easy to recognize initial patterns, trends and correlations in your data relatively quickly. The library is particularly useful for exploratory data analysis (EDA). In one of my recent articles ‘Mastering Time Series Data: 9 Essential Steps for Beginners before applying Machine Learning in Python‘ you can find some important steps for a EDA.

3. plotly If you want to create interactive visualizations such as 3D plots, geographical maps or dashboards, use plotly. You can zoom into the diagrams, highlight data points or switch between views. The great thing about the library is that you can also easily integrate the plots into web applications afterwards. If you are just starting out with Python, the plotly.express API is an easy way to get started.

4. pandas Of course, pandas is a library not just for visualizations but for almost any data manipulation task. If you want to visualise data, you can create it directly from a DataFrame with pandas. With the ‘df.plot()‘ method you can easily create line, bar or scatter charts. If you want to get a quick and easy insight into your data before you dive deeper into the analysis, pandas is definitely a good choice.

5. bokeh This library is ideal if you want to create interactive dashboards that can be embedded directly in HTML. bokeh was specially developed for interactive, web-friendly visualizations. The library impresses with its fast rendering times, especially with large data sets.

6. altair The counterpart of ggplot2 in R, so to speak – the syntax is very similar. Altair is very suitable for exploratory data analysis (EDA) and you can create meaningful plots quickly and easily.

7. holoviewsDo you need to be super fast and don’t want to write a lot of code? Then the best way to create your visualization is with holoviews. You can create interactive plots with minimal code, making the library ideal for prototypes or if you need quick feedback.

8. foliumWith folium you can create interactive geographical maps and visualizations. It is based on leaflet.js and allows you to create maps with markers, heatmaps or clusters. For example, you can display data points on a world map or carry out geographical analyses.

Own visualization – Illustrations from unDraw.co

3 – Step-by-Step-Guide: Creating plots in R and Python with code examples

The best way to get started is by opening RStudio or Jupyter Lab and running through the code examples yourself step by step.

Prerequisites to run the visualizations with R

In any case, you must have downloaded R and preferably RStudio. Then we install the libraries we need for the visualizations with the following command:

# Installing Packages for R
install.packages(c("ggplot2", "plotly", "leaflet"))

To make the visualizations easy to reproduce, we use the built-in dataset ‘mtcars’, which contains vehicle data such as horsepower, weight and fuel consumption. We load the dataset with the following command:

# Loading data for R
data(mtcars)  # Loads the built-in mtcars dataset into memory

Prerequisites to create the visualizations with Python

Of course you need to have Python installed. I also use Juypter Lab for the code examples (if you prefer to work with VSCode, this is also a good alternative). I work with Anaconda (a Python distribution that makes it easier for you to get started). If you are not using Anaconda, you can install the packages below using pip. To ensure there are no conflicts between libraries, I create a separate environment using the following command:

conda create --name NameEnvironment python=3.10

In this article ‘Python Data Analysis Ecosystem – A Beginner’s Roadmap‘ you will find more detailed instructions on how to get started.

After we have activated the environment (conda activate NameEnvironment) we install the libraries we need:

# Installing Packages for Python
conda install matplotlib seaborn pandas plotly folium

Once the libraries are installed, you can start Jupyter lab by entering ‘jupyter lab’ in the terminal.

Now we load a sample data set that is directly integrated into Pandas. Although the data from this sample dataset is less interesting than real data, it is easier to go through the code examples this way. The iris dataset consists of 150 observations of three iris flower species, each with four characteristics (sepal length, sepal width, petal length, petal width) and the corresponding flower species.

# Python-Visualization
# Importing the libraries
from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Loading the Iris dataset as a pandas DataFrame
iris = load_iris(as_frame=True)
df = iris.frame

# Displaying the first 5 rows of the dataset
print(df.head())

Creating basic plots

HistogramFirst we create a histogram with R to visualize the distribution of horsepower (hp): We load the dataset and specify the dataset and column after ‘hist’. We use ‘main’ and ‘xlab’ to label the diagram and the axis. We define the fill colour with ‘col’ and the border colour with ‘border’.

# R-Visualization
# Histogram for Horsepower (hp) in the mtcars Dataset
hist(mtcars$hp, 
     main = "Histogram of Horsepower", 
     xlab = "Horsepower", 
     col = "skyblue", 
     border = "black")  # Creates a histogram for the hp column

# Adding gridlines for better readability
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

Using Python, we create a histogram with matplotlib that shows the distribution of ‘sepal length’ in the Iris dataset: We use the bins to divide the data into 10 intervals. When you create a histogram, the bins indicate the level of granularity in the distribution display. Starting with 5 to 10 bins is often a good standard and gives you a solid overview of the distribution of your data. If you choose fewer bins, you will see more general trends but less detail. With more bins, you can recognize details and patterns more precisely, but the chart can appear confusing. Especially if you are working with smaller data sets (<100 values), 5–10 bins usually make sense.

# Python-Visualization
# Histogram for Sepal Length in the Iris Dataset
plt.hist(df['sepal length (cm)'], bins=10, color='skyblue', edgecolor='black')  # Creates a histogram for Sepal Length
plt.title('Histogram of Sepal Length')  # Adds a title to the plot
plt.xlabel('Sepal Length (cm)')  # Labels the x-axis
plt.ylabel('Frequency')  # Labels the y-axis

# Add gridlines for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show()  # Displays the plot

Bar chartWe use the bar chart to compare categorical data. First, we analyze in R the frequency distribution of cars with different numbers of cylinders in the mtcars dataset: Instead of just selecting the dataframe and column as with the histogram, we need to use ‘table()’ to create a table that counts the frequency of each number of cylinders. This helps us understand how frequently each cylinder category appears in the dataset.

# R-Visualization
# Bar chart for cylinders in the mtcars dataset
barplot(table(mtcars$cyl), 
        main = "Number of Cylinders in Cars", 
        xlab = "Cylinders", 
        ylab = "Frequency", 
        col = "orange", 
        cex.names = 0.8)  # Adjusts the axis label size

In Python, we create a bar chart to analyze the frequency of each target class (i.e. flower types) in the Iris dataset: With ‘[‘target’]’ we count how often each target class occurs in the DataFrame. Each class represents one of the three iris flower types. We then use the information in ‘plot()’ to specify that a bar chart with orange bars and a black border should be created. This helps us visualize the distribution of flower types in the dataset.

# Python-Visualization
# Bar chart for target classes in the Iris dataset
df['target'].value_counts().plot(kind='bar', color='orange', edgecolor='black')  # Creates the bar chart
plt.title('Frequency of Target Classes')  # Adds a title to the plot
plt.xlabel('Iris Flower Type')  # Labels the x-axis
plt.ylabel('Frequency')  # Labels the y-axis
plt.xticks(rotation=0)  # Ensures x-axis labels are horizontal
plt.show()  # Displays the plot

Scatter plot We use the scatter diagram to visualize the relationship between two numerical variables. In R, we examine the relationship between engine power (hp) and car weight (wt). Such diagrams help us to recognize possible patterns or correlations between the variables. For example, we could test the hypothesis that heavier cars tend to have more hp. We begin by specifying the two variables we want to compare. With ‘pch=19’ we define the shape of the points – here the points are filled as a circle and with ‘col’ they are displayed in blue. Alternatively, you could use ‘pch=17’ to create a filled triangle or ‘pch=0’ to create an unfilled square for the points:

# R-Visualization
# Scatter plot for Horsepower (hp) vs. Car Weight (wt) in the mtcars dataset
plot(mtcars$hp, mtcars$wt, 
     main="Scatter Plot: Horsepower vs. Weight", 
     xlab="Horsepower", 
     ylab="Weight", 
     pch=19, 
     col="blue")

To add: To better identify trends, we can add a regression line by adding the following command:

# R-Visualization
# Adding a regression line
abline(lm(mtcars$wt ~ mtcars$hp), col="darkgreen", lwd=2)

In Python, we visualize the relationship between sepal length and sepal width. For example, we can check the hypothesis that larger sepals are also wider. In the first line, we enter the values for the x-axis and the y-axis and colour the points blue. With ‘alpha=0.7’ we specify that the points should be slightly transparent so that we can easily see the points when they overlap.

# Python-Visualization
# Scatter plot for Sepal Length vs. Sepal Width
plt.scatter(df['sepal length (cm)'], df['sepal width (cm)'], color='blue', alpha=0.7)
plt.title('Scatter Plot: Sepal Length vs. Sepal Width')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.show()

To add: For this small data set, we can use NumPy to calculate and plot a regression line. If the data set were larger, it would be better to use seaborn.

# Python-Visualization
# Calculate data for the regression line
x = df['sepal length (cm)']
y = df['sepal width (cm)']
m, b = np.polyfit(x, y, 1)  # Calculates slope (m) and y-intercept (b)

# Scatter plot
plt.scatter(x, y, color='blue', alpha=0.7, s=50)  # Add point size for better visibility
plt.plot(x, m*x + b, color='red', linewidth=2, label='Regression Line')  # Add regression line
plt.title('Scatter Plot: Sepal Length vs. Sepal Width')
plt.xlabel('Sepal Length (cm)', fontsize=12)
plt.ylabel('Sepal Width (cm)', fontsize=12)
plt.legend()  # Adds a legend to distinguish the regression line
plt.show()

Interactive plots with a geographical map

Finally, we want to visualize an interactive geographical map that marks locations you have already visited, for example. In R we use leaflet for this: First, we create an empty map object with ‘leaflet()’. With ‘addTiles()’ we add the default map tiles. And with ‘addMarkers()’ we add markers of the locations on the map.

# R-Visualization
library(leaflet)

# Interactive map with Leaflet
leaflet() %>%
  addTiles() %>%
  addMarkers(lng=-0.1278, lat=51.5074, popup="London") %>%
  addMarkers(lng=2.3522, lat=48.8566, popup="Paris")

In Python, we use the library folium: A map is created with ‘folium.Map()’. ‘location’ specifies the starting position of the map. ‘zoom_start’ sets an initial zoom level so that we start with an overview of Europe. We then add the markers with ‘folium.Marker()’. And at the end we save the map as an HTML file that we can open in a browser.

# Python-Visualization
import folium

# Locations and their coordinates
locations = [
    {"name": "London", "coords": [51.5074, -0.1278]},
    {"name": "Paris", "coords": [48.8566, 2.3522]}
]

# Interactive map with Folium
m = folium.Map(location=[51.5074, -0.1278], zoom_start=5)  # Initial view set to London

# Add markers for each location
for location in locations:
    folium.Marker(location["coords"], popup=location["name"]).add_to(m)

# Save the map as an HTML file
m.save('map.html')

4 – Advantages and disadvantages: Comparing R & Python for Data Science

R was developed specifically for statistics and data analysis, and you can see this when using it.

But what are the strengths and weaknesses of this programming language?

Advantages of R

Strength in statistics & data analysisR is specially optimized for statistical operations and visualizations. This is also quickly noticeable when using the language. R also offers extensive functions for regressions, hypothesis tests and data modeling.
Community & Open SourceR has a very active community that is constantly developing new packages and through which you can find many resources on the Internet. In addition, the language is open source and therefore free and accessible to all.
Integration into other environmentsYou can easily integrate R into other environments. For example, you can integrate R into Jupyter Notebooks for interactive analyses, using shiny dashboards to display results or using packages such as ‘httr’ and ‘jsonlite’ to use APIs. R allows you to directly access relational databases through packages like ‘DBI’ and ‘RPostgreSQL’.

And what are the disadvantages or R?

Steeper learning curveCompared to Python, the syntax can be more difficult for beginners. In addition, the error messages are often less intuitive. An example of less intuitive error messages in R is the common ‘object not found’ error. This message appears when you reference a variable or object that doesn’t exist in your environment, often due to a simple typo or forgetting to define the variable.
PerformanceR can be slower than Python when processing large datasets. R is also less suitable for machine learning and deep learning: While R offers some support for machine learning through libraries like ‘caret’ or ‘randomForest’, it is not as comprehensive as Python’s frameworks like TensorFlow and PyTorch for deep learning.
Less broadR is primarily specialized for use in statistics. For other tasks such as web development or machine learning, R cannot be used to the same extent as Python.

Python is one of the most versatile and widely used programming languages. But even Python is not equally suitable for everything:

Let’s start with the advantages of Python:

Easy to learnPython has a very intuitive & easy to understand syntax that has many similarities to English. If you are a beginner, Python is usually recommended. For example, consider the following Python code for a conditional statement: if age >= 18: print("You are an adult.") else: print("You are not an adult.")
VersatilityThe great thing about this language is that it is not only suitable for data science. Once you have mastered the language, you can also use it for automation, web development and, of course, machine learning.
Large communityLike R, Python has a huge community for libraries and resources. Python is also open-source and therefore freely available to everyone.
PerformanceWith libraries such as NumPy, Pandas and Dask, Python has good libraries for processing large amounts of data efficiently. Check out one of my articles ‘A Practical Dive into NumPy and Pandas: Data Analysis with Python‘ for an overview of Numpy and Pandas.

And what are the disadvantages of Python?

Not as specialized in statistics If you want to perform complex statistical analyses and your focus is clearly on statistics, R is probably the better choice.
Performance If you compare Python with other programming languages such as C++ or Java, which are compiled languages, Python can be slower. You can minimize this disadvantage with libraries such as NumPy or Cython.

5 – Final Thoughts

When to use R? When to use Python?

If you are new to programming or are looking for a versatile language, Python is likely the best choice. However, both programming languages are a good choice. R might be the better option if your primary focus is on statistics and data visualization.

But almost more important than choosing between R and Python is that you understand the basic principles of data analysis: How can you clean raw data and prepare it for analysis? What steps do you need to take to perform exploratory data analysis (EDA) to recognize patterns and relationships in your data? Which visualizations are best suited to show your results to others?

In addition to R and Python, there are other languages and tools that are important in data analysis and visualization. One of them is Julia, which is particularly fast and efficient for numerical calculations and scientific computing. There is also MATLAB, which has powerful visualization and calculation functions. It’s commonly used in academia and engineering for its robust computational capabilities and ease of use in specific domains. However, it’s relatively expensive and less flexible. Tableau and Power BI are excellent tools for creating interactive visualizations without requiring programming skills – and are widely used in business environments. And of course, there is still Excel, which allows practically anyone to create many visualizations very easily without having to know a programming language. While Excel is an excellent tool for beginners, its limitations become apparent when handling larger datasets.

Where can you continue learning?

Medium – A Beginner’s Journey into Key Mathematical Concepts: Applied Data Analysis Simplified
GeeksForGeeks – Esquisse Package in R Programming
DataCamp Course – Introduction to Python (no affiliate link, only first part is free)
DataCamp Course – Introduction to R (no affiliate link, only first part is free)
IBM Blog – Python vs. R: What’s the Difference?

The post The Essential Guide to R and Python Libraries for Data Visualization appeared first on Towards Data Science.