Data Science | Towards Data Science https://towardsdatascience.com/tag/data-science/ The world’s leading publication for data science, AI, and ML professionals. Thu, 06 Mar 2025 05:58:39 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Data Science | Towards Data Science https://towardsdatascience.com/tag/data-science/ 32 32 Kubernetes — Understanding and Utilizing Probes Effectively https://towardsdatascience.com/kubernetes-understanding-and-utilizing-probes-effectively/ Thu, 06 Mar 2025 03:59:54 +0000 https://towardsdatascience.com/?p=598812 Why proper configuration and implementation of Kubernetes probes is vital for any critical deployment

The post Kubernetes — Understanding and Utilizing Probes Effectively appeared first on Towards Data Science.

]]>
Introduction

Let’s talk about Kubernetes probes and why they matter in your deployments. When managing production-facing containerized applications, even small optimizations can have enormous benefits.

Aiming to reduce deployment times, making your applications better react to scaling events, and managing the running pods healthiness requires fine-tuning your container lifecycle management. This is exactly why proper configuration — and implementation — of Kubernetes probes is vital for any critical deployment. They assist your cluster to make intelligent decisions about traffic routing, restarts, and resource allocation.

Properly configured probes dramatically improve your application reliability, reduce deployment downtime, and handle unexpected errors gracefully. In this article, we’ll explore the three types of probes available in Kubernetes and how utilizing them alongside each other helps configure more resilient systems.

Quick refresher

Understanding exactly what each probe does and some common configuration patterns is essential. Each of them serves a specific purpose in the container lifecycle and when used together, they create a rock-solid framework for maintaining your application availability and performance.

Startup: Optimizing start-up times

Start-up probes are evaluated once when a new pod is spun up because of a scale-up event or a new deployment. It serves as a gatekeeper for the rest of the container checks and fine-tuning it will help your applications better handle increased load or service degradation.

Sample Config:

startupProbe:
  httpGet:
    path: /health
    port: 80
  failureThreshold: 30
  periodSeconds: 10

Key takeaways:

  • Keep periodSeconds low, so that the probe fires often, quickly detecting a successful deployment.
  • Increase failureThreshold to a high enough value to accommodate for the worst-case start-up time.

The Startup probe will check whether your container has started by querying the configured path. It will additionally stop the triggering of the Liveness and Readiness probes until it is successful.

Liveness: Detecting dead containers

Your liveness probes answer a very simple question: “Is this pod still running properly?” If not, K8s will restart it.

Sample Config:

livenessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 10
  failureThreshold: 3

Key takeaways:

  • Since K8s will completely restart your container and spin up a new one, add a failureThreshold to combat intermittent abnormalities.
  • Avoid using initialDelaySeconds as it is too restrictive — use a Start-up probe instead.

Be mindful that a failing Liveness probe will bring down your currently running pod and spin up a new one, so avoid making it too aggressive — that’s for the next one.

Readiness: Handling unexpected errors

The readiness probe determines if it should start — or continue — to receive traffic. It is extremely useful in situations where your container lost connection to the database or is otherwise over-utilized and should not receive new requests.

Sample Config:

readinessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 3
  failureThreshold: 1
  timeoutSeconds: 1

Key takeaways:

  • Since this is your first guard to stopping traffic to unhealthy targets, make the probe aggressive and reduce the periodSeconds .
  • Keep failureThreshold at a minimum, you want to fail quick.
  • The timeout period should also be kept at a minimum to handle slower Containers.
  • Give the readinessProbe ample time to recover by having a longer-running livenessProbe .

Readiness probes ensure that traffic will not reach a container not ready for it and as such it’s one of the most important ones in the stack.

Putting it all together

As you can see, even if all of the probes have their own distinct uses, the best way to improve your application’s resilience strategy is using them alongside each other.

Your startup probe will assist you in scale up scenarios and new deployments, allowing your containers to be quickly brought up. They’re fired only once and also stop the execution of the rest of the probes until they successfully complete.

The liveness probe helps in dealing with dead containers suffering from non-recoverable errors and tells the cluster to bring up a new, fresh pod just for you.

The readiness probe is the one telling K8s when a pod should receive traffic or not. It can be extremely useful dealing with intermittent errors or high resource consumption resulting in slower response times.

Additional configurations

Probes can be further configured to use a command in their checks instead of an HTTP request, as well as giving ample time for the container to safely terminate. While these are useful in more specific scenarios, understanding how you can extend your deployment configuration can be beneficial, so I’d recommend doing some additional reading if your containers handle unique use cases.

Further reading:
Liveness, Readiness, and Startup Probes
Configure Liveness, Readiness and Startup Probes

The post Kubernetes — Understanding and Utilizing Probes Effectively appeared first on Towards Data Science.

]]>
Is Python Set to Surpass Its Competitors? https://towardsdatascience.com/is-python-set-to-surpass-its-competitors/ Wed, 26 Feb 2025 00:46:31 +0000 https://towardsdatascience.com/?p=598431 The features that make Python the most suitable programming language for most people

The post Is Python Set to Surpass Its Competitors? appeared first on Towards Data Science.

]]>
A soufflé is a baked egg dish that originated in France in the 18th century. The process of making an elegant and delicious French soufflé is complex, and in the past, it was typically only prepared by professional French pastry chefs. However, with pre-made soufflé mixes now widely available in supermarkets, this classic French dish has found its way into the kitchens of countless households. 

Python is like the pre-made soufflé mixes in programming. Many studies have consistently shown that Python is the most popular programming language among developers, and this advantage will continue to expand in 2025. Python stands out compared to languages like C, C++, Java, and Julia because it’s highly readable and expressive, flexible and dynamic, beginner-friendly yet powerful. These characteristics make Python the most suitable programming language for people even without programming basics. The following features distinguish Python from other programming languages:

  • Dynamic Typing
  • List Comprehensions
  • Generators
  • Argument Passing and Mutability

These features reveal Python’s intrinsic nature as a programming language. Without this knowledge, you’ll never truly understand Python. In today’s article, I will elaborate how Python excels over other programming languages through these features.

Dynamic Typing

For most programming languages like Java or C++, explicit data type declarations are required. But when it comes to Python, you don’t have to declare the type of a variable when you create one. This feature in Python is called dynamic typing, which makes Python flexible and easy to use.

List Comprehensions

List comprehensions are used to generate lists from other lists by applying functions to each element in the list. They provide a concise way to apply loops and optional conditions in a list.

For example, if you’d like to create a list of squares for even numbers between 0 and 9, you can use JavaScript, a regular loop in Python and Python’s list comprehension to achieve the same goal. 

JavaScript

let squares = Array.from({ length: 10 }, (_, x) => x)  // Create array [0, 1, 2, ..., 9]
   .filter(x => x % 2 === 0)                          // Filter even numbers
   .map(x => x ** 2);                                 // Square each number
console.log(squares);  // Output: [0, 4, 16, 36, 64]

Regular Loop in Python

squares = []
for x in range(10):
   if x % 2 == 0:
       squares.append(x**2)
print(squares) 

Python’s List Comprehension

squares = [x**2 for x in range(10) if x % 2 == 0]print(squares) 

All the three sections of code above generate the same list [0, 4, 16, 36, 64], but Python’s list comprehension is the most elegant because the syntax is concise and clearly express the intent while the Python function is more verbose and requires explicit initialization and appending. The syntax of JavaScript is the least elegant and readable because it requires chaining methods of using Array.from, filter, and map. Both Python function and JavaScript function are not intuitive and cannot be read as natural language as Python list comprehension does.

Generator

Generators in Python are a special kind of iterator that allow developers to iterate over a sequence of values without storing them all in memory at once. They are created with the yield keyword. Other programming languages like C++ and Java, though offering similar functionality, don’t have built-in yield keyword in the same simple, integrated way. Here are several key advantages that make Python Generators unique:

  • Memory Efficiency: Generators yield one value at a time so that they only compute and hold one item in memory at any given moment. This is in contrast to, say, a list in Python, which stores all items in memory.
  • Lazy Evaluation: Generators enable Python to compute values only as needed. This “lazy” computation results in significant performance improvements when dealing with large or potentially infinite sequences.
  • Simple Syntax: This might be the biggest reason why developers choose to use generators because they can easily convert a regular function into a generator without having to manage state explicitly.
def fibonacci():
   a, b = 0, 1
   while True:
       yield a
       a, b = b, a + b

fib = fibonacci()
for _ in range(100):
   print(next(fib))

The example above shows how to use the yield keyword when creating a sequence. For the memory usage and time difference between the code with and without Generators, generating 100 Fibonacci numbers can hardly see any differences. But when it comes to 100 million numbers in practice, you’d better use generators because a list of 100 million numbers could easily strain many system resources.

Argument Passing and Mutability

In Python, we don’t really assign values to variables; instead, we bind variables to objects. The result of such an action depends on whether the object is mutable or immutable. If an object is mutable, changes made to it inside the function will affect the original object. 

def modify_list(lst):
   lst.append(4)

my_list = [1, 2, 3]
modify_list(my_list)
print(my_list)  # Output: [1, 2, 3, 4]

In the example above, we’d like to append ‘4’ to the list my_list which is [1,2,3]. Because lists are mutable, the behavior append operation changes the original list my_list without creating a copy. 

However, immutable objects, such as integers, floats, strings, tuples and frozensets, cannot be changed after creation. Therefore, any modification results in a new object. In the example below, because integers are immutable, the function creates a new integer rather than modifying the original variable.

def modify_number(n):
   n += 10
   return n

a = 5
new_a = modify_number(a)
print(a)      # Output: 5
print(new_a)  # Output: 15

Python’s argument passing is sometimes described as “pass-by-object-reference” or “pass-by-assignment.” This makes Python unique because Python pass references uniformly (pass-by-object-reference) while other languages need to differentiate explicitly between pass-by-value and pass-by-reference. Python’s uniform approach is simple yet powerful. It avoids the need for explicit pointers or reference parameters but requires developers to be mindful of mutable objects.

With Python’s argument passing and mutability, we can enjoy the following benefits in coding:

  • Memory Efficiency: It saves memory by passing references instead of making full copies of objects. This especially benefits code development with large data structures.
  • Performance: It avoids unnecessary copies and thus improves the overall coding performance.
  • Flexibility: This feature provides convenience for updating data structure because developers don’t need to explicitly choose between pass-by-value and pass-by-reference.

However, this characteristic of Python forces developers to carefully choose between mutable and immutable data types and it also brings more complex debugging.

So is Python Really Simple?

Python’s popularity results from its simplicity, memory efficiency, high performance, and beginner-friendiness. It’s also a programming language that looks most like a human’s natural language, so even people who haven’t received systematic and holistic programming training are still able to understand it. These characteristics make Python a top choice among enterprises, academic institutes, and government organisations. 

For example, when we’d like to filter out the the “completed” orders with amounts greater than 200, and update a mutable summary report (a dictionary) with the total count and sum of amounts for an e-commerce company, we can use list comprehension to create a list of orders meeting our criteria, skip the declaration of variable types and make changes of the original dictionary with pass-by-assignment

import random
import time

def order_stream(num_orders):
   """
   A generator that yields a stream of orders.
   Each order is a dictionary with dynamic types:
     - 'order_id': str
     - 'amount': float
     - 'status': str (randomly chosen among 'completed', 'pending', 'cancelled')
   """
   for i in range(num_orders):
       order = {
           "order_id": f"ORD{i+1}",
           "amount": round(random.uniform(10.0, 500.0), 2),
           "status": random.choice(["completed", "pending", "cancelled"])
       }
       yield order
       time.sleep(0.001)  # simulate delay

def update_summary(report, orders):
   """
   Updates the mutable summary report dictionary in-place.
   For each order in the list, it increments the count and adds the order's amount.
   """
   for order in orders:
       report["count"] += 1
       report["total_amount"] += order["amount"]

# Create a mutable summary report dictionary.
summary_report = {"count": 0, "total_amount": 0.0}

# Use a generator to stream 10,000 orders.
orders_gen = order_stream(10000)

# Use a list comprehension to filter orders that are 'completed' and have amount > 200.
high_value_completed_orders = [order for order in orders_gen
                              if order["status"] == "completed" and order["amount"] > 200]

# Update the summary report using our mutable dictionary.
update_summary(summary_report, high_value_completed_orders)

print("Summary Report for High-Value Completed Orders:")
print(summary_report)

If we’d like to achieve the same goal with Java, since Java lacks built-in generators and list comprehensions, we have to generate a list of orders, then filter and update a summary using explicit loops, and thus make the code more complex, less readable and harder to maintain.

import java.util.*;
import java.util.concurrent.ThreadLocalRandom;

class Order {
   public String orderId;
   public double amount;
   public String status;
  
   public Order(String orderId, double amount, String status) {
       this.orderId = orderId;
       this.amount = amount;
       this.status = status;
   }
  
   @Override
   public String toString() {
       return String.format("{orderId:%s, amount:%.2f, status:%s}", orderId, amount, status);
   }
}

public class OrderProcessor {
   // Generates a list of orders.
   public static List<Order> generateOrders(int numOrders) {
       List<Order> orders = new ArrayList<>();
       String[] statuses = {"completed", "pending", "cancelled"};
       Random rand = new Random();
       for (int i = 0; i < numOrders; i++) {
           String orderId = "ORD" + (i + 1);
           double amount = Math.round(ThreadLocalRandom.current().nextDouble(10.0, 500.0) * 100.0) / 100.0;
           String status = statuses[rand.nextInt(statuses.length)];
           orders.add(new Order(orderId, amount, status));
       }
       return orders;
   }
  
   // Filters orders based on criteria.
   public static List<Order> filterHighValueCompletedOrders(List<Order> orders) {
       List<Order> filtered = new ArrayList<>();
       for (Order order : orders) {
           if ("completed".equals(order.status) && order.amount > 200) {
               filtered.add(order);
           }
       }
       return filtered;
   }
  
   // Updates a mutable summary Map with the count and total amount.
   public static void updateSummary(Map<String, Object> summary, List<Order> orders) {
       int count = 0;
       double totalAmount = 0.0;
       for (Order order : orders) {
           count++;
           totalAmount += order.amount;
       }
       summary.put("count", count);
       summary.put("total_amount", totalAmount);
   }
  
   public static void main(String[] args) {
       // Generate orders.
       List<Order> orders = generateOrders(10000);
      
       // Filter orders.
       List<Order> highValueCompletedOrders = filterHighValueCompletedOrders(orders);
      
       // Create a mutable summary map.
       Map<String, Object> summaryReport = new HashMap<>();
       summaryReport.put("count", 0);
       summaryReport.put("total_amount", 0.0);
      
       // Update the summary report.
       updateSummary(summaryReport, highValueCompletedOrders);
      
       System.out.println("Summary Report for High-Value Completed Orders:");
       System.out.println(summaryReport);
   }
}

Conclusion

Equipped with features of dynamic typing, list comprehensions, generators, and its approach to argument passing and mutability, Python is making itself a simplified coding while enhancing memory efficiency and performance. As a result, Python has become the ideal programming language for self-learners.

Thank you for reading!

The post Is Python Set to Surpass Its Competitors? appeared first on Towards Data Science.

]]>
The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data https://towardsdatascience.com/the-next-ai-revolution-a-tutorial-using-vaes-to-generate-high-quality-synthetic-data/ Fri, 21 Feb 2025 23:42:07 +0000 https://towardsdatascience.com/?p=598363 Leverage the BasicVAE architecture to generate synthetic data and improves the classification accuracy on an imbalanced dataset

The post The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data appeared first on Towards Data Science.

]]>
What is synthetic data?

Data created by a computer intended to replicate or augment existing data.

Why is it useful?

We have all experienced the success of ChatGPT, Llama, and more recently, DeepSeek. These language models are being used ubiquitously across society and have triggered many claims that we are rapidly approaching Artificial General Intelligence — AI capable of replicating any human function. 

Before getting too excited, or scared, depending on your perspective — we are also rapidly approaching a hurdle to the advancement of these language models. According to a paper published by a group from the research institute, Epoch [1], we are running out of data. They estimate that by 2028 we will have reached the upper limit of possible data upon which to train language models. 

Image by Author. Graph based on estimated dataset projections. This is a reconstructed visualisation inspired by Epoch research group [1].

What happens if we run out of data?

Well, if we run out of data then we aren’t going to have anything new with which to train our language models. These models will then stop improving. If we want to pursue Artificial General Intelligence then we are going to have to come up with new ways of improving AI without just increasing the volume of real-world training data. 

One potential saviour is synthetic data which can be generated to mimic existing data and has already been used to improve the performance of models like Gemini and DBRX. 

Synthetic data beyond LLMs

Beyond overcoming data scarcity for large language models, synthetic data can be used in the following situations: 

  • Sensitive Data — if we don’t want to share or use sensitive attributes, synthetic data can be generated which mimics the properties of these features while maintaining anonymity.
  • Expensive data — if collecting data is expensive we can generate a large volume of synthetic data from a small amount of real-world data.
  • Lack of data — datasets are biased when there is a disproportionately low number of individual data points from a particular group. Synthetic data can be used to balance a dataset. 

Imbalanced datasets

Imbalanced datasets can (*but not always*) be problematic as they may not contain enough information to effectively train a predictive model. For example, if a dataset contains many more men than women, our model may be biased towards recognising men and misclassify future female samples as men. 

In this article we show the imbalance in the popular UCI Adult dataset [2], and how we can use a variational auto-encoder to generate Synthetic Data to improve classification on this example. 

We first download the Adult dataset. This dataset contains features such as age, education and occupation which can be used to predict the target outcome ‘income’. 

# Download dataset into a dataframe
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
   "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
   "occupation", "relationship", "race", "sex", "capital-gain",
   "capital-loss", "hours-per-week", "native-country", "income"
]
data = pd.read_csv(url, header=None, names=columns, na_values=" ?", skipinitialspace=True)

# Drop rows with missing values
data = data.dropna()

# Split into features and target
X = data.drop(columns=["income"])
y = data['income'].map({'>50K': 1, '<=50K': 0}).values

# Plot distribution of income
plt.figure(figsize=(8, 6))
plt.hist(data['income'], bins=2, edgecolor='black')
plt.title('Distribution of Income')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()

In the Adult dataset, income is a binary variable, representing individuals who earn above, and below, $50,000. We plot the distribution of income over the entire dataset below. We can see that the dataset is heavily imbalanced with a far larger number of individuals who earn less than $50,000. 

Image by Author. Original dataset: Number of data instances with the label ≤50k and >50k. There is a disproportionately larger representation of individuals who earn less than 50k in the dataset.

Despite this imbalance we can still train a machine learning classifier on the Adult dataset which we can use to determine whether unseen, or test, individuals should be classified as earning above, or below, 50k. 

# Preprocessing: One-hot encode categorical features, scale numerical features
numerical_features = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
categorical_features = [
   "workclass", "education", "marital-status", "occupation", "relationship",
   "race", "sex", "native-country"
]

preprocessor = ColumnTransformer(
   transformers=[
       ("num", StandardScaler(), numerical_features),
       ("cat", OneHotEncoder(), categorical_features)
   ]
)

X_processed = preprocessor.fit_transform(X)

# Convert to numpy array for PyTorch compatibility
X_processed = X_processed.toarray().astype(np.float32)
y_processed = y.astype(np.float32)
# Split dataset in train and test sets
X_model_train, X_model_test, y_model_train, y_model_test = train_test_split(X_processed, y_processed, test_size=0.2, random_state=42)


rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_model_train, y_model_train)

# Make predictions
y_pred = rf_classifier.predict(X_model_test)

# Display confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

Printing out the confusion matrix of our classifier shows that our model performs fairly well despite the imbalance. Our model has an overall error rate of 16% but the error rate for the positive class (income > 50k) is 36% where the error rate for the negative class (income < 50k) is 8%. 

This discrepancy shows that the model is indeed biased towards the negative class. The model is frequently incorrectly classifying individuals who earn more than 50k as earning less than 50k. 

Below we show how we can use a Variational Autoencoder to generate synthetic data of the positive class to balance this dataset. We then train the same model using the synthetically balanced dataset and reduce model errors on the test set. 

Image by Author. Confusion matrix for predictive model on original dataset.

How can we generate synthetic data?

There are lots of different methods for generating synthetic data. These can include more traditional methods such as SMOTE and Gaussian Noise which generate new data by modifying existing data. Alternatively Generative models such as Variational Autoencoders or General Adversarial networks are predisposed to generate new data as their architectures learn the distribution of real data and use these to generate synthetic samples.

In this tutorial we use a variational autoencoder to generate synthetic data.

Variational Autoencoders

Variational Autoencoders (VAEs) are great for synthetic data generation because they use real data to learn a continuous latent space. We can view this latent space as a magic bucket from which we can sample synthetic data which closely resembles existing data. The continuity of this space is one of their big selling points as it means the model generalises well and doesn’t just memorise the latent space of specific inputs.

A VAE consists of an encoder, which maps input data into a probability distribution (mean and variance) and a decoder, which reconstructs the data from the latent space. 

For that continuous latent space, VAEs use a reparameterization trick, where a random noise vector is scaled and shifted using the learned mean and variance, ensuring smooth and continuous representations in the latent space.

Below we construct a BasicVAE class which implements this process with a simple architecture.

  •  The encoder compresses the input into a smaller, hidden representation, producing both a mean and log variance that define a Gaussian distribution aka creating our magic sampling bucket. Instead of directly sampling, the model applies the reparameterization trick to generate latent variables, which are then passed to the decoder. 
  • The decoder reconstructs the original data from these latent variables, ensuring the generated data maintains characteristics of the original dataset. 
class BasicVAE(nn.Module):
   def __init__(self, input_dim, latent_dim):
       super(BasicVAE, self).__init__()
       # Encoder: Single small layer
       self.encoder = nn.Sequential(
           nn.Linear(input_dim, 8),
           nn.ReLU()
       )
       self.fc_mu = nn.Linear(8, latent_dim)
       self.fc_logvar = nn.Linear(8, latent_dim)
      
       # Decoder: Single small layer
       self.decoder = nn.Sequential(
           nn.Linear(latent_dim, 8),
           nn.ReLU(),
           nn.Linear(8, input_dim),
           nn.Sigmoid()  # Outputs values in range [0, 1]
       )

   def encode(self, x):
       h = self.encoder(x)
       mu = self.fc_mu(h)
       logvar = self.fc_logvar(h)
       return mu, logvar

   def reparameterize(self, mu, logvar):
       std = torch.exp(0.5 * logvar)
       eps = torch.randn_like(std)
       return mu + eps * std

   def decode(self, z):
       return self.decoder(z)

   def forward(self, x):
       mu, logvar = self.encode(x)
       z = self.reparameterize(mu, logvar)
       return self.decode(z), mu, logvar

Given our BasicVAE architecture we construct our loss functions and model training below. 

def vae_loss(recon_x, x, mu, logvar, tau=0.5, c=1.0):
   recon_loss = nn.MSELoss()(recon_x, x)
 
   # KL Divergence Loss
   kld_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
   return recon_loss + kld_loss / x.size(0)

def train_vae(model, data_loader, epochs, learning_rate):
   optimizer = optim.Adam(model.parameters(), lr=learning_rate)
   model.train()
   losses = []
   reconstruction_mse = []

   for epoch in range(epochs):
       total_loss = 0
       total_mse = 0
       for batch in data_loader:
           batch_data = batch[0]
           optimizer.zero_grad()
           reconstructed, mu, logvar = model(batch_data)
           loss = vae_loss(reconstructed, batch_data, mu, logvar)
           loss.backward()
           optimizer.step()
           total_loss += loss.item()

           # Compute batch-wise MSE for comparison
           mse = nn.MSELoss()(reconstructed, batch_data).item()
           total_mse += mse

       losses.append(total_loss / len(data_loader))
       reconstruction_mse.append(total_mse / len(data_loader))
       print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}, MSE: {total_mse:.4f}")
   return losses, reconstruction_mse

combined_data = np.concatenate([X_model_train.copy(), y_model_train.cop
y().reshape(26048,1)], axis=1)

# Train-test split
X_train, X_test = train_test_split(combined_data, test_size=0.2, random_state=42)

batch_size = 128

# Create DataLoaders
train_loader = DataLoader(TensorDataset(torch.tensor(X_train)), batch_size=batch_size, shuffle=True)
test_loader = DataLoader(TensorDataset(torch.tensor(X_test)), batch_size=batch_size, shuffle=False)

basic_vae = BasicVAE(input_dim=X_train.shape[1], latent_dim=8)

basic_losses, basic_mse = train_vae(
   basic_vae, train_loader, epochs=50, learning_rate=0.001,
)

# Visualize results
plt.figure(figsize=(12, 6))
plt.plot(basic_mse, label="Basic VAE")
plt.ylabel("Reconstruction MSE")
plt.title("Training Reconstruction MSE")
plt.legend()
plt.show()

vae_loss consists of two components: reconstruction loss, which measures how well the generated data matches the original input using Mean Squared Error (MSE), and KL divergence loss, which ensures that the learned latent space follows a normal distribution.

train_vae optimises the VAE using the Adam optimizer over multiple epochs. During training, the model takes mini-batches of data, reconstructs them, and computes the loss using vae_loss. These errors are then corrected via backpropagation where the model weights are updated. We train the model for 50 epochs and plot how the reconstruction mean squared error decreases over training.

We can see that our model learns quickly how to reconstruct our data, evidencing efficient learning. 

Image by Author. Reconstruction MSE of BasicVAE on the Adult dataset.

Now we have trained our BasicVAE to accurately reconstruct the Adult dataset we can now use it to generate synthetic data. We want to generate more samples of the positive class (individuals who earn over 50k) in order to balance out the classes and remove the bias from our model.

To do this we select all the samples from our VAE dataset where income is the positive class (earn more than 50k). We then encode these samples into the latent space. As we have only selected samples of the positive class to encode, this latent space will reflect properties of the positive class which we can sample from to create synthetic data. 

We sample 15000 new samples from this latent space and decode these latent vectors back into the input data space as our synthetic data points. 

# Create column names
col_number = sample_df.shape[1]
col_names = [str(i) for i in range(col_number)]
sample_df.columns = col_names

# Define the feature value to filter
feature_value = 1.0  # Specify the feature value - here we set the income to 1

# Set all income values to 1 : Over 50k
selected_samples = sample_df[sample_df[col_names[-1]] == feature_value]
selected_samples = selected_samples.values
selected_samples_tensor = torch.tensor(selected_samples, dtype=torch.float32)

basic_vae.eval()  # Set model to evaluation mode
with torch.no_grad():
   mu, logvar = basic_vae.encode(selected_samples_tensor)
   latent_vectors = basic_vae.reparameterize(mu, logvar)

# Compute the mean latent vector for this feature
mean_latent_vector = latent_vectors.mean(dim=0)


num_samples = 15000  # Number of new samples
latent_dim = 8
latent_samples = mean_latent_vector + 0.1 * torch.randn(num_samples, latent_dim)

with torch.no_grad():
   generated_samples = basic_vae.decode(latent_samples)

Now we have generated synthetic data of the positive class, we can combine this with the original training data to generate a balanced synthetic dataset. 

new_data = pd.DataFrame(generated_samples)

# Create column names
col_number = new_data.shape[1]
col_names = [str(i) for i in range(col_number)]
new_data.columns = col_names

X_synthetic = new_data.drop(col_names[-1],axis=1)
y_synthetic = np.asarray([1 for _ in range(0,X_synthetic.shape[0])])

X_synthetic_train = np.concatenate([X_model_train, X_synthetic.values], axis=0)
y_synthetic_train = np.concatenate([y_model_train, y_synthetic], axis=0)

mapping = {1: '>50K', 0: '<=50K'}
map_function = np.vectorize(lambda x: mapping[x])
# Apply mapping
y_mapped = map_function(y_synthetic_train)

plt.figure(figsize=(8, 6))
plt.hist(y_mapped, bins=2, edgecolor='black')
plt.title('Distribution of Income')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()
Image by Author. Synthetic dataset: Number of data instances with the label ≤50k and >50k. There are now a balanced number of individuals earning more and less than 50k.

We can now use our balanced training synthetic dataset to retrain our random forest classifier. We can then evaluate this new model on the original test data to see how effective our synthetic data is at reducing the model bias.

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_synthetic_train, y_synthetic_train)

# Step 5: Make predictions
y_pred = rf_classifier.predict(X_model_test)

cm = confusion_matrix(y_model_test, y_pred)

# Create heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

Our new classifier, trained on the balanced synthetic dataset makes fewer errors on the original test set than our original classifier trained on the imbalanced dataset and our error rate is now reduced to 14%.

Image by Author. Confusion matrix for predictive model on synthetic dataset.

However, we have not been able to reduce the discrepancy in errors by a significant amount, our error rate for the positive class is still 36%. This could be due to to the following reasons: 

  • We have discussed how one of the benefits of VAEs is the learning of a continuous latent space. However, if the majority class dominates, the latent space might skew towards the majority class.
  • The model may not have properly learned a distinct representation for the minority class due to the lack of data, making it hard to sample from that region accurately.

In this tutorial we have introduced and built a BasicVAE architecture which can be used to generate synthetic data which improves the classification accuracy on an imbalanced dataset. 

Follow for future articles where I will show how we can build more sophisticated VAE architectures which address the above problems with imbalanced sampling and more.

[1] Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., & Hobbhahn, M. (2024). Will we run out of data? Limits of LLM scaling based on human-generated data. arXiv preprint arXiv:2211.04325, 3.

[2] Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.

The post The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data appeared first on Towards Data Science.

]]>
How I Became A Machine Learning Engineer (No CS Degree, No Bootcamp) https://towardsdatascience.com/how-i-became-a-machine-learning-engineer-no-cs-degree-no-bootcamp/ Sat, 15 Feb 2025 02:33:01 +0000 https://towardsdatascience.com/?p=598008 Machine learning and AI are among the most popular topics nowadays, especially within the tech space. I am fortunate enough to work and develop with these technologies every day as a machine learning engineer! In this article, I will walk you through my journey to becoming a machine learning engineer, shedding some light and advice […]

The post How I Became A Machine Learning Engineer (No CS Degree, No Bootcamp) appeared first on Towards Data Science.

]]>
Machine learning and AI are among the most popular topics nowadays, especially within the tech space. I am fortunate enough to work and develop with these technologies every day as a machine learning engineer!

In this article, I will walk you through my journey to becoming a machine learning engineer, shedding some light and advice on how you can become one yourself!

My Background

In one of my previous articles, I extensively wrote about my journey from school to securing my first Data Science job. I recommend you check out that article, but I will summarise the key timeline here.

Pretty much everyone in my family studied some sort of STEM subject. My great-grandad was an engineer, both my grandparents studied physics, and my mum is a maths teacher.

So, my path was always paved for me.

Me at age 11

I chose to study physics at university after watching The Big Bang Theory at age 12; it’s fair to say everyone was very proud!

At school, I wasn’t dumb by any means. I was actually relatively bright, but I didn’t fully apply myself. I got decent grades, but definitely not what I was fully capable of.

I was very arrogant and thought I would do well with zero work.

I applied to top universities like Oxford and Imperial College, but given my work ethic, I was delusional thinking I had a chance. On results day, I ended up in clearing as I missed my offers. This was probably one of the saddest days of my life.

Clearing in the UK is where universities offer places to students on certain courses where they have space. It’s mainly for students who don’t have a university offer.

I was lucky enough to be offered a chance to study physics at the University of Surrey, and I went on to earn a first-class master’s degree in physics!

There is genuinely no substitute for hard work. It is a cringy cliche, but it is true!

My original plan was to do a PhD and be a full-time researcher or professor, but during my degree, I did a research year, and I just felt a career in research was not for me. Everything moved so slowly, and it didn’t seem there was much opportunity in the space.

During this time, DeepMind released their AlphaGo — The Movie documentary on YouTube, which popped up on my home feed.

From the video, I started to understand how AI worked and learn about neural networks, reinforcement learning, and deep learning. To be honest, to this day I am still not an expert in these areas.

Naturally, I dug deeper and found that a data scientist uses AI and machine learning algorithms to solve problems. I immediately wanted in and started applying for data science graduate roles.

I spent countless hours coding, taking courses, and working on projects. I applied to 300+ jobs and eventually landed my first data science graduate scheme in September 2021.

You can hear more about my journey from a podcast.

Data Science Journey

I started my career in an insurance company, where I built various supervised learning models, mainly using gradient boosted tree packages like CatBoost, XGBoost, and generalised linear models (GLMs).

I built models to predict:

  • Fraud — Did someone fraudulently make a claim to profit.
  • Risk Prices — What’s the premium we should give someone.
  • Number of Claims — How many claims will someone have.
  • Average Cost of Claim — What’s the average claim value someone will have.

I made around six models spanning the regression and classification space. I learned so much here, especially in statistics, as I worked very closely with Actuaries, so my maths knowledge was excellent.

However, due to the company’s structure and setup, it was difficult for my models to advance past the PoC stage, so I felt I lacked the “tech” side of my toolkit and understanding of how companies use machine learning in production.

After a year, my previous employer reached out to me asking if I wanted to apply to a junior data scientist role that specialises in time series forecasting and optimisation problems. I really liked the company, and after a few interviews, I was offered the job!

I worked at this company for about 2.5 years, where I became an expert in forecasting and combinatorial optimisation problems.

I developed many algorithms and deployed my models to production through AWS using software engineering best practices, such as unit testing, lower environment, shadow system, CI/CD pipelines, and much more.

Fair to say I learned a lot. 

I worked very closely with software engineers, so I picked up a lot of engineering knowledge and continued self-studying machine learning and statistics on the side.

I even earned a promotion from junior to mid-level in that time!

Transitioning To MLE

Over time, I realised the actual value of data science is using it to make live decisions. There is a good quote by Pau Labarta Bajo

ML models inside Jupyter notebooks have a business value of $0

There is no point in building a really complex and sophisticated model if it will not produce results. Seeking out that extra 0.1% accuracy by staking multiple models is often not worth it.

You are better off building something simple that you can deploy, and that will bring real financial benefit to the company.

With this in mind, I started thinking about the future of data science. In my head, there are two avenues:

  • Analytics -> You work primarily to gain insight into what the business should be doing and what it should be looking into to boost its performance.
  • Engineering -> You ship solutions (models, decision algorithms, etc.) that bring business value.

I feel the data scientist who analyses and builds PoC models will become extinct in the next few years because, as we said above, they don’t provide tangible value to a business.

That’s not to say they are entirely useless; you have to think of it from the business perspective of their return on investment. Ideally, the value you bring in should be more than your salary.

You want to say that you did “X that produced Y”, which the above two avenues allow you to do.

The engineering side was the most interesting and enjoyable for me. I genuinely enjoy coding and building stuff that benefits people, and that they can use, so naturally, that’s where I gravitated towards.

To move to the ML engineering side, I asked my line manager if I could deploy the algorithms and ML models I was building myself. I would get help from software engineers, but I would write all the production code, do my own system design, and set up the deployment process independently.

And that’s exactly what I did.

I basically became a Machine Learning Engineer. I was developing my algorithms and then shipping them to production.

I also took NeetCode’s data structures and algorithms course to improve my fundamentals of computer science and started blogging about software engineering concepts.

Coincidentally, my current employer contacted me around this time and asked if I wanted to apply for a machine learning engineer role that specialises in general ML and optimisation at their company!

Call it luck, but clearly, the universe was telling me something. After several interview rounds, I was offered the role, and I am now a fully fledged machine learning engineer!

Fortunately, a role kind of “fell to me,” but I created my own luck through up-skilling and documenting my learning. That is why I always tell people to show their work — you don’t know what may come from it.

My Advice

I want to share the main bits of advice that helped me transition from a machine learning engineer to a data scientist.

  • Experience — A machine learning engineer is not an entry-level position in my opinion. You need to be well-versed in data science, machine learning, software engineering, etc. You don’t need to be an expert in all of them, but have good fundamentals across the board. That’s why I recommend having a couple of years of experience as either a software engineer or data scientist and self-study other areas.
  • Production Code — If you are from data science, you must learn to write good, well-tested production code. You must know things like typing, linting, unit tests, formatting, mocking and CI/CD. It’s not too difficult, but it just requires some practice. I recommend asking your current company to work with software engineers to gain this knowledge, it worked for me!
  • Cloud Systems — Most companies nowadays deploy many of their architecture and systems on the cloud, and machine learning models are no exception. So, it’s best to get practice with these tools and understand how they enable models to go live. I learned most of this on the job, to be honest, but there are courses you can take.
  • Command Line — I am sure most of you know this already, but every tech professional should be proficient in the command line. You will use it extensively when deploying and writing production code. I have a basic guide you can checkout here.
  • Data Structures & Algorithms — Understanding the fundamental algorithms in computer science are very useful for MLE roles. Mainly because you will likely be asked about it in interviews. It’s not too hard to learn compared to machine learning; it just takes time. Any course will do the trick.
  • Git & GitHub — Again, most tech professionals should know Git, but as an MLE, it is essential. How to squash commits, do code reviews, and write outstanding pull requests are musts.
  • Specialise — Many MLE roles I saw required you to have some specialisation in a particular area. I specialise in time series forecasting, optimisation, and general ML based on my previous experience. This helps you stand out in the market, and most companies are looking for specialists nowadays.

The main theme here is that I basically up-skilled my software engineering abilities. This makes sense as I already had all the math, stats, and machine learning knowledge from being a data scientist.

If I were a software engineer, the transition would likely be the reverse. This is why securing a machine learning engineer role can be quite challenging, as it requires proficiency across a wide range of skills.

Summary & Further Thoughts

I have a free newsletter, Dishing the Data, where I share weekly tips and advice as a practising data scientist. Plus, when you subscribe, you will get my FREE data science resume and short PDF version of my AI roadmap!

Connect With Me

The post How I Became A Machine Learning Engineer (No CS Degree, No Bootcamp) appeared first on Towards Data Science.

]]>
Publish Interactive Data Visualizations for Free with Python and Marimo https://towardsdatascience.com/publish-interactive-data-visualizations-for-free-with-python-and-marimo/ Fri, 14 Feb 2025 16:00:00 +0000 https://towardsdatascience.com/?p=597851 Working in data science, it can be hard to share insights from complex datasets using only static figures. All the facets that describe the shape and meaning of interesting data are not always captured in a handful of pre-generated figures. While we have powerful technologies available for presenting interactive figures — where a viewer can rotate, filter, […]

The post Publish Interactive Data Visualizations for Free with Python and Marimo appeared first on Towards Data Science.

]]>
Working in Data Science, it can be hard to share insights from complex datasets using only static figures. All the facets that describe the shape and meaning of interesting data are not always captured in a handful of pre-generated figures. While we have powerful technologies available for presenting interactive figures — where a viewer can rotate, filter, zoom, and generally explore complex data  —  they always come with tradeoffs.

Here I present my experience using a recently released Python library — marimo — which opens up exciting new opportunities for publishing interactive visualizations across the entire field of data science.

Interactive Data Visualization

The tradeoffs to consider when selecting an approach for presenting data visualizations can be broken into three categories:

  • Capabilities — what visualizations and interactivity am I able to present to the user?
  • Publication Cost — what are the resources needed for displaying this visualization to users (e.g. running servers, hosting websites)?
  • Ease of Use – how much of a new skillset / codebase do I need to learn upfront?

JavaScript is the foundation of portable interactivity. Every user has a web browser installed on their computer and there are many different frameworks available for displaying any degree of interactivity or visualization you might imagine (for example, this gallery of amazing things people have made with three.js). Since the application is running on the user’s computer, no costly servers are needed. However, a significant drawback for the data science community is ease of use, as JS does not have many of the high-level (i.e. easy-to-use) libraries that data scientists use for data manipulation, plotting, and interactivity.

Python provides a useful point of comparison. Because of its continually growing popularity, some have called this the “Era of Python”. For data scientists in particular, Python stands alongside R as one of the foundational languages for quickly and effectively wielding complex data. While Python may be easier to use than Javascript, there are fewer options for presenting interactive visualizations. Some popular projects providing interactivity and visualization have been Flask, Dash, and Streamlit (also worth mentioning — bokeh, HoloViews, altair, and plotly). The biggest tradeoff for using Python has been the cost for publishing – delivering the tool to users. In the same way that shinyapps require a running computer to serve up the visualization, these Python-based frameworks have exclusively been server-based. This is by no means prohibitive for authors with a budget to spend, but it does limit the number of users who can take advantage of a particular project.

Pyodide is an intriguing middle ground — Python code running directly in the web browser using WebAssembly (WASM). There are resource limitations (only 1 thread and 2GB memory) that make this impractical for doing the heavy lifting of data science. However, this can be more than sufficient for building visualizations and updating based on user input. Because it runs in the browser, no servers are required for hosting. Tools that use Pyodide as a foundation are interesting to explore because they give data scientists an opportunity to write Python code which runs directly on users’ computers without their having to install or run anything outside of the web browser.

As an aside, I’ve been interested previously in one project that has tried this approach: stlite, an in-browser implementation of Streamlit that lets you deploy these flexible and powerful apps to a broad range of users. However, a core limitation is that Streamlit itself is distinct from stlite (the port of Streamlit to WASM), which means that not all features are supported and that advancement of the project is dependent on two separate groups working along compatible lines.

Introducing: Marimo

This brings us to Marimo.

The first public announcements of marimo were in January 2024, so the project is very new, and it has a unique combination of features:

  • The interface resembles a Jupyter notebook, which will be familiar to users.
  • Execution of cells is reactive, so that updating one cell will rerun all cells which depend on its output.
  • User input can be captured with a flexible set of UI components.
  • Notebooks can be quickly converted into apps, hiding the code and showing only the input/output elements.
  • Apps can be run locally or converted into static webpages using WASM/Pyodide.

marimo balances the tradeoffs of technology in a way that is well suited to the skill set of the typical data scientists:

  • Capabilities — user input and visual display features are rather extensive, supporting user input via Altair and Plotly plots.
  • Publication Cost — deploying as static webpages is basically free — no servers required
  • Ease of Use — for users familiar with Python notebooks, marimo will feel very familiar and be easy to pick up.

Publishing Marimo Apps on the Web

The best place to start with marimo is by reading their extensive documentation

As a simple example of the type of display that can be useful in data science, consisting of explanatory text interspersed with interactive displays, I have created a barebones GitHub repository. Try it out yourself here.

Example publication created with marimo (image created by author)

Using just a little bit of code, users can:

  • Attach source datasets
  • Generate visualizations with flexible interactivity
  • Write narrative text describing their findings
  • Publish to the web for free (i.e. using GitHub Pages)

For more details, read their documentation on web publishing and template repository for deploying to GitHub Pages.

Public App / Private Data

This new technology offers an exciting new opportunity for collaboration — publish the app publicly to the world, but users can only see specific datasets that they have permission to access.

Rather than building a dedicated data backend for every app, user data can be stored in a generic backend which can be securely authenticated and accessed using a Python client library — all contained within the user’s web browser. For example, the user is given an OAuth login link that will authenticate them with the backend and allow the app to temporarily access input data.

As a proof of concept, I built a simple visualization app which connects to the Cirro data platform, which is used at my institution to manage scientific data. Full disclosure: I was part of the team that built this platform before it spun out as an independent company. In this manner users can:

  • Load the public visualization app — hosted on GitHub Pages
  • Connect securely to their private data store
  • Load the appropriate dataset for display
  • Share a link which will direct authorized collaborators to the same data

Try it out yourself here.

Example visualization app sourcing user controlled data (image created by author)

As a data scientist, this approach of publishing free and open-source visualization apps which can be used to interact with private datasets is extremely exciting. Building and publishing a new app can take hours and days instead of weeks and years, letting researchers quickly share their insights with collaborators and then publish them to the wider world.

The post Publish Interactive Data Visualizations for Free with Python and Marimo appeared first on Towards Data Science.

]]>
Building a Data Engineering Center of Excellence https://towardsdatascience.com/building-a-data-engineering-center-of-excellence/ Fri, 14 Feb 2025 02:35:48 +0000 https://towardsdatascience.com/?p=597886 As data continues to grow in importance and become more complex, the need for skilled data engineers has never been greater. But what is data engineering, and why is it so important? In this blog post, we will discuss the essential components of a functioning data engineering practice and why data engineering is becoming increasingly […]

The post Building a Data Engineering Center of Excellence appeared first on Towards Data Science.

]]>
As data continues to grow in importance and become more complex, the need for skilled data engineers has never been greater. But what is data engineering, and why is it so important? In this blog post, we will discuss the essential components of a functioning data engineering practice and why data engineering is becoming increasingly critical for businesses today, and how you can build your very own Data Engineering Center of Excellence!

I’ve had the privilege to build, manage, lead, and foster a sizeable high-performing team of data warehouse & ELT engineers for many years. With the help of my team, I have spent a considerable amount of time every year consciously planning and preparing to manage the growth of our data month-over-month and address the changing reporting and analytics needs for our 20000+ global data consumers. We built many data warehouses to store and centralize massive amounts of data generated from many OLTP sources. We’ve implemented Kimball methodology by creating star schemas both within our on-premise data warehouses and in the ones in the cloud.

The objective is to enable our user-base to perform fast analytics and reporting on the data; so our analysts’ community and business users can make accurate data-driven decisions.

It took me about three years to transform teams (plural) of data warehouse and ETL programmers into one cohesive Data Engineering team.

I have compiled some of my learnings building a global data engineering team in this post in hopes that Data professionals and leaders of all levels of technical proficiency can benefit.

Evolution of the Data Engineer

It has never been a better time to be a data engineer. Over the last decade, we have seen a massive awakening of enterprises now recognizing their data as the company’s heartbeat, making data engineering the job function that ensures accurate, current, and quality data flow to the solutions that depend on it.

Historically, the role of Data Engineers has evolved from that of data warehouse developers and the ETL/ELT developers (extract, transform and load).

The data warehouse developers are responsible for designing, building, developing, administering, and maintaining data warehouses to meet an enterprise’s reporting needs. This is done primarily via extracting data from operational and transactional systems and piping it using extract transform load methodology (ETL/ ELT) to a storage layer like a data warehouse or a data lake. The data warehouse or the data lake is where data analysts, data scientists, and business users consume data. The developers also perform transformations to conform the ingested data to a data model with aggregated data for easy analysis.

A data engineer’s prime responsibility is to produce and make data securely available for multiple consumers.

Data engineers oversee the ingestion, transformation, modeling, delivery, and movement of data through every part of an organization. Data extraction happens from many different data sources & applications. Data Engineers load the data into data warehouses and data lakes, which are transformed not just for the Data Science & predictive analytics initiatives (as everyone likes to talk about) but primarily for data analysts. Data analysts & data scientists perform operational reporting, exploratory analytics, service-level agreement (SLA) based business intelligence reports and dashboards on the catered data. In this book, we will address all of these job functions.

The role of a data engineer is to acquire, store, and aggregate data from both cloud and on-premise, new, and existing systems, with data modeling and feasible data architecture. Without the data engineers, analysts and data scientists won’t have valuable data to work with, and hence, data engineers are the first to be hired at the inception of every new data team. Based on the data and analytics tools available within an enterprise, data engineering teams’ role profiles, constructs, and approaches have several options for what should be included in their responsibilities which we will discuss in this chapter.

Data Engineering team

Software is increasingly automating the historically manual and tedious tasks of data engineers. Data processing tools and technologies have evolved massively over several years and will continue to grow. For example, cloud-based data warehouses (Snowflake, for instance) have made data storage and processing affordable and fast. Data pipeline services (like Informatica IICSApache AirflowMatillionFivetran) have turned data extraction into work that can be completed quickly and efficiently. The data engineering team should be leveraging such technologies as force multipliers, taking a consistent and cohesive approach to integration and management of enterprise data, not just relying on legacy siloed approaches to building custom data pipelines with fragile, non-performant, hard to maintain code. Continuing with the latter approach will stifle the pace of innovation within the said enterprise and force the future focus to be around managing data infrastructure issues rather than how to help generate value for your business.

The primary role of an enterprise Data Engineering team should be to transform raw data into a shape that’s ready for analysis — laying the foundation for real-world analytics and data science application.

The Data Engineering team should serve as the librarian for enterprise-level data with the responsibility to curate the organization’s data and act as a resource for those who want to make use of it, such as Reporting & Analytics teams, Data Science teams, and other groups that are doing more self-service or business group driven analytics leveraging the enterprise data platform. This team should serve as the steward of organizational knowledge, managing and refining the catalog so that analysis can be done more effectively. Let’s look at the essential responsibilities of a well-functioning Data Engineering team.

Responsibilities of a Data Engineering Team

The Data Engineering team should provide a shared capability within the enterprise that cuts across to support both the Reporting/Analytics and Data Science capabilities to provide access to clean, transformed, formatted, scalable, and secure data ready for analysis. The Data Engineering teams’ core responsibilities should include:

· Build, manage, and optimize the core data platform infrastructure

· Build and maintain custom and off-the-shelf data integrations and ingestion pipelines from a variety of structured and unstructured sources

· Manage overall data pipeline orchestration

· Manage transformation of data either before or after load of raw data through both technical processes and business logic

· Support analytics teams with design and performance optimizations of data warehouses

Data is an Enterprise Asset.

Data as an Asset should be shared and protected.

Data should be valued as an Enterprise asset, leveraged across all Business Units to enhance the company’s value to its respective customer base by accelerating decision making, and improving competitive advantage with the help of data. Good data stewardship, legal and regulatory requirements dictate that we protect the data owned from unauthorized access and disclosure.

In other words, managing Security is a crucial responsibility.

Why Create a Centralized Data Engineering Team?

Treating Data Engineering as a standard and core capability that underpins both the Analytics and Data Science capabilities will help an enterprise evolve how to approach Data and Analytics. The enterprise needs to stop vertically treating data based on the technology stack involved as we tend to see often and move to more of a horizontal approach of managing a data fabric or mesh layer that cuts across the organization and can connect to various technologies as needed drive analytic initiatives. This is a new way of thinking and working, but it can drive efficiency as the various data organizations look to scale. Additionally — there is value in creating a dedicated structure and career path for Data Engineering resources. Data engineering skill sets are in high demand in the market; therefore, hiring outside the company can be costly. Companies must enable programmers, database administrators, and software developers with a career path to gain the needed experience with the above-defined skillsets by working across technologies. Usually, forming a data engineering center of excellence or a capability center would be the first step for making such progression possible.

Challenges for creating a centralized Data Engineering Team

The centralization of the Data Engineering team as a service approach is different from how Reporting & Analytics and Data Science teams operate. It does, in principle, mean giving up some level of control of resources and establishing new processes for how these teams will collaborate and work together to deliver initiatives.

The Data Engineering team will need to demonstrate that it can effectively support the needs of both Reporting & Analytics and Data Science teams, no matter how large these teams are. Data Engineering teams must effectively prioritize workloads while ensuring they can bring the right skillsets and experience to assigned projects.

Data engineering is essential because it serves as the backbone of data-driven companies. It enables analysts to work with clean and well-organized data, necessary for deriving insights and making sound decisions. To build a functioning data engineering practice, you need the following critical components:

Data Engineering Center of Excellence

The Data Engineering team should be a core capability within the enterprise, but it should effectively serve as a support function involved in almost everything data-related. It should interact with the Reporting and Analytics and Data Science teams in a collaborative support role to make the entire team successful.

The Data Engineering team doesn’t create direct business value — but the value should come in making the Reporting and Analytics, and Data Science teams more productive and efficient to ensure delivery of maximum value to business stakeholders through Data & Analytics initiatives. To make that possible, the six key responsibilities within the data engineering capability center would be as follow –

Data Engineering Center of Excellence — Image by Author.

Let’s review the 6 pillars of responsibilities:

1. Determine Central Data Location for Collation and Wrangling

Understanding and having a strategy for a Data Lake.(a centralized data repository or data warehouse for the mass consumption of data for analysis). Defining requisite data tables and where they will be joined in the context of data engineering and subsequently converting raw data into digestible and valuable formats.

2. Data Ingestion and Transformation

Moving data from one or more sources to a new destination (your data lake or cloud data warehouse) where it can be stored and further analyzed and then converting data from the format of the source system to that of the destination

3. ETL/ELT Operations

Extracting, transforming, and loading data from one or more sources into a destination system to represent the data in a new context or style.

4. Data Modeling

Data modeling is an essential function of a data engineering team, granted not all data engineers excel with this capability. Formalizing relationships between data objects and business rules into a conceptual representation through understanding information system workflows, modeling required queries, designing tables, determining primary keys, and effectively utilizing data to create informed output.

I’ve seen engineers in interviews mess up more with this than coding in technical discussions. It’s essential to understand the differences between Dimensions, Facts, Aggregate tables.

5. Security and Access

Ensuring that sensitive data is protected and implementing proper authentication and authorization to reduce the risk of a data breach

6. Architecture and Administration

Defining the models, policies, and standards that administer what data is collected, where and how it is stored, and how it such data is integrated into various analytical systems.

The six pillars of responsibilities for data engineering capabilities center on the ability to determine a central data location for collation and wrangling, ingest and transform data, execute ETL/ELT operations, model data, secure access and administer an architecture. While all companies have their own specific needs with regards to these functions, it is important to ensure that your team has the necessary skillset in order to build a foundation for big data success.

Besides the Data Engineering following are the other capability centers that need to be considered within an enterprise:

Analytics Capability Center

The analytics capability center enables consistent, effective, and efficient BI, analytics, and advanced analytics capabilities across the company. Assist business functions in triaging, prioritizing, and achieving their objectives and goals through reporting, analytics, and dashboard solutions, while providing operational reports and visualizations, self-service analytics, and required tools to automate the generation of such insights.

Data Science Capability Center

The data science capability center is for exploring cutting-edge technologies and concepts to unlock new insights and opportunities, better inform employees and create a culture of prescriptive information usage using Automated AI and Automated ML solutions such as H2O.aiDataikuAible, DataRobot, C3.ai

Data Governance

The data governance office empowers users with trusted, understood, and timely data to drive effectiveness while keeping the integrity and sanctity of data in the right hands for mass consumption.


As your company grows, you will want to make sure that the data engineering capabilities are in place to support the six pillars of responsibilities. By doing this, you will be able to ensure that all aspects of data management and analysis are covered and that your data is safe and accessible by those who need it. Have you started thinking about how your company will grow? What steps have you taken to put a centralized data engineering team in place?

Thank you for reading!

The post Building a Data Engineering Center of Excellence appeared first on Towards Data Science.

]]>
Learnings from a Machine Learning Engineer — Part 1: The Data https://towardsdatascience.com/learnings-from-a-machine-learning-engineer-part-1-the-data/ Thu, 13 Feb 2025 20:55:53 +0000 https://towardsdatascience.com/?p=597818 It is said that in order for a machine learning model to be successful, you need to have good data. While this is true (and pretty much obvious), it is extremely difficult to define, build, and sustain good data. Let me share with you the unique processes that I have learned over several years building […]

The post Learnings from a Machine Learning Engineer — Part 1: The Data appeared first on Towards Data Science.

]]>
It is said that in order for a machine learning model to be successful, you need to have good data. While this is true (and pretty much obvious), it is extremely difficult to define, build, and sustain good data. Let me share with you the unique processes that I have learned over several years building an ever-growing image classification system and how you can apply these techniques to your own application.

With persistence and diligence, you can avoid the classic “garbage in, garbage out”, maximize your model accuracy, and demonstrate real business value.

In this series of articles, I will dive into the care and feeding of a multi-class, single-label image classification app and what it takes to reach the highest level of performance. I won’t get into any coding or specific user interfaces, just the main concepts that you can incorporate to suit your needs with the tools at your disposal.

Here is a brief description of the articles. You will notice that the model is last on the list since we need to focus on curating the data first and foremost:

Background

Over the past six years, I have been primarily focused on building and maintaining an image classification application for a manufacturing company. Back when I started, most of the software did not exist or was too expensive, so I created these from scratch. In this time, I have deployed two identifier applications, the largest handles 1,500 classes and achieves 97–98% accuracy.

It was about eight years ago that I started online studies for Data Science and machine learning. So, when the exciting opportunity to create an AI application presented itself, I was prepared to build the tools I needed to leverage the latest advancements. I jumped in with both feet!

I quickly found that building and deploying a model is probably the easiest part of the job. Feeding high quality data into the model is the best way to improve performance, and that requires focus and patience. Attention to detail is what I do best, so this was a perfect fit.

It all starts with the data

I feel that so much attention is given to the model selection (deciding which neural network is best) and that the data is just an afterthought. I have found the hard way that even one or two pieces of bad data can significantly impact model performance, so that is where we need to focus.

For example, let’s say you train the classic cat versus dog image classifier. You have 50 pictures of cats and 50 pictures of dogs, however one of the “cats” is clearly (objectively) a picture of a dog. The computer doesn’t have the luxury of ignoring the mislabelled image, and instead adjusts the model weights to make it fit. Square peg meets round hole.

Another example would be a picture of a cat that climbed up into a tree. But when you take a wholistic view of it, you would describe it as a picture of a tree (first) with a cat (second). Again, the computer doesn’t know to ignore the big tree and focus on the cat — it will start to identify trees as cats, even if there is a dog. You can think of these pictures as outliers and should be removed.

It doesn’t matter if you have the best neural network in the world, you can count on the model making poor predictions when it is trained on “bad” data. I’ve learned that any time I see the model make mistakes, it’s time to review the data.

Example Application — Zoo animals

For the rest of this write-up, I will use an example of identifying zoo animals. Let’s assume your goal is to create a mobile app where guests at the zoo can take pictures of the animals they see and have the app identify them. Specifically, this is a multi-class, single-label application.

Here is your challenge:

  • Variety — There are a lot of different animals at the zoo and many of them look very similar.
  • Quality — Guests using the app don’t always take good pictures (zoomed out, blurry, too dark), so we don’t want to provide an answer if the image is poor.
  • Growth — The zoo keeps expanding and adding new species all the time.
  • Out-of-scope — Occasionally you might find that people take pictures of the sparrows near the food court grabbing some dropped popcorn.
  • Pranksters — Just for fun, guests may take a picture of the bag of popcorn just to see what it comes back with.

These are all real challenges — being able to tell the subtle differences between animals, handling out-of-scope cases, and just plain poor images.

Before we get there, let’s start from the beginning.

Collecting and Labelling

There are a lot of tools these days to help you with this part of the process, but the challenge remains the same — collecting, labelling, and curating the data.

Having data to collect is challenge #1. Without images, you have nothing to train. You may need to get creative on sourcing the data, or even creating synthetic data. More on that later.

A quick note about image pre-processing. I convert all my images to the input size of my neural network and save them as PNG. Inside this square PNG, I preserve the aspect ratio of the original picture and fill the background black. I don’t stretch the image nor crop any features out. This also helps center the subject.

Challenge #2 is to establish standards for data quality…and ensure that these standards are followed! These standards will guide you toward that “good” data. And this assumes, of course, correct labels. Having both is much easier said than done!

I hope to show how “good” and “correct” actually go hand-in-hand, and how important it is to apply these standards to every image.

Good Data

First, I want to point out that the image data discussed here is for the training set. What qualifies as a good image for training is a bit different than what qualifies as a good image for evaluation. More on that in Part 3.

So, what is “good” data when talking about images? “A picture is worth a thousand words”, and if the first words you use to describe the picture do not include the subject you are trying to label, then it is not good and you need remove it from your training set.

For example, let’s say you are shown a picture of a zebra and (removing bias toward your application) you describe it as an “open field with a zebra in the distance”. In other words, if “open field” is the first thing you notice, then you likely do not want to use that image. The opposite is also true — if the picture is way too close, you would described it as “zebra pattern”.

Photo by Meg von Haartman on Unsplash
Photo by Jason Dent on Unsplash
Photo by Martin Olsen on Unsplash

What you want is a description like, “a zebra, front and center”. This would have your subject taking up about 80–90% of the total frame. Sometimes I will take the time to crop the original image so the subject is framed properly.

Keep in mind the use of image augmentation at the time of training. Having that buffer around the edges will allow “zoom in” augmentation. And “zoom out” augmentation will simulate smaller subjects, so don’t start out less than 50% of the total frame for your subject since you lose detail.

Another aspect of a “good” image relates to the label. If you can only see the back side of your zoo animal, can you really tell, for example, that it is a cheetah versus a leopard? The key identifying features need to be visible. If a human struggles to identify it, you can’t expect the computer to learn anything.

Photo by Jan Harder on Unsplash

What does a “bad” image look like? Here is what I frequently watch out for:

  • Wide angle lens stretching
  • Back-lit or silohuette
  • High contrast or dark shadows
  • Blurry or hazy
  • Obscured features
  • Multiple subjects
  • “Doctored” images, drawn lines and arrows
  • “Unusual” angles or situations
  • Picture of a mobile device that has a picture of your subject

Correct Labels

If you have a team of subject matter experts (SMEs) on hand to label the images, you are in a good starting position. Animal trainers at the zoo know the various species, and can spot the differences between, for example, a chimpanzee and a bonobo.

Photo by Adèle on Unsplash
Photo by Andrius Ordojan on Unsplash

To a Machine Learning Engineer, it is easy for you to assume all labels from your SMEs are correct and move right on to training the model. However, even experts make mistakes, so if you can get a second opinion on the labels, your error rate should go down.

In reality, it can be prohibitively expensive to get one, let alone two, subject matter experts to review image labels. The SME usually has years of experience that make them more valuable to the business in other areas of work. My experience is that the machine learning engineer (that’s you and me) becomes the second opinion, and often the first opinion as well.

Over time, you can become pretty adept at labelling, but certainly not an SME. If you do have the luxury of access to an expert, explain to them the labelling standards and how these are required for the application to be successful. Emphasize “quality over quantity”.

It goes without saying that having a correct label is so important. However, all it takes is one or two mislabelled images to degrade performance. These can easily slip into your data set with careless or hasty labelling. So, take the time to get it right.

Ultimately, we as the ML engineer are responsible for model performance. So, if we take the approach of only working on model training and deployment, we will find ourselves wondering why performance is falling short.

Unknown Labels

A lot of times, you will come across a really good picture of a very interesting subject, but have no idea what it is! It would be a shame to simply dispose of it. What you can do is assign it a generic label, like “Unknown Bird” or “Random Plant” that are not included in your training set. Later in Part 4, you’ll see how to come back to these images at a later date when you have a better idea what they are, and you’ll be glad you saved them.

Model Assistance

If you have done any image labelling, then you know how time consuming and difficult it can be. But this is where having a model, even a less-than-perfect model, can help you.

Typically, you have a large collection of unlabelled image and you need to go through them one at a time to assign labels. Simply having the model offer a best guess and display the top 3 results lets you step through each image in a matter of seconds!

Even if the top 3 results are wrong, this can help you narrow down your search. Over time, newer models will get better, and the labelling process can even be somewhat fun!

In Part 4, I will show how you can bulk identify images and take this to the next level for faster labelling.

Classes and Sub-Classes

I mentioned the example above of two species that look very similar, the chimpanzee and the bonobo. When you start out building your data set, you may have very sparse coverage of one or both of these species. In machine learning terms, we these “classes”. One option is to roll with what you have and hope that the model picks up on the differences with only a handful of example images.

The option that I have used is to merge two or more classes into one, at least temporarily. So, in this case I would create a class called “chimp-bonobo”, which is composed of the limited example pictures of chimpanzee and bonobo species classes. Combined, these may give me enough to train the model on “chimp-bonobo”, with the trade-off that it’s a more generic identification.

Sub-classes can even be normal variations. For example, juvenile pink flamingos are grey instead of pink. Or, male and female orangutans have distinct facial features. You wan to have a fairly balanced number of images for these normal variations, and keeping sub-classes will allow you to accomplish this.

Photo by David Valentine on Unsplash
Photo by Hongbin on Unsplash

Don’t be concerned that you are merging completely different looking classes — the neural network does a nice job of applying the “OR” operator. This works both ways — it can help you identify male or female variations as one species, but it can hurt you when “bad” outlier images sneak in like the example “open field with a zebra in the distance.”

Over time, you will (hopefully) be able to collect more images of the sub-classes and then be able to successfully split them apart (if necessary) and train the model to identify them separately. This process has worked very well for me. Just be sure to double-check all the images when you split them to ensure the labels didn’t get accidentally mixed up — it will be time well spent.

All of this certainly depends on your user requirements, and you can handle this in different ways either by creating a unique class label like “chimp-bonobo”, or at the front-end presentation layer where you notify the user that you have intentionally merged these classes and provide guidance on further refining the results. Even after you decide to split the two classes, you may want to caution the user that the model could be wrong since the two classes are so similar.

Up next…

I realize this was a long write-up for something that on the surface seems intuitive, but these are all areas that I have tripped me up in the past because I didn’t give them enough attention. Once you have a solid understanding of these principles, you can go on to build a successful application.

In Part 2, we will take the curated data we collected here to create the classic data sets, with a custom benchmark set that will further enhance your data. Then we will see how best to evaluate our trained model using a specific “training mindset”, and switch to a “production mindset” when evaluating a deployed model.

The post Learnings from a Machine Learning Engineer — Part 1: The Data appeared first on Towards Data Science.

]]>
Learnings from a Machine Learning Engineer — Part 2: The Data Sets https://towardsdatascience.com/learnings-from-a-machine-learning-engineer-part-2-the-data-sets/ Thu, 13 Feb 2025 20:29:39 +0000 https://towardsdatascience.com/?p=597856 In Part 1, we discussed the importance of collecting good image data and assigning proper labels for your image classification project to be successful. Also, we talked about classes and sub-classes of your data. These may seem pretty straight forward concepts, but it’s important to have a solid understanding going forward. So, if you haven’t, please […]

The post Learnings from a Machine Learning Engineer — Part 2: The Data Sets appeared first on Towards Data Science.

]]>
In Part 1, we discussed the importance of collecting good image data and assigning proper labels for your Image Classification project to be successful. Also, we talked about classes and sub-classes of your data. These may seem pretty straight forward concepts, but it’s important to have a solid understanding going forward. So, if you haven’t, please check it out.

Now we will discuss how to build the various data sets and the techniques that have worked well for my application. Then in the next part, we will dive into the evaluation of your models, beyond simple accuracy.

I will again use the example zoo animals image classification app.

Data Sets

As machine learning engineers, we are all familiar with the train-validation-test sets, but when we include the concept of sub-classes discussed in Part 1, and incorporate to concepts discussed below to set a minimum and maximum image count per class, as well as staged and synthetic data to the mix, the process gets a bit more complicated. I had to create a custom script to handle these options.

I will walk you through these concepts before we split the data for training:

  • Image cutoffs — Too few images and your model performance will suffer. Too many and you spend more time training than it’s worth.
  • Confidence thresholds — Your model indicates how confident it is in the predictions. Let’s use that to decide when to present results to the user.
  • Benchmark sets — Real-world data is messy and the benchmark sets should reflect that. These need to stretch the model to the limit and help us decide when it is ready for production.
  • Staged and synthetic data — Real-world data is king, but sometimes you need to produce the your own or even generate data to get off the ground. Be careful it doesn’t hurt performance.
  • Duplicate images — Repeat data can skew your results and give you a false sense of performance. Make sure your data is diverse.
  • Building the data sets — Combine sub-classes, apply cutoffs, and create your train-validation-test sets. Now we are ready to get the show started.

Image cutoffs

In my experience, using a minimum of 40 images per class provides descent performance. Since I like to use 10% each for the test set and validation set, that means at least 4 images will be used to check the training set, which feels just barely adequate. Using fewer than 40 images per class, I notice my model evaluation tends to suffer.

On the other end, I set a maximum of about 125 images per class. I have found that the performance gains tend to plateau beyond this, so having more data will slow down the training run with little to show for it. Having more than the maximum is fine, and these “overflow” can be added to the test set, so they don’t go to waste.

There are times when I will drop the minimum cutoff to, say 35, with no intention of moving the trained model to production. Instead, the purpose is to leverage this throw-away model to find more images from my unlabelled set. This is a technique that I will go into more detail in Part 3.

Confidence threshold

You are likely familiar with the softmax score. As a reminder, softmax is the probability assigned to each label. I like to think of it as a confidence score, and we are interested in the class that receives the highest confidence. Softmax is a value between zero and one, but I find it easier to interpret confidence scores between zero and 100, like a percentage.

In order to decide if the model is confident enough with its prediction, I have chosen a threshold of 95. I use this threshold when determining if I want to present results to the user.

Scores above the threshold have a better changes of being right, so I can confidently provide the results. Scores below the threshold may not be right — in fact it could be “out-of-scope”, meaning it’s something the model doesn’t know how to identify. So, instead of taking the risk of presenting incorrect results, I instead prompt the user to try again and offer suggestions on how to take a “good” picture.

Admittedly this is somewhat arbitrary cutoff, and you should decide for your use-case what is appropriate. In fact, this score could probably be adjusted for each trained model, but this would make it harder to compare performance across models.

I will refer to this confidence score frequently in the evaluations section in Part 3.

Benchmark sets

Let me introduce what I call the benchmark sets, which you can think of as extended test sets. These are hand-picked images designed to stretch the limits of your model, and provide a measure for specific classes of your data. Use these benchmarks to justify moving your model to production, and for an objective measure to show to your manager.

  • Difficult Benchmark — These are the “extra credit” images, like the bonus questions a professor would add to the quiz to see which students are paying attention. You need a keen eye to spot the difference between the ground truth and a similar looking class. For example, a cheetah sleeping in the shade that could pass as a leopard if you don’t look closely.
  • Out-of-scope Benchmark — These are the “trick question” images. Our model is trained on zoo animals, but people are known for not following the rules. For example, a zoo guest takes a picture of their child wearing cheetah face paint.
  • Most-Common Benchmark — These are your “bread and butter” classes that need to get near perfect scores and zero errors. This would be a make-or-break benchmark for moving to production.
  • Least-Common Benchmark — These are your “rare but exceptional” classes that again need to be correct, but reach a minimum score like the confidence threshold.

When looking for images to add to the benchmarks, you can likely find them in real-world images from your deployed model. See the evaluation in Part 3.

For each benchmark, calculate the min, max, median, and mean scores, and also how many images get scores above and below the confidence threshold. Now you can compare these measures against what is currently in production, and against your minimum requirements, to help decide if the new model is production worthy.

Staged or Synthetic data

Perhaps the biggest hurdle to any supervised machine learning application is having data to train the model. Clearly, “real-world” data that comes from actual users of the application is ideal. However you can’t really collect these until the model is deployed. Chicken and egg problem.

One way to get started to is to have volunteers collect “staged” images for you, trying to act like real users. So, let’s have our zoo staff go around taking pictures of the animals. This is a good start, but there will be a certain level of bias introduced in these images. For example, the staff may take the photos over a few days, so you may not get the year-round weather conditions.

Another way to get pictures is use computer-generated “synthetic” images. I would avoid these at all costs, to be honest. Based on my experience, the model struggles with these because they look…different. The lighting is not natural, the subject may superimposed on a background and so the edges look too sharp, etc. Granted, some of the AI generated images look very realistic, but if you look closely you may spot something unusual. The neural network in your model will notice these, so be careful.

Image generated using Dall-E

The way that I handle these staged or synthetic images is as a sub-class that gets merged into the training set, but only after giving preference to the real-world images. I cap the number of staged images to 60, so if I have 10 real-world, I now only need 50 staged. Eventually, these staged and synthetic images are phased out completely, and I rely entirely on real-world.

Duplicate images

One problem that can creep into your image set are duplicate images. These can be exact copies of pictures, or they can be extremely similar. You may think that this is harmless, but imagine having 100 pictures of an elephant that are exactly the same — your model will not know what to do with a different angle of the elephant.

Now, let’s say you have only two pictures that are nearly the same. Not so bad, right? Well, here is what can happen to them:

  • Both pictures go in the training set — The model doesn’t learn anything from the repeated image and it wastes time processing them.
  • One goes into the training set, the other goes into the test set — Your test score will be higher, but it is not an accurate evaluation.
  • Both are in the test set — Your test score will be compounded either higher or lower than it should be.

None of these will help your model.

There are a few ways to find duplicates. The approach I have taken is to calculate a hamming distance on all the pictures and identify the ones that are very close. I have an interface that displays the duplicates and I decide which one I like best, and remove the other.

Another way (I haven’t tried this yet) is to create a vector representation of your images. Store these a vector database, and you can do a similarity search to find nearly identical images.

Whatever method you use, it is important to clean up the duplicates.

Building the data sets

Now we are ready to build the traditional training, validation, and test sets. This is no longer a straight forward task since I want to:

  1. Merge sub-classes into a main class.
  2. Prioritize real-world images over staged or synthetic images.
  3. Apply a minimum number of images per class.
  4. Apply a maximum number of images per class, sending the “overflow” to the test set.

This process is somewhat complicated and depends on how you manage your image library. First, I would recommend keeping your images in a folder structure that has sub-class folders. You can get image counts by using a script to simply read the folders. Second is to keep a configuration of how the sub-classes are merged. To really set yourself up for success, put these image counts and merge rules in a database for faster lookups.

My train-validation-test set splits are usually 90–10–0. I originally started out using 80–10–10, but with diligence on keeping the entire data set clean, I noticed validation and test scores became pretty even. This allowed me to increase the training set size, and use “overflow” to become the test set, as well as using the benchmark sets.

Up next…

In this part, we’ve built our data sets by merging sub-classes and using the image count cutoffs. Plus we handle staged and synthetic data as well as cleaning up duplicate images. We also created benchmark sets and defined confidence thresholds, which help us decide when to move a model to production.

In Part 3, we will discuss how we are going to evaluate the different model performances. And then finally we will get to the actual model training and the techniques to enhance accuracy.

The post Learnings from a Machine Learning Engineer — Part 2: The Data Sets appeared first on Towards Data Science.

]]>
Polars vs. Pandas — An Independent Speed Comparison https://towardsdatascience.com/polars-vs-pandas-an-independent-speed-comparison/ Tue, 11 Feb 2025 21:07:55 +0000 https://towardsdatascience.com/?p=597637 Overview Introduction — Purpose and Reasons Speed is important when dealing with large amounts of data. If you are handling data in a cloud data warehouse or similar, then the speed of execution for your data ingestion and processing affects the following: As you’ve probably understood from the title, I am going to provide a […]

The post Polars vs. Pandas — An Independent Speed Comparison appeared first on Towards Data Science.

]]>
Overview
  1. Introduction — Purpose and Reasons
  2. Datasets, Tasks, and Settings
  3. Results
  4. Conclusions
  5. Wrapping Up

Introduction — Purpose and Reasons

Speed is important when dealing with large amounts of data. If you are handling data in a cloud data warehouse or similar, then the speed of execution for your data ingestion and processing affects the following:

  • Cloud costs: This is probably the biggest factor. More compute time equals more costs in most billing models. In other billing based on a certain amount of preallocated resources, you could have chosen a lower service level if the speed of your ingestion and processing was higher.
  • Data timeliness: If you have a real-time stream that takes 5 minutes to process data, then your users will have a lag of at least 5 minutes when viewing the data through e.g. a Power BI rapport. This difference can be a lot in certain situations. Even for batch jobs, the data timeliness is important. If you are running a batch job every hour, it is a lot better if it takes 2 minutes rather than 20 minutes.
  • Feedback loop: If your batch job takes only a minute to run, then you get a very quick feedback loop. This probably makes your job more enjoyable. In addition, it enables you to find logical mistakes more quickly.

As you’ve probably understood from the title, I am going to provide a speed comparison between the two Python libraries Polars and Pandas. If you know anything about Pandas and Polars from before, then you know that Polars is the (relatively) new kid on the block proclaiming to be much faster than Pandas. You probably also know that Polars is implemented in Rust, which is a trend for many other modern Python tools like uv and Ruff.

There are two distinct reasons that I want to do a speed comparison test between Polars and Pandas:

Reason 1 — Investigating Claims

Polars boasts on its website with the following claim: Compared to pandas, it (Polars) can achieve more than 30x performance gains.

As you can see, you can follow a link to the benchmarks that they have. It’s commendable that they have speed tests open source. But if you are writing the comparison tests for both your own tool and a competitor’s tool, then there might be a slight conflict of interest. I’m not here saying that they are purposefully overselling the speed of Polars, but rather that they might have unconsciously selected for favorable comparisons.

Hence the first reason to do a speed comparison test is simply to see whether this supports the claims presented by Polars or not.

Reason 2 — Greater granularity

Another reason for doing a speed comparison test between Polars and Pandas is to make it slightly more transparent where the performance gains might be.

This might be already clear if you’re an expert on both libraries. However, speed tests between Polars and Pandas are mostly of interest to those considering switching up their tool. In that case, you might not yet have played around much with Polars because you are unsure if it is worth it.

Hence the second reason to do a speed comparison is simply to see where the speed gains are located.

I want to test both libraries on different tasks both within data ingestion and Data Processing. I also want to consider datasets that are both small and large. I will stick to common tasks within data engineering, rather than esoteric tasks that one seldom uses.

What I will not do

  • I will not give a tutorial on either Pandas or Polars. If you want to learn Pandas or Polars, then a good place to start is their documentation.
  • I will not cover other common data processing libraries. This might be disappointing to a fan of PySpark, but having a distributed compute model makes comparisons a bit more difficult. You might find that PySpark is quicker than Polars on tasks that are very easy to parallelize, but slower on other tasks where keeping all the data in memory reduces travel times.
  • I will not provide full reproducibility. Since this is, in humble words, only a blog post, then I will only explain the datasets, tasks, and system settings that I have used. I will not host a complete running environment with the datasets and bundle everything neatly. This is not a precise scientific experiment, but rather a guide that only cares about rough estimations.

Finally, before we start, I want to say that I like both Polars and Pandas as tools. I’m not financially or otherwise compensated by any of them obviously, and don’t have any incentive other than being curious about their performance ☺

Datasets, Tasks, and Settings

Let’s first describe the datasets that I will be considering, the tasks that the libraries will perform, and the system settings that I will be running them on.

Datasets

A most companies, you will need to work with both small and (relatively) large datasets. In my opinion, a good data processing tool can tackle both ends of the spectrum. Small datasets challenge the start-up time of tasks, while larger datasets challenge scalability. I will consider two datasets, both can be found on Kaggle:

  • A small dataset on the format CSV: It is no secret that CSV files are everywhere! Often they are quite small, coming from Excel files or database dumps. What better example of this than the classical iris dataset (licensed with CC0 1.0 Universal License) with 5 columns and 150 rows. The iris version I linked to on Kaggle has 6 columns, but the classical one does not have a running index column. So remove this column if you want precisely the same dataset as I have. The iris dataset is certainly small data by any stretch of the imagination.
  • A large dataset on the format Parquet: The parquet format is super useful for large data as it has built-in compression column-wise (along with many other benefits). I will use the Transaction dataset (licensed with Apache License 2.0) representing financial transactions. The dataset has 24 columns and 7 483 766 rows. It is close to 3 GB in its CSV format found on Kaggle. I used Pandas & Pyarrow to convert this to a parquet file. The final result is only 905 MB due to the compression of the parquet file format. This is at the low end of what people call big data, but it will suffice for us.

Tasks

I will do a speed comparison on five different tasks. The first two are I/O tasks, while the last three are common tasks in data processing. Specifically, the tasks are:

  1. Reading data: I will read both files using the respective methods read_csv() and read_parquet() from the two libraries. I will not use any optional arguments as I want to compare their default behavior.
  2. Writing data: I will write both files back to identical copies as new files using the respective methods to_csv() and to_parquet() for Pandas and write_csv() and write_parquet() for Polars. I will not use any optional arguments as I want to compare their default behavior.
  3. Computing Numeric Expressions: For the iris dataset I will compute the expression SepalLengthCm ** 2 + SepalWidthCm as a new column in a copy of the DataFrame. For the transactions dataset, I will simply compute the expression (amount + 10) ** 2 as a new column in a copy of the DataFrame. I will use the standard way to transform columns in Pandas, while in Polars I will use the standard functions all()col(), and alias() to make an equivalent transformation.
  4. Filters: For the iris dataset, I will select the rows corresponding to the criteria SepalLengthCm >= 5.0 and SepalWidthCm <= 4.0. For the transactions dataset, I will select the rows corresponding to the categorical criteria merchant_category == 'Restaurant'. I will use the standard filtering method based on Boolean expressions in each library. In pandas, this is syntax such as df_new = df[df['col'] < 5], while in Polars this is given similarly by the filter() function along with the col() function. I will use the and-operator & for both libraries to combine the two numeric conditions for the iris dataset.
  5. Group By: For the iris dataset, I will group by the Species column and calculate the mean values for each species of the four columns SepalLengthCmSepalWidthCmPetalLengthCm, and PetalWidthCm. For the transactions dataset, I will group by the column merchant_category and count the number of instances in each of the classes within merchant_category. Naturally, I will use the groupby() function in Pandas and the group_by() function in Polars in obvious ways.

Settings

  • System Settings: I’m running all the tasks locally with 16GB RAM and an Intel Core i5–10400F CPU with 6 Cores (12 logical cores through hyperthreading). So it’s not state-of-the-art by any means, but good enough for simple benchmarking.
  • Python: I’m running Python 3.12. This is not the most current stable version (which is Python 3.13), but I think this is a good thing. Commonly the latest supported Python version in cloud data warehouses is one or two versions behind.
  • Polars & Pandas: I’m using Polars version 1.21 and Pandas 2.2.3. These are roughly the newest stable releases to both packages.
  • Timeit: I’m using the standard timeit module in Python and finding the median of 10 runs.

Especially interesting will be how Polars can take advantage of the 12 logical cores through multithreading. There are ways to make Pandas take advantage of multiple processors, but I want to compare Polars and Pandas out of the box without any external modification. After all, this is probably how they are running in most companies around the world.

Results

Here I will write down the results for each of the five tasks and make some minor comments. In the next section I will try to summarize the main points into a conclusion and point out a disadvantage that Polars has in this comparison:

Task 1 — Reading data

The median run time over 10 runs for the reading task was as follows:

# Iris Dataset
Pandas: 0.79 milliseconds
Polars: 0.31 milliseconds

# Transactions Dataset
Pandas: 14.14 seconds
Polars: 1.25 seconds

For reading the Iris dataset, Polars was roughly 2.5x faster than Pandas. For the transactions dataset, the difference is even starker where Polars was 11x faster than Pandas. We can see that Polars is much faster than Pandas for reading both small and large files. The performance difference grows with the size of the file.

Task 2— Writing data

The median run time in seconds over 10 runs for the writing task was as follows:

# Iris Dataset
Pandas: 1.06 milliseconds
Polars: 0.60 milliseconds

# Transactions Dataset
Pandas: 20.55 seconds
Polars: 10.39 seconds

For writing the iris dataset, Polars was around 75% faster than Pandas. For the transactions dataset, Polars was roughly 2x as fast as Pandas. Again we see that Polars is faster than Pandas, but the difference here is smaller than for reading files. Still, a difference of close to 2x in performance is a massive difference.

Task 3 —Computing Numeric Expressions

The median run time over 10 runs for the computing numeric expressions task was as follows:

# Iris Dataset
Pandas: 0.35 milliseconds
Polars: 0.15 milliseconds

# Transactions Dataset
Pandas: 54.58 milliseconds
Polars: 14.92 milliseconds

For computing the numeric expressions, Polars beats Pandas with a rate of roughly 2.5x for the iris dataset, and roughly 3.5x for the transactions dataset. This is a pretty massive difference. It should be noted that computing numeric expressions is fast in both libraries even for the large dataset transactions.

Task 4 — Filters

The median run time over 10 runs for the filters task was as follows:

# Iris Dataset
Pandas: 0.40 milliseconds
Polars: 0.15 milliseconds

# Transactions Dataset
Pandas: 0.70 seconds
Polars: 0.07 seconds

For filters, Polars is 2.6x faster on the iris dataset and 10x as fast on the transactions dataset. This is probably the most surprising improvement for me since I suspected that the speed improvements for filtering tasks would not be this massive.

Task 5 — Group By

The median run time over 10 runs for the group by task was as follows:

# Iris Dataset
Pandas: 0.54 milliseconds
Polars: 0.18 milliseconds

# Transactions Dataset
Pandas: 334 milliseconds 
Polars: 126 milliseconds

For the group-by task, there is a 3x speed improvement for Polars in the case of the iris dataset. For the transactions dataset, there is a 2.6x improvement of Polars over Pandas.

Conclusions

Before highlighting each point below, I want to point out that Polars is somewhat in an unfair position throughout my comparisons. It is often that multiple data transformations are performed after one another in practice. For this, Polars has the lazy API that optimizes this before calculating. Since I have considered single ingestions and transformations, this advantage of Polars is hidden. How much this would improve in practical situations is not clear, but it would probably make the difference in performance even bigger.

Data Ingestion

Polars is significantly faster than Pandas for both reading and writing data. The difference is largest in reading data, where we had a massive 11x difference in performance for the transactions dataset. On all measurements, Polars performs significantly better than Pandas.

Data Processing

Polars is significantly faster than Pandas for common data processing tasks. The difference was starkest for filters, but you can at least expect a 2–3x difference in performance across the board.

Final Verdict

Polars consistently performs faster than Pandas on all tasks with both small and large data. The improvements are very significant, ranging from a 2x improvement to a whopping 11x improvement. When it comes to reading large parquet files or performing filter statements, Polars is leaps and bound in front of Pandas.

However…Nowhere here is Polars remotely close to performing 30x better than Pandas, as Polars’ benchmarking suggests. I would argue that the tasks that I have presented are standard tasks performed on realistic hardware infrastructure. So I think that my conclusions give us some room to question whether the claims put forward by Polars give a realistic picture of the improvements that you can expect.

Nevertheless, I am in no doubt that Polars is significantly faster than Pandas. Working with Polars is not more complicated than working with Pandas. So for your next data engineering project where the data fits in memory, I would strongly suggest that you opt for Polars rather than Pandas.

Wrapping Up

Photo by Spencer Bergen on Unsplash

I hope this blog post gave you a different perspective on the speed difference between Polars and Pandas. Please comment if you have a different experience with the performance difference between Polars and Pandas than what I have presented.

If you are interested in AI, Data Science, or data engineering, please follow me or connect on LinkedIn.

Like my writing? Check out some of my other posts:

The post Polars vs. Pandas — An Independent Speed Comparison appeared first on Towards Data Science.

]]>