The world’s leading publication for data science, AI, and ML professionals.

Use Tablib to Handle Simple Tabular Data in Python

Sometimes a Shallow Abstraction is more Valuable than Performance

Photo by David Clode on Unsplash
Photo by David Clode on Unsplash

Overview

  1. Introduction – What is Tablib?
  2. Working with Datasets
  3. Importing Data
  4. Exporting Data
  5. Dynamic Columns
  6. Formatters
  7. Wrapping Up

Introduction – What is Tablib?

For many years I have been working with tools like Pandas and PySpark in Python for Data Ingestion, data processing, and data exporting. These tools are great for complex data transformations and big data sizes (Pandas when the data fits in memory). However, often I have used these tools when the following conditions apply:

  • The data size is relatively small. Think well below 100,000 rows of data.
  • Performance is not an issue at all. Think of a one-off job or a job that repeats at midnight every night, but I don’t care if it takes 20 seconds or 5 minutes.
  • There are no complex transformations needed. Think of simply importing 20 JSON files with the same format, stacking them on top of each other, and then exporting this as a CSV file.

In these cases, tools like Pandas and (especially) PySpark are like shooting a fly with a canon. In these cases, the library Tablib is perfect 🔥

The Python library Tablib is a small library that deals exclusively with small tabular datasets. It is much less performant than both Pandas and PySpark, but it does not aim for performance at all. The library is only around 1000 lines of code and is a shallow abstraction over Python. What is the advantage of this?

It is simple to understand the nitty gritty details of Tablib since nothing is optimized. Tablib only gives you base functionality and fall back on Python logic for most things.

If you are unsure how something is implemented in Tablib, then a quick glance at the source code gives you the answer. In contrast, understanding how various methods in PySpark are implemented requires an understanding of Scala and the JVM. And loads of free time 😬

Tablib is also focused on importing and exporting tabular data in different formats. While big data is certainly more sexy than small data, the reality is that even in companies with big data, small data sets are abundant. Spinning up a Spark cluster every night to read a CSV file with 200 lines is just a waste of money and sanity.

Tablib should simply be another tool in your toolbox that can be used in the case of small data with few performance requirements and no complex data transformations. It is surprising how often these conditions are satisfied.

In this blog post, I will tell you everything you need to know about Tablib. Unlike Pandas and PySpark, I will manage to explain 80% of Tablib in a single blog post since it is such a simple and shallow abstraction. After reading this blog post, you should be able to easily use this tool where it is useful.

You can find all the code and artifacts used in this blog post in the following Github repository. In addition to this blog post, I’ve also made a free video series on YouTube on the same topic that is gradually coming out. If you prefer videos, then you can check that instead:

Alright! Let’s get started 😃


Working with Datasets

To get started, head over to the Tablib homepage. You can read a quick overview there before going to the installation page. I would recommend using the following command to install Tablib:

pip install "tablib[all]"

Now that Tablib is installed, we can first learn about the main class called Dataset. We can open a file and write the following:

Python">from tablib import Dataset

# Creating a dataset
data = Dataset()

# Adding headers
data.headers = ["Name", "Phone Number"]

# Adding rows
data.append(["Eirik", "74937475"])
data.append(["Stine", "75839478"])

# Have nice printing
print(data)

# Get a standard Python representation
print(data.dict)

Here we import Dataset from tablib and create a dataset. Then we add headers for the columns and add two rows with the .append() method. When we print out data to the console, we get a nicely formatted Tabular Data structure. We can also get a native Python representation by using the .dict attribute.

The variable data will be our format-agnostic container for the data that we will later import. First, let us see how we can add new columns, select both columns and rows, and delete both columns and rows. This data manipulation will be useful when we want to make simple data transformations to imported data.

To add a column, we will use the method .append_col(). It takes a list of values and a keyword argument representing the header name of the column:

data.append_col([29, 30], header="Age")
print(data)

If you print out data again, then you will see that we have added a column representing the age of the individuals. So use .append() to append rows, while append_col() to append columns.

To select columns, we can simply use the notation that you are probably familiar with from either Python dictionaries or from working with Pandas. For selecting rows, we can use index notation or slicing:

# Selecting columns and rows
print(data["Age"])
print(f"Average age: {sum(data["Age"]) / len(data["Age"])}")
print(data[0])
print(data[0:2])

As you can see, the value data["Age"] is simply a Python list. We can thus use the build-in functions like sum() and len() to work with this and calculate the average age.

Notice that this reduces data transformations to pure Python logic, which is not particularly fast. But fast is not the aim of Tablib, predictability and simplicity is 💪

Finally, to delete both columns and rows, we can use the built-in del keyword in Python as follows:

# Delete columns and rows
del data["Age"]
print(data)
del data[1]
print(data)

To summarize so far, we initialize an instance of the Dataset class, then add rows and columns with simple append-type methods. We can easily select and remove both columns and rows as we please.

Notice also that since everything is handled with Python lists, which are not data type homogeneous, there is nothing that requires that datatypes be consistent within a column. We can have a column that has many different data types. You can add your own data validation logic with formatters as we will see later.


Importing Data

It’s time to import some data. The way we created data from scratch in the previous section is not typical. The usual workflow is that we import data and use the tools from the previous section to modify or retrieve pieces of information.

Let’s say that we have a folder called /artifacts for the rest of the blog post. Within that folder are various files we want to import. First, let us assume that there is a CSV file called simple_example.csv with both headers and rows. We can import this with the following simple piece of code:

from tablib import Dataset

# Import a single CSV file
with open('artifacts/simple_example.csv', 'r') as file:
    imported_data = Dataset().load(file)
print(imported_data)

As you can see, we use the .load() method to load the file into a newly created dataset.

The important thing to notice here is that we don’t have separate method for each file type! There is not a .load_csv() method and a separate .load_json() method. There is simply a .load() method that detects the file type. This makes reusability very convenient.

To import a JSON file called standard.json in the artifacts folder we could simply write the following:

# Import a single JSON file
with open('artifacts/standard.json', 'r') as file:
    json_file = Dataset().load(file)
print(json_file)

So you don’t need to learn separate methods for separate datatypes.

One thing to note is that if you have a CSV file that does not have headers, you need to specify the headers=False keyword argument in the .load() method. Afterwards, you can set headers if you want. Here you can see an example of this with a file called no_header.csv:

# Import a CSV file with no headers
with open('artifacts/no_header.csv', 'r') as file:
    no_header_data = Dataset().load(file, headers=False)
no_header_data.headers = ['Name', 'Age']
print(no_header_data)

Finally, a common issue is that you have multiple files that need to be imported and combined. Say that we have a subfolder /artifacts/multiple/ where there are three CSV files. In Tablib, there is not a separate method for this situation. You have to use basic Python logic to load the files, and then you can use the .extend() method to combine them as follows. Here we can use the build-in library pathlib to manage this:

from pathlib import Path

# Work with multiple files
combined_data = Dataset(headers=('first_name', 'last_name'))
for path in Path('artifacts/multiple/').iterdir():
    with open(path, 'r') as file:
        temp_data = Dataset().load(file)
        combined_data.extend(temp_data)
print(combined_data)

Exporting Data

Now it is time to export data. The cool thing is that Tablib has a single method called .export() for exporting that makes this super easy:

from tablib import Dataset

# Import a single CSV file
with open('artifacts/simple_example.csv', 'r') as file:
    imported_data = Dataset().load(file)

# Write as a JSON file
print(imported_data.export('json'))

Notice that the .export() method does not involve the file system at all! It does not let you specify a file to export the data to. What is happening? 😮

The method .export() simply converts the data to a string with the specified format you require. So far we just print this string out to the console. We need to use standard Python logic to write this information to a file.

Again this shows that Tablib wants to remain simple: No interaction with the file system here, you have to use Python logic for this. If you are used to working with Python, this gives you control of this aspect. The cost of this tradeoff is performance, but again, tablib does not care about performance.

To export the file as a JSON file, you can simply write this:

from tablib import Dataset

# Import a single CSV file
with open('artifacts/simple_example.csv', 'r') as file:
    imported_data = Dataset().load(file)

# Export to a JSON file
with open('artifacts/new_file.json', 'w') as file:
    file.write(imported_data.export('json'))

Simple, right? Here are some of the other file formats that Tablib supports out of the box:

from tablib import Dataset

# Import a single CSV file
with open('artifacts/simple_example.csv', 'r') as file:
    imported_data = Dataset().load(file)

# Write as a CSV file
print(imported_data.export('csv'))

# Or as JSON
print(imported_data.export('json'))

# Or as YAML
print(imported_data.export('yaml'))

# Or as HTML
print(imported_data.export('html'))

# Or as Excel
print(imported_data.export('xls'))

When you are writing to an Excel file, make sure that you use the "write binary" option with wb instead of w since Excel files are binary files.

Finally, you should know that you can easily transition between Tablib and Pandas as follows:

from tablib import Dataset
from pandas import DataFrame

# Import a single CSV file
with open('artifacts/simple_example.csv', 'r') as file:
    imported_data = Dataset().load(file)

df = DataFrame(imported_data.dict)
print(df.head())

You should nevertheless use this sparingly. If you feel the need to constantly move over to Pandas for complex transformations, then maybe you should have used Pandas all along. Going to Pandas for a one-off function is fine, but too many conversions to Pandas is a symptom that you have chosen the wrong library.


Dynamic Columns

Now you know how to import data, do simple transformations, and export to various formats. This is the base functionality of Tablib. In addition to this, dynamic columns and formatters are convenient. Let’s first look at dynamic columns.

Let us assume that we have a CSV file called students.csv that looks like this:

student_id,score
84947,75
85345,64
84637,32
89274,98
84636,82
85146,55

We want to calculate a grade for each student based on the score. We could do this after loading the data. However, it would be nice to have the grade automatically calculated when new rows are introduced. To do this, we write a dynamic column as follows:

from tablib import Dataset

# Write a dynamic column
def calculate_grade(row):
    """Calculates the grade of a student based on the score."""
    score = int(row[1])
    if score > 93:
        return 'A'
    elif score >= 80:
        return 'B'
    elif score >= 66:
        return 'C'
    elif score >= 55:
        return 'D'
    elif score >= 40:
        return 'E'
    else:
        return 'F'

# Import a single CSV file
with open('artifacts/students.csv', 'r') as file:
    student_data = Dataset().load(file)

# Add the dynamically calculated column
student_data.append_col(calculate_grade, header="grade")

# Print out the data
print(student_data)

Here we have written the function calculate_grade() to accept individual rows and use that information to return a single value, namely the grade of the student. We can attach this as a dynamic column to the dataset student_data with the method .append_col() as described above.

Now calculate_grade() works as a callback function, so it is applied every time a new row is added:

# Add more rows
student_data.append(['81237', '86'])
print(student_data)

As you can see if you run the code, the grade is automatically calculated for the new student. If I manually want to specify the grade myself, I can do so as well:

# Can add the dynamic column myself
student_data.append(['81237', '56', 'D'])
print(student_data)

If I add the column value myself, the callback function does nothing. If I don’t, then the callback function takes on that responsibility. This is super convenient for automatic data augmentation. You can use this for complex use cases like machine learning predictions based on the other features of a row.


Formatters

Finally, I want to take a quick look at formatters. This is something that is not described in the documentation of Tablib, and you need to read the source code to find this feature. So I want to highlight this here, as this is also pretty convenient.

To understand formatters, you need to realize that Tablib does not do any data validation or data cleaning by default:

from tablib import Dataset

# Creating a dataset
data = Dataset()

# Adding headers
data.headers = ["Name", "Phone Number"]

# Add data with an whitespace error
data.append(["Eirik", "74937475 "])
data.append(["Stine", "75839478"])

# No data formatting - Whitespace is kept
print(data.dict)

Again, we could clean this up after the fact. Going by the same principle as for dynamic columns, it would be nice to attach a callback function to data that automatically formats the data correctly. This is precisely what formatters do:

# Create a formatter
def remove_whitespace(phone_num: str) -> str:
    """Removes whitespace from phone numbers."""
    return phone_num.strip()

# Add the formatter
data.add_formatter("Phone Number", remove_whitespace)

# Check that the formatter has been added
print(data._formatters)

# Append more data with whitespace errors
data.append(["Eirik", " 74937475 "])
data.append(["Stine", "75839478"])

# Data is automatically formatted on insertion.
print(data.dict)

Formatters are callback functions that are added with the .add_formatter() method. You can check the registered formatters on data by using the "private" attribute data._formatters. When you now add more data with whitespace errors, these are automatically cleaned when appended to data.

The difference between dynamic columns and formatters is that dynamic columns create new columns, while formatters modify existing ones. Use dynamic columns for data augmentation, while formatters for data cleaning and data validation 😍


Wrapping Up

Photo by Spencer Bergen on Unsplash
Photo by Spencer Bergen on Unsplash

I hope this blog post helped you understand the library Tablib and what it can do in Python. If you are interested in AI, data science, or data engineering, please follow me or connect on LinkedIn.

Like my writing? Check out some of my other posts for more content:


Related Articles