The world’s leading publication for data science, AI, and ML professionals.

Advanced SQL Techniques for Unstructured Data Handling

Everything you need to know to get started with text mining

Photo by Etienne Girardet on Unsplash
Photo by Etienne Girardet on Unsplash

The ideal dataset for data analysis is like Table_1:

Table_1 (mock data by the author)
Table_1 (mock data by the author)

However, the datasets that we encounter in reality are mostly like Table_2:

Table_2: customer support log (mock data by the author)
Table_2: customer support log (mock data by the author)

The main differences between these two tables are whether the data is well organized with rows and column and only presented in numbers or text. Due to these differences, the data in Table_1 is called structured data while the data in Table_2 is categorized as unstructured data.

Unstructured data refers to information that doesn’t have a predetermined structure or format. It’s difficult to store and manage in relational database. But it often contains valuable information which is useful for generating data insights, training machine learning models, or performing natural language processing (NLP).

In this article, I’ll introduce 7 advanced SQL techniques used to hand unstructured data. Although we called these techniques ‘advanced’ in SQL, they actually construct the foundation of data parsing or texting mining.


JSON Parsing

JSON data is the short form of "JavaScript Object Notation" data. It is a text-based format used in web development to exchange information between a server and a web application. JSON data is widely used because of its advantages like ease of use, platform independence and flexibility. But we can never directly analyze and obtain insights from JSON data due to its complex structure and varying levels of nesting for data storage. The syntax for JSON parsing is as follows:

SELECT JSON_VALUE(json_column, '$.key') AS extracted_value
FROM table_name;

In Table_2, column customer_data is JSON data, which can be converted to two columns name and age .

SELECT JSON_VALUE(customer_data, '$.name') AS name, JSON_VALUE(customer_data, '$.age') AS age
FROM support_logs;
Output of JSON parsing (screenshot by the author)
Output of JSON parsing (screenshot by the author)

Regular Expression

A regular expression(regex) is a sequence of characters used to define a search pattern. Regex allows developers to find and manipulate complex string data within a database by matching specific patterns within text. The syntax for regex in SQL is:

SELECT column_name
FROM table_name
WHERE column_name REGEXP 'pattern';

Regex is widely used for flexible search, data validation and data extraction. A typical use case is to extract emails from text.

SELECT column_name
FROM users
WHERE column_name REGEXP '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}';

Sometimes people tend to confuse REGEXP operator with LIKE operator. Both operators are used in SQL to perform pattern matching within strings. But LIKE only supports basic wildcards while REGEXP allows for complex pattern using regular expression syntax and offers greater flexibility for advanced pattern matching scenarios duo to its more robust syntax. Meanwhile, Regex requires more understanding of the structure of the original data, which may be challenging.


Key-Value Pair Parsing

When using JSON parsing, the format is comparably standardized. If the string data has more complex structures but the key and value are separated by a delimiter like a colon, a semicolon or an equal sign, JSON parsing may not be enough for data extraction. Instead, we should use Key-Value Pair Parsing.

Key-Value Pair Parsing is the process of extracting and separating data stored in a format where each piece of information is represented as a "key" paired with its corresponding "value". This method is implemented with string functions or regular expressions (regex).

String functions like SUBSTRING , SUBSTRING_INDEX, POSITION and REPLACE are frequently used to extract key and values parts based on the delimiters. An example of using SUBSTRING_INDEX function for paring key-value pairs is as follows.

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(column_name, 'key=', -1), ';', 1) AS value
FROM table_name;

Text Analytics with Window Functions

Window functions perform calculations across a specified set of rows, known as a "window", from a query and return a single value related to the current row. You can refer to my another article ‘The Most Useful Advanced SQL Techniques to Succeed in the Tech Industry’ to have a thorough understanding of the syntax, benefits and use cases of window functions.

The Most Useful Advanced SQL Techniques to Succeed in the Tech Industry

Besides the capability of computing metrics across partitions, window functions can be used for text-based analysis as well, which very few people have realised.

The applications of window functions for text-based analysis are versatile. For example, if we’d like to rank the text by length, we can use the syntax below:

SELECT column_name, RANK() OVER (ORDER BY LENGTH(column_name) DESC) AS rank
FROM table_name;

Data Tokenization

Data tokenization is a security technique where sensitive data, such as credit card numbers or social security numbers, is replaced with randomized tokens. SQL itself doesn’t inherently support tokenization, but it can work with tokenized data through:

  • Lookup Tables: A mapping table associates tokens with their original values.
  • Encryption or Hash Functions: While not true tokenization, these methods can obfuscate data.

Data tokenization is not considered as a typical methodology to clean up an unstructured dataset. But it’s an important technique for text-mining especially more and more

Data tokenization is not typically considered a methodology for cleaning unstructured datasets. However, it is an important technique for text mining, especially as data privacy are increasingly breached and data security become serious risks in data usage.


COALESCE()

COALESCE() is a function that returns the first non-null value from a list of expressions. It’s useful for handling incomplete or inconsistent data, which are very common in unstructured dataset. The syntax of COALESCE() function is:

SELECT COALESCE(column1, column2, 'default_value') AS result
FROM table_name;

COALESCE() is widely used for replacing null values, selecting the first available value or fallback logic.


CAST()

CAST() converts data from one type to another. The syntax is:

SELECT CAST(column_name AS target_type) AS converted_value
FROM table_name;

When using CAST() function, we must be cautious, especially when we handle the data contains missing values (null).

SELECT 
    CAST((JSON_VALUE(customer_data, '$.age') AS INT) AS customer_age
FROM 
    support_logs;

The code above will return an error. This is because the column customer_age contains ‘null’ after parsing and you cannot cast ‘nothing’ into something concrete like an integer or string.

SELECT 
    CAST(JSON_VALUE(customer_data, '$.age') AS UNSIGNED) AS customer_age
FROM 
    support_logs;

SQL Example of Handling Unstructured Data

Let’s revisit Table_2: the customer support logs table, and use the techniques mentioned above to convert the unstructured data into a structured data table which is ready for analysis.

Table_2: customer support log (mock data by the author)
Table_2: customer support log (mock data by the author)

Here are the tasks that we aim to accomplish:

  1. Extract customer name and age from the customer_data column.
  2. Handle the missing values in the issue_description column.
  3. Extract the priority and status of each ticket from the extra_info column to help the IT team prioritize workloads and track status of each ticket.
  4. Extract the resolution time (in hours) for further analysis.
  5. Rank tickets by the length of the issue_description .
  6. Tokenize phone numbers to protect customer privacy.
SELECT 
    ticket_id,
    customer_id,

    -- Extract customer names and ages
    JSON_VALUE(customer_data, '$.name') AS customer_name,
    CAST(JSON_VALUE(customer_data, '$.age') AS UNSIGNED) AS customer_age,

    -- Handle missing issue descriptions
    COALESCE(issue_description, 'No issue reported') AS issue_description,

    -- Extract ticket priority and status
    SUBSTRING_INDEX(REGEXP_SUBSTR(extra_info, 'priority=[^;]+'), '=', -1) AS priority,
    SUBSTRING_INDEX(SUBSTRING_INDEX(extra_info, 'status=', -1), ';', 1) AS status,

    -- Extract hours of ticket resolution
    CAST(REGEXP_REPLACE(NULLIF(resolution_time, 'N/A'), '[^0-9]', '') AS UNSIGNED) AS resolution_time_hours,

    -- Rank tickets by lengths
    RANK() OVER (ORDER BY LENGTH(issue_description) DESC) AS issue_length_rank,

    -- Tokenize phone number
    CONCAT('TOKEN-', RIGHT(MD5(phone_number), 8)) AS tokenized_phone_number
FROM support_logs;

We can transform the original unstructured data into a structured table with no sensitive information, as shown below:

Output of text mining (screenshot by the author)
Output of text mining (screenshot by the author)

Conclusion

SQL is not only a powerful tool for data retrieval and manipulation , but its syntax for handling text from various sources, such as logs, emails, websites, social media, and mobile apps, is also exceptionally robust.

Unstructured data may be confusing and difficult to interpret, but by utilizing SQL’s relevant functionalities, we can extract highly valuable insights and drive the success of Data Science projects to new heights from the data.

Thank you for reading! If you found this article helpful, please give it some claps! Follow me and subscribe via email to receive notifications whenever I publish a new article. My goal is to help data analysts and data scientists, whether you’re a beginner or experienced, enhance your technical skills and achieve greater success in your career.


Related Articles