
The ideal dataset for data analysis is like Table_1:

However, the datasets that we encounter in reality are mostly like Table_2:

The main differences between these two tables are whether the data is well organized with rows and column and only presented in numbers or text. Due to these differences, the data in Table_1 is called structured data while the data in Table_2 is categorized as unstructured data.
Unstructured data refers to information that doesn’t have a predetermined structure or format. It’s difficult to store and manage in relational database. But it often contains valuable information which is useful for generating data insights, training machine learning models, or performing natural language processing (NLP).
In this article, I’ll introduce 7 advanced SQL techniques used to hand unstructured data. Although we called these techniques ‘advanced’ in SQL, they actually construct the foundation of data parsing or texting mining.
JSON Parsing
JSON data is the short form of "JavaScript Object Notation" data. It is a text-based format used in web development to exchange information between a server and a web application. JSON data is widely used because of its advantages like ease of use, platform independence and flexibility. But we can never directly analyze and obtain insights from JSON data due to its complex structure and varying levels of nesting for data storage. The syntax for JSON parsing is as follows:
SELECT JSON_VALUE(json_column, '$.key') AS extracted_value
FROM table_name;
In Table_2, column customer_data
is JSON data, which can be converted to two columns name
and age
.
SELECT JSON_VALUE(customer_data, '$.name') AS name, JSON_VALUE(customer_data, '$.age') AS age
FROM support_logs;

Regular Expression
A regular expression(regex) is a sequence of characters used to define a search pattern. Regex allows developers to find and manipulate complex string data within a database by matching specific patterns within text. The syntax for regex in SQL is:
SELECT column_name
FROM table_name
WHERE column_name REGEXP 'pattern';
Regex is widely used for flexible search, data validation and data extraction. A typical use case is to extract emails from text.
SELECT column_name
FROM users
WHERE column_name REGEXP '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}';
Sometimes people tend to confuse REGEXP
operator with LIKE
operator. Both operators are used in SQL to perform pattern matching within strings. But LIKE
only supports basic wildcards while REGEXP
allows for complex pattern using regular expression syntax and offers greater flexibility for advanced pattern matching scenarios duo to its more robust syntax. Meanwhile, Regex requires more understanding of the structure of the original data, which may be challenging.
Key-Value Pair Parsing
When using JSON parsing, the format is comparably standardized. If the string data has more complex structures but the key and value are separated by a delimiter like a colon, a semicolon or an equal sign, JSON parsing may not be enough for data extraction. Instead, we should use Key-Value Pair Parsing.
Key-Value Pair Parsing is the process of extracting and separating data stored in a format where each piece of information is represented as a "key" paired with its corresponding "value". This method is implemented with string functions or regular expressions (regex).
String functions like SUBSTRING
, SUBSTRING_INDEX
, POSITION
and REPLACE
are frequently used to extract key and values parts based on the delimiters. An example of using SUBSTRING_INDEX
function for paring key-value pairs is as follows.
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(column_name, 'key=', -1), ';', 1) AS value
FROM table_name;
Text Analytics with Window Functions
Window functions perform calculations across a specified set of rows, known as a "window", from a query and return a single value related to the current row. You can refer to my another article ‘The Most Useful Advanced SQL Techniques to Succeed in the Tech Industry’ to have a thorough understanding of the syntax, benefits and use cases of window functions.
The Most Useful Advanced SQL Techniques to Succeed in the Tech Industry
Besides the capability of computing metrics across partitions, window functions can be used for text-based analysis as well, which very few people have realised.
The applications of window functions for text-based analysis are versatile. For example, if we’d like to rank the text by length, we can use the syntax below:
SELECT column_name, RANK() OVER (ORDER BY LENGTH(column_name) DESC) AS rank
FROM table_name;
Data Tokenization
Data tokenization is a security technique where sensitive data, such as credit card numbers or social security numbers, is replaced with randomized tokens. SQL itself doesn’t inherently support tokenization, but it can work with tokenized data through:
- Lookup Tables: A mapping table associates tokens with their original values.
- Encryption or Hash Functions: While not true tokenization, these methods can obfuscate data.
Data tokenization is not considered as a typical methodology to clean up an unstructured dataset. But it’s an important technique for text-mining especially more and more
Data tokenization is not typically considered a methodology for cleaning unstructured datasets. However, it is an important technique for text mining, especially as data privacy are increasingly breached and data security become serious risks in data usage.
COALESCE()
COALESCE() is a function that returns the first non-null value from a list of expressions. It’s useful for handling incomplete or inconsistent data, which are very common in unstructured dataset. The syntax of COALESCE() function is:
SELECT COALESCE(column1, column2, 'default_value') AS result
FROM table_name;
COALESCE() is widely used for replacing null values, selecting the first available value or fallback logic.
CAST()
CAST() converts data from one type to another. The syntax is:
SELECT CAST(column_name AS target_type) AS converted_value
FROM table_name;
When using CAST() function, we must be cautious, especially when we handle the data contains missing values (null).
SELECT
CAST((JSON_VALUE(customer_data, '$.age') AS INT) AS customer_age
FROM
support_logs;
The code above will return an error. This is because the column customer_age
contains ‘null’ after parsing and you cannot cast ‘nothing’ into something concrete like an integer or string.
SELECT
CAST(JSON_VALUE(customer_data, '$.age') AS UNSIGNED) AS customer_age
FROM
support_logs;
SQL Example of Handling Unstructured Data
Let’s revisit Table_2: the customer support logs table, and use the techniques mentioned above to convert the unstructured data into a structured data table which is ready for analysis.

Here are the tasks that we aim to accomplish:
- Extract customer name and age from the
customer_data
column. - Handle the missing values in the
issue_description
column. - Extract the priority and status of each ticket from the
extra_info
column to help the IT team prioritize workloads and track status of each ticket. - Extract the resolution time (in hours) for further analysis.
- Rank tickets by the length of the
issue_description
. - Tokenize phone numbers to protect customer privacy.
SELECT
ticket_id,
customer_id,
-- Extract customer names and ages
JSON_VALUE(customer_data, '$.name') AS customer_name,
CAST(JSON_VALUE(customer_data, '$.age') AS UNSIGNED) AS customer_age,
-- Handle missing issue descriptions
COALESCE(issue_description, 'No issue reported') AS issue_description,
-- Extract ticket priority and status
SUBSTRING_INDEX(REGEXP_SUBSTR(extra_info, 'priority=[^;]+'), '=', -1) AS priority,
SUBSTRING_INDEX(SUBSTRING_INDEX(extra_info, 'status=', -1), ';', 1) AS status,
-- Extract hours of ticket resolution
CAST(REGEXP_REPLACE(NULLIF(resolution_time, 'N/A'), '[^0-9]', '') AS UNSIGNED) AS resolution_time_hours,
-- Rank tickets by lengths
RANK() OVER (ORDER BY LENGTH(issue_description) DESC) AS issue_length_rank,
-- Tokenize phone number
CONCAT('TOKEN-', RIGHT(MD5(phone_number), 8)) AS tokenized_phone_number
FROM support_logs;
We can transform the original unstructured data into a structured table with no sensitive information, as shown below:

Conclusion
SQL is not only a powerful tool for data retrieval and manipulation , but its syntax for handling text from various sources, such as logs, emails, websites, social media, and mobile apps, is also exceptionally robust.
Unstructured data may be confusing and difficult to interpret, but by utilizing SQL’s relevant functionalities, we can extract highly valuable insights and drive the success of Data Science projects to new heights from the data.
Thank you for reading! If you found this article helpful, please give it some claps! Follow me and subscribe via email to receive notifications whenever I publish a new article. My goal is to help data analysts and data scientists, whether you’re a beginner or experienced, enhance your technical skills and achieve greater success in your career.