Validate Balanced Parenthesis using SQL

Validating if a string contains balanced parenthesis is a practical problem resulting from string/expression parsing/validation in various practical scenarios. In this article, we look at how to validate such a string containing only open ‘(‘ and close ‘)’ parenthesis using just declarative SQL.

Previous Article: Longest Increasing Subsequence of an array in SQL

Problem Statement

Given a string containing only open and closed parenthesis, can you determine if these parenthesis are balanced? Being balanced means:

Every open parenthesis ‘(‘ has exactly one closing parenthesis ‘)’ after it
A closing parenthesis ‘)’ is always matched with exactly one open parenthesis ‘(‘ that appears before it

The following are examples of valid Balanced Parenthesis strings:

((()()))
()()()()
(()()(()()))

The following are example of invalid balanced parenthesis:

((()( – here, the first 2 and the last open parenthesis don’t have a matching closing parenthesis
()()()()) – here, the last closing parenthesis doesn’t match any other unmatched open parenthesis

Online coverage of the problem statement

Input Table Schema

The input table has 2 columns:

idx: The problem index. i.e. the 1st string to check will have idx = 1, the 2nd string to check will have idx = 2, and so on.
parens: A string containing only open and close parenthesis. This string needs to be checked for well-formedness.

CREATE TABLE inputs(idx SERIAL PRIMARY KEY, parens TEXT NOT NULL);

INSERT INTO inputs(parens) VALUES
('((()()))'), ('((()('), ('()()()())'), ('()()()()'), ('(()()(()()))');

SELECT * FROM inputs;

The input table for the balanced parenthesis problem (Image by author)

Solution: Maintain a running counter: O(n)

In imperative Programming languages, this problem is usually solved by maintaining a stack which receives an ‘(‘ every time it is encountered in the input string. Hence, the stack contains only ‘(‘ characters. Every time a ‘)’ is seen in the input, it is matched with an ‘(‘ on the top of the stack, and the ‘)’ character is popped off.

Since we have only one type of open (and close) brackets, we can eliminate the stack and maintain just a counter, which counts how many unmatched ‘(‘ characters have been seen so far. Every time an ‘(‘ character is encountered, we increment the counter, and every time a ‘)’ character is encountered, we decrement the counter.

Negative counter value: If the counter ever reaches a negative (< 0) value, it indicates an unmatched closing parenthesis.
Final counter value: After all the characters in the input string are processed, if the value of the counter is != 0, it indicates a problem. A positive value indicates an unmatched ‘(‘ character in the input, and a negative value is already considered above.

The Sql code for this solution looks like this:

WITH as_split AS (
  -- First split each input string such that every character is on its
  -- own row. We ensure that we tag each input row with the original
  -- index of the input string from which it came so that we know which
  -- problem it is a part of.
  SELECT
    idx,
    UNNEST(
      STRING_TO_ARRAY(
        parens, NULL
      )
    ) AS ch
  FROM inputs
),

with_annotation AS (
  -- Annotate characters from each problem (unique index) with its
  -- position using ROW_NUMBER(). Also annotate an open paren with
  -- +1 and a close paren with a -1 number, so that we can maintain
  -- a running sum of these counters later.
  SELECT
    idx,
    ch,
    ROW_NUMBER() OVER(PARTITION BY idx) AS row_num,
    CASE WHEN ch = '(' THEN +1 ELSE -1 END AS ctr
  FROM as_split
),

with_running_sum AS (
  -- Find the running sum for characters in each problem. Note that we are
  -- solving all the problems at once (think "batch API") instead of feeding
  -- feeding each problem once into the solution machine.
  SELECT
    idx,
    ch,
    ctr,
    row_num,
    SUM(ctr) OVER(PARTITION BY idx ORDER BY row_num ASC) AS running_sum,
    MAX(row_num) OVER(PARTITION BY idx) AS max_row_num
  FROM with_annotation
),

with_result AS (
  -- The result is valid only if we never hit a negative running sum
  -- (which indicates that we have an extra close paren than a corresponding
  -- open paren) and if we end with a running sum of 0. If we end with a
  -- running sum &gt; 0 then we have an additional open paren that is not
  -- matched with a corresponding close paren.
  SELECT
    idx,
    CASE WHEN MIN(running_sum) &lt; 0 THEN TRUE ELSE FALSE END AS has_negative,
    CASE WHEN SUM(
      CASE WHEN row_num = max_row_num THEN running_sum ELSE 0 END
    ) = 0 THEN TRUE ELSE FALSE END AS ends_with_zero_sum
  FROM with_running_sum
  GROUP BY 1
)

SELECT
  lhs.idx,
  rhs.parens,
  CASE
    WHEN has_negative OR NOT ends_with_zero_sum THEN FALSE
    ELSE TRUE END
  AS is_valid_parens
FROM with_result lhs INNER JOIN inputs rhs
ON lhs.idx = rhs.idx
ORDER BY lhs.idx ASC;

There are quite a few intermediate tables used in the solution above to aid readability and to separate out the various processing steps.

This is the first time we have used the UNNEST keyword in SQL. This is also the first time I’ve written a solution that batch-processes multiple inputs and solves them at once. I leverage the idx field which indicates the index of the input string. All intermediate tables use the idx field to separate out solutions to different problems.

The result of the O(n) solution (Image by author)

Estimated Cost: The estimated cost for this query on a table with 5 distinct input rows is 45k. Most of this cost seems to be coming from the use of the window aggregation functions.

While I’ve tagged the runtime to be O(n), it would depend on how the database engine internally executes the query. For example, if the engine notices that the assignment of the row_num column using ROW_NUMBER() results in rows that have a strictly increasing value for that column, and the database is able to preserve that row order across CTE tables, then it can avoid doing a sort later when it encounters an ORDER BY clause in the window function execution here.

SUM(ctr) OVER(PARTITION BY idx ORDER BY row_num ASC) AS running_sum,

The ORDER BY in the OVER() clause above is essential to ensure that we get a running sum, and not an overall sum for the entire partition.

SQL Fiddle

The SQL Fiddle link to the solution in this post can be found here.

Extensions and exercises

What if we have more than one type of parenthesis such as square brackets [], curly braces {}. How would you evolve the solution to account for this change?
What if the string that make up the open and close bracket pairs were not fixed but specified in another dimension table. How would you evolve the solution to account for this change?

We shall see (in later posts) how these extensions solve real-world and practical problems encountered by SQL practitioners.