Regular Expression Matching

Today, let’s focus on implementing a regular expression engine. Matching a given input string against a regular expression pattern is a very common application in the real world. Most regular expression engines are far too complex, and are often based on Finite automata state machines. However, our naive implementation doesn’t use Finite automata state machines. 

Regular Expression Matching - Problem Statement

Given an input string s and a pattern p, implement regular expression matching with support for '.' and '*' where:

  • '.' Matches any single character.
  • '*' Matches zero or more of the preceding element.

The matching should cover the entire input string (not partial).

Regular Expression Matching - Definition

For the input string $s$ and the pattern $p$, each with length $m$ and $n$ characters respectively, let $t[i][j]$ be the evaluation of the string $s_1 \ldots s_i$ against the pattern $p_1 \ldots p_j$ such that $0 \le i \le m$ and $0 \le j \le n$. Here’s our final recursive formula for regular expression matching. 

$t[i][j] = \begin{cases} true, & \text{if $i =0$ and $j =0$} \\[2ex] false, & \text{if $i \gt 0$ and $j = 0$} \\[2ex] false, & \text{if $i = 0$ and $j \gt 0$ and $p_j \;!=\; *$} \\[2ex] t[i][j - 2], & \text{if $i = 0$ and $j \gt 0$ and $p_j \;=\; *$} \\[2ex] true, & \text{if $i \gt 0$ and $j \gt 0$ and $p_j \;=\; *$ and $t[i][j - 2] = true$} \\[2ex] true, & \text{if $i \gt 0$ and $j \gt 0$ and $p_j \;=\; *$ and $t[i][j - 1] = true$} \\[2ex] t[i - 1][j] \;and\;\\(p_{j - 1} = . \;or\; s_i=p_{j-1}), & \text{if $i \gt 0$ and $j \gt 0$ and $p_j \;=\; *\;$ and $\;!t[i][j - 2]\;$ and $!t[i][j - 1]$} \\[2ex] t[i - 1][j - 1], & \text{if $i \gt 0$ and $j \gt 0$ and $p_i = .$} \\[2ex] t[i - 1][j - 1] \;and\; s_i = p_j, & \text{otherwise} \end{cases}$


Now, let me walk you through each case in our piecewise defined function from the top to bottom, explaining which case covers which scenario. The first case, we have both input string $s$ and the pattern $p$ is empty, thus they can be matched with each other with no problem. Hence the respective table entry is set to $true$. The second case deals when the input string $s$ is not empty, but the pattern $p$ is empty. In this case we can’t match the string since there exists no pattern, hence setting all the relevant table entries to $false$.

The cases three and four handle the situation in which we have empty input string $s$ and non-empty pattern $p$. If the current character we consider in the pattern $p_j$ is the regex metacharacter *, then we definitely have to match it with zero occurrences, since no characters exist in the input string $s$. Thus, we merely set the $t[i][j] \gets t[i][j - 2]$. Otherwise, we must match the input string with one or more characters, and we are out of luck, given the input string is empty. Thus we set the relevant table entry to $false$.

That covers all the trivial cases of our recurrence, where either $i = 0$ or $j = 0$ or both $i = j = 0$. Now, let’s move on to the non-trivial cases of our recurrence.

Note that all the subsequent cases deal with non-trivial cases of our recurrence where both the input string $s$ and the pattern $p$ are non-empty. 

Cases five, six and seven coves all the scenarios associated with the regex metacharacter *. Case five, matches it with zero occurrences of the preceding character, case six matches it with exactly one occurrence of the preceding character and case seven matches it with multiple occurrences of the preceding character respectively.  

Case eight deals with the wildcard character . and case nine covers scenarios in which both $s_i$ and $p_j$ contain lowercase english letters. 

Regular Expression Matching - Approach

We first fill in the table entries associated with the trivial cases of our recurrence shown in the previous step. We then start filling the table entries associated with non-trivial cases of our recurrence. Also note that, we fill in the table entries in row-major order. The problem exhibits optimal substructure and overlapping subproblems properties [1]. Thus, not surprisingly, we shall solve this problem with dynamic programming [1]. Here’s the algorithm for the regular expression matching problem described above.

**Input:** Input string $s$ and pattern $p$  
**Output:** $true$ if the input sequence $s$ can be matched with the given pattern $p$, $false$ otherwise.  
$REGEX-MATCHING(s, p)$           
$\phantom{{}++{}}$ $m \gets s.length$  
$\phantom{{}++{}}$ $n \gets p.length$  
$\phantom{{}++{}}$ Let $t[0\ldots m][0 \ldots n]$ be a new table  
$\phantom{{}++{}}$ $t[0][0] \gets true$  
$\phantom{{}++{}}$ `for ` $i=1$ ` to ` $m$  
$\phantom{{}++{}} \phantom{{}++{}}$ $t[i][0] \gets false$  
$\phantom{{}++{}}$ `for ` $j = 1$ ` to ` $n$   
$\phantom{{}++{}} \phantom{{}++{}}$ `if` $p_j = *$  
$\phantom{{}++{}} \phantom{{}++{}} \phantom{{}++{}}$ $t[0][j] \gets t[0][j - 2]$  
$\phantom{{}++{}} \phantom{{}++{}}$ `else`  
$\phantom{{}++{}} \phantom{{}++{}} \phantom{{}++{}}$ $t[0][j] \gets false$  
$\phantom{{}++{}}$ `for ` $i = 1$ ` to ` $m$   
$\phantom{{}++{}}$ $\phantom{{}++{}}$ `for ` $j = 1$ ` to ` $n$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ `if ` $p_j = *$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ `if` $t[i][j - 2]\;$ or $\;t[i][j - 1]$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $t[i][j] \gets true$   
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ `else`  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $t[i][j] \gets t[i - 1][j] \;and\; (p_{j - 1} = . \;or\; s_i = p_{j - 1})$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ `else `     
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$  $t[i][j] \gets t[i - 1][j - 1] \;and\; (p_j = . \;or\; s_i = p_j)$  
$\phantom{{}++{}}$ `return ` $t[m][n]$

Figure 1 shows the table $t$ computed by the procedure $REGEX-MATCHING$ on the input string $s= $ mississippi and the pattern $p = $mis*is*p*.

One more thing that bears noting, we use the abbreviations $T$ and $F$ to represent otherwise more verbose literals $true$ and $false$ in our table $t$.

$$\begin{array}{|c|c|c|}  \hline
 & \text{}  & \text{m} & \text{i} & \text{s} & \text{*} & \text{i}& \text{s}& \text{*} & \text{p} & \text{*} & \text{.} \\ \hline
\text{} & T & F & F & F & F & F & F & F & F & F & F \\ \hline
\text{m} & F & T & F & F & F & F & F & F & F & F & F \\ \hline
\text{i} & F & F & T & F & T & F & F & F & F & F & F \\ \hline
\text{s} & F & F & F & T & T & F & F & F & F & F & F \\ \hline
\text{s} & F & F & F & F & T & F & F & F & F & F & F \\ \hline
\text{i} & F & F & F & F & F & T & F & T & F & T & F \\ \hline
\text{s} & F & F & F & F & F & F & T & T & F & T & T \\ \hline
\text{s} & F & F & F & F & F & F & F & T & F & T & T \\ \hline
\text{i} & F & F & F & F & F & F & F & F & F & F & T \\ \hline
\text{p} & F & F & F & F & F & F & F & F & F & F & F \\ \hline
\text {i}& F  & F  & F & F & F & F & F & F & F & F & F \\ \hline
\end{array}$$


Sample Implementation

Here’s the sample implementation of algorithms described above along with a driver program.


Analysis

Let $m$ and $n$ be the length of our input string $s$ and the pattern $p$ respectively. Our approach of regular expression matching incurs $O(m \cdot n) = O(n^2)$ space and time complexity. 

References

[1] https://www.amazon.com/Introduction-Algorithms-3rd-MIT-Press/dp/0262033844

Comments

Popular posts from this blog

Introducing Java Reactive Extentions in to a SpringBoot Micro Service

Optimal binary search trees

Edit distance