Regular Expression Matching

Today, let’s focus on implementing a regular expression engine. Matching a given input string against a regular expression pattern is a very common application in the real world. Most regular expression engines are far too complex, and are often based on Finite automata state machines. However, our naive implementation doesn’t use Finite automata state machines. 

Regular Expression Matching - Problem Statement

Given an input string s and a pattern p, implement regular expression matching with support for '.' and '*' where:

  • '.' Matches any single character.
  • '*' Matches zero or more of the preceding element.

The matching should cover the entire input string (not partial).

Regular Expression Matching - Definition

For the input string s and the pattern p, each with length m and n characters respectively, let t[i][j] be the evaluation of the string s1si against the pattern p1pj such that 0im and 0jn. Here’s our final recursive formula for regular expression matching. 

t[i][j]={true,if i=0 and j=0false,if i>0 and j=0false,if i=0 and j>0 and pj!=t[i][j2],if i=0 and j>0 and pj=true,if i>0 and j>0 and pj= and t[i][j2]=truetrue,if i>0 and j>0 and pj= and t[i][j1]=truet[i1][j]and(pj1=.orsi=pj1),if i>0 and j>0 and pj= and !t[i][j2] and !t[i][j1]t[i1][j1],if i>0 and j>0 and pi=.t[i1][j1]andsi=pj,otherwise


Now, let me walk you through each case in our piecewise defined function from the top to bottom, explaining which case covers which scenario. The first case, we have both input string s and the pattern p is empty, thus they can be matched with each other with no problem. Hence the respective table entry is set to true. The second case deals when the input string s is not empty, but the pattern p is empty. In this case we can’t match the string since there exists no pattern, hence setting all the relevant table entries to false.

The cases three and four handle the situation in which we have empty input string s and non-empty pattern p. If the current character we consider in the pattern pj is the regex metacharacter *, then we definitely have to match it with zero occurrences, since no characters exist in the input string s. Thus, we merely set the t[i][j]t[i][j2]. Otherwise, we must match the input string with one or more characters, and we are out of luck, given the input string is empty. Thus we set the relevant table entry to false.

That covers all the trivial cases of our recurrence, where either i=0 or j=0 or both i=j=0. Now, let’s move on to the non-trivial cases of our recurrence.

Note that all the subsequent cases deal with non-trivial cases of our recurrence where both the input string s and the pattern p are non-empty. 

Cases five, six and seven coves all the scenarios associated with the regex metacharacter *. Case five, matches it with zero occurrences of the preceding character, case six matches it with exactly one occurrence of the preceding character and case seven matches it with multiple occurrences of the preceding character respectively.  

Case eight deals with the wildcard character . and case nine covers scenarios in which both si and pj contain lowercase english letters. 

Regular Expression Matching - Approach

We first fill in the table entries associated with the trivial cases of our recurrence shown in the previous step. We then start filling the table entries associated with non-trivial cases of our recurrence. Also note that, we fill in the table entries in row-major order. The problem exhibits optimal substructure and overlapping subproblems properties [1]. Thus, not surprisingly, we shall solve this problem with dynamic programming [1]. Here’s the algorithm for the regular expression matching problem described above.

**Input:** Input string s and pattern p  
**Output:** true if the input sequence s can be matched with the given pattern p, false otherwise.  
REGEXMATCHING(s,p)           
++ ms.length  
++ np.length  
++ Let t[0m][0n] be a new table  
++ t[0][0]true  
++ `for ` i=1 ` to ` m  
++++ t[i][0]false  
++ `for ` j=1 ` to ` n   
++++ `if` pj=  
++++++ t[0][j]t[0][j2]  
++++ `else`  
++++++ t[0][j]false  
++ `for ` i=1 ` to ` m   
++ ++ `for ` j=1 ` to ` n  
++ ++ ++ `if ` pj=  
++ ++ ++ ++ `if` t[i][j2] or t[i][j1]  
++ ++ ++ ++ ++ t[i][j]true   
++ ++ ++ ++ `else`  
++ ++ ++ ++ ++ t[i][j]t[i1][j]and(pj1=.orsi=pj1)  
++ ++ ++ `else `     
++ ++ ++ ++  t[i][j]t[i1][j1]and(pj=.orsi=pj)  
++ `return ` t[m][n]

Figure 1 shows the table t computed by the procedure REGEXMATCHING on the input string s= mississippi and the pattern p=mis*is*p*.

One more thing that bears noting, we use the abbreviations T and F to represent otherwise more verbose literals true and false in our table t.

mis*is*p*.TFFFFFFFFFFmFTFFFFFFFFFiFFTFTFFFFFFsFFFTTFFFFFFsFFFFTFFFFFFiFFFFFTFTFTFsFFFFFFTTFTTsFFFFFFFTFTTiFFFFFFFFFFTpFFFFFFFFFFFiFFFFFFFFFFF


Sample Implementation

Here’s the sample implementation of algorithms described above along with a driver program.


class RegexMatcher {
RegexMatcher() {
throw new AssertionError();
}
public static void main(String[] args) {
final String s = "mississippi";
final String p = "mis*is*p*.";
final boolean match = isMatch(s, p);
System.out.println(match);
}
static boolean isMatch(String s, String p) {
final int m = s.length();
final int n = p.length();
final boolean[][] t = new boolean[m + 1][n + 1];
t[0][0] = true;
for (int i = 1; i <= m; i++)
t[i][0] = false;
for (int j = 1; j <= n; j++) {
if (p.charAt(j - 1) == '*')
t[0][j] = t[0][j - 2];
else
t[0][j] = false;
}
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (p.charAt(j - 1) == '*') {
if (t[i][j - 2] || t[i][j - 1])
t[i][j] = true;
else
t[i][j] = t[i - 1][j] && (p.charAt(j - 2) == '.' || p.charAt(j - 2) == s.charAt(i - 1));
} else
t[i][j] = t[i - 1][j - 1] && (p.charAt(j - 1) == '.' || s.charAt(i - 1) == p.charAt(j - 1));
}
}
return t[m][n];
}
}

Analysis

Let m and n be the length of our input string s and the pattern p respectively. Our approach of regular expression matching incurs O(mn)=O(n2) space and time complexity. 

References

[1] https://www.amazon.com/Introduction-Algorithms-3rd-MIT-Press/dp/0262033844

Comments

Popular posts from this blog

Introducing Java Reactive Extentions in to a SpringBoot Micro Service

Combining the emissions of multiple Observables together using RxJava Zip operator in a Spring Boot Micro service

RabbitMQ Transport in WSO2 ESB