Trex — The Tiny Regular Expression Matcher

This note describes a translation to Scheme of the tiny regular expression matcher written by Rob Pike for the book The Practice of Programming by Pike and Brian W. Kernighan. The code was reprinted and described in the book Beautiful Code by Andy Oram and Greg Wilson in a chapter by Kernighan. The types of regular expressions that can be matched are limited, for didactic purposes: c — the literal character c
. — any character
* — any number of repetitions of the preceeding character, including zero
^ or $ — meta-characters that anchor the match to the beginning or end of the text, respectively Here is the original code, in C: [[ /* match: search for re anywhere in text */ int match(char *re, char *text) { if (re[0] == '^') return matchhere(re+1, text); do { /* must look at empty string */ if (matchhere(re, text)) return 1; } while (*text++ != '\0'); return 0; } ]] [[ /* matchhere: search for re at beginning of text */ int matchhere(char *re, char *text) { if (re[0] == '\0') return 1; if (re[1] == '*') return matchstar(re[0], re+2, text); if (re[0] == '$' && re[1] == '\0') return *text == '\0'; if (*text!='\0' && (re[0]=='.' || re[0]==*text)) return matchhere(re+1, text+1); return 0; } ]] [[ /* matchstar: search for c*re at beginning of text */ int matchstar(int c, char *re, char *text) { do { /* a * matches zero or more instances */ if (matchhere(re, text)) return 1; } while (*text!='\0' && (*text++==c || c=='.')); return 0; } ]] The Scheme matcher is similar; it is called as [[(trex]] regexp text[[)]]. The top-level function [[trex]] converts the regexp and text from strings to lists of characters, then calls the matchers that correspond to those shown above: ; trex regex text -- tiny regular expression matcher ; c -- match literal character c ; . -- match any character ; * -- match zero or more of the preceeding character ; ^ and $ -- anchor match to beginning or end of text ; trex :: (string * string) -> boolean (define (trex regex text) << match >> (match (string->list regex) (string->list text))) Matching begins at the [[match]] function. If regex is null, all previous elements of the regular expression matched successfully, and [[match]] returns [[#t]] to indicate a successful match. If text is exhausted before the end of the regular expression, the match fails, and [[match]] returns [[#f]]. If the match is anchored at the beginning of the text, [[match]] calls [[match-here]] to test if the regular expression matches the beginning of the text. Otherwise, [[match]] calls itself recursively at each position through text until if finds a match or exhausts the text. << match >>= (define (match regex text) << match-here >> << match-star >> (cond ((null? regex) #t) ((null? text) #f) ((char=? (car regex) #\^) (match-here (cdr regex) text)) (else (or (match-here regex text) (match regex (cdr text)))))) [[Match-here]] tests if the regex matches at the current position in the input text. If the regex is exhausted, the match succeeds, and [[match-here]] returns [[#t]]. If the regex is at least two characters long, and the second character is an asterisk, [[match-here]] calls [[match-star]] to match the closure. If the regex ends with a dollar sign, [[match-here]] tests the text and returns [[#t]] if text is empty (the null list) and [[#f]] otherwise. If the input text is not exhausted and the next character in the regex is a dot, which matches any character, or is the same as the next character in the text, [[match-here]] advances both regex and text to the next character and calls itself recursively. Otherwise, the match fails, and [[match-here]] returns [[#f]]. << match-here >>= (define (match-here regex text) (cond ((null? regex) #t) ((and (pair? (cdr regex)) (char=? (cadr regex) #\*)) (match-star (car regex) (cddr regex) text)) ((and (char=? (car regex) #\$) (null? (cdr regex))) (null? text)) ((and (pair? text) (or (char=? (car regex) #\.) (char=? (car regex) (car text)))) (match-here (cdr regex) (cdr text))) (else #f))) [[Match-star]] implements closure, which is repetition of a pattern element, using a lazy strategy; it matches the least possible number of repetitions of the pattern element that lead to an overall match of the regex. The first [[cond]] clause tests for a null match and returns [[#t]] immediately. If the input text is not exhausted and the next character in the regex is a dot, which matches any character, or is the same as the next character in the text, [[match-star]] calls itself recursively to match the remaining text. Otherwise, [[match-star]] returns [[#f]] to indicate failure. << match-star >>= (define (match-star c regex text) (cond ((match-here regex text) #t) ((and (pair? text) (or (char=? (car text) c) (char=? c #\.))) (match-star c regex (cdr text))) (else #f))) As Kernighan and Pike point out, the [[match]], [[match-here]] and [[match-star]] procedures are mutually recursive, and it is instructive to trace the recursion on complicated regexs. The test suite below provides a basic set of tests that cover all the code above. Call the test suite as [[(match-test)]]; if all is well, it should produce no output. << match-test >>= (define-syntax assert (syntax-rules () ((assert expr result) (if (not (equal? expr result)) (begin (newline) (display "failed assertion:") (newline) (display 'expr) (newline) (display "expected: ") (display result) (newline) (display "returned: ") (display expr) (newline)))) ((assert descrip expr result) (if (not (equal? expr result)) (begin (newline) (display "failed assertion: ") (display descrip) (newline) (display 'expr) (newline) (display "expected: ") (display result) (newline) (display "returned: ") (display expr) (newline)))))) << match-test >>= ; uncomment next line only for testing ; (define (error s x) (string-append (symbol->string s) ": " x)) << match-test >>= ; executing (match-test) should produce no output (define (match-test) (assert (trex "a" "a") #t) (assert (trex "a" "b") #f) (assert (trex "a" "ab") #t) (assert (trex "a" "ba") #t) (assert (trex "ab" "ab") #t) (assert (trex "ab" "ba") #f) (assert (trex "ab" "xab") #t) (assert (trex "ab" "aab") #t) (assert (trex "a.c" "ac") #f) (assert (trex "a.c" "abc") #t) (assert (trex "a.c" "xac") #f) (assert (trex "a.c" "xabcx") #t) (assert (trex "^ab" "ab") #t) (assert (trex "^ab" "ba") #f) (assert (trex "^ab" "aab") #f) (assert (trex "^ab" "abc") #t) (assert (trex "ab$" "ab") #t) (assert (trex "ab$" "ba") #f) (assert (trex "ab$" "aab") #t) (assert (trex "ab$" "abc") #f) (assert (trex "^ab$" "ab") #t) (assert (trex "^ab$" "ba") #f) (assert (trex "^ab$" "abc") #f) (assert (trex "^ab$" "aba") #f) (assert (trex "a.*c" "ac") #t) (assert (trex "a.*c" "abc") #t) (assert (trex "a.*c" "abbc") #t) (assert (trex "a.*c" "cbba") #f) (assert (trex "aa*" "x") #f) (assert (trex "aa*" "a") #t) (assert (trex "aa*" "aa") #t) (assert (trex "aa*" "ba") #t) (assert (trex "a*a*a" "a") #t) (assert (trex "a*a*a" "aaa") #t) (assert (trex "a*a*a" "xxxxx") #f) )