;;; TSTRIE -- maintain ordered key/value dictionary where keys are strings
;;; by Phil Bewig (pbewig@swbell.net) of St Louis, Missouri USA in April 2002
;;; Copyright (C) 2002 by Philip L. Bewig. All rights reserved.
Ternary search tries are a recently-discovered data structure for mapping
strings to values. They are typically faster than hash tables because
they take advantage of the fact that keys are known to be strings, and
they provide the in-order access of binary trees without the complexity.
Ternary search tries were described by Jon Bentley and Robert Sedgewick
at the 1997 SODA conference; the full reference is given below, along
with several other books, articles, and web sites. Variants of ternary
search tries can be used to sort strings, match partial strings, and
locate near neighbors of strings, among other useful applications,
but these variants are not discussed here.
This program implements ternary search tries as an abstract data type.
A new ternary search trie is created by a function of no arguments
(the term ternary search trie may abbreviated to tstrie, pronounced
tess'-tree):
create-tstrie -- create a new ternary search trie
A tstrie object responds to the following messages:
'lookup key -- if key is present in the ternary search trie, return
a (key . value) pair; if the key is not present, return #f
'insert key value -- return a newly-allocated ternary search trie
containing the indicated key and value, without regard to whether
the key was already present
'delete key -- return a newly-allocated ternary search trie not
containing the indicated key, without regard to whether the key
was already present
'update key value proc -- if the key is present in the ternary
search trie, return a newly-allocated ternary search trie
with the current value associated with the key replaced by
the result of (proc key oldvalue); if the key is not present,
return a newly-allocated ternary search trie containing key and
value; the action performed by 'update is similar to the action
performed by a 'lookup followed by an 'insert, but more efficient,
as only one search must be made through the ternary search trie
'map proc -- return a newly-allocated ternary search trie with each
value replaced by the result of evaluating (proc key oldvalue);
the order in which keys are visited is unspecified
'fold proc base -- the result of applying (proc prior-fold key value)
pairwise first to base and the alphabetically first key/value pair
in the ternary search trie, then applying proc pairwise to the
result of the first application of proc and the alphabetically
second key/value pair in the ternary search trie, and so on,
until reaching the end of the ternary search trie; this is
equivalent to left-folding on the list created by 'to-list,
but more efficient because the intermediate list isn't needed
'for-each proc -- apply the procedure (proc key value) to each
key/value pair in the ternary search trie; the return value is
unspecified; keys are visited in ascending alphabetic order
'to-list -- return a list of (key . value) pairs for each item in
the ternary search trie, in ascending alphabetic order of keys
'from-list list -- return a newly-allocated ternary search trie
containing all the key/value pairs in the original ternary
search trie plus all the (key . value) pairs in list; if a
key from the list is the same as a key already in the ternary
search trie, the associated value in the list replaces the
original value; if the same key appears twice in the list, the
If any error occurs while processing a request in the ternary search trie,
an error value consisting of a pair whose car is the symbol 'ERROR and
whose cdr is a description of the error will be returned.
All of the functions that return a ternary search trie are applicative,
meaning that the original ternary search trie, and all of its key/value
pairs, still exist if they have been bound to a variable. Nodes on the
search path will be newly allocated, but nodes not on the search path
will be shared between the two ternary search tries.
Here is a synopsis of the tstrie abstract data type:
;;; CREATE-TSTRIE => tst -- create a new ternary search trie
;;; tst 'LOOKUP key -- (key . value) if key in tst, #f otherwise
;;; tst 'INSERT key value -- a new tstrie containing key and value
;;; tst 'DELETE key -- new tstrie not containing key
;;; tst 'UPDATE key value proc -- store (proc key oldvalue) or value in tst
;;; tst 'MAP proc -- return a new tstrie with each value updated by proc
;;; tst 'FOLD proc base -- apply proc pairwise across all key/value pairs
;;; tst 'FOR-EACH proc -- perform (proc key value) for each tst item, in order
;;; tst 'TO-LIST -- a newly-allocated list containing (key . value) pairs
;;; tst 'FROM-LIST lst -- insert items in lst into tst
;;; ('ERROR . message) -- value returned on error
;;; TSTRIE? obj -- #t if obj is a tstrie, #f otherwise
As the name implies, ternary search tries are a cross between binary
search trees and radix tries. Consider binary trees. A tree consists
of leaves and nodes, where nodes contain additional leaves and nodes;
a binary tree has no more than two children per node. Sub-nodes imply
an ordering relationship; all sub-nodes of the left child contain keys
less than the current node, and all sub-nodes of the right child contain
keys greater than the current node. One node is distinguished as the
root of the tree, having no ancestors. Binary trees can be searched by
starting at the root and comparing the current key to the search key;
if the two keys are the same, the search terminates with success, if
the search reaches a leaf, it terminates with failure, and otherwise the
search descends to the appropriate child, where it continues recursively.
Because each descent to a new level eliminates half the keys from the
search space, searching in a binary tree takes logarithmic time.
When the keys are strings, radix tries provide an alternative to binary
search trees. A radix trie considers each of the characters in the
string, in sequence. Like a binary search tree, a radix trie consists
of leaves and nodes, where nodes contain additional leaves and nodes;
unlike a binary search tree, a node of a radix trie has as many children
as there are characters in the alphabet (that's twenty-six for english,
but 34,000 for unicode). Radix tries can be searched by starting at the
root and comparing the first character of the string to the trie; if the
appropriate child of the trie is empty, the search ends with failure, if
the appropriate child of the trie is a leaf, it terminates with success,
and otherwise the search descends to the appropriate child, where it
continues recursively with the next character. Compared to binary trees,
radix tries can be blazingly fast; a search for "is" starts at the root,
takes the "i" branch, then the "s" branch, and ends at the desired leaf.
The bad news is that tries have huge space requirements, due to the high
fanout at each level of the trie.
Ternary search tries consist of recursive nodes and leaves, just like
binary search trees and radix tries. As the name implies, a node of a
ternary search trie has three children, one for lesser children, one
for greater children, and a third for equal children. Searches proceed
character-by-character, as with a trie. At each level of the trie, the
search compares the current character of the search string with the
character stored in the node. If the search character is less, the
search continues at the less-than child, and if the search character is
greater, the search continues at the greater-than child. However, when
the two characters are equal, the search continues recursively at the
equal-to child, proceeding to the next character in the search string.
In summary, ternary search tries combine the low space and run-time
overhead of binary search trees with the character-based efficiency
of radix tries. The basic problem of radix tries is excessive use of
space for nearly-empty nodes; ternary search tries can be thought of as
radix tries that solve this problem, at a slight cost for full nodes.
Ternary search tries have been proven to have logarithmic access times for
all of the basic operations (lookup, insert, delete), offer the ability
to traverse the keys in alphabetic order, and nearly always outperform
hash tables in practice.
The interested reader should consult the following articles that describe
the underlying theory and practical uses of ternary search tries:
Fast Algorithms for Sorting and Searching Strings, by Jon L. Bentley
and Robert Sedgewick, appeared in the Proceedings of the Eighth
Annual ACM-SIAM Symposium on Discrete Algorithms in January 1997.
This is the original paper on ternary search tries. It is available
in pdf format at http://www.cs.princeton.edu/~rs/strings/paper.pdf
or in postscript format at the same url but with .pdf changed to .ps.
The paper describes the use of ternary search tries in a commercial
Optical Character Recognition system built at Bell Labs.
Ternary Search Trees, by Jon Bentley and Bob Sedgewick, in Dr. Dobb's
Journal, April 1998, pages 20-25. The article is available at
http://www.ddj.com/articles/1998/9804/9804a/9804a.htm.
A sample implementation of ternary search tries in C is available at
http://www.cs.princeton.edu/~rs/strings/. This code is by Bentley
and Sedgewick and is referenced in the two papers cited above.
Algorithm Alley: Sorting Strings with Three-Way Radix Quicksort,
by Jon Bentley and Robert Sedgewick, in Dr. Dobb's Journal, November
1998, pages 133-138, discusses how an isomorphism to ternary search
tries can be used to sort strings very quickly.
The Analysis of Hybrid Trie Structures, by Julien Clement, Philippe
Flajolet and Brigitte Vallee was published as an INRIA Technical
Report in November 1997; the paper appeared in the Proceedings of
the Ninth ACM-SIAM Symposium on Discrete Algorithms in January 1998
and is avail- able at http://users.info.unicaen.fr/~clement/SODA.
This paper proves that lookup, insert and delete in ternary search
tries are logarithmic in the number of keys present.
Code to implement ternary search tries in ML is available at http:-
//www.hardcoreprocessing.com/Download/StandardMLCode/Collections_-
20001102.tar.gz; the current scheme implementation owes much to this
ML implementation.
Algorithms in C, 3rd edition, by Robert Sedgewick (Addison-Wesley,
1998) discusses ternary search tries in Section 15.4. The C++
version of the book has a similar discussion, also in Section 15.4.
Tstries are represented by two-slot leaf vectors and six-slot node
vectors. The two leaf slots are the same as the first two node slots: a
boolean that tells whether or not the current leaf/node contains a value
and the associated value, which is only meaningful if the boolean is #t.
The third slot in a node is the character that splits the tstrie between
less-than and greater-than. The three remaining node slots are the three
child tstries: the less-than tstrie, called lokid, the equal-to tstrie,
called eqkid, and the greater-than tstrie, called hikid (these names come
from the Bentley/Sedgewick SODA paper). Access functions are provided to
simplify code, written as macros so the code will be in-lined for speed.
Also shown below is the definition of a nil tstrie and a predicate to
recognize it.
<< internal functions of the tstrie abstract data type >>=
;; ACCESS FUNCTIONS -- functions to access members of a tstrie
;; LEAF val? val -- create a leaf
(define-syntax leaf (syntax-rules ()
((leaf v? v) (vector v? v))))
;; NODE val? val split lokid eqkid hikid -- create a node
(define-syntax node (syntax-rules ()
((node v? v s l e h) (vector v? v s l e h))))
;; LEAF? tst -- #t => leaf, #f => node
(define-syntax leaf? (syntax-rules ()
((leaf? tst) (= (vector-length tst) 2))))
;; NODE? tst -- #t => node, #f => leaf
(define-syntax node? (syntax-rules ()
((node? tst) (= (vector-length tst) 6))))
;; VAL? tst -- is value stored in leaf/node?
(define-syntax val? (syntax-rules ()
((val? tst) (vector-ref tst 0))))
;; VAL tst -- the value stored in leaf/node
(define-syntax val (syntax-rules ()
((val tst) (vector-ref tst 1))))
;; SPLIT tst -- node only -- list of key characters
(define-syntax split (syntax-rules ()
((split tst) (vector-ref tst 2))))
;; LOKID tst -- node only -- less-than sub-tstrie
(define-syntax lokid (syntax-rules ()
((lokid tst) (vector-ref tst 3))))
;; EQKID tst -- node only -- equal-to sub-tstrie
(define-syntax eqkid (syntax-rules ()
((eqkid tst) (vector-ref tst 4))))
;; HIKID tst -- node only -- greater-than sub-tstrie
(define-syntax hikid (syntax-rules ()
((hikid tst) (vector-ref tst 5))))
;; NIL -- the nil tstrie
(define nil (leaf #f 0))
;; NIL? tst -- #t if tst is nil, #f otherwise
(define-syntax nil? (syntax-rules ()
((nil? tst) (eq? tst nil))))
The first function to be discussed is the lookup function, because the
other functions -- insert, delete, and update -- are all variations of
lookup, as they first have to find the appropriate place in the tstrie
to perform their work. If there are no more characters in the key --
(null? k) => #t -- then the search terminates; the first four tests
look at all combinations of leaf/node and value/novalue and make the
appropriate return. If characters remain in the key, but the search
has reached a leaf, the search fails; this is the fifth test. Finally,
if both the key and the tstrie are non-null, the search continues lower
in the tstrie; the last three tests compare the key character to the
split character and branch to the appropriate sub-tstrie, recursively
calling the named let. Note that of the three searches to sub-tstries,
only the search on eqkid consumes an element of the key.
<< internal functions of the tstrie abstract data type >>=
;; LOOKUP tst key -- (key . value) if key in tst, #f otherwise
(define (lookup tst key)
(let lookup ((t tst) (k (string->list key)))
(cond ((and (null? k) (leaf? t) (val? t)) (cons key (val t)))
((and (null? k) (leaf? t) (not (val? t))) #f)
((and (null? k) (node? t) (val? t)) (cons key (val t)))
((and (null? k) (node? t) (not (val? t))) #f)
((leaf? t) #f)
((char>=
;; INSERT tst key value -- a new tstrie containing key and value
(define (insert tst key value)
(let insert ((t tst) (k (string->list key)))
(cond ((and (null? k) (leaf? t))
(leaf #t value))
((and (null? k) (node? t))
(node #t value (split t) (lokid t) (eqkid t) (hikid t)))
((leaf? t)
(node (val? t) (val t) (car k) nil (insert nil (cdr k)) nil))
((char>=
;; DELETE tst key -- new tstrie not containing key
(define (delete tst key)
(let delete ((t tst) (k (string->list key)))
(cond ((and (null? k) (leaf? t)) nil)
((and (null? k) (node? t))
(node #f 0 (split t) (lokid t) (eqkid t) (hikid t)))
((leaf? t) t)
((char>=
;; UPDATE tst key value proc -- store (proc key oldvalue) or value in tst
(define (update tst key value proc)
(let update ((t tst) (k (string->list key)))
(cond ((and (null? k) (leaf? t))
(if (val? t) (leaf #t (proc key (val t))) (leaf #t value)))
((and (null? k) (node? t))
(if (val? t) (leaf #t (proc key (val t))) (leaf #t value)))
((leaf? t)
(node (val? t) (val t) (car k)
nil (insert nil (list->string (cdr k)) value) nil))
((char>=
;; MAP tst proc -- return a new tstrie with each value updated by proc
(define (map tst proc)
(let map ((t tst) (k '()))
(if (leaf? t)
(leaf (val? t)
(if (val? t)
(proc (list->string (reverse k)) (val t))
(val t)))
(node (val? t)
(if (val? t)
(proc (list->string (reverse k)) (val t))
(val t))
(split t)
(map (lokid t) k)
(map (eqkid t) (cons (split t) k))
(map (hikid t) k)))))
Fold is a process that repeatedly applies a binary function to an accumu-
lating base value and a key/value pair, in ascending alphabetical order;
it is the same operation as folding a list, applied to the tstrie. For
example, (tst 'fold (lambda (x k v) (+ x 1)) 0) computes the number of
key/value pairs that are in the tstrie. The function is simple enough,
traversing the tstrie with recursion through nodes that termintes at
leaves, updating an accumulating variable x each time (val? t) is true.
<< internal functions of the tstrie abstract data type >>=
;; FOLD tst proc base -- apply proc pairwise across all key/value pairs
(define (fold tst proc base)
(let fold ((t tst) (k '()) (x base))
(if (leaf? t)
(if (val? t)
(proc x (list->string (reverse k)) (val t))
x)
(let ((x (fold (lokid t) k x)))
(let ((x (fold (eqkid t)
(cons (split t) k)
(if (val? t)
(proc x (list->string (reverse k)) (val t))
x))))
(fold (hikid t) k x))))))
For-each visits each key in the tstrie, in ascending alphabetical order,
and performs a procedure at each; the return value is unspecified.
<< internal functions of the tstrie abstract data type >>=
;; FOR-EACH tst proc -- perform (proc key value) for each tst item, in order
(define (for-each tst proc)
(let for-each ((t tst) (k '()))
(if (leaf? t)
(if (val? t) (proc (list->string (reverse k)) (val t)))
(begin
(for-each (lokid t) k)
(if (val? t) (proc (list->string (reverse k)) (val t)))
(for-each (eqkid t) (cons (split t) k))
(for-each (hikid t) k)))))
Converting to a list is done by traversing the tstrie; the traversal
is done in reverse order so that each node may be consed to the front
of the list as it is discovered. To-list has many similarities to the
fold function shown earlier; in fact, to-list could be implemented as
(reverse (tst 'fold (lambda (x k v) (cons (cons k v) x)) '())), except
that would unnecessarily generate garbage for the intermediate list.
<< internal functions of the tstrie abstract data type >>=
;; TO-LIST tst -- a newly-allocated list containing (key . value) pairs
(define (to-list tst)
(let to-list ((t tst) (k '()) (l '()))
(if (leaf? t)
(if (val? t)
(cons (cons (list->string (reverse k)) (val t)) l)
l)
(let ((l (to-list (hikid t) k l)))
(let ((l (to-list (eqkid t) (cons (split t) k) l)))
(to-list (lokid t) k (if (val? t)
(cons (cons (list->string
(reverse k))
(val t))
l)
l)))))))
Inserting a whole list of items into a tstrie is done simply by cdring
down the list, inserting key/value pairs as they arrive.
<< internal functions of the tstrie abstract data type >>=
;; FROM-LIST tst lst -- insert items in lst into tst
(define (from-list tst lst)
(if (null? lst)
tst
(from-list (insert tst (caar lst) (cdar lst)) (cdr lst))))
The following function, purposely not mentioned in the list above,
provides a useful debugging tool. The human-readable output is only
sensible on fairly shallow tstries:
<< internal functions of the tstrie abstract data type >>=
;; SHOW tst -- display tstrie; useful for debugging
(define (show tst)
(let show ((tst tst) (prefix ""))
(if (leaf? tst)
(begin
(display prefix)
(display (val? tst))
(display " ")
(display (val tst))
(newline))
(begin
(display prefix)
(display (split tst))
(display " ")
(display (val? tst))
(display " ")
(display (val tst))
(newline)
(show (lokid tst) (string-append prefix "< "))
(show (eqkid tst) (string-append prefix "= "))
(show (hikid tst) (string-append prefix "> "))))))
The dispatch function is the key to the abstract data type. The variable
root is the root of the tstrie; it is saved in the closure that is
passed from one invocation of the abstract data type to the next.
Functions that return a tstrie -- insert, delete, update, map, and
from-list -- all perform the requested action, then return a function
that takes a message and arguments and calls dispatch. Functions that
merely return a value perform the action and return the appopriate value.
Dispatch is long, but each individual case is short, checking arguments
for domain errors then passing control to the requested action.
<< internal functions of the tstrie abstract data type >>=
;; DISPATCH message . args -- call appropriate procedure based on message
(define (dispatch root message args)
(case message
((lookup) ; lookup key
(if (not (= (length args) 1))
(error "missing or extra arguments to lookup")
(if (not (string? (car args)))
(error "non-string key passed to lookup")
(lookup root (car args)))))
((insert) ; insert key value
(if (not (= (length args) 2))
(error "missing or extra arguments to insert")
(if (not (string? (car args)))
("non-string key passed to insert")
(let ((new-root (insert root (car args) (cadr args))))
(lambda (new-message . new-args)
(dispatch new-root new-message new-args))))))
((delete) ; delete key
(if (not (= (length args) 1))
(error "missing or extra arguments to delete")
(if (not (string? (car args)))
(error "non-string key passed to delete")
(let ((new-root (delete root (car args))))
(lambda (new-message . new-args)
(dispatch new-root new-message new-args))))))
((update) ; update key value proc
(if (not (= (length args) 3))
(error "missing or extra arguments to update")
(if (not (string? (car args)))
(error "non-string key passed to update")
(if (not (procedure? (caddr args)))
(error "non-procedure final argument passed to update")
(let ((new-root (apply update (cons root args))))
(lambda (new-message . new-args)
(dispatch new-root new-message new-args)))))))
((map) ; map proc
(if (not (= (length args) 1))
(error "missing or extra arguments to map")
(if (not (procedure? (car args)))
(error "non-procedure argument passed to map")
(let ((new-root (map root (car args))))
(lambda (new-message . new-args)
(dispatch new-root new-message new-args))))))
((fold) ; fold proc base
(if (not (= (length args) 2))
(error "missing or extra arguments to fold")
(if (not (procedure? (car args)))
(error "non-procedure argument passed to map")
(fold root (car args) (cadr args)))))
((for-each) ; for-each proc
(if (not (= (length args) 1))
(error "missing or extra arguments to for-each")
(if (not (procedure? (car args)))
(error "non-procedure argument passed to for-each")
(for-each root (car args)))))
((to-list) ; to-list
(if (not (= (length args) 0))
(error "extra arguments to to-list")
(to-list root)))
((from-list) ; from-list list
(if (not (= (length args) 1))
(error "missing or extra arguments to from-list")
(if (not (list? (car args)))
(error "non-list argument passed to from-list")
(let ((new-root (from-list root (car args))))
(lambda (new-message . new-args)
(dispatch new-root new-message new-args))))))
((type) ; type
'tstrie)
((show) ; show
(if (not (= (length args) 0))
(error "extra arguments to show")
(show root)))))
The dispatch function is also where error checking on arguments passed
by users is performed. All calls to a tstrie are checked for the proper
number of arguments, and that needed arguments are of the appropriate
type. Errors are reported by sending an error message as a result.
<< internal functions of the tstrie abstract data type >>=
;; ERROR message -- return ('ERROR . message)
(define (error message) (cons 'ERROR message))
Here is the main function that creates tstries and responds to messages.
It is simple, just including all the code above, and then creating a
new function, the tstrie, that calls the dispatch function to process
the request. The initial root tstrie is, of course, nil.
;; CREATE-TSTRIE -- create a newly-allocated, empty tstrie
(define (create-tstrie)
<< internal functions of the tstrie abstract data type >>
;; main body of create-tstrie
(lambda (message . args)
(dispatch nil message args)))
The function tstrie? is a predicate that recognizes ternary search tries,
using the 'type message included in the dispatch function above.
;; TSTRIE? obj -- #t if obj is a tstrie, #f otherwise
(define (tstrie? obj) (and (procedure? obj) (eq? (obj 'type) 'tstrie)))
Testing ...
Timing ...
THE OPTIMIZATIONS DISCUSSED BELOW ARE NOT YET IMPLEMENTED
Even though they have only three-way fanout instead of twenty-six or
higher, ternary search tries can suffer some of the same space problems
as radix tries. In the case where a string has a short common prefix
and is then unique, the unique tail will consist of a chain of nodes
with null less-than and greater-than children. Also, it is common to
have strings that differ only in the last few characters, so the ternary
search string will have a long prefix chain with null less-than and
greater-than children. To reduce required space, the ternary search
tries implemented here will have an optimization of the basic ternary
search tries described above: where two successive nodes in a search
path have null less-than and greater-than children, the split characters
stored in the nodes will be combined into a list of split characters in
a single node. This optimization turns out to be fairly simple to
implement and can reduce space requirements by up to two thirds.
The optimized version of lookup is shown below. The only change occurs
in the last test, where it is necessary to look for a list of characters
instead of a single character in the split field. If all the characters
in the list are the same as the head of the search key, the search
continues recursively at the next level of the tstrie; otherwise, the
search ends in failure.
<< optimized version of the lookup function >>=
;; LOOKUP tst key -- (key . value) if key in tst, #f otherwise
(define (lookup tst key)
(let lookup ((t tst) (k (string->list key)))
(cond ((and (null? k) (leaf? t) (val? t)) (cons key (val t)))
((and (null? k) (leaf? t) (not (val? t))) #f)
((and (null? k) (node? t) (val? t)) (cons key (val t)))
((and (null? k) (node? t) (not (val? t))) #f)
((leaf? t) #f)
((char>=
;; PREFIX? list1 list2 -- #t if list1 is a prefix of list2, #f otherwise
(define (prefix? list1 list2)
(cond ((null? list1) #t)
((null? list2) #f)
((eq? (car list1) (car list2))
(prefix? (cdr list1) (cdr list2)))
(else #f)))
<< utility functions used by the tstrie abstract data type >>=
;; DROP list n -- n'th cdr of list
(define (drop list n)
(cond ((<= n 0) list)
((null? list) '())
(else (drop (cdr list) (- n 1)))))
The hard part of optimizing the insert function is that it is possible
that previously-defined split-character lists will need to be split
during the insert process. The two (null? k) clauses are unchanged.
The easy case is the (leaf? t) clause; here, the leaf is replaced with
a node whose split character is the list of characters remaining in the
key, with nil lokid and hikid sub-tstries and a leaf containing the
value in the eqkid; there is no need to recur character-by-character
until the end of the list of key characters. The hard cases are the
three recursive calls, which have to be divided into the cases where
the split is a single character or a list of characters and further
divided into the cases where the split-list is a prefix of the key or
not.
<< optimized version of the insert function >>=
(define (insert tst key value)
(let insert ((t tst) (k (string->list key)))
(cond ((and (null? k) (leaf? t)) (leaf #t value))
((and (null? k) (node? t)) (node #t value (split t)
(lokid t) (eqkid t) (hikid t)))
((leaf? t) (node (val? t) (val t) k nil (leaf #t value) nil))
((and (char