A Statisticle Speling Korrecter

This note describes a spelling corrector in Scheme similar to the spelling corrector written by Peter Norvig in Python at www.norvig.com/spell-correct.html. Our purpose, besides solving the problem, is to show that Scheme is just as nimble as Python. We'll restrict ourselves to that subset of Scheme defined in R5RS, so we know our code will be portable to any Scheme system. Norvig's system works in two phases. In the training phase, it reads a text, counting the number of occurrences of each word. Then, when asked to suggest a correction, it considers similar words, based on their edit distance from the target word, and chooses the one that occurred most frequently in the training text. Hash tables Looking at Norvig's code, we see that we'll need a dictionary of key/value pairs to store the wordcounts of the training corpus. Scheme provides association-lists, but the linear time-complexity of a-lists will be too slow to deal with the large datasets we'll be using. Norvig used hash tables, but they aren't provided by R5RS; we'll have to build our own. A hash table is essentially a vector of lists containing key/value pairs; it requires a hashing function, to convert the key to an index into the vector, and an equality function, to find the right key/value pair in the list. The [[make-table]] function stores those parameters in a vector, along with a count of the number of items in the hash table, which we will need later: (define (make-table hash eql? size) (vector hash eql? size (make-vector size '()) 0)) It is possible to separate the insert and update functions, but more useful to combine them into one. [[(update! table key value func)]] will insert a new key/value pair into a hash table, if the key is not already present, or update the value currently associated with the key, if the key already exists in the table. The update function takes three arguments, the key, the old value that already exists, and the new value that is being inserted, and returns a value that somehow combines those three arguments and is ultimately stored in the table. For example, if the application requires that the new value replace the existing value, a suitable update function is [[(lambda (k v1 v2) v2)]]. (define (update! table key value func) (let ((hash (vector-ref table 0)) (eql? (vector-ref table 1)) (size (vector-ref table 2)) (tabl (vector-ref table 3)) (count (vector-ref table 4))) (let ((slot (hash size key))) (let loop ((bucket (vector-ref tabl slot))) (cond ((null? bucket) (vector-set! table 4 (+ (vector-ref table 4) 1)) (vector-set! tabl slot (cons (cons key value) (vector-ref tabl slot)))) ((eql? (caar bucket) key) (set-cdr! (car bucket) (func key (cdar bucket) value))) (else (loop (cdr bucket)))))))) The [[lookup]] function retrieves the value associated with a key, or a default value if the key does not exist in the table; it hashes the key, then performs linear search through the list at the indexed location in the hash vector: (define (lookup table key default) (let ((hash (vector-ref table 0)) (eql? (vector-ref table 1)) (size (vector-ref table 2)) (tabl (vector-ref table 3))) (let loop ((bucket (vector-ref tabl (hash size key)))) (cond ((null? bucket) default) ((eql? (caar bucket) key) (cdar bucket)) (else (loop (cdr bucket))))))) Note that [[update!]] and [[lookup]] have a similar structure: both loop on the contents of a slot in the [[tabl]] vector, and both use [[cond]] with identical tests. The similarities give us some confidence that both are correct. [[Count]] is a simple function to return the number of records in a table: (define (count table) (vector-ref table 4)) The keys to our table will be the words in the training corpus, so we need a function to hash strings. The function below dates to the folklore of computing: (define (string-hash size str) (do ((index 0 (+ index 1)) (hash 0 (modulo (+ (* 37 hash) (char->integer (string-ref str index))) size))) ((>= index (string-length str)) hash))) Reading and storing the training corpus The training phase reads a training corpus and returns a dictionary containing all the words and their frequency of occurrence. We assume seven-bit ascii, as Norvig did, and define a word as a maximal sequence of alphabetic characters. [[Read-word]] fetches the next word from the current input port, converting its characters to lower-case, and returns the [[eof-object]] when the port is exhausted: (define (read-word) (let loop ((in-word? #f) (c (read-char)) (word '())) (cond ((eof-object? c) (if in-word? (list->string (reverse word)) c)) ((char-alphabetic? c) (loop #t (read-char) (cons (char-downcase c) word))) (in-word? (list->string (reverse word))) (else (loop #f (read-char) word))))) We store the word/count pairs in a global hash table named [[nwords]]. The [[size]] is a prime number approximately the same as the number of unique words in the training corpus. (define nwords (make-table string-hash string=? 24989)) We build the table of word/count pairs by either inserting the word with a count of [[1]] if it is new or by adding [[1]] to the current count if the word already exists. Norvig gives a source for his training corpus [[big.txt]] in his article: (with-input-from-file "big.txt" (lambda () (do ((word (read-word) (read-word))) ((eof-object? word)) (update! nwords word 1 (lambda (k v1 v2) (+ v1 v2)))))) Sets The spelling correction phase of Norvig's program centers around sets of words. We'll show that code in a moment, but first we need some code for handling sets. We can use the hash tables we've already implemented to represent sets of words; the words will be the hash key, the hash value will always be [[#t]], and the default value of the lookup function will be [[#f]] when the word isn't in the hash table. [[Make-set]] creates a new set, [[adjoin!]] adds a word to a set, and [[member?]] tests membership in the set: (define (make-set size) (make-table string-hash string=? size)) (define (adjoin! set word) (update! set word #t (lambda (k v1 v2) #t))) (define (member? set word) (lookup set word #f)) We'll also need to iterate over the members of a set. We could write a function [[set-for-each]], but we prefer to write syntax for a loop, since that is more in the spirit of Norvig's code: (define-syntax for-set (syntax-rules () ((_ var set body ...) (let ((size (vector-ref set 2)) (tabl (vector-ref set 3))) (do ((slot 0 (+ slot 1))) ((<= size slot)) (do ((bucket (vector-ref tabl slot) (cdr bucket))) ((null? bucket)) (let ((var (caar bucket))) body ...))))))) Technically, this isn't very good code, because it breaks an abstraction barrier -- a set iterator shouldn't know the details of how the hash table that underlies the set is implemented. In a small program like this it's probably overkill to worry about an abstraction barrier, but in larger programs, the discipline achieved by honoring abstraction barriers is an important part of ensuring that the program is correct. Generating similar words When comparing two words for similarity, it is common to talk about the edit distance between them: the number of deletions (remove one letter), transpositions (swap two adjacent letters), alterations (change one letter to another), and insertions (add one letter) that would be required to convert one word to the other. For instance, [[Tuesday]] can be converted into [[Thursday]] by inserting an [[h]] after the initial [[T]] and changing the [[e]] to an [[r]], so there is an edit distance of two between them. Norvig writes the following function that returns all the words within an edit distance of one from a target word: [[ def edits1(word): n = len(word) return set([word[0:i]+word[i+1;] for i in range(n)] + [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) ]] The first expression handles deletion of a single character, the second expression handles transpositions of a single character for its successor, the third expression handles alteration of a single character for another character in the alphabet, and the fourth expression handles insertion of a single character from the alphabet; all of the expressions loop over all possible positions within the word. The [[edits1]] function returns a set of words, some of which may be present in the training corpus, and others which are ridiculous. This function, which is concise and readable in Python, would be much more verbose in Scheme. There are two problems of syntax: One problem is string-indexing; where Python uses the expression [[word[i+1:] ]]to represent the string of characters from the [[i]]'th [[+ 1]] position to the end of the string, Scheme would say [[(substring word (+ i 1) (string-length word)]], which isn't nearly as concise. Another problem is the easy notation for loops that has no direct equivalent in Scheme. We solve the first problem by writing a macro [[cat]] that catenates a list of substring clauses: (define-syntax cat (syntax-rules (last) ((_ 'aux base) (apply string-append (reverse base))) ((_ 'aux base (str i last) clauses ...) (cat 'aux (cons (substring str i (string-length str)) base) clauses ...)) ((_ 'aux base (str i j) clauses ...) (cat 'aux (cons (substring str i j) base) clauses ...)) ((_ 'aux base (str i) clauses ...) (cat 'aux (cons (substring str i (+ i 1)) base) clauses ...)) ((_ 'aux base c-or-s clauses ...) (cat 'aux (cons (if (char? c-or-s) (string c-or-s) c-or-s) base) clauses ...)) ((_ clauses ...) (cat 'aux '() clauses ...)))) [[Cat]] calls [[string-append]] to catenate the strings returned by each of its clauses. There are several types of clauses: [[(str i j)]] expands to [[(substring str i j)]], [[(str i last)]] expands to [[(substring str i (string-length str))]], [[(str i)]] expands to [[(string-ref str i)]], a character [[c]] expands to [[(string c)]], and a string [[str]] expands to itself. Our notation is somewhat different than Soegaard's, being easier to parse both for the macro compiler and for the human programmer. Here are some examples of the use of [[cat]]: [[ > (cat ("hello" 2 4)) "ll" > (cat ("hello" 0 1) ("hello" 2 last)) "hllo" > (cat ("hello" 0 1) #\e ("hello" 2 last)) "hello" > (cat ("hello" 0 1) "e" ("hello" 2 last)) "hello" > (cat ("hello" 0 1) ("hello" 1) ("hello" 2 last)) "hello" > (cat ("hello" 0 0)) "" ]] We solve the second problem with two simple looping macros: (define-syntax for (syntax-rules () ((_ var first past body ...) (do ((var first (+ var 1))) ((>= var past)) body ...)))) (define-syntax for-alpha (syntax-rules () ((_ var body ...) (do ((i 97 (+ i 1))) ((>= i 122)) (let ((var (integer->char i))) body ...))))) Given these three macros, our [[edits1]] function is as simple and clear as Norvig's: (define (edits1 word) (let ((n (string-length word)) (edits (make-set 997))) (for i 0 n (adjoin! edits (cat (word 0 i) (word (+ i 1) last)))) (for i 0 (- n 1) (adjoin! edits (cat (word 0 i) (word (+ i 1)) (word i) (word (+ i 2) last)))) (for i 0 n (for-alpha c (adjoin! edits (cat (word 0 i) c (word (+ i 1) last))))) (for i 0 (+ n 1) (for-alpha c (adjoin! edits (cat (word 0 i) c (word i last))))) edits)) We choose [[997]] as the size of the [[edits1]] set based on some math from Norvig, who shows that [[edits1]] will have [[54n+25]] members, where [[n]] is the length of [[word]]. With a table size of [[997]], we can accomodate words with eighteen characters without any of the hash-slots having more than one entry, assuming a good hash function, so lookups in this set should be speedy. Of course, since we have to iterate over the set, we don't want to make it too large. The correction function [[Known-edits2]] builds on [[edits1]] to return the set of all words in the training corpus that are within an edit distance of one from all words that are within an edit distance of one from the target word, giving a set of words that are within an edit distance of two from the given word: (define (known-edits2 word) (let ((edits (make-set 19))) (for-set e1 (edits1 word) (for-set e2 (edits1 e1) (if (lookup nwords e2 #f) (adjoin! edits e2)))) edits)) We choose [[19]] as the size of the set because we expect most words to have only a few suggested corrections. Note the usage of [[lookup]] here. If [[e2]] is in [[nwords]], it will return the count of the number of times the word occurred in the training corpus; otherwise it will return [[#f]]. That makes the return value of [[lookup]] a non-boolean boolean, so it can be used directly in the predicate clause of an [[if]]. The alternative would be [[if (positive? (lookup nwords word 0))]], which is less concise and less clear. This idiom occurs twice more below. [[Known]] takes a set of [[words]] and returns the subset of them that are in [[nwords]]: (define (known words) (let ((known (make-set 19))) (for-set word words (if (lookup nwords word #f) (adjoin! known word))) known)) Scheme provides a [[max]] function, but it's not as versatile as Python's; we need to find the word in a set that has the maximum wordcount in [[nwords]]. (define (max-word set) (let ((count 0) (word "")) (for-set w set (let ((n (lookup nwords w 0))) (if (> n count) (begin (set! count n) (set! word w))))) word)) We're finally ready to write the [[correct]] function. If [[word]] is in the training corpus, return it. Otherwise, build the set of known words at edit distance one; if it is non-empty, return the word with the highest count in the training corpus. Otherwise, build the set of known words at edit distance two; if it is non-empty, return the word with the highest count in the training corpus. Otherwise, punt the original word back to the caller. This is the same algorithm Norvig uses, but expressed somewhat differently; note in particular that Norvig computes [[(edits1 word)]] twice, once directly, and once indirectly in the call to [[(known-edits2 word)]]. (define (correct word) (if (lookup nwords word #f) word (let ((candidates (known (edits1 word)))) (if (positive? (count candidates)) (max-word candidates) (let ((candidates (known-edits2 word))) (if (positive? (count candidates)) (max-word candidates) word)))))) To see the [[correct]] function in action, type words at the spelling-corrector: [[ > (correct "statisticle") "statistics" > (correct "speling") "spelling" > (correct "korrecter") "corrected" ]] In the training corpus, "statistics" appears five times but "statistical" only twice, and "corrector" doesn't appear at all, so the corrections suggested for those words aren't what we expect. Norvig gives details of the math behind the spelling corrector in his article. Testing the program Norvig found a testing corpus and wrote a simple driver that counts the number of tests, the number of misspellings that were fixed, and the time it took to perform the tests. Here's our driver, which is a bit simpler than Norvig's: (define (spelltest tests) (let loop ((count 0) (okay 0) (tests tests)) (cond ((null? tests) (display "tested: ") (display count) (display "; fixed: ") (display okay) (newline)) ((null? (cdar tests)) (loop count okay (cdr tests))) ((string=? (caar tests) (correct (cadar tests))) (loop (+ count 1) (+ okay 1) (cons (cons (caar tests) (cddar tests)) (cdr tests)))) (else (loop (+ count 1) okay (cons (cons (caar tests) (cddar tests)) (cdr tests))))))) Norvig split his test corpus into two pieces, for hygiene. Here's the first: (define tests1 '(("access" "acess") ("accessing" "accesing") ("accommodation" "accomodation" "acommodation" "acomodation") ("account" "acount") ("address" "adress" "adres") ("addressable" "addresable") ("arranged" "aranged" "arrainged") ("arrangeing" "aranging") ("arrangement" "arragment") ("articles" "articals") ("aunt" "annt" "anut" "arnt") ("auxiliary" "auxillary") ("available" "avaible") ("awful" "awfall" "afful") ("basically" "basicaly") ("beginning" "begining") ("benefit" "benifit") ("benefits" "benifits") ("between" "beetween") ("bicycle" "bicycal" "bycicle" "bycycle") ("biscuits" "biscits" "biscutes" "biscuts" "bisquits" "buiscits" "buiscuts") ("built" "biult") ("cake" "cak") ("career" "carrer") ("cemetery" "cemetary" "semetary") ("centrally" "centraly") ("certain" "cirtain") ("challenges" "chalenges" "chalenges") ("chapter" "chaper" "chaphter" "chaptur") ("choice" "choise") ("choosing" "chosing") ("clerical" "clearical") ("committee" "comittee") ("compare" "compair") ("completely" "completly") ("consider" "concider") ("considerable" "conciderable") ("contented" "contenpted" "contende" "contended" "contentid") ("curtains" "cartains" "certans" "courtens" "cuaritains" "curtans" "curtians" "curtions") ("decide" "descide") ("decided" "descided") ("definitely" "definately" "difinately") ("definition" "defenition") ("definitions" "defenitions") ("description" "discription") ("desiccate" "desicate" "dessicate" "dessiccate") ("diagrammatically" "diagrammaticaally") ("different" "diffrent") ("driven" "dirven") ("ecstasy" "exstacy" "ecstacy") ("embarrass" "embaras" "embarass") ("establishing" "astablishing" "establising") ("experience" "experance" "experiance") ("experiences" "experances") ("extended" "extented") ("extremely" "extreamly") ("fails" "failes") ("families" "familes") ("february" "febuary") ("further" "futher") ("gallery" "galery" "gallary" "gallerry" "gallrey") ("hierarchal" "hierachial") ("hierarchy" "hierchy") ("inconvenient" "inconvienient" "inconvient" "inconvinient") ("independent" "independant" "independant") ("initial" "intial") ("initials" "inetials" "inistals" "initails" "initals" "intials") ("juice" "guic" "juce" "jucie" "juise" "juse") ("latest" "lates" "latets" "latiest" "latist") ("laugh" "lagh" "lauf" "laught" "lugh") ("level" "leval") ("levels" "levals") ("liaison" "liaision" "liason") ("lieu" "liew") ("literature" "litriture") ("loans" "lones") ("locally" "localy") ("magnificent" "magnificnet" "magificent" "magnifcent" "magnifecent" "magnifiscant" "magnifisent" "magnificant") ("management" "managment") ("meant" "ment") ("minuscule" "miniscule") ("minutes" "muinets") ("monitoring" "monitering") ("necessary" "neccesary" "necesary" "neccesary" "necassary" "necassery" "neccasary") ("occurrence" "occurence" "occurence") ("often" "ofen" "offen" "offten" "ofton") ("opposite" "opisite" "oppasite" "oppesite" "oppisit" "oppisite" "opposit" "oppossite" "oppossitte") ("parallel" "paralel" "paralell" "parrallel" "parralell" "parrallell") ("particular" "particulaur") ("perhaps" "perhapse") ("personnel" "personnell") ("planned" "planed") ("poem" "poame") ("poems" "poims" "pomes") ("poetry" "poartry" "poertry" "poetre" "poety" "powetry") ("position" "possition") ("possible" "possable") ("pretend" "pertend" "protend" "prtend" "pritend") ("problem" "problam" "proble" "promblem" "proplen") ("pronunciation" "pronounciation") ("purple" "perple" "perpul" "poarple") ("questionnaire" "questionaire") ("really" "realy" "relley" "relly") ("receipt" "receit" "receite" "reciet" "recipt") ("receive" "recieve") ("refreshment" "reafreshment" "refreshmant" "refresment" "refressmunt") ("remember" "rember" "remeber" "rememmer" "rermember") ("remind" "remine" "remined") ("scarcely" "scarcly" "scarecly" "scarely" "scarsely") ("scissors" "scisors" "sissors") ("separate" "seperate") ("singular" "singulaur") ("someone" "somone") ("sources" "sorces") ("southern" "southen") ("special" "speaical" "specail" "specal" "speical") ("splendid" "spledid" "splended" "splened" "splended") ("standardizing" "stanerdizing") ("stomach" "stomac" "stomache" "stomec" "stumache") ("supersede" "supercede" "superceed") ("there" "ther") ("totally" "totaly") ("transferred" "transfred") ("transportability" "transportibility") ("triangular" "triangulaur") ("understand" "undersand" "undistand") ("unexpected" "unexpcted" "unexpeted" "unexspected") ("unfortunately" "unfortunatly") ("unique" "uneque") ("useful" "usefull") ("valuable" "valubale" "valuble") ("variable" "varable") ("variant" "vairiant") ("various" "vairious") ("visited" "fisited" "viseted" "vistid" "vistied") ("visitors" "vistors") ("voluntary" "volantry") ("voting" "voteing") ("wanted" "wantid" "wonted") ("whether" "wether") ("wrote" "rote" "wote"))) And here's the second. Somehow, in editing, we managed to lose a word from the second test. Oh, well.... (define tests2 '(("forbidden" "forbiden") ("decisions" "deciscions" "descisions") ("supposedly" "supposidly") ("embellishing" "embelishing") ("technique") ("tecnique") ("permanently" "perminantly") ("confirmation" "confermation") ("appointment" "appoitment") ("progression" "progresion") ("accompanying" "acompaning") ("applicable" "aplicable") ("regained" "regined") ("guidelines" "guidlines") ("surrounding" "serounding") ("titles" "tittles") ("unavailable" "unavailble") ("advantageous" "advantageos") ("brief" "brif") ("appeal" "apeal") ("consisting" "consisiting") ("clerk" "cleark" "clerck") ("component" "componant") ("favourable" "faverable") ("separation" "seperation") ("search" "serch") ("receive" "recieve") ("employees" "emploies") ("prior" "piror") ("resulting" "reulting") ("suggestion" "sugestion") ("opinion" "oppinion") ("cancellation" "cancelation") ("criticism" "citisum") ("useful" "usful") ("humour" "humor") ("anomalies" "anomolies") ("would" "whould") ("doubt" "doupt") ("examination" "eximination") ("therefore" "therefoe") ("recommend" "recomend") ("separated" "seperated") ("successful" "sucssuful" "succesful") ("apparent" "apparant") ("occurred" "occureed") ("particular" "paerticulaur") ("pivoting" "pivting") ("announcing" "anouncing") ("challenge" "chalange") ("arrangements" "araingements") ("proportions" "proprtions") ("organized" "oranised") ("accept" "acept") ("dependence" "dependance") ("unequalled" "unequaled") ("numbers" "numbuers") ("sense" "sence") ("conversely" "conversly") ("provide" "provid") ("arrangement" "arrangment") ("responsibilities" "responsiblities") ("fourth" "forth") ("ordinary" "ordenary") ("description" "desription" "descvription" "desacription") ("inconceivable" "inconcievable") ("data" "dsata") ("register" "rgister") ("supervision" "supervison") ("encompassing" "encompasing") ("negligible" "negligable") ("allow" "alow") ("operations" "operatins") ("executed" "executted") ("interpretation" "interpritation") ("hierarchy" "heiarky") ("indeed" "indead") ("years" "yesars") ("through" "throut") ("committee" "committe") ("inquiries" "equiries") ("before" "befor") ("continued" "contuned") ("permanent" "perminant") ("choose" "chose") ("virtually" "vertually") ("correspondence" "correspondance") ("eventually" "eventully") ("lonely" "lonley") ("profession" "preffeson") ("they" "thay") ("now" "noe") ("desperately" "despratly") ("university" "unversity") ("adjournment" "adjurnment") ("possibilities" "possablities") ("stopped" "stoped") ("mean" "meen") ("weighted" "wagted") ("adequately" "adequattly") ("shown" "hown") ("matrix" "matriiix") ("profit" "proffit") ("encourage" "encorage")("collate" "colate") ("disaggregate" "disaggreagte" "disaggreaget") ("receiving" "recieving" "reciving") ("proviso" "provisoe") ("umbrella" "umberalla") ("approached" "aproached") ("pleasant" "plesent") ("difficulty" "dificulty") ("appointments" "apointments") ("base" "basse") ("conditioning" "conditining") ("earliest" "earlyest") ("beginning" "begining") ("universally" "universaly") ("unresolved" "unresloved") ("length" "lengh") ("exponentially" "exponentualy") ("utilized" "utalised") ("set" "et") ("surveys" "servays") ("families" "familys") ("system" "sysem") ("approximately" "aproximatly") ("their" "ther") ("scheme" "scheem") ("speaking" "speeking") ("repetitive" "repetative") ("inefficient" "ineffiect") ("geneva" "geniva") ("exactly" "exsactly") ("immediate" "imediate") ("appreciation" "apreciation") ("luckily" "luckeley") ("eliminated" "elimiated") ("believe" "belive") ("appreciated" "apreciated") ("readjusted" "reajusted") ("were" "wer" "where") ("feeling" "fealing") ("and" "anf") ("false" "faulse") ("seen" "seeen") ("interrogating" "interogationg") ("academically" "academicly") ("relatively" "relativly" "relitivly") ("traditionally" "traditionaly") ("studying" "studing") ("majority" "majorty") ("build" "biuld") ("aggravating" "agravating") ("transactions" "trasactions") ("arguing" "aurguing") ("sheets" "sheertes") ("successive" "sucsesive" "sucessive") ("segment" "segemnt") ("especially" "especaily") ("later" "latter") ("senior" "sienior") ("dragged" "draged") ("atmosphere" "atmospher") ("drastically" "drasticaly") ("particularly" "particulary") ("visitor" "vistor") ("session" "sesion") ("continually" "contually") ("availability" "avaiblity") ("busy" "buisy") ("parameters" "perametres") ("surroundings" "suroundings" "seroundings") ("employed" "emploied") ("adequate" "adiquate") ("handle" "handel") ("means" "meens") ("familiar" "familer") ("between" "beeteen") ("overall" "overal") ("timing" "timeing") ("committees" "comittees" "commitees") ("queries" "quies") ("econometric" "economtric") ("erroneous" "errounous") ("decides" "descides") ("reference" "refereence" "refference") ("intelligence" "inteligence") ("edition" "ediion" "ediition") ("are" "arte") ("apologies" "appologies") ("thermawear" "thermawere" "thermawhere") ("techniques" "tecniques") ("voluntary" "volantary") ("subsequent" "subsequant" "subsiquent") ("currently" "curruntly") ("forecast" "forcast") ("weapons" "wepons") ("routine" "rouint") ("neither" "niether") ("approach" "aproach") ("available" "availble") ("recently" "reciently") ("ability" "ablity") ("nature" "natior") ("commercial" "comersial") ("agencies" "agences") ("however" "howeverr") ("suggested" "sugested") ("career" "carear") ("many" "mony") ("annual" "anual") ("according" "acording") ("receives" "recives" "recieves") ("interesting" "intresting") ("expense" "expence") ("relevant" "relavent" "relevaant") ("table" "tasble") ("throughout" "throuout") ("conference" "conferance") ("sensible" "sensable") ("described" "discribed" "describd") ("union" "unioun") ("interest" "intrest") ("flexible" "flexable") ("refered" "reffered") ("controlled" "controled") ("sufficient" "suficient") ("dissension" "desention") ("adaptable" "adabtable") ("representative" "representitive") ("irrelevant" "irrelavent") ("unnecessarily" "unessasarily") ("applied" "upplied") ("apologised" "appologised") ("these" "thees" "thess") ("choices" "choises") ("will" "wil") ("procedure" "proceduer") ("shortened" "shortend") ("manually" "manualy") ("disappointing" "dissapoiting") ("excessively" "exessively") ("comments" "coments") ("containing" "containg") ("develop" "develope") ("credit" "creadit") ("government" "goverment") ("acquaintances" "aquantences") ("orientated" "orentated") ("widely" "widly") ("advise" "advice") ("difficult" "dificult") ("investigated" "investegated") ("bonus" "bonas") ("conceived" "concieved") ("nationally" "nationaly") ("compared" "comppared" "compased") ("moving" "moveing") ("necessity" "nessesity") ("opportunity" "oppertunity" "oppotunity" "opperttunity") ("thoughts" "thorts") ("equalled" "equaled") ("variety" "variatry") ("analysis" "analiss" "analsis" "analisis") ("patterns" "pattarns") ("qualities" "quaties") ("easily" "easyly") ("organization" "oranisation" "oragnisation") ("the" "thw" "hte" "thi") ("corporate" "corparate") ("composed" "compossed") ("enormously" "enomosly") ("financially" "financialy") ("functionally" "functionaly") ("discipline" "disiplin") ("announcement" "anouncement") ("progresses" "progressess") ("except" "excxept") ("recommending" "recomending") ("mathematically" "mathematicaly") ("source" "sorce") ("combine" "comibine") ("input" "inut") ("careers" "currers" "carrers") ("resolved" "resoved") ("demands" "diemands") ("unequivocally" "unequivocaly") ("suffering" "suufering") ("immediately" "imidatly" "imediatly") ("accepted" "acepted") ("projects" "projeccts") ("necessary" "necasery" "nessasary" "nessisary" "neccassary") ("journalism" "journaism") ("unnecessary" "unessessay") ("night" "nite") ("output" "oputput") ("security" "seurity") ("essential" "esential") ("beneficial" "benificial" "benficial") ("explaining" "explaning") ("supplementary" "suplementary") ("questionnaire" "questionare") ("employment" "empolyment") ("proceeding" "proceding") ("decision" "descisions" "descision") ("per" "pere") ("discretion" "discresion") ("reaching" "reching") ("analysed" "analised") ("expansion" "expanion") ("although" "athough") ("subtract" "subtrcat") ("analysing" "aalysing") ("comparison" "comparrison") ("months" "monthes") ("hierarchal" "hierachial") ("misleading" "missleading") ("commit" "comit") ("auguments" "aurgument") ("within" "withing") ("obtaining" "optaning") ("accounts" "acounts") ("primarily" "pimarily") ("operator" "opertor") ("accumulated" "acumulated") ("extremely" "extreemly") ("there" "thear") ("summarys" "sumarys") ("analyse" "analiss") ("understandable" "understadable") ("safeguard" "safegaurd") ("consist" "consisit") ("declarations" "declaratrions") ("minutes" "muinutes" "muiuets") ("associated" "assosiated") ("accessibility" "accessability") ("examine" "examin") ("surveying" "servaying") ("politics" "polatics") ("annoying" "anoying") ("again" "agiin") ("assessing" "accesing") ("ideally" "idealy") ("scrutinized" "scrutiniesed") ("simular" "similar") ("personnel" "personel") ("whereas" "wheras") ("when" "whn") ("geographically" "goegraphicaly") ("gaining" "ganing") ("requested" "rquested") ("separate" "seporate") ("students" "studens") ("prepared" "prepaired") ("generated" "generataed") ("graphically" "graphicaly") ("suited" "suted") ("variable" "varible" "vaiable") ("building" "biulding") ("required" "reequired") ("necessitates" "nessisitates") ("together" "togehter") ("profits" "proffits"))) The tests show how much we've accomplished: [[ Welcome to DrScheme, version 360. Language: Pretty Big (includes MrEd and Advanced Student). > (time (spelltest tests1)) tested: 270; fixed: 202 cpu time: 239000 real time: 239154 gc time: 35479 > (time (spelltest tests2)) tested: 399; fixed: 269 cpu time: -10028 real time: 419651 gc time: 61285 ]] [[ Petite Chez Scheme Version 7.1 Copyright (c) 1985-2006 Cadence Research Systems > (time (spelltest tests1)) tested: 270; fixed: 202 (time (spelltest tests1)) 20796 collections 300719 ms elapsed cpu time, including 3876 ms collecting 300832 ms elapsed real time, including 3839 ms collecting 22590400232 bytes allocated, including 22590418968 bytes reclaimed > (time (spelltest tests2)) tested: 399; fixed: 269 (time (spelltest tests2)) 35758 collections 525578 ms elapsed cpu time, including 6008 ms collecting 534243 ms elapsed real time, including 6095 ms collecting 38843881760 bytes allocated, including 38843074624 bytes reclaimed ]] The Scheme solution is much slower than Norvig's Python solution, because the Python version uses hash tables that are native to Python, running at compiled speed, while the Scheme version uses our own hash tables running at interpreted speed. To speed up the Scheme version, we could try tuning the sizes of the various tables, or compile the program, or use the hash tables provided natively with our version of Scheme; we won't take the time to try those things here, but experiments show that either compiling or using native hash tables makes the Scheme version run about the same speed as the Python version. Discussion The beauty of Scheme shines in the macros. Scheme provides all the functions needed to write the same [[edits1]] function that Norvig uses, but with an ugly syntax. Macros allow us to extend the syntax of Scheme, in a way that neatly fits the problem at hand. Other languages, Python among them, don't offer programmers the same flexibility; if Python doesn't give you the syntax you need, you're just out of luck. If we didn't have those macros, we would have to write [[edits1]] like this: [[ (define (edits1 word) (let ((n (string-length word)) (edits (make-set 997))) ; deletion (do ((i 0 (+ i 1))) ((<= n i)) (adjoin! edits (string-append (substring word 0 i) (substring word (+ i 1) n)))) ; transposition (do ((i 0 (+ i 1))) ((<= (- n 1) i)) (adjoin! edits (string-append (substring word 0 i) (substring word (+ i 1) (+ i 2)) (substring word i (+ i 1)) (substring word (+ i 2) n)))) ; alteration (do ((i 0 (+ i 1))) ((<= n i)) (do ((ci 97 (+ ci 1))) ((<= 123 ci)) (let ((c (integer->char ci))) (adjoin! edits (string-append (substring word 0 i) (string c) (substring word (+ i 1) n)))))) ; insertion (do ((i 0 (+ i 1))) ((<= (+ n 1) i)) (do ((ci 97 (+ ci 1))) ((<= 123 ci)) (let ((c (integer->char ci))) (adjoin! edits (string-append (substring word 0 i) (string c) (substring word i n)))))) edits)) ]] That isn't pretty at all! Syntax does matter, and Scheme lets us build any syntax we need. Python doesn't. Proponents of Python, and other similar languages, often argue that their language is so large that everything a programmer could need is already present in the language, so the macros present in Scheme add little. But that's clearly not true. For example, it's easy to add generators to Scheme; Dorai Sitaram does exactly that, in just a dozen lines of code, in his book Teach Yourself Scheme in Fixnum Days. But Python had to wait ten years for Guido to add generators to the language, making do with some less-appropriate substitute in the meantime. It's really not a fair comparison -- in our minimalism, we Schemers have all the tools and all the fun, the Pythonistas have nothing! References There have been other Scheme solutions to the spelling corrector problem posted to the internet in the wake of Norvig's post. Shiro Kawai posted his solution, using Gauche and a bunch of SRFIs, at http://practical-scheme.net/wiliki/wiliki.cgi/Gauche:SpellingCorrection. Jens Axel Soegaard wrote a version for PLT Scheme in his blog at http://www.scheme.dk/blog/2007/04/writing-spelling-corrector-in-plt.html; his solution provided the inspiration for the [[cat]] macro, though we changed the details to make it easier for both the macro compiler and the human programmer to parse the substring clauses. Shiro and Jens both wrote good solutions, but they neatly point out the problem with Scheme; since Scheme is so minimalistic, every Scheme system extends the language, all in ways incompatible with one another. We stayed with pure R5RS Scheme, and paid the price by writing more code.