A Statisticle Speling Korrecter

This note describes a spelling corrector in Scheme similar to the spelling corrector written by Peter Norvig in Python at www.norvig.com/spell-correct.html. Our purpose, besides solving the problem, is to show that Scheme is just as nimble as Python. We'll restrict ourselves to that subset of Scheme defined in R5RS, so we know our code will be portable to any Scheme system. Looking at Norvig's code, we see that we'll need a dictionary of key/value pairs to store the wordcounts of the training corpus, and a set to store the potential corrections. Scheme provides association-lists to manage those problems, but the linear time-complexity of a-lists will be too slow to deal with the large datasets we'll be using. Norvig used hash tables, but they aren't provided by R5RS; we could build our own, but instead we choose treaps, a randomized system of nearly-balanced trees, that are both fast and simple to code — and besides, I already wrote an implementation of treaps for another program, so it is simple to steal the code. Treaps are essentially binary search trees, represented as either an empty node, called nil, or a node that contains a key and a value and pointers to two sub-nodes, the left child and the right child. Binary search trees obey the property that all nodes in the left child have keys that are less than the current node, and all nodes in the right child have keys that are greater than the current node. One node is distinguished as the root of the tree; it is the only node visible outside the tree. Obviously, a single set of keys could be represented by many different binary search trees, depending on which key is chosen as the root and the choices made at each level under the root. The speed of a search depends on the depth at which the key is located in the tree. Trees that have more balance (fewer nodes with nil children) are faster to search than trees that have less balance (more nodes with nil children) because the overall depth of the tree is less. The two extreme cases occur when a tree is perfectly balanced (no internal nil nodes), when a search takes time logarithmic in the size of the tree, and when a tree is perfectly imbalanced (a long chain of nodes each with one nil child), when a search takes time linear in the size of the tree. Treaps use an ingenious method to assure that the tree is approximately balanced at all times. Each node, in addition to the key, value, and two child nodes, contains an integer priority, which is randomly assigned at the time the node is created and never changed. Of the many possible trees that could represent a particular set of keys, the one that is chosen obeys the heap-order property, with the priority of each node being greater than the priorities of all of its child nodes. Assuming all keys and priorities are distinct, there will be exactly one tree that satisfies both the tree-order and heap-order properties; it is the same tree that would be created if keys were dynmically inserted into a normal binary search tree in order of the priority field. The genius of this method is that the priorities are independent of the keys, making it extremely unlikely that highly-imbalanced trees could be created, thus keeping search times approximately logarithmic. Analysis shows that perfectly balanced trees take about O(1.4 log2 n) comparisons per access; since treaps are never perfectly balanced, they have a somewhat higher constant factor, but normally require somewhat less than O(2 log2 n) comparisons per access. Our implementation of treaps starts with some utility procedures that define the treap datatype. These are all written as macros so they will be in-lined for speed. (define-syntax treap (syntax-rules () ((treap k p v l r) (vector k p v l r)))) (define-syntax key (syntax-rules () ((key treap) (vector-ref treap 0)))) (define-syntax prio (syntax-rules () ((prio treap) (vector-ref treap 1)))) (define-syntax val (syntax-rules () ((val treap) (vector-ref treap 2)))) (define-syntax lkid (syntax-rules () ((lkid treap) (vector-ref treap 3)))) (define-syntax rkid (syntax-rules () ((rkid treap) (vector-ref treap 4)))) (define nil (vector 'nil -1 'nil 'nil 'nil)) (define-syntax nil? (syntax-rules () ((nil? treap) (eq? treap nil)))) The really good news about treaps is their simplicity. Since treaps are just binary search trees, the lookup function for treaps works exactly the same as the lookup function for binary search trees, descending the tree recursively until it either finds a matching key or hits a null node signalling that the key doesn't exist in the treap. Our [[lookup]] function either returns the value associated with the key or a default value that indicates an error: (define (lookup lt? t k x) (cond ((nil? t) x) ((lt? k (key t)) (lookup lt? (lkid t) k x)) ((lt? (key t) k) (lookup lt? (rkid t) k x)) (else (val t)))) Insertion works in two phases: a winding phase locates the insertion point, in a manner similar to [[lookup]], followed by an unwinding phase, where rotations are performed to restore the heap-order property. Our [[insert]] is also used to update existing values in a treap; the update function [[f]] is called when the key is already in the treap, and takes the key, the old value, and the new value, and combines them to form the value that is ultimately stored in the treap: (define (insert lt? t k p v f) (cond ((nil? t) (treap k p v nil nil)) ((lt? k (key t)) (let ((t (treap (key t) (prio t) (val t) (insert lt? (lkid t) k p v f) (rkid t)))) (if (< (prio t) (prio (lkid t))) (rot-right t) t))) ((lt? (key t) k) (let ((t (treap (key t) (prio t) (val t) (lkid t) (insert lt? (rkid t) k p v f)))) (if (< (prio t) (prio (rkid t))) (rot-left t) t))) (else (treap k (prio t) (f (key t) (val t) v) (lkid t) (rkid t))))) ;(else (treap k (prio t) v (lkid t) (rkid t))))) Rotations maintain the tree-order property and restore the heap-order property if it is violated. They come in two varieties, left and right: [[ + - + rotate left + - + | y | <------------ | x | + - + + - + / \ / \ / \ / \ / \ / \ / \ / \ + - + . . + - + | x | / \ / \ | y | + - + / C \ / A \ + - + / \ /_ _ _\ /_ _ _\ / \ / \ / \ / \ / \ . . . . / \ / \ / \ / \ / A \ / B \ ------------> / B \ / C \ /_ _ _\ /_ _ _\ rotate right /_ _ _\ /_ _ _\ ]] In both trees shown above, all the nodes in sub-tree A have keys less than x, all the nodes in sub-tree B have keys between x and y, and all the nodes in sub-tree C have keys greater than y. Transforming the tree so that the root moves from x to y is called a left rotation, and transforming the tree so that the root moves from y to x is called a right rotation; in both cases the root node moves down the tree toward the leaves. Note that both rotations preserve the tree-order property. It is the goal of the insert and delete algorithms to use rotations to maintain the heap-order property. Analysis shows that treaps use about two rotations per update (insert or delete). (define (rot-left t) (let ((l (treap (key t) (prio t) (val t) (lkid t) (lkid (rkid t))))) (treap (key (rkid t)) (prio (rkid t)) (val (rkid t)) l (rkid (rkid t))))) (define (rot-right t) (let ((r (treap (key t) (prio t) (val t) (rkid (lkid t)) (rkid t)))) (treap (key (lkid t)) (prio (lkid t)) (val (lkid t)) (lkid (lkid t)) r))) Treaps were invented by Cecilia Aragon and Raimund Seidel. The interested reader can learn more at [[http://www.ischool.berkeley.edu/~aragon/pubs/rst96.pdf]]. Treaps rely on a random-number generator to set the priorities; here's a simple one, based on the well-known Park/Miller linear-congruential algorithm; since the seed is a constant, it will always produce the same random sequence, but you can change the seed if you wish: (define seed 20070502) (define (rand) (let ((multiplier 48271) (modulus 2147483647)) (set! seed (modulo (* seed multiplier) modulus)) (if (zero? seed) (set! seed modulus)) seed)) Norvig's spelling corrector works in two phases: a training phase and a correction phase. The training phase reads a training corpus and returns a dictionary containing all the words and their frequency of occurrence. We assume seven-bit ascii, as Norvig did, and define a word as a maximal sequence of alphabetic characters. [[Read-word]] fetches the next word from a port, converting its characters to lower-case, and returns the eof-object when the port is exhausted: (define (read-word . port) (let ((p (if (null? port) (current-input-port) (car port)))) (let loop ((in-word? #f) (c (read-char p)) (word '())) (cond ((eof-object? c) (if in-word? (list->string (reverse word)) c)) ((char-alphabetic? c) (loop #t (read-char p) (cons (char-downcase c) word))) (in-word? (list->string (reverse word))) (else (loop #f (read-char p) word)))))) [[Fold-port]] is to ports as a fold function is to lists; it takes a port and a function that reads the next item from the port and combines the items using a [[folder]] function: (define (fold-port reader folder base . port) (let ((p (if (null? port) (current-input-port) (car port)))) (let loop ((item (reader p)) (result base)) (if (eof-object? item) result (loop (reader p) (folder result item)))))) Given these functions, it is easy to write a training function that reads a file and returns a mapping of the words in the file to the counts of those words: (define (train file-name) (with-input-from-file file-name (lambda () (fold-port read-word (lambda (tree word) (insert stringThe Scheme Programming Language, though we represent sets as treaps instead of lists. The [[in]] clause was hard to write because it naturally uses a double-recursion, one on each child; the implementation above places sub-trees on a stack, popping them and binding each key to the bind variable. Here are some samples of the use of [[set-of]]: [[ > (define (set->list s) (if (nil? s) '() (append (set->list (lkid s)) (list (key s)) (set->list (rkid s))))) > (set->list (set-of nil < (+ y 6) (x upto 1 6) (odd? x) (y is (* x x)))) (7 15 31) ]] The basic operation on sets is [[adjoin]], which adds an item to a set if the item doesn't already exist in the set and returns the new set. We insert the item into a treap with value [[#t]], so that the call [[(lookup lt? set item #f)]] returns [[#t]] if the item exists in the set and [[#f]] otherwise: (define (adjoin lt? set item) (insert lt? set item (rand) #t (lambda (k v1 v2) #t))) The other notation problem involves substrings. A few moment's thought suggests a macro [[cat]] that catenates a list of substring clauses: (define-syntax cat (syntax-rules (last) ((_ 'aux base) (apply string-append (reverse base))) ((_ 'aux base (str i last) clauses ...) (cat 'aux (cons (substring str i (string-length str)) base) clauses ...)) ((_ 'aux base (str i j) clauses ...) (cat 'aux (cons (substring str i j) base) clauses ...)) ((_ 'aux base (str i) clauses ...) (cat 'aux (cons (substring str i (+ i 1)) base) clauses ...)) ((_ 'aux base c-or-s clauses ...) (cat 'aux (cons (if (char? c-or-s) (string c-or-s) c-or-s) base) clauses ...)) ((_ clauses ...) (cat 'aux '() clauses ...)))) [[Cat]] calls [[string-append]] to catenate the strings returned by each of its clauses. There are several types of clauses: [[(str i j)]] expands to [[(substring str i j)]], [[(str i last)]] expands to [[(substring str i (string-length str))]], [[(str i)]] expands to [[(string-ref str i)]], a character [[c]] expands to [[(string c)]], and a string [[str]] expands to itself. Our notation is somewhat different than Soegaard's, being easier to parse both for the macro compiler and for the human programmer. Here are some examples of the use of [[cat]]: [[ > (cat ("hello" 2 4)) "ll" > (cat ("hello" 0 1) ("hello" 2 last)) "hllo" > (cat ("hello" 0 1) #\e ("hello" 2 last)) "hello" > (cat ("hello" 0 1) "e" ("hello" 2 last)) "hello" > (cat ("hello" 0 1) ("hello" 1) ("hello" 2 last)) "hello" > (cat ("hello" 0 0)) "" ]] Given [[set-of]] and [[cat]], [[edits1]] is as clear in Scheme as in Python, though perhaps a few lines longer: (define (edits1 word) (let* ((n (string-length word)) (e (set-of nil stringchar ci)))) (e (set-of e stringchar ci))))) e)) [[Known-edits2]] builds on [[edits1]] to return the set of all words in the training corpus that are within an edit distance of one from all words that are within an edit distance of one from the given word, giving a set of words that are within an edit distance of two from the given word: (define (known-edits2 word) (set-of nil string (define nwords (train "big.txt")) > (correct nwords "statisticle") "statistics" > (correct nwords "speling") "spelling" > (correct nwords "korrecter") "corrected" ]] In the training corpus, "statistics" appears five times but "statistical" only twice, and "corrector" doesn't appear at all, so the corrections suggested for those words aren't what we expect. Norvig gives details of the math behind the spelling corrector and gives a source for the training corpus [[big.txt]] in his article. To test the program, Norvig found a testing corpus and wrote a simple driver that counts the number of tests, the number of misspellings that were fixed, and the time it took to perform the tests. Here's our driver, which is a bit simpler than Norvig's: (define (spelltest tests) (let loop ((count 0) (okay 0) (tests tests)) (cond ((null? tests) (display "tested: ") (display count) (display "; fixed: ") (display okay) (newline)) ((null? (cdar tests)) (loop count okay (cdr tests))) ((string=? (caar tests) (correct (cadar tests))) (loop (+ count 1) (+ okay 1) (cons (cons (caar tests) (cddar tests)) (cdr tests)))) (else (loop (+ count 1) okay (cons (cons (caar tests) (cddar tests)) (cdr tests))))))) Norvig split his test corpus into two pieces, for hygiene. Here they are: (define tests1 '(("access" "acess") ("accessing" "accesing") ("accommodation" "accomodation" "acommodation" "acomodation") ("account" "acount") ("address" "adress" "adres") ("addressable" "addresable") ("arranged" "aranged" "arrainged") ("arrangeing" "aranging") ("arrangement" "arragment") ("articles" "articals") ("aunt" "annt" "anut" "arnt") ("auxiliary" "auxillary") ("available" "avaible") ("awful" "awfall" "afful") ("basically" "basicaly") ("beginning" "begining") ("benefit" "benifit") ("benefits" "benifits") ("between" "beetween") ("bicycle" "bicycal" "bycicle" "bycycle") ("biscuits" "biscits" "biscutes" "biscuts" "bisquits" "buiscits" "buiscuts") ("built" "biult") ("cake" "cak") ("career" "carrer") ("cemetery" "cemetary" "semetary") ("centrally" "centraly") ("certain" "cirtain") ("challenges" "chalenges" "chalenges") ("chapter" "chaper" "chaphter" "chaptur") ("choice" "choise") ("choosing" "chosing") ("clerical" "clearical") ("committee" "comittee") ("compare" "compair") ("completely" "completly") ("consider" "concider") ("considerable" "conciderable") ("contented" "contenpted" "contende" "contended" "contentid") ("curtains" "cartains" "certans" "courtens" "cuaritains" "curtans" "curtians" "curtions") ("decide" "descide") ("decided" "descided") ("definitely" "definately" "difinately") ("definition" "defenition") ("definitions" "defenitions") ("description" "discription") ("desiccate" "desicate" "dessicate" "dessiccate") ("diagrammatically" "diagrammaticaally") ("different" "diffrent") ("driven" "dirven") ("ecstasy" "exstacy" "ecstacy") ("embarrass" "embaras" "embarass") ("establishing" "astablishing" "establising") ("experience" "experance" "experiance") ("experiences" "experances") ("extended" "extented") ("extremely" "extreamly") ("fails" "failes") ("families" "familes") ("february" "febuary") ("further" "futher") ("gallery" "galery" "gallary" "gallerry" "gallrey") ("hierarchal" "hierachial") ("hierarchy" "hierchy") ("inconvenient" "inconvienient" "inconvient" "inconvinient") ("independent" "independant" "independant") ("initial" "intial") ("initials" "inetials" "inistals" "initails" "initals" "intials") ("juice" "guic" "juce" "jucie" "juise" "juse") ("latest" "lates" "latets" "latiest" "latist") ("laugh" "lagh" "lauf" "laught" "lugh") ("level" "leval") ("levels" "levals") ("liaison" "liaision" "liason") ("lieu" "liew") ("literature" "litriture") ("loans" "lones") ("locally" "localy") ("magnificent" "magnificnet" "magificent" "magnifcent" "magnifecent" "magnifiscant" "magnifisent" "magnificant") ("management" "managment") ("meant" "ment") ("minuscule" "miniscule") ("minutes" "muinets") ("monitoring" "monitering") ("necessary" "neccesary" "necesary" "neccesary" "necassary" "necassery" "neccasary") ("occurrence" "occurence" "occurence") ("often" "ofen" "offen" "offten" "ofton") ("opposite" "opisite" "oppasite" "oppesite" "oppisit" "oppisite" "opposit" "oppossite" "oppossitte") ("parallel" "paralel" "paralell" "parrallel" "parralell" "parrallell") ("particular" "particulaur") ("perhaps" "perhapse") ("personnel" "personnell") ("planned" "planed") ("poem" "poame") ("poems" "poims" "pomes") ("poetry" "poartry" "poertry" "poetre" "poety" "powetry") ("position" "possition") ("possible" "possable") ("pretend" "pertend" "protend" "prtend" "pritend") ("problem" "problam" "proble" "promblem" "proplen") ("pronunciation" "pronounciation") ("purple" "perple" "perpul" "poarple") ("questionnaire" "questionaire") ("really" "realy" "relley" "relly") ("receipt" "receit" "receite" "reciet" "recipt") ("receive" "recieve") ("refreshment" "reafreshment" "refreshmant" "refresment" "refressmunt") ("remember" "rember" "remeber" "rememmer" "rermember") ("remind" "remine" "remined") ("scarcely" "scarcly" "scarecly" "scarely" "scarsely") ("scissors" "scisors" "sissors") ("separate" "seperate") ("singular" "singulaur") ("someone" "somone") ("sources" "sorces") ("southern" "southen") ("special" "speaical" "specail" "specal" "speical") ("splendid" "spledid" "splended" "splened" "splended") ("standardizing" "stanerdizing") ("stomach" "stomac" "stomache" "stomec" "stumache") ("supersede" "supercede" "superceed") ("there" "ther") ("totally" "totaly") ("transferred" "transfred") ("transportability" "transportibility") ("triangular" "triangulaur") ("understand" "undersand" "undistand") ("unexpected" "unexpcted" "unexpeted" "unexspected") ("unfortunately" "unfortunatly") ("unique" "uneque") ("useful" "usefull") ("valuable" "valubale" "valuble") ("variable" "varable") ("variant" "vairiant") ("various" "vairious") ("visited" "fisited" "viseted" "vistid" "vistied") ("visitors" "vistors") ("voluntary" "volantry") ("voting" "voteing") ("wanted" "wantid" "wonted") ("whether" "wether") ("wrote" "rote" "wote"))) (define tests2 '(("forbidden" "forbiden") ("decisions" "deciscions" "descisions") ("supposedly" "supposidly") ("embellishing" "embelishing") ("technique") ("tecnique") ("permanently" "perminantly") ("confirmation" "confermation") ("appointment" "appoitment") ("progression" "progresion") ("accompanying" "acompaning") ("applicable" "aplicable") ("regained" "regined") ("guidelines" "guidlines") ("surrounding" "serounding") ("titles" "tittles") ("unavailable" "unavailble") ("advantageous" "advantageos") ("brief" "brif") ("appeal" "apeal") ("consisting" "consisiting") ("clerk" "cleark" "clerck") ("component" "componant") ("favourable" "faverable") ("separation" "seperation") ("search" "serch") ("receive" "recieve") ("employees" "emploies") ("prior" "piror") ("resulting" "reulting") ("suggestion" "sugestion") ("opinion" "oppinion") ("cancellation" "cancelation") ("criticism" "citisum") ("useful" "usful") ("humour" "humor") ("anomalies" "anomolies") ("would" "whould") ("doubt" "doupt") ("examination" "eximination") ("therefore" "therefoe") ("recommend" "recomend") ("separated" "seperated") ("successful" "sucssuful" "succesful") ("apparent" "apparant") ("occurred" "occureed") ("particular" "paerticulaur") ("pivoting" "pivting") ("announcing" "anouncing") ("challenge" "chalange") ("arrangements" "araingements") ("proportions" "proprtions") ("organized" "oranised") ("accept" "acept") ("dependence" "dependance") ("unequalled" "unequaled") ("numbers" "numbuers") ("sense" "sence") ("conversely" "conversly") ("provide" "provid") ("arrangement" "arrangment") ("responsibilities" "responsiblities") ("fourth" "forth") ("ordinary" "ordenary") ("description" "desription" "descvription" "desacription") ("inconceivable" "inconcievable") ("data" "dsata") ("register" "rgister") ("supervision" "supervison") ("encompassing" "encompasing") ("negligible" "negligable") ("allow" "alow") ("operations" "operatins") ("executed" "executted") ("interpretation" "interpritation") ("hierarchy" "heiarky") ("indeed" "indead") ("years" "yesars") ("through" "throut") ("committee" "committe") ("inquiries" "equiries") ("before" "befor") ("continued" "contuned") ("permanent" "perminant") ("choose" "chose") ("virtually" "vertually") ("correspondence" "correspondance") ("eventually" "eventully") ("lonely" "lonley") ("profession" "preffeson") ("they" "thay") ("now" "noe") ("desperately" "despratly") ("university" "unversity") ("adjournment" "adjurnment") ("possibilities" "possablities") ("stopped" "stoped") ("mean" "meen") ("weighted" "wagted") ("adequately" "adequattly") ("shown" "hown") ("matrix" "matriiix") ("profit" "proffit") ("encourage" "encorage")("collate" "colate") ("disaggregate" "disaggreagte" "disaggreaget") ("receiving" "recieving" "reciving") ("proviso" "provisoe") ("umbrella" "umberalla") ("approached" "aproached") ("pleasant" "plesent") ("difficulty" "dificulty") ("appointments" "apointments") ("base" "basse") ("conditioning" "conditining") ("earliest" "earlyest") ("beginning" "begining") ("universally" "universaly") ("unresolved" "unresloved") ("length" "lengh") ("exponentially" "exponentualy") ("utilized" "utalised") ("set" "et") ("surveys" "servays") ("families" "familys") ("system" "sysem") ("approximately" "aproximatly") ("their" "ther") ("scheme" "scheem") ("speaking" "speeking") ("repetitive" "repetative") ("inefficient" "ineffiect") ("geneva" "geniva") ("exactly" "exsactly") ("immediate" "imediate") ("appreciation" "apreciation") ("luckily" "luckeley") ("eliminated" "elimiated") ("believe" "belive") ("appreciated" "apreciated") ("readjusted" "reajusted") ("were" "wer" "where") ("feeling" "fealing") ("and" "anf") ("false" "faulse") ("seen" "seeen") ("interrogating" "interogationg") ("academically" "academicly") ("relatively" "relativly" "relitivly") ("traditionally" "traditionaly") ("studying" "studing") ("majority" "majorty") ("build" "biuld") ("aggravating" "agravating") ("transactions" "trasactions") ("arguing" "aurguing") ("sheets" "sheertes") ("successive" "sucsesive" "sucessive") ("segment" "segemnt") ("especially" "especaily") ("later" "latter") ("senior" "sienior") ("dragged" "draged") ("atmosphere" "atmospher") ("drastically" "drasticaly") ("particularly" "particulary") ("visitor" "vistor") ("session" "sesion") ("continually" "contually") ("availability" "avaiblity") ("busy" "buisy") ("parameters" "perametres") ("surroundings" "suroundings" "seroundings") ("employed" "emploied") ("adequate" "adiquate") ("handle" "handel") ("means" "meens") ("familiar" "familer") ("between" "beeteen") ("overall" "overal") ("timing" "timeing") ("committees" "comittees" "commitees") ("queries" "quies") ("econometric" "economtric") ("erroneous" "errounous") ("decides" "descides") ("reference" "refereence" "refference") ("intelligence" "inteligence") ("edition" "ediion" "ediition") ("are" "arte") ("apologies" "appologies") ("thermawear" "thermawere" "thermawhere") ("techniques" "tecniques") ("voluntary" "volantary") ("subsequent" "subsequant" "subsiquent") ("currently" "curruntly") ("forecast" "forcast") ("weapons" "wepons") ("routine" "rouint") ("neither" "niether") ("approach" "aproach") ("available" "availble") ("recently" "reciently") ("ability" "ablity") ("nature" "natior") ("commercial" "comersial") ("agencies" "agences") ("however" "howeverr") ("suggested" "sugested") ("career" "carear") ("many" "mony") ("annual" "anual") ("according" "acording") ("receives" "recives" "recieves") ("interesting" "intresting") ("expense" "expence") ("relevant" "relavent" "relevaant") ("table" "tasble") ("throughout" "throuout") ("conference" "conferance") ("sensible" "sensable") ("described" "discribed" "describd") ("union" "unioun") ("interest" "intrest") ("flexible" "flexable") ("refered" "reffered") ("controlled" "controled") ("sufficient" "suficient") ("dissension" "desention") ("adaptable" "adabtable") ("representative" "representitive") ("irrelevant" "irrelavent") ("unnecessarily" "unessasarily") ("applied" "upplied") ("apologised" "appologised") ("these" "thees" "thess") ("choices" "choises") ("will" "wil") ("procedure" "proceduer") ("shortened" "shortend") ("manually" "manualy") ("disappointing" "dissapoiting") ("excessively" "exessively") ("comments" "coments") ("containing" "containg") ("develop" "develope") ("credit" "creadit") ("government" "goverment") ("acquaintances" "aquantences") ("orientated" "orentated") ("widely" "widly") ("advise" "advice") ("difficult" "dificult") ("investigated" "investegated") ("bonus" "bonas") ("conceived" "concieved") ("nationally" "nationaly") ("compared" "comppared" "compased") ("moving" "moveing") ("necessity" "nessesity") ("opportunity" "oppertunity" "oppotunity" "opperttunity") ("thoughts" "thorts") ("equalled" "equaled") ("variety" "variatry") ("analysis" "analiss" "analsis" "analisis") ("patterns" "pattarns") ("qualities" "quaties") ("easily" "easyly") ("organization" "oranisation" "oragnisation") ("the" "thw" "hte" "thi") ("corporate" "corparate") ("composed" "compossed") ("enormously" "enomosly") ("financially" "financialy") ("functionally" "functionaly") ("discipline" "disiplin") ("announcement" "anouncement") ("progresses" "progressess") ("except" "excxept") ("recommending" "recomending") ("mathematically" "mathematicaly") ("source" "sorce") ("combine" "comibine") ("input" "inut") ("careers" "currers" "carrers") ("resolved" "resoved") ("demands" "diemands") ("unequivocally" "unequivocaly") ("suffering" "suufering") ("immediately" "imidatly" "imediatly") ("accepted" "acepted") ("projects" "projeccts") ("necessary" "necasery" "nessasary" "nessisary" "neccassary") ("journalism" "journaism") ("unnecessary" "unessessay") ("night" "nite") ("output" "oputput") ("security" "seurity") ("essential" "esential") ("beneficial" "benificial" "benficial") ("explaining" "explaning") ("supplementary" "suplementary") ("questionnaire" "questionare") ("employment" "empolyment") ("proceeding" "proceding") ("decision" "descisions" "descision") ("per" "pere") ("discretion" "discresion") ("reaching" "reching") ("analysed" "analised") ("expansion" "expanion") ("although" "athough") ("subtract" "subtrcat") ("analysing" "aalysing") ("comparison" "comparrison") ("months" "monthes") ("hierarchal" "hierachial") ("misleading" "missleading") ("commit" "comit") ("auguments" "aurgument") ("within" "withing") ("obtaining" "optaning") ("accounts" "acounts") ("primarily" "pimarily") ("operator" "opertor") ("accumulated" "acumulated") ("extremely" "extreemly") ("there" "thear") ("summarys" "sumarys") ("analyse" "analiss") ("understandable" "understadable") ("safeguard" "safegaurd") ("consist" "consisit") ("declarations" "declaratrions") ("minutes" "muinutes" "muiuets") ("associated" "assosiated") ("accessibility" "accessability") ("examine" "examin") ("surveying" "servaying") ("politics" "polatics") ("annoying" "anoying") ("again" "agiin") ("assessing" "accesing") ("ideally" "idealy") ("scrutinized" "scrutiniesed") ("simular" "similar") ("personnel" "personel") ("whereas" "wheras") ("when" "whn") ("geographically" "goegraphicaly") ("gaining" "ganing") ("requested" "rquested") ("separate" "seporate") ("students" "studens") ("prepared" "prepaired") ("generated" "generataed") ("graphically" "graphicaly") ("suited" "suted") ("variable" "varible" "vaiable") ("building" "biulding") ("required" "reequired") ("necessitates" "nessisitates") ("together" "togehter") ("profits" "proffits"))) The test show how much we've accomplished: [[ > (define nwords (train "big.txt")) > (time (spelltest tests1)) tested: 270; fixed: 202 cpu time: 198156 real time: 198759 gctime: 27560 > (time (spelltest tests2)) tested: 399; fixed: 269 cpu time: -79121 real time: 355899 gc time: 46332 ]] The Scheme solution is much slower than Norvig's Python solution, because the Python version uses hash tables that are native to Python, running at compiled speed, while the Scheme version uses balanced trees running at interpreted speed. To speed up the Scheme version, either compile it, or use the hash tables provided natively with your version of Scheme. We had to work harder than Norvig because Scheme doesn't provide as large a standard library as Python. But we didn't have to work too hard, because we were able to steal the missing pieces from obvious sources, and from other programs we had previously written. Our program, though larger than Norvig's, is just as clear and easy-to-read, especially in the essentials of the spelling-correction code where clarity is most important. The beauty of Scheme shines in the two macros. Scheme provides all the functions needed to write the same [[edits1]] function that Norvig uses, but with an ugly syntax. Macros allow us to extend the syntax of Scheme, in a way that neatly fits the problem at hand. Other languages, Python among them, don't offer programmers the same flexibility; if Python doesn't give you the syntax you need, you're just out of luck. If you didn't have those macros, you would have to write something like this: [[ (define (edits1 word) (let ((n (string-length word)) (edits (make-hash-table 'equal))) ; deletion (do ((i 0 (+ i 1))) ((<= n i)) (hash-table-put! edits (string-append (substring word 0 i) (substring word (+ i 1) n)) #t)) ; transposition (do ((i 0 (+ i 1))) ((<= (- n 1) i)) (hash-table-put! edits (string-append (substring word 0 i) (string-ref word (+ i 1)) (string-ref word i) (substring word (+ i 2) n)) #t)) ; alteration (do ((i 0 (+ i 1))) ((<= n i)) (do ((ci 97 (+ ci 1))) ((<= 123 ci)) (let ((c (integer->char ci))) (hash-table-put! edits (string-append (substring word 0 i) (string c) (substring word (+ i 1) n)) #t))) ; insertion (do ((i 0 (+ i 1))) ((<= (+ n 1) i)) (do ((ci 97 (+ ci 1))) ((<= 123 ci)) (let ((c (integer->char ci))) (hash-table-put! edits (string-append (substring word 0 i) (string c) (substring word i n)) #t))) edits)) ]] Which isn't pretty at all! Some people see the minimalism of Scheme as its great weakness. But, in fact, minimalism pales in the face of the ease of extensibility that Scheme offers with its macros. People that don't understand Scheme think of macros as an "advanced feature" instead of an integral part of the philosophy of Scheme. But by providing the ability to extend its syntax, Scheme offers its users the ability to mold the language to their needs in a way most languages can't, as in the [[cat]] and [[set-of]] macros we wrote. There have been other Scheme solutions to the spelling corrector problem posted to the internet in the wake of Norvig's post. Shiro Kawai posted his solution, using Gauche and a bunch of SRFIs, at http://practical-scheme.net/wiliki/wiliki.cgi/Gauche:SpellingCorrection. Jens Axel Soegaard wrote a version for PLT Scheme in his blog at http://www.scheme.dk/blog/2007/04/writing-spelling-corrector-in-plt.html; his solution provided the inspiration for the [[cat]] macro. Both solutions are good, but they neatly point out the problem with Scheme; since Scheme is so minimalistic, every Scheme system extends the language, all in ways incompatible with one another. We stayed with pure R5RS Scheme, and paid the price in slower execution time.