# Cluster

Clustering is the process of collecting in groups all of the items from an input collection
that share some common feature. The most obvious clustering operator in common computer
languages is the [[GROUP BY]] operator of SQL. This note describes [[(cluster proc lt? lst)]],
a Scheme procedure that takes an input list and returns a list of lists; [[proc]] computes a
signature of each item in the input list, and each sub-list in the output list contains all
those elements of the input list with identical signatures, with sub-lists in increasing order
of signature according to [[lt?]]. The type of [[cluster]] is (α
→ β) × (β → boolean) × (list α) → (list (list
α)).
The original [[clusterBy]] function
was written in Haskell by Tom Moertel:
[[
import Control.Arrow ((&&&))
import qualified Data.Map as M
clusterBy :: Ord b => (a -> b) -> [a] -> [[a]]
clusterBy f = M.elems . M.map reverse . M.fromListWith (++) . map (f &&& return)
]]
That's a little bit terse, and uses some fiendishly idiomatic Haskell. We'll use some rather
prosaic Scheme instead. Our strategy is to apply [[proc]] to each element of [[lst]] and
insert the signature/value pair in a binary tree. We won't bother to balance the tree, on the
theory that the signature function ought to be unrelated to the order of the input data:
(define (cluster proc lt? lst)
<< insert into a tree >>
<< convert tree to list of lists >>
(let loop ((lst lst) (tree '()))
(if (null? lst)
(in-order tree)
(loop (cdr lst) (insert (proc (car lst)) (car lst) tree)))))
The tree is represented as a recursive four-element list, with signature in the [[car]], list
of values in the [[cadr]], left child in the [[caddr]], and right child in the [[cadddr]].
Since the data structure is recursive, so is the code that does insertion, which begins at
the root of the tree. If a tree is null, the signature must not appear in the tree, so a new
node is built. Otherwise, the signature must already exist in the tree, so the existing node
is updated, either by recursively inserting in the left sub-tree if the new signature is less
than the current signature, or in the right sub-tree if the new signature is greater than the
current signature, or by consing the input element to the current node if the two signatures
are equal.
<< insert into a tree >>=
(define (insert key value tree)
(cond ((null? tree)
(list key (list value) '() '()))
((lt? key (car tree))
(let ((left (insert key value (caddr tree))))
(list (car tree) (cadr tree) left (cadddr tree))))
((lt? (car tree) key)
(let ((right (insert key value (cadddr tree))))
(list (car tree) (cadr tree) (caddr tree) right)))
(else
(let ((new (cons value (cadr tree))))
(list key new (caddr tree) (cadddr tree))))))
On output, the tree is converted to a list by in-order traversal:
<< convert tree to list of lists >>=
(define (in-order tree)
(if (null? tree)
'()
(append (in-order (caddr tree))
(list (cadr tree))
(in-order (cadddr tree)))))
[[Cluster]] is useful in a variety of situations. These examples cluster a list of strings
by the lengths of its words and by the first letter of each word:
[[
> (define x '("this" "is" "a" "fun" "and" "useful" "program"))
]]
[[
> (cluster string-length < x)
(("a") ("is") ("and" "fun") ("this") ("useful") ("program"))
]]
[[
> (cluster (lambda (x) (string-ref x 0)) char (define (anagram s) (list->string (sort charlist s))))
> (define dict '("pots" "time" "spot" "pans" "item" "tops"))
> (cluster anagram string (filter (lambda (x) (> (length x) 2))
(cluster (lambda (x) (anagram (string-append (car x) (cadr x)))) string