Cluster

Clustering is the process of collecting in groups all of the items from an input collection that share some common feature. The most obvious clustering operator in common computer languages is the GROUP BY operator of SQL. This note describes (cluster proc lt? lst), a Scheme procedure that takes an input list and returns a list of lists; proc computes a signature of each item in the input list, and each sub-list in the output list contains all those elements of the input list with identical signatures, with sub-lists in increasing order of signature according to lt?. The type of cluster is (α → β) × (β → boolean) × (list α) → (list (list α)).

The original clusterBy function was written in Haskell by Tom Moertel:

| import Control.Arrow ((&&&))
| import qualified Data.Map as M
|  
| clusterBy :: Ord b => (a -> b) -> [a] -> [[a]]
| clusterBy f = M.elems . M.map reverse . M.fromListWith (++) . map (f &&& return)

That's a little bit terse, and uses some fiendishly idiomatic Haskell. We'll use some rather prosaic Scheme instead. Our strategy is to apply proc to each element of lst and insert the signature/value pair in a binary tree. We won't bother to balance the tree, on the theory that the signature function ought to be unrelated to the order of the input data:

(define (cluster proc lt? lst)
  « insert into a tree »
  « convert tree to list of lists »
  (let loop ((lst lst) (tree '()))
    (if (null? lst)
        (in-order tree)
        (loop (cdr lst) (insert (proc (car lst)) (car lst) tree)))))

The tree is represented as a recursive four-element list, with signature in the car, list of values in the cadr, left child in the caddr, and right child in the cadddr. Since the data structure is recursive, so is the code that does insertion, which begins at the root of the tree. If a tree is null, the signature must not appear in the tree, so a new node is built. Otherwise, the signature must already exist in the tree, so the existing node is updated, either by recursively inserting in the left sub-tree if the new signature is less than the current signature, or in the right sub-tree if the new signature is greater than the current signature, or by consing the input element to the current node if the two signatures are equal.

« insert into a tree »≡
(define (insert key value tree)
  (cond ((null? tree)
          (list key (list value) '() '()))
        ((lt? key (car tree))
          (let ((left (insert key value (caddr tree))))
            (list (car tree) (cadr tree) left (cadddr tree))))
        ((lt? (car tree) key)
          (let ((right (insert key value (cadddr tree))))
            (list (car tree) (cadr tree) (caddr tree) right)))
        (else
          (let ((new (cons value (cadr tree))))
            (list key new (caddr tree) (cadddr tree))))))

On output, the tree is converted to a list by in-order traversal:

« convert tree to list of lists »≡
(define (in-order tree)
  (if (null? tree)
      '()
      (append (in-order (caddr tree))
              (list (cadr tree))
              (in-order (cadddr tree)))))

Cluster is useful in a variety of situations. These examples cluster a list of strings by the lengths of its words and by the first letter of each word:

| > (define x '("this" "is" "a" "fun" "and" "useful" "program"))

| > (cluster string-length < x)
| (("a") ("is") ("and" "fun") ("this") ("useful") ("program"))

| > (cluster (lambda (x) (string-ref x 0)) char<? x)
| (("and" "a") ("fun") ("is") ("program") ("this") ("useful"))

A more powerful example emits anagram clusters; (anagram s) sorts the letters in the word s (for instance, (anagram "pots") returns "opst"), then cluster brings together the instances of equal signatures:

| > (define (anagram s) (list->string (sort char<? (string->list s))))
| > (define dict '("pots" "time" "spot" "pans" "item" "tops"))
| > (cluster anagram string<? dict)
| (("pans") ("item" "time") ("tops" "spot" "pots"))

Moertel used cluster to solve the following word problem: "Take the names of two U. S. States, mix them all together, then rearrange the letters to form the names of two other U. S. States. What states are these?" Here is our answer:

(define states '("alabama" "alaska" "arizona" "arkansas" "california" "colorado"
  "connecticut" "delaware" "florida" "georgia" "hawaii" "idaho" "illinois" "indiana"
  "iowa" "kansas" "kentucky" "louisiana" "maine" "maryland" "massachusetts" "michigan"
  "minnesota" "mississippi" "missouri" "montana" "nebraska" "nevada" "newhampshire"
  "newjersey" "newmexico" "newyork" "northcarolina" "northdakota" "ohio" "oklahoma"
  "oregon" "pennsylvania" "rhodeisland" "southcarolina" "southdakota" "tennessee"
  "texas" "utah" "vermont" "virginia" "washington" "westvirginia" "wisconsin" "wyoming"))

| > (filter (lambda (x) (> (length x) 2))
|     (cluster (lambda (x) (anagram (string-append (car x) (cadr x)))) string<?
|       (cross-product states states)))
| ((("southdakota" "northcarolina")
|    ("southcarolina" "northdakota")
|    ("northdakota" "southcarolina")
|    ("northcarolina" "southdakota")))

For those systems that don't have them, the utility functions sort, filter and cross-product are defined below:

(define (merge lt? x1 x2)
  (cond ((null? x1) x2)
        ((null? x2) x1)
        ((lt? (car x2) (car x1))
          (cons (car x2) (merge x1 (cdr x2))))
        (else (cons (car x1) (merge (cdr x1) x2)))))

(define (sort lt? xs)
  (define (merge x1 x2)
    (cond ((null? x1) x2)
          ((null? x2) x1)
          ((lt? (car x2) (car x1))
            (cons (car x2) (merge x1 (cdr x2))))
          (else (cons (car x1) (merge (cdr x1) x2)))))
  (define (merge-pairs xs k)
    (if (or (null? (cdr xs)) (odd? k)) xs
        (merge-pairs (cons (merge (car xs) (cadr xs)) (cddr xs)) (quotient k 2))))
  (define (next-run run xs)
    (if (or (null? xs) (lt? (car xs) (car run))) (values (reverse run) xs)
        (next-run (cons (car xs) run) (cdr xs))))
  (define (sorting xs ys k)
    (if (null? xs) (car (merge-pairs ys 0))
        (call-with-values
          (lambda () (next-run (list (car xs)) (cdr xs)))
          (lambda (run tail)
            (sorting tail (merge-pairs (cons run ys) (+ k 1)) (+ k 1))))))
  (if (null? xs) xs (sorting xs '() 0)))

(define (filter pred? lst)
  (cond ((null? lst) '())
        ((pred? (car lst)) (cons (car lst) (filter pred? (cdr lst))))
        (else (filter pred? (cdr lst)))))

(define (cross-product . lists)
  (let recur ((lists lists))
    (if (null? lists)
        '(())
        (reverse
          (let ((tails (recur (cdr lists))))
            (let outer ((list (car lists)) (product '()))
              (if (null? list)
                  product
                  (let ((item (car list)) (more (cdr list)))
                    (let inner ((tails tails) (product product))
                      (if (null? tails)
                          (outer more product)
                          (inner (cdr tails) (cons (cons item (car tails)) product))))))))))))

PLB 31OCT2007