Brady Ouren (Tippenein)

Peter Norvig has an old post about writing a spell checker which I’ve always loved the succinctness of. I was on a plane for a few hours and wanted to see how this translated from python to haskell. This is the tale of the journey which I did not intend to take.

Keeping the same structure

First step was going through and recreating the main functionality which is mainly lists and sets, so, no big deal.

Splitting

We split the raw text into words

words :: Text -> [Text]
words = T.split (not . Char.isAsciiLower) . T.toLower

This code is the intuitive answer to the problem above, however it’s very slow. We’ll look at performance later in this post.

Counter / Bag

In python, the Counter is implemented as a multiset / “bag”. We’ll create our own with a Map from Text (word) to Int (count)

type Counter = Map Text Int

toCounter :: [Text] -> Hist
toCounter = Map.fromListWith (+) . fmap (,1)

words <- toCounter . words <$> readFile "big.txt"

There is also a package called multiset but I didn’t know this because the wifi on the plane didn’t work.

Probability and Correction

In order to guess the right way of correcting we need to have probabilities based on the corpus’ word counts.

prob :: Counter -> Text -> Int
prob counter word = occurences `div` totalWords
  where
    occurences = fromMaybe 0 $ Map.lookup t counter
    totalWords = Map.size ms

correction :: Counter -> Text -> Text
correction counter word = maximumBy (\a b -> p a `compare` p b) $ candidates counter word
  where p = prob counter

candidates :: Counter -> Text -> Set Text
candidates counter word = detect
  [ known counter $ Set.singleton t
  , known counter (edits1 word)
  , known counter (edits2 word)
  , Set.fromList [t]
  ]

detect :: [Set Text] -> Set Text
detect = fromMaybe Set.empty . head . filter (not . Set.null)

known :: Counter -> Set Text -> Set Text
known counter = Set.filter (\w -> Map.member w counter)

Edits / Permutations

I initially squished all the logic into single list-comprehensions, but you’ll see I’ve split the heavier functions out.

edits1 :: Text -> [Text]
edits1 w = nub' $ mconcat [transposes', deletes', replaces', inserts]
  where
    alphabet    = fmap T.singleton ['a'..'z']
    splits      = zip (T.inits w) (T.tails w)
    deletes'    = deletes splits
    transposes' = transposes splits
    replaces'   = replaces splits
    inserts     = [l <> c <> r | (l,r) <- splits, c <- alphabet]

The splits gets its own type for cleanliness:

type Splits = [(Text, Text)]

Instead of if R or if len(R)<1 and such like we have in python, I used a guard to skip over splits with contents fitting a certain criteria (e.g (l,r) where r is not empty)

unSplit :: (Monad f, Alternative f) => (Text, Text) -> f (Text,Text)
unSplit = unSplitWith (/= "")

unSplitWith :: (Monad f, Alternative f) => (Text -> Bool) -> (Text, Text) -> f (Text,Text)
unSplitWith f (l, r) = guard (f r) >> pure (l, r)

-- | swap the 1st and 2nd letters across our list of splits ("derp" -> "edrp")
transposes :: Splits -> [Text]
transposes splits =
  [l <> swap' r | x <- splits, (l,r) <- unSplitWith (\a -> T.length a > 1) x]
  where
    swap' w = T.intercalate "" [two, one', rest]
      where
        two  = T.take 1 $ T.drop 1 w
        one'  = T.take 1 w
        rest = T.tail $ T.tail w

-- | remove a letter across all splits "derp" -> ["drp","dep","der"]
deletes :: Splits -> [Text]
deletes splits =
  [l <> T.tail r | x <- splits, (l,r) <- unSplit x]

-- | try replacing a letter with one from the alphabet in each spot. This one is very large
replaces :: Splits -> [Text]
replaces splits = [l <> c <> T.tail r | x <- splits, (l,r) <- unSplit x, c <- alphabet]

I think this comes out reasonably concise.

edits2 :: Text -> [Text]
edits2 w = nub' [ e2 | e1 <- edits1 w, e2 <- edits1 e1 ]

-- Prelude's nub is prrrrrretty bad, so we use this instead.
nub' :: [Text] -> [Text]
nub' = Set.toList . Set.fromList

Performance

The performance of the implementation I came to is… really bad. The time taken to guess even short words was ~4 seconds. This was unacceptable considering the python version is nearly instant.

After asking around on irc and slack, two main problems were pointed out.

The words function was extremely inefficient (thanks to @mwutton for pointing this out)
The Map and Set in containers package are not optimized for this sort of bagging. (thanks to @yaron)

In order to speed up the words implementation, we just shove the logic into Data.Text’s implementation (which is nasty). This buys us ~1 second off the ridiculous 4 seconds.. So, I went further.

Since I wasn’t using any order-specific functions on Sets or Maps I just replaced the containers dependency with unordered-containers and changed the import statements to use them. Bam! This nearly halved the time! But it’s still real bad at 1 second.

I used the profiteur tool to visualize the performance issues a bit while going through this process, which just basically confirmed that Set/Map operations and words were awful, like we already knew. profiling

It seems as though python’s Counter shouldn’t be all that different than ours (an unordered hash set) but the haskell version lags behind. I kept the code as intuitive as I knew how and it wasn’t quite enough for this type of problem.

Lessons

Always use unordered-containers unless you for some reason need to keep the ordering of your data structures.
Sometimes pre-processing is worth the effort. You can try as hard as you want to optimize the function, but at some point you have to call it a loss.

I’d welcome any comments about how this could be improved further. The result was not encouraging but despite this, I did learn some things along the way.

The full code is here

A literate haskell file to follow along is here

Norvigs spell correct in Haskell