How Spell Checkers Work – Part 2

My previous post How Spell Checkers Work – Part 1 concluded with an introductory taste of n-grams. N-grams have been the cool kids of the search algorithm scene for some time. You’ll find hundreds of papers extolling their merit, and for good reason.

Why N-Grams are Cool

Languages are contextual. Words and characters derive syntactic and semantic meaning from the words and characters that they co-occur with. Consider the word butter in:

peanut butter
cocoa butter

Or the occurrences of o in potato:

po : o follows p.
ot: o precedes t.
to : o follows t.
o$: o precedes nothing (last character).

Same character, changing context.

Co-occurrence

The character o co-occurs with other characters – i.e. it is associated with them. The association is orderedo follows p and o precedes t. And op is not the same as po. The last o in potato precedes an implicit terminating null character. An implicit null precedes the first p.

Common sense. Common sense makes fabulous software.

Two strings are similar if they have characters in common. They are a lot more similar if they have co-occurring characters in common. Incorporating co-occurrence information improves your similarity algorithm.

N-grams capture co-occurrence. Given a set, bi-grams capture the co-occurrence of pairs of the set’s elements.

potato => {$p, po, ot, ta, at, to, o$}

Tri-grams capture the co-occurrence of triples.

peanut butter allergy
peanut butter sandwich.

Interestingly, the N in N-grams rarely exceeds 4 or 5. For spell-checking and string/word similarity scoring, bi-grams are all you need.

An Improved Spell Checking Algorithm

  • Transform each string in your dictionary into its constituent bi-grams.
  • Do the same for the misspelling
  • Calculate the Jaccard Similarity Coefficient of the bi-gram set of the misspelling to the bi-gram sets of your dictionary
  • You’ll find that the top N matches returned by this algorithm are actually pretty good.

But you’ll also notice two problems:

  • The ranking order of your corrections isn’t always ideal. Some intuitively superior matches get lower scores.
  • You still have to compare the misspelling against every string in you dictionary.

A Second Pass Ranker

You improve the ranking order of your spelling corrections by using a second algorithm to recalculate string similarities and re-rank the results.

This second edit-distance algorithm is accurate, but also computationally expensive. So you only run it over the candidate set of spelling corrections you found using your Jaccard ranker above!

Edit Distance

What is edit distance anyway?

You can transform any string A into any string B by adding, removing, replacing or transposing characters in A until you arrive at string B.

The edit distance between string A and B is the minimum number of steps it takes to transform A into B.

  • The edit distance between equal strings is 0.
  • The edit distance between potato and potatoe is 1.
  • The distance between potato and potatoes is 2.

The smaller the edit distance, the more similar the strings. Edit distance is a great way to rank corrections because most spelling mistakes are mutations of the correct spelling.

How many individual mutations (adds, removes, replacements) does it take to go from the original string to the mutated string? The fewer the steps, the stronger the relationship between the two strings. The nearest relative to the misspelling is the most likely correct spelling. Genetics, anyone?

Calculating Edit Distance

Calculating edit distances can be tricky. There may be more than one way to turn string A into B, and you must use the minimal number of steps required.

There are a few well known measures of edit distance, each with its own wrinkle. The popular Levenshtein algorithm uses dynamic programming to do the deed.

You may have butted heads with dynamic programming in computer science class. And promptly forgotten what you learnt to protect your brain from trauma. Like other dynamic programming algorithms, edit distance calculation can leave you scratching your head. I had to scratch mine with the vigor of my dog Toby, and it has left a decorative dead zone atop my skull.

Dynamic programming algorithms use values found in earlier iterations to calculate values for the current iteration. They save the results of the current iteration to inform future iterations.

To calculate edit distances, you must first formulate a systematic way to list and count the number of steps needed to transform string A into B. Lets explore what they are by turning string abcd into pqrs.

Delete And Transform

Say you have already calculated how to transform abc ==> pqrs.

To transform abcd ==> pqrs :

  • Transform abcd ==> abc, by deleting d from abcd.
  • Transform abc ==> pqrs [which you’ve already calculated in an earlier iteration]
  • Total Cost = 1 Delete + Count Steps( abc ==> pqrs)

Transform and Add

Say you have already calculated how to transform abcd ==> pqr.

To transform abcd ==> pqrs :

  • Transform abcd ==> pqr [which you’ve already calculated in an earlier iteration]
  • Transform pqr ==> pqrs, by adding s to pqr.
  • Total Cost = 1 Add + Count Steps (abcd ==> pqr)

Transform and Replace

Say you have already calculated how to transform abc ==> pqr.

To transform abcd ==> pqrs :

  • Transform the abc portion of abcd  ==> pqr [which you’ve already calculated earlier]
  • Replace the last character from abcd – d with s.
  • Total Cost = 1 Replace + Count Steps (abc ==> pqr)

Putting it Together

The edit distance is the minimum of the individual costs of the 3 transformations listed above.

You may have realized that you will end up calculating the edit distance of each prefix of A to each prefix of B.

    • For each string in {a, ab, abc, abcd}
      • Calculate the edit distance to each string in {p, pq, pqr, pqrs} using the steps outlined above
      • Save the minimum cost found.
    • By the time you actually calculate the edit distance from abcd ==> pqrs, you’ve already computed all the prior information you need.

Speed, Give me Speed!

And you’ve got yourself an effective and versatile spell checker  – or a class of suggestion engine (there are others) – that you can use in a range of scenarios. You can use it to autocorrect medication names (can you spell penicillin yet?). You can even use it to find the nearest first or last name in a phone book.

But you still need to fix the speed part.

You already know how to use indexing and caching to nitrous boost your code into running fast and furious. So how do you speed up your spell checker and avoid comparing each misspelling against all strings in your dictionary?