How Spell Checkers Work – Part 2

My previous post How Spell Checkers Work – Part 1 concluded with an introductory taste of n-grams. N-grams have been the cool kids of the search algorithm scene for some time. You’ll find hundreds of papers extolling their merit, and for good reason.

Why N-Grams are Cool

Languages are contextual. Words and characters derive syntactic and semantic meaning from the words and characters that they co-occur with. Consider the word butter in:

peanut butter
cocoa butter

Or the occurrences of o in potato:

po : o follows p.
ot: o precedes t.
to : o follows t.
o$: o precedes nothing (last character).

Same character, changing context.

Co-occurrence

The character o co-occurs with other characters – i.e. it is associated with them. The association is orderedo follows p and o precedes t. And op is not the same as po. The last o in potato precedes an implicit terminating null character. An implicit null precedes the first p.

Common sense. Common sense makes fabulous software.

Two strings are similar if they have characters in common. They are a lot more similar if they have co-occurring characters in common. Incorporating co-occurrence information improves your similarity algorithm.

N-grams capture co-occurrence. Given a set, bi-grams capture the co-occurrence of pairs of the set’s elements.

potato => {$p, po, ot, ta, at, to, o$}

Tri-grams capture the co-occurrence of triples.

peanut butter allergy
peanut butter sandwich.

Interestingly, the N in N-grams rarely exceeds 4 or 5. For spell-checking and string/word similarity scoring, bi-grams are all you need.

An Improved Spell Checking Algorithm

  • Transform each string in your dictionary into its constituent bi-grams.
  • Do the same for the misspelling
  • Calculate the Jaccard Similarity Coefficient of the bi-gram set of the misspelling to the bi-gram sets of your dictionary
  • You’ll find that the top N matches returned by this algorithm are actually pretty good.

But you’ll also notice two problems:

  • The ranking order of your corrections isn’t always ideal. Some intuitively superior matches get lower scores.
  • You still have to compare the misspelling against every string in you dictionary.

A Second Pass Ranker

You improve the ranking order of your spelling corrections by using a second algorithm to recalculate string similarities and re-rank the results.

This second edit-distance algorithm is accurate, but also computationally expensive. So you only run it over the candidate set of spelling corrections you found using your Jaccard ranker above!

Edit Distance

What is edit distance anyway?

You can transform any string A into any string B by adding, removing, replacing or transposing characters in A until you arrive at string B.

The edit distance between string A and B is the minimum number of steps it takes to transform A into B.

  • The edit distance between equal strings is 0.
  • The edit distance between potato and potatoe is 1.
  • The distance between potato and potatoes is 2.

The smaller the edit distance, the more similar the strings. Edit distance is a great way to rank corrections because most spelling mistakes are mutations of the correct spelling.

How many individual mutations (adds, removes, replacements) does it take to go from the original string to the mutated string? The fewer the steps, the stronger the relationship between the two strings. The nearest relative to the misspelling is the most likely correct spelling. Genetics, anyone?

Calculating Edit Distance

Calculating edit distances can be tricky. There may be more than one way to turn string A into B, and you must use the minimal number of steps required.

There are a few well known measures of edit distance, each with its own wrinkle. The popular Levenshtein algorithm uses dynamic programming to do the deed.

You may have butted heads with dynamic programming in computer science class. And promptly forgotten what you learnt to protect your brain from trauma. Like other dynamic programming algorithms, edit distance calculation can leave you scratching your head. I had to scratch mine with the vigor of my dog Toby, and it has left a decorative dead zone atop my skull.

Dynamic programming algorithms use values found in earlier iterations to calculate values for the current iteration. They save the results of the current iteration to inform future iterations.

To calculate edit distances, you must first formulate a systematic way to list and count the number of steps needed to transform string A into B. Lets explore what they are by turning string abcd into pqrs.

Delete And Transform

Say you have already calculated how to transform abc ==> pqrs.

To transform abcd ==> pqrs :

  • Transform abcd ==> abc, by deleting d from abcd.
  • Transform abc ==> pqrs [which you’ve already calculated in an earlier iteration]
  • Total Cost = 1 Delete + Count Steps( abc ==> pqrs)

Transform and Add

Say you have already calculated how to transform abcd ==> pqr.

To transform abcd ==> pqrs :

  • Transform abcd ==> pqr [which you’ve already calculated in an earlier iteration]
  • Transform pqr ==> pqrs, by adding s to pqr.
  • Total Cost = 1 Add + Count Steps (abcd ==> pqr)

Transform and Replace

Say you have already calculated how to transform abc ==> pqr.

To transform abcd ==> pqrs :

  • Transform the abc portion of abcd  ==> pqr [which you’ve already calculated earlier]
  • Replace the last character from abcd – d with s.
  • Total Cost = 1 Replace + Count Steps (abc ==> pqr)

Putting it Together

The edit distance is the minimum of the individual costs of the 3 transformations listed above.

You may have realized that you will end up calculating the edit distance of each prefix of A to each prefix of B.

    • For each string in {a, ab, abc, abcd}
      • Calculate the edit distance to each string in {p, pq, pqr, pqrs} using the steps outlined above
      • Save the minimum cost found.
    • By the time you actually calculate the edit distance from abcd ==> pqrs, you’ve already computed all the prior information you need.

Speed, Give me Speed!

And you’ve got yourself an effective and versatile spell checker  – or a class of suggestion engine (there are others) – that you can use in a range of scenarios. You can use it to autocorrect medication names (can you spell penicillin yet?). You can even use it to find the nearest first or last name in a phone book.

But you still need to fix the speed part.

You already know how to use indexing and caching to nitrous boost your code into running fast and furious. So how do you speed up your spell checker and avoid comparing each misspelling against all strings in your dictionary?

How Spell Checkers Work – Part 1

Ever wonder how a spell checker works? How does Bing auto-correct your misspellings (potato, instead of potatoe) as you type? How does software automatically correct Penicilin to Penicillin? How many of you even spotted that typo? While the actual algorithms and engineering used by Bing or Google are deep dark trade secrets that we had best not know, the fundamentals are well understood.

Spell Checking

A spellchecker works by searching for a given string in its dictionary of known strings. Any string not found in the dictionary is deemed an alleged misspelling. The misspelling is alleged because it is a misspelling relative to the spellchecker’s dictionary – which could be limited or incomplete. “Blog” is a misspelling to a spellchecker using a dictionary from 1990.

To correct a misspelling, the spellchecker assumes that your misspelling must be a mutation of one of the strings in its dictionary. To suggest a correction, the spellchecker searches its dictionary for strings similar to the misspelling. It then orders the potential corrections by their similarity and likelihood score. The known string nearest to your misspelling is the suggested correction.

Sometimes a correctly spelt and known string may itself require correction. Is a given word the most likely word in the current context? Did you really mean peanut allegory symptoms? Or is it more likely that you meant peanut allergy symptoms. But allegory is a legal English word!

String Similarity

You intuitively know that the strings Penisilin and Penicillin are similar. So are Potato and Potatoe. Tomato and Potato are also similar. The question is – how similar?

How do you measure similarity? Is the string Potato closer to Potatoe or to Tomato?

The measurement of string similarity is relevant to several disciplines, including Genetics. While similarity algorithms typically work with text strings, a string is merely a list of elements taken from some alphabet. A string need not be text. The DNA strands you sometimes blame for your expanding gut are also strings, constructed from the famous 4 element alphabet {A, C, G, T}. A series of musical notes is a string. Polymers are strings. Phone numbers and social security numbers are strings. The repeating patterns of lines and and colors on your shirt – could be expressed as strings.

This has important implications. What works for text also works elsewhere. You can use algorithms developed to search and auto-correct plain old text, to also give Darth Vader a paternity test. How similar is Vader’s DNA to Luke Skywalker’s? That is conceptually like asking how similar is Penisilin to Penicillin.

The classic string similarity measure is the famed edit-distance. I’ll explore edit-distance algorithms in future posts. But I’ll begin with a simpler and faster alternative that uses intuitive measures of set similarity. Discussing set similarity introduces several important search concepts into the discussion. Set similarity also serves as an effective first-pass algorithm when searching large dictionaries.

SET SIMILARITY

Consider the following strings, expressed as sets.

Set String Elements Element Count
A potatoes {p, o, t, a, t, o, e, s} 8
B potato {p, o, t, a, t, o} 6
C tomatos {t, o, m, a, t, o, s} 7

Problem: measure the similarity of set A (potatoes) to Set B (potato) & Set C (tomatos).

Your basic intuition is sound. The more elements that sets A & B have in common, the more similar they must be. You express your measure as : Count(A intersection B)

This definition gives us the following similarity scores:

  • A (potatoes) & B (potato) have 6 elements in common.
  • A (potatoes) & C (tomatos) also have 6 elements in common.

This simple string similarity measure suggests that potatoes is equally similar to both potato and tomatos. But you know that isn’t the case. You know that potatoes is closer to potato than it is to tomatos. Our simple similarity measure is unfortunately not discriminating enough.

The Jaccard Coefficient

A better similarity measure does more than count the number of elements in common between sets. Given a set of elements in sets A and B, calculate the fraction of elements found in both sets. The higher the fraction, the more similar the sets.

  • The set of all unique elements in (A,  B) => (A union B).
  • The set of common elements in (A, B) => (A intersection B)

You use these two sets to compute a Jaccard Coefficient. The similarity between two sets A & B is given by:

Count(A intersect B) / Count(A union B)

Using the Jaccard Coefficient:

A (Potatoes) Count(Union) Count(Intersection) Score
B (Potato) 8 6 6/8 = 0.75
C (Tomatos) 9 6 6/9 = 0.66

A Basic Spellcheck Algorithm

To find suggestions for a given misspelling:

  • Calculate the Jaccard Coefficient of the misspelling with each string in the dictionary.
  • Collect those strings with a Jaccard score higher than some threshold.
  • Sort the matches by score, and pick the top N (1) matches as suggestions.

A good start. But the algorithm has several weaknesses, including performance. Comparing a misspelling against every string in the spellchecker’s dictionary (imagine Bing’s) could be prohibitively expensive. The algorithm also fails to consistently produce good corrections for spelling mistakes such as:

  • Single character errors, such as accidentally hitting the wrong key
  • Leaving out letters: sucess instead of success
  • Additional characters: happinness instead of happiness.
  • Phonetic interpretations of spellings
  • And so on…

The Jaccard measure is sound. But the simple expression of potato as the set {p, o, t, a, t, o} is inadequate. It turns out that you can noticeably improve a spellchecker by changing how you express strings as sets. You do this by this using N-grams.

Instead of expressing potato as the the set, {p, o, t, a, t, o}, you express it as a set of bi-grams – the set of all overlapping character pairs in the string: {$p, po, ot, ta, at, to, o$}.

More on the exciting world of n-grams and their impact on spellchecking (and elsewhere) in the Part 2.