Ever wonder how a spell checker works? How does Bing auto-correct your misspellings (potato, instead of potatoe) as you type? How does software automatically correct Penicilin to Penicillin? How many of you even spotted that typo? While the actual algorithms and engineering used by Bing or Google are deep dark trade secrets that we had best not know, the fundamentals are well understood.

Spell Checking

A spellchecker works by searching for a given string in its dictionary of known strings. Any string not found in the dictionary is deemed an alleged misspelling. The misspelling is alleged because it is a misspelling relative to the spellchecker’s dictionary – which could be limited or incomplete. “Blog” is a misspelling to a spellchecker using a dictionary from 1990.

To correct a misspelling, the spellchecker assumes that your misspelling must be a mutation of one of the strings in its dictionary. To suggest a correction, the spellchecker searches its dictionary for strings similar to the misspelling. It then orders the potential corrections by their similarity and likelihood score. The known string nearest to your misspelling is the suggested correction.

Sometimes a correctly spelt and known string may itself require correction. Is a given word the most likely word in the current context? Did you really mean peanut allegory symptoms? Or is it more likely that you meant peanut allergy symptoms. But allegory is a legal English word!

String Similarity

You intuitively know that the strings Penisilin and Penicillin are similar. So are Potato and Potatoe. Tomato and Potato are also similar. The question is – how similar?

How do you measure similarity? Is the string Potato closer to Potatoe or to Tomato?

The measurement of string similarity is relevant to several disciplines, including Genetics. While similarity algorithms typically work with text strings, a string is merely a list of elements taken from some alphabet. A string need not be text. The DNA strands you sometimes blame for your expanding gut are also strings, constructed from the famous 4 element alphabet {A, C, G, T}. A series of musical notes is a string. Polymers are strings. Phone numbers and social security numbers are strings. The repeating patterns of lines and and colors on your shirt – could be expressed as strings.

This has important implications. What works for text also works elsewhere. You can use algorithms developed to search and auto-correct plain old text, to also give Darth Vader a paternity test. How similar is Vader’s DNA to Luke Skywalker’s? That is conceptually like asking how similar is Penisilin to Penicillin.

The classic string similarity measure is the famed edit-distance. I’ll explore edit-distance algorithms in future posts. But I’ll begin with a simpler and faster alternative that uses intuitive measures of set similarity. Discussing set similarity introduces several important search concepts into the discussion. Set similarity also serves as an effective first-pass algorithm when searching large dictionaries.

SET SIMILARITY

Consider the following strings, expressed as sets.

Set	String	Elements	Element Count
A	potatoes	{p, o, t, a, t, o, e, s}	8
B	potato	{p, o, t, a, t, o}	6
C	tomatos	{t, o, m, a, t, o, s}	7

Problem: measure the similarity of set A (potatoes) to Set B (potato) & Set C (tomatos).

Your basic intuition is sound. The more elements that sets A & B have in common, the more similar they must be. You express your measure as : Count(A intersection B)

This definition gives us the following similarity scores:

A (potatoes) & B (potato) have 6 elements in common.
A (potatoes) & C (tomatos) also have 6 elements in common.

This simple string similarity measure suggests that potatoes is equally similar to both potato and tomatos. But you know that isn’t the case. You know that potatoes is closer to potato than it is to tomatos. Our simple similarity measure is unfortunately not discriminating enough.

The Jaccard Coefficient

A better similarity measure does more than count the number of elements in common between sets. Given a set of elements in sets A and B, calculate the fraction of elements found in both sets. The higher the fraction, the more similar the sets.

The set of all unique elements in (A, B) => (A union B).
The set of common elements in (A, B) => (A intersection B)

You use these two sets to compute a Jaccard Coefficient. The similarity between two sets A & B is given by:

Count(A intersect B) / Count(A union B)

Using the Jaccard Coefficient:

A (Potatoes)	Count(Union)	Count(Intersection)	Score
B (Potato)	8	6	6/8 = 0.75
C (Tomatos)	9	6	6/9 = 0.66

A Basic Spellcheck Algorithm

To find suggestions for a given misspelling:

Calculate the Jaccard Coefficient of the misspelling with each string in the dictionary.
Collect those strings with a Jaccard score higher than some threshold.
Sort the matches by score, and pick the top N (1) matches as suggestions.

A good start. But the algorithm has several weaknesses, including performance. Comparing a misspelling against every string in the spellchecker’s dictionary (imagine Bing’s) could be prohibitively expensive. The algorithm also fails to consistently produce good corrections for spelling mistakes such as:

Single character errors, such as accidentally hitting the wrong key
Leaving out letters: sucess instead of success
Additional characters: happinness instead of happiness.
Phonetic interpretations of spellings
And so on…

The Jaccard measure is sound. But the simple expression of potato as the set {p, o, t, a, t, o} is inadequate. It turns out that you can noticeably improve a spellchecker by changing how you express strings as sets. You do this by this using N-grams.

Instead of expressing potato as the the set, {p, o, t, a, t, o}, you express it as a set of bi-grams – the set of all overlapping character pairs in the string: {$p, po, ot, ta, at, to, o$}.

More on the exciting world of n-grams and their impact on spellchecking (and elsewhere) in the Part 2.

Engineer By Day

Computer Science In Plain English

Monthly Archives: January 2012

How Spell Checkers Work – Part 1

String Similarity

The Jaccard Coefficient

A Basic Spellcheck Algorithm