My previous post How Spell Checkers Work – Part 1 concluded with an introductory taste of n-grams. N-grams have been the cool kids of the search algorithm scene for some time. You’ll find hundreds of papers extolling their merit, and for good reason.

# Why N-Grams are Cool

Languages are **contextual**. Words and characters derive syntactic and semantic meaning from the words and characters that they *co-occur* with. Consider the word *butter *in:

peanut

butter

cocoabutter

Or the occurrences of ** o **in

*potato*:

po : o follows p.

ot: o precedes t.

to : o follows t.

o$: o precedes nothing (last character).

Same character, changing context.

## Co-occurrence

The character **o *** co-occurs *with other characters – i.e. it is

**associated**with them. The association is

*–*

**ordered**

**o****follows**and

*p***. And**

*o*precedes*t**op*is

**not**the same as

*po*. The last

**o**in

*potato*precedes an implicit terminating null character. An implicit null precedes the first

*.*

**p**Common sense. Common sense makes fabulous software.

Two strings are similar if they have characters in common. They are *a lot more *similar if they have **co-occurring characters in common**. Incorporating co-occurrence information improves your similarity algorithm.

**N-grams** capture co-occurrence. Given a set, **bi-grams** capture the co-occurrence of **pairs **of the set’s elements.

potato => {$p, po, ot, ta, at, to, o$}

**Tri-grams** capture the co-occurrence of **triples**.

peanut butter allergy

peanut butter sandwich.

Interestingly, the **N **in N-grams rarely exceeds 4 or 5. For spell-checking and string/word similarity scoring, bi-grams are all you need.

# An Improved Spell Checking Algorithm

- Transform each string in your dictionary into its constituent bi-grams.
- Do the same for the misspelling
- Calculate the Jaccard Similarity Coefficient of the bi-gram set of the misspelling to the bi-gram sets of your dictionary
- You’ll find that the top N matches returned by this algorithm are actually pretty good.

But you’ll also notice two problems:

- The ranking order of your corrections isn’t always ideal. Some intuitively superior matches get lower scores.
- You still have to compare the misspelling against every string in you dictionary.

## A Second Pass Ranker

You improve the ranking order of your spelling corrections by using a second algorithm to **recalculate **string similarities and * re-rank *the results

**.**

This second **edit-distance **algorithm is accurate, but also computationally expensive. So you only run it over the *candidate set *of spelling corrections you found using your Jaccard ranker above!

# Edit Distance

What is edit distance anyway?

You can transform *any *string A into *any *string B by **adding**, **removing, ****replacing or transposing **characters in A until you arrive at string B.

The **edit distance** between string A and B is the * minimum *number of steps it takes to transform A into B.

- The edit distance between equal strings is
**0**. - The edit distance between
*potato*and*potatoe*is**1**. - The distance between
*potato*and*potatoes*is**2**.

The *smaller *the edit distance, the more *similar *the strings. Edit distance is a great way to rank corrections because most spelling mistakes are **mutations** of the correct spelling.

How many individual mutations (adds, removes, replacements) does it take to go from the original string to the mutated string? The fewer the steps, the stronger the relationship between the two strings. The *nearest relative *to the misspelling is the most likely correct spelling. **Genetics**, anyone?

# Calculating Edit Distance

Calculating edit distances can be tricky. There may be more than one way to turn string A into B, and you must use the **minimal **number of steps required.

There are a few well known measures of edit distance, each with its own wrinkle. The popular **Levenshtein** algorithm uses dynamic programming to do the deed.

You may have butted heads with dynamic programming in computer science class. And promptly forgotten what you learnt to protect your brain from trauma. Like other dynamic programming algorithms, edit distance calculation can leave you scratching your head. I had to scratch mine with the vigor of my dog Toby, and it has left a decorative dead zone atop my skull.

Dynamic programming algorithms use values found in earlier iterations to calculate values for the current iteration. They save the results of the current iteration to inform future iterations.

To calculate edit distances, you must first formulate a *systematic way *to list and count the number of steps needed to transform string A into B. Lets explore what they are by turning string * abcd *into

*.*

**pqrs**## Delete And Transform

Say you have *already calculated *how to transform **abc **==> **pqrs**.

To transform **abcd **==> **pqrs **:

- Transform
**abcd**==>**abc**, by deleting**d**from**abcd**. - Transform
**abc**==>**pqrs**[which you’ve already calculated in an*earlier*iteration] - Total Cost = 1 Delete + Count Steps(
**abc**==>**pqrs**)

## Transform and Add

Say you have *already calculated *how to transform **abcd **==>** pqr.**

To transform **abcd **==> **pqrs **:

- Transform
**abcd**==>**pqr**[which you’ve already calculated in an*earlier*iteration] - Transform
**pqr**==>**pqrs**, by adding**s**to pqr. - Total Cost = 1 Add + Count Steps (
**abcd**==>**pqr**)

**Transform and Replace**

Say you have *already calculated* how to transform **abc **==>** pqr. **

To transform **abcd **==> **pqrs **:

- Transform the
**abc**portion of**abcd**==> pqr [which you’ve already calculated*earlier*] - Replace the last character from abcd –
with**d****s**. - Total Cost = 1 Replace + Count Steps (abc ==> pqr)

# Putting it Together

The edit distance is the minimum of the individual costs of the 3 transformations listed above.

You may have realized that you will end up calculating the edit distance of *each prefix *of A to *each prefix *of B.

- For each string in {a, ab, abc, abcd}
- Calculate the edit distance to each string in {p, pq, pqr, pqrs} using the steps outlined above
- Save the
**minimum**cost found.

- By the time you actually calculate the edit distance from
**abcd**==>**pqrs**, you’ve already computed all the prior information you need.

# Speed, Give me Speed!

And you’ve got yourself an effective and versatile spell checker – or a class of suggestion engine (there are others) – that you can use in a range of scenarios. You can use it to autocorrect medication names (can you spell **penicillin **yet?). You can even use it to find the nearest first or last name in a phone book.

But you still need to fix the speed part.

You already know how to use indexing and caching to nitrous boost your code into running fast and furious. So how do you speed up your spell checker and avoid comparing each misspelling against all strings in your dictionary?