How Spell Checkers Work – Part 2

My previous post How Spell Checkers Work – Part 1 concluded with an introductory taste of n-grams. N-grams have been the cool kids of the search algorithm scene for some time. You’ll find hundreds of papers extolling their merit, and for good reason.

Why N-Grams are Cool

Languages are contextual. Words and characters derive syntactic and semantic meaning from the words and characters that they co-occur with. Consider the word butter in:

peanut butter
cocoa butter

Or the occurrences of o in potato:

po : o follows p.
ot: o precedes t.
to : o follows t.
o$: o precedes nothing (last character).

Same character, changing context.

Co-occurrence

The character o co-occurs with other characters – i.e. it is associated with them. The association is orderedo follows p and o precedes t. And op is not the same as po. The last o in potato precedes an implicit terminating null character. An implicit null precedes the first p.

Common sense. Common sense makes fabulous software.

Two strings are similar if they have characters in common. They are a lot more similar if they have co-occurring characters in common. Incorporating co-occurrence information improves your similarity algorithm.

N-grams capture co-occurrence. Given a set, bi-grams capture the co-occurrence of pairs of the set’s elements.

potato => {$p, po, ot, ta, at, to, o$}

Tri-grams capture the co-occurrence of triples.

peanut butter allergy
peanut butter sandwich.

Interestingly, the N in N-grams rarely exceeds 4 or 5. For spell-checking and string/word similarity scoring, bi-grams are all you need.

An Improved Spell Checking Algorithm

  • Transform each string in your dictionary into its constituent bi-grams.
  • Do the same for the misspelling
  • Calculate the Jaccard Similarity Coefficient of the bi-gram set of the misspelling to the bi-gram sets of your dictionary
  • You’ll find that the top N matches returned by this algorithm are actually pretty good.

But you’ll also notice two problems:

  • The ranking order of your corrections isn’t always ideal. Some intuitively superior matches get lower scores.
  • You still have to compare the misspelling against every string in you dictionary.

A Second Pass Ranker

You improve the ranking order of your spelling corrections by using a second algorithm to recalculate string similarities and re-rank the results.

This second edit-distance algorithm is accurate, but also computationally expensive. So you only run it over the candidate set of spelling corrections you found using your Jaccard ranker above!

Edit Distance

What is edit distance anyway?

You can transform any string A into any string B by adding, removing, replacing or transposing characters in A until you arrive at string B.

The edit distance between string A and B is the minimum number of steps it takes to transform A into B.

  • The edit distance between equal strings is 0.
  • The edit distance between potato and potatoe is 1.
  • The distance between potato and potatoes is 2.

The smaller the edit distance, the more similar the strings. Edit distance is a great way to rank corrections because most spelling mistakes are mutations of the correct spelling.

How many individual mutations (adds, removes, replacements) does it take to go from the original string to the mutated string? The fewer the steps, the stronger the relationship between the two strings. The nearest relative to the misspelling is the most likely correct spelling. Genetics, anyone?

Calculating Edit Distance

Calculating edit distances can be tricky. There may be more than one way to turn string A into B, and you must use the minimal number of steps required.

There are a few well known measures of edit distance, each with its own wrinkle. The popular Levenshtein algorithm uses dynamic programming to do the deed.

You may have butted heads with dynamic programming in computer science class. And promptly forgotten what you learnt to protect your brain from trauma. Like other dynamic programming algorithms, edit distance calculation can leave you scratching your head. I had to scratch mine with the vigor of my dog Toby, and it has left a decorative dead zone atop my skull.

Dynamic programming algorithms use values found in earlier iterations to calculate values for the current iteration. They save the results of the current iteration to inform future iterations.

To calculate edit distances, you must first formulate a systematic way to list and count the number of steps needed to transform string A into B. Lets explore what they are by turning string abcd into pqrs.

Delete And Transform

Say you have already calculated how to transform abc ==> pqrs.

To transform abcd ==> pqrs :

  • Transform abcd ==> abc, by deleting d from abcd.
  • Transform abc ==> pqrs [which you’ve already calculated in an earlier iteration]
  • Total Cost = 1 Delete + Count Steps( abc ==> pqrs)

Transform and Add

Say you have already calculated how to transform abcd ==> pqr.

To transform abcd ==> pqrs :

  • Transform abcd ==> pqr [which you’ve already calculated in an earlier iteration]
  • Transform pqr ==> pqrs, by adding s to pqr.
  • Total Cost = 1 Add + Count Steps (abcd ==> pqr)

Transform and Replace

Say you have already calculated how to transform abc ==> pqr.

To transform abcd ==> pqrs :

  • Transform the abc portion of abcd  ==> pqr [which you’ve already calculated earlier]
  • Replace the last character from abcd – d with s.
  • Total Cost = 1 Replace + Count Steps (abc ==> pqr)

Putting it Together

The edit distance is the minimum of the individual costs of the 3 transformations listed above.

You may have realized that you will end up calculating the edit distance of each prefix of A to each prefix of B.

    • For each string in {a, ab, abc, abcd}
      • Calculate the edit distance to each string in {p, pq, pqr, pqrs} using the steps outlined above
      • Save the minimum cost found.
    • By the time you actually calculate the edit distance from abcd ==> pqrs, you’ve already computed all the prior information you need.

Speed, Give me Speed!

And you’ve got yourself an effective and versatile spell checker  – or a class of suggestion engine (there are others) – that you can use in a range of scenarios. You can use it to autocorrect medication names (can you spell penicillin yet?). You can even use it to find the nearest first or last name in a phone book.

But you still need to fix the speed part.

You already know how to use indexing and caching to nitrous boost your code into running fast and furious. So how do you speed up your spell checker and avoid comparing each misspelling against all strings in your dictionary?

How Spell Checkers Work – Part 1

Ever wonder how a spell checker works? How does Bing auto-correct your misspellings (potato, instead of potatoe) as you type? How does software automatically correct Penicilin to Penicillin? How many of you even spotted that typo? While the actual algorithms and engineering used by Bing or Google are deep dark trade secrets that we had best not know, the fundamentals are well understood.

Spell Checking

A spellchecker works by searching for a given string in its dictionary of known strings. Any string not found in the dictionary is deemed an alleged misspelling. The misspelling is alleged because it is a misspelling relative to the spellchecker’s dictionary – which could be limited or incomplete. “Blog” is a misspelling to a spellchecker using a dictionary from 1990.

To correct a misspelling, the spellchecker assumes that your misspelling must be a mutation of one of the strings in its dictionary. To suggest a correction, the spellchecker searches its dictionary for strings similar to the misspelling. It then orders the potential corrections by their similarity and likelihood score. The known string nearest to your misspelling is the suggested correction.

Sometimes a correctly spelt and known string may itself require correction. Is a given word the most likely word in the current context? Did you really mean peanut allegory symptoms? Or is it more likely that you meant peanut allergy symptoms. But allegory is a legal English word!

String Similarity

You intuitively know that the strings Penisilin and Penicillin are similar. So are Potato and Potatoe. Tomato and Potato are also similar. The question is – how similar?

How do you measure similarity? Is the string Potato closer to Potatoe or to Tomato?

The measurement of string similarity is relevant to several disciplines, including Genetics. While similarity algorithms typically work with text strings, a string is merely a list of elements taken from some alphabet. A string need not be text. The DNA strands you sometimes blame for your expanding gut are also strings, constructed from the famous 4 element alphabet {A, C, G, T}. A series of musical notes is a string. Polymers are strings. Phone numbers and social security numbers are strings. The repeating patterns of lines and and colors on your shirt – could be expressed as strings.

This has important implications. What works for text also works elsewhere. You can use algorithms developed to search and auto-correct plain old text, to also give Darth Vader a paternity test. How similar is Vader’s DNA to Luke Skywalker’s? That is conceptually like asking how similar is Penisilin to Penicillin.

The classic string similarity measure is the famed edit-distance. I’ll explore edit-distance algorithms in future posts. But I’ll begin with a simpler and faster alternative that uses intuitive measures of set similarity. Discussing set similarity introduces several important search concepts into the discussion. Set similarity also serves as an effective first-pass algorithm when searching large dictionaries.

SET SIMILARITY

Consider the following strings, expressed as sets.

Set String Elements Element Count
A potatoes {p, o, t, a, t, o, e, s} 8
B potato {p, o, t, a, t, o} 6
C tomatos {t, o, m, a, t, o, s} 7

Problem: measure the similarity of set A (potatoes) to Set B (potato) & Set C (tomatos).

Your basic intuition is sound. The more elements that sets A & B have in common, the more similar they must be. You express your measure as : Count(A intersection B)

This definition gives us the following similarity scores:

  • A (potatoes) & B (potato) have 6 elements in common.
  • A (potatoes) & C (tomatos) also have 6 elements in common.

This simple string similarity measure suggests that potatoes is equally similar to both potato and tomatos. But you know that isn’t the case. You know that potatoes is closer to potato than it is to tomatos. Our simple similarity measure is unfortunately not discriminating enough.

The Jaccard Coefficient

A better similarity measure does more than count the number of elements in common between sets. Given a set of elements in sets A and B, calculate the fraction of elements found in both sets. The higher the fraction, the more similar the sets.

  • The set of all unique elements in (A,  B) => (A union B).
  • The set of common elements in (A, B) => (A intersection B)

You use these two sets to compute a Jaccard Coefficient. The similarity between two sets A & B is given by:

Count(A intersect B) / Count(A union B)

Using the Jaccard Coefficient:

A (Potatoes) Count(Union) Count(Intersection) Score
B (Potato) 8 6 6/8 = 0.75
C (Tomatos) 9 6 6/9 = 0.66

A Basic Spellcheck Algorithm

To find suggestions for a given misspelling:

  • Calculate the Jaccard Coefficient of the misspelling with each string in the dictionary.
  • Collect those strings with a Jaccard score higher than some threshold.
  • Sort the matches by score, and pick the top N (1) matches as suggestions.

A good start. But the algorithm has several weaknesses, including performance. Comparing a misspelling against every string in the spellchecker’s dictionary (imagine Bing’s) could be prohibitively expensive. The algorithm also fails to consistently produce good corrections for spelling mistakes such as:

  • Single character errors, such as accidentally hitting the wrong key
  • Leaving out letters: sucess instead of success
  • Additional characters: happinness instead of happiness.
  • Phonetic interpretations of spellings
  • And so on…

The Jaccard measure is sound. But the simple expression of potato as the set {p, o, t, a, t, o} is inadequate. It turns out that you can noticeably improve a spellchecker by changing how you express strings as sets. You do this by this using N-grams.

Instead of expressing potato as the the set, {p, o, t, a, t, o}, you express it as a set of bi-grams – the set of all overlapping character pairs in the string: {$p, po, ot, ta, at, to, o$}.

More on the exciting world of n-grams and their impact on spellchecking (and elsewhere) in the Part 2.

How encrypted email works

I’ve been working on the Direct Project for the past year or more. The Direct Project is a federally sponsored initiative that uses secure email as the foundation for the ubiquitous nationwide exchange of health information.

To secure an email, you have to, among other things, encrypt the message content. It is no surprise that many newcomers to Direct want to know how encrypted email works. Others, who are comfortable with classic message security, notice that unlike point to point messaging (one sender, one receiver), email is inherently multicast (one sender, many receivers). They ask: how do you encrypt email sent to multiple recipients?

In this inaugural posting for my new blog, I will try to answer both questions in plain English.

Encryption Basics

First, a quick refresher on encryption concepts:

  1. Key: An array of carefully generated bits, used to encrypt and decrypt data.
  2. Encryption: You use a key (secret) and a precise series of complicated steps (encryption algorithm or cipher) to mangle (encrypt) data into undecipherable gibberish.
  3. Decryption: You use a key (secret – hopefully the right one) and a precise series of complicated steps (decryption algorithm or cipher) to un-mangle (decrypt) gibberish back into your original data. If you use the wrong key, or the wrong algorithm, you turn the source gibberish into more gibberish.
  4. Symmetric Encryption: You use the same key to both encrypt and decrypt the data. Both the sender and the receiver have a copy of the same keya shared secret. To share the secret, the sender and receiver must exchange their shared key securely – without an attacker getting a peek. If an attacker can somehow (silently) intercept an inadequately protected secret as it moves from sender to receiver (steaming open the envelope, so to speak), the attacker can also decrypt your encrypted data.
  5. Asymmetric Encryption: You use one key (public) to encrypt the data and an associated but different key (private) to decrypt the data. Data encrypted with your public key can only be decrypted with your associated private key. You boldly give the public part of your key pair to anybody you want to receive encrypted data from. You keep your private key secret and and use it to decrypt data that people send you. Unlike symmetric encryption, there is no shared secret to exchange. You can distribute your public key to the entire world without fear. Data encrypted with your public key is truly for your eyes only – because only you can decrypt it with the secret private key that only you have.The reverse is also true. Data encrypted with your private key can only be decrypted using your public key. This property has important implications for digital signatures (more in future posts).

Symmetric and Asymmetric encryption work differently, – they use different types of keys and different encryption/decryption algorithms.

Symmetric encryption is fast. Asymmetric encryption is slow.

How does email encryption work?

Violet wants people to encrypt the email they send her. To help them do this, Violet creates a (public, private) key pair. She wraps up her public key in a secure package called an X509 Digital Certificate (more on this in future posts) and gives the certificate containing the public key to those she is corresponding with. To make it easy for others to find her public key, she even publishes her certificate in a public directory.

Violet’s good friend Toby Toby decides to send her some encrypted email.

All Toby has to do is use Violet’s public key to encrypt the message, right? Wrong.

To use Violet’s public key to encrypt his email, Toby must use asymmetric encryption. Which, unfortunately, is slow. Toby cannot practically encrypt the content of his email using Violet’s asymmetric public key – it takes too much work!

To encrypt his email content, Toby needs a faster option – symmetric encryption. Toby generates a new symmetric encryption key and uses this key to efficiently encrypt the content of his email.

But how does Violet decrypt Toby’s email? To decrypt, Violet needs a copy of the symmetric encryption key, which she doesn’t have because Toby generated it on the fly and hasn’t given it to her yet! How does Toby securely send Violet a copy of his encryption key?

Toby cleverly solves the problem by attaching the encryption key to the email itself. The message brings its own key with it.

But isn’t that crazy? Anybody can now grab the key and decrypt the email, right? Wrong.

The clever Toby encrypts the symmetric encryption key before attaching it to the email. He does this using Violet’s public key, which he had obtained earlier. And even though this requires slow asymmetric encryption, the performance conscious Toby doesn’t mind because the encryption key is relatively small – usually only 256 bits long at most.

Toby sends his email to Violet. Naturally, Toby does not encrypt the addressing information on the message – the To & From – which have to travel in the clear, just like the addressing information on the envelope of a sealed snail-mail letter. Email servers use the addressing information to transport the email to its destination.

When Violet receives the email, she decrypts the attached encryption key using her private key. She then uses the encryption key to decrypt the email content and receives Toby’s friendly missive.

How do you encrypt email sent to multiple recipients?

Toby wants to send an email message to both Violet and Margaret. How does he encrypt this message?

Should Toby repeat the encryption process twice? Encrypt the email once for Violet and again for Margaret? And what happens if Toby also puts Gitanjali on the To line? Does Toby have to encrypt the message three times? And send out 3 different copies of the same message? Isn’t that getting really inefficient?

Toby has a much better idea. Just like before, he encrypts the email exactly once, using a symmetric encryption key. Then he attaches multiple copies of the same encryption key to the message – one for each recipient and encrypted with that recipient’s public key. Toby encrypts one copy of the encryption key with Violet’s public key. He encrypts a second copy with Margaret’s public key and third with Gitanjali’s. Then he attaches the 3 copies to the message.

When Margaret receives the email, she locates the copy of the encryption key that was intended for her. She decrypts the encryption key, then uses it to decrypt Toby’s note. Violet and Gitanjali do the same.

You can use the same technique to encrypt email sent to as many recipients as you like. Every new recipient merely means the small overhead of an additional attached copy of the encryption key.

S/MIME

You should now have a high level notion of how email encryption works. Those of you who are interested in the gory details should deep dive into S/MIMEthe defacto standard for securing email. Please do peruse the S/MIME and Direct Transport specs for a bit by bit commentary.

It takes more than encryption to secure email. See my follow up posts to learn how:

Source Code

The open source Direct Project Reference implementation contains a full S/MIME and secure email implementation. To learn how to encrypt and sign email and email content in C#, check out the SMIME source code.