The other day, I was in a meeting where somebody said, “….and then you take a SHA-256 hash of the document, which is unique…”.
Not quite. It would be more accurate to say practically unique.
Cryptographic hash functions are astounding. They take arbitrary binary data: document, image, movie, message, bytes.. and crunch over every bit to produce short, fixed length summary called a hash value or digest. E.g. The SHA1 hash function creates a 160 bit digest out of any source input, no matter what its size or content. Cryptographic digests have some very important properties.
Say you create a cryptographic digest of a document. You will find it practically impossible to:
- Find or create a second document (or any other data) that will produce the same digest
- Change or tamper the source document – even a single bit – without also altering the digest .
- Reverse engineer the document from the digest – i.e. by hashing randomly generated documents until you find one that has a matching digest.
These properties make the digest unique for all practical purposes. You can take any binary data and derive a big number that represents that data and that data alone. The digest serves as a digital fingerprint for the data. This property makes cryptographic digests the basis of Digital Signatures and their close cousins, HMACs (we’ll cover both in upcoming posts).
But the fact remains: the cryptographic hash is not actually unique. Where there is a hash value, there will always be a collision: two or more arbitrary pieces of data that reduce down to the same digest.
Why are collisions inevitable? Most of you know how a hash table works. If you don’t, then consider the following : Say you were given 5 balls and asked to place them in 3 buckets. It doesn’t take an engineer to realize that you must put at least 2 balls in buckets that already contain at least 1 other ball. Collisions!
The famous SHA-1 function, the one time champion of cryptographic hashing, produces 160 bit hash values. 160 bits represents is 2^160 (2 raised to 160) possible unique values. 2^160 is a very big number: approx. 1.46 x 10^48 – i.e. 48 zeros. By comparison, the Earth has an estimated 1.33 x 10^50 atoms.
The number of possible inputs (balls) to the hash function is infinite. The number of possible hash values (buckets) is fixed. Collisions! Multiple balls will land in the same bucket. Eventually. But it may take a long while because there are so many buckets!
In fact, the laws of probability tell us that you have a 50-50 chance of getting a collision if you have as few as 2^80 inputs. Which is a smaller but still scarily big number. Why? For the same reason that 23 random people have a 50-50 chance of sharing the same birthday (but not birth year!).
So how do you go find collisions and why? The why is obvious: imagine if you found somebody who had the same fingerprint as you – unlikely though it may be. If you were of the miscreant persuasion, you might take advantage of this knowledge. The same holds for digital fingerprints. If you could tamper with or create digital data in such a way that its digital fingerprint matched (collided) with that of the “real data”, you could cause some mischief. Since the digital fingerprint of the bad data matched the one people expected from good data, they would have little reason to be suspicious.
The simplest way to find collisions– brute force – is also the hardest – primarily because of how long it takes. You could brute force collision detection by calculating the hashes of bazillion (all) inputs using gazillions of computers and then watching a lot of TV as you… wait……for ever. Or you could alter various bits of the original input, and try computing hashes and see if any of them stick. You could do cryptanalysis – which these days is a highly sophisticated version of how the British famously hacked the German Enigma machine. There are other techniques and they all are more involved than this short paragraph may let on (you think?), but you get the general idea.
To find collisions using the hottest cryptographic hash function in town (the SHA2 family) is (currently) practically impossible. In cryptography, practically impossible means computationally infeasible. Which is a fancy way of saying that even if you used all the current computers, algorithms and known mathematics, it would take you so long to solve the problem that it wouldn’t matter any more. You could use all of Azure and Amazon EC2 to crunch your algorithms, but you would die before you succeeded, as would all of humanity and possibly the Earth too. Of course, a brainy breakthrough that exploited of a fundamental flaw in the hash function, or a quantum computing revolution might give you a fighting chance, but until then..
You could also invent some new smarty pants Math that lets you find a collision in feasible time. Cryptographers live in abject terror of computational feasibility – the Freddie Kruger of their dreams.
Cryptographic hash functions are painstakingly designed to reduce the probability of collisions. If you peek at the code for a hash function, you will find it replete with bit operations like xors, bitwise and/ors, shifts and rotations. They operate on each bit, shoving and pushing and twisting the data with seemingly arbitrary, but carefully chosen and massively tested steps. A software blender using every bit in the original data to make a digital smoothie, with each smoothie having a taste of its own that incorporates the flavor of everything that went in. With values distributed values more or less randomly (evenly) across all buckets. Small changes in the original triggering an avalanche of changes in the computed digest.
The mathematics behind why or how any of this works is way over my balding head. The mathematics are actually so subtle and clever that they may include hidden flaws – either mistakes or deliberate weaknesses that a clever chap may exploit at a later time. This is why cryptographic hash functions are few and far between. Rock stars that hold sway for a while even as they are taken apart by brainiacs. Until one of them discovers a weakness. And so went MD4 and MD5, SHA0. And not so long ago, SHA1 also met its fate, even though it was compromised only in theory. Cryptographers are a paranoid bunch, which is just fine with me!