What are Hashes, Checksums and File Digests?

I purchased a digital design assets library the other day containing 45000 files (some 200GB worth of data). The package came segmented into 30 ZIP archives uploaded to Amazon’s cloud. The issue faced with this distribution method was that each 6GB archive often corrupted during the download phase making extraction of the archive impossible whilst wasting a lot of my internet bandwidth in redownloads. (Assuming that the archives were uploaded correctly without any errors.)

There was no way to determine the cause of the extraction problems as they could originate during the initial compression phase (maybe on a low spec machine that struggled to handle the compression of 6GB of data), the upload to Amazon’s Cloud storage, or the download on my home internet connection. (Thanks Australia for medieval download speeds in my area.)

The suggestion I left with the seller was to generate hashes of the files on his computer such that I can confirm the data I downloaded is intact.

What is a hash?

Hashes, checksums and file digests, are all terms for a short, seemingly random string of characters representing a summary of data. What they do is take a given file and use a one-way compression algorithm (a hash function) to compress it down to a relatively short string of characters1 representing that file. The aim of this compression is not to reduce the used disk space. It is impossible (as we will see later) to restore the original data. The purpose of hash functions is to map a given file to a short string of characters that is unique to that file. It’s easy to see where that might come in handy in my download corruption problem.

Hash Function Properties

There are number of properties that are very desirable for hash functions to have:

  1. One-way (preimage resistant)
  2. Map to unique hash values (collision resistant)
  3. Given message, it should be hard to find another message with the same hash (second preimage resistant)

Let’s go through each one in turn:

One-way

Hash functions must be one way and very hard to reverse. This property is also called preimage resistant and the formal definition2 is as follows:

A hash function is preimage resistant if, given a hash h, it is hard to find the original message m such that the message, paired with the hash key k evaluates to the given hash value. h = hash(k, m).

As an example, consider the hash function h(k, m) = 3 - 2m + k/5. This equation can be easily rearranged to find m, given a value of h and k. This is exactly the kind of behavior this property states a hash function should not exhibit.

Collision Resistant

A hash function should reliably map any given input to a hash value unique to that input.

In other words, we do not want a hash function that maps multiple inputs to the same hash value. In other words again, we don’t want an attacker to be able to create a fake message with the same hash value as the legit message. Expressed formally, this property becomes:

hash(m1) = hash(m2)

Consider the hash function hash(k, m) = 16. This function is not collision resistant because all messages map to the value 16. A less obvious example is h(k, m) = m^2 + 1. Given a fixed value for k, matching positive and negative values for m map to the same hash value. (e.g. m1 = 2, m2 = -2)

Second Preimage Resistant

This means that, given a message m1 that maps to a hash value h, it should be extremely hard3 to find a second message m2 that maps to that same hash value h. The mathematical definition of this follows. K represents the key used. m1 is a given message and m2 was calculated to make the equation true.

h = hash(k, m1) = hash(k, m2)

Practical Example

Say you wanted to transfer 1000 dollars from account 1 to account 2 by sending a message to your bank.

transfer(1000, 1, 2)
Hash: 9aEfc3

You do not want an attacker to change the message contents (e.g. direct your money to account 43) without the bank noticing the fraud. If the hash values of the message and the claimed hash value contained within the message do not match, then the bank can detect this kind of attack and reject the transaction request. However, if an alternate message with the same hash value can be found easily, an attacker can transfer money as they please without the bank noticing. This is why it is very important hash functions are second preimage resistant.

transfer(1000, 1, 43)
Hash: 9aEfc3

This example works to illustrate the point, in reality, the chance of finding an alternate message that maps to the same hash value and makes sense in that it contains actual English words is very low.

h = hash(k, m)

Hash functions have become really fast and secure, with SHA-512 using 512-bit encryption and the MD6 algorithm released in 2008 capable of hashing data at a rate of 1GB/s on a 16 core architecture.

How hashes are used on the web

Now that we know what a checksum/hash is we can talk about how we might actually use it. Back to the download problem: The issue is that there is no way to tell where the archive was corrupted (upload, download or during the compression process). By hashing the archive on their computer, the seller can communicate to me what the intact archive’s hash value is. Then, if by some unfortunate bit shifting across the Atlantic, the file I receive generates a different hash to the one supplied by the seller, I can be sure to know that I have to redownload the file and that the issue happened during the upload/download progress. The seller can also confirm whether the file was uploaded correctly to Amazon’s cloud by hashing the file online and comparing hash values. This way, the cause of the problem can be narrowed down strategically.

Of course, with digital downloads of this size and the sheer number of files involved, a torrent based distribution system would be orders of magnitude better than a cloud based one. Considering how well torrents deal with multiple files and how files are split up into smaller pieces, each piece getting hashed and checked for integrity4 before being appended to the file on the users computer, it goes without saying that torrent based distribution is the way to go for delivery of large asset libraries.

  1. 16 characters for MD5 (128-bit checksums), 64 characters for SHA-512 (512-bit checksums) ↩︎

  2. http://www.denimgroup.com/knowartic_secure_hashfunctions.html ↩︎

  3. “hard” meaning computationally expensive, taking a super polynomial time to compute. Hash functions are designed to make this impossible. ↩︎

  4. No need for manual hashing and comparing of checksums. By hashing individual file pieces, rather than the full 6GB archive, we can save bandwidth as well as we only have to redownload that small piece of a file. ↩︎