Monday, April 27, 2009

Electronic Files Rehashed

This month’s blog entry is provided by guest blogger, Aaron Goodisman. Aaron is Valora’s Chief Technology Officer and Vice President of Engineering.

To a lot of people the word “hash” conjures up visions of leftover corned beef and onions (or, in some parts of the country, barbecued pork) – delicious. In the world of computers and electronic files, however, it’s less about chopped up meat and vegetables, and more about chopped up files and documents.

A “hash” (or more completely a “hash code”) is a kind of electronic signature of a computer file. The data bytes that make up the file are processed through a hash function (chopped up) to produce a short, fixed-length snippet of data that can be used to refer to the original file. Since the resulting hash code is short, it’s easy to store a lot of them. It’s also quick to send them across a network or compare them to each other – much quicker than doing the same thing with the whole file.

Web browsers and web servers use hash codes to decide which web pages to refresh. If the browser has a copy of a file stored locally on your computers hard disk (e.g., the header graphic on this page), it sends the hash code of that to the server to ask if the file has changed. The server compares that hash code to the hash code of its version of the file. If they’re different, the server transmits the new header graphic file; otherwise, it tells the browser to go ahead and use its local version. Sending just the hash code takes much less network bandwidth than sending the whole graphic, making pages load faster and reducing the overall load on the network.

The Litigation Support and Computer Forensics industries use hash codes, too. Rather than processing every file on a custodian’s hard drive, savvy practitioners skip over duplicate files, useless files or malicious files by computing the hash code of each file on the disk and comparing it against lists of hash codes of files known to be useless, malicious or already processed. Because the hash codes are small, the lists are easy to manipulate and fast to search.

For all this to work properly, the hash function needs to have several important characteristics:

  • It needs to be fast to compute the hash code of a file (otherwise it would defeat the purpose of being able to do quick comparisons)
  • The hash code of a file needs to change if you change the file, even a little bit (otherwise the web browser wouldn’t know to download the new version)
  • It needs to be extremely difficult to create a file with a specific hash or to create two files with the same hash (called a “collision”).


The last of those is particularly important for electronic file processing, because it wouldn’t do for the critical document in a case to be removed because it accidentally happened to have the same hash code as a standard Windows library file or some non-critical document already processed. Neither would we want a virus writer to be able to manipulate a file to contain a virus, but have the hash code of a different file known to be safe.

Fortunately, over the years a lot of hash functions have been developed and thoroughly tested. One of these hash functions is called MD5, developed in 1991 by MIT professor Ron Rivest. This hash function became extremely popular with the growth of the internet, the proliferation of electronic files, and the widespread use of hashing techniques for various purposes. In fact MD5 became so pervasive that some people in the Litigation Support industry use it as a synonym for term hash code.

Unfortunately, things change. Computers get faster and MD5 has reached the end of its useful life in this industry. In 1996 it was shown that it was theoretically possible to create two files with identical MD5 hash values (a collision) and in 2007 a group of researchers described how to do it. A recent article in Technology Review magazine describes one example of this.

The good news, of course, is that there are many other hash functions to choose from. Several years ago Valora switched to the SHA-1 hash function, designed by the National Security Agency and published as FIPS 180. SHA-1 is similar to MD5, but uses a longer hash code and is orders of magnitude less vulnerable to collisions. Those orders of magnitude don’t mean SHA-1 will be useful forever. Indeed, it’s been shown that SHA-1 is vulnerable to attacks similar to the ones used to bring down MD5, but it will be a while before the necessary computing resources are available. By then we’ll have moved on to other hash functions appropriate to the computing power of the day. But the industry needs to keep moving forward with technology, so when the time comes to switch functions, this discussion doesn’t need to be rehashed.


-Aaron


Aaron Goodisman is a software industry veteran with over 20 years experience in engineering management, software architecture, and product development. Prior to founding Valora Technologies, Mr. Goodisman served as Vice President of Engineering at SilverStream Software, acting as both manager and visionary for this award-winning application server product and its associated development and deployment tools.

Mr. Goodisman received his undergraduate and Master's degrees in Computer Science from MIT and is considered a world expert in Java industry standards and UI design. He is a frequent industry speaker and has authored several articles for industry publications. Mr. Goodisman is named as the inventor on several U.S. patents and currently pending patent applications.