June 6, 2012
As you have no doubt heard, 6 and a half million hashed passwords have been leaked from LinkedIn. Also, those password hashes were unsalted, so if you use your LinkedIn password in other places, change it everywhere. In a word, oops.
There is a 118MB file floating around (which shouldn't be too hard to find if you look for it) that contains the hashes. The file does not contain the usernames that match the passwords. LinkedIn also apparently has over 150 million accounts. On the surface, it doesn't seem like a huge deal, but appearances can be deceptive.
I, like a lot of other people, believe that it's certain that the account information is being sat on by whoever compromised the database. There are several ways that a database dump like this could be acquired, and none of the likely scenarios include a column-level dump of just passwords.
In addition, early analysis of the hash dump indicate that the list has already been deduplicated, so we can't be certain how many accounts these passwords represent (since you KNOW a bunch of idiots had password123 or LinkedIn1). There are several people on forums and twitter that have tested the hash of their password against the file and not found it, so it seems that all of the accounts aren't represented. This could mean that there are multiple files that the cracker is sitting on, or the source of the leak could have been incomplete. We should assume the worst case, which is that all of the accounts are compromised.
So lets take a look at the technical details of what's going on, why it's bad, how it could have been prevented, and so on.
What is floating around is the SHA1 hash of people's passwords. SHA-1 is a cryptographic hash function, and takes an input up to a certain size (in SHA-1's case, almost 2 exabytes) and returns a fixed-length output (in this case, 20 bytes or 40 hex characters). The idea is that you can take any string, run it through SHA-1, and get the same resulting string every time.
Lets play with it just a little bit. Most Linux and Unix distributions (including Mac OS X) have the 'shasum' command. It reads standard input and spits out a SHA-1 hash. Lets try to make a hash of the string "Matt Simmons":
$ echo -n "Matt Simmons" | shasum
If you ran the exact same command (and don't forget the -n, or you'll introduce a newline at the end of the string), you got the exact same output, and that's the magic of a one-way hash function like this. If you took the string "e7d16b6607cbf9d51cd99122b092d96f8dc99c3d", you would never find an operation on it that gets you the string "Matt Simmons" (or at least, you couldn't find one that would work systematically on all SHA-1 hashes).
Because the operation is one-way, and you can't get the input from the output, it is useful for password operations. The concept is that the user sets their password, and you store the SHA-1 hash of it in the database. Each time the user logs in, you re-encode whatever string they provide, and if that string matches the SHA-1 hash stored in the database, then you know, beyond a shadow of a doubt, that the user logging in has provided the same password that the user initially provided when setting the password.
Incidentally, it's within the realm of existence that theoretically there could be a "hash collision", which is where two plain-texts equal the same hashed text...and there has been a theoretical proof that it's possible, although statistically very, very unlikely (it involves 2^51 hashes).
OK, so you can't get the password out of the hash...so why is everyone worked up about this whole leaked password thing? It's because now ANYONE can do exactly what we just did. If anyone on LinkedIn has the password "Matt Simmons", then the encrypted string "e7d16b6607cbf9d51cd99122b092d96f8dc99c3d" appears in the file. Don't believe me? Lets check it out.
I have, on occasion, looked through password dictionaries before, and one of the stranger passwords that I've found in common is that people like to use the word "penny" in them. I don't think I've ever seen a password list that doesn't include it. Weird, but it always occurs to me to check. I'm going to slightly modify this very succinct commandline from HackerNews forum member jgrahamc:
$ grep -n `echo -n penny1 | shasum | cut -c6-40` combo_not.txt
Let me explain that cut command in there... When you run shasum on "penny1", you get a hashed string '9507f2bf284be7aaf16530f433d5f4db8939fdbf'. The command line above has cut the first 5 characters off because, as you can see in the line we get returned, the first five characters of the hash are 0. This isn't an accident - what people have observed is that the easiest passwords have already been cracked, and the cracker has marked them as being cracked by making the first 5 characters 0. Whenever we check something against the file, we have to disregard the first 5 characters, in case they've already figured that one out (which, in this case, they did).
What is really annoying about what happened is that the password were not salted. Using a salt is like taking the user submitted password and adding your own information to it, and then hashing THAT value and storing it in the table, instead of just the password. So in the case above, we were able to match a password because the hash was of "penny1", but if the hash were instead of, say, "pennny1028317101397147812", we would have gotten a completely different hash. In order to crack the passwords, we'd have to have the salt AND the password. Brute forcing a hash which has been salted is excessively difficult, because it takes time.
The way UNIX used to store passwords was crypt, a DES implementation. That was eventually upgraded to MD5, which is the standard now on almost all modern systems. SHA-1 is an upgrade from that. What do I mean when I say upgraded? I mean that it takes longer to generate a hash. Check this out:
msimmons-mbp:~ bandman$ time for i in `seq 1 100` ; do echo "test-$i" | md5 >/dev/null ; done
msimmons-mbp:~ bandman$ time for i in `seq 1 100` ; do echo "test-$i" | shasum >/dev/null ; done
I ran that on my current-generation Macbook Pro. No one sane would ever crack passwords like this, but you can see the comparison in terms of speed. All of the non-hash things (forking, executing, etc) are identical between the two commands, but SHA-1 takes about 25 times longer to hash the same string.
Brute forcing even larger passwords is trivial these days, particularly in a world with things like Amazon's EC2 cluster making light work of them. Large salts stored separately from the hashes are really the way to go (although there are some, *ahem*, fervent advocates for bcrypt).
Defense in depth is the best policy that we, as users of services like this, can have. Don't just use good passwords, use unique passwords on sites. I recently started using LastPass, and I've found it very, very handy. It generates long passwords, stores them securely in an encrypted container, and never decrypts the container on the server-side.
I've also heard a lot of people who like KeePass, and I've used PasswordSafe before, but the important part is to use individual passwords for different sites. Don't trust a site to keep your password safe for you, because you can't know how they're storing their stuff on the backend.