"One of the best ways to get yourself a reputation as a dangerous citizen these days is to go about repeating the very phrases which our founding fathers used in the great struggle for independence." --Charles Austin Beard
Last time, I talked about brute force attacks and how efforts to reduce the keyspace of a given algorithm can greatly reduce your computational time (link). This article will discuss hash collisions and the probability of collisions occuring via the birthday paradox. Though this doesnt inherently provide you with any more tools or methods in terms of cracking hashes, it does provide the theory behind generating hash collisions which can severely compromise the security of an algorithm.
Hash Collisions
To quickly review, cryptographic hash functions work by taking in a string (i.e., your password), performing some mathematical operations on it, and spitting out hexadecimal garbage that is utterly uninformative to anyone. If the algorithm is well-designed, the mathematical operations will be one-way; that is, it is either impossible or so computationally impractical to take a hash and run it backwards through the algorithm to regurgitate the original string. The easiest way to get your desired password is to match the hash by testing every possible password and seeing if the hashes match, otherwise known as brute-forcing. Unfortunately, this can take a long time and there are more efficient means at our disposal.
One method is to try and get the algorithm to generate what is known as a "hash collision". From the previous article, we showed that using the hashing algorithm from ExtBasic 11, passwords containing the same set of letters would always generate the same hash, regardless of what order the letters were in ("aab" was the same as "aba" was the same as "baa"). This is a hash collision, two or more separate strings creating an identical hash. Since real-world algorithms do not use such simplistic methods, collisions do not occur so easily, but they do occur. In fact, we know that they must occur!
Take the MD5 algorithm, for example. For every password that is entered, the algorithm will always return a hash that is 128-bits long, or a string of 32 hexadecimal characters. From that, we know that there are a finite number of hashes that can be generated, but the number of passwords entered can be infinite. To put this into perspective, there are 94 characters on a normal keyboard (52 capital and lower-case letters, 10 numbers, and 32 assorted symbols; as far as I know, there are no illegal characters in the MD5 algorithm). Assuming a password length of 20, we use our permutation equation from before:
CODE :
n^k = 16^32 = 2^128 = 3.4e38 possible hashes
94^20 = 2.9e39 possible passwords
So from this, we know that if we try every possible password with 20 characters or more, eventually we will generate a hash collision!
However, this is not really good news. To test every password at a speed of 6 million hashes per second (my laptops average hashing speed) would take 1.5e25 years, orders of magnitude longer than any timespan our brains could possibly conceive. Additionally, since we have to store each hash we compute so that we can check it against future ones and knowing each hash occupies 16 bytes, that means we would have to have so many 2TB HDD that if they were laid out in a grid, they would cover the entire Earths surface 1,100 miles deep. So what to do? Luckily, probability is on our side.
The Birthday Paradox
If you have ever taken a statistics and probability class, you have probably learned about the birthday paradox. It goes something like this: Take a room of 50 people. What are the odds that at least two people will have the same birthday?
Thinking about this quickly, there are 365 days in year (Leap year births not counted; they arent real people anyway), 50 people, so maybe 1 in 6, or about 17%? Wrong. In reality, the probability is a whopping 97%! This unbelievably high, surely there must be a mistake! Well, lets look at the math.
In a room of 50 people, were trying to match two birthdays. So taking any one birthday, there are 49 possible matches to be made. However, if we compare every birthday to every other birthday, our number of possible matches greatly increases and we can use our friend, the combination without replacement equation:
CODE :
50!
---------- = 1,225 combinations
2! (50-2)!
So to think about what this problem is asking in another way, if we had 2,450 people and we paired them off randomly with each other, what are the odds that a person would have the same birthday as the person he was paired with? With 1,225 pairs, now it doesnt seem so far fetched that theres a 97% chance there will be a match.
But how does this apply to hash collisions? Well, since we are simply trying to find to passwords with the same hash ("birthday") it turns out we dont have to calculate the heinous number of passwords we originally thought. The birthday paradox simplifies down to the following equation:
CODE :
k!
------------- = P
(k^n)(k - n)!
Where "k" is the maximum number of items were trying to match (birthdays, in the example), "n" is our sample space (number of people), and "P" is the probability of finding a match. Since factorial calculations rapidly exceed the allowable size of most calculators, the Taylor Series approximation is also useful (for those of us struggling to remember algebra, "e" is th exponential constant and is approximately 2.718):
CODE :
P = 1 - e^(-(n^2)/(2*k))
Solving for n:
n = sqrt(ln(1 - P)*-2*k)
Our "k" is determined by the maximum number of hashes for the algorithm, 3.4e38 as mentioned above. Alright, weve got our equation, now we just need to plug and chug. Since k is fixed, we just need to figure out what is an acceptable probability. I think 99% gives us pretty good odds, so well use P = 0.99. Plugging it into our equation, we discover that we only need to calculate 5.98e19 hashes for a 99% chance of generating a hash collision,less than one trillionth of a percent of our original keyspace. Note that this still presents a formidable logistic challenge, but the computing power is within reach for someone on a relatively modest budget; a cluster of 4 PlayStation 3s could crunch that many hashes in just over 8 months.
Conclusion
It is important to note that finding two passwords with the same hash is NOT the same as finding two passwords that will generate a specific hash. That is, if you have a specific hash, your best bet is to still go with a traditional brute force attack because the birthday problem only applies to matching ANY two passwords with the same hash, not just the one youre looking for. However, if we are able to generate a hash collision, we can start analyzing the calculations performed to generate the hash and, hopefully, figure why the collision occurred and manipulate that information to our advantage.
To that end, a distributed computing project in 2004 known as MD5CRK did that very thing and successfully found collisions within the MD5 algorithm. They were able to manipulate the algorithm and generate matching hashes with considerably less computing time than with a brute force attack. Improvements upon their work have resulted in astounding vulnerabilities in the MD5 algorithm. In 2006, a Czech cryptologist named Vlastimil Klima published an algorithm that was able to generate an MD5 hash collision, on average, in 17 seconds using a 3.4 GHz Pentium 4 (link, PDF). Despite these severe vulnerabilities, MD5 remains one of the most popular cryptographic hash functions in use.
Cast your vote on this article 10 - Highest, 1 - Lowest
Very interesting, sorta reminds me of something I read in a book a while back about false-positives, where something that has a certain degree of accuracy can be utterly inaccurate with the margin of error.
/me likes it, 10*
raddy1313 - 06:03 am Wednesday February 17th, 2010
I just noticed an error...in the third to last paragraph it says: "Our 'n' is determined by the maximum number of hashes for the algorithm."
That should read: "Our 'k' is determined..." Sorry for any confusion!
Once again you managed a great article. Please keep them coming. I enjoy learning new things. I was able to keep up with most of the math until we hit an extremely complex equation, but I do not blame myself, I am still in high school and have only taken a single class covering that subject for but a few weeks. Thanks for the great info, this really helped me understand not only how hashes work, but how we can put them to work for our benefit.
10*
Thanks, I'm really glad you're liking these articles. I'm constantly learning new stuff about crypto, so I should have a steady supply of information coming for awhile. Thanks again!
Also, if you have questions about the math, post them or PM them to me. More than likely, you're not the only person so I may write a short article detailing commonly used math operators and functions.
I think the new standard is going to be the SHA-2 family sometime this year, at least for the U.S. government. Also, the SHA-3 algorithms are going to be in development via a public competition. Just some random information, thought it was interesting that despite the government migrating, the public uses the potentially insecure MD5 generally.
Looking forward to more articles like this one. (and hopefully one on advanced mathematical subjects for people like me whose brains might explode soon)
this is one article that actually has a great explanation on all levels, complex enough to convey perfect and accurate info, but still explained enough were i can get it :P. 11 if i could :)
Awesome, the first real use I've seen for that sort of math I'm learning at school.
And when you do generate a successful hash collision, does that then allow you to try and work out the original hash algorithms? (im n00b)
HackThisSite is is the collective work of the HackThisSite staff, licensed under a CC BY-NC license.
We ask that you inform us upon sharing or distributing.
Very interesting, sorta reminds me of something I read in a book a while back about false-positives, where something that has a certain degree of accuracy can be utterly inaccurate with the margin of error.
/me likes it, 10*