I'll start by saying that this overview isn't for everyone. It's intended for those who have a good programming background and hopefully have worked with imaging a bit. Even if you haven't worked with images and pixel manipulation, this may be the answer to some prayers out there when people are asking themselves, "How in the world do I even start to break this thing?!?!". It is important to realize though that many times when advanced warping techniques are used it becomes almost impossible to break, all that means though is that it's -almost- impossible, not impossible ;)
So, what's the purpose of breaking a captcha image? The reasons may vary, but most of the time it's to be able to use a bot to automate some process (what captcha images are meant to prevent). For example, say in Real 1 there was a way to register at "Uncle Arnold's Local Band Review" that used a captcha image. Well we know by the challenge that we have to get the band "Raging Inferno" up to the top. In a real world situation that didn't have the same type of security flaws as the Real 1 challenge, we could register hundreds of bots that simply vote the band up to the top, and to do this we'd have to break the captcha image at registration.
Remember though, Captcha images are never universal, every different site has their own specialized captcha, so there's no simple "global" fix for all of them. With that said, however, it's easy to take code once you've written it and transfer it into another captcha breaking project.
This overview is meant to establish the groundwork so that you can break captcha images easier in the future. You can use virtually any language, however, I recommend C/C++ or C# just for speed reasons. One of these examples I've done in PHP and it works quite well, though it goes slower than most.
Now lets begin our overview of captcha breaking!
[Step 1: Analyze and Prepare]
This is more of a step that you would take after you have read this entire overview, however, I'll fill you in on it now. When starting to break a captcha, look it over, refresh it several times, and find all aspects of the captcha. Does it use different fonts? Does the background change? Is there a background image? Does the text change from bold to italics? Does the text move around on the image? Is the text a completely different color than the image? What characters/charset does it use? Is it case sensitive? These questions and more are all things you must ask yourself and analyze while looking at the different variations of the captcha image.
Now that we've got a good idea of what's what, we need to be able to start the breaking process. This just depends on what language you want to use, but make sure you have a way to open the image into your language and read all the bits into an array. Whether you do this by looping through all the pixels and putting them into an RGB array, or by using some function like LockBits or GetDIBits. This part is essential to being able to work with the image. Never try to manipulate the image using single pixel functions, like functions that get or set the color of an individual pixel. These functions usually take an extreme amount of time to perform simple tasks. The only time you'd ever use those functions is when you're reading the pixels into an array. Okay, now that you've got the general idea, on to Step 2!
[Step 2: Get rid of the crap!]
A lot of people who write captcha images like to think that they are very crafty and cunning with the garbage they put in to throw you off. Here's a big morale booster... 99% of the time it's just that, crap. You can easily write image filters to go through and wipe out the junk.
Looking for ways to get rid of garbage often times includes looking for patterns in the image. You have to really think hard about what you can and cannot use against them. For example, you come across a captcha image that has black text, but unfortunately it has an image in the background. How do we filter out the text from the image? Simple, write a filter to include only back and colors close to it (when saving in JPG, not all colors will be perfect so you have to account for some variation in color). By filtering out all pixels that aren't close to black, we're left with just the text. One way of thinking is to ask yourself, "How is it possible that I can read this? How come I can distinguish the text from the garbage and noise?". A lot of times these questions will bring you to the answer. Lets look at some examples.
Now, start by asking yourself what you notice in this image. Is it the dark text that jumps out at you? How about the light background? Both of those we can use to our advantage. Now what about those lines? For now, we'll deal with those after we get rid of the background. So we think we have an idea of how to break it... but what happens if they throw something like this at us?
The text is barely visible! Not to mention the amount of noise is cluttering up the screen. Lets think about this, how is it possible that we can read this? Simple, the text is still slightly darker than the background. So, for our filter we'll write it to turn all pixels that are darker than a certain amount to black, and all pixels that are lighter than that certain amount to white. I find that when working with captcha images, it's really nice to be able to convert them to monochrome for working with, since monochrome is just black and white. You can then use a simple 2 dimensional array for the width and height, and just use 0 and 1 for black and white. Here's our result:
Wow, now the text sure stands out! But what about that annoying background noise? Notice how it looks like there are very distinct lines going horizontally. If you look at both the original images very closely, you'll notice they aren't lines, but rows of dots! Getting rid of this is simple, all we have to do is scan the image for a pixel that's white, then a pixel that's black, then another pixel that's white again. By scanning the image for that pattern, we will be able to find and isolate the dots. Since if we look at it, it's actually both columns and rows of dots, we'll do a 2 way filter. One that looks for dots going up and down, and the other left and right. Pseudo code for left-right would look like this:
if (Pixel[x,y] == 0 && Pixel[x + 1] == 1 && Pixel[x + 2, y] == 0)
Then we have a dot in the middle! We could also do another if that flips the black with the white to scan for white dots, but we don't need to now. The same can be done for scanning up and down, just by adding 1 and 2 to the y instead of the x. The last part of our code here is to set the middle dot to white. Here's what we've got now:
Much better, we've eliminated the majority of the background and some parts of those random black lines. A big hint here now on what to do is that you can actually use the same and or close to the same filter that we just wrote above to remove these black lines. If we write something that looks for individual pixels that are not touching more than 3 other black pixels (there are 9 pixels around any single pixel that is not on the border of the image), then we can eliminate almost all of the noise.
Now that's looking really good. Unfortunately here this is the point where the above filter probably ends, since if we go any further and, lets say, try to eliminate pixels that aren't touching more than 5 or 6 black pixels, we'll start eating away too much of the text. Keeping the text close to it's original look is key for cracking captcha images. What we're going to do now is a method that I've come up with which uses Flood-Filling to eliminate random garbage. If you're going to top performance, you can always write your own FloodFill function, or you can find GD libraries that include FloodFill functions. PHP for example has the function "imagefilltoborder" which is exactly what I want. I also decided to write a performance version of this same application in C#, which I wrote my own FloodFill function. So you might ask, how are we going to use FloodFill to eliminate garbage? If we look at the image we have now, we notice that all the garbage is in really small parts, while the text is very thick and large. This gives us an advantage to breaking it, because we can simply go through every black pixel, run a FloodFill on it, count the amount of pixels that got filled, then if it's less than a certain amount... throw it out. The smaller pieces of garbage will only have a pixel count of usually 20 pixels or less, so we write our function to get rid of anything that fits our needs. You may or may not even need this step, however, if you do use it the pixel count will have to be adjusted based off of your image and how much garbage you have. After we run this new filter, our image looks like this:
Alright! Now just to let you know, depending on the captcha, not all the junk needs to be filtered out. This will also depend on the method you choose in Step 3.
[Step 3: Define our letters]
The third step is usually easier than the second. Whereas before we were just cleaning the image up, now we're going to actually define where our letters are on the image. Lucky for us, the letters are still there and pretty thick, so how should we do this? Here are our options:
Method 1: Break the letters into individual cells
Method 2: Create a bounding box around our letters that will be used as a scanning area.
The advantages of the first method is that it's quick and fairly painless to break up a captcha when you have a nice thick font. It's also much faster in Step 4 (You'll see why). The disadvantage of breaking them up is when the captcha uses thin and small fonts that could get broken by the previous filters, or we could end up connecting two letters if the previous filters weren't good enough to destroy all lines between the letters.
The advantages of the second method are that it doesn't require us to do extra compensation and image checks for connected letters or broken up letters, and it allows us to easily work with small and thin fonts. The disadvantages are that it takes much more processing power, and takes a much longer time.
Lets look at the How-To: We can break the letters up in this captcha by using the same FloodFill method that we used above to eliminate noise... but instead, make it look for blocks of black that have more than 80 pixels or so (based off how thick the letters are). One thing you might ask is "Why did we have to eliminate the garbage then with that last FloodFill filter if we're just going to use it again to grab the text?" The answer is that you don't have to, since none of the small garbage which we eliminated in that last filter would have been touching the text. Now just to summarize what we're doing here, the letters get filled in by our function, then because so much of it was filled in, the program identifies that it is a character, puts it into it's own image cell, then moves on to the next ones until all 4 are in individual cells.
For our second method, we'll again use something similar to the FloodFill, however, this time we would have needed to eliminate most if not all of the garbage. We do a simple FloodFill scan to find out where the majority of the black pixels are (should be where the text is), then we find the left most, right most, upper most, and lower most borders. This should create a box around our text. It's always a good thing to expand this box a few pixels, say maybe 4 just in case one of our garbage filters took off a thin layer of the text. Now that we've got our region identified or characters into cells, onto Step 4!
[Step 4: LERN TO REED!]
The title of Step 4 may sound condescending, but for this part you actually have to make your program "read" the text. There are a few methods of doing this. The simplest is to build up a character set and scan the letters you have against the charset. Whether you do this by comparing black pixels, overall pixels, or what have you, it's your choice. The other options are to build a point profile for each letter and compare them against a pre-made set of point profiles for an the entire character set. For now, I'm going to stick with comparing the two using pixels.
If you wondered why we said Method 2 would be a lot more intensive on your computer, the simple answer is this: In order to read this captcha, you have to loop through every letter in your charset. On top of that, you'd have to loop through the entire region that we setup earlier, and lastly on top of that you'd have to do the individual scan that compares the character with the image. So there's a scan which is however man pixels the character is, then loop through every position in our region that it could possibly be, then lastly loop through every character, now you understand why it takes a long time. Method 2 works when after all the scanning is complete, the top 4 (or however many letters your captcha image is) matches are chosen and their corresponding letters are outputted as a string/text.
With Method 1, the only loop you have to do is looping through the characters in the charset, then scanning the character with the cell image to see if it's a match. Once you get the characters that match, you simply output their corresponding letters.
Unfortunately since there's no real "Imagery" done at this point, I have nothing to show you, but be assured it works very well!
[Step 5: Complications with Step 4]
This last step is only for the really hard to break captchas. Say we have letters that are rotated or distorted in some way. Rotated letters can be fixed by finding a way to "un-rotate" them. As you'll see in an example below, I "un-rotated" the letters by finding the rotation with the least width. Distorted letters are another case, since it's hard to undo a distortion. I've personally never attempted, but with some of the more simpler distortions, such as ones that use a sine wave or ones that simply stretch the text as it gets further to the end seems that it could be reversed with the right tweaks.
Now that we've established the ground rules for breaking captchas, take a look at a few more examples and see what you can think of:
The captcha from above in the process of breaking:
A captcha from Rapidshare:
Breaking down the Rapidshare Captcha:
Our after filters and identifying the character cells:
And lastly the compact version:
Now go back and look at the original Rapidshare captcha before it was broken, and think to yourself how you would go about breaking it. For this, I simply noticed that the background noise text was thinner than that of the main text. So I wrote a filter to thin down the walls of all the text about 8 times until it completely eliminated the background text. Once that was done, I was left with a very thin text, so I built it back up again putting 8 layers on top it, and only filling in those 8 layers where there was black on the original image. Then I separated the letters into cells, rotated them each with a 45 degree range both CW and CCW (to avoid going upside-down) and found the rotation with the minimum width. After that, I built a character set based off of the letters that were left and that was all it needed.
Lets look at some more examples and try and think what you see as a vulnerability, then read the list of what I noticed and compare:
1. Only black and white, no need to write filters to differentiate
2. Dots are easily removed
3. The text is thicker than the dots, thus it can be filtered out easily.
4. They only use numbers!
5. The text is centered in the middle, and the letters are from a monospaced font meaning that every letter will be in the same place.
1. No attempt at background noise whatsoever
2. Different colors do nothing to thwart programs, since we can easily turn the text into black.
3. If you could actually see the real captcha generation and see all the different variations for this, you'd see that there is no variation whatsoever except the letters used, makes for a very simple crack.
1. All the letters are the same color, makes it easy to pull the text from the background.
That's about the only thing this person did wrong with their captcha. The letters are spaced unevenly apart from each other, though you could call that a vulnerability because it makes it easy to distinguish which letter is which (no connecting letters). The letters are rotated, and on top of that a different font is used per letter, as well as bolding! This is an example of an extremely well made captcha image, but as well made as it is, through much programming it is still breakable.
[Step 6: From a hacking standpoint]
One last final thing about captcha images. There's always a chance that you can exploit the server along with your captcha breaking program. For example, right here on HTS they use the captcha generating script:
If you notice, this script seems to generate the same captcha text with every refresh. The only time it chooses a new set of letters is when the register page is refreshed and the session variable holding the captcha text is updated. This in itself is a vulnerability, because you could write a program to say break 5 of those, and if your program had any trouble breaking one of them, it could check against 4 others to find the best guess answer.
There is also the possibility, though slim, that you can use the session ID or captcha ID for a captcha that's already been submitted. Say for example you enter in the text for the captcha and it validates, and you notice that a session ID is attached along with the html form you just submitted. By modifying future session IDs to match that same one, there is a chance that you could trick the server into thinking that you're entering the text from the new captcha, but in fact it was a captcha that was already shown.
I hope I've done a well enough job enlightening you on the general concepts for breaking a captcha image. Now go forth man, and break something!
( Preferably a captcha :P )
Cast your vote on this article 10 - Highest, 1 - Lowest
Comments: Published: 24 comments.
HackThisSite is is the collective work of the HackThisSite staff, licensed under a CC BY-NC license.
We ask that you inform us upon sharing or distributing.