Cenzic 232 Patent
Paid Advertising
sla.ckers.org is
ha.ckers sla.cking
Sla.ckers.org
How robots and spiders are causing issues, how to stop them. We can also talk about Completely Automated Public Turing Test To Tell Computers And Humans Apart - their use, their compliance issues, porn proxies, PWNtcha and other ways to defeat them. 
Go to Topic: PreviousNext
Go to: Forum ListMessage ListNew TopicSearchLog In
Yahoo/Hotmail/Google CAPTCHA Extraction
Posted by: maluc
Date: March 12, 2008 02:17AM

When it comes time to test and tweak an algorithm, or just to figure out where it's best to begin, it's helpful to have a large sample size to work with. Below are several php scripts I used to extract out and save 1000 CAPTCHA jpegs or wavs from the major email sites. They work very similar, but each one has subtle changes in parsing.

The URL at the top of each file may need updating when you get ready to run it, since the session variable may have expired.

Yahoo (jpeg)
Hotmail (jpeg)
Hotmail (wav)
Google (jpeg)
Google (wav)
PS - For the sake of others, please don't add/modify things in this pastebin.

-maluc



Edited 1 time(s). Last edit at 03/12/2008 08:32AM by maluc.

Options: ReplyQuote
Re: Yahoo/Hotmail/Google CAPTCHA Extraction
Posted by: maluc
Date: March 12, 2008 02:58AM

And for those interested, these were my initial notes on the properties of each CAPTCHA. Might save someone a bit of time.

Spammy:
CAPTCHA notes

gmail:
link to view: https://www.google.com/accounts/NewAccount?service=mail&continue=http%3A%2F%2Fmail.google.com%2Fmail%2Fe-11-10ba05aeaa8e9b701e5151437f9a44d3-64aeae753cc34f1c864f7edc97a046ccdc96987b&type=2
length: 5-8
range: a-z
case-sensitive: no
background: always white
overlay: none
text color: solid blue,green,or red. single color.
size: 2000-3900 bytes
width: always 200px
height: always 70px
other: tilting seemingly random, 5chars is rare, red is rare, shade of solid colors may change between captchas

gmail-audio:
link to view: https://www.google.com/accounts/NewAccount?service=mail&continue=http%3A%2F%2Fmail.google.com%2Fmail%2Fe-11-10ba05aeaa8e9b701e5151437f9a44d3-64aeae753cc34f1c864f7edc97a046ccdc96987b&type=2
length: not certain (5-10?)
range: 0-9
case-sensitive: N/A
background: equally loud gibberish and noise, really gets in the way.
size: 200044-440044 bytes
other: way too hard for a human - don't know how blind people do it. pace varies but pitch seems to remain fairly similar.

yahoo:
link to view: https://edit.yahoo.com/registration?.intl=us&new=1&.done=http%3A//mail.yahoo.com&.src=ym&.v=0&.u=ak37rod3tebb2&partner=&.partner=&pkg=&stepid=&.p=&promo=&.last=#
length: 4-6
range: a-z,A-Z,2-8
case-sensitive: no
background: always white
text color: always black
overlay: 1-3 random line paths, always black
size: between 1800 and 3200 bytes
width: always 290px
height: always 80px
other: tilting and bending randomly, 4chars is rare, each letter either 2d sans-serif or 3d serif, some letters not used or in only one case

hotmail:
link to view: https://signup.live.com/newuser.aspx?mkt=en-us&ts=4309539&sh=ynSL&ru=http%3a%2f%2fmail.live.com%2f%3fnewuser%3dyes&rx=http%3a%2f%2fget.live.com%2fmail%2foptions&rollrs=03&lic=1#HipBox
length: 8
range: A-Z,2-3,5-6,8-9
case-sensitive: no
background: always gray
text color: always dark blue
overlay: short line paths with 0-3 bends, always dark blue
size: 3200-4400 bytes
width: always 218px
height: alway 48px
other: looks easiest to solve, font size varies

hotmail-audio:
link to view: https://signup.live.com/newuser.aspx?mkt=en-us&ts=4309539&sh=ynSL&ru=http%3a%2f%2fmail.live.com%2f%3fnewuser%3dyes&rx=http%3a%2f%2fget.live.com%2fmail%2foptions&rollrs=03&lic=1#HipBox
length: 10
range: 0-9
case-sensitive: N/A
background: lower volume gibberish, sounds like numbers really fast with extra noise
size: 46000-131000 bytes
other: numbers seem to follow a steady pace, pitch varies and either a higher pitched woman or low pitched male with robotic senthesizing

-maluc



Edited 1 time(s). Last edit at 03/12/2008 08:34AM by maluc.

Options: ReplyQuote
Re: Yahoo/Hotmail/Google CAPTCHA Extraction
Date: March 12, 2008 08:17PM

Nice work, maluc. It should make someone's work with OCR applications a bit easier.


Awesome AnDrEw - That's The Sound Of Your Brain Crackin'
http://www.awesomeandrew.net/

Options: ReplyQuote
Re: Yahoo/Hotmail/Google CAPTCHA Extraction
Posted by: istari
Date: March 13, 2008 09:35PM

Nice scripts! The one for Google's JPEG is quite similar to one I wrote some time ago for the same purpose, but mine is in Python...

Anyway, did you ever get to do anything with those test images?

Options: ReplyQuote
Re: Yahoo/Hotmail/Google CAPTCHA Extraction
Posted by: maluc
Date: March 14, 2008 01:39PM

Well I started using them to train a neural network on an audio CAPTCHA (not one listed above), but I haven't had the time to get it fully working. The audio CAPTCHA i picked to start with has very little noise and amplitude adjustments, so the NN should identify it easily. The tricky part is in segmenting wave files when they have varying speed and the letters are not evenly paced. :T

Once I get it working (99-100% success? ^^), i'll release the full code for it.

-maluc

Options: ReplyQuote
Re: Yahoo/Hotmail/Google CAPTCHA Extraction
Posted by: istari
Date: March 14, 2008 08:16PM

That's gonna be a must-read for me when it comes out (if it does come out: 99-100% success rate is a bit high :P )! I never really considered breaking audio CAPTCHA's, as my audio processing skills are as low as it gets...

Anyway, do you think that a big site would not notice an increase in the number of registrations done with visually-impaired CAPTCHA's? I have absolutely no statistics about this, but my guess is that those CAPTCHA's are not used by too many people... although I don't know if they even keep track of this kind of stuff...



Edited 1 time(s). Last edit at 03/14/2008 08:18PM by istari.

Options: ReplyQuote
Re: Yahoo/Hotmail/Google CAPTCHA Extraction
Posted by: maluc
Date: March 15, 2008 12:57AM

Heh, I agree 99+% is more hopeful than realistic.. but anything above 50% is really pretty brutal, so i'd be satisfied with that. Breaking audio captchas seems to be a pretty overlooked vector in captcha solving - likely because noone seems to wanna learn audio signal processing anymore :T ..

My sound parsing experience had been limited to DTMF (touch-tone sounds), so it's somewhat new for me too. Not quite the 'fast fourier and your done' ^^;

You're right. There can't be very many blind web surfers out there. I'd hope the bulk of the major websites is doing heuristics to detect mass-registrations, since I believe they're a much better long term spam prevention than unsolvable-userbase-pissing-off-captchas.. But i think that's more hopeful than realistic too.

-maluc



Edited 1 time(s). Last edit at 03/15/2008 02:02AM by maluc.

Options: ReplyQuote
Re: Yahoo/Hotmail/Google CAPTCHA Extraction
Posted by: istari
Date: April 30, 2008 09:58AM

Well, I just thought I'd post here some code I worked on to get reddit.com CAPTCHAs. I hope someone makes good use of this...

Note that it can easily be extended to pull CAPTCHAs from other sites, as one only needs to change the definition of the URL variable and call random_filename with the appropiate arguments. Also note that reddit.com uses the filename to determine the CAPTCHAs text, and a random floating point number to warp the image. Best results are obtained when this floating point number is between 0.25 and 0.75 (this is implemented in this code).

The code is in Python (but it's still quite fast: 100 CAPTCHAs in 1 m 30 s on a standard connection), and you only need to call the get_captcha function with the number of CAPTCHAs you want as an argument...

Cheers!

import urllib, os, random, time

### INIT ###
DIR = r"\home\"# <== Directory to output files
random.seed(time.time())

global UPPER, LOWER, NUMBERS
UPPER = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
LOWER = "abcdefghijklmnopqrstuvwxyz"
NUMBERS = "0123456789"

### FUNCTIONS ###
def random_filename(length, extension, uppercase=1, lowercase=1, numbers=1):
    global UPPER, LOWER, NUMBERS

    possible = []
    if uppercase == 1:
        possible.append((UPPER, 25))
    if lowercase == 1:
        possible.append((LOWER, 25))
    if numbers == 1:
        possible.append((NUMBERS, 9))

    result = ""
    while len(result) != length:
        cur = random.choice(possible)
        result = result + cur[0][random.randint(0, cur[1])]

    return result + '.' + extension

def get_captcha(quant):
    for i in range(quant):
        FILENAME = random_filename(32, "png")
        URL = "http://reddit.com/captcha/" + FILENAME + '?' + `random.random()/2 + 0.25`

        print URL

        imf = urllib.urlopen(URL)

        size = imf.headers.get("Content-Length")

        of = open(os.path.join(DIR, FILENAME), "wb")
        of.write(imf.read(int(size)))
        of.close()

        imf.close()

### GET SOME CAPTCHAS ###
get_captcha(100)
[\code]



Edited 1 time(s). Last edit at 04/30/2008 10:08AM by istari.

Options: ReplyQuote
Re: Yahoo/Hotmail/Google CAPTCHA Extraction
Posted by: busin3ss
Date: September 02, 2008 09:11PM

Great stuff, thanks maluc!

Options: ReplyQuote


Sorry, only registered users may post in this forum.