Cenzic 232 Patent
Paid Advertising
sla.ckers.org is
ha.ckers sla.cking
Sla.ckers.org
How robots and spiders are causing issues, how to stop them. We can also talk about Completely Automated Public Turing Test To Tell Computers And Humans Apart - their use, their compliance issues, porn proxies, PWNtcha and other ways to defeat them. 
Go to Topic: PreviousNext
Go to: Forum ListMessage ListNew TopicSearchLog In
Detecting Spiders is the key to Blackhat SEO
Posted by: rsnake
Date: August 23, 2006 11:33AM

One of the major problems in blackhat SEO (search engine optimization) is detecting what is a robot and what is a user pretending to be a robot to detect what you are doing. There are a lot of tricks out there, but almost all of them can be subverted. It would be interesting to catalogue them and see which ones work for what, instead of just trying to keep them all in our heads at one time.

IE: User Agent Detection -> Works for all search engines that don't lie or change their user agents. Doesn't work for when competitors change their user agent and pretend to be a spider. Etc...

- RSnake
Gotta love it. http://ha.ckers.org

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: ryduh
Date: January 27, 2007 08:48PM

I use BBClone, a php web counter, and they can tell pretty much all robots out there. Robot= Red line.
http://bbclone.de/demo/show_detailed.php?lng=en

Their database of robots or their code would be useful to anyone trying to test out this method.

---------
Patience is a waste of time.

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: rsnake
Date: January 27, 2007 09:00PM

Uhm... yah, but that sorta didn't answer my question. I'm talking about people pretending to be search engine robots, not catching people being their own robots. This is only cataloging robots, not cataloging which ones are home grown robots verses search engines.

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: kuza55
Date: January 28, 2007 12:04AM

For googlebot, you can verify that it is googlebot, rather than someone spoofing the User-Agent, you can do a reverse lookup, as is described here: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html

Of course, this results in the problem of potential DoS attack (probably not too likely, but still possible), since doing reverse DNS lookups (which actually involves two queries - one to get the host name, and then one to verify that the NS responsible for that name returns the IP you are checking) isn't exactly a cheap operation.

I've been planning on sending google an email to see if they can either provide a list of domains somewhere, or get crawl.google.bot.com to return a list of all the IPs which googlebot has, but I'm lazy and forgetful......

So the only real solution to improve performance that I could come up with is to manually hard code the netblock in which google resides (or just use 64.*.*.*-66.*.*.*), and then only doing reverse lookups only if it is that range.

Oh, and of course add all IPs which don't have the appropriate reverse DNS results, or aren't in the net block just get added to hosts.deny or to your firewalls rules, so that you don't do the same request over and over from the same IP and so users have to use proxies to attempt a DoS, and that should stop it. Furthermore a caching system with a list of Googlebot's current IPs would also speed things up, there's a fairly recent list availiable here: http://johnny.ihackstuff.com/index.php?name=PNphpBB2&file=viewtopic&t=3418

I have no idea how to detect other bots, but if you know which bots you want to allow through you could compile a netblock or IP list.

Me and a friend are working on a paper (err, more being lazy and not writing up our ideas, than anything, but its coming) about how to detect bots and some other things, hopefully we'll get around to finishing it soon enough, and once we do I'll add that here.

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: rsnake
Date: January 28, 2007 06:06PM

Yah, Google is the easiest of all the bots to detect though. It's really the others that are more interesting. When you get further along in your detecting bots paper, I'd love to read it.

- RSnake
Gotta love it. http://ha.ckers.org

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: kuza55
Date: January 28, 2007 07:23PM

I may have been a bit misleading aboutthe paper, its not really focused on bot detection, though that is integral to it, so yeah, don't get your hopes up or anything, but I've sent you an email to explain a bit anyway.

Anyway, other search engines have the same option, e.g.:

MSN/Live Search: http://blogs.msdn.com/livesearch/archive/2006/11/29/search-robots-in-disguise.aspx

Ask.com: http://about.ask.com/en/docs/about/webmasters.shtml#21

And I'm sure that if you did some checks on the IPs which Yahoobot uses, they'd probably all resolve to yahoo.com domains.

Also, one thing I'm curious about is if anyone knows whether PHP and other languages do forward lookups of the hostname returned by gethostbyaddr, because I remember reading somewhere that even the default C functions (on some *nix like OSs) do that.......but I could be so extremely wrong.....I'm primarily curios because they also recommend doing forward lookups on the hostname as well after getting a result.....

[EDIT]: Supposedly Yahoo also does the same, but returns .inktomisearch.com domains: http://www.webmasterworld.com/google/3092423.htm



Edited 1 time(s). Last edit at 01/28/2007 07:33PM by kuza55.

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: ntp
Date: January 28, 2007 07:27PM

DNS reverse-lookups and anything source IP based can still be spoofed from IP/ASN hijacking and a few other methods. Spammers, click-through fraudsters, phishers, and blackhat SEO are already using these methods as well as um... proxies.

The key to detecting bots is going to become similar to detecting someone using ippersonality or Tor. Most of the current techniques seem to be kernel timing based, which is also one of the best current methods to detect rootkits.

I am still looking for running code that makes my bots look like undetectable humans. Billy Hoffman did an interesing presentation on Web Crawling but SPI never released the code.

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: kuza55
Date: January 28, 2007 07:41PM

ntp Wrote:
-------------------------------------------------------
> DNS reverse-lookups and anything source IP based
> can still be spoofed from IP/ASN hijacking and a
> few other methods. Spammers, click-through
> fraudsters, phishers, and blackhat SEO are already
> using these methods as well as um... proxies.


Ok, it is possible to perform IP hjacking, DNS MITM attacks, and even DNS traffic injection attacks if you're somehow on the LAN of the webserver, but I think that doing that is out of reach for most people, because (AFAIK) there's no way to do it without compromising a router somewhere between the target website and their DNS server, or a machine on either of those LANs.

Or am I clearly missing something?

Oh, and Acidus also did a presentation on covert crawling at Shmoocon, which frankly I don't remember much of since I watched it so long ago and have deleted the video, but anyway, if you're interested you can find the video here: http://www.shmoocon.org/2006/presentations.html

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: rsnake
Date: January 29, 2007 11:27AM

@ntp - kernal timing? Are you talking about that TCP timing detection stuff that came out 2-3 years ago?

- RSnake
Gotta love it. http://ha.ckers.org

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: ntp
Date: January 29, 2007 04:47PM

Any of the work done by Zalewski, KD-Team, or LightBlueTouchPaper on using timing to identify and fingerprint code of all types.

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: rsnake
Date: January 30, 2007 06:24PM

Ah, yes, there was a timing offset paper that came out a few years ago... I thought that's what you were talking about. There are problems with that paper in implementation. Nevermind.

- RSnake
Gotta love it. http://ha.ckers.org

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: berty7386
Date: December 15, 2009 12:44PM

http://www.webshree.com/black-hat-seo-and-xss-attacks.aspx
check the link for details
-------------------------------------------



Edited 1 time(s). Last edit at 02/25/2014 04:39AM by berty7386.

Options: ReplyQuote
Re: Detecting Spiders is the key to Blackhat SEO
Posted by: rvdh
Date: December 15, 2009 03:24PM

YES! I WILL. I want IT.

Options: ReplyQuote


Sorry, only registered users may post in this forum.