An Introduction to Compassionate Screen Scraping

jp · on April 19, 2011

Pretending to be human is problematic if the server thinks you are a robot because of User-Agent, IP subnet (dynamic IP cloud systems) and DNS look-up patterns (CNN and similar sites).

So "behaving like a human" on HN might result in an IP ban because /x is denied in robots.txt. And this gets really funny when you get banned randomly because of dynamic IP addresses in cloud infrastructure.

hung · on April 19, 2011

Caching is nice, but HTTP has a built-in method: conditional GETs. I wrote up a blog post on how to do this with App Engine but it should work generally in Python using urllib2.

http://www.hung-truong.com/blog/2010/12/01/conditional-gets-...

runningdogx · on April 19, 2011

Screen scraping is taking visual data and transforming it into structure data. A screen scraper would graphically capture a window and try to identify or pick out data. Bots for MMOs tend to do that, alnong with providing input to the MMO depending on what they "see".

Web or data scraping is what the article talks about. Still a hard problem, easily broken by minor changes to the scraped webpage, but not subject to the vagaries of OCR and computer vision or graphical interpretation problems, which is what I was expecting from the title.

joshu · on April 20, 2011

The original screen scraping was in the context of emulating a 3270 terminal as an "API" into some system running on a mainframe.

I worked at a place that allowed web-based trading in the mid 90s by wrapping a web server (on a sun spark 10) around a single emulated terminal running on a SCO box.

It was called screen scraping then, too :)

icandoitbetter · on April 20, 2011

Speaking of screen scraping, can someone recommend good libraries for accessing information on GUI widgets with integrated OCR capabilities?

eli · on April 19, 2011

No mention of observing robots.txt?

Groxx · on April 19, 2011

If you're scraping, and explicitly "behaving" like a human, you're probably grabbing things which aren't exposed in a robot-friendly format.

I.e., technically, you should. Practically, you can't all the time.

megamark16 · on April 19, 2011

It was my understanding the urllib2 respects robots.txt automatically. I can't find much to back that up, but I really thought I read that somewhere reliable once.

functional-tree · on April 19, 2011

This link corroborates that urllib2 respects robot.txt:

http://stackoverflow.com/questions/3197299/urllib2-connectio...

arst · on April 19, 2011

That link is wrong. Python does ship with a robotparser module in the standard library that parses robots.txt files, but urllib2 does not use it out of the box. This can be easily confirmed using Wireshark or a quick glance at the source: http://hg.python.org/cpython/file/08b5e2c9112c/Lib/urllib2.p....

storborg · on April 19, 2011

The author makes some great suggestions, namely to cache heavily and throttle requests. However, they lost a lot of credibility for me with "screen scraper traffic should be indistinguishable from human traffic". Sorry, but that's BS--socially responsible scraping leaves control with the publisher. If the publisher doesn't want you scraping their content, you shouldn't try to fake a human in order to be able to.

dotBen · on April 19, 2011

I see your point - however I read from it that the author was more referring to the load/level of activity on the server that your requests make should be indistinguishable from human traffic.

IE if the server's log files have 100's of requests from the same IP address in successive lines then that doesn't look like human behavior.

What would have been nice for a 'best practice' document would be to show how to set the HTTP AGENT string for the crawler so that it had an identifier, version number and some contact method.

helwr · on April 19, 2011

This was already asked: http://groups.google.com/group/comp.lang.java/browse_thread/...

dotBen · on April 19, 2011

I'm confused - your url is for a thread about Java but this best practice document is python-orientated

njs12345 · on April 19, 2011

It's a joke, look at the person asking the question :)

dhruvbird · on April 19, 2011

I can't help but mention that you should probably be using node.js with the jsdom module for such a task these days. You can get the complete power of jQuery with jsdom, making screen scraping child's play

jerf · on April 19, 2011

It may not use the MIGHTY POWER OF JAVASCRIPT, but BeautifulSoup is a best-of-breed real-world HTML parser. Not just in the API, but in the verification that its parsing algorithm is effective against HTML found on real web sites. I find it unlikely that jsdom is actually significantly better. Does that code really look like it's going to be significantly improved with jQuery?

    titles = [x for x in soup.findAll('td','title') if x.findChildren()][:-1]

packs a lot of punch.

meatmanek · on April 19, 2011

The ability to use CSS/jQuery selectors is really nice, though; In order to find all <td>s whose parents have a class "blah", you have to use list comprehension:

  tds = [td for td in soup.findAll('td') if td.parent.get('class') == 'blah']

In jQuery, this is more compactly written:

  tds = $('.blah > td')

And if you just want to look for <td>s somewhere within a .blah element, you can use

  tds = $('.blah td')

This is a lot less clear in BeautifulSoup:

  tds = [td for td in soup.findAll('td') if td.findParents(attrs={'class': 'blah'})]

(If there are better ways to write this BeautifulSoup code, please let me know)

Selectors have some other benefits too - you can just go to the CSS file and grab the selector that matches what you want, and you can be reasonably sure it'll work in most cases.

joshu · on April 20, 2011

Lxml in python lets you use CSS selectors.

cdr · on April 20, 2011

My impression was that BeautifulSoup is actually getting very long in the tooth - I didn't use it much, but I have used a lot of Scrapy, a beautiful framework that completely supplants BeautifulSoup for scraping.

henrybaxter · on April 19, 2011

You can get the best of all worlds imo by using lxml, which supports the selectors you want, uses Python which I prefer, and in my experience lxml is more robust than BeautifulSoup.

I spent more than a year writing hundreds of scrapers that ran for weeks at a time. BeautifulSoup did not work out as well as lxml in practice. On extremely javascript heavy pages we used pyv8 actually.

edit: more information at http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciat... the comments are useful too.

cdr · on April 20, 2011

Even better use Scrapy, which is a whole framework designed specifically for scraping and is built on top of libxml2 like lxml.

krakensden · on April 20, 2011

Scrapy is overkill for nearly everything. You'll probably have under a page of code using lxml and urllib.

cdr · on April 20, 2011

I have under a page of code with Scrapy for simple projects, and more advanced features when I need them.

That's like saying "jQuery is overkill for just about everything, you should use plain javascript".

krakensden · on April 21, 2011

No, it's like saying "The full YUI suite is overkill for just about everything, you should just use the core or jQuery".

'scrapy startproject' creates a couple nested directories, with maybe seven files. Are you writing a scraper that you're going to run regularly? Does it need to be super robust and maintainable? Or are you writing something that you'll run once, maybe twice?

cdr · on April 29, 2011

I seem to be missing why you think using a framework is a bad thing. With say django or YUI there are performance and abstraction issues that can bite you, but I don't see those mattering for so lightweight a framework and tightly scoped a problem.

bballbackus · on April 20, 2011

The combination of BeautifulSoup and mechanize also makes tasks incredibly simple.

njs12345 · on April 19, 2011

JSoup (http://jsoup.org) gives Java similiar capabilities :)