Pretending to be human is problematic if the server thinks you are a robot because of User-Agent, IP subnet (dynamic IP cloud systems) and DNS look-up patterns (CNN and similar sites).
So "behaving like a human" on HN might result in an IP ban because /x is denied in robots.txt. And this gets really funny when you get banned randomly because of dynamic IP addresses in cloud infrastructure.
Caching is nice, but HTTP has a built-in method: conditional GETs. I wrote up a blog post on how to do this with App Engine but it should work generally in Python using urllib2.
Screen scraping is taking visual data and transforming it into structure data. A screen scraper would graphically capture a window and try to identify or pick out data. Bots for MMOs tend to do that, alnong with providing input to the MMO depending on what they "see".
Web or data scraping is what the article talks about. Still a hard problem, easily broken by minor changes to the scraped webpage, but not subject to the vagaries of OCR and computer vision or graphical interpretation problems, which is what I was expecting from the title.
The original screen scraping was in the context of emulating a 3270 terminal as an "API" into some system running on a mainframe.
I worked at a place that allowed web-based trading in the mid 90s by wrapping a web server (on a sun spark 10) around a single emulated terminal running on a SCO box.
It was my understanding the urllib2 respects robots.txt automatically. I can't find much to back that up, but I really thought I read that somewhere reliable once.
That link is wrong. Python does ship with a robotparser module in the standard library that parses robots.txt files, but urllib2 does not use it out of the box. This can be easily confirmed using Wireshark or a quick glance at the source: http://hg.python.org/cpython/file/08b5e2c9112c/Lib/urllib2.p....
The author makes some great suggestions, namely to cache heavily and throttle requests. However, they lost a lot of credibility for me with "screen scraper traffic should be indistinguishable from human traffic". Sorry, but that's BS--socially responsible scraping leaves control with the publisher. If the publisher doesn't want you scraping their content, you shouldn't try to fake a human in order to be able to.
I see your point - however I read from it that the author was more referring to the load/level of activity on the server that your requests make should be indistinguishable from human traffic.
IE if the server's log files have 100's of requests from the same IP address in successive lines then that doesn't look like human behavior.
What would have been nice for a 'best practice' document would be to show how to set the HTTP AGENT string for the crawler so that it had an identifier, version number and some contact method.
I can't help but mention that you should probably be using node.js with the jsdom module for such a task these days. You can get the complete power of jQuery with jsdom, making screen scraping child's play
It may not use the MIGHTY POWER OF JAVASCRIPT, but BeautifulSoup is a best-of-breed real-world HTML parser. Not just in the API, but in the verification that its parsing algorithm is effective against HTML found on real web sites. I find it unlikely that jsdom is actually significantly better. Does that code really look like it's going to be significantly improved with jQuery?
titles = [x for x in soup.findAll('td','title') if x.findChildren()][:-1]
The ability to use CSS/jQuery selectors is really nice, though; In order to find all <td>s whose parents have a class "blah", you have to use list comprehension:
tds = [td for td in soup.findAll('td') if td.parent.get('class') == 'blah']
In jQuery, this is more compactly written:
tds = $('.blah > td')
And if you just want to look for <td>s somewhere within a .blah element, you can use
tds = $('.blah td')
This is a lot less clear in BeautifulSoup:
tds = [td for td in soup.findAll('td') if td.findParents(attrs={'class': 'blah'})]
(If there are better ways to write this BeautifulSoup code, please let me know)
Selectors have some other benefits too - you can just go to the CSS file and grab the selector that matches what you want, and you can be reasonably sure it'll work in most cases.
My impression was that BeautifulSoup is actually getting very long in the tooth - I didn't use it much, but I have used a lot of Scrapy, a beautiful framework that completely supplants BeautifulSoup for scraping.
You can get the best of all worlds imo by using lxml, which supports the selectors you want, uses Python which I prefer, and in my experience lxml is more robust than BeautifulSoup.
I spent more than a year writing hundreds of scrapers that ran for weeks at a time. BeautifulSoup did not work out as well as lxml in practice. On extremely javascript heavy pages we used pyv8 actually.
No, it's like saying "The full YUI suite is overkill for just about everything, you should just use the core or jQuery".
'scrapy startproject' creates a couple nested directories, with maybe seven files. Are you writing a scraper that you're going to run regularly? Does it need to be super robust and maintainable? Or are you writing something that you'll run once, maybe twice?
I seem to be missing why you think using a framework is a bad thing. With say django or YUI there are performance and abstraction issues that can bite you, but I don't see those mattering for so lightweight a framework and tightly scoped a problem.
So "behaving like a human" on HN might result in an IP ban because /x is denied in robots.txt. And this gets really funny when you get banned randomly because of dynamic IP addresses in cloud infrastructure.