Develop highly-concurrent web scrapers, easily.
ragstoriches is a combined library/framework to ease writing web scrapers using gevent and requests.
A simple example to tell the story:
import re from lxml.html import document_fromstring from ragstoriches.scraper import Scraper scraper = Scraper(__name__) @scraper def index(requests, url='http://eastidaho.craigslist.org/search/act?query=+'): html = document_fromstring(requests.get(url).content) for ad_link in html.cssselect('.row a'): yield 'posting', ad_link.get('href') nextpage = html.cssselect('.nextpage a') if nextpage: yield 'index', nextpage.get('href') @scraper def posting(requests, url): html = document_fromstring(requests.get(url).content) title = html.cssselect('.postingtitle').text.strip() id = re.findall(r'\d+', html.cssselect('div.postinginfos p').text) date = html.cssselect('.postinginfos date').text.strip() body = html.cssselect('#postingbody').text_content().strip() print title print '=' * len(title) print 'post *%s*, posted on %s' % (id, date) print body print
You will get a bunch of jumbled output, so next step is redirecting stdout to a file:
ragstoriches craigs.py > output.md
Try giving different urls for this scraper on the command-line:
ragstoriches craigs.py http://newyork.craigslist.org/mnh/acc/ > output.md # hustle ragstoriches craigs.py http://orangecounty.craigslist.org/wet/ > output.md # writing OC ragstoriches craigs.py http://seattle.craigslist.org/w4m/ > output.md # sleepless in seattle
There are a lot of commandline-options available, see ragstoriches --help for a list.
ragstoriches works with almost any kind of HTML-parsing library, while using lxml is recommend, you can easily use BeautifulSoup4 or another library (lxml, in my tests, turned out to be about five times as fast as BeautifulSoup though and the CSS-like selectors are a joy to use).
A scraper module consists of some initialization code and a number of subscrapers. Scraping starts by calling the a scraper named index (see the example above).
Scrapers make use of dependency injection - the argument names are looked up in a scope and filled with the relevant instance. This means that if your subscraper takes an argument called url, it will always be the URL to scrape, requests always a pool instance and so on.
The following predefined injections are available:
- requests: A requests session. Can be treated like the top-level API of requests. As long as you use it for fetching webpages, you never have to worry about blocking or exceeding concurrency limits.
- url: The url to scrape and parse.
- data: A callback for passing data out of the scraper. See the example below.
Return values of scrapers are ignored. However, if a scraper is a generater (i.e. contains a yield statement), it should yield tuples of subscraper_name, url or subscraper_name, url, context. These are added to the queue of jobs to scrape, the contents of context are added to the scope for all following subscraper calls.
The url-yield is urlparse.urljoin-ed onto the url passed into the scraper, this means that you do not have to worry about relative links, they just work.
You can decouple scraping/parsing and the actual processing of the resulting data by using receivers. Let’s rewrite the example above a slight bit by replacing the second function with this:
@scraper def posting(requests, url, data): html = document_fromstring(requests.get(url).content) data('posting', title=html.cssselect('.postingtitle').text.strip(), id=re.findall(r'\d+', html.cssselect('div.postinginfos p').text), date=html.cssselect('.postinginfos date').text.strip(), body=html.cssselect('#postingbody').text_content().strip(), )
Two differences: We inject data as an argument and instead of printing our data, we pass it to the new callable.
When you call data, the first argument is the name of a subreceiver and everything passed into it gets passed on to every receiver loaded. We don’t load any receivers, so running the scraper will do nothing but fill up the data-queue.
To illustrate, put the following into a file called printer.py:
from ragstoriches.receiver import Receiver receiver = Receiver(__name__) @receiver def posting(**kwargs): print 'New posting: %r' % kwargs
Afterwards, run ragstoriches -q craigs.py printer.py. The result will be that the receiver prints the extracted data to stdout, nicely decoupling extraction and processing.
You can transparently cache downloaded data, this is especially useful when developing. Simply pass --cache some_name to ragstoriches, which will use requests-cache for caching.
Usage as a library
You can use ragstoriches as a library (instead of as a framework, by using the commandline utilities) as well, but there is no detailed documentation. Drop me a line if this is important for you.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size ragstoriches-0.3.1dev.tar.gz (6.6 kB)||File type Source||Python version None||Upload date||Hashes View|