ragstoriches

Develop highly-concurrent web scrapers, easily.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ragstoriches is a combined library/framework to ease writing web scrapers using gevent and requests.

A simple example to tell the story:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urlparse import urljoin
import re

from bs4 import BeautifulSoup
from ragstoriches.scraper import Scraper

rr = Scraper(__name__)

@rr
def index(requests, url='http://eastidaho.craigslist.org/search/act?query=+'):
    soup = BeautifulSoup(requests.get(url).text)

    for row in soup.find_all(class_='row'):
        yield 'posting', urljoin(url, row.find('a').attrs['href'])

    nextpage = soup.find(class_='nextpage')
    if nextpage:
        yield 'index', urljoin(url, nextpage.find('a').attrs['href'])


@rr
def posting(requests, url):
    soup = BeautifulSoup(requests.get(url).text)
    infos = soup.find(class_='postinginfos').find_all(class_='postinginfo')

    title = soup.find(class_='postingtitle').text.strip()
    id = re.findall('\d+', infos[0].text)[0]
    date = infos[1].find('date').text.strip()
    body = soup.find(id='postingbody').text.strip()

    print title
    print '=' * len(title)
    print 'post *%s*, posted on %s' % (id, date)
    print body
    print

Install the library and BeautifulSoup 4 using pip install ragstoriches beautifulsoup4, then save the above as craigs.py, finally run with ragstoriches craigs.py.

You will get a bunch of jumbled input, so next step is redirecting stdout to a file:

ragstoriches craigs.py > output.md

Try giving different urls for this scraper on the command-line:

ragstoriches craigs.py http://newyork.craigslist.org/mnh/acc/  > output.md  # hustle
ragstoriches craigs.py http://orangecounty.craigslist.org/wet/ > output.md  # writing OC
ragstoriches craigs.py http://seattle.craigslist.org/w4m/      > output.md  # sleepless in seattle

There are a lot of commandline-options available, see ragstoriches --help for a list.

Writing scrapers

A scraper module consists of some initialization code and a number of subscrapers. Scraping starts by calling the a scraper named index on the scraper rr in the module (see the example above).

The requests argument should be treated like the requests module (it actually is an instance of requests Pool). As long as you use it for fetching webpages, you never have to worry about blocking or exceeding concurrency limits.

The url is the url to scrape and parse.

Return values of scrapers are ignored. However, if a scraper is a generater (i.e. contains a yield statement), any value it yields must be at least a 2-tuple consisting of the name of a scraper and another url. These are added to the queue of jobs to scrape.

Good friends of ragstoriches are the urlparse.urljoin function and BeautifulSoup4.

Caching

You can transparently cache downloaded data, this is especially useful when developing. Simply pass --cache some_name to ragstoriches, which will use requests-cache for caching.

Usage as a library

You can use ragstoriches as a library as well by not using the commandline tools but simply importing a scraper and running it with the scrape() method. Remember to monkey-patch using gevent first.

See the source files for details, as there is not that much documentation available at this point.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.1dev pre-release

Mar 13, 2013

This version

0.3dev pre-release

Mar 12, 2013

0.2

Mar 10, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragstoriches-0.3dev.tar.gz (5.6 kB view hashes)

Uploaded Mar 12, 2013 Source

Hashes for ragstoriches-0.3dev.tar.gz

Hashes for ragstoriches-0.3dev.tar.gz
Algorithm	Hash digest
SHA256	`a216eca9cbfcd8bd0c2923e674a55ef6600fc28dec5d422fbc3ebd5f44046401`
MD5	`4e7d5dc6ccd3946daa01251ea3d971e9`
BLAKE2b-256	`a869ee1c274e2c28046b632c2c0b16f10dc5229d5581db395d66f0b90da7357d`