a library for scraping things
Project description
A Python library for scraping things.
Features include:
HTTP, HTTPS, FTP requests via an identical API
HTTP caching, compression and cookies
redirect following
request throttling
robots.txt compliance (optional)
robust error handling
scrapelib is a project of Sunlight Labs (c) 2011. All code is released under a BSD-style license, see LICENSE for details.
Written by Michael Stephens <mstephens@sunlightfoundation.com> and James Turk <jturk@sunlightfoundation.com>.
- Contributors:
Joe Germuska - fix for IPython embedding
Alex Chiang - fix to test suite
Requirements
python >= 2.6 (experimental support for Python 3.2)
httplib2
chardet
Installation
scrapelib is available on PyPI and can be installed via pip install scrapelib
PyPI package: http://pypi.python.org/pypi/scrapelib
Source: http://github.com/sunlightlabs/scrapelib
Documentation: http://scrapelib.readthedocs.org/en/latest/
Example Usage
import scrapelib s = scrapelib.Scraper(requests_per_minute=10, allow_cookies=True, follow_robots=True) # Grab Google front page s.urlopen('http://google.com') # Will raise RobotExclusionError s.urlopen('http://google.com/search') # Will be throttled to 10 HTTP requests per minute while True: s.urlopen('http://example.com')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.