Skip to main content

Web scraper framework

Project description

Build Status PyPI version

Alcazar is a Python library that simplifies the task of writing web scrapers.

Some of its core features are:

  • succinct syntax for locating relevant data within an HTML page, JSON document, string of text
  • HTTP caching to disk for exact replay of scrapes without resubmitting HTTP requests
  • Throttling of requests to the same host
  • Automatic retries when an HTTP request fails, or when a page fails to parse as expected
  • Crawler facilities for maintaining a queue of URLs to visit
  • fail-fast: by default, we'd rather crash than save incorrect or incomplete data

Alcazar brings together the following libraries:

Getting Started

Alcazar is available on PyPi so it can be installed it using pip:

pip install alcazar

The simplest way to use the library is to instantiate a Scraper and call its fetch method:

>>> import alcazar
>>> scraper = alcazar.Scraper()
>>> page = scraper.fetch('')
>>> print('div[@id="toc"]/preceding-sibling::p[./b]').text.normalized)
Gorgie (/ˈɡɔːrɡiː/ GOR-gee) is a densely populated area of Edinburgh, Scotland. It is located in the west of the city and borders Murrayfield, Ardmillan and Dalry.

In this snippet:

  • we've fetched the HTML for the page
    • if any network error or HTTP error happens, we'll retry to fetch it a few times, sleeping increasing delays between every attempt
  • we've parsed the HTML into a tree
    • using lxml's excellent handling and recovery from "broken" HTML, as seen in the wild
  • we've located the element we're interested in
    • here using an XPath expression, but we could've used a CSS selector too
    • we've checked that there was one and only one element that matched our query
    • else an exception would've been thrown, ensuring we capture only exactly what we wanted
  • we've extracted its text, removed all tags from it, and normalized its whitespace

See the samples directory for a taste of how Alcazar works.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for alcazar, version 0.5.5
Filename, size File type Python version Upload date Hashes
Filename, size alcazar-0.5.5-py3-none-any.whl (89.3 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size alcazar-0.5.5.tar.gz (65.5 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page