Web scraper framework
Alcazar is a Python library that simplifies the task of writing web scrapers.
Some of its core features are:
- succinct syntax for locating relevant data within an HTML page, JSON document, string of text
- HTTP caching to disk for exact replay of scrapes without resubmitting HTTP requests
- Throttling of requests to the same host
- Automatic retries when an HTTP request fails, or when a page fails to parse as expected
- Crawler facilities for maintaining a queue of URLs to visit
- fail-fast: by default, we'd rather crash than save incorrect or incomplete data
Alcazar brings together the following libraries:
Alcazar is available on PyPi so it can be installed it using
pip install alcazar
The simplest way to use the library is to instantiate a
Scraper and call its
>>> import alcazar >>> scraper = alcazar.Scraper() >>> page = scraper.fetch('https://en.wikipedia.org/wiki/Gorgie') >>> print(page.one('div[@id="toc"]/preceding-sibling::p[./b]').text.normalized) Gorgie (/ˈɡɔːrɡiː/ GOR-gee) is a densely populated area of Edinburgh, Scotland. It is located in the west of the city and borders Murrayfield, Ardmillan and Dalry.
In this snippet:
- we've fetched the HTML for the page
- if any network error or HTTP error happens, we'll retry to fetch it a few times, sleeping increasing delays between every attempt
- we've parsed the HTML into a tree
- using lxml's excellent handling and recovery from "broken" HTML, as seen in the wild
- we've located the element we're interested in
- here using an XPath expression, but we could've used a CSS selector too
- we've checked that there was one and only one element that matched our query
- else an exception would've been thrown, ensuring we capture only exactly what we wanted
- we've extracted its text, removed all tags from it, and normalized its whitespace
samples directory for a taste of how Alcazar works.
Release history Release notifications
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size alcazar-0.5.5-py3-none-any.whl (89.3 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size alcazar-0.5.5.tar.gz (65.5 kB)||File type Source||Python version None||Upload date||Hashes View|