XML/HTML scraper using XPath queries.
Project description
Copyright (C) 2014-2018 H. Turgut Uyar <uyar@tekir.org>
Piculet is a module for extracting data from XML or HTML documents using XPath queries. It consists of a single source file with no dependencies other than the standard library, which makes it very easy to integrate into applications. It also provides a command line interface.
- PyPI:
- Repository:
- Documentation:
Piculet has been tested with Python 2.7, Python 3.4+, PyPy2 5.7+, and PyPy3 5.7+. You can install the latest version using pip:
pip install piculet
History
1.0b7 (2018-03-21)
Dropped support for Python 3.3.
Fixes for handling Unicode data in HTML for Python 2.
Added registry for preprocessors.
1.0b6 (2018-01-17)
Support for writing specifications in YAML.
1.0b5 (2018-01-16)
Added a class-based API for writing specifications.
Added predefined transformation functions.
Removed callables from specification maps. Use the new API instead.
Added support for registering new reducers and transformers.
Added support for defining sections in document.
Refactored XPath evaluation method in order to parse path expressions once.
Preprocessing will be done only once when the tree is built.
Concatenation is now the default reducing operation.
1.0b4 (2018-01-02)
Added “–version” option to command line arguments.
Added option to force the use of lxml’s HTML builder.
Fixed the error where non-truthy values would be excluded from the result.
Added support for transforming node text during preprocess.
Added separate preprocessing function to API.
Renamed the “join” reducer as “concat”.
Renamed the “foreach” keyword for keys as “section”.
Removed some low level debug messages to substantially increase speed.
1.0b3 (2017-07-25)
Removed the caching feature.
1.0b2 (2017-06-16)
Added helper function for getting cache hash keys of URLs.
1.0b1 (2017-04-26)
Added optional value transformations.
Added support for custom reducer callables.
Added command-line option for scraping documents from local files.
1.0a2 (2017-04-04)
Added support for Python 2.7.
Fixed lxml support.
1.0a1 (2016-08-24)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for piculet-1.0b7-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c14ae0bd8bf0e1ba771c8c2532fcea3259f2133c89a1f85257a52abf1d8e210e |
|
MD5 | 08bab36fcdb7f9177513abdaeeae1fa2 |
|
BLAKE2b-256 | f93f688365b8255c035d815887d56a68e4b21a509d37c860f32202f23f971644 |