XML/HTML scraper using XPath queries.
Project Description
Copyright (C) 2014-2018 H. Turgut Uyar <uyar@tekir.org>
Piculet is a module for extracting data from XML or HTML documents using XPath queries. It consists of a single source file with no dependencies other than the standard library, which makes it very easy to integrate into applications. It also provides a command line interface.
PyPI: | https://pypi.python.org/pypi/piculet/ |
---|---|
Repository: | https://bitbucket.org/uyar/piculet |
Documentation: | https://piculet.readthedocs.io/ |
Piculet has been tested with Python 2.7, Python 3.4+, PyPy2 5.7+, and PyPy3 5.7+. You can install the latest version using pip:
pip install piculet
History
1.0b7 (2018-03-21)
- Dropped support for Python 3.3.
- Fixes for handling Unicode data in HTML for Python 2.
- Added registry for preprocessors.
1.0b6 (2018-01-17)
- Support for writing specifications in YAML.
1.0b5 (2018-01-16)
- Added a class-based API for writing specifications.
- Added predefined transformation functions.
- Removed callables from specification maps. Use the new API instead.
- Added support for registering new reducers and transformers.
- Added support for defining sections in document.
- Refactored XPath evaluation method in order to parse path expressions once.
- Preprocessing will be done only once when the tree is built.
- Concatenation is now the default reducing operation.
1.0b4 (2018-01-02)
- Added “–version” option to command line arguments.
- Added option to force the use of lxml’s HTML builder.
- Fixed the error where non-truthy values would be excluded from the result.
- Added support for transforming node text during preprocess.
- Added separate preprocessing function to API.
- Renamed the “join” reducer as “concat”.
- Renamed the “foreach” keyword for keys as “section”.
- Removed some low level debug messages to substantially increase speed.
1.0b3 (2017-07-25)
- Removed the caching feature.
1.0b2 (2017-06-16)
- Added helper function for getting cache hash keys of URLs.
1.0b1 (2017-04-26)
- Added optional value transformations.
- Added support for custom reducer callables.
- Added command-line option for scraping documents from local files.
1.0a2 (2017-04-04)
- Added support for Python 2.7.
- Fixed lxml support.
1.0a1 (2016-08-24)
- First release on PyPI.
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size & hash SHA256 hash help | File type | Python version | Upload date |
---|---|---|---|
piculet-1.0b7-py2.py3-none-any.whl (13.9 kB) Copy SHA256 hash SHA256 | Wheel | py2.py3 | Mar 21, 2018 |
piculet-1.0b7.tar.gz (32.8 kB) Copy SHA256 hash SHA256 | Source | None | Mar 21, 2018 |