HTML parsing library using YAML definitions and XPath
Project description
PyParsy
PyParsy is an HTML parsing library using YAML definition files. The idea is to use the YAML file as sort of intent - what you want to have as a result and let Parsy do the heavy lifting for you. The differences to other similar libraries (e.g. selectorlib) is that it supports multiple version of selectors for a single field. This way you will not need to create a new yaml definition file for every change on a website.
The YAML files contain:
- The desired structure of the output
- XPath/CSS/Regex selectors for the element extraction
- Return type definition
- Optional children of the field
Features
- YAML File definitions
- YAML File validation
- Intent instead of coding
- support for XPath, CSS and Regex selectors
- Different output formats e.g. JSON, YAML, XML
- Somewhat opinionated
- 99% coverage
Installation
Using pip:
pip install pyparsy
Running Tests
To run tests, run the following command
poetry run pytest
Examples
We can consider as an example the amazon bestseller page. First we define the .yaml definition file:
title:
selector: //div[contains(@class, "_card-title_")]/h1/text()
selector_type: XPATH
return_type: STRING
page:
selector: //ul[contains(@class, "a-pagination")]/li[@class="a-selected"]/a/text()
selector_type: XPATH
return_type: INTEGER
products:
selector: //div[@id="gridItemRoot"]
selector_type: XPATH
multiple: true
return_type: MAP
children:
image:
selector: //img[contains(@class, "a-dynamic-image")]/@src
selector_type: XPATH
return_type: STRING
title:
selector: //a[@class="a-link-normal"]/span/div/text()
selector_type: XPATH
return_type: STRING
price:
selector: //span[contains(@class, "a-color-price")]/span/text()
selector_type: XPATH
return_type: FLOAT
asin:
selector: //div[contains(@class, "sc-uncoverable-faceout")]/@id
selector_type: XPATH
return_type: STRING
reviews_count:
selector: //div[contains(@class, "sc-uncoverable-faceout")]/div/div/a/span/text()
selector_type: XPATH
return_type: INTEGER
For the example sake let's store the file as amazon_bestseller.yaml.
Then we can use the PyParsy library in out code:
import httpx
from pyparsy import Parsy
def main():
html = httpx.get("https://www.amazon.com/gp/bestsellers/hi/?ie=UTF8&ref_=sv_hg_1")
parser = Parsy("amazon_bestseller.yaml")
result = parser.parse(html.text)
print(result)
if __name__ == "__main__":
main()
For more examples please see the tests for the library.
Documentation
Documentation (hopefuly some day)
Acknowledgements
- selectorlib - It is the main inspiration for this project
- Scrapy - One of the best crawling libraries for Python
- parsel - Scrapy parsing library is heavily used in this project and can be considered main dependency.
- schema - Used for validating the YAML file schema
Contributing
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyparsy-0.1.2.tar.gz.
File metadata
- Download URL: pyparsy-0.1.2.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.9.15 Darwin/22.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4472f2e5e6f299423062dfbd69dde9385a2f25f44ce15a5b1fbbfac719aa48f8
|
|
| MD5 |
f60f7f311b4e33cc8f0938df33c034f9
|
|
| BLAKE2b-256 |
a15df0c68cb3ab0db7a61dcea9553f8151d10a0e87c059f4b76685ee2b7e3006
|
File details
Details for the file pyparsy-0.1.2-py3-none-any.whl.
File metadata
- Download URL: pyparsy-0.1.2-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.9.15 Darwin/22.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b847eb59aa722b336c67484cce211446ceafeba5ac113109c6b3e360ed714ab5
|
|
| MD5 |
21bbb8842a60c94eebaa40fd5b2a0434
|
|
| BLAKE2b-256 |
aafa6de326f56a7ae508f617dacb3fe782b6a08680f8390131796f8f6bed24f1
|