HTML parsing library using YAML definitions and XPath
Project description
PyParsy
Parsy is a HTML parsing library using YAML definition files. The idea is to use the YAML file as sort of intent - what you want to have as a result and let Parsy do the heavy lifting for you.
The YAML files contain:
- The desired structure of the output
- XPath variants of the parsed items
Features
- YAML File definitions
- Intent instead of coding
- support for XPath and Regex
- Different output formats e.g. JSON, YAML, XML
Installation
Using pip:
pip install pyparsy
Running Tests
To run tests, run the following command
poetry run pytest
Examples
We can consider as an example the amazon bestseller page. First we define the .yaml definition file:
title:
selector: //div[contains(@class, "_card-title_")]/h1/text()
selector_type: XPATH
return_type: STRING
page:
selector: //ul[contains(@class, "a-pagination")]/li[@class="a-selected"]/a/text()
selector_type: XPATH
return_type: INTEGER
products:
selector: //div[@id="gridItemRoot"]
selector_type: XPATH
multiple: true
return_type: MAP
children:
image:
selector: //img[contains(@class, "a-dynamic-image")]/@src
selector_type: XPATH
return_type: STRING
title:
selector: //a[@class="a-link-normal"]/span/div/text()
selector_type: XPATH
return_type: STRING
price:
selector: //span[contains(@class, "a-color-price")]/span/text()
selector_type: XPATH
return_type: FLOAT
asin:
selector: //div[contains(@class, "sc-uncoverable-faceout")]/@id
selector_type: XPATH
return_type: STRING
reviews_count:
selector: //div[contains(@class, "sc-uncoverable-faceout")]/div/div/a/span/text()
selector_type: XPATH
return_type: INTEGER
For the example sake let's store the file as amazon_bestseller.yaml.
Then we can use the PyParsy library in out code:
import httpx
from pyparsy import Parsy
def main():
html = httpx.get("https://www.amazon.com/gp/bestsellers/hi/?ie=UTF8&ref_=sv_hg_1")
parser = Parsy("amazon_bestseller.yaml")
result = parser.parse(html.text)
print(result)
if __name__ == "__main__":
main()
For more examples please see the tests for the library.
Documentation
Documentation (hopefuly some day)
Acknowledgements
- selectorlib - It is the main inspiration for this project
- Scrapy - One of the best crawling libraries for Python
- Tiangolo - His projects are real inspiration to produce great software
Contributing
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyparsy-0.1.1.tar.gz.
File metadata
- Download URL: pyparsy-0.1.1.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.9.15 Darwin/22.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cb9719c45b1a5d1be463183ee29e91da424421d388a5a40f152db629b5fece4
|
|
| MD5 |
bb55869c7fdb19952ae07f8f6604f58f
|
|
| BLAKE2b-256 |
bd765e05a42a2558b2f16fc39578b26c60c6d51c1f7c983c98bd889fcc4da5ac
|
File details
Details for the file pyparsy-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pyparsy-0.1.1-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.9.15 Darwin/22.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48a837fdee0a4bbb523d77c819b65673209604e4471526457411cce51790c11c
|
|
| MD5 |
c1ad90a75ef5dfe42c9655ab8138a90e
|
|
| BLAKE2b-256 |
1c247b82d0b8381db6026d80879f0e7c44768f41b4dd2c8fe67138359bd5f0ba
|