HTML parsing library using YAML definitions and XPath

These details have not been verified by PyPI

Project links

Project description

Logo

PyPI - Python Version PyPI

PyParsy

PyParsy is an HTML parsing library using YAML definition files. The idea is to use the YAML file as sort of intent - what you want to have as a result and let Parsy do the heavy lifting for you. The differences to other similar libraries (e.g. selectorlib) is that it supports multiple version of selectors for a single field. This way you will not need to create a new yaml definition file for every change on a website.

The YAML files contain:

The desired structure of the output
XPath/CSS/Regex selectors for the element extraction
Return type definition
Optional children of the field

Features

YAML File definitions
YAML File validation
Intent instead of coding
support for XPath, CSS and Regex selectors
Different output formats e.g. JSON, YAML, XML
Somewhat opinionated
99% coverage

Installation

Using pip:

pip install pyparsy

Running Tests

To run tests, run the following command

  poetry run pytest

Examples

We can consider as an example the amazon bestseller page. First we define the .yaml definition file:

title:
  selector: //div[contains(@class, "_card-title_")]/h1/text()
  selector_type: XPATH
  return_type: STRING
page:
  selector: //ul[contains(@class, "a-pagination")]/li[@class="a-selected"]/a/text()
  selector_type: XPATH
  return_type: INTEGER
products:
  selector: //div[@id="gridItemRoot"]
  selector_type: XPATH
  multiple: true
  return_type: MAP
  children:
    image:
      selector: //img[contains(@class, "a-dynamic-image")]/@src
      selector_type: XPATH
      return_type: STRING
    title:
      selector: //a[@class="a-link-normal"]/span/div/text()
      selector_type: XPATH
      return_type: STRING
    price:
      selector: //span[contains(@class, "a-color-price")]/span/text()
      selector_type: XPATH
      return_type: FLOAT
    asin:
      selector: //div[contains(@class, "sc-uncoverable-faceout")]/@id
      selector_type: XPATH
      return_type: STRING
    reviews_count:
      selector: //div[contains(@class, "sc-uncoverable-faceout")]/div/div/a/span/text()
      selector_type: XPATH
      return_type: INTEGER

For the example sake let's store the file as amazon_bestseller.yaml.

Then we can use the PyParsy library in out code:

import httpx
from pyparsy import Parsy

def main():
    html = httpx.get("https://www.amazon.com/gp/bestsellers/hi/?ie=UTF8&ref_=sv_hg_1")
    parser = Parsy("amazon_bestseller.yaml")
    result = parser.parse(html.text)
    print(result)
    
if __name__ == "__main__":
    main()

For more examples please see the tests for the library.

Documentation

Documentation (hopefuly some day)

Acknowledgements

selectorlib - It is the main inspiration for this project
Scrapy - One of the best crawling libraries for Python
parsel - Scrapy parsing library is heavily used in this project and can be considered main dependency.
schema - Used for validating the YAML file schema

Contributing

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.6

Dec 16, 2022

0.2.5

Dec 16, 2022

0.2.4

Dec 9, 2022

0.2.3

Dec 9, 2022

0.2.1

Dec 9, 2022

This version

0.1.2

Nov 29, 2022

0.1.1

Nov 28, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyparsy-0.1.2.tar.gz (7.3 kB view details)

Uploaded Nov 29, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyparsy-0.1.2-py3-none-any.whl (8.5 kB view details)

Uploaded Nov 29, 2022 Python 3

File details

Details for the file pyparsy-0.1.2.tar.gz.

File metadata

Download URL: pyparsy-0.1.2.tar.gz
Upload date: Nov 29, 2022
Size: 7.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.2.2 CPython/3.9.15 Darwin/22.1.0

File hashes

Hashes for pyparsy-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`4472f2e5e6f299423062dfbd69dde9385a2f25f44ce15a5b1fbbfac719aa48f8`
MD5	`f60f7f311b4e33cc8f0938df33c034f9`
BLAKE2b-256	`a15df0c68cb3ab0db7a61dcea9553f8151d10a0e87c059f4b76685ee2b7e3006`

See more details on using hashes here.

File details

Details for the file pyparsy-0.1.2-py3-none-any.whl.

File metadata

Download URL: pyparsy-0.1.2-py3-none-any.whl
Upload date: Nov 29, 2022
Size: 8.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.2.2 CPython/3.9.15 Darwin/22.1.0

File hashes

Hashes for pyparsy-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b847eb59aa722b336c67484cce211446ceafeba5ac113109c6b3e360ed714ab5`
MD5	`21bbb8842a60c94eebaa40fd5b2a0434`
BLAKE2b-256	`aafa6de326f56a7ae508f617dacb3fe782b6a08680f8390131796f8f6bed24f1`

See more details on using hashes here.

pyparsy 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyParsy

Features

Installation

Running Tests

Examples

Documentation

Acknowledgements

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes