Skip to main content

A library for converting any text (xml, html, plain text, stdout, etc) to python datatypes

Project description

Hatch project Documentation Status CI License Version Python-versions codecov

Scrape-schema

This library is designed to write structured, readable, reusable parsers for html, raw text and is inspired by dataclasses and ORM libraries

!!! warning

Scrape-schema is currently in Pre-Alpha. Please expect breaking changes.

Motivation

Simplifying parsers support, where it is difficult to use or the complete absence of the API interfaces and decrease boilerplate code

Also structuring, data serialization and use as an intermediate layer for third-party serialization libraries: json, dataclasses, pydantic, etc


Features

  • Built top on Parsel
  • re, css, xpath, jmespath, chompjs features
  • Fluent interface simular original parsel.Selector API for easy to use.
  • decrease boilerplate code
  • Does not depend on the http client implementation, use any!
  • Python 3.8+ support
  • Reusability, code consistency
  • Dataclass-like structure
  • Partial support auto type-casting from annotations (str, int, float, bool, list, dict, Optional)
  • Detailed logging process to make it easier to write a parser

Install

pip install scrape-schema

Example

The fields interface is similar to the original parsel

# Example from parsel documentation
>>> from parsel import Selector
>>> text = """
        <html>
            <body>
                <h1>Hello, Parsel!</h1>
                <ul>
                    <li><a href="http://example.com">Link 1</a></li>
                    <li><a href="http://scrapy.org">Link 2</a></li>
                </ul>
                <script type="application/json">{"a": ["b", "c"]}</script>
            </body>
        </html>"""
>>> selector = Selector(text=text)
>>> selector.css('h1::text').get()
'Hello, Parsel!'
>>> selector.xpath('//h1/text()').re(r'\w+')
['Hello', 'Parsel']
>>> for li in selector.css('ul > li'):
...     print(li.xpath('.//@href').get())
http://example.com
http://scrapy.org
>>> selector.css('script::text').jmespath("a").get()
'b'
>>> selector.css('script::text').jmespath("a").getall()
['b', 'c']
from scrape_schema import BaseSchema, Parsel, Sc


class Schema(BaseSchema):
    h1: Sc[str, Parsel().css('h1::text').get()]
    words: Sc[list[str], Parsel().xpath('//h1/text()').re(r'\w+')]
    urls: Sc[list[str], Parsel().css('ul > li').xpath('.//@href').getall()]
    sample_jmespath_1: Sc[str, Parsel().css('script::text').jmespath("a").get()]
    sample_jmespath_2: Sc[list[str], Parsel().css('script::text').jmespath("a").getall()]


text = """
        <html>
            <body>
                <h1>Hello, Parsel!</h1>
                <ul>
                    <li><a href="http://example.com">Link 1</a></li>
                    <li><a href="http://scrapy.org">Link 2</a></li>
                </ul>
                <script type="application/json">{"a": ["b", "c"]}</script>
            </body>
        </html>"""

print(Schema(text).dict())
# {'h1': 'Hello, Parsel!',
# 'words': ['Hello', 'Parsel'],
# 'urls': ['http://example.com', 'http://scrapy.org'],
# 'sample_jmespath_1': 'b',
# 'sample_jmespath_2': ['b', 'c']}

See more examples and documentation for get more information/examples


This project is licensed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_schema-0.5.4.tar.gz (18.6 kB view hashes)

Uploaded Source

Built Distribution

scrape_schema-0.5.4-py3-none-any.whl (22.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page