Skip to main content

A library for converting any text (xml, html, plain text, stdout, etc) to python datatypes

Project description

Hatch project Documentation Status CI License Version Python-versions codecov

Scrape-schema

This library is designed to write structured, readable, reusable parsers for html, raw text and is inspired by dataclasses and ORM libraries

!!! warning

Scrape-schema is currently in Pre-Alpha. Please expect breaking changes.

Motivation

Simplifying parsers support, where it is difficult to use or the complete absence of the API interfaces and decrease boilerplate code

Also structuring, data serialization and use as an intermediate layer for third-party serialization libraries: json, dataclasses, pydantic, etc


Features

  • Built top on Parsel
  • re, css, xpath, jmespath, chompjs features
  • Fluent interface simular original parsel.Selector API for easy to use.
  • decrease boilerplate code
  • Does not depend on the http client implementation, use any!
  • Python 3.8+ support
  • Reusability, code consistency
  • Dataclass-like structure
  • Partial support auto type-casting from annotations (str, int, float, bool, list, dict, Optional)
  • Detailed logging process to make it easier to write a parser

Install

pip install scrape-schema

Example

The fields interface is similar to the original parsel

# Example from parsel documentation
>>> from parsel import Selector
>>> text = """
        <html>
            <body>
                <h1>Hello, Parsel!</h1>
                <ul>
                    <li><a href="http://example.com">Link 1</a></li>
                    <li><a href="http://scrapy.org">Link 2</a></li>
                </ul>
                <script type="application/json">{"a": ["b", "c"]}</script>
            </body>
        </html>"""
>>> selector = Selector(text=text)
>>> selector.css('h1::text').get()
'Hello, Parsel!'
>>> selector.xpath('//h1/text()').re(r'\w+')
['Hello', 'Parsel']
>>> for li in selector.css('ul > li'):
...     print(li.xpath('.//@href').get())
http://example.com
http://scrapy.org
>>> selector.css('script::text').jmespath("a").get()
'b'
>>> selector.css('script::text').jmespath("a").getall()
['b', 'c']
from scrape_schema import BaseSchema, Parsel, Sc


class Schema(BaseSchema):
    h1: Sc[str, Parsel().css('h1::text').get()]
    words: Sc[list[str], Parsel().xpath('//h1/text()').re(r'\w+')]
    urls: Sc[list[str], Parsel().css('ul > li').xpath('.//@href').getall()]
    sample_jmespath_1: Sc[str, Parsel().css('script::text').jmespath("a").get()]
    sample_jmespath_2: Sc[list[str], Parsel().css('script::text').jmespath("a").getall()]


text = """
        <html>
            <body>
                <h1>Hello, Parsel!</h1>
                <ul>
                    <li><a href="http://example.com">Link 1</a></li>
                    <li><a href="http://scrapy.org">Link 2</a></li>
                </ul>
                <script type="application/json">{"a": ["b", "c"]}</script>
            </body>
        </html>"""

print(Schema(text).dict())
# {'h1': 'Hello, Parsel!',
# 'words': ['Hello', 'Parsel'],
# 'urls': ['http://example.com', 'http://scrapy.org'],
# 'sample_jmespath_1': 'b',
# 'sample_jmespath_2': ['b', 'c']}

See more examples and documentation for get more information/examples


This project is licensed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_schema-0.5.5.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

scrape_schema-0.5.5-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file scrape_schema-0.5.5.tar.gz.

File metadata

  • Download URL: scrape_schema-0.5.5.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.24.1

File hashes

Hashes for scrape_schema-0.5.5.tar.gz
Algorithm Hash digest
SHA256 948732ea52c9f714448f894c2cef350da184993bb689d36a03d80d84649cad0f
MD5 22130308f3967c75369df8e7afbe5c51
BLAKE2b-256 b6cec8ca4f22841d302a462021418465c733cb41ed2f32aab05f716797820f2a

See more details on using hashes here.

File details

Details for the file scrape_schema-0.5.5-py3-none-any.whl.

File metadata

File hashes

Hashes for scrape_schema-0.5.5-py3-none-any.whl
Algorithm Hash digest
SHA256 d30e3c355488c23054da90304a9508f7b5f675b11c85cc2aba1b3bfb15b3b6d5
MD5 616a63b2720cbbb4506db91c4b64e16a
BLAKE2b-256 6540835e011f87126484e2d2b245c89091c0938458884d90c9d3b857927ba5c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page