Skip to main content

A library for converting any text (xml, html, plain text, stdout, etc) to python datatypes

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

Hatch project Documentation Status CI License Version Python-versions codecov

Scrape-schema

This library is designed to write structured, readable, reusable parsers for html, raw text and is inspired by dataclasses and ORM libraries

!!! warning

Scrape-schema is currently in Pre-Alpha. Please expect breaking changes.

Motivation

Simplifying parsers support, where it is difficult to use or the complete absence of the API interfaces and decrease boilerplate code

Also structuring, data serialization and use as an intermediate layer for third-party serialization libraries: json, dataclasses, pydantic, etc


Features

  • Built top on Parsel
  • re, css, xpath, jmespath, chompjs features
  • Fluent interface simular original parsel.Selector API for easy to use.
  • decrease boilerplate code
  • Does not depend on the http client implementation, use any!
  • Python 3.8+ support
  • Reusability, code consistency
  • Dataclass-like structure
  • Partial support auto type-casting from annotations (str, int, float, bool, list, dict, Optional)
  • Detailed logging process to make it easier to write a parser

Install

pip install scrape-schema

Example

The fields interface is similar to the original parsel

# Example from parsel documentation
>>> from parsel import Selector
>>> text = """
        <html>
            <body>
                <h1>Hello, Parsel!</h1>
                <ul>
                    <li><a href="http://example.com">Link 1</a></li>
                    <li><a href="http://scrapy.org">Link 2</a></li>
                </ul>
                <script type="application/json">{"a": ["b", "c"]}</script>
            </body>
        </html>"""
>>> selector = Selector(text=text)
>>> selector.css('h1::text').get()
'Hello, Parsel!'
>>> selector.xpath('//h1/text()').re(r'\w+')
['Hello', 'Parsel']
>>> for li in selector.css('ul > li'):
...     print(li.xpath('.//@href').get())
http://example.com
http://scrapy.org
>>> selector.css('script::text').jmespath("a").get()
'b'
>>> selector.css('script::text').jmespath("a").getall()
['b', 'c']
from scrape_schema import BaseSchema, Parsel, Sc


class Schema(BaseSchema):
    h1: Sc[str, Parsel().css('h1::text').get()]
    words: Sc[list[str], Parsel().xpath('//h1/text()').re(r'\w+')]
    urls: Sc[list[str], Parsel().css('ul > li').xpath('.//@href').getall()]
    sample_jmespath_1: Sc[str, Parsel().css('script::text').jmespath("a").get()]
    sample_jmespath_2: Sc[list[str], Parsel().css('script::text').jmespath("a").getall()]


text = """
        <html>
            <body>
                <h1>Hello, Parsel!</h1>
                <ul>
                    <li><a href="http://example.com">Link 1</a></li>
                    <li><a href="http://scrapy.org">Link 2</a></li>
                </ul>
                <script type="application/json">{"a": ["b", "c"]}</script>
            </body>
        </html>"""

print(Schema(text).dict())
# {'h1': 'Hello, Parsel!',
# 'words': ['Hello', 'Parsel'],
# 'urls': ['http://example.com', 'http://scrapy.org'],
# 'sample_jmespath_1': 'b',
# 'sample_jmespath_2': ['b', 'c']}

See more examples and documentation for get more information/examples


This project is licensed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_schema-0.5.4.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrape_schema-0.5.4-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file scrape_schema-0.5.4.tar.gz.

File metadata

  • Download URL: scrape_schema-0.5.4.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.24.1

File hashes

Hashes for scrape_schema-0.5.4.tar.gz
Algorithm Hash digest
SHA256 31972d6b86263df0cd44140522e72f74606da1d99d9b1836e47057ed3ae6bc2d
MD5 473250f6dd39471716250d19ecccfc43
BLAKE2b-256 e2f65d070b78194c120534e7ad402f7e15c16727919fc15251ede95c99f71863

See more details on using hashes here.

File details

Details for the file scrape_schema-0.5.4-py3-none-any.whl.

File metadata

File hashes

Hashes for scrape_schema-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c6a796d4da9b8cae80690186df57439b8c89bc104812ba42092072e86c57c66e
MD5 9d008bc92f38d56110f0649e5e4e5912
BLAKE2b-256 ca1938395a48613c649c4f92f9b7efd07d45732985d36a8c14e941663d2f133c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page