A library for converting any text (xml, html, plain text, stdout, etc) to python datatypes

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

License Version Python-versions

Scrape-schema

This library is designed to write structured, readable, reusable parsers for html, raw text and is inspired by dataclasses and ORM libraries

!!! warning

Scrape-schema is currently in Pre-Alpha. Please expect breaking changes.

Motivation

Simplifying parsers support, where it is difficult to use or the complete absence of the API interfaces and decrease boilerplate code

Also structuring, data serialization and use as an intermediate layer for third-party serialization libraries: json, dataclasses, pydantic, etc

Features

Built top on Parsel
re, css, xpath, jmespath, chompjs features
Fluent interface simular original parsel.Selector API for easy to use.
decrease boilerplate code
Does not depend on the http client implementation, use any!
Python 3.8+ support
Reusability, code consistency
Dataclass-like structure
Partial support auto type-casting from annotations (str, int, float, bool, list, dict, Optional)
Detailed logging process to make it easier to write a parser

Install

pip install scrape-schema

Example

The fields interface is similar to the original parsel

# Example from parsel documentation
>>> from parsel import Selector
>>> text = """
        <html>
            <body>
                <h1>Hello, Parsel!</h1>
                <ul>
                    <li><a href="http://example.com">Link 1</a></li>
                    <li><a href="http://scrapy.org">Link 2</a></li>
                </ul>
                <script type="application/json">{"a": ["b", "c"]}</script>
            </body>
        </html>"""
>>> selector = Selector(text=text)
>>> selector.css('h1::text').get()
'Hello, Parsel!'
>>> selector.xpath('//h1/text()').re(r'\w+')
['Hello', 'Parsel']
>>> for li in selector.css('ul > li'):
...     print(li.xpath('.//@href').get())
http://example.com
http://scrapy.org
>>> selector.css('script::text').jmespath("a").get()
'b'
>>> selector.css('script::text').jmespath("a").getall()
['b', 'c']

from scrape_schema import BaseSchema, Parsel, Sc


class Schema(BaseSchema):
    h1: Sc[str, Parsel().css('h1::text').get()]
    words: Sc[list[str], Parsel().xpath('//h1/text()').re(r'\w+')]
    urls: Sc[list[str], Parsel().css('ul > li').xpath('.//@href').getall()]
    sample_jmespath_1: Sc[str, Parsel().css('script::text').jmespath("a").get()]
    sample_jmespath_2: Sc[list[str], Parsel().css('script::text').jmespath("a").getall()]


text = """
        <html>
            <body>
                <h1>Hello, Parsel!</h1>
                <ul>
                    <li><a href="http://example.com">Link 1</a></li>
                    <li><a href="http://scrapy.org">Link 2</a></li>
                </ul>
                <script type="application/json">{"a": ["b", "c"]}</script>
            </body>
        </html>"""

print(Schema(text).dict())
# {'h1': 'Hello, Parsel!',
# 'words': ['Hello', 'Parsel'],
# 'urls': ['http://example.com', 'http://scrapy.org'],
# 'sample_jmespath_1': 'b',
# 'sample_jmespath_2': ['b', 'c']}

See more examples and documentation for get more information/examples

This project is licensed under the terms of the MIT license.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.6.3

Oct 8, 2023

0.6.2

Oct 5, 2023

0.6.1

Sep 30, 2023

0.6.0

Sep 29, 2023

0.5.5

Sep 21, 2023

This version

0.5.4

Sep 21, 2023

0.5.3

Sep 14, 2023

0.5.2

Sep 11, 2023

0.5.1

Aug 31, 2023

0.5.0

Aug 31, 2023

0.4.2

Aug 29, 2023

0.4.1

Aug 29, 2023

0.4.0

Aug 16, 2023

0.3.7

Jul 17, 2023

0.3.6

Jul 16, 2023

0.3.5

Jul 16, 2023

0.3.4

Jul 14, 2023

0.3.3

Jul 10, 2023

0.3.2

Jul 8, 2023

0.3.0

Jul 6, 2023

0.2.4

May 12, 2023

0.2.3

May 11, 2023

0.2.2

May 11, 2023

0.2.0

May 10, 2023

0.1.4

May 5, 2023

0.1.3

May 1, 2023

0.1.1

Apr 28, 2023

0.1.0

Apr 28, 2023

0.0.9

Apr 25, 2023

0.0.8

Apr 24, 2023

0.0.7

Apr 23, 2023

0.0.6

Apr 23, 2023

0.0.5

Apr 22, 2023

0.0.4

Apr 22, 2023

0.0.3

Apr 21, 2023

0.0.2

Apr 21, 2023

0.0.1

Apr 18, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_schema-0.5.4.tar.gz (18.6 kB view hashes)

Uploaded Sep 21, 2023 Source

Built Distribution

scrape_schema-0.5.4-py3-none-any.whl (22.8 kB view hashes)

Uploaded Sep 21, 2023 Python 3

Hashes for scrape_schema-0.5.4.tar.gz

Hashes for scrape_schema-0.5.4.tar.gz
Algorithm	Hash digest
SHA256	`31972d6b86263df0cd44140522e72f74606da1d99d9b1836e47057ed3ae6bc2d`
MD5	`473250f6dd39471716250d19ecccfc43`
BLAKE2b-256	`e2f65d070b78194c120534e7ad402f7e15c16727919fc15251ede95c99f71863`

Hashes for scrape_schema-0.5.4-py3-none-any.whl

Hashes for scrape_schema-0.5.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c6a796d4da9b8cae80690186df57439b8c89bc104812ba42092072e86c57c66e`
MD5	`9d008bc92f38d56110f0649e5e4e5912`
BLAKE2b-256	`ca1938395a48613c649c4f92f9b7efd07d45732985d36a8c14e941663d2f133c`