Skip to main content

A library for converting any text (xml, html, plain text, stdout, etc) to python datatypes

Project description

Hatch project Documentation Status CI License Version Python-versions codecov

Scrape-schema

This library is designed to write structured, readable, reusable parsers for html, raw text and is inspired by dataclasses

Motivation

Simplifying parsers support, where it is difficult to use or the complete absence of the API interfaces and decrease lines of code

Also structuring, data serialization and use as an intermediate layer for third-party serialization libraries: json, dataclasses, pydantic, etc


Features

  • Parsel backend.
  • Fluent interface simulate original parsel.Selector API for easy to use.
  • Python 3.8+ support
  • Dataclass-like structure
  • Partial support auto type-casting from annotations (str, int, float, bool, list, dict, Optional)
  • logging to quickly find problems in extracted values

Install

pip install scrape-schema

Example

The fields interface is similar to the original parsel

# Example from parsel documentation
>>> from parsel import Selector
>>> text = """
        <html>
            <body>
                <h1>Hello, Parsel!</h1>
                <ul>
                    <li><a href="http://example.com">Link 1</a></li>
                    <li><a href="http://scrapy.org">Link 2</a></li>
                </ul>
                <script type="application/json">{"a": ["b", "c"]}</script>
            </body>
        </html>"""
>>> selector = Selector(text=text)
>>> selector.css('h1::text').get()
'Hello, Parsel!'
>>> selector.xpath('//h1/text()').re(r'\w+')
['Hello', 'Parsel']
>>> for li in selector.css('ul > li'):
...     print(li.xpath('.//@href').get())
http://example.com
http://scrapy.org
>>> selector.css('script::text').jmespath("a").get()
'b'
>>> selector.css('script::text').jmespath("a").getall()
['b', 'c']
from scrape_schema import BaseSchema, Parsel, Sc


class Schema(BaseSchema):
    h1: Sc[str, Parsel().css('h1::text').get()]
    words: Sc[list[str], Parsel().xpath('//h1/text()').re(r'\w+')]
    urls: Sc[list[str], Parsel().css('ul > li').xpath('.//@href').getall()]
    sample_jmespath_1: Sc[str, Parsel().css('script::text').jmespath("a").get()]
    sample_jmespath_2: Sc[list[str], Parsel().css('script::text').jmespath("a").getall()]


text = """
        <html>
            <body>
                <h1>Hello, Parsel!</h1>
                <ul>
                    <li><a href="http://example.com">Link 1</a></li>
                    <li><a href="http://scrapy.org">Link 2</a></li>
                </ul>
                <script type="application/json">{"a": ["b", "c"]}</script>
            </body>
        </html>"""

print(Schema(text).dict())
# {'h1': 'Hello, Parsel!',
# 'words': ['Hello', 'Parsel'],
# 'urls': ['http://example.com', 'http://scrapy.org'],
# 'sample_jmespath_1': 'b',
# 'sample_jmespath_2': ['b', 'c']}

Code comparison

html

parsel:

from parsel import Selector
import pprint
import requests


def original_parsel(resp: str):
    sel = Selector(resp)
    __RATINGS = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
    data: dict[str, list[dict]] = {"books": []}
    for book_sel in sel.xpath(".//section/div/ol[@class='row']/li"):
        if url := book_sel.xpath('//div[@class="image_container"]/a/@href').get():
            url = f"https://books.toscrape.com/catalogue/{url}"
        if image := book_sel.xpath('//div[@class="image_container"]/a/img/@src').get():
            image = f"https://books.toscrape.com{image[2:]}"
        if price := book_sel.xpath('//div[@class="product_price"]/p[@class="price_color"]/text()').get():
            price = float(price[2:])
        else:
            price = .0
        name = book_sel.xpath("//h3/a/@title").get()
        available = book_sel.xpath('//div[@class="product_price"]/p[@class="instock availability"]/i').attrib.get('class')
        available = ('icon-ok' in available)
        rating = book_sel.xpath('//p[contains(@class, "star-rating")]').attrib.get('class')
        rating = __RATINGS.get(rating.split()[-1], 0)
        data['books'].append(dict(url=url, image=image, price=price, name=name, available=available, rating=rating))
    return data


if __name__ == '__main__':
    response = requests.get("https://books.toscrape.com/catalogue/page-2.html").text
    pprint.pprint(original_parsel(response), compact=True)

scrape_schema:

from typing import List
import pprint
import requests
from scrape_schema import BaseSchema, Sc, Nested, sc_param, Parsel


class Book(BaseSchema):
    __RATINGS = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
    url: Sc[str, (Parsel()
                  .xpath('//div[@class="image_container"]/a/@href')
                  .get()
                  .concat_l("https://books.toscrape.com/catalogue/"))]
    image: Sc[str, (Parsel()
                    .xpath('//div[@class="image_container"]/a/img/@src')
                    .get()[2:]
                    .concat_l("https://books.toscrape.com"))]
    price: Sc[float, (Parsel(default=.0)
                      .xpath('//div[@class="product_price"]/p[@class="price_color"]/text()')
                      .get()[2:])]
    name: Sc[str, Parsel().xpath("//h3/a/@title").get()]
    available: Sc[bool, (Parsel()
                         .xpath('//div[@class="product_price"]/p[@class="instock availability"]/i')
                         .attrib['class']
                         .fn(lambda s: s == 'icon-ok')  # check available tag
                         )]
    _rating: Sc[str, Parsel().xpath('//p[contains(@class, "star-rating")]').attrib.get(key='class')]

    @sc_param
    def rating(self) -> int:
        return self.__RATINGS.get(self._rating.split()[-1], 0)


class MainPage(BaseSchema):
    books: Sc[List[Book], Nested(Parsel().xpath(".//section/div/ol[@class='row']/li").getall())]


if __name__ == '__main__':
    response = requests.get("https://books.toscrape.com/catalogue/page-2.html").text
    pprint.pprint(MainPage(response).dict(), compact=True)

raw text

original re:

import re
import pprint

TEXT = """
banana potato BANANA POTATO
-foo:10
-bar:20
lorem upsum dolor
192.168.0.1
"""


def parse_text(text: str) -> dict:
    if match := re.search(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", text):
        ipv4 = match[1]
    else:
        ipv4 = None

    if matches := re.findall(r"(\d+)", text):
        max_digit = max(int(i) for i in matches)
    else:
        max_digit = None

    failed_value = bool(re.search(r"(ora)", text))

    if matches := re.findall(r"(\d+)", text):
        digits = [int(i) for i in matches]
        digits_float = [float(f'{i}.5') for i in matches]
    else:
        digits = None
        digits_float = None
    words_lower = matches if (matches := re.findall(r"([a-z]+)", text)) else None
    words_upper = matches if (matches := re.findall(r"([A-Z]+)", text)) else None

    return dict(ipv4=ipv4, max_digit=max_digit, failed_value=failed_value,
                digits=digits, digits_float=digits_float, 
                words_lower=words_lower, words_upper=words_upper)
    

if __name__ == '__main__':
    pprint.pprint(parse_text(TEXT), width=48, compact=True)
    # {'digits': [10, 20, 192, 168, 0, 1],
    #  'digits_float': [10.5, 20.5, 192.5, 168.5, 0.5,
    #                   1.5],
    #  'failed_value': False,
    #  'ip_v4': '192.168.0.1',
    #  'max_digit': 192,
    #  'words_lower': ['banana', 'potato', 'foo',
    #                  'bar', 'lorem', 'upsum',
    #                  'dolor'],
    #  'words_upper': ['BANANA', 'POTATO']}

scrape_schema:

from typing import List  # if you usage python3.8. If python3.9 - use build-in list
import pprint
from scrape_schema import Parsel, BaseSchema, Sc, sc_param

# Note: `Sc` is shortcut typing.Annotated

TEXT = """
banana potato BANANA POTATO
-foo:10
-bar:20
lorem upsum dolor
192.168.0.1
"""


class MySchema(BaseSchema):
    ipv4: Sc[str, Parsel(raw=True).re_search(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")[1]]
    failed_value: Sc[bool, Parsel(default=False, raw=True).re_search(r"(ora)")[1]]
    digits: Sc[List[int], Parsel(raw=True).re_findall(r"(\d+)")]
    digits_float: Sc[List[float], Parsel(raw=True).re_findall(r"(\d+)").fn(lambda lst: [f"{s}.5" for s in lst])]
    words_lower: Sc[List[str], Parsel(raw=True).re_findall("([a-z]+)")]
    words_upper: Sc[List[str], Parsel(raw=True).re_findall(r"([A-Z]+)")]

    @sc_param
    def sum(self):
        return sum(self.digits)

    @sc_param
    def all_words(self):
        return self.words_lower + self.words_upper


if __name__ == '__main__':
    pprint.pprint(MySchema(TEXT).dict(), compact=True)
# {'all_words': ['banana', 'potato', 'foo', 'bar', 'lorem', 'upsum', 'dolor',
#                'BANANA', 'POTATO'],
#  'digits': [10, 20, 192, 168, 0, 1],
#  'digits_float': [10.5, 20.5, 192.5, 168.5, 0.5, 1.5],
#  'failed_value': False,
#  'ipv4': '192.168.0.1',
#  'sum': 391,
#  'words_lower': ['banana', 'potato', 'foo', 'bar', 'lorem', 'upsum', 'dolor'],
#  'words_upper': ['BANANA', 'POTATO']}

logging

In this project, logging to the DEBUG level is enabled by default.

To set up logger, you can get it by the name "scrape_schema"

import logging

logger = logging.getLogger("scrape_schema")
logger.setLevel(logging.INFO)
...

See more examples and documentation for get more information/examples


This project is licensed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_schema-0.3.2.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

scrape_schema-0.3.2-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file scrape_schema-0.3.2.tar.gz.

File metadata

  • Download URL: scrape_schema-0.3.2.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.24.0

File hashes

Hashes for scrape_schema-0.3.2.tar.gz
Algorithm Hash digest
SHA256 ab6c122eb2b7e750a98b42d56f0dbc1bbd31cfb8bd5242d9698f9701bda6aa5f
MD5 822287dde1f089bda49362e38a47ea6e
BLAKE2b-256 8206815d975f5ea1a4885193fc378a4eb78e1b83e4cd9be5b7b79883ddba4e14

See more details on using hashes here.

File details

Details for the file scrape_schema-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for scrape_schema-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2993f2a2074b4e87321bb6187a9c7fbb29656a8e1dc081b770182e5529923e48
MD5 dac059ae60f2c50b35c4d3ca1c52af8d
BLAKE2b-256 868bc07cc0dc5c86262d3a77d71437fd1e2efe94d2b2dc80b081dbd87f2c83f6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page