Skip to main content

A library for converting any text (xml, html, plain text, stdout, etc) to python datatypes

Project description

Hatch project Documentation Status

Scrape-schema

This library is designed to write structured, readable, reusable parsers for unstructured text data (like html, stdout or any text) and is inspired by dataclasses

Motivation

Simplifying parsers support, where it is difficult to use or the complete absence of the API interfaces and decrease lines of code

Also structuring, data serialization and use as an intermediate layer for third-party serialization libraries: json, dataclasses, pydantic, etc


Features

  • Partial support type-casting from annotations (str, int, float, bool, list, dict)
  • Optional success-attempts parse values checker
  • Factory functions for convert values
  • Filter functions for filter a founded values
  • Optional checking the success of getting the value from the field

Build-in backends parsers support:

  • re
  • bs4
  • selectolax(Modest)
  • parsel (TODO)

Install

zero dependencies (regex, nested fields)

pip install scrape-schema

add bs4 fields

pip install scrape-schema[bs4]

add selectolax fields

pip install scrape-schema[selectolax]

add all fields

pip install scrape-schema[all]

Example

from typing import Annotated
import pprint

from scrape_schema import BaseSchema
from scrape_schema.fields.regex import ReMatch, ReMatchList

TEXT = """
banana potato BANANA POTATO
-foo:10
-bar:20
lorem upsum dolor
192.168.0.1
"""


class Schema(BaseSchema):
    status: str = "OK"
    ipv4: Annotated[str, ReMatch(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")]
    max_digit: Annotated[int, ReMatchList(r"(\d+)",
                                          callback=int,                                      
                                          factory=max)]
    failed_value: Annotated[bool, ReMatchList(r"(ora)", default=False)]
    digits: Annotated[list[int], ReMatchList(r"(\d+)")]
    digits_float: Annotated[list[float], ReMatchList(r"(\d+)", 
                                                     callback=lambda s: f"{s}.5")]
    words_lower: Annotated[list[str], ReMatchList(r"([a-z]+)")]
    words_upper: Annotated[list[str], ReMatchList(r"([A-Z]+)")]
    
if __name__ == '__main__':
    schema = Schema(TEXT)
    pprint.pprint(schema.dict(), width=48, compact=True)
    # {'digits': [10, 20, 192, 168, 0, 1],
    #  'digits_float': [10.5, 20.5, 192.5, 168.5, 0.5,
    #                   1.5],
    #  'failed_value': False,
    #  'ip_v4': '192.168.0.1',
    #  'max_digit': 192,
    #  'words_lower': ['banana', 'potato', 'foo',
    #                  'bar', 'lorem', 'upsum',
    #                  'dolor'],
    #  'words_upper': ['BANANA', 'POTATO']}

logging

In this project, logging to the DEBUG level is enabled by default.

To set up logger, you can get it by the name "scrape_schema"

import logging

logger = logging.getLogger("scrape_schema")
logger.setLevel(logging.WARNING)
...

See more examples and documentation for get more information/examples


This project is licensed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_schema-0.0.7.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

scrape_schema-0.0.7-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file scrape_schema-0.0.7.tar.gz.

File metadata

  • Download URL: scrape_schema-0.0.7.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.24.0

File hashes

Hashes for scrape_schema-0.0.7.tar.gz
Algorithm Hash digest
SHA256 7bc0355b082e8d8c7edad295107c8d332103631e94b7daad8ad66ed6f638d88a
MD5 c2c565bfc4a13ef1488f2d783ec14ef5
BLAKE2b-256 2b276908932dfc5b38c406a5f1e3c997806abbb6f2038852ed579b6795dd92cf

See more details on using hashes here.

File details

Details for the file scrape_schema-0.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for scrape_schema-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 b9ebbfff13fd4286a351293f9c24a1eafe4bae2b835ea1c00b61afbb90565cd5
MD5 b806bff3154c83dc7a14756143ed1b0a
BLAKE2b-256 612ae2c53b5576cbf11c4283f5cd5795c9872995ae83f3a1d70b1d771ac8c643

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page