Skip to main content

A library for converting any text (xml, html, plain text, stdout, etc) to python datatypes

Project description

Hatch project Documentation Status CI License Version Python-versions codecov

Scrape-schema

This library is designed to write structured, readable, reusable parsers for unstructured text data (like html, stdout or any text) and is inspired by dataclasses

Motivation

Simplifying parsers support, where it is difficult to use or the complete absence of the API interfaces and decrease lines of code

Also structuring, data serialization and use as an intermediate layer for third-party serialization libraries: json, dataclasses, pydantic, etc


Features

  • python 3.8+ support
  • decrease lines of code for your parsers
  • partial support type-casting from annotations (str, int, float, bool, list, dict, Optional)
  • interacting with values with callbacks, filters, factories
  • logging to quickly find problems in extracted values
  • optional success-attempts parse values checker from fields objects
  • standardization, modularity* of structures-parsers *If you usage schema-structures and they are separated from the logic of getting the text (stdout output, HTTP requests, etc)

Build-in libraries parsers support:

  • re
  • bs4
  • selectolax(Modest)
  • parsel
  • selenium
  • playwright

Install

zero dependencies: regex, nested fields (and typing_extension if python < 3.11)

pip install scrape-schema

add bs4 fields

pip install scrape-schema[bs4]

add selectolax fields

pip install scrape-schema[selectolax]

add parsel fields

pip install scrape-schema[parsel]

add all fields

pip install scrape-schema[all]

Code comparison

Before scrape_schema: harder to maintain, change logic

import re
import pprint

TEXT = """
banana potato BANANA POTATO
-foo:10
-bar:20
lorem upsum dolor
192.168.0.1
"""


def parse_text(text: str) -> dict:
    if match := re.search(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", text):
        ipv4 = match[1]
    else:
        ipv4 = None

    if matches := re.findall(r"(\d+)", text):
        max_digit = max(int(i) for i in matches)
    else:
        max_digit = None

    failed_value = bool(re.search(r"(ora)", text))

    if matches := re.findall(r"(\d+)", text):
        digits = [int(i) for i in matches]
        digits_float = [float(f'{i}.5') for i in matches]
    else:
        digits = None
        digits_float = None
    words_lower = matches if (matches := re.findall(r"([a-z]+)", text)) else None
    words_upper = matches if (matches := re.findall(r"([A-Z]+)", text)) else None

    return dict(ipv4=ipv4, max_digit=max_digit, failed_value=failed_value,
                digits=digits, digits_float=digits_float, 
                words_lower=words_lower, words_upper=words_upper)
    

if __name__ == '__main__':
    pprint.pprint(parse_text(TEXT), width=48, compact=True)
    # {'digits': [10, 20, 192, 168, 0, 1],
    #  'digits_float': [10.5, 20.5, 192.5, 168.5, 0.5,
    #                   1.5],
    #  'failed_value': False,
    #  'ip_v4': '192.168.0.1',
    #  'max_digit': 192,
    #  'words_lower': ['banana', 'potato', 'foo',
    #                  'bar', 'lorem', 'upsum',
    #                  'dolor'],
    #  'words_upper': ['BANANA', 'POTATO']}

After scrape_schema: easy change of logic, support, portability

from typing import List  # if you usage python3.8 - usage GenericAliases
import pprint

from scrape_schema import BaseSchema, ScField
from scrape_schema.fields.regex import ReMatch, ReMatchList

TEXT = """
banana potato BANANA POTATO
-foo:10
-bar:20
lorem upsum dolor
192.168.0.1
"""


class Schema(BaseSchema):
    ipv4: ScField[str, ReMatch(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")]
    failed_value: ScField[bool, ReMatchList(r"(ora)", default=False)]
    digits: ScField[List[int], ReMatchList(r"(\d+)")]
    digits_float: ScField[List[float], ReMatchList(r"(\d+)", 
                                                     callback=lambda s: f"{s}.5")]
    words_lower: ScField[List[str], ReMatchList(r"([a-z]+)")]
    words_upper: ScField[List[str], ReMatchList(r"([A-Z]+)")]
    
    @property
    def max_digit(self) -> int:
        return max(self.digits)
    
    @property
    def all_words(self) -> List[str]:
        return self.words_lower + self.words_upper
    
if __name__ == '__main__':
    schema = Schema(TEXT)
    pprint.pprint(schema.dict(), compact=True)
    # {'all_words': ['banana', 'potato', 'foo', 'bar', 'lorem', 'upsum', 'dolor',
    #           'BANANA', 'POTATO'],
    #  'digits': [10, 20, 192, 168, 0, 1],
    #  'digits_float': [10.5, 20.5, 192.5, 168.5, 0.5, 1.5],
    #  'failed_value': False,
    #  'ipv4': '192.168.0.1',
    #  'max_digit': 192,
    #  'words_lower': ['banana', 'potato', 'foo', 'bar', 'lorem', 'upsum', 'dolor'],
    #  'words_upper': ['BANANA', 'POTATO']}

logging

In this project, logging to the DEBUG level is enabled by default.

To set up logger, you can get it by the name "scrape_schema"

import logging

logger = logging.getLogger("scrape_schema")
logger.setLevel(logging.INFO)
...

See more examples and documentation for get more information/examples


This project is licensed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_schema-0.2.2.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

scrape_schema-0.2.2-py3-none-any.whl (22.0 kB view details)

Uploaded Python 3

File details

Details for the file scrape_schema-0.2.2.tar.gz.

File metadata

  • Download URL: scrape_schema-0.2.2.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.24.0

File hashes

Hashes for scrape_schema-0.2.2.tar.gz
Algorithm Hash digest
SHA256 a673b089592839097409cfd7b838ee7465dadbd0639c2ff6e76c48b36a7ff40d
MD5 634fffa6f1a923e50c424197495526a2
BLAKE2b-256 2925a9ff24d5958a9ee45374f2d38a1e1f46d7bfcb75b57d39e01eb3a784288c

See more details on using hashes here.

File details

Details for the file scrape_schema-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for scrape_schema-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7b56b46a1e747e5acd12bd04bbaab78bbd02054ff8e7c338fd715839f4ced64b
MD5 b40f265a8f941b44476613fdcce17af3
BLAKE2b-256 e2c21d409686b59a79e7c8f1bf2c515323bfc0afa3a1bd1cc4edf858dbeec170

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page