A library for converting any text (xml, html, plain text, stdout, etc) to python datatypes
Project description
Scrape-schema
This library is designed to write structured, readable, reusable parsers for unstructured text data (like html, stdout or any text) and is inspired by dataclasses
Motivation
Simplifying parsers support, where it is difficult to use or the complete absence of the API interfaces and decrease lines of code
Also structuring, data serialization and use as an intermediate layer for third-party serialization libraries: json, dataclasses, pydantic, etc
Features
- python 3.8+ support
- decrease lines of code for your parsers
- partial support type-casting from annotations (str, int, float, bool, list, dict, Optional)
- interacting with values with callbacks, filters, factories
- logging to quickly find problems in extracted values
- optional success-attempts parse values checker from fields objects
- standardization, modularity* of structures-parsers *If you usage schema-structures and they are separated from the logic of getting the text (stdout output, HTTP requests, etc)
Build-in libraries parsers support:
- re
- bs4
- selectolax(Modest)
- parsel
- lxml
- selenium
- playwright
Install
zero dependencies: regex, nested fields (and typing_extension if python < 3.11)
pip install scrape-schema
add bs4 fields
pip install scrape-schema[bs4]
add selectolax fields
pip install scrape-schema[selectolax]
add parsel fields
pip install scrape-schema[parsel]
add all fields
pip install scrape-schema[all]
Code comparison
Before scrape_schema: harder to maintain, change logic
import re
import pprint
TEXT = """
banana potato BANANA POTATO
-foo:10
-bar:20
lorem upsum dolor
192.168.0.1
"""
def parse_text(text: str) -> dict:
if match := re.search(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", text):
ipv4 = match[1]
else:
ipv4 = None
if matches := re.findall(r"(\d+)", text):
max_digit = max(int(i) for i in matches)
else:
max_digit = None
failed_value = bool(re.search(r"(ora)", text))
if matches := re.findall(r"(\d+)", text):
digits = [int(i) for i in matches]
digits_float = [float(f'{i}.5') for i in matches]
else:
digits = None
digits_float = None
words_lower = matches if (matches := re.findall(r"([a-z]+)", text)) else None
words_upper = matches if (matches := re.findall(r"([A-Z]+)", text)) else None
return dict(ipv4=ipv4, max_digit=max_digit, failed_value=failed_value,
digits=digits, digits_float=digits_float,
words_lower=words_lower, words_upper=words_upper)
if __name__ == '__main__':
pprint.pprint(parse_text(TEXT), width=48, compact=True)
# {'digits': [10, 20, 192, 168, 0, 1],
# 'digits_float': [10.5, 20.5, 192.5, 168.5, 0.5,
# 1.5],
# 'failed_value': False,
# 'ip_v4': '192.168.0.1',
# 'max_digit': 192,
# 'words_lower': ['banana', 'potato', 'foo',
# 'bar', 'lorem', 'upsum',
# 'dolor'],
# 'words_upper': ['BANANA', 'POTATO']}
After scrape_schema: easy change of logic, support, portability
from typing import List # if you usage python3.8 - usage GenericAliases
import pprint
from scrape_schema import BaseSchema, ScField
from scrape_schema.fields.regex import ReMatch, ReMatchList
TEXT = """
banana potato BANANA POTATO
-foo:10
-bar:20
lorem upsum dolor
192.168.0.1
"""
class Schema(BaseSchema):
ipv4: ScField[str, ReMatch(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")]
max_digit: ScField[int, ReMatchList(r"(\d+)",
callback=int,
factory=max)]
failed_value: ScField[bool, ReMatchList(r"(ora)", default=False)]
digits: ScField[List[int], ReMatchList(r"(\d+)")]
digits_float: ScField[List[float], ReMatchList(r"(\d+)",
callback=lambda s: f"{s}.5")]
words_lower: ScField[List[str], ReMatchList(r"([a-z]+)")]
words_upper: ScField[List[str], ReMatchList(r"([A-Z]+)")]
if __name__ == '__main__':
schema = Schema(TEXT)
pprint.pprint(schema.dict(), width=48, compact=True)
# {'digits': [10, 20, 192, 168, 0, 1],
# 'digits_float': [10.5, 20.5, 192.5, 168.5, 0.5,
# 1.5],
# 'failed_value': False,
# 'ip_v4': '192.168.0.1',
# 'max_digit': 192,
# 'words_lower': ['banana', 'potato', 'foo',
# 'bar', 'lorem', 'upsum',
# 'dolor'],
# 'words_upper': ['BANANA', 'POTATO']}
logging
In this project, logging to the DEBUG
level is enabled by default.
To set up logger, you can get it by the name "scrape_schema"
import logging
logger = logging.getLogger("scrape_schema")
logger.setLevel(logging.INFO)
...
See more examples and documentation for get more information/examples
This project is licensed under the terms of the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrape_schema-0.1.3.tar.gz
.
File metadata
- Download URL: scrape_schema-0.1.3.tar.gz
- Upload date:
- Size: 15.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.24.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 56c6ab70f3a87cae1acb5d5bb0fb90e5ce08f52bfa2ab25156e1ecae1ecb0532 |
|
MD5 | 5c0412954a1bac359d51e2bec6beaf7d |
|
BLAKE2b-256 | 39c4e2a435e2a0381e4b08335ab0ee72a0cc7369298f5f7c6b26f6df83dbaec4 |
File details
Details for the file scrape_schema-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: scrape_schema-0.1.3-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.24.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf3dfe36e98d10894f6db2d44436a7c15a059f1e97654ef122f5230d79325e55 |
|
MD5 | a533dd439c4ee5fecfc17c1faeda0c14 |
|
BLAKE2b-256 | 72fbeee3a88fe39276580f4519158f25d0c5f50ce654740b5f71f3f1c6f3eb5a |