A library for converting any text (xml, html, plain text, stdout, etc) to python datatypes
Project description
Scrape-schema
This library is designed to write structured, readable, reusable parsers for html, raw text and is inspired by dataclasses and ORM libraries
!!! warning
Scrape-schema is currently in Pre-Alpha. Please expect breaking changes.
Motivation
Simplifying parsers support, where it is difficult to use or the complete absence of the API interfaces and decrease boilerplate code
Also structuring, data serialization and use as an intermediate layer for third-party serialization libraries: json, dataclasses, pydantic, etc
Features
- Built top on Parsel
- re, css, xpath, jmespath, chompjs features
- Fluent interface simular original parsel.Selector API for easy to use.
- decrease boilerplate code
- Does not depend on the http client implementation, use any!
- Python 3.8+ support
- Reusability, code consistency
- Dataclass-like structure
- Partial support auto type-casting from annotations (str, int, float, bool, list, dict, Optional)
- Detailed logging process to make it easier to write a parser
Install
pip install scrape-schema
Example
The fields interface is similar to the original parsel
# Example from parsel documentation
>>> from parsel import Selector
>>> text = """
<html>
<body>
<h1>Hello, Parsel!</h1>
<ul>
<li><a href="http://example.com">Link 1</a></li>
<li><a href="http://scrapy.org">Link 2</a></li>
</ul>
<script type="application/json">{"a": ["b", "c"]}</script>
</body>
</html>"""
>>> selector = Selector(text=text)
>>> selector.css('h1::text').get()
'Hello, Parsel!'
>>> selector.xpath('//h1/text()').re(r'\w+')
['Hello', 'Parsel']
>>> for li in selector.css('ul > li'):
... print(li.xpath('.//@href').get())
http://example.com
http://scrapy.org
>>> selector.css('script::text').jmespath("a").get()
'b'
>>> selector.css('script::text').jmespath("a").getall()
['b', 'c']
from scrape_schema import BaseSchema, Parsel, Sc
class Schema(BaseSchema):
h1: Sc[str, Parsel().css('h1::text').get()]
words: Sc[list[str], Parsel().xpath('//h1/text()').re(r'\w+')]
urls: Sc[list[str], Parsel().css('ul > li').xpath('.//@href').getall()]
sample_jmespath_1: Sc[str, Parsel().css('script::text').jmespath("a").get()]
sample_jmespath_2: Sc[list[str], Parsel().css('script::text').jmespath("a").getall()]
text = """
<html>
<body>
<h1>Hello, Parsel!</h1>
<ul>
<li><a href="http://example.com">Link 1</a></li>
<li><a href="http://scrapy.org">Link 2</a></li>
</ul>
<script type="application/json">{"a": ["b", "c"]}</script>
</body>
</html>"""
print(Schema(text).dict())
# {'h1': 'Hello, Parsel!',
# 'words': ['Hello', 'Parsel'],
# 'urls': ['http://example.com', 'http://scrapy.org'],
# 'sample_jmespath_1': 'b',
# 'sample_jmespath_2': ['b', 'c']}
See more examples and documentation for get more information/examples
This project is licensed under the terms of the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrape_schema-0.5.5.tar.gz
.
File metadata
- Download URL: scrape_schema-0.5.5.tar.gz
- Upload date:
- Size: 18.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.24.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 948732ea52c9f714448f894c2cef350da184993bb689d36a03d80d84649cad0f |
|
MD5 | 22130308f3967c75369df8e7afbe5c51 |
|
BLAKE2b-256 | b6cec8ca4f22841d302a462021418465c733cb41ed2f32aab05f716797820f2a |
File details
Details for the file scrape_schema-0.5.5-py3-none-any.whl
.
File metadata
- Download URL: scrape_schema-0.5.5-py3-none-any.whl
- Upload date:
- Size: 23.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.24.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d30e3c355488c23054da90304a9508f7b5f675b11c85cc2aba1b3bfb15b3b6d5 |
|
MD5 | 616a63b2720cbbb4506db91c4b64e16a |
|
BLAKE2b-256 | 6540835e011f87126484e2d2b245c89091c0938458884d90c9d3b857927ba5c6 |