Provide a universal solution for crawler platforms. Read more: https://github.com/ClericPy/uniparser.
Project description
uniparser
Provide a universal solution for crawler, Python3.6+.
Install
pip install uniparser -U
Why?
- Reduced the code quantity from plenty of similar crawlers & parsers. Don't Repeat Yourself.
- Make the parsing process of different parsers persistent.
- Separating parsing processes from the downloading.
- Provide a universal solution for crawler platforms.
- Summarize common string parsing tools on the market.
Feature List
-
Support most of popular parsers for HTML / XML / JSON / AnyString / Python object
-
parser list
1. css (HTML) 1. bs4 2. xml 1. lxml 3. regex 4. jsonpath 1. jsonpath_ng 5. objectpath 1. objectpath 6. jmespath 1. jmespath 7. time 8. loader 1. json / yaml / toml 1. toml 2. pyyaml 9. udf 1. source code for exec & eval which named as **parse** 10. python 1. some common python methods, getitem, split, join... 11. *waiting for new ones...*
-
-
Request args persistence, support curl-string, single-url, dict, json.
-
A simple Web UI for generate & test CrawlerRule.
-
Serializable JSON rule class for saving the whole parsing process.
- Each ParseRule / CrawlerRule / HostRule subclass can be json.dumps to JSON for persistence.
- Therefore, they also can be loaded from JSON string.
- Nest relation of rule names will be treat as the result format. (Rule's result will be ignore if has childs.)
Quick Start
Mission: Crawl python Meta-PEPs
Only less than 25 lines necessary code besides the rules(which can be saved outside and auto loaded).
HostRules will be saved at
$HOME/host_rules.json
by default, not need to init every time.
CrawlerRule JSON & Expected Result
# These rules will be saved at `$HOME/host_rules.json`
crawler = Crawler(
storage=JSONRuleStorage.loads(
r'{"www.python.org": {"host": "www.python.org", "crawler_rules": {"main": {"name":"list","request_args":{"method":"get","url":"https://www.python.org/dev/peps/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"__request__","chain_rules":[["css","#index-by-category #meta-peps-peps-about-peps-or-processes td.num>a","@href"],["re","^/","@https://www.python.org/"],["python","getitem","[:3]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/$","encoding":""}, "subs": {"name":"detail","request_args":{"method":"get","url":"https://www.python.org/dev/peps/pep-0001/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"title","chain_rules":[["css","h1.page-title","$text"],["python","getitem","[0]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/pep-\\d+$","encoding":""}}}}'
))
expected_result = {
'list': {
'__request__': [
'https://www.python.org/dev/peps/pep-0001',
'https://www.python.org/dev/peps/pep-0004',
'https://www.python.org/dev/peps/pep-0005'
],
'__result__': [{
'detail': {
'title': 'PEP 1 -- PEP Purpose and Guidelines'
}
}, {
'detail': {
'title': 'PEP 4 -- Deprecation of Standard Modules'
}
}, {
'detail': {
'title': 'PEP 5 -- Guidelines for Language Evolution'
}
}]
}
}
The Whole Source Code
from uniparser import Crawler, JSONRuleStorage
import asyncio
crawler = Crawler(
storage=JSONRuleStorage.loads(
r'{"www.python.org": {"host": "www.python.org", "crawler_rules": {"main": {"name":"list","request_args":{"method":"get","url":"https://www.python.org/dev/peps/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"__request__","chain_rules":[["css","#index-by-category #meta-peps-peps-about-peps-or-processes td.num>a","@href"],["re","^/","@https://www.python.org/"],["python","getitem","[:3]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/$","encoding":""}, "subs": {"name":"detail","request_args":{"method":"get","url":"https://www.python.org/dev/peps/pep-0001/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"title","chain_rules":[["css","h1.page-title","$text"],["python","getitem","[0]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/pep-\\d+$","encoding":""}}}}'
))
expected_result = {
'list': {
'__request__': [
'https://www.python.org/dev/peps/pep-0001',
'https://www.python.org/dev/peps/pep-0004',
'https://www.python.org/dev/peps/pep-0005'
],
'__result__': [{
'detail': {
'title': 'PEP 1 -- PEP Purpose and Guidelines'
}
}, {
'detail': {
'title': 'PEP 4 -- Deprecation of Standard Modules'
}
}, {
'detail': {
'title': 'PEP 5 -- Guidelines for Language Evolution'
}
}]
}
}
def test_sync_crawler():
# JSON will be saved if file_path!=None
result = crawler.crawl('https://www.python.org/dev/peps/')
print('sync result:', result)
assert result == expected_result
def test_async_crawler():
async def _test():
# JSON will be saved if file_path!=None
result = await crawler.acrawl('https://www.python.org/dev/peps/')
print('sync result:', result)
assert result == expected_result
asyncio.run(_test())
test_sync_crawler()
test_async_crawler()
Uniparser Test Console Demo (Web UI)
1. Prepare Environment
- pip install bottle uniparser
- python -m uniparser 8080
2. open browser http://127.0.0.1:8080/
Start page
Prepare the rules
Read the parse result
Show result as repr(result)
{'HelloWorld': {'rule1-get-first-p': 'Customer name: ', 'rule2-get-legends': [' Pizza Size ', ' Pizza Toppings ']}}
As we can see, CrawlerRule's name is the root key, and ParseRule's name as the others.
More Usage
Talk is cheap, code is doc(means poor time to write)
Test code: test_parsers.py
TODO
- Release to pypi.org
- Upload dist with Web UI
- Add github actions for testing package
- Web UI for testing rules
- Complete the whole doc
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for uniparser-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c817307820f42978bd4be5a36a722daae582171324fa5d91e351d8d3ce92d013 |
|
MD5 | 1b2b009e3cf476a3a0019e3af71b0870 |
|
BLAKE2b-256 | 32e73a7a9f20b2e385af544fd241b96147b8372a873754f05bd2deb594437ed5 |