Provide a universal solution for crawler platforms. Read more: https://github.com/ClericPy/uniparser.
Project description
uniparser
Provide a universal solution for crawler.
Install
pip install uniparser -U
Why?
- Reduced the code quantity from plenty of similar crawlers & parsers. Don't Repeat Yourself.
- Make the parsing process of different parsers persistent.
- Separating the crawler code from main app code, no need to redeploy app when adding a new crawler.
- Provide a universal solution for crawler platforms.
- Summarize common string parsing tools on the market.
Feature List
-
Support most of popular parsers for HTML / XML / JSON / AnyString / Python object
-
parser list
1. css (HTML) 1. bs4 2. xml 1. lxml 3. regex 4. jsonpath 1. jsonpath-rw-ext 5. objectpath 1. objectpath 6. jmespath 1. jmespath 7. time 8. loader 1. json / yaml / toml 1. toml 2. pyyaml 9. udf 1. source code for exec & eval which named as **parse** 10. python 1. some common python methods, getitem, split, join... 11. *waiting for new ones...*
-
-
Request args persistence, support curl-string, single-url, dict, json.
-
A simple Web UI for generate & test CrawlerRule.
-
Serializable JSON rule class for saving the whole parsing process.
- Each ParseRule / CrawlerRule / HostRule subclass can be json.dumps to JSON for persistence.
- Therefore, they also can be loaded from JSON string.
- Nest relation of rule names will be treat as the result format. (Rule's result will be ignore if has childs.)
-
Rule Classes
- JsonSerializable is the base class for all the rules.
- dumps classmethod can dump self as a standard JSON string.
- loads classmethod can load self from a standard JSON string, which means the new object will has the methods as a rule.
- ParseRule is the lowest level for a parse mission, which contains how to parse a input_object. Sometimes it also has a list of ParseRule as child rules.
- Parse result is a dict that rule_name as key and result as value.
- CrawlerRule contains some ParseRules, which has 3 attributes besides the rule name:
- request_args tell the http-downloader how to send the request.
- parse_rules is a list of ParseRule, and the parsing result format is like {CrawlerRule_name: {ParseRule1['name']: ParseRule1_result, ParseRule2['name']: ParseRule2_result}}.
- regex tells how to find the crawler_rule with a given url.
- HostRule contains a dict like: {CrawlerRule['name']: CrawlerRule}, with the find method it can get the specified CrawlerRule with a given url.
- JSONRuleStorage is a simple storage way, which saved the HostRules in a JSON file. On the production environment this is not a good choice, maybe redis / mysql / mongodb can give a hand.
- JsonSerializable is the base class for all the rules.
-
Uniparser is the center console for the entire crawler process. It handled download middleware, parse middleware. Detail usage can be find at uniparser.crawler.Crawler, or have a loot at [Quick Start].
-
For custom settings, such as json loader, please update the uniparser.config.GlobalConfig.
Quick Start
Mission: Crawl python Meta-PEPs
Only less than 25 lines necessary code besides the rules(which can be saved outside and auto loaded).
HostRules will be saved at
$HOME/host_rules.json
by default, not need to init every time.
CrawlerRule JSON & Expected Result
# These rules will be saved at `$HOME/host_rules.json`
crawler = Crawler(
storage=JSONRuleStorage.loads(
r'{"www.python.org": {"host": "www.python.org", "crawler_rules": {"main": {"name":"list","request_args":{"method":"get","url":"https://www.python.org/dev/peps/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"__request__","chain_rules":[["css","#index-by-category #meta-peps-peps-about-peps-or-processes td.num>a","@href"],["re","^/","@https://www.python.org/"],["python","getitem","[:3]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/$","encoding":""}, "subs": {"name":"detail","request_args":{"method":"get","url":"https://www.python.org/dev/peps/pep-0001/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"title","chain_rules":[["css","h1.page-title","$text"],["python","getitem","[0]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/pep-\\d+$","encoding":""}}}}'
))
expected_result = {
'list': {
'__request__': [
'https://www.python.org/dev/peps/pep-0001',
'https://www.python.org/dev/peps/pep-0004',
'https://www.python.org/dev/peps/pep-0005'
],
'__result__': [{
'detail': {
'title': 'PEP 1 -- PEP Purpose and Guidelines'
}
}, {
'detail': {
'title': 'PEP 4 -- Deprecation of Standard Modules'
}
}, {
'detail': {
'title': 'PEP 5 -- Guidelines for Language Evolution'
}
}]
}
}
The Whole Source Code
from uniparser import Crawler, JSONRuleStorage
import asyncio
crawler = Crawler(
storage=JSONRuleStorage.loads(
r'{"www.python.org": {"host": "www.python.org", "crawler_rules": {"main": {"name":"list","request_args":{"method":"get","url":"https://www.python.org/dev/peps/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"__request__","chain_rules":[["css","#index-by-category #meta-peps-peps-about-peps-or-processes td.num>a","@href"],["re","^/","@https://www.python.org/"],["python","getitem","[:3]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/$","encoding":""}, "subs": {"name":"detail","request_args":{"method":"get","url":"https://www.python.org/dev/peps/pep-0001/","headers":{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}},"parse_rules":[{"name":"title","chain_rules":[["css","h1.page-title","$text"],["python","getitem","[0]"]],"childs":""}],"regex":"^https://www.python.org/dev/peps/pep-\\d+$","encoding":""}}}}'
))
expected_result = {
'list': {
'__request__': [
'https://www.python.org/dev/peps/pep-0001',
'https://www.python.org/dev/peps/pep-0004',
'https://www.python.org/dev/peps/pep-0005'
],
'__result__': [{
'detail': {
'title': 'PEP 1 -- PEP Purpose and Guidelines'
}
}, {
'detail': {
'title': 'PEP 4 -- Deprecation of Standard Modules'
}
}, {
'detail': {
'title': 'PEP 5 -- Guidelines for Language Evolution'
}
}]
}
}
def test_sync_crawler():
result = crawler.crawl('https://www.python.org/dev/peps/')
print('sync result:', result)
assert result == expected_result
def test_async_crawler():
async def _test():
result = await crawler.acrawl('https://www.python.org/dev/peps/')
print('sync result:', result)
assert result == expected_result
asyncio.run(_test())
test_sync_crawler()
test_async_crawler()
Uniparser Rule Test Console (Web UI)
- pip install bottle uniparser
- python -m uniparser 8080
- open browser => http://127.0.0.1:8080/
Start page
Prepare the rules
Read the parse result
Show result as repr(result)
{'HelloWorld': {'rule1-get-first-p': 'Customer name: ', 'rule2-get-legends': [' Pizza Size ', ' Pizza Toppings ']}}
As we can see, CrawlerRule's name is the root key, and ParseRule's name as the others.
Async environment usage: Fastapi
import uvicorn
from uniparser.fastapi_ui import app
if __name__ == "__main__":
uvicorn.run(app, port=8080)
# http://127.0.0.1:8080
Fastapi subapp usage
import uvicorn
from fastapi import FastAPI
from uniparser.fastapi_ui import app as sub_app
app = FastAPI()
app.mount('/uniparser', sub_app)
if __name__ == "__main__":
uvicorn.run(app, port=8080)
# http://127.0.0.1:8080/uniparser/
More Usage
Talk is cheap, code is doc(means poor time to write)
Test code: test_parsers.py
TODO
- Release to pypi.org
- Upload dist with Web UI
- Add github actions for testing package
- Web UI for testing rules
- Complete the doc in detail
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for uniparser-1.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 845eae15822f807adab19e6967681ad18b8f0a202d9bc96c33f51dd38336dd2d |
|
MD5 | 98a9470b02d1d2179f1ec7809f6263af |
|
BLAKE2b-256 | f1efd2d57382eb3afb03f1e1ebb424b26d4c0166fb8bcdc266af87470c89879f |