Provide a universal solution for crawler platforms. Read more: https://github.com/ClericPy/uniparser.
Project description
uniparser
Provide a universal solution for crawler, Python3.6+.
Install
pip install uniparser -U
Why?
- Reduced the code quantity from plenty of similar crawlers & parsers. Don't Repeat Yourself.
- Make the parsing process of different parsers persistent.
- Separating parsing processes from the downloading.
- Provide a universal solution for crawler platforms.
- Summarize common string parsing tools on the market.
Feature List
- Support most of popular parsers for HTML / XML / JSON / AnyString / Python object
- css (HTML)
- bs4
- xml
- lxml
- regex
- jsonpath
- jsonpath_ng
- objectpath
- objectpath
- jmespath
- jmespath
- time
- loader
- json / yaml / toml
- toml
- pyyaml
- json / yaml / toml
- waiting for new ones...
- css (HTML)
- Request args persistence, support curl-string, single-url, dict, json.
- A simple Web UI for generate & test CrawlerRule.
- Serializable JSON rule class for saving the whole parsing process.
- Each ParseRule / CrawlerRule / HostRule subclass can be json.dumps to JSON for persistence.
- Therefore, they also can be loaded from JSON string.
- Nest relation of rule names will be treat as the result format. (Rule's result will be ignore if has childs.)
Quick Start
Crawl python Meta-PEPs
# -*- coding: utf-8 -*-
import asyncio
import httpx
from uniparser import CrawlerRule, Uniparser
try:
import uvloop
uvloop.install()
except ImportError:
pass
list_crawler_json = r'''
{
"name": "SeedParser",
"request_args": {
"method": "get",
"url": "https://www.python.org/dev/peps/",
"headers": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
}
},
"parse_rules": [{
"name": "links",
"chain_rules": [[
"css",
"#index-by-category #meta-peps-peps-about-peps-or-processes td.num>a",
"@href"
], ["re", "^/", "@https://www.python.org/"]],
"childs": ""
}],
"regex": "^https?://www.python.org/dev/peps/$"
}
'''
detail_crawler_json = r'''
{
"name": "SeedParser",
"request_args": {
"method": "get",
"url": "https://www.python.org/dev/peps/pep-0001/",
"headers": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
}
},
"parse_rules": [
{
"name": "title",
"chain_rules": [
[
"css",
"h1.page-title",
"$text"
],
[
"python",
"getitem",
"[0]"
]
],
"childs": ""
},
{
"name": "author",
"chain_rules": [
[
"css",
"#content > div > section > article > table > tbody > tr:nth-child(3) > td",
"$text"
],
[
"python",
"getitem",
"[0]"
]
],
"childs": ""
}
],
"regex": "^https?://www.python.org/dev/peps/pep-\\d+/?$"
}
'''
class CrawlerTask(object):
def __init__(self, uniparser: Uniparser, list_crawler_json,
detail_crawler_json):
self.uni = uniparser
self.list_crawler_rule = CrawlerRule.loads(list_crawler_json)
self.detail_crawler_rule = CrawlerRule.loads(detail_crawler_json)
async def crawl(self):
# 1. get url list
async with httpx.AsyncClient() as req:
resp = await req.request(**self.list_crawler_rule['request_args'])
# sometimes will has `encoding` arg
if self.list_crawler_rule.get('encoding'):
resp.encoding = self.list_crawler_rule.get('encoding')
scode = resp.text
result = self.uni.parse(scode, self.list_crawler_rule, '')
# print(result)
# {'SeedParser': {'links': ['https://www.python.org/dev/peps/pep-0001', 'https://www.python.org/dev/peps/pep-0004', 'https://www.python.org/dev/peps/pep-0005', 'https://www.python.org/dev/peps/pep-0006', 'https://www.python.org/dev/peps/pep-0007', 'https://www.python.org/dev/peps/pep-0008', 'https://www.python.org/dev/peps/pep-0010', 'https://www.python.org/dev/peps/pep-0011', 'https://www.python.org/dev/peps/pep-0012']}}
links = result['SeedParser']['links']
tasks = [
asyncio.ensure_future(
req.request(**self.detail_crawler_rule.get_request(
url=link)))
for link in links
if self.detail_crawler_rule.match(link)
]
# print(tasks)
results = []
for task in tasks:
resp = await task
if self.detail_crawler_rule.get('encoding'):
resp.encoding = self.detail_crawler_rule.get('encoding')
scode = resp.text
result = self.uni.parse(scode, self.detail_crawler_rule, '')
results.append(result)
return results
async def main():
uni = Uniparser()
crawler = CrawlerTask(uni, list_crawler_json, detail_crawler_json)
results = await crawler.crawl()
for result in results:
print('Title :', result['SeedParser']['title'])
print('Author:', result['SeedParser']['author'].strip())
print('=' * 30)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Title : PEP 1 -- PEP Purpose and Guidelines
Author: Barry Warsaw, Jeremy Hylton, David Goodger, Nick Coghlan
==============================
Title : PEP 4 -- Deprecation of Standard Modules
Author: Brett Cannon , Martin von Löwis
==============================
...
Uniparser Test Console Demo (Web UI)
1. Prepare Environment
- pip install bottle uniparser
- python -m uniparser 8080
2. open browser http://127.0.0.1:8080/
2.1 Start page
2.2 Prepare the rules
2.3 Read the parse result
Show result as repr(result)
{'HelloWorld': {'rule1-get-first-p': 'Customer name: ', 'rule2-get-legends': [' Pizza Size ', ' Pizza Toppings ']}}
As you see, Crawler Rule's name is the root key, and ParseRule's name as the others.
More Usage
Talk is cheap, code is doc. Poor time to write...
Watch the examples: test_parsers.py
TODO
- Release to pypi.org
- Upload dist with Web UI
- Add github actions for testing package
- Web UI for testing rules
- Complete the whole doc
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for uniparser-0.0.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca26d30fefec4909385da7fa90f8724aaa39d19ffa4a8c3919f2ca34bdcdf57d |
|
MD5 | d0ebe77df64fd4e9c63507a12096d102 |
|
BLAKE2b-256 | e43daa4c556ce702c8a1c3f7a0dc5c6c31ca2d2a86bdcb1211409269bd09756d |