Provide a universal solution for crawler platforms. Read more: https://github.com/ClericPy/uniparser.

These details have not been verified by PyPI

Project links

Homepage

Project description

uniparser

Provide a universal solution for crawler, Python3.6+.

Install

pip install uniparser -U

Why?

Reduced the code quantity from plenty of similar crawlers & parsers. Don't Repeat Yourself.
Make the parsing process of different parsers persistent.
Separating parsing processes from the downloading.
Provide a universal solution for crawler platforms.
Summarize common string parsing tools on the market.

Feature List

Support most of popular parsers for HTML / XML / JSON / AnyString / Python object

parser list

  1. css (HTML)
      1. bs4
  2. xml
      1. lxml
  3. regex
  4. jsonpath
      1. jsonpath_ng
  5. objectpath
      1. objectpath
  6. jmespath
      1. jmespath
  7. time
  8. loader
      1. json / yaml / toml
          1. toml
          2. pyyaml
  9. udf
      1. source code for exec & eval which named as **parse**
  10. python
      1. some  common python methods, getitem, split, join...
  11. *waiting for new ones...*

Request args persistence, support curl-string, single-url, dict, json.
A simple Web UI for generate & test CrawlerRule.
Serializable JSON rule class for saving the whole parsing process.
1. Each ParseRule / CrawlerRule / HostRule subclass can be json.dumps to JSON for persistence.
2. Therefore, they also can be loaded from JSON string.
3. Nest relation of rule names will be treat as the result format. (Rule's result will be ignore if has childs.)

Quick Start

Crawl python Meta-PEPs

Only 25 lines necessary code besides the rules(which can be saved outside).

JSON Rule

list_crawler_json = r'''
{
    "name": "SeedParser",
    "request_args": {
        "method": "get",
        "url": "https://www.python.org/dev/peps/",
        "headers": {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
        }
    },
    "parse_rules": [{
        "name": "links",
        "chain_rules": [[
            "css",
            "#index-by-category #meta-peps-peps-about-peps-or-processes td.num>a",
            "@href"
        ], ["re", "^/", "@https://www.python.org/"]],
        "childs": ""
    }],
    "regex": "^https?://www.python.org/dev/peps/$"
}

'''

detail_crawler_json = r'''
{
  "name": "SeedParser",
  "request_args": {
    "method": "get",
    "url": "https://www.python.org/dev/peps/pep-0001/",
    "headers": {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
    }
  },
  "parse_rules": [
    {
      "name": "title",
      "chain_rules": [
        [
          "css",
          "h1.page-title",
          "$text"
        ],
        [
          "python",
          "getitem",
          "[0]"
        ]
      ],
      "childs": ""
    },
    {
      "name": "author",
      "chain_rules": [
        [
          "css",
          "#content > div > section > article > table > tbody > tr:nth-child(3) > td",
          "$text"
        ],
        [
          "python",
          "getitem",
          "[0]"
        ]
      ],
      "childs": ""
    }
  ],
  "regex": "^https?://www.python.org/dev/peps/pep-\\d+/?$"
}
'''

Crawler Code

# -*- coding: utf-8 -*-

import asyncio

from uniparser import CrawlerRule, Uniparser, HTTPXAsyncAdapter


class CrawlerTask(object):

    def __init__(self, uniparser: Uniparser, list_crawler_json,
                 detail_crawler_json):
        self.uni = uniparser
        self.list_crawler_rule = CrawlerRule.loads(list_crawler_json)
        self.detail_crawler_rule = CrawlerRule.loads(detail_crawler_json)

    async def crawl(self):
        # 1. get url list
        result = await self.uni.acrawl(self.list_crawler_rule)
        # print(result)
        # {'SeedParser': {'links': ['https://www.python.org/dev/peps/pep-0001', 'https://www.python.org/dev/peps/pep-0004', 'https://www.python.org/dev/peps/pep-0005', 'https://www.python.org/dev/peps/pep-0006', 'https://www.python.org/dev/peps/pep-0007', 'https://www.python.org/dev/peps/pep-0008', 'https://www.python.org/dev/peps/pep-0010', 'https://www.python.org/dev/peps/pep-0011', 'https://www.python.org/dev/peps/pep-0012']}}
        links = result['SeedParser']['links']
        tasks = [
            asyncio.ensure_future(
                self.uni.acrawl(self.detail_crawler_rule, url=link))
            for link in links
            if self.detail_crawler_rule.match(link)
        ]
        # print(tasks)
        results = [await task for task in tasks]
        return results


async def main():
    uni = Uniparser(HTTPXAsyncAdapter())
    crawler = CrawlerTask(uni, list_crawler_json, detail_crawler_json)
    results = await crawler.crawl()
    print(json.dumps(results, indent=2, ensure_ascii=0))


if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Print Result

[
  {
    "SeedParser": {
      "title": "PEP 1 -- PEP Purpose and Guidelines",
      "author": "Barry Warsaw, Jeremy Hylton, David Goodger, Nick Coghlan"
    }
  },
  {
    "SeedParser": {
      "title": "PEP 4 -- Deprecation of Standard Modules",
      "author": "Brett Cannon <brett at python.org>, Martin von LÃ¶wis <martin at v.loewis.de>"
    }
  },
  {
    "SeedParser": {
      "title": "PEP 5 -- Guidelines for Language Evolution",
      "author": "paul at prescod.net (Paul Prescod)"
    }
  },
  {
    "SeedParser": {
      "title": "PEP 6 -- Bug Fix Releases",
      "author": "aahz at pythoncraft.com (Aahz), anthony at interlink.com.au (Anthony Baxter)"
    }
  },
  {
    "SeedParser": {
      "title": "PEP 7 -- Style Guide for C Code",
      "author": "Guido van Rossum <guido at python.org>, Barry Warsaw <barry at python.org>"
    }
  },
  {
    "SeedParser": {
      "title": "PEP 8 -- Style Guide for Python Code",
      "author": "Guido van Rossum <guido at python.org>,\nBarry Warsaw <barry at python.org>,\nNick Coghlan <ncoghlan at gmail.com>"
    }
  },
  {
    "SeedParser": {
      "title": "PEP 10 -- Voting Guidelines",
      "author": "barry at python.org (Barry Warsaw)"
    }
  },
  {
    "SeedParser": {
      "title": "PEP 11 -- Removing support for little used platforms",
      "author": "Martin von LÃ¶wis <martin at v.loewis.de>,\nBrett Cannon <brett at python.org>"
    }
  },
  {
    "SeedParser": {
      "title": "PEP 12 -- Sample reStructuredText PEP Template",
      "author": "David Goodger <goodger at python.org>,\nBarry Warsaw <barry at python.org>,\nBrett Cannon <brett at python.org>"
    }
  }
]

Uniparser Test Console Demo (Web UI)

1. Prepare Environment

pip install bottle uniparser
python -m uniparser 8080

2. open browser http://127.0.0.1:8080/

Start page

Prepare the rules

Read the parse result

Show result as repr(result)

{'HelloWorld': {'rule1-get-first-p': 'Customer name: ', 'rule2-get-legends': [' Pizza Size ', ' Pizza Toppings ']}}

As we can see, CrawlerRule's name is the root key, and ParseRule's name as the others.

More Usage

Talk is cheap, code is doc(means poor time to write)

Test code: test_parsers.py

TODO

Release to pypi.org
- Upload dist with Web UI
Add github actions for testing package
Web UI for testing rules
Complete the whole doc

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

3.0.2

May 23, 2022

3.0.1

Apr 1, 2022

3.0.0

Mar 31, 2022

2.0.2

Mar 16, 2022

2.0.1

Mar 13, 2022

2.0.0

Mar 2, 2022

1.9.0

Jan 9, 2022

1.8.9

Sep 12, 2021

1.8.7

Aug 4, 2021

1.8.6

Jun 17, 2021

1.8.5

Apr 28, 2021

1.8.4

Apr 12, 2021

1.8.3

Dec 26, 2020

1.8.2

Sep 28, 2020

1.8.1

Sep 28, 2020

1.8.0

Sep 27, 2020

1.7.0

Sep 27, 2020

1.6.4

Sep 24, 2020

1.6.3

Sep 23, 2020

1.6.2

Sep 21, 2020

1.6.1

Sep 15, 2020

1.6.0

Sep 1, 2020

1.5.2

Jul 14, 2020

1.5.1

Jul 13, 2020

1.5.0

Jul 12, 2020

1.4.7

Jul 9, 2020

1.4.6

Jul 3, 2020

1.4.5

May 5, 2020

1.4.4

May 4, 2020

1.4.3

May 3, 2020

1.4.2

May 2, 2020

1.4.1

Apr 30, 2020

1.4.0

Apr 30, 2020

1.3.9

Apr 18, 2020

1.3.8

Apr 11, 2020

1.3.7

Mar 30, 2020

1.3.6

Mar 15, 2020

1.3.5

Mar 15, 2020

1.3.4

Mar 13, 2020

1.3.3

Mar 13, 2020

1.3.2

Mar 13, 2020

1.3.1

Mar 13, 2020

1.3.0

Mar 12, 2020

1.2.9

Mar 11, 2020

1.2.8

Mar 11, 2020

1.2.7

Mar 10, 2020

1.2.6

Mar 10, 2020

1.2.2

Mar 10, 2020

1.2.1

Mar 10, 2020

1.2.0

Mar 9, 2020

1.1.1

Mar 8, 2020

1.1.0

Mar 7, 2020

1.0.9

Mar 7, 2020

1.0.8

Mar 7, 2020

1.0.7

Mar 7, 2020

1.0.6

Mar 7, 2020

1.0.4

Mar 6, 2020

1.0.3

Mar 1, 2020

1.0.2

Mar 1, 2020

1.0.1

Feb 29, 2020

1.0.0

Feb 28, 2020

0.1.5

Feb 23, 2020

0.1.4

Feb 22, 2020

0.1.2

Feb 16, 2020

This version

0.1.1

Feb 13, 2020

0.1.0

Feb 13, 2020

0.0.8

Feb 12, 2020

0.0.7

Feb 9, 2020

0.0.6

Feb 7, 2020

0.0.5

Feb 7, 2020

0.0.4

Feb 7, 2020

0.0.3

Feb 6, 2020

0.0.2

Feb 5, 2020

0.0.1

Feb 5, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

uniparser-0.1.1-py3-none-any.whl (20.0 kB view hashes)

Uploaded Feb 13, 2020 Python 3

Hashes for uniparser-0.1.1-py3-none-any.whl

Hashes for uniparser-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`122a7316d522b4fdd35216306b9fff3e84b8426ef24f24ed6ddcd9ef674e05af`
MD5	`af473c0f58e19009a50fd8fa3b71412a`
BLAKE2b-256	`f9b5d7de6170e90a766de535de894f880a323a99ad6d27b8c992bd0d6f185da3`