Skip to main content

Page Parser Utils For scraping, List index update

Project description

Parse Utilities (ParseUtils)

This package helps you extracting python dict from html/xml contents

Installation

pip install parse-utils

Usage

from parse_utils.page_parser import PageParser, ItemExtracter

html_data = """
<html>
    <head><title>This is title</title></head>
    <body>
        <p id="header">This is header id</p>
        <p id="header">This is header2</p>
        <p class="content">This is content</p>
    </body>
</html>
"""

html_rows_data = """
<html>
    <head><title>This is title</title></head>
    <body>
    <ul id="contents">
        <li><a href="/first"> First Item</a><p>Description of Item 1</p></li>
        <li><a href="/Second"> Second Item</a><p>Description of Item 2</p></li>
        <li><a href="/Second"> Second Item</a><p>Description of Item 3</p></li>
        <li><a href="/Second"> First Item</a><p>Description of Item 4</p></li>
    </ul>
        <p class="content">This is content</p>
    </body>
</html>
"""


json_data = {"name": "Yogendra", "address": {"country": "Nepal", "city": "Pokhara",}}


def test_html_parser():
    """
    """
    config = {
        "header": ['//p[@id="header"]/text()'],
        "content": ['//p[@class="content"]'],
        "body": ["//body"],
    }
    pparser = PageParser(html_data)
    item = pparser.extract_dict(config)

    item2 = pparser.extract_dict(config, is_list=True)
    print(item2["body"])
    item3 = pparser.extract_dict(config, linebreaks=True)
    print(item3["body"])
    print(item)


def test_json_parser():
    """
    """
    config = {
        "header": ["name"],
        "city": ["address", "city"],
    }
    jparser = PageParser(json_data, selector=True)
    item = jparser.extract_dict_from_json(config)
    print(item)


def test_items_parser():
    """
    """
    config = {
        "results": "//ul/li",
        "fields": {
            "title": ["./a/text()"],
            "description": ["./p"],
            "link": ["./a/@href"],
        },
    }
    for item in ItemExtracter.extract_items(
        config["results"], config["fields"], html_rows_data
    ):
        print(item)


if __name__ == "__main__":
    test_html_parser()
    test_json_parser()
    test_items_parser()

Output will be:

['This is header id\n        This is header2\n        This is content']
This is header id
        This is header2
        This is content
{'header': 'This is header id', 'content': 'This is content', 'body': 'This is header id\n        This is header2\n        This is content'}
{'header': 'Yogendra', 'city': 'Pokhara'}
{'title': 'First Item', 'description': 'Description of Item 1', 'link': '/first'}
{'title': 'Second Item', 'description': 'Description of Item 2', 'link': '/Second'}
{'title': 'Second Item', 'description': 'Description of Item 3', 'link': '/Second'}
{'title': 'First Item', 'description': 'Description of Item 4', 'link': '/Second'}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parse_utils-1.3.1-py3-none-any.whl (4.3 kB view details)

Uploaded Python 3

File details

Details for the file parse_utils-1.3.1-py3-none-any.whl.

File metadata

  • Download URL: parse_utils-1.3.1-py3-none-any.whl
  • Upload date:
  • Size: 4.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.9

File hashes

Hashes for parse_utils-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3cf9483c26a24c957613a5628cd088754486dfc3ef282d92e1742ab682dd1a52
MD5 58b83ca61d4142272db971a0f3f9a5e9
BLAKE2b-256 0f38db87a03f594a03031905ae49f16a768a4c2bc697073bb280467ca9800f62

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page