Skip to main content

Page Parser Utils For scraping, List index update

Project description

Parse Utilities (ParseUtils)

This package helps you extracting python dict from html/xml contents

Installation

pip install parse-utils

Usage

from parse_utils.page_parser import PageParser, ItemExtractor

html_data = """
<html>
    <head><title>This is title</title></head>
    <body>
        <p id="header">This is header id</p>
        <p id="header">This is header2</p>
        <p class="content">This is content</p>
    </body>
</html>
"""

html_rows_data = """
<html>
    <head><title>This is title</title></head>
    <body>
    <ul id="contents">
        <li><a href="/first"> First Item</a><p>Description of Item 1</p></li>
        <li><a href="/Second"> Second Item</a><p>Description of Item 2</p></li>
        <li><a href="/Second"> Second Item</a><p>Description of Item 3</p></li>
        <li><a href="/Second"> First Item</a><p>Description of Item 4</p></li>
    </ul>
        <p class="content">This is content</p>
    </body>
</html>
"""


json_data = {"name": "Yogendra", "address": {"country": "Nepal", "city": "Pokhara",}}


def test_html_parser():
    """
    """
    config = {
        "header": ['//p[@id="header"]/text()'],
        "content": ['//p[@class="content"]'],
        "description": ["//body"],
    }
    pparser = PageParser(html_data)
    item = pparser.extract_dict(config)

    item2 = pparser.extract_dict(config, is_list=True)
    print(item2["body"])
    item3 = pparser.extract_dict(config, linebreaks=True)
    print(item3["body"])
    print(item)


def test_json_parser():
    """
    """
    config = {
        "header": ["name"],
        "city": ["address", "city"],
    }
    jparser = PageParser(json_data, selector=True)
    item = jparser.extract_dict_from_json(config)
    print(item)


def test_items_parser():
    """
    """
    config = {
        "results": "//ul/li",
        "fields": {
            "title": ["./a/text()"],
            "description": ["./p"],
            "link": ["./a/@href"],
        },
    }
    for item in ItemExtractor.extract_items(
        config["results"], config["fields"], html_rows_data
    ):
        print(item)


def test_items_parser_with_seed():
    """
    """
    seed_dict = {'default_key': 'default_value'}
    config = {
        "results": "//ul/li",
        "fields": {
            "title": ["./a/text()"],
            "description": ["./p"],
            "link": ["./a/@href"],
        },
    }
    for item in ItemExtractor.extract_items(
        config["results"], config["fields"], html_rows_data, item=seed_dict
    ):
        print(item)

def test_items_parser_with_results():
    """
    """
    seed_dict = {'default': 'default_list'}
    config = {
        "results": ["//apple/ball", "//ul/li"],
        "fields": {
            "title": ["./a/text()"],
            "description": ["./p"],
            "link": ["./a/@href"],
        },
    }
    for item in ItemExtractor.extract_items(
        config["results"], config["fields"], html_rows_data, item=seed_dict
    ):
        print(item)

if __name__ == "__main__":
    test_html_parser()
    test_json_parser()
    test_items_parser()
    test_items_parser_with_seed()
    test_items_parser_with_results()

Output will be:

['This is header id\n        This is header2\n        This is content']
This is header id
        This is header2
        This is content
{'header': 'This is header id', 'content': 'This is content', 'body': 'This is header id\n        This is header2\n        This is content'}
{'header': 'Yogendra', 'city': 'Pokhara'}
{'title': 'First Item', 'description': 'Description of Item 1', 'link': '/first'}
{'title': 'Second Item', 'description': 'Description of Item 2', 'link': '/Second'}
{'title': 'Second Item', 'description': 'Description of Item 3', 'link': '/Second'}
{'title': 'First Item', 'description': 'Description of Item 4', 'link': '/Second'}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parse_utils-1.3.5-py2.py3-none-any.whl (4.5 kB view details)

Uploaded Python 2Python 3

File details

Details for the file parse_utils-1.3.5-py2.py3-none-any.whl.

File metadata

  • Download URL: parse_utils-1.3.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.0.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for parse_utils-1.3.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 53a12e68aad05b20d0260366c906cfe12444a8373e7c1af1ee90c26fa5180766
MD5 df707f7236bfa1d657e9e4b5b64c6e02
BLAKE2b-256 de0a27d523a40c7f8dadb92b2ec385194bcaf2ffb0db62f73c714c62e36b8626

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page