Skip to main content

parser of bs4 with yaml config support

Project description

iparse

iparse is a Python package for parsing HTML to structured data in an easy way with as little code as possible.

It aims to make the process of parsing HTML quick and easy!

iparse highlights:

  • mainly code with YAML
  • only refine raw HTML info with fewest python code
  • lot's HTML layout changes, only YAML will be involved

Installation

pip install iparse

A Simple Example

for HTML page: i.e. lovely xkcd python

to get the structured data all you need are

  • create a class inherit from IParser
  • write a YAML config file represents all locators

create xkcd_353.py

xkcd_353.py will go through the startup_dir, look for a file named as the snake_case of the ClassName without suffix:Parser, so XkcdParser will be xkcd.yaml

from pathlib import Path
from iparse._parse import IParser, RsvWords

HOME_DIR = Path(__file__).parents[0]


class XkcdParser(IParser):
    def __init__(self, file_name, is_test_mode=False, **kwargs):
        kwargs['startup_dir'] = kwargs.get('startup_dir', HOME_DIR)
        super().__init__(file_name, is_test_mode=is_test_mode, **kwargs)


if __name__ == "__main__":
    xkcd = XkcdParser(file_name=HOME_DIR / 'xkcd_python_353.htm')
    xkcd.do_parse()
    print(xkcd.data)

create a file named xkcd.yaml

you can use any locator that is supported, but css selector is recommended

page:
  # css_selector of title: head>title
  title: head>title
  # css_selector: div#footnote
  footnote: div#footnote
  # css_selector: div#licenseText
  license: div#licenseText

the output parsed data

the parsed data xkcd.data is dict, but you can also use it with xkcd.data_as_yaml/xkcd.data_as_json

yaml output

page:
  footnote: "xkcd.com is best viewed with Netscape Navigator 4.0 or below on a Pentium\
    \ 3\xB11 emulated in Javascript on an Apple IIGSat a screen resolution of 1024x1.\
    \ Please enable your ad blockers, disable high-heat drying, and remove your devicefrom\
    \ Airplane Mode and set it to Boat Mode. For security reasons, please leave caps\
    \ lock on while browsing."
  license: '


    This work is licensed under a

    Creative Commons Attribution-NonCommercial 2.5 License.


    This means you''re free to copy and share these comics (but not to sell them).
    More details.

    '
  title: 'xkcd: Python'

json output

{
  "page": {
    "footnote": "xkcd.com is best viewed with Netscape Navigator 4.0 or below on a Pentium 3\u00b11 emulated in Javascript on an Apple IIGSat a screen resolution of 1024x1. Please enable your ad blockers, disable high-heat drying, and remove your devicefrom Airplane Mode and set it to Boat Mode. For security reasons, please leave caps lock on while browsing.",
    "license": "\n\nThis work is licensed under a\nCreative Commons Attribution-NonCommercial 2.5 License.\n\nThis means you're free to copy and share these comics (but not to sell them). More details.\n",
    "title": "xkcd: Python"
  }
}

Details

# all settings added to __raw, will be kept as it added
__raw:
  site_url: https://xkcd.com/


page:
  # if not _locator supplied will reuse parent soup
  # page has no parent soup, so use default root soup
  title: head>title
  footnote: div#footnote
  license:
    _locator: div#licenseText
    # strip blank with true, but also can specified a str
    _striped: true

top_container:
  # we set a _locator here, all sub-nodes will select within top_container
  _locator: div#topContainer
  top_left:
    # _index:~ means None, so we can use whole list
    _index: ~
    _locator: div#topLeft>ul>li>a
    # if non-reserved key set to ~, means use parent soup, and use its text
    # this is a convenient way to get text
    menu_text: ~
    menu_url:
      # when other attributes exist, no need to add _locator to use its parent soup
      _attr: href
      # if we need some extra work on _attr, goes with two ways
      # 1. `_attr_refine: true` will auto generate => _refine_menu_url_href
      # the rule of auto-generator is _refine_<key_name>_<attr_value>
      # 2. `_attr_refine: _a_valid_method_name`
      _attr_refine: true
  top_right:
    _locator: div#topRight
    masthead:
      # two way to get more than one attributes on a element
      # e.g. image.src/.alt
      # way1: if all src/alt need refine, this will treat attrs as list
      image_1:
        _attr:
          - src
          - alt
        _attr_refine: true
        _locator: &LOGO_IMG span>a>img
      # way2: not all src/alt need refine, this will treat attrs as dict
      image_2:
        _locator: *LOGO_IMG
        src:
          _attr: src
          # only set _attr_refine to src
          # 1. _attr_refine: true => _refine_src_src
          # 2. _attr_refine: _refine_image_1_src to reuse exists method
          _attr_refine: _refine_image_1_src
        alt:
          _attr: alt

      slogan: span#slogan

more

please check the tests/ for more infomation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iparse-0.0.6.4.tar.gz (19.9 kB view details)

Uploaded Source

Built Distributions

iparse-0.0.6.4-py3.8.egg (42.8 kB view details)

Uploaded Source

iparse-0.0.6.4-py3-none-any.whl (32.0 kB view details)

Uploaded Python 3

File details

Details for the file iparse-0.0.6.4.tar.gz.

File metadata

  • Download URL: iparse-0.0.6.4.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.6

File hashes

Hashes for iparse-0.0.6.4.tar.gz
Algorithm Hash digest
SHA256 3502a27276bf086a9e929825a3ca7b08fe4be375d1795095fcaa1827cf456a61
MD5 042099fd2390905b0945dc6dc7b133ee
BLAKE2b-256 a46a0ef48d8daf4d3d10a3217297502d5ad37552c66d4e1122e647a313894cf5

See more details on using hashes here.

File details

Details for the file iparse-0.0.6.4-py3.8.egg.

File metadata

  • Download URL: iparse-0.0.6.4-py3.8.egg
  • Upload date:
  • Size: 42.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.6

File hashes

Hashes for iparse-0.0.6.4-py3.8.egg
Algorithm Hash digest
SHA256 f1ed389f309a9b223017ede42636ab7a618e19ce9be229a1f445076a4d571112
MD5 efa83ea517b3ac437fb1b42eb8066d4c
BLAKE2b-256 e999d632ac23841501bc00a0fb85f07feeefee63f7d122850596848b79ad4ed2

See more details on using hashes here.

File details

Details for the file iparse-0.0.6.4-py3-none-any.whl.

File metadata

  • Download URL: iparse-0.0.6.4-py3-none-any.whl
  • Upload date:
  • Size: 32.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.6

File hashes

Hashes for iparse-0.0.6.4-py3-none-any.whl
Algorithm Hash digest
SHA256 77cc283f4eff4a7eaf697c97b3cc7e32e49ee526d4bca733e6c8c1ce62b58baa
MD5 d6ee221f5403efd9182d54fcfd57c553
BLAKE2b-256 4850caa23fc3e77cb6d7d5bff179de9e32de3708f1f9f40904c5ee9ff2db6263

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page