parser of bs4 with yaml config support
iparse is a Python package for parsing HTML to structured data in an easy way with as little code as possible.
It aims to make the process of parsing HTML quick and easy!
- mainly code with YAML
- only refine raw HTML info with fewest python code
- lot's HTML layout changes, only YAML will be involved
pip install iparse
A Simple Example
for HTML page: i.e. lovely xkcd python
to get the structured data all you need are
- create a class inherit from
- write a YAML config file represents all locators
xkcd_353.py will go through the startup_dir, look for a file named as the snake_case of the ClassName without
XkcdParser will be
from pathlib import Path from iparse._parse import IParser, RsvWords HOME_DIR = Path(__file__).parents class XkcdParser(IParser): def __init__(self, file_name, is_test_mode=False, **kwargs): kwargs['startup_dir'] = kwargs.get('startup_dir', HOME_DIR) super().__init__(file_name, is_test_mode=is_test_mode, **kwargs) if __name__ == "__main__": xkcd = XkcdParser(file_name=HOME_DIR / 'xkcd_python_353.htm') xkcd.do_parse() print(xkcd.data)
create a file named xkcd.yaml
you can use any locator that is supported, but css selector is recommended
page: # css_selector of title: head>title title: head>title # css_selector: div#footnote footnote: div#footnote # css_selector: div#licenseText license: div#licenseText
the output parsed data
the parsed data
xkcd.data is dict, but you can also use it with
# all settings added to __raw, will be kept as it added __raw: site_url: https://xkcd.com/ page: # if not _locator supplied will reuse parent soup # page has no parent soup, so use default root soup title: head>title footnote: div#footnote license: _locator: div#licenseText # strip blank with true, but also can specified a str _striped: true top_container: # we set a _locator here, all sub-nodes will select within top_container _locator: div#topContainer top_left: # _index:~ means None, so we can use whole list _index: ~ _locator: div#topLeft>ul>li>a # if non-reserved key set to ~, means use parent soup, and use its text # this is a convenient way to get text menu_text: ~ menu_url: # when other attributes exist, no need to add _locator to use its parent soup _attr: href # if we need some extra work on _attr, goes with two ways # 1. `_attr_refine: true` will auto generate => _refine_menu_url_href # the rule of auto-generator is _refine_<key_name>_<attr_value> # 2. `_attr_refine: _a_valid_method_name` _attr_refine: true top_right: _locator: div#topRight masthead: # two way to get more than one attributes on a element # e.g. image.src/.alt # way1: if all src/alt need refine, this will treat attrs as list image_1: _attr: - src - alt _attr_refine: true _locator: &LOGO_IMG span>a>img # way2: not all src/alt need refine, this will treat attrs as dict image_2: _locator: *LOGO_IMG src: _attr: src # only set _attr_refine to src # 1. _attr_refine: true => _refine_src_src # 2. _attr_refine: _refine_image_1_src to reuse exists method _attr_refine: _refine_image_1_src alt: _attr: alt slogan: span#slogan
please check the
tests/ for more infomation.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.