Parsing HTML to JSON
Project description
Convert a HTML webpage to JSON data using a template defined in JSON.
Installation Guide
This package is available on PyPi. Just use pip install -U html2json
to install it. Then you can import it using from html2json import collect
.
- Note that starting version 0.3.0, at least Python 3.9 is required.
API
The method is collect(html, template)
. html
is the HTML of page loaded as string, and template
is the JSON of template loaded as Python objects.
Note that the HTML must contain the root node, like <html>...</html>
or <div>...</div>
.
Template Syntax
| For detailed syntax examples, please refer to unit tests (with 100% coverage).
The basic syntax is keyName: [selector, attr, [listOfRegexes]]
.
1. selector
is a CSS selector (supported by lxml).
- When the selector is null
, the root node itself is matched.
- When the selector cannot be matched, null
is returned.
- When the selector matches single element, a string is returned.
- When the selector matches multiple elements, a list of string is returned.
- If only selector is needed, you can just specify a string instead of list.
2. attr
matches the attribute value. It can be null
to match either the inner text or the outer text when the inner text is empty.
- Optional when only selector is needed.
3. The list of regexes [listOfRegexes]
supports two forms of regex operations. The operations with in the list are executed sequentially.
- Replacement: s/regex/replacement/g
. g
is optional for multiple replacements.
- Extraction: /regex/
.
- Note that you can use any character as separator instead of /
.
- Optional when only selector and/or attribute are needed.
For example:
{
"Color": ["head link:nth-of-type(1)", "href", ["/\\w+(?=\\.css)/"]],
}
Starting version 0.3.1, besides value, key can also matched like "[selector, ...]": ...
. Note that key must be a string for valid JSON.
- When the selector cannot be matched, key is not added to JSON.
- When the selector matches single element, returned string is used as key.
- When the selector matches multiple elements, list of returned strings are used as multiple keys.
Starting version 0.3.1, you can also replace certain part of value's selector with current key using syntax ...{key}...
. This is especially useful when key is dynamic.
As JSON, nested structure can be easily constructed.
{
"Cover": {
"URL": [".cover img", "src", []],
"Number of Favorites": [".cover .favorites", "value", []]
},
}
An alternative simplified syntax keyName: [subRoot, subTemplate]
can be used.
1. subRoot
a CSS selector of the new root for each sub entry.
2. subTemplate
is a sub-template for each entry, recursively.
For example, the previous example can be simplified as follow.
{
"Cover": [".cover", {
"URL": ["img", "src", []],
"Number of Favorites": [".favorites", "value", []]
}],
}
To extract a list of sub-entries following the same sub-template, the list syntax is keyName: [[subRoot, subTemplate]]
. Please note the difference (surrounding [
and ]
) from the previous syntax above.
1. subRoot
is the CSS selector of the new root for each sub entry.
2. subTemplate
is the sub-template for each entry, recursively.
- Optional or null
to match entire sub-root
For example:
{
"Comments": [[".comments", {
"From": [".from", null, []],
"Content": [".content", null, []],
"Photos": [["img", {
"URL": ["", "src", []]
}]]
}]]
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file html2json-0.3.3.tar.gz
.
File metadata
- Download URL: html2json-0.3.3.tar.gz
- Upload date:
- Size: 63.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
64431862c841ecbd0a3e7b28a851f4ecd2c73a4e4a9737a9d850291a16310f4c
|
|
MD5 |
fd93fb7eb6b4adbe1fc8b9b138eba5e3
|
|
BLAKE2b-256 |
829c995644e96067634b9470499a8a12f282c6312edcb90d83492bc5d95b6036
|
File details
Details for the file html2json-0.3.3-py3-none-any.whl
.
File metadata
- Download URL: html2json-0.3.3-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
7079d10d626bb95606363135f2fb0373aa4e7109b2492a9901608a953b431e9e
|
|
MD5 |
d1ba9226808fc9c38e949713fd8bfb96
|
|
BLAKE2b-256 |
52872ca8306a71c49348afacd8daec1b17324d2f3aad129b033644ade285bd74
|