Parsing HTML to JSON
Project description
Convert a HTML webpage to JSON data using a template defined in JSON.
Installation
This package is available on PyPi. Just use pip install -U html2json
to install it. Then you can import it using from html2json import collect
.
API
The method is collect(html, template)
. html
is the HTML of page loaded as string, and template
is the JSON of template loaded as Python objects.
Note that the HTML must contain the root node, like <html>...</html>
or <div>...</div>
.
Template Syntax
- The basic syntax is
keyName: [selector, attr, [listOfRegexes]]
.selector
is a CSS selector (supported by lxml).- When the selector is
null
, the root node itself is matched. - When the selector cannot be matched,
null
is returned.
- When the selector is
attr
matches the attribute value. It can benull
to match either the inner text or the outer text when the inner text is empty.- The list of regexes
[listOfRegexes]
supports two forms of regex operations. The operations with in the list are executed sequentially.- Replacement:
s/regex/replacement/g
.g
is optional for multiple replacements. - Extraction:
/regex/
.
- Replacement:
For example:
{
"Color": ["head link:nth-of-type(1)", "href", ["/\\w+(?=\\.css)/"]],
}
- As JSON, nested structure can be easily constructed.
{
"Cover": {
"URL": [".cover img", "src", []],
"Number of Favorites": [".cover .favorites", "value", []]
},
}
- An alternative simplified syntax
keyName: [subRoot, subTemplate]
can be used.subRoot
a CSS selector of the new root for each sub entry.subTemplate
is a sub-template for each entry, recursively.
For example, the previous example can be simplified as follow.
{
"Cover": [".cover", {
"URL": ["img", "src", []],
"Number of Favorites": [".favorites", "value", []]
}],
}
- To extract a list of sub-entries following the same sub-template, the list syntax is
keyName: [[subRoot, subTemplate]]
. Please note the difference (surrounding[
and]
) from the previous syntax above.subRoot
is the CSS selector of the new root for each sub entry.subTemplate
is the sub-template for each entry, recursively.
For example:
{
"Comments": [[".comments", {
"From": [".from", null, []],
"Content": [".content", null, []],
"Photos": [["img", {
"URL": ["", "src", []]
}]]
}]]
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
html2json-0.2.4.1.tar.gz
(4.0 kB
view details)
File details
Details for the file html2json-0.2.4.1.tar.gz
.
File metadata
- Download URL: html2json-0.2.4.1.tar.gz
- Upload date:
- Size: 4.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e35ab1e7a62938c990f59933a295066f4083be4404091259d9e06d493bc79a31 |
|
MD5 | ff4134d541b2fef9bee0fc57a214e2f4 |
|
BLAKE2b-256 | 464406f3b08dac69528c7d6c9ae5415863c26415cd916c4f0cc57e29610c1f02 |