Yet another 'jq for HTML'
Project description
heq: Yet Another 'jq for HTML'
heq is a command-line tool for extracting structured data as JSON from HTML using concise expressions, akin to jq. Additionally, heq serves as a Python library, facilitating the efficient scraping of HTML content through its jq-inspired DSL based on XPath.
Installation
pip install heq
heq depends on the following Python packages:
- lxml
- parsimonious
Usage as a command-line tool
$ cat << 'EOF' | heq '$`div.product` / {name: $`h2.name`.text}'
<body>
<div id="header">Welcome to Our Store!</div>
<div class="product">
<h2 class="name">Widget A</h2>
<p class="price">$10</p>
<ul class="features"><li>Durable</li><li>Lightweight</li></ul>
<a href="/products/widget_a">Details</a>
</div>
<div class="product">
<h2 class="name">Gadget B</h2>
<p class="price">$20</p>
<ul class="features"><li>Compact</li><li>Energy Efficient</li></ul>
<a href="/products/gadget_b">Details</a>
</div>
</body>
EOF
Output:
[
{"name": "Widget A"},
{"name": "Gadget B"}
]
$ cat expr.heq
`//div[@class="product"]` / {
name: `.//h2[@class="name"]`.text,
price: `.//p[@class="price"]`.text,
features: `.//li` / text,
url: `.//a`@href
}
$ cat << 'EOF' | heq -f expr.heq
(The same HTML as above)
EOF
Output:
[
{
"name": "Widget A",
"price": "$10",
"features": ["Durable", "Lightweight"],
"url": "/products/widget_a"
},
{
"name": "Gadget B",
"price": "$20",
"features": ["Compact", "Energy Efficient"],
"url": "/products/gadget_b"
}
]
Usage as a library
from heq import extract, xpath
html = '''<body>
<div class="product">
<h2 class="name">Widget A</h2>
<p class="price">$10</p>
<ul class="features"><li>Durable</li><li>Lightweight</li></ul>
</div>
<div class="product">
<h2 class="name">Gadget B</h2>
<p class="price">$20</p>
<ul class="features"><li>Compact</li><li>Energy Efficient</li></ul>
</div>
</body>'''
expr = xpath("//div[@class='product']") / {
'name': xpath(".//h2[@class='name']").text,
'price': xpath(".//p[@class='price']").text,
'features': xpath(".//li") / {
'feature': xpath('.').text
}
}
print(extract(expr, html))
Output:
[{'name': 'Widget A',
'price': '$10',
'features': [{'feature': 'Durable'}, {'feature': 'Lightweight'}]},
{'name': 'Gadget B',
'price': '$20',
'features': [{'feature': 'Compact'}, {'feature': 'Energy Efficient'}]}]
Syntax and Semantics
Informal BNF-like Representation
<S> ::= <expr>
<expr> ::= <selector_lit> '/' <term>
| <term>
<term> ::= <dict_lit> | <dottext> | <atattr> | <filter>
<filter> ::= 'text' | <attr_lit>
<dict_lit> ::= '{' ((<dict_field_value> ',')* <dict_field_value>)? '}'
<dict_field_value> ::= <dict_field> ':' <expr>
<dottext> ::= <selector_lit> '.text'
<atattr> ::= <selector_lit> <attr_lit>
<selector_lit> ::= <css_lit> / <xpath_lit>
<css_lit> ::= '$' <backtick_lit>
<xpath_lit> ::= <backtick_lit>
<attr_lit> ::= '@' <ident_with_hyphen>
heq has the concept of context DOM tree. This is the DOM tree against which XPath expressions or CSS selectors are evaluated. Initially, it is set to the root tree, and it changes as the / operator is applied, to each of the elements.
Available syntactic constructs and their semantics are as follows:
- Value Forms
{key: expression}: Evaluates to a dictionary.keyis a string without quotes andexpressionis an expression.text: Evaluates to a string representing the text content of the context DOM tree.@attr: Evaluates to the value associated with the attributeattrof the context DOM tree.<selector>.text: Evaluates to a string representing the text content of the element(s) selected by the specified selector.<selector>@attr: Evaluates to a string representing the value associated with the attributeattrof the first element selected by the specified XPath expression.
- Selectors
`<xpath>`: Selects elements by evaluating the XPath against the context DOM tree.$`<css_selector>`: Selects elements by evaluating the CSS selector against the context DOM tree.
- Mapping Against Query Results
<selector> / <value_form>: First, evaluates the selector to obtain a list of elements. Then, for each element, thevalue_formis evaluated with the element as the new context DOM tree. The entire expression evaluates to an array.
Examples
Target HTML
<body>
<div id="header">Welcome to Our Store!</div>
<div class="product">
<h2 class="name">Widget A</h2>
<p class="price">$10</p>
<ul class="features"><li>Durable</li><li>Lightweight</li></ul>
<a href="/products/widget_a">Details</a>
</div>
<div class="product">
<h2 class="name">Gadget B</h2>
<p class="price">$20</p>
<ul class="features"><li>Compact</li><li>Energy Efficient</li></ul>
<a href="/products/gadget_b">Details</a>
</div>
</body>
Example 1
{ header: `//div[@id="header"]`.text }
evaluates to:
{ "header": "Welcome to Our Store!" }
Example 2
`//div[@id="header"]`.text
evaluates to:
"Welcome to Our Store!"
Example 3
`//div[@class="product"]` / `.//a`@href
evaluates to:
["/products/widget_a", "/products/gadget_b"]
Example 4
`//div[@class="product"]` / {
name: `.//h2[@class="name"]`.text,
price: `.//p[@class="price"]`.text,
features: `.//li` / text,
url: `.//a`@href
}
evaluates to:
[
{
"name": "Widget A",
"price": "$10",
"features": ["Durable", "Lightweight"],
"url": "/products/widget_a"
},
{
"name": "Gadget B",
"price": "$20",
"features": ["Compact", "Energy Efficient"],
"url": "/products/gadget_b"
}
]
Example 5
$`div.product` / {
name: $`h2.name`.text,
price: $`p.price`.text,
features: $`li` / text,
url: $`a`@href
}
evaluates to:
[
{
"name": "Widget A",
"price": "$10",
"features": ["Durable", "Lightweight"],
"url": "/products/widget_a"
},
{
"name": "Gadget B",
"price": "$20",
"features": ["Compact", "Energy Efficient"],
"url": "/products/gadget_b"
}
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file heq-0.0.3.tar.gz.
File metadata
- Download URL: heq-0.0.3.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83929bacde5ff125c7164ebb37e084d1c1cc8106fa567f80f5701b143dca3c04
|
|
| MD5 |
c43cd931ea2bc04c4530c965060f673f
|
|
| BLAKE2b-256 |
825e26168bf17db8b5f612977fcf797e1c3ee0d66a41176339baa142088c55b3
|
File details
Details for the file heq-0.0.3-py3-none-any.whl.
File metadata
- Download URL: heq-0.0.3-py3-none-any.whl
- Upload date:
- Size: 5.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34f359c5933252d7d1a3ef2eec5d65a4301faaa08cb005524f0ce69a43fe09e5
|
|
| MD5 |
fca1e1c95ed3dea1cd475e2082a8953f
|
|
| BLAKE2b-256 |
03e555a73df389e3f2127bc84c63f1bdb3b8466640d6e3ce0e92247dc5c4bc06
|