Skip to main content

Yet another 'jq for HTML'

Project description

heq: Yet Another 'jq for HTML'

heq is a command-line tool for extracting structured data from HTML using concise expressions, akin to jq. Additionally, heq serves as a Python library, facilitating the efficient scraping of HTML content through its jq-inspired DSL based on XPath.

Usage as a command-line tool

$ cat << 'EOF' | heq '`//div[@class="product"]` / {name: `.//h2[@class="name"]`.text}'
<body>
    <div id="header">Welcome to Our Store!</div>
    <div id="announcement">Special Offer: 20% off on all products this week!</div>
    <div class="product">
      <h2 class="name">Widget A</h2>
      <p class="price">$10</p>
      <ul class="features"><li>Durable</li><li>Lightweight</li></ul>
    </div>
    <div class="product">
      <h2 class="name">Gadget B</h2>
      <p class="price">$20</p>
      <ul class="features"><li>Compact</li><li>Energy Efficient</li></ul>
    </div>
</body>
EOF

Output:

[
  {"name": "Widget A"},
  {"name": "Gadget B"}
]
$ cat expr.heq
`//div[@class="product"]` / {
    name: `.//h2[@class="name"]`.text,
    price: `.//p[@price="name"]`.text,
    features': `.//li` / text
}
$ cat << 'EOF' | heq -f expr.heq
(same as above)
EOF

Output:

[
  {
    "name": "Widget A",
    "price": "$10",
    "features": ["Durable", "Lightweight"]
  },
  {
    "name": "Gadget B",
    "price": "$20",
    "features": ["Compact", "Energy Efficient"]
  }
]

Usage as a library

from heq import extract, xpath
import lxml.etree
tree = lxml.etree.HTML('''<body>
    <div class="product">
      <h2 class="name">Widget A</h2>
      <p class="price">$10</p>
      <ul class="features"><li>Durable</li><li>Lightweight</li></ul>
    </div>
    <div class="product">
      <h2 class="name">Gadget B</h2>
      <p class="price">$20</p>
      <ul class="features"><li>Compact</li><li>Energy Efficient</li></ul>
    </div>
</body>''')

expr = xpath("//div[@class='product']") / {
    'name': xpath(".//h2[@class='name']").text,
    'price': xpath(".//p[@class='price']").text,
    'features': xpath(".//li") / {
      'feature': xpath('.').text
    }
}

print(extract(expr, tree))

Output:

[{'name': 'Widget A',
  'price': '$10',
  'features': [{'feature': 'Durable'}, {'feature': 'Lightweight'}]},
 {'name': 'Gadget B',
  'price': '$20',
  'features': [{'feature': 'Compact'}, {'feature': 'Energy Efficient'}]}]

Syntax and Semantics

Informal BNF-like Representation

<S> ::= <expr>
<expr> ::= <xpath_lit> '/' <term>
         | <term>
<term> ::= <dict_lit> | <dottext> | <filter>
<filter> ::= "text"
<dict_lit> ::= '{' ((<dict_field_value> ',')* <dict_field_value>)? '}'
<dict_field_value> ::= <dict_field> ':' <expr>
<xpath_lit> ::= <backtick_lit>
<dottext> ::= <xpath_lit> '.text'

heq has the concept of context DOM tree. This is the DOM tree against which XPath expressions are evaluated. It changes as the / operator is applied, to each of the elements.

  1. Value Forms
    • {key: expression}: Evaluates to a dictionary. key is a string without quotes and expression is an expression.
    • text: Evaluates to a string representing the text content of the context DOM tree.
    • `<xpath>`.text: Evaluates to a string representing the text content of the element(s) selected by the specified XPath expression.
  2. Mapping Against Query Results
    • `<xpath>` / <value_form>: First, evaluates the XPath expression to obtain a list of elements. Then, for each element, the value_form is evaluated with the element as the new context DOM tree. The entire expression evaluates to an array.

Examples

Target HTML

<body>
    <div id="header">Welcome to Our Store!</div>
    <div id="announcement">Special Offer: 20% off on all products this week!</div>
    <div class="product">
      <h2 class="name">Widget A</h2>
      <p class="price">$10</p>
      <ul class="features"><li>Durable</li><li>Lightweight</li></ul>
    </div>
    <div class="product">
      <h2 class="name">Gadget B</h2>
      <p class="price">$20</p>
      <ul class="features"><li>Compact</li><li>Energy Efficient</li></ul>
    </div>
</body>

Example 1

{ header: `//div[@id="header"]`.text }

evaluates to:

{ "header": "Welcome to Our Store!" }

Example 2

`//div[@id="header"]`.text

evaluates to:

"Welcome to Our Store!"

Example 3

`//div[@class="product"]` / {
    name: `.//h2[@class="name"]`.text,
    price: `.//p[@class="price"]`.text,
    features: `.//li` / text
}

evaluates to:

[
  {
    "name": "Widget A",
    "price": "$10",
    "features": ["Durable", "Lightweight"]
  },
  {
    "name": "Gadget B",
    "price": "$20",
    "features": ["Compact", "Energy Efficient"]
  }
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

heq-0.0.1.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

heq-0.0.1-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file heq-0.0.1.tar.gz.

File metadata

  • Download URL: heq-0.0.1.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for heq-0.0.1.tar.gz
Algorithm Hash digest
SHA256 5036b3d018ee04f72e9161aaa993800377992e6437f534b1c0a43fde6060fdc9
MD5 8d4a94d45dcc999cb1bd8d962dc5ccae
BLAKE2b-256 47f23eb68807c186dda32b946fc27c7134f918832c1214acd779f437d953a2c8

See more details on using hashes here.

File details

Details for the file heq-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: heq-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for heq-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5cf3489e65af571adff8529f4585a0b278b746a8c80023f8aa1d3c55a894fbc8
MD5 7e2d419323d7d59b5dc344cadbdd62dd
BLAKE2b-256 bce9666306e7075846233c0c29ed086bdc3fe38bca9bd34c19b5c52f4476bf7f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page