Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.0-cp310-cp310-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10

hext-1.0.0-cp310-cp310-macosx_10_11_x86_64.whl (726.9 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.0-cp39-cp39-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9

hext-1.0.0-cp39-cp39-macosx_10_11_x86_64.whl (727.0 kB view details)

Uploaded CPython 3.9macOS 10.11+ x86-64

hext-1.0.0-cp38-cp38-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8

hext-1.0.0-cp38-cp38-macosx_10_11_x86_64.whl (726.9 kB view details)

Uploaded CPython 3.8macOS 10.11+ x86-64

hext-1.0.0-cp37-cp37m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m

hext-1.0.0-cp37-cp37m-macosx_10_11_x86_64.whl (726.9 kB view details)

Uploaded CPython 3.7mmacOS 10.11+ x86-64

hext-1.0.0-cp36-cp36m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m

hext-1.0.0-cp36-cp36m-macosx_10_11_x86_64.whl (726.9 kB view details)

Uploaded CPython 3.6mmacOS 10.11+ x86-64

File details

Details for the file hext-1.0.0-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.0-cp310-cp310-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.10
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.0-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 689c8e3b9a63db97fcfc152b7a52d2eb055358fc00e449d839961c42eb2f58ca
MD5 ca4b24fbe29013ed3d47c7231904b41d
BLAKE2b-256 fb3684562fe3c80a1ce84ac42a5154b6ae060af3fc3a8a13d8e7616474856692

See more details on using hashes here.

File details

Details for the file hext-1.0.0-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.0-cp310-cp310-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.9 kB
  • Tags: CPython 3.10, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.0-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 b7ee2036be3fd9aff01ca8c64ef0d52362aa32026738040583ea156f20b507c2
MD5 57048ed25cf62f7d17698fa0eb3bca00
BLAKE2b-256 32dc116606864bd490c7051bc81a6fe35e57d463952d80a9ec1dd5398774fa84

See more details on using hashes here.

File details

Details for the file hext-1.0.0-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.0-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.0-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e66adae2b84695b1cd22c648fbedb6e2fe060ae839410340be9e310b08f78f84
MD5 7b9fcb4648a773092981e5351df36a91
BLAKE2b-256 2c8922c5dcc2dd9afd435dbb50d32c4f5fb33cf22ca245508a1d5b186f0937ab

See more details on using hashes here.

File details

Details for the file hext-1.0.0-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.0-cp39-cp39-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 727.0 kB
  • Tags: CPython 3.9, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.0-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 f3e33e7c00aca8c48a69e0b246fcc82ddd754addb86b48fd1e467936308bc128
MD5 14222d0251afc781149dc44abdf23b5b
BLAKE2b-256 a86ff9bdec328501df48d2433f68596b6438065a614909701fe091b63e242695

See more details on using hashes here.

File details

Details for the file hext-1.0.0-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.0-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5ae56edc6b038781c6b06d573ad402e3ef6576fbe46a164840ebf360528c0741
MD5 ff139675d794cccb53a433f3060989f0
BLAKE2b-256 feb42fd2f753723c281060131e7cf855c20f8bc1ccd026e10bf57511bb133859

See more details on using hashes here.

File details

Details for the file hext-1.0.0-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.0-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.9 kB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.0-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 26a6e7d5b6ca7dd200e7337e7a97a0a9c3c846fc442ba73afb3683d597aa84fd
MD5 bf3dae9753d47cc7e93d2557d738b6bb
BLAKE2b-256 9fbb6d7ad19818f6014adb02f9b0abdabace6521c953f1ec447c49a2b4774204

See more details on using hashes here.

File details

Details for the file hext-1.0.0-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.0-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.0-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e2bb29b776bc23f86321769531887b4dc567900f6a45409f32ad07afa9429b72
MD5 c2928e424549cec33e470cea52b5d8e9
BLAKE2b-256 00c28ada86bb30b222d7b7a6c8416fe91b9b3daafcf03bf82d25818145a6abfa

See more details on using hashes here.

File details

Details for the file hext-1.0.0-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.0-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.9 kB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.0-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 7e43c767d52f64e3aa8a63c6f5b526bfe0180594851a82e9c17684e8fd9598fa
MD5 5a6730909792e2b94de750d7f2e0127f
BLAKE2b-256 d6452ecf851ad8f46044e644c5a37bea89ba2a47cfea8811f59ad33c6609b0fd

See more details on using hashes here.

File details

Details for the file hext-1.0.0-cp36-cp36m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.0-cp36-cp36m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.0-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2a413c6a500c10bb28777bb956cd8437d5bad42eb0b63daeb68f1e6b45668913
MD5 f221a6169e527a109b70c0789e38b803
BLAKE2b-256 fbc637be5d41bf6cc428e28631a092347f84bc1a4dc5fba3f4f19f0e583f4672

See more details on using hashes here.

File details

Details for the file hext-1.0.0-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.0-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.9 kB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.0-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 6a6855aaf876ca922d1af6af9e3cf4b7e061137dd967d368acf9c8603054993c
MD5 3319bb08443c4ff641e01efa5eb56a7d
BLAKE2b-256 7400c07420bd6348b21690498d62cc42f3a09d753f117920015145afa1653d48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page