Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.3-cp310-cp310-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10

hext-1.0.3-cp310-cp310-macosx_10_11_x86_64.whl (726.8 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.3-cp39-cp39-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9

hext-1.0.3-cp39-cp39-macosx_10_11_x86_64.whl (727.0 kB view details)

Uploaded CPython 3.9macOS 10.11+ x86-64

hext-1.0.3-cp38-cp38-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8

hext-1.0.3-cp38-cp38-macosx_10_11_x86_64.whl (726.8 kB view details)

Uploaded CPython 3.8macOS 10.11+ x86-64

hext-1.0.3-cp37-cp37m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m

hext-1.0.3-cp37-cp37m-macosx_10_11_x86_64.whl (726.8 kB view details)

Uploaded CPython 3.7mmacOS 10.11+ x86-64

hext-1.0.3-cp36-cp36m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m

hext-1.0.3-cp36-cp36m-macosx_10_11_x86_64.whl (726.9 kB view details)

Uploaded CPython 3.6mmacOS 10.11+ x86-64

File details

Details for the file hext-1.0.3-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.3-cp310-cp310-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.10
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.4

File hashes

Hashes for hext-1.0.3-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dfc3115e628060def7d1c96e4fb5b756db2770794e5916e82b7d570362d4c3f3
MD5 0be18672f506a2aa88e053f6c47148b7
BLAKE2b-256 d9c36e110fc68e5d689c1881f1dc7b57939257e94f44a2e17fb6a79d4f656a2a

See more details on using hashes here.

File details

Details for the file hext-1.0.3-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.3-cp310-cp310-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.8 kB
  • Tags: CPython 3.10, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.4

File hashes

Hashes for hext-1.0.3-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 93527c03e4f2c06265cd9aea75832c3f6de773b4375ecbdd3d3a3b1d08f61f08
MD5 664a118ec138ebb56497c8cf7f05cddc
BLAKE2b-256 0ecd0a9528e9a4d62057b28e5b8d8d5d46748f4fe0593be78d374fe24622af1a

See more details on using hashes here.

File details

Details for the file hext-1.0.3-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.3-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.4

File hashes

Hashes for hext-1.0.3-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b745bfda5ba46a8b0ce65fc85e4a8a3b54f302ff9030d7e59ca90c4f6467870f
MD5 1403b2c0d71a0ac36085b8a9bf69aebc
BLAKE2b-256 e8c0c3c2116ba7b21fbe6bfd5e478b7ec1b6459939a5c6ac405e55cb7132ff87

See more details on using hashes here.

File details

Details for the file hext-1.0.3-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.3-cp39-cp39-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 727.0 kB
  • Tags: CPython 3.9, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.4

File hashes

Hashes for hext-1.0.3-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 3fa2db64af724228cf7b993d6e08432629789b1a4bbf1c7d7635363c753a1ad6
MD5 2feb4b9f8dd0c32c6492931eac9cdc09
BLAKE2b-256 afedc824aec7e303d7ccd5af1b2ad6fea0bd85adfc32a02c7150ea4a901c0493

See more details on using hashes here.

File details

Details for the file hext-1.0.3-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.3-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.4

File hashes

Hashes for hext-1.0.3-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2497594269854a91c01add9e3c1cdea7ef653e6d4fe91b5241f2ef4e591fba8a
MD5 3d32b58cb3a0a4e9be7a04c2dd8be8b0
BLAKE2b-256 b3fbe4e6768cf07e4994703d0212ac352766d088f33aa05933f8ca5f93af775d

See more details on using hashes here.

File details

Details for the file hext-1.0.3-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.3-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.8 kB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.4

File hashes

Hashes for hext-1.0.3-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 f1076542d766881d72dd4600019a685946a89f9adf9d9911435e2f83d5906f7d
MD5 bf7b314d7a671c1de850dab53007c21e
BLAKE2b-256 13555811521dd3bf4becfdfbed07c10dee672759e40af16d1d275075740d05bf

See more details on using hashes here.

File details

Details for the file hext-1.0.3-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.3-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.4

File hashes

Hashes for hext-1.0.3-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 920ab264c1ad404a14e377b29e0659c26cf319d80c0ed0a1e46d907b49535cae
MD5 f5037448b5359611524f4bf97cdf7fd9
BLAKE2b-256 210e36d84c9a413b47b39a74b0ba1440bad7e3f119499c27e67590db3f2678f0

See more details on using hashes here.

File details

Details for the file hext-1.0.3-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.3-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.8 kB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.4

File hashes

Hashes for hext-1.0.3-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 8a104c5b0e61e99a3f76b3ed2dc0c63ccf6d9c0978d83313a7df6f9b26956fd6
MD5 d23648da659a4f867123f2b667cfed92
BLAKE2b-256 6c44d87480dc6ed0b16a86397387067fe1a12111ff33ea637fc0d1317acf5aeb

See more details on using hashes here.

File details

Details for the file hext-1.0.3-cp36-cp36m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.3-cp36-cp36m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.4

File hashes

Hashes for hext-1.0.3-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 89bdfaa9b566a4e43f4f7f668a9a8399a6539ab6cb1bd064a3699ae6c00b4449
MD5 c6063cf42bca4de3284f46f1c676921f
BLAKE2b-256 6042a6d59f896adde426478d3002b0e6c3f78936def9a58d3a469b1fea9387e0

See more details on using hashes here.

File details

Details for the file hext-1.0.3-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.3-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.9 kB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.4

File hashes

Hashes for hext-1.0.3-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 559f5de80319129bd8499605651701c89d64b0b0f53d5a2512bf72cff7f8a7a8
MD5 16f836858f05e80695e3bd7a05e11f7f
BLAKE2b-256 ce696853aedd4558c9570e327d6be967182919d535bf7971a3f65277c187f905

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page