Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext snippet collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext snippet is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext snippets to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-0.2.3-cp37-cp37m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m

hext-0.2.3-cp37-cp37m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.7mmacOS 10.11+ x86-64

hext-0.2.3-cp36-cp36m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m

hext-0.2.3-cp36-cp36m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.6mmacOS 10.11+ x86-64

hext-0.2.3-cp35-cp35m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.5m

hext-0.2.3-cp35-cp35m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5mmacOS 10.11+ x86-64

hext-0.2.3-cp34-cp34m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.4m

hext-0.2.3-cp34-cp34m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.4mmacOS 10.11+ x86-64

hext-0.2.3-cp27-cp27mu-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 2.7mu

hext-0.2.3-cp27-cp27mu-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7mumacOS 10.11+ x86-64

hext-0.2.3-cp27-cp27m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 2.7m

hext-0.2.3-cp27-cp27m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7mmacOS 10.11+ x86-64

File details

Details for the file hext-0.2.3-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 aa68caf0e61c03029498ec920a6a68a942d66232e6f0ab4e2903a0de332a4ee7
MD5 d225f4c46a78c15a0f57152306f02527
BLAKE2b-256 a2cd90702759922ea1734e6a4ef130dec04d75107eb541089948e2c589c25389

See more details on using hashes here.

File details

Details for the file hext-0.2.3-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 a5052840017a5429bff071cc80eb696ea21219f6e6b94cd023e7cb3b9b9a8dd3
MD5 50ea9b08a77f6a92b2320b7e66113201
BLAKE2b-256 5b2ca5ad8bc7784c21d909a6a01379f31b5a4e5ecbbb086c3c6a54e915d36cfa

See more details on using hashes here.

File details

Details for the file hext-0.2.3-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4bcdecf69022cf03610cb565e6d94fe7c8749f5108d029568e7a9dfc1a4f000e
MD5 8011d74aafe3bad813024b31934c413b
BLAKE2b-256 c8d4bac74f8139b048064fb0bfd74cf624fd5b74a39b9b5a875f0a0f11263bed

See more details on using hashes here.

File details

Details for the file hext-0.2.3-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 afb94442eae979783785a38c2059e58046cb769333ad44e07a2d7b69f024dab7
MD5 39aaaef11a15838309c3108aea10d7d6
BLAKE2b-256 94ab849ee970119b190852bf5d56d632cd31a94941e530bd929a0ad0cdf150e8

See more details on using hashes here.

File details

Details for the file hext-0.2.3-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2701fae930e9e4de799f42125ef36ca91cfce8e652936b735280af03e0659e1a
MD5 8349ae66aa2c7b10e7257150c6cca9ba
BLAKE2b-256 5311c909f8d63b2dd952af7aeaa1b97754fa942bb5ce81b13c47726a96e53921

See more details on using hashes here.

File details

Details for the file hext-0.2.3-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 2a7ba4f3631f702ae106edc8dbfd8943fac8e93fc1a86781f94abffce0cd8ca2
MD5 44c56ef23ac94fbc5562d583dce3f53f
BLAKE2b-256 60ea600d329c4ec8301f4c3a9ba5f5ca7ce6d1e2b7988f24493dc2f497f39e22

See more details on using hashes here.

File details

Details for the file hext-0.2.3-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp34-cp34m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.4m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 dec7eb00fb7fb6b5d343c6b45626e80e9276e6d995c7d0fcb3d85451cfbceb57
MD5 04e2547209f35af630cf02a33c9d69ac
BLAKE2b-256 538c2f689fccb8eed9d4077992ac1ad00d909d3fe50abcc1d603ceea2b9d61fb

See more details on using hashes here.

File details

Details for the file hext-0.2.3-cp34-cp34m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp34-cp34m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.4m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp34-cp34m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 a46011a5281c545ccae856413c4b09a245f92f91676fe812a9fff74d91cdca3c
MD5 afdad9dbe7daacc6eafb32fcbd119916
BLAKE2b-256 3ce793184d6e409ac38de13156a0d3910c112a2c769a88865b0e628beec62ef6

See more details on using hashes here.

File details

Details for the file hext-0.2.3-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 18accd88ddabec3b53b8029574a4bfda20453d4783b3122b87dd979faa1ff171
MD5 33ea3fb98d5ee2a696dcd3af4b82189e
BLAKE2b-256 da38876bf92f4e9d5c97e7f6dcf1a5de7a8b95fbe1f5f87ccddea3a942b1a547

See more details on using hashes here.

File details

Details for the file hext-0.2.3-cp27-cp27mu-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp27-cp27mu-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7mu, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp27-cp27mu-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 93892bde92dd70786111431ec19bbb953607024559732f448cb71ac105036593
MD5 277ea266cb3a45b4b191a4b06fda1e43
BLAKE2b-256 428bcab7265d7ddaab1476bc7b42ceea256cad33c4f38fafb516dfb9edb93ad0

See more details on using hashes here.

File details

Details for the file hext-0.2.3-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5c7208a20c095ff9c383ced71adc7f298a47278a6abf47ef7d720d92c5a3d4a8
MD5 3aac7835b1690877dbf13c163dff96cb
BLAKE2b-256 c43be3eaa2ffbdd44cf2f5cbd5f5ef20c0deb5cf8506845f48d992373d7fd31e

See more details on using hashes here.

File details

Details for the file hext-0.2.3-cp27-cp27m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.3-cp27-cp27m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.3-cp27-cp27m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 074e01ce662d550a26df89f0d32d35d5aba3e87c614580a12408208cf99bb845
MD5 b0d2f0e9ca4ea4ae74e1ad7b31559892
BLAKE2b-256 b6284303c5127d4d3b534d3f8a8651f7d4af538aaba51c6f34af546c4247fac0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page