Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext snippet collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext snippet is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext snippets to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-0.2.4-cp38-cp38-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8

hext-0.2.4-cp38-cp38-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8macOS 10.11+ x86-64

hext-0.2.4-cp37-cp37m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m

hext-0.2.4-cp37-cp37m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.7mmacOS 10.11+ x86-64

hext-0.2.4-cp36-cp36m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m

hext-0.2.4-cp36-cp36m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.6mmacOS 10.11+ x86-64

hext-0.2.4-cp35-cp35m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.5m

hext-0.2.4-cp35-cp35m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5mmacOS 10.11+ x86-64

hext-0.2.4-cp34-cp34m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.4m

hext-0.2.4-cp34-cp34m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.4mmacOS 10.11+ x86-64

hext-0.2.4-cp27-cp27mu-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 2.7mu

hext-0.2.4-cp27-cp27mu-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7mumacOS 10.11+ x86-64

hext-0.2.4-cp27-cp27m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 2.7m

hext-0.2.4-cp27-cp27m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7mmacOS 10.11+ x86-64

File details

Details for the file hext-0.2.4-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 891e8be199884d2015eb2c374d45eead48215fc3bd381b2e3b81acc323c3613b
MD5 f3dd5746db4be0b58eb147020833afd7
BLAKE2b-256 8564e85e36eb319684f600252712ea8e7a004776f2f3169e5c44fdb07bccc5f0

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 c103106d02a04d688c2e83fa3f0fa7f80411780feb4fe1bf07c8a2737aa7bf7f
MD5 a77ca6e04f45aef117a635296e685727
BLAKE2b-256 7976c9a0da317ef2dd96b725fe49b900e1de3c1b9877d24fec0ce935e86bbc92

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4688f1e45b77ca87d6d2f48462cb3411983493cfba4aefc1e6d974dfd7008104
MD5 0499a5e79783c31aecbc790ecb0ca100
BLAKE2b-256 381765ecd49e97fd15592aa153f74e378250412e0466b0741bce0af470bcd1fd

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 989f0978a7c8bb603234c419a9c6c55766205fe91dac9a5e7258ef9ad281c35e
MD5 9bdd11e692290ee85ab673ce51bae057
BLAKE2b-256 1649bae340628b4140d3be92941d8f2f588df883d94585d9fb915e921830fb11

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ae1f5997ad9d88631e4cbc08fc6eca53c8da3854b5ce92708989490de18e8da8
MD5 0cb910e1f8683b08e5cf71ca1411cd11
BLAKE2b-256 19436a11104db231acea7b18e1174e03b1099afd45d12083bd753bd87db7657c

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 341e626c9c39a8709aa664df76e2ae92db9baf9193093048cffc4375631e7585
MD5 b6a53a7989bdd2cb9727f8017a1da6e4
BLAKE2b-256 ac7bc0357e2530e4bde3c0e113a1a04f2591c777eec4fbf7368dc71226badbfc

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c1d50b44a21b09a4d2a6ae571fd8fc0a92f3113c401cee36fa5e3bfdb28521ec
MD5 44d228482b9decbe49d66b09c1023ea9
BLAKE2b-256 84d41279c43268de1275dee72a51dd7340c47c488d0d294d4df79c0e5079c722

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 5cf20490b3d775bd95f9d4f4acffbd6e84e00bebead437cdf6e968109c9b5757
MD5 9cbd8e79323c1555124d894ee4c25676
BLAKE2b-256 7c81abc897a78de6d0fce7ffd5854ead240c9a0b5144012b4c057b90faa628d1

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp34-cp34m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.4m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a5b9ea1133b88adb6b40446c8ab374f3a95c3a56bc40731e9488e3bc3180ab4c
MD5 35bbb75706cbd37945a3c2d78a62d4e0
BLAKE2b-256 52389430aeb747dba96ac037194b7db673dea2f5652a182fbe43f49c7953e04c

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp34-cp34m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp34-cp34m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.4m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp34-cp34m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 cd0588672084f9d2eaa86eea248793dd4f51a085c36e8f6dd4595c13c7b7dfbc
MD5 ede7ac52f264513385b3bb9d2553b464
BLAKE2b-256 db3cc9e6612c5cb1fe57aae72f60f93b7c3b2b3c4588acf958b4e6bcb6d52dec

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 14bb8e5fd809eb15abab12ed12593211cc30f18d5c9fc0ef99c09e4a8b035c47
MD5 64c402442ce823a8c691e19982b46cff
BLAKE2b-256 c256e5a0a5e4524e99498f207a22b736ca572aaea58d9117c671e0e639ca1ce0

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp27-cp27mu-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp27-cp27mu-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7mu, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp27-cp27mu-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 aa7b0f8f776eb6b66e597bacab08a1055c522c7e23db77380fdb8e253557d1cb
MD5 46f9312d99b3fca4b6c394ab9b0452d3
BLAKE2b-256 7c2d975803a6a48010eefe4481d6f05439e6a3174c7f2a5c85cf693cf053340c

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7a1a02b66d302cd96fd5fcb1b6cc8a9928604909924f1bd73301570f5065147c
MD5 2de4369f11ae1c9130f5014c5e4f6767
BLAKE2b-256 240731f044daf500780888badbc2b35d5c4e7ef1c71d8ff514a43cb8e47e029d

See more details on using hashes here.

File details

Details for the file hext-0.2.4-cp27-cp27m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.4-cp27-cp27m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for hext-0.2.4-cp27-cp27m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 f0b9835927c2e8dfb9caee442acd2d59d0114c320e950a00b51bdf2eaa2360fc
MD5 85aba93899d9d30387d37cba5aebc011
BLAKE2b-256 99942639dfa57c592a0511b2d89b19e4b091a64ef509210b029df540d0629eb0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page