Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.1-cp310-cp310-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10

hext-1.0.1-cp310-cp310-macosx_10_11_x86_64.whl (726.9 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.1-cp39-cp39-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9

hext-1.0.1-cp39-cp39-macosx_10_11_x86_64.whl (727.0 kB view details)

Uploaded CPython 3.9macOS 10.11+ x86-64

hext-1.0.1-cp38-cp38-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8

hext-1.0.1-cp38-cp38-macosx_10_11_x86_64.whl (726.9 kB view details)

Uploaded CPython 3.8macOS 10.11+ x86-64

hext-1.0.1-cp37-cp37m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m

hext-1.0.1-cp37-cp37m-macosx_10_11_x86_64.whl (726.9 kB view details)

Uploaded CPython 3.7mmacOS 10.11+ x86-64

hext-1.0.1-cp36-cp36m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m

hext-1.0.1-cp36-cp36m-macosx_10_11_x86_64.whl (726.9 kB view details)

Uploaded CPython 3.6mmacOS 10.11+ x86-64

File details

Details for the file hext-1.0.1-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.1-cp310-cp310-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.10
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.1-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2ae9f2f19a3ed47ce858e605439b80bdd6497e6a42daa19d926b43f4f4e03f2c
MD5 b49155da417d9a0a24191f329cc0b6ba
BLAKE2b-256 438ec98454b5f75e4f637554a397d7783cc0c1b7ec6f0bdca75a39ecbc101736

See more details on using hashes here.

File details

Details for the file hext-1.0.1-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.1-cp310-cp310-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.9 kB
  • Tags: CPython 3.10, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.1-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 3bd57aa414bc84a871dae3fb04b343b78f5ffdceebd6f6c2e7a26dbde9cd5e9f
MD5 13f50bc925d19236284d359292ef8e80
BLAKE2b-256 71f3c866bd8efc91bc1b6f93420f33cbc5bc04f151cb21c3998ee6a00a621dc8

See more details on using hashes here.

File details

Details for the file hext-1.0.1-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.1-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 20e87a57c8af4d956c8414b6bbe207b7f7a51170bc05eba91db81a39847815b8
MD5 1201616fe0ea8c8b913c3280e904e7b5
BLAKE2b-256 dd33c40c78a482c7b1ba73cc519129221bd00689d9fb59692960dc8daf4e1822

See more details on using hashes here.

File details

Details for the file hext-1.0.1-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.1-cp39-cp39-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 727.0 kB
  • Tags: CPython 3.9, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.1-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 b23fe401bbaf6e5afa00abf61c4e30270ac8bf96daaedb2c2b066e7efa7829ae
MD5 4cf567dd43726582a8b74cf5dc042b24
BLAKE2b-256 1713a427f7fda27299e8bdd7bf65af7d26ac39dd4f742415239e0fa31ebc6fc1

See more details on using hashes here.

File details

Details for the file hext-1.0.1-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.1-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 acc400040de8c9cbdbbc6b084d6a928b227d25917c136b71b18503ed18222ebb
MD5 cfd479fc6c36881e3e146a35d1d18ff7
BLAKE2b-256 068b9c20165bd8445ef9c96a80247fc594e4cd33edc7ef64bf7e31b8b49961c2

See more details on using hashes here.

File details

Details for the file hext-1.0.1-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.1-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.9 kB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.1-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 07dcccb0d77d660b2548f22d4a5617603ee8bf1a98f016aca65a465ab306917c
MD5 10aa0caeb90ec4309149f8bd3337cac2
BLAKE2b-256 01924c34dda3ff08078f0eed6ac512127cf907ec8260b707a4b1ee0612941a7b

See more details on using hashes here.

File details

Details for the file hext-1.0.1-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.1-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.1-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d7ddf1d9f1974c8aa2888cd5d9a2fa1bcf3cf9f4fa1f31c3e94f3759182e6824
MD5 28ef268e9d1f2fb1c85e681ee5e29203
BLAKE2b-256 7a91e683bc1ab40ac51dd7f782b7b1340fde08cb41b2fe5d235264062dab290c

See more details on using hashes here.

File details

Details for the file hext-1.0.1-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.1-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.9 kB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.1-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 fe815c968c9e23e7ace455ea82aad6f17dfcf1eb8713d669e4c7bff706228446
MD5 f65ead9fde7bf933cd42205dc6205a8e
BLAKE2b-256 46f3a19c307f77659b40925f1c5bb062b0ccf0ec78fd57cf5e2a5101852f0502

See more details on using hashes here.

File details

Details for the file hext-1.0.1-cp36-cp36m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.1-cp36-cp36m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.1-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3040fffb6161bd488e9e69106c155e5fe875d212631f4a05fa51664674c903f4
MD5 7056a1fd39ab07b050b8d80968d61a66
BLAKE2b-256 59c2a8d69b300b1a166cc40d7d2f8ba37bbaf75c0d07eea801546cc0ded08539

See more details on using hashes here.

File details

Details for the file hext-1.0.1-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.1-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.9 kB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for hext-1.0.1-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 5ac48632b3450b0514a8e6517794007d6d769e2a18fb6af7f4b4b9c8264c10bc
MD5 93e9c79cc6b901931b68bea7ec8bd368
BLAKE2b-256 20b23792e9da112fa8aa1861fc11dee90fd565c8ff7e3079389213fd1d462abc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page