Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.4-cp311-cp311-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11

hext-1.0.4-cp311-cp311-macosx_10_11_x86_64.whl (726.8 kB view details)

Uploaded CPython 3.11macOS 10.11+ x86-64

hext-1.0.4-cp310-cp310-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10

hext-1.0.4-cp310-cp310-macosx_10_11_x86_64.whl (726.8 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.4-cp39-cp39-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9

hext-1.0.4-cp39-cp39-macosx_10_11_x86_64.whl (726.8 kB view details)

Uploaded CPython 3.9macOS 10.11+ x86-64

hext-1.0.4-cp38-cp38-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8

hext-1.0.4-cp38-cp38-macosx_10_11_x86_64.whl (727.0 kB view details)

Uploaded CPython 3.8macOS 10.11+ x86-64

hext-1.0.4-cp37-cp37m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m

hext-1.0.4-cp37-cp37m-macosx_10_11_x86_64.whl (726.8 kB view details)

Uploaded CPython 3.7mmacOS 10.11+ x86-64

hext-1.0.4-cp36-cp36m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m

File details

Details for the file hext-1.0.4-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.4-cp311-cp311-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.11
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/5.0.0 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for hext-1.0.4-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9f4a8a1dd7c11d78783da947932d29d011245674d042bc85030ce3711d9fb972
MD5 17979211f5774705aaad4781a8b360e5
BLAKE2b-256 aadc3cc2e644b299813b402239c014b7401603f101a2097c51b8b19bdddd7c4a

See more details on using hashes here.

File details

Details for the file hext-1.0.4-cp311-cp311-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.4-cp311-cp311-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.8 kB
  • Tags: CPython 3.11, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/5.0.0 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for hext-1.0.4-cp311-cp311-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 adf6214d21605535a3ad2f12e1163641bb73f0ece0c50dae89841de3b822786c
MD5 a0c2496b9c4eac2ccc4800dc2be81181
BLAKE2b-256 e4f21d8ac54b090113f1244967dbdf070ff56636fa2632c37ff93de611d0e941

See more details on using hashes here.

File details

Details for the file hext-1.0.4-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.4-cp310-cp310-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.10
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/5.0.0 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for hext-1.0.4-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8ae8050080bfe1dd369d2fb567474a7e42fee0a6dd58c2a02b02d4efe17b67e8
MD5 891f8027911d4786bd12d88b2d4ec9fc
BLAKE2b-256 0df893b20ae87824b16ab8ce980f2ab40e9361d51d2ebc72308602c1853b69f1

See more details on using hashes here.

File details

Details for the file hext-1.0.4-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.4-cp310-cp310-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.8 kB
  • Tags: CPython 3.10, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/5.0.0 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for hext-1.0.4-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 7e3ef7034605e6e64914e375e95c32aac00abeadc5b39899e59be4e24e3eb4d4
MD5 195a8232f4dd24fec655c3eef0b10063
BLAKE2b-256 a71bb0d70b031cb9cacf1574aed8b9b197e85dd6f84b0217170d9ed9a2e3e966

See more details on using hashes here.

File details

Details for the file hext-1.0.4-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.4-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/5.0.0 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for hext-1.0.4-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 252502d49e30b9f3c61e24982c570710dc1f56bc7565729f5d10a2d7c3d92b9d
MD5 b4e104106f5b21a8dbdcbdffc8a02d26
BLAKE2b-256 879ff96b9b52c16b2cc6443a5c69bf7e1f0192177b077bef90ceaff3497d234f

See more details on using hashes here.

File details

Details for the file hext-1.0.4-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.4-cp39-cp39-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.8 kB
  • Tags: CPython 3.9, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/5.0.0 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for hext-1.0.4-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 710a7b3e3453f66e2e62558c181885cec61edf746033c239294b3086c9f7b1bd
MD5 2d0e56409b0d6375cbd9acf5c155ec3a
BLAKE2b-256 0e495da12a30da6ff06a9ed1f0621a4b71fa4f0a0c9d45b44d3f2d39b70ae304

See more details on using hashes here.

File details

Details for the file hext-1.0.4-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.4-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/5.0.0 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for hext-1.0.4-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6553fe9544c5c319db2a4d4d6794635944a770b52b3135c677ec6da56c4ba69c
MD5 add646aef329fe901e14f2e9877cf662
BLAKE2b-256 086f46fb53398f3b2e1fe37dec0621fba9e41b5e8061730bd01f72af68232be1

See more details on using hashes here.

File details

Details for the file hext-1.0.4-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.4-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 727.0 kB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/5.0.0 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for hext-1.0.4-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 4dda143501e621b58b9db9cba8785bf4821afa233bd2bc88c258cc37b0856034
MD5 f47ed756dc1e2a0bcd72f9a97f27aa3a
BLAKE2b-256 4c473fb93cfa217c58cc79b2c84f537f0461fa0d5ecc0892b36e6cfb9c27e2fa

See more details on using hashes here.

File details

Details for the file hext-1.0.4-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.4-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/5.0.0 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for hext-1.0.4-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 12605a4cefd9870a6aa99b7514a4970b4eec7de6f5abd61a08bb3d01243283e8
MD5 246c1101f8ad4cc11cd3e1f6276f31b1
BLAKE2b-256 45abcdc4ab7789db27497014796633b0eb6d8df18fc595f8455390343911934f

See more details on using hashes here.

File details

Details for the file hext-1.0.4-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.4-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.8 kB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/5.0.0 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for hext-1.0.4-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 f937141ba3a1e741d7df67046a4b924ca7d8215eb1ca673fa13a4d2ea90b2bdd
MD5 cab845aba25d8e9fb2ec930ef604789b
BLAKE2b-256 107c06b871321c27180e41a1ebbb0b7bef2795df402a85cfd7eb4fe87104beef

See more details on using hashes here.

File details

Details for the file hext-1.0.4-cp36-cp36m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-1.0.4-cp36-cp36m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/5.0.0 pkginfo/1.8.2 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for hext-1.0.4-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 12ae7926ff5fe8390b5d0625cd240dada6c2630e9f45ac90943424099af83ac1
MD5 d163b52cb248641a24a16d43310a5b5e
BLAKE2b-256 4bf9020c3eb2b6fdefd7db149bbcacf85aa7e7e387a8f778865aed07f4650d5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page