Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.11-cp313-cp313-manylinux_2_28_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

hext-1.0.11-cp313-cp313-macosx_11_0_arm64.whl (689.2 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

hext-1.0.11-cp313-cp313-macosx_10_11_x86_64.whl (742.1 kB view details)

Uploaded CPython 3.13macOS 10.11+ x86-64

hext-1.0.11-cp312-cp312-manylinux_2_28_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

hext-1.0.11-cp312-cp312-macosx_11_0_arm64.whl (689.2 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

hext-1.0.11-cp312-cp312-macosx_10_11_x86_64.whl (742.1 kB view details)

Uploaded CPython 3.12macOS 10.11+ x86-64

hext-1.0.11-cp311-cp311-manylinux_2_28_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

hext-1.0.11-cp311-cp311-macosx_11_0_arm64.whl (689.1 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

hext-1.0.11-cp311-cp311-macosx_10_11_x86_64.whl (741.8 kB view details)

Uploaded CPython 3.11macOS 10.11+ x86-64

hext-1.0.11-cp310-cp310-manylinux_2_28_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

hext-1.0.11-cp310-cp310-macosx_11_0_arm64.whl (689.1 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

hext-1.0.11-cp310-cp310-macosx_10_11_x86_64.whl (741.8 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.11-cp39-cp39-manylinux_2_28_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

hext-1.0.11-cp39-cp39-macosx_10_11_x86_64.whl (741.8 kB view details)

Uploaded CPython 3.9macOS 10.11+ x86-64

hext-1.0.11-cp38-cp38-manylinux_2_28_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64

hext-1.0.11-cp37-cp37m-manylinux_2_28_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.28+ x86-64

hext-1.0.11-cp36-cp36m-manylinux_2_28_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.28+ x86-64

File details

Details for the file hext-1.0.11-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e924058ea48ce5262ad617025f2cc426413346ef02560e3cc21fa28bea9bf2a7
MD5 b1966b4a5e5dd006f9e49a869e4594a2
BLAKE2b-256 bc4238a9deaf805d1085a830b4e7e1eb92399f4f03259165b3443cd96f0d96d2

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 51a1da68493822ed46f76a430ea1d6d2734d8fdd6b1fbfb9d1994dbc6a14cf90
MD5 5159f84b36fc503c7fd0f0ab708ca383
BLAKE2b-256 385f32bec2b18c9205323662211cd56b62db2d5b23e25a35f9cdb0b97c5ce4b9

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp313-cp313-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp313-cp313-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 eedbf9adc3343421ef3ab8067817f4f7f29e3be246adac991feb91c1ce9152b0
MD5 0dee0921de8ff79ce0940a612aa49da8
BLAKE2b-256 87bca363a8e26d4f867b021c98a350c719328f833077b2af3249806a865f47d1

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e9f9747490640b3f3209aafea6448f024ce2e8b0ca5295b899887ad53a4aecae
MD5 e6e8cb9ae91559294a19e122fd01d430
BLAKE2b-256 a6ff4f5a298252a44f0974d709bcc6748152418c9575ddd6f1c8b76eb2a816aa

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3c7082b2fc08225c677539affea49f1466e6ebd8b7767d16bcbcc8d70d1def8a
MD5 cf2482d5255f877ac86a76f74c16231f
BLAKE2b-256 b40012569026417566f1d437c0c7e269bc2ead491927f26d9d50d2f950950595

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp312-cp312-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp312-cp312-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 b589d01dd73805f17dacaebbd68ecc3a5ba2874900cd35bf219baa6937d31d3b
MD5 05497513566bad5efc10ae2ffed6734c
BLAKE2b-256 c0be8886c01df7fa096e7f86ef323dc518c63d4cf0d40d0173a2b8919186d04e

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 411d7468255d49db12d39d1fc9bdf518cd9269c7193d498d00ef329148321d5c
MD5 77331679c2dfd1ccf601e662c66ba956
BLAKE2b-256 ae2773a9e4b6072a0a5b52f8bbfcc6bebc304553cde31b3f8f3b77752466dfa1

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ecf591f451d4f0128bd9c7d96288022b86b611e476a15f03cf6806939a6165ff
MD5 31f40848369473f98740418f95685fbf
BLAKE2b-256 8c597c8a3549dc3f32b2a839037a9d77c7ad8acfa78bef71c4dc76c83c158e99

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp311-cp311-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp311-cp311-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 5917c6ad535a02fbcd2489c863de4bacd4dc98fa9be378c1e7d0c04021c3b049
MD5 682bb1a660de4c57fd691a27de999463
BLAKE2b-256 58bf5f9eaf48d171506392c92d3848a1687181e6f73dae536c0615c4a1afe517

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 62d16c4410da2ca0943b2c1b9a20d3e58e37f9a48bf1e38b4ae393c17817d010
MD5 a7852ed969d2ad9e17cad0b269a653c7
BLAKE2b-256 a1d92eeeeaae8bf74741b238ce3c31b9bf917602c5b11e5195cbdcd9894a5b50

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6498ad1e5abac573354c4dd594ec19cda56603091539b600ffd05d008abe682e
MD5 dc72fff19f7a5c8c75a56fb16ae7c925
BLAKE2b-256 febe259e6508adced61cb4dbaf60f66cb5df7c320af5436d719a22fd21ec2242

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 5aea83b502560f4e5c41bca46b71437c88545c0dae3b37acaa73812d9f5c42a4
MD5 a1aa059c938824303441120315f7c936
BLAKE2b-256 7703f758a810fc5cb03ca23478977d73bfe3d2fa5ce9817ffcddc5b98ff5577b

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 eef6c41284890c74ab45b55fc54c11772e9c9e3387dae2021b15c596ab407bb5
MD5 183d739ee9e94572119be2bddc56fdd9
BLAKE2b-256 553fd65e2ef918b0c9b35a7b0f221def4de15ec443e90abfeee03a1e5e8a4023

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 b06b68a7b5c393e4742e8a3557c57ae12ab08ce54e5ea1b265976b25fe57010d
MD5 d6e2d8f0849edc1e35f9eaf626a78614
BLAKE2b-256 5c2079d1cc2ef13e3d772f19565a8b1f223174cbf4a8aad7c150752d136de3b0

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e588407e22334822098a0ad3d949778fb3b0956dca30bfe81f2390c372e45284
MD5 41be0578207a1593b1996c95a5bca1de
BLAKE2b-256 dcf7d92f96d7554f1f92cfc50a1eec12ef5fe961d0dc30194ba34af660829316

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp37-cp37m-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp37-cp37m-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6f063ae87f609acfbbce9e8044e340159b6118ed1a27919b1f27fa6c1af0b94e
MD5 62ba16ccbef18f0a372cac2c0501909a
BLAKE2b-256 ef66678606ae678b87019af79c887c4c18c823279bf0c0dda72c0e6f45543141

See more details on using hashes here.

File details

Details for the file hext-1.0.11-cp36-cp36m-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.11-cp36-cp36m-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 eaccac7f617103b428debd3ec7e4c9f8d3b7e15152d06927b503696cefffd29e
MD5 97498d94ebb1ecd301520d50b554f294
BLAKE2b-256 6b02c6126d5579e5b0688841da354ab6768560eff37e2255cda4ff8899683d91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page