Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.15-cp314-cp314-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64

hext-1.0.15-cp314-cp314-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ ARM64

hext-1.0.15-cp314-cp314-macosx_11_0_arm64.whl (711.8 kB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

hext-1.0.15-cp314-cp314-macosx_10_11_x86_64.whl (712.2 kB view details)

Uploaded CPython 3.14macOS 10.11+ x86-64

hext-1.0.15-cp313-cp313-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

hext-1.0.15-cp313-cp313-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

hext-1.0.15-cp313-cp313-macosx_11_0_arm64.whl (711.7 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

hext-1.0.15-cp313-cp313-macosx_10_11_x86_64.whl (712.2 kB view details)

Uploaded CPython 3.13macOS 10.11+ x86-64

hext-1.0.15-cp312-cp312-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

hext-1.0.15-cp312-cp312-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

hext-1.0.15-cp312-cp312-macosx_11_0_arm64.whl (712.1 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

hext-1.0.15-cp312-cp312-macosx_10_11_x86_64.whl (712.3 kB view details)

Uploaded CPython 3.12macOS 10.11+ x86-64

hext-1.0.15-cp311-cp311-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

hext-1.0.15-cp311-cp311-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

hext-1.0.15-cp311-cp311-macosx_11_0_arm64.whl (711.9 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

hext-1.0.15-cp311-cp311-macosx_10_11_x86_64.whl (712.3 kB view details)

Uploaded CPython 3.11macOS 10.11+ x86-64

hext-1.0.15-cp310-cp310-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

hext-1.0.15-cp310-cp310-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ ARM64

hext-1.0.15-cp310-cp310-macosx_11_0_arm64.whl (711.9 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

hext-1.0.15-cp310-cp310-macosx_10_11_x86_64.whl (712.3 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.15-cp39-cp39-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

hext-1.0.15-cp39-cp39-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ ARM64

hext-1.0.15-cp38-cp38-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64

hext-1.0.15-cp38-cp38-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ ARM64

File details

Details for the file hext-1.0.15-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2a1ed8a8fdb33359f0fc26e48e47ff7760c89cba4af9ff5621fd55fe76e9289e
MD5 853d74efcb5d78b168d838f7a83f6ed7
BLAKE2b-256 d6994dd0c346ecf4e97deedc4f2be11e7d517e78dd5071485e2e4e1687841bf4

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp314-cp314-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp314-cp314-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 be9f7f2da0fe19c2b394abb0ff48318588c658efd27ab42532f4594beaa3e5ea
MD5 1c6e19970eec28464e1435a2b73d27f7
BLAKE2b-256 31b71be22b9dcc34fd4bb5405f08c0444babb2a5ab67ee3d14ccd1fd7fdc6ed5

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b3b52322cfd64091ac2a60fb68837716b2813267a29318b6b55a6328aab8a0f2
MD5 f926c0580c85143feabf42b3133f9e88
BLAKE2b-256 1a6da25dbace9a3a6afea2cc68b078bb58f90d5d7f2f394c076bdf9ace610f41

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp314-cp314-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp314-cp314-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 8b99eeaadc2a1236027d77b18cfa987dcb890ea4797c250cd2c05babc04ab39e
MD5 e68be5a2e3078d6ac19d425c5b3899a7
BLAKE2b-256 2db33ea28ba0f88b2bf5c9c156b596f298255f658db5011b3b607411594cc3a7

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2da095d67fb03aa00eca12bc589963bde5ee52f9ec429ee8af948edecc7cc0a6
MD5 1824e83d8224faf6114ad28a284c7066
BLAKE2b-256 9e422bfd119a80c01fae009283120ad7c7aa4afedb5bb29785c6792bbe58d9e9

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 522e31bc85b63a7939237e4447ed8478888ad7422603b8bd0c2dcb2d41394b75
MD5 16a0bb8f3f09a7d6f0175f2460c58df3
BLAKE2b-256 6e36ebcfdbcc14cd0b28b5ee9b00cb15d430540950d1624063ed470a0d949399

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5ff2165e5f9ab2d5ccdf90145870b20d2c60795c84a8f71c51236a1083b5fbc3
MD5 3b38ea87074575ba89f3b949a660eee7
BLAKE2b-256 fd02d958479b01b94ff96fed36b5079f5ba1b69bb4ed89956535d5c88e9c5aa2

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp313-cp313-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp313-cp313-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 2a2908cf21c21bd745c6eb970228e4fa981dc3b91666403780508e3dee4ca2ff
MD5 66b6bf2b7660357f6122514acbc8b97f
BLAKE2b-256 3f37c565b98e6367e096c04779cd0498c129232c5067b08868d06d441b4f2e4c

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d622a43d422918a7f37cdece4d92394d61a184d92414c872356cdab51f6bb0ac
MD5 adfd199b2678a33397bae9dff4b9e1bc
BLAKE2b-256 2900598e2a7503ed9156ab8127ea2e5e215d8b010f017962dea8f16d1aa33339

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7458af588a0f2f9f1585e057fb3e709bc92f1287c510049028bb970cfbff35fc
MD5 66dbcbcef91e6152c7374a77bd4d5b89
BLAKE2b-256 3fbc4276184002f5c6e96a01529688e0604cb062307c2f2eef1996ea05ed199c

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f10365ee6b94900566de9c9a03d8ad9c588fa871b7174c21ac7351b4fa239609
MD5 9ba4ffdb19ef7881e26f668fd7507dac
BLAKE2b-256 fcc3c80c39ecd247c235ddc6ee21c21f381f8a93de5c2fa9bc50a457674a65ce

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp312-cp312-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp312-cp312-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 422668bc91a6b395074d42d72f192595f872f6ae650e4e20bbf8b65017e6ffb3
MD5 019a21a9e4efe1ab3d7352cf5572750a
BLAKE2b-256 3322c358b0f7503219e0a00ad1f1ee41bc34a1759b413750ca9330b3fd200383

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 32727a3246ec618a4f679b8e90e99d3c956d81d098cc45b854ad1c87d69cbd08
MD5 34faa3676a2996f07a0d81a9bd232474
BLAKE2b-256 a256291e96954f7ceef9c466ef2a704d9e8309d95d9f2eaa1fcfd541f1fe64c3

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 09700082c76badd4f6a53ab1e1fd41d03875e283fbbaa73db0f6423c2cd28e8f
MD5 3c992b176a8e2482442160e88ac83480
BLAKE2b-256 55fdaa26010ac2ab1ff181d337463a745d5aeb5cdd8bc53d96270961f7607958

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d42dd625559c46ff1a5f692008e0d5d891a9988339d6a4201722637f0459384e
MD5 cd855e9778ed69899f089333cf4bdb38
BLAKE2b-256 b2899ef9ccf84dd1b77246b8bb2b407c077b25f06f866e13aeb214d9bfc11616

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp311-cp311-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp311-cp311-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 a6e03ca7b53f6bc86503c2ee62c866a88eac0fc5c13a4f668ee6ab1445cdbafd
MD5 b7a9dd46e6c4df1a93e3b47048856f41
BLAKE2b-256 2b91732065360b0c25e3a2840405cb175c2f621e45b813bb7096b9ea31a62de3

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 17c4498afc6c66d748ac65eb3d7be597303b6c06294053e1a893ad34ff782748
MD5 96ce141bc5d1bee98623a85e917b3687
BLAKE2b-256 45272131dabfc3c06d0d1c693ea70278505a13a72a98e8a59727495d66a50227

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp310-cp310-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 dd46efbf5fea24c20d6ec5c7704bc94d408d355018a079ac84f902af2a133e8a
MD5 7f35f35e8c9b0adabfc8d61bae4a2e87
BLAKE2b-256 6fa5db8d1af3e26579a3aecc9cf7d65d37a2d74fde6b1261ce4d95a0548d54e1

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a931f1d80f5cf5310fa1cdb498fb25ebf61ea4d23cbd87812251eb5427bd4a9e
MD5 7623499071bd99521be5e85abaf73c3c
BLAKE2b-256 c41d226a5df776890cd6fc76c79f9f82b0b53b7774c0a17f2fd38640fa08be0f

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 179dc1932d3b9593b0d038af682c4d8baedf0227db7411b93141c2b2792cb413
MD5 481f4453505d2304b5cf2045a0511570
BLAKE2b-256 54dac79b1b09f1c309b949b8dc9be6b7730d27d342530f49ed84eaa5bc6c1721

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 84751598d261f82a005152f17e94ee95f44ee9e0b97a1b2ff2b046678814ad91
MD5 b1087f188e4e7eea9f963224501120ef
BLAKE2b-256 763c9b0a9156542a8a90bf7c83e30d22778e0304541706d5a82e4be3243239aa

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp39-cp39-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp39-cp39-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 395f2687c13d036bf175281c2da98ca7fd2d3e7a1aa1ff65c7d07d03954728ce
MD5 f667344315d9b13279791bbea508304f
BLAKE2b-256 2d357b263e14812799e4eddce2ad1e8e7cab4c692e622dc7019d3caea1e3f66e

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 84f1add63117b8f870519e8266b5e1224e4785abe284f8b71771dd1c0dac2423
MD5 a068e1992910e41b73ab949a47fef086
BLAKE2b-256 9905b242020b2e63a941fd7c273093cc67e4c54524f5286efe0f6c5d43953782

See more details on using hashes here.

File details

Details for the file hext-1.0.15-cp38-cp38-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.15-cp38-cp38-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d875de318efe0fa932972a000724bc3d7d62e86efdf18f96dd88ab613d7588c9
MD5 8e51cd1d73fef4ea6ffbd4293ec6f75d
BLAKE2b-256 d331a854e1d423e652a81c8685ddaa14da76555f84b6a13a5fc2da631c543e49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page