Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

hext-1.0.12-cp313-cp313-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13 manylinux: glibc 2.28+ x86-64

hext-1.0.12-cp313-cp313-macosx_11_0_arm64.whl (690.2 kB view details)

Uploaded CPython 3.13 macOS 11.0+ ARM64

hext-1.0.12-cp313-cp313-macosx_10_11_x86_64.whl (742.1 kB view details)

Uploaded CPython 3.13 macOS 10.11+ x86-64

hext-1.0.12-cp312-cp312-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.28+ x86-64

hext-1.0.12-cp312-cp312-macosx_11_0_arm64.whl (690.2 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

hext-1.0.12-cp312-cp312-macosx_10_11_x86_64.whl (742.1 kB view details)

Uploaded CPython 3.12 macOS 10.11+ x86-64

hext-1.0.12-cp311-cp311-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

hext-1.0.12-cp311-cp311-macosx_11_0_arm64.whl (690.2 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

hext-1.0.12-cp311-cp311-macosx_10_11_x86_64.whl (741.8 kB view details)

Uploaded CPython 3.11 macOS 10.11+ x86-64

hext-1.0.12-cp310-cp310-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

hext-1.0.12-cp310-cp310-macosx_11_0_arm64.whl (690.2 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

hext-1.0.12-cp310-cp310-macosx_10_11_x86_64.whl (741.8 kB view details)

Uploaded CPython 3.10 macOS 10.11+ x86-64

hext-1.0.12-cp39-cp39-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

hext-1.0.12-cp39-cp39-macosx_10_11_x86_64.whl (741.8 kB view details)

Uploaded CPython 3.9 macOS 10.11+ x86-64

hext-1.0.12-cp38-cp38-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

hext-1.0.12-cp37-cp37m-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.28+ x86-64

hext-1.0.12-cp36-cp36m-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.28+ x86-64

File details

Details for the file hext-1.0.12-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 110d36a98ee31b8a0faa05242c75153b1175f6be3ce70be78fa6ed7bb4d513f8
MD5 7f264265c71d1144090a40283d7babe5
BLAKE2b-256 024bc66e8db3081b90201d489bb1dec97d3af494e74bd3593c69312c2bdf47f3

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8c4fcdb4aa92e3794d737be923f4968c2dbb626ebc56675b276d59b54051ec0e
MD5 bc803a4efec802108a57e2bf596b4e5b
BLAKE2b-256 511f3469444d2b41189dc37f05220eee4cf85f830ca099592172ee460f22cac0

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp313-cp313-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp313-cp313-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 4331b5899d6af835186f53bf1dc722fa0f709ef7507ec735e9c40d8c4ff0e475
MD5 3704d927df4d8b46421330e3319e29ee
BLAKE2b-256 6ce6cadfca247bc2d69f84725390340d5340c8d842cd3b9c03c222a02ffe51d7

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c12059d5be044cf4e8064a4d0a9c9fa11d8a9b94b21b55b73331e869e8192e78
MD5 f20653dfcc1f9a3cefcf2ddb48685408
BLAKE2b-256 c66aa8dff848c545a27955b3d32d42982bd133c88260a1d8154bec64992cd793

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 020cca3a88c545026d039119221929e37f41babbe4d493cebda9a032ff428df1
MD5 886eb8ec577e8674a01c47285dd8eec9
BLAKE2b-256 85c961d153c23d90c5e82f7ee3021ad66488edb25040b732a013e753139b06ec

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp312-cp312-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp312-cp312-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 b5aa5857f5e7ce3bf82bb31bdc1c405cfd3e60899a47f734e9ac0a27b800f2e3
MD5 fdcbeb13e904e8a6092032ffaa4ba006
BLAKE2b-256 55e2c15282585773023c324a894f74753f468422c527e03d377b84ea9d3a83a7

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 417bf59a984f070ad9c7ffd0c32649f99645677660028e4dd54b0af4d4db451d
MD5 b6ec293bbc83d816bf7d4c5815b3ab73
BLAKE2b-256 1fcced8e62952b922dfa052ada9307fc90387fdaf9a3735106e23ee8d37e6a1e

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5214a53c09b0345dd48e374afb014f780efcdf6b12a5c1332cafd48a6eb87c7f
MD5 85fc6b75e0d749390eacf1ccad73a656
BLAKE2b-256 f4dd8fcd3ba76132f00638ac8881112e690647ab92e92c5eb53a9090d8b05a27

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp311-cp311-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp311-cp311-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 21c413d79111f56e573c2fe74944bdfd973a416480910544b16affca6559fc16
MD5 c10df87b8f618ec0f46129dee7ab5357
BLAKE2b-256 6a6ccaf18b5bde09fe52f30866d16b0e74598c5b3530a9ea2a61c2b904ce3346

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2bf521bb429eea8d0c7866a0424543b3b3c93ffe11e52af127e39c3343d0503c
MD5 675a3f19792d32ed83f339fce435872f
BLAKE2b-256 a8c8844602313a57776a3adc28f28bf018a66621c6f9e3e76357d47e0c6b4492

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8461b58b54af3e950d74211069bd9030d799c8bd789f25af9acc53a7843f809c
MD5 8ea9b81f7682f40e87205efb01a61f8e
BLAKE2b-256 ce12da5f912e974fce44af9c8bcb59587545d58bed76c4c9a8c40d7c87ab288a

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 f39eca5c16f26a9f5c8f1bdc28fd1c5d2e5f20b6ad4514467ab5890b1fb8da28
MD5 332c8c9b4b09da199a8b62aaf7a05137
BLAKE2b-256 63f379eece777e2518b01fbb668a4dbf936d93c97753cd49cb7857faa6759444

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a6808dcee57b9965b2185840cfd00cacc4863b9947eca6d391d196c25176901b
MD5 b0efae232476582d64333c0e2a542fce
BLAKE2b-256 87416b883415726f8d376201f5a3ec59e244acc9e1d520e1dff39e58b9380815

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 ccda1ae89eac76ff152564fcf034eb7618c43b7525ecce12d7b03eee96bb17f2
MD5 6a071493af83ab0a50b766fa3c34eb23
BLAKE2b-256 39e5b348e475db8f917ac707c3cabd6c3fce4472df648a3a8708629704edee72

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4a9d78b58c5653e52a0277baa9f983ba401a8363273e62c5b727e6d4a75fc806
MD5 4a4ed16a6bb55f34dff38209e7b25959
BLAKE2b-256 dc2e0689be5eb6073deb54cbc2349cee350879802193a314a9bdf8c9228180bf

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp37-cp37m-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp37-cp37m-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 241c8978afb62eb7b94459082c7a4236d22b3cf4d21ff47973e22b669b6af836
MD5 04aadc89d761c0a32797d21c7d1dc849
BLAKE2b-256 8c646ca87cf01fd2b159bc7d1708fb97d0fcf4fcf4bce5178075ed1683855c55

See more details on using hashes here.

File details

Details for the file hext-1.0.12-cp36-cp36m-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.12-cp36-cp36m-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bb53e794d5ca79e8e537ed8f7f93da3d1fd4ba85e30d7dc4fc4cc635fa690d7d
MD5 f62169cf01185d737e2b4a979f9c5fcd
BLAKE2b-256 bc61476f71b206de55e658d6707875951c93fd6b4565d1d5cda459c4ea019481

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page