Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.14-cp314-cp314-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64

hext-1.0.14-cp314-cp314-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ ARM64

hext-1.0.14-cp314-cp314-macosx_11_0_arm64.whl (710.9 kB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

hext-1.0.14-cp314-cp314-macosx_10_11_x86_64.whl (713.5 kB view details)

Uploaded CPython 3.14macOS 10.11+ x86-64

hext-1.0.14-cp313-cp313-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

hext-1.0.14-cp313-cp313-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

hext-1.0.14-cp313-cp313-macosx_11_0_arm64.whl (710.9 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

hext-1.0.14-cp313-cp313-macosx_10_11_x86_64.whl (713.5 kB view details)

Uploaded CPython 3.13macOS 10.11+ x86-64

hext-1.0.14-cp312-cp312-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

hext-1.0.14-cp312-cp312-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

hext-1.0.14-cp312-cp312-macosx_11_0_arm64.whl (710.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

hext-1.0.14-cp312-cp312-macosx_10_11_x86_64.whl (713.5 kB view details)

Uploaded CPython 3.12macOS 10.11+ x86-64

hext-1.0.14-cp311-cp311-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

hext-1.0.14-cp311-cp311-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

hext-1.0.14-cp311-cp311-macosx_11_0_arm64.whl (710.8 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

hext-1.0.14-cp311-cp311-macosx_10_11_x86_64.whl (713.5 kB view details)

Uploaded CPython 3.11macOS 10.11+ x86-64

hext-1.0.14-cp310-cp310-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

hext-1.0.14-cp310-cp310-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ ARM64

hext-1.0.14-cp310-cp310-macosx_11_0_arm64.whl (710.8 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

hext-1.0.14-cp310-cp310-macosx_10_11_x86_64.whl (713.5 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.14-cp39-cp39-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

hext-1.0.14-cp39-cp39-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ ARM64

hext-1.0.14-cp38-cp38-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64

hext-1.0.14-cp38-cp38-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ ARM64

File details

Details for the file hext-1.0.14-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9fc457466ca7323665b19b197a814ed1c1777ab6cb1e2bc7e85d52981ca3f6b5
MD5 c26d6a7714ce36ae3c510a8055319b2c
BLAKE2b-256 2b9e5b5b7e49cca271cc5739a879d83b307020a3043392d1398fd94d43a3e433

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp314-cp314-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp314-cp314-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8998562c5cbc38e7127fefd650574975b744620b1ca8b513a73bcd8c7beb449d
MD5 c14335ff9985b297472ae00b0fd3c21a
BLAKE2b-256 27751b0f55c88828780e946ad5953e687b0ea3dd2069c9ab5450218d72ac3a5d

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f1088996904c23e554735e952915930d8416edfa0861a450569033c88de50d2e
MD5 8d5f78af7fa8fbcbd7cde0f4dde3a1ac
BLAKE2b-256 3678bd9b0068a0f330ac3201d64cd94321b9ab40040c7246759d4ff0e8679688

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp314-cp314-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp314-cp314-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 fb43fa816b7dea39b1b6e72f0279be58219e6c1038ff9b67d2bca5dac259293a
MD5 8084bc6c4a7079ce6b677e5a09ac624f
BLAKE2b-256 83d9e6b03e5091a09394adf4318a5f3b7e09fee78a65097ab9a73eacbb8469b3

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a9e37e93b711dd34186b08b50d5d40430063aa75632ff22d32ce283c762cadf8
MD5 acafda1f6e3dd7ac08bda7b3389d2491
BLAKE2b-256 2fda5681c7936d2c138fb337035d2cb06da201a54542c7725abf761ea1a2bece

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6ecbe5dbdef5c0018f03a685099f33d4021ba31797d39a7546359be3629c8759
MD5 49e19026a72ee053797e1dc1d3fb2973
BLAKE2b-256 f93b6d85b768658576c788ef069036c2f46f6c3b0bb5f2796890f19d9a419f5b

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 619521b8c5292eb2415ddecfb77cf19c40a556af4a7a8972dd7ae5ee28bce317
MD5 e4447732c5fbd6d5153033960ac519f4
BLAKE2b-256 ab1748ba6a3a048c3a35cf0aef0704d165d4cae03e3970a5a4537b4aba07f1ad

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp313-cp313-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp313-cp313-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 6a576f63390c768520f34234ce2da4da535aa388330d144fa48936e303181bbb
MD5 f573a15c4d9a9a4b06f413480724af4e
BLAKE2b-256 ac46159bff759de9f9a475ccacfcb230fee380fd23bf3a52632837995c1db5f7

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5c21a40d50cdcc6fad12ba513a2b74bde3bbe7865971b93a79f096884694ea0b
MD5 4ba9c7d85d3a64f1560f15cf0492ac8a
BLAKE2b-256 984904617c2be77c135b3d649ed4299200465181c2f8fcbf102a4082cc8034f8

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 23b8323fb1ebf7da62a20b0ea4caa6ac98f11791d46857a0b9298baa30f9a367
MD5 40dc2b9e07118f3b519f0fe6b5d6ed14
BLAKE2b-256 43940e93e71d17e805d658c782e4ef381ac5eae373692652f397f8bdc76a149b

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7c71225aa8be80df2c2e8491543acf21de5111c9db40f428bf320101136d997d
MD5 b75802464d469658dd93cfadb372c4b8
BLAKE2b-256 e96cc6c315cc2ffc8db685ff98bce5daa2930a696b2191af74a20244caa52cc8

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp312-cp312-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp312-cp312-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 0962bfe9417309aa73cb6e5111625b2cf4bf4aaa9cfd836a5d4cb0910492135d
MD5 8065a05d1f68bd7cd6acf72bdd2a8fe9
BLAKE2b-256 3936e30290a9a9670e3b495785377b87292bfcce203709910eb90b009f2fdae4

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5715c5e1c591852c7c75a7d91ec85435eb0bd21ad310cbc72a7fb3caccab920c
MD5 740cdc01c15c61487880c05952df7b6c
BLAKE2b-256 cb0570679ecf8ec31c510d7463fafb8554faf94a6ddfb304d2997c81a45f96be

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 752119ce3bb0b2f5041570c75f9631a1b0d806b7bdfe02b024573ad09140005f
MD5 baae1ebb4128d0028013310e813fefd4
BLAKE2b-256 83f4629787f3f1e91c7e9115385a121588b981764a05ff28fe40c03c95c2eb31

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b68e3bd37334559bebecc83a8c657190736589c2cb7ea8031b562641e942600d
MD5 1ebbece8d229ab5cda29bf585133afd2
BLAKE2b-256 cfa88dda5236fa1b1fb630774590fb8ec151eb1967ba33086565dfa6679c3894

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp311-cp311-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp311-cp311-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 1ff3747cdf310cbf144ec88a75f1536c58487265ca6e44562f48915c2efcd1a0
MD5 5f37a191f53a1eedd4d6691280852b01
BLAKE2b-256 25b5444336835b3b8e7a2721c72f1f1b0cbc0db7cbf4865da8aefe9afdae2f0f

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c3b4ffdc5af84e511a3a38b622e48b0905e6c39d9d5590b0ce71dee66d867940
MD5 1e04dac5158f00ea10fc6085767b5680
BLAKE2b-256 2cba8eb035ac3f060b0969bb1f9fe601527590a16dc90aac310d5b86b3ed24c7

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp310-cp310-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 916345dcf0e4cb2ec8207a32dc37d9db166a421cea9213674a5e59f4e79b24b3
MD5 1b786ab3ac1295c51ba972c7dee50f2b
BLAKE2b-256 56f85f1bac12ebae898f09608bc4ea02ac270fff02443ebcd40c1cbc67f8d930

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3786e57bcdecaee14344ffb3743f651a1e78aa1886deb251c56eddc4b70c8782
MD5 10a3d9532919569cfcd63a32c077f714
BLAKE2b-256 bd7697f42fc1506e50d89a00190de240783a86704f524b89ba81447d69bfc815

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 150936b0f14163d5fc422300eeba1aa458f6ff658a90797a013c4b54bda57980
MD5 ca3e60b7de7cd9d57ffe18506d305c59
BLAKE2b-256 afc648982fdab84d8e9ab2b72e6f880bda1ea98b977854126b2bd26948e5a2ef

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ea8fb9336569d5a4d70df17d00aa002bba2bc6e6f99a3af0fb4865c7d843b872
MD5 0dc469a7985abb6931e387828e6dd3ac
BLAKE2b-256 57656ade6f840626e85f4a72e9e99362fedd0ff3f59e605ca3d4c1062ebe54bc

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp39-cp39-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp39-cp39-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 e3bcd71373d2878783f75743966f0cbe714df8e5efa22b314e2ecb6cb5dbb00b
MD5 46f2037068a5d86dfcebe63ff3d4b95d
BLAKE2b-256 d749f6223b660cf23d6a5c70f96a3293beac0d5531729b36183634589a68aa41

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4b3bbc701d0a3573aa749e974e8bae794823b51f2108017b4b26ffc4c99aade9
MD5 8bee678c52c811a0ed99674214f8a2a1
BLAKE2b-256 b81c2e0e911783ff0dde79a65dd913a95a1d8e67e2fc54cf87cdf0be3624795a

See more details on using hashes here.

File details

Details for the file hext-1.0.14-cp38-cp38-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.14-cp38-cp38-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 4c50f6c2bbc02438e95579ad9b683144f184309e00066a7ea8be7e0f3ff0981b
MD5 b524d5e43d076fe7e6024bb0cc317e63
BLAKE2b-256 b00745902f6901e3607a1efe8d0897a0b242994dc50a814f088f67182aca21a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page