Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.13-cp313-cp313-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

hext-1.0.13-cp313-cp313-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

hext-1.0.13-cp313-cp313-macosx_11_0_arm64.whl (705.8 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

hext-1.0.13-cp313-cp313-macosx_10_11_x86_64.whl (726.1 kB view details)

Uploaded CPython 3.13macOS 10.11+ x86-64

hext-1.0.13-cp312-cp312-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

hext-1.0.13-cp312-cp312-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

hext-1.0.13-cp312-cp312-macosx_11_0_arm64.whl (705.8 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

hext-1.0.13-cp312-cp312-macosx_10_11_x86_64.whl (726.1 kB view details)

Uploaded CPython 3.12macOS 10.11+ x86-64

hext-1.0.13-cp311-cp311-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

hext-1.0.13-cp311-cp311-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

hext-1.0.13-cp311-cp311-macosx_11_0_arm64.whl (705.8 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

hext-1.0.13-cp311-cp311-macosx_10_11_x86_64.whl (726.1 kB view details)

Uploaded CPython 3.11macOS 10.11+ x86-64

hext-1.0.13-cp310-cp310-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

hext-1.0.13-cp310-cp310-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ ARM64

hext-1.0.13-cp310-cp310-macosx_11_0_arm64.whl (705.8 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

hext-1.0.13-cp310-cp310-macosx_10_11_x86_64.whl (726.1 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.13-cp39-cp39-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

hext-1.0.13-cp39-cp39-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ ARM64

hext-1.0.13-cp39-cp39-macosx_10_11_x86_64.whl (726.1 kB view details)

Uploaded CPython 3.9macOS 10.11+ x86-64

hext-1.0.13-cp38-cp38-manylinux_2_28_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64

hext-1.0.13-cp38-cp38-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ ARM64

File details

Details for the file hext-1.0.13-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f41cf0bb391a8e618ba402b358447372edf0d9b5b241189c49f8459446f5e40b
MD5 965f2309f96f7575597ff6464aec837c
BLAKE2b-256 31e5672e5b460650f91caa69819c94b91bddbb178e625ada8835907708c32190

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6af4d5141c52070d79fee3811ed33aeb7a329422b16c02baef87a6b3e8411154
MD5 2d7006752c934d0c1232b1772d76eff9
BLAKE2b-256 576eb471daa6b64c1f80412e31b4036b0044df97a5bf34c07689d83b1ca93145

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1aacabcfe0297a2fa444ee55e6517b4577be4ba574d76e327730ce72c889ed27
MD5 b3b11c906b2ae4969867d5815e40223e
BLAKE2b-256 2aa3f72e04bef3f43cda9f35dde81bc3e6ffc765de04d972fd7099106a955db6

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp313-cp313-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp313-cp313-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 49f27c2662ada97986259d3acf5ae64696871d0ac24fd3424885091230c78d73
MD5 2aface2a0b10ea7f9cbfd5e45be90476
BLAKE2b-256 abb73f1cd4e62267023a328ac3235803cf723c69fd93a43cb7d00a5144e75e30

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b7ae8b7149f672efa179134c576c97080053e7daa1551cb64e746da180c589f7
MD5 ace0492fe2a559256629b14f128db183
BLAKE2b-256 075d2bdf0437bbde1eeb8ece844775059c692466ebf4c13a946d2885f1128879

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5fb090ec08131ab53443e47e7ca61b8e16bccd6ba3422a9df65fe78f14e7cc11
MD5 d7ed1d5f8dea0332fbeea88638f56fcf
BLAKE2b-256 ec5bfae67e4e60887c90ea73439f248ea81b2ccaeb6326a8212d8a3c95794039

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2533802d1e64f6cea9972eee4edf0c930958a73546314ca291f6b046120b6f0f
MD5 e9e4d5bf5cb30fe72632530fe37937af
BLAKE2b-256 575c7bac8e402b4cef39f2ddda0dc88ac24e5e77267acd9ed0f8b198ba99adb2

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp312-cp312-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp312-cp312-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 0934905c2b9056f320bc9b088f982f991b6bad817694ae54fb51461ee5f80561
MD5 97ef69a84be15ad0ac8ac088ed250174
BLAKE2b-256 2fe5f6f18191745a9fa0ef43f689f9267308a03632b3fa0593d12118dd6ddb6d

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2f0243d4d7665a4e2081f52380bb0a7e75b5a36772b3ec26f2c6d844174a2ec5
MD5 27d89f131196ba3c2d7bf3280a865474
BLAKE2b-256 42652374a71374f60370a56fc2e9a329ab2898765848da1b5a8b2701ef3b1016

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 600860bb640a99f09d0e73ec59efb9d6f5bc32400ca3af3ad78db429bce954a0
MD5 30efffd4f94d09a0ac9e9a5eb82d388b
BLAKE2b-256 eb32a1b2064d6f6f930d3da78849e19d8cc6157e20fdbc5040dbd0da56504219

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ef5a80288687834d4ac5539187ec25de94ba835fccaf5c5b18f6e7162c3f1200
MD5 5092b022c9625b08812c7e23d7b87c32
BLAKE2b-256 f23b7677342b773122b988cef72857a842d3d5ba28a3db70da208f4c3d309a66

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp311-cp311-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp311-cp311-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 2c735aac4ee5530bf9b6bba55e467d7ef81fa37d0f35e56ab20309ae3611d4a3
MD5 82c0d48c5f0487271e3bb9795a3ef6cc
BLAKE2b-256 c365e93b60f5831e4eb2b2105a475f3b4d30ff224caa4a9fff9e0fb25f5d3b8b

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 280ca43ff77d7d929763a376b1ab7eeceb120a646500ea7477fca50d9ccc1b68
MD5 47fc4b99fcbc8f387d2baee93685f43a
BLAKE2b-256 afda9435afb6b91d60dcdb314ed9774f1eea1c331207c425d7b267c253b97bf7

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp310-cp310-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b8fcf863da84b8c3132aa2dadd8f9c6e76f879b723fc87f594deca8d52822c30
MD5 8f0694434097b184fac951150763ce47
BLAKE2b-256 dd9d99f4a0b38effd0fac5b75d63103effde1f4040ce91c714596ec0c828589b

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 33052317e4b29054cc512a293b696b15e6634395a2d6bb984c09952b945eca16
MD5 201097ac3bce2cae8f24e67d5432f756
BLAKE2b-256 632c3faf7775ec4f540049308a402e563f594bedf5dcc657162287241e719c31

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 fae6b4ba52c3b6083d5410e420ee6b99265b7c30ff2d7526d8c3818d35cb86b3
MD5 aaf6b39c59a73e4581ff0d8e47a13dd5
BLAKE2b-256 3de7dc3574cdee3e2e87666f19e56ddc8ad45754be9a5c0b95dad8ed5326db11

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 24a6a8bd2ddd77524e7823f66d562d12277ef54976f2af0262bf24e5893520aa
MD5 34eda2aa42aa834309a85d8f28a55254
BLAKE2b-256 ffbe32f570008eedf1a74709b3228dffa3ac0cfd37154b8c4032e6d7ea472b31

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp39-cp39-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp39-cp39-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 dccfd03fb7c84e75acece60e47d7864b9e4ddc8e20c67bfc00bdac5b69960df4
MD5 b7cd2a6ba4b384d2bb51dc515fedc98b
BLAKE2b-256 b3a922ecbfca9389a614906c6b128d5fb4156497c5bc4a1f7c3d88c07cadd3b2

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 6254b0d1c599125ad9259455e0b698898330d2b4510ae6e286fb5b0769701083
MD5 084b2e06fd46054a5d73bd048cef215b
BLAKE2b-256 a2f5808e610efb6881aefa83d81e1f594da163390e8889c92aa557b3dc614da8

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1103d8d76e6536f224a9d3324a937bd5be48cd3f0907f5a83057d5c19d329bcd
MD5 11466b86cf3fa2a2e3e6457452d7366e
BLAKE2b-256 8a559f89fa59bc088aa7a68db6921e026c2295728e77c5ec8d68635c125703c9

See more details on using hashes here.

File details

Details for the file hext-1.0.13-cp38-cp38-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hext-1.0.13-cp38-cp38-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b5b5406a11ce28cb79a62f7769db1a0f57068475d6c3dae07aa6d7cf613923de
MD5 66c705a66f0261131ef939c3b9b6229a
BLAKE2b-256 9903933dcc2d040c70082899b7ba0a92332d0fc4c9571cdf874bcf9505e361d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page