Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.9-cp312-cp312-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12

hext-1.0.9-cp312-cp312-macosx_11_0_arm64.whl (688.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

hext-1.0.9-cp312-cp312-macosx_10_11_x86_64.whl (736.6 kB view details)

Uploaded CPython 3.12macOS 10.11+ x86-64

hext-1.0.9-cp311-cp311-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11

hext-1.0.9-cp311-cp311-macosx_11_0_arm64.whl (688.3 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

hext-1.0.9-cp311-cp311-macosx_10_11_x86_64.whl (736.2 kB view details)

Uploaded CPython 3.11macOS 10.11+ x86-64

hext-1.0.9-cp310-cp310-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10

hext-1.0.9-cp310-cp310-macosx_11_0_arm64.whl (688.3 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

hext-1.0.9-cp310-cp310-macosx_10_11_x86_64.whl (736.2 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.9-cp39-cp39-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9

hext-1.0.9-cp39-cp39-macosx_10_11_x86_64.whl (736.2 kB view details)

Uploaded CPython 3.9macOS 10.11+ x86-64

hext-1.0.9-cp38-cp38-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8

hext-1.0.9-cp38-cp38-macosx_10_11_x86_64.whl (736.2 kB view details)

Uploaded CPython 3.8macOS 10.11+ x86-64

hext-1.0.9-cp37-cp37m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m

hext-1.0.9-cp36-cp36m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m

File details

Details for the file hext-1.0.9-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 616dc27e78697b8324aca18f5d19f978cf049e9f76091d72daff9cb1d7015fb5
MD5 a6d73a8b233d8f2d2789841f29b147e4
BLAKE2b-256 d282183884bf7dde64ed31dba250b4bd34b2cb47ca26acc8ca5b9170a52fb68c

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8daccfafabf5a448abb678ec6376a9c91338bf98b9280e49a231fe8ff0d94384
MD5 4fb15d1f9d7936da37eb3b306dcde72e
BLAKE2b-256 49ce47fe6f361f9e313e72379aa36f6f80d4635e75b3a5ae7b91ebd131c997d2

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp312-cp312-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp312-cp312-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 993b72c8a3a63bb23cb9e582a0a5eaa09d27adf34fd7f440862bde4a8b2e1b61
MD5 64cf3d54397d7b0d7852479eaeae9e12
BLAKE2b-256 723dc83dda336672d04d205db788a1fae9a525dbb384b7f0ea49d96571c52685

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2ba03d9ff14c7f074336f43442f79bb682e67055b5f0c0b2832362bfbe5f774f
MD5 e5feb682e84bd1f1d3c53d3b8d738cfe
BLAKE2b-256 8331a8f81573d4c72af529b2e4a518eb93a39ecedd5de6c803550a284a1fc06b

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3422682b1e3de47ce8ed7067ec69d23be1ae03d1c9295199f451b8e27dd2c16d
MD5 96865234f770fd569b6a4da32629ee34
BLAKE2b-256 3a5bdbf3c74ef759d7197f69f0d6cb2d4ad2a7b6780cafe5619eca3bdff40cff

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp311-cp311-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp311-cp311-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 dbaf63a55bd68728a108aa5be9d00d7761840bd449e27aa1dcb8a98b0ae2dc81
MD5 8cf8c93a4a7c71ff740b94fbdde8d7ca
BLAKE2b-256 78f5cf429860448fc052d2326c5f4652c7e1161255b9d8a8bade22db301fe7b5

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a52db848acaa9c8e1f069ba9fe57fafd3c46d7ba68b593f64b76329b402b867f
MD5 1ecbeefaf38690d9a8ebcd5e1c02d368
BLAKE2b-256 6a0204c0665928d57a431d16db54d7b5f6bd0c0a545e54e17daf75c150cf3e87

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ea20f72706ae0a632a75a5b06d384cd7bf019930503a946472e604c2af0680af
MD5 0e7d29e9d1ff99964f4457529ca2c08b
BLAKE2b-256 8199c407006ab31d38c0bd3d24a8c45672826627413b7587617b92e41aade29d

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 6757dabc270d847309ea99135767768e40fb90ae82bcff395701c78ad1503aa8
MD5 43f7409240fa1c198d388c8c874105b2
BLAKE2b-256 9b297648eceb9f77f87844d76c984457fbedbaa7f2a455d3f53071bdf847604d

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 13993b3b7bf998eabf06aefac45ed53e7bca32a53e1f93bfb073392ff2878aa1
MD5 e495feaefd2e3d31de456aff3a8b3d7d
BLAKE2b-256 271fd35988d622ffca2e07480eebed45013ccef4c6d0aa90a7a5c88a8c961d2a

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 f0751d53f032f64b3b3765e512ba8aab5a10bc520d660264c9dc101610c48a0c
MD5 c499ef5d5cc15dde62f992c5adcc1adb
BLAKE2b-256 9cd7822da18bed86175a5e90a6b8f554439e2abd8d31bd1a06f4a14b0246e8e7

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 39c62968029248bc29c76b859cff468c86c95fd80a7c38e2832dc7fef6266cb5
MD5 358ba98843b62ab3e3da37a0434bde98
BLAKE2b-256 d3b702ac314059ed43059b5998a0a3cc8e1801406a3abcc330111f9f03b79136

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 d28b8ae636427d99ad683e6e11d711efb53ef2dd503c706f40884154f34d9d03
MD5 11d773c5bce6b19f4ce77714f3df0a97
BLAKE2b-256 80c2b5ee1bc72aeee122b043a0e1f37690956bb3574538031f660a2670c9eb37

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 91acc921d032220b60ab988ce8f9966e05c332587f886411a7667c0724afbd69
MD5 7afade79e8dcda326effd0e912fcb37c
BLAKE2b-256 7f4bcfb5e15fcbe5027f274c3884d30230899af8e2f62f925cc9b48c30526e51

See more details on using hashes here.

File details

Details for the file hext-1.0.9-cp36-cp36m-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.9-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 05d980fc616cc090a46d64b3a8f8e84ac7ea96b81cea33715b7527c54b01063d
MD5 b2c816a0835a8114e46f06c3e99d87b4
BLAKE2b-256 35e26c3774996a756b40beae844e68170e75f3e0a49718f5b1b32856ce27c679

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page