Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.6-cp312-cp312-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12

hext-1.0.6-cp312-cp312-macosx_10_11_x86_64.whl (731.3 kB view details)

Uploaded CPython 3.12macOS 10.11+ x86-64

hext-1.0.6-cp311-cp311-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11

hext-1.0.6-cp311-cp311-macosx_10_11_x86_64.whl (731.0 kB view details)

Uploaded CPython 3.11macOS 10.11+ x86-64

hext-1.0.6-cp310-cp310-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10

hext-1.0.6-cp310-cp310-macosx_10_11_x86_64.whl (731.0 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.6-cp39-cp39-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9

hext-1.0.6-cp39-cp39-macosx_10_11_x86_64.whl (731.0 kB view details)

Uploaded CPython 3.9macOS 10.11+ x86-64

hext-1.0.6-cp38-cp38-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8

hext-1.0.6-cp38-cp38-macosx_10_11_x86_64.whl (731.3 kB view details)

Uploaded CPython 3.8macOS 10.11+ x86-64

hext-1.0.6-cp37-cp37m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m

hext-1.0.6-cp36-cp36m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m

File details

Details for the file hext-1.0.6-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fd9b88903b11c1419448cf7ed07b7c6bd346d77cdfaf75734f1de82a8e73f6d7
MD5 be62b5e282fb697b4ad4eab5fd990997
BLAKE2b-256 6bf8d86b0aa4ca146725ae994eec364c5b0114db2337590da2cb53b89278e503

See more details on using hashes here.

File details

Details for the file hext-1.0.6-cp312-cp312-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp312-cp312-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 b0cb9e8137c003ec249240d78296fb7b306d921c8218ff900e79c8fe78e70825
MD5 a60c7d2df0d89451dd0275cc771db074
BLAKE2b-256 8e945804fcee87464a19e674e277a4f29e2e8948815c56cc7ab2573722b73cb5

See more details on using hashes here.

File details

Details for the file hext-1.0.6-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d8e359c9d05cb9afb6b8d952a7179d2a65e6e39f0c2c7477b37b5edd086a52c2
MD5 61f2c700ac8de647c01ca79c8b37748b
BLAKE2b-256 d32f3b0b84ee6e461351a8ff3a4c6d084a7ead6ec89dcd1f8056ce32d501ab59

See more details on using hashes here.

File details

Details for the file hext-1.0.6-cp311-cp311-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp311-cp311-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 df95eab8a701a584283aa4a76b09b72b5806fb48295a6a3dc2d933b350a8ec4a
MD5 0742ccff0b0362615f2ccf8f8725daf4
BLAKE2b-256 937021edfaf26d85d5c2fe22b799ed445d9797491e8dae4716e29b042a5b270f

See more details on using hashes here.

File details

Details for the file hext-1.0.6-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c943d2f1201e32f88c52d46115d9b8221379cbeeefd2b9f8d470bf4d5a46c843
MD5 56f7e90b1fdbe25527790c2d2099c7e9
BLAKE2b-256 602d167165805af9054d1496b164a323887ed484d3fe0c9f3fa25156a7900b06

See more details on using hashes here.

File details

Details for the file hext-1.0.6-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 04fe503640fdd8cfdb791603abae64b1b0b260b3449aa43fb4a198d6e051ef8f
MD5 acd7b9052141161b7344e55e0fa16f28
BLAKE2b-256 db2825e49b91518957d4540384e5e209bda2b16925b2f0a33cf53055b6061481

See more details on using hashes here.

File details

Details for the file hext-1.0.6-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 409f5257bd93a46a677ca95b5b169929931734e2c0de27cf8d5fbf171cc0da37
MD5 527b9226e1b7494a4eeb8f776705eb8f
BLAKE2b-256 d6e165e8ad8b0a08b20a8e1f5f9f7a47181e64f8e215ecd224e2dd53ac630714

See more details on using hashes here.

File details

Details for the file hext-1.0.6-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 7ef34edb062a50c2db4e344a42277b92c74b98b4237dea1d3e0b58ca19ac8a6c
MD5 b4fd4f9fed68a80bb7d69f0108fc389d
BLAKE2b-256 81238dc6797c76fa081b9c4725deb91bb75be90ebde4e1eb92420191882b8f5b

See more details on using hashes here.

File details

Details for the file hext-1.0.6-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9b289b42b41f12c8a87025fab6fb988d309c266c7baf007d073d1af886da8d0a
MD5 eac4ca425a27024cf962138042d64542
BLAKE2b-256 e500205421eb22c281bea0d4771b88ceca65385cb1a92c8eb3eedf35d323a34c

See more details on using hashes here.

File details

Details for the file hext-1.0.6-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 8104bae7147d70192226ed0e721f45e0c3fda6574e3a9d083890cfe8e764b23a
MD5 8d91d13d311e6d71c8b5296af3cd4b84
BLAKE2b-256 8a9e9e0e61d037a1cae3f52f01b4eaf8773932b79880e31cb562aee4e0c7c7c9

See more details on using hashes here.

File details

Details for the file hext-1.0.6-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dfb9dc53f227748ec6c39059ac95d5e302ff2ac6aa3681757e8300e73f68afaa
MD5 3c2400a8c9fe562837f1a4cbe41a6140
BLAKE2b-256 e53eb2983669c1e0d88d02ee77d8638cc044e5b637da2bcacae225ea643cc647

See more details on using hashes here.

File details

Details for the file hext-1.0.6-cp36-cp36m-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.6-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0404691309a804bfef9cd35ee0cfaa65964c0bad2834cda1397a0f8e32f016e0
MD5 15ca8157fb18bce473f56541476d80f2
BLAKE2b-256 b94184af0c6fb690d832b1a38dff3d1d4f18701161a6a7b132fa940d476b6095

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page