Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.5-cp311-cp311-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11

hext-1.0.5-cp311-cp311-macosx_10_11_universal2.whl (726.7 kB view details)

Uploaded CPython 3.11macOS 10.11+ universal2 (ARM64, x86-64)

hext-1.0.5-cp310-cp310-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10

hext-1.0.5-cp310-cp310-macosx_10_11_x86_64.whl (726.7 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.5-cp39-cp39-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9

hext-1.0.5-cp39-cp39-macosx_10_11_x86_64.whl (726.7 kB view details)

Uploaded CPython 3.9macOS 10.11+ x86-64

hext-1.0.5-cp38-cp38-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8

hext-1.0.5-cp38-cp38-macosx_10_11_x86_64.whl (726.9 kB view details)

Uploaded CPython 3.8macOS 10.11+ x86-64

hext-1.0.5-cp37-cp37m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m

hext-1.0.5-cp37-cp37m-macosx_10_11_x86_64.whl (726.7 kB view details)

Uploaded CPython 3.7mmacOS 10.11+ x86-64

hext-1.0.5-cp36-cp36m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m

File details

Details for the file hext-1.0.5-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.5-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3cdea0b71d76217d0fb2cb4221498120d8d01d8956d880792e8bc8ef51694366
MD5 74f8f5d7fc40338ed2718d0463973a8a
BLAKE2b-256 a20035992970c1734ca7e4d64abe6a0a0b656818d90edb2ceca25a353bef6544

See more details on using hashes here.

File details

Details for the file hext-1.0.5-cp311-cp311-macosx_10_11_universal2.whl.

File metadata

File hashes

Hashes for hext-1.0.5-cp311-cp311-macosx_10_11_universal2.whl
Algorithm Hash digest
SHA256 1af57b65ce5cda3e11e7a884dfc952d2f265bfe9451396f9c50affa65658d883
MD5 1edd0d5415d2e60b606b9a52103afdf5
BLAKE2b-256 24f5a7be1fb9c2b75df8bff281294f1b797a7c9de86b5f1d1251ca1a12d18844

See more details on using hashes here.

File details

Details for the file hext-1.0.5-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.5-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1cb44404c136a9834a24e0cfe5bf40e27931b836311bee422f51b8fa4b15fb5e
MD5 caf121e4f668c7810fe0e684d700b4ad
BLAKE2b-256 e300b1c30cfca8fa35c9a5764ed6868ff8d438619995a1808e6ca655e372b368

See more details on using hashes here.

File details

Details for the file hext-1.0.5-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.5-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 d66483b001eebc2732977d5a11488cbb9bcda84357e0a7df562bdbf760b63f20
MD5 7112e99f98d36e9520e1a383fc57a16d
BLAKE2b-256 8e3f23a0cb893d41bd0ac43cf09b9d2a9af5b239a23cc7284919680fa945818a

See more details on using hashes here.

File details

Details for the file hext-1.0.5-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.5-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ba27acffd57f4f86f047e1de2095df8563a934381fda757d8d37508c04f2ae21
MD5 e02747ff28721a251c5e8f1eb138e1f1
BLAKE2b-256 73f4de858af849d271e6fc07ae343d0c791407425277572eac72ffbb0554e404

See more details on using hashes here.

File details

Details for the file hext-1.0.5-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.5-cp39-cp39-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.7 kB
  • Tags: CPython 3.9, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for hext-1.0.5-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 1bcfea986c18cf89c8c96aedb6210a880fea9b3ef97c36bba9c6941c4d35af85
MD5 4bf36523bbb01049750953a6532d1452
BLAKE2b-256 52ab0d0668370d9ee46f632ed47bd371ecf110f2c3bc78ed6207d24a5fd76f82

See more details on using hashes here.

File details

Details for the file hext-1.0.5-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.5-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bedb3bb8ae062ae99ee91389c7f878d476f48856b4805afa34964e1eb438f2b9
MD5 9c52ee721ef336f22c4361f02f439a28
BLAKE2b-256 a20bcff641bdc0093b5373cf12a6bb6e9ad8efe839f20ac7adafeef6efd48e55

See more details on using hashes here.

File details

Details for the file hext-1.0.5-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-1.0.5-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 726.9 kB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for hext-1.0.5-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 9cb0fdeb39bf9255e2d03ae5d62517738c0c77b3df49c4fcf5bd672ff415558a
MD5 741f1055dd18db2b679d74db057b88f3
BLAKE2b-256 3b16e07badfe05188e5f91bab07399fb42b836ef07341efc360881b835ecc710

See more details on using hashes here.

File details

Details for the file hext-1.0.5-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.5-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cd871d3b23b2dad2ed2b8d0aaa388ec91172a8dd09af29ff53a5315379468df4
MD5 596e7780f6d11a39cef1158a4dc0dcd0
BLAKE2b-256 f719961661317da29bdc53ba757d1358d82f459ee13959b9b59ea3c0a028d6dd

See more details on using hashes here.

File details

Details for the file hext-1.0.5-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.5-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 608898b2f36ea6e3137850826e7cb64ac91d32b0545a4a729a0d2037857e209e
MD5 9afd721d04ba241a8d5c37997211ae3f
BLAKE2b-256 8da6f7d41d362e9131a249547a1c9269be5b211b76752c0d70c600aefb005844

See more details on using hashes here.

File details

Details for the file hext-1.0.5-cp36-cp36m-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.5-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6819a4650d336079cdb6944fb35ef70a48da50ebc159ecadda3bfd0347ee7b32
MD5 57958d19ce2fec16c9b40aced392e621
BLAKE2b-256 08942530e4887a383a030563a03b71f211ee4e872b1190b32dc272bef9b50fec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page