Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-0.3.0-cp310-cp310-manylinux2014_x86_64.whl (886.6 kB view details)

Uploaded CPython 3.10

hext-0.3.0-cp310-cp310-macosx_10_11_x86_64.whl (725.1 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-0.3.0-cp39-cp39-manylinux2014_x86_64.whl (906.9 kB view details)

Uploaded CPython 3.9

hext-0.3.0-cp39-cp39-macosx_10_11_x86_64.whl (725.1 kB view details)

Uploaded CPython 3.9macOS 10.11+ x86-64

hext-0.3.0-cp38-cp38-manylinux2014_x86_64.whl (907.2 kB view details)

Uploaded CPython 3.8

hext-0.3.0-cp38-cp38-macosx_10_11_x86_64.whl (725.3 kB view details)

Uploaded CPython 3.8macOS 10.11+ x86-64

hext-0.3.0-cp37-cp37m-manylinux2014_x86_64.whl (907.1 kB view details)

Uploaded CPython 3.7m

hext-0.3.0-cp37-cp37m-macosx_10_11_x86_64.whl (725.2 kB view details)

Uploaded CPython 3.7mmacOS 10.11+ x86-64

hext-0.3.0-cp36-cp36m-manylinux2014_x86_64.whl (907.1 kB view details)

Uploaded CPython 3.6m

hext-0.3.0-cp36-cp36m-macosx_10_11_x86_64.whl (725.2 kB view details)

Uploaded CPython 3.6mmacOS 10.11+ x86-64

hext-0.3.0-cp35-cp35m-manylinux2014_x86_64.whl (907.1 kB view details)

Uploaded CPython 3.5m

File details

Details for the file hext-0.3.0-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-0.3.0-cp310-cp310-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 886.6 kB
  • Tags: CPython 3.10
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for hext-0.3.0-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e3a61bd32072216a75e44b7bd4aa731e2ac9a12055c19b1f10c73284e18233d0
MD5 0bf1112524a93ef93a85124eb3943110
BLAKE2b-256 fb215b4aaafb54b7cda10d9e1d066af4102cbf945a87c844dc9bd98ac2b6fe08

See more details on using hashes here.

File details

Details for the file hext-0.3.0-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.3.0-cp310-cp310-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 725.1 kB
  • Tags: CPython 3.10, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for hext-0.3.0-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 efa4bac770bd40a839769df017fdf8911eafd666d98557fbde804d92f77de5b8
MD5 334edc41972641540c348e4b19521cbe
BLAKE2b-256 dd194318df097f1c852cecc8ff4b2f0d946648b63fce19ef99a27ca76c7a85a0

See more details on using hashes here.

File details

Details for the file hext-0.3.0-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-0.3.0-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 906.9 kB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for hext-0.3.0-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bf2fe9c5ffc69926bde7b4115d626d200bf184c1c902fb3e97e39311a1996298
MD5 fc5031f06f510f7ea00bb47fb6d4c498
BLAKE2b-256 0eb5b0c8eef787ce4e456f18b4b3efa1c362fe836e1130ee1cc6789861649b70

See more details on using hashes here.

File details

Details for the file hext-0.3.0-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.3.0-cp39-cp39-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 725.1 kB
  • Tags: CPython 3.9, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for hext-0.3.0-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 c5c92119c449f34856ae839b8e82365f5068f3d713ad3932833820c395ff94b7
MD5 ca253d2a8a602d50f93879eaa9dba9a3
BLAKE2b-256 e63378e84cb1784b94f14037bc1e2011d4ce6b65b218c15c56e079b21958966f

See more details on using hashes here.

File details

Details for the file hext-0.3.0-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-0.3.0-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 907.2 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for hext-0.3.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2a2ca670f32fabb600ecb7ec9290012f1c4dc905458c6d5345a1ba34cf02f8b7
MD5 8b2ff62c1402876f625dc05b43b488bc
BLAKE2b-256 e64db3a0962107a09edb50005974ff659df73278406296aea96d0527a4bc35e7

See more details on using hashes here.

File details

Details for the file hext-0.3.0-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.3.0-cp38-cp38-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 725.3 kB
  • Tags: CPython 3.8, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for hext-0.3.0-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 7716185a5566bec598b8b54a0c0f8eb4072b3d0e276d587698a2911b4f288cd1
MD5 5fd5375d30f9258ab71cde552425163f
BLAKE2b-256 08d0eb980f7165aed365e2debe334a170fd064144f9b40e13317a129511f344f

See more details on using hashes here.

File details

Details for the file hext-0.3.0-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-0.3.0-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 907.1 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for hext-0.3.0-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f38179b5479acd4d5d9f44a36e9cdc60d20918d3d5cf50c90441c5e1d77d911b
MD5 3108b8c759cbcd7d4f75c358a16b5c55
BLAKE2b-256 e94c8b89871352f9bcbf12d8847ab9bd358a7484db604fc53d8d5d79ad9fe790

See more details on using hashes here.

File details

Details for the file hext-0.3.0-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.3.0-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 725.2 kB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for hext-0.3.0-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 7a1c3609f88b0949c3c12e210bcb731bdc1716b1b788651a765afde83ecfe4cd
MD5 4d8f8282cbfca68871b4f981355878dd
BLAKE2b-256 952fa1634be881e83ebb22e98c3d51f2085de2997c7648887b8d5ffdc2f39a67

See more details on using hashes here.

File details

Details for the file hext-0.3.0-cp36-cp36m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-0.3.0-cp36-cp36m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 907.1 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for hext-0.3.0-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 90a768b794eb65a2f005a371cd1a18086f209c6e587d82970952407543bc6c8d
MD5 8ef3e71fe0a71979c2a36ffe92afb193
BLAKE2b-256 1336922405c39bebe2450636856b803a8ad8380781784031408164fd5f21a3bf

See more details on using hashes here.

File details

Details for the file hext-0.3.0-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.3.0-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 725.2 kB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for hext-0.3.0-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 96da0899c7f205d7f4626fc14b0d124b97bd796296fca21019ee3da617e87a25
MD5 41a21a4b6997434b83399d3002bf5f95
BLAKE2b-256 451d793ca34ec579efe68eec827af6d238628e62038f3217be08707eb054b4a1

See more details on using hashes here.

File details

Details for the file hext-0.3.0-cp35-cp35m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: hext-0.3.0-cp35-cp35m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 907.1 kB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for hext-0.3.0-cp35-cp35m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8ae9b2661a0310b98e5b38b9403a6fe4d846cbf85ecf0794a182295ade282711
MD5 380fac8b47bacb3f1ae68758b9c342a2
BLAKE2b-256 4ae64e7783419064d60aadb33cd615570a2f64d5e82441aff992f89d7dc55693

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page