Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext snippet collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext snippet is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext snippets to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-0.2.0-cp37-cp37m-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.7m

hext-0.2.0-cp37-cp37m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.7mmacOS 10.11+ x86-64

hext-0.2.0-cp36-cp36m-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.6m

hext-0.2.0-cp36-cp36m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.6mmacOS 10.11+ x86-64

hext-0.2.0-cp35-cp35m-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5m

hext-0.2.0-cp35-cp35m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5mmacOS 10.11+ x86-64

hext-0.2.0-cp34-cp34m-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.4m

hext-0.2.0-cp34-cp34m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.4mmacOS 10.11+ x86-64

hext-0.2.0-cp27-cp27mu-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7mu

hext-0.2.0-cp27-cp27mu-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7mumacOS 10.11+ x86-64

hext-0.2.0-cp27-cp27m-manylinux1_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7m

hext-0.2.0-cp27-cp27m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7mmacOS 10.11+ x86-64

File details

Details for the file hext-0.2.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.15rc1

File hashes

Hashes for hext-0.2.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3aeccee6fe496211bbf725832e22368ad87d0a1afb233a695f825a81ad3a1923
MD5 0bc58dda05152b8a97b37bf417aaf503
BLAKE2b-256 ed142c6804d874306ec1e095f5d2c762c255d80a710e752e21aaf6b5fcaaf8df

See more details on using hashes here.

File details

Details for the file hext-0.2.0-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.0-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 f0f24020a0ba380ded3c0a62652b99d7049339933deba7a1a33fb6ba3cc901e5
MD5 d80bc44af6ae7c679a5714e04311b7a1
BLAKE2b-256 fb3072e60bea6c438401a8e4af12c4d034f7949ab54d554d45df3848781f3711

See more details on using hashes here.

File details

Details for the file hext-0.2.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.15rc1

File hashes

Hashes for hext-0.2.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8d895b39e3b5569f5bd02fe1444bffbd8036d156ed16006cdecfbfbdde1ccfe5
MD5 066997f02fa53448a1cc88adf560ac98
BLAKE2b-256 3a32410b4c0ecce20c36287b6ecb39794e58c80821f58829baf1aec64ad758a2

See more details on using hashes here.

File details

Details for the file hext-0.2.0-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.0-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 64b53d7785f0753af6550928b161a54520cd7bebf550019852ab8172dbabd6a1
MD5 10992564b173e99face8064e6dd13565
BLAKE2b-256 676978b0c3b2e810c6a8200d398dbcb3849dc0b8e658d33b7c3d39388e2b3ed8

See more details on using hashes here.

File details

Details for the file hext-0.2.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.15rc1

File hashes

Hashes for hext-0.2.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4df0f4fed63199fcb436186a2f16bb5e2f8a9dbf4e0eb8cb07d444992a463052
MD5 1b9721dc875a4cd44738f47b4aa594c8
BLAKE2b-256 96bba735b5631cf8deee2d2f16f4c0a35f080bf2c115057aee291cfd2b944e0a

See more details on using hashes here.

File details

Details for the file hext-0.2.0-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.0-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 4f9a03af8f67a1f17e6d5c2e11596dfedbdb4b9e994cbfbb7ff63552ef2c369f
MD5 913a852da166d4fc7a06dab3901599e4
BLAKE2b-256 7d23d16feef8d556ed3080d395c89846c4e527bf7a9255180f1d19f2758bf18a

See more details on using hashes here.

File details

Details for the file hext-0.2.0-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp34-cp34m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.4m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.15rc1

File hashes

Hashes for hext-0.2.0-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 cf81be7f04e816a05d9384ab971ef9447dfc478f47cf5428859c34fe1d4f4fde
MD5 60a05b970f37c7f5d032e4b97152e5d5
BLAKE2b-256 09c89a737048bec9c37b779d0f0e7220d0295867a8e2aaab75f513863c69e2d1

See more details on using hashes here.

File details

Details for the file hext-0.2.0-cp34-cp34m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp34-cp34m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.4m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.0-cp34-cp34m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 4628a911ac271b0d8e5c1f4e30327651cd7f1771648f2c7360e8bf4f945f8cfd
MD5 8a27f444bfe6e738f40df66bbbfa2c2c
BLAKE2b-256 b5142608747a9e72072a55b7ba38be732a6c8ae1cdce451f55801bce9a45bfaf

See more details on using hashes here.

File details

Details for the file hext-0.2.0-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.15rc1

File hashes

Hashes for hext-0.2.0-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f282b7a0e518d3646bcde8eeb7f0164c668a2ef5580611b2b62f98b7a05570b9
MD5 733d4550a18937863f8d8f2055d07982
BLAKE2b-256 3cba043b55bf0cfab70300682e7a49964594e02a5379eb97fa3047efa32191e6

See more details on using hashes here.

File details

Details for the file hext-0.2.0-cp27-cp27mu-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp27-cp27mu-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7mu, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.0-cp27-cp27mu-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 0859e493bcb89cb24ec054424b6d64c53be33b3e602fe43fb97da1b5b5fe34b2
MD5 cc002af3a95648fd64f0485e69abe12c
BLAKE2b-256 b00b76fd2172c6f4edfddf2b616fcf6cb1c4289fa4653101a50568b76226c8f4

See more details on using hashes here.

File details

Details for the file hext-0.2.0-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.15rc1

File hashes

Hashes for hext-0.2.0-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a9edf2f5b034178e2b2127724c35954446669cd7e1319cdde568d1d9c7f2d80a
MD5 393750490547ee5dd2eec4325911d056
BLAKE2b-256 d8ab48c2df7e51bd2bc1229c1371ba8182057a7388c46d091d3b05548d6ab03f

See more details on using hashes here.

File details

Details for the file hext-0.2.0-cp27-cp27m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.0-cp27-cp27m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.0-cp27-cp27m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 6111c7086db73d417fb0084ced75f489fb7323abd78f44332cb4625313f39075
MD5 e85642c1439ea2683ce13b859746e5f4
BLAKE2b-256 34f8adf5dfa3f4507ac8c2cbe85fd19fd92bf5a4d92bc656daea9ca825571a67

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page