Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext snippet collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext snippet is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext snippets to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-0.2.1-cp37-cp37m-manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.7m

hext-0.2.1-cp37-cp37m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.7mmacOS 10.11+ x86-64

hext-0.2.1-cp36-cp36m-manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.6m

hext-0.2.1-cp36-cp36m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.6mmacOS 10.11+ x86-64

hext-0.2.1-cp35-cp35m-manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.5m

hext-0.2.1-cp35-cp35m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.5mmacOS 10.11+ x86-64

hext-0.2.1-cp34-cp34m-manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.4m

hext-0.2.1-cp34-cp34m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.4mmacOS 10.11+ x86-64

hext-0.2.1-cp27-cp27mu-manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 2.7mu

hext-0.2.1-cp27-cp27mu-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7mumacOS 10.11+ x86-64

hext-0.2.1-cp27-cp27m-manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 2.7m

hext-0.2.1-cp27-cp27m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7mmacOS 10.11+ x86-64

File details

Details for the file hext-0.2.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0a31bc32decdceb70ff6ed82aec75ac27def66ca160acc54c2663242ca441609
MD5 5a7007ba4bb9958d64e587833c820f14
BLAKE2b-256 f108a63ffb1fa8773fea85e9cf70c9d98d235c0c859dbc93a5d2a5b5f30cfd05

See more details on using hashes here.

File details

Details for the file hext-0.2.1-cp37-cp37m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp37-cp37m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp37-cp37m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 74bf8476ab65d84bff9c2b86785900c128f5bfd1060ca01b21fef1f2a3beb3a6
MD5 3236bb57a2fbf25169675785485d3466
BLAKE2b-256 fb8a0eae1e0e9dd96c994897859e00d0103a15e7ab1c69520afcd9e7ef182aa3

See more details on using hashes here.

File details

Details for the file hext-0.2.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 15d9f591fc935f3406f39c9c29f1feddaff5e20eabedc2f0c476697e7a7e757c
MD5 70fa16096a19e4f08f6f6029b104aaf5
BLAKE2b-256 00752653464b5f856debfb4fa1df239b9af7be191e358a2a1ba66217a67f8196

See more details on using hashes here.

File details

Details for the file hext-0.2.1-cp36-cp36m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp36-cp36m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp36-cp36m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 94dd5e6768784f165269833e0a6b8b466587786ba7ddcc23956591387120d581
MD5 0732f9fffdc569630b1491e9950d4883
BLAKE2b-256 b27448576a12f77924c3e0213f8687edf141562c2d5bae134d5bd4ded5f79724

See more details on using hashes here.

File details

Details for the file hext-0.2.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e00f99c16f5d44311cd225d022367c41205097278b1035be465934ad53054cb5
MD5 7394d09ac1ea1f704417f8da3b4bfb66
BLAKE2b-256 ddb0972c5e719a33147b5b2bc83e82cfdba03f3c6be07127be19e6fad31b0944

See more details on using hashes here.

File details

Details for the file hext-0.2.1-cp35-cp35m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp35-cp35m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp35-cp35m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 421127abf437def9a5f72e468a3625780b5ce4d89740e1db639d4e5a91abb1d1
MD5 3f4866adef273406ad54628421224f64
BLAKE2b-256 2c3ff77145c5bdb541103c6e6ae02534007107db074b023144c60c6d7c623df8

See more details on using hashes here.

File details

Details for the file hext-0.2.1-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp34-cp34m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.4m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d63be41350f08c505db67f45669bdc32245244cf0b873ad8f64838bec9e4675c
MD5 b3aabf67701ee350bde73bd6c7c22b93
BLAKE2b-256 635b3fcc3de22df8d76207a8320fa8ea182f8b0ec2a42a1f0446bf65b039062d

See more details on using hashes here.

File details

Details for the file hext-0.2.1-cp34-cp34m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp34-cp34m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.4m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp34-cp34m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 9fbc6422316c6a13fa60265fe95ccae9041060866b74e17b236806a6e7847b67
MD5 83ef67ee1172842a4eb306a0e9af069b
BLAKE2b-256 031c9bfdb6dc50662c0fef8256b150cbaac98412f5f1f316be61965ef57ba748

See more details on using hashes here.

File details

Details for the file hext-0.2.1-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8d770dc2c9d8eec8e91b553224004d2cf74cceea31f148a055129f85aa846ba7
MD5 87673013638928fc2f102d3c7a7736f1
BLAKE2b-256 036b1c77fabd29476c06b48e7924bf6ad5443def1ee399fe7a7a1a1f65161337

See more details on using hashes here.

File details

Details for the file hext-0.2.1-cp27-cp27mu-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp27-cp27mu-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7mu, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp27-cp27mu-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 6d9b473ba6bf440d73dc626661215800128eac5619f84ae569eccef96b424a43
MD5 f9d6e05c0ba4f326ec1fb10e797a7a9c
BLAKE2b-256 95b4def93ce2fffc7a7aa1097b011488b753cb135d750dbba028487b7001358b

See more details on using hashes here.

File details

Details for the file hext-0.2.1-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 27439c98c8a530ff4d82cac204aac49608cd7083a037d6a8b934f7d92391cc37
MD5 d29d17020627ae583826c470cbdd4b4b
BLAKE2b-256 2299c5515aa1fbcc973c0da2a92e1441f414d5ed20a4d533646e8ab9e8b83657

See more details on using hashes here.

File details

Details for the file hext-0.2.1-cp27-cp27m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.1-cp27-cp27m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.1-cp27-cp27m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 70d51b5bae8d1e71d7da02090fd6abcf8b2911218015f7c2c892be875ae4cca7
MD5 a6f3a79823cb4ee8985c6f971212af68
BLAKE2b-256 05a79dd62b3bc147215d66629db39c8924e398f797b801a451366c277b00c097

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page