Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext snippet collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext snippet is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext snippets to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-0.2.2-cp27-cp27mu-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 2.7mu

hext-0.2.2-cp27-cp27mu-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7mumacOS 10.11+ x86-64

hext-0.2.2-cp27-cp27m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 2.7m

hext-0.2.2-cp27-cp27m-macosx_10_11_x86_64.whl (1.8 MB view details)

Uploaded CPython 2.7mmacOS 10.11+ x86-64

File details

Details for the file hext-0.2.2-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.2-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.2-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 080b725d9027ab55f8648346bdda7490bc45f1a4edf7b76141f0f8c6316ff848
MD5 81dea90fb617624c3b5650a082d986e9
BLAKE2b-256 52640dffe522e44d65065ba42f854214efb0493cce55ed52f6e03b02f8149fad

See more details on using hashes here.

File details

Details for the file hext-0.2.2-cp27-cp27mu-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.2-cp27-cp27mu-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7mu, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.2-cp27-cp27mu-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 b968419bbd400afe12c43c660c6006de5e430edaa876be2a64ac82a12c50879c
MD5 5f231f1106f928252534989d176cc1b8
BLAKE2b-256 ac8093698bdf84be2ec62cc189783b8f358060648028dd6c25f7d1e05e8bbac3

See more details on using hashes here.

File details

Details for the file hext-0.2.2-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: hext-0.2.2-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.2-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 359930a07af04fad6fd1a7539b91bc34bb3c6e9f21d55764cf2ed2fd814ba140
MD5 5673fc875c3349fed3f850395266337a
BLAKE2b-256 afdcff8853a16a525be8e4ef9e51962fa2f375b5f02d0f8049bf26ad5067092e

See more details on using hashes here.

File details

Details for the file hext-0.2.2-cp27-cp27m-macosx_10_11_x86_64.whl.

File metadata

  • Download URL: hext-0.2.2-cp27-cp27m-macosx_10_11_x86_64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 2.7m, macOS 10.11+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hext-0.2.2-cp27-cp27m-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 e1134b68c6c47517af433e7259cb1f791de0ee825fd7faeb35c6ae70a868371d
MD5 af86f2d0ae35e23a82c47f131f931de5
BLAKE2b-256 e75b28199b479fc8748ba4ed61622e3f73f2ac1e446f640e2a4f168957c72736

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page