Skip to main content

A module and command-line utility to extract structured data from HTML

Project description

Hext — Extract Data from HTML

Hext Logo

Hext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

A Quick Example

The following Hext template collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

If the above Hext template is applied to this piece of HTML:

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

Hext will produce the following values:

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

You can use this example in Hext’s live code editor. Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Components

This package includes:

  • The Hext Python module
  • The htmlext command-line utility

Using Hext with Python

The module exposes three interfaces:

  • html = hext.Html("<html>...</html>") -> object
  • rule = hext.Rule("...") -> object
  • rule.extract(html) -> dictionary of {string -> string}
  • rule.extract has a second optional parameter max_searches which is of type unsigned int. The search for matching elements is aborted after this limit is reached. The default is 0, which never aborts.
import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext templates to HTML documents and outputs JSON.

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

License

Hext is released under the terms of the Apache License v2.0. The source code is hosted on Github. This binary package includes content authored by third parties:

  • gumbo-parser. Copyright 2010 Google Inc. See gumbo.license.
  • rapidjson. Copyright (C) 2015 THL A29 Limited, a Tencent company, and Milo Yip. See rapidjson.license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hext-1.0.8-cp312-cp312-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12

hext-1.0.8-cp312-cp312-macosx_11_0_arm64.whl (691.0 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

hext-1.0.8-cp312-cp312-macosx_10_11_x86_64.whl (731.3 kB view details)

Uploaded CPython 3.12macOS 10.11+ x86-64

hext-1.0.8-cp311-cp311-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11

hext-1.0.8-cp311-cp311-macosx_11_0_arm64.whl (691.0 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

hext-1.0.8-cp311-cp311-macosx_10_11_x86_64.whl (731.0 kB view details)

Uploaded CPython 3.11macOS 10.11+ x86-64

hext-1.0.8-cp310-cp310-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10

hext-1.0.8-cp310-cp310-macosx_11_0_arm64.whl (691.0 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

hext-1.0.8-cp310-cp310-macosx_10_11_x86_64.whl (731.0 kB view details)

Uploaded CPython 3.10macOS 10.11+ x86-64

hext-1.0.8-cp39-cp39-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9

hext-1.0.8-cp39-cp39-macosx_10_11_x86_64.whl (731.0 kB view details)

Uploaded CPython 3.9macOS 10.11+ x86-64

hext-1.0.8-cp38-cp38-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8

hext-1.0.8-cp38-cp38-macosx_10_11_x86_64.whl (731.3 kB view details)

Uploaded CPython 3.8macOS 10.11+ x86-64

hext-1.0.8-cp37-cp37m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m

hext-1.0.8-cp36-cp36m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m

File details

Details for the file hext-1.0.8-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dc067176367dc717b76afcfdf3541f092525af3a34df60608b6ad06626275f5e
MD5 e218d4eb8513314d18051e03c6f0e90f
BLAKE2b-256 8c704b958ab050e289f565c38fcb72c4b998a811738bce9290a61b3aa01b93fb

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 649f3362ecb677b06244f514ff1efc7244100dcc9866eb88de9069f6acc48503
MD5 e9c513187037583eec6ad35cda9257e5
BLAKE2b-256 eda3ca490258a248e711985bf0ccf1dd3bda9c0d426ad6fdc9c24cc35d2737dc

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp312-cp312-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp312-cp312-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 2f50adc8f5fea8b8be32f495dd8c6d660f4cce65e419ce71045919e11ef285c6
MD5 02c55b742e778324a11b5ce13c9d76b2
BLAKE2b-256 95d0042497d99bd72269af034b277cc4497d48f5972902a6fd80b55e0503e2c5

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7c2d4c9c74e28e85765684deef7c19dbed40db610c159efdd95bbead12dcd18e
MD5 e26304222bbac2af82d551f30ad86d1d
BLAKE2b-256 628a0722eff42a6b9200319408da404841e15e00b581c30fee81315269bd4cc5

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3579ce7c4319108d8d395309e7c6e3587aeccc606a96c9d5285d0760e287f19d
MD5 d882bfc4327ea50c5cc0fd7b0be71a2f
BLAKE2b-256 4126a753371784f1f68ad97ac7ff0b33ab751bc30b64a2ff898db95d40110c20

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp311-cp311-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp311-cp311-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 7754c4be0f6954223458fc92d1e0eaab71b4f63280998971b4f4a9e1b02d3c04
MD5 2ed5e599375ad5af35ca58ea352274fe
BLAKE2b-256 56aefa931903a287c621bbab5893db93ee4e54c5b1de4d73894f4eaf9ee80914

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 389b8926ced44437c2d148e105d3c0cdfbf918104c45152527a22708c33ebf81
MD5 3f19d44b2e612fba87d1409bc0451603
BLAKE2b-256 560df87a363131f45b085bb52d49cf7ebd0e71ca1b886d73124a5d8be5b333b5

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 49fd50cb2a84654c64f6f750abd83ea430086527b1ef27be132933e5b2277160
MD5 07e3f532febb20ba573f480e0c38250e
BLAKE2b-256 5c938a4734c601db7bea848ec53843f9baa0894bb70b3c8700179222f19962aa

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp310-cp310-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp310-cp310-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 0bde836fc643fa4a597a8ef7b2cd27252984ba62d6a766655ddc5036f5a3a36f
MD5 f11d0a56be776a1fd81c0ed912ba6424
BLAKE2b-256 599b007294529d449a07680c5c7cf6cf3c8a6a988c6153e1e8d030411ec70c23

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9945d095fe31011377d48c43dfe0545eeb7d0ab213553ecd6e44de89b667df41
MD5 381a5bbd40c1160a115dae6d35be1708
BLAKE2b-256 7ea318ffa4489606db4517c1231431934728eb7f26ce40c8f614dc1a58bd8e91

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp39-cp39-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp39-cp39-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 0eab0ad1d65e9ed544a2c7e5e944d0074f46ed333be934187c34ac83ad1211db
MD5 b3011597f9dc84808a6f67b67efbb6d3
BLAKE2b-256 81e757825d1fa9f1ffecb247167ae54e34e6a7e9993de088a51b5f575540ea52

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 255c3bd53dc6da912c85d857fde583ca777132b6d0417c7e2cd386b729a79d55
MD5 e3a981f9da8b6c0242ce46f9be3ed16d
BLAKE2b-256 ed7562e7c8606249ed435f643c6705c650e16b6d85158a75dbaa56a475842248

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp38-cp38-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp38-cp38-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 d494aff44aa503ae50f868aa624ec692518329558ce9e37690980b9133155cf9
MD5 454918bec582b5cab595eea6b85d60df
BLAKE2b-256 6297c26bd176c048e33a3880ba3fe084fdd9b9298e889aa0639417aaf402bb9d

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e3f49002744cb8e67d8f2914cd06a35a2a018ea2aa871301505480775ed4be91
MD5 659b23dabb4d3ff3e7e92e894b74e229
BLAKE2b-256 bfec5c67161959747994a09c0b8774ac4e67f309aac427696e99762f25cd350b

See more details on using hashes here.

File details

Details for the file hext-1.0.8-cp36-cp36m-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hext-1.0.8-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d7a135f2301eb3eeeb4c707202d9e7d25328e35f9e82986b9a7920a8dc578532
MD5 0d86bd71c6bd8d82007f55e06bad11c1
BLAKE2b-256 df2e6df32a4c615aefe6a80b8f37cc42966f2bd2e51274bbf63ef80cc9abe3ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page