Skip to main content

Convert html to json.

Project description

HTML to JSON

PyPI Build Status codecov

Convert HTML and/or HTML tables to JSON.

Installation

pip install html-to-json

Usage

HTML to JSON

import html_to_json

html_string = """<head>
    <title>Test site</title>
    <meta charset="UTF-8"></head>"""
output_json = html_to_json.convert(html_string)
print(output_json)

When calling the html_to_json.convert function, you can choose to not capture the text values from the html by passing in the key-word argument capture_element_values=False. You can also choose to not capture the attributes of the elements by passing capture_element_attributes=False into the function.

Example

Example input:

<head>
    <title>Floyd Hightower's Projects</title>
    <meta charset="UTF-8">
    <meta name="description" content="Floyd Hightower&#39;s Projects">
    <meta name="keywords" content="projects,fhightower,Floyd,Hightower">
</head>

Example output:

{
    "head": [
    {
        "title": [
        {
            "_value": "Floyd Hightower's Projects"
        }],
        "meta": [
        {
            "_attributes":
            {
                "charset": "UTF-8"
            }
        },
        {
            "_attributes":
            {
                "name": "description",
                "content": "Floyd Hightower's Projects"
            }
        },
        {
            "_attributes":
            {
                "name": "keywords",
                "content": "projects,fhightower,Floyd,Hightower"
            }
        }]
    }]
}

HTML Tables to JSON

In addition to converting HTML to JSON, this library can also intelligently convert HTML tables to JSON.

Currently, this library can handle three types of tables:

A. Those with table headers in the first row B. Those with table headers in the first column C. Those without table headers

Tables of type A and B are diagrammed below:

This package can handle tables with the headers in the first row or headers in the first column

Example

This code:

import html_to_json

html_string = """<table>
    <tr>
        <th>#</th>
        <th>Malware</th>
        <th>MD5</th>
        <th>Date Added</th>
    </tr>

    <tr>
        <td>25548</td>
        <td><a href="/stats/DarkComet/">DarkComet</a></td>
        <td><a href="/config/034a37b2a2307f876adc9538986d7b86">034a37b2a2307f876adc9538986d7b86</a></td>
        <td>July 9, 2018, 6:25 a.m.</td>
    </tr>

    <tr>
        <td>25547</td>
        <td><a href="/stats/DarkComet/">DarkComet</a></td>
        <td><a href="/config/706eeefbac3de4d58b27d964173999c3">706eeefbac3de4d58b27d964173999c3</a></td>
        <td>July 7, 2018, 6:25 a.m.</td>
    </tr></table>"""
tables = html_to_json.convert_tables(html_string)
print(tables)

will produce this output:

[
    [
        {
            "#": "25548",
            "Malware": "DarkComet",
            "MD5": "034a37b2a2307f876adc9538986d7b86",
            "Date Added": "July 9, 2018, 6:25 a.m."
        }, {
            "#": "25547",
            "Malware": "DarkComet",
            "MD5": "706eeefbac3de4d58b27d964173999c3",
            "Date Added": "July 7, 2018, 6:25 a.m."
        }
    ]
]

Credits

This package was created with Cookiecutter and fhightower's Python project template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_to_json-2.0.0.tar.gz (54.2 kB view details)

Uploaded Source

Built Distribution

html_to_json-2.0.0-py2.py3-none-any.whl (6.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file html_to_json-2.0.0.tar.gz.

File metadata

  • Download URL: html_to_json-2.0.0.tar.gz
  • Upload date:
  • Size: 54.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.9.1

File hashes

Hashes for html_to_json-2.0.0.tar.gz
Algorithm Hash digest
SHA256 3fc848f40618f444f8e9971f88a22fef041d0cb4569464de018dcf8e3c37669e
MD5 3435ba0c28a24aa9d273cc05799c91a7
BLAKE2b-256 da83c425c27e4c8f4b622901f8b58ad48e53be14a080d341a70c67570f1ec30a

See more details on using hashes here.

File details

Details for the file html_to_json-2.0.0-py2.py3-none-any.whl.

File metadata

  • Download URL: html_to_json-2.0.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.9.1

File hashes

Hashes for html_to_json-2.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 707ba86390ac05cf59d36a106f3d3da34b6075a245ee597d4c6c06ca9a6d0898
MD5 730212b353bec354b16c5249a66704c1
BLAKE2b-256 5a79aa64abd13c010a02c3cc61f970295357fb0a65505eb096f7c03a2e7cdebd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page