Skip to main content

Convert html to json.

Project description

HTML to JSON

Pipeline status

Convert html and/or html tables to json. There is a testing/debugging interface here.

Installation

pip install html-to-json

Usage

HTML to JSON

import html_to_json

html_string = """<head>
    <title>Test site</title>
    <meta charset="UTF-8"></head>"""
output_json = html_to_json.convert(html_string)
print(output_json)

When calling the html_to_json.convert function, you can choose to not capture the text values from the html by passing in the key-word argument capture_element_values=False. You can also choose to not capture the attributes of the elements by passing capture_element_attributes=False into the function.

Example

Example input:

<head>
    <title>Floyd Hightower's Projects</title>
    <meta charset="UTF-8">
    <meta name="description" content="Floyd Hightower&#39;s Projects">
    <meta name="keywords" content="projects,fhightower,Floyd,Hightower">
</head>

Example output:

{
    "head": [
    {
        "title": [
        {
            "_value": "Floyd Hightower\'s Projects"
        }],
        "meta": [
        {
            "_attributes":
            {
                "charset": "UTF-8"
            }
        },
        {
            "_attributes":
            {
                "name": "description",
                "content": "Floyd Hightower\'s Projects"
            }
        },
        {
            "_attributes":
            {
                "name": "keywords",
                "content": "projects,fhightower,Floyd,Hightower"
            }
        }]
    }]
}

HTML Tables to JSON

import html_to_json

html_string = """<table class="table table-striped table-bordered table-hover">
    <tr>
        <th>#</th>
        <th>Malware</th>
        <th>MD5</th>
        <th>Date Added</th>
    </tr>

    <tr>
        <td>25548</td>
        <td><a href="/stats/DarkComet/">DarkComet</a></td>
        <td><a href="/config/034a37b2a2307f876adc9538986d7b86">034a37b2a2307f876adc9538986d7b86</a></td>
        <td>July 9, 2018, 6:25 a.m.</td>
    </tr>

    <tr>
        <td>25547</td>
        <td><a href="/stats/DarkComet/">DarkComet</a></td>
        <td><a href="/config/706eeefbac3de4d58b27d964173999c3">706eeefbac3de4d58b27d964173999c3</a></td>
        <td>July 7, 2018, 6:25 a.m.</td>
    </tr></table>"""
tables = html_to_json.convert_tables(html_string)
print(tables)

Currently, this package can handle tables with the headers in the first row or tables with headers in the first column as depicted below:

This package can handle tables with the headers in the first row or headers in the first column

Example

Example input:

<table class="table table-striped table-bordered table-hover">
    <tr>
        <th>#</th>
        <th>Malware</th>
        <th>MD5</th>
        <th>Date Added</th>
    </tr>

    <tr>
        <td>25548</td>
        <td><a href="/stats/DarkComet/">DarkComet</a></td>
        <td><a href="/config/034a37b2a2307f876adc9538986d7b86">034a37b2a2307f876adc9538986d7b86</a></td>
        <td>July 9, 2018, 6:25 a.m.</td>
    </tr>

    <tr>
        <td>25547</td>
        <td><a href="/stats/DarkComet/">DarkComet</a></td>
        <td><a href="/config/706eeefbac3de4d58b27d964173999c3">706eeefbac3de4d58b27d964173999c3</a></td>
        <td>July 7, 2018, 6:25 a.m.</td>
    </tr>
</table>

Example output:

[
    [
        {
            '#': '25548',
            'Malware': 'DarkComet',
            'MD5': '034a37b2a2307f876adc9538986d7b86',
            'Date Added': 'July 9, 2018, 6:25 a.m.'
        }, {
            '#': '25547',
            'Malware': 'DarkComet',
            'MD5': '706eeefbac3de4d58b27d964173999c3',
            'Date Added': 'July 7, 2018, 6:25 a.m.'
        }
    ]
]

Credits

This package was created with Cookiecutter and fhightower's Python project template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_to_json-1.0.7.tar.gz (59.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

html_to_json-1.0.7-py2.py3-none-any.whl (6.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file html_to_json-1.0.7.tar.gz.

File metadata

  • Download URL: html_to_json-1.0.7.tar.gz
  • Upload date:
  • Size: 59.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for html_to_json-1.0.7.tar.gz
Algorithm Hash digest
SHA256 c0774a22c22a1bf0c2c03ac9626ce6fa030d4aaa28a8b0ae149807b0663b3660
MD5 51ae9db29224fe5de5fd842a8b1d082f
BLAKE2b-256 9b291b8c756c93457c24639f3b3d6e6d2ba1b9d91dc8569a79da80cf81da3ac1

See more details on using hashes here.

File details

Details for the file html_to_json-1.0.7-py2.py3-none-any.whl.

File metadata

  • Download URL: html_to_json-1.0.7-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for html_to_json-1.0.7-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ac3102527f1dd730ae02fd9fa3e29f2ae13c37e8605122edada4f1552330d74e
MD5 8fec600a7f2a88c39943977bde8fc1a2
BLAKE2b-256 2d24105b44fa3ec56d3fd533e9fe82ad26427c237d95a28a9265e891ce70466a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page