Skip to main content

Convert html to json.

Project description

HTML to JSON

PyPI PyPI - Downloads codecov

Convert HTML and/or HTML tables to JSON.

If this library is useful to you or if you're using this library for a business - please consider sponsoring me. Even a small sponsorship allows me to prioritize work on this library and ongoing maintainance. Thanks!

Installation

pip install html-to-json

Usage

HTML to JSON

import html_to_json

html_string = """<head>
    <title>Test site</title>
    <meta charset="UTF-8"></head>"""
output_json = html_to_json.convert(html_string)
print(output_json)

When calling the html_to_json.convert function, you can choose to not capture the text values from the html by passing in the key-word argument capture_element_values=False. You can also choose to not capture the attributes of the elements by passing capture_element_attributes=False into the function.

Example

Example input:

<head>
    <title>Floyd Hightower's Projects</title>
    <meta charset="UTF-8">
    <meta name="description" content="Floyd Hightower&#39;s Projects">
    <meta name="keywords" content="projects,fhightower,Floyd,Hightower">
</head>

Example output:

{
    "head": [
    {
        "title": [
        {
            "_value": "Floyd Hightower's Projects"
        }],
        "meta": [
        {
            "_attributes":
            {
                "charset": "UTF-8"
            }
        },
        {
            "_attributes":
            {
                "name": "description",
                "content": "Floyd Hightower's Projects"
            }
        },
        {
            "_attributes":
            {
                "name": "keywords",
                "content": "projects,fhightower,Floyd,Hightower"
            }
        }]
    }]
}

HTML Tables to JSON

In addition to converting HTML to JSON, this library can also intelligently convert HTML tables to JSON.

Currently, this library can handle three types of tables:

A. Those with table headers in the first row B. Those with table headers in the first column C. Those without table headers

Tables of type A and B are diagrammed below:

This package can handle tables with the headers in the first row or headers in the first column

Example

This code:

import html_to_json

html_string = """<table>
    <tr>
        <th>#</th>
        <th>Malware</th>
        <th>MD5</th>
        <th>Date Added</th>
    </tr>

    <tr>
        <td>25548</td>
        <td><a href="/stats/DarkComet/">DarkComet</a></td>
        <td><a href="/config/034a37b2a2307f876adc9538986d7b86">034a37b2a2307f876adc9538986d7b86</a></td>
        <td>July 9, 2018, 6:25 a.m.</td>
    </tr>

    <tr>
        <td>25547</td>
        <td><a href="/stats/DarkComet/">DarkComet</a></td>
        <td><a href="/config/706eeefbac3de4d58b27d964173999c3">706eeefbac3de4d58b27d964173999c3</a></td>
        <td>July 7, 2018, 6:25 a.m.</td>
    </tr></table>"""
tables = html_to_json.convert_tables(html_string)
print(tables)

will produce this output:

[
    [
        {
            "#": "25548",
            "Malware": "DarkComet",
            "MD5": "034a37b2a2307f876adc9538986d7b86",
            "Date Added": "July 9, 2018, 6:25 a.m."
        }, {
            "#": "25547",
            "Malware": "DarkComet",
            "MD5": "706eeefbac3de4d58b27d964173999c3",
            "Date Added": "July 7, 2018, 6:25 a.m."
        }
    ]
]

Preserving nested tags in table cells

By default, convert_tables() only captures the text of each cell, so nested tags (such as <a> elements) and their attributes are dropped. To keep them, pass one of the following keyword arguments:

  • record_html=True — capture each cell's inner HTML as a string.
  • record_children=True — capture each cell's children as JSON, using the same structure produced by convert().

If both are given, record_html takes precedence.

For example, html_to_json.convert_tables(html_string, record_html=True) on the table above produces:

[
    [
        {
            "#": "25548",
            "Malware": "<a href=\"/stats/DarkComet/\">DarkComet</a>",
            "MD5": "<a href=\"/config/034a37b2a2307f876adc9538986d7b86\">034a37b2a2307f876adc9538986d7b86</a>",
            "Date Added": "July 9, 2018, 6:25 a.m."
        }, {
            "#": "25547",
            "Malware": "<a href=\"/stats/DarkComet/\">DarkComet</a>",
            "MD5": "<a href=\"/config/706eeefbac3de4d58b27d964173999c3\">706eeefbac3de4d58b27d964173999c3</a>",
            "Date Added": "July 7, 2018, 6:25 a.m."
        }
    ]
]

while html_to_json.convert_tables(html_string, record_children=True) produces:

[
    [
        {
            "#": [{"_value": "25548"}],
            "Malware": [{"a": [{"_attributes": {"href": "/stats/DarkComet/"}, "_value": "DarkComet"}]}],
            "MD5": [{"a": [{"_attributes": {"href": "/config/034a37b2a2307f876adc9538986d7b86"}, "_value": "034a37b2a2307f876adc9538986d7b86"}]}],
            "Date Added": [{"_value": "July 9, 2018, 6:25 a.m."}]
        }, {
            "#": [{"_value": "25547"}],
            "Malware": [{"a": [{"_attributes": {"href": "/stats/DarkComet/"}, "_value": "DarkComet"}]}],
            "MD5": [{"a": [{"_attributes": {"href": "/config/706eeefbac3de4d58b27d964173999c3"}, "_value": "706eeefbac3de4d58b27d964173999c3"}]}],
            "Date Added": [{"_value": "July 7, 2018, 6:25 a.m."}]
        }
    ]
]

Development

This project uses uv for dependency and environment management. Python 3.10+ is required.

uv sync
uv run pytest
./scripts/lint.sh

Credits

This package was created with Cookiecutter and fhightower's Python project template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_to_json-3.0.0.tar.gz (542.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

html_to_json-3.0.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file html_to_json-3.0.0.tar.gz.

File metadata

  • Download URL: html_to_json-3.0.0.tar.gz
  • Upload date:
  • Size: 542.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for html_to_json-3.0.0.tar.gz
Algorithm Hash digest
SHA256 1f58229aff7ca10f510415280c466c637e4a895885dd8aac6838f389254e795a
MD5 7c2ea3f7303b67dd49f981c3fa42152a
BLAKE2b-256 949a865878032c9a1b30ee172a6d5eb036645179619320c96e7af3f08d30f0b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for html_to_json-3.0.0.tar.gz:

Publisher: python-publish.yml on fhightower/html-to-json

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file html_to_json-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: html_to_json-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for html_to_json-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5ea910fc6d9609f5e1a04b2fd13a06c462a4d144ae64bfd49103d41125e244d3
MD5 094663eae8c208a2e18f794454dee37c
BLAKE2b-256 defc9b77c9aaf53076d49fada65f12db2dc688b4c359cf82cb89d608413613d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for html_to_json-3.0.0-py3-none-any.whl:

Publisher: python-publish.yml on fhightower/html-to-json

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page