Extract HTML tables from webpages and convert them to JSON format

These details have not been verified by PyPI

Project links

Project description

webtable2json

A Python library to extract HTML tables from webpages and convert them to JSON format. Perfect for web scraping, data extraction, and converting tabular web data into structured JSON.

Features

Extract tables from URLs or HTML content
Clean and normalize table data
Handle complex table structures (thead, tbody, colspan, etc.)
Preserve links and images with automatic URL normalization
Specialized functions for ranking websites
Session support for better performance
Built-in logging and error handling
Save results directly to JSON files
Filter tables by size requirements
Type hints for better development experience
Comprehensive error handling

Installation

pip install webtable2json

Quick Start

from webtable2json import convert_url_to_json, WebTableToJSON

# Extract all tables from a URL
tables = convert_url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")

# Extract a specific table (0-based index)
table = convert_url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html", table_index=0)

# Use the class for more control
converter = WebTableToJSON()
result = converter.url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")

Usage Examples

Basic Table Extraction

from webtable2json import convert_url_to_json, save_tables_to_file

# Get all tables from a webpage
tables = convert_url_to_json("https://www.w3schools.com/html/html_tables.asp")

# Save to file
save_tables_to_file(tables, "extracted_tables.json")

for i, table in enumerate(tables):
    print(f"Table {i}: {table['row_count']} rows, {table['column_count']} columns")
    print(f"First row: {table['data'][0]}")

Working with Custom Headers

from webtable2json import convert_url_to_json

# Custom headers for authentication or specific requirements
headers = {
    'Authorization': 'Bearer your-token',
    'User-Agent': 'My Custom Bot 1.0'
}

tables = convert_url_to_json("https://example.com", headers=headers)

Working with HTML Content

from webtable2json import convert_html_to_json

html = """
<table>
    <tr><th>Name</th><th>Website</th><th>Logo</th></tr>
    <tr>
        <td>Example Corp</td>
        <td><a href="https://example.com">Visit Site</a></td>
        <td><img src="logo.png" alt="Company Logo"></td>
    </tr>
</table>
"""

tables = convert_html_to_json(html, base_url="https://example.com")
print(tables[0]['data'])
# Output includes normalized URLs and image data

Filtering and Utility Functions

from webtable2json import convert_url_to_json, filter_tables_by_size, save_tables_to_file

# Get all tables
all_tables = convert_url_to_json("https://example.com")

# Filter tables with at least 5 rows and 3 columns
large_tables = filter_tables_by_size(all_tables, min_rows=5, min_cols=3)

# Save filtered results
save_tables_to_file(large_tables, "large_tables.json")

Advanced Usage with Custom Headers

from webtable2json import WebTableToJSON
import requests

# Custom headers and session for better performance
session = requests.Session()
headers = {
    'User-Agent': 'My Custom Bot 1.0',
    'Accept': 'text/html,application/xhtml+xml'
}

converter = WebTableToJSON(headers=headers, session=session, timeout=60)
result = converter.url_to_json("https://example.com")

Specialized Functions

from webtable2json import get_main_table, get_clean_ranking_data

# Get the largest table (usually the main data table)
main_table = get_main_table("https://example.com/data-page")

# Specialized function for ranking websites
ranking_data = get_clean_ranking_data("https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")

API Reference

Classes

`WebTableToJSON`

Main class for table extraction and conversion.

Methods:

__init__(headers=None, timeout=30, session=None): Initialize with optional custom headers, timeout, and session
fetch_webpage(url): Fetch HTML content from URL
normalize_url(url, base_url): Convert relative URLs to absolute URLs
extract_table_data(table, base_url=None): Extract data from BeautifulSoup table element
extract_tables_from_html(html_content, base_url=None): Extract all tables from HTML
url_to_json(url, table_index=None): Convert tables from URL to JSON
html_to_json(html_content, table_index=None, base_url=None): Convert tables from HTML to JSON

Functions

`convert_url_to_json(url, table_index=None, headers=None)`

Convert tables from a URL to JSON format.

`convert_html_to_json(html_content, table_index=None, base_url=None)`

Convert tables from HTML content to JSON format.

`save_tables_to_file(tables, filename, indent=2)`

Save table data to a JSON file.

`filter_tables_by_size(tables, min_rows=1, min_cols=1)`

Filter tables by minimum size requirements.

`get_main_table(url)`

Get the main data table from a URL (usually the largest table).

`get_clean_ranking_data(url)`

Specialized function for ranking websites like NIRF.

Output Format

Each table is returned as a dictionary with the following structure:

{
    "table_index": 0,
    "row_count": 10,
    "column_count": 3,
    "caption": "Optional table caption",
    "id": "table-id",
    "class": "table-class",
    "source_url": "https://example.com",
    "data": [
        {
            "Column 1": "Simple text value",
            "Column 2": {
                "text": "Link Text",
                "link": "https://example.com/page"
            },
            "Column 3": {
                "text": "Image description",
                "image": "https://example.com/image.jpg",
                "image_alt": "Alt text"
            }
        }
    ]
}

Requirements

Python 3.7+
requests >= 2.25.0
beautifulsoup4 >= 4.9.0

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Dec 21, 2025

1.0.0

Dec 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webtable2json-1.1.0.tar.gz (8.9 kB view details)

Uploaded Dec 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webtable2json-1.1.0-py3-none-any.whl (8.8 kB view details)

Uploaded Dec 21, 2025 Python 3

File details

Details for the file webtable2json-1.1.0.tar.gz.

File metadata

Download URL: webtable2json-1.1.0.tar.gz
Upload date: Dec 21, 2025
Size: 8.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for webtable2json-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`76c655e7e007b4e5f828650749f4e4808ceace38f1551fc8b45740e5c716c747`
MD5	`2709f435eca14763aa2285114f88576c`
BLAKE2b-256	`42eaf1869abd6c5eaa65b738831f4e74a00a9ddc777ae9289a9cb3570647535a`

See more details on using hashes here.

File details

Details for the file webtable2json-1.1.0-py3-none-any.whl.

File metadata

Download URL: webtable2json-1.1.0-py3-none-any.whl
Upload date: Dec 21, 2025
Size: 8.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for webtable2json-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`549f1379edb76857c6eaad7b14ed68e4f1639240fb66595cc28ca76976b68ed7`
MD5	`2f6015e517b5d264dc2c849fc0d092bd`
BLAKE2b-256	`19c36ae351025966ceb8bdfb08886a78008ccb336979816149515c25d2fb08ca`

See more details on using hashes here.

webtable2json 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

webtable2json

Features

Installation

Quick Start

Usage Examples

Basic Table Extraction

Working with Custom Headers

Working with HTML Content

Filtering and Utility Functions

Advanced Usage with Custom Headers

Specialized Functions

API Reference

Classes

WebTableToJSON

Functions

convert_url_to_json(url, table_index=None, headers=None)

convert_html_to_json(html_content, table_index=None, base_url=None)

save_tables_to_file(tables, filename, indent=2)

filter_tables_by_size(tables, min_rows=1, min_cols=1)

get_main_table(url)

get_clean_ranking_data(url)

Output Format

Requirements

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`WebTableToJSON`

`convert_url_to_json(url, table_index=None, headers=None)`

`convert_html_to_json(html_content, table_index=None, base_url=None)`

`save_tables_to_file(tables, filename, indent=2)`

`filter_tables_by_size(tables, min_rows=1, min_cols=1)`

`get_main_table(url)`

`get_clean_ranking_data(url)`