Skip to main content

Extract HTML tables from webpages and convert them to JSON format

Project description

webtable2json

A Python library to extract HTML tables from webpages and convert them to JSON format. Perfect for web scraping, data extraction, and converting tabular web data into structured JSON.

Features

  • Extract tables from URLs or HTML content
  • Clean and normalize table data
  • Handle complex table structures (thead, tbody, colspan, etc.)
  • Preserve links and images with automatic URL normalization
  • Specialized functions for ranking websites
  • Session support for better performance
  • Built-in logging and error handling
  • Save results directly to JSON files
  • Filter tables by size requirements
  • Type hints for better development experience
  • Comprehensive error handling

Installation

pip install webtable2json

Quick Start

from webtable2json import convert_url_to_json, WebTableToJSON

# Extract all tables from a URL
tables = convert_url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")

# Extract a specific table (0-based index)
table = convert_url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html", table_index=0)

# Use the class for more control
converter = WebTableToJSON()
result = converter.url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")

Usage Examples

Basic Table Extraction

from webtable2json import convert_url_to_json, save_tables_to_file

# Get all tables from a webpage
tables = convert_url_to_json("https://www.w3schools.com/html/html_tables.asp")

# Save to file
save_tables_to_file(tables, "extracted_tables.json")

for i, table in enumerate(tables):
    print(f"Table {i}: {table['row_count']} rows, {table['column_count']} columns")
    print(f"First row: {table['data'][0]}")

Working with Custom Headers

from webtable2json import convert_url_to_json

# Custom headers for authentication or specific requirements
headers = {
    'Authorization': 'Bearer your-token',
    'User-Agent': 'My Custom Bot 1.0'
}

tables = convert_url_to_json("https://example.com", headers=headers)

Working with HTML Content

from webtable2json import convert_html_to_json

html = """
<table>
    <tr><th>Name</th><th>Website</th><th>Logo</th></tr>
    <tr>
        <td>Example Corp</td>
        <td><a href="https://example.com">Visit Site</a></td>
        <td><img src="logo.png" alt="Company Logo"></td>
    </tr>
</table>
"""

tables = convert_html_to_json(html, base_url="https://example.com")
print(tables[0]['data'])
# Output includes normalized URLs and image data

Filtering and Utility Functions

from webtable2json import convert_url_to_json, filter_tables_by_size, save_tables_to_file

# Get all tables
all_tables = convert_url_to_json("https://example.com")

# Filter tables with at least 5 rows and 3 columns
large_tables = filter_tables_by_size(all_tables, min_rows=5, min_cols=3)

# Save filtered results
save_tables_to_file(large_tables, "large_tables.json")

Advanced Usage with Custom Headers

from webtable2json import WebTableToJSON
import requests

# Custom headers and session for better performance
session = requests.Session()
headers = {
    'User-Agent': 'My Custom Bot 1.0',
    'Accept': 'text/html,application/xhtml+xml'
}

converter = WebTableToJSON(headers=headers, session=session, timeout=60)
result = converter.url_to_json("https://example.com")

Specialized Functions

from webtable2json import get_main_table, get_clean_ranking_data

# Get the largest table (usually the main data table)
main_table = get_main_table("https://example.com/data-page")

# Specialized function for ranking websites
ranking_data = get_clean_ranking_data("https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")

API Reference

Classes

WebTableToJSON

Main class for table extraction and conversion.

Methods:

  • __init__(headers=None, timeout=30, session=None): Initialize with optional custom headers, timeout, and session
  • fetch_webpage(url): Fetch HTML content from URL
  • normalize_url(url, base_url): Convert relative URLs to absolute URLs
  • extract_table_data(table, base_url=None): Extract data from BeautifulSoup table element
  • extract_tables_from_html(html_content, base_url=None): Extract all tables from HTML
  • url_to_json(url, table_index=None): Convert tables from URL to JSON
  • html_to_json(html_content, table_index=None, base_url=None): Convert tables from HTML to JSON

Functions

convert_url_to_json(url, table_index=None, headers=None)

Convert tables from a URL to JSON format.

convert_html_to_json(html_content, table_index=None, base_url=None)

Convert tables from HTML content to JSON format.

save_tables_to_file(tables, filename, indent=2)

Save table data to a JSON file.

filter_tables_by_size(tables, min_rows=1, min_cols=1)

Filter tables by minimum size requirements.

get_main_table(url)

Get the main data table from a URL (usually the largest table).

get_clean_ranking_data(url)

Specialized function for ranking websites like NIRF.

Output Format

Each table is returned as a dictionary with the following structure:

{
    "table_index": 0,
    "row_count": 10,
    "column_count": 3,
    "caption": "Optional table caption",
    "id": "table-id",
    "class": "table-class",
    "source_url": "https://example.com",
    "data": [
        {
            "Column 1": "Simple text value",
            "Column 2": {
                "text": "Link Text",
                "link": "https://example.com/page"
            },
            "Column 3": {
                "text": "Image description",
                "image": "https://example.com/image.jpg",
                "image_alt": "Alt text"
            }
        }
    ]
}

Requirements

  • Python 3.7+
  • requests >= 2.25.0
  • beautifulsoup4 >= 4.9.0

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webtable2json-1.1.0.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webtable2json-1.1.0-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file webtable2json-1.1.0.tar.gz.

File metadata

  • Download URL: webtable2json-1.1.0.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for webtable2json-1.1.0.tar.gz
Algorithm Hash digest
SHA256 76c655e7e007b4e5f828650749f4e4808ceace38f1551fc8b45740e5c716c747
MD5 2709f435eca14763aa2285114f88576c
BLAKE2b-256 42eaf1869abd6c5eaa65b738831f4e74a00a9ddc777ae9289a9cb3570647535a

See more details on using hashes here.

File details

Details for the file webtable2json-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: webtable2json-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for webtable2json-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 549f1379edb76857c6eaad7b14ed68e4f1639240fb66595cc28ca76976b68ed7
MD5 2f6015e517b5d264dc2c849fc0d092bd
BLAKE2b-256 19c36ae351025966ceb8bdfb08886a78008ccb336979816149515c25d2fb08ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page