Extract HTML tables from webpages and convert them to JSON format
Project description
webtable2json
A Python library to extract HTML tables from webpages and convert them to JSON format. Perfect for web scraping, data extraction, and converting tabular web data into structured JSON.
Features
- Extract tables from URLs or HTML content
- Clean and normalize table data
- Handle complex table structures (thead, tbody, colspan, etc.)
- Preserve links and images with automatic URL normalization
- Specialized functions for ranking websites
- Session support for better performance
- Built-in logging and error handling
- Save results directly to JSON files
- Filter tables by size requirements
- Type hints for better development experience
- Comprehensive error handling
Installation
pip install webtable2json
Quick Start
from webtable2json import convert_url_to_json, WebTableToJSON
# Extract all tables from a URL
tables = convert_url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")
# Extract a specific table (0-based index)
table = convert_url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html", table_index=0)
# Use the class for more control
converter = WebTableToJSON()
result = converter.url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")
Usage Examples
Basic Table Extraction
from webtable2json import convert_url_to_json, save_tables_to_file
# Get all tables from a webpage
tables = convert_url_to_json("https://www.w3schools.com/html/html_tables.asp")
# Save to file
save_tables_to_file(tables, "extracted_tables.json")
for i, table in enumerate(tables):
print(f"Table {i}: {table['row_count']} rows, {table['column_count']} columns")
print(f"First row: {table['data'][0]}")
Working with Custom Headers
from webtable2json import convert_url_to_json
# Custom headers for authentication or specific requirements
headers = {
'Authorization': 'Bearer your-token',
'User-Agent': 'My Custom Bot 1.0'
}
tables = convert_url_to_json("https://example.com", headers=headers)
Working with HTML Content
from webtable2json import convert_html_to_json
html = """
<table>
<tr><th>Name</th><th>Website</th><th>Logo</th></tr>
<tr>
<td>Example Corp</td>
<td><a href="https://example.com">Visit Site</a></td>
<td><img src="logo.png" alt="Company Logo"></td>
</tr>
</table>
"""
tables = convert_html_to_json(html, base_url="https://example.com")
print(tables[0]['data'])
# Output includes normalized URLs and image data
Filtering and Utility Functions
from webtable2json import convert_url_to_json, filter_tables_by_size, save_tables_to_file
# Get all tables
all_tables = convert_url_to_json("https://example.com")
# Filter tables with at least 5 rows and 3 columns
large_tables = filter_tables_by_size(all_tables, min_rows=5, min_cols=3)
# Save filtered results
save_tables_to_file(large_tables, "large_tables.json")
Advanced Usage with Custom Headers
from webtable2json import WebTableToJSON
import requests
# Custom headers and session for better performance
session = requests.Session()
headers = {
'User-Agent': 'My Custom Bot 1.0',
'Accept': 'text/html,application/xhtml+xml'
}
converter = WebTableToJSON(headers=headers, session=session, timeout=60)
result = converter.url_to_json("https://example.com")
Specialized Functions
from webtable2json import get_main_table, get_clean_ranking_data
# Get the largest table (usually the main data table)
main_table = get_main_table("https://example.com/data-page")
# Specialized function for ranking websites
ranking_data = get_clean_ranking_data("https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")
API Reference
Classes
WebTableToJSON
Main class for table extraction and conversion.
Methods:
__init__(headers=None, timeout=30, session=None): Initialize with optional custom headers, timeout, and sessionfetch_webpage(url): Fetch HTML content from URLnormalize_url(url, base_url): Convert relative URLs to absolute URLsextract_table_data(table, base_url=None): Extract data from BeautifulSoup table elementextract_tables_from_html(html_content, base_url=None): Extract all tables from HTMLurl_to_json(url, table_index=None): Convert tables from URL to JSONhtml_to_json(html_content, table_index=None, base_url=None): Convert tables from HTML to JSON
Functions
convert_url_to_json(url, table_index=None, headers=None)
Convert tables from a URL to JSON format.
convert_html_to_json(html_content, table_index=None, base_url=None)
Convert tables from HTML content to JSON format.
save_tables_to_file(tables, filename, indent=2)
Save table data to a JSON file.
filter_tables_by_size(tables, min_rows=1, min_cols=1)
Filter tables by minimum size requirements.
get_main_table(url)
Get the main data table from a URL (usually the largest table).
get_clean_ranking_data(url)
Specialized function for ranking websites like NIRF.
Output Format
Each table is returned as a dictionary with the following structure:
{
"table_index": 0,
"row_count": 10,
"column_count": 3,
"caption": "Optional table caption",
"id": "table-id",
"class": "table-class",
"source_url": "https://example.com",
"data": [
{
"Column 1": "Simple text value",
"Column 2": {
"text": "Link Text",
"link": "https://example.com/page"
},
"Column 3": {
"text": "Image description",
"image": "https://example.com/image.jpg",
"image_alt": "Alt text"
}
}
]
}
Requirements
- Python 3.7+
- requests >= 2.25.0
- beautifulsoup4 >= 4.9.0
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webtable2json-1.1.0.tar.gz.
File metadata
- Download URL: webtable2json-1.1.0.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76c655e7e007b4e5f828650749f4e4808ceace38f1551fc8b45740e5c716c747
|
|
| MD5 |
2709f435eca14763aa2285114f88576c
|
|
| BLAKE2b-256 |
42eaf1869abd6c5eaa65b738831f4e74a00a9ddc777ae9289a9cb3570647535a
|
File details
Details for the file webtable2json-1.1.0-py3-none-any.whl.
File metadata
- Download URL: webtable2json-1.1.0-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
549f1379edb76857c6eaad7b14ed68e4f1639240fb66595cc28ca76976b68ed7
|
|
| MD5 |
2f6015e517b5d264dc2c849fc0d092bd
|
|
| BLAKE2b-256 |
19c36ae351025966ceb8bdfb08886a78008ccb336979816149515c25d2fb08ca
|