Web Structured Data Extraction Agent
Project description
📖 What is web2json-agent?
An AI-powered web scraping agent that automatically generates production-ready parser code from HTML samples — no manual XPath/CSS selector writing required.
📋 Demo
https://github.com/user-attachments/assets/c82e8e13-fc42-4d1f-a81a-4cec6e3f434b
📊 SWDE Benchmark Results
The SWDE dataset covers 8 vertical fields, 80 websites, and 124,291 pages
| Precision | Recall | F1 Score | |
|---|---|---|---|
| COT | 87.75 | 79.90 | 76.95 |
| Reflexion | 93.28 | 82.76 | 82.40 |
| AUTOSCRAPER | 92.49 | 89.13 | 88.69 |
| Web2JSON-Agent | 91.50 | 90.46 | 89.93 |
🚀 Quick Start
Install via pip
# 1. Install package
pip install web2json-agent
# 2. Initialize configuration
web2json setup
Install for Developers
# 1. Clone the repository
git clone https://github.com/ccprocessor/web2json-agent
cd web2json-agent
# 2. Install in editable mode
pip install -e .
# 3. Initialize configuration
web2json setup
📚 Complete User Guide
For a comprehensive tutorial covering installation, configuration, and all usage scenarios, see:
📖 Web2JSON-Agent Complete User Guide (中文)
This guide includes:
- Detailed installation steps
- Configuration methods (interactive wizard, config file, environment variables)
- Layout clustering for mixed HTML types
- Complete API examples and use cases
- FAQ and troubleshooting
🐍 API Usage
Web2JSON provides five simple APIs. Perfect for databases, APIs, and real-time processing!
API 1: extract_data - Complete Workflow
Extract structured data from HTML in one step (schema + parser + data).
Auto Mode - Let AI automatically discover and extract fields:
from web2json import Web2JsonConfig, extract_data
config = Web2JsonConfig(
name="my_project",
html_path="html_samples/",
# save=['schema', 'code', 'data'], # Save to local disk
# output_path="./results", # Custom output directory (default: "output")
)
result = extract_data(config)
# Results are always returned in memory
print(result.final_schema) # Dict: extracted schema
print(result.parser_code) # str: generated parser code
print(result.parsed_data[0]) # List[Dict]: parsed JSON data
Predefined Mode - Extract only specific fields:
from web2json import Web2JsonConfig, extract_data
config = Web2JsonConfig(
name="articles",
html_path="html_samples/",
schema={
"title": "string",
"author": "string",
"date": "string",
"content": "string"
},
# save=['schema', 'code', 'data'], # Save to local disk
# output_path="./results", # Custom output directory
)
result = extract_data(config)
# Returns: ExtractDataResult with schema, code, and data in memory
API 2: extract_schema - Extract Schema Only
Generate a JSON schema describing the data structure in HTML.
from web2json import Web2JsonConfig, extract_schema
config = Web2JsonConfig(
name="schema_only",
html_path="html_samples/",
# save=['schema'], # Save schema to disk
# output_path="./schemas", # Custom output directory
)
result = extract_schema(config)
print(result.final_schema) # Dict: final schema
print(result.intermediate_schemas) # List[Dict]: iteration history
API 3: infer_code - Generate Parser Code
Generate parser code from a schema (Dict or from previous step).
from web2json import Web2JsonConfig, infer_code
# Use schema from previous step or define manually
my_schema = {
"title": "string",
"author": "string",
"content": "string"
}
config = Web2JsonConfig(
name="my_parser",
html_path="html_samples/",
schema=my_schema,
# save=['code'], # Save parser code and schema to disk
# output_path="./parsers", # Custom output directory
)
result = infer_code(config)
print(result.parser_code) # str: BeautifulSoup parser code
print(result.schema) # Dict: schema used
API 4: extract_data_with_code - Parse with Code
Use parser code to extract data from HTML files.
from web2json import Web2JsonConfig, extract_data_with_code
config = Web2JsonConfig(
name="parse_demo",
html_path="new_html_files/",
parser_code="output/blog/parsers/final_parser.py", # Path to parser .py file
save=['data'], # Save parsed data to disk
output_path="./parse_results", # Custom output directory
)
result = extract_data_with_code(config)
print(f"Success: {result.success_count}, Failed: {result.failed_count}")
for item in result.parsed_data:
print(f"File: {item['filename']}")
print(f"Data: {item['data']}")
API 5: classify_html_dir - Classify HTML by Layout
Group HTML files by layout similarity (for mixed-layout datasets).
from web2json import Web2JsonConfig, classify_html_dir
config = Web2JsonConfig(
name="classify_demo",
html_path="mixed_html/",
# save=['report', 'files'], # Save cluster report and copy files to subdirectories
# output_path="./cluster_analysis", # Custom output directory
)
result = classify_html_dir(config)
print(f"Found {result.cluster_count} layout types")
print(f"Noise files: {len(result.noise_files)}")
for cluster_name, files in result.clusters.items():
print(f"{cluster_name}: {len(files)} files")
for file in files[:3]:
print(f" - {file}")
Configuration Reference
Web2JsonConfig Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str |
Required | Project name (for identification) |
html_path |
str |
Required | HTML directory or file path |
output_path |
str |
"output" |
Output directory (used when save is specified) |
iteration_rounds |
int |
3 |
Number of samples for learning |
schema |
Dict |
None |
Predefined schema (None = auto mode) |
enable_schema_edit |
bool |
False |
Enable manual schema editing |
parser_code |
str |
None |
Parser code (for extract_data_with_code) |
save |
List[str] |
None |
Items to save locally (e.g., ['schema', 'code', 'data']). None = memory only |
Standalone API Parameters:
| API | Parameters | Returns |
|---|---|---|
extract_data |
config: Web2JsonConfig |
ExtractDataResult |
extract_schema |
config: Web2JsonConfig |
ExtractSchemaResult |
infer_code |
config: Web2JsonConfig |
InferCodeResult |
extract_data_with_code |
config: Web2JsonConfig |
ParseResult |
classify_html_dir |
config: Web2JsonConfig |
ClusterResult |
All result objects provide:
- Direct access to data via object attributes
.to_dict()method for serialization.get_summary()method for quick stats
Which API Should I Use?
# Need data immediately? → extract_data
config = Web2JsonConfig(name="my_run", html_path="html_samples/")
result = extract_data(config)
print(result.parsed_data)
# Want to review/edit schema first? → extract_schema + infer_code
config = Web2JsonConfig(name="schema_run", html_path="html_samples/")
schema_result = extract_schema(config)
# Edit schema if needed, then generate code
config = Web2JsonConfig(
name="code_run",
html_path="html_samples/",
schema=schema_result.final_schema
)
code_result = infer_code(config)
# Parse with the generated code
config = Web2JsonConfig(
name="parse_run",
html_path="new_html_files/",
parser_code=code_result.parser_code
)
data_result = extract_data_with_code(config)
# Have parser code, need to parse more files? → extract_data_with_code
config = Web2JsonConfig(
name="parse_more",
html_path="more_files/",
parser_code=my_parser_code
)
result = extract_data_with_code(config)
# Mixed layouts (list + detail pages)? → classify_html_dir
config = Web2JsonConfig(name="classify", html_path="mixed_html/")
result = classify_html_dir(config)
📄 License
Apache-2.0 License
Made with ❤️ by the web2json-agent team
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web2json_agent-1.1.5.tar.gz.
File metadata
- Download URL: web2json_agent-1.1.5.tar.gz
- Upload date:
- Size: 97.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28b731d8e7ba9f07b5e72269905656ba3228ea7af9c7710bdba706e36c3ceb77
|
|
| MD5 |
6fcca52737caf478e529db7a90f4b0ac
|
|
| BLAKE2b-256 |
28bbffff84a998abd9000750bbd5d437a9c104958452cd274d160ea1174baa05
|
File details
Details for the file web2json_agent-1.1.5-py3-none-any.whl.
File metadata
- Download URL: web2json_agent-1.1.5-py3-none-any.whl
- Upload date:
- Size: 115.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1755b768e0e08697731631c27354c37d19265822781391aa8eb161a05436aad
|
|
| MD5 |
e024d4d531564bfbfdffe743c01107bf
|
|
| BLAKE2b-256 |
164f6f0def23d63112b66d43331d538a0bf46042a28848da23066f9c15a65938
|