Skip to main content

Web Structured Data Extraction Agent

Project description

🌐 web2json-agent

Stop Coding Scrapers, Start Getting Data — from Hours to Seconds

Python LangChain OpenAI PyPI

English | 中文


📖 What is web2json-agent?

An AI-powered web scraping agent that automatically generates production-ready parser code from HTML samples — no manual XPath/CSS selector writing required.


📋 Demo

https://github.com/user-attachments/assets/c82e8e13-fc42-4d1f-a81a-4cec6e3f434b


📊 SWDE Benchmark Results

The SWDE dataset covers 8 vertical fields, 80 websites, and 124,291 pages

Precision Recall F1 Score
COT 87.75 79.90 76.95
Reflexion 93.28 82.76 82.40
AUTOSCRAPER 92.49 89.13 88.69
Web2JSON-Agent 91.50 90.46 89.93

🚀 Quick Start

Install via pip

# 1. Install package
pip install web2json-agent

# 2. Initialize configuration
web2json setup

Install for Developers

# 1. Clone the repository
git clone https://github.com/ccprocessor/web2json-agent
cd web2json-agent

# 2. Install in editable mode
pip install -e .

# 3. Initialize configuration
web2json setup

📚 Complete User Guide

For a comprehensive tutorial covering installation, configuration, and all usage scenarios, see:

📖 Web2JSON-Agent Complete User Guide (中文)

This guide includes:

  • Detailed installation steps
  • Configuration methods (interactive wizard, config file, environment variables)
  • Layout clustering for mixed HTML types
  • Complete API examples and use cases
  • FAQ and troubleshooting

🐍 API Usage

Web2JSON provides five simple APIs. Perfect for databases, APIs, and real-time processing!

API 1: extract_data - Complete Workflow

Extract structured data from HTML in one step (schema + parser + data).

Auto Mode - Let AI automatically discover and extract fields:

from web2json import Web2JsonConfig, extract_data

config = Web2JsonConfig(
    name="my_project",
    html_path="html_samples/",
    # save=['schema', 'code', 'data'],  # Save to local disk
    # output_path="./results",  # Custom output directory (default: "output")
)

result = extract_data(config)

# Results are always returned in memory
print(result.final_schema)        # Dict: extracted schema
print(result.parser_code)          # str: generated parser code
print(result.parsed_data[0])       # List[Dict]: parsed JSON data

Predefined Mode - Extract only specific fields:

from web2json import Web2JsonConfig, extract_data

config = Web2JsonConfig(
    name="articles",
    html_path="html_samples/",
    schema={
        "title": "string",
        "author": "string",
        "date": "string",
        "content": "string"
    },
    # save=['schema', 'code', 'data'],  # Save to local disk
    # output_path="./results",  # Custom output directory
)

result = extract_data(config)
# Returns: ExtractDataResult with schema, code, and data in memory

API 2: extract_schema - Extract Schema Only

Generate a JSON schema describing the data structure in HTML.

from web2json import Web2JsonConfig, extract_schema

config = Web2JsonConfig(
    name="schema_only",
    html_path="html_samples/",
    # save=['schema'],  # Save schema to disk
    # output_path="./schemas",  # Custom output directory
)

result = extract_schema(config)

print(result.final_schema)         # Dict: final schema
print(result.intermediate_schemas) # List[Dict]: iteration history

API 3: infer_code - Generate Parser Code

Generate parser code from a schema (Dict or from previous step).

from web2json import Web2JsonConfig, infer_code

# Use schema from previous step or define manually
my_schema = {
    "title": "string",
    "author": "string",
    "content": "string"
}

config = Web2JsonConfig(
    name="my_parser",
    html_path="html_samples/",
    schema=my_schema,
    # save=['code'],  # Save parser code and schema to disk
    # output_path="./parsers",  # Custom output directory
)

result = infer_code(config)

print(result.parser_code)  # str: BeautifulSoup parser code
print(result.schema)       # Dict: schema used

API 4: extract_data_with_code - Parse with Code

Use parser code to extract data from HTML files.

from web2json import Web2JsonConfig, extract_data_with_code

config = Web2JsonConfig(
    name="parse_demo",
    html_path="new_html_files/",
    parser_code="output/blog/parsers/final_parser.py",  # Path to parser .py file
    save=['data'],  # Save parsed data to disk
    output_path="./parse_results",  # Custom output directory
)

result = extract_data_with_code(config)

print(f"Success: {result.success_count}, Failed: {result.failed_count}")
for item in result.parsed_data:
    print(f"File: {item['filename']}")
    print(f"Data: {item['data']}")

API 5: classify_html_dir - Classify HTML by Layout

Group HTML files by layout similarity (for mixed-layout datasets).

from web2json import Web2JsonConfig, classify_html_dir

config = Web2JsonConfig(
    name="classify_demo",
    html_path="mixed_html/",
    # save=['report', 'files'],  # Save cluster report and copy files to subdirectories
    # output_path="./cluster_analysis",  # Custom output directory
)

result = classify_html_dir(config)

print(f"Found {result.cluster_count} layout types")
print(f"Noise files: {len(result.noise_files)}")

for cluster_name, files in result.clusters.items():
    print(f"{cluster_name}: {len(files)} files")
    for file in files[:3]:
        print(f"  - {file}")

Configuration Reference

Web2JsonConfig Parameters:

Parameter Type Default Description
name str Required Project name (for identification)
html_path str Required HTML directory or file path
output_path str "output" Output directory (used when save is specified)
iteration_rounds int 3 Number of samples for learning
schema Dict None Predefined schema (None = auto mode)
enable_schema_edit bool False Enable manual schema editing
parser_code str None Parser code (for extract_data_with_code)
save List[str] None Items to save locally (e.g., ['schema', 'code', 'data']). None = memory only

Standalone API Parameters:

API Parameters Returns
extract_data config: Web2JsonConfig ExtractDataResult
extract_schema config: Web2JsonConfig ExtractSchemaResult
infer_code config: Web2JsonConfig InferCodeResult
extract_data_with_code config: Web2JsonConfig ParseResult
classify_html_dir config: Web2JsonConfig ClusterResult

All result objects provide:

  • Direct access to data via object attributes
  • .to_dict() method for serialization
  • .get_summary() method for quick stats

Which API Should I Use?

# Need data immediately? → extract_data
config = Web2JsonConfig(name="my_run", html_path="html_samples/")
result = extract_data(config)
print(result.parsed_data)

# Want to review/edit schema first? → extract_schema + infer_code
config = Web2JsonConfig(name="schema_run", html_path="html_samples/")
schema_result = extract_schema(config)

# Edit schema if needed, then generate code
config = Web2JsonConfig(
    name="code_run",
    html_path="html_samples/",
    schema=schema_result.final_schema
)
code_result = infer_code(config)

# Parse with the generated code
config = Web2JsonConfig(
    name="parse_run",
    html_path="new_html_files/",
    parser_code=code_result.parser_code
)
data_result = extract_data_with_code(config)

# Have parser code, need to parse more files? → extract_data_with_code
config = Web2JsonConfig(
    name="parse_more",
    html_path="more_files/",
    parser_code=my_parser_code
)
result = extract_data_with_code(config)

# Mixed layouts (list + detail pages)? → classify_html_dir
config = Web2JsonConfig(name="classify", html_path="mixed_html/")
result = classify_html_dir(config)

📄 License

Apache-2.0 License


Made with ❤️ by the web2json-agent team

⭐ Star us on GitHub | 🐛 Report Issues | 📖 Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2json_agent-1.1.5.tar.gz (97.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web2json_agent-1.1.5-py3-none-any.whl (115.1 kB view details)

Uploaded Python 3

File details

Details for the file web2json_agent-1.1.5.tar.gz.

File metadata

  • Download URL: web2json_agent-1.1.5.tar.gz
  • Upload date:
  • Size: 97.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for web2json_agent-1.1.5.tar.gz
Algorithm Hash digest
SHA256 28b731d8e7ba9f07b5e72269905656ba3228ea7af9c7710bdba706e36c3ceb77
MD5 6fcca52737caf478e529db7a90f4b0ac
BLAKE2b-256 28bbffff84a998abd9000750bbd5d437a9c104958452cd274d160ea1174baa05

See more details on using hashes here.

File details

Details for the file web2json_agent-1.1.5-py3-none-any.whl.

File metadata

  • Download URL: web2json_agent-1.1.5-py3-none-any.whl
  • Upload date:
  • Size: 115.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for web2json_agent-1.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c1755b768e0e08697731631c27354c37d19265822781391aa8eb161a05436aad
MD5 e024d4d531564bfbfdffe743c01107bf
BLAKE2b-256 164f6f0def23d63112b66d43331d538a0bf46042a28848da23066f9c15a65938

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page