Skip to main content

智能网页解析代码生成器 - 基于 AI 自动生成网页解析代码

Project description

web2json-agent

Let AI automatically generate web parsing code, say goodbye to manual XPath and CSS selectors, easily get structured data

English | 中文

💡 Project Introduction

web2json-agent is an intelligent data parsing tool that can automatically analyze web page structure and generate high-quality Python parser code with automatic data parsing, saving 80% of development time, from hours to minutes!

📋 Video Demo

https://github.com/user-attachments/assets/772fb610-808e-431d-93b3-d16ca0775b3f


📊 SWDE Benchmark Results

Evaluated on the SWDE dataset (8 verticals, 80 websites, 124,291 pages):

Metric Score
Average Precision 91.50%
Average Recall 90.46%
Average F1 Score 89.93%

🚀 Quick Start

Install via pip

# 1. Install package
pip install web2json-agent

# 2. Initialize configuration
web2json setup

# Mode 1: Auto mode (auto) - Quick exploration, unsure which fields to extract
web2json -d html_samples/ -o output/result

# Mode 2: Predefined mode (predefined) - Know exactly which fields to extract, need precise output control
web2json -d html_samples/ -o output/result --interactive-schema

🎨 Web UI Frontend Interface

The project provides a visual Web UI interface for convenient browser-based operations.

Installation and Launch

# Enter frontend directory
cd web2json_ui/

# Install dependencies
npm install

# Start development server
npm run dev

# Or build production version
npm run build

📄 License

MIT License


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2json_agent-1.1.1.tar.gz (82.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web2json_agent-1.1.1-py3-none-any.whl (102.4 kB view details)

Uploaded Python 3

File details

Details for the file web2json_agent-1.1.1.tar.gz.

File metadata

  • Download URL: web2json_agent-1.1.1.tar.gz
  • Upload date:
  • Size: 82.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for web2json_agent-1.1.1.tar.gz
Algorithm Hash digest
SHA256 74754fc5bc8cac993a47b2e22ab140f56c4173c438a46bcde5e8ec6ddb6f88bb
MD5 0c99a2205751bae6b8db006eeb46cc3a
BLAKE2b-256 856ef1f0378ed9d6486aaf36ef232a1a5a6712dacf6d6c9c6ca1ef043214ec94

See more details on using hashes here.

File details

Details for the file web2json_agent-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: web2json_agent-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 102.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for web2json_agent-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0ab1185d9870d9394af647b631ec0fe4ec7f023d74e5c77d6a235e8bc8bf1b18
MD5 ffd7ade7086c5350f9c46fc26545b81e
BLAKE2b-256 0c923a308e6761a5311b6150c29bf1e41edcb63f532f3a032cbae1f05b76fd28

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page