Webis HTML extraction tool

These details have not been verified by PyPI

Project links

Project description

Webis HTML - Intelligent Web Content Extraction Tool

Python Version Package Version License

Webis HTML is a modern intelligent web content extraction tool that uses AI technology to automatically identify and extract valuable information from web pages, filter out noise content, and provide high-quality text data for knowledge base construction, data analysis, and AI training.

✨ Features

🚀 One-click extraction: Complete HTML content extraction with a single function call
🔄 Batch processing: Supports directory-level batch HTML file processing
🌐 URL support: Extract content directly from web URLs
🤖 AI optimization: Integrated DeepSeek API for intelligent content filtering
⚡ Asynchronous processing: High-performance asynchronous API calls with concurrent processing support
🖥️ Multiple interfaces: Supports Python API, command line, and web interface
📦 Standard package: Compliant with PyPI standards, easy to install and distribute

📦 Installation

Environment Requirements

Python 3.8+
Recommended to use conda for environment management

Quick Installation

Method 1: Install from PyPI (Recommended)

# Create conda environment
conda create -n webis_html python=3.10 -y
conda activate webis_html

# Install package
pip install webis-html

Method 2: Install from Source

# Clone repository
git clone https://github.com/Webis/Webis.git
cd Webis/Webis_HTML

# Create environment and install
conda create -n webis_html python=3.10 -y
conda activate webis_html
pip install -e .

Method 3: Test Version Installation

# Install latest test version from TestPyPI
pip install -i https://test.pypi.org/simple/ webis-html

Verify Installation

# Check CLI command
webis-html --help

# Check Python import
python -c "import webis_html; print('✅ Installation successful!')"

🚀 Quick Start

1. Simplest Usage

import webis_html

# Extract from HTML content
html_content = "<html><body><h1>Title</h1><p>Content</p></body></html>"
result = webis_html.extract_from_html(html_content)

# Batch process directory
result = webis_html.extract_from_directory("./html_files", "./output")

# Extract from URL
result = webis_html.extract_from_url("https://example.com")

2. Command Line Usage

# Batch process HTML files
webis-html extract --input ./html_files --output ./results

# Start web interface
webis-html gui

# Check version
webis-html version

📖 Detailed Usage Instructions

Python API

Convenience Functions (Recommended)

import webis_html

# 1. Process HTML content
html_content = """
<html>
<body>
    <h1>Important Title</h1>
    <p>Valuable content</p>
    <div class="ad">Advertisement content</div>
</body>
</html>
"""

result = webis_html.extract_from_html(
    html_content, 
    api_key="sk-your-deepseek-key",  # Optional, for AI optimization (skip AI filtering if not provided)
    output_dir="./output"
)

if result['success']:
    print(f"Extraction successful! Total {len(result['results'])} text segments")
    for item in result['results']:
        print(f"File: {item['filename']}")
        print(f"Content: {item['content'][:100]}...")

# 2. Batch process directory
result = webis_html.extract_from_directory(
    input_dir="./html_files",
    output_dir="./output",
    api_key="sk-your-deepseek-key"  # Optional, skip AI filtering if not provided
)

# 3. Extract from URL
result = webis_html.extract_from_url(
    "https://example.com",
    api_key="sk-your-deepseek-key",  # Optional, skip AI filtering if not provided
    output_dir="./output"
)

Advanced Customization

import webis_html

# Use core components for custom processing flow
processor = webis_html.HtmlProcessor(input_dir, output_dir)
processor.process_html_folder()

# Generate dataset
webis_html.process_json_folder(content_dir, dataset_file)

# Model prediction
webis_html.process_predictions(dataset_file, results_file)

# Restore text
webis_html.restore_text_from_json(results_file, output_dir)

Command Line Interface

Basic Commands

# Extract HTML content
webis-html extract --input ./html_files --output ./results --api-key YOUR_KEY

# Verbose output
webis-html extract --input ./html_files --verbose

# Start web interface
webis-html gui --web-port 9000 --gui-port 8001

# Test API connection
webis-html check-api --api-key YOUR_KEY

# Check version information
webis-html version

Complete Example

# Process HTML files in samples directory
webis-html extract \
  --input ./samples/input_html \
  --output ./samples/output \
  --api-key sk-your-deepseek-api-key \
  --verbose

Web Interface

Start web interface for visual operation:

# Start GUI (will automatically start Web API server)
webis-html gui

# Custom ports
webis-html gui --web-port 9000 --gui-port 8001 --api-key YOUR_KEY

Then visit http://localhost:8001 in your browser.

🔑 API Key Configuration

Supports multiple API key configuration methods:

1. Configuration File (Recommended)

Create config/api_keys.json:

{
    "deepseek_api_key": "sk-your-deepseek-api-key-here"
}

Note: If API key is not configured, the program can still run normally, but will skip the AI intelligent filtering step and only perform basic HTML content extraction.

2. Environment Variables

export DEEPSEEK_API_KEY="sk-your-deepseek-api-key-here"
# or
export LLM_PREDICTOR_API_KEY="sk-your-deepseek-api-key-here"

3. Command Line Parameters

webis-html extract --input ./html --api-key sk-your-key

4. Python Code

result = webis_html.extract_from_html(html_content, api_key="sk-your-key")  # Optional

📁 Output Structure

All processing methods generate a unified output structure:

output/
├── content_output/          # HTML preprocessing results
│   └── *.json              # Structured content data
├── dataset/                # Dataset files
│   ├── extra_datasets.json # Training dataset
│   └── pred_results.json   # Prediction results
├── predicted_texts/        # Basic extraction results
│   └── *.txt              # Extracted text files
└── filtered_texts/         # AI optimized results (if using DeepSeek API)
    └── *.txt              # Filtered high-quality text

🛠️ Development and Customization

Project Structure

webis_html/
├── __init__.py             # Main package entry, convenience functions
├── cli/                    # Command line interface
│   ├── cli.py             # CLI implementation
│   └── __main__.py        # CLI entry point
├── core/                   # Core processing modules
│   ├── html_processor.py  # HTML preprocessing
│   ├── dataset_processor.py # Dataset generation
│   ├── llm_predictor.py   # AI prediction
│   ├── content_restorer.py # Content restoration
│   └── llm_clean.py       # DeepSeek filtering
├── server/                 # Web server
│   ├── __init__.py        # FastAPI application
│   ├── api/               # API routes
│   └── services/          # Service components
├── utils/                  # Utility modules
├── config/                 # Configuration files
├── frontend/              # Web interface
└── scripts/               # Startup scripts

Extension Development

# Create custom processor
from webis_html.core import HtmlProcessor

class CustomProcessor(HtmlProcessor):
    def custom_process(self, html_content):
        # Custom processing logic
        pass

# Create web service
from webis_html import create_app
import uvicorn

app = create_app()
uvicorn.run(app, host="0.0.0.0", port=8000)

📊 Performance Features

Asynchronous processing: High-performance concurrency using httpx and asyncio
Smart caching: Automatic API key and configuration caching
Batch optimization: Batch processing optimization for large numbers of files
Memory management: Stream processing of large files to avoid memory overflow

🤝 Contribution

Welcome to contribute code! Please follow these steps:

Fork the project
Create feature branch (git checkout -b feature/AmazingFeature)
Commit changes (git commit -m 'Add some AmazingFeature')
Push to branch (git push origin feature/AmazingFeature)
Open Pull Request

📄 License

This project uses MIT License - see LICENSE file for details.

🆘 Support

📧 Email: example@example.com
🐛 Issues: GitHub Issues
📖 Documentation: Project Documentation

🎯 Use Cases

Knowledge base construction: Batch extract structured knowledge from web pages
Data mining: Clean web data for analysis
AI training: Prepare high-quality training data for large language models
Content migration: Website content migration and organization
Information extraction: Extract key information from HTML

Start using Webis HTML to make web content extraction simple and efficient! 🚀

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.4

Dec 17, 2025

This version

1.0.3

Dec 15, 2025

1.0.2

Dec 15, 2025

1.0.1

Nov 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webis_html-1.0.3.tar.gz (1.2 MB view details)

Uploaded Dec 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webis_html-1.0.3-py3-none-any.whl (1.2 MB view details)

Uploaded Dec 15, 2025 Python 3

File details

Details for the file webis_html-1.0.3.tar.gz.

File metadata

Download URL: webis_html-1.0.3.tar.gz
Upload date: Dec 15, 2025
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for webis_html-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`6b65e40147e8cafbea2f76a3b46d3d4bc066508b30c28b32de0b454299da2c05`
MD5	`93a6a0c1534d13ceb26dff47eb432282`
BLAKE2b-256	`0ca7c5d0061461391529b094d3dea85f9a2bb3085eab6219067e4be7ae7b3384`

See more details on using hashes here.

File details

Details for the file webis_html-1.0.3-py3-none-any.whl.

File metadata

Download URL: webis_html-1.0.3-py3-none-any.whl
Upload date: Dec 15, 2025
Size: 1.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for webis_html-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8acc7eb7ede06364efe7d3d9cacdc28f86c20e13088e4f9c55810f68d2e686b0`
MD5	`af6d0004dda2c741280715c307be114f`
BLAKE2b-256	`cc389290f5c0d2099ebf37065bfec8fe1b5e62b944cefe022471fa8675c7770c`

See more details on using hashes here.

webis-html 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Webis HTML - Intelligent Web Content Extraction Tool

✨ Features

📦 Installation

Environment Requirements

Quick Installation

Method 1: Install from PyPI (Recommended)

Method 2: Install from Source

Method 3: Test Version Installation

Verify Installation

🚀 Quick Start

1. Simplest Usage

2. Command Line Usage

📖 Detailed Usage Instructions

Python API

Convenience Functions (Recommended)

Advanced Customization

Command Line Interface

Basic Commands

Complete Example

Web Interface

🔑 API Key Configuration

1. Configuration File (Recommended)

2. Environment Variables

3. Command Line Parameters

4. Python Code

📁 Output Structure

🛠️ Development and Customization

Project Structure

Extension Development

📊 Performance Features

🤝 Contribution

📄 License

🆘 Support

🎯 Use Cases

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes