Skip to main content

Webis HTML extraction tool

Project description

Webis HTML - Intelligent Web Content Extraction Tool

Python Version Package Version License

Webis HTML is a modern intelligent web content extraction tool that uses AI technology to automatically identify and extract valuable information from web pages, filter out noise content, and provide high-quality text data for knowledge base construction, data analysis, and AI training.

โœจ Features

  • ๐Ÿš€ One-click extraction: Complete HTML content extraction with a single function call
  • ๐Ÿ”„ Batch processing: Supports directory-level batch HTML file processing
  • ๐ŸŒ URL support: Extract content directly from web URLs
  • ๐Ÿค– AI optimization: Integrated DeepSeek API for intelligent content filtering
  • โšก Asynchronous processing: High-performance asynchronous API calls with concurrent processing support
  • ๐Ÿ–ฅ๏ธ Multiple interfaces: Supports Python API, command line, and web interface
  • ๐Ÿ“ฆ Standard package: Compliant with PyPI standards, easy to install and distribute

๐Ÿ“ฆ Installation

Environment Requirements

  • Python 3.8+
  • Recommended to use conda for environment management

Quick Installation

Method 1: Install from PyPI (Recommended)

# Create conda environment
conda create -n webis_html python=3.10 -y
conda activate webis_html

# Install package
pip install webis-html

Method 2: Install from Source

# Clone repository
git clone https://github.com/Webis/Webis.git
cd Webis/Webis_HTML

# Create environment and install
conda create -n webis_html python=3.10 -y
conda activate webis_html
pip install -e .

Method 3: Test Version Installation

# Install latest test version from TestPyPI
pip install -i https://test.pypi.org/simple/ webis-html

Verify Installation

# Check CLI command
webis-html --help

# Check Python import
python -c "import webis_html; print('โœ… Installation successful!')"

๐Ÿš€ Quick Start

1. Simplest Usage

import webis_html

# Extract from HTML content
html_content = "<html><body><h1>Title</h1><p>Content</p></body></html>"
result = webis_html.extract_from_html(html_content)

# Batch process directory
result = webis_html.extract_from_directory("./html_files", "./output")

# Extract from URL
result = webis_html.extract_from_url("https://example.com")

2. Command Line Usage

# Batch process HTML files
webis-html extract --input ./html_files --output ./results

# Start web interface
webis-html gui

# Check version
webis-html version

๐Ÿ“– Detailed Usage Instructions

Python API

Convenience Functions (Recommended)

import webis_html

# 1. Process HTML content
html_content = """
<html>
<body>
    <h1>Important Title</h1>
    <p>Valuable content</p>
    <div class="ad">Advertisement content</div>
</body>
</html>
"""

result = webis_html.extract_from_html(
    html_content, 
    api_key="sk-your-deepseek-key",  # Optional, for AI optimization (skip AI filtering if not provided)
    output_dir="./output"
)

if result['success']:
    print(f"Extraction successful! Total {len(result['results'])} text segments")
    for item in result['results']:
        print(f"File: {item['filename']}")
        print(f"Content: {item['content'][:100]}...")

# 2. Batch process directory
result = webis_html.extract_from_directory(
    input_dir="./html_files",
    output_dir="./output",
    api_key="sk-your-deepseek-key"  # Optional, skip AI filtering if not provided
)

# 3. Extract from URL
result = webis_html.extract_from_url(
    "https://example.com",
    api_key="sk-your-deepseek-key",  # Optional, skip AI filtering if not provided
    output_dir="./output"
)

Advanced Customization

import webis_html

# Use core components for custom processing flow
processor = webis_html.HtmlProcessor(input_dir, output_dir)
processor.process_html_folder()

# Generate dataset
webis_html.process_json_folder(content_dir, dataset_file)

# Model prediction
webis_html.process_predictions(dataset_file, results_file)

# Restore text
webis_html.restore_text_from_json(results_file, output_dir)

Command Line Interface

Basic Commands

# Extract HTML content
webis-html extract --input ./html_files --output ./results --api-key YOUR_KEY

# Verbose output
webis-html extract --input ./html_files --verbose

# Start web interface
webis-html gui --web-port 9000 --gui-port 8001

# Test API connection
webis-html check-api --api-key YOUR_KEY

# Check version information
webis-html version

Complete Example

# Process HTML files in samples directory
webis-html extract \
  --input ./samples/input_html \
  --output ./samples/output \
  --api-key sk-your-deepseek-api-key \
  --verbose

Web Interface

Start web interface for visual operation:

# Start GUI (will automatically start Web API server)
webis-html gui

# Custom ports
webis-html gui --web-port 9000 --gui-port 8001 --api-key YOUR_KEY

Then visit http://localhost:8001 in your browser.

๐Ÿ”‘ API Key Configuration

Supports multiple API key configuration methods:

1. Configuration File (Recommended)

Create config/api_keys.json:

{
    "deepseek_api_key": "sk-your-deepseek-api-key-here"
}

Note: If API key is not configured, the program can still run normally, but will skip the AI intelligent filtering step and only perform basic HTML content extraction.

2. Environment Variables

export DEEPSEEK_API_KEY="sk-your-deepseek-api-key-here"
# or
export LLM_PREDICTOR_API_KEY="sk-your-deepseek-api-key-here"

3. Command Line Parameters

webis-html extract --input ./html --api-key sk-your-key

4. Python Code

result = webis_html.extract_from_html(html_content, api_key="sk-your-key")  # Optional

๐Ÿ“ Output Structure

All processing methods generate a unified output structure:

output/
โ”œโ”€โ”€ content_output/          # HTML preprocessing results
โ”‚   โ””โ”€โ”€ *.json              # Structured content data
โ”œโ”€โ”€ dataset/                # Dataset files
โ”‚   โ”œโ”€โ”€ extra_datasets.json # Training dataset
โ”‚   โ””โ”€โ”€ pred_results.json   # Prediction results
โ”œโ”€โ”€ predicted_texts/        # Basic extraction results
โ”‚   โ””โ”€โ”€ *.txt              # Extracted text files
โ””โ”€โ”€ filtered_texts/         # AI optimized results (if using DeepSeek API)
    โ””โ”€โ”€ *.txt              # Filtered high-quality text

๐Ÿ› ๏ธ Development and Customization

Project Structure

webis_html/
โ”œโ”€โ”€ __init__.py             # Main package entry, convenience functions
โ”œโ”€โ”€ cli/                    # Command line interface
โ”‚   โ”œโ”€โ”€ cli.py             # CLI implementation
โ”‚   โ””โ”€โ”€ __main__.py        # CLI entry point
โ”œโ”€โ”€ core/                   # Core processing modules
โ”‚   โ”œโ”€โ”€ html_processor.py  # HTML preprocessing
โ”‚   โ”œโ”€โ”€ dataset_processor.py # Dataset generation
โ”‚   โ”œโ”€โ”€ llm_predictor.py   # AI prediction
โ”‚   โ”œโ”€โ”€ content_restorer.py # Content restoration
โ”‚   โ””โ”€โ”€ llm_clean.py       # DeepSeek filtering
โ”œโ”€โ”€ server/                 # Web server
โ”‚   โ”œโ”€โ”€ __init__.py        # FastAPI application
โ”‚   โ”œโ”€โ”€ api/               # API routes
โ”‚   โ””โ”€โ”€ services/          # Service components
โ”œโ”€โ”€ utils/                  # Utility modules
โ”œโ”€โ”€ config/                 # Configuration files
โ”œโ”€โ”€ frontend/              # Web interface
โ””โ”€โ”€ scripts/               # Startup scripts

Extension Development

# Create custom processor
from webis_html.core import HtmlProcessor

class CustomProcessor(HtmlProcessor):
    def custom_process(self, html_content):
        # Custom processing logic
        pass

# Create web service
from webis_html import create_app
import uvicorn

app = create_app()
uvicorn.run(app, host="0.0.0.0", port=8000)

๐Ÿ“Š Performance Features

  • Asynchronous processing: High-performance concurrency using httpx and asyncio
  • Smart caching: Automatic API key and configuration caching
  • Batch optimization: Batch processing optimization for large numbers of files
  • Memory management: Stream processing of large files to avoid memory overflow

๐Ÿค Contribution

Welcome to contribute code! Please follow these steps:

  1. Fork the project
  2. Create feature branch (git checkout -b feature/AmazingFeature)
  3. Commit changes (git commit -m 'Add some AmazingFeature')
  4. Push to branch (git push origin feature/AmazingFeature)
  5. Open Pull Request

๐Ÿ“„ License

This project uses MIT License - see LICENSE file for details.

๐Ÿ†˜ Support

๐ŸŽฏ Use Cases

  • Knowledge base construction: Batch extract structured knowledge from web pages
  • Data mining: Clean web data for analysis
  • AI training: Prepare high-quality training data for large language models
  • Content migration: Website content migration and organization
  • Information extraction: Extract key information from HTML

Start using Webis HTML to make web content extraction simple and efficient! ๐Ÿš€

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webis_html-1.0.3.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webis_html-1.0.3-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file webis_html-1.0.3.tar.gz.

File metadata

  • Download URL: webis_html-1.0.3.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for webis_html-1.0.3.tar.gz
Algorithm Hash digest
SHA256 6b65e40147e8cafbea2f76a3b46d3d4bc066508b30c28b32de0b454299da2c05
MD5 93a6a0c1534d13ceb26dff47eb432282
BLAKE2b-256 0ca7c5d0061461391529b094d3dea85f9a2bb3085eab6219067e4be7ae7b3384

See more details on using hashes here.

File details

Details for the file webis_html-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: webis_html-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for webis_html-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8acc7eb7ede06364efe7d3d9cacdc28f86c20e13088e4f9c55810f68d2e686b0
MD5 af6d0004dda2c741280715c307be114f
BLAKE2b-256 cc389290f5c0d2099ebf37065bfec8fe1b5e62b944cefe022471fa8675c7770c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page