Webis HTML extraction tool
Project description
Webis HTML - Intelligent Web Content Extraction Tool
Webis HTML is a modern intelligent web content extraction tool that uses AI technology to automatically identify and extract valuable information from web pages, filter out noise content, and provide high-quality text data for knowledge base construction, data analysis, and AI training.
โจ Features
- ๐ One-click extraction: Complete HTML content extraction with a single function call
- ๐ Batch processing: Supports directory-level batch HTML file processing
- ๐ URL support: Extract content directly from web URLs
- ๐ค AI optimization: Integrated DeepSeek API for intelligent content filtering
- โก Asynchronous processing: High-performance asynchronous API calls with concurrent processing support
- ๐ฅ๏ธ Multiple interfaces: Supports Python API, command line, and web interface
- ๐ฆ Standard package: Compliant with PyPI standards, easy to install and distribute
๐ฆ Installation
Environment Requirements
- Python 3.8+
- Recommended to use conda for environment management
Quick Installation
Method 1: Install from PyPI (Recommended)
# Create conda environment
conda create -n webis_html python=3.10 -y
conda activate webis_html
# Install package
pip install webis-html
Method 2: Install from Source
# Clone repository
git clone https://github.com/Webis/Webis.git
cd Webis/Webis_HTML
# Create environment and install
conda create -n webis_html python=3.10 -y
conda activate webis_html
pip install -e .
Method 3: Test Version Installation
# Install latest test version from TestPyPI
pip install -i https://test.pypi.org/simple/ webis-html
Verify Installation
# Check CLI command
webis-html --help
# Check Python import
python -c "import webis_html; print('โ
Installation successful!')"
๐ Quick Start
1. Simplest Usage
import webis_html
# Extract from HTML content
html_content = "<html><body><h1>Title</h1><p>Content</p></body></html>"
result = webis_html.extract_from_html(html_content)
# Batch process directory
result = webis_html.extract_from_directory("./html_files", "./output")
# Extract from URL
result = webis_html.extract_from_url("https://example.com")
2. Command Line Usage
# Batch process HTML files
webis-html extract --input ./html_files --output ./results
# Start web interface
webis-html gui
# Check version
webis-html version
๐ Detailed Usage Instructions
Python API
Convenience Functions (Recommended)
import webis_html
# 1. Process HTML content
html_content = """
<html>
<body>
<h1>Important Title</h1>
<p>Valuable content</p>
<div class="ad">Advertisement content</div>
</body>
</html>
"""
result = webis_html.extract_from_html(
html_content,
api_key="sk-your-deepseek-key", # Optional, for AI optimization (skip AI filtering if not provided)
output_dir="./output"
)
if result['success']:
print(f"Extraction successful! Total {len(result['results'])} text segments")
for item in result['results']:
print(f"File: {item['filename']}")
print(f"Content: {item['content'][:100]}...")
# 2. Batch process directory
result = webis_html.extract_from_directory(
input_dir="./html_files",
output_dir="./output",
api_key="sk-your-deepseek-key" # Optional, skip AI filtering if not provided
)
# 3. Extract from URL
result = webis_html.extract_from_url(
"https://example.com",
api_key="sk-your-deepseek-key", # Optional, skip AI filtering if not provided
output_dir="./output"
)
Advanced Customization
import webis_html
# Use core components for custom processing flow
processor = webis_html.HtmlProcessor(input_dir, output_dir)
processor.process_html_folder()
# Generate dataset
webis_html.process_json_folder(content_dir, dataset_file)
# Model prediction
webis_html.process_predictions(dataset_file, results_file)
# Restore text
webis_html.restore_text_from_json(results_file, output_dir)
Command Line Interface
Basic Commands
# Extract HTML content
webis-html extract --input ./html_files --output ./results --api-key YOUR_KEY
# Verbose output
webis-html extract --input ./html_files --verbose
# Start web interface
webis-html gui --web-port 9000 --gui-port 8001
# Test API connection
webis-html check-api --api-key YOUR_KEY
# Check version information
webis-html version
Complete Example
# Process HTML files in samples directory
webis-html extract \
--input ./samples/input_html \
--output ./samples/output \
--api-key sk-your-deepseek-api-key \
--verbose
Web Interface
Start web interface for visual operation:
# Start GUI (will automatically start Web API server)
webis-html gui
# Custom ports
webis-html gui --web-port 9000 --gui-port 8001 --api-key YOUR_KEY
Then visit http://localhost:8001 in your browser.
๐ API Key Configuration
Supports multiple API key configuration methods:
1. Configuration File (Recommended)
Create config/api_keys.json:
{
"deepseek_api_key": "sk-your-deepseek-api-key-here"
}
Note: If API key is not configured, the program can still run normally, but will skip the AI intelligent filtering step and only perform basic HTML content extraction.
2. Environment Variables
export DEEPSEEK_API_KEY="sk-your-deepseek-api-key-here"
# or
export LLM_PREDICTOR_API_KEY="sk-your-deepseek-api-key-here"
3. Command Line Parameters
webis-html extract --input ./html --api-key sk-your-key
4. Python Code
result = webis_html.extract_from_html(html_content, api_key="sk-your-key") # Optional
๐ Output Structure
All processing methods generate a unified output structure:
output/
โโโ content_output/ # HTML preprocessing results
โ โโโ *.json # Structured content data
โโโ dataset/ # Dataset files
โ โโโ extra_datasets.json # Training dataset
โ โโโ pred_results.json # Prediction results
โโโ predicted_texts/ # Basic extraction results
โ โโโ *.txt # Extracted text files
โโโ filtered_texts/ # AI optimized results (if using DeepSeek API)
โโโ *.txt # Filtered high-quality text
๐ ๏ธ Development and Customization
Project Structure
webis_html/
โโโ __init__.py # Main package entry, convenience functions
โโโ cli/ # Command line interface
โ โโโ cli.py # CLI implementation
โ โโโ __main__.py # CLI entry point
โโโ core/ # Core processing modules
โ โโโ html_processor.py # HTML preprocessing
โ โโโ dataset_processor.py # Dataset generation
โ โโโ llm_predictor.py # AI prediction
โ โโโ content_restorer.py # Content restoration
โ โโโ llm_clean.py # DeepSeek filtering
โโโ server/ # Web server
โ โโโ __init__.py # FastAPI application
โ โโโ api/ # API routes
โ โโโ services/ # Service components
โโโ utils/ # Utility modules
โโโ config/ # Configuration files
โโโ frontend/ # Web interface
โโโ scripts/ # Startup scripts
Extension Development
# Create custom processor
from webis_html.core import HtmlProcessor
class CustomProcessor(HtmlProcessor):
def custom_process(self, html_content):
# Custom processing logic
pass
# Create web service
from webis_html import create_app
import uvicorn
app = create_app()
uvicorn.run(app, host="0.0.0.0", port=8000)
๐ Performance Features
- Asynchronous processing: High-performance concurrency using httpx and asyncio
- Smart caching: Automatic API key and configuration caching
- Batch optimization: Batch processing optimization for large numbers of files
- Memory management: Stream processing of large files to avoid memory overflow
๐ค Contribution
Welcome to contribute code! Please follow these steps:
- Fork the project
- Create feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m 'Add some AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open Pull Request
๐ License
This project uses MIT License - see LICENSE file for details.
๐ Support
- ๐ง Email: example@example.com
- ๐ Issues: GitHub Issues
- ๐ Documentation: Project Documentation
๐ฏ Use Cases
- Knowledge base construction: Batch extract structured knowledge from web pages
- Data mining: Clean web data for analysis
- AI training: Prepare high-quality training data for large language models
- Content migration: Website content migration and organization
- Information extraction: Extract key information from HTML
Start using Webis HTML to make web content extraction simple and efficient! ๐
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webis_html-1.0.1.tar.gz.
File metadata
- Download URL: webis_html-1.0.1.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77e3ca49135bb1a18f078c7bdc7dd2bb7e9512a8cae6b765431570b8aeecdf1d
|
|
| MD5 |
5234788518eb1fcc43ae73a16c38e862
|
|
| BLAKE2b-256 |
55c56f321d1b53ae110ca399b99f91430fb2a2c5135e199ef5ca42dfcaaaf184
|
File details
Details for the file webis_html-1.0.1-py3-none-any.whl.
File metadata
- Download URL: webis_html-1.0.1-py3-none-any.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15fbf725eb9fa9bc6ade2b4c5064f59a325f51c95fe1ca522a3c8e0d3d61878d
|
|
| MD5 |
b646e3584e8e51ffec04521a06a0f5cd
|
|
| BLAKE2b-256 |
5e67a87ff32fdf12ba96b5ff75e4ae62226f0c4112be74197bdd01fdd0714c7e
|