Add your description here
Project description
MinerU-Webkit
MinerU-Webkit is a high-performance web content conversion toolkit builtl. It intelligently parses and extracts structured content from HTML web pages, supporting various output formats and customizable configurations.
Key Features
- 🚀 High-Performance Parsing: Leverages Python and lxml for fast processing and low memory footprint.
- 🎯 Multi-Format Output: Supports Markdown, JSON, Txt, and other structured formats to meet diverse needs.
- ⚡ Asynchronous Processing: Supports asynchronous batch processing for improved efficiency with multiple web pages.
- 🌐 Dual-Protocol Support: A unified service gateway that supports both Model Context Protocol (MCP) and traditional RESTful APIs enables your web conversion service to be seamlessly invoked by both AI agents (such as Claude, Cursor) and traditional web clients and mobile applications.
- 🔧 Error Resilience: Incorporates robust error recovery mechanisms to handle malformed HTML gracefully.
Installation
Prerequisites
- Python >= 3.13
Basic Installation (Core Functionality)
For basic usage of MinerU-Webkit, install with core dependencies only:
# Clone the repository
git clone https://github.com/ccprocessor/MinerU-Webkit.git
cd MinerU-Webkit
# Dependencies from pyproject.toml are automatically installed
uv sync --package webpage_converter
Quick Start
1. Basic Usage
from webpage_converter.convert import convert_html_to_structured_data
# Extract main content from HTML
html_content = """
<html>
<body>
<div>
<h1>This is a title</h1>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
</div>
<div>
<p>Related content</p>
<p>Advertising content</p>
</div>
</body>
</html>
"""
result = convert_html_to_structured_data(main_html=html_content, url="http://www.example.com", output_format='mm_md')
print(result)
Configuration
Configuration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
main_html |
str | Required | HTML that needs to be converted |
url |
str | https://example.com | The URL link for HTML is required in mm_md mode |
output_format |
str | mm_md | Conversion format, supports mm_md (markdown with images), md (markdown), json, txt |
use_raw_image_url |
bool | True | Whether to use the original image URL (only valid for mm_md format) |
Optional values for output_format
mm_md: The output format is markdown with imagesmd: The output format is Markdownjson: The output format is jsontxt: The output format is txt
TODO
contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mineru_webkit-0.1.6.tar.gz.
File metadata
- Download URL: mineru_webkit-0.1.6.tar.gz
- Upload date:
- Size: 5.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
993144fb32ee8d11f7481e041fa07613eba6f3c8e8f38e91a14a034acc3f190d
|
|
| MD5 |
3e69590617af5f82c43a5c03da6c7a91
|
|
| BLAKE2b-256 |
3121e5fc16a8fc2142d2e85bad74e18a415c84fbd4d61231233479c958abcef6
|
File details
Details for the file mineru_webkit-0.1.6-py3-none-any.whl.
File metadata
- Download URL: mineru_webkit-0.1.6-py3-none-any.whl
- Upload date:
- Size: 164.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08c09d2882576b55277905cc9576c56d86840cde5009a94d100a3aa33318b656
|
|
| MD5 |
e9735e534f74c916dbd8d590a6596250
|
|
| BLAKE2b-256 |
6c524ef66e7a7d4d735f66f061421bffe553cff7a4f1c700a93af637edb28c46
|