Skip to main content

Add your description here

Project description

MinerU-Webkit

MinerU-Webkit is a high-performance web content conversion toolkit builtl. It intelligently parses and extracts structured content from HTML web pages, supporting various output formats and customizable configurations.

Key Features

  • 🚀 High-Performance Parsing: Leverages Python and lxml for fast processing and low memory footprint.
  • 🎯 Multi-Format Output: Supports Markdown, JSON, Txt, and other structured formats to meet diverse needs.
  • Asynchronous Processing: Supports asynchronous batch processing for improved efficiency with multiple web pages.
  • 🌐 Dual-Protocol Support: A unified service gateway that supports both Model Context Protocol (MCP) and traditional RESTful APIs enables your web conversion service to be seamlessly invoked by both AI agents (such as Claude, Cursor) and traditional web clients and mobile applications.
  • 🔧 Error Resilience: Incorporates robust error recovery mechanisms to handle malformed HTML gracefully.

Installation

Prerequisites

  • Python >= 3.13

Basic Installation (Core Functionality)

For basic usage of MinerU-Webkit, install with core dependencies only:

# Clone the repository
git clone https://github.com/ccprocessor/MinerU-Webkit.git
cd MinerU-Webkit

# Dependencies from pyproject.toml are automatically installed
uv sync --package webpage_converter

Quick Start

1. Basic Usage

from webpage_converter.convert import convert_html_to_structured_data

# Extract main content from HTML
html_content = """
<html>
  <body>
    <div>
    <h1>This is a title</h1>
    <p>This is a paragraph</p>
    <p>This is another paragraph</p>
    </div>
    <div>
    <p>Related content</p>
    <p>Advertising content</p>
    </div>
  </body>
</html>
"""
result = convert_html_to_structured_data(main_html=html_content, url="http://www.example.com", output_format='mm_md')
print(result)

Configuration

Configuration Options

Parameter Type Default Description
main_html str Required HTML that needs to be converted
url str https://example.com The URL link for HTML is required in mm_md mode
output_format str mm_md Conversion format, supports mm_md (markdown), md (markdown with images), json, txt
use_raw_image_url bool True Whether to use the original image URL (only valid for mm_md format)

Optional values for output_format

  • mm_md: The output format is markdown
  • md: The output format is Markdown with images
  • json: The output format is json
  • txt: The output format is txt

TODO

contributors

contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_webkit-0.1.5.tar.gz (5.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mineru_webkit-0.1.5-py3-none-any.whl (164.8 kB view details)

Uploaded Python 3

File details

Details for the file mineru_webkit-0.1.5.tar.gz.

File metadata

  • Download URL: mineru_webkit-0.1.5.tar.gz
  • Upload date:
  • Size: 5.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mineru_webkit-0.1.5.tar.gz
Algorithm Hash digest
SHA256 b0c258727db00247f089ed26a393331bef3e0a434061a67ba925d31b656eee90
MD5 6ea7d68a8a08f2bf803b885329d39af9
BLAKE2b-256 cb2c464e1adde5fe6208c6a6f75c6b8af88bdc1c03255820c3d0e0cacc500608

See more details on using hashes here.

File details

Details for the file mineru_webkit-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: mineru_webkit-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 164.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mineru_webkit-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 3951732d6dc4d0ab5f7430fe3bf19f31a417945e7c227bcdf1e06407262df0d0
MD5 251f9b26c24d99b55c04f4b0077afc36
BLAKE2b-256 cd080d2032bb9c68bd68446e5f1f2ddd1693351396a5b0725a16b6a5c55dbded

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page