Skip to main content

Add your description here

Project description

MinerU-Webkit

MinerU-Webkit is a high-performance web content conversion toolkit builtl. It intelligently parses and extracts structured content from HTML web pages, supporting various output formats and customizable configurations.

Key Features

  • 🚀 High-Performance Parsing: Leverages Python and lxml for fast processing and low memory footprint.
  • 🎯 Multi-Format Output: Supports Markdown, JSON, Txt, and other structured formats to meet diverse needs.
  • Asynchronous Processing: Supports asynchronous batch processing for improved efficiency with multiple web pages.
  • 🌐 Dual-Protocol Support: A unified service gateway that supports both Model Context Protocol (MCP) and traditional RESTful APIs enables your web conversion service to be seamlessly invoked by both AI agents (such as Claude, Cursor) and traditional web clients and mobile applications.
  • 🔧 Error Resilience: Incorporates robust error recovery mechanisms to handle malformed HTML gracefully.

Installation

Prerequisites

  • Python >= 3.13

Basic Installation (Core Functionality)

For basic usage of MinerU-Webkit, install with core dependencies only:

# Clone the repository
git clone https://github.com/ccprocessor/MinerU-Webkit.git
cd MinerU-Webkit

# Dependencies from pyproject.toml are automatically installed
uv sync --package webpage_converter

Quick Start

1. Basic Usage

from webpage_converter.convert import convert_html_to_structured_data

# Extract main content from HTML
html_content = """
<html>
  <body>
    <div>
    <h1>This is a title</h1>
    <p>This is a paragraph</p>
    <p>This is another paragraph</p>
    </div>
    <div>
    <p>Related content</p>
    <p>Advertising content</p>
    </div>
  </body>
</html>
"""
result = convert_html_to_structured_data(main_html=html_content, url="http://www.example.com", output_format='mm_md')
print(result)

Configuration

Configuration Options

Parameter Type Default Description
main_html str Required HTML that needs to be converted
url str https://example.com The URL link for HTML is required in mm_md mode
output_format str mm_md Conversion format, supports mm_md (markdown with images), md (markdown), json, txt
use_raw_image_url bool True Whether to use the original image URL (only valid for mm_md format)

Optional values for output_format

  • mm_md: The output format is markdown with images
  • md: The output format is Markdown
  • json: The output format is json
  • txt: The output format is txt

TODO

contributors

contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_webkit-0.1.6.tar.gz (5.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mineru_webkit-0.1.6-py3-none-any.whl (164.8 kB view details)

Uploaded Python 3

File details

Details for the file mineru_webkit-0.1.6.tar.gz.

File metadata

  • Download URL: mineru_webkit-0.1.6.tar.gz
  • Upload date:
  • Size: 5.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mineru_webkit-0.1.6.tar.gz
Algorithm Hash digest
SHA256 993144fb32ee8d11f7481e041fa07613eba6f3c8e8f38e91a14a034acc3f190d
MD5 3e69590617af5f82c43a5c03da6c7a91
BLAKE2b-256 3121e5fc16a8fc2142d2e85bad74e18a415c84fbd4d61231233479c958abcef6

See more details on using hashes here.

File details

Details for the file mineru_webkit-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: mineru_webkit-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 164.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"20.04","id":"focal","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mineru_webkit-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 08c09d2882576b55277905cc9576c56d86840cde5009a94d100a3aa33318b656
MD5 e9735e534f74c916dbd8d590a6596250
BLAKE2b-256 6c524ef66e7a7d4d735f66f061421bffe553cff7a4f1c700a93af637edb28c46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page