PDF to AI Agent Knowledge Bridge - Convert PDFs to enhanced Markdown/JSON

These details have not been verified by PyPI

Project description

Chinvat

PDF to AI Agent Knowledge Bridge

Convert PDFs to enhanced, AI-agent-ready Markdown/JSON documents.

Features • Quick Start • Installation • Usage

What is Chinvat?

Chinvat bridges the gap between PDFs designed for human reading and the structured, semantic text that AI agents require. It leverages:

PaddleOCR-VL for high-quality PDF-to-Markdown conversion
Ovis2.5-9B Vision-Language Model for intelligent image analysis
Mistune for flexible AST manipulation

Features

Feature	Description
Smart OCR	Extracts text, tables, and images with layout awareness
VLM Image Analysis	Generates captions and descriptions for embedded images
Decorative Filtering	Removes logos, watermarks, and noise automatically
Heading Correction(developing)	Fixes OCR-generated heading hierarchy issues
Multi-Format Output	Export to Markdown, JSON, or structured data
Batch Processing	Efficiently process multiple PDFs
Flexible Pipeline	Run full pipeline or individual steps
Base64 Embedding	Embed images as base64 data URLs in markdown

Quick Start

# Install
pip install -e .

# Convert a PDF
chinvat pipeline document.pdf output/

# Embed images as base64 (single portable file)
chinvat pipeline document.pdf output/ --embed-base64

# Or use as a Python library
python -c "
from chinvat import run_full_pipeline
run_full_pipeline('document.pdf', 'output/')
"

Installation

# Basic installation
pip install -e .

# With development dependencies
pip install -e ".[dev]"

Prerequisites

Chinvat requires two external services running:

1. PaddleOCR-VL Server (port 8118)

Requires GPU. Run using Docker:

docker run \
    -it \
    --rm \
    --gpus all \
    --network host \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu \
    paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --host 0.0.0.0 --port 8118 --backend vllm

2. Ovis2.5-9B VLM Server (port 8000)

Requires GPU with ~16GB VRAM. Run with vLLM:

vllm serve AIDC-AI/Ovis2.5-9B \
    --trust-remote-code \
    --port 8000 \
    --gpu-memory-utilization 0.4

Note: If you encounter RuntimeError: Exception from the 'vlm' worker: only 0-dimensional arrays can be converted to Python scalars, install numpy==1.26.4.

Usage

Command Line

# Full pipeline
chinvat pipeline input.pdf output_dir

# Batch processing
chinvat batch output_base file1.pdf file2.pdf file3.pdf

# Individual steps
chinvat pdf input.pdf output_dir          # PDF → Markdown
chinvat image input.md output.md          # Analyze images
chinvat md enhance input.md output.md     # Enhance AST
chinvat md export input.md out --format json  # Export

Output Options

# Keep images folder (default: enabled)
chinvat pipeline input.pdf output/ --keep-images

# Disable images folder
chinvat pipeline input.pdf output/ --no-keep-images

# Embed images as base64 (single portable markdown file)
chinvat pipeline input.pdf output/ --embed-base64

# Combine options
chinvat pipeline input.pdf output/ --embed-base64 --no-keep-images

Python API

from chinvat import Pipeline, run_full_pipeline, batch_pipeline

# Simple usage
run_full_pipeline("document.pdf", "output/")

# Advanced usage
pipeline = Pipeline(
    paddleocr_server_url="http://localhost:8118/v1",
    vlm_server_url="http://localhost:8000/v1"
)

outputs = pipeline.run(
    input_path="document.pdf",
    output_folder="output",
    format="both",  # markdown, json, structured, or both
    keep_images=True,  # keep imgs folder
    embed_base64=False,  # or True for single file
)

# Batch processing
batch_pipeline(
    input_paths=["doc1.pdf", "doc2.pdf"],
    output_base_folder="batch_output"
)

pipeline.close()

Output Structure

output/
└── document_name/
    └── enhanced/
        ├── document.md     # Enhanced Markdown
        ├── document.json   # JSON AST (optional)
        └── imgs/          # Extracted images (optional)
            ├── image1.jpg
            └── image2.jpg

With --embed-base64, images are embedded directly in the markdown file as base64 data URLs.

How It Works

Chinvat processes PDFs through 4 stages:

PDF → Markdown - Uses PaddleOCR-VL to extract text, tables, and images
Markdown → AST - Parses markdown into an abstract syntax tree using Mistune
Enhance - Analyzes images with VLM, fixes heading levels, filters decorative elements
Export - Outputs as Markdown, JSON, or structured data

Configuration

Environment Variables

Variable	Default	Description
`API_BASE_URL`	`http://localhost:8000/v1`	VLM server URL
`API_KEY`	`abc-123`	VLM API key
`MODEL_NAME`	`AIDC-AI/Ovis2.5-9B`	VLM model name

CLI Options

chinvat pipeline input.pdf output/ \
  --format markdown|json|structured|both \
  --no-enhance \
  --no-fix-headings \
  --no-filter-decorative \
  --no-enrich-images \
  --keep-images \
  --embed-base64 \
  --dummy  # Test without API calls

Development

# Code formatting
black chinvat/

# Linting
ruff check chinvat/

# Type checking
mypy chinvat/

# Testing
pytest

Project Structure

chinvat/
├── chinvat/              # Main package
│   ├── __init__.py       # Exports
│   ├── main.py           # CLI entry point
│   ├── cli.py            # Command-line interface
│   ├── pipeline.py       # Pipeline orchestration
│   ├── pdf_converter.py  # PDF → Markdown
│   ├── image_analyzer.py # VLM image analysis
│   ├── ast_handler.py    # AST manipulation
│   ├── enhancement.py    # AST enhancement
│   └── exporters.py      # Export utilities
├── pyproject.toml        # Package configuration
├── README.md             # English documentation
├── README_CN.md          # Chinese documentation
├── CLAUDE.md             # Claude Code context
└── .gitignore

License

MIT License. See LICENSE for details.

Contributing

Contributions are welcome! Please read CLAUDE.md for development guidelines.

Built for AI agents, by AI agents

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0 yanked

Feb 13, 2026

Reason this release was yanked:

Deprecated, name changed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chinvat-0.1.0.tar.gz (26.7 kB view details)

Uploaded Feb 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chinvat-0.1.0-py3-none-any.whl (28.5 kB view details)

Uploaded Feb 13, 2026 Python 3

File details

Details for the file chinvat-0.1.0.tar.gz.

File metadata

Download URL: chinvat-0.1.0.tar.gz
Upload date: Feb 13, 2026
Size: 26.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for chinvat-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`66a527bb0df18204b7c0031a0281a41a78c4a477ff46c7b154f9199e510eb7b8`
MD5	`f02bb17ce3b5effd3bee7ff965258036`
BLAKE2b-256	`6b9ea9d0ce86f1951b914d0f13f952b54d696751ae2e1a1f69d48584383aecc7`

See more details on using hashes here.

File details

Details for the file chinvat-0.1.0-py3-none-any.whl.

File metadata

Download URL: chinvat-0.1.0-py3-none-any.whl
Upload date: Feb 13, 2026
Size: 28.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for chinvat-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c378be7fc1f18278cf77bd1ce99f0779652bae2a40f5b3b575a81fec590b51f4`
MD5	`5b61b972f898eeebc501a5b9e4ae90fe`
BLAKE2b-256	`a23917713f7facef4bf20ec6faa28dd203d5d247b2263130fe4a5832c2a70a6c`

See more details on using hashes here.

chinvat 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Chinvat

What is Chinvat?

Features

Quick Start

Installation

Prerequisites

1. PaddleOCR-VL Server (port 8118)

2. Ovis2.5-9B VLM Server (port 8000)

Usage

Command Line

Output Options

Python API

Output Structure

How It Works

Configuration

Environment Variables

CLI Options

Development

Project Structure

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes