PDF to AI Agent Knowledge Bridge - Convert PDFs to enhanced Markdown/JSON
Project description
Anji-Bridge (安济桥)
PDF to AI Agent Knowledge Bridge
Convert PDFs to enhanced, AI-agent-ready Markdown/JSON documents.
Features • Quick Start • Installation • Usage
What is Anji?
Anji bridges the gap between PDFs designed for human reading and the structured, semantic text that AI agents require. It leverages:
- PaddleOCR-VL for high-quality PDF-to-Markdown conversion
- Ovis2.5-9B Vision-Language Model for intelligent image analysis
- Mistune for flexible AST manipulation
Features
| Feature | Description |
|---|---|
| Smart OCR | Extracts text, tables, and images with layout awareness |
| VLM Image Analysis | Generates captions and descriptions for embedded images |
| Decorative Filtering | Removes logos, watermarks, and noise automatically |
| Heading Correction(developing) | Fixes OCR-generated heading hierarchy issues |
| Multi-Format Output | Export to Markdown, JSON, or structured data |
| Batch Processing | Efficiently process multiple PDFs |
| Flexible Pipeline | Run full pipeline or individual steps |
| Base64 Embedding | Embed images as base64 data URLs in markdown |
Quick Start
# Install
pip install -e .
# Convert a PDF
anji pipeline document.pdf output/
# Embed images as base64 (single portable file)
anji pipeline document.pdf output/ --embed-base64
# Or use as a Python library
python -c "
from anji import run_full_pipeline
run_full_pipeline('document.pdf', 'output/')
"
Installation
# Basic installation
pip install -e .
# With development dependencies
pip install -e ".[dev]"
Prerequisites
Anji requires two external services running:
1. PaddleOCR-VL Server (port 8118)
Requires GPU. Run using Docker:
docker run \
-it \
--rm \
--gpus all \
--network host \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu \
paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --host 0.0.0.0 --port 8118 --backend vllm
2. Ovis2.5-9B VLM Server (port 8000)
Requires GPU with ~16GB VRAM. Run with vLLM:
vllm serve AIDC-AI/Ovis2.5-9B \
--trust-remote-code \
--port 8000 \
--gpu-memory-utilization 0.4
Note: If you encounter
RuntimeError: Exception from the 'vlm' worker: only 0-dimensional arrays can be converted to Python scalars, installnumpy==1.26.4.
Usage
Command Line
# Full pipeline
anji pipeline input.pdf output_dir
# Batch processing
anji batch output_base file1.pdf file2.pdf file3.pdf
# Individual steps
anji pdf input.pdf output_dir # PDF → Markdown
anji image input.md output.md # Analyze images
anji md enhance input.md output.md # Enhance AST
anji md export input.md out --format json # Export
Output Options
# Keep images folder (default: enabled)
anji pipeline input.pdf output/ --keep-images
# Disable images folder
anji pipeline input.pdf output/ --no-keep-images
# Embed images as base64 (single portable markdown file)
anji pipeline input.pdf output/ --embed-base64
# Combine options
anji pipeline input.pdf output/ --embed-base64 --no-keep-images
Python API
from anji import Pipeline, run_full_pipeline, batch_pipeline
# Simple usage
run_full_pipeline("document.pdf", "output/")
# Advanced usage
pipeline = Pipeline(
paddleocr_server_url="http://localhost:8118/v1",
vlm_server_url="http://localhost:8000/v1"
)
outputs = pipeline.run(
input_path="document.pdf",
output_folder="output",
output_format="both", # markdown, json, structured, or both
keep_images=True, # keep imgs folder
embed_base64=False, # or True for single file
)
# Batch processing
batch_pipeline(
input_paths=["doc1.pdf", "doc2.pdf"],
output_base_folder="batch_output"
)
pipeline.close()
Output Structure
output/
└── document_name/
└── enhanced/
├── document.md # Enhanced Markdown
├── document.json # JSON AST (optional)
└── imgs/ # Extracted images (optional)
├── image1.jpg
└── image2.jpg
With --embed-base64, images are embedded directly in the markdown file as base64 data URLs.
How It Works
Anji processes PDFs through 4 stages:
- PDF → Markdown - Uses PaddleOCR-VL to extract text, tables, and images
- Markdown → AST - Parses markdown into an abstract syntax tree using Mistune
- Enhance - Analyzes images with VLM, fixes heading levels, filters decorative elements
- Export - Outputs as Markdown, JSON, or structured data
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
API_BASE_URL |
http://localhost:8000/v1 |
VLM server URL |
API_KEY |
abc-123 |
VLM API key |
MODEL_NAME |
AIDC-AI/Ovis2.5-9B |
VLM model name |
CLI Options
anji pipeline input.pdf output/ \
--format markdown|json|structured|both \
--no-enhance \
--no-fix-headings \
--no-filter-decorative \
--no-enrich-images \
--keep-images \
--embed-base64 \
--dummy # Test without API calls
Development
# Code formatting
black anji/
# Linting
ruff check anji/
# Type checking
mypy anji/
# Testing
pytest
Project Structure
anji/
├── anji/ # Main package
│ ├── __init__.py # Exports
│ ├── main.py # CLI entry point
│ ├── cli.py # Command-line interface
│ ├── pipeline.py # Pipeline orchestration
│ ├── pdf_converter.py # PDF → Markdown
│ ├── image_analyzer.py # VLM image analysis
│ ├── ast_handler.py # AST manipulation
│ ├── enhancement.py # AST enhancement
│ └── exporters.py # Export utilities
├── pyproject.toml # Package configuration
├── README.md # English documentation
├── README_CN.md # Chinese documentation
├── CLAUDE.md # Claude Code context
└── .gitignore
License
MIT License. See LICENSE for details.
Contributing
Contributions are welcome! Please read CLAUDE.md for development guidelines.
Built for AI agents, by AI agents
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anji_bridge-0.1.0.tar.gz.
File metadata
- Download URL: anji_bridge-0.1.0.tar.gz
- Upload date:
- Size: 27.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69dc0ff05de3af83c75afcd9c748f951e283b2f27c8bb023e18c9b737c16c91a
|
|
| MD5 |
a71ff655eeb558f7899d47948b14c9d0
|
|
| BLAKE2b-256 |
189d874f6da7fe978dda66a4d948572ebe13e635d380e1f9568b02a82385480d
|
File details
Details for the file anji_bridge-0.1.0-py3-none-any.whl.
File metadata
- Download URL: anji_bridge-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8091421492edf18443ccdba1f232d65247a661744cd01c323e304a8ccd038dcb
|
|
| MD5 |
4ea102c0ee9482f6a051102f52e22b4b
|
|
| BLAKE2b-256 |
193657d59af7da86e0a6d04a4da2d07f2e58d5c1c7417dadaea1c2b0951af2e7
|