Skip to main content

PDF to AI Agent Knowledge Bridge - Convert PDFs to enhanced Markdown/JSON

Project description

Anji-Bridge (安济桥)

PDF to AI Agent Knowledge Bridge

Python 3.10+ MIT License PyPI

Convert PDFs to enhanced, AI-agent-ready Markdown/JSON documents.

FeaturesQuick StartInstallationUsage


English | 中文


What is Anji?

Anji bridges the gap between PDFs designed for human reading and the structured, semantic text that AI agents require. It leverages:

  • PaddleOCR-VL for high-quality PDF-to-Markdown conversion
  • Ovis2.5-9B Vision-Language Model for intelligent image analysis
  • Mistune for flexible AST manipulation

Features

Feature Description
Smart OCR Extracts text, tables, and images with layout awareness
VLM Image Analysis Generates captions and descriptions for embedded images
Decorative Filtering Removes logos, watermarks, and noise automatically
Heading Correction(developing) Fixes OCR-generated heading hierarchy issues
Multi-Format Output Export to Markdown, JSON, or structured data
Batch Processing Efficiently process multiple PDFs
Flexible Pipeline Run full pipeline or individual steps
Base64 Embedding Embed images as base64 data URLs in markdown

Quick Start

# Install
pip install -e .

# Convert a PDF
anji pipeline document.pdf output/

# Embed images as base64 (single portable file)
anji pipeline document.pdf output/ --embed-base64

# Or use as a Python library
python -c "
from anji import run_full_pipeline
run_full_pipeline('document.pdf', 'output/')
"

Installation

# Basic installation
pip install -e .

# With development dependencies
pip install -e ".[dev]"

Prerequisites

Anji requires two external services running:

1. PaddleOCR-VL Server (port 8118)

Requires GPU. Run using Docker:

docker run \
    -it \
    --rm \
    --gpus all \
    --network host \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu \
    paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --host 0.0.0.0 --port 8118 --backend vllm

2. Ovis2.5-9B VLM Server (port 8000)

Requires GPU with ~16GB VRAM. Run with vLLM:

vllm serve AIDC-AI/Ovis2.5-9B \
    --trust-remote-code \
    --port 8000 \
    --gpu-memory-utilization 0.4

Note: If you encounter RuntimeError: Exception from the 'vlm' worker: only 0-dimensional arrays can be converted to Python scalars, install numpy==1.26.4.

Usage

Command Line

# Full pipeline
anji pipeline input.pdf output_dir

# Batch processing
anji batch output_base file1.pdf file2.pdf file3.pdf

# Individual steps
anji pdf input.pdf output_dir          # PDF → Markdown
anji image input.md output.md          # Analyze images
anji md enhance input.md output.md     # Enhance AST
anji md export input.md out --format json  # Export

Output Options

# Keep images folder (default: enabled)
anji pipeline input.pdf output/ --keep-images

# Disable images folder
anji pipeline input.pdf output/ --no-keep-images

# Embed images as base64 (single portable markdown file)
anji pipeline input.pdf output/ --embed-base64

# Combine options
anji pipeline input.pdf output/ --embed-base64 --no-keep-images

Python API

from anji import Pipeline, run_full_pipeline, batch_pipeline

# Simple usage
run_full_pipeline("document.pdf", "output/")

# Advanced usage
pipeline = Pipeline(
    paddleocr_server_url="http://localhost:8118/v1",
    vlm_server_url="http://localhost:8000/v1"
)

outputs = pipeline.run(
    input_path="document.pdf",
    output_folder="output",
    output_format="both",  # markdown, json, structured, or both
    keep_images=True,  # keep imgs folder
    embed_base64=False,  # or True for single file
)

# Batch processing
batch_pipeline(
    input_paths=["doc1.pdf", "doc2.pdf"],
    output_base_folder="batch_output"
)

pipeline.close()

Output Structure

output/
└── document_name/
    └── enhanced/
        ├── document.md     # Enhanced Markdown
        ├── document.json   # JSON AST (optional)
        └── imgs/          # Extracted images (optional)
            ├── image1.jpg
            └── image2.jpg

With --embed-base64, images are embedded directly in the markdown file as base64 data URLs.

How It Works

Anji processes PDFs through 4 stages:

  1. PDF → Markdown - Uses PaddleOCR-VL to extract text, tables, and images
  2. Markdown → AST - Parses markdown into an abstract syntax tree using Mistune
  3. Enhance - Analyzes images with VLM, fixes heading levels, filters decorative elements
  4. Export - Outputs as Markdown, JSON, or structured data

Configuration

Environment Variables

Variable Default Description
API_BASE_URL http://localhost:8000/v1 VLM server URL
API_KEY abc-123 VLM API key
MODEL_NAME AIDC-AI/Ovis2.5-9B VLM model name

CLI Options

anji pipeline input.pdf output/ \
  --format markdown|json|structured|both \
  --no-enhance \
  --no-fix-headings \
  --no-filter-decorative \
  --no-enrich-images \
  --keep-images \
  --embed-base64 \
  --dummy  # Test without API calls

Development

# Code formatting
black anji/

# Linting
ruff check anji/

# Type checking
mypy anji/

# Testing
pytest

Project Structure

anji/
├── anji/              # Main package
│   ├── __init__.py       # Exports
│   ├── main.py           # CLI entry point
│   ├── cli.py            # Command-line interface
│   ├── pipeline.py       # Pipeline orchestration
│   ├── pdf_converter.py  # PDF → Markdown
│   ├── image_analyzer.py # VLM image analysis
│   ├── ast_handler.py    # AST manipulation
│   ├── enhancement.py    # AST enhancement
│   └── exporters.py      # Export utilities
├── pyproject.toml        # Package configuration
├── README.md             # English documentation
├── README_CN.md          # Chinese documentation
├── CLAUDE.md             # Claude Code context
└── .gitignore

License

MIT License. See LICENSE for details.

Contributing

Contributions are welcome! Please read CLAUDE.md for development guidelines.


Built for AI agents, by AI agents

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anji_bridge-0.1.0.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anji_bridge-0.1.0-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file anji_bridge-0.1.0.tar.gz.

File metadata

  • Download URL: anji_bridge-0.1.0.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for anji_bridge-0.1.0.tar.gz
Algorithm Hash digest
SHA256 69dc0ff05de3af83c75afcd9c748f951e283b2f27c8bb023e18c9b737c16c91a
MD5 a71ff655eeb558f7899d47948b14c9d0
BLAKE2b-256 189d874f6da7fe978dda66a4d948572ebe13e635d380e1f9568b02a82385480d

See more details on using hashes here.

File details

Details for the file anji_bridge-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: anji_bridge-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for anji_bridge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8091421492edf18443ccdba1f232d65247a661744cd01c323e304a8ccd038dcb
MD5 4ea102c0ee9482f6a051102f52e22b4b
BLAKE2b-256 193657d59af7da86e0a6d04a4da2d07f2e58d5c1c7417dadaea1c2b0951af2e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page