High-performance RAG-based PDF to LaTeX conversion module
Project description
PDF2TeX
High-performance RAG-based PDF to LaTeX conversion module for large documents (2000+ pages).
Features
- Intelligent PDF Extraction: Multi-path content processing with PyMuPDF, Nougat, and PaddleOCR
- Math-First Approach: 95%+ accuracy on mathematical content using neural equation recognition
- RAG-Powered Generation: Context-aware LaTeX synthesis with Hugging Face LLMs
- Distributed Processing: Ray-based parallel processing for high throughput
- Chapter-Based Output: One
.texfile per chapter with master document
Architecture
PDF Input → Extract + OCR → Chunk + Index → RAG + LLM → LaTeX Output
↓ ↓ ↓
Ray Distributed Workers Pool
↓
Qdrant | Redis | MinIO | Postgres
Quick Start
Prerequisites
- Python 3.11+
- Docker & Docker Compose
- NVIDIA GPU (recommended)
- Hugging Face API token
Installation
# Clone repository
git clone https://github.com/pdf2tex/pdf2tex.git
cd pdf2tex
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -e ".[dev]"
# Start infrastructure
docker-compose up -d
# Run conversion
pdf2tex convert input.pdf --output ./output
Configuration
Create a .env file:
# Hugging Face
HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxx
# Database
POSTGRES_URL=postgresql+asyncpg://pdf2tex:password@localhost:5432/pdf2tex
# Vector Store
QDRANT_URL=http://localhost:6333
# Redis
REDIS_URL=redis://localhost:6379
# MinIO
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin
# Ray
RAY_ADDRESS=auto
Usage
CLI
# Convert a PDF
pdf2tex convert document.pdf --output ./output
# Resume failed conversion
pdf2tex resume doc_abc123
# Check status
pdf2tex status doc_abc123
API
# Start API server
uvicorn pdf2tex.api.app:app --host 0.0.0.0 --port 8000
# Submit document
curl -X POST http://localhost:8000/documents \
-F "file=@textbook.pdf"
# Check status
curl http://localhost:8000/documents/doc_abc123
Python SDK
from pdf2tex import PDF2TeX
converter = PDF2TeX()
result = await converter.convert("textbook.pdf", output_dir="./output")
print(f"Converted {result.total_pages} pages in {result.duration}")
Project Structure
pdf2tex/
├── src/pdf2tex/
│ ├── extraction/ # PDF parsing, OCR, math extraction
│ ├── chunking/ # Text splitting, chapter detection
│ ├── rag/ # Embeddings, vector store, retrieval
│ ├── generation/ # LLM integration, LaTeX synthesis
│ ├── pipeline/ # Orchestration, distributed workers
│ └── api/ # FastAPI endpoints
├── tests/
├── docker-compose.yml
└── pyproject.toml
Performance
| Document Size | Processing Time | Workers |
|---|---|---|
| 500 pages | ~20 min | 10 |
| 1000 pages | ~40 min | 20 |
| 2000 pages | ~72 min | 20 |
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf2tex-0.2.0.tar.gz
(88.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
pdf2tex-0.2.0-py3-none-any.whl
(72.3 kB
view details)
File details
Details for the file pdf2tex-0.2.0.tar.gz.
File metadata
- Download URL: pdf2tex-0.2.0.tar.gz
- Upload date:
- Size: 88.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc638f7a6994181b0050fb8dd81246a35bc34720741693f1233f44ee9faf13c3
|
|
| MD5 |
3fd4a0c47b5a4c2f28986dcf46cfef14
|
|
| BLAKE2b-256 |
d9799447212b27a46fd67f1a3c5f7b9d0b647b79e2409fea33bfa9025a2b031a
|
File details
Details for the file pdf2tex-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pdf2tex-0.2.0-py3-none-any.whl
- Upload date:
- Size: 72.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0491a662fb607a1f6c57b50837baa0b4454b5ddb2830d6d847e7b11360c3c7e9
|
|
| MD5 |
5b294532d912ccd9c5aa820cc0808f1b
|
|
| BLAKE2b-256 |
7e6c4609143a4b947aa4c91eb10ff283dc36e8298f875cf78d076feda967fe5e
|