Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines.
Project description
LongParser
Privacy-first document intelligence engine for production RAG pipelines.
Parse PDFs, DOCX, PPTX, XLSX & CSV → validated, AI-ready chunks with HITL review.
Features
| Feature | Detail |
|---|---|
| Multi-format extraction | PDF, DOCX, PPTX, XLSX, CSV via Docling |
| Hybrid chunking | Token-aware, heading-hierarchy-aware, table-aware |
| HITL review | Human-in-the-Loop block & chunk editing before embedding |
| LangGraph HITL | approve / edit / reject workflow with LangGraph interrupt() |
| 3-layer memory | Short-term turns + rolling summary + long-term facts |
| Multi-provider LLM | OpenAI, Gemini, Groq, OpenRouter |
| Multi-backend vectors | Chroma, FAISS, Qdrant |
| Async-first API | FastAPI + Motor (MongoDB) + ARQ (Redis) |
| LangChain adapters | Drop-in BaseRetriever and LlamaIndex QueryEngine |
| Privacy-first | All processing runs locally; no data leaves your infra |
Installation
Quick install (recommended)
pip install "longparser[gpu]"
Includes everything — server, embeddings, vector DB, OCR, LangChain, LlamaIndex. Works on CPU machines too; torch just runs in CPU mode automatically.
Core SDK only (no server, no torch)
pip install longparser
Pick only what you need
| Extra | What it adds |
|---|---|
server |
FastAPI + MongoDB + Redis + LangChain chat |
embeddings-gpu |
sentence-transformers (GPU) |
embeddings-cpu |
sentence-transformers (CPU-only torch) |
faiss-gpu |
FAISS GPU vector store |
faiss-cpu |
FAISS CPU vector store |
chroma |
ChromaDB |
qdrant |
Qdrant |
latex-ocr-gpu |
pix2tex equation OCR (GPU) |
latex-ocr-cpu |
pix2tex equation OCR (CPU) |
langchain |
LangChain core adapter |
llamaindex |
LlamaIndex reader adapter |
gpu |
All of the above — one command |
cpu |
All of the above — CPU-only torch |
Advanced: CPU-only install (save ~1.8 GB)
For Docker images, edge devices, or CI environments where CUDA isn't needed:
# Step 1 — CPU torch (~230 MB vs ~2 GB for CUDA)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
# Step 2 — LongParser CPU bundle
pip install "longparser[cpu]"
Quick Start
Python SDK
from longparser import PipelineOrchestrator, ProcessingConfig
pipeline = PipelineOrchestrator()
result = pipeline.process_file("document.pdf")
print(f"Pages: {result.document.metadata.total_pages}")
print(f"Chunks: {len(result.chunks)}")
print(result.chunks[0].text)
REST API
# 1. Copy and edit configuration
cp .env.example .env
# 2. Start services (MongoDB + Redis)
docker-compose up -d mongo redis
# 3. Start the API
uv run uvicorn longparser.server.app:app --reload --port 8000
# 4. Upload a document
curl -X POST http://localhost:8000/jobs \
-H "X-API-Key: your-key" \
-F "file=@document.pdf"
# 5. Check job status
curl http://localhost:8000/jobs/{job_id} -H "X-API-Key: your-key"
# 6. Finalize and embed
curl -X POST http://localhost:8000/jobs/{job_id}/finalize \
-H "X-API-Key: your-key" \
-H "Content-Type: application/json" \
-d '{"finalize_policy": "approve_all_pending"}'
curl -X POST http://localhost:8000/jobs/{job_id}/embed \
-H "X-API-Key: your-key" \
-H "Content-Type: application/json" \
-d '{"provider": "huggingface", "model": "BAAI/bge-base-en-v1.5", "vector_db": "chroma"}'
# 7. Chat with the document
curl -X POST http://localhost:8000/chat/sessions \
-H "X-API-Key: your-key" \
-H "Content-Type: application/json" \
-d '{"job_id": "your-job-id"}'
curl -X POST http://localhost:8000/chat \
-H "X-API-Key: your-key" \
-H "Content-Type: application/json" \
-d '{"session_id": "...", "job_id": "...", "question": "What is the refund policy?"}'
Architecture
Document → Extract → Validate → HITL Review → Chunk → Embed → Index
↓
Chat → RAG → LLM → Answer
Pipeline Stages
- Extract — Docling converts PDF/DOCX/etc. into structured
Blockobjects - Validate — Per-page confidence scoring and RTL detection
- HITL Review — Human approves/edits/rejects blocks and chunks via the API
- Chunk —
HybridChunkerbuilds token-aware RAG chunks with section hierarchy - Embed — Embedding engine (HuggingFace / OpenAI) vectors stored in Chroma/FAISS/Qdrant
- Chat — LCEL chain with 3-layer memory and citation validation
Project Structure
src/longparser/
├── schemas.py ← core Pydantic models (Document, Block, Chunk, …)
├── extractors/ ← Docling, LaTeX OCR backends
├── chunkers/ ← HybridChunker
├── pipeline/ ← PipelineOrchestrator
├── integrations/ ← LangChain loader & LlamaIndex reader
├── utils/ ← shared helpers (RTL detection, …)
└── server/ ← REST API layer
├── app.py ← FastAPI application (all routes)
├── db.py ← Motor async MongoDB
├── queue.py ← ARQ/Redis job queue
├── worker.py ← ARQ background worker
├── embeddings.py ← HuggingFace / OpenAI embedding engine
├── vectorstores.py ← Chroma / FAISS / Qdrant adapters
└── chat/ ← RAG chat engine
├── engine.py ← ChatEngine (LCEL + 3-layer memory)
├── graph.py ← LangGraph HITL workflow
├── schemas.py ← chat Pydantic models
├── retriever.py ← LangChain BaseRetriever adapter
├── llm_chain.py ← multi-provider LLM factory
└── callbacks.py ← observability callbacks
LangChain Integration
from longparser.integrations.langchain import LongParserLoader
loader = LongParserLoader("report.pdf")
docs = loader.load() # list[langchain_core.documents.Document]
LlamaIndex Integration
from longparser.integrations.llamaindex import LongParserReader
reader = LongParserReader()
docs = reader.load_data("report.pdf")
Configuration
Copy .env.example to .env and set:
| Variable | Default | Description |
|---|---|---|
LONGPARSER_MONGO_URL |
mongodb://localhost:27017 |
MongoDB connection |
LONGPARSER_REDIS_URL |
redis://localhost:6379 |
Redis for job queue |
LONGPARSER_LLM_PROVIDER |
openai |
LLM provider |
LONGPARSER_LLM_MODEL |
gpt-4o |
Model name |
LONGPARSER_EMBED_PROVIDER |
huggingface |
Embedding provider |
LONGPARSER_VECTOR_DB |
chroma |
Vector store backend |
Running with Docker
docker-compose up
API available at http://localhost:8000 · Docs at http://localhost:8000/docs
Testing
# Install dev dependencies
uv sync --extra dev
# Run unit tests
uv run pytest tests/unit/ -v
# Run with coverage
uv run pytest tests/ --cov=src/longparser --cov-report=term-missing
Contributing
See CONTRIBUTING.md for development setup and PR guidelines.
Security
See SECURITY.md for vulnerability reporting.
License
MIT — Copyright © 2026 ENDEVSOLS
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file longparser-0.1.1.tar.gz.
File metadata
- Download URL: longparser-0.1.1.tar.gz
- Upload date:
- Size: 88.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9155a7f139b46ce7ee6e15239042efc2fac923638a71f4ba8418c2f3d5db0952
|
|
| MD5 |
b4e0d64962ea415f4d0276e35e890939
|
|
| BLAKE2b-256 |
92ca51a7b567d875857cdb9b79a950b3bb34f4f23f7c3498a0aacbdfd9cc00bb
|
File details
Details for the file longparser-0.1.1-py3-none-any.whl.
File metadata
- Download URL: longparser-0.1.1-py3-none-any.whl
- Upload date:
- Size: 94.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b3fbd144e73ac1adcefd9f1b4c9018bfbce5f72d68ee232fe9849eb3ab94f39
|
|
| MD5 |
8f66ea8a8f4a7d662c54f7227b11e45b
|
|
| BLAKE2b-256 |
fd1d00109417bea74368a4d5e51b6348bac9155eeccf47e3c6713e87d5aa1ea4
|