Skip to main content

Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines.

Project description

LongParser

Privacy-first document intelligence engine for production RAG pipelines.

Parse PDFs, DOCX, PPTX, XLSX & CSV → validated, AI-ready chunks with HITL review.

CI PyPI Total Downloads Monthly Downloads Python MIT License Docs


Features

Feature Detail
Multi-format extraction PDF, DOCX, PPTX, XLSX, CSV via Docling & Marker
Hybrid chunking Token-aware, heading-hierarchy-aware, table-aware
Semantic chunking Embedding-based boundaries using all-MiniLM-L6-v2
Cross-referencing Deterministic linking of explicit and implicit charts/figures
Quality scoring Zero-ML heuristic scoring with dictionary & fastText validation
PII redaction Hybrid Regex + NER (spaCy) redaction with secure HITL preservation
Summary chunks Async ARQ worker generating hierarchical LLM section summaries
HITL review Human-in-the-Loop block & chunk editing before embedding
LangGraph HITL approve / edit / reject workflow with LangGraph interrupt() and MongoDB checkpointer
3-layer memory Short-term turns + rolling summary + long-term facts
Multi-provider LLM OpenAI, Gemini, Groq, OpenRouter
Multi-backend vectors Chroma, FAISS, Qdrant
Production-ready API FastAPI + Motor (MongoDB) + ARQ + Redis (Queue & Rate Limiting)
Enterprise Security Tenant isolation, Role-Based Access Control (RBAC), and CORS
LangChain adapters Drop-in BaseRetriever and LlamaIndex QueryEngine
Privacy-first All processing runs locally; no data leaves your infra

Installation

Quick install (recommended)

pip install "longparser[gpu]"

Includes everything — server, embeddings, vector DB, OCR, LangChain, LlamaIndex. Works on CPU machines too; torch just runs in CPU mode automatically.

Core SDK only (no server, no torch)

pip install longparser

Pick only what you need

Extra What it adds
server FastAPI + MongoDB + Redis + LangChain chat
embeddings-gpu sentence-transformers (GPU)
embeddings-cpu sentence-transformers (CPU-only torch)
faiss-gpu FAISS GPU vector store
faiss-cpu FAISS CPU vector store
chroma ChromaDB
qdrant Qdrant
latex-ocr-gpu pix2tex equation OCR (GPU)
latex-ocr-cpu pix2tex equation OCR (CPU)
langchain LangChain core adapter
llamaindex LlamaIndex reader adapter
gpu All of the above — one command
cpu All of the above — CPU-only torch

Advanced: CPU-only install (save ~1.8 GB)

For Docker images, edge devices, or CI environments where CUDA isn't needed:

# Step 1 — CPU torch (~230 MB vs ~2 GB for CUDA)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

# Step 2 — LongParser CPU bundle
pip install "longparser[cpu]"

Quick Start

Python SDK

from longparser import DocumentPipeline, ProcessingConfig

pipeline = DocumentPipeline(ProcessingConfig())
result = pipeline.process_file("document.pdf")

print(f"Pages: {result.document.metadata.total_pages}")
print(f"Chunks: {len(result.chunks)}")
print(result.chunks[0].text)

REST API

# 1. Copy and edit configuration
cp .env.example .env

# 2. Start services (MongoDB + Redis)
docker-compose up -d mongo redis

# 3. Start the API
uv run uvicorn longparser.server.app:app --reload --port 8000

# 4. Upload a document
curl -X POST http://localhost:8000/jobs \
  -H "X-API-Key: your-key" \
  -F "file=@document.pdf"

# 5. Check job status
curl http://localhost:8000/jobs/{job_id} -H "X-API-Key: your-key"

# 6. Finalize and embed
curl -X POST http://localhost:8000/jobs/{job_id}/finalize \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"finalize_policy": "approve_all_pending"}'

curl -X POST http://localhost:8000/jobs/{job_id}/embed \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"provider": "huggingface", "model": "BAAI/bge-base-en-v1.5", "vector_db": "chroma"}'

# 7. Chat with the document
curl -X POST http://localhost:8000/chat/sessions \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"job_id": "your-job-id"}'

curl -X POST http://localhost:8000/chat \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"session_id": "...", "job_id": "...", "question": "What is the refund policy?"}'

Architecture

Document → Extract → Validate → HITL Review → Chunk → Embed → Index
                                                              ↓
                                             Chat → RAG → LLM → Answer

Pipeline Stages

  1. Extract — Docling converts PDF/DOCX/etc. into structured Block objects
  2. Validate — Per-page confidence scoring and RTL detection
  3. HITL Review — Human approves/edits/rejects blocks and chunks via the API
  4. ChunkHybridChunker builds token-aware RAG chunks with section hierarchy
  5. Embed — Embedding engine (HuggingFace / OpenAI) vectors stored in Chroma/FAISS/Qdrant
  6. Chat — LCEL chain with 3-layer memory and citation validation

Project Structure

src/longparser/
├── schemas.py           ← core Pydantic models (Document, Block, Chunk, …)
├── extractors/          ← Docling, LaTeX OCR backends
├── chunkers/            ← HybridChunker
├── pipeline/            ← DocumentPipeline
├── integrations/        ← LangChain loader & LlamaIndex reader
├── utils/               ← shared helpers (RTL detection, …)
└── server/              ← REST API layer
    ├── app.py           ← FastAPI application (all routes)
    ├── db.py            ← Motor async MongoDB
    ├── queue.py         ← ARQ/Redis job queue
    ├── worker.py        ← ARQ background worker
    ├── embeddings.py    ← HuggingFace / OpenAI embedding engine
    ├── vectorstores.py  ← Chroma / FAISS / Qdrant adapters
    └── chat/            ← RAG chat engine
        ├── engine.py    ← ChatEngine (LCEL + 3-layer memory)
        ├── graph.py     ← LangGraph HITL workflow
        ├── schemas.py   ← chat Pydantic models
        ├── retriever.py ← LangChain BaseRetriever adapter
        ├── llm_chain.py ← multi-provider LLM factory
        └── callbacks.py ← observability callbacks

LangChain Integration

from longparser.integrations.langchain import LongParserLoader

loader = LongParserLoader("report.pdf")
docs = loader.load()  # list[langchain_core.documents.Document]

LlamaIndex Integration

from longparser.integrations.llamaindex import LongParserReader

reader = LongParserReader()
docs = reader.load_data("report.pdf")

Configuration

Copy .env.example to .env and set:

Variable Default Description
LONGPARSER_MONGO_URL mongodb://localhost:27017 MongoDB connection
LONGPARSER_REDIS_URL redis://localhost:6379 Redis for job queue & rate limits
LONGPARSER_LLM_PROVIDER openai LLM provider
LONGPARSER_LLM_MODEL gpt-5.3 Model name
LONGPARSER_EMBED_PROVIDER huggingface Embedding provider
LONGPARSER_VECTOR_DB chroma Vector store backend
LONGPARSER_CORS_ORIGINS * Allowed CORS origins
LONGPARSER_RATE_LIMIT 60 Max RPM per tenant
LONGPARSER_ADMIN_KEYS (empty) Comma-separated admin API keys

Running with Docker

docker-compose up

API available at http://localhost:8000 · Docs at http://localhost:8000/docs


Testing

# Install dev dependencies
uv sync --extra dev

# Run unit tests
uv run pytest tests/unit/ -v

# Run with coverage
uv run pytest tests/ --cov=src/longparser --cov-report=term-missing

Contributing

See CONTRIBUTING.md for development setup and PR guidelines.

Security

See SECURITY.md for vulnerability reporting.

License

MIT — Copyright © 2026 ENDEVSOLS

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

longparser-0.1.5.tar.gz (112.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

longparser-0.1.5-py3-none-any.whl (123.9 kB view details)

Uploaded Python 3

File details

Details for the file longparser-0.1.5.tar.gz.

File metadata

  • Download URL: longparser-0.1.5.tar.gz
  • Upload date:
  • Size: 112.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for longparser-0.1.5.tar.gz
Algorithm Hash digest
SHA256 8204e5268874ede61421c3033ffa2e282fe3c57aa6e0562de3ee4e7be8f8e27e
MD5 0b09506b8795de855c935dd99d64e958
BLAKE2b-256 99123e40bd789fe35214769d4678375bb75dd701f7d9a00408254611805e9c0f

See more details on using hashes here.

File details

Details for the file longparser-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: longparser-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 123.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for longparser-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 aa75043b1ffa297eebb449f9cdc36218ac239b3a84ce8ea79a986fbc77c266f5
MD5 fe5389c2c7d3172e8e5f5293695acf9e
BLAKE2b-256 bb66dacb465b74d47da8cbe55b323ddf28fa723ed162e0108e9bebd13af6745e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page