longparser

Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines.

These details have not been verified by PyPI

Project links

Project description

LongParser

Privacy-first document intelligence engine for production RAG pipelines.

Parse PDFs, DOCX, PPTX, XLSX & CSV → validated, AI-ready chunks with HITL review.

Features

Feature	Detail
Multi-format extraction	PDF, DOCX, PPTX, XLSX, CSV via Docling & Marker
Hybrid chunking	Token-aware, heading-hierarchy-aware, table-aware
Semantic chunking	Embedding-based boundaries using `all-MiniLM-L6-v2`
Cross-referencing	Deterministic linking of explicit and implicit charts/figures
Quality scoring	Zero-ML heuristic scoring with dictionary & fastText validation
PII redaction	Hybrid Regex + NER (spaCy) redaction with secure HITL preservation
Summary chunks	Async ARQ worker generating hierarchical LLM section summaries
HITL review	Human-in-the-Loop block & chunk editing before embedding
LangGraph HITL	`approve / edit / reject` workflow with LangGraph `interrupt()` and MongoDB checkpointer
3-layer memory	Short-term turns + rolling summary + long-term facts
Multi-provider LLM	OpenAI, Gemini, Groq, OpenRouter
Multi-backend vectors	Chroma, FAISS, Qdrant
Production-ready API	FastAPI + Motor (MongoDB) + ARQ + Redis (Queue & Rate Limiting)
Enterprise Security	Tenant isolation, Role-Based Access Control (RBAC), and CORS
LangChain adapters	Drop-in `BaseRetriever` and LlamaIndex `QueryEngine`
Privacy-first	All processing runs locally; no data leaves your infra

Installation

Quick install (recommended)

pip install "longparser[gpu]"

Includes everything — server, embeddings, vector DB, OCR, LangChain, LlamaIndex. Works on CPU machines too; torch just runs in CPU mode automatically.

Core SDK only (no server, no torch)

pip install longparser

Pick only what you need

Extra	What it adds
`server`	FastAPI + MongoDB + Redis + LangChain chat
`embeddings-gpu`	`sentence-transformers` (GPU)
`embeddings-cpu`	`sentence-transformers` (CPU-only torch)
`faiss-gpu`	FAISS GPU vector store
`faiss-cpu`	FAISS CPU vector store
`chroma`	ChromaDB
`qdrant`	Qdrant
`latex-ocr-gpu`	`pix2tex` equation OCR (GPU)
`latex-ocr-cpu`	`pix2tex` equation OCR (CPU)
`langchain`	LangChain core adapter
`llamaindex`	LlamaIndex reader adapter
`gpu`	All of the above — one command
`cpu`	All of the above — CPU-only torch

Advanced: CPU-only install (save ~1.8 GB)

For Docker images, edge devices, or CI environments where CUDA isn't needed:

# Step 1 — CPU torch (~230 MB vs ~2 GB for CUDA)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

# Step 2 — LongParser CPU bundle
pip install "longparser[cpu]"

Quick Start

Python SDK

from longparser import DocumentPipeline, ProcessingConfig

pipeline = DocumentPipeline(ProcessingConfig())
result = pipeline.process_file("document.pdf")

print(f"Pages: {result.document.metadata.total_pages}")
print(f"Chunks: {len(result.chunks)}")
print(result.chunks[0].text)

REST API

# 1. Copy and edit configuration
cp .env.example .env

# 2. Start services (MongoDB + Redis)
docker-compose up -d mongo redis

# 3. Start the API
uv run uvicorn longparser.server.app:app --reload --port 8000

# 4. Upload a document
curl -X POST http://localhost:8000/jobs \
  -H "X-API-Key: your-key" \
  -F "file=@document.pdf"

# 5. Check job status
curl http://localhost:8000/jobs/{job_id} -H "X-API-Key: your-key"

# 6. Finalize and embed
curl -X POST http://localhost:8000/jobs/{job_id}/finalize \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"finalize_policy": "approve_all_pending"}'

curl -X POST http://localhost:8000/jobs/{job_id}/embed \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"provider": "huggingface", "model": "BAAI/bge-base-en-v1.5", "vector_db": "chroma"}'

# 7. Chat with the document
curl -X POST http://localhost:8000/chat/sessions \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"job_id": "your-job-id"}'

curl -X POST http://localhost:8000/chat \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"session_id": "...", "job_id": "...", "question": "What is the refund policy?"}'

Architecture

Document → Extract → Validate → HITL Review → Chunk → Embed → Index
                                                              ↓
                                             Chat → RAG → LLM → Answer

Pipeline Stages

Extract — Docling converts PDF/DOCX/etc. into structured Block objects
Validate — Per-page confidence scoring and RTL detection
HITL Review — Human approves/edits/rejects blocks and chunks via the API
Chunk — HybridChunker builds token-aware RAG chunks with section hierarchy
Embed — Embedding engine (HuggingFace / OpenAI) vectors stored in Chroma/FAISS/Qdrant
Chat — LCEL chain with 3-layer memory and citation validation

Project Structure

src/longparser/
├── schemas.py           ← core Pydantic models (Document, Block, Chunk, …)
├── extractors/          ← Docling, LaTeX OCR backends
├── chunkers/            ← HybridChunker
├── pipeline/            ← DocumentPipeline
├── integrations/        ← LangChain loader & LlamaIndex reader
├── utils/               ← shared helpers (RTL detection, …)
└── server/              ← REST API layer
    ├── app.py           ← FastAPI application (all routes)
    ├── db.py            ← Motor async MongoDB
    ├── queue.py         ← ARQ/Redis job queue
    ├── worker.py        ← ARQ background worker
    ├── embeddings.py    ← HuggingFace / OpenAI embedding engine
    ├── vectorstores.py  ← Chroma / FAISS / Qdrant adapters
    └── chat/            ← RAG chat engine
        ├── engine.py    ← ChatEngine (LCEL + 3-layer memory)
        ├── graph.py     ← LangGraph HITL workflow
        ├── schemas.py   ← chat Pydantic models
        ├── retriever.py ← LangChain BaseRetriever adapter
        ├── llm_chain.py ← multi-provider LLM factory
        └── callbacks.py ← observability callbacks

LangChain Integration

from longparser.integrations.langchain import LongParserLoader

loader = LongParserLoader("report.pdf")
docs = loader.load()  # list[langchain_core.documents.Document]

LlamaIndex Integration

from longparser.integrations.llamaindex import LongParserReader

reader = LongParserReader()
docs = reader.load_data("report.pdf")

Configuration

Copy .env.example to .env and set:

Variable	Default	Description
`LONGPARSER_MONGO_URL`	`mongodb://localhost:27017`	MongoDB connection
`LONGPARSER_REDIS_URL`	`redis://localhost:6379`	Redis for job queue & rate limits
`LONGPARSER_LLM_PROVIDER`	`openai`	LLM provider
`LONGPARSER_LLM_MODEL`	`gpt-5.3`	Model name
`LONGPARSER_EMBED_PROVIDER`	`huggingface`	Embedding provider
`LONGPARSER_VECTOR_DB`	`chroma`	Vector store backend
`LONGPARSER_CORS_ORIGINS`	`*`	Allowed CORS origins
`LONGPARSER_RATE_LIMIT`	`60`	Max RPM per tenant
`LONGPARSER_ADMIN_KEYS`	(empty)	Comma-separated admin API keys

Running with Docker

docker-compose up

API available at http://localhost:8000 · Docs at http://localhost:8000/docs

Testing

# Install dev dependencies
uv sync --extra dev

# Run unit tests
uv run pytest tests/unit/ -v

# Run with coverage
uv run pytest tests/ --cov=src/longparser --cov-report=term-missing

Contributing

See CONTRIBUTING.md for development setup and PR guidelines.

Security

See SECURITY.md for vulnerability reporting.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.5

May 5, 2026

0.1.4

Apr 23, 2026

0.1.3

Apr 13, 2026

0.1.2

Apr 5, 2026

0.1.1

Apr 4, 2026

0.1.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

longparser-0.1.5.tar.gz (112.5 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

longparser-0.1.5-py3-none-any.whl (123.9 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file longparser-0.1.5.tar.gz.

File metadata

Download URL: longparser-0.1.5.tar.gz
Upload date: May 5, 2026
Size: 112.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for longparser-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`8204e5268874ede61421c3033ffa2e282fe3c57aa6e0562de3ee4e7be8f8e27e`
MD5	`0b09506b8795de855c935dd99d64e958`
BLAKE2b-256	`99123e40bd789fe35214769d4678375bb75dd701f7d9a00408254611805e9c0f`

See more details on using hashes here.

File details

Details for the file longparser-0.1.5-py3-none-any.whl.

File metadata

Download URL: longparser-0.1.5-py3-none-any.whl
Upload date: May 5, 2026
Size: 123.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for longparser-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa75043b1ffa297eebb449f9cdc36218ac239b3a84ce8ea79a986fbc77c266f5`
MD5	`fe5389c2c7d3172e8e5f5293695acf9e`
BLAKE2b-256	`bb66dacb465b74d47da8cbe55b323ddf28fa723ed162e0108e9bebd13af6745e`

See more details on using hashes here.

longparser 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Installation

Quick install (recommended)

Core SDK only (no server, no torch)

Pick only what you need

Advanced: CPU-only install (save ~1.8 GB)

Quick Start

Python SDK

REST API

Architecture

Pipeline Stages

Project Structure

LangChain Integration

LlamaIndex Integration

Configuration

Running with Docker

Testing

Contributing

Security

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes