Self-learning RAG field prediction SDK for PDF form filling.
Project description
ragpdf-sdk
Self-learning RAG field prediction for PDF form filling.
A fully open-source Python SDK that predicts PDF form field mappings using sentence-transformer embeddings and a dual-model ensemble (RAG + LLM). The vector database learns from every prediction — getting smarter with every document processed.
What It Does
When filling a PDF form, every field box has context — surrounding text, section headers, position. This SDK learns to predict which standardized field name (e.g. investor_full_legal_name) maps to which field box, by:
- Embedding field context using sentence-transformers (or OpenAI embeddings)
- Matching against a vector database via cosine similarity
- Combining RAG predictions with your LLM predictions into a 5-case ensemble
- Learning from every outcome — boosting correct vectors, decaying wrong ones, regenerating embeddings on errors
- Tracking accuracy, coverage, and confidence at 5 levels (per-PDF, per-category, global)
Everything runs on your own infrastructure. No external services. No data leaves your environment.
Installation
# Minimal (numpy + scikit-learn only — bring your own embeddings)
pip install ragpdf-sdk
# With sentence-transformers (recommended default)
pip install ragpdf-sdk[transformers]
# With OpenAI embeddings + GPT-4 corrector
pip install ragpdf-sdk[openai]
# With Anthropic Claude corrector
pip install ragpdf-sdk[anthropic]
# With AWS S3 storage
pip install ragpdf-sdk[s3]
# With Pinecone vector store
pip install ragpdf-sdk[pinecone]
# With ChromaDB vector store (local, embedded)
pip install ragpdf-sdk[chroma]
# With Weaviate vector store
pip install ragpdf-sdk[weaviate]
# With FastAPI dev server
pip install ragpdf-sdk[server]
# Everything
pip install ragpdf-sdk[all]
Quick Start
from ragpdf import RAGPDFClient, LocalStorage, LocalVectorStore, SentenceTransformerBackend
client = RAGPDFClient(
storage=LocalStorage("./ragpdf_data"),
vector_store=LocalVectorStore("./ragpdf_data"),
embedding_backend=SentenceTransformerBackend("all-MiniLM-L6-v2"),
)
# API 1 — Get RAG predictions for your PDF fields
result = client.get_predictions(
user_id="user_001",
session_id="session_abc",
pdf_id="pdf_xyz",
fields=[
{
"field_id": "f001",
"field_name": "Investor Name",
"context": "Full legal name of the investor as it appears on government-issued ID",
"section_context": "Investor Identity",
"headers": ["Section 1", "Personal Information"],
},
],
pdf_hash="md5hashofthepdffile",
pdf_category={
"category": "Private Markets",
"sub_category": "Private Equity",
"document_type": "LP Subscription Agreement",
},
)
print(result["summary"])
# {'total_fields': 1, 'predicted_fields': 0, 'unpredicted_fields': 1, 'avg_confidence': 0.0}
# (empty on first run — vector DB learns from each submission)
Or use environment variables:
cp .env.example .env
# Fill in your settings
client = RAGPDFClient.from_env()
The 6 APIs
API 1 — get_predictions()
Generate RAG predictions for a set of PDF form fields. Saves results to storage.
result = client.get_predictions(
user_id="user_001",
session_id="session_abc",
pdf_id="pdf_xyz",
fields=[
{
"field_id": "f001", # required: unique ID for this field
"field_name": "Name Box", # optional but improves accuracy
"context": "...", # surrounding text in the PDF
"section_context": "...", # section/heading this field belongs to
"headers": ["..."], # list of headers above this field
}
],
pdf_hash="abc123", # MD5/SHA of the PDF (used for dedup + frequency)
pdf_category={
"category": "Private Markets",
"sub_category": "Private Equity",
"document_type": "LP Subscription Agreement",
},
)
# Returns: submission_id, frequency, is_duplicate, summary
# RAG predictions are saved to: predictions/{user_id}/{session_id}/{pdf_id}/predictions/rag_predictions.json
API 2 — save_filled_pdf()
After your backend fills the PDF (using its own LLM predictions), call this to run the full processing pipeline: case classification → metrics → vector DB update → time series.
result = client.save_filled_pdf(
user_id="user_001",
session_id="session_abc",
pdf_id="pdf_xyz",
llm_predictions={
"predictions": {
"f001": {
"predicted_field_name": "investor_full_legal_name",
"confidence": 0.92,
}
}
},
final_predictions={
"final_predictions": {
"f001": {
"selected_field_name": "investor_full_legal_name",
"selected_from": "llm", # "rag" | "llm"
"rag_confidence": 0.0,
"llm_confidence": 0.92,
}
}
},
)
# Runs: CaseClassifier → MetricsService → VectorDB update → TimeSeriesService
API 4 — submit_feedback()
When a user reports a wrong field name after reviewing the filled PDF:
result = client.submit_feedback(
user_id="user_001",
session_id="session_abc",
pdf_id="pdf_xyz",
errors=[
{
"error_type": "wrong_field_name",
"field_name": "investor_name", # what was predicted
"field_type": "text",
"value": "John Smith",
"feedback": "Should be full_legal_name",
"page_number": 1,
"corners": [[10, 20], [200, 20], [200, 40], [10, 40]],
}
],
)
# Runs: LLM corrector → negative confidence update → embedding regen → metric recalc
API 5 — get_metrics()
# Per-PDF metrics
client.get_metrics("pdf", user_id="u1", session_id="s1", pdf_id="p1")
# Category time series
client.get_metrics("category", category="Private Markets")
# Subcategory time series
client.get_metrics("subcategory", category="Private Markets", subcategory="Private Equity")
# Document type time series
client.get_metrics("doctype", category="Private Markets", subcategory="Private Equity", doctype="LP Subscription Agreement")
# Global metrics — full LLM vs RAG comparison + ensemble stats
client.get_metrics("global")
# Compare multiple PDFs
client.get_metrics("compare", pdfs=[
{"user_id": "u1", "session_id": "s1", "pdf_id": "p1"},
{"user_id": "u2", "session_id": "s2", "pdf_id": "p2"},
])
# All submissions for a specific PDF hash
client.get_metrics("pdf_hash", pdf_hash="abc123")
API 6 — get_system_info()
info = client.get_system_info()
# Returns: total PDFs, users, sessions, categories, vectors, breakdown by source
API 7 — get_error_analytics()
analytics = client.get_error_analytics(
date_from="2026-01-01T00:00:00Z",
date_to="2026-03-31T23:59:59Z",
category="Private Markets", # optional filter
subcategory="Private Equity", # optional filter
doctype="LP Subscription Agreement", # optional filter
)
# Returns: total_errors + breakdown by category, subcategory, doctype, date, error_type, case_type
Plugin System
Every component is pluggable. Mix and match to fit your stack.
Embedding Backends
| Backend | Install | Best For |
|---|---|---|
SentenceTransformerBackend (default) |
[transformers] |
Local, no API calls, great accuracy |
OpenAIEmbeddingBackend |
[openai] |
Highest quality, uses API credits |
Custom (EmbeddingBackend) |
— | Any model — Ollama, HuggingFace, Cohere |
# Sentence Transformers (runs locally, no API key)
from ragpdf import SentenceTransformerBackend
backend = SentenceTransformerBackend(model="all-MiniLM-L6-v2")
# Other models: "all-mpnet-base-v2", "paraphrase-MiniLM-L6-v2"
# OpenAI
from ragpdf import OpenAIEmbeddingBackend
backend = OpenAIEmbeddingBackend(api_key="sk-...", model="text-embedding-3-small")
# Custom — implement 2 methods
from ragpdf.embeddings.base import EmbeddingBackend
class MyEmbedder(EmbeddingBackend):
def embed(self, text: str) -> list[float]:
return my_model.encode(text).tolist()
def embed_batch(self, texts: list[str]) -> list[list[float]]:
return my_model.encode(texts).tolist()
Vector Store Backends
| Backend | Install | Best For |
|---|---|---|
LocalVectorStore (default) |
— | Dev/testing, single server |
S3VectorStore |
[s3] |
Production, no extra deps |
PineconeStore |
[pinecone] |
Large scale, managed |
ChromaStore |
[chroma] |
Local production, embedded |
WeaviateStore |
[weaviate] |
Self-hosted, full-featured |
Custom (VectorStoreBackend) |
— | pgvector, Qdrant, Milvus, Redis |
from ragpdf import LocalVectorStore, S3VectorStore
from ragpdf.vector_stores import PineconeStore, ChromaStore, WeaviateStore
# Flat JSON on disk (dev)
store = LocalVectorStore(path="./ragpdf_data")
# Flat JSON in your S3 bucket (production)
store = S3VectorStore(bucket="my-bucket", region="us-east-1")
# Pinecone
store = PineconeStore(api_key="...", index_name="ragpdf-vectors", namespace="prod")
# ChromaDB (local, embedded, no external service)
store = ChromaStore(path="./chroma_data", collection="ragpdf_vectors")
# Weaviate
store = WeaviateStore(url="http://localhost:8080", class_name="RagpdfVector")
# Custom — implement 5 methods
from ragpdf.vector_stores.base import VectorStoreBackend
class PgVectorStore(VectorStoreBackend):
def find_similar(self, embedding, threshold, top_k): ...
def add_vector(self, field_name, context, section_context, headers, embedding, **meta): ...
def update_confidence(self, vector_id, is_positive, error_info=None): ...
def save(self): ...
def count(self) -> int: ...
LLM Corrector Backends
| Backend | Install | Best For |
|---|---|---|
NoOpCorrectorBackend (default) |
— | No LLM call, offline |
OpenAICorrectorBackend |
[openai] |
GPT-4, highest quality corrections |
AnthropicCorrectorBackend |
[anthropic] |
Claude, fast + accurate |
Custom (FieldCorrectorBackend) |
— | Llama, Mistral, Ollama, any LLM |
from ragpdf import OpenAICorrectorBackend, AnthropicCorrectorBackend, NoOpCorrectorBackend
# GPT-4
corrector = OpenAICorrectorBackend(api_key="sk-...", model="gpt-4-turbo-preview")
# Claude
corrector = AnthropicCorrectorBackend(api_key="sk-ant-...", model="claude-sonnet-4-20250514")
# No LLM (just cleans the field name to snake_case)
corrector = NoOpCorrectorBackend()
# Custom — implement 1 method
from ragpdf.correctors.base import FieldCorrectorBackend
class OllamaCorrector(FieldCorrectorBackend):
def generate_corrected_field_name(self, error_data: dict) -> dict:
# Call Ollama / any local LLM
return {"corrected_field_name": "name", "confidence": 0.9, "reasoning": "..."}
Storage Backends
from ragpdf import LocalStorage, S3Storage
from ragpdf.storage.base import StorageBackend
# Local filesystem
storage = LocalStorage("./ragpdf_data")
# AWS S3 (your own bucket)
storage = S3Storage(bucket="my-bucket", region="us-east-1", prefix="ragpdf/")
# Custom — implement 5 methods (PostgreSQL, MongoDB, GCS, Azure Blob, etc.)
class PostgresStorage(StorageBackend):
def save_json(self, key, data): ...
def load_json(self, key): ...
def append_to_jsonl(self, key, data): ...
def load_jsonl(self, key): ...
def copy_file(self, source, dest): ...
How the Learning Loop Works
PDF fields submitted
↓
EmbeddingBackend.embed(field_context)
↓
VectorStoreBackend.find_similar(embedding)
→ RAG prediction + confidence score
↓
Your backend runs LLM prediction independently
↓
save_filled_pdf(rag_preds, llm_preds, final_preds)
↓
CaseClassifier assigns each field to one of 5 cases:
CASE_A → Both agreed → boost RAG vector confidence
CASE_B → Conflict → boost winner, create new vector if LLM selected
CASE_C → LLM only → create new vector from LLM prediction
CASE_D → RAG only → boost RAG vector confidence
CASE_E → Neither → do nothing
↓
MetricsService calculates accuracy/coverage/confidence
TimeSeriesService appends to 5 time series levels
↓
(optionally) submit_feedback(errors)
↓
FieldCorrectorBackend.generate_corrected_field_name(error)
↓
VectorStoreBackend.update_confidence(vector_id, is_positive=False)
→ confidence decayed
→ embedding regenerated: "original context corrected:right_field_name"
→ stability_score updated
MetricsService.recalculate_accuracy_after_errors()
TimeSeriesService updates all 5 levels again
Over time, CASE_A (both agreed) increases → LLM needed less → faster + cheaper predictions.
Configuration Reference
Copy .env.example to .env:
# Storage
RAGPDF_STORAGE=local # local | s3
RAGPDF_DATA_PATH=./ragpdf_data
# S3 (if RAGPDF_STORAGE=s3)
RAGPDF_S3_BUCKET=my-bucket
RAGPDF_S3_REGION=us-east-1
# Embedding
RAGPDF_EMBEDDING_BACKEND=sentence_transformer # sentence_transformer | openai
RAGPDF_ST_MODEL=all-MiniLM-L6-v2
OPENAI_API_KEY=sk-...
# Vector store
RAGPDF_VECTOR_STORE=local # local | s3 | pinecone | chroma | weaviate
PINECONE_API_KEY=...
RAGPDF_CHROMA_PATH=./chroma_data
# LLM Corrector
RAGPDF_CORRECTOR_BACKEND=openai # openai | anthropic | noop
ANTHROPIC_API_KEY=sk-ant-...
# Prediction tuning
RAGPDF_PREDICTION_THRESHOLD=0.75 # min cosine similarity to count as a match
RAGPDF_TOP_K=5 # how many candidates to return
RAGPDF_CONFIDENCE_DECAY_RATE=0.95 # multiply confidence on error
RAGPDF_CONFIDENCE_GROWTH_RATE=1.05 # multiply confidence on correct
S3 Storage Layout
All paths below are relative to your bucket (or RAGPDF_DATA_PATH for local):
vectors/
└── vector_database.json # the vector DB
pdf_hash_mapping/
└── mapping.json # dedup + frequency tracking
predictions/{user_id}/{session_id}/{pdf_id}/
├── metadata/
│ ├── submission_info.json # submission_id, frequency
│ └── pdf_info.json # pdf_hash, pdf_category
├── predictions/
│ ├── input.json # raw fields (for CASE_B/C vector creation)
│ ├── rag_predictions.json # API 1 output
│ ├── llm_predictions.json # provided by your backend
│ └── final_predictions.json # ensemble decisions
├── analysis/
│ ├── case_classification.json # A/B/C/D/E per field
│ ├── metrics_snapshot.json # initial metrics
│ └── vector_update_summary.json # what changed in vector DB
└── errors/
├── user_feedback_raw.jsonl # raw feedback events
├── error_analysis.json # processed error records
├── metrics_snapshot_updated.json # recalculated after errors
└── error_log_{timestamp}.json # timestamped error log
metrics/time_series/
├── global/time_series.json
├── category/{category}/time_series.json
├── subcategory/{category}/{sub}/time_series.json
├── doctype/{category}/{sub}/{doc}/time_series.json
└── pdf_hash/{hash}/time_series.json
Dev Server
pip install ragpdf-sdk[server]
uvicorn server.local_server:app --reload --port 8000
Endpoints:
POST /predict— API 1POST /save-filled-pdf— API 2POST /feedback— API 4POST /metrics— API 5GET /system-info— API 6POST /error-analytics— API 7GET /health— vector count
All endpoints require X-API-Key: dev-key header (set RAGPDF_API_KEY to change).
Testing
# Install dev dependencies
pip install -e ".[dev]"
# Unit tests (no API keys, no network)
pytest tests/unit/ -v
# With coverage
pytest tests/unit/ --cov=ragpdf --cov-report=html
# Integration tests (no API keys — uses DummyEmbeddingBackend)
pytest tests/integration/ -v -m integration
# All tests
pytest
Publishing
# Increment version in pyproject.toml
# Update CHANGELOG.md
git commit -am "Release v0.2.3"
git tag v0.2.3
git push origin main --tags
# GitHub Actions publishes to PyPI automatically
Versioning policy:
PATCH(0.1.x) — bug fixes, no API changesMINOR(0.x.0) — new backends/features, backwards compatibleMAJOR(x.0.0) — breaking changes toRAGPDFClientor plugin interfaces
License
MIT — see LICENSE
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_autofillr_rag-0.2.3.tar.gz.
File metadata
- Download URL: pdf_autofillr_rag-0.2.3.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbda00465c8861df2017d11d836e86911686144ad592886326df8b79c17803a4
|
|
| MD5 |
85dc3b3922471b44eee83ca2bb775554
|
|
| BLAKE2b-256 |
f67f2b0bd83fa99e836cddcdb9951e412e68b316bae556d80f484f050b83c38d
|
File details
Details for the file pdf_autofillr_rag-0.2.3-py3-none-any.whl.
File metadata
- Download URL: pdf_autofillr_rag-0.2.3-py3-none-any.whl
- Upload date:
- Size: 105.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2503eba38d8457802a77f19091de18ce0046a552723acd16892d505ceb35c402
|
|
| MD5 |
1d4f197ea4255c3b4981ac62f014ee4c
|
|
| BLAKE2b-256 |
938e52682c6ab8f60f44d642d7da9909026522c874a72252284e0e149533a01f
|