Self-learning RAG field prediction SDK for PDF form filling.

These details have not been verified by PyPI

Project links

Project description

ragpdf-sdk

Self-learning RAG field prediction for PDF form filling.

A fully open-source Python SDK that predicts PDF form field mappings using sentence-transformer embeddings and a dual-model ensemble (RAG + LLM). The vector database learns from every prediction — getting smarter with every document processed.

What It Does

When filling a PDF form, every field box has context — surrounding text, section headers, position. This SDK learns to predict which standardized field name (e.g. investor_full_legal_name) maps to which field box, by:

Embedding field context using sentence-transformers (or OpenAI embeddings)
Matching against a vector database via cosine similarity
Combining RAG predictions with your LLM predictions into a 5-case ensemble
Learning from every outcome — boosting correct vectors, decaying wrong ones, regenerating embeddings on errors
Tracking accuracy, coverage, and confidence at 5 levels (per-PDF, per-category, global)

Everything runs on your own infrastructure. No external services. No data leaves your environment.

Installation

# Minimal (numpy + scikit-learn only — bring your own embeddings)
pip install ragpdf-sdk

# With sentence-transformers (recommended default)
pip install ragpdf-sdk[transformers]

# With OpenAI embeddings + GPT-4 corrector
pip install ragpdf-sdk[openai]

# With Anthropic Claude corrector
pip install ragpdf-sdk[anthropic]

# With AWS S3 storage
pip install ragpdf-sdk[s3]

# With Pinecone vector store
pip install ragpdf-sdk[pinecone]

# With ChromaDB vector store (local, embedded)
pip install ragpdf-sdk[chroma]

# With Weaviate vector store
pip install ragpdf-sdk[weaviate]

# With FastAPI dev server
pip install ragpdf-sdk[server]

# Everything
pip install ragpdf-sdk[all]

Quick Start

from ragpdf import RAGPDFClient, LocalStorage, LocalVectorStore, SentenceTransformerBackend

client = RAGPDFClient(
    storage=LocalStorage("./ragpdf_data"),
    vector_store=LocalVectorStore("./ragpdf_data"),
    embedding_backend=SentenceTransformerBackend("all-MiniLM-L6-v2"),
)

# API 1 — Get RAG predictions for your PDF fields
result = client.get_predictions(
    user_id="user_001",
    session_id="session_abc",
    pdf_id="pdf_xyz",
    fields=[
        {
            "field_id": "f001",
            "field_name": "Investor Name",
            "context": "Full legal name of the investor as it appears on government-issued ID",
            "section_context": "Investor Identity",
            "headers": ["Section 1", "Personal Information"],
        },
    ],
    pdf_hash="md5hashofthepdffile",
    pdf_category={
        "category": "Private Markets",
        "sub_category": "Private Equity",
        "document_type": "LP Subscription Agreement",
    },
)
print(result["summary"])
# {'total_fields': 1, 'predicted_fields': 0, 'unpredicted_fields': 1, 'avg_confidence': 0.0}
# (empty on first run — vector DB learns from each submission)

Or use environment variables:

cp .env.example .env
# Fill in your settings

client = RAGPDFClient.from_env()

The 6 APIs

API 1 — `get_predictions()`

Generate RAG predictions for a set of PDF form fields. Saves results to storage.

result = client.get_predictions(
    user_id="user_001",
    session_id="session_abc",
    pdf_id="pdf_xyz",
    fields=[
        {
            "field_id": "f001",           # required: unique ID for this field
            "field_name": "Name Box",     # optional but improves accuracy
            "context": "...",             # surrounding text in the PDF
            "section_context": "...",     # section/heading this field belongs to
            "headers": ["..."],           # list of headers above this field
        }
    ],
    pdf_hash="abc123",                    # MD5/SHA of the PDF (used for dedup + frequency)
    pdf_category={
        "category":      "Private Markets",
        "sub_category":  "Private Equity",
        "document_type": "LP Subscription Agreement",
    },
)
# Returns: submission_id, frequency, is_duplicate, summary
# RAG predictions are saved to: predictions/{user_id}/{session_id}/{pdf_id}/predictions/rag_predictions.json

API 2 — `save_filled_pdf()`

After your backend fills the PDF (using its own LLM predictions), call this to run the full processing pipeline: case classification → metrics → vector DB update → time series.

result = client.save_filled_pdf(
    user_id="user_001",
    session_id="session_abc",
    pdf_id="pdf_xyz",
    llm_predictions={
        "predictions": {
            "f001": {
                "predicted_field_name": "investor_full_legal_name",
                "confidence": 0.92,
            }
        }
    },
    final_predictions={
        "final_predictions": {
            "f001": {
                "selected_field_name": "investor_full_legal_name",
                "selected_from": "llm",       # "rag" | "llm"
                "rag_confidence": 0.0,
                "llm_confidence": 0.92,
            }
        }
    },
)
# Runs: CaseClassifier → MetricsService → VectorDB update → TimeSeriesService

API 4 — `submit_feedback()`

When a user reports a wrong field name after reviewing the filled PDF:

result = client.submit_feedback(
    user_id="user_001",
    session_id="session_abc",
    pdf_id="pdf_xyz",
    errors=[
        {
            "error_type":  "wrong_field_name",
            "field_name":  "investor_name",        # what was predicted
            "field_type":  "text",
            "value":       "John Smith",
            "feedback":    "Should be full_legal_name",
            "page_number": 1,
            "corners":     [[10, 20], [200, 20], [200, 40], [10, 40]],
        }
    ],
)
# Runs: LLM corrector → negative confidence update → embedding regen → metric recalc

API 5 — `get_metrics()`

# Per-PDF metrics
client.get_metrics("pdf", user_id="u1", session_id="s1", pdf_id="p1")

# Category time series
client.get_metrics("category", category="Private Markets")

# Subcategory time series
client.get_metrics("subcategory", category="Private Markets", subcategory="Private Equity")

# Document type time series
client.get_metrics("doctype", category="Private Markets", subcategory="Private Equity", doctype="LP Subscription Agreement")

# Global metrics — full LLM vs RAG comparison + ensemble stats
client.get_metrics("global")

# Compare multiple PDFs
client.get_metrics("compare", pdfs=[
    {"user_id": "u1", "session_id": "s1", "pdf_id": "p1"},
    {"user_id": "u2", "session_id": "s2", "pdf_id": "p2"},
])

# All submissions for a specific PDF hash
client.get_metrics("pdf_hash", pdf_hash="abc123")

API 6 — `get_system_info()`

info = client.get_system_info()
# Returns: total PDFs, users, sessions, categories, vectors, breakdown by source

API 7 — `get_error_analytics()`

analytics = client.get_error_analytics(
    date_from="2026-01-01T00:00:00Z",
    date_to="2026-03-31T23:59:59Z",
    category="Private Markets",          # optional filter
    subcategory="Private Equity",        # optional filter
    doctype="LP Subscription Agreement", # optional filter
)
# Returns: total_errors + breakdown by category, subcategory, doctype, date, error_type, case_type

Plugin System

Every component is pluggable. Mix and match to fit your stack.

Embedding Backends

Backend	Install	Best For
`SentenceTransformerBackend` (default)	`[transformers]`	Local, no API calls, great accuracy
`OpenAIEmbeddingBackend`	`[openai]`	Highest quality, uses API credits
Custom (`EmbeddingBackend`)	—	Any model — Ollama, HuggingFace, Cohere

# Sentence Transformers (runs locally, no API key)
from ragpdf import SentenceTransformerBackend
backend = SentenceTransformerBackend(model="all-MiniLM-L6-v2")
# Other models: "all-mpnet-base-v2", "paraphrase-MiniLM-L6-v2"

# OpenAI
from ragpdf import OpenAIEmbeddingBackend
backend = OpenAIEmbeddingBackend(api_key="sk-...", model="text-embedding-3-small")

# Custom — implement 2 methods
from ragpdf.embeddings.base import EmbeddingBackend
class MyEmbedder(EmbeddingBackend):
    def embed(self, text: str) -> list[float]:
        return my_model.encode(text).tolist()

    def embed_batch(self, texts: list[str]) -> list[list[float]]:
        return my_model.encode(texts).tolist()

Vector Store Backends

Backend	Install	Best For
`LocalVectorStore` (default)	—	Dev/testing, single server
`S3VectorStore`	`[s3]`	Production, no extra deps
`PineconeStore`	`[pinecone]`	Large scale, managed
`ChromaStore`	`[chroma]`	Local production, embedded
`WeaviateStore`	`[weaviate]`	Self-hosted, full-featured
Custom (`VectorStoreBackend`)	—	pgvector, Qdrant, Milvus, Redis

from ragpdf import LocalVectorStore, S3VectorStore
from ragpdf.vector_stores import PineconeStore, ChromaStore, WeaviateStore

# Flat JSON on disk (dev)
store = LocalVectorStore(path="./ragpdf_data")

# Flat JSON in your S3 bucket (production)
store = S3VectorStore(bucket="my-bucket", region="us-east-1")

# Pinecone
store = PineconeStore(api_key="...", index_name="ragpdf-vectors", namespace="prod")

# ChromaDB (local, embedded, no external service)
store = ChromaStore(path="./chroma_data", collection="ragpdf_vectors")

# Weaviate
store = WeaviateStore(url="http://localhost:8080", class_name="RagpdfVector")

# Custom — implement 5 methods
from ragpdf.vector_stores.base import VectorStoreBackend
class PgVectorStore(VectorStoreBackend):
    def find_similar(self, embedding, threshold, top_k): ...
    def add_vector(self, field_name, context, section_context, headers, embedding, **meta): ...
    def update_confidence(self, vector_id, is_positive, error_info=None): ...
    def save(self): ...
    def count(self) -> int: ...

LLM Corrector Backends

Backend	Install	Best For
`NoOpCorrectorBackend` (default)	—	No LLM call, offline
`OpenAICorrectorBackend`	`[openai]`	GPT-4, highest quality corrections
`AnthropicCorrectorBackend`	`[anthropic]`	Claude, fast + accurate
Custom (`FieldCorrectorBackend`)	—	Llama, Mistral, Ollama, any LLM

from ragpdf import OpenAICorrectorBackend, AnthropicCorrectorBackend, NoOpCorrectorBackend

# GPT-4
corrector = OpenAICorrectorBackend(api_key="sk-...", model="gpt-4-turbo-preview")

# Claude
corrector = AnthropicCorrectorBackend(api_key="sk-ant-...", model="claude-sonnet-4-20250514")

# No LLM (just cleans the field name to snake_case)
corrector = NoOpCorrectorBackend()

# Custom — implement 1 method
from ragpdf.correctors.base import FieldCorrectorBackend
class OllamaCorrector(FieldCorrectorBackend):
    def generate_corrected_field_name(self, error_data: dict) -> dict:
        # Call Ollama / any local LLM
        return {"corrected_field_name": "name", "confidence": 0.9, "reasoning": "..."}

Storage Backends

from ragpdf import LocalStorage, S3Storage
from ragpdf.storage.base import StorageBackend

# Local filesystem
storage = LocalStorage("./ragpdf_data")

# AWS S3 (your own bucket)
storage = S3Storage(bucket="my-bucket", region="us-east-1", prefix="ragpdf/")

# Custom — implement 5 methods (PostgreSQL, MongoDB, GCS, Azure Blob, etc.)
class PostgresStorage(StorageBackend):
    def save_json(self, key, data): ...
    def load_json(self, key): ...
    def append_to_jsonl(self, key, data): ...
    def load_jsonl(self, key): ...
    def copy_file(self, source, dest): ...

How the Learning Loop Works

PDF fields submitted
       ↓
EmbeddingBackend.embed(field_context)
       ↓
VectorStoreBackend.find_similar(embedding)
       → RAG prediction + confidence score
       ↓
Your backend runs LLM prediction independently
       ↓
save_filled_pdf(rag_preds, llm_preds, final_preds)
       ↓
CaseClassifier assigns each field to one of 5 cases:
  CASE_A → Both agreed     → boost RAG vector confidence
  CASE_B → Conflict        → boost winner, create new vector if LLM selected
  CASE_C → LLM only        → create new vector from LLM prediction
  CASE_D → RAG only        → boost RAG vector confidence
  CASE_E → Neither         → do nothing
       ↓
MetricsService calculates accuracy/coverage/confidence
TimeSeriesService appends to 5 time series levels
       ↓
(optionally) submit_feedback(errors)
       ↓
FieldCorrectorBackend.generate_corrected_field_name(error)
       ↓
VectorStoreBackend.update_confidence(vector_id, is_positive=False)
  → confidence decayed
  → embedding regenerated: "original context corrected:right_field_name"
  → stability_score updated
MetricsService.recalculate_accuracy_after_errors()
TimeSeriesService updates all 5 levels again

Over time, CASE_A (both agreed) increases → LLM needed less → faster + cheaper predictions.

Configuration Reference

Copy .env.example to .env:

# Storage
RAGPDF_STORAGE=local              # local | s3
RAGPDF_DATA_PATH=./ragpdf_data

# S3 (if RAGPDF_STORAGE=s3)
RAGPDF_S3_BUCKET=my-bucket
RAGPDF_S3_REGION=us-east-1

# Embedding
RAGPDF_EMBEDDING_BACKEND=sentence_transformer   # sentence_transformer | openai
RAGPDF_ST_MODEL=all-MiniLM-L6-v2
OPENAI_API_KEY=sk-...

# Vector store
RAGPDF_VECTOR_STORE=local         # local | s3 | pinecone | chroma | weaviate
PINECONE_API_KEY=...
RAGPDF_CHROMA_PATH=./chroma_data

# LLM Corrector
RAGPDF_CORRECTOR_BACKEND=openai   # openai | anthropic | noop
ANTHROPIC_API_KEY=sk-ant-...

# Prediction tuning
RAGPDF_PREDICTION_THRESHOLD=0.75  # min cosine similarity to count as a match
RAGPDF_TOP_K=5                    # how many candidates to return
RAGPDF_CONFIDENCE_DECAY_RATE=0.95 # multiply confidence on error
RAGPDF_CONFIDENCE_GROWTH_RATE=1.05 # multiply confidence on correct

S3 Storage Layout

All paths below are relative to your bucket (or RAGPDF_DATA_PATH for local):

vectors/
└── vector_database.json                              # the vector DB

pdf_hash_mapping/
└── mapping.json                                      # dedup + frequency tracking

predictions/{user_id}/{session_id}/{pdf_id}/
├── metadata/
│   ├── submission_info.json                          # submission_id, frequency
│   └── pdf_info.json                                 # pdf_hash, pdf_category
├── predictions/
│   ├── input.json                                    # raw fields (for CASE_B/C vector creation)
│   ├── rag_predictions.json                          # API 1 output
│   ├── llm_predictions.json                          # provided by your backend
│   └── final_predictions.json                        # ensemble decisions
├── analysis/
│   ├── case_classification.json                      # A/B/C/D/E per field
│   ├── metrics_snapshot.json                         # initial metrics
│   └── vector_update_summary.json                    # what changed in vector DB
└── errors/
    ├── user_feedback_raw.jsonl                        # raw feedback events
    ├── error_analysis.json                            # processed error records
    ├── metrics_snapshot_updated.json                  # recalculated after errors
    └── error_log_{timestamp}.json                     # timestamped error log

metrics/time_series/
├── global/time_series.json
├── category/{category}/time_series.json
├── subcategory/{category}/{sub}/time_series.json
├── doctype/{category}/{sub}/{doc}/time_series.json
└── pdf_hash/{hash}/time_series.json

Dev Server

pip install ragpdf-sdk[server]
uvicorn server.local_server:app --reload --port 8000

Endpoints:

POST /predict — API 1
POST /save-filled-pdf — API 2
POST /feedback — API 4
POST /metrics — API 5
GET /system-info — API 6
POST /error-analytics — API 7
GET /health — vector count

All endpoints require X-API-Key: dev-key header (set RAGPDF_API_KEY to change).

Testing

# Install dev dependencies
pip install -e ".[dev]"

# Unit tests (no API keys, no network)
pytest tests/unit/ -v

# With coverage
pytest tests/unit/ --cov=ragpdf --cov-report=html

# Integration tests (no API keys — uses DummyEmbeddingBackend)
pytest tests/integration/ -v -m integration

# All tests
pytest

Publishing

# Increment version in pyproject.toml
# Update CHANGELOG.md

git commit -am "Release v0.2.0"
git tag v0.2.0
git push origin main --tags
# GitHub Actions publishes to PyPI automatically

Versioning policy:

PATCH (0.1.x) — bug fixes, no API changes
MINOR (0.x.0) — new backends/features, backwards compatible
MAJOR (x.0.0) — breaking changes to RAGPDFClient or plugin interfaces

License

MIT — see LICENSE

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.4

May 16, 2026

0.2.3

Apr 28, 2026

0.2.2

Apr 3, 2026

This version

0.2.1

Apr 2, 2026

0.2.0

Apr 2, 2026

0.1.3

Mar 18, 2026

0.1.2

Mar 18, 2026

0.1.1

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_autofillr_rag-0.2.1.tar.gz (84.2 kB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_autofillr_rag-0.2.1-py3-none-any.whl (84.8 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file pdf_autofillr_rag-0.2.1.tar.gz.

File metadata

Download URL: pdf_autofillr_rag-0.2.1.tar.gz
Upload date: Apr 2, 2026
Size: 84.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_autofillr_rag-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`b8a9a48f255c14c4f85cff0e96f0a0d389e689734bbd7267bf1a5dada9d9094e`
MD5	`8503f26e9555b8c09bcc1e198535692a`
BLAKE2b-256	`88165f8b7e7f5ef4d60bf7be2c042f3926dadc33f800553a3b4c1042abd1821f`

See more details on using hashes here.

File details

Details for the file pdf_autofillr_rag-0.2.1-py3-none-any.whl.

File metadata

Download URL: pdf_autofillr_rag-0.2.1-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 84.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_autofillr_rag-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a320c04961add7461e88ef3564e4903245f90118954014f73c280bc572a0a3b2`
MD5	`9ede80a39bd92c87e5dddacf639548c9`
BLAKE2b-256	`d75afc39cb9385223d0d9d09bde6f273ee357c29d10666783492a6a206e02cce`

See more details on using hashes here.

pdf-autofillr-rag 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ragpdf-sdk

What It Does

Installation

Quick Start

The 6 APIs

API 1 — get_predictions()

API 2 — save_filled_pdf()

API 4 — submit_feedback()

API 5 — get_metrics()

API 6 — get_system_info()

API 7 — get_error_analytics()

Plugin System

Embedding Backends

Vector Store Backends

LLM Corrector Backends

Storage Backends

How the Learning Loop Works

Configuration Reference

S3 Storage Layout

Dev Server

Testing

Publishing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

API 1 — `get_predictions()`

API 2 — `save_filled_pdf()`

API 4 — `submit_feedback()`

API 5 — `get_metrics()`

API 6 — `get_system_info()`

API 7 — `get_error_analytics()`