AI-powered document transcription and semantic chunking for RAG pipelines

These details have not been verified by PyPI

Project links

Project description

open_rag

A Python library for AI-powered document transcription and semantic chunking with RAG (Retrieval-Augmented Generation). It processes PDFs through LLMs (Claude via AWS Bedrock), chunks the resulting Markdown semantically, enriches each chunk with surrounding context, and returns ready-to-index Document objects for PostgreSQL pgvector.

Version: 0.0.1 | Python: >=3.12 | Build: uv

Features

PDF-to-Markdown transcription powered by Claude via AWS Bedrock
LangGraph-based transcription workflow with configurable retry logic and accuracy thresholds
Semantic chunking with 85th-percentile breakpoints (plus recursive and Markdown-header strategies)
Per-chunk context enrichment via a dedicated LangGraph workflow — each chunk is wrapped with <context> and <content> tags
Pluggable storage backends: local filesystem or AWS S3
Vector indexing into PostgreSQL pgvector via LangChain PGVectorStore
LangSmith tracing support

Prerequisites

Python 3.12 or higher
uv for dependency management
AWS credentials configured (standard boto3 credential chain — env vars, ~/.aws/credentials, or instance profile)
PostgreSQL database with the pgvector extension enabled

Installation

Install from PyPI:

pip install open_rag

For development (clone + install with dev tools):

git clone https://github.com/Restebance/open_rag.git
cd open_rag
uv sync --group dev
cp example.env .env

Fill in .env with your credentials (see Environment Variables below).

Usage

Document Transcription

OpenRagTranscriber accepts the raw bytes of a single PDF page and returns a ParsedDocPage containing the Markdown transcription.

import asyncio
from open_rag import OpenRagTranscriber

transcriber = OpenRagTranscriber(
    langsmith_project_name="my-project",   # required
    langsmith_api_key="lsv2_...",          # required
    llm_model_id="global.anthropic.claude-sonnet-4-6",
    target_language="es-CO",
    transcription_accuracy_threshold=0.90,
    max_transcription_retries=2,
)

with open("page.pdf", "rb") as f:
    page_bytes = f.read()

result = asyncio.run(transcriber.transcribe_document(page_bytes))
print(result.page_text)  # Markdown string

Semantic Chunking with Context

ChunksManager takes a pre-loaded Markdown string and returns a list of LangChain Document objects, each enriched with a contextual summary.

import asyncio
from open_rag import ChunksManager

manager = ChunksManager(
    langsmith_project_name="my-project",   # required
    langsmith_api_key="lsv2_...",          # required
)

with open("document.md") as f:
    markdown_content = f.read()

docs = asyncio.run(manager.gen_context_chunks(
    file_key="document.md",
    file_markdown_content=markdown_content,
    file_tags={"category": "hr", "department": "onboarding"},
))

# docs is a List[Document]; index to pgvector as needed
for doc in docs:
    print(doc.page_content)

Note: gen_context_chunks does not load files from storage — the caller must pass the content as a string. Indexing to pgvector is the caller's responsibility.

Architecture

The codebase follows a clean layered architecture. Dependency direction: transcription.py / chunks.py → application → domain ↔ infra ← workflows.

open_rag/                     # installable package (src/open_rag/)
├── transcription.py          # Public API — OpenRagTranscriber
├── chunks.py                 # Public API — ChunksManager
├── domain/                   # Core data models (PageToTranscribe, ParsedDocPage, ParsedDoc)
├── application/              # Orchestration + abstract interfaces (ABCs)
├── data/                     # Shared enums and prompt strings
├── infra/
│   ├── llms/                 # AWS Bedrock chat (ChatBedrockConverse)
│   ├── embeddings/           # AWS Bedrock embeddings (BedrockEmbeddings)
│   ├── persistence/          # Local filesystem, AWS S3, PostgreSQL managers
│   ├── rag/                  # SemanticChunks, RecursiveChunks, MarkdownHeadersChunks, PGVectorStore, WeaviateEmbeddingsManager
│   └── secrets/              # AWS Secrets Manager helper
├── utils/                    # validate_file_name_format
└── workflows/                # LangGraph state machines (transcription + context)
tests/                        # pytest suite
data/                         # Sample / test documents
example.env
pyproject.toml

Key Data Flow

PDF bytes
  → ParseDocModelService  (PyMuPDF → base64 pages)
  → TranscriptionWorkflow (LangGraph → Claude via AWS Bedrock → Markdown)

Markdown string + tags
  → SemanticChunks        (AWS Bedrock embeddings, 85th-percentile breakpoints)
  → ContextWorkflow       (LangGraph → Claude adds surrounding context per chunk)
  → List[Document]        (each chunk wrapped in <context> / <content> tags)
  → Caller indexes to pgvector

Environment Variables

Copy example.env to .env and fill in the values:

Variable	Purpose
`VECTOR_STORE_CONNECTION`	PostgreSQL connection string (pgvector)
`VECTOR_STORE_TABLE`	pgvector table name
`LANGSMITH_API_KEY`	LangSmith API key for tracing
`LANGCHAIN_PROJECT`	LangSmith project name
`LANGSMITH_TRACING`	Enable LangSmith tracing (`true` / `false`)
`SUPABASE_KEY` / `SUPABASE_URL`	Supabase credentials (optional)

AWS credentials are read from the standard boto3 credential chain and are not set in .env.

Development

Running tests

# Unit tests (mocked — no AWS credentials required)
uv run pytest

# Transcription integration test (requires live AWS credentials)
uv run python src/open_rag/transcription.py

# Chunking integration test (requires live AWS credentials)
uv run python src/open_rag/chunks.py

Profiling

# CPU profiling
uv run pyinstrument test.py transcribe <file.pdf> <source_dir> <target_dir>

# Memory profiling
uv run python -m memray run test.py transcribe <file.pdf> <source_dir> <target_dir>

Building the package

uv build

Gotchas

SemanticChunks calls AWS Bedrock at construction time (via SemanticChunker) — not just at index time. Make sure credentials are available before instantiating ChunksManager.
Both transcribe_document and gen_context_chunks are async; wrap them in asyncio.run(...) from synchronous code.
OpenRagTranscriber and ChunksManager require langsmith_project_name and langsmith_api_key as constructor arguments — they are not read from environment variables.
ParseDocModelService.parse_document_to_base64_pages iterates range(0, page_count) — pages are zero-indexed (page_number=0 is the first page).
AWS Bedrock cross-region model IDs use the global. prefix (e.g. global.anthropic.claude-sonnet-4-6).

License

Licensed under the Apache License 2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wizit_open_rag-0.0.1.tar.gz (27.0 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wizit_open_rag-0.0.1-py3-none-any.whl (45.2 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file wizit_open_rag-0.0.1.tar.gz.

File metadata

Download URL: wizit_open_rag-0.0.1.tar.gz
Upload date: May 20, 2026
Size: 27.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.9

File hashes

Hashes for wizit_open_rag-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`8e45652b762ab9b27094f3dfdf1fa84555c160bd8b1af7aac841e82727f68ebd`
MD5	`0c4b1392a625822ec117ef49e4b9d886`
BLAKE2b-256	`f16cf71a495d8c2f1f280cbc1ef4dbe6e3c566dfe44b8393bc87bb681a68166c`

See more details on using hashes here.

File details

Details for the file wizit_open_rag-0.0.1-py3-none-any.whl.

File metadata

Download URL: wizit_open_rag-0.0.1-py3-none-any.whl
Upload date: May 20, 2026
Size: 45.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.9

File hashes

Hashes for wizit_open_rag-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`28376f02aeefc73c0d535ad5cab98e105bef3d93f2dd0090dbecd8aa4195ecd3`
MD5	`930f72f3c6a68ab3aa1d325b9bb1c5e2`
BLAKE2b-256	`9b9c59d2b62512638f3b0bdb7e074d7d30d0d6bf2eb44eee74c17713ce5e5362`

See more details on using hashes here.

wizit-open-rag 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

open_rag

Features

Prerequisites

Installation

Usage

Document Transcription

Semantic Chunking with Context

Architecture

Key Data Flow

Environment Variables

Development

Running tests

Profiling

Building the package

Gotchas

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes