AI-powered document transcription and semantic chunking for RAG pipelines
Project description
open_rag
A Python library for AI-powered document transcription and semantic chunking with RAG (Retrieval-Augmented Generation). It processes PDFs through LLMs (Claude via AWS Bedrock), chunks the resulting Markdown semantically, enriches each chunk with surrounding context, and returns ready-to-index Document objects for PostgreSQL pgvector.
Version: 0.0.1 | Python: >=3.12 | Build: uv
Features
- PDF-to-Markdown transcription powered by Claude via AWS Bedrock
- LangGraph-based transcription workflow with configurable retry logic and accuracy thresholds
- Semantic chunking with 85th-percentile breakpoints (plus recursive and Markdown-header strategies)
- Per-chunk context enrichment via a dedicated LangGraph workflow — each chunk is wrapped with
<context>and<content>tags - Pluggable storage backends: local filesystem or AWS S3
- Vector indexing into PostgreSQL pgvector via LangChain
PGVectorStore - LangSmith tracing support
Prerequisites
- Python 3.12 or higher
- uv for dependency management
- AWS credentials configured (standard boto3 credential chain — env vars,
~/.aws/credentials, or instance profile) - PostgreSQL database with the pgvector extension enabled
Installation
Install from PyPI:
pip install open_rag
For development (clone + install with dev tools):
git clone https://github.com/Restebance/open_rag.git
cd open_rag
uv sync --group dev
cp example.env .env
Fill in .env with your credentials (see Environment Variables below).
Usage
Document Transcription
OpenRagTranscriber accepts the raw bytes of a single PDF page and returns a ParsedDocPage containing the Markdown transcription.
import asyncio
from open_rag import OpenRagTranscriber
transcriber = OpenRagTranscriber(
langsmith_project_name="my-project", # required
langsmith_api_key="lsv2_...", # required
llm_model_id="global.anthropic.claude-sonnet-4-6",
target_language="es-CO",
transcription_accuracy_threshold=0.90,
max_transcription_retries=2,
)
with open("page.pdf", "rb") as f:
page_bytes = f.read()
result = asyncio.run(transcriber.transcribe_document(page_bytes))
print(result.page_text) # Markdown string
Semantic Chunking with Context
ChunksManager takes a pre-loaded Markdown string and returns a list of LangChain Document objects, each enriched with a contextual summary.
import asyncio
from open_rag import ChunksManager
manager = ChunksManager(
langsmith_project_name="my-project", # required
langsmith_api_key="lsv2_...", # required
)
with open("document.md") as f:
markdown_content = f.read()
docs = asyncio.run(manager.gen_context_chunks(
file_key="document.md",
file_markdown_content=markdown_content,
file_tags={"category": "hr", "department": "onboarding"},
))
# docs is a List[Document]; index to pgvector as needed
for doc in docs:
print(doc.page_content)
Note:
gen_context_chunksdoes not load files from storage — the caller must pass the content as a string. Indexing to pgvector is the caller's responsibility.
Architecture
The codebase follows a clean layered architecture. Dependency direction: transcription.py / chunks.py → application → domain ↔ infra ← workflows.
open_rag/ # installable package (src/open_rag/)
├── transcription.py # Public API — OpenRagTranscriber
├── chunks.py # Public API — ChunksManager
├── domain/ # Core data models (PageToTranscribe, ParsedDocPage, ParsedDoc)
├── application/ # Orchestration + abstract interfaces (ABCs)
├── data/ # Shared enums and prompt strings
├── infra/
│ ├── llms/ # AWS Bedrock chat (ChatBedrockConverse)
│ ├── embeddings/ # AWS Bedrock embeddings (BedrockEmbeddings)
│ ├── persistence/ # Local filesystem, AWS S3, PostgreSQL managers
│ ├── rag/ # SemanticChunks, RecursiveChunks, MarkdownHeadersChunks, PGVectorStore, WeaviateEmbeddingsManager
│ └── secrets/ # AWS Secrets Manager helper
├── utils/ # validate_file_name_format
└── workflows/ # LangGraph state machines (transcription + context)
tests/ # pytest suite
data/ # Sample / test documents
example.env
pyproject.toml
Key Data Flow
PDF bytes
→ ParseDocModelService (PyMuPDF → base64 pages)
→ TranscriptionWorkflow (LangGraph → Claude via AWS Bedrock → Markdown)
Markdown string + tags
→ SemanticChunks (AWS Bedrock embeddings, 85th-percentile breakpoints)
→ ContextWorkflow (LangGraph → Claude adds surrounding context per chunk)
→ List[Document] (each chunk wrapped in <context> / <content> tags)
→ Caller indexes to pgvector
Environment Variables
Copy example.env to .env and fill in the values:
| Variable | Purpose |
|---|---|
VECTOR_STORE_CONNECTION |
PostgreSQL connection string (pgvector) |
VECTOR_STORE_TABLE |
pgvector table name |
LANGSMITH_API_KEY |
LangSmith API key for tracing |
LANGCHAIN_PROJECT |
LangSmith project name |
LANGSMITH_TRACING |
Enable LangSmith tracing (true / false) |
SUPABASE_KEY / SUPABASE_URL |
Supabase credentials (optional) |
AWS credentials are read from the standard boto3 credential chain and are not set in .env.
Development
Running tests
# Unit tests (mocked — no AWS credentials required)
uv run pytest
# Transcription integration test (requires live AWS credentials)
uv run python src/open_rag/transcription.py
# Chunking integration test (requires live AWS credentials)
uv run python src/open_rag/chunks.py
Profiling
# CPU profiling
uv run pyinstrument test.py transcribe <file.pdf> <source_dir> <target_dir>
# Memory profiling
uv run python -m memray run test.py transcribe <file.pdf> <source_dir> <target_dir>
Building the package
uv build
Gotchas
SemanticChunkscalls AWS Bedrock at construction time (viaSemanticChunker) — not just at index time. Make sure credentials are available before instantiatingChunksManager.- Both
transcribe_documentandgen_context_chunksareasync; wrap them inasyncio.run(...)from synchronous code. OpenRagTranscriberandChunksManagerrequirelangsmith_project_nameandlangsmith_api_keyas constructor arguments — they are not read from environment variables.ParseDocModelService.parse_document_to_base64_pagesiteratesrange(0, page_count)— pages are zero-indexed (page_number=0is the first page).- AWS Bedrock cross-region model IDs use the
global.prefix (e.g.global.anthropic.claude-sonnet-4-6).
License
Licensed under the Apache License 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wizit_open_rag-0.0.1.tar.gz.
File metadata
- Download URL: wizit_open_rag-0.0.1.tar.gz
- Upload date:
- Size: 27.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e45652b762ab9b27094f3dfdf1fa84555c160bd8b1af7aac841e82727f68ebd
|
|
| MD5 |
0c4b1392a625822ec117ef49e4b9d886
|
|
| BLAKE2b-256 |
f16cf71a495d8c2f1f280cbc1ef4dbe6e3c566dfe44b8393bc87bb681a68166c
|
File details
Details for the file wizit_open_rag-0.0.1-py3-none-any.whl.
File metadata
- Download URL: wizit_open_rag-0.0.1-py3-none-any.whl
- Upload date:
- Size: 45.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28376f02aeefc73c0d535ad5cab98e105bef3d93f2dd0090dbecd8aa4195ecd3
|
|
| MD5 |
930f72f3c6a68ab3aa1d325b9bb1c5e2
|
|
| BLAKE2b-256 |
9b9c59d2b62512638f3b0bdb7e074d7d30d0d6bf2eb44eee74c17713ce5e5362
|