Skip to main content

AI-powered document transcription and semantic chunking for RAG pipelines

Project description

open_rag

A Python library for AI-powered document transcription and semantic chunking with RAG (Retrieval-Augmented Generation). It processes PDFs through LLMs (Claude via AWS Bedrock), chunks the resulting Markdown semantically, enriches each chunk with surrounding context, and returns ready-to-index Document objects for PostgreSQL pgvector.

Version: 0.0.1 | Python: >=3.12 | Build: uv


Features

  • PDF-to-Markdown transcription powered by Claude via AWS Bedrock
  • LangGraph-based transcription workflow with configurable retry logic and accuracy thresholds
  • Semantic chunking with 85th-percentile breakpoints (plus recursive and Markdown-header strategies)
  • Per-chunk context enrichment via a dedicated LangGraph workflow — each chunk is wrapped with <context> and <content> tags
  • Pluggable storage backends: local filesystem or AWS S3
  • Vector indexing into PostgreSQL pgvector via LangChain PGVectorStore
  • LangSmith tracing support

Prerequisites

  • Python 3.12 or higher
  • uv for dependency management
  • AWS credentials configured (standard boto3 credential chain — env vars, ~/.aws/credentials, or instance profile)
  • PostgreSQL database with the pgvector extension enabled

Installation

Install from PyPI:

pip install open_rag

For development (clone + install with dev tools):

git clone https://github.com/Restebance/open_rag.git
cd open_rag
uv sync --group dev
cp example.env .env

Fill in .env with your credentials (see Environment Variables below).


Usage

Document Transcription

OpenRagTranscriber accepts the raw bytes of a single PDF page and returns a ParsedDocPage containing the Markdown transcription.

import asyncio
from open_rag import OpenRagTranscriber

transcriber = OpenRagTranscriber(
    langsmith_project_name="my-project",   # required
    langsmith_api_key="lsv2_...",          # required
    llm_model_id="global.anthropic.claude-sonnet-4-6",
    target_language="es-CO",
    transcription_accuracy_threshold=0.90,
    max_transcription_retries=2,
)

with open("page.pdf", "rb") as f:
    page_bytes = f.read()

result = asyncio.run(transcriber.transcribe_document(page_bytes))
print(result.page_text)  # Markdown string

Semantic Chunking with Context

ChunksManager takes a pre-loaded Markdown string and returns a list of LangChain Document objects, each enriched with a contextual summary.

import asyncio
from open_rag import ChunksManager

manager = ChunksManager(
    langsmith_project_name="my-project",   # required
    langsmith_api_key="lsv2_...",          # required
)

with open("document.md") as f:
    markdown_content = f.read()

docs = asyncio.run(manager.gen_context_chunks(
    file_key="document.md",
    file_markdown_content=markdown_content,
    file_tags={"category": "hr", "department": "onboarding"},
))

# docs is a List[Document]; index to pgvector as needed
for doc in docs:
    print(doc.page_content)

Note: gen_context_chunks does not load files from storage — the caller must pass the content as a string. Indexing to pgvector is the caller's responsibility.


Architecture

The codebase follows a clean layered architecture. Dependency direction: transcription.py / chunks.py → application → domain ↔ infra ← workflows.

open_rag/                     # installable package (src/open_rag/)
├── transcription.py          # Public API — OpenRagTranscriber
├── chunks.py                 # Public API — ChunksManager
├── domain/                   # Core data models (PageToTranscribe, ParsedDocPage, ParsedDoc)
├── application/              # Orchestration + abstract interfaces (ABCs)
├── data/                     # Shared enums and prompt strings
├── infra/
│   ├── llms/                 # AWS Bedrock chat (ChatBedrockConverse)
│   ├── embeddings/           # AWS Bedrock embeddings (BedrockEmbeddings)
│   ├── persistence/          # Local filesystem, AWS S3, PostgreSQL managers
│   ├── rag/                  # SemanticChunks, RecursiveChunks, MarkdownHeadersChunks, PGVectorStore, WeaviateEmbeddingsManager
│   └── secrets/              # AWS Secrets Manager helper
├── utils/                    # validate_file_name_format
└── workflows/                # LangGraph state machines (transcription + context)
tests/                        # pytest suite
data/                         # Sample / test documents
example.env
pyproject.toml

Key Data Flow

PDF bytes
  → ParseDocModelService  (PyMuPDF → base64 pages)
  → TranscriptionWorkflow (LangGraph → Claude via AWS Bedrock → Markdown)

Markdown string + tags
  → SemanticChunks        (AWS Bedrock embeddings, 85th-percentile breakpoints)
  → ContextWorkflow       (LangGraph → Claude adds surrounding context per chunk)
  → List[Document]        (each chunk wrapped in <context> / <content> tags)
  → Caller indexes to pgvector

Environment Variables

Copy example.env to .env and fill in the values:

Variable Purpose
VECTOR_STORE_CONNECTION PostgreSQL connection string (pgvector)
VECTOR_STORE_TABLE pgvector table name
LANGSMITH_API_KEY LangSmith API key for tracing
LANGCHAIN_PROJECT LangSmith project name
LANGSMITH_TRACING Enable LangSmith tracing (true / false)
SUPABASE_KEY / SUPABASE_URL Supabase credentials (optional)

AWS credentials are read from the standard boto3 credential chain and are not set in .env.


Development

Running tests

# Unit tests (mocked — no AWS credentials required)
uv run pytest

# Transcription integration test (requires live AWS credentials)
uv run python src/open_rag/transcription.py

# Chunking integration test (requires live AWS credentials)
uv run python src/open_rag/chunks.py

Profiling

# CPU profiling
uv run pyinstrument test.py transcribe <file.pdf> <source_dir> <target_dir>

# Memory profiling
uv run python -m memray run test.py transcribe <file.pdf> <source_dir> <target_dir>

Building the package

uv build

Gotchas

  • SemanticChunks calls AWS Bedrock at construction time (via SemanticChunker) — not just at index time. Make sure credentials are available before instantiating ChunksManager.
  • Both transcribe_document and gen_context_chunks are async; wrap them in asyncio.run(...) from synchronous code.
  • OpenRagTranscriber and ChunksManager require langsmith_project_name and langsmith_api_key as constructor arguments — they are not read from environment variables.
  • ParseDocModelService.parse_document_to_base64_pages iterates range(0, page_count) — pages are zero-indexed (page_number=0 is the first page).
  • AWS Bedrock cross-region model IDs use the global. prefix (e.g. global.anthropic.claude-sonnet-4-6).

License

Licensed under the Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wizit_open_rag-0.0.1.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wizit_open_rag-0.0.1-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file wizit_open_rag-0.0.1.tar.gz.

File metadata

  • Download URL: wizit_open_rag-0.0.1.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.9

File hashes

Hashes for wizit_open_rag-0.0.1.tar.gz
Algorithm Hash digest
SHA256 8e45652b762ab9b27094f3dfdf1fa84555c160bd8b1af7aac841e82727f68ebd
MD5 0c4b1392a625822ec117ef49e4b9d886
BLAKE2b-256 f16cf71a495d8c2f1f280cbc1ef4dbe6e3c566dfe44b8393bc87bb681a68166c

See more details on using hashes here.

File details

Details for the file wizit_open_rag-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for wizit_open_rag-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 28376f02aeefc73c0d535ad5cab98e105bef3d93f2dd0090dbecd8aa4195ecd3
MD5 930f72f3c6a68ab3aa1d325b9bb1c5e2
BLAKE2b-256 9b9c59d2b62512638f3b0bdb7e074d7d30d0d6bf2eb44eee74c17713ce5e5362

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page