AI-powered document transcription and semantic chunking for RAG pipelines

These details have not been verified by PyPI

Project links

Project description

wizit_open_rag

A Python library for AI-powered document transcription and semantic chunking for RAG (Retrieval-Augmented Generation) pipelines. It processes PDFs and images through a cost-aware tiered pipeline — plain-text extraction first, OCR second, LLM last — then chunks the resulting Markdown semantically, enriches each chunk with surrounding context, and returns ready-to-index Document objects for PostgreSQL pgvector or Weaviate.

Version: 0.0.4 | Python: >=3.12

Features

Cost-aware tiered transcription: pdfplumber (free) → OCR (AWS Textract or Mistral Document AI) → Claude Haiku (LLM fallback). Each page escalates only when the previous tier scores below the quality threshold.
Image transcription: PNG and JPG/JPEG files bypass the tiered pipeline entirely and go straight to the LLM (Claude vision via AWS Bedrock). Pass file_name="scan.png" to transcribe_document — no other change needed.
LangGraph-based transcription workflow with configurable retry logic and accuracy thresholds.
Per-chunk context enrichment — each chunk is wrapped with <context> and <content> tags for higher retrieval precision.
Markdown-header-based chunking strategy, ready to extend to semantic or recursive splitting.
Pluggable vector store backends: PostgreSQL pgvector (PgEmbeddingsManager) or Weaviate (WeaviateEmbeddingsManager).
LangSmith tracing built in.

Prerequisites

Python 3.12 or higher
AWS credentials configured (standard boto3 credential chain — env vars, ~/.aws/credentials, or instance profile). Required for AWS Bedrock (LLM + embeddings) and optionally for AWS Textract and S3.
For pgvector: PostgreSQL with the pgvector extension enabled.
For Weaviate: a running Weaviate instance (local or cloud).
For Mistral OCR: a MISTRAL_API_KEY environment variable or the key passed directly.
For Voyage AI embeddings: a VOYAGE_API_KEY environment variable or the key passed directly.
For Anthropic direct API: an ANTHROPIC_API_KEY environment variable or the key passed directly to ClaudeModels.

Installation

pip install wizit_open_rag

Quickstart

1. Transcribe a PDF page

OpenRagTranscriber accepts raw bytes for a single PDF page and returns a ParsedDocPage containing the Markdown transcription. By default it uses AWS Bedrock; pass ai_service=ClaudeModels(...) to use the Anthropic direct API instead.

import asyncio
import fitz  # PyMuPDF — pip install pymupdf
from wizit_open_rag import OpenRagTranscriber

transcriber = OpenRagTranscriber(
    langsmith_project_name="my-project",  # required
    langsmith_api_key="lsv2_...",         # required
    target_language="en",
)

# Split a multi-page PDF into single-page byte blobs
with fitz.open("document.pdf") as doc:
    single = fitz.open()
    single.insert_pdf(doc, from_page=0, to_page=0)
    page_bytes = single.tobytes()

result = asyncio.run(transcriber.transcribe_document(page_number=1, page_content=page_bytes))
print(result.page_text)  # Markdown string

Using the Anthropic direct API instead of Bedrock:

import asyncio
import fitz
from wizit_open_rag import OpenRagTranscriber
from wizit_open_rag.infra.llms.claude_model import ClaudeModels

transcriber = OpenRagTranscriber(
    langsmith_project_name="my-project",
    langsmith_api_key="lsv2_...",
    ai_service=ClaudeModels("claude-sonnet-4-6"),  # reads ANTHROPIC_API_KEY from env
    # ai_service=ClaudeModels("claude-sonnet-4-6", api_key="sk-ant-..."),  # or pass directly
    target_language="en",
)

with fitz.open("document.pdf") as doc:
    single = fitz.open()
    single.insert_pdf(doc, from_page=0, to_page=0)
    page_bytes = single.tobytes()

result = asyncio.run(transcriber.transcribe_document(page_number=1, page_content=page_bytes))
print(result.page_text)

1b. Transcribe a standalone image (PNG / JPG)

Pass file_name with the image extension to signal that the input is an image rather than a PDF. Images bypass the tiered pipeline and go directly to the LLM, regardless of whether use_tiered_transcription is set.

import asyncio
from wizit_open_rag import OpenRagTranscriber

transcriber = OpenRagTranscriber(
    langsmith_project_name="my-project",
    langsmith_api_key="lsv2_...",
)

with open("scan.png", "rb") as f:
    image_bytes = f.read()

result = asyncio.run(
    transcriber.transcribe_document(
        page_number=1,
        page_content=image_bytes,
        file_name="scan.png",   # .png / .jpg / .jpeg — triggers image path
    )
)
print(result.page_text)

Supported image extensions: .png, .jpg, .jpeg. Any other extension raises ValueError. When file_name is None or omitted, PDF is assumed (backwards-compatible default).

2. Chunk Markdown and generate context

ChunksManager takes a pre-loaded Markdown string and returns a list of LangChain Document objects, each enriched with a contextual summary. By default it uses AWS Bedrock; pass ai_service=ClaudeModels(...) to use the Anthropic direct API.

import asyncio
from wizit_open_rag import ChunksManager

manager = ChunksManager(
    langsmith_project_name="my-project",  # required
    langsmith_api_key="lsv2_...",         # required
)

with open("document.md") as f:
    markdown = f.read()

docs = asyncio.run(manager.gen_context_chunks(
    file_key="document.md",
    file_markdown_content=markdown,
    file_tags={"category": "hr", "department": "onboarding"},
))

for doc in docs:
    print(doc.page_content)   # "<context>...</context><content>...</content>"
    print(doc.metadata)       # {"source": "document.md", "category": "hr", ...}

Using the Anthropic direct API:

import asyncio
from wizit_open_rag import ChunksManager
from wizit_open_rag.infra.llms.claude_model import ClaudeModels

manager = ChunksManager(
    langsmith_project_name="my-project",
    langsmith_api_key="lsv2_...",
    ai_service=ClaudeModels("claude-sonnet-4-6"),  # reads ANTHROPIC_API_KEY from env
)

with open("document.md") as f:
    markdown = f.read()

docs = asyncio.run(manager.gen_context_chunks(
    file_key="document.md",
    file_markdown_content=markdown,
    file_tags={"category": "hr"},
))

3. Full pipeline — transcribe, chunk, and index

import asyncio
import fitz
from wizit_open_rag import OpenRagTranscriber, ChunksManager
from wizit_open_rag.infra.embeddings.aws_embeddings import AWSEmbeddingsModels
from wizit_open_rag.infra.rag.weaviate_embeddings import WeaviateEmbeddingsManager

# ── Transcription ──────────────────────────────────────────────────────────────
transcriber = OpenRagTranscriber(
    langsmith_project_name="my-project",
    langsmith_api_key="lsv2_...",
    use_tiered_transcription=True,  # cost-aware: pdfplumber → Textract → Haiku
    tier2_ocr="textract",
    target_language="en",
)

pages_text = []
with fitz.open("document.pdf") as doc:
    for i in range(len(doc)):
        single = fitz.open()
        single.insert_pdf(doc, from_page=i, to_page=i)
        result = asyncio.run(transcriber.transcribe_document(
            page_number=i + 1,
            page_content=single.tobytes(),
        ))
        pages_text.append(result.page_text or "")

markdown = "\n\n".join(pages_text)

# ── Chunking + indexing ────────────────────────────────────────────────────────
embeddings = AWSEmbeddingsModels("amazon.titan-embed-text-v1").load_embeddings_model()

kdb = WeaviateEmbeddingsManager(
    embeddings_model=embeddings,
    weaviate_url="http://localhost:8080",
    collection_name="Documents",
    records_manager_db_url="postgresql://user:password@localhost:5432/vectordb",
)

manager = ChunksManager(
    langsmith_project_name="my-project",
    langsmith_api_key="lsv2_...",
    kdb=kdb,
)

result = asyncio.run(manager.gen_and_index_context_chunks(
    file_key="document.md",
    file_markdown_content=markdown,
    file_tags={"source_doc": "document.pdf"},
))
print(result)  # IndexingResult(num_added=12, num_updated=0, num_deleted=0)

Transcription Reference

`OpenRagTranscriber`

from wizit_open_rag import OpenRagTranscriber

Constructor parameters

Parameter	Type	Default	Description
`langsmith_project_name`	`str`	required	LangSmith project name for tracing
`langsmith_api_key`	`str`	required	LangSmith API key
`llm_model_id`	`str`	`"global.anthropic.claude-sonnet-4-6"`	Bedrock model ID used when `use_tiered_transcription=False`
`target_language`	`str`	`"es"`	BCP-47 language tag for the output (e.g. `"en"`, `"es-CO"`)
`transcription_additional_instructions`	`str`	`""`	Extra instructions appended to the system prompt
`transcription_accuracy_threshold`	`float`	`0.80`	Minimum quality score `[0.0, 0.95]` to accept a tier's output
`max_transcription_retries`	`int`	`2`	LLM retry attempts `[1, 3]` within the LangGraph loop
`use_tiered_transcription`	`bool`	`False`	Enable cost-aware tiered pipeline
`tier2_ocr`	`"textract" \| "mistral"`	`"textract"`	Tier 2 OCR backend
`tier3_model_id`	`str`	`"us.anthropic.claude-haiku-4-5-20251001-v1:0"`	Bedrock model for the LLM fallback tier
`mistral_api_key`	`str \| None`	`None`	Mistral API key; falls back to `MISTRAL_API_KEY` env var
`ai_service`	`AiApplicationService \| None`	`None`	LLM backend override for all standard and image transcription. Pass `ClaudeModels(...)` to use the Anthropic direct API instead of Bedrock. Ignores `llm_model_id` when set.
`tier3_ai_service`	`AiApplicationService \| None`	`None`	LLM backend override for the Tier 3 fallback (only relevant when `use_tiered_transcription=True`). Defaults to `AWSModels(tier3_model_id)` when not set.

Method

async def transcribe_document(
    page_number: int,
    page_content: str | bytes,
    file_name: str | None = None,
) -> ParsedDocPage

page_content — raw bytes of the input. For PDFs, use PyMuPDF to extract a single page. For images, read the file directly.
file_name — optional filename used to detect the input format from its extension. When None or omitted, PDF is assumed. Supported extensions: .pdf, .png, .jpg, .jpeg. Unsupported extensions raise ValueError.

Image routing: when file_name has an image extension, the tiered pipeline is skipped and the page goes directly to llm_model_id (not tier3_model_id), regardless of the use_tiered_transcription setting.

Input model

@dataclass
class PageToTranscribe:
    page_number: int
    page_content: str | bytes
    media_type: str = "application/pdf"  # set automatically from file_name extension

Return type

@dataclass
class ParsedDocPage:
    page_number: int
    page_content: str | bytes  # original input
    page_text: str | None      # Markdown transcription

Tiered pipeline

When use_tiered_transcription=True, each PDF page flows through tiers in order. A tier's output is accepted when its score meets transcription_accuracy_threshold; otherwise the next tier runs.

Tier 1 — pdfplumber    (free, no network, digital text + tables)
    ↓ score < threshold
Tier 2 — AWS Textract  (OCR API, tables + forms)
       OR Mistral OCR  (swap via tier2_ocr="mistral")
    ↓ score < threshold
Tier 3 — Claude Haiku  (LLM fallback, always produces a result)

Images always bypass the tiered pipeline. When file_name has a .png, .jpg, or .jpeg extension, the page goes directly to the primary llm_model_id (Sonnet by default), not through Tier 1→2→3. This applies even when use_tiered_transcription=True.

Instantiate OpenRagTranscriber once and reuse it across all pages — both LangGraph workflows (Sonnet for the standard path, Haiku for Tier 3) are compiled at construction time.

Chunking Reference

`ChunksManager`

from wizit_open_rag import ChunksManager

Constructor parameters

Parameter	Type	Default	Description
`langsmith_project_name`	`str`	required	LangSmith project name for tracing
`langsmith_api_key`	`str`	required	LangSmith API key
`llm_model_id`	`str`	`"global.anthropic.claude-sonnet-4-6"`	Bedrock model for context generation
`embeddings_model_id`	`str`	`"amazon.titan-embed-text-v1"`	Bedrock embeddings model
`target_language`	`str`	`"es-CO"`	Output language for generated context
`kdb`	`EmbeddingsManager \| None`	`None`	Vector store backend; required only for `gen_and_index_context_chunks`
`ai_service`	`AiApplicationService \| None`	`None`	LLM backend override for context generation. Pass `ClaudeModels(...)` to use the Anthropic direct API instead of Bedrock. Ignores `llm_model_id` when set.

Methods

# Generate enriched chunks — caller handles indexing
async def gen_context_chunks(
    file_key: str,
    file_markdown_content: str,
    file_tags: dict,
) -> list[Document]

# Generate + index in one call — requires kdb= at construction time
async def gen_and_index_context_chunks(
    file_key: str,
    file_markdown_content: str,
    file_tags: dict,
    cleanup: "incremental" | "full" | "scoped_full" | None = "incremental",
    source_id_key: str = "source",
) -> IndexingResult

file_key: Filename used as the source metadata key (e.g. "report.md"). Must end with .md.
file_markdown_content: Pre-loaded Markdown string. This method does not read files from disk or S3.
file_tags: Arbitrary key/value metadata propagated to every chunk.
cleanup: LangChain indexing deduplication mode. "incremental" (default) skips unchanged chunks; "full" replaces all prior chunks for the source.

Vector Store Backends

PostgreSQL pgvector — `PgEmbeddingsManager`

from wizit_open_rag import PgEmbeddingsManager
from wizit_open_rag.infra.embeddings.aws_embeddings import AWSEmbeddingsModels

embeddings = AWSEmbeddingsModels("amazon.titan-embed-text-v1").load_embeddings_model()

kdb = PgEmbeddingsManager(
    embeddings_model=embeddings,
    pg_connection="postgresql://user:password@localhost:5432/vectordb",
    embeddings_vectors_table_name="documents",
    records_manager_table_name="documents_records",
    # optional
    vector_size=768,                        # must match the embeddings model output
    metadata_columns=["source", "category"],
)

# First-time setup: create the table and record-manager schema
kdb.configure_vector_store()

# Create an HNSW index for fast ANN search (requires vector_size <= 2000)
kdb.create_index()

# Index documents
from langchain_core.documents import Document
docs = [Document(page_content="...", metadata={"source": "report.md"})]
result = kdb.index_documents(docs)

# Similarity search (returns top-5 by default)
matches = kdb.search_records("What is the refund policy?")

# Delete a document and all its chunks
ids = kdb.retrieve_documents_by_file_name("report.md")
kdb.delete_documents_by_ids(ids)

Weaviate — `WeaviateEmbeddingsManager`

from wizit_open_rag import WeaviateEmbeddingsManager
from wizit_open_rag.infra.embeddings.aws_embeddings import AWSEmbeddingsModels

embeddings = AWSEmbeddingsModels("amazon.titan-embed-text-v1").load_embeddings_model()

kdb = WeaviateEmbeddingsManager(
    embeddings_model=embeddings,
    weaviate_url="http://localhost:8080",
    collection_name="Documents",
    records_manager_db_url="postgresql://user:password@localhost:5432/vectordb",
    # optional
    records_manager_table_name="weaviate_records_manager",
    weaviate_api_key=None,    # set for Weaviate Cloud
    text_key="text",
)

# First-time setup: initialise record-manager schema
# (Weaviate creates the collection automatically on first write)
kdb.configure_vector_store()

# Index documents
result = kdb.index_documents(docs)

# Similarity search
matches = kdb.search_records("What is the refund policy?", k=5)

# Delete
ids = kdb.retrieve_documents_by_file_name("report.md")
kdb.delete_documents_by_ids(ids)

Both backends implement the same EmbeddingsManager interface and are interchangeable when passed as kdb= to ChunksManager.

Embeddings Models

AWS Bedrock — `AWSEmbeddingsModels`

from wizit_open_rag.infra.embeddings.aws_embeddings import AWSEmbeddingsModels

# Returns a LangChain-compatible Embeddings instance backed by AWS Bedrock
embeddings = AWSEmbeddingsModels(
    embeddings_model_id="amazon.titan-embed-text-v1",
    region_name="us-east-1",  # default
).load_embeddings_model()

Credentials are read from the standard boto3 credential chain — no explicit key is needed.

Voyage AI — `VoyageEmbeddingsModels`

A drop-in alternative to AWS Bedrock embeddings. Voyage AI models tend to score higher on retrieval benchmarks and support multilingual content out of the box.

from wizit_open_rag.infra.embeddings.voyage_embeddings import VoyageEmbeddingsModels

embeddings = VoyageEmbeddingsModels(
    embeddings_model_id="voyage-3",      # default
    # api_key="voy-...",                 # or set VOYAGE_API_KEY env var
    batch_size=72,                       # default; Voyage's hard limit is 128
).load_embeddings_model()

Available models (pass as embeddings_model_id):

Model	Dimensions	Notes
`voyage-3`	1024	General-purpose, highest quality (default)
`voyage-3-lite`	512	Lower latency, lower cost
`voyage-multilingual-2`	1024	Optimised for multilingual retrieval

The returned object is a standard LangChain Embeddings instance — pass it to PgEmbeddingsManager, WeaviateEmbeddingsManager, or ChunksManager exactly like the AWS variant:

from wizit_open_rag import PgEmbeddingsManager
from wizit_open_rag.infra.embeddings.voyage_embeddings import VoyageEmbeddingsModels

embeddings = VoyageEmbeddingsModels("voyage-3").load_embeddings_model()

kdb = PgEmbeddingsManager(
    embeddings_model=embeddings,
    pg_connection="postgresql://user:password@localhost:5432/vectordb",
    embeddings_vectors_table_name="documents",
    records_manager_table_name="documents_records",
    vector_size=1024,  # must match the model's output dimension
)

Environment Variables

Variables read at runtime (not at import time):

Variable	Purpose
`LANGSMITH_API_KEY`	LangSmith API key (can also be passed as constructor arg)
`LANGCHAIN_PROJECT`	LangSmith project name
`LANGSMITH_TRACING`	Enable LangSmith tracing (`true` / `false`)
`MISTRAL_API_KEY`	Mistral OCR API key (only needed for `tier2_ocr="mistral"`)
`VECTOR_STORE_CONNECTION`	PostgreSQL connection string for pgvector
`VECTOR_STORE_TABLE`	pgvector table name
`WEAVIATE_URL`	Weaviate cluster URL
`WEAVIATE_API_KEY`	Weaviate Cloud API key (optional for local)
`WEAVIATE_COLLECTION`	Weaviate collection name
`VOYAGE_API_KEY`	Voyage AI API key (only needed when using `VoyageEmbeddingsModels`)
`ANTHROPIC_API_KEY`	Anthropic API key (only needed when using `ClaudeModels`; can also be passed directly as `api_key`)

AWS credentials (Bedrock, Textract, S3) are configured via the standard boto3 chain and are not managed by this library.

Architecture

wizit_open_rag/
├── transcription.py       ← OpenRagTranscriber (public API)
├── chunks.py              ← ChunksManager (public API)
├── domain/                ← PageToTranscribe, ParsedDocPage, ParsedDoc
├── application/
│   ├── interfaces.py      ← ABCs: EmbeddingsManager, PageTranscriptionTier, …
│   ├── transcription_app.py         ← LangGraph transcription workflow
│   ├── tiered_transcription_app.py  ← Cost-aware tier sequencer
│   └── context_chunk_app.py         ← Per-chunk context enrichment
├── infra/
│   ├── llms/              ← AWSModels (ChatBedrockConverse), ClaudeModels (ChatAnthropic)
│   ├── embeddings/        ← AWSEmbeddingsModels (BedrockEmbeddings), VoyageEmbeddingsModels
│   ├── transcription/
│   │   ├── pdfplumber_tier.py   ← Tier 1
│   │   ├── textract_tier.py     ← Tier 2a
│   │   ├── mistral_ocr_tier.py  ← Tier 2b
│   │   └── llm_tier.py          ← Tier 3
│   ├── rag/
│   │   ├── pg_embeddings.py        ← PgEmbeddingsManager
│   │   ├── weaviate_embeddings.py  ← WeaviateEmbeddingsManager
│   │   ├── markdown_chunks.py      ← MarkdownHeadersChunks
│   │   ├── semantic_chunks.py      ← SemanticChunks (85th-pct breakpoints)
│   │   └── recursive_chunks.py    ← RecursiveChunks
│   └── persistence/       ← LocalStorageService, S3StorageService, PgConnectionManager
└── workflows/             ← LangGraph state machines (transcription + context)

Gotchas

transcribe_document takes a single-page PDF as bytes. Use PyMuPDF (fitz) to split pages before calling it.
For images, pass the raw file bytes directly — no page-splitting needed. Include file_name="scan.png" so the library detects the format.
Both transcribe_document and gen_context_chunks are async. Use asyncio.run(...) from synchronous code, or await them inside an async function.
OpenRagTranscriber and ChunksManager require langsmith_project_name and langsmith_api_key as constructor arguments — they are not read from environment variables.
AWS Bedrock cross-region model IDs use the global. prefix (e.g. global.anthropic.claude-sonnet-4-6). Region-specific IDs use the regional prefix (e.g. us.anthropic.claude-haiku-4-5-20251001-v1:0).
ClaudeModels uses the Anthropic direct API — model IDs are plain Anthropic IDs (e.g. "claude-sonnet-4-6"), not the Bedrock-prefixed forms (global. / us.). When ai_service is provided, llm_model_id and tier3_model_id are ignored.
gen_context_chunks does not load files from disk or S3 — pass the Markdown content as a string.
gen_and_index_context_chunks raises ValueError if no kdb= backend was provided at construction time.
WeaviateEmbeddingsManager opens a new Weaviate client connection per operation. Avoid calling it in a tight loop; prefer batching via gen_and_index_context_chunks.
PgEmbeddingsManager.create_index() raises NotImplementedError when vector_size > 2000.
When use_tiered_transcription=True, the OpenRagTranscriber compiles two LangGraph workflows at construction time. Instantiate once and reuse across all pages.
Images passed with an unsupported extension (e.g. .tiff, .bmp, .webp) raise ValueError immediately — they are not silently treated as PDFs.
When file_name is None (default), the library assumes application/pdf. Pass file_name explicitly when the bytes are an image.

License

Licensed under the Apache License 2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.9

Jun 10, 2026

0.0.8

Jun 10, 2026

0.0.7

Jun 9, 2026

0.0.6

Jun 9, 2026

0.0.5

Jun 1, 2026

0.0.4

May 27, 2026

0.0.3

May 26, 2026

0.0.2

May 25, 2026

0.0.1

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wizit_open_rag-0.0.9.tar.gz (40.5 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wizit_open_rag-0.0.9-py3-none-any.whl (61.8 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file wizit_open_rag-0.0.9.tar.gz.

File metadata

Download URL: wizit_open_rag-0.0.9.tar.gz
Upload date: Jun 10, 2026
Size: 40.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.9

File hashes

Hashes for wizit_open_rag-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`effbb4228f829888a8e08b848cbb4887cb4efced6bbc2d85a3a6bc261424952e`
MD5	`d5c47e9d65072f484dab347415dee423`
BLAKE2b-256	`8c650958b4edaec578efd5bbce59cc74c346bbc90bc0010499c0cd61098d7504`

See more details on using hashes here.

File details

Details for the file wizit_open_rag-0.0.9-py3-none-any.whl.

File metadata

Download URL: wizit_open_rag-0.0.9-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 61.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.9

File hashes

Hashes for wizit_open_rag-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`15d611766c0c97528cb3ea66f56ca3fd36ccaa7833b49158a42dbebea5a726a8`
MD5	`2336de19bb17149f176f88bd497b3d88`
BLAKE2b-256	`09aeb5ec8384ffa20b4c69e57a45afe7bc4b34c9f4bba0083dfdbe97e7749f4f`

See more details on using hashes here.

wizit-open-rag 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

wizit_open_rag

Features

Prerequisites

Installation

Quickstart

1. Transcribe a PDF page

1b. Transcribe a standalone image (PNG / JPG)

2. Chunk Markdown and generate context

3. Full pipeline — transcribe, chunk, and index

Transcription Reference

OpenRagTranscriber

Tiered pipeline

Chunking Reference

ChunksManager

Vector Store Backends

PostgreSQL pgvector — PgEmbeddingsManager

Weaviate — WeaviateEmbeddingsManager

Embeddings Models

AWS Bedrock — AWSEmbeddingsModels

Voyage AI — VoyageEmbeddingsModels

Environment Variables

Architecture

Gotchas

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`OpenRagTranscriber`

`ChunksManager`

PostgreSQL pgvector — `PgEmbeddingsManager`

Weaviate — `WeaviateEmbeddingsManager`

AWS Bedrock — `AWSEmbeddingsModels`

Voyage AI — `VoyageEmbeddingsModels`