LlamaIndex Integration for Kawn AI (Basser Reader & Tbyaan Embeddings)

These details have not been verified by PyPI

Project description

LlamaIndex Kawn Integration

This repository provides LlamaIndex wrappers for the Kawn AI SDK, allowing you to seamlessly integrate state-of-the-art Arabic AI models into your LlamaIndex pipelines.

This integration includes:

KawnEmbedding: High-quality document and query embedding via Kawn's Tbyaan models, optimized for Arabic and Islamic content.
BaseerReader: A sophisticated OCR data reader powered by Kawn's Baseer API, extracting structural markdown and exceptional Arabic text from documents (PDFs, Images).

Installation

pip install llama-index-kawn

Setup

The easiest way to configure the integration is by setting your Kawn API key as an environment variable. Alternatively, you can pass it directly to the instances.

export KAWN_API_KEY="your_api_key_here" # Or MISRAJ_API_KEY="your_api_key_here"

Usage

1. KawnEmbedding

KawnEmbedding allows you to generate robust vector representations of queries and documents. You can use it as the default embedding model in your LlamaIndex settings or interface with it directly to generate embeddings.

from llama_index_integration.embeddings.kawn import KawnEmbedding
from llama_index.core import Settings

# 1. Initialize the Kawn Embedding model
# It automatically picks up the KAWN_API_KEY environment variable.
embed_model = KawnEmbedding(
    model_name="tbyaan/islamic-embedding-tbyaan-v1",
    dimensions=768 # Optional: specify desired output dimension
)

# 2. Set as the default embeddings model in LlamaIndex globally
Settings.embed_model = embed_model

# 3. Direct Embedding Generation
query_embedding = embed_model.get_query_embedding("ما هو تفسير سورة الفاتحة؟")
print(f"Query embedding dimension: len({len(query_embedding)})")

text_batch = ["النص الأول", "النص الثاني"]
batch_embeddings = embed_model.get_text_embedding_batch(text_batch)

2. BaseerReader

BaseerReader leverages Baseer's highly accurate OCR service to read documents (such as .pdf, .png, .jpg) and instantly convert them into LlamaIndex Document objects. It is optimized for structural data extraction and advanced Arabic layouts.

from llama_index_integration.readers.kawn import BaseerReader

# 1. Initialize the Baseer Reader
reader = BaseerReader(
    model="baseer/baseer-v2", # Default OCR model
    # Optional dictionary of OCR configuration parameters
  )

# 2. Load and process a local file
file_path = "sample_book.pdf"

# By default, it returns a list of Documents (one per page). 
# Set one_text_result=True to merge everything into a single LlamaIndex Document.
documents = reader.load_data(
    file_path=file_path, 
    one_text_result=False,
    extra_info={"category": "Islamic History"} # Appended to metadata
)

for doc in documents:
    print(f"Page {doc.metadata['page_index']}:")
    print(doc.text[:200]) # Print the first 200 characters of the page

3. End-to-End RAG Pipeline (LlamaIndex Vector Store)

Here is how you can combine both BaseerReader and KawnEmbedding to read an Arabic PDF, embed it, store it in a vector database, and query it.

from llama_index.core import VectorStoreIndex, Settings
from llama_index_integration.embeddings.kawn import KawnEmbedding
from llama_index_integration.readers.kawn import BaseerReader

# 1. Set Kawn as the global embedding model
Settings.embed_model = KawnEmbedding()

# 2. Extract text and structure from a complex Arabic document
reader = BaseerReader()
documents = reader.load_data("complex_arabic_document.pdf")

# 3. Build a Vector Store Index (In-memory by default, easily swapped for Chroma/Qdrant)
index = VectorStoreIndex.from_documents(documents)

# 4. Query the document
query_engine = index.as_query_engine()
response = query_engine.query("ما هي الاستنتاجات الرئيسية في هذا التقرير؟")
print(response)

4. Using BaseerReader with LangChain

While BaseerReader is built as a native LlamaIndex integration, you can easily use its exceptional OCR capabilities to extract text and feed it directly into a LangChain conversational pipeline.

from llama_index_integration.readers.kawn import BaseerReader
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# 1. Extract pure text using BaseerReader
reader = BaseerReader()
# one_text_result=True is useful here to pass a single context block to the LLM
documents = reader.load_data("sample_book.pdf", one_text_result=True)
document_text = documents[0].text

# 2. Setup LangChain LLM and Prompt
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
    ("system", "أنت مساعد ذكي. أجب على أسئلة المستخدم بناءً على السياق التالي فقط:\n\n{context}"),
    ("human", "{question}")
])

# 3. Create the chain and ask a question based on the document
chain = prompt | llm
response = chain.invoke({
    "context": document_text,
    "question": "لخص أهم النقاط المذكورة في هذا النص."
})

print(response.content)

Async Support

Both KawnEmbedding and BaseerReader natively support non-blocking asynchronous operations, making them ideal for high-throughput batching or async web servers (like FastAPI).

import asyncio
from llama_index_integration.embeddings.kawn import KawnEmbedding
from llama_index_integration.readers.kawn import BaseerReader

async def main():
    embed_model = KawnEmbedding()
    reader = BaseerReader()

    # Asynchronous Embedding Generation
    query_embed = await embed_model.aget_query_embedding("مرحبا بك في منصة كون")
    
    # Asynchronous OCR Request (handles background polling for you)
    docs = await reader.aload_data("sample_document.pdf")

asyncio.run(main())

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Apr 7, 2026

This version

0.1.0

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_kawn-0.1.0.tar.gz (7.0 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_index_kawn-0.1.0-py3-none-any.whl (9.4 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file llama_index_kawn-0.1.0.tar.gz.

File metadata

Download URL: llama_index_kawn-0.1.0.tar.gz
Upload date: Apr 7, 2026
Size: 7.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llama_index_kawn-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9207dfb184e42433de1870ca85eefea8fd4d89643330a3b1bcca9cc9dbf9fc41`
MD5	`09ab0570bd60b0bd5860bb90ceabafe4`
BLAKE2b-256	`0b1a4ddcd6b48314783b8e13c1f9b6ed9edf54948191740ba0554db282748825`

See more details on using hashes here.

File details

Details for the file llama_index_kawn-0.1.0-py3-none-any.whl.

File metadata

Download URL: llama_index_kawn-0.1.0-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 9.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llama_index_kawn-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`39590a365ad532f2aba4d3f51c6228aaddca2697e0e6b6e00b7a65fb2944bdd8`
MD5	`017d33b7e908b267dbff4c5e6483014f`
BLAKE2b-256	`faa4a3dfabd8801dc2737ff41891481b4cac0efc79afa52fa61b90e168da8948`

See more details on using hashes here.

llama-index-kawn 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LlamaIndex Kawn Integration

Installation

Setup

Usage

1. KawnEmbedding

2. BaseerReader

3. End-to-End RAG Pipeline (LlamaIndex Vector Store)

4. Using BaseerReader with LangChain

Async Support

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes