Skip to main content

LlamaIndex Integration for Kawn AI (Basser Reader & Tbyaan Embeddings)

Project description

LlamaIndex Kawn Integration

This repository provides LlamaIndex wrappers for the Kawn AI SDK, allowing you to seamlessly integrate state-of-the-art Arabic AI models into your LlamaIndex pipelines.

This integration includes:

  • KawnEmbedding: High-quality document and query embedding via Kawn's Tbyaan models, optimized for Arabic and Islamic content.
  • BaseerReader: A sophisticated OCR data reader powered by Kawn's Baseer API, extracting structural markdown and exceptional Arabic text from documents (PDFs, Images).

Installation

pip install llama-index-kawn

Setup

The easiest way to configure the integration is by setting your Kawn API key as an environment variable. Alternatively, you can pass it directly to the instances.

export KAWN_API_KEY="your_api_key_here" # Or MISRAJ_API_KEY="your_api_key_here"

Usage

1. KawnEmbedding

KawnEmbedding allows you to generate robust vector representations of queries and documents. You can use it as the default embedding model in your LlamaIndex settings or interface with it directly to generate embeddings.

from llama_index_integration.embeddings.kawn import KawnEmbedding
from llama_index.core import Settings

# 1. Initialize the Kawn Embedding model
# It automatically picks up the KAWN_API_KEY environment variable.
embed_model = KawnEmbedding(
    model_name="tbyaan/islamic-embedding-tbyaan-v1",
    dimensions=768 # Optional: specify desired output dimension
)

# 2. Set as the default embeddings model in LlamaIndex globally
Settings.embed_model = embed_model

# 3. Direct Embedding Generation
query_embedding = embed_model.get_query_embedding("ما هو تفسير سورة الفاتحة؟")
print(f"Query embedding dimension: len({len(query_embedding)})")

text_batch = ["النص الأول", "النص الثاني"]
batch_embeddings = embed_model.get_text_embedding_batch(text_batch)

2. BaseerReader

BaseerReader leverages Baseer's highly accurate OCR service to read documents (such as .pdf, .png, .jpg) and instantly convert them into LlamaIndex Document objects. It is optimized for structural data extraction and advanced Arabic layouts.

from llama_index_integration.readers.kawn import BaseerReader

# 1. Initialize the Baseer Reader
reader = BaseerReader(
    model="baseer/baseer-v2", # Default OCR model
    # Optional dictionary of OCR configuration parameters
  )

# 2. Load and process a local file
file_path = "sample_book.pdf"

# By default, it returns a list of Documents (one per page). 
# Set one_text_result=True to merge everything into a single LlamaIndex Document.
documents = reader.load_data(
    file_path=file_path, 
    one_text_result=False,
    extra_info={"category": "Islamic History"} # Appended to metadata
)

for doc in documents:
    print(f"Page {doc.metadata['page_index']}:")
    print(doc.text[:200]) # Print the first 200 characters of the page

3. End-to-End RAG Pipeline (LlamaIndex Vector Store)

Here is how you can combine both BaseerReader and KawnEmbedding to read an Arabic PDF, embed it, store it in a vector database, and query it.

from llama_index.core import VectorStoreIndex, Settings
from llama_index_integration.embeddings.kawn import KawnEmbedding
from llama_index_integration.readers.kawn import BaseerReader

# 1. Set Kawn as the global embedding model
Settings.embed_model = KawnEmbedding()

# 2. Extract text and structure from a complex Arabic document
reader = BaseerReader()
documents = reader.load_data("complex_arabic_document.pdf")

# 3. Build a Vector Store Index (In-memory by default, easily swapped for Chroma/Qdrant)
index = VectorStoreIndex.from_documents(documents)

# 4. Query the document
query_engine = index.as_query_engine()
response = query_engine.query("ما هي الاستنتاجات الرئيسية في هذا التقرير؟")
print(response)

4. Using BaseerReader with LangChain

While BaseerReader is built as a native LlamaIndex integration, you can easily use its exceptional OCR capabilities to extract text and feed it directly into a LangChain conversational pipeline.

from llama_index_integration.readers.kawn import BaseerReader
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# 1. Extract pure text using BaseerReader
reader = BaseerReader()
# one_text_result=True is useful here to pass a single context block to the LLM
documents = reader.load_data("sample_book.pdf", one_text_result=True)
document_text = documents[0].text

# 2. Setup LangChain LLM and Prompt
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
    ("system", "أنت مساعد ذكي. أجب على أسئلة المستخدم بناءً على السياق التالي فقط:\n\n{context}"),
    ("human", "{question}")
])

# 3. Create the chain and ask a question based on the document
chain = prompt | llm
response = chain.invoke({
    "context": document_text,
    "question": "لخص أهم النقاط المذكورة في هذا النص."
})

print(response.content)

Async Support

Both KawnEmbedding and BaseerReader natively support non-blocking asynchronous operations, making them ideal for high-throughput batching or async web servers (like FastAPI).

import asyncio
from llama_index_integration.embeddings.kawn import KawnEmbedding
from llama_index_integration.readers.kawn import BaseerReader

async def main():
    embed_model = KawnEmbedding()
    reader = BaseerReader()

    # Asynchronous Embedding Generation
    query_embed = await embed_model.aget_query_embedding("مرحبا بك في منصة كون")
    
    # Asynchronous OCR Request (handles background polling for you)
    docs = await reader.aload_data("sample_document.pdf")

asyncio.run(main())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_kawn-0.1.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_kawn-0.1.1-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_kawn-0.1.1.tar.gz.

File metadata

  • Download URL: llama_index_kawn-0.1.1.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llama_index_kawn-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6210c1e74c116d914ae10cb4addcec8e3a2d937591932e03256fe458f42f6e87
MD5 7f6a11c252cafcda3ebc5e443a3f492f
BLAKE2b-256 591584c0c12d8e7d61ac0584696980f787b27a8358e8fc530aa76a1e8cdcfd22

See more details on using hashes here.

File details

Details for the file llama_index_kawn-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_kawn-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a8f3d87da0243585609bc1ae0a91eecd3213516fec2ef594005314a04ca0dc2c
MD5 753276812b6209f1262650d26713cfea
BLAKE2b-256 e280f90d1c23cddbc0170a2f5df62649201e3ec2548e9bb6068c21f07c458156

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page