Skip to main content

Faster and smarter Retrieval Augmented Generation using Speculative Retrieval and Context Tetris.

Project description

Quira Logo

Quira

Lightning-Fast, Context-Dense RAG Framework for Python

Stop waiting. Start predicting.


PyPI License Python GitHub



Quickstart  ·  How It Works  ·  Cost Savings  ·  API  ·  Contributing



🔥 The Problem with Traditional RAG

Traditional Retrieval-Augmented Generation (RAG) is slow and expensive:

  1. High Latency: User types query → Hits Enter → WAIT → Vector search → WAIT → Stuff 10 large chunks into LLM → WAIT → Response.
  2. "Lost in the Middle" Syndrome: You stuff massive chunks of text into the context window, most of which is useless filler. The LLM loses track of the actual facts.
  3. Expensive Redundancy: On every turn of the conversation, you re-fetch and re-process the exact same context over and over again.

✨ The Quira Solution

Quira solves this by predicting what users need before they finish typing, dynamically compressing context to maximize density, and statefully tracking the conversation.

⏱️ 85% faster latency | 🧠 2.6× denser context | 💰 40% cheaper token costs

🏗️ Architecture

graph TD
    User([User Typing]) -->|WebSocket Stream| Speculative[1. Speculative Retriever]
    Speculative -->|Predictive Search| Cache[(Redis Cache)]
    UserSubmit([User Hits Enter]) --> Diff[3. Differential Retriever]
    Diff -->|Cosine Similarity > 0.6?| DeltaFetch{Fetch Delta Chunks Only}
    Cache --> DeltaFetch
    DeltaFetch --> Tetris[2. Context Tetris]
    Tetris -->|Relevance, Recency, Density| Groq[Groq LLM Compression]
    Groq -->|U-Shape Order| FinalContext[Packed Context]
    FinalContext --> MainLLM{Your Main LLM}

📦 Quickstart

1. Install via pip

pip install quira

2. Basic Setup

Quira does not hardcode API keys. You bring your own clients, meaning you have full control over your usage and billing.

import asyncio
from quira import quiraPipeline, UserSession
from qdrant_client import QdrantClient
from groq import Groq
import spacy
from fastembed import TextEmbedding

async def main():
    # 1. Initialize your clients (Bring Your Own Keys)
    qdrant = QdrantClient(":memory:") # Or your cloud Qdrant URL
    redis_mock = None # Pass a real Upstash Redis client in production
    groq = Groq(api_key="your_groq_api_key") 
    spacy_model = spacy.load("en_core_web_sm")
    
    embed_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")
    embed_func = lambda text: list(embed_model.embed([text]))[0]

    # 2. Initialize Quira Pipeline
    pipeline = quiraPipeline(
        qdrant_client=qdrant,
        redis_client=redis_mock,
        groq_client=groq,
        embed_func=embed_func,
        spacy_model=spacy_model
    )

    # 3. Create a session for a specific user
    session = UserSession(user_id="user_123")

    # 4. Ingest some documents!
    print("Ingesting document...")
    await pipeline.ingestor.ingest_text("user_123", "Our return policy allows returns within 30 days of purchase.")

    # 5. 🏎️ Speculative fetch (triggers while user is typing in the UI)
    await pipeline.handle_typing_event(session, "What is the re")

    # 6. 🎯 Submit (Context is already warm from the speculative fetch!)
    answer = await pipeline.process_submission(
        session, "What is the return policy?"
    )
    print(answer)

if __name__ == "__main__":
    asyncio.run(main())

⚙️ How It Works: The 4 Core Modules

Quira is built on 4 beautifully orchestrated modules:

🏎️ Module 1: Speculative Retrieval

Instead of waiting for the user to hit "Enter", Quira listens to keystrokes. Using adaptive debouncing (250ms–600ms based on typing speed), it fires Qdrant searches in the background. By the time the user hits Enter, the vector search is already cached in Redis.

🧩 Module 2: Context Tetris

Not all retrieved context is equal. Quira scores every chunk on 4 dimensions:

  1. Relevance (Cosine similarity)
  2. Recency (Half-life decay for older chunks)
  3. Uniqueness (Penalizes duplicate information)
  4. Density (Entity-to-token ratio)

It then uses the blazing-fast Groq LLM to compress filler text out of the chunks, and orders them in a U-shape (best chunks at the very start and end) to prevent the LLM from "losing" facts in the middle of the prompt.

🔄 Module 3: Differential Retrieval

In a normal RAG chat, asking a follow-up question triggers a completely new vector search. Quira maintains a Context Pool. It measures the cosine similarity between the current and previous query. If the topic hasn't changed drastically, Quira only fetches Delta Chunks (new information) and merges it, saving massive amounts of redundant processing.

📄 Module 4: Document Ingestion

Built-in PyMuPDF parsing with overlapping text chunking (default 1000 chars / 200 overlap) to prevent sentence fragmentation. Automatically generates embeddings and upserts them directly into Qdrant.


💰 Why Quira Saves You Money

You might wonder: "Doesn't using Groq for Context Tetris cost extra money?"

No, it actually saves you up to 40% on your bill. Here's why:

  1. Groq is Hyper-Cheap: The llama-3.1-8b-instant model used to compress context costs fractions of a penny.
  2. Your Main LLM is Expensive: You are likely sending your final prompt to a heavy model like GPT-4o or Claude 3.5 Sonnet. By using cheap Groq tokens to compress the context, you send significantly fewer tokens to the expensive main LLM.
  3. Differential Caching: You stop re-fetching and re-sending identical chunks of text on every single conversational turn.

📊 Benchmarks

Metric Traditional RAG Quira Improvement
Avg Latency 1,450 ms 210 ms 🚀 85% faster
Context Density 35% 94% 🧠 2.6× denser
Token Cost Baseline -40% 💰 40% cheaper
Redundant Fetches Every turn Delta only ♻️ ~70% fewer

📚 API Reference

quiraPipeline(qdrant, redis, groq, embed_func, spacy_model)

The main pipeline class. Accepts your own client instances.

Method Description
handle_typing_event(session, keystrokes) Trigger speculative retrieval on keystrokes
process_submission(session, query) Full retrieval + compression pipeline
ingestor.ingest_pdf(user_id, path) Parse, chunk, embed, and store a PDF
ingestor.ingest_text(user_id, text) Chunk, embed, and store raw text

UserSession(user_id, websocket=None)

Tracks per-user conversation state, context pools, and turn history. Keeps different users' data strictly isolated.


🔒 Security

Quira is regularly audited:

  • 0 vulnerabilities across all severity levels via Bandit
  • ✅ SHA-256 hashing for all cache keys (no weak hashes)
  • Bring Your Own Keys architecture — absolutely zero API keys or credentials are included or required by the library itself. You retain 100% control over your API secrets.

🤝 Contributing

Contributions are welcome! Please open an issue or submit a pull request.

# Clone the repo
git clone https://github.com/DevDarsh26/Quira.git
cd Quira

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/


Built with ❤️ by darshmodii.in

GitHub   Website

If you like Quira, drop a ⭐ on GitHub — it means the world!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quira-0.2.0.tar.gz (30.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quira-0.2.0-py3-none-any.whl (34.4 kB view details)

Uploaded Python 3

File details

Details for the file quira-0.2.0.tar.gz.

File metadata

  • Download URL: quira-0.2.0.tar.gz
  • Upload date:
  • Size: 30.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quira-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a6824a97e5986ac76dc66e7463745c72c3c5afdf346cf9642b14b488948708d1
MD5 4f4bb1c77fbf0fdc1f5322a7df5c1c3d
BLAKE2b-256 0cc605f6cd35d33fb6a62825f3bfbfe330af73cb682774edc48b41ed612e48c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for quira-0.2.0.tar.gz:

Publisher: publish.yml on DevDarsh26/Quira

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file quira-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: quira-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 34.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quira-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f6a7a3187465073c909be6b4341b3a026ca6fccf068d9ff448730f605903dd7
MD5 16d6b1756795f7b9f40f4bd3718ab492
BLAKE2b-256 f25b1649fb45d11c85b155e46edc5d8a8142968397a0e7473a922f20fa846e38

See more details on using hashes here.

Provenance

The following attestation bundles were made for quira-0.2.0-py3-none-any.whl:

Publisher: publish.yml on DevDarsh26/Quira

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page