Skip to main content

Natural Language Question Answering Toolkit (Hybrid NLP + GenAI)

Project description

NLQcat Logo

NLQcat

Natural Language Question Answering Toolkit

Bridging the gap between Linguistic Analysis, Semantic Search, and Generative AI.

PyPI Version License: MIT Python Versions Code Style: Black


📖 Overview

NLQcat is a production-ready, hybrid NLP + GenAI library designed to unify classic linguistic analysis (spaCy) with modern semantic search (Vector Databases) and Large Language Models (LLMs).

Unlike purely generative frameworks, NLQcat offers a grounded approach where linguistic structure (POS tagging, NER) informs and refines semantic retrieval, leading to more accurate and context-aware RAG (Retrieval-Augmented Generation) pipelines.

Whether you are building a local document Q&A bot, a complex semantic search engine, or an intelligent agent, NLQcat provides the modular building blocks to get you there fast.

✨ Features

  • 🧠 Hybrid Intelligence: Seamlessly blends symbolic NLP (spaCy) with neural embeddings (SentenceTransformers).
  • 🚀 Unified Pipeline: A single Pipeline class to manage ingestion, analysis, retrieval, and generation.
  • 🔌 Plug-and-Play Vector Stores: Integrated support for ChromaDB, with extensible interfaces for FAISS and Pinecone.
  • 🤖 LLM Agnostic: Built-in support for OpenAI GPT models, with a flexible LLMBase for easy integration of HuggingFace or local LLMs.
  • 🔍 Deep Linguistic Analysis: extract entities, linguistic tokens, and POS tags to filter or re-rank semantic search results.
  • 🛠️ Production Ready: Type-hinted, modular architecture designed for scalability and maintainability.

📦 Installation

Install NLQcat via pip:

pip install nlqcat

Download the required spaCy model (default):

python -m spacy download en_core_web_sm

🚀 Quick Start

Get a RAG pipeline running in 3 lines of code:

from nlqcat.core.pipeline import Pipeline

# 1. Initialize Pipeline with Vector Store (ChromaDB default)
pipe = Pipeline(vector_store_type="chroma")

# 2. Add some knowledge
pipe.add_documents([
    "NLQcat combines linguistic NLP with semantic RAG.",
    "It supports ChromaDB, FAISS, and OpenAI integration."
])

# 3. Ask a question! (Retrieval Only)
result = pipe.query("What does NLQcat support?")
print(result['retrieved_docs'])

Want Generative Answers? Configure an LLM:

from nlqcat.models.openai_llm import OpenAILLM

# Initialize LLM
llm = OpenAILLM(api_key="your-openai-key")

# Attach to Pipeline
pipe = Pipeline(vector_store_type="chroma", llm=llm)

# Query
answer = pipe.query("Explain NLQcat's architecture.")['answer']
print(answer)

📚 Full Usage Guide

1. The Core Pipeline

The Pipeline class is the heart of NLQcat. It orchestrates the flow of data between the NLP analyzer, Vector Store, and LLM.

from nlqcat.core.pipeline import Pipeline

pipe = Pipeline(
    enable_spacy=True,          # Enable linguistic analysis
    vector_store_type="chroma", # 'chroma', 'faiss', 'pinecone' or None
    vector_store_path="./db",   # Persistence path
    llm=my_llm_instance         # Optional LLM instance
)

2. Working with Vector Stores

NLQcat supports modular vector stores. If you need a specific configuration, instantiate the store directly or let the pipeline handle it.

Supported Stores:

  • ChromaStore (Default, excellent for local dev & prod)
  • FaissStore (Fast, in-memory)
  • PineconeStore (Managed cloud vector DB)
# Automatic (Recommended)
pipe = Pipeline(vector_store_type="chroma", vector_store_path="./my_chroma_db")

# Manual
from nlqcat.vector_store.chroma_store import ChromaStore
store = ChromaStore(path="./custom_db")

3. Linguistic Analysis (NLP)

Access standard spaCy features conveniently through the unified NLP class.

# Initialize
pipe = Pipeline(enable_spacy=True)
doc = pipe.nlp.analyze("Apple is looking at buying U.K. startup for $1 billion")

# 1. Tokens & POS Tags
print(doc.tokens)   # ['Apple', 'is', 'looking', ...]
print(doc.pos_tags) # [('Apple', 'PROPN'), ('is', 'AUX'), ...]

# 2. Named Entities (NER)
print(doc.entities) 
# [('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]

# 3. Dependency Parsing
for dep in doc.dependencies:
    print(f"{dep['text']} --[{dep['dep']}]--> {dep['head']}")

4. Semantic Features & GenAI

NLQcat makes complex semantic operations simple.

Embeddings & Similarity

Generate vector embeddings and calculate cosine similarity.

pipe = Pipeline()

# Generate Embeddings
text = "Artificial Intelligence is transforming the world."
emb = pipe.nlp.embed(text)
print(f"Dimensions: {emb.shape}")

# Calculate Similarity
score = pipe.nlp.similarity("I love coding", "Programming is my passion")
print(f"Similarity Score: {score:.4f}") # High score (e.g., 0.85)

Text Summarization

Built-in abstractive/extractive summarization (defaulting to simple heuristics or configurable models).

long_text = "Deep learning is part of a broader family of machine learning methods..."
summary = pipe.nlp.summarize(long_text)
print(summary)

Clustering

Cluster sentences based on semantic meaning using K-Means.

sentences = [
    "The cat sits on the mat.", "Dogs are great pets.", # Animals
    "Python is a language.", "Java is verbose."         # Coding
]

clusters = pipe.nlp.cluster(sentences)
# Returns: {0: ['The cat...', 'Dogs...'], 1: ['Python...', 'Java...']}

🧩 Architecture

The NLQcat architecture follows a clean Layered Pattern:

  1. Core Layer (nlqcat.core): Contains the Pipeline orchestrator and RAG logic.
  2. Semantic Layer (nlqcat.semantic): Handles Embeddings (SentenceTransformers) and Similarity calculations.
  3. Vector Store Layer (nlqcat.vector_store): Adapters for different vector databases.
  4. Model Layer (nlqcat.models): Wrappers for LLMs (OpenAI, etc.).
flowchart TD
    U[User Query] --> P[Pipeline]

    P --> S[spaCy NLP<br/>Tokens / POS / NER]
    P --> E[Embedder<br/>Sentence Transformers]

    E --> V[(Vector Store)]
    V --> R[Retrieved Context]

    P --> L[LLM]

    R --> L
    S --> F[Entity / Metadata Filters]
    F --> V

    L --> A[Final Answer]

⚙️ Configuration

NLQcat respects standard environment variables.

Variable Description
OPENAI_API_KEY Required if using OpenAILLM without passing key explicitly.
HUGGINGFACE_TOKEN Required for some gated HuggingFace models (future support).

🧪 Advanced Concepts

Custom LLMs

You can plug in any LLM by inheriting from LLMBase.

from nlqcat.models.llm_base import LLMBase

class MyCustomLLM(LLMBase):
    def generate(self, prompt: str, **kwargs) -> str:
        return "This is a dummy response based on " + prompt

pipe = Pipeline(llm=MyCustomLLM())

Hybrid Filtering (Roadmap)

Future versions will allow using spaCy entities to automatically filter vector search results (e.g., "Show me documents about Elon Musk" -> Filter metadata person="Elon Musk").

❓ FAQ

Q: Can I use a different embedding model? A: Yes! Modify the Embedder class or look out for the upcoming config update allowing custom model names in Pipeline.

Q: Is this thread-safe? A: Pipeline is generally thread-safe, but be cautious with ChromaDB's SQLite backend in highly concurrent write scenarios.

🗺️ Roadmap

  • v0.2.0: Integration with LangChain tools.
  • v0.3.0: Advanced RAG (HyDE, MMR Re-ranking).
  • v0.4.0: Cloud Deployment Blueprints (Docker, AWS Lambda).
  • Documentation: Sphinx/MkDocs site generation.

🤝 Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/amazing-feature).
  3. Commit your changes (git commit -m 'Add amazing feature').
  4. Push to the branch (git push origin feature/amazing-feature).
  5. Open a Pull Request.

Please ensure you run tests before submitting:

python -m pytest tests/

📄 License

Distributed under the MIT License. See LICENSE for more information.

👥 Credits

  • Author: Anirban Sarkar
  • Maintainer: AnirbansarkarS

Built with ❤️ by Anirban-QuantumCAT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlqcat-0.1.3.tar.gz (29.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlqcat-0.1.3-py3-none-any.whl (31.0 kB view details)

Uploaded Python 3

File details

Details for the file nlqcat-0.1.3.tar.gz.

File metadata

  • Download URL: nlqcat-0.1.3.tar.gz
  • Upload date:
  • Size: 29.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nlqcat-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d6d9b22bd64b8acc9c05ce1e43eb59658a6c869de96169894abfb617cd9c37af
MD5 571ac51aed336acb66173c20d63ee846
BLAKE2b-256 90f95998e224680eadb83f418082bf0d473a7fee0b41872fae2578489acde7f8

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlqcat-0.1.3.tar.gz:

Publisher: workflow.yml on AnirbansarkarS/NLqcat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nlqcat-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: nlqcat-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 31.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nlqcat-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9d2b05be49ad728e409c0236cdfc275cd419c1dc07b8a36ffdb97e1a305e3f79
MD5 83dc9fae018c3a02ac388904c5da449d
BLAKE2b-256 eb4b801ccbc3ea3000de59e8e259a373d3dbd96538e975dcb6e6ed7c98206a25

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlqcat-0.1.3-py3-none-any.whl:

Publisher: workflow.yml on AnirbansarkarS/NLqcat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page