Natural Language Question Answering Toolkit (Hybrid NLP + GenAI)
Project description
NLQcat
Natural Language Question Answering Toolkit
Bridging the gap between Linguistic Analysis, Semantic Search, and Generative AI.
📖 Overview
NLQcat is a production-ready, hybrid NLP + GenAI library designed to unify classic linguistic analysis (spaCy) with modern semantic search (Vector Databases) and Large Language Models (LLMs).
Unlike purely generative frameworks, NLQcat offers a grounded approach where linguistic structure (POS tagging, NER) informs and refines semantic retrieval, leading to more accurate and context-aware RAG (Retrieval-Augmented Generation) pipelines.
Whether you are building a local document Q&A bot, a complex semantic search engine, or an intelligent agent, NLQcat provides the modular building blocks to get you there fast.
✨ Features
- 🧠 Hybrid Intelligence: Seamlessly blends symbolic NLP (spaCy) with neural embeddings (SentenceTransformers).
- 🚀 Unified Pipeline: A single
Pipelineclass to manage ingestion, analysis, retrieval, and generation. - 🔌 Plug-and-Play Vector Stores: Integrated support for ChromaDB, with extensible interfaces for FAISS and Pinecone.
- 🤖 LLM Agnostic: Built-in support for OpenAI GPT models, with a flexible
LLMBasefor easy integration of HuggingFace or local LLMs. - 🔍 Deep Linguistic Analysis: extract entities, linguistic tokens, and POS tags to filter or re-rank semantic search results.
- 🛠️ Production Ready: Type-hinted, modular architecture designed for scalability and maintainability.
📦 Installation
Install NLQcat via pip:
pip install nlqcat
Download the required spaCy model (default):
python -m spacy download en_core_web_sm
🚀 Quick Start
Get a RAG pipeline running in 3 lines of code:
from nlqcat.core.pipeline import Pipeline
# 1. Initialize Pipeline with Vector Store (ChromaDB default)
pipe = Pipeline(vector_store_type="chroma")
# 2. Add some knowledge
pipe.add_documents([
"NLQcat combines linguistic NLP with semantic RAG.",
"It supports ChromaDB, FAISS, and OpenAI integration."
])
# 3. Ask a question! (Retrieval Only)
result = pipe.query("What does NLQcat support?")
print(result['retrieved_docs'])
Want Generative Answers? Configure an LLM:
from nlqcat.models.openai_llm import OpenAILLM
# Initialize LLM
llm = OpenAILLM(api_key="your-openai-key")
# Attach to Pipeline
pipe = Pipeline(vector_store_type="chroma", llm=llm)
# Query
answer = pipe.query("Explain NLQcat's architecture.")['answer']
print(answer)
📚 Full Usage Guide
1. The Core Pipeline
The Pipeline class is the heart of NLQcat. It orchestrates the flow of data between the NLP analyzer, Vector Store, and LLM.
from nlqcat.core.pipeline import Pipeline
pipe = Pipeline(
enable_spacy=True, # Enable linguistic analysis
vector_store_type="chroma", # 'chroma', 'faiss', 'pinecone' or None
vector_store_path="./db", # Persistence path
llm=my_llm_instance # Optional LLM instance
)
2. Working with Vector Stores
NLQcat supports modular vector stores. If you need a specific configuration, instantiate the store directly or let the pipeline handle it.
Supported Stores:
ChromaStore(Default, excellent for local dev & prod)FaissStore(Fast, in-memory)PineconeStore(Managed cloud vector DB)
# Automatic (Recommended)
pipe = Pipeline(vector_store_type="chroma", vector_store_path="./my_chroma_db")
# Manual
from nlqcat.vector_store.chroma_store import ChromaStore
store = ChromaStore(path="./custom_db")
3. Linguistic Analysis (NLP)
Access standard spaCy features conveniently through the unified NLP class.
# Initialize
pipe = Pipeline(enable_spacy=True)
doc = pipe.nlp.analyze("Apple is looking at buying U.K. startup for $1 billion")
# 1. Tokens & POS Tags
print(doc.tokens) # ['Apple', 'is', 'looking', ...]
print(doc.pos_tags) # [('Apple', 'PROPN'), ('is', 'AUX'), ...]
# 2. Named Entities (NER)
print(doc.entities)
# [('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]
# 3. Dependency Parsing
for dep in doc.dependencies:
print(f"{dep['text']} --[{dep['dep']}]--> {dep['head']}")
4. Semantic Features & GenAI
NLQcat makes complex semantic operations simple.
Embeddings & Similarity
Generate vector embeddings and calculate cosine similarity.
pipe = Pipeline()
# Generate Embeddings
text = "Artificial Intelligence is transforming the world."
emb = pipe.nlp.embed(text)
print(f"Dimensions: {emb.shape}")
# Calculate Similarity
score = pipe.nlp.similarity("I love coding", "Programming is my passion")
print(f"Similarity Score: {score:.4f}") # High score (e.g., 0.85)
Text Summarization
Built-in abstractive/extractive summarization (defaulting to simple heuristics or configurable models).
long_text = "Deep learning is part of a broader family of machine learning methods..."
summary = pipe.nlp.summarize(long_text)
print(summary)
Clustering
Cluster sentences based on semantic meaning using K-Means.
sentences = [
"The cat sits on the mat.", "Dogs are great pets.", # Animals
"Python is a language.", "Java is verbose." # Coding
]
clusters = pipe.nlp.cluster(sentences)
# Returns: {0: ['The cat...', 'Dogs...'], 1: ['Python...', 'Java...']}
🧩 Architecture
The NLQcat architecture follows a clean Layered Pattern:
- Core Layer (
nlqcat.core): Contains thePipelineorchestrator andRAGlogic. - Semantic Layer (
nlqcat.semantic): Handles Embeddings (SentenceTransformers) and Similarity calculations. - Vector Store Layer (
nlqcat.vector_store): Adapters for different vector databases. - Model Layer (
nlqcat.models): Wrappers for LLMs (OpenAI, etc.).
flowchart TD
U[User Query] --> P[Pipeline]
P --> S[spaCy NLP<br/>Tokens / POS / NER]
P --> E[Embedder<br/>Sentence Transformers]
E --> V[(Vector Store)]
V --> R[Retrieved Context]
P --> L[LLM]
R --> L
S --> F[Entity / Metadata Filters]
F --> V
L --> A[Final Answer]
⚙️ Configuration
NLQcat respects standard environment variables.
| Variable | Description |
|---|---|
OPENAI_API_KEY |
Required if using OpenAILLM without passing key explicitly. |
HUGGINGFACE_TOKEN |
Required for some gated HuggingFace models (future support). |
🧪 Advanced Concepts
Custom LLMs
You can plug in any LLM by inheriting from LLMBase.
from nlqcat.models.llm_base import LLMBase
class MyCustomLLM(LLMBase):
def generate(self, prompt: str, **kwargs) -> str:
return "This is a dummy response based on " + prompt
pipe = Pipeline(llm=MyCustomLLM())
Hybrid Filtering (Roadmap)
Future versions will allow using spaCy entities to automatically filter vector search results (e.g., "Show me documents about Elon Musk" -> Filter metadata person="Elon Musk").
❓ FAQ
Q: Can I use a different embedding model?
A: Yes! Modify the Embedder class or look out for the upcoming config update allowing custom model names in Pipeline.
Q: Is this thread-safe?
A: Pipeline is generally thread-safe, but be cautious with ChromaDB's SQLite backend in highly concurrent write scenarios.
🗺️ Roadmap
- v0.2.0: Integration with LangChain tools.
- v0.3.0: Advanced RAG (HyDE, MMR Re-ranking).
- v0.4.0: Cloud Deployment Blueprints (Docker, AWS Lambda).
- Documentation: Sphinx/MkDocs site generation.
🤝 Contributing
We welcome contributions! Please follow these steps:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/amazing-feature). - Commit your changes (
git commit -m 'Add amazing feature'). - Push to the branch (
git push origin feature/amazing-feature). - Open a Pull Request.
Please ensure you run tests before submitting:
python -m pytest tests/
📄 License
Distributed under the MIT License. See LICENSE for more information.
👥 Credits
- Author: Anirban Sarkar
- Maintainer: AnirbansarkarS
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlqcat-0.1.3.tar.gz.
File metadata
- Download URL: nlqcat-0.1.3.tar.gz
- Upload date:
- Size: 29.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6d9b22bd64b8acc9c05ce1e43eb59658a6c869de96169894abfb617cd9c37af
|
|
| MD5 |
571ac51aed336acb66173c20d63ee846
|
|
| BLAKE2b-256 |
90f95998e224680eadb83f418082bf0d473a7fee0b41872fae2578489acde7f8
|
Provenance
The following attestation bundles were made for nlqcat-0.1.3.tar.gz:
Publisher:
workflow.yml on AnirbansarkarS/NLqcat
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nlqcat-0.1.3.tar.gz -
Subject digest:
d6d9b22bd64b8acc9c05ce1e43eb59658a6c869de96169894abfb617cd9c37af - Sigstore transparency entry: 748679359
- Sigstore integration time:
-
Permalink:
AnirbansarkarS/NLqcat@da8e766a825a2df2a4a8b1a6e9dd9927b118ba89 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/AnirbansarkarS
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@da8e766a825a2df2a4a8b1a6e9dd9927b118ba89 -
Trigger Event:
release
-
Statement type:
File details
Details for the file nlqcat-0.1.3-py3-none-any.whl.
File metadata
- Download URL: nlqcat-0.1.3-py3-none-any.whl
- Upload date:
- Size: 31.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d2b05be49ad728e409c0236cdfc275cd419c1dc07b8a36ffdb97e1a305e3f79
|
|
| MD5 |
83dc9fae018c3a02ac388904c5da449d
|
|
| BLAKE2b-256 |
eb4b801ccbc3ea3000de59e8e259a373d3dbd96538e975dcb6e6ed7c98206a25
|
Provenance
The following attestation bundles were made for nlqcat-0.1.3-py3-none-any.whl:
Publisher:
workflow.yml on AnirbansarkarS/NLqcat
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nlqcat-0.1.3-py3-none-any.whl -
Subject digest:
9d2b05be49ad728e409c0236cdfc275cd419c1dc07b8a36ffdb97e1a305e3f79 - Sigstore transparency entry: 748679362
- Sigstore integration time:
-
Permalink:
AnirbansarkarS/NLqcat@da8e766a825a2df2a4a8b1a6e9dd9927b118ba89 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/AnirbansarkarS
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@da8e766a825a2df2a4a8b1a6e9dd9927b118ba89 -
Trigger Event:
release
-
Statement type: