Lightweight topic-aware document processing and retrieval
Project description
topic-rag — Topic-Enhanced Retrieval-Augmented Generation
Install it anywhere with:
pip install topic-rag
What it does
Standard RAG systems retrieve documents purely by text similarity (TF-IDF cosine distance). topic-rag adds a second signal — it automatically discovers hidden topics across your document collection and uses those topics to boost retrieval accuracy. A query about "neural networks" will score higher against documents that share that topic cluster, even if the exact words differ.
Quick Start
from topic_rag import DocumentProcessor, TopicEnhancedRAGRetriever
# Build a topic-aware corpus from your documents
processor = DocumentProcessor(n_topics=10)
corpus = processor.process_corpus(documents) # list of {id, text, title}
# Retrieve with topic enhancement
retriever = TopicEnhancedRAGRetriever(corpus)
results = retriever.retrieve("What is transfer learning?", k=5)
for r in results:
print(f"[{r['score']:.4f}] {r['sentence']}")
Two Retrieval Modes
| Retriever | How it works |
|---|---|
StandardRAGRetriever |
Ranks by TF-IDF cosine similarity only |
TopicEnhancedRAGRetriever |
Combines TF-IDF similarity + latent topic overlap for better context-aware ranking |
Use both side by side to compare standard vs topic-enhanced retrieval on your own data:
from topic_rag import DocumentProcessor, StandardRAGRetriever, TopicEnhancedRAGRetriever
processor = DocumentProcessor(n_topics=10)
corpus = processor.process_corpus(documents)
std_retriever = StandardRAGRetriever(corpus)
enh_retriever = TopicEnhancedRAGRetriever(corpus)
std_results = std_retriever.retrieve("my query", k=5)
enh_results = enh_retriever.retrieve("my query", k=5)
Key Design Decisions
- No GPU required — uses TF-IDF and a lightweight NMF-based topic model (no PyTorch, no sentence-transformers)
- Minimal dependencies — only
numpyandscikit-learnat the core - Self-contained — no external API calls or downloads needed
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
See the LICENSE file for the full license text.
What AGPL-3.0 means
- Anyone can view, use, and modify the code
- Any modified version used to provide a network service must release its source code
- Companies cannot embed this in proprietary software without open-sourcing their product
For commercial licensing enquiries, please contact the project maintainers.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file topic_rag-1.0.3.tar.gz.
File metadata
- Download URL: topic_rag-1.0.3.tar.gz
- Upload date:
- Size: 44.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a863fc9e05fc23014a606badb85c7a769c248ef0da2b331c1d2698afa12b2851
|
|
| MD5 |
ea1f7168c3bd56355d6dcacbddd8fa42
|
|
| BLAKE2b-256 |
f292455265c99e037d8ff557d6c8c8e1de3b1b944fd66cbf49417818e9438b62
|
File details
Details for the file topic_rag-1.0.3-py3-none-any.whl.
File metadata
- Download URL: topic_rag-1.0.3-py3-none-any.whl
- Upload date:
- Size: 41.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3931622acc34f80bbe140c282afb8e43b5f7b5d247bc50944387e0c253fc07ed
|
|
| MD5 |
827d2b8bd77b86f222aaf5a211f5a132
|
|
| BLAKE2b-256 |
da727df773c9241b630699e3fdec2969cd134188dc12c1bef670552d5ab5f29c
|