Topic-Enhanced Retrieval-Augmented Generation Library
Project description
topic-rag — Topic-Enhanced Retrieval-Augmented Generation
Install it anywhere with:
pip install topic-rag
What it does
Standard RAG systems retrieve documents purely by text similarity (how close the words are). topic-rag adds a second layer — it discovers hidden topics across your document collection and uses those topics to boost retrieval accuracy. A query about "neural networks" will score higher against documents that share that topic cluster, even if the exact words differ.
Core usage
1. Basic retrieval
from topic_rag import DocumentProcessor, TopicEnhancedRAGRetriever, RAGEvaluator
# Build a topic-aware corpus from your documents
processor = DocumentProcessor(n_topics=10)
corpus = processor.process_corpus(documents) # list of {id, text, title}
# Retrieve with topic enhancement
retriever = TopicEnhancedRAGRetriever(corpus)
results = retriever.retrieve("What is transfer learning?", k=5)
# Evaluate retrieval quality
evaluator = RAGEvaluator()
metrics = evaluator.evaluate_retrieval(results, relevant_doc_ids)
# → recall@5, precision@5, MRR, NDCG, hit_rate
2. Benchmarking against standard datasets
from topic_rag import EvaluationPipeline, ExperimentConfig
pipeline = EvaluationPipeline()
config = ExperimentConfig(
dataset_name="squad_v2", # ms_marco | natural_questions | hotpot_qa | trivia_qa
max_documents=500,
max_queries=100,
n_topics=10
)
results = pipeline.run_single_experiment(config)
3. Statistical validation
from topic_rag import StatisticalAnalyzer
analyzer = StatisticalAnalyzer()
stats = analyzer.calculate_comparison_statistics(standard_results, enhanced_results)
# → paired t-test, Wilcoxon signed-rank, Cohen's d effect size, confidence intervals
4. Paper generation
from topic_rag import PaperGenerator
gen = PaperGenerator()
files = gen.generate_complete_paper({
"title": "My RAG Study",
"authors": ["Your Name"],
"institution": "Your University",
"results": experiment_results,
"output_format": "LaTeX + PDF", # or "Markdown"
"sections": { "abstract": True, "methodology": True, ... }
})
Advantages over plain RAG
| Standard RAG | topic-rag | |
|---|---|---|
| Retrieval signal | TF-IDF similarity only | TF-IDF + latent topic overlap |
| Semantic grouping | None | Automatic topic discovery |
| Evaluation | Manual | Built-in (Recall, MRR, NDCG) |
| Statistical proof | None | t-test, Wilcoxon, effect sizes |
| Paper output | None | LaTeX + Markdown auto-generated |
| Datasets | Bring your own | MS MARCO, NQ, SQuAD, HotpotQA, TriviaQA built-in |
| Dependencies | Heavy (PyTorch, transformers) | Lightweight (numpy, scikit-learn, scipy) |
Key design decisions
- No GPU required — uses TF-IDF and a lightweight topic model (no PyTorch, no sentence-transformers)
- Self-contained — all benchmark datasets have built-in fallback data, so experiments run offline
- Research-ready — statistical tests and paper generation make it suitable for academic submission
- AGPL-3.0 — open source; any service built on it must also be open source
Who it's for
- Researchers benchmarking retrieval systems
- Engineers who want a lightweight RAG baseline without heavy ML infrastructure
- Anyone who needs reproducible, statistically validated RAG experiments with automatic paper output
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
See the LICENSE file for the full license text.
What AGPL-3.0 means
- Anyone can view, use, and modify the code
- Any modified version used to provide a network service must release its source code
- Companies cannot embed this in proprietary software without open-sourcing their product
For commercial licensing enquiries, please contact the project maintainers.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file topic_rag-1.0.1.tar.gz.
File metadata
- Download URL: topic_rag-1.0.1.tar.gz
- Upload date:
- Size: 45.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38f6e3590bde108ae17be4bdbab984c9c9d9c8476d16de01e56283bfa1b20527
|
|
| MD5 |
3c9e3eddd25afc7a789d71885ff8bdd2
|
|
| BLAKE2b-256 |
2552744a7737d9ff20ce66a54c1e663bd6e16c629a10007ed95d5b14873982a2
|
File details
Details for the file topic_rag-1.0.1-py3-none-any.whl.
File metadata
- Download URL: topic_rag-1.0.1-py3-none-any.whl
- Upload date:
- Size: 42.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5000c4b5d47ede91fee84675322036ff8621fa40855b4e9494d2d1e1d8660214
|
|
| MD5 |
82af309ade19626e08e9d3d964d1adf6
|
|
| BLAKE2b-256 |
f03c44c286703f8f8a37aa9239389fa342995ed7c38db95f29b917f6530deeeb
|