Skip to main content

Topic-Enhanced Retrieval-Augmented Generation Library

Project description

topic-rag — Topic-Enhanced Retrieval-Augmented Generation

Install it anywhere with:

pip install topic-rag

What it does

Standard RAG systems retrieve documents purely by text similarity (how close the words are). topic-rag adds a second layer — it discovers hidden topics across your document collection and uses those topics to boost retrieval accuracy. A query about "neural networks" will score higher against documents that share that topic cluster, even if the exact words differ.

Core usage

1. Basic retrieval

from topic_rag import DocumentProcessor, TopicEnhancedRAGRetriever, RAGEvaluator

# Build a topic-aware corpus from your documents
processor = DocumentProcessor(n_topics=10)
corpus = processor.process_corpus(documents)   # list of {id, text, title}

# Retrieve with topic enhancement
retriever = TopicEnhancedRAGRetriever(corpus)
results = retriever.retrieve("What is transfer learning?", k=5)

# Evaluate retrieval quality
evaluator = RAGEvaluator()
metrics = evaluator.evaluate_retrieval(results, relevant_doc_ids)
# → recall@5, precision@5, MRR, NDCG, hit_rate

2. Benchmarking against standard datasets

from topic_rag import EvaluationPipeline, ExperimentConfig

pipeline = EvaluationPipeline()
config = ExperimentConfig(
    dataset_name="squad_v2",   # ms_marco | natural_questions | hotpot_qa | trivia_qa
    max_documents=500,
    max_queries=100,
    n_topics=10
)
results = pipeline.run_single_experiment(config)

3. Statistical validation

from topic_rag import StatisticalAnalyzer

analyzer = StatisticalAnalyzer()
stats = analyzer.calculate_comparison_statistics(standard_results, enhanced_results)
# → paired t-test, Wilcoxon signed-rank, Cohen's d effect size, confidence intervals

4. Paper generation

from topic_rag import PaperGenerator

gen = PaperGenerator()
files = gen.generate_complete_paper({
    "title": "My RAG Study",
    "authors": ["Your Name"],
    "institution": "Your University",
    "results": experiment_results,
    "output_format": "LaTeX + PDF",   # or "Markdown"
    "sections": { "abstract": True, "methodology": True, ... }
})

Advantages over plain RAG

Standard RAG topic-rag
Retrieval signal TF-IDF similarity only TF-IDF + latent topic overlap
Semantic grouping None Automatic topic discovery
Evaluation Manual Built-in (Recall, MRR, NDCG)
Statistical proof None t-test, Wilcoxon, effect sizes
Paper output None LaTeX + Markdown auto-generated
Datasets Bring your own MS MARCO, NQ, SQuAD, HotpotQA, TriviaQA built-in
Dependencies Heavy (PyTorch, transformers) Lightweight (numpy, scikit-learn, scipy)

Key design decisions

  • No GPU required — uses TF-IDF and a lightweight topic model (no PyTorch, no sentence-transformers)
  • Self-contained — all benchmark datasets have built-in fallback data, so experiments run offline
  • Research-ready — statistical tests and paper generation make it suitable for academic submission
  • AGPL-3.0 — open source; any service built on it must also be open source

Who it's for

  • Researchers benchmarking retrieval systems
  • Engineers who want a lightweight RAG baseline without heavy ML infrastructure
  • Anyone who needs reproducible, statistically validated RAG experiments with automatic paper output

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

See the LICENSE file for the full license text.

What AGPL-3.0 means

  • Anyone can view, use, and modify the code
  • Any modified version used to provide a network service must release its source code
  • Companies cannot embed this in proprietary software without open-sourcing their product

For commercial licensing enquiries, please contact the project maintainers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topic_rag-1.0.1.tar.gz (45.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topic_rag-1.0.1-py3-none-any.whl (42.4 kB view details)

Uploaded Python 3

File details

Details for the file topic_rag-1.0.1.tar.gz.

File metadata

  • Download URL: topic_rag-1.0.1.tar.gz
  • Upload date:
  • Size: 45.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for topic_rag-1.0.1.tar.gz
Algorithm Hash digest
SHA256 38f6e3590bde108ae17be4bdbab984c9c9d9c8476d16de01e56283bfa1b20527
MD5 3c9e3eddd25afc7a789d71885ff8bdd2
BLAKE2b-256 2552744a7737d9ff20ce66a54c1e663bd6e16c629a10007ed95d5b14873982a2

See more details on using hashes here.

File details

Details for the file topic_rag-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: topic_rag-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 42.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for topic_rag-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5000c4b5d47ede91fee84675322036ff8621fa40855b4e9494d2d1e1d8660214
MD5 82af309ade19626e08e9d3d964d1adf6
BLAKE2b-256 f03c44c286703f8f8a37aa9239389fa342995ed7c38db95f29b917f6530deeeb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page