Skip to main content

Lightweight topic-aware document processing and retrieval

Project description

topic-rag — Topic-Enhanced Retrieval-Augmented Generation

Install it anywhere with:

pip install topic-rag

What it does

Standard RAG systems retrieve documents purely by text similarity (TF-IDF cosine distance). topic-rag adds a second signal — it automatically discovers hidden topics across your document collection and uses those topics to boost retrieval accuracy. A query about "neural networks" will score higher against documents that share that topic cluster, even if the exact words differ.

Quick Start

from topic_rag import DocumentProcessor, TopicEnhancedRAGRetriever

# Build a topic-aware corpus from your documents
processor = DocumentProcessor(n_topics=10)
corpus = processor.process_corpus(documents)   # list of {id, text, title}

# Retrieve with topic enhancement
retriever = TopicEnhancedRAGRetriever(corpus)
results = retriever.retrieve("What is transfer learning?", k=5)

for r in results:
    print(f"[{r['score']:.4f}] {r['sentence']}")

Two Retrieval Modes

Retriever How it works
StandardRAGRetriever Ranks by TF-IDF cosine similarity only
TopicEnhancedRAGRetriever Combines TF-IDF similarity + latent topic overlap for better context-aware ranking

Use both side by side to compare standard vs topic-enhanced retrieval on your own data:

from topic_rag import DocumentProcessor, StandardRAGRetriever, TopicEnhancedRAGRetriever

processor = DocumentProcessor(n_topics=10)
corpus = processor.process_corpus(documents)

std_retriever = StandardRAGRetriever(corpus)
enh_retriever = TopicEnhancedRAGRetriever(corpus)

std_results = std_retriever.retrieve("my query", k=5)
enh_results = enh_retriever.retrieve("my query", k=5)

Key Design Decisions

  • No GPU required — uses TF-IDF and a lightweight NMF-based topic model (no PyTorch, no sentence-transformers)
  • Minimal dependencies — only numpy and scikit-learn at the core
  • Self-contained — no external API calls or downloads needed

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

See the LICENSE file for the full license text.

What AGPL-3.0 means

  • Anyone can view, use, and modify the code
  • Any modified version used to provide a network service must release its source code
  • Companies cannot embed this in proprietary software without open-sourcing their product

For commercial licensing enquiries, please contact the project maintainers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topic_rag-1.0.3.tar.gz (44.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topic_rag-1.0.3-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file topic_rag-1.0.3.tar.gz.

File metadata

  • Download URL: topic_rag-1.0.3.tar.gz
  • Upload date:
  • Size: 44.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for topic_rag-1.0.3.tar.gz
Algorithm Hash digest
SHA256 a863fc9e05fc23014a606badb85c7a769c248ef0da2b331c1d2698afa12b2851
MD5 ea1f7168c3bd56355d6dcacbddd8fa42
BLAKE2b-256 f292455265c99e037d8ff557d6c8c8e1de3b1b944fd66cbf49417818e9438b62

See more details on using hashes here.

File details

Details for the file topic_rag-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: topic_rag-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for topic_rag-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3931622acc34f80bbe140c282afb8e43b5f7b5d247bc50944387e0c253fc07ed
MD5 827d2b8bd77b86f222aaf5a211f5a132
BLAKE2b-256 da727df773c9241b630699e3fdec2969cd134188dc12c1bef670552d5ab5f29c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page