Skip to main content

Lightweight topic-aware document processing and retrieval

Project description

topic-rag — Topic-Enhanced Retrieval-Augmented Generation

Install it anywhere with:

pip install topic-rag

What it does

Standard RAG systems retrieve documents purely by text similarity (TF-IDF cosine distance). topic-rag adds a second signal — it automatically discovers hidden topics across your document collection and uses those topics to boost retrieval accuracy. A query about "neural networks" will score higher against documents that share that topic cluster, even if the exact words differ.

Quick Start

from topic_rag import DocumentProcessor, TopicEnhancedRAGRetriever

# Build a topic-aware corpus from your documents
processor = DocumentProcessor(n_topics=10)
corpus = processor.process_corpus(documents)   # list of {id, text, title}

# Retrieve with topic enhancement
retriever = TopicEnhancedRAGRetriever(corpus)
results = retriever.retrieve("What is transfer learning?", k=5)

for r in results:
    print(f"[{r['score']:.4f}] {r['sentence']}")

Two Retrieval Modes

Retriever How it works
StandardRAGRetriever Ranks by TF-IDF cosine similarity only
TopicEnhancedRAGRetriever Combines TF-IDF similarity + latent topic overlap for better context-aware ranking

Use both side by side to compare standard vs topic-enhanced retrieval on your own data:

from topic_rag import DocumentProcessor, StandardRAGRetriever, TopicEnhancedRAGRetriever

processor = DocumentProcessor(n_topics=10)
corpus = processor.process_corpus(documents)

std_retriever = StandardRAGRetriever(corpus)
enh_retriever = TopicEnhancedRAGRetriever(corpus)

std_results = std_retriever.retrieve("my query", k=5)
enh_results = enh_retriever.retrieve("my query", k=5)

Key Design Decisions

  • No GPU required — uses TF-IDF and a lightweight NMF-based topic model (no PyTorch, no sentence-transformers)
  • Minimal dependencies — only numpy and scikit-learn at the core
  • Self-contained — no external API calls or downloads needed

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

See the LICENSE file for the full license text.

What AGPL-3.0 means

  • Anyone can view, use, and modify the code
  • Any modified version used to provide a network service must release its source code
  • Companies cannot embed this in proprietary software without open-sourcing their product

For commercial licensing enquiries, please contact the project maintainers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topic_rag-1.0.2.tar.gz (44.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topic_rag-1.0.2-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file topic_rag-1.0.2.tar.gz.

File metadata

  • Download URL: topic_rag-1.0.2.tar.gz
  • Upload date:
  • Size: 44.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for topic_rag-1.0.2.tar.gz
Algorithm Hash digest
SHA256 00d7b11d4db7f81416e6d869f7816a26653cf4c8c11f45439093bf8397885f15
MD5 084ad6fb80ffdf31a51dd0e08c5de888
BLAKE2b-256 eff466e9bfec609444393db60ad8e494b96e353c30068aab91dc26e36ba2f1e2

See more details on using hashes here.

File details

Details for the file topic_rag-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: topic_rag-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for topic_rag-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bb8b4b0eca9009e77ad0daad97be5e79715b89a08468ad4c2106edece657a241
MD5 0b2d3dec62602627b9aae821051347ea
BLAKE2b-256 6492622e937fa258110d3ed522e88601c517ea606a4a06456ea86fbdea8b0d75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page