Skip to main content

A lightweight toolkit for extracting topics from PDFs and visualizing their connections using graphs.

Project description

graphrag-tagger

A lightweight toolkit for extracting topics from PDFs and visualizing their connections using graphs.

Overview

graphrag-tagger automates topic extraction from PDF documents and builds graphs to visualize relationships between text segments. It offers a modular pipeline for processing text, applying topic modeling, refining results with an LLM, and constructing a graph-based representation of topic similarities.

Key Features

PDF Processing – Extracts text from PDFs efficiently.
Text Segmentation – Splits extracted text into manageable chunks.
Topic Modeling – Supports two methods:

  • Scikit-learn: Classic Latent Dirichlet Allocation (LDA) for topic extraction.
  • ktrain: A deep-learning-based approach with vocabulary filtering.
    LLM-Powered Refinement – Uses a language model to clean and classify topics.
    Graph Construction – Builds topic similarity graphs using network analysis.

Core Dependencies

  • PyMuPDF – Extracts text from PDF files.
  • scikit-learn & ktrain – Performs topic modeling.
  • LLM Client – Enhances and refines extracted topics.
  • networkx – Constructs and analyzes graphs.

Installation

Ensure you have Python installed, then build and install the package locally:

python -m build
pip install .

Usage

Extract Topics from PDFs

Run the topic extraction pipeline on a folder of PDFs:

python -m graphrag_tagger.tagger \
    --pdf_folder /path/to/pdfs \
    --output_folder /path/to/output \
    --chunk_size 512 \
    --chunk_overlap 25 \
    --n_features 512 \
    --min_df 2 \
    --max_df 0.95 \
    --llm_model ollama:phi4 \
    --model_choice sk

Build a Topic Similarity Graph

Generate a graph from the extracted topics:

python -m graphrag_tagger.build_graph \
    --input_folder /path/to/output \
    --output_folder /path/to/graph \
    --threshold_percentile 97.5

How It Works

1️⃣ PDF Processing – Extracts raw text from documents.
2️⃣ Text Segmentation – Divides the text into structured chunks.
3️⃣ Topic Modeling – Uses either LDA or ktrain-based modeling to extract key topics.
4️⃣ LLM-Based Refinement – Cleans and classifies topics for better accuracy.
5️⃣ Graph Construction – Builds a network where:

  • Nodes represent text chunks.
  • Edges represent topic similarities.
  • The graph reveals clusters and connections between document sections.

Contributing

Contributions are welcome! Feel free to submit issues or pull requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graphrag_tagger-0.1.1.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

graphrag_tagger-0.1.1-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file graphrag_tagger-0.1.1.tar.gz.

File metadata

  • Download URL: graphrag_tagger-0.1.1.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for graphrag_tagger-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7de8162272a8583880fc58bf582be8ce4ea9d8bfe0e5d0e0ef78c5a559c30540
MD5 0c8702260ffce91af1f7ed008c4a1d92
BLAKE2b-256 176d0c1c9433d77a5001a178fff268743cfb4b4a6d6f29558b203d71270f0716

See more details on using hashes here.

File details

Details for the file graphrag_tagger-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for graphrag_tagger-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a63541ab263c321cb0f2dcf6452aaa41f3f3e6158933ffaa0cbef895766a2241
MD5 0bdc9d4bc3e5b8bd74c8fe3280353349
BLAKE2b-256 271b3d70115d05dd7ad1054285c369acac9fbf6065f20cb8270fb3d4400b006a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page