A lightweight toolkit for extracting topics from PDFs and visualizing their connections using graphs.

These details have not been verified by PyPI

Project links

Homepage

Project description

graphrag-tagger

A lightweight toolkit for extracting topics from PDFs and visualizing their connections using graphs.

Overview

graphrag-tagger automates topic extraction from PDF documents and builds graphs to visualize relationships between text segments. It offers a modular pipeline for processing text, applying topic modeling, refining results with an LLM, and constructing a graph-based representation of topic similarities.

Key Features

✅ PDF Processing – Extracts text from PDFs efficiently.
✅ Text Segmentation – Splits extracted text into manageable chunks.
✅ Topic Modeling – Supports two methods:

Scikit-learn: Classic Latent Dirichlet Allocation (LDA) for topic extraction.
ktrain: A deep-learning-based approach with vocabulary filtering.
✅ LLM-Powered Refinement – Uses a language model to clean and classify topics.
✅ Graph Construction – Builds topic similarity graphs using network analysis.

Core Dependencies

PyMuPDF – Extracts text from PDF files.
scikit-learn & ktrain – Performs topic modeling.
LLM Client – Enhances and refines extracted topics.
networkx – Constructs and analyzes graphs.

Installation

Ensure you have Python installed, then build and install the package locally:

python -m build
pip install .

Usage

Extract Topics from PDFs

Run the topic extraction pipeline on a folder of PDFs:

python -m graphrag_tagger.tagger \
    --pdf_folder /path/to/pdfs \
    --output_folder /path/to/output \
    --chunk_size 512 \
    --chunk_overlap 25 \
    --n_features 512 \
    --min_df 2 \
    --max_df 0.95 \
    --llm_model ollama:phi4 \
    --model_choice sk

Build a Topic Similarity Graph

Generate a graph from the extracted topics:

python -m graphrag_tagger.build_graph \
    --input_folder /path/to/output \
    --output_folder /path/to/graph \
    --threshold_percentile 97.5

How It Works

1️⃣ PDF Processing – Extracts raw text from documents.
2️⃣ Text Segmentation – Divides the text into structured chunks.
3️⃣ Topic Modeling – Uses either LDA or ktrain-based modeling to extract key topics.
4️⃣ LLM-Based Refinement – Cleans and classifies topics for better accuracy.
5️⃣ Graph Construction – Builds a network where:

Nodes represent text chunks.
Edges represent topic similarities.
The graph reveals clusters and connections between document sections.

Contributing

Contributions are welcome! Feel free to submit issues or pull requests.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.1

Feb 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graphrag_tagger-0.1.1.tar.gz (16.9 kB view details)

Uploaded Feb 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

graphrag_tagger-0.1.1-py3-none-any.whl (20.1 kB view details)

Uploaded Feb 23, 2025 Python 3

File details

Details for the file graphrag_tagger-0.1.1.tar.gz.

File metadata

Download URL: graphrag_tagger-0.1.1.tar.gz
Upload date: Feb 23, 2025
Size: 16.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for graphrag_tagger-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`7de8162272a8583880fc58bf582be8ce4ea9d8bfe0e5d0e0ef78c5a559c30540`
MD5	`0c8702260ffce91af1f7ed008c4a1d92`
BLAKE2b-256	`176d0c1c9433d77a5001a178fff268743cfb4b4a6d6f29558b203d71270f0716`

See more details on using hashes here.

File details

Details for the file graphrag_tagger-0.1.1-py3-none-any.whl.

File metadata

Download URL: graphrag_tagger-0.1.1-py3-none-any.whl
Upload date: Feb 23, 2025
Size: 20.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for graphrag_tagger-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a63541ab263c321cb0f2dcf6452aaa41f3f3e6158933ffaa0cbef895766a2241`
MD5	`0bdc9d4bc3e5b8bd74c8fe3280353349`
BLAKE2b-256	`271b3d70115d05dd7ad1054285c369acac9fbf6065f20cb8270fb3d4400b006a`

See more details on using hashes here.

graphrag-tagger 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

graphrag-tagger

Overview

Key Features

Core Dependencies

Installation

Usage

Extract Topics from PDFs

Build a Topic Similarity Graph

How It Works

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes