A lightweight toolkit for extracting topics from PDFs and visualizing their connections using graphs.
Project description
graphrag-tagger
A lightweight toolkit for extracting topics from PDFs and visualizing their connections using graphs.
Overview
graphrag-tagger automates topic extraction from PDF documents and builds graphs to visualize relationships between text segments. It offers a modular pipeline for processing text, applying topic modeling, refining results with an LLM, and constructing a graph-based representation of topic similarities.
Key Features
✅ PDF Processing – Extracts text from PDFs efficiently.
✅ Text Segmentation – Splits extracted text into manageable chunks.
✅ Topic Modeling – Supports two methods:
- Scikit-learn: Classic Latent Dirichlet Allocation (LDA) for topic extraction.
- ktrain: A deep-learning-based approach with vocabulary filtering.
✅ LLM-Powered Refinement – Uses a language model to clean and classify topics.
✅ Graph Construction – Builds topic similarity graphs using network analysis.
Core Dependencies
- PyMuPDF – Extracts text from PDF files.
- scikit-learn & ktrain – Performs topic modeling.
- LLM Client – Enhances and refines extracted topics.
- networkx – Constructs and analyzes graphs.
Installation
Ensure you have Python installed, then build and install the package locally:
python -m build
pip install .
Usage
Extract Topics from PDFs
Run the topic extraction pipeline on a folder of PDFs:
python -m graphrag_tagger.tagger \
--pdf_folder /path/to/pdfs \
--output_folder /path/to/output \
--chunk_size 512 \
--chunk_overlap 25 \
--n_features 512 \
--min_df 2 \
--max_df 0.95 \
--llm_model ollama:phi4 \
--model_choice sk
Build a Topic Similarity Graph
Generate a graph from the extracted topics:
python -m graphrag_tagger.build_graph \
--input_folder /path/to/output \
--output_folder /path/to/graph \
--threshold_percentile 97.5
How It Works
1️⃣ PDF Processing – Extracts raw text from documents.
2️⃣ Text Segmentation – Divides the text into structured chunks.
3️⃣ Topic Modeling – Uses either LDA or ktrain-based modeling to extract key topics.
4️⃣ LLM-Based Refinement – Cleans and classifies topics for better accuracy.
5️⃣ Graph Construction – Builds a network where:
- Nodes represent text chunks.
- Edges represent topic similarities.
- The graph reveals clusters and connections between document sections.
Contributing
Contributions are welcome! Feel free to submit issues or pull requests.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graphrag_tagger-0.1.1.tar.gz.
File metadata
- Download URL: graphrag_tagger-0.1.1.tar.gz
- Upload date:
- Size: 16.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7de8162272a8583880fc58bf582be8ce4ea9d8bfe0e5d0e0ef78c5a559c30540
|
|
| MD5 |
0c8702260ffce91af1f7ed008c4a1d92
|
|
| BLAKE2b-256 |
176d0c1c9433d77a5001a178fff268743cfb4b4a6d6f29558b203d71270f0716
|
File details
Details for the file graphrag_tagger-0.1.1-py3-none-any.whl.
File metadata
- Download URL: graphrag_tagger-0.1.1-py3-none-any.whl
- Upload date:
- Size: 20.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a63541ab263c321cb0f2dcf6452aaa41f3f3e6158933ffaa0cbef895766a2241
|
|
| MD5 |
0bdc9d4bc3e5b8bd74c8fe3280353349
|
|
| BLAKE2b-256 |
271b3d70115d05dd7ad1054285c369acac9fbf6065f20cb8270fb3d4400b006a
|