A Python SDK for graph document processing
Project description
GraphDoc - Knowledge Graph Enhanced Document Processing
GraphDoc (GraphRAG) is a document analysis and retrieval system that enhances traditional Retrieval-Augmented Generation (RAG) with knowledge graph capabilities. It's designed to process, index, and query complex document collections by combining vector-based retrieval with graph relationship context.
Features
- Document Processing: Extract structured text from PDFs, text files, and other document formats
- Knowledge Graph Construction: Automatically extract entities and relationships
- Timeline Analysis: Create chronological sequences of events from documents
- Graph-based Retrieval: Enhance document retrieval with graph relationships
- Batch Processing: Process large document collections efficiently
Installation
pip install -e .
For development:
pip install -e ".[test]"
Quick Start
Python API
from graphrag_doc.index.batch_indexer import GraphRAGIndexer
# Initialize the indexer
indexer = GraphRAGIndexer(working_dir="graphrag_index")
# Process documents
results = indexer.index_documents(
folder_path="path/to/documents",
file_extensions=[".txt", ".pdf", ".docx", ".md"],
recursive=True,
extract_metadata=True
)
# Print indexing results
for result in results:
print(f"{result.filename}: {result.status} - {result.message}")
# Access document metadata
for result in results:
if result.status == "Success" and result.metadata:
print(f"File: {result.filename}")
print(f"Size: {result.metadata.file_size_bytes} bytes")
print(f"Word count: {result.metadata.word_count}")
# For PDFs, additional metadata may be available
if result.metadata.num_pages:
print(f"Pages: {result.metadata.num_pages}")
# Use the indexed documents with the LightRAG query API
from lightrag import QueryParam
response = indexer.rag.query(
"When did the event occur?",
param=QueryParam(mode="mix")
)
print(response)
Advanced Indexing Options
# Process specific file types
indexer.index_documents(
folder_path="path/to/documents",
file_extensions=[".pdf"], # Only process PDF files
recursive=True
)
# Non-recursive search (only files in the specified folder, not subfolders)
indexer.index_documents(
folder_path="path/to/documents",
recursive=False
)
# Skip metadata extraction for faster processing
indexer.index_documents(
folder_path="path/to/documents",
extract_metadata=False
)
# Customize batch size during initialization
indexer = GraphRAGIndexer(
working_dir="graphrag_index",
batch_size=50 # Process 50 documents per batch
)
Command Line Interface
Index documents using the CLI:
# Basic usage
python -m graphrag_doc.sdk.cli index path/to/documents
# Specify working directory and output file
python -m graphrag_doc.sdk.cli index path/to/documents --working-dir=my_index --output=results.json
# Specify file types
python -m graphrag_doc.sdk.cli index path/to/documents --file-types=".txt,.pdf,.docx,.md"
# Set batch size
python -m graphrag_doc.sdk.cli index path/to/documents --batch-size=50
Sample Script
The project includes a sample script for batch document processing:
python -m src.scripts.index_sample_documents path/to/documents --working-dir=my_index --output=results.json
Supported File Formats
- Text files (
.txt) - PDF documents (
.pdf) - requires PyPDF2 - Word documents (
.docx) - requires python-docx - Markdown files (
.md) - Other formats with textract (optional)
Configuration
GraphDoc uses environment variables for configuration:
GRAPHDOC_NEO4J_URI: Neo4j database URI (default: bolt://localhost:7687)GRAPHDOC_NEO4J_USERNAME: Neo4j usernameGRAPHDOC_NEO4J_PASSWORD: Neo4j passwordOPENAI_API_KEY: OpenAI API keyGRAPHDOC_OPENAI_MODEL: OpenAI model to use (default: gpt-4o-mini)GRAPHDOC_INDEX_DIR: Path to store indexes (default: graphrag_index)
Development
Running Tests
pytest
Run a single test:
pytest tests/path_to_test.py::test_function_name
Run the indexer tests:
pytest tests/test_indexer.py
Code Style
flake8
pyright
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graphrag_doc-0.0.2.tar.gz.
File metadata
- Download URL: graphrag_doc-0.0.2.tar.gz
- Upload date:
- Size: 480.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30376436384dfddc52583de73187cf97bad7c725996b5221d08d57f695530502
|
|
| MD5 |
f476b96a16b7a704cefa9811d674c460
|
|
| BLAKE2b-256 |
467cff8e1aefdf26f597da9a6c3dc2a52d6a2bff1bd9c8673aedf04f45e42132
|
File details
Details for the file graphrag_doc-0.0.2-py3-none-any.whl.
File metadata
- Download URL: graphrag_doc-0.0.2-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9af16cb54e9d93a6dd07b0dd10630d764a69124aa5e703d6d3c9bf6d2217ec30
|
|
| MD5 |
6aaeec7a010709374354a059165a32f0
|
|
| BLAKE2b-256 |
d4082e5025361542de98977032383bc9de8c63367730941f96125e8083d23635
|
File details
Details for the file graphrag_doc-0.0.2-py2.py3-none-any.whl.
File metadata
- Download URL: graphrag_doc-0.0.2-py2.py3-none-any.whl
- Upload date:
- Size: 22.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bedd3fd11118900417a709a4372b4773ea89992e51317b060bf5376b3ce32d6
|
|
| MD5 |
6b7e02c9a112e34c44fee283b4973858
|
|
| BLAKE2b-256 |
69ac550238a90d3d8aa2b50401a379c8297a8fdd254e094af40167284a870317
|