A Python SDK for graph document processing

These details have not been verified by PyPI

Project links

Project description

GraphDoc - Knowledge Graph Enhanced Document Processing

GraphDoc (GraphRAG) is a document analysis and retrieval system that enhances traditional Retrieval-Augmented Generation (RAG) with knowledge graph capabilities. It's designed to process, index, and query complex document collections by combining vector-based retrieval with graph relationship context.

Features

Document Processing: Extract structured text from PDFs, text files, and other document formats
Knowledge Graph Construction: Automatically extract entities and relationships
Timeline Analysis: Create chronological sequences of events from documents
Graph-based Retrieval: Enhance document retrieval with graph relationships
Batch Processing: Process large document collections efficiently

Installation

pip install -e .

For development:

pip install -e ".[test]"

Quick Start

Python API

from graphrag_doc.index.batch_indexer import GraphRAGIndexer

# Initialize the indexer
indexer = GraphRAGIndexer(working_dir="graphrag_index")

# Process documents
results = indexer.index_documents(
    folder_path="path/to/documents",
    file_extensions=[".txt", ".pdf", ".docx", ".md"],
    recursive=True,
    extract_metadata=True
)

# Print indexing results
for result in results:
    print(f"{result.filename}: {result.status} - {result.message}")

# Access document metadata
for result in results:
    if result.status == "Success" and result.metadata:
        print(f"File: {result.filename}")
        print(f"Size: {result.metadata.file_size_bytes} bytes")
        print(f"Word count: {result.metadata.word_count}")
        
        # For PDFs, additional metadata may be available
        if result.metadata.num_pages:
            print(f"Pages: {result.metadata.num_pages}")

# Use the indexed documents with the LightRAG query API
from lightrag import QueryParam
response = indexer.rag.query(
    "When did the event occur?",
    param=QueryParam(mode="mix")
)
print(response)

Advanced Indexing Options

# Process specific file types
indexer.index_documents(
    folder_path="path/to/documents",
    file_extensions=[".pdf"],  # Only process PDF files
    recursive=True
)

# Non-recursive search (only files in the specified folder, not subfolders)
indexer.index_documents(
    folder_path="path/to/documents",
    recursive=False
)

# Skip metadata extraction for faster processing
indexer.index_documents(
    folder_path="path/to/documents",
    extract_metadata=False
)

# Customize batch size during initialization
indexer = GraphRAGIndexer(
    working_dir="graphrag_index",
    batch_size=50  # Process 50 documents per batch
)

Command Line Interface

Index documents using the CLI:

# Basic usage
python -m graphrag_doc.sdk.cli index path/to/documents

# Specify working directory and output file
python -m graphrag_doc.sdk.cli index path/to/documents --working-dir=my_index --output=results.json

# Specify file types
python -m graphrag_doc.sdk.cli index path/to/documents --file-types=".txt,.pdf,.docx,.md"

# Set batch size
python -m graphrag_doc.sdk.cli index path/to/documents --batch-size=50

Sample Script

The project includes a sample script for batch document processing:

python -m src.scripts.index_sample_documents path/to/documents --working-dir=my_index --output=results.json

Supported File Formats

Text files (.txt)
PDF documents (.pdf) - requires PyPDF2
Word documents (.docx) - requires python-docx
Markdown files (.md)
Other formats with textract (optional)

Configuration

GraphDoc uses environment variables for configuration:

GRAPHDOC_NEO4J_URI: Neo4j database URI (default: bolt://localhost:7687)
GRAPHDOC_NEO4J_USERNAME: Neo4j username
GRAPHDOC_NEO4J_PASSWORD: Neo4j password
OPENAI_API_KEY: OpenAI API key
GRAPHDOC_OPENAI_MODEL: OpenAI model to use (default: gpt-4o-mini)
GRAPHDOC_INDEX_DIR: Path to store indexes (default: graphrag_index)

Development

Running Tests

pytest

Run a single test:

pytest tests/path_to_test.py::test_function_name

Run the indexer tests:

pytest tests/test_indexer.py

Code Style

flake8
pyright

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

Apr 6, 2025

0.0.1

Feb 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graphrag_doc-0.0.2.tar.gz (480.6 kB view details)

Uploaded Apr 6, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

graphrag_doc-0.0.2-py3-none-any.whl (15.7 kB view details)

Uploaded Apr 6, 2025 Python 3

graphrag_doc-0.0.2-py2.py3-none-any.whl (22.0 kB view details)

Uploaded Apr 6, 2025 Python 2Python 3

File details

Details for the file graphrag_doc-0.0.2.tar.gz.

File metadata

Download URL: graphrag_doc-0.0.2.tar.gz
Upload date: Apr 6, 2025
Size: 480.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for graphrag_doc-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`30376436384dfddc52583de73187cf97bad7c725996b5221d08d57f695530502`
MD5	`f476b96a16b7a704cefa9811d674c460`
BLAKE2b-256	`467cff8e1aefdf26f597da9a6c3dc2a52d6a2bff1bd9c8673aedf04f45e42132`

See more details on using hashes here.

File details

Details for the file graphrag_doc-0.0.2-py3-none-any.whl.

File metadata

Download URL: graphrag_doc-0.0.2-py3-none-any.whl
Upload date: Apr 6, 2025
Size: 15.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for graphrag_doc-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9af16cb54e9d93a6dd07b0dd10630d764a69124aa5e703d6d3c9bf6d2217ec30`
MD5	`6aaeec7a010709374354a059165a32f0`
BLAKE2b-256	`d4082e5025361542de98977032383bc9de8c63367730941f96125e8083d23635`

See more details on using hashes here.

File details

Details for the file graphrag_doc-0.0.2-py2.py3-none-any.whl.

File metadata

Download URL: graphrag_doc-0.0.2-py2.py3-none-any.whl
Upload date: Apr 6, 2025
Size: 22.0 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for graphrag_doc-0.0.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`1bedd3fd11118900417a709a4372b4773ea89992e51317b060bf5376b3ce32d6`
MD5	`6b7e02c9a112e34c44fee283b4973858`
BLAKE2b-256	`69ac550238a90d3d8aa2b50401a379c8297a8fdd254e094af40167284a870317`

See more details on using hashes here.

graphrag-doc 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

GraphDoc - Knowledge Graph Enhanced Document Processing

Features

Installation

Quick Start

Python API

Advanced Indexing Options

Command Line Interface

Sample Script

Supported File Formats

Configuration

Development

Running Tests

Code Style

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes