Skip to main content

A Python SDK for graph document processing

Project description

GraphDoc - Knowledge Graph Enhanced Document Processing

GraphDoc (GraphRAG) is a document analysis and retrieval system that enhances traditional Retrieval-Augmented Generation (RAG) with knowledge graph capabilities. It's designed to process, index, and query complex document collections by combining vector-based retrieval with graph relationship context.

Features

  • Document Processing: Extract structured text from PDFs, text files, and other document formats
  • Knowledge Graph Construction: Automatically extract entities and relationships
  • Timeline Analysis: Create chronological sequences of events from documents
  • Graph-based Retrieval: Enhance document retrieval with graph relationships
  • Batch Processing: Process large document collections efficiently

Installation

pip install -e .

For development:

pip install -e ".[test]"

Quick Start

Python API

from graphrag_doc.index.batch_indexer import GraphRAGIndexer

# Initialize the indexer
indexer = GraphRAGIndexer(working_dir="graphrag_index")

# Process documents
results = indexer.index_documents(
    folder_path="path/to/documents",
    file_extensions=[".txt", ".pdf", ".docx", ".md"],
    recursive=True,
    extract_metadata=True
)

# Print indexing results
for result in results:
    print(f"{result.filename}: {result.status} - {result.message}")

# Access document metadata
for result in results:
    if result.status == "Success" and result.metadata:
        print(f"File: {result.filename}")
        print(f"Size: {result.metadata.file_size_bytes} bytes")
        print(f"Word count: {result.metadata.word_count}")
        
        # For PDFs, additional metadata may be available
        if result.metadata.num_pages:
            print(f"Pages: {result.metadata.num_pages}")

# Use the indexed documents with the LightRAG query API
from lightrag import QueryParam
response = indexer.rag.query(
    "When did the event occur?",
    param=QueryParam(mode="mix")
)
print(response)

Advanced Indexing Options

# Process specific file types
indexer.index_documents(
    folder_path="path/to/documents",
    file_extensions=[".pdf"],  # Only process PDF files
    recursive=True
)

# Non-recursive search (only files in the specified folder, not subfolders)
indexer.index_documents(
    folder_path="path/to/documents",
    recursive=False
)

# Skip metadata extraction for faster processing
indexer.index_documents(
    folder_path="path/to/documents",
    extract_metadata=False
)

# Customize batch size during initialization
indexer = GraphRAGIndexer(
    working_dir="graphrag_index",
    batch_size=50  # Process 50 documents per batch
)

Command Line Interface

Index documents using the CLI:

# Basic usage
python -m graphrag_doc.sdk.cli index path/to/documents

# Specify working directory and output file
python -m graphrag_doc.sdk.cli index path/to/documents --working-dir=my_index --output=results.json

# Specify file types
python -m graphrag_doc.sdk.cli index path/to/documents --file-types=".txt,.pdf,.docx,.md"

# Set batch size
python -m graphrag_doc.sdk.cli index path/to/documents --batch-size=50

Sample Script

The project includes a sample script for batch document processing:

python -m src.scripts.index_sample_documents path/to/documents --working-dir=my_index --output=results.json

Supported File Formats

  • Text files (.txt)
  • PDF documents (.pdf) - requires PyPDF2
  • Word documents (.docx) - requires python-docx
  • Markdown files (.md)
  • Other formats with textract (optional)

Configuration

GraphDoc uses environment variables for configuration:

  • GRAPHDOC_NEO4J_URI: Neo4j database URI (default: bolt://localhost:7687)
  • GRAPHDOC_NEO4J_USERNAME: Neo4j username
  • GRAPHDOC_NEO4J_PASSWORD: Neo4j password
  • OPENAI_API_KEY: OpenAI API key
  • GRAPHDOC_OPENAI_MODEL: OpenAI model to use (default: gpt-4o-mini)
  • GRAPHDOC_INDEX_DIR: Path to store indexes (default: graphrag_index)

Development

Running Tests

pytest

Run a single test:

pytest tests/path_to_test.py::test_function_name

Run the indexer tests:

pytest tests/test_indexer.py

Code Style

flake8
pyright

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graphrag_doc-0.0.2.tar.gz (480.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

graphrag_doc-0.0.2-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

graphrag_doc-0.0.2-py2.py3-none-any.whl (22.0 kB view details)

Uploaded Python 2Python 3

File details

Details for the file graphrag_doc-0.0.2.tar.gz.

File metadata

  • Download URL: graphrag_doc-0.0.2.tar.gz
  • Upload date:
  • Size: 480.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for graphrag_doc-0.0.2.tar.gz
Algorithm Hash digest
SHA256 30376436384dfddc52583de73187cf97bad7c725996b5221d08d57f695530502
MD5 f476b96a16b7a704cefa9811d674c460
BLAKE2b-256 467cff8e1aefdf26f597da9a6c3dc2a52d6a2bff1bd9c8673aedf04f45e42132

See more details on using hashes here.

File details

Details for the file graphrag_doc-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: graphrag_doc-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for graphrag_doc-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9af16cb54e9d93a6dd07b0dd10630d764a69124aa5e703d6d3c9bf6d2217ec30
MD5 6aaeec7a010709374354a059165a32f0
BLAKE2b-256 d4082e5025361542de98977032383bc9de8c63367730941f96125e8083d23635

See more details on using hashes here.

File details

Details for the file graphrag_doc-0.0.2-py2.py3-none-any.whl.

File metadata

  • Download URL: graphrag_doc-0.0.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 22.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for graphrag_doc-0.0.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 1bedd3fd11118900417a709a4372b4773ea89992e51317b060bf5376b3ce32d6
MD5 6b7e02c9a112e34c44fee283b4973858
BLAKE2b-256 69ac550238a90d3d8aa2b50401a379c8297a8fdd254e094af40167284a870317

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page