Knowledge Graph Generation Library
Project description
Grafa
Knowledge Graph Generation Library
Documentation: https://codingmaster8.github.io/grafa/
Source Code: https://github.com/codingmaster8/grafa
What is Grafa?
Grafa is a comprehensive Python library for building, managing, and querying knowledge graphs. It provides an end-to-end solution for:
- Document Ingestion: Upload and process documents (text files, PDFs, etc.)
- Intelligent Chunking: Break documents into meaningful chunks using agentic chunking strategies
- Entity Extraction: Automatically extract entities and relationships from text using LLMs
- Knowledge Graph Construction: Build structured knowledge graphs in Neo4j
- Smart Search: Perform semantic, text-based, and hybrid searches across your knowledge base
- Deduplication: Automatically merge similar entities to maintain graph quality
Key Features
🚀 Easy Setup
- Schema-driven approach using YAML configuration
- Automatic Neo4j index creation (vector and text indexes)
- Built-in support for AWS S3 storage and local file storage
🧠 AI-Powered Processing
- LLM-based entity and relationship extraction
- Semantic similarity search using embeddings
- Intelligent entity deduplication and merging
🔍 Advanced Search Capabilities
- Semantic Search: Vector-based similarity search
- Text Search: Full-text search with fuzzy matching
- Hybrid Search: Combines semantic and text approaches
- Name Matching: Edit distance-based name matching
📊 Flexible Node Types
- Built-in node types: Documents, Chunks, Document History
- Custom node types defined via YAML schema
- Support for metadata, embeddings, and relationships
Installation
pip install grafa
Quick Start
1. Define Your Schema
Create a YAML file (schema.yaml) to define your knowledge graph structure:
database:
name: "Business Concepts"
description: "A knowledge graph for business concepts"
node_types:
Person:
description: "A person"
fields:
occupation:
type: STRING
description: "Occupation of the person"
options:
link_to_chunk: false
embed: false
Company:
description: "A company"
fields:
description:
type: STRING
description: "Description of the company"
options:
link_to_chunk: false
embed: false
Concept:
description: "A business concept"
fields:
description:
type: STRING
description: "Description of the concept"
options:
link_to_chunk: true
semantic_search: true
text_search: true
relationships:
- from_type: Person
to_type: Company
relationship_type: WORKS_AT
description: "A person works at a company"
- from_type: Company
to_type: Concept
relationship_type: IS_RELATED_TO
description: "A company is related to a concept"
2. Initialize the Client
from grafa import GrafaClient
# Create client from YAML schema
client = await GrafaClient.from_yaml(
yaml_path="schema.yaml",
db_name="my_knowledge_base"
)
# Or connect to existing database
client = await GrafaClient.create(db_name="existing_db")
3. Ingest Documents
# Upload and process a document
document, chunks, entities, relationships = await client.ingest_file(
document_name="business_guide",
document_path="path/to/document.txt",
context="Business processes and concepts",
author="John Doe",
max_token_chunk_size=500,
deduplication_similarity_threshold=0.6
)
print(f"Created {len(chunks)} chunks")
print(f"Extracted {sum(len(e) for e in entities)} entities")
4. Search Your Knowledge Base
# Semantic search
results = await client.similarity_search(
query="What is revenue management?",
node_types=["Concept"],
search_mode="semantic",
limit=10
)
# Hybrid search (semantic + text)
results = await client.similarity_search(
query="company revenue strategies",
search_mode="hybrid",
semantic_threshold=0.7,
text_threshold=0.5
)
# Knowledge base query (returns formatted context)
answer = await client.knowledgebase_query(
query="How do we measure promotional effectiveness?",
max_hops=2,
return_formatted=True
)
print(answer)
Configuration
Environment Variables
Set these environment variables for database and storage configuration:
# Neo4j Configuration
export GRAFA_URI="neo4j+s://your-database.neo4j.io"
export GRAFA_USERNAME="neo4j"
export GRAFA_PASSWORD="your-password"
# Storage Configuration (choose one)
export GRAFA_S3_BUCKET="your-s3-bucket" # For S3 storage
export GRAFA_LOCAL_STORAGE_PATH="/local/path" # For local storage
Custom Configuration
from grafa import GrafaClient, GrafaConfig
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
# Create custom configuration
config = await GrafaConfig.create(
embedding_model=OpenAIEmbeddings(model="text-embedding-3-small"),
embedding_dimension=1536,
llm=ChatOpenAI(model="gpt-4"),
s3_bucket="my-documents-bucket"
)
client = await GrafaClient.create(
db_name="my_db",
grafa_config=config
)
Schema Definition
Node Types
Define custom node types with fields and options:
node_types:
Product:
description: "A product in our catalog"
fields:
price:
type: FLOAT
description: "Product price"
category:
type: STRING
description: "Product category"
features:
type: LIST
description: "List of product features"
options:
link_to_chunk: true # Link to source chunks
semantic_search: true # Enable vector search
text_search: true # Enable full-text search
unique_name: true # Enforce unique names
Field Types
STRING: Text fieldsINTEGER: Numeric integersFLOAT: Numeric floatsBOOLEAN: True/false valuesLIST: Arrays of valuesDATETIME: Date and time values
Node Options
link_to_chunk: Whether nodes link back to source chunkssemantic_search: Enable vector-based semantic searchtext_search: Enable full-text search indexingunique_name: Enforce unique names for this node typeembed: Whether to generate embeddings for this node type
Advanced Features
Entity Deduplication
Grafa automatically deduplicates similar entities during ingestion:
# Configure deduplication thresholds
await client.ingest_file(
document_name="document.txt",
deduplication_similarity_threshold=0.8, # Semantic similarity
deduplication_text_threshold=0.6, # Text similarity
deduplication_word_edit_distance=3 # Name edit distance
)
Custom Chunking
Use different chunking strategies:
from grafa.document.chunking import agentic_chunking
# Create document first
document = await client.upload_file(
document_name="guide.txt",
document_path="path/to/guide.txt"
)
# Custom chunking with specific parameters
chunks = await client.chunk_document(
document,
max_token_chunk_size=800,
verbose=True,
output_language="en"
)
Search Modes
Different search strategies for different use cases:
# Pure semantic search (vector embeddings)
semantic_results = await client.similarity_search(
query="machine learning algorithms",
search_mode="semantic",
semantic_threshold=0.75
)
# Pure text search (full-text index)
text_results = await client.similarity_search(
query="revenue management strategies",
search_mode="text",
text_threshold=0.6
)
# Hybrid search (combines both)
hybrid_results = await client.similarity_search(
query="customer segmentation",
search_mode="hybrid",
semantic_threshold=0.7,
text_threshold=0.5
)
# Automatic mode (uses available indexes)
auto_results = await client.similarity_search(
query="business metrics",
search_mode="allowed" # Default
)
Examples
The examples/ directory contains comprehensive examples:
client.ipynb: Basic client usagegraphrag.ipynb: Complete GraphRAG implementationsearch.ipynb: Advanced search exampleschunking.ipynb: Document chunking strategiesdatabase_info.ipynb: Database schema exploration
Core Components
GrafaClient
The main interface for all operations (grafa/client.py):
- Document ingestion and processing
- Entity extraction and relationship building
- Search and retrieval operations
- Database management
Node Types
Built-in node types (grafa/models.py):
- GrafaDocument: Represents uploaded documents
- GrafaChunk: Document chunks with content and metadata
- GrafaDocumentHistory: Version history for documents
- GrafaDatabase: Database schema and configuration
Dynamic Models
Custom node types generated from YAML (grafa/dynamic_models.py):
- Runtime model creation from schema
- Automatic relationship validation
- Field type mapping and validation
Development
Setup environment
We use Hatch to manage the development environment and production build. Ensure it's installed on your system.
Run unit tests
You can run all the tests with:
hatch run test
Format the code
Execute the following command to apply linting and check typing:
hatch run lint
Publish a new version
You can bump the version, create a commit and associated tag with one command:
hatch version patch
hatch version minor
hatch version major
Your default Git text editor will open so you can add information about the release.
When you push the tag on GitHub, the workflow will automatically publish it on PyPi and a GitHub release will be created as draft.
Serve the documentation
You can serve the Mkdocs documentation with:
hatch run docs-serve
It'll automatically watch for changes in your code.
Requirements
- Python 3.8+
- Neo4j database (local or cloud)
- OpenAI API key (for embeddings and LLM operations)
- AWS credentials (if using S3 storage)
License
This project is licensed under the terms of MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file grafa-0.1.0.tar.gz.
File metadata
- Download URL: grafa-0.1.0.tar.gz
- Upload date:
- Size: 61.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc08502e0688d88c987b65825de544099aa320aeced4137e941f801ee84efd7a
|
|
| MD5 |
854d2e97322317af697d01c60d07489a
|
|
| BLAKE2b-256 |
055c9403046561339e64a967b50423c0e2c9bc27de27363e4e2545e424e0fa89
|
Provenance
The following attestation bundles were made for grafa-0.1.0.tar.gz:
Publisher:
publish.yml on CodingMaster8/grafa
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
grafa-0.1.0.tar.gz -
Subject digest:
dc08502e0688d88c987b65825de544099aa320aeced4137e941f801ee84efd7a - Sigstore transparency entry: 660902831
- Sigstore integration time:
-
Permalink:
CodingMaster8/grafa@9a00f0b6070e2f9d00540e7492609a539dd14311 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/CodingMaster8
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9a00f0b6070e2f9d00540e7492609a539dd14311 -
Trigger Event:
release
-
Statement type:
File details
Details for the file grafa-0.1.0-py3-none-any.whl.
File metadata
- Download URL: grafa-0.1.0-py3-none-any.whl
- Upload date:
- Size: 78.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af5b8489f306482204f77102919e93ea7e4b93e9de1644763fac8f862bfee41b
|
|
| MD5 |
7c3fa2313c18ad6a745da4c8e3d2d891
|
|
| BLAKE2b-256 |
50ce737802164cc161f3a5ffbab8fa27dc3cf965cb85a91f67f749ff14c8ec02
|
Provenance
The following attestation bundles were made for grafa-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on CodingMaster8/grafa
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
grafa-0.1.0-py3-none-any.whl -
Subject digest:
af5b8489f306482204f77102919e93ea7e4b93e9de1644763fac8f862bfee41b - Sigstore transparency entry: 660902832
- Sigstore integration time:
-
Permalink:
CodingMaster8/grafa@9a00f0b6070e2f9d00540e7492609a539dd14311 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/CodingMaster8
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9a00f0b6070e2f9d00540e7492609a539dd14311 -
Trigger Event:
release
-
Statement type: