Fast path finding in large knowledge graphs
Project description
GANDALF
Graph Analysis Navigator for Discovery And Link Finding
Features
- Compressed Sparse Row (CSR) graph representation for memory efficiency
- Bidirectional search for optimal performance
- O(1) property lookups via hash indexing
- Predicate filtering to reduce path explosion
- Batch property enrichment for fast results
- Diagnostic tools to understand path counts
Installation
Recommended: Use a virtual environment
Some transitive dependencies (e.g., stringcase, pytest-logging) require modern pip/setuptools to build correctly. Using a virtual environment ensures you have updated tools.
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Upgrade pip and setuptools (important for building dependencies)
pip install --upgrade pip setuptools wheel
# Install the package
pip install -e .
Alternative: Direct install (may fail on some systems)
If you have a recent pip/setuptools already, you can try:
pip install -e .
Quick Start
Unzipping a full translator kgx
tar -xvf translator_kg.tar.zstThis will output a nodes.jsonl and edges.jsonl file
Build a graph from JSONL
from gandalf import build_graph_from_jsonl
# Build with ontology filtering
graph = build_graph_from_jsonl(
edges_path="data/raw/edges.jsonl",
nodes_path="data/raw/nodes.jsonl",
excluded_predicates={'biolink:subclass_of'}
)
# Save for fast loading
graph.save("data/processed/graph_filtered.pkl")
Query paths
from gandalf import CSRGraph, find_paths
# Load graph (takes ~1-2 seconds)
graph = CSRGraph.load("data/processed/graph.pkl")
# Find paths
paths = find_paths(
graph,
start_id="CHEBI:45783",
end_id="MONDO:0004979"
)
print(f"Found {len(paths)} paths")
Filter by predicates
from gandalf import find_paths_filtered
# Only mechanistic relationships
paths = find_paths_filtered(
graph,
start_id="CHEBI:45783",
end_id="MONDO:0004979",
allowed_predicates={
'biolink:treats',
'biolink:affects',
'biolink:has_metabolite'
}
)
Architecture
The package uses a three-stage pipeline:
- Topology Search (fast) - Find all paths using indices only
- Filtering (medium) - Apply business logic on necessary node or edge properties
- Enrichment (batch) - Load all properties for final paths only
This separation allows filtering millions of paths before expensive property lookups.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gandalf_csr-0.1.2.tar.gz.
File metadata
- Download URL: gandalf_csr-0.1.2.tar.gz
- Upload date:
- Size: 75.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
590c79ef4f49e96a7f04ab608d64155dd3508d3756893831ca21a1bf9ea7f589
|
|
| MD5 |
c32cd9f15c6b1e21403099969e16c14d
|
|
| BLAKE2b-256 |
72ab0125c1f1c3a3329817ab11afd8e0c28d547f01627f2a94872e1c64d838cc
|
File details
Details for the file gandalf_csr-0.1.2-py3-none-any.whl.
File metadata
- Download URL: gandalf_csr-0.1.2-py3-none-any.whl
- Upload date:
- Size: 75.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc6a1dc06b33d1aa8f210707b4c68f35b6d4957b241035782bbad5ae4950da2c
|
|
| MD5 |
3fa29a242bd92bc8d79620b1ccd49349
|
|
| BLAKE2b-256 |
ff3c0baac116cdf853d967915218b863590b3c8c13ccf7cc0a125cf0e5141c2c
|