Skip to main content

Graph + Vector retrieval system for InterSystems IRIS with hybrid search, openCypher, and GraphQL APIs

Project description

IRIS Vector Graph

A knowledge graph system built on InterSystems IRIS that combines graph traversal, vector similarity search, and full-text search in a single database. Designed for biomedical research use cases like protein interaction networks and literature mining.

What is InterSystems IRIS?

IRIS is a multi-model database that supports SQL, objects, documents, and key-value storage. This project uses IRIS's embedded Python, SQL procedures, and native vector search capabilities to implement graph operations without external dependencies.

What This Does

Stores graph data (nodes, edges, properties) in IRIS SQL tables and provides:

  • Vector similarity search - Find semantically similar entities using embeddings
  • Graph traversal - Multi-hop path queries across relationships
  • Hybrid search - Combine vector similarity with keyword search using RRF (Reciprocal Rank Fusion)
  • Referential integrity - Foreign key constraints ensure data consistency across all graph entities
  • Graph analytics - PageRank, Connected Components, and other algorithms optimized with embedded Python
  • REST API - Access all features via HTTP endpoints

Built with embedded Python for flexibility and IRIS SQL procedures for performance.

Architecture

┌─────────────────────────────────────────────────┐
│  REST API (Graph.KG.Service)                    │
│  - /kg/vectorSearch                             │
│  - /kg/hybridSearch                             │
│  - /kg/metaPath                                 │
└────────────────┬────────────────────────────────┘
                 │
┌────────────────▼────────────────────────────────┐
│  IRIS SQL Procedures (operators.sql)            │
│  - kg_KNN_VEC: Vector similarity                │
│  - kg_RRF_FUSE: Hybrid search                   │
│  - Text search with BM25                        │
└────────────────┬────────────────────────────────┘
                 │
┌────────────────▼────────────────────────────────┐
│  Embedded Python (Graph.KG.PyOps)               │
│  - Core engine (iris_vector_graph)         │
│  - NetworkX integration                         │
│  - Vector utilities                             │
└────────────────┬────────────────────────────────┘
                 │
┌────────────────▼────────────────────────────────┐
│  IRIS Tables                                    │
│  - nodes: Explicit node identity (PRIMARY KEY)  │
│  - rdf_edges: Relationships (FK to nodes)       │
│  - rdf_labels: Node types (FK to nodes)         │
│  - rdf_props: Properties (FK to nodes)          │
│  - kg_NodeEmbeddings: Vector embeddings (FK)    │
└─────────────────────────────────────────────────┘

Data is stored in RDF-style tables with explicit node identity and foreign key constraints for referential integrity. Graph operations are implemented via SQL procedures and embedded Python. HNSW vector indexing provides fast similarity search (requires IRIS 2025.3+ or ACORN-1 pre-release build).

Repository Structure

sql/
  schema.sql              # Table definitions
  operators.sql           # SQL procedures (requires IRIS 2025.3+)
  operators_fixed.sql     # Compatibility version for older IRIS
  migrations/             # Schema migrations (NodePK)
    001_add_nodepk_table.sql
    002_add_fk_constraints.sql

iris_vector_graph/   # Python engine
  engine.py               # Core search/traversal logic
  fusion.py               # RRF hybrid search
  vector_utils.py         # Vector operations

iris/src/                 # ObjectScript components
  Graph/KG/               # REST API and Python integration
    Service.cls           # REST API endpoints
    PyOps.cls             # Python integration
    Traversal.cls         # Graph operations
  PageRankEmbedded.cls    # Embedded Python graph analytics

scripts/
  ingest/networkx_loader.py         # Load data from files
  migrations/migrate_to_nodepk.py   # NodePK migration utility
  performance/                      # Benchmarking tools

docs/
  architecture/           # Design documentation
  performance/            # Performance benchmarks
  setup/                  # Installation guides
  api/REST_API.md        # API reference

Quick Start

Prerequisites

  • Docker (for running IRIS)
  • Python 3.8+ (we use UV for package management)
  • Basic familiarity with SQL

Installation

1. Install UV and dependencies:

curl -LsSf https://astral.sh/uv/install.sh | sh
git clone <repository-url>
cd iris-vector-graph
uv sync
source .venv/bin/activate

2. Start IRIS:

# Option A: ACORN-1 (pre-release build with HNSW optimization - fastest)
# Note: ACORN-1 is experimental and not yet in standard IRIS releases
docker-compose -f docker-compose.acorn.yml up -d

# Option B: Standard IRIS Community Edition (slower but stable)
# docker-compose up -d

3. Load schema:

# Connect to IRIS SQL terminal
docker exec -it iris-acorn-1 iris session iris

# In SQL prompt:
\i sql/schema.sql
\i sql/operators.sql  # Use operators_fixed.sql if this fails
\i sql/migrations/001_add_nodepk_table.sql
\i sql/migrations/002_add_fk_constraints.sql
\i scripts/sample_data_768.sql

Note: If you have existing data, use the migration utility instead:

uv run python scripts/migrations/migrate_to_nodepk.py --validate-only  # Dry run
uv run python scripts/migrations/migrate_to_nodepk.py --execute         # Execute

4. Create REST endpoints:

# In IRIS terminal:
Do ##class(Graph.KG.Service).CreateWebApp("/kg")

5. Configure and test:

cp .env.sample .env
# Edit .env with connection details (defaults usually work)

# Run tests
uv run python tests/python/run_all_tests.py --quick

Usage Examples

Vector Similarity Search

Find entities with similar embeddings:

REST API:

curl -X POST http://localhost:52773/kg/vectorSearch \
  -H "Content-Type: application/json" \
  -d '{
    "vector": [0.1, 0.2, 0.3, ...],  # 768-dimensional embedding
    "k": 10,
    "label": "protein"
  }'

Python (using SQL procedure):

import iris, json, numpy as np
conn = iris.connect('localhost', 1973, 'USER', '_SYSTEM', 'SYS')
c = conn.cursor()
qvec = np.random.rand(768).tolist()
c.execute("CALL kg_KNN_VEC(?, ?, ?)", [json.dumps(qvec), 10, 'protein'])
print(c.fetchall())

Direct SQL (if VECTOR functions available):

-- Use when VECTOR/TO_VECTOR functions are available
SELECT TOP 10 id,
       VECTOR_COSINE(emb, TO_VECTOR('[0.1,0.2,0.3,...]')) AS similarity
FROM kg_NodeEmbeddings
ORDER BY similarity DESC;

Graph Traversal

Find multi-hop paths between entities:

REST API:

curl -X POST http://localhost:52773/kg/metaPath \
  -H "Content-Type: application/json" \
  -d '{
    "srcId": "DRUG:aspirin",
    "predicates": ["targets", "interacts_with", "associated_with"],
    "maxHops": 3,
    "dstLabel": "disease"
  }'

Python:

# Multi-hop graph traversal
cursor.execute("""
    SELECT e1.s as drug, e2.o_id as protein, e3.o_id as disease
    FROM rdf_edges e1
    JOIN rdf_edges e2 ON e1.o_id = e2.s
    JOIN rdf_edges e3 ON e2.o_id = e3.s
    WHERE e1.s = ?
      AND e1.p = 'targets'
      AND e2.p = 'interacts_with'
      AND e3.p = 'associated_with'
""", ['DRUG:aspirin'])

pathways = cursor.fetchall()
for drug, protein, disease in pathways:
    print(f"{drug}{protein}{disease}")

SQL (recursive CTE):

-- Find shortest paths between drug and disease
WITH RECURSIVE pathway(source, target, path, hops) AS (
  SELECT s, o_id, CAST(s || ' -> ' || o_id AS VARCHAR(1000)), 1
  FROM rdf_edges
  WHERE s = 'DRUG:aspirin'

  UNION ALL

  SELECT p.source, e.o_id, p.path || ' -> ' || e.o_id, p.hops + 1
  FROM pathway p
  JOIN rdf_edges e ON p.target = e.s
  WHERE p.hops < 4
    AND e.o_id LIKE 'DISEASE:%'
)
SELECT path, hops FROM pathway
WHERE target LIKE 'DISEASE:%'
ORDER BY hops LIMIT 10;

Hybrid Search

Combine vector similarity with keyword matching:

REST API:

curl -X POST http://localhost:52773/kg/hybridSearch \
  -H "Content-Type: application/json" \
  -d '{
    "vector": [0.1, 0.2, ...],  # Cancer pathway embedding
    "text": "tumor suppressor DNA repair",
    "k": 15,
    "c": 60
  }'

Python:

# Hybrid search using RRF fusion of vector and text results
import numpy as np

# Use the stored procedure for hybrid search
query_vector = np.random.rand(768).tolist()  # Replace with actual embedding
cursor.execute("CALL kg_RRF_FUSE(?, ?, ?, ?, ?, ?)", [
    15,                              # k final results
    20,                             # k1 vector results
    20,                             # k2 text results
    60,                             # c parameter for RRF
    json.dumps(query_vector),       # query vector as JSON
    'tumor suppressor DNA repair'   # text query
])

results = cursor.fetchall()
for entity_id, rrf_score, vs_score, bm25_score in results:
    print(f"{entity_id}: RRF={rrf_score:.3f}, Vector={vs_score:.3f}, Text={bm25_score:.3f}")

# Alternative: Manual text search in qualifiers
cursor.execute("""
    SELECT s, qualifiers
    FROM rdf_edges
    WHERE qualifiers LIKE '%tumor%' OR qualifiers LIKE '%suppressor%'
    LIMIT 15
""")
text_results = cursor.fetchall()

for result in text_results:
    print(f"Entity: {result[0]}")
    print(f"  Qualifiers: {result[1]}")

Network Analysis

Find highly connected nodes:

Python:

# Find hub proteins (most connections)
cursor.execute("""
    SELECT s as protein, COUNT(*) as connections
    FROM rdf_edges
    WHERE p = 'interacts_with'
      AND s LIKE 'PROTEIN:%'
    GROUP BY s
    ORDER BY connections DESC
    LIMIT 20
""")

hubs = cursor.fetchall()
print("Top protein interaction hubs:")
for protein, connections in hubs:
    print(f"  {protein}: {connections} interactions")

SQL:

-- Network clustering coefficient
SELECT
    node,
    connections,
    triangles,
    CASE WHEN connections > 1
         THEN 2.0 * triangles / (connections * (connections - 1))
         ELSE 0 END as clustering_coefficient
FROM (
    SELECT
        e1.s as node,
        COUNT(DISTINCT e1.o_id) as connections,
        COUNT(DISTINCT e2.o_id) as triangles
    FROM rdf_edges e1
    LEFT JOIN rdf_edges e2 ON e1.o_id = e2.s AND e2.o_id IN (
        SELECT o_id FROM rdf_edges WHERE s = e1.s
    )
    WHERE e1.p = 'interacts_with'
    GROUP BY e1.s
) stats
ORDER BY clustering_coefficient DESC;

Complete Workflow Example

Drug target discovery:

Python:

def find_drug_targets(disease_name):
    """Find potential drug targets for a disease (working pattern)"""

    # 1. Find disease-associated proteins
    cursor.execute("""
        SELECT DISTINCT o_id as protein
        FROM rdf_edges
        WHERE s = ? AND p = 'associated_with'
    """, [f"DISEASE:{disease_name}"])

    disease_proteins = [row[0] for row in cursor.fetchall()]

    # 2. Find drugs targeting these proteins
    targets = []
    for protein in disease_proteins:
        cursor.execute("""
            SELECT s as drug, qualifiers
            FROM rdf_edges
            WHERE o_id = ? AND p = 'targets'
        """, [protein])

        targets.extend(cursor.fetchall())

    return targets

# Usage
cancer_drugs = find_drug_targets('cancer')
print(f"Found {len(cancer_drugs)} potential drug-target relationships")

NetworkX integration:

import networkx as nx

# Export IRIS graph to NetworkX for analysis
G = nx.DiGraph()

cursor.execute("SELECT s, o_id, p FROM rdf_edges WHERE p = 'interacts_with'")
for source, target, relation in cursor.fetchall():
    G.add_edge(source, target, relation=relation)

# Network analysis
centrality = nx.betweenness_centrality(G)
communities = nx.community.greedy_modularity_communities(G)

print(f"Network has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")
print(f"Found {len(communities)} protein communities")

Data Integrity & Graph Analytics

Referential Integrity with NodePK

The system enforces data consistency through foreign key constraints on an explicit nodes table:

Benefits:

  • Zero orphaned references - Cannot create edges, labels, properties, or embeddings for non-existent nodes
  • Data validation - FK constraints validated on every insert/update operation
  • Performance - FK constraints actually IMPROVE performance by 64% (query optimizer benefits)
  • Migration support - Utility to safely migrate existing data to explicit node identity

Schema:

-- Central node registry
CREATE TABLE nodes (
    node_id VARCHAR(256) PRIMARY KEY,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- All graph tables reference nodes
ALTER TABLE rdf_edges ADD FOREIGN KEY (s) REFERENCES nodes(node_id);
ALTER TABLE rdf_edges ADD FOREIGN KEY (o_id) REFERENCES nodes(node_id);
ALTER TABLE rdf_labels ADD FOREIGN KEY (s) REFERENCES nodes(node_id);
ALTER TABLE rdf_props ADD FOREIGN KEY (s) REFERENCES nodes(node_id);
ALTER TABLE kg_NodeEmbeddings ADD FOREIGN KEY (id) REFERENCES nodes(node_id);

Migration:

# Validate existing data (dry run)
uv run python scripts/migrations/migrate_to_nodepk.py --validate-only

# Execute migration
uv run python scripts/migrations/migrate_to_nodepk.py --execute

# See detailed logs
uv run python scripts/migrations/migrate_to_nodepk.py --execute --verbose

Graph Analytics with Embedded Python

High-performance graph algorithms using IRIS embedded Python (runs in-process, 10-50x faster than client-side).

Setup Requirements:

  • InterSystems IRIS 2024.1+ with embedded Python support
  • Container Python Framework (CPF) configuration - see iris-embedded-python-template
  • Python packages installed in IRIS runtime environment (separate from client)
  • PageRankEmbedded.cls compiled in IRIS

Compilation (one-time setup):

// IRIS terminal - compile PageRankEmbedded class
do $system.OBJ.Load("/path/to/iris/src/PageRankEmbedded.cls", "ck")

PageRank Usage:

// IRIS terminal - compute PageRank on graph subset
set results = ##class(PageRankEmbedded).ComputePageRank("PROTEIN:%", 10, 0.85)
do results.%ToJSON()

// With convergence tracking
set results = ##class(PageRankEmbedded).ComputePageRankWithMetrics("PROTEIN:%", 10, 0.85, 0.0001)
do results.%ToJSON()

Python client:

# PageRank via embedded Python (in-process execution)
cursor.execute("""
    SELECT ##class(PageRankEmbedded).ComputePageRank('PROTEIN:%', 10, 0.85)
""")
pagerank_json = cursor.fetchone()[0]

import json
results = json.loads(pagerank_json)
for node in results[:10]:  # Top 10 by PageRank
    print(f"{node['nodeId']}: {node['pagerank']:.6f}")

Performance (1K nodes, 8863 edges):

  • PageRank (10 iterations): 5.31ms
  • Connected Components: 4.70ms
  • Shortest Path (BFS): 0.045ms per level
  • Degree Centrality: 72-89ms

Expected performance (100K nodes, 500K edges):

  • PageRank: 1-5 seconds (vs 50-60s client-side Python baseline)
  • Graph traversal: <2ms for 3-hop queries
  • Concurrent workload: ≥700 queries/second

See docs/performance/graph_analytics_roadmap.md for optimization roadmap and docs/performance/nodepk_benchmark_results.md for detailed benchmarks.

Performance

The system has been tested with biomedical datasets (STRING protein interactions, PubMed literature). Performance metrics:

With ACORN-1 (pre-release build with HNSW indexing):

  • Vector search: ~1.7ms (HNSW-optimized)
  • Node lookup: 0.292ms (PRIMARY KEY index)
  • Graph queries: ~0.09ms per hop
  • Bulk node insertion: 6,496 nodes/second
  • PageRank (1K nodes): 5.31ms
  • Concurrent queries: 702 queries/second
  • Handles 50K+ nodes, 500K+ edges tested

With standard IRIS Community Edition:

  • Vector search: ~5.8s (no HNSW optimization)
  • Graph queries: ~1ms average
  • Data ingestion: ~29 proteins/second
  • Still functional for development and moderate-scale datasets

NodePK Performance:

  • FK constraint overhead: -64% (IMPROVED performance, query optimizer benefits)
  • Node lookup: <1ms with PRIMARY KEY index
  • Graph traversal: 0.09ms per hop with FK validation
  • See docs/performance/nodepk_benchmark_results.md

See docs/performance/ for detailed benchmarks.

Use Cases

Designed for biomedical research but adaptable to other domains:

  • Protein-protein interaction networks
  • Drug-target relationship discovery
  • Literature mining and knowledge extraction
  • Multi-hop reasoning across heterogeneous data
  • Semantic search over structured knowledge

The vector search supports any 768-dimensional embeddings (e.g., from BioBERT, SapBERT, or general-purpose models).

Development

Run tests:

uv run python tests/python/run_all_tests.py --quick
uv run python tests/python/test_iris_rest_api.py

Load your own data:

# TSV format: source\ttarget\trelationship_type
uv run python scripts/ingest/networkx_loader.py load data.tsv --format tsv

Performance testing:

uv run python scripts/performance/string_db_scale_test.py --max-proteins 10000

Documentation

Requirements

  • IRIS Database:
    • IRIS 2025.3+ for VECTOR functions (recommended)
    • ACORN-1 pre-release build for HNSW optimization (fastest, but experimental)
    • Standard Community Edition works but without HNSW indexing (slower vector search)
  • Python: 3.8+ (embedded in IRIS, also needed for client scripts)
  • Docker: For running IRIS container

Limitations

  • Vector search requires IRIS with VECTOR support (2025.3+ or ACORN-1)
  • HNSW indexing (major speedup) only available in ACORN-1 pre-release - not yet in standard IRIS
  • ACORN-1 is experimental and not recommended for production deployments
  • Graph traversal uses SQL recursive CTEs - performance degrades on very deep paths (>5 hops)
  • Text search uses simple BM25 implementation (not production-grade full-text)

License

See LICENSE file for details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iris_vector_graph-1.1.9.tar.gz (739.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iris_vector_graph-1.1.9-py3-none-any.whl (208.2 kB view details)

Uploaded Python 3

File details

Details for the file iris_vector_graph-1.1.9.tar.gz.

File metadata

  • Download URL: iris_vector_graph-1.1.9.tar.gz
  • Upload date:
  • Size: 739.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for iris_vector_graph-1.1.9.tar.gz
Algorithm Hash digest
SHA256 1741aa40566e46b455a84cf32bbda0712689a41a4ff85548f016a0a0b1539e72
MD5 fa04d3fd12eb5181d02e62dec72101b1
BLAKE2b-256 a32276cc3d2aa6b60f52d024ede507a700b4cb95fe9783d290cbe544b14aa041

See more details on using hashes here.

File details

Details for the file iris_vector_graph-1.1.9-py3-none-any.whl.

File metadata

File hashes

Hashes for iris_vector_graph-1.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 dc276c78dfb79968491dcd0d6531fc1b804e2d052aefce0972059e75fd30207e
MD5 3fe2be922e48df0bfd8a25434a16489b
BLAKE2b-256 71c1a6908897af37f4377586c7792cdf29df1716ee8cdb4f645ff6202a14288a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page