Transpile Graph Query Language (openCypher) to Recursive SQL (Databricks)

These details have not been verified by PyPI

Project description

gsql2rsql - OpenCypher to Databricks SQL Transpiler

gsql2rsql transpiles OpenCypher graph queries to Databricks SQL, enabling graph analytics on Delta Lake without a dedicated graph database.

Project Status: This is a hobby/research project being developed towards production quality. While it handles complex queries and includes comprehensive tests, it's not yet at enterprise scale. Contributions welcome!

!!! warning "Not for OLTP (obviously) or end-user queries" This transpiler is for internal analytics and exploration (data science, engineering, analysis). It obviously makes no sense for OLTP ! If you plan to expose transpiled queries to end users, be careful: implement validation, rate limiting, and security. Use common sense.

Why This Project?

Inspiration: Microsoft's openCypherTranspiler

This project was inspired by Microsoft's openCypherTranspiler (now unmaintained) which transpiled OpenCypher to T-SQL (SQL Server).

Why a new transpiler? Two reasons:

Databricks SQL is fundamentally different from T-SQL — WITH RECURSIVE, HOFs, and Delta Lake optimizations require different strategies
Security-first architecture — gsql2rsql uses strict separation of concerns for correctness:
- Parser: Syntax only (no schema access)
- Planner: Semantics only (builds logical operators)
- Resolver: Validation only (schema checking, column resolution)
- Renderer: Code generation only (intentionally "dumb")

This separation makes the transpiler easier to audit, test, and trust

The game-changer: Databricks recently added WITH RECURSIVE support, unlocking variable-leng

Databricks SQL Higher-Order Functions (HOFs)

Databricks SQL has native array manipulation via HOFs:

-- Transform array elements
SELECT transform(relationships, r -> r.amount) AS amounts
FROM fraud_paths

-- Filter complex conditions
SELECT filter(path, node -> node.risk_score > 0.8) AS risky_nodes
FROM customer_journeys

-- Aggregate with lambda
SELECT aggregate(
  transactions,
  0.0,
  (acc, t) -> acc + t.amount,
  acc -> acc
) AS total
FROM account_history

gsql2rsql leverages these HOFs for:

Path filtering: NONE(r IN relationships(path) WHERE r.suspicious)
Path aggregations: SUM(r IN rels WHERE r.amount > 1000)
Pattern matching: Complex nested conditions

This makes Cypher → SQL transpilation more natural

Why Graph Queries on Delta Lake?

Delta Lake (Single Source)
     ↓ OpenCypher (via gsql2rsql)
Databricks SQL
     ↓ Results

Advantages:

No duplication: Query source data directly
Real-time: Always fresh data
No sync: One less thing to break
Cost-effective: No second database
Unified governance: Single data platform

Billion-Scale Relationships: Triple Stores in Delta

The Problem with graph databases (oltp) at Scale

When you have billions of relationships:

Memory limits: Graph must fit in RAM for good performance
Vertical scaling: Limited by single-server resources
Cost: Enterprise licenses + large EC2 instances = $$$$
Backup/Recovery: GBs of graph data, long backup windows
Version upgrades: Risky with large graphs

Triple Store in Delta Lake

Model relationships as triples in Delta:

-- Nodes table (entities)
CREATE TABLE nodes (
  node_id STRING,
  type STRING,          -- Person, Account, Merchant, etc.
  properties MAP<STRING, STRING>,
  timestamp TIMESTAMP
) USING DELTA;

-- Edges table (relationships)
-- Option 1: Traditional partitioning (relationship_type + date)
CREATE TABLE edges (
  src STRING,           -- Source node_id
  relationship_type STRING,  -- TRANSACTION, OWNS, LOCATED_AT, etc.
  dst STRING,           -- Destination node_id
  properties MAP<STRING, STRING>,
  timestamp TIMESTAMP
) USING DELTA
PARTITIONED BY (relationship_type, DATE(timestamp));

-- Option 2: Liquid Clustering (DBR 13.3+, RECOMMENDED!)
-- Auto-tunes partitioning based on query patterns
CREATE TABLE edges (
  src STRING,
  relationship_type STRING,
  dst STRING,
  properties MAP<STRING, STRING>,
  timestamp TIMESTAMP
) USING DELTA
CLUSTER BY (relationship_type, src);

-- For traditional partitioning, optimize with Z-ordering
OPTIMIZE edges ZORDER BY (src, relationship_type, dst);

Advantages:

Horizontal scale: Petabytes, billions of rows, no problem
Cost-effective: S3 storage ($0.0something/GB) vs RAM ($something+/GB)
Time travel: Delta Lake versioning = free audit trail
Schema evolution: Add properties without downtime
ACID guarantees: Delta Lake transactions
Liquid clustering: Auto-tunes for query patterns

This is why GraphContext API exists: When your graph fits this pattern (nodes + edges tables), you don't need bunch lines of schema boilerplate — just 2 table paths and you're done.

LLMs + Transpilers: Enterprise Governance

The Problem: In enterprise environments, someone must be accountable for queries before execution — even with LLM text-to-query.

Why Transpilers Matter

1. Reviewability: Graph queries are 4-5 lines vs hundreds of SQL lines

# 5 lines in Cypher
MATCH (c:Customer)-[:TRANSACTION*1..3]->(m:Merchant)
WHERE m.risk_score > 0.9
RETURN c.id, COUNT(*) AS risky_tx
ORDER BY risky_tx DESC
LIMIT 100

vs 150+ lines of recursive SQL. Easier for humans to review and approve.

Transpilers turn LLM outputs into governable, auditable, human-reviewable queries.

Quick Start

Installation

pip install gsql2rsql
# Or from source:
git clone https://github.com/devmessias/gsql2rsql
cd gsql2rsql/python
uv pip install -e .

Simplified API: GraphContext (Recommended for Triple Stores)

Why Triple Stores + Delta Tables Scale: Delta Lake's horizontal scaling, Z-ordering, and liquid clustering make single triple store architectures incredibly efficient — even at billions of edges. No need for complex multi-table schemas when Delta can handle everything.

GraphContext API eliminates ~100 lines of boilerplate for the common case: graph stored as two Delta tables (nodes + edges).

from gsql2rsql import GraphContext

# 1. Create context (just 2 table paths!)
# Note: Table names without backticks - SQLRenderer adds them automatically
graph = GraphContext(
    nodes_table="catalog.fraud.nodes",
    edges_table="catalog.fraud.edges",
    extra_node_attrs={"name": str, "risk_score": float},
    extra_edge_attrs={"amount": float, "timestamp": str}
)

# 2. Set types (auto-discovered if spark session provided)
graph.set_types(
    node_types=["Person", "Account", "Merchant"],
    edge_types=["TRANSACTION", "OWNS", "LOCATED_AT"]
)

# 3. Query with inline filters (optimized!)
query = """
MATCH path = (origin:Person {id: 'alice'})-[:TRANSACTION*1..3]->(dest:Account)
WHERE dest.risk_score > 0.8
RETURN dest.id, dest.risk_score, length(path) AS depth
ORDER BY depth, dest.risk_score DESC
LIMIT 100
"""

sql = graph.transpile(query, optimize=True)  # Predicate pushdown enabled!

# 4. Execute on Databricks
# df = graph.execute(query)  # If spark session provided
# df.show()

graph = GraphContext(
    spark=spark,  # Required for discovery
    nodes_table="catalog.fraud.nodes",
    edges_table="catalog.fraud.edges",
    discover_edge_combinations=True  # Query DB for real combinations
)
# If you have 10 node types × 5 edge types = 500 possible schemas
# But only 15 combinations exist → Creates only 15 schemas (33x faster!)

Advanced: Manual Schema Setup (Full Control)

For multi-table schemas or when you need precise control over SQL table descriptors, use the manual setup:

Example: Find fraud networks using BFS (Breadth-First Search) up to depth 4, starting from a suspicious account and ignoring social relationships.

from gsql2rsql.parser.opencypher_parser import OpenCypherParser
from gsql2rsql.planner.logical_plan import LogicalPlan
from gsql2rsql.renderer.sql_renderer import SQLRenderer
from gsql2rsql.common.schema import NodeSchema, EdgeSchema, EntityProperty
from gsql2rsql.renderer.schema_provider import SimpleSQLSchemaProvider, SQLTableDescriptor

# 1. Define schema (SimpleSQLSchemaProvider)
schema = SimpleSQLSchemaProvider()

# Person node
person = NodeSchema(
    name="Person",
    properties=[
        EntityProperty(property_name="id", data_type=int),
        EntityProperty(property_name="name", data_type=str),
        EntityProperty(property_name="risk_score", data_type=float),
    ],
    node_id_property=EntityProperty(property_name="id", data_type=int)
)

schema.add_node(
    person,
    SQLTableDescriptor(
        table_name="fraud.person",  # Databricks catalog.schema.table
        node_id_columns=["id"],
    )
)

# Multiple edge types - we'll only query TRANSACAO_SUSPEITA
# AMIGOS and FAMILIARES are in the schema but ignored in the query
amigos = EdgeSchema(
    name="AMIGOS",
    source_node_id="Person",
    sink_node_id="Person",
    source_id_property=EntityProperty(property_name="person1_id", data_type=int),
    sink_id_property=EntityProperty(property_name="person2_id", data_type=int),
    properties=[]
)

familiares = EdgeSchema(
    name="FAMILIARES",
    source_node_id="Person",
    sink_node_id="Person",
    source_id_property=EntityProperty(property_name="person1_id", data_type=int),
    sink_id_property=EntityProperty(property_name="person2_id", data_type=int),
    properties=[]
)

transacao_suspeita = EdgeSchema(
    name="TRANSACAO_SUSPEITA",
    source_node_id="Person",
    sink_node_id="Person",
    source_id_property=EntityProperty(property_name="origem_id", data_type=int),
    sink_id_property=EntityProperty(property_name="destino_id", data_type=int),
    properties=[
        EntityProperty(property_name="valor", data_type=float),
        EntityProperty(property_name="timestamp", data_type=str),
    ]
)

schema.add_edge(
    amigos,
    SQLTableDescriptor(
        entity_id="Person@AMIGOS@Person",
        table_name="fraud.amigos",
    )
)

schema.add_edge(
    familiares,
    SQLTableDescriptor(
        entity_id="Person@FAMILIARES@Person",
        table_name="fraud.familiares",
    )
)

schema.add_edge(
    transacao_suspeita,
    SQLTableDescriptor(
        entity_id="Person@TRANSACAO_SUSPEITA@Person",
        table_name="fraud.transacao_suspeita",
    )
)

# 2. BFS Query: Find fraud network up to depth 4 from suspicious root account
# Only traverse TRANSACAO_SUSPEITA edges (ignore AMIGOS and FAMILIARES)
query = """
MATCH path = (origem:Person {id: 12345})-[:TRANSACAO_SUSPEITA*1..4]->(destino:Person)
RETURN
    origem.id AS origem_id,
    origem.name AS origem_name,
    destino.id AS destino_id,
    destino.name AS destino_name,
    destino.risk_score AS destino_risk_score,
    length(path) AS profundidade
ORDER BY profundidade, destino.risk_score DESC
LIMIT 100
"""

# 3. Transpile to SQL with WITH RECURSIVE (for BFS traversal)
parser = OpenCypherParser()
renderer = SQLRenderer(db_schema_provider=schema)

ast = parser.parse(query)
plan = LogicalPlan.process_query_tree(ast, schema)
plan.resolve(original_query=query)
sql = renderer.render_plan(plan)

print(sql)

# 4. Execute on Databricks
# df = spark.sql(sql)
# df.show(100, truncate=False)

Output: Databricks SQL with JOINs, WHERE filters, ORDER BY, and LIMIT — ready to execute on Delta Lake.

Features

✅ Variable-length paths (*1..N) via WITH RECURSIVE
✅ Undirected relationships (-[:REL]-)
✅ Path functions (length(), nodes(), relationships())
✅ Aggregations (COUNT, SUM, COLLECT, etc.)
✅ Predicate pushdown (filters applied in DataSource before joins)
✅ Inline property filters ({name: 'Alice'} → optimized WHERE clauses)
✅ BFS source filter optimization (inline filters applied in base case)
✅ WITH clauses (multi-stage composition)
✅ UNION, OPTIONAL MATCH, CASE, DISTINCT
✅ GraphContext API (simplified setup for Triple Stores)

See full feature list.

Documentation

Development

# Setup
uv sync --extra dev
uv pip install -e ".[dev]"

# Tests
make test-no-pyspark   # Fast (no Spark dependency)
make test-pyspark      # Full validation with PySpark

# Lint & Format
make lint
make format
make typecheck

See CONTRIBUTING.md for conventional commits and release process.

Requirements

Python 3.12+
Databricks Runtime 15.0+ (for WITH RECURSIVE)
PySpark (optional, only for development/testing)

Contributing

This is an open hobby project — contributions are very welcome!

Bugs: Open an issue
Features: Discuss in Discussions
PRs: Follow conventional commits

License

MIT License - see LICENSE.

Author

Bruno Messias LinkedIn | GitHub

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.10.0

Feb 24, 2026

0.10.0.dev202603200846 pre-release

Mar 20, 2026

0.9.7

Feb 22, 2026

0.9.6

Feb 18, 2026

0.9.5

Feb 18, 2026

0.9.4

Feb 7, 2026

0.9.3

Feb 3, 2026

0.9.2

Feb 3, 2026

0.9.1

Feb 3, 2026

0.9.0

Feb 2, 2026

0.8.2

Jan 31, 2026

0.8.1

Jan 31, 2026

0.8.0

Jan 26, 2026

0.7.3

Jan 24, 2026

0.7.2

Jan 24, 2026

0.7.1

Jan 24, 2026

This version

0.7.0

Jan 24, 2026

0.6.0

Jan 22, 2026

0.5.0

Jan 21, 2026

0.4.1

Jan 21, 2026

0.4.0

Jan 21, 2026

0.3.0

Jan 21, 2026

0.2.0

Jan 20, 2026

0.1.7

Jan 20, 2026

0.1.6

Jan 20, 2026

0.1.5

Jan 20, 2026

0.1.4

Jan 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gsql2rsql-0.7.0.tar.gz (2.5 MB view details)

Uploaded Jan 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gsql2rsql-0.7.0-py3-none-any.whl (274.9 kB view details)

Uploaded Jan 24, 2026 Python 3

File details

Details for the file gsql2rsql-0.7.0.tar.gz.

File metadata

Download URL: gsql2rsql-0.7.0.tar.gz
Upload date: Jan 24, 2026
Size: 2.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gsql2rsql-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`b1a84215fa9dacd4c4071a3726cdaddd429380bf1cc5e32cb59c8dcb6050e06c`
MD5	`d430d008f8c2694a06735aad973f2c52`
BLAKE2b-256	`85d19f57314eacb428d7c0689119e5aa4f5f8fa9962d768b865a8d3b71c4d1d7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gsql2rsql-0.7.0.tar.gz:

Publisher: release.yml on devmessias/gsql2rsql

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gsql2rsql-0.7.0.tar.gz
- Subject digest: b1a84215fa9dacd4c4071a3726cdaddd429380bf1cc5e32cb59c8dcb6050e06c
- Sigstore transparency entry: 850005687
- Sigstore integration time: Jan 24, 2026
Source repository:
- Permalink: devmessias/gsql2rsql@139dde314bad25eaf99e8d335790838f8426c917
- Branch / Tag: refs/heads/main
- Owner: https://github.com/devmessias
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@139dde314bad25eaf99e8d335790838f8426c917
- Trigger Event: push

File details

Details for the file gsql2rsql-0.7.0-py3-none-any.whl.

File metadata

Download URL: gsql2rsql-0.7.0-py3-none-any.whl
Upload date: Jan 24, 2026
Size: 274.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gsql2rsql-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e120d932eb2530c14b3c29b9a97ddc2b8204febc680ade014380c4cb2fdaaf29`
MD5	`67d8caefedf6e24ced9ed5c4c757b15e`
BLAKE2b-256	`c658ce068c464b96cb1c92b9808634c8b61d2bb17e8c8544422d0ecb678500b9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gsql2rsql-0.7.0-py3-none-any.whl:

Publisher: release.yml on devmessias/gsql2rsql

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gsql2rsql-0.7.0-py3-none-any.whl
- Subject digest: e120d932eb2530c14b3c29b9a97ddc2b8204febc680ade014380c4cb2fdaaf29
- Sigstore transparency entry: 850005690
- Sigstore integration time: Jan 24, 2026
Source repository:
- Permalink: devmessias/gsql2rsql@139dde314bad25eaf99e8d335790838f8426c917
- Branch / Tag: refs/heads/main
- Owner: https://github.com/devmessias
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@139dde314bad25eaf99e8d335790838f8426c917
- Trigger Event: push

gsql2rsql 0.7.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

gsql2rsql - OpenCypher to Databricks SQL Transpiler

Why This Project?

Inspiration: Microsoft's openCypherTranspiler

Databricks SQL Higher-Order Functions (HOFs)

Why Graph Queries on Delta Lake?

Billion-Scale Relationships: Triple Stores in Delta

The Problem with graph databases (oltp) at Scale

Triple Store in Delta Lake

LLMs + Transpilers: Enterprise Governance

Why Transpilers Matter

Quick Start

Installation

Simplified API: GraphContext (Recommended for Triple Stores)

Advanced: Manual Schema Setup (Full Control)

Features

Documentation

Development

Requirements

Contributing

License

Author

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance