Transpile Graph Query Language (openCypher) to Recursive SQL (Databricks)
Project description
gsql2rsql - OpenCypher to Databricks SQL Transpiler
gsql2rsql transpiles OpenCypher graph queries to Databricks SQL, enabling graph analytics on Delta Lake without a dedicated graph database.
Project Status: This is a hobby/research project being developed towards production quality. While it handles complex queries and includes comprehensive tests, it's not yet at enterprise scale. Contributions welcome!
Why This Project?
Inspiration: Microsoft's openCypherTranspiler
This project was inspired by Microsoft's openCypherTranspiler (now unmaintained) which transpiled OpenCypher to T-SQL (SQL Server).
Why a new transpiler? Two reasons:
- Databricks SQL is fundamentally different from T-SQL — WITH RECURSIVE, HOFs, and Delta Lake optimizations require different strategies
- Security-first architecture — gsql2rsql uses strict 4-phase separation of concerns for correctness:
- Parser: Syntax only (no schema access)
- Planner: Semantics only (builds logical operators)
- Resolver: Validation only (schema checking, column resolution)
- Renderer: Code generation only (intentionally "dumb" — no semantic decisions, just SQL generation)
This separation makes the transpiler easier to audit, test, and trust
The game-changer: Databricks recently added WITH RECURSIVE support, unlocking variable-leng
Databricks SQL Higher-Order Functions (HOFs)
Databricks SQL has native array manipulation via HOFs:
-- Transform array elements
SELECT transform(relationships, r -> r.amount) AS amounts
FROM fraud_paths
-- Filter complex conditions
SELECT filter(path, node -> node.risk_score > 0.8) AS risky_nodes
FROM customer_journeys
-- Aggregate with lambda
SELECT aggregate(
transactions,
0.0,
(acc, t) -> acc + t.amount,
acc -> acc
) AS total
FROM account_history
gsql2rsql leverages these HOFs for:
- Path filtering:
NONE(r IN relationships(path) WHERE r.suspicious) - Path aggregations:
SUM(r IN rels WHERE r.amount > 1000) - Pattern matching: Complex nested conditions
This makes Cypher → SQL transpilation more natural
Why Graph Queries on Delta Lake?
Delta Lake (Single Source)
↓ OpenCypher (via gsql2rsql)
Databricks SQL
↓ Results
Advantages:
- No duplication: Query source data directly
- Real-time: Always fresh data
- No sync: One less thing to break
- Cost-effective: No second database
- Unified governance: Single data platform
Billion-Scale Relationships: Triple Stores in Delta
The Problem with graph databases (oltp) at Scale
When you have billions of relationships:
- Memory limits: Graph must fit in RAM for good performance
- Vertical scaling: Limited by single-server resources
- Cost: Enterprise licenses + large EC2 instances = $$$$
- Backup/Recovery: GBs of graph data, long backup windows
- Version upgrades: Risky with large graphs
Triple Store in Delta Lake
Model relationships as triples in Delta:
CREATE TABLE relationships (
subject_id STRING, -- Source entity
predicate STRING, -- Relationship type
object_id STRING, -- Target entity
properties MAP<STRING, STRING>,
timestamp TIMESTAMP,
_partition DATE GENERATED ALWAYS AS (DATE(timestamp))
) PARTITIONED BY (_partition);
Advantages:
- Horizontal scale: Petabytes, billions of rows, no problem
- Cost-effective: S3 storage ($0.023/GB) vs RAM ($10+/GB)
- Time travel: Delta Lake versioning = free audit trail
- Schema evolution: Add properties without downtime
- ACID guarantees: Delta Lake transactions
- Z-ordering:
OPTIMIZE table ZORDER BY (subject_id, predicate)for fast lookups - Liquid clustering: Auto-optimize hot paths
LLMs + Transpilers: Enterprise Governance
The Problem: In enterprise environments, someone must be accountable for queries before execution — even with LLM text-to-query.
Why Transpilers Matter
1. Reviewability: Graph queries are 4-5 lines vs hundreds of SQL lines
# 5 lines in Cypher
MATCH (c:Customer)-[:TRANSACTION*1..3]->(m:Merchant)
WHERE m.risk_score > 0.9
RETURN c.id, COUNT(*) AS risky_tx
ORDER BY risky_tx DESC
LIMIT 100
vs 150+ lines of recursive SQL. Easier for humans to review and approve.
Transpilers turn LLM outputs into governable, auditable, human-reviewable queries.
Quick Start
Installation
pip install gsql2rsql
# Or from source:
git clone https://github.com/devmessias/gsql2rsql
cd gsql2rsql/python
uv pip install -e .
Your First Query
Example: Find fraud networks using BFS (Breadth-First Search) up to depth 4, starting from a suspicious account and ignoring social relationships.
from gsql2rsql.parser.opencypher_parser import OpenCypherParser
from gsql2rsql.planner.logical_plan import LogicalPlan
from gsql2rsql.renderer.sql_renderer import SQLRenderer
from gsql2rsql.common.schema import SimpleGraphSchemaProvider, NodeSchema, EdgeSchema, EntityProperty
from gsql2rsql.renderer.schema_provider import SimpleSQLSchemaProvider, SQLTableDescriptor
# 1. Define graph schema (for logical planner)
graph_schema = SimpleGraphSchemaProvider()
# Person node
person = NodeSchema(
name="Person",
properties=[
EntityProperty(property_name="id", data_type=int),
EntityProperty(property_name="name", data_type=str),
EntityProperty(property_name="risk_score", data_type=float),
],
node_id_property=EntityProperty(property_name="id", data_type=int)
)
graph_schema.add_node(person)
# Multiple edge types - we'll only query TRANSACAO_SUSPEITA
# AMIGOS and FAMILIARES are in the schema but ignored in the query
amigos = EdgeSchema(
name="AMIGOS",
source_node_id="Person",
sink_node_id="Person",
source_id_property=EntityProperty(property_name="person1_id", data_type=int),
sink_id_property=EntityProperty(property_name="person2_id", data_type=int),
properties=[]
)
familiares = EdgeSchema(
name="FAMILIARES",
source_node_id="Person",
sink_node_id="Person",
source_id_property=EntityProperty(property_name="person1_id", data_type=int),
sink_id_property=EntityProperty(property_name="person2_id", data_type=int),
properties=[]
)
transacao_suspeita = EdgeSchema(
name="TRANSACAO_SUSPEITA",
source_node_id="Person",
sink_node_id="Person",
source_id_property=EntityProperty(property_name="origem_id", data_type=int),
sink_id_property=EntityProperty(property_name="destino_id", data_type=int),
properties=[
EntityProperty(property_name="valor", data_type=float),
EntityProperty(property_name="timestamp", data_type=str),
]
)
graph_schema.add_edge(amigos)
graph_schema.add_edge(familiares)
graph_schema.add_edge(transacao_suspeita)
# 2. Define SQL schema (maps to Delta tables)
sql_schema = SimpleSQLSchemaProvider()
sql_schema.add_node(
person,
SQLTableDescriptor(
table_name="fraud.person", # Databricks catalog.schema.table
node_id_columns=["id"],
)
)
sql_schema.add_edge(
amigos,
SQLTableDescriptor(
entity_id="Person@AMIGOS@Person",
table_name="fraud.amigos",
)
)
sql_schema.add_edge(
familiares,
SQLTableDescriptor(
entity_id="Person@FAMILIARES@Person",
table_name="fraud.familiares",
)
)
sql_schema.add_edge(
transacao_suspeita,
SQLTableDescriptor(
entity_id="Person@TRANSACAO_SUSPEITA@Person",
table_name="fraud.transacao_suspeita",
)
)
# 3. BFS Query: Find fraud network up to depth 4 from suspicious root account
# Only traverse TRANSACAO_SUSPEITA edges (ignore AMIGOS and FAMILIARES)
query = """
MATCH path = (origem:Person {id: 12345})-[:TRANSACAO_SUSPEITA*1..4]->(destino:Person)
RETURN
origem.id AS origem_id,
origem.name AS origem_name,
destino.id AS destino_id,
destino.name AS destino_name,
destino.risk_score AS destino_risk_score,
length(path) AS profundidade
ORDER BY profundidade, destino.risk_score DESC
LIMIT 100
"""
# 4. Transpile to SQL with WITH RECURSIVE (for BFS traversal)
parser = OpenCypherParser()
renderer = SQLRenderer(db_schema_provider=sql_schema)
ast = parser.parse(query)
plan = LogicalPlan.process_query_tree(ast, graph_schema)
plan.resolve(original_query=query)
sql = renderer.render_plan(plan)
print(sql)
# 5. Execute on Databricks
# df = spark.sql(sql)
# df.show(100, truncate=False)
Output: Databricks SQL with JOINs, WHERE filters, ORDER BY, and LIMIT — ready to execute on Delta Lake.
Features
- ✅ Variable-length paths (
*1..N) viaWITH RECURSIVE - ✅ Undirected relationships (
-[:REL]-) - ✅ Path functions (
length(),nodes(),relationships()) - ✅ Aggregations (
COUNT,SUM,COLLECT, etc.) - ✅ Filter pushdown (optimizes Delta scans)
- ✅ WITH clauses (multi-stage composition)
- ✅ UNION, OPTIONAL MATCH, CASE, DISTINCT
See full feature list.
Documentation
- 📘 Installation & Quick Start
- 🎯 Examples Gallery (69 queries)
- 🏗️ Architecture
- 🤝 Contributing
Development
# Setup
uv sync --extra dev
uv pip install -e ".[dev]"
# Tests
make test-no-pyspark # Fast (no Spark dependency)
make test-pyspark # Full validation with PySpark
# Lint & Format
make lint
make format
make typecheck
See CONTRIBUTING.md for conventional commits and release process.
Requirements
- Python 3.12+
- Databricks Runtime 15.0+ (for
WITH RECURSIVE) - PySpark (optional, only for development/testing)
See full limitations.
Contributing
This is an open hobby project — contributions are very welcome!
- Bugs: Open an issue
- Features: Discuss in Discussions
- PRs: Follow conventional commits
License
MIT License - see LICENSE.
Acknowledgments
- Microsoft's openCypherTranspiler (T-SQL) for inspiration
- OpenCypher community for the graph query language
- ANTLR for parser generation
- Databricks for Delta Lake + Spark SQL +
WITH RECURSIVEsupport
Author
Bruno Messias LinkedIn | GitHub
Status: Active development | Version: 0.1.0 (Alpha) | Python: 3.12+
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gsql2rsql-0.1.7.tar.gz.
File metadata
- Download URL: gsql2rsql-0.1.7.tar.gz
- Upload date:
- Size: 2.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
898b73561e163267c0003254d40a249d3aacca911b5fa590d1f4a8c697cedbc9
|
|
| MD5 |
5f2a75271212fc16e3976a07e3c8edfc
|
|
| BLAKE2b-256 |
4dfb8f373e9da5c259c38cc4ab5c35cff54719611c24c7d175b109633ab3bb51
|
Provenance
The following attestation bundles were made for gsql2rsql-0.1.7.tar.gz:
Publisher:
release.yml on devmessias/gsql2rsql
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gsql2rsql-0.1.7.tar.gz -
Subject digest:
898b73561e163267c0003254d40a249d3aacca911b5fa590d1f4a8c697cedbc9 - Sigstore transparency entry: 836602852
- Sigstore integration time:
-
Permalink:
devmessias/gsql2rsql@040640e8033dd15748276547e4c3e3e06ef207b4 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/devmessias
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@040640e8033dd15748276547e4c3e3e06ef207b4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file gsql2rsql-0.1.7-py3-none-any.whl.
File metadata
- Download URL: gsql2rsql-0.1.7-py3-none-any.whl
- Upload date:
- Size: 258.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6611231072942f9240e37baddcf846efe5f520f73e1ceb3ba9160acb800b690
|
|
| MD5 |
2b5abbc69efd0be33d52d53202c16264
|
|
| BLAKE2b-256 |
ce4ab3b7dd5abac56ae0ba2ac4e152e650920259783c7412fc4a88eaa21a2246
|
Provenance
The following attestation bundles were made for gsql2rsql-0.1.7-py3-none-any.whl:
Publisher:
release.yml on devmessias/gsql2rsql
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gsql2rsql-0.1.7-py3-none-any.whl -
Subject digest:
d6611231072942f9240e37baddcf846efe5f520f73e1ceb3ba9160acb800b690 - Sigstore transparency entry: 836602856
- Sigstore integration time:
-
Permalink:
devmessias/gsql2rsql@040640e8033dd15748276547e4c3e3e06ef207b4 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/devmessias
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@040640e8033dd15748276547e4c3e3e06ef207b4 -
Trigger Event:
push
-
Statement type: