Skip to main content

Transpile Graph Query Language (openCypher) to Recursive SQL (Databricks)

Project description

gsql2rsql - OpenCypher to Databricks SQL Transpiler

License Python Version Documentation

gsql2rsql transpiles OpenCypher graph queries to Databricks SQL, enabling graph analytics on Delta Lake without a dedicated graph database.

Project Status: This is a hobby/research project being developed towards production quality. While it handles complex queries and includes comprehensive tests, it's not yet at enterprise scale. Contributions welcome!

Why This Project?

Inspiration: Microsoft's openCypherTranspiler

This project was inspired by Microsoft's openCypherTranspiler (now unmaintained) which transpiled OpenCypher to T-SQL (SQL Server).

Why a new transpiler? Two reasons:

  1. Databricks SQL is fundamentally different from T-SQL — WITH RECURSIVE, HOFs, and Delta Lake optimizations require different strategies
  2. Security-first architecture — gsql2rsql uses strict 4-phase separation of concerns for correctness:
    • Parser: Syntax only (no schema access)
    • Planner: Semantics only (builds logical operators)
    • Resolver: Validation only (schema checking, column resolution)
    • Renderer: Code generation only (intentionally "dumb" — no semantic decisions, just SQL generation)

This separation makes the transpiler easier to audit, test, and trust

The game-changer: Databricks recently added WITH RECURSIVE support, unlocking variable-leng

Databricks SQL Higher-Order Functions (HOFs)

Databricks SQL has native array manipulation via HOFs:

-- Transform array elements
SELECT transform(relationships, r -> r.amount) AS amounts
FROM fraud_paths

-- Filter complex conditions
SELECT filter(path, node -> node.risk_score > 0.8) AS risky_nodes
FROM customer_journeys

-- Aggregate with lambda
SELECT aggregate(
  transactions,
  0.0,
  (acc, t) -> acc + t.amount,
  acc -> acc
) AS total
FROM account_history

gsql2rsql leverages these HOFs for:

  • Path filtering: NONE(r IN relationships(path) WHERE r.suspicious)
  • Path aggregations: SUM(r IN rels WHERE r.amount > 1000)
  • Pattern matching: Complex nested conditions

This makes Cypher → SQL transpilation more natural

Why Graph Queries on Delta Lake?

Delta Lake (Single Source)
     ↓ OpenCypher (via gsql2rsql)
Databricks SQL
     ↓ Results

Advantages:

  1. No duplication: Query source data directly
  2. Real-time: Always fresh data
  3. No sync: One less thing to break
  4. Cost-effective: No second database
  5. Unified governance: Single data platform

Billion-Scale Relationships: Triple Stores in Delta

The Problem with graph databases (oltp) at Scale

When you have billions of relationships:

  • Memory limits: Graph must fit in RAM for good performance
  • Vertical scaling: Limited by single-server resources
  • Cost: Enterprise licenses + large EC2 instances = $$$$
  • Backup/Recovery: GBs of graph data, long backup windows
  • Version upgrades: Risky with large graphs

Triple Store in Delta Lake

Model relationships as triples in Delta:

CREATE TABLE relationships (
  subject_id STRING,    -- Source entity
  predicate STRING,     -- Relationship type
  object_id STRING,     -- Target entity
  properties MAP<STRING, STRING>,
  timestamp TIMESTAMP,
  _partition DATE GENERATED ALWAYS AS (DATE(timestamp))
) PARTITIONED BY (_partition);

Advantages:

  1. Horizontal scale: Petabytes, billions of rows, no problem
  2. Cost-effective: S3 storage ($0.023/GB) vs RAM ($10+/GB)
  3. Time travel: Delta Lake versioning = free audit trail
  4. Schema evolution: Add properties without downtime
  5. ACID guarantees: Delta Lake transactions
  6. Z-ordering: OPTIMIZE table ZORDER BY (subject_id, predicate) for fast lookups
  7. Liquid clustering: Auto-optimize hot paths

LLMs + Transpilers: Enterprise Governance

The Problem: In enterprise environments, someone must be accountable for queries before execution — even with LLM text-to-query.

Why Transpilers Matter

1. Reviewability: Graph queries are 4-5 lines vs hundreds of SQL lines

# 5 lines in Cypher
MATCH (c:Customer)-[:TRANSACTION*1..3]->(m:Merchant)
WHERE m.risk_score > 0.9
RETURN c.id, COUNT(*) AS risky_tx
ORDER BY risky_tx DESC
LIMIT 100

vs 150+ lines of recursive SQL. Easier for humans to review and approve.

Transpilers turn LLM outputs into governable, auditable, human-reviewable queries.

Quick Start

Installation

pip install gsql2rsql
# Or from source:
git clone https://github.com/devmessias/gsql2rsql
cd gsql2rsql/python
uv pip install -e .

Your First Query

from gsql2rsql.parser.opencypher_parser import OpenCypherParser
from gsql2rsql.planner.logical_plan import LogicalPlan
from gsql2rsql.renderer.sql_renderer import SQLRenderer
from gsql2rsql.planner.schema import DatabricksSchemaProvider, SimpleGraphSchemaProvider
from gsql2rsql.common.schema import NodeSchema, EdgeSchema, EntityProperty

# 1. Define schema (map graph to Delta tables)
schema = SimpleGraphSchemaProvider()

person = NodeSchema(
    name="Person",
    properties=[
        EntityProperty(property_name="id", data_type=int),
        EntityProperty(property_name="name", data_type=str),
        EntityProperty(property_name="age", data_type=int),
    ],
    node_id_property=EntityProperty(property_name="id", data_type=int)
)

company = NodeSchema(
    name="Company",
    properties=[
        EntityProperty(property_name="id", data_type=int),
        EntityProperty(property_name="name", data_type=str),
        EntityProperty(property_name="industry", data_type=str),
    ],
    node_id_property=EntityProperty(property_name="id", data_type=int)
)

works_at = EdgeSchema(
    name="WORKS_AT",
    source_node_id="Person",
    sink_node_id="Company",
    source_id_property=EntityProperty(property_name="person_id", data_type=int),
    sink_id_property=EntityProperty(property_name="company_id", data_type=int),
    properties=[EntityProperty(property_name="since", data_type=int)]
)

schema.add_node(person)
schema.add_node(company)
schema.add_edge(works_at)

# 2. Write Cypher query
query = """
MATCH (p:Person)-[:WORKS_AT]->(c:Company)
WHERE c.industry = 'Technology'
RETURN p.name, p.age, c.name AS company
ORDER BY p.age DESC
LIMIT 10
"""

# 3. Transpile to SQL
parser = OpenCypherParser()
schema_provider = DatabricksSchemaProvider(schema)
renderer = SQLRenderer(schema_provider)

ast = parser.parse(query)
plan = LogicalPlan.from_ast(ast, schema)
plan.resolve(query)
sql = renderer.render_plan(plan)

print(sql)

# 4. Execute on Databricks
# spark.sql(sql).show()

Output: Databricks SQL with JOINs, WHERE filters, ORDER BY, and LIMIT — ready to execute on Delta Lake.

Features

  • Variable-length paths (*1..N) via WITH RECURSIVE
  • Undirected relationships (-[:REL]-)
  • Path functions (length(), nodes(), relationships())
  • Aggregations (COUNT, SUM, COLLECT, etc.)
  • Filter pushdown (optimizes Delta scans)
  • WITH clauses (multi-stage composition)
  • UNION, OPTIONAL MATCH, CASE, DISTINCT

See full feature list.

Documentation

Development

# Setup
uv sync --extra dev
uv pip install -e ".[dev]"

# Tests
make test-no-pyspark   # Fast (no Spark dependency)
make test-pyspark      # Full validation with PySpark

# Lint & Format
make lint
make format
make typecheck

See CONTRIBUTING.md for conventional commits and release process.

Requirements

  • Python 3.12+
  • Databricks Runtime 15.0+ (for WITH RECURSIVE)
  • PySpark (optional, only for development/testing)

See full limitations.

Contributing

This is an open hobby project — contributions are very welcome!

License

MIT License - see LICENSE.

Acknowledgments

Author

Bruno Messias LinkedIn | GitHub


Status: Active development | Version: 0.1.0 (Alpha) | Python: 3.12+

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gsql2rsql-0.1.4.tar.gz (2.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gsql2rsql-0.1.4-py3-none-any.whl (258.2 kB view details)

Uploaded Python 3

File details

Details for the file gsql2rsql-0.1.4.tar.gz.

File metadata

  • Download URL: gsql2rsql-0.1.4.tar.gz
  • Upload date:
  • Size: 2.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gsql2rsql-0.1.4.tar.gz
Algorithm Hash digest
SHA256 1c98fcf818042a14546f474b92b94385de7137f3fbcb8648d46319e9fba15b95
MD5 6f6cb7b836547427e94e687905e2e951
BLAKE2b-256 a525263ed7dedfbb0cc2683491700112ca2f503cad406560ded049f51e3822fa

See more details on using hashes here.

Provenance

The following attestation bundles were made for gsql2rsql-0.1.4.tar.gz:

Publisher: release.yml on devmessias/gsql2rsql

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gsql2rsql-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: gsql2rsql-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 258.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gsql2rsql-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b11c0f9e39236bab462301cdae8f79bd7036a69811df4be2227e3da7d82114de
MD5 c7db9cb8e7fea42d9a7d2f07eac76a06
BLAKE2b-256 af3c86fa946ea37af7a5f60494216f382f619c0f27eff43ac1c379ec8dcbc6cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for gsql2rsql-0.1.4-py3-none-any.whl:

Publisher: release.yml on devmessias/gsql2rsql

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page