Skip to main content

Graph-based schema analysis for text-to-SQL. Build schema graphs, rank relevant nodes with Personalized PageRank, and extract LLM-ready context. No LLM dependency — bring your own model.

Project description

graph2sql

CI

Graph-based schema analysis for text-to-SQL.

Build a schema graph (tables and fields as nodes, relationships as edges), rank the most relevant nodes for a natural language question using Personalized PageRank, and extract structured context ready to pass to any LLM for SQL generation.

No LLM included. Bring your own model.


How it works

flowchart LR
    A["🗣️ Natural language\nquestion"] --> C
    B["🗄️ Schema graph\ntables · columns · relationships"] --> C
    C["Personalized\nPageRank"] --> D["Ranked subgraph\ntop-k nodes + 1-hop neighbours"]
    D --> E["Structured\ncontext"]
    E --> F["Your LLM"]
    F --> G["✅ SQL"]

    style C fill:#6366f1,color:#fff,stroke:none
    style F fill:#10b981,color:#fff,stroke:none
    style G fill:#f59e0b,color:#fff,stroke:none

Instead of dumping the entire schema into an LLM prompt, graph2sql identifies which tables and relationships are most relevant to the question — reducing noise and improving SQL accuracy.


Install

pip install graph2sql

Or from source:

git clone https://github.com/jw-open/graph2sql
cd graph2sql
pip install -e ".[dev]"

Quick start

from graph2sql import SchemaGraph

# Build the schema graph
graph = SchemaGraph()

graph.add_node("users",       "users",       content="id INT, name VARCHAR, email VARCHAR, country VARCHAR")
graph.add_node("orders",      "orders",      content="id INT, customer_id INT, total DECIMAL, created_at TIMESTAMP")
graph.add_node("products",    "products",    content="id INT, name VARCHAR, price DECIMAL, category VARCHAR")
graph.add_node("order_items", "order_items", content="id INT, order_id INT, product_id INT, quantity INT")

graph.add_edge("orders",      "users",    "belongs_to")
graph.add_edge("order_items", "orders",   "belongs_to")
graph.add_edge("order_items", "products", "references")

# Rank nodes for a natural language question
context = graph.rank("total revenue by customer last month", k=3)

# Pass context to your LLM
print(context)
# {
#   "nodes": [
#     {"label": "orders", "content": "...", "score": 0.312, ...},
#     {"label": "users",  "content": "...", "score": 0.198, ...},
#     ...
#   ],
#   "edges": [
#     {"from": "orders", "to": "users", "label": "belongs_to"},
#     ...
#   ]
# }

Load from an existing dict

graph = SchemaGraph.from_dict({
    "nodes": [
        {"id": "users",  "label": "users",  "content": "id, name, email"},
        {"id": "orders", "label": "orders", "content": "id, customer_id, total"},
    ],
    "edges": [
        {"from": "orders", "to": "users", "label": "belongs_to"}
    ]
})

context = graph.rank("how many orders per user")

Graph schema

Node

Field Type Required Description
id str yes Unique identifier
label str yes Table or column name — used for query token matching
content str no Column definitions, constraints, DDL, or notes
attributes dict no Typed metadata — see conventions below

Node attribute conventions

Use attributes to describe what a node represents. The algorithm also checks attribute string values for query token matching (useful for aliases).

# Table node
graph.add_node("orders", "orders",
    content="id INT PK, customer_id INT FK, total DECIMAL, created_at TIMESTAMP",
    attributes={
        "type": "table",
        "database": "mysql",          # mysql | postgres | sqlite | mongodb | etc.
        "schema": "public",           # schema/namespace if applicable
        "alias": "transactions",      # alternative names matched against queries
        "primary_key": "id",
    }
)

# Column node
graph.add_node("orders.total", "total",
    content="DECIMAL(10,2) — order grand total including tax",
    attributes={
        "type": "column",
        "table": "orders",
        "data_type": "DECIMAL",
        "nullable": "false",
        "alias": "revenue amount",    # matched against queries
    }
)

# View or virtual table
graph.add_node("monthly_revenue", "monthly_revenue",
    content="SELECT DATE_TRUNC('month', created_at), SUM(total) FROM orders GROUP BY 1",
    attributes={
        "type": "view",
        "database": "postgres",
    }
)

Recognised type values (convention, not enforced):

Value Meaning
"table" A physical database table
"column" A column within a table
"view" A database view or virtual table
"index" An index definition
"schema" A database schema/namespace grouping

Recognised database values (convention, not enforced): "mysql", "postgres", "sqlite", "mongodb", "bigquery", "snowflake", "redshift", "mssql", "oracle"

Any additional attributes are valid — they are stored as-is and passed through to the LLM context.

Edge

Field Type Description
"from" str Source node id
"to" str Target node id
"label" str Relationship type — see conventions below

Edge schema

Edges support an optional attributes dict for richer relationship metadata:

# 1:N — one user has many orders
graph.add_edge("orders", "users", "belongs_to")
# or with attributes:
{
    "from": "orders",
    "to": "users",
    "label": "belongs_to",
    "attributes": {
        "cardinality": "many_to_one",   # one_to_one | one_to_many | many_to_one | many_to_many
        "on_delete": "CASCADE",
        "nullable": "false",
        "join": "orders.customer_id = users.id",   # hint for LLM SQL generation
    }
}

# M:N — orders ↔ products via order_items
{
    "from": "orders",
    "to": "products",
    "label": "many_to_many",
    "attributes": {
        "cardinality": "many_to_many",
        "via": "order_items",           # junction table
        "join": "orders.id = order_items.order_id AND order_items.product_id = products.id",
    }
}

cardinality values:

Value Meaning
"one_to_one" 1:1
"one_to_many" 1:N (parent → children)
"many_to_one" N:1 (child → parent, most FK relationships)
"many_to_many" M:N (requires a junction table)

Edge label conventions:

Label Meaning
"foreign_key" Standard FK relationship between tables
"belongs_to" Child → parent (N:1)
"has_many" Parent → children (1:N)
"many_to_many" M:N relationship (use attributes.via for junction table)
"one_to_one" 1:1 relationship
"column_of" Column node → its parent table
"references" Looser reference between any two nodes
"related_to" Semantic relationship (no strict FK)

The join attribute is especially useful — it gives the LLM the exact JOIN condition to use rather than inferring it.


API reference

SchemaGraph

SchemaGraph()
SchemaGraph.from_dict(graph: dict) -> SchemaGraph
graph.add_node(id, label, content=None, attributes=None) -> SchemaGraph
graph.add_edge(from_id, to_id, label) -> SchemaGraph
graph.rank(query, k=3, alpha=0.85) -> dict
graph.to_dict() -> dict

personalized_page_rank

Low-level function used internally by SchemaGraph.rank().

from graph2sql import personalized_page_rank

result = personalized_page_rank(
    query="revenue by customer",
    graph={"nodes": [...], "edges": [...]},
    alpha=0.85,   # damping factor
    k=3,          # top-k seed nodes
)

Run the example

python examples/ecommerce.py

Run tests

pip install -e ".[dev]"
pytest tests/

Design principles

  • No LLM dependency — pure Python + numpy. Works with any model or no model at all.
  • No database connection required — pass schema definitions as strings in content.
  • Bring your own LLMrank() returns a plain dict you can serialize and inject into any prompt.
  • Decoupled from infra — no FastAPI, MongoDB, Redis, or cloud dependencies.

Known limitations

Token matching is exact. The PPR algorithm matches query words against node labels using exact token overlap — it does not perform stemming, fuzzy matching, or semantic similarity.

For example, a query containing "customer" will not match a node labeled "users".

Workaround: use attributes for aliases.

Add alternative names as attribute values on the node. The algorithm also checks all attribute values for token matches:

graph.add_node(
    "users",
    "users",
    content="id, name, email",
    attributes={
        "alias": "customers",
        "also_known_as": "clients members",
    }
)

# Now "customers" and "clients" in a query will match this node
context = graph.rank("total revenue by customers")

Supported attribute patterns:

  • alias — primary alternative name
  • also_known_as — space-separated synonyms
  • related_to — domain terms associated with this table
  • associated_with — any custom terms relevant to queries

Any string attribute value is tokenized and matched — the key name is not significant.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graph2sql-0.1.0.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

graph2sql-0.1.0-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file graph2sql-0.1.0.tar.gz.

File metadata

  • Download URL: graph2sql-0.1.0.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for graph2sql-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4162df391a1394a6213449110adc28b09f922671ff34b7d282c0a05c935eeaf3
MD5 936c36e7bf1c7d9460d219be239aa4be
BLAKE2b-256 6a09a7a1cbd0734a00f32100081fb7fd36299ab1e62f6ea674213d9a84f9ad20

See more details on using hashes here.

File details

Details for the file graph2sql-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: graph2sql-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for graph2sql-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dde522743e6a8f531cc52cf41fea50e8bdb832aa3b74f94adcbd702cefb3510d
MD5 36a0db57f2b916506b16e3b7ea93e642
BLAKE2b-256 5f5e857a4a6ce050b76a6a46c4eb56c99f89366b41c2419d01e28ea403a499d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page