Graph-based schema analysis for text-to-SQL. Build schema graphs, rank relevant nodes with Personalized PageRank, and extract LLM-ready context. No LLM dependency — bring your own model.

These details have not been verified by PyPI

Project links

Repository

Project description

graph2sql

Graph-based schema analysis for text-to-SQL.

Build a schema graph (tables and fields as nodes, relationships as edges), rank the most relevant nodes for a natural language question using Personalized PageRank, and extract structured context ready to pass to any LLM for SQL generation.

No LLM included. Bring your own model.

How it works

flowchart LR
    A["🗣️ Natural language\nquestion"] --> C
    B["🗄️ Schema graph\ntables · columns · relationships"] --> C
    C["Personalized\nPageRank"] --> D["Ranked subgraph\ntop-k nodes + 1-hop neighbours"]
    D --> E["Structured\ncontext"]
    E --> F["Your LLM"]
    F --> G["✅ SQL"]

    style C fill:#6366f1,color:#fff,stroke:none
    style F fill:#10b981,color:#fff,stroke:none
    style G fill:#f59e0b,color:#fff,stroke:none

Instead of dumping the entire schema into an LLM prompt, graph2sql identifies which tables and relationships are most relevant to the question — reducing noise and improving SQL accuracy.

Install

pip install graph2sql

Or from source:

git clone https://github.com/jw-open/graph2sql
cd graph2sql
pip install -e ".[dev]"

Quick start

from graph2sql import SchemaGraph

# Build the schema graph
graph = SchemaGraph()

graph.add_node("users",       "users",       content="id INT, name VARCHAR, email VARCHAR, country VARCHAR")
graph.add_node("orders",      "orders",      content="id INT, customer_id INT, total DECIMAL, created_at TIMESTAMP")
graph.add_node("products",    "products",    content="id INT, name VARCHAR, price DECIMAL, category VARCHAR")
graph.add_node("order_items", "order_items", content="id INT, order_id INT, product_id INT, quantity INT")

graph.add_edge("orders",      "users",    "belongs_to")
graph.add_edge("order_items", "orders",   "belongs_to")
graph.add_edge("order_items", "products", "references")

# Rank nodes for a natural language question
context = graph.rank("total revenue by customer last month", k=3)

# Pass context to your LLM
print(context)
# {
#   "nodes": [
#     {"label": "orders", "content": "...", "score": 0.312, ...},
#     {"label": "users",  "content": "...", "score": 0.198, ...},
#     ...
#   ],
#   "edges": [
#     {"from": "orders", "to": "users", "label": "belongs_to"},
#     ...
#   ]
# }

Load from an existing dict

graph = SchemaGraph.from_dict({
    "nodes": [
        {"id": "users",  "label": "users",  "content": "id, name, email"},
        {"id": "orders", "label": "orders", "content": "id, customer_id, total"},
    ],
    "edges": [
        {"from": "orders", "to": "users", "label": "belongs_to"}
    ]
})

context = graph.rank("how many orders per user")

Graph schema

Node

Field	Type	Required	Description
`id`	str	yes	Unique identifier
`label`	str	yes	Table or column name — used for query token matching
`content`	str	no	Column definitions, constraints, DDL, or notes
`attributes`	dict	no	Typed metadata — see conventions below

Node attribute conventions

Use attributes to describe what a node represents. The algorithm also checks attribute string values for query token matching (useful for aliases).

# Table node
graph.add_node("orders", "orders",
    content="id INT PK, customer_id INT FK, total DECIMAL, created_at TIMESTAMP",
    attributes={
        "type": "table",
        "database": "mysql",          # mysql | postgres | sqlite | mongodb | etc.
        "schema": "public",           # schema/namespace if applicable
        "alias": "transactions",      # alternative names matched against queries
        "primary_key": "id",
    }
)

# Column node
graph.add_node("orders.total", "total",
    content="DECIMAL(10,2) — order grand total including tax",
    attributes={
        "type": "column",
        "table": "orders",
        "data_type": "DECIMAL",
        "nullable": "false",
        "alias": "revenue amount",    # matched against queries
    }
)

# View or virtual table
graph.add_node("monthly_revenue", "monthly_revenue",
    content="SELECT DATE_TRUNC('month', created_at), SUM(total) FROM orders GROUP BY 1",
    attributes={
        "type": "view",
        "database": "postgres",
    }
)

Recognised type values (convention, not enforced):

Value	Meaning
`"table"`	A physical database table
`"column"`	A column within a table
`"view"`	A database view or virtual table
`"index"`	An index definition
`"schema"`	A database schema/namespace grouping

Recognised database values (convention, not enforced): "mysql", "postgres", "sqlite", "mongodb", "bigquery", "snowflake", "redshift", "mssql", "oracle"

Any additional attributes are valid — they are stored as-is and passed through to the LLM context.

Edge

Field	Type	Description
`"from"`	str	Source node id
`"to"`	str	Target node id
`"label"`	str	Relationship type — see conventions below

Edge schema

Edges support an optional attributes dict for richer relationship metadata:

# 1:N — one user has many orders
graph.add_edge("orders", "users", "belongs_to")
# or with attributes:
{
    "from": "orders",
    "to": "users",
    "label": "belongs_to",
    "attributes": {
        "cardinality": "many_to_one",   # one_to_one | one_to_many | many_to_one | many_to_many
        "on_delete": "CASCADE",
        "nullable": "false",
        "join": "orders.customer_id = users.id",   # hint for LLM SQL generation
    }
}

# M:N — orders ↔ products via order_items
{
    "from": "orders",
    "to": "products",
    "label": "many_to_many",
    "attributes": {
        "cardinality": "many_to_many",
        "via": "order_items",           # junction table
        "join": "orders.id = order_items.order_id AND order_items.product_id = products.id",
    }
}

cardinality values:

Value	Meaning
`"one_to_one"`	1:1
`"one_to_many"`	1:N (parent → children)
`"many_to_one"`	N:1 (child → parent, most FK relationships)
`"many_to_many"`	M:N (requires a junction table)

Edge label conventions:

Label	Meaning
`"foreign_key"`	Standard FK relationship between tables
`"belongs_to"`	Child → parent (N:1)
`"has_many"`	Parent → children (1:N)
`"many_to_many"`	M:N relationship (use `attributes.via` for junction table)
`"one_to_one"`	1:1 relationship
`"column_of"`	Column node → its parent table
`"references"`	Looser reference between any two nodes
`"related_to"`	Semantic relationship (no strict FK)

The join attribute is especially useful — it gives the LLM the exact JOIN condition to use rather than inferring it.

API reference

`SchemaGraph`

SchemaGraph()
SchemaGraph.from_dict(graph: dict) -> SchemaGraph
graph.add_node(id, label, content=None, attributes=None) -> SchemaGraph
graph.add_edge(from_id, to_id, label) -> SchemaGraph
graph.rank(query, k=3, alpha=0.85) -> dict
graph.to_dict() -> dict

`personalized_page_rank`

Low-level function used internally by SchemaGraph.rank().

from graph2sql import personalized_page_rank

result = personalized_page_rank(
    query="revenue by customer",
    graph={"nodes": [...], "edges": [...]},
    alpha=0.85,   # damping factor
    k=3,          # top-k seed nodes
)

Run the example

python examples/ecommerce.py

Run tests

pip install -e ".[dev]"
pytest tests/

Design principles

No LLM dependency — pure Python + numpy. Works with any model or no model at all.
No database connection required — pass schema definitions as strings in content.
Bring your own LLM — rank() returns a plain dict you can serialize and inject into any prompt.
Decoupled from infra — no FastAPI, MongoDB, Redis, or cloud dependencies.

Known limitations

Token matching is exact. The PPR algorithm matches query words against node labels using exact token overlap — it does not perform stemming, fuzzy matching, or semantic similarity.

For example, a query containing "customer" will not match a node labeled "users".

Workaround: use attributes for aliases.

Add alternative names as attribute values on the node. The algorithm also checks all attribute values for token matches:

graph.add_node(
    "users",
    "users",
    content="id, name, email",
    attributes={
        "alias": "customers",
        "also_known_as": "clients members",
    }
)

# Now "customers" and "clients" in a query will match this node
context = graph.rank("total revenue by customers")

Supported attribute patterns:

alias — primary alternative name
also_known_as — space-separated synonyms
related_to — domain terms associated with this table
associated_with — any custom terms relevant to queries

Any string attribute value is tokenized and matched — the key name is not significant.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graph2sql-0.1.0.tar.gz (16.9 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

graph2sql-0.1.0-py3-none-any.whl (14.6 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file graph2sql-0.1.0.tar.gz.

File metadata

Download URL: graph2sql-0.1.0.tar.gz
Upload date: Apr 6, 2026
Size: 16.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for graph2sql-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4162df391a1394a6213449110adc28b09f922671ff34b7d282c0a05c935eeaf3`
MD5	`936c36e7bf1c7d9460d219be239aa4be`
BLAKE2b-256	`6a09a7a1cbd0734a00f32100081fb7fd36299ab1e62f6ea674213d9a84f9ad20`

See more details on using hashes here.

File details

Details for the file graph2sql-0.1.0-py3-none-any.whl.

File metadata

Download URL: graph2sql-0.1.0-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 14.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for graph2sql-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dde522743e6a8f531cc52cf41fea50e8bdb832aa3b74f94adcbd702cefb3510d`
MD5	`36a0db57f2b916506b16e3b7ea93e642`
BLAKE2b-256	`5f5e857a4a6ce050b76a6a46c4eb56c99f89366b41c2419d01e28ea403a499d1`

See more details on using hashes here.

graph2sql 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

graph2sql

How it works

Install

Quick start

Load from an existing dict

Graph schema

Node

Node attribute conventions

Edge

Edge schema

API reference

SchemaGraph

personalized_page_rank

Run the example

Run tests

Design principles

Known limitations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`SchemaGraph`

`personalized_page_rank`