Graph-based schema analysis for text-to-SQL. Build schema graphs, rank relevant nodes with Personalized PageRank, and extract LLM-ready context. No LLM dependency — bring your own model.
Project description
graph2sql
Graph-based schema analysis for text-to-SQL.
Build a schema graph (tables and fields as nodes, relationships as edges), rank the most relevant nodes for a natural language question using Personalized PageRank, and extract structured context ready to pass to any LLM for SQL generation.
No LLM included. Bring your own model.
How it works
flowchart LR
A["🗣️ Natural language\nquestion"] --> C
B["🗄️ Schema graph\ntables · columns · relationships"] --> C
C["Personalized\nPageRank"] --> D["Ranked subgraph\ntop-k nodes + 1-hop neighbours"]
D --> E["Structured\ncontext"]
E --> F["Your LLM"]
F --> G["✅ SQL"]
style C fill:#6366f1,color:#fff,stroke:none
style F fill:#10b981,color:#fff,stroke:none
style G fill:#f59e0b,color:#fff,stroke:none
Instead of dumping the entire schema into an LLM prompt, graph2sql identifies which tables and relationships are most relevant to the question — reducing noise and improving SQL accuracy.
Install
pip install graph2sql
Or from source:
git clone https://github.com/jw-open/graph2sql
cd graph2sql
pip install -e ".[dev]"
Quick start
from graph2sql import SchemaGraph
# Build the schema graph
graph = SchemaGraph()
graph.add_node("users", "users", content="id INT, name VARCHAR, email VARCHAR, country VARCHAR")
graph.add_node("orders", "orders", content="id INT, customer_id INT, total DECIMAL, created_at TIMESTAMP")
graph.add_node("products", "products", content="id INT, name VARCHAR, price DECIMAL, category VARCHAR")
graph.add_node("order_items", "order_items", content="id INT, order_id INT, product_id INT, quantity INT")
graph.add_edge("orders", "users", "belongs_to")
graph.add_edge("order_items", "orders", "belongs_to")
graph.add_edge("order_items", "products", "references")
# Rank nodes for a natural language question
context = graph.rank("total revenue by customer last month", k=3)
# Pass context to your LLM
print(context)
# {
# "nodes": [
# {"label": "orders", "content": "...", "score": 0.312, ...},
# {"label": "users", "content": "...", "score": 0.198, ...},
# ...
# ],
# "edges": [
# {"from": "orders", "to": "users", "label": "belongs_to"},
# ...
# ]
# }
Load from an existing dict
graph = SchemaGraph.from_dict({
"nodes": [
{"id": "users", "label": "users", "content": "id, name, email"},
{"id": "orders", "label": "orders", "content": "id, customer_id, total"},
],
"edges": [
{"from": "orders", "to": "users", "label": "belongs_to"}
]
})
context = graph.rank("how many orders per user")
Graph schema
Node
| Field | Type | Required | Description |
|---|---|---|---|
id |
str | yes | Unique identifier |
label |
str | yes | Table or column name — used for query token matching |
content |
str | no | Column definitions, constraints, DDL, or notes |
attributes |
dict | no | Typed metadata — see conventions below |
Node attribute conventions
Use attributes to describe what a node represents. The algorithm also checks attribute string values for query token matching (useful for aliases).
# Table node
graph.add_node("orders", "orders",
content="id INT PK, customer_id INT FK, total DECIMAL, created_at TIMESTAMP",
attributes={
"type": "table",
"database": "mysql", # mysql | postgres | sqlite | mongodb | etc.
"schema": "public", # schema/namespace if applicable
"alias": "transactions", # alternative names matched against queries
"primary_key": "id",
}
)
# Column node
graph.add_node("orders.total", "total",
content="DECIMAL(10,2) — order grand total including tax",
attributes={
"type": "column",
"table": "orders",
"data_type": "DECIMAL",
"nullable": "false",
"alias": "revenue amount", # matched against queries
}
)
# View or virtual table
graph.add_node("monthly_revenue", "monthly_revenue",
content="SELECT DATE_TRUNC('month', created_at), SUM(total) FROM orders GROUP BY 1",
attributes={
"type": "view",
"database": "postgres",
}
)
Recognised type values (convention, not enforced):
| Value | Meaning |
|---|---|
"table" |
A physical database table |
"column" |
A column within a table |
"view" |
A database view or virtual table |
"index" |
An index definition |
"schema" |
A database schema/namespace grouping |
Recognised database values (convention, not enforced):
"mysql", "postgres", "sqlite", "mongodb", "bigquery", "snowflake", "redshift", "mssql", "oracle"
Any additional attributes are valid — they are stored as-is and passed through to the LLM context.
Edge
| Field | Type | Description |
|---|---|---|
"from" |
str | Source node id |
"to" |
str | Target node id |
"label" |
str | Relationship type — see conventions below |
Edge schema
Edges support an optional attributes dict for richer relationship metadata:
# 1:N — one user has many orders
graph.add_edge("orders", "users", "belongs_to")
# or with attributes:
{
"from": "orders",
"to": "users",
"label": "belongs_to",
"attributes": {
"cardinality": "many_to_one", # one_to_one | one_to_many | many_to_one | many_to_many
"on_delete": "CASCADE",
"nullable": "false",
"join": "orders.customer_id = users.id", # hint for LLM SQL generation
}
}
# M:N — orders ↔ products via order_items
{
"from": "orders",
"to": "products",
"label": "many_to_many",
"attributes": {
"cardinality": "many_to_many",
"via": "order_items", # junction table
"join": "orders.id = order_items.order_id AND order_items.product_id = products.id",
}
}
cardinality values:
| Value | Meaning |
|---|---|
"one_to_one" |
1:1 |
"one_to_many" |
1:N (parent → children) |
"many_to_one" |
N:1 (child → parent, most FK relationships) |
"many_to_many" |
M:N (requires a junction table) |
Edge label conventions:
| Label | Meaning |
|---|---|
"foreign_key" |
Standard FK relationship between tables |
"belongs_to" |
Child → parent (N:1) |
"has_many" |
Parent → children (1:N) |
"many_to_many" |
M:N relationship (use attributes.via for junction table) |
"one_to_one" |
1:1 relationship |
"column_of" |
Column node → its parent table |
"references" |
Looser reference between any two nodes |
"related_to" |
Semantic relationship (no strict FK) |
The join attribute is especially useful — it gives the LLM the exact JOIN condition to use rather than inferring it.
API reference
SchemaGraph
SchemaGraph()
SchemaGraph.from_dict(graph: dict) -> SchemaGraph
graph.add_node(id, label, content=None, attributes=None) -> SchemaGraph
graph.add_edge(from_id, to_id, label) -> SchemaGraph
graph.rank(query, k=3, alpha=0.85) -> dict
graph.to_dict() -> dict
personalized_page_rank
Low-level function used internally by SchemaGraph.rank().
from graph2sql import personalized_page_rank
result = personalized_page_rank(
query="revenue by customer",
graph={"nodes": [...], "edges": [...]},
alpha=0.85, # damping factor
k=3, # top-k seed nodes
)
Run the example
python examples/ecommerce.py
Run tests
pip install -e ".[dev]"
pytest tests/
Design principles
- No LLM dependency — pure Python + numpy. Works with any model or no model at all.
- No database connection required — pass schema definitions as strings in
content. - Bring your own LLM —
rank()returns a plain dict you can serialize and inject into any prompt. - Decoupled from infra — no FastAPI, MongoDB, Redis, or cloud dependencies.
Known limitations
Token matching is exact. The PPR algorithm matches query words against node labels using exact token overlap — it does not perform stemming, fuzzy matching, or semantic similarity.
For example, a query containing "customer" will not match a node labeled "users".
Workaround: use attributes for aliases.
Add alternative names as attribute values on the node. The algorithm also checks all attribute values for token matches:
graph.add_node(
"users",
"users",
content="id, name, email",
attributes={
"alias": "customers",
"also_known_as": "clients members",
}
)
# Now "customers" and "clients" in a query will match this node
context = graph.rank("total revenue by customers")
Supported attribute patterns:
alias— primary alternative namealso_known_as— space-separated synonymsrelated_to— domain terms associated with this tableassociated_with— any custom terms relevant to queries
Any string attribute value is tokenized and matched — the key name is not significant.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graph2sql-0.1.0.tar.gz.
File metadata
- Download URL: graph2sql-0.1.0.tar.gz
- Upload date:
- Size: 16.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4162df391a1394a6213449110adc28b09f922671ff34b7d282c0a05c935eeaf3
|
|
| MD5 |
936c36e7bf1c7d9460d219be239aa4be
|
|
| BLAKE2b-256 |
6a09a7a1cbd0734a00f32100081fb7fd36299ab1e62f6ea674213d9a84f9ad20
|
File details
Details for the file graph2sql-0.1.0-py3-none-any.whl.
File metadata
- Download URL: graph2sql-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dde522743e6a8f531cc52cf41fea50e8bdb832aa3b74f94adcbd702cefb3510d
|
|
| MD5 |
36a0db57f2b916506b16e3b7ea93e642
|
|
| BLAKE2b-256 |
5f5e857a4a6ce050b76a6a46c4eb56c99f89366b41c2419d01e28ea403a499d1
|