Skip to main content

Unified SQLAlchemy helpers for Apache AGE + pgvector + BM25/FTS + hybrid search.

Project description

age_search

CI

A unified SQLAlchemy extension that combines:

  • Apache AGE → graph traversal (Cypher)
  • pgvector → semantic vector search (cosine, HNSW / IVFFLAT)
  • Postgres FTS → built-in full-text search
  • BM25 (pg_search / ParadeDB) → high-quality lexical ranking
  • Hybrid search → lexical + semantic fusion
  • Graph-constrained search → search + expand via graph topology

All inside one Postgres database, one SQLAlchemy session, one transaction model.

This package does not try to pretend graphs are tables. Instead, it gives you clean primitives that compose.


Why this exists

Most systems need all of the following at once:

  • semantic similarity (embeddings)
  • keyword relevance (BM25 / FTS)
  • graph structure (relationships, hops, communities)
  • transactional consistency
  • deployability (migrations, pooling, ORM)

Postgres already supports all of this — but the integration story is painful.

This package provides:

  • safe engine/session setup
  • sane defaults
  • index + migration helpers
  • ORM-friendly APIs
  • zero magic that fights SQLAlchemy internals

Core design principles

  1. Relational tables own the data
  2. AGE owns topology
  3. Vectors stay in tables
  4. Graph nodes reference table primary keys
  5. Hybrid search is explicit and debuggable
  6. Everything works under normal SQLAlchemy pooling

Installation

pip install -e .

Dependencies:

  • Python ≥ 3.10
  • SQLAlchemy ≥ 2.0
  • psycopg3
  • pgvector
  • Apache AGE installed server-side
  • Optional: pg_search (BM25)

Engine setup (IMPORTANT)

AGE requires per-connection initialization.

Always create your engine using:

from age_search import create_engine_all_in_one

engine = create_engine_all_in_one(
    DATABASE_URL,
    graph_name="knowledge_graph",
)

This automatically:

  • registers pgvector adapters
  • runs LOAD 'age'
  • sets search_path = ag_catalog, public
  • is safe under connection pooling

Canonical Doc model

This is the reference model used throughout the examples.

from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy import Integer, Text

from age_search import (
    Base,
    GraphNodeMixin,
    VectorMixin,
    FTSSearchMixin,
    BM25SearchMixin,
    GraphRelationship,
)

class Doc(
    Base,
    GraphNodeMixin,
    VectorMixin,
    FTSSearchMixin,
    BM25SearchMixin,
):
    __tablename__ = "docs"

    id: Mapped[int] = mapped_column(Integer, primary_key=True)
    content: Mapped[str] = mapped_column(Text, nullable=False)

    # Graph configuration
    graph_label = "Doc"
    graph_id_field = "id"
    vertex_property_key = "id"

    # Vector configuration
    vector_dim = 1536   # cosine by default

    # FTS
    fts_config = "english"

    # BM25
    bm25_key_field = "id"
    bm25_default_field = "content"

    # Graph relationships
    related = GraphRelationship("RELATED_TO", target_label="Doc")

Database initialization (one-time)

Create extensions, graph, and indexes with a single call.

init_db.py

from sqlalchemy import create_engine
from age_search.migrations import install_all, InstallSpec
from models.doc import Doc

engine = create_engine(DATABASE_URL)

install_all(
    engine,
    models=[Doc],
    spec=InstallSpec(
        graph_name="knowledge_graph",
        enable_fts=True,
        enable_bm25=True,        # requires pg_search extension
        vector_index="hnsw",     # or "ivfflat"
        analyze_after=True,
    ),
)

print("Database initialized.")

This creates:

  • age, vector, (optional) pg_search extensions
  • AGE graph
  • FTS GIN index
  • BM25 index
  • pgvector cosine index (HNSW or IVFFLAT)
  • runs ANALYZE

Optional: auto-sync graph vertices

Keep AGE graph nodes in sync with ORM rows automatically.

from age_search.hooks import install_graph_sync
from models.doc import Doc

install_graph_sync(Doc)

Behavior:

  • insert/update → MERGE (Doc {id})
  • delete → DETACH DELETE

Safe under normal ORM usage.


Writing data

doc1 = Doc(content="Graph neural networks for fraud detection", embedding=vec1)
doc2 = Doc(content="Vector databases and hybrid search", embedding=vec2)

session.add_all([doc1, doc2])
session.commit()

If graph sync is enabled, vertices are created automatically.


Graph operations (AGE)

Create relationships

doc1.related.add(session, doc2)
session.commit()

Traverse neighbors

neighbors = doc1.related(session).limit(10).all()

Returns JSON-decoded AGE nodes, not ORM objects (by design).


Vector search (pgvector)

Cosine similarity is the default.

hits = Doc.vector_search(
    session,
    query_vec,
    k=20,
    distance="cosine",
)

Uses:

  • embedding <-> query_vec
  • HNSW or IVFFLAT index automatically

Full-text search (Postgres FTS)

hits = Doc.fts_search(
    session,
    "graph neural networks",
    k=20,
)

Uses:

  • tsvector
  • websearch_to_tsquery
  • GIN index

BM25 search (pg_search / ParadeDB)

rows = Doc.bm25_search(
    session,
    "graph neural networks",
    k=20,
    with_snippet=True,
)

Each row contains:

  • id
  • BM25 score
  • optional snippet

To return ORM objects:

docs = Doc.bm25_search_objects(session, "graph neural networks")

Hybrid search (lexical + semantic)

Simple hybrid (RRF)

from age_search import hybrid_search

results = hybrid_search(
    session,
    Doc,
    query_text="graph neural networks",
    query_vec=query_embedding,
    prefer_bm25=True,
)

This:

  1. runs BM25 (or FTS fallback)
  2. runs vector search
  3. fuses ranks via Reciprocal Rank Fusion
  4. returns ORM objects in fused order

Typed hybrid results (scores + metadata)

from age_search.hybrid2 import hybrid_search_results

results = hybrid_search_results(
    session,
    Doc,
    query_text="graph neural networks",
    query_vec=query_embedding,
)

for r in results:
    print(
        r.id,
        r.rrf_score,
        r.bm25_score,
        r.semantic_rank,
        r.snippet,
    )

This is what you want for:

  • debugging
  • evals
  • ranking analysis
  • explainability

Graph-constrained search

Expand after search

from age_search import graph_expand_ids

seed_ids = [r.id for r in results]

expanded_ids = graph_expand_ids(
    session,
    graph_name="knowledge_graph",
    label="Doc",
    seed_ids=seed_ids,
    edge="RELATED_TO",
    hops=2,
)

You can then:

  • re-rank
  • fetch objects
  • or run another hybrid search inside this subset

Hierarchical labels (taxonomy)

You typically want two layers:

  • Relational taxonomy tables (source of truth): fast filtering, constraints, auditing
  • AGE mirror (optional): traversal/reasoning (PARENT_OF, HAS_LABEL)

Relational taxonomy

Use the built-in Label model (adjacency list via parent_id):

from age_search import Base
from age_search.taxonomy import Label

For a document↔label join table, create it explicitly so your doc table name can be anything:

from age_search.taxonomy import make_doc_labels_table

doc_labels = make_doc_labels_table(Base.metadata, doc_table="docs")

To expand a subtree in pure SQL (recursive CTE):

from age_search.taxonomy import descendant_label_ids

ids = descendant_label_ids(session, root_label_id=42)

AGE mirror (optional)

Mirror taxonomy into AGE:

  • (:Label {id, slug, name})
  • (:Label)-[:PARENT_OF]->(:Label)
  • (:Doc)-[:HAS_LABEL]->(:Label)

Then you can do graph-constrained hybrid search in one call:

from age_search import hybrid_search_results_in_label_subtree

results = hybrid_search_results_in_label_subtree(
    session,
    Doc,
    graph_name="knowledge_graph",
    root_label_id=42,
    query_text="graph neural networks",
    query_vec=query_embedding,
)

Weighted edges

You can attach properties (including a numeric weight) when creating a relationship:

# adds/updates relationship properties on the AGE edge
doc1.related.add(session, doc2, weight=0.8, props={"source": "cooccur"})
session.commit()

Community detection helpers (connected components)

For a simple baseline "community" definition, you can compute connected components from an AGE edge list:

from age_search.community import graph_connected_components

communities = graph_connected_components(
    session,
    graph_name="knowledge_graph",
    label="Doc",
    edge="RELATED_TO",
)

Benchmark + eval harness

There’s a lightweight, dependency-free eval module (age_search.eval) with common IR metrics. You provide EvalCase objects and a search(case) -> ranked_ids function:

from age_search.eval import EvalCase, evaluate

cases = [
    EvalCase(name="q1", relevant_ids={1, 2, 3}),
]

report = evaluate(cases, search=lambda c: [1, 9, 2, 8], benchmark=True)
print(report)

Development + release notes

  • CI runs ruff + pytest on PRs.
  • Releases: use GitHub Actions → workflow Publish Python distribution to PyPI (manual dispatch) to bump version, tag, build, and publish.

Index strategies (cosine)

HNSW (default)

  • best recall/latency
  • heavier build
  • good general default

IVFFLAT

  • faster build
  • smaller
  • requires ANALYZE
  • tune with:
SET ivfflat.probes = 10;

Switch via:

InstallSpec(vector_index="ivfflat")

CLI (optional)

agegraph doctor
agegraph init --bm25 --vector-index hnsw
agegraph index --models-module your_app.models

Useful for:

  • ops
  • CI
  • smoke tests

Mental model summary

Layer Technology Role
Tables SQLAlchemy source of truth
Vectors pgvector semantic similarity
Lexical FTS / BM25 keyword relevance
Graph Apache AGE topology
Fusion RRF hybrid ranking
Transactions Postgres consistency

Nothing is hidden. Everything composes.


What this is good for

  • RAG systems
  • knowledge graphs
  • recommendation engines
  • fraud / AML
  • search + reasoning
  • graph-aware retrieval
  • eval pipelines

What this deliberately does NOT do

  • pretend graphs are tables
  • auto-load graph neighbors via lazy ORM relationships
  • hide Cypher behind magic joins
  • force a specific embedding model
  • lock you into one search strategy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

age_search-0.1.0.tar.gz (74.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

age_search-0.1.0-py3-none-any.whl (30.2 kB view details)

Uploaded Python 3

File details

Details for the file age_search-0.1.0.tar.gz.

File metadata

  • Download URL: age_search-0.1.0.tar.gz
  • Upload date:
  • Size: 74.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for age_search-0.1.0.tar.gz
Algorithm Hash digest
SHA256 da72652a6f317817b2ff2253331c9df693d5e85d9f45029d2eff96a75b604a7f
MD5 40386a91bb5f76b55ce918d0eef83ab6
BLAKE2b-256 76c2c205b0cb1e564e7776bcaf5320b2aaa9cd4d2411f08b95f128c702f85809

See more details on using hashes here.

Provenance

The following attestation bundles were made for age_search-0.1.0.tar.gz:

Publisher: release.yml on webcoderz/age-search

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file age_search-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: age_search-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for age_search-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 68a2d30adb34750c58d5d2f9527c5ae8e39986849db59445a7e41054a874c768
MD5 a4184bd66c2b48833678162bf6fa56dd
BLAKE2b-256 38f746054e34f8cab5b20eddb2ffd7a436d54034658c47f38b1d250cf89b07c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for age_search-0.1.0-py3-none-any.whl:

Publisher: release.yml on webcoderz/age-search

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page