Skip to main content

Graph-aware RAG system for Notion-backed technical knowledge bases.

Project description

Corbin

Corbin is a graph-aware retrieval system for Notion-backed technical knowledge bases.

It turns structured Notion content into a retrieval-ready mirror with documents, chunks, metadata, aliases, and typed relationships, then exposes that knowledge through an API that can power search, grounded answers, and ChatGPT tool use.

Why Corbin

Most note systems are pleasant to write in but weak at retrieval once the knowledge base grows. Corbin keeps Notion as the authoring layer and builds a retrieval layer that is deterministic, inspectable, and easy to evolve.

The goal is not generic semantic search alone. The goal is to answer questions using:

  • chunk embeddings
  • exact identifiers
  • metadata filters
  • typed relationships
  • freshness and verification state
  • provenance back to the source note

Core idea

Corbin treats each Notion page as a source record that can become:

  • a document
  • one or more chunks
  • one or more graph edges
  • optional aliases and extracted entities

That makes it possible to combine semantic retrieval with structural expansion. A chunk about a CUDA fix can lead to the host it applies to, the service it affects, and the playbook that verifies it.

Architecture

Corbin is split into a few clear layers:

  1. Notion source layer
    Pull canonical databases, page metadata, and recursive block content.

  2. Sync and normalization layer
    Flatten Notion blocks into clean text, extract metadata, normalize relation properties, and compute content hashes.

  3. Indexing layer
    Chunk documents by structure, enrich chunks with compact headers, generate embeddings, and upsert into PostgreSQL with pgvector.

  4. Retrieval layer
    Run hybrid search across semantic similarity, full-text search, metadata filters, and graph expansion.

  5. Orchestration API
    Expose search and answer endpoints through FastAPI.

  6. Chat integration layer
    Present Corbin as tools through MCP so ChatGPT can call into the knowledge base directly.

Planned stack

  • Python 3.12
  • FastAPI
  • PostgreSQL
  • pgvector
  • SQLAlchemy
  • Alembic
  • Pydantic
  • HTTPX
  • Notion API
  • uv
  • Docker Compose
  • MCP server for ChatGPT integration

Retrieval model

Corbin is designed around hybrid retrieval rather than embedding-only search. A query can be analyzed into intent, entities, and constraints, then resolved through several channels:

  • semantic chunk search
  • PostgreSQL full-text search
  • exact and fuzzy alias matching
  • metadata filters such as host, project, or status
  • graph expansion from related nodes and edges

The final answer should prefer verified, host-specific, and current documentation whenever possible.

Example use cases

  • Find the exact playbook for rebuilding a service on a specific host.
  • Explain how a component, machine, and script are related.
  • Retrieve the most relevant troubleshooting note, then expand to nearby docs.
  • Answer a question in ChatGPT using private internal knowledge instead of generic recall.
  • Surface stale notes that need verification after infra changes.

Initial project layout

corbin/
├── pyproject.toml
├── README.md
├── .env.example
├── configs/
│   ├── app.yaml
│   ├── notion.yaml
│   ├── retrieval.yaml
│   └── chunking.yaml
├── src/
│   └── corbin/
│       ├── notion/
│       │   ├── client.py
│       │   ├── sync.py
│       │   ├── blocks.py
│       │   └── normalize.py
│       ├── indexing/
│       │   ├── chunker.py
│       │   ├── embed.py
│       │   ├── extract.py
│       │   └── upsert.py
│       ├── graph/
│       │   ├── entities.py
│       │   ├── relations.py
│       │   └── traversal.py
│       ├── retrieval/
│       │   ├── analyze.py
│       │   ├── hybrid.py
│       │   ├── rerank.py
│       │   └── answer.py
│       ├── db/
│       │   ├── models.py
│       │   ├── session.py
│       │   └── migrations/
│       ├── api/
│       │   └── main.py
│       └── app/
│           └── mcp_server.py
└── tests/

First milestones

Phase 1

Sync one or two Notion databases into PostgreSQL.

Phase 2

Chunk content and add embeddings.

Phase 3

Capture relation properties as graph edges.

Phase 4

Expose retrieval through FastAPI.

Phase 5

Connect ChatGPT through MCP tools.

Design principles

  • Notion stays the authoring layer.
  • PostgreSQL is the retrieval mirror.
  • Retrieval must be inspectable and testable.
  • Chunking should follow structure before token count.
  • Relations are first-class signals, not just metadata.
  • Answers should always preserve provenance.

Status

Early scaffold. The first version focuses on reliable sync, clean normalization, and grounded retrieval before adding richer answer synthesis and write-back workflows.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corbin-0.1.1.tar.gz (3.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corbin-0.1.1-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file corbin-0.1.1.tar.gz.

File metadata

  • Download URL: corbin-0.1.1.tar.gz
  • Upload date:
  • Size: 3.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for corbin-0.1.1.tar.gz
Algorithm Hash digest
SHA256 afe5dba4f2d2f44656aa5b139a161d480357bd7cc25bf0bce4097d75f77cd952
MD5 8dbf339355d126a2d762f8801ff56f93
BLAKE2b-256 e8cb59728cd37dd1d31838d46c41670fe9a906328a31ae24e2ea61024c174511

See more details on using hashes here.

File details

Details for the file corbin-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: corbin-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for corbin-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 81ceb9fa3134398cdc65910421d0b6fe387619a2f15c17b57c997e1342f15b8f
MD5 6ac7e5975cac54514b44793ed14136e9
BLAKE2b-256 ad1f12e651e4047dfbe3467afb3e55e56d05a866e11dd57f60330b36f395ad80

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page