Skip to main content

Graph-aware RAG system for Notion-backed technical knowledge bases.

Project description

Corbin

Corbin is a graph-aware retrieval system for Notion-backed technical knowledge bases.

It turns structured Notion content into a retrieval-ready mirror with documents, chunks, metadata, aliases, and typed relationships, then exposes that knowledge through an API that can power search, grounded answers, and ChatGPT tool use.

Why Corbin

Most note systems are pleasant to write in but weak at retrieval once the knowledge base grows. Corbin keeps Notion as the authoring layer and builds a retrieval layer that is deterministic, inspectable, and easy to evolve.

The goal is not generic semantic search alone. The goal is to answer questions using:

  • chunk embeddings
  • exact identifiers
  • metadata filters
  • typed relationships
  • freshness and verification state
  • provenance back to the source note

Core idea

Corbin treats each Notion page as a source record that can become:

  • a document
  • one or more chunks
  • one or more graph edges
  • optional aliases and extracted entities

That makes it possible to combine semantic retrieval with structural expansion. A chunk about a CUDA fix can lead to the host it applies to, the service it affects, and the playbook that verifies it.

Architecture

Corbin is split into a few clear layers:

  1. Notion source layer
    Pull canonical databases, page metadata, and recursive block content.

  2. Sync and normalization layer
    Flatten Notion blocks into clean text, extract metadata, normalize relation properties, and compute content hashes.

  3. Indexing layer
    Chunk documents by structure, enrich chunks with compact headers, generate embeddings, and upsert into PostgreSQL with pgvector.

  4. Retrieval layer
    Run hybrid search across semantic similarity, full-text search, metadata filters, and graph expansion.

  5. Orchestration API
    Expose search and answer endpoints through FastAPI.

  6. Chat integration layer
    Present Corbin as tools through MCP so ChatGPT can call into the knowledge base directly.

Planned stack

  • Python 3.12
  • FastAPI
  • PostgreSQL
  • pgvector
  • SQLAlchemy
  • Alembic
  • Pydantic
  • HTTPX
  • Notion API
  • uv
  • Docker Compose
  • MCP server for ChatGPT integration

Retrieval model

Corbin is designed around hybrid retrieval rather than embedding-only search. A query can be analyzed into intent, entities, and constraints, then resolved through several channels:

  • semantic chunk search
  • PostgreSQL full-text search
  • exact and fuzzy alias matching
  • metadata filters such as host, project, or status
  • graph expansion from related nodes and edges

The final answer should prefer verified, host-specific, and current documentation whenever possible.

Example use cases

  • Find the exact playbook for rebuilding a service on a specific host.
  • Explain how a component, machine, and script are related.
  • Retrieve the most relevant troubleshooting note, then expand to nearby docs.
  • Answer a question in ChatGPT using private internal knowledge instead of generic recall.
  • Surface stale notes that need verification after infra changes.

Initial project layout

corbin/
├── pyproject.toml
├── README.md
├── .env.example
├── configs/
│   ├── app.yaml
│   ├── notion.yaml
│   ├── retrieval.yaml
│   └── chunking.yaml
├── src/
│   └── corbin/
│       ├── notion/
│       │   ├── client.py
│       │   ├── sync.py
│       │   ├── blocks.py
│       │   └── normalize.py
│       ├── indexing/
│       │   ├── chunker.py
│       │   ├── embed.py
│       │   ├── extract.py
│       │   └── upsert.py
│       ├── graph/
│       │   ├── entities.py
│       │   ├── relations.py
│       │   └── traversal.py
│       ├── retrieval/
│       │   ├── analyze.py
│       │   ├── hybrid.py
│       │   ├── rerank.py
│       │   └── answer.py
│       ├── db/
│       │   ├── models.py
│       │   ├── session.py
│       │   └── migrations/
│       ├── api/
│       │   └── main.py
│       └── app/
│           └── mcp_server.py
└── tests/

First milestones

Phase 1

Sync one or two Notion databases into PostgreSQL.

Phase 2

Chunk content and add embeddings.

Phase 3

Capture relation properties as graph edges.

Phase 4

Expose retrieval through FastAPI.

Phase 5

Connect ChatGPT through MCP tools.

Design principles

  • Notion stays the authoring layer.
  • PostgreSQL is the retrieval mirror.
  • Retrieval must be inspectable and testable.
  • Chunking should follow structure before token count.
  • Relations are first-class signals, not just metadata.
  • Answers should always preserve provenance.

Status

Early scaffold. The first version focuses on reliable sync, clean normalization, and grounded retrieval before adding richer answer synthesis and write-back workflows.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corbin-0.1.0.tar.gz (2.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corbin-0.1.0-py3-none-any.whl (3.5 kB view details)

Uploaded Python 3

File details

Details for the file corbin-0.1.0.tar.gz.

File metadata

  • Download URL: corbin-0.1.0.tar.gz
  • Upload date:
  • Size: 2.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for corbin-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a96745495a72a2e478059105a9557eb53410d5247b66b68d6d7442165838432f
MD5 7561c733027a024efe1ffa74372b42b5
BLAKE2b-256 51b41a555953c075b9fa9dbcc06e7d40d0e7b1222befdedc5b6860bd86f675c1

See more details on using hashes here.

File details

Details for the file corbin-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: corbin-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 3.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for corbin-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f41e1b65d91805787078a8ba41d1ade9f35e9873fb684fc386b34f25c762c6ba
MD5 1707e6d6a04fa8fd58044f55d8a478c6
BLAKE2b-256 251662cccb708debdcdca2fd3e854912ac806f6d91b2d06ac2ab972e37371ea1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page