LLM-powered documentation and test generation for dbt Core projects
Project description
dbt-scribe
LLM-powered documentation and test generation for dbt Core projects.
dbt-scribe analyses your dbt project and uses an LLM (Anthropic Claude, OpenAI, or
Google Gemini) to automatically generate model descriptions, column descriptions, and
data tests — following your project's conventions, never overwriting what already exists.
The problem
Writing thorough dbt documentation and tests is non-negotiable — but it is slow. A staging model with 15 columns takes 30–45 minutes to document properly when following strict conventions: English descriptions, two-tier docs blocks, named tests, shared column blocks, four-section mart template.
Existing tools don't fully solve this:
| Tool | Limitation |
|---|---|
dbt-osmosis |
Mechanical propagation — no LLM understanding |
dbt-codegen |
Generates empty boilerplate only |
dbt Assist |
Cloud-only, paid, not configurable |
dbt-coverage |
Measures coverage but generates nothing |
dbt-scribe fills the gap: LLM-powered generation, local, configurable per project,
compatible with dbt Core.
What it generates
Documentation
- Model descriptions (inline YAML or long-form docs blocks)
- Column descriptions for every undocumented column
*__docs.mdfiles following dbt's two-tier convention- Four-section template for mart docs blocks (Description / Limitations / Business Stakeholder / Technical Stakeholder)
Tests
- Named generic tests in YAML:
not_null,unique,accepted_values,relationships - Column types are inferred automatically (primary key, foreign key, enum, timestamp, boolean, metric) to generate the right tests
accepted_valueslists are inferred fromCASE WHENandWHERE INclauses in compiled SQL — a placeholderTODOis generated when values cannot be detected
Safe by default
- Only fills in what is missing — never overwrites existing descriptions or tests
- A
{{ doc("...") }}reference is treated as a filled description and is preserved - Use
--forceto regenerate everything, including existing content - Use
--dry-runto preview output without writing any files
Requirements
- Python 3.11+
- dbt Core (any version that produces
target/manifest.json) - A supported dbt adapter: DuckDB, BigQuery, or PostgreSQL
- An API key for your chosen LLM provider
Installation
pip install dbt-scribe
Or install from source for local development:
git clone https://github.com/jeremy6680/dbt-scribe.git
cd dbt-scribe
pip install -e .
Quickstart
All commands must be run from the root of your dbt project (the directory
containing dbt_project.yml).
1. Compile your dbt project
dbt-scribe reads compiled SQL from target/manifest.json. Run this first and
any time your models change:
dbt compile
2. API key
dbt-scribe requires an API key for the LLM provider configured in dbt-scribe.yml
(default: Anthropic Claude).
Add the key to your shell profile so it is available in every session:
# Add to ~/.zprofile (Mac) or ~/.bashrc (Linux)
export ANTHROPIC_API_KEY=sk-ant-...
# Reload your shell profile
source ~/.zprofile
Mac note: Use
~/.zprofile, not~/.zshrc. On Mac, terminal apps open as login shells and load~/.zprofilefirst. Variables set only in~/.zshrcmay not be available inside virtual environments.
Other supported providers:
export OPENAI_API_KEY=sk-... # for provider: openai
export GOOGLE_API_KEY=... # for provider: google
3. Initialise the config
dbt-scribe init
This generates a dbt-scribe.yml at your project root. Open it and set your
preferred LLM provider, coverage thresholds, shared column names, and layer conventions.
Commit this file — it is part of your project.
4. Check current coverage
dbt-scribe audit --target models/
No LLM calls, nothing written. Shows doc and test coverage per model.
5. Preview generation (dry run)
# Documentation only
dbt-scribe docs --target models/ --dry-run
# Tests only
dbt-scribe tests --target models/ --dry-run
# Both in one pass
dbt-scribe generate --target models/ --dry-run
6. Generate for real
dbt-scribe generate --target models/
Commands
All commands must be run from the root of your dbt project
(the directory containing dbt_project.yml).
dbt-scribe init
Generates a dbt-scribe.yml configuration file at the project root.
dbt-scribe init [--force]
--force overwrites an existing dbt-scribe.yml.
dbt-scribe docs
Generates model and column descriptions. Writes inline YAML descriptions and
long-form *__docs.md docs blocks.
dbt-scribe docs --target <path> [--dry-run] [--force]
# Single model
dbt-scribe docs --target models/staging/spotify/stg_spotify__tracks.sql
# All models in a folder
dbt-scribe docs --target models/staging/
# Entire project
dbt-scribe docs --target models/
# Preview without writing
dbt-scribe docs --target models/ --dry-run
dbt-scribe tests
Generates named generic tests in YAML.
dbt-scribe tests --target <path> [--dry-run] [--force]
dbt-scribe generate
Generates documentation and tests in a single LLM call per model.
dbt-scribe generate --target <path> [--dry-run] [--force]
dbt-scribe audit
Reports documentation and test coverage per model. No generation, no LLM calls.
dbt-scribe audit --target <path>
Example output:
Audit summary
stg_spotify__tracks: doc coverage 100% (19/19), test coverage 0% (0/19)
stg_spotify__albums: doc coverage 60% (6/10), test coverage 0% (0/10)
int_music__unified: doc coverage 100% (14/14), test coverage 0% (0/14)
mrt_music__collection: doc coverage 100% (13/13), test coverage 0% (0/13)
Configuration (dbt-scribe.yml)
Generated by dbt-scribe init and versioned with your dbt project.
Key settings:
llm:
provider: anthropic # anthropic | openai | google
model: claude-sonnet-6
temperature: 0.2 # Low for consistent, structured output
docs:
language: en
two_tier: true # Short desc in YAML, long desc in *__docs.md
shared_columns: # These columns use shared docs blocks
- _loaded_at
- created_at
mart_template: true # Enforce four-section template for mart docs
tests:
named_tests: true # All generic tests use the name: key
pk_patterns: ["_id$", "^id$"]
enum_patterns: ["_status$", "_type$"]
coverage:
min_doc_coverage: 80 # % threshold for audit / CI mode
min_test_coverage: 60
conventions:
layers:
staging:
prefixes: ["stg_", "base_"]
intermediate:
prefixes: ["int_"]
marts:
prefixes: [] # No prefix — detected by exclusion
# Override if your project uses e.g. ["mrt_"]
How it works
- Bootstrap — validates that
dbt_project.yml,target/manifest.json, anddbt-scribe.ymlare all present in the current directory - Manifest parsing — reads compiled SQL (Jinja2-resolved), column lists, lineage,
adapter type, and fully-qualified node names from
target/manifest.json - YAML parsing — reads existing
.ymlfiles to detect what is already documented - Analysis — detects the layer (staging / intermediate / marts) and infers column types (pk, fk, enum, timestamp, boolean, metric, shared, text)
- Generation — calls the configured LLM with structured prompts; all responses are JSON for reliable parsing — one call per model, not per column
- Writing — creates
.ymlfiles from scratch or merges non-destructively into existing ones; creates or appends to*__docs.mdfiles
Why
manifest.jsonand not the.sqlfiles directly? dbt model files contain unresolved Jinja2 ({{ ref('...') }},{{ var('...') }}, macros).target/manifest.json, produced bydbt compile, contains fully-resolved SQL — the only reliable source for column extraction and expression analysis.
Supported adapters
| Adapter | Status | Notes |
|---|---|---|
| DuckDB | ✅ Supported | Default for local / portfolio projects |
| BigQuery | ✅ Supported | Auto-detected from manifest metadata |
| PostgreSQL | ✅ Supported | Auto-detected from manifest metadata |
Adapter is auto-detected from manifest.json. You can override it in dbt-scribe.yml.
LLM providers
| Provider | Default model | Environment variable |
|---|---|---|
anthropic (default) |
claude-sonnet-4-20250514 |
ANTHROPIC_API_KEY |
openai |
gpt-4o |
OPENAI_API_KEY |
google |
gemini-2.5-pro |
GOOGLE_API_KEY |
Only the key for your configured provider is required.
CI integration
Use dbt-scribe audit in your pipeline to enforce documentation and test coverage
thresholds. Set fail_on_threshold: true in dbt-scribe.yml to exit with code 1
when thresholds are not met:
# .github/workflows/dbt-quality.yml
- name: Check dbt documentation coverage
run: |
dbt compile
dbt-scribe audit --target models/ --ci
# dbt-scribe.yml
coverage:
min_doc_coverage: 80
min_test_coverage: 60
fail_on_threshold: true
Project structure
dbt-scribe/
├── dbt_scribe/
│ ├── cli.py # Click entry point — all commands
│ ├── config.py # Pydantic config + provider resolution
│ ├── resolver.py # Resolves --target to a list of models
│ ├── analyzer.py # Layer detection + column type inference
│ ├── parsers/
│ │ ├── manifest_parser.py # Reads target/manifest.json
│ │ └── yaml_parser.py # Reads existing .yml files
│ ├── generators/
│ │ ├── base_generator.py # LLMProvider ABC + retry logic
│ │ ├── providers/ # anthropic | openai | google
│ │ ├── docs_generator.py
│ │ └── tests_generator.py
│ ├── writers/
│ │ ├── yaml_writer.py # Create from scratch or merge
│ │ └── docs_writer.py # Create or append *__docs.md
│ └── prompts/ # Jinja2 prompt templates per layer
└── tests/
└── fixtures/dbt_project/ # Minimal dbt project with pre-built manifest
The test suite uses checked-in fixtures and mocked LLM providers — CI requires no dbt installation, no warehouse connection, and no API keys.
Roadmap
| Phase | Status | Highlights |
|---|---|---|
| Phase 1 — MVP | ✅ Complete | docs, tests, generate, audit commands |
| Phase 2 — Portfolio-ready | 🔄 Planned | Singular SQL tests, LLM cache, ruamel.yaml migration, CI mode |
| Phase 3 — Open source | 📋 Backlog | PyPI publication, full documentation, dbt Slack announcement |
License
MIT — see LICENSE.
Author
Jeremy Marchandeau — web2data.org
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dbt_scribe-0.1.0.tar.gz.
File metadata
- Download URL: dbt_scribe-0.1.0.tar.gz
- Upload date:
- Size: 135.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56e1ba0f1393ebf5e7bf0676bceba05aa41decf44eee78c8db943f23405b6c15
|
|
| MD5 |
1b46149abb709395e05518751f2fe69f
|
|
| BLAKE2b-256 |
1cda6178c35b7c18dd845daf823180ba7fabd4a8cb586ec154bb044c7076730f
|
File details
Details for the file dbt_scribe-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dbt_scribe-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d92f2cac4f955feacd3e1b85703b28747865c803534307b444f12b837057175
|
|
| MD5 |
819be07a5f6eda01b6f728df84f1a24d
|
|
| BLAKE2b-256 |
69135be42027b809da63693172b8fe17b706ea569287bd5863e16162a51d02ce
|