Automated documentation generator for dbt projects using Google Gemini AI
Project description
DBT Autodoc Documentation
dbt-autodoc is an automated tool to generate and manage documentation for your dbt models using Google Gemini AI. It integrates with dbt-osmosis to synchronize your YAML files and ensures that your documentation is consistent, version-controlled, and easily maintainable.
🚀 Features
- Automated Generation: Uses Google Gemini AI to generate technical descriptions for tables and columns.
- YAML Synchronization: Keeps your
schema.ymlfiles in sync with your dbt models usingdbt-osmosis. - Caching & History: Stores descriptions in a database (
duckdborpostgres) to prevent regenerating existing documentation and tracks changes over time. - User Tracking: Logs who made changes to the documentation (based on environment variables or system user).
- Smart Updates: Respects human-written documentation and allows forcing re-generation via tags.
🛠️ Setup
-
Install:
pip install dbt-autodoc
-
Configuration: When you run
dbt-autodocfor the first time, it will automatically generate adbt-autodoc.ymlconfiguration file in your project root.Important: You should edit this file to provide context about your company (
company_context), which significantly improves the AI's ability to generate accurate descriptions.Supported AI: Currently, this tool supports Google Gemini models (e.g.,
gemini-2.5-flash). -
Environment Variables (Optional): You can provide keys via environment variables (e.g. in a
.envfile) OR pass them as command-line arguments.GEMINI_API_KEY=your_api_key_here POSTGRES_URL=postgresql://user:pass@host:port/db (if using postgres) DBT_USER=your_username (optional, for tracking)
Note: The tool attempts to load
.envautomatically. If that fails, it has a fallback to manually parsePOSTGRES_URLfrom the.envfile directly.
📋 Recommended Workflow
For the best results, follow this workflow to build your documentation incrementally:
-
Initial Setup: Run the tool once to generate
dbt-autodoc.yml. Edit the file and fill incompany_contextwith a detailed description of your business. -
Step 1: Generate Table Descriptions (SQL) Run the tool to generate descriptions for your models (tables/views) inside your SQL files.
dbt-autodoc --generate-docs-config-aiWhy? The AI uses the company context to describe what the table represents.
-
Step 2: Review & Refine Tables Open your
.sqlfiles. Review the generated{{ config(description=...) }}.- If it's good, leave it (or remove the
(ai_generated)tag to lock it). - If it's bad, edit it manually.
- Run the tool again to save your manual edits to the database "Source of Truth".
- If it's good, leave it (or remove the
-
Step 3: Generate Column Descriptions (YAML) Once table descriptions are solid, generate the column descriptions.
dbt-autodoc --generate-docs-yml-aiWhy? The AI now uses both the company context AND the specific table description to generate highly accurate column definitions.
-
Step 4: Review & Refine Columns Check the generated
schema.yml(or_*.yml) files.- Refine definitions where necessary.
- Rerun
dbt-autodoc --generate-docs-ymlto sync your manual changes to the database.
-
Fast Track (Optional): Once you are comfortable with the tool, you can run both generations at once. The tool will automatically prioritize tables first, then columns.
dbt-autodoc --generate-docs-config-ai --generate-docs-yml-ai
🗄️ Database Selection: DuckDB vs Postgres
-
DuckDB (
db_type: duckdb):- Best for: Individual developers, local testing, or single-user projects.
- Pros: Zero setup, fast, simple file-based database (
docs_backup.duckdb). - Cons: Cannot be easily shared concurrently between team members.
-
Postgres (
db_type: postgres):- Best for: Production environments and Teams.
- Pros: Centralized "Source of Truth". Multiple developers can run the tool and share the same cache/history. If one developer documents a model, others get it automatically without regenerating.
- Cons: Requires a running Postgres instance.
📖 Usage & Arguments
Run the tool from the command line:
dbt-autodoc [ARGUMENTS]
Recommended: Run
dbt run(ordbt compile) before running this tool to ensure your project manifest is up to date.
Database Maintenance
If you need to reset the database schema (e.g., during development or if the schema becomes corrupted), you can use the --cleanup-db flag.
Warning: This operation is destructive and will delete all cached descriptions and history.
dbt-autodoc --cleanup-db
Database Maintenance
If you need to reset the database schema (e.g., during development or if the schema becomes corrupted), you can use the --cleanup-db flag.
Warning: This operation is destructive and will delete all cached descriptions and history.
dbt-autodoc --cleanup-db
Available Arguments
| Argument | Description |
|---|---|
--generate-docs-yml |
Sync Structure Only. Runs dbt-osmosis to update YAML files with new columns/models. Saves manual edits to the database. Use this to sync files without AI. |
--generate-docs-yml-ai |
Sync & Generate Columns. Runs dbt-osmosis, then scans _*.yml files. If a column description is missing, calls AI to generate it. |
--generate-docs-config |
Sync SQL Configs. Updates SQL files. Read-only mode for descriptions (doesn't generate new ones). Saves manual edits to the database. |
--generate-docs-config-ai |
Generate Table Descriptions. Scans .sql model files. If a table description is missing in the {{ config() }} block, calls AI to generate it. |
--show-prompt |
Debug Mode. Prints the exact prompt sent to the AI without saving the result. Useful for testing prompt engineering. |
--cleanup-yml |
Cleanup YAML. Deletes temporary _*.yml files generated by osmosis if needed. |
--cleanup-db |
Cleanup Database. Drops the doc_cache and doc_cache_log tables from the database. Useful for resetting the schema or cache. Irreversible. |
--gemini-api-key |
Overrides the API key from environment variables. |
--concurrency |
Sets the maximum number of concurrent AI/DB requests (default: 10). Can also be set in dbt-autodoc.yml. Note: concurrency is for postgres only. |
🧠 How It Works
The tool follows a strict logic flow to determine whether to keep, update, or generate a description.
1. Description Resolution Logic
For every Column (in YAML) or Table (in SQL), the script follows a strict hierarchy to decide what to do. The goal is to protect your manual work while automating the rest.
-
Human Written (Highest Priority):
- Definition: Any description that does not contain the
(ai_generated)tag. - Behavior: The script assumes you wrote this manually and that it is the "Source of Truth".
- Action: It will NOT use AI to generate another response. It will NOT overwrite your text. It effectively "locks" the description. It also saves this description to the database so that if you accidentally delete it later, it can be restored.
- Definition: Any description that does not contain the
-
Existing AI:
- Definition: A description containing the
(ai_generated)tag. - Behavior: The script considers this valid but "owned" by the machine.
- Action: It preserves the existing AI generation.
- How to Regenerate: If you want to regenerate an AI description, simply delete it from the file and run the script again with an AI flag (
--generate-docs-yml-aior--generate-docs-config-ai).
- Definition: A description containing the
-
Cache Restore:
- Behavior:
- If running WITHOUT AI (
--generate-docs-ymlor--generate-docs-config): The script attempts to restore missing descriptions from thedoc_cachedatabase. This protects against accidental deletion. - If running WITH AI (
--generate-docs-yml-aior--generate-docs-config-ai): The script assumes a missing description means you want to generate a new one, so it skips the cache restore (unless the cached version was human-written).
- If running WITHOUT AI (
- Behavior:
-
Generate AI (Lowest Priority):
- Definition: No description in file, and not restored from cache.
- Action: Calls Google Gemini to generate a new description.
🔎 Examples
Example 1: Protecting Manual Work (Human Override)
- Scenario: The AI previously generated:
"Flag indicating if user is active (ai_generated)". - Your Action: You decide this is too vague. You manually edit the YAML file to:
"Flag for users who have logged in within the last 30 days."(Note: You removed the tag). - Result: On the next run, the tool sees no tag. It marks it as Human Written. It updates the database with your new definition but will not call the AI or overwrite your text. Your manual definition is now safe.
Example 2: Forcing an AI Update (Regenerate)
- Scenario: The file has a description:
"Total value of orders (ai_generated)"which you think is wrong. - Your Action: You delete the description line (or make it empty).
- Result: Run
dbt-autodoc --generate-docs-yml-ai. The tool sees the description is missing. It ignores the old cached AI value and calls Gemini to generate a fresh one. - Final Output:
"Sum of gross merchandise value for completed orders (ai_generated)".
Example 3: Restoring Lost Documentation
- Scenario: You run a command that accidentally wipes descriptions, and you want them back without using AI.
- Action: Run
dbt-autodoc --generate-docs-yml(no AI). - Result: The tool checks
doc_cacheand restores your last known descriptions.
2. Special Tags
- (ai_generated): Automatically appended to all AI-generated descriptions. Identifies content that can be updated by the script.
3. Database & Caching
The tool maintains two tables in your database (duckdb local file or postgres):
doc_cache
Stores the current active description for every model and column.
- Purpose: Prevents regenerating documentation for unchanged models (saves money/time) and serves as a backup.
- Columns:
dbt_project_namedbt_profile_namemodel_namecolumn_namedescriptionuser_nameis_human: Boolean flag indicating if the description was manually written (True) or AI-generated (False).updated_at
doc_cache_log
An audit log of all changes made to descriptions.
- Purpose: Tracks who changed what and when. Useful for debugging or rolling back.
- Trigger: Written to whenever a description changes (e.g.,
Old Value->New Value). - Columns:
dbt_project_namedbt_profile_namemodel_namecolumn_nameold_descriptionnew_descriptionuser_nameis_human: Boolean flag representing the status of the new description.changed_at
4. User Tracking
The tool tracks which user is running the script to populate the user_name field in the logs. It resolves the user in this order:
DBT_USERenvironment variable.USERenvironment variable.USERNAMEenvironment variable.- System logged-in user (
getpass.getuser()). - Fallback to
'unknown'.
📝 Best Practices
- Run Structure Sync First:
Run
dbt-autodoc --generate-docs-ymlfrequently to keep your YAML files consistent with your SQL models without calling AI. - Review AI Changes:
AI descriptions include the
(ai_generated)tag. You can leave them as is, or edit them. If you remove the tag, the script will treat them as human-written and protect them from future updates. - Regenerate Poor AI Descriptions:
If an AI description is poor, simply delete it (from the YAML or SQL file) and run with
--generate-docs-yml-ai(for columns) or--generate-docs-config-ai(for tables) to generate a new one.
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Attribution
Brought to you by JustDataPlease.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dbt_autodoc-1.0.9.tar.gz.
File metadata
- Download URL: dbt_autodoc-1.0.9.tar.gz
- Upload date:
- Size: 23.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6a124de935aab97421fd73c2f1c4a059d6cbb471b42e38e3f1e155253d36b17
|
|
| MD5 |
a04370436712b38d0dd1a38fc166d849
|
|
| BLAKE2b-256 |
e7f94bfd599eef6e7f585267935ea8dd8426901ee9e3da49299dae9d5a88c001
|
File details
Details for the file dbt_autodoc-1.0.9-py3-none-any.whl.
File metadata
- Download URL: dbt_autodoc-1.0.9-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8a4779860bd9f6f55478fcc2cdcd65b695478bf2ea6310f9b431922de5e917d
|
|
| MD5 |
9aba3739e6ddaad35a05237c75e0eddb
|
|
| BLAKE2b-256 |
f4bd0b707665674c48174a77c2f8725f2d0febfea4944ee5463defc1d8396145
|