Skip to main content

Bidirectional markdown ↔ SQLite conversion with column limit protection

Project description

SQLDown

Bidirectional markdown ↔ SQLite conversion - Load, query, and dump with ease.

Status

MVP Features Implemented:

  • ✅ Column limit validation with --max-columns flag
  • ✅ Top-N section extraction with --top-sections flag
  • ✅ Import with dynamic schema generation
  • ✅ Watch mode for auto-refresh
  • 🚧 Dump command (planned)
  • 🚧 Info command (planned)

See SPECIFICATION.md for the complete v0.1 design. See REVIEW.md for Papa Bear's detailed design review.

Philosophy

  • Bidirectional: Load markdown → SQLite, dump SQLite → markdown
  • Simple: Use sqlite3 for queries, Python for parsing
  • Smart: Dynamic schema, path-aware dumps, change detection
  • Database as authority: Enforces schema consistency

Quick Start (Future v0.1)

# Load markdown files into database
sqldown load ~/tasks

# Query with sqlite3
sqlite3 sqldown.db "SELECT * FROM docs WHERE status='active'"

# Dump back to markdown
sqldown dump -d sqldown.db -o ~/restored

# Get info
sqldown info -d sqldown.db

Import Command

python3 bin/import.py --db DATABASE --table TABLE --root PATH [OPTIONS]

What it does:

  • Scans markdown files recursively
  • Parses YAML frontmatter → columns
  • Extracts H2 sections → section_* columns
  • Creates schema dynamically based on discovered fields
  • Upserts into SQLite (idempotent - run multiple times safely)

Options:

  • --db PATH - Database file (creates if doesn't exist)
  • --table NAME - Table name (default: docs)
  • --root PATH - Directory containing markdown files
  • --pattern GLOB - File pattern (default: **/*.md)
  • --max-columns N - Maximum allowed columns (default: 1800, SQLite limit: 2000)
  • --top-sections N - Extract only top N most common sections (default: 20, 0=all)
  • --watch, -w - Watch for file changes and auto-update
  • --verbose, -v - Show progress

Using SQLite3

Once imported, use sqlite3 directly for all queries:

# List tables
sqlite3 cache.db ".tables"

# Show schema
sqlite3 cache.db ".schema tasks"

# Query
sqlite3 cache.db "SELECT title, status FROM tasks WHERE status='active' LIMIT 10"

# Aggregate
sqlite3 cache.db "SELECT status, COUNT(*) FROM tasks GROUP BY status"

# Complex queries
sqlite3 cache.db "
  SELECT project, COUNT(*) as active_count
  FROM tasks
  WHERE status='active'
  GROUP BY project
  ORDER BY active_count DESC
"

# Export to CSV
sqlite3 -csv cache.db "SELECT * FROM tasks WHERE status='active'" > active.csv

# Interactive mode
sqlite3 cache.db

Dynamic Schema Example

From this markdown:

---
status: active
project: agents
priority: high
---

# Add SQLite caching

## Objective
Create a cache layer...

## Implementation Plan
1. Parser
2. Schema

Creates these columns automatically:

  • Core: _id, _path, _sections, title, body, lead, file_modified
  • Frontmatter: status, project, priority
  • Sections: section_objective, section_implementation_plan

Real example: 87 tasks → 181 columns (no schema design needed!)

Column Limit Protection

SQLite has a hard limit of 2000 columns per table. With diverse markdown documents, you can easily hit this limit:

Problem: 5,225 tasks with diverse sections = 6,694 columns (💥 exceeds limit!)

Solution: Use --top-sections to extract only the most common sections:

# Extract only top 20 most common sections (default)
python3 bin/import.py --db cache.db --root ~/tasks --top-sections 20

# Result: 5,225 tasks → 116 columns ✅
# - 7 base columns (_id, _path, title, body, lead, _sections, file_modified)
# - 89 frontmatter columns (status, project, type, priority, etc.)
# - 20 section columns (overview, usage, objective, notes, next_steps, etc.)

What happens to other sections?

  • All content is preserved in the body field
  • You can still search across all sections using SQLite FTS5
  • Only the top N sections become queryable columns

Column limit validation:

# Check if your documents will fit before importing
python3 bin/import.py --db test.db --root ~/docs --verbose

# Output shows breakdown:
# 📊 Column breakdown:
#   - Base columns: 7
#   - Frontmatter columns: 89
#   - Section columns: 20
#   - Total: 116 (limit: 1800)

When approaching limit (>90%):

  • Shows warning but continues import
  • Consider: reducing --top-sections or increasing --max-columns

When exceeding limit:

  • Stops before import to prevent database corruption
  • Shows breakdown and suggestions

Multiple Collections

One database, multiple tables:

# Import different document types
python3 bin/import.py --db cache.db --table tasks --root ~/tasks
python3 bin/import.py --db cache.db --table notes --root ~/notes
python3 bin/import.py --db cache.db --table skills --root ~/.claude/skills

# Query them
sqlite3 cache.db "SELECT * FROM tasks WHERE status='active'"
sqlite3 cache.db "SELECT * FROM notes WHERE tags LIKE '%sqlite%'"

# Join across tables
sqlite3 cache.db "
  SELECT t.title as task, n.title as note
  FROM tasks t
  JOIN notes n ON n.tags LIKE '%' || t.project || '%'
  WHERE t.status='active'
"

Refresh Strategy

One-time import:

Import is idempotent - just run it again:

# Add this to cron or a git hook
python3 bin/import.py --db cache.db --table tasks --root ~/tasks

Watch mode (auto-refresh):

Use the --watch / -w flag to automatically update the cache when files change:

# Watch mode: import once, then auto-update on file changes
md-import --db cache.db --table tasks --root ~/tasks --watch

# Output:
# ✅ Imported 87 documents into cache.db:tasks
# 📋 Schema has 181 columns
#
# 👀 Watching /Users/admin/tasks for changes... (Ctrl-C to stop)
# [2025-01-15 10:23:45] Updated: AG-22_feat_add-configuration/README.md
# [2025-01-15 10:24:12] Added: AG-31_feat_new-feature/README.md

Watch mode is ideal for development workflows where you want the cache to stay in sync with your files.

Common Queries

# Active tasks
sqlite3 cache.db "SELECT title FROM tasks WHERE status='active'"

# Recent updates
sqlite3 cache.db "SELECT title, updated FROM tasks ORDER BY updated DESC LIMIT 10"

# By project
sqlite3 cache.db "SELECT project, COUNT(*) FROM tasks GROUP BY project"

# Search content
sqlite3 cache.db "SELECT title FROM tasks WHERE body LIKE '%cache%'"

# High priority incomplete
sqlite3 cache.db "SELECT title FROM tasks WHERE priority='high' AND status != 'completed'"

Why Not Use query.py?

You could! But sqlite3 gives you:

  • Full SQL power (no wrapper limitations)
  • Standard tool everyone knows
  • CSV export, JSON mode, etc.
  • Interactive shell with history
  • Better performance (no Python overhead)

The import tool adds value (markdown parsing). Query wrappers don't.

Requirements

Prerequisites:

  • Python 3.7+ (includes sqlite3 module - standard library)
  • sqlite3 CLI (built-in on macOS 10.4+ and most Linux distributions)
  • uv - Python package installer and runner

Dependencies (handled automatically by uv):

The md-import script uses uv's inline dependency specification, so dependencies are automatically installed on first run:

  • sqlite-utils >= 3.30
  • click >= 8.0
  • PyYAML >= 6.0
  • pathspec >= 0.11.0

No manual pip install or virtual environment setup needed!

Claude Code Integration

The SKILL.md file teaches Claude to:

  1. Use bin/import.py to create/refresh caches
  2. Use sqlite3 for all queries
  3. Start with simple queries, then get complex if needed
  4. Read full markdown files only when necessary (progressive disclosure)

License

MIT


Development Log

2025-01-13 - Design Session: Python Package Specification

Session Goals:

  • Design conversion from uv single-file script to proper Python package
  • Plan bidirectional markdown ↔ SQLite workflow
  • Define CLI interface and conventions

Key Decisions:

  1. Package Structure:

    • PyPI name: sqldown
    • CLI commands: load, dump, info
    • Modern packaging with pyproject.toml and src/ layout
  2. Command Design:

    • sqldown load - markdown → database (with --where for simple filtering)
    • sqldown dump - database → markdown (with full SQL --filter)
    • sqldown info - database stats and table summaries
    • All commands support --dry-run
  3. Smart Features:

    • Path-aware dumping (only dump files under target directory)
    • Change detection (only write files if content differs)
    • Frontmatter filtering during load (simple key=value in v0.1)
    • Automatic gitignore respect
    • Watch mode for file changes
  4. Configuration:

    • Default database: ./sqldown.db
    • Environment variable: SQLDOWN_DB
    • Tab completion for bash/zsh/fish
  5. Conventions:

    • Recommended frontmatter field: doctype (for document type classification)
    • Short flags: -d (db), -t (table), -o (output), -w (watch), -v (verbose)
    • Database as schema authority

Testing:

  • Imported 686 tasks from ~/tasks into test database
  • Hit SQLite's 2000 column limit (1927 section columns, 72 frontmatter fields)
  • Confirmed round-trip capability (body field contains full markdown)

Artifacts:

  • SPECIFICATION.md - Complete v0.1 design (627 lines)
  • REVIEW.md - Papa Bear's design review and recommendations
  • Test database analysis showing column distribution

Future Enhancements (v0.2+):

  • Full SQL WHERE during load (currently simple key=value)
  • Bidirectional watch mode (database changes → file updates)
  • JSON overflow columns for >2000 column scenarios
  • FTS5 full-text search
  • Configuration file support (.sqldown.toml)

Critical Items from Review:

  • Address 2000 column limit with JSON overflow in v0.1 (not deferred)
  • Add comprehensive error handling and recovery
  • Implement transaction batching for performance
  • Add SQL injection protection in filter clauses
  • Validate path traversal security in dump

Next Steps:

  • Phase 0: Address critical review items (column overflow, error handling)
  • Phase 1: Rename markdown-cache → sqldown throughout codebase
  • Phase 2: Create package structure (src/sqldown/, pyproject.toml)
  • Phase 3: Implement core load/dump functions
  • Phase 4: Implement CLI with Click
  • Phase 5: Add tests (>80% coverage)
  • Phase 6: Update documentation

2025-01-13 - Column Limit Protection Implemented

Session Goals:

  • Address Papa Bear's critical review finding: SQLite 2000 column limit
  • Implement MVP column limit validation
  • Add top-N section extraction feature

Implemented Features:

  1. Column Limit Validation:

    • Added --max-columns flag (default: 1800, leaving safety margin)
    • Pre-import validation scans all documents and counts unique columns
    • Shows detailed breakdown: base/frontmatter/section columns
    • Warning at 90% threshold
    • Stops import with error if limit exceeded (prevents database corruption)
  2. Top-N Section Extraction:

    • Added --top-sections flag (default: 20, 0=extract all)
    • Analyzes section frequency across entire document collection
    • Extracts only the N most common sections as queryable columns
    • All other sections preserved in body field
    • Reduces column count from 6,694 → 116 for real-world task collection

Testing Results:

  • 5,225 task documents successfully imported
  • Without top-sections: Would create 6,694 columns (exceeds 2000 limit)
  • With top-20 sections: Creates 116 columns (well within limit)
  • Top extracted sections: overview, usage, objective, notes, next_steps, troubleshooting, installation, etc.

Validation:

  • ✅ Correctly stops import when limit exceeded
  • ✅ Shows warning when approaching limit (90% threshold)
  • ✅ Proper error codes (exit 1 on failure, 0 on success)
  • ✅ Detailed column breakdown in verbose mode
  • ✅ All content preserved in body field

Documentation Updated:

  • README.md: Added column limit protection section
  • IDEAS.md: Added section whitelisting and body sync concepts
  • Commit messages document rationale and implementation

Next Steps:

  • Update SPECIFICATION.md to reflect implemented features
  • Update SKILL.md with new CLI flags and patterns
  • Create taskmaster tasks for ongoing sqldown development
  • Consider: section whitelisting (explicit list vs. frequency-based)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqldown-0.1.0.tar.gz (98.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqldown-0.1.0-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file sqldown-0.1.0.tar.gz.

File metadata

  • Download URL: sqldown-0.1.0.tar.gz
  • Upload date:
  • Size: 98.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.24

File hashes

Hashes for sqldown-0.1.0.tar.gz
Algorithm Hash digest
SHA256 68d55c5f9d66d12683324815112503fcbdcf6b63479b61e14651e2f984c73a80
MD5 351adfa5b63f9dd92ca50307cf8971ce
BLAKE2b-256 6e81e4d61288be49bf47239492c79a42c55fdd960feba53b3088afd87f05db50

See more details on using hashes here.

File details

Details for the file sqldown-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sqldown-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.24

File hashes

Hashes for sqldown-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 17242a29d121cc475a09cfe1f8df4a4821734a90a66a2cb6adb8eb26e9e37fb9
MD5 79d5f927dcd8f56bc8af2fcddc3a0ba2
BLAKE2b-256 efa38b34839812f1426f8c8796f4959c6788ae6562b3e0ce9f977a1150570cbf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page