Mine a Python repo's git history to produce a DuckDB database of all distinct versions of every function/class
Project description
codearc
Mine a Python repo's git history to extract all distinct versions of every function and class into a DuckDB database.
Quick Start
# Run directly (no install needed)
uvx codearc --repo /path/to/repo --db output.duckdb --verbose
# Query the results
python -c "
import duckdb
conn = duckdb.connect('output.duckdb')
for row in conn.execute('SELECT qualname, kind, COUNT(*) as versions FROM symbol_versions GROUP BY 1, 2 ORDER BY 3 DESC LIMIT 10').fetchall():
print(row)
"
Installation
Requires Python 3.12+ and uv.
# As a CLI tool
uv tool install codearc
codearc --repo /path/to/repo --db output.duckdb
# As a library
uv add codearc
From source
git clone https://github.com/drothermel/codearc.git
cd codearc
uv sync
uv run codearc --help
CLI Reference
codearc --repo PATH --db OUTPUT.duckdb [options]
| Option | Description |
|---|---|
--repo PATH |
Path to the git repository (required) |
--db PATH |
Path to output DuckDB file (required) |
--package-root PATH |
Package root for module path calculation |
--since-commit HASH |
Resume from a specific commit |
--since DATE |
Process commits after date (ISO format) |
--authors "a,b" |
Comma-separated author filter |
--no-merge/--include-merge |
Skip merge commits (default: skip) |
--ignore PATTERN |
Additional ignore patterns (repeatable) |
-v, --verbose |
Show mining statistics |
Examples
# Mine a repo with verbose output
codearc --repo ~/projects/mylib --db mylib.duckdb --verbose
# Filter by author and date
codearc --repo . --db output.duckdb --authors "Alice,Bob" --since 2024-01-01
# Resume from a specific commit
codearc --repo . --db output.duckdb --since-commit abc123
# Add custom ignore patterns
codearc --repo . --db output.duckdb --ignore "generated/*" --ignore "vendor/*"
Features
- Symbol extraction - Extracts functions, classes, and methods using LibCST with accurate source positions and qualified names
- Version deduplication - Stores only distinct versions of each symbol (by content hash), avoiding redundant storage
- Git history traversal - Walks commit history with PyDriller, processing only modified Python files
- Crash recovery - Per-commit database writes with extraction state tracking for resumability
- Flexible filtering - Filter by author, date, ignore patterns; skip merge commits by default
- Encoding handling - Gracefully handles non-UTF8 files with fallback encodings
Database Schema
The extracted data is stored in two tables:
symbol_versions - All distinct versions of symbols
version_key- Unique identifier (repo:module:qualname:kind:code_hash)symbol_key- Symbol identifier without version (repo:module:qualname:kind)repo_id,commit_hash,commit_time- Git metadatafile_path,module,start_line,end_line- Location infokind- "function" or "class"qualname- Qualified name (e.g.,ClassName.method_name)code,code_hash- Exact source code and its hashdocstring- Extracted docstring if present
extraction_state - Tracks mining progress for resumability
repo_id,last_processed_commit,total_commits_processed, etc.
Demo Scripts
Interactive demos to explore the library's capabilities:
| Script | Description |
|---|---|
scripts/demo_models.py |
Data models, ignore pattern matching, key generation |
scripts/demo_database.py |
Database operations, deduplication, extraction state |
scripts/demo_extractor.py |
LibCST parsing, symbol extraction with metadata |
scripts/demo_module_paths.py |
File path to module name conversion |
scripts/demo_miner.py |
End-to-end mining of a sample git repo |
Run any demo:
uv run python scripts/demo_miner.py
Development
Running Tests
uv run pytest tests/ -v
Project Structure
src/codearc/
├── cli.py # Typer CLI entrypoint
├── database.py # DuckDB schema + operations
├── utils.py # Hashing, module paths, encoding
├── extraction/ # Symbol extraction from source
│ ├── extract_symbols.py # Main extraction entry point
│ ├── symbol_extractor.py # LibCST visitor for symbols
│ └── docstring.py # Docstring extraction
├── mining/ # Git history mining
│ ├── miner.py # PyDriller git traversal
│ ├── mining_config.py # MiningConfig
│ ├── mining_stats.py # MiningStats
│ ├── symbol_version.py # SymbolVersion
│ ├── ignore_patterns.py # IgnorePatterns
│ └── encoding_config.py # EncodingConfig
└── models/ # Shared models
└── extracted_symbol.py # ExtractedSymbol, SymbolKind
scripts/ # Demo scripts
tests/ # Test suite (80 tests)
Dependencies
- PyDriller - Git repository mining
- LibCST - Lossless Python parsing
- DuckDB - Embedded analytics database
- Typer - CLI framework
- Rich - Terminal output formatting
- Pydantic - Data validation and models
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codearc-0.1.0.tar.gz.
File metadata
- Download URL: codearc-0.1.0.tar.gz
- Upload date:
- Size: 23.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5ebd4f5142bd2d3cd8a79e786546eb4cad89290eb12d95ecd12adec2f40a288
|
|
| MD5 |
e97256d5b4644a635096bd9c58a2ce04
|
|
| BLAKE2b-256 |
1762b1f0f6821a3805e5aff9d4ade4553024e04f2db76e54791525db4223897f
|
File details
Details for the file codearc-0.1.0-py3-none-any.whl.
File metadata
- Download URL: codearc-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc5798f4b9cda793f65831895afb0df00c8f88542c9f0213ee1cb12d35dc29ad
|
|
| MD5 |
30f70ccd5fc65efe702f35681b471d8b
|
|
| BLAKE2b-256 |
90e579ef3b33472ea5e7bf317703bb58cc9bdc63c4e7656c1fc8f71b3368add3
|