Skip to main content

Local-first codebase intelligence with semantic search, multi-hop research, and 12-language AST support

Project description

Sia Code

Local-first codebase search with semantic understanding and multi-hop code discovery.

Benchmark Results

89.9% Recall@5 on RepoEval benchmark (1,600 queries, 8 repositories)

  • +12.9 percentage points better than cAST (77.0%)
  • Lexical-only search outperforms hybrid (BM25 > BM25+embeddings)
  • Publication-quality results with ±1.5% confidence interval

See docs/BENCHMARK_RESULTS.md for full analysis.

Features

  • 89.9% Recall@5 - State-of-the-art code search performance on RepoEval benchmark
  • Lexical-First Search - BM25 + FTS5 optimized for code queries (outperforms semantic-only)
  • Multi-Hop Research - Automatically discover code relationships and call graphs
  • AST-Aware Chunking - Tree-sitter preserves function/class boundaries
  • Project Auto-Detection - Automatic language detection and indexing strategy
  • Tiered Search - Filter by project code, dependencies, or both
  • 12 Languages - Python, JS/TS, Go, Rust, Java, C/C++, C#, Ruby, PHP (full AST support)
  • Watch Mode - Auto-reindex on file changes with incremental updates
  • Portable Index - Usearch HNSW + SQLite FTS5 in .sia-code/ directory

Installation

# From PyPI (recommended)
pip install sia-code

# Or with uv
uv tool install sia-code

# Or from source
uv tool install git+https://github.com/DxTa/sia-code.git

# Try without installing (ephemeral run)
uvx sia-code --version
uvx sia-code search "authentication logic"

# Verify installation
sia-code --version

Quick Start

# Initialize and index
sia-code init
sia-code index .

# Search
sia-code search "authentication logic"           # Hybrid search (default: BM25 + semantic)
sia-code search --regex "def.*login"             # Lexical-only search (BM25)
sia-code search --semantic-only "handle errors"  # Semantic-only search

# Multi-hop research (discover relationships)
sia-code research "how does the API handle errors?"

# Check index health
sia-code status

Commands

Command Description
sia-code init Initialize index in current directory
sia-code index . Index codebase
sia-code index --update Re-index changed files only
sia-code index --watch Auto-reindex on file changes
sia-code search "query" Hybrid search (BM25 + semantic)
sia-code search --regex "pattern" Lexical-only search
sia-code search --semantic-only "query" Semantic-only search
sia-code research "question" Multi-hop code discovery
sia-code status Index health and staleness metrics
sia-code compact Remove stale chunks
sia-code memory list List timeline/changelogs/decisions
sia-code memory changelog Generate changelog from git
sia-code memory sync-git Import events from git history
sia-code config show Display configuration
sia-code interactive Live search mode

See docs/CLI_FEATURES.md for complete command reference with all options and examples.

Configuration

Recommended: Lexical-only search (best performance, no API key needed)

sia-code init
sia-code index .
# Search uses BM25 by default (89.9% Recall@5)

Optional: Hybrid search (adds semantic embeddings):

export OPENAI_API_KEY=sk-your-key-here
sia-code config set embedding.enabled true
sia-code config set search.vector_weight 0.0  # 0.0 = lexical-only (recommended!)
sia-code index --clean

Edit config at .sia-code/config.json to:

  • Set vector_weight (0.0 = lexical-only, 0.5 = hybrid, 1.0 = semantic-only)
  • Change embedding model (BAAI/bge-small-en-v1.5, openai-small)
  • Exclude patterns (node_modules/, __pycache__/, etc.)
  • Adjust chunk sizes (max_chunk_size, min_chunk_size)

View config: sia-code config show

AI Summarization (optional, enhances git changelogs):

{
  "summarization": {
    "enabled": true,
    "model": "google/flan-t5-base",
    "max_commits": 20
  }
}

Output Formats

sia-code search "query" --format json            # JSON output
sia-code search "query" --format table           # Rich table
sia-code search "query" --format csv             # CSV for Excel
sia-code search "query" --output results.json    # Save to file

Supported Languages

Full AST Support (12): Python, JavaScript, TypeScript, JSX, TSX, Go, Rust, Java, C, C++, C#, Ruby, PHP

Recognized: Kotlin, Groovy, Swift, Bash, Vue, Svelte, and more (indexed as text)

Troubleshooting

Issue Solution
No API key warning Normal - searches fallback to lexical mode
Index growing large Run sia-code compact to remove stale chunks
Slow indexing Use sia-code index --update for incremental
Stale search results Run sia-code index --clean to rebuild

How It Works

  1. Parse - Tree-sitter generates language-agnostic AST for each file
  2. Chunk - AST-aware chunking preserves function/class boundaries (max 1200 chars)
  3. Index - Usearch HNSW (vectors) + SQLite FTS5 (lexical search with BM25)
  4. Store - Portable .sia-code/ directory (17-25 MB per repo)
  5. Search - Lexical-first (BM25) with optional hybrid fusion (RRF)

Key Innovation: Lexical-only search (BM25) outperforms hybrid (BM25+embeddings) for code queries because code contains precise identifiers that benefit from exact keyword matching.

Documentation

Architecture & Implementation

Benchmark Results

Usage & Configuration

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sia_code-0.4.0.tar.gz (83.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sia_code-0.4.0-py3-none-any.whl (85.6 kB view details)

Uploaded Python 3

File details

Details for the file sia_code-0.4.0.tar.gz.

File metadata

  • Download URL: sia_code-0.4.0.tar.gz
  • Upload date:
  • Size: 83.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sia_code-0.4.0.tar.gz
Algorithm Hash digest
SHA256 ae6e17e13c41fb695ea8a925ccfbfe820f2fe7fa76c0c5d4b45f208d6f356c07
MD5 26da9f8624dbd46c26e4d8ff1aed5a91
BLAKE2b-256 3ca8206f3d88aa492a5fb0cf3788aa3d44d0b5af9ae4ccfd5bd96adeb59753c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for sia_code-0.4.0.tar.gz:

Publisher: release.yml on DxTa/sia-code

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sia_code-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: sia_code-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 85.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sia_code-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e969d07c945874c0f3d221c5e35021715b0e41d7d639a6443a7e38442f70e627
MD5 15d7d40d4d46fe50e6297aa6ad666fb6
BLAKE2b-256 b3a066a636f2a8a8d9c72be32931e37191a16637dfd1c3e51f1bc2249161b709

See more details on using hashes here.

Provenance

The following attestation bundles were made for sia_code-0.4.0-py3-none-any.whl:

Publisher: release.yml on DxTa/sia-code

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page