Architecture extraction & codebase intelligence for the agentic era
Project description
archex
Architectural intelligence for code, with no LLM bill.
archex is a Python library and CLI that transforms any Git repository into structured architectural intelligence and token-budget-aware code context — fully local, fully deterministic, no API keys, no per-call cost. It serves two consumers from a single index: human architects receive an ArchProfile with module boundaries, dependency graphs, detected patterns, and interface surfaces; AI agents receive a ContextBundle with relevance-ranked, syntax-aligned code chunks assembled to fit within a specified token budget.
Features
- 8 language adapters — Python, TypeScript/JavaScript, Go, Rust, Java, Kotlin, C#, Swift (tree-sitter AST parsing), extensible via entry points
- 8 public APIs —
analyze(),query(),compare(),file_tree(),file_outline(),search_symbols(),get_symbol(),get_symbols_batch()+ token counting utilities - Hybrid retrieval — BM25F weighted-field keyword search + optional vector embeddings, merged via confidence-weighted Reciprocal Rank Fusion and adaptive Relative Score Fusion behind an AvgIDF fusion gate
- Intent-aware ranking — query intent classification (definition, architecture, usage, debugging, general) selects per-intent scoring-weight presets
- Cross-encoder reranking — opt-in
cross-encoder/ms-marco-MiniLMre-scoring of top candidates; local sentence-transformers, off by default, enabled viaIndexConfig(rerank=True) - Token budget assembly — AST-aware chunking, Personalized PageRank dependency-graph expansion, greedy bin-packing with configurable
ScoringWeights - Structural analysis — module detection (Louvain), pattern recognition (extensible
PatternRegistry), interface extraction - Cross-repo comparison — 6 architectural dimensions, no LLM required
- Delta indexing — surgical re-index via git diff when only a few files changed; mtime-based fallback for non-git sources; configurable
delta_threshold - Performance — cache-first query (skips parse on cache hit), delta indexing, parallel parsing, parallel compare, git-aware cache keys; warm queries under 150ms
- Extensibility — plugin APIs for language adapters, pattern detectors, chunkers, and scoring weights via entry points and protocols
- Security — input validation on git URLs/branches, FTS5 query escaping, cache key validation,
allow_pickle=Falsefor vector persistence - Pipeline observability — opt-in
PipelineTracewith step-level timing for retrieve, expand, score, assemble stages - Unified artifact pipeline —
produce_artifacts()single entry point for parse, import-resolve, chunk, edge-build - HTTP API server —
archex serveexposes/analyze,/query,/compare,/tree,/outline,/symbols,/symbol/{id}, and/benchmark/*endpoints over FastAPI, with an optional/dashboard - Query normalization — camelCase/snake_case splitting, bigram compound generation, architecture-intent synonym expansion
- Quality gates — CI-embeddable threshold checks for recall, precision, F1, MRR with latency warnings
- Expansion gating — weak BM25 seeds (below 10% of max) don't trigger graph expansion; score-relative file cutoff removes noise
- 35-task benchmark corpus — 16 self-referential + 19 external repo tasks across Python, Go, Rust, JS/TS, covering 5 difficulty categories
- LLM-free — every pipeline stage is local; no API keys, no per-query cost, no rate-limit dependencies
Installation
# CLI tool (system-wide)
uv tool install archex
# Project dependency
uv add archex
Extras
The core package handles all 8 languages, structural analysis, and BM25 retrieval with zero API calls. Extras add optional capabilities:
Agent integration:
uv tool install "archex[mcp]" # MCP server (Claude Code / Claude Desktop)
uv add "archex[langchain]" # LangChain retriever
uv add "archex[llamaindex]" # LlamaIndex retriever
uv add "archex[lsap]" # LSP enrichment (Python 3.12+, lsp-client)
Hybrid retrieval (vector embeddings + BM25):
uv add "archex[vector]" # ONNX local embeddings (Nomic Code) — no GPU required
uv add "archex[vector-fast]" # FastEmbed embeddings — fast, no GPU
uv add "archex[vector-torch]" # Torch-backed sentence-transformers — GPU-accelerated
Other:
uv add "archex[graph]" # igraph + leidenalg — faster Leiden module detection
uv add "archex[language-pack]" # Fallback tree-sitter grammars
uv add "archex[all]" # vector + graph + mcp + langchain + llamaindex + language-pack
Quick Start
Python API
from archex import analyze, query, compare
from archex.models import RepoSource
# Architectural analysis
profile = analyze(RepoSource(local_path="./my-project"))
for module in profile.module_map:
print(f"{module.name}: {len(module.files)} files")
for pattern in profile.pattern_catalog:
print(f"[{pattern.confidence:.0%}] {pattern.name}")
# Implementation context for an agent
bundle = query(
RepoSource(local_path="./my-project"),
"How does authentication work?",
token_budget=8192,
)
print(bundle.to_prompt(format="xml"))
# Query with custom scoring weights
from archex.models import ScoringWeights
bundle = query(
RepoSource(local_path="./my-project"),
"database connection pooling",
scoring_weights=ScoringWeights(relevance=0.8, structural=0.1, type_coverage=0.1),
)
# Cross-repo comparison
result = compare(
RepoSource(local_path="./project-a"),
RepoSource(local_path="./project-b"),
dimensions=["error_handling", "api_surface"],
)
Surgical Lookups
from archex.api import file_tree, file_outline, search_symbols, get_symbol
from archex.models import RepoSource
source = RepoSource(local_path="./my-project")
# Browse repository structure (~2,000 tokens vs 200,000+ for raw listing)
tree = file_tree(source, max_depth=3, language="python")
# Get symbol outline for a single file (~180 tokens vs 4,800 for full file)
outline = file_outline(source, "src/auth/middleware.py")
# Search symbols by name across the codebase
matches = search_symbols(source, "authenticate", kind="function", limit=10)
# Retrieve full source for a specific symbol by stable ID
symbol = get_symbol(source, "src/auth/middleware.py::authenticate#function")
# Batch retrieval for multiple symbols
from archex.api import get_symbols_batch
symbols = get_symbols_batch(source, [
"src/auth/middleware.py::authenticate#function",
"src/models/user.py::User#class",
])
Token Counting
from archex.api import get_file_token_count, get_files_token_count, get_repo_total_tokens
from archex.models import RepoSource
source = RepoSource(local_path="./my-project")
# Count tokens before deciding what to include in context
total = get_repo_total_tokens(source)
auth_tokens = get_file_token_count(source, "src/auth/middleware.py")
batch_tokens = get_files_token_count(source, ["src/auth/middleware.py", "src/models/user.py"])
CLI
# Analyze a local repo or remote URL
archex analyze ./my-project --format json
archex analyze https://github.com/org/repo --format markdown -l python --timing
# Query for implementation context
archex query ./my-project "How does auth work?" --budget 8192 --format xml
archex query ./my-project "connection pooling" --strategy hybrid --timing
archex query "How does auth work?" --budget 8192 # defaults to cwd
# Browse and search
archex tree ./my-project --depth 3 -l python
archex tree --depth 3 -l python # defaults to cwd
archex outline ./my-project src/auth/middleware.py
archex symbols ./my-project "authenticate" --kind function --limit 10
archex symbols "authenticate" --kind function --limit 10 # defaults to cwd
archex symbol ./my-project "src/auth/middleware.py::authenticate#function"
# Compare two repositories
archex compare ./project-a ./project-b --dimensions error_handling,api_surface --format markdown
# Serve the HTTP API (analyze/query/compare/tree/symbol + benchmark endpoints)
archex serve --host 127.0.0.1 --port 8080
# Repo-local lifecycle
archex init
archex index --format json
archex status
archex dogfood --task archex_query_pipeline
archex reset --force
# Manage the analysis cache
archex cache list
archex cache clean --max-age 168
archex cache info
# Benchmark retrieval strategies
archex benchmark run tasks.yaml --strategies bm25,hybrid --output results.json
archex benchmark report results.json --format markdown
archex benchmark delta ./my-project
Repo-Local Lifecycle
Use repo-local mode when archex is part of an agent or maintainer workflow for a checked-out repository:
cd ./my-project
archex init
archex index
archex status
archex query "Where is cache invalidation handled?" --budget 8192 --timing
archex dogfood --task archex_query_cache_lifecycle
archex init creates .archex/settings.toml, .archex/metadata.json, and .archex/dogfood/history/, then adds .archex/ to .gitignore. The entire .archex/ directory is generated local state: it holds the repo-local SQLite index, vector artifacts, and dogfood reports, and should stay out of source control.
Lifecycle commands default SOURCE to the current working directory. Explicit sources still take precedence:
archex index ../other-repo
archex query ../other-repo "How does query packing work?"
The same cwd default applies to local-read commands where a URL cannot be inferred: analyze, query, tree, and symbols.
archex status is a preflight check, not an indexing command. It reports whether the project is initialized, whether an index exists, whether the indexed commit matches HEAD, whether the working tree has changed since the indexed snapshot, and where the latest dogfood report lives. Use --strict when stale or dirty state should fail a script.
archex index explicitly builds or refreshes the repo-local index. Repeated runs use the cache only when both HEAD and the stored working-tree signature match; uncommitted source edits trigger delta or full reindexing.
archex reset --force deletes generated repo-local index/vector state while preserving settings. archex reset --all --force removes .archex/ entirely. Reset never touches the global ~/.archex/cache unless a future global reset command is added.
archex dogfood runs the self-task regression suite and writes .archex/dogfood/latest.json, .archex/dogfood/latest.md, and timestamped history. Dogfood is mandatory local validation for retrieval-sensitive changes and compares against .archex/dogfood/baseline.json when present or the committed benchmarks/dogfood_baseline.json. The gate is baseline-relative: existing weak tasks do not fail the build unless a metric regresses from the recorded baseline. The GitHub dogfood workflow is manual-only via workflow_dispatch for maintainers who want an uploaded report artifact.
Agent Integration
archex is designed to be called by coding agents. Four integration paths, ordered by depth:
Shell Out (any agent)
Any agent that can execute shell commands (Cursor, Claude Code, Copilot, custom frameworks) can use archex directly:
# Agent needs to understand a foreign codebase — one call, structured output
archex query https://github.com/encode/httpx "connection pooling" --budget 8000 --format xml
# Agent is already inside the target repo — no explicit . needed
archex status --strict
archex query "Where is cache invalidation handled?" --budget 8000 --format xml
# Agent needs the lay of the land before implementation
archex tree https://github.com/encode/httpx --depth 3
# Agent needs a specific function's source
archex symbol ./repo "src/pool.py::ConnectionPool#class"
MCP Server (Claude Code / Claude Desktop)
Add to your MCP configuration:
{
"mcpServers": {
"archex": {
"command": "archex",
"args": ["mcp"]
}
}
}
The agent discovers 8 tools automatically:
| Tool | What it does |
|---|---|
analyze_repo |
Full architectural profile (modules, patterns, interfaces) |
query_repo |
Token-budget context assembly for a natural language query |
compare_repos |
Structural comparison across architectural dimensions |
get_file_tree |
Annotated directory listing with language/symbol counts |
get_file_outline |
Symbol hierarchy for a single file (signatures only) |
search_symbols |
Fuzzy symbol search by name, kind, and language |
get_symbol |
Full source code for a symbol by its stable ID |
get_symbols_batch |
Batch retrieval of multiple symbols in one call |
Python API (agent frameworks)
# LangChain retriever
from archex.integrations.langchain import ArchexRetriever
retriever = ArchexRetriever(source=RepoSource(local_path="./repo"))
# LlamaIndex retriever
from archex.integrations.llamaindex import ArchexRetriever
retriever = ArchexRetriever(source=RepoSource(local_path="./repo"))
# Direct usage in any framework
from archex import query
from archex.models import RepoSource
bundle = query(
RepoSource(url="https://github.com/encode/httpx"),
"How does connection pooling work?",
token_budget=8000,
)
agent_context = bundle.to_prompt(format="xml")
HTTP API (any language)
Run archex serve to expose the same operations over HTTP — for agents and services that prefer REST over shelling out:
archex serve --port 8080
| Method & path | Operation |
|---|---|
GET /health |
Liveness check |
POST /analyze |
Architecture profile for a repo |
POST /query |
Token-budget context bundle for a question |
POST /compare |
Cross-repo structural comparison |
GET /tree |
Annotated file tree |
GET /outline |
Symbol outline for a file |
GET /symbols |
Symbol search |
GET /symbol/{symbol_id} |
Full source for a symbol by stable ID |
GET /benchmark/results |
Stored benchmark results |
GET /benchmark/summary |
Aggregate benchmark summary |
GET /benchmark/gate |
Quality-gate status |
FastAPI and uvicorn ship in the core install, so archex serve works out of the box. An optional /dashboard is served when its assets are present.
When to Use archex
archex gives AI agents structural priors about codebases they've never seen.
When an agent encounters a new codebase, it has two options: explore file-by-file (expensive, slow, high risk of missing context) or receive a pre-computed structural map (cheap, fast, complete). archex is the pre-computed map.
| Capability | archex | archex + LSAP | Claude Code | LSP |
|---|---|---|---|---|
| Cold-start codebase understanding | Yes — pre-computed structural map | Yes — structural + semantic | Slow — sequential file exploration | No — requires active session |
| Semantic type resolution | No — syntactic (tree-sitter) | Yes — LSP hover, references, definitions | Via LLM reasoning | Yes — compiler-grade |
| Token-budget context assembly | Yes — ranked, packed, dependency-expanded | Yes — enriched with type context | No — agent manually selects | No — not designed for this |
| Cross-repo structural comparison | Yes — 6 dimensions, no LLM | Yes | No | No |
| Real-time editing support | No | No | No | Yes |
| Offline / CI-embeddable | Yes | Partially — needs running language server | No | Partially |
| Works with any agent framework | Yes — CLI, MCP, Python API | Yes — async Python API | Claude-specific | Editor-specific |
LSAP Enrichment (opt-in)
archex symbols can be enriched with LSP type information via archex[lsap]. This is opt-in — the core pipeline remains syntactic-only.
from lsp_client import PyrightClient
from archex.api import get_symbol
from archex.integrations.lsap import LSAPEnrichedLookup
from archex.models import RepoSource
# Caller manages the language server lifecycle.
async with PyrightClient("./my-project") as client:
lookup = LSAPEnrichedLookup(lsp_client=client)
symbol = get_symbol(RepoSource(local_path="./my-project"), "src/repo.py::get_user#method")
enriched = await lookup.enrich_symbol(symbol)
print(enriched.lsap_enrichment.hover.type_signature)
print(f"{enriched.lsap_enrichment.reference_count} references")
MCP Co-Tool Configuration
archex and an LSP MCP server can run as sibling tools, giving agents both structural context and precision type operations:
{
"mcpServers": {
"archex": {
"command": "archex",
"args": ["mcp"]
},
"lsp": {
"command": "lsp-cli",
"args": ["--language", "python", "--root", "./my-project"]
}
}
}
Workflow: archex for structural context (modules, patterns, ranked chunks) → LSP MCP for precision follow-ups (type resolution, find references, go-to-definition).
Concrete workflows:
- Before agent edits this repo:
archex status --strict→archex indexif stale or dirty →archex query "Where should this change land?" - Before agent explores new repo:
archex query ./repo "auth system" --budget 8K→ context bundle → feed to agent's first prompt - Before a code review:
archex analyze ./pr-branch→ architectural impact summary - Before claiming retrieval improvement: run the relevant focused dogfood task locally, for example
archex dogfood --task archex_query_pipeline - Before landing retrieval-sensitive changes: run
archex dogfood . --format jsonlocally and inspect.archex/dogfood/latest.md; the GitHub dogfood workflow is manual-only for report artifacts - CI gate: standard lint, type, and test workflows run automatically; dogfood is a mandatory local gate, not an automatic CI gate
Benchmarks
Retrieval Quality
query() retrieval quality is measured against human-annotated expected files. The benchmark corpus is 35 tasks (16 self-referential + 19 external); the table below reports the BM25 run over the original 25-task subset (6 self-referential + 19 external), captured before the self-task corpus expansion. Token budget: 8,192. Strategy: BM25.
Perfect recall (1.00):
| Task | Recall | Precision | F1 | MRR |
|---|---|---|---|---|
| httpx_pooling | 1.00 | 0.38 | 0.55 | 1.00 |
| mini_redis_async | 1.00 | 0.50 | 0.67 | 0.50 |
| archex_pattern_detection | 1.00 | 0.25 | 0.40 | 1.00 |
| gin_routing | 1.00 | 0.50 | 0.67 | 1.00 |
| go_gin_middleware | 1.00 | 0.38 | 0.55 | 0.50 |
| requests_sessions | 1.00 | 0.43 | 0.60 | 1.00 |
High recall (0.67):
| Task | Recall | Precision | F1 | MRR |
|---|---|---|---|---|
| archex_adapter_registry | 0.67 | 0.25 | 0.36 | 1.00 |
| archex_graph_expansion | 0.67 | 0.33 | 0.44 | 1.00 |
| archex_scoring | 0.67 | 0.25 | 0.36 | 1.00 |
| click_decorators | 0.67 | 0.29 | 0.40 | 1.00 |
| express_error_handling | 0.67 | 0.25 | 0.36 | 1.00 |
| fastapi_routing | 0.67 | 0.40 | 0.50 | 1.00 |
| flask_blueprints | 0.67 | 0.40 | 0.50 | 0.50 |
| pydantic_validators | 0.67 | 0.25 | 0.36 | 0.17 |
Moderate recall (0.33-0.50):
| Task | Recall | Precision | F1 | MRR |
|---|---|---|---|---|
| pytest_fixtures | 0.50 | 0.14 | 0.22 | 1.00 |
| archex_delta_indexing | 0.33 | 0.12 | 0.18 | 1.00 |
| celery_task_dispatch | 0.33 | 0.14 | 0.20 | 0.25 |
| django_middleware | 0.33 | 0.17 | 0.22 | 1.00 |
| express_middleware | 0.33 | 0.12 | 0.18 | 0.25 |
| fastapi_dependency_injection | 0.33 | 0.25 | 0.29 | 1.00 |
| react_hooks | 0.33 | 0.12 | 0.18 | 0.14 |
| rust_tokio_runtime | 0.33 | 0.14 | 0.20 | 0.33 |
| sqlalchemy_sessions | 0.33 | 0.14 | 0.20 | 0.50 |
| archex_query_pipeline | 0.00 | 0.00 | 0.00 | 0.00 |
Zero recall:
| Task | Recall | Precision | F1 | MRR |
|---|---|---|---|---|
| django_orm_queries | 0.00 | 0.00 | 0.00 | 0.00 |
Summary (BM25, 25 tasks): Mean Recall: 0.58, Mean F1: 0.34, Mean MRR: 0.69, Mean nDCG: 0.52, Mean MAP: 0.40. Avg token savings: 40.2%.
By category:
| Category | Tasks | Recall | Precision | F1 | MRR |
|---|---|---|---|---|---|
| external-framework | 9 | 0.80 | 0.36 | 0.50 | 0.80 |
| architecture-broad | 3 | 0.67 | 0.26 | 0.38 | 0.83 |
| self | 6 | 0.56 | 0.20 | 0.29 | 0.83 |
| framework-semantic | 2 | 0.33 | 0.19 | 0.23 | 0.62 |
| external-large | 5 | 0.27 | 0.11 | 0.16 | 0.25 |
How to read this: Recall measures what fraction of expected files appear in the context bundle. Precision measures what fraction of returned chunks are from expected files (low by design — archex includes dependency-expanded context beyond the strict expected set). MRR measures how early the first relevant file appears in the ranked results (1.0 = first position). Reproduce locally with
archex benchmark run; per-task JSON is written tobenchmarks/results/(gitignored).Honest assessment: archex excels at mid-size framework repos (80% recall for external-framework tasks) where BM25 keyword matching aligns well with file content and query vocabulary. Self-referential tasks and architecture-broad queries also perform well (MRR 0.83). The primary weakness is very large codebases (django, react, sqlalchemy) where BM25 alone cannot disambiguate generic vocabulary across hundreds of files — external-large recall drops to 27%. Precision remains structurally low because the 8K token budget packs dependency-expanded chunks beyond the strict expected set.
Token Efficiency
Token savings measured across 10 open-source repositories spanning Python, JavaScript, TypeScript, Go, and Rust — from 35 files to 2,332 files:
query() — context retrieval (the operation agents use most):
| Repository | Language | Files | Repo Tokens | query() Raw | query() Output | query() Savings |
|---|---|---|---|---|---|---|
| flask | Python | 80 | 182K | 68K | 8,158 | 88.0% |
| requests | Python | 35 | 131K | 104K | 7,823 | 92.5% |
| fastapi | Python | 941 | 774K | 137K | 7,964 | 94.2% |
| django | Python | 2,332 | 7,196K | 179K | 8,128 | 95.5% |
| express | JS | 141 | 136K | 23K | 7,925 | 66.1% |
| got | TS | 77 | 198K | 72K | 7,959 | 88.9% |
| gin | Go | 99 | 190K | 92K | 5,670 | 93.9% |
| actix-web | Rust | 313 | 619K | 107K | 7,926 | 92.6% |
| oak | TS | 61 | 112K | 55K | 7,821 | 85.8% |
| httpx | Python | 57 | 172K | 70K | 7,975 | 88.6% |
Median query() savings: 90.8%. "query() Raw" = tokens an agent would read by loading every file matching the query's language filter.
What this measures and what it doesn't: Token savings measures compression ratio — how much less data archex sends compared to reading raw files. It does not measure retrieval quality. See the retrieval quality table above for recall, precision, and MRR.
Per-operation breakdown (all 10 repos):
| Operation | Savings Range | Median | What it replaces |
|---|---|---|---|
file_tree() |
94.9–99.1% | 98.3% | Reading entire repo to understand structure |
analyze() |
81.7–99.3% | 96.7% | Manual exploration of modules, patterns, interfaces |
compare() |
79.1–99.7% | 97.8% | Reading both repos end-to-end |
query() |
66.1–95.5% | 90.8% | Multi-file search and backtracking |
get_symbol() |
48.4–87.1% | 61.7% | Reading entire file to extract one function |
search_symbols() |
85.9–96.1% | 91.0% | Grepping + reading matching files |
Aggregate numbers: Summing raw tokens across all 8 operations gives a per-repo "total savings" of 84–99% (median 96.8%). This aggregate inflates the picture because it includes
compare()(which counts both repos' tokens) and treats all operations as a single atomic call. The per-operation breakdown above is more honest.
CLI Reference
archex init [source]
Initialize repo-local project state. source defaults to the current working directory.
| Option | Default | Description |
|---|---|---|
--force |
off | Rewrite existing project settings |
--reset |
off | Delete existing .archex state before init; requires force |
archex index [source]
Build or refresh the index without running a query. source defaults to the current working directory.
| Option | Default | Description |
|---|---|---|
--format |
text |
Output format: text, json |
archex status [source]
Inspect repo-local project state without indexing. source defaults to the current working directory.
| Option | Default | Description |
|---|---|---|
--strict |
off | Fail unless the index is fresh |
--format |
text |
Output format: text, json |
archex reset [source]
Delete repo-local generated state. source defaults to the current working directory.
| Option | Default | Description |
|---|---|---|
--force |
off | Required for destructive execution |
--all |
off | Remove the entire .archex/ directory |
archex dogfood [source]
Run self-benchmark dogfood tasks and gate against a stored baseline. source defaults to the current working directory.
| Option | Default | Description |
|---|---|---|
--task |
default self task set | Run one task by task id |
--all |
off | Run all self dogfood tasks |
--include-external |
off | Include non-self benchmark tasks |
--query-fusion |
off | Include the experimental query-fusion strategy |
--baseline |
auto | Baseline JSON path |
--format |
text |
Output format: text, json |
archex analyze [source]
Analyze a repository and produce an architecture profile.
| Option | Default | Description |
|---|---|---|
--format |
json |
Output format: json, markdown |
-l / --language |
all | Filter by language (repeatable) |
--timing |
off | Print timing breakdown to stderr |
source defaults to the current working directory for local use. Remote URLs remain explicit.
archex query [source] <question>
Query a repository and return a context bundle.
| Option | Default | Description |
|---|---|---|
--budget |
8192 |
Token budget for context assembly |
--format |
xml |
Output format: xml, json, markdown |
-l / --language |
all | Filter by language (repeatable) |
--strategy |
bm25 |
Retrieval strategy: bm25, hybrid |
--timing |
off | Print timing breakdown to stderr |
When one argument is provided, it is treated as the question and source defaults to the current working directory. Use an explicit source for another local repo or a remote URL.
archex compare <source_a> <source_b>
Compare two repositories across architectural dimensions.
| Option | Default | Description |
|---|---|---|
--dimensions |
all 6 | Comma-separated dimension list |
--format |
json |
Output format: json, markdown |
-l / --language |
all | Filter by language (repeatable) |
--timing |
off | Print timing breakdown to stderr |
Supported dimensions: api_surface, concurrency, configuration, error_handling, state_management, testing.
archex tree [source]
Display the annotated file tree of a repository.
| Option | Default | Description |
|---|---|---|
--depth |
5 |
Maximum directory depth |
-l / --language |
all | Filter to specific language |
--json |
off | Output as JSON |
--timing |
off | Print timing breakdown |
source defaults to the current working directory.
archex outline <source> <file_path>
Display the symbol outline for a single file (signatures, hierarchy, no source bodies).
| Option | Default | Description |
|---|---|---|
--json |
off | Output as JSON |
--timing |
off | Print timing breakdown |
archex symbols [source] <query>
Search symbols by name across a repository.
| Option | Default | Description |
|---|---|---|
--kind |
all | Filter by kind: function, class, etc. |
-l / --language |
all | Filter to specific language |
--limit |
20 |
Maximum results |
--json |
off | Output as JSON |
--timing |
off | Print timing breakdown |
When one argument is provided, it is treated as the symbol query and source defaults to the current working directory.
archex symbol <source> <symbol_id>
Retrieve the full source code for a symbol by its stable ID (e.g., src/auth.py::login#function).
| Option | Default | Description |
|---|---|---|
--json |
off | Output as JSON |
--timing |
off | Print timing breakdown |
archex mcp
Start the MCP (Model Context Protocol) server for agent integration.
Tools exposed: analyze_repo, query_repo, compare_repos, get_file_tree, get_file_outline, search_symbols, get_symbol, get_symbols_batch.
archex serve
Start the HTTP API server (FastAPI + uvicorn).
| Option | Default | Description |
|---|---|---|
--host |
127.0.0.1 |
Bind host |
--port |
8080 |
Bind port |
--reload |
off | Auto-reload for development |
Endpoints: /health, /analyze, /query, /compare, /tree, /outline, /symbols, /symbol/{id}, /benchmark/results, /benchmark/summary, /benchmark/gate, and an optional /dashboard.
archex cache <subcommand>
Manage the local analysis cache.
| Subcommand | Options | Description |
|---|---|---|
list |
--cache-dir |
List cached entries |
clean |
--max-age N (hours), --cache-dir |
Remove expired entries |
info |
--cache-dir |
Show cache summary |
archex benchmark <subcommand>
Benchmark retrieval strategies against real repositories.
| Subcommand | Description |
|---|---|
run |
Run benchmarks across strategies |
report |
Generate formatted reports from results |
triage |
Rank retrieval failures for a strategy |
readiness |
Non-blocking retrieval readiness report |
gate |
Check results against quality thresholds |
validate |
Validate benchmark task definitions |
baseline save |
Save current results as golden baseline |
baseline compare |
Compare current results against a saved baseline |
delta run |
Run delta indexing benchmarks |
delta gate |
Check delta results against thresholds |
delta report |
Generate delta benchmark reports |
Extensibility
archex supports plugin APIs via Python entry points and protocols.
Custom Language Adapters
Register a language adapter in your package's pyproject.toml:
[project.entry-points."archex.language_adapters"]
dart = "mypackage.adapters:DartAdapter"
The adapter class must implement the LanguageAdapter protocol (see archex.parse.adapters.base).
Custom Pattern Detectors
Register a pattern detector function:
[project.entry-points."archex.pattern_detectors"]
my_pattern = "mypackage.patterns:detect_my_pattern"
Detector signature: (list[ParsedFile], DependencyGraph) -> DetectedPattern | None.
Custom Chunkers
Implement the Chunker protocol and pass it to query():
from archex.pipeline.chunker import Chunker
class MyChunker:
def chunk_file(self, parsed_file, source): ...
def chunk_files(self, parsed_files, sources): ...
bundle = query(source, question, chunker=MyChunker())
Custom Scoring Weights
Adjust the retrieval ranking formula:
from archex.models import ScoringWeights
# Boost relevance, lower structural weight
weights = ScoringWeights(relevance=0.8, structural=0.1, type_coverage=0.1)
bundle = query(source, question, scoring_weights=weights)
Configuration
archex reads configuration from global defaults, ~/.archex/config.toml, repo-local .archex/settings.toml, ARCHEX_* environment variables, and explicit CLI/API arguments.
Precedence is:
defaults
-> ~/.archex/config.toml
-> <repo>/.archex/settings.toml
-> ARCHEX_* environment variables
-> explicit CLI/API arguments
# ~/.archex/config.toml
[default]
languages = ["python", "typescript"]
cache = true
cache_dir = "~/.archex/cache"
parallel = true
max_file_size = 10000000
delta_threshold = 0.5
Repo-local settings are created by archex init:
# .archex/settings.toml
[project]
mode = "local"
[index]
cache_dir = ".archex"
languages = []
vector = false
delta_threshold = 0.5
[dogfood]
tasks_dir = "benchmarks/tasks"
output_dir = ".archex/dogfood"
default_task_prefix = "archex_"
strategies = ["raw_files", "raw_grepped", "archex_query"]
| Field | Default | Description |
|---|---|---|
languages |
all supported | Languages to parse (list of strings) |
depth |
3 |
Module detection depth |
cache |
true |
Enable caching of parse/index results |
cache_dir |
~/.archex/cache |
Cache storage directory |
max_file_size |
10000000 (10 MB) |
Skip files larger than this (bytes) |
parallel |
true |
Enable parallel parsing and comparison |
delta_threshold |
0.5 |
If more than this fraction of files changed, full re-index instead of delta |
Environment variables override global and repo-local file config: ARCHEX_CACHE_DIR, ARCHEX_PARALLEL, ARCHEX_MAX_FILE_SIZE.
Language Support
| Language | Extensions | Symbols | Import Resolution |
|---|---|---|---|
| Python | .py |
Functions, classes, methods, types, constants, decorators | import, from...import, relative imports |
| TypeScript / JavaScript | .ts, .tsx, .js, .jsx |
Functions, classes, methods, types, interfaces, enums, constants | ES modules, CommonJS, type-only imports, re-exports |
| Go | .go |
Functions, methods (pointer/value receivers), structs, interfaces, constants | Package imports with alias support |
| Rust | .rs |
Functions, structs, enums, traits, impl blocks, constants, macros | use statements, pub/pub(crate) visibility |
| Java | .java |
Classes, interfaces, enums, methods, fields, annotations | Package imports, static imports, wildcard imports |
| Kotlin | .kt, .kts |
Classes, objects, functions, properties, extensions, companions | Package imports, alias imports |
| C# | .cs |
Classes, structs, interfaces, enums, methods, properties, events | using directives, namespace-qualified resolution |
| Swift | .swift |
Classes, structs, enums, protocols, actors, extensions, functions | import declarations |
Adapters are extensible via Python entry points — add a new language without modifying archex core.
Development
git clone https://github.com/Mathews-Tom/archex.git
cd archex
uv sync --all-extras
# Run tests (1,928 tests, 85% minimum coverage enforced)
uv run pytest
# Lint and format
uv run ruff check .
uv run ruff format .
# Type check (strict mode)
uv run pyright
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file archex-0.6.1.tar.gz.
File metadata
- Download URL: archex-0.6.1.tar.gz
- Upload date:
- Size: 716.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79d73b27e5fbb4e07cbfc04d8fec33729f83091a0b549d87648cd4b815f5c671
|
|
| MD5 |
4fd38394a90401b829f394284f0ff74d
|
|
| BLAKE2b-256 |
0c7bd68c71c8840692c2980a9d3b81b07e81064fcba6f63512976e388bdcabf7
|
File details
Details for the file archex-0.6.1-py3-none-any.whl.
File metadata
- Download URL: archex-0.6.1-py3-none-any.whl
- Upload date:
- Size: 246.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98d398d335f43563102e9973e9f24b2bf5a2f178b01ea1db4aeaff2f1eaf5e7b
|
|
| MD5 |
eda7c18ea989b0252fe6af1aef327ab6
|
|
| BLAKE2b-256 |
8b903da82ff67f0eb30cb65770c6b2a9496d5ed9c4354381a028b4a1b5eea8e6
|