Python bindings for the big-code-analysis Rust library
Project description
big-code-analysis (Python bindings)
Python bindings for the
big-code-analysis
Rust library — compute maintainability metrics for source code in
~20 languages using the same tree-sitter parsers the Rust crate
ships with.
Full documentation: the book's Python Bindings chapter covers the install matrix, batch / async / SARIF recipes, and the full error taxonomy. The README below is the quick reference shown on PyPI.
All nine phases of the Python bindings work (issues #265–#273;
parent #103) have landed. The crate now ships single-file
analysis, the never-raise batch entry point, the flatten_spaces
flat-record iterator, explicit metric selection (metrics=),
SARIF 2.1.0 rendering (to_sarif), the strict ruff / mypy /
pyright tooling gate, manylinux wheel CI on Linux x86_64 +
aarch64, the book's "Python Bindings" chapter, and the end-user
example set covered below. See the
CHANGELOG for the per-phase changes.
Runnable examples
big-code-analysis-py/examples/ is the canonical collection of
copy-paste recipes. Every file is executed under CI either via
tests/test_book_examples.py (the .py examples) or via
jupyter nbconvert --execute (the notebook), so a renamed kwarg
or removed function fails CI before the example can rot in the
docs.
| File | What it shows |
|---|---|
quick_start.py |
Single-file analysis + headline metric. Embedded by the book's Quick start. |
batch_processing.py |
analyze_batch + the AnalysisError discriminator. Embedded by Batch processing. |
flat_records.py |
flatten_spaces → sqlite for one file. Embedded by Flat-record iteration. |
metric_selection.py |
metrics= kwarg + dependency-pull behaviour. Embedded by Metric selection. |
sarif_output.py |
Minimal SARIF rendering. Embedded by SARIF output. |
errors_taxonomy.py |
The full exception map across the entry points. Embedded by Error handling. |
async_patterns.py |
asyncio.to_thread (canonical) vs the in-loop anti-pattern. Embedded by Async patterns. |
cli_parity.py |
Byte-for-byte parity smoke test vs bca metrics --output-format json. Wired into make py-test. |
pipeline_db.py |
Directory walk → analyze_batch → flatten_spaces → sqlite top-N, with a deliberately broken file to exercise the never-raise contract. |
sarif_upload.py |
SARIF emission tuned for GitHub Code Scanning (github/codeql-action/upload-sarif@v3). |
jupyter_quickstart.ipynb |
Pandas DataFrame + matplotlib cyclomatic.sum per function + top-N. Executed in CI via python-examples-nbconvert. |
Installation
The package is not yet published on PyPI. For development, build locally via maturin:
cd big-code-analysis-py
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[dev]" # pulls maturin, pytest, mypy, ruff, pyright
maturin develop
python -c "import big_code_analysis; print(big_code_analysis.__version__)"
Usage
import big_code_analysis as bca
# Analyse a file by path. The returned dict matches the JSON
# emitted by `bca metrics --output-format json` for the same
# file at the `FuncSpace` boundary — same field order, same
# numeric formatting, same shape. Language detection mirrors the
# CLI (path extension, then shebang, then emacs `-*- mode -*-`).
# Pass `exclude_tests=True` to mirror `bca metrics --exclude-tests`
# (prunes Rust `#[test]` / `#[cfg(test)]` subtrees before metric
# computation). Generated files (`@generated`, `DO NOT EDIT`,
# `GENERATED CODE` markers) are skipped by default, matching the
# CLI walker — `analyze` returns `None` for them; pass
# `skip_generated=False` to opt out. See `bca.analyze.__doc__`
# for the full parity contract.
result = bca.analyze("src/main.rs")
if result is not None:
print(result["metrics"]["cognitive"]["sum"])
# Analyse a Rust file with `#[test]` subtrees pruned out — same
# result as `bca metrics --exclude-tests --output-format json`.
prod_only = bca.analyze("src/main.rs", exclude_tests=True)
# Non-UTF-8 paths raise `ValueError` by default so the `name`
# field is always a round-trippable identifier. Pass
# `allow_lossy_path=True` to opt into the CLI's U+FFFD
# substitution behaviour (see `bca.analyze.__doc__` and #316).
lossy = bca.analyze(weird_path, allow_lossy_path=True)
# Force analysis of files marked `@generated` (default skips them).
forced = bca.analyze("third_party/generated.pb.go", skip_generated=False)
# Analyse an in-memory snippet (str, bytes, or bytearray accepted).
metrics = bca.analyze_source("fn main() {}\n", "rust")
# Language detection helpers. `language_for_file` reads the file
# and runs the same detection pipeline as `analyze` — path
# extension first, then shebang / emacs-mode fallback (#318) —
# so an extension-less script with a `#!/usr/bin/env python`
# leading line resolves the same way it would for `analyze`. The
# file is read on every call (parity with `analyze`), so the path
# must exist; I/O failures raise the same typed `OSError` subclass
# `analyze` does (`FileNotFoundError`, `PermissionError`, …). If
# you only need the cheap extension lookup (`.py` → `python`) and
# do not want the file read, use
# `bca.language_extensions("python")` and match the extension
# yourself.
assert bca.language_for_file("path/to/real/foo.py") == "python"
# Extension-less script with a `#!/usr/bin/env python` first line
# would resolve to "python" too (the asymmetry #318 closed).
assert "python" in bca.supported_languages()
assert "py" in bca.language_extensions("python")
Selecting metrics
Pass metrics=[…] to compute only a subset of the metric suite.
metrics=None (the default) preserves today's "compute everything"
behaviour. Unrequested metrics are absent from the result dict
(not present with None placeholders).
import big_code_analysis as bca
# Compute only LoC and cyclomatic complexity.
result = bca.analyze("src/main.rs", metrics=["loc", "cyclomatic"])
assert result is not None
assert set(result["metrics"]) == {"loc", "cyclomatic"}
# Selecting a derived metric pulls its dependencies in automatically:
# `metrics=["mi"]` also computes loc, cyclomatic, and halstead.
mi_result = bca.analyze("src/main.rs", metrics=["mi"])
assert mi_result is not None
assert {"loc", "cyclomatic", "halstead", "mi"}.issubset(mi_result["metrics"])
# `bca.METRIC_NAMES` is a `tuple[str, ...]` enumerating every
# canonical name accepted by `metrics=` (alphabetised, lowercase).
assert "halstead" in bca.METRIC_NAMES
The same kwarg is honoured by bca.analyze_source and
bca.analyze_batch — the latter applies the selection uniformly to
every file in the batch. Validation runs before any file I/O: an
empty list or unknown name raises ValueError immediately and never
returns an AnalysisError slot for what is really a caller bug.
# Compute only `cyclomatic` and `cognitive` across a batch.
results = bca.analyze_batch(
["src/a.py", "src/b.rs"],
metrics=["cyclomatic", "cognitive"],
)
Names are case-sensitive lowercase; passing an unknown name raises
ValueError with the canonical list in the error message. The
"exit" Metric-Display spelling is accepted as an alias for the
canonical JSON-key spelling "nexits"; both produce a "nexits"
key in the output. Duplicates are silently collapsed.
SARIF 2.1.0 output
bca.to_sarif(result, *, thresholds=None) renders an analysis
result (or an iterable of them) into a SARIF 2.1.0 JSON document
suitable for upload to GitHub Code Scanning or any other SARIF
consumer. The output is produced by the same Rust writer that
backs bca check -O sarif, so the schema URL, tool driver name /
version, and rule descriptions match the CLI byte-for-byte.
import big_code_analysis as bca
# Single file → SARIF with a finding for every function whose
# cyclomatic complexity strictly exceeds 15.
sarif = bca.to_sarif(
bca.analyze("src/main.py"),
thresholds={"cyclomatic": 15, "loc.lloc": 200},
)
with open("metrics.sarif", "w", encoding="utf-8") as fh:
fh.write(sarif)
# Batch input — AnalysisError entries are skipped silently because
# they represent files we couldn't analyse, not findings.
batch = bca.analyze_batch(["src/a.py", "src/b.rs", "src/c.cpp"])
sarif = bca.to_sarif(batch, thresholds={"cognitive": 20})
Accepted threshold names mirror the CLI's EXTRACTORS table in
big-code-analysis-cli/src/thresholds.rs — e.g. "cognitive",
"cyclomatic", "cyclomatic.modified", "halstead.volume",
"halstead.difficulty", "halstead.effort", "halstead.time",
"halstead.bugs", "loc.sloc",
"loc.ploc", "loc.lloc", "loc.cloc", "loc.blank", "nom",
"tokens", "nexits", "nargs", "mi.original", "mi.sei",
"mi.visual_studio", "abc", "wmc", "npm", "npa". An
unknown name raises ValueError listing the accepted set, so a
typo fails fast instead of silently producing an empty SARIF run.
thresholds=None (the default) and thresholds={} both produce
a well-formed SARIF document with empty results and rules
arrays. This matches the CLI's posture: there are no built-in
default thresholds; every check run supplies its own limits.
Unit-level findings. to_sarif emits file-scope (unit-space)
findings for every metric whose JSON headline at the unit space
matches the CLI's per-space accessor (loc.*, halstead.*,
mi.*, nom, nargs, nexits, tokens, abc, wmc, npm,
npa). The three exceptions — cyclomatic, cyclomatic.modified,
cognitive — are skipped at the unit level because the JSON only
exposes the aggregate sum across children while the CLI's
per-space accessor returns just the unit's own scalar; emitting
findings from the aggregate would diverge from the CLI for parent
spaces. Unit findings carry logicalLocations: [{"fullyQualifiedName": "<file>"}]; nameless non-unit spaces (rare parse-failure case)
carry "<unnamed>" — both matching the CLI's function_token
placeholders.
Upload to GitHub Code Scanning
# .github/workflows/code-scanning.yml (excerpt)
- name: Compute metric SARIF
run: |
python - <<'PY'
import big_code_analysis as bca
with open("paths.txt", encoding="utf-8") as paths_fh:
results = bca.analyze_batch(paths_fh.read().splitlines())
with open("metrics.sarif", "w", encoding="utf-8") as fh:
fh.write(bca.to_sarif(results, thresholds={"cyclomatic": 15}))
PY
- name: Upload to Code Scanning
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: metrics.sarif
Batch processing
bca.analyze_batch(paths) runs the same analysis as bca.analyze
over every path in an iterable and never raises on per-file
errors: each result slot is either an analysis dict or a
bca.AnalysisError describing the failure. The list has the same
length as the input and preserves order one-to-one, so callers
can zip(inputs, results) without losing the pairing.
import big_code_analysis as bca
paths = ["src/a.py", "src/missing.py", "src/b.rs"]
results = bca.analyze_batch(paths)
for path, result in zip(paths, results):
if isinstance(result, bca.AnalysisError):
print(f"skipped {path}: ({result.error_kind}) {result.error}")
else:
process(result)
The pattern above keeps paths and results as separate
materialised sequences. If you want to drive analyze_batch from
a generator (e.g. glob.iglob('**/*.py')) for memory efficiency,
materialise it into a list first — otherwise
zip(generator, analyze_batch(generator)) yields nothing because
analyze_batch exhausts the generator before zip re-iterates it:
import glob
paths = list(glob.iglob("src/**/*.py", recursive=True))
results = bca.analyze_batch(paths)
# now zip(paths, results) works
bca.AnalysisError is a frozen value type with path: str,
error: str, and error_kind: Literal["UnsupportedLanguage", "ParseError", "IoError"]. It implements __eq__, __hash__,
and __repr__, so callers can put errors in a set to
deduplicate failures across runs. It is not an Exception
subclass — analyze_batch returns it, never raises it.
analyze_batch only raises on programmer errors: TypeError
for a non-iterable paths argument (or a non-path element
inside), and ValueError for an empty metrics= list or an
unknown metric name. The metrics= selection (see
Selecting metrics above) applies uniformly
to every file in the batch; validation runs before the input
iterable's __iter__ so a bad selection aborts without invoking
any side effects.
Generators work — paths are consumed lazily. There is no
built-in parallelism; the recommended pattern is
concurrent.futures.ThreadPoolExecutor around bca.analyze for
parallel single-file calls. analyze_batch also runs with the
is_generated walker filter off so every input position
yields either a dict or an AnalysisError (never None).
Call bca.analyze(path) per-file with the default
skip_generated=True if you need the CLI walker's skip behaviour.
Flatten to records
bca.flatten_spaces(result) walks the nested FuncSpace tree in
pre-order and yields one flat, scalar-only dict per node — ready
for sqlite3.executemany, pandas.DataFrame.from_records, or any
other tabular consumer. Metric keys use the same dotted convention
as the CLI's CSV writer (cyclomatic.modified.sum,
halstead.volume, loc.lloc_average, …). Metric columns match
the CLI's CSV_HEADER set; the identity columns do not — CSV uses
space_name / space_kind and has no parent_name / depth,
while flat records use name / kind and add the parent / depth
pair. One metric also diverges: tokens.* flattens to the JSON
shape (tokens.tokens, tokens.tokens_average,
tokens.tokens_min, tokens.tokens_max), while CSV_HEADER renames
those columns to tokens.sum / .average / .min / .max.
Rename in the consumer if you need exact CSV alignment.
import sqlite3
import big_code_analysis as bca
result = bca.analyze("src/lib.rs")
if result is None: # generated/skipped file
raise SystemExit("nothing to analyze")
records = list(bca.flatten_spaces(result))
columns = sorted({k for r in records for k in r})
# flatten_spaces keys come from a bounded alphabet (`.`, `_`,
# ASCII alnum), so f-string quoting is safe here. Sanitize if you
# ever build records by hand.
with sqlite3.connect("metrics.db") as db:
cols = ", ".join(f'"{c}"' for c in columns)
qs = ", ".join("?" for _ in columns)
db.execute(f"CREATE TABLE m ({cols})")
db.executemany(
f"INSERT INTO m ({cols}) VALUES ({qs})",
[tuple(r.get(c) for c in columns) for r in records],
)
The iterator is lazy and single-use: it walks the input once
without materialising the whole list, and a second iteration is
empty. Records always carry path (the analyzed file, or None
for analyze_source), name, kind, start_line, end_line,
parent_name, and depth. Anonymous spaces (Rust closures, JS
function expressions / arrows) keep their name == "<anonymous>"
marker verbatim — flatten_spaces does not normalize. Missing
metric subtrees produce no keys (absent, not None), matching the
"Halstead disabled" edge case for metrics= selection.
parent_name alone cannot disambiguate same-named siblings nested
under different parents (e.g. two Inner classes under two
different outer classes both surface as parent_name == 'Inner'
for their own children). Pair with depth and source-order
position, or rebuild the qualified name in your consumer, if you
need a fully-qualified path.
Don't mutate the input result while iterating: the walker keeps
references into it, so mutations to not-yet-yielded subtrees will
be observed in later records.
flatten_spaces raises TypeError if the input is not a mapping;
callers must filter None returns from bca.analyze (e.g. when
skip_generated=True matched a generated file) before passing.
Errors
bca.analyze raises exceptions; bca.analyze_batch returns
bca.AnalysisError values inside the result list (never raised on
per-file failures — see the Batch processing section above).
Exception types raised by bca.analyze / bca.analyze_source:
bca.UnsupportedLanguageError(subclass ofValueError) — raised when a file extension is unrecognised, or whenanalyze_source(..., language="...")is passed an unknown language name.bca.ParseError(subclass ofValueError) — raised when the underlying tree-sitter parser fails on the supplied source.ValueError— raised bybca.analyzewhen the path is not valid UTF-8 and the default strict policy is in effect; passallow_lossy_path=Trueto mirror the CLI's U+FFFD substitution viaPath::to_string_lossyand accept the resulting non-round-trippablenamefield (#316).OSError— bubbled up from the underlying file-system read. Dispatches to the canonical subclass (FileNotFoundError,PermissionError,IsADirectoryError, …) based onerrno, witherr.errnoanderr.filenamepopulated.
Returned by bca.analyze_batch inside the result list:
bca.AnalysisError— frozen value type withpath: str,error: str, anderror_kind: Literal["UnsupportedLanguage", "ParseError", "IoError"]. Not anExceptionsubclass.error_kindis a closed taxonomy:"IoError"covers both filesystem failures and the non-UTF-8 path case (kept at three kinds per the API contract);"ParseError"similarly covers internal JSON-serialisation failures of the resultingFuncSpace(rare; reserved upstream). The OSerrnois preserved in theerrorstring via Rust's"<msg> (os error <N>)"default formatting — parse with regexr"\(os error (\d+)\)$"if you need it for retry classification, or callbca.analyzeper-file to get a typedOSErrorsubclass instead.
Type checking
The package ships PEP 561 type stubs (py.typed + _native.pyi).
mypy --strict and pyright should both pass cleanly against
client code.
License
MPL-2.0 (matches the Rust library).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file big_code_analysis-1.1.0.tar.gz.
File metadata
- Download URL: big_code_analysis-1.1.0.tar.gz
- Upload date:
- Size: 2.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b33ebca3b44d71803ad7bbcff4306ff0cac5d374f5411d94f987399ec509ab5a
|
|
| MD5 |
f7ee7d954ec4d48344f0bbe7457c74e6
|
|
| BLAKE2b-256 |
a57e114bcfbe9c0684d59e0f0494214012f0043d73d8455678cb9d2a9c558615
|
Provenance
The following attestation bundles were made for big_code_analysis-1.1.0.tar.gz:
Publisher:
python-wheels.yml on dekobon/big-code-analysis
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
big_code_analysis-1.1.0.tar.gz -
Subject digest:
b33ebca3b44d71803ad7bbcff4306ff0cac5d374f5411d94f987399ec509ab5a - Sigstore transparency entry: 1631496767
- Sigstore integration time:
-
Permalink:
dekobon/big-code-analysis@1fd21a9a12d078812a11f653a1f1c51250e33c45 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/dekobon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-wheels.yml@1fd21a9a12d078812a11f653a1f1c51250e33c45 -
Trigger Event:
push
-
Statement type:
File details
Details for the file big_code_analysis-1.1.0-cp312-abi3-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: big_code_analysis-1.1.0-cp312-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 3.7 MB
- Tags: CPython 3.12+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5da89df82958dc1c244a0d52e3aa1c24fb1400c744b377d72568cdc9ee97fd2d
|
|
| MD5 |
265d15f8a97dc145c0bb186060f971a7
|
|
| BLAKE2b-256 |
cd84a30cc33cb6acca0be15e993ffb06cfd0fa1665203ef3a96b0689aafb1d7f
|
Provenance
The following attestation bundles were made for big_code_analysis-1.1.0-cp312-abi3-manylinux_2_28_x86_64.whl:
Publisher:
python-wheels.yml on dekobon/big-code-analysis
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
big_code_analysis-1.1.0-cp312-abi3-manylinux_2_28_x86_64.whl -
Subject digest:
5da89df82958dc1c244a0d52e3aa1c24fb1400c744b377d72568cdc9ee97fd2d - Sigstore transparency entry: 1631496921
- Sigstore integration time:
-
Permalink:
dekobon/big-code-analysis@1fd21a9a12d078812a11f653a1f1c51250e33c45 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/dekobon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-wheels.yml@1fd21a9a12d078812a11f653a1f1c51250e33c45 -
Trigger Event:
push
-
Statement type:
File details
Details for the file big_code_analysis-1.1.0-cp312-abi3-manylinux_2_28_aarch64.whl.
File metadata
- Download URL: big_code_analysis-1.1.0-cp312-abi3-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 3.6 MB
- Tags: CPython 3.12+, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
357a4a755aa4118c9e6950db0682cdbf7040757c79be1d923c3700d91014f13b
|
|
| MD5 |
4089f0b394da1a29a04d9cd464fd3e85
|
|
| BLAKE2b-256 |
26b1c3112669390163accb6541f66ad6c83ae3b343873dc114aab46a9c35dacd
|
Provenance
The following attestation bundles were made for big_code_analysis-1.1.0-cp312-abi3-manylinux_2_28_aarch64.whl:
Publisher:
python-wheels.yml on dekobon/big-code-analysis
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
big_code_analysis-1.1.0-cp312-abi3-manylinux_2_28_aarch64.whl -
Subject digest:
357a4a755aa4118c9e6950db0682cdbf7040757c79be1d923c3700d91014f13b - Sigstore transparency entry: 1631496830
- Sigstore integration time:
-
Permalink:
dekobon/big-code-analysis@1fd21a9a12d078812a11f653a1f1c51250e33c45 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/dekobon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-wheels.yml@1fd21a9a12d078812a11f653a1f1c51250e33c45 -
Trigger Event:
push
-
Statement type: