Python bindings for the big-code-analysis Rust library

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

dekobon

These details have not been verified by PyPI

Project description

big-code-analysis (Python bindings)

Python bindings for the big-code-analysis Rust library — compute maintainability metrics for source code in ~20 languages using the same tree-sitter parsers the Rust crate ships with.

Full documentation: the book's Python Bindings chapter covers the install matrix, batch / async / SARIF recipes, and the full error taxonomy. The README below is the quick reference shown on PyPI.

All nine phases of the Python bindings work (issues #265–#273; parent #103) have landed. The crate now ships single-file analysis, the never-raise batch entry point, the flatten_spaces flat-record iterator, explicit metric selection (metrics=), SARIF 2.1.0 rendering (to_sarif), the strict ruff / mypy / pyright tooling gate, manylinux wheel CI on Linux x86_64 + aarch64, the book's "Python Bindings" chapter, and the end-user example set covered below. See the CHANGELOG for the per-phase changes.

Runnable examples

big-code-analysis-py/examples/ is the canonical collection of copy-paste recipes. Every file is executed under CI either via tests/test_book_examples.py (the .py examples) or via jupyter nbconvert --execute (the notebook), so a renamed kwarg or removed function fails CI before the example can rot in the docs.

File	What it shows
`quick_start.py`	Single-file analysis + headline metric. Embedded by the book's Quick start.
`batch_processing.py`	`analyze_batch` + the `AnalysisError` discriminator. Embedded by Batch processing.
`flat_records.py`	`flatten_spaces` → sqlite for one file. Embedded by Flat-record iteration.
`metric_selection.py`	`metrics=` kwarg + dependency-pull behaviour. Embedded by Metric selection.
`sarif_output.py`	Minimal SARIF rendering. Embedded by SARIF output.
`errors_taxonomy.py`	The full exception map across the entry points. Embedded by Error handling.
`async_patterns.py`	`asyncio.to_thread` (canonical) vs the in-loop anti-pattern. Embedded by Async patterns.
`cli_parity.py`	Byte-for-byte parity smoke test vs `bca metrics --output-format json`. Wired into `make py-test`.
`pipeline_db.py`	Directory walk → `analyze_batch` → `flatten_spaces` → sqlite top-N, with a deliberately broken file to exercise the never-raise contract.
`sarif_upload.py`	SARIF emission tuned for GitHub Code Scanning (`github/codeql-action/upload-sarif@v3`).
`jupyter_quickstart.ipynb`	Pandas DataFrame + matplotlib `cyclomatic.sum` per function + top-N. Executed in CI via `python-examples-nbconvert`.

Installation

The package is not yet published on PyPI. For development, build locally via maturin:

cd big-code-analysis-py
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[dev]"  # pulls maturin, pytest, mypy, ruff, pyright
maturin develop
python -c "import big_code_analysis; print(big_code_analysis.__version__)"

Usage

import big_code_analysis as bca

# Analyse a file by path. The returned dict matches the JSON
# emitted by `bca metrics --output-format json` for the same
# file at the `FuncSpace` boundary — same field order, same
# numeric formatting, same shape. Language detection mirrors the
# CLI (path extension, then shebang, then emacs `-*- mode -*-`).
# Pass `exclude_tests=True` to mirror `bca metrics --exclude-tests`
# (prunes Rust `#[test]` / `#[cfg(test)]` subtrees before metric
# computation). Generated files (`@generated`, `DO NOT EDIT`,
# `GENERATED CODE` markers) are skipped by default, matching the
# CLI walker — `analyze` returns `None` for them; pass
# `skip_generated=False` to opt out. See `bca.analyze.__doc__`
# for the full parity contract.
result = bca.analyze("src/main.rs")
if result is not None:
    print(result["metrics"]["cognitive"]["sum"])

# Analyse a Rust file with `#[test]` subtrees pruned out — same
# result as `bca metrics --exclude-tests --output-format json`.
prod_only = bca.analyze("src/main.rs", exclude_tests=True)

# Non-UTF-8 paths raise `ValueError` by default so the `name`
# field is always a round-trippable identifier. Pass
# `allow_lossy_path=True` to opt into the CLI's U+FFFD
# substitution behaviour (see `bca.analyze.__doc__` and #316).
lossy = bca.analyze(weird_path, allow_lossy_path=True)

# Force analysis of files marked `@generated` (default skips them).
forced = bca.analyze("third_party/generated.pb.go", skip_generated=False)

# Analyse an in-memory snippet (str, bytes, or bytearray accepted).
metrics = bca.analyze_source("fn main() {}\n", "rust")

# Language detection helpers. `language_for_file` reads the file
# and runs the same detection pipeline as `analyze` — path
# extension first, then shebang / emacs-mode fallback (#318) —
# so an extension-less script with a `#!/usr/bin/env python`
# leading line resolves the same way it would for `analyze`. The
# file is read on every call (parity with `analyze`), so the path
# must exist; I/O failures raise the same typed `OSError` subclass
# `analyze` does (`FileNotFoundError`, `PermissionError`, …). If
# you only need the cheap extension lookup (`.py` → `python`) and
# do not want the file read, use
# `bca.language_extensions("python")` and match the extension
# yourself.
assert bca.language_for_file("path/to/real/foo.py") == "python"
# Extension-less script with a `#!/usr/bin/env python` first line
# would resolve to "python" too (the asymmetry #318 closed).
assert "python" in bca.supported_languages()
assert "py" in bca.language_extensions("python")

Selecting metrics

Pass metrics=[…] to compute only a subset of the metric suite. metrics=None (the default) preserves today's "compute everything" behaviour. Unrequested metrics are absent from the result dict (not present with None placeholders).

import big_code_analysis as bca

# Compute only LoC and cyclomatic complexity.
result = bca.analyze("src/main.rs", metrics=["loc", "cyclomatic"])
assert result is not None
assert set(result["metrics"]) == {"loc", "cyclomatic"}

# Selecting a derived metric pulls its dependencies in automatically:
# `metrics=["mi"]` also computes loc, cyclomatic, and halstead.
mi_result = bca.analyze("src/main.rs", metrics=["mi"])
assert mi_result is not None
assert {"loc", "cyclomatic", "halstead", "mi"}.issubset(mi_result["metrics"])

# `bca.METRIC_NAMES` is a `tuple[str, ...]` enumerating every
# canonical name accepted by `metrics=` (alphabetised, lowercase).
assert "halstead" in bca.METRIC_NAMES

The same kwarg is honoured by bca.analyze_source and bca.analyze_batch — the latter applies the selection uniformly to every file in the batch. Validation runs before any file I/O: an empty list or unknown name raises ValueError immediately and never returns an AnalysisError slot for what is really a caller bug.

# Compute only `cyclomatic` and `cognitive` across a batch.
results = bca.analyze_batch(
    ["src/a.py", "src/b.rs"],
    metrics=["cyclomatic", "cognitive"],
)

Names are case-sensitive lowercase; passing an unknown name raises ValueError with the canonical list in the error message. The "exit" Metric-Display spelling is accepted as an alias for the canonical JSON-key spelling "nexits"; both produce a "nexits" key in the output. Duplicates are silently collapsed.

SARIF 2.1.0 output

bca.to_sarif(result, *, thresholds=None) renders an analysis result (or an iterable of them) into a SARIF 2.1.0 JSON document suitable for upload to GitHub Code Scanning or any other SARIF consumer. The output is produced by the same Rust writer that backs bca check -O sarif, so the schema URL, tool driver name / version, and rule descriptions match the CLI byte-for-byte.

import big_code_analysis as bca

# Single file → SARIF with a finding for every function whose
# cyclomatic complexity strictly exceeds 15.
sarif = bca.to_sarif(
    bca.analyze("src/main.py"),
    thresholds={"cyclomatic": 15, "loc.lloc": 200},
)
with open("metrics.sarif", "w", encoding="utf-8") as fh:
    fh.write(sarif)

# Batch input — AnalysisError entries are skipped silently because
# they represent files we couldn't analyse, not findings.
batch = bca.analyze_batch(["src/a.py", "src/b.rs", "src/c.cpp"])
sarif = bca.to_sarif(batch, thresholds={"cognitive": 20})

Accepted threshold names mirror the CLI's EXTRACTORS table in big-code-analysis-cli/src/thresholds.rs — e.g. "cognitive", "cyclomatic", "cyclomatic.modified", "halstead.volume", "halstead.difficulty", "halstead.effort", "halstead.time", "halstead.bugs", "loc.sloc", "loc.ploc", "loc.lloc", "loc.cloc", "loc.blank", "nom", "tokens", "nexits", "nargs", "mi.original", "mi.sei", "mi.visual_studio", "abc", "wmc", "npm", "npa". An unknown name raises ValueError listing the accepted set, so a typo fails fast instead of silently producing an empty SARIF run.

thresholds=None (the default) and thresholds={} both produce a well-formed SARIF document with empty results and rules arrays. This matches the CLI's posture: there are no built-in default thresholds; every check run supplies its own limits.

Unit-level findings. to_sarif emits file-scope (unit-space) findings for every metric whose JSON headline at the unit space matches the CLI's per-space accessor (loc.*, halstead.*, mi.*, nom, nargs, nexits, tokens, abc, wmc, npm, npa). The three exceptions — cyclomatic, cyclomatic.modified, cognitive — are skipped at the unit level because the JSON only exposes the aggregate sum across children while the CLI's per-space accessor returns just the unit's own scalar; emitting findings from the aggregate would diverge from the CLI for parent spaces. Unit findings carry logicalLocations: [{"fullyQualifiedName": "<file>"}]; nameless non-unit spaces (rare parse-failure case) carry "<unnamed>" — both matching the CLI's function_token placeholders.

Upload to GitHub Code Scanning

# .github/workflows/code-scanning.yml (excerpt)
- name: Compute metric SARIF
  run: |
    python - <<'PY'
    import big_code_analysis as bca
    with open("paths.txt", encoding="utf-8") as paths_fh:
        results = bca.analyze_batch(paths_fh.read().splitlines())
    with open("metrics.sarif", "w", encoding="utf-8") as fh:
        fh.write(bca.to_sarif(results, thresholds={"cyclomatic": 15}))
    PY
- name: Upload to Code Scanning
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: metrics.sarif

Batch processing

bca.analyze_batch(paths) runs the same analysis as bca.analyze over every path in an iterable and never raises on per-file errors: each result slot is either an analysis dict or a bca.AnalysisError describing the failure. The list has the same length as the input and preserves order one-to-one, so callers can zip(inputs, results) without losing the pairing.

import big_code_analysis as bca

paths = ["src/a.py", "src/missing.py", "src/b.rs"]
results = bca.analyze_batch(paths)
for path, result in zip(paths, results):
    if isinstance(result, bca.AnalysisError):
        print(f"skipped {path}: ({result.error_kind}) {result.error}")
    else:
        process(result)

The pattern above keeps paths and results as separate materialised sequences. If you want to drive analyze_batch from a generator (e.g. glob.iglob('**/*.py')) for memory efficiency, materialise it into a list first — otherwise zip(generator, analyze_batch(generator)) yields nothing because analyze_batch exhausts the generator before zip re-iterates it:

import glob

paths = list(glob.iglob("src/**/*.py", recursive=True))
results = bca.analyze_batch(paths)
# now zip(paths, results) works

bca.AnalysisError is a frozen value type with path: str, error: str, and error_kind: Literal["UnsupportedLanguage", "ParseError", "IoError"]. It implements __eq__, __hash__, and __repr__, so callers can put errors in a set to deduplicate failures across runs. It is not an Exception subclass — analyze_batch returns it, never raises it.

analyze_batch only raises on programmer errors: TypeError for a non-iterable paths argument (or a non-path element inside), and ValueError for an empty metrics= list or an unknown metric name. The metrics= selection (see Selecting metrics above) applies uniformly to every file in the batch; validation runs before the input iterable's __iter__ so a bad selection aborts without invoking any side effects.

Generators work — paths are consumed lazily. There is no built-in parallelism; the recommended pattern is concurrent.futures.ThreadPoolExecutor around bca.analyze for parallel single-file calls. analyze_batch also runs with the is_generated walker filter off so every input position yields either a dict or an AnalysisError (never None). Call bca.analyze(path) per-file with the default skip_generated=True if you need the CLI walker's skip behaviour.

Flatten to records

bca.flatten_spaces(result) walks the nested FuncSpace tree in pre-order and yields one flat, scalar-only dict per node — ready for sqlite3.executemany, pandas.DataFrame.from_records, or any other tabular consumer. Metric keys use the same dotted convention as the CLI's CSV writer (cyclomatic.modified.sum, halstead.volume, loc.lloc_average, …). Metric columns match the CLI's CSV_HEADER set; the identity columns do not — CSV uses space_name / space_kind and has no parent_name / depth, while flat records use name / kind and add the parent / depth pair. One metric also diverges: tokens.* flattens to the JSON shape (tokens.tokens, tokens.tokens_average, tokens.tokens_min, tokens.tokens_max), while CSV_HEADER renames those columns to tokens.sum / .average / .min / .max. Rename in the consumer if you need exact CSV alignment.

import sqlite3
import big_code_analysis as bca

result = bca.analyze("src/lib.rs")
if result is None:  # generated/skipped file
    raise SystemExit("nothing to analyze")
records = list(bca.flatten_spaces(result))
columns = sorted({k for r in records for k in r})
# flatten_spaces keys come from a bounded alphabet (`.`, `_`,
# ASCII alnum), so f-string quoting is safe here. Sanitize if you
# ever build records by hand.
with sqlite3.connect("metrics.db") as db:
    cols = ", ".join(f'"{c}"' for c in columns)
    qs = ", ".join("?" for _ in columns)
    db.execute(f"CREATE TABLE m ({cols})")
    db.executemany(
        f"INSERT INTO m ({cols}) VALUES ({qs})",
        [tuple(r.get(c) for c in columns) for r in records],
    )

The iterator is lazy and single-use: it walks the input once without materialising the whole list, and a second iteration is empty. Records always carry path (the analyzed file, or None for analyze_source), name, kind, start_line, end_line, parent_name, and depth. Anonymous spaces (Rust closures, JS function expressions / arrows) keep their name == "<anonymous>" marker verbatim — flatten_spaces does not normalize. Missing metric subtrees produce no keys (absent, not None), matching the "Halstead disabled" edge case for metrics= selection.

parent_name alone cannot disambiguate same-named siblings nested under different parents (e.g. two Inner classes under two different outer classes both surface as parent_name == 'Inner' for their own children). Pair with depth and source-order position, or rebuild the qualified name in your consumer, if you need a fully-qualified path.

Don't mutate the input result while iterating: the walker keeps references into it, so mutations to not-yet-yielded subtrees will be observed in later records.

flatten_spaces raises TypeError if the input is not a mapping; callers must filter None returns from bca.analyze (e.g. when skip_generated=True matched a generated file) before passing.

Errors

bca.analyze raises exceptions; bca.analyze_batch returns bca.AnalysisError values inside the result list (never raised on per-file failures — see the Batch processing section above).

Exception types raised by bca.analyze / bca.analyze_source:

bca.UnsupportedLanguageError (subclass of ValueError) — raised when a file extension is unrecognised, or when analyze_source(..., language="...") is passed an unknown language name.
bca.ParseError (subclass of ValueError) — raised when the underlying tree-sitter parser fails on the supplied source.
ValueError — raised by bca.analyze when the path is not valid UTF-8 and the default strict policy is in effect; pass allow_lossy_path=True to mirror the CLI's U+FFFD substitution via Path::to_string_lossy and accept the resulting non-round-trippable name field (#316).
OSError — bubbled up from the underlying file-system read. Dispatches to the canonical subclass (FileNotFoundError, PermissionError, IsADirectoryError, …) based on errno, with err.errno and err.filename populated.

Returned by bca.analyze_batch inside the result list:

bca.AnalysisError — frozen value type with path: str, error: str, and error_kind: Literal["UnsupportedLanguage", "ParseError", "IoError"]. Not an Exception subclass. error_kind is a closed taxonomy: "IoError" covers both filesystem failures and the non-UTF-8 path case (kept at three kinds per the API contract); "ParseError" similarly covers internal JSON-serialisation failures of the resulting FuncSpace (rare; reserved upstream). The OS errno is preserved in the error string via Rust's "<msg> (os error <N>)" default formatting — parse with regex r"$os error (\d+)$$" if you need it for retry classification, or call bca.analyze per-file to get a typed OSError subclass instead.

Type checking

The package ships PEP 561 type stubs (py.typed + _native.pyi). mypy --strict and pyright should both pass cleanly against client code.

License

MPL-2.0 (matches the Rust library).

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

dekobon

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.0

May 26, 2026

This version

1.0.0

May 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

big_code_analysis-1.0.0.tar.gz (2.7 MB view details)

Uploaded May 25, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_x86_64.whl (3.7 MB view details)

Uploaded May 25, 2026 CPython 3.12+manylinux: glibc 2.28+ x86-64

big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_aarch64.whl (3.6 MB view details)

Uploaded May 25, 2026 CPython 3.12+manylinux: glibc 2.28+ ARM64

File details

Details for the file big_code_analysis-1.0.0.tar.gz.

File metadata

Download URL: big_code_analysis-1.0.0.tar.gz
Upload date: May 25, 2026
Size: 2.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for big_code_analysis-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`8306a02e27bb3687c109055861c8d87c95a4f165995ca90c8c85877dd7317db2`
MD5	`61ed6dad5835b33b2810ec2eadda2110`
BLAKE2b-256	`8f4dd7daddc44dd39a864cae6360873e9769a595a9afe4e7b708a24dbba972a7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for big_code_analysis-1.0.0.tar.gz:

Publisher: python-wheels.yml on dekobon/big-code-analysis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: big_code_analysis-1.0.0.tar.gz
- Subject digest: 8306a02e27bb3687c109055861c8d87c95a4f165995ca90c8c85877dd7317db2
- Sigstore transparency entry: 1629764139
- Sigstore integration time: May 25, 2026
Source repository:
- Permalink: dekobon/big-code-analysis@6f48a9bfa360e5d839e9c0ef57f7d49cf834892c
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/dekobon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-wheels.yml@6f48a9bfa360e5d839e9c0ef57f7d49cf834892c
- Trigger Event: push

File details

Details for the file big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_x86_64.whl.

File metadata

Download URL: big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_x86_64.whl
Upload date: May 25, 2026
Size: 3.7 MB
Tags: CPython 3.12+, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`161f57242ecaeca6b7271c4884c2b1a75fa161e068503a31c217dcfc74b0c144`
MD5	`a05c16789a4438b9efbbd7ee454262b9`
BLAKE2b-256	`ec5fb666c9fe87b054a1084e5c65cb54201c1518b5b6d61bee8512d7107f0615`

See more details on using hashes here.

Provenance

The following attestation bundles were made for big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_x86_64.whl:

Publisher: python-wheels.yml on dekobon/big-code-analysis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_x86_64.whl
- Subject digest: 161f57242ecaeca6b7271c4884c2b1a75fa161e068503a31c217dcfc74b0c144
- Sigstore transparency entry: 1629764271
- Sigstore integration time: May 25, 2026
Source repository:
- Permalink: dekobon/big-code-analysis@6f48a9bfa360e5d839e9c0ef57f7d49cf834892c
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/dekobon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-wheels.yml@6f48a9bfa360e5d839e9c0ef57f7d49cf834892c
- Trigger Event: push

File details

Details for the file big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_aarch64.whl.

File metadata

Download URL: big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_aarch64.whl
Upload date: May 25, 2026
Size: 3.6 MB
Tags: CPython 3.12+, manylinux: glibc 2.28+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_aarch64.whl
Algorithm	Hash digest
SHA256	`2a352d618a4f8f032ec562122ac0197519957b61e345057d48c4e9cb6a543dcc`
MD5	`23bfd0bd8746f45e75a5c33464ed4f4e`
BLAKE2b-256	`4f56e11f765c8e358be478a081607cde786a3e554884d4c406dd76a44836d758`

See more details on using hashes here.

Provenance

The following attestation bundles were made for big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_aarch64.whl:

Publisher: python-wheels.yml on dekobon/big-code-analysis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_aarch64.whl
- Subject digest: 2a352d618a4f8f032ec562122ac0197519957b61e345057d48c4e9cb6a543dcc
- Sigstore transparency entry: 1629764180
- Sigstore integration time: May 25, 2026
Source repository:
- Permalink: dekobon/big-code-analysis@6f48a9bfa360e5d839e9c0ef57f7d49cf834892c
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/dekobon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-wheels.yml@6f48a9bfa360e5d839e9c0ef57f7d49cf834892c
- Trigger Event: push

big-code-analysis 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

big-code-analysis (Python bindings)

Runnable examples

Installation

Usage

Selecting metrics

SARIF 2.1.0 output

Upload to GitHub Code Scanning

Batch processing

Flatten to records

Errors

Type checking

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance