Skip to main content

Python bindings for the big-code-analysis Rust library

Project description

big-code-analysis (Python bindings)

Python bindings for the big-code-analysis Rust library — compute maintainability metrics for source code in ~20 languages using the same tree-sitter parsers the Rust crate ships with.

Full documentation: the book's Python Bindings chapter covers the install matrix, batch / async / SARIF recipes, and the full error taxonomy. The README below is the quick reference shown on PyPI.

All nine phases of the Python bindings work (issues #265–#273; parent #103) have landed. The crate now ships single-file analysis, the never-raise batch entry point, the flatten_spaces flat-record iterator, explicit metric selection (metrics=), SARIF 2.1.0 rendering (to_sarif), the strict ruff / mypy / pyright tooling gate, manylinux wheel CI on Linux x86_64 + aarch64, the book's "Python Bindings" chapter, and the end-user example set covered below. See the CHANGELOG for the per-phase changes.

Runnable examples

big-code-analysis-py/examples/ is the canonical collection of copy-paste recipes. Every file is executed under CI either via tests/test_book_examples.py (the .py examples) or via jupyter nbconvert --execute (the notebook), so a renamed kwarg or removed function fails CI before the example can rot in the docs.

File What it shows
quick_start.py Single-file analysis + headline metric. Embedded by the book's Quick start.
batch_processing.py analyze_batch + the AnalysisError discriminator. Embedded by Batch processing.
flat_records.py flatten_spaces → sqlite for one file. Embedded by Flat-record iteration.
metric_selection.py metrics= kwarg + dependency-pull behaviour. Embedded by Metric selection.
sarif_output.py Minimal SARIF rendering. Embedded by SARIF output.
errors_taxonomy.py The full exception map across the entry points. Embedded by Error handling.
async_patterns.py asyncio.to_thread (canonical) vs the in-loop anti-pattern. Embedded by Async patterns.
cli_parity.py Byte-for-byte parity smoke test vs bca metrics --output-format json. Wired into make py-test.
pipeline_db.py Directory walk → analyze_batchflatten_spaces → sqlite top-N, with a deliberately broken file to exercise the never-raise contract.
sarif_upload.py SARIF emission tuned for GitHub Code Scanning (github/codeql-action/upload-sarif@v3).
jupyter_quickstart.ipynb Pandas DataFrame + matplotlib cyclomatic.sum per function + top-N. Executed in CI via python-examples-nbconvert.

Installation

The package is not yet published on PyPI. For development, build locally via maturin:

cd big-code-analysis-py
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[dev]"  # pulls maturin, pytest, mypy, ruff, pyright
maturin develop
python -c "import big_code_analysis; print(big_code_analysis.__version__)"

Usage

import big_code_analysis as bca

# Analyse a file by path. The returned dict matches the JSON
# emitted by `bca metrics --output-format json` for the same
# file at the `FuncSpace` boundary — same field order, same
# numeric formatting, same shape. Language detection mirrors the
# CLI (path extension, then shebang, then emacs `-*- mode -*-`).
# Pass `exclude_tests=True` to mirror `bca metrics --exclude-tests`
# (prunes Rust `#[test]` / `#[cfg(test)]` subtrees before metric
# computation). Generated files (`@generated`, `DO NOT EDIT`,
# `GENERATED CODE` markers) are skipped by default, matching the
# CLI walker — `analyze` returns `None` for them; pass
# `skip_generated=False` to opt out. See `bca.analyze.__doc__`
# for the full parity contract.
result = bca.analyze("src/main.rs")
if result is not None:
    print(result["metrics"]["cognitive"]["sum"])

# Analyse a Rust file with `#[test]` subtrees pruned out — same
# result as `bca metrics --exclude-tests --output-format json`.
prod_only = bca.analyze("src/main.rs", exclude_tests=True)

# Non-UTF-8 paths raise `ValueError` by default so the `name`
# field is always a round-trippable identifier. Pass
# `allow_lossy_path=True` to opt into the CLI's U+FFFD
# substitution behaviour (see `bca.analyze.__doc__` and #316).
lossy = bca.analyze(weird_path, allow_lossy_path=True)

# Force analysis of files marked `@generated` (default skips them).
forced = bca.analyze("third_party/generated.pb.go", skip_generated=False)

# Analyse an in-memory snippet (str, bytes, or bytearray accepted).
metrics = bca.analyze_source("fn main() {}\n", "rust")

# Language detection helpers. `language_for_file` reads the file
# and runs the same detection pipeline as `analyze` — path
# extension first, then shebang / emacs-mode fallback (#318) —
# so an extension-less script with a `#!/usr/bin/env python`
# leading line resolves the same way it would for `analyze`. The
# file is read on every call (parity with `analyze`), so the path
# must exist; I/O failures raise the same typed `OSError` subclass
# `analyze` does (`FileNotFoundError`, `PermissionError`, …). If
# you only need the cheap extension lookup (`.py` → `python`) and
# do not want the file read, use
# `bca.language_extensions("python")` and match the extension
# yourself.
assert bca.language_for_file("path/to/real/foo.py") == "python"
# Extension-less script with a `#!/usr/bin/env python` first line
# would resolve to "python" too (the asymmetry #318 closed).
assert "python" in bca.supported_languages()
assert "py" in bca.language_extensions("python")

Selecting metrics

Pass metrics=[…] to compute only a subset of the metric suite. metrics=None (the default) preserves today's "compute everything" behaviour. Unrequested metrics are absent from the result dict (not present with None placeholders).

import big_code_analysis as bca

# Compute only LoC and cyclomatic complexity.
result = bca.analyze("src/main.rs", metrics=["loc", "cyclomatic"])
assert result is not None
assert set(result["metrics"]) == {"loc", "cyclomatic"}

# Selecting a derived metric pulls its dependencies in automatically:
# `metrics=["mi"]` also computes loc, cyclomatic, and halstead.
mi_result = bca.analyze("src/main.rs", metrics=["mi"])
assert mi_result is not None
assert {"loc", "cyclomatic", "halstead", "mi"}.issubset(mi_result["metrics"])

# `bca.METRIC_NAMES` is a `tuple[str, ...]` enumerating every
# canonical name accepted by `metrics=` (alphabetised, lowercase).
assert "halstead" in bca.METRIC_NAMES

The same kwarg is honoured by bca.analyze_source and bca.analyze_batch — the latter applies the selection uniformly to every file in the batch. Validation runs before any file I/O: an empty list or unknown name raises ValueError immediately and never returns an AnalysisError slot for what is really a caller bug.

# Compute only `cyclomatic` and `cognitive` across a batch.
results = bca.analyze_batch(
    ["src/a.py", "src/b.rs"],
    metrics=["cyclomatic", "cognitive"],
)

Names are case-sensitive lowercase; passing an unknown name raises ValueError with the canonical list in the error message. The "exit" Metric-Display spelling is accepted as an alias for the canonical JSON-key spelling "nexits"; both produce a "nexits" key in the output. Duplicates are silently collapsed.

SARIF 2.1.0 output

bca.to_sarif(result, *, thresholds=None) renders an analysis result (or an iterable of them) into a SARIF 2.1.0 JSON document suitable for upload to GitHub Code Scanning or any other SARIF consumer. The output is produced by the same Rust writer that backs bca check -O sarif, so the schema URL, tool driver name / version, and rule descriptions match the CLI byte-for-byte.

import big_code_analysis as bca

# Single file → SARIF with a finding for every function whose
# cyclomatic complexity strictly exceeds 15.
sarif = bca.to_sarif(
    bca.analyze("src/main.py"),
    thresholds={"cyclomatic": 15, "loc.lloc": 200},
)
with open("metrics.sarif", "w", encoding="utf-8") as fh:
    fh.write(sarif)

# Batch input — AnalysisError entries are skipped silently because
# they represent files we couldn't analyse, not findings.
batch = bca.analyze_batch(["src/a.py", "src/b.rs", "src/c.cpp"])
sarif = bca.to_sarif(batch, thresholds={"cognitive": 20})

Accepted threshold names mirror the CLI's EXTRACTORS table in big-code-analysis-cli/src/thresholds.rs — e.g. "cognitive", "cyclomatic", "cyclomatic.modified", "halstead.volume", "halstead.difficulty", "halstead.effort", "halstead.time", "halstead.bugs", "loc.sloc", "loc.ploc", "loc.lloc", "loc.cloc", "loc.blank", "nom", "tokens", "nexits", "nargs", "mi.original", "mi.sei", "mi.visual_studio", "abc", "wmc", "npm", "npa". An unknown name raises ValueError listing the accepted set, so a typo fails fast instead of silently producing an empty SARIF run.

thresholds=None (the default) and thresholds={} both produce a well-formed SARIF document with empty results and rules arrays. This matches the CLI's posture: there are no built-in default thresholds; every check run supplies its own limits.

Unit-level findings. to_sarif emits file-scope (unit-space) findings for every metric whose JSON headline at the unit space matches the CLI's per-space accessor (loc.*, halstead.*, mi.*, nom, nargs, nexits, tokens, abc, wmc, npm, npa). The three exceptions — cyclomatic, cyclomatic.modified, cognitive — are skipped at the unit level because the JSON only exposes the aggregate sum across children while the CLI's per-space accessor returns just the unit's own scalar; emitting findings from the aggregate would diverge from the CLI for parent spaces. Unit findings carry logicalLocations: [{"fullyQualifiedName": "<file>"}]; nameless non-unit spaces (rare parse-failure case) carry "<unnamed>" — both matching the CLI's function_token placeholders.

Upload to GitHub Code Scanning

# .github/workflows/code-scanning.yml (excerpt)
- name: Compute metric SARIF
  run: |
    python - <<'PY'
    import big_code_analysis as bca
    with open("paths.txt", encoding="utf-8") as paths_fh:
        results = bca.analyze_batch(paths_fh.read().splitlines())
    with open("metrics.sarif", "w", encoding="utf-8") as fh:
        fh.write(bca.to_sarif(results, thresholds={"cyclomatic": 15}))
    PY
- name: Upload to Code Scanning
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: metrics.sarif

Batch processing

bca.analyze_batch(paths) runs the same analysis as bca.analyze over every path in an iterable and never raises on per-file errors: each result slot is either an analysis dict or a bca.AnalysisError describing the failure. The list has the same length as the input and preserves order one-to-one, so callers can zip(inputs, results) without losing the pairing.

import big_code_analysis as bca

paths = ["src/a.py", "src/missing.py", "src/b.rs"]
results = bca.analyze_batch(paths)
for path, result in zip(paths, results):
    if isinstance(result, bca.AnalysisError):
        print(f"skipped {path}: ({result.error_kind}) {result.error}")
    else:
        process(result)

The pattern above keeps paths and results as separate materialised sequences. If you want to drive analyze_batch from a generator (e.g. glob.iglob('**/*.py')) for memory efficiency, materialise it into a list first — otherwise zip(generator, analyze_batch(generator)) yields nothing because analyze_batch exhausts the generator before zip re-iterates it:

import glob

paths = list(glob.iglob("src/**/*.py", recursive=True))
results = bca.analyze_batch(paths)
# now zip(paths, results) works

bca.AnalysisError is a frozen value type with path: str, error: str, and error_kind: Literal["UnsupportedLanguage", "ParseError", "IoError"]. It implements __eq__, __hash__, and __repr__, so callers can put errors in a set to deduplicate failures across runs. It is not an Exception subclass — analyze_batch returns it, never raises it.

analyze_batch only raises on programmer errors: TypeError for a non-iterable paths argument (or a non-path element inside), and ValueError for an empty metrics= list or an unknown metric name. The metrics= selection (see Selecting metrics above) applies uniformly to every file in the batch; validation runs before the input iterable's __iter__ so a bad selection aborts without invoking any side effects.

Generators work — paths are consumed lazily. There is no built-in parallelism; the recommended pattern is concurrent.futures.ThreadPoolExecutor around bca.analyze for parallel single-file calls. analyze_batch also runs with the is_generated walker filter off so every input position yields either a dict or an AnalysisError (never None). Call bca.analyze(path) per-file with the default skip_generated=True if you need the CLI walker's skip behaviour.

Flatten to records

bca.flatten_spaces(result) walks the nested FuncSpace tree in pre-order and yields one flat, scalar-only dict per node — ready for sqlite3.executemany, pandas.DataFrame.from_records, or any other tabular consumer. Metric keys use the same dotted convention as the CLI's CSV writer (cyclomatic.modified.sum, halstead.volume, loc.lloc_average, …). Metric columns match the CLI's CSV_HEADER set; the identity columns do not — CSV uses space_name / space_kind and has no parent_name / depth, while flat records use name / kind and add the parent / depth pair. One metric also diverges: tokens.* flattens to the JSON shape (tokens.tokens, tokens.tokens_average, tokens.tokens_min, tokens.tokens_max), while CSV_HEADER renames those columns to tokens.sum / .average / .min / .max. Rename in the consumer if you need exact CSV alignment.

import sqlite3
import big_code_analysis as bca

result = bca.analyze("src/lib.rs")
if result is None:  # generated/skipped file
    raise SystemExit("nothing to analyze")
records = list(bca.flatten_spaces(result))
columns = sorted({k for r in records for k in r})
# flatten_spaces keys come from a bounded alphabet (`.`, `_`,
# ASCII alnum), so f-string quoting is safe here. Sanitize if you
# ever build records by hand.
with sqlite3.connect("metrics.db") as db:
    cols = ", ".join(f'"{c}"' for c in columns)
    qs = ", ".join("?" for _ in columns)
    db.execute(f"CREATE TABLE m ({cols})")
    db.executemany(
        f"INSERT INTO m ({cols}) VALUES ({qs})",
        [tuple(r.get(c) for c in columns) for r in records],
    )

The iterator is lazy and single-use: it walks the input once without materialising the whole list, and a second iteration is empty. Records always carry path (the analyzed file, or None for analyze_source), name, kind, start_line, end_line, parent_name, and depth. Anonymous spaces (Rust closures, JS function expressions / arrows) keep their name == "<anonymous>" marker verbatim — flatten_spaces does not normalize. Missing metric subtrees produce no keys (absent, not None), matching the "Halstead disabled" edge case for metrics= selection.

parent_name alone cannot disambiguate same-named siblings nested under different parents (e.g. two Inner classes under two different outer classes both surface as parent_name == 'Inner' for their own children). Pair with depth and source-order position, or rebuild the qualified name in your consumer, if you need a fully-qualified path.

Don't mutate the input result while iterating: the walker keeps references into it, so mutations to not-yet-yielded subtrees will be observed in later records.

flatten_spaces raises TypeError if the input is not a mapping; callers must filter None returns from bca.analyze (e.g. when skip_generated=True matched a generated file) before passing.

Errors

bca.analyze raises exceptions; bca.analyze_batch returns bca.AnalysisError values inside the result list (never raised on per-file failures — see the Batch processing section above).

Exception types raised by bca.analyze / bca.analyze_source:

  • bca.UnsupportedLanguageError (subclass of ValueError) — raised when a file extension is unrecognised, or when analyze_source(..., language="...") is passed an unknown language name.
  • bca.ParseError (subclass of ValueError) — raised when the underlying tree-sitter parser fails on the supplied source.
  • ValueError — raised by bca.analyze when the path is not valid UTF-8 and the default strict policy is in effect; pass allow_lossy_path=True to mirror the CLI's U+FFFD substitution via Path::to_string_lossy and accept the resulting non-round-trippable name field (#316).
  • OSError — bubbled up from the underlying file-system read. Dispatches to the canonical subclass (FileNotFoundError, PermissionError, IsADirectoryError, …) based on errno, with err.errno and err.filename populated.

Returned by bca.analyze_batch inside the result list:

  • bca.AnalysisError — frozen value type with path: str, error: str, and error_kind: Literal["UnsupportedLanguage", "ParseError", "IoError"]. Not an Exception subclass. error_kind is a closed taxonomy: "IoError" covers both filesystem failures and the non-UTF-8 path case (kept at three kinds per the API contract); "ParseError" similarly covers internal JSON-serialisation failures of the resulting FuncSpace (rare; reserved upstream). The OS errno is preserved in the error string via Rust's "<msg> (os error <N>)" default formatting — parse with regex r"\(os error (\d+)\)$" if you need it for retry classification, or call bca.analyze per-file to get a typed OSError subclass instead.

Type checking

The package ships PEP 561 type stubs (py.typed + _native.pyi). mypy --strict and pyright should both pass cleanly against client code.

License

MPL-2.0 (matches the Rust library).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

big_code_analysis-1.0.0.tar.gz (2.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ x86-64

big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_aarch64.whl (3.6 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ ARM64

File details

Details for the file big_code_analysis-1.0.0.tar.gz.

File metadata

  • Download URL: big_code_analysis-1.0.0.tar.gz
  • Upload date:
  • Size: 2.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for big_code_analysis-1.0.0.tar.gz
Algorithm Hash digest
SHA256 8306a02e27bb3687c109055861c8d87c95a4f165995ca90c8c85877dd7317db2
MD5 61ed6dad5835b33b2810ec2eadda2110
BLAKE2b-256 8f4dd7daddc44dd39a864cae6360873e9769a595a9afe4e7b708a24dbba972a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for big_code_analysis-1.0.0.tar.gz:

Publisher: python-wheels.yml on dekobon/big-code-analysis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 161f57242ecaeca6b7271c4884c2b1a75fa161e068503a31c217dcfc74b0c144
MD5 a05c16789a4438b9efbbd7ee454262b9
BLAKE2b-256 ec5fb666c9fe87b054a1084e5c65cb54201c1518b5b6d61bee8512d7107f0615

See more details on using hashes here.

Provenance

The following attestation bundles were made for big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_x86_64.whl:

Publisher: python-wheels.yml on dekobon/big-code-analysis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 2a352d618a4f8f032ec562122ac0197519957b61e345057d48c4e9cb6a543dcc
MD5 23bfd0bd8746f45e75a5c33464ed4f4e
BLAKE2b-256 4f56e11f765c8e358be478a081607cde786a3e554884d4c406dd76a44836d758

See more details on using hashes here.

Provenance

The following attestation bundles were made for big_code_analysis-1.0.0-cp312-abi3-manylinux_2_28_aarch64.whl:

Publisher: python-wheels.yml on dekobon/big-code-analysis

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page