Security- and compliance-first CLI for AI-assisted legacy code analysis
Project description
legacylens
Security- and compliance-first CLI for AI-assisted legacy code analysis.
legacylens ingests legacy/mainframe codebases and produces three co-equal outputs:
- Dependency graphs — call/data/include relationships across artifacts.
- Security & compliance findings — mapped to CWE / OWASP (extensible rule packs; see docs/RULES.md).
- Modern documentation — human-readable explanations of legacy logic.
It is built to run on-prem / air-gapped with bring-your-own LLM — point it at your own self-hosted models or your own LLM API keys. No source code or telemetry leaves your environment.
Status: v1 feature-complete across batches B0–B7 (see
REQUIREMENTS.md): scaffold, BYO-LLM gateway, ingestion/index, COBOL/JCL/PL-I parsing, dependency graph, CWE/OWASP security analysis, documentation, retrieval, and cost controls.
Installation
legacylens installs a single CLI command that works from cmd, PowerShell, or any
terminal on Windows, macOS, and Linux.
Recommended — pipx (puts legacylens on PATH for every shell, all OSes)
python -m pip install --user pipx
python -m pipx ensurepath # one-time: adds the CLI dir to PATH
pipx install legacylens # from PyPI (or: pipx install . from a clone)
Or use the bundled helper from a clone:
# Windows (PowerShell):
powershell -ExecutionPolicy Bypass -File scripts\install.ps1
# macOS / Linux:
bash scripts/install.sh
Open a new terminal afterwards (so PATH refreshes), then:
legacylens --help
legacylens doctor # report Python + dependency status
On first run, legacylens verifies its required libraries are present; if any are
missing (e.g. running from a clone without installing), it asks permission and
installs them with pip (or set LEGACYLENS_AUTO_INSTALL=1 to consent
non-interactively). legacylens doctor shows the same status on demand.
Alternative — pip
pip install legacylens # or: pip install . from a clone
This also creates the legacylens command. On Windows, if the shell can't find it,
your Python Scripts directory isn't on PATH — either use pipx (above), run
py -m legacylens ..., or add ...\PythonXX\Scripts to PATH. pipx avoids this
entirely.
Air-gapped / offline install
Build a self-contained wheel bundle on a networked machine, copy it to the air-gapped host, and install with no network:
bash scripts/build_offline_bundle.sh # Windows: scripts\build_offline_bundle.ps1
# copy dist/wheelhouse/ to the target host, then:
pip install --no-index --find-links wheelhouse legacylens
Development (editable, with tests)
python -m venv .venv
. .venv/Scripts/activate # Windows
# source .venv/bin/activate # Linux/macOS
pip install -e ".[dev]"
pytest
Quick start
legacylens init # scaffold an audit.yaml in the current dir
# edit audit.yaml — set project.root, languages, and your LLM provider(s)
legacylens index # discover & index sources (COBOL/JCL/PL-I)
legacylens analyze # parse + security/compliance analysis
legacylens graph # emit dependency graph (DOT/Mermaid/GraphML)
legacylens doc # generate documentation (Markdown + overview)
legacylens report # render findings (SARIF/JSON/HTML)
legacylens embed # build the semantic embedding index (BYO embeddings)
legacylens search "QUERY" # find the most relevant artifacts
Once an embedding index is built (legacylens embed), the LLM steps are
retrieval-augmented: analyze (security) and doc inject the most semantically
related artifacts into their prompts for cross-file reasoning (disable with
--no-rag).
Run legacylens --help for all commands and legacylens <cmd> --help for details.
Useful flags: --no-llm (on analyze/doc) runs fully deterministically with no model
calls; budget.max_tokens in config caps total LLM spend per run.
Parse results are cached in the index (content-addressed), so unchanged files are
parsed once and reused across passes, commands, and runs — incremental by default
(parser.cache: true). For large estates, parse in parallel with -j <workers> (or
parser.workers): cache-miss files are grammar-parsed across a process pool to warm
the cache before analysis.
Configuration
All behavior is driven by a single config file (audit.yaml by default; override
with -c/--config). Credentials are never stored in config — you name the
environment variable holding each provider's key. See the file generated by
legacylens init for the full, commented schema.
A fully-commented reference with every provider (local, OpenAI, Anthropic,
Gemini, and any OpenAI-compatible endpoint) is at
examples/audit.example.yaml — copy it, keep one
provider, point routing.default at it, and export that provider's API key env var.
Key principles:
- Air-gapped by default (
air_gapped: true): the LLM gateway refuses any endpoint not explicitly listed underllm.providers. - Bring-your-own models: OpenAI-compatible, Anthropic, or local servers (Ollama / vLLM / llama.cpp), with per-task model routing.
- Auditable: every run appends a structured trail to the configured audit log.
Choosing your LLM provider
Steps to enable an LLM:
- Create
llm_config.yamlnext to youraudit.yaml(copy examples/llm_config.example.yaml), and make sureaudit.yamlhas nollm:block. - Fill in
type,url,model, andkeyfor your provider (table below). Preferapi_key_env: NAMEinstead ofkey:to keep the key in an env var. - (Optional) add
embedding_model:to enableembed/search+ retrieval-augmented docs and security. - Run with the LLM on — i.e. without
--no-llm:legacylens index legacylens analyze # adds LLM advisory findings (flagged for review) legacylens doc # fills in Purpose / Business-logic prose legacylens embed # optional: build the embedding index for RAG
- Everything else (parsing, graph, CWE/OWASP + regulatory findings, all output formats) works the same with or without an LLM.
Easiest — a 4-line llm_config.yaml. Create it next to your audit.yaml (and
leave the llm: block out of audit.yaml); legacylens auto-detects it:
# llm_config.yaml (this file is git-ignored — it may hold your key)
type: openai_compatible
url: https://generativelanguage.googleapis.com/v1beta/openai
model: gemini-2.0-flash
key: PASTE_YOUR_KEY_HERE # or use `api_key_env: GEMINI_API_KEY` to keep it in an env var
That's it — run legacylens analyze / doc. Swap type/url/model for any
provider (see examples/llm_config.example.yaml).
Advanced — a full llm: block (multiple providers, per-task routing). Here the
API key is never in the config — you export it as an environment variable and name
it via api_key_env. Pick one provider block for llm.providers, point
routing.default at it, and export the key:
# PowerShell (Windows) — this session, or `setx NAME "value"` to persist:
$env:OPENAI_API_KEY = "sk-..."
# bash / Linux / macOS:
export OPENAI_API_KEY="sk-..."
| Provider | type |
base_url |
model (example) |
key env |
|---|---|---|---|---|
| Local (Ollama / vLLM / llama.cpp) | local |
http://localhost:11434/v1 |
qwen2.5-coder:32b |
— (offline) |
| OpenAI | openai_compatible |
https://api.openai.com/v1 |
gpt-4o-mini |
OPENAI_API_KEY |
| Anthropic (Claude) | anthropic |
https://api.anthropic.com |
claude-sonnet-4-6 |
ANTHROPIC_API_KEY |
| Google Gemini | openai_compatible |
https://generativelanguage.googleapis.com/v1beta/openai |
gemini-2.0-flash |
GEMINI_API_KEY |
| Any OpenAI-compatible (Groq, Together, OpenRouter, LiteLLM…) | openai_compatible |
your endpoint /v1 |
your model id | your env var |
Example — Google Gemini (free key):
llm:
providers:
- name: gemini
type: openai_compatible
base_url: https://generativelanguage.googleapis.com/v1beta/openai
model: gemini-2.0-flash
api_key_env: GEMINI_API_KEY # export GEMINI_API_KEY=... ; not stored here
routing:
default: gemini
# optional — enables `embed`/`search` + retrieval-augmented docs & security:
# embeddings: { provider: gemini, model: text-embedding-004 }
Then run without --no-llm (e.g. legacylens analyze, legacylens doc) to get
LLM-assisted findings and documentation. The full set of provider blocks is in
examples/audit.example.yaml.
Note: a cloud provider means code leaves your environment (the cloud host is an allowed endpoint even with
air_gapped: true). For a strict on-prem engagement, use a local model.
Development
pytest # run the test suite
Findings lifecycle & CI gating
legacylens supports a real audit/CI workflow around findings:
legacylens analyze --fail-on high # exit 6 if any non-suppressed finding >= high
legacylens suppress --list # list findings with their (line-independent) fingerprints
legacylens suppress <fingerprint> --reason "false positive" # accept / silence one
legacylens baseline # accept current findings as the baseline
legacylens diff # show findings new vs resolved since the baseline
legacylens analyze --fail-on high --new-only # gate only on findings new vs the baseline
- Suppressions (
.legacylens/suppressions.json) mark false positives or accepted LLM-advisory findings; they're excluded from gating and shown struck-through in the HTML report and marked in SARIF (suppressions). - Baseline (
.legacylens/baseline.json) lets you adopt legacylens on a large estate without drowning in pre-existing findings — gate only on what's new. - Exit codes:
6= gate failure (distinct from tool errors), so CI can tell a policy failure apart from a crash. Configure a default viafindings.fail_on.
Fingerprints are line-independent, so a finding survives edits elsewhere in the file.
Custom & regulatory compliance packs: add your own detection rules via YAML
(analysis.compliance.pack_paths) and map findings to regulatory controls with
built-in (pci-dss, nist-800-53) or custom frameworks
(analysis.compliance.frameworks / framework_paths). Findings carry controls
(e.g. PCI-DSS:8.6.2) in every output, and legacylens compliance summarizes by
control. See docs/RULES.md.
COBOL parser backend (client choice)
The COBOL parser backend is selectable in config under parser.backend:
regex(default) — pure-Python, zero-dependency line parser. Works everywhere, installs air-gapped with no extra steps.antlr(opt-in) — grammar-based parser using ANTLR, for higher fidelity (the lexer understands string literals and tokens natively). It requires:- the runtime extra:
pip install 'legacylens[antlr]', and - a one-time parser generation:
python scripts/build_antlr.py(needs Java at build time only; see the script header).
- the runtime extra:
parser:
backend: antlr # or: regex
fallback_to_regex: true # if antlr isn't generated/installed, use regex instead of failing
With fallback_to_regex: true (default), selecting antlr before generating it
simply logs a warning and uses the regex backend — runs never break. The ANTLR
grammar (src/legacylens/parsing/antlr/Cobol.g4) is a starter covering the
structural constructs legacylens needs; clients can extend it or substitute a mature
grammar (e.g. ProLeap's).
Validation
Beyond the unit suite, legacylens has been run end-to-end against public COBOL,
JCL, and PL/I repositories (AWS CardDemo, IBM Bank-of-Z, and others). See
docs/VALIDATION.md for the test matrix, results, and the
issues that real-world testing surfaced and fixed.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file legacylens-0.1.0.tar.gz.
File metadata
- Download URL: legacylens-0.1.0.tar.gz
- Upload date:
- Size: 97.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14de1a4abb96dc977b9c36281980bf56f5d7c6296935d496a81372ef60cdf3e9
|
|
| MD5 |
d016465b11c5ed49084ab1cd0bb35e7e
|
|
| BLAKE2b-256 |
336c808a2acc30b47aafe9cedd82c7461d269ee70cfd14a7f58b31e9a1436863
|
File details
Details for the file legacylens-0.1.0-py3-none-any.whl.
File metadata
- Download URL: legacylens-0.1.0-py3-none-any.whl
- Upload date:
- Size: 92.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23c1377fbc194bc8a49e28939eeaf5177f62b959f7b64b18bc27ec17cca2467d
|
|
| MD5 |
21deb046ea174eda9a3e6170d962d030
|
|
| BLAKE2b-256 |
97812f9b7a004a93bf8d7e7bbaf93e304dc1573149cf144b208248561be3d9d1
|