LLM-assisted migration of SAS analytics, transformations, and reports to Databricks (PySpark, Spark SQL, DLT, Workflows).
Project description
sas2databricks 🦶
sas2databricks - track down your SAS and set it free in the Databricks lakehouse.
An open-source, LLM-assisted migration toolkit that converts SAS analytics, data transformations, and reports into Databricks (PySpark, Spark SQL, Delta Live Tables, and Workflows) - end to end.
Deterministic transpilers handle the patterns we understand. A GitHub Copilot-powered LLM layer (default Claude Opus 4.8, or Codex, or Auto) fills the gaps, resolves ambiguity, and explains every conversion.
Why this exists
Migrating SAS to Databricks is hard because SAS is not one language - it is a family of sub-languages (DATA step, PROC SQL, the macro facility, dozens of PROCs, formats/informats). Pure rules-based converters break on real-world code; pure LLM converters hallucinate and are unverifiable. sas2databricks combines both:
- A deterministic core parses SAS into an intermediate representation (IR) and transpiles every pattern it recognizes - fast, free, and 100% reproducible.
- An LLM orchestrator is invoked only for the residue (unknown PROCs, gnarly macros, business logic) with the model you choose, and its output is validated against the IR.
- Every line of generated code carries provenance (which SAS line it came from and whether it was rule-based or LLM-based) so reviewers can trust the result.
What it covers
| SAS capability | Target | Engine |
|---|---|---|
PROC SQL |
Spark SQL / PySpark | Deterministic (sqlglot) |
DATA step (BY-group, RETAIN, arrays + DO loops, LAG, FIRST./LAST.) |
PySpark | Deterministic + LLM |
Macro facility (%MACRO, %LET, %IF/%DO/%ELSE, iterative %DO, macro vars) |
Python/Jinja params | Deterministic |
PROC MEANS / SUMMARY / FREQ / TABULATE (measures & aggregations) |
PySpark / Spark SQL | Deterministic |
PROC FORMAT (formats/informats) |
PySpark UDF / mapping tables | Deterministic |
PROC REPORT / PRINT |
Databricks notebook viz / SQL | Deterministic + LLM |
Statistical PROCs (REG, LOGISTIC, GLM, GENMOD) |
Spark MLlib scaffold | Deterministic (review) |
Descriptive PROCs (CORR, UNIVARIATE) |
Spark stats helpers | Deterministic |
| Data-parity validation | validate notebook (row/schema/checksum diff) |
Deterministic |
| Deployment packaging | Databricks Asset Bundle (databricks.yml) + Workflows job graph |
Deterministic |
Three ways to use it
flowchart LR
SAS["SAS project (*.sas)"] --> PROJ
subgraph PROJ["Project orchestration (project.py)"]
PLAN["Plan files →<br/>flat or --bundle layout"]
end
PLAN -->|per .sas file| CORE
subgraph CORE["sas2databricks core (Python)"]
P[Parser] --> IR[(IR)]
IR --> T[Transpilers]
T -->|low confidence| L[LLM Orchestrator]
L -->|model: opus/codex/auto| T
T --> E[Emitters]
end
E --> OUT["PySpark / Spark SQL / DLT / Workflows / Validate / Bundle"]
OUT --> ASM["databricks.yml + src/ notebooks<br/>+ project report index (report_index.py)"]
CLI["CLI: s2db migrate"] --> PROJ
MCP["MCP server (tools for Copilot)"] --> PROJ
AGENT["VS Code Copilot agent + skill"] --> MCP
- CLI -
s2db migrate ./sas_projectmigrates everything in one command (PySpark + Spark SQL + DLT). Narrow with--target pysparkor add--bundlefor batch jobs. - MCP server - exposes
parse_sas,convert_sas,validate_conversion,explain_conversion,migrate_projectas tools to any MCP client (incl. GitHub Copilot). - VS Code Copilot agent + skill - the
@sas-migratoragent orchestrates the migration interactively and lets you pick the model (Opus 4.8 default / Codex / Auto).
Quick start
# Install from PyPI
pip install sas2databricks
# ...or include the MCP server extra:
pip install "sas2databricks[mcp]"
# Convert a single SAS program to PySpark
s2db convert examples/sample1_proc_sql.sas --target pyspark
# Migrate a whole SAS project - ONE command emits PySpark + Spark SQL + DLT side by side,
# with a combined report index linking each format. No flags needed.
s2db migrate ./examples --out ./out
# narrow to one format: s2db migrate ./examples --target pyspark
# pick the model for gaps: s2db migrate ./examples --model opus-4.8
# Assemble deployable Databricks Asset Bundles (databricks.yml + src/ notebooks + reports/)
# With --target all you get one ready-to-deploy bundle per format under out/<target>/.
s2db migrate ./examples --bundle --html --out ./bundle
# then: cd bundle/pyspark && databricks bundle deploy -t dev
# Run the MCP server (stdio) so Copilot can call the tools
s2db mcp
From source (for development):
git clone https://github.com/navintkr/sas2databricksthenpip install -e ".[dev,mcp]". Theexamples/SAS samples used above live in the repo.
Model selection
The LLM layer is provider-agnostic. Pick the model per run:
| Value | Meaning |
|---|---|
opus-4.8 |
Default. Best reasoning for complex macros & business logic. |
codex |
Fast, code-focused conversions. |
auto |
Router: deterministic first; escalates only low-confidence nodes, and picks Opus for macros/business logic, Codex for mechanical rewrites. |
In the VS Code Copilot agent the model is selected via the agent's model picker; in the CLI
and MCP server it is the --model flag / model argument. See docs/model-selection.md.
Project layout
src/sas2databricks/
├── parser/ # SAS → preprocess → macro expansion → step split
├── ir/ # Intermediate representation (engine-agnostic)
├── transpilers/ # IR builders per SAS construct (deterministic)
├── emitters/ # IR → PySpark / Spark SQL / DLT / Workflows / Validate / Bundle
├── llm/ # Model selection + orchestrator + pluggable providers
├── macros.py # %MACRO body (+ %IF/%DO control flow) → parameterized Python function
├── mcp/ # MCP server exposing the core as tools
├── project.py # Project migration: flat or deployable-bundle layout
├── report_index.py# Project-level report index (Markdown + HTML)
├── pipeline.py # End-to-end orchestration
└── cli.py # `s2db` command-line interface
See ARCHITECTURE.md for the full design and ROADMAP.md for what's planned.
Status
v0.3.0 - real and growing. Deterministic transpilers (with tests) cover PROC SQL,
macro variables, %MACRO definitions/invocations with %IF/%DO/%ELSE control flow
and iterative %DO loops, PROC MEANS/FORMAT/REPORT, the DATA step (BY-group, RETAIN,
LAG/DIF, FIRST./LAST., MERGE, arrays + iterative DO loops), descriptive
stats (CORR/UNIVARIATE), and MLlib scaffolds for REG/LOGISTIC/GLM. Targets include
PySpark, Spark SQL, DLT (with expectations), Workflows, a data-parity validate notebook,
and a Databricks Asset Bundle (databricks.yml). s2db migrate --bundle assembles a
deployable bundle (notebooks + databricks.yml + a roll-up project report index). Real
LLM providers (Anthropic, Azure OpenAI) plug in behind LLMProvider, results render to an
HTML report, and CI runs ruff + mypy + pytest on Python 3.10–3.12. Contributions welcome -
see CONTRIBUTING.md.
License
MIT © contributors. SAS and all related marks are trademarks of SAS Institute Inc. This project is independent and not affiliated with or endorsed by SAS Institute Inc.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sas2databricks-0.4.0.tar.gz.
File metadata
- Download URL: sas2databricks-0.4.0.tar.gz
- Upload date:
- Size: 61.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
967804fbcdef23623e923c31719bd8e4df08f5fd5fde4646b759e99d9c7026f1
|
|
| MD5 |
8cf90d8268ccfbb78dc82b5596a984ea
|
|
| BLAKE2b-256 |
ab16829123a5211f6b8bd6d30ae33434ccff0ec2d7b4aa4eebdf725fc3f146ce
|
Provenance
The following attestation bundles were made for sas2databricks-0.4.0.tar.gz:
Publisher:
publish.yml on navintkr/sas2databricks
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sas2databricks-0.4.0.tar.gz -
Subject digest:
967804fbcdef23623e923c31719bd8e4df08f5fd5fde4646b759e99d9c7026f1 - Sigstore transparency entry: 1852902145
- Sigstore integration time:
-
Permalink:
navintkr/sas2databricks@5be9cb44805bcdc3f3d3570ea0ba1b1c48ea1d00 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/navintkr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5be9cb44805bcdc3f3d3570ea0ba1b1c48ea1d00 -
Trigger Event:
release
-
Statement type:
File details
Details for the file sas2databricks-0.4.0-py3-none-any.whl.
File metadata
- Download URL: sas2databricks-0.4.0-py3-none-any.whl
- Upload date:
- Size: 61.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
472f367872367633d452faad4c11cf3f1d2d9f7bc28c27eb5d75abee4c2ee8c0
|
|
| MD5 |
d50d8f8dc7b9f41f59bbb82087c0c91b
|
|
| BLAKE2b-256 |
dd983220e7193b1740a452398028de9241b719f986db9057133e4b9c0812801f
|
Provenance
The following attestation bundles were made for sas2databricks-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on navintkr/sas2databricks
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sas2databricks-0.4.0-py3-none-any.whl -
Subject digest:
472f367872367633d452faad4c11cf3f1d2d9f7bc28c27eb5d75abee4c2ee8c0 - Sigstore transparency entry: 1852902384
- Sigstore integration time:
-
Permalink:
navintkr/sas2databricks@5be9cb44805bcdc3f3d3570ea0ba1b1c48ea1d00 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/navintkr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5be9cb44805bcdc3f3d3570ea0ba1b1c48ea1d00 -
Trigger Event:
release
-
Statement type: