Skip to main content

LLM-assisted migration of SAS analytics, transformations, and reports to Databricks (PySpark, Spark SQL, DLT, Workflows).

Project description

sas2databricks 🦶

sas2databricks - track down your SAS and set it free in the Databricks lakehouse.

An open-source, LLM-assisted migration toolkit that converts SAS analytics, data transformations, and reports into Databricks (PySpark, Spark SQL, Delta Live Tables, and Workflows) - end to end.

Deterministic transpilers handle the patterns we understand. A GitHub Copilot–powered LLM layer (default Claude Opus 4.8, or Codex, or Auto) fills the gaps, resolves ambiguity, and explains every conversion.

License: MIT Python 3.10+ status: beta


Why this exists

Migrating SAS to Databricks is hard because SAS is not one language - it is a family of sub-languages (DATA step, PROC SQL, the macro facility, dozens of PROCs, formats/informats). Pure rules-based converters break on real-world code; pure LLM converters hallucinate and are unverifiable. sas2databricks combines both:

  1. A deterministic core parses SAS into an intermediate representation (IR) and transpiles every pattern it recognizes - fast, free, and 100% reproducible.
  2. An LLM orchestrator is invoked only for the residue (unknown PROCs, gnarly macros, business logic) with the model you choose, and its output is validated against the IR.
  3. Every line of generated code carries provenance (which SAS line it came from and whether it was rule-based or LLM-based) so reviewers can trust the result.

What it covers

SAS capability Target Engine
PROC SQL Spark SQL / PySpark Deterministic (sqlglot)
DATA step (BY-group, RETAIN, arrays + DO loops, LAG, FIRST./LAST.) PySpark Deterministic + LLM
Macro facility (%MACRO, %LET, %IF/%DO/%ELSE, iterative %DO, macro vars) Python/Jinja params Deterministic
PROC MEANS / SUMMARY / FREQ / TABULATE (measures & aggregations) PySpark / Spark SQL Deterministic
PROC FORMAT (formats/informats) PySpark UDF / mapping tables Deterministic
PROC REPORT / PRINT Databricks notebook viz / SQL Deterministic + LLM
Statistical PROCs (REG, LOGISTIC, GLM, GENMOD) Spark MLlib scaffold Deterministic (review)
Descriptive PROCs (CORR, UNIVARIATE) Spark stats helpers Deterministic
Data-parity validation validate notebook (row/schema/checksum diff) Deterministic
Deployment packaging Databricks Asset Bundle (databricks.yml) + Workflows job graph Deterministic

Three ways to use it

flowchart LR
    SAS["SAS project (*.sas)"] --> PROJ

    subgraph PROJ["Project orchestration (project.py)"]
        PLAN["Plan files →<br/>flat or --bundle layout"]
    end

    PLAN -->|per .sas file| CORE
    subgraph CORE["sas2databricks core (Python)"]
        P[Parser] --> IR[(IR)]
        IR --> T[Transpilers]
        T -->|low confidence| L[LLM Orchestrator]
        L -->|model: opus/codex/auto| T
        T --> E[Emitters]
    end
    E --> OUT["PySpark / Spark SQL / DLT / Workflows / Validate / Bundle"]
    OUT --> ASM["databricks.yml + src/ notebooks<br/>+ project report index (report_index.py)"]

    CLI["CLI: s2db migrate"] --> PROJ
    MCP["MCP server (tools for Copilot)"] --> PROJ
    AGENT["VS Code Copilot agent + skill"] --> MCP
  1. CLI - s2db migrate ./sas_project --target pyspark --out ./databricks for batch jobs.
  2. MCP server - exposes parse_sas, convert_sas, validate_conversion, explain_conversion, migrate_project as tools to any MCP client (incl. GitHub Copilot).
  3. VS Code Copilot agent + skill - the @sas-migrator agent orchestrates the migration interactively and lets you pick the model (Opus 4.8 default / Codex / Auto).

Quick start

# Install (editable, with dev + mcp extras)
pip install -e ".[dev,mcp]"

# Convert a single SAS program to PySpark
s2db convert examples/sample1_proc_sql.sas --target pyspark

# Migrate an entire SAS project, choosing the LLM model for the gaps
s2db migrate ./examples --target dlt --model opus-4.8 --out ./out

# Assemble a deployable Databricks Asset Bundle (databricks.yml + src/ notebooks + reports/)
s2db migrate ./examples --target pyspark --bundle --html --out ./bundle
#   then:  cd bundle && databricks bundle deploy -t dev

# Run the MCP server (stdio) so Copilot can call the tools
s2db mcp

Model selection

The LLM layer is provider-agnostic. Pick the model per run:

Value Meaning
opus-4.8 Default. Best reasoning for complex macros & business logic.
codex Fast, code-focused conversions.
auto Router: deterministic first; escalates only low-confidence nodes, and picks Opus for macros/business logic, Codex for mechanical rewrites.

In the VS Code Copilot agent the model is selected via the agent's model picker; in the CLI and MCP server it is the --model flag / model argument. See docs/model-selection.md.

Project layout

src/sas2databricks/
├── parser/        # SAS → preprocess → macro expansion → step split
├── ir/            # Intermediate representation (engine-agnostic)
├── transpilers/   # IR builders per SAS construct (deterministic)
├── emitters/      # IR → PySpark / Spark SQL / DLT / Workflows / Validate / Bundle
├── llm/           # Model selection + orchestrator + pluggable providers
├── macros.py      # %MACRO body (+ %IF/%DO control flow) → parameterized Python function
├── mcp/           # MCP server exposing the core as tools
├── project.py     # Project migration: flat or deployable-bundle layout
├── report_index.py# Project-level report index (Markdown + HTML)
├── pipeline.py    # End-to-end orchestration
└── cli.py         # `s2db` command-line interface

See ARCHITECTURE.md for the full design and ROADMAP.md for what's planned.

Status

v0.3.0 - real and growing. Deterministic transpilers (with tests) cover PROC SQL, macro variables, %MACRO definitions/invocations with %IF/%DO/%ELSE control flow and iterative %DO loops, PROC MEANS/FORMAT/REPORT, the DATA step (BY-group, RETAIN, LAG/DIF, FIRST./LAST., MERGE, arrays + iterative DO loops), descriptive stats (CORR/UNIVARIATE), and MLlib scaffolds for REG/LOGISTIC/GLM. Targets include PySpark, Spark SQL, DLT (with expectations), Workflows, a data-parity validate notebook, and a Databricks Asset Bundle (databricks.yml). s2db migrate --bundle assembles a deployable bundle (notebooks + databricks.yml + a roll-up project report index). Real LLM providers (Anthropic, Azure OpenAI) plug in behind LLMProvider, results render to an HTML report, and CI runs ruff + mypy + pytest on Python 3.10–3.12. Contributions welcome - see CONTRIBUTING.md.

License

MIT © contributors. SAS and all related marks are trademarks of SAS Institute Inc. This project is independent and not affiliated with or endorsed by SAS Institute Inc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sas2databricks-0.3.1.tar.gz (59.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sas2databricks-0.3.1-py3-none-any.whl (59.2 kB view details)

Uploaded Python 3

File details

Details for the file sas2databricks-0.3.1.tar.gz.

File metadata

  • Download URL: sas2databricks-0.3.1.tar.gz
  • Upload date:
  • Size: 59.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sas2databricks-0.3.1.tar.gz
Algorithm Hash digest
SHA256 f224080221a630f5898750046381b8665de2f4f799fa33d10dfa0a434bc99557
MD5 da8654511855c89d637990526fb324af
BLAKE2b-256 4d180c8cd3e52df998ccdd7f55ea606210319181d2b27816ff4330a542509312

See more details on using hashes here.

Provenance

The following attestation bundles were made for sas2databricks-0.3.1.tar.gz:

Publisher: publish.yml on navintkr/sas2databricks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sas2databricks-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: sas2databricks-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 59.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sas2databricks-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2c240fe50ced3c1c10fdff7bd9a938955a83eacf0ce2fcca02e6ab2939f0b8d2
MD5 e31a6fad133c994c63fe69d4ef42d938
BLAKE2b-256 b80461eb5df7b36c36cfceca300364c886b7798a549effaa1f3325aac02ab593

See more details on using hashes here.

Provenance

The following attestation bundles were made for sas2databricks-0.3.1-py3-none-any.whl:

Publisher: publish.yml on navintkr/sas2databricks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page