Skip to main content

LLM-assisted migration of SAS analytics, transformations, and reports to Databricks (PySpark, Spark SQL, DLT, Workflows).

Project description

sas2databricks 🦶

sas2databricks - track down your SAS and set it free in the Databricks lakehouse.

An open-source, LLM-assisted migration toolkit that converts SAS analytics, data transformations, and reports into Databricks (PySpark, Spark SQL, Delta Live Tables, and Workflows) - end to end.

Deterministic transpilers handle the patterns we understand. A GitHub Copilot-powered LLM layer (default Claude Opus 4.8, or Codex, or Auto) fills the gaps, resolves ambiguity, and explains every conversion.

PyPI License: MIT Python 3.10+ status: beta


Why this exists

Migrating SAS to Databricks is hard because SAS is not one language - it is a family of sub-languages (DATA step, PROC SQL, the macro facility, dozens of PROCs, formats/informats). Pure rules-based converters break on real-world code; pure LLM converters hallucinate and are unverifiable. sas2databricks combines both:

  1. A deterministic core parses SAS into an intermediate representation (IR) and transpiles every pattern it recognizes - fast, free, and 100% reproducible.
  2. An LLM orchestrator is invoked only for the residue (unknown PROCs, gnarly macros, business logic) with the model you choose, and its output is validated against the IR.
  3. Every line of generated code carries provenance (which SAS line it came from and whether it was rule-based or LLM-based) so reviewers can trust the result.

What it covers

SAS capability Target Engine
PROC SQL Spark SQL / PySpark Deterministic (sqlglot)
DATA step (BY-group, RETAIN, arrays + DO loops, LAG, FIRST./LAST.) PySpark Deterministic + LLM
Macro facility (%MACRO, %LET, %IF/%DO/%ELSE, iterative %DO, macro vars) Python/Jinja params Deterministic
PROC MEANS / SUMMARY / FREQ / TABULATE (measures & aggregations) PySpark / Spark SQL Deterministic
PROC FORMAT (formats/informats) PySpark UDF / mapping tables Deterministic
PROC REPORT / PRINT Databricks notebook viz / SQL Deterministic + LLM
Statistical PROCs (REG, LOGISTIC, GLM, GENMOD) Spark MLlib scaffold Deterministic (review)
Descriptive PROCs (CORR, UNIVARIATE) Spark stats helpers Deterministic
Data-parity validation validate notebook (row/schema/checksum diff) Deterministic
Deployment packaging Databricks Asset Bundle (databricks.yml) + Workflows job graph Deterministic

Three ways to use it

flowchart LR
    SAS["SAS project (*.sas)"] --> PROJ

    subgraph PROJ["Project orchestration (project.py)"]
        PLAN["Plan files →<br/>flat or --bundle layout"]
    end

    PLAN -->|per .sas file| CORE
    subgraph CORE["sas2databricks core (Python)"]
        P[Parser] --> IR[(IR)]
        IR --> T[Transpilers]
        T -->|low confidence| L[LLM Orchestrator]
        L -->|model: opus/codex/auto| T
        T --> E[Emitters]
    end
    E --> OUT["PySpark / Spark SQL / DLT / Workflows / Validate / Bundle"]
    OUT --> ASM["databricks.yml + src/ notebooks<br/>+ project report index (report_index.py)"]

    CLI["CLI: s2db migrate"] --> PROJ
    MCP["MCP server (tools for Copilot)"] --> PROJ
    AGENT["VS Code Copilot agent + skill"] --> MCP
  1. CLI - s2db migrate ./sas_project migrates everything in one command (PySpark + Spark SQL + DLT). Narrow with --target pyspark or add --bundle for batch jobs.
  2. MCP server - exposes parse_sas, convert_sas, validate_conversion, explain_conversion, migrate_project as tools to any MCP client (incl. GitHub Copilot).
  3. VS Code Copilot agent + skill - the @sas-migrator agent orchestrates the migration interactively and lets you pick the model (Opus 4.8 default / Codex / Auto).

Quick start

# Install from PyPI
pip install sas2databricks
#   ...or include the MCP server extra:
pip install "sas2databricks[mcp]"

# Convert a single SAS program to PySpark
s2db convert examples/sample1_proc_sql.sas --target pyspark

# Migrate a whole SAS project - ONE command emits PySpark + Spark SQL + DLT side by side,
# with a combined report index linking each format. No flags needed.
s2db migrate ./examples --out ./out
#   narrow to one format:   s2db migrate ./examples --target pyspark
#   pick the model for gaps: s2db migrate ./examples --model opus-4.8

# Assemble deployable Databricks Asset Bundles (databricks.yml + src/ notebooks + reports/)
# With --target all you get one ready-to-deploy bundle per format under out/<target>/.
s2db migrate ./examples --bundle --html --out ./bundle
#   then:  cd bundle/pyspark && databricks bundle deploy -t dev

# Run the MCP server (stdio) so Copilot can call the tools
s2db mcp

From source (for development): git clone https://github.com/navintkr/sas2databricks then pip install -e ".[dev,mcp]". The examples/ SAS samples used above live in the repo.

Model selection

The LLM layer is provider-agnostic. Pick the model per run:

Value Meaning
opus-4.8 Default. Best reasoning for complex macros & business logic.
codex Fast, code-focused conversions.
auto Router: deterministic first; escalates only low-confidence nodes, and picks Opus for macros/business logic, Codex for mechanical rewrites.

In the VS Code Copilot agent the model is selected via the agent's model picker; in the CLI and MCP server it is the --model flag / model argument. See docs/model-selection.md.

Project layout

src/sas2databricks/
├── parser/        # SAS → preprocess → macro expansion → step split
├── ir/            # Intermediate representation (engine-agnostic)
├── transpilers/   # IR builders per SAS construct (deterministic)
├── emitters/      # IR → PySpark / Spark SQL / DLT / Workflows / Validate / Bundle
├── llm/           # Model selection + orchestrator + pluggable providers
├── macros.py      # %MACRO body (+ %IF/%DO control flow) → parameterized Python function
├── mcp/           # MCP server exposing the core as tools
├── project.py     # Project migration: flat or deployable-bundle layout
├── report_index.py# Project-level report index (Markdown + HTML)
├── pipeline.py    # End-to-end orchestration
└── cli.py         # `s2db` command-line interface

See ARCHITECTURE.md for the full design and ROADMAP.md for what's planned.

Status

v0.3.0 - real and growing. Deterministic transpilers (with tests) cover PROC SQL, macro variables, %MACRO definitions/invocations with %IF/%DO/%ELSE control flow and iterative %DO loops, PROC MEANS/FORMAT/REPORT, the DATA step (BY-group, RETAIN, LAG/DIF, FIRST./LAST., MERGE, arrays + iterative DO loops), descriptive stats (CORR/UNIVARIATE), and MLlib scaffolds for REG/LOGISTIC/GLM. Targets include PySpark, Spark SQL, DLT (with expectations), Workflows, a data-parity validate notebook, and a Databricks Asset Bundle (databricks.yml). s2db migrate --bundle assembles a deployable bundle (notebooks + databricks.yml + a roll-up project report index). Real LLM providers (Anthropic, Azure OpenAI) plug in behind LLMProvider, results render to an HTML report, and CI runs ruff + mypy + pytest on Python 3.10–3.12. Contributions welcome - see CONTRIBUTING.md.

License

MIT © contributors. SAS and all related marks are trademarks of SAS Institute Inc. This project is independent and not affiliated with or endorsed by SAS Institute Inc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sas2databricks-0.4.0.tar.gz (61.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sas2databricks-0.4.0-py3-none-any.whl (61.0 kB view details)

Uploaded Python 3

File details

Details for the file sas2databricks-0.4.0.tar.gz.

File metadata

  • Download URL: sas2databricks-0.4.0.tar.gz
  • Upload date:
  • Size: 61.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sas2databricks-0.4.0.tar.gz
Algorithm Hash digest
SHA256 967804fbcdef23623e923c31719bd8e4df08f5fd5fde4646b759e99d9c7026f1
MD5 8cf90d8268ccfbb78dc82b5596a984ea
BLAKE2b-256 ab16829123a5211f6b8bd6d30ae33434ccff0ec2d7b4aa4eebdf725fc3f146ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for sas2databricks-0.4.0.tar.gz:

Publisher: publish.yml on navintkr/sas2databricks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sas2databricks-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: sas2databricks-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 61.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sas2databricks-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 472f367872367633d452faad4c11cf3f1d2d9f7bc28c27eb5d75abee4c2ee8c0
MD5 d50d8f8dc7b9f41f59bbb82087c0c91b
BLAKE2b-256 dd983220e7193b1740a452398028de9241b719f986db9057133e4b9c0812801f

See more details on using hashes here.

Provenance

The following attestation bundles were made for sas2databricks-0.4.0-py3-none-any.whl:

Publisher: publish.yml on navintkr/sas2databricks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page