LLM-assisted migration of SAS analytics, transformations, and reports to Databricks (PySpark, Spark SQL, DLT, Workflows).

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

navintkr

These details have not been verified by PyPI

Project description

sas2databricks 🦶

`sas2databricks` - track down your SAS and set it free in the Databricks lakehouse.

An open-source, LLM-assisted migration toolkit that converts SAS analytics, data transformations, and reports into Databricks (PySpark, Spark SQL, Delta Live Tables, and Workflows) - end to end.

Deterministic transpilers handle the patterns we understand. A GitHub Copilot-powered LLM layer (default Claude Opus 4.8, or Codex, or Auto) fills the gaps, resolves ambiguity, and explains every conversion.

Why this exists

Migrating SAS to Databricks is hard because SAS is not one language - it is a family of sub-languages (DATA step, PROC SQL, the macro facility, dozens of PROCs, formats/informats). Pure rules-based converters break on real-world code; pure LLM converters hallucinate and are unverifiable. sas2databricks combines both:

A deterministic core parses SAS into an intermediate representation (IR) and transpiles every pattern it recognizes - fast, free, and 100% reproducible.
An LLM orchestrator is invoked only for the residue (unknown PROCs, gnarly macros, business logic) with the model you choose, and its output is validated against the IR.
Every line of generated code carries provenance (which SAS line it came from and whether it was rule-based or LLM-based) so reviewers can trust the result.

What it covers

SAS capability	Target	Engine
`PROC SQL`	Spark SQL / PySpark	Deterministic (sqlglot)
`DATA` step (BY-group, `RETAIN`, arrays + `DO` loops, `LAG`, `FIRST.`/`LAST.`)	PySpark	Deterministic + LLM
Macro facility (`%MACRO`, `%LET`, `%IF/%DO/%ELSE`, iterative `%DO`, macro vars)	Python/Jinja params	Deterministic
`PROC MEANS` / `SUMMARY` / `FREQ` / `TABULATE` (measures & aggregations)	PySpark / Spark SQL	Deterministic
`PROC FORMAT` (formats/informats)	PySpark UDF / mapping tables	Deterministic
`PROC REPORT` / `PRINT`	Databricks notebook viz / SQL	Deterministic + LLM
Statistical PROCs (`REG`, `LOGISTIC`, `GLM`, `GENMOD`)	Spark MLlib scaffold	Deterministic (review)
Descriptive PROCs (`CORR`, `UNIVARIATE`)	Spark stats helpers	Deterministic
Data-parity validation	`validate` notebook (row/schema/checksum diff)	Deterministic
Deployment packaging	Databricks Asset Bundle (`databricks.yml`) + Workflows job graph	Deterministic

Three ways to use it

flowchart LR
    SAS["SAS project (*.sas)"] --> PROJ

    subgraph PROJ["Project orchestration (project.py)"]
        PLAN["Plan files →<br/>flat or --bundle layout"]
    end

    PLAN -->|per .sas file| CORE
    subgraph CORE["sas2databricks core (Python)"]
        P[Parser] --> IR[(IR)]
        IR --> T[Transpilers]
        T -->|low confidence| L[LLM Orchestrator]
        L -->|model: opus/codex/auto| T
        T --> E[Emitters]
    end
    E --> OUT["PySpark / Spark SQL / DLT / Workflows / Validate / Bundle"]
    OUT --> ASM["databricks.yml + src/ notebooks<br/>+ project report index (report_index.py)"]

    CLI["CLI: s2db migrate"] --> PROJ
    MCP["MCP server (tools for Copilot)"] --> PROJ
    AGENT["VS Code Copilot agent + skill"] --> MCP

CLI - s2db migrate ./sas_project migrates everything in one command (PySpark + Spark SQL + DLT). Narrow with --target pyspark or add --bundle for batch jobs.
MCP server - exposes parse_sas, convert_sas, validate_conversion, explain_conversion, migrate_project as tools to any MCP client (incl. GitHub Copilot).
VS Code Copilot agent + skill - the @sas-migrator agent orchestrates the migration interactively and lets you pick the model (Opus 4.8 default / Codex / Auto).

Quick start

# Install from PyPI
pip install sas2databricks
#   ...or include the MCP server extra:
pip install "sas2databricks[mcp]"

# Convert a single SAS program to PySpark
s2db convert examples/sample1_proc_sql.sas --target pyspark

# Migrate a whole SAS project - ONE command emits PySpark + Spark SQL + DLT side by side,
# with a combined report index linking each format. No flags needed.
s2db migrate ./examples --out ./out
#   narrow to one format:   s2db migrate ./examples --target pyspark
#   pick the model for gaps: s2db migrate ./examples --model opus-4.8

# Assemble deployable Databricks Asset Bundles (databricks.yml + src/ notebooks + reports/)
# With --target all you get one ready-to-deploy bundle per format under out/<target>/.
s2db migrate ./examples --bundle --html --out ./bundle
#   then:  cd bundle/pyspark && databricks bundle deploy -t dev

# Run the MCP server (stdio) so Copilot can call the tools
s2db mcp

From source (for development): git clone https://github.com/navintkr/sas2databricks then pip install -e ".[dev,mcp]". The examples/ SAS samples used above live in the repo.

Model selection

The LLM layer is provider-agnostic. Pick the model per run:

Value	Meaning
`opus-4.8`	Default. Best reasoning for complex macros & business logic.
`codex`	Fast, code-focused conversions.
`auto`	Router: deterministic first; escalates only low-confidence nodes, and picks Opus for macros/business logic, Codex for mechanical rewrites.

In the VS Code Copilot agent the model is selected via the agent's model picker; in the CLI and MCP server it is the --model flag / model argument. See docs/model-selection.md.

Project layout

src/sas2databricks/
├── parser/        # SAS → preprocess → macro expansion → step split
├── ir/            # Intermediate representation (engine-agnostic)
├── transpilers/   # IR builders per SAS construct (deterministic)
├── emitters/      # IR → PySpark / Spark SQL / DLT / Workflows / Validate / Bundle
├── llm/           # Model selection + orchestrator + pluggable providers
├── macros.py      # %MACRO body (+ %IF/%DO control flow) → parameterized Python function
├── mcp/           # MCP server exposing the core as tools
├── project.py     # Project migration: flat or deployable-bundle layout
├── report_index.py# Project-level report index (Markdown + HTML)
├── pipeline.py    # End-to-end orchestration
└── cli.py         # `s2db` command-line interface

See ARCHITECTURE.md for the full design and ROADMAP.md for what's planned.

Status

v0.3.0 - real and growing. Deterministic transpilers (with tests) cover PROC SQL, macro variables, %MACRO definitions/invocations with %IF/%DO/%ELSE control flow and iterative %DO loops, PROC MEANS/FORMAT/REPORT, the DATA step (BY-group, RETAIN, LAG/DIF, FIRST./LAST., MERGE, arrays + iterative DO loops), descriptive stats (CORR/UNIVARIATE), and MLlib scaffolds for REG/LOGISTIC/GLM. Targets include PySpark, Spark SQL, DLT (with expectations), Workflows, a data-parity validate notebook, and a Databricks Asset Bundle (databricks.yml). s2db migrate --bundle assembles a deployable bundle (notebooks + databricks.yml + a roll-up project report index). Real LLM providers (Anthropic, Azure OpenAI) plug in behind LLMProvider, results render to an HTML report, and CI runs ruff + mypy + pytest on Python 3.10–3.12. Contributions welcome - see CONTRIBUTING.md.

License

MIT © contributors. SAS and all related marks are trademarks of SAS Institute Inc. This project is independent and not affiliated with or endorsed by SAS Institute Inc.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

navintkr

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

Jun 17, 2026

0.3.1

Jun 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sas2databricks-0.4.0.tar.gz (61.3 kB view details)

Uploaded Jun 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sas2databricks-0.4.0-py3-none-any.whl (61.0 kB view details)

Uploaded Jun 17, 2026 Python 3

File details

Details for the file sas2databricks-0.4.0.tar.gz.

File metadata

Download URL: sas2databricks-0.4.0.tar.gz
Upload date: Jun 17, 2026
Size: 61.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sas2databricks-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`967804fbcdef23623e923c31719bd8e4df08f5fd5fde4646b759e99d9c7026f1`
MD5	`8cf90d8268ccfbb78dc82b5596a984ea`
BLAKE2b-256	`ab16829123a5211f6b8bd6d30ae33434ccff0ec2d7b4aa4eebdf725fc3f146ce`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sas2databricks-0.4.0.tar.gz:

Publisher: publish.yml on navintkr/sas2databricks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sas2databricks-0.4.0.tar.gz
- Subject digest: 967804fbcdef23623e923c31719bd8e4df08f5fd5fde4646b759e99d9c7026f1
- Sigstore transparency entry: 1852902145
- Sigstore integration time: Jun 17, 2026
Source repository:
- Permalink: navintkr/sas2databricks@5be9cb44805bcdc3f3d3570ea0ba1b1c48ea1d00
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/navintkr
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5be9cb44805bcdc3f3d3570ea0ba1b1c48ea1d00
- Trigger Event: release

File details

Details for the file sas2databricks-0.4.0-py3-none-any.whl.

File metadata

Download URL: sas2databricks-0.4.0-py3-none-any.whl
Upload date: Jun 17, 2026
Size: 61.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sas2databricks-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`472f367872367633d452faad4c11cf3f1d2d9f7bc28c27eb5d75abee4c2ee8c0`
MD5	`d50d8f8dc7b9f41f59bbb82087c0c91b`
BLAKE2b-256	`dd983220e7193b1740a452398028de9241b719f986db9057133e4b9c0812801f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sas2databricks-0.4.0-py3-none-any.whl:

Publisher: publish.yml on navintkr/sas2databricks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sas2databricks-0.4.0-py3-none-any.whl
- Subject digest: 472f367872367633d452faad4c11cf3f1d2d9f7bc28c27eb5d75abee4c2ee8c0
- Sigstore transparency entry: 1852902384
- Sigstore integration time: Jun 17, 2026
Source repository:
- Permalink: navintkr/sas2databricks@5be9cb44805bcdc3f3d3570ea0ba1b1c48ea1d00
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/navintkr
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5be9cb44805bcdc3f3d3570ea0ba1b1c48ea1d00
- Trigger Event: release

sas2databricks 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

sas2databricks 🦶

`sas2databricks` - track down your SAS and set it free in the Databricks lakehouse.

Why this exists

What it covers

Three ways to use it

Quick start

Model selection

Project layout

Status

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

sas2databricks 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

sas2databricks 🦶

sas2databricks - track down your SAS and set it free in the Databricks lakehouse.

Why this exists

What it covers

Three ways to use it

Quick start

Model selection

Project layout

Status

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`sas2databricks` - track down your SAS and set it free in the Databricks lakehouse.