SQL dialect to Feldera SQL translator agent
Project description
Felderize — SQL to Feldera SQL Translator
felderize translates SQL from various dialects into valid Feldera SQL using LLM-based translation with optional compiler validation.
Dialects: Spark SQL is currently the only supported dialect. Support for additional dialects is planned.
Setup
cd python/felderize
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
Note:
pip install -e .is required before runningfelderize. It registers the package and CLI command.
The Feldera SQL compiler JAR (used only for --validate; requires Java 19–21 installed):
felderize downloads it for you. The first time you run a command with --validate
and no compiler configured, felderize fetches the latest
sql2dbsp-jar-with-dependencies-*.jar from
GitHub Releases into ~/.felderize/
and reuses it on later runs. To opt out (e.g. in CI or offline), set
FELDERIZE_AUTO_DOWNLOAD=0; validation is then skipped unless you point
FELDERA_COMPILER / --compiler at a JAR.
To fetch or update it explicitly:
felderize download-compiler
This saves the latest JAR to ~/.felderize/ and prints its path. Re-run it any
time to pick up a newer release; it reports whether you are already on the latest
one. felderize automatically uses the newest JAR cached in ~/.felderize/, so you
do not need to set FELDERA_COMPILER unless you want a specific JAR.
Requirement: felderize needs compiler v0.304.0 or newer — earlier releases lack SQL features felderize relies on (e.g.
div_null,MAKE_DATE).download-compileralways fetches the latest release, and felderize warns at validation time if the configured compiler is older than v0.304.0.
Note: Java 19, 20, or 21 must be installed and on your
PATHbefore running validation. Check withjava -version. Later versions (22+) are not supported.
Create a .env file in the python/felderize/ directory:
ANTHROPIC_API_KEY=your-key-here
FELDERA_COMPILER=~/.felderize/sql2dbsp-jar-with-dependencies-vX.Y.Z.jar
FELDERIZE_MODEL=claude-sonnet-4-6
ANTHROPIC_API_KEY and FELDERIZE_MODEL are required. FELDERA_COMPILER is optional: it is used only for validation, and when unset felderize auto-downloads (and caches) the compiler on first --validate. Set it to pin a specific JAR. You can also pass --compiler PATH and --model MODEL per command.
Note: felderize currently requires an Anthropic API key — only Claude models are supported.
Tip — use a large-context-window model. Each request bundles the full rule set (
felderize/skills/spark_skills.md), validated examples, and (on the retry pass) Feldera reference documentation on top of your schema and query — easily tens of thousands of tokens, and the--validaterepair and docs passes add more. Prefer a model with a large context window (e.g. a recent Claude Sonnet/Opus) so nothing is truncated and the model has the full rules in view; set it viaFELDERIZE_MODELor--model. If a program still doesn't fit, felderize stops with anerrorasking you to shorten the input (translate fewer views at a time, drop unused tables, or split it into smaller files).
Usage
Run a built-in example
# List available examples
felderize spark example
# Translate an example (validates by default)
felderize spark example simple
# Without compiler validation
felderize spark example simple --no-validate
# Log SQL submitted to the validator at each attempt
felderize spark example json --verbose
# Use a specific compiler binary
felderize spark example simple --compiler /path/to/sql-to-dbsp
# Output as JSON
felderize spark example simple --json-output
Available examples:
| Name | Description |
|---|---|
simple |
Date truncation, GROUP BY |
strings |
INITCAP, LPAD, NVL, CONCAT_WS |
arrays |
array_contains, size, element_at |
joins |
Null-safe equality (<=>) |
windows |
LAG, running SUM OVER |
aggregations |
COUNT DISTINCT, AVG, SUM, HAVING |
json |
get_json_object → PARSE_JSON + VARIANT access (combined file) |
topk |
ROW_NUMBER TopK, QUALIFY, datediff (combined file) |
dates |
to_date → PARSE_DATE, date_format → FORMAT_DATE/EXTRACT (combined file) |
arithmetic |
pmod, NULLIF division, subtraction (combined file) |
The JSON output contains:
{
"feldera_schema": "...", // translated DDL (CREATE TABLE statements)
"feldera_query": "...", // translated query (CREATE VIEW statements)
"unsupported": [...], // Spark features with no Feldera equivalent
"warnings": [...], // non-fatal notes (compiler repairs, validation result)
"explanations": [...], // explanations for translation decisions
"status": "success|unsupported|error"
}
status:
success— translated, and (with--validate) compiled cleanly.unsupported— translated, but some constructs have no Feldera equivalent and were emitted asCAST(NULL AS <type>)placeholders (listed inunsupported).error— the LLM response couldn't be parsed, or (with--validate) the output still failed to compile after the repair attempts. felderize always returns the best-effort SQL infeldera_query(salvaging it from the raw response when the reply wasn't valid JSON), with the compiler errors inwarnings— it may not compile, but it always attempts a translation.
Translate your own SQL
Each form below writes the translated, deployable Feldera SQL to a file. The file leads with a comment header recording the translation status and any unsupported constructs / warnings, so it is self-documenting; the status is also printed to stderr.
Separate schema and query files:
felderize spark translate schema.sql query.sql --validate -o out.sql
# → out.sql (the translated CREATE TABLE + CREATE VIEW)
Single combined file (CREATE TABLE and CREATE VIEW statements in one file):
felderize spark translate-file combined.sql --validate -o out.sql
# → out.sql
Multiple query files against a shared schema (batch — faster than a shell loop):
felderize spark translate-batch schema.sql queries/*.sql --validate --output-dir out/
# → out/<query>_feldera.sql, one per query
Structured JSON instead of a .sql file (for automation — parse with jq):
felderize spark translate schema.sql query.sql --validate --json-output > result.json
# → result.json with feldera_schema, feldera_query, status, unsupported, warnings
translate-batch processes all queries in a single process so doc and example
caches stay warm across queries. Omitting -o / --output-dir / --json-output
prints the result as readable sections to the terminal instead.
Note: Running without
--validateprints a warning — the output SQL has not been verified against the Feldera compiler.
All commands accept:
--validateto validate output against the Feldera compiler (opt-in;examplevalidates by default, use--no-validateto skip)--compiler PATHto specify the path to the Feldera compiler binary (overridesFELDERA_COMPILERenv var)--model MODELto specify the LLM model (overridesFELDERIZE_MODELenv var)--no-docsto disable Feldera SQL reference docs in the prompt--verboseto log the SQL submitted to the validator at each repair attempt--json-outputto output results as JSON (the structured machine interface)-o, --output PATH(translate/translate-file) to write the translated schema + views to a deployable.sqlfile; the status prints to stderr so stdout/the file stay clean. (translate-batchuses--output-dirinstead.)
Programmatic API (Python)
To call felderize from your own code instead of shelling out to the CLI, use the
single entry point translate_spark_to_feldera:
from felderize import translate_spark_to_feldera, Config, Status
cfg = Config.from_env() # reads ANTHROPIC_API_KEY, FELDERA_COMPILER, FELDERIZE_MODEL
result = translate_spark_to_feldera(
schema_sql, # Spark CREATE TABLE ... DDL (str)
query_sql, # Spark CREATE VIEW / SELECT ... (str)
cfg,
validate=True, # compile against the Feldera compiler and repair (default: False)
)
if result.status is Status.SUCCESS:
deploy(result.feldera_schema, result.feldera_query)
else:
# UNSUPPORTED -> NULL-placeholder views (see result.unsupported);
# ERROR -> best-effort SQL that did not compile.
review(result.unsupported, result.warnings)
TranslationResult exposes feldera_schema, feldera_query, status,
unsupported, warnings, explanations, and to_dict(). validate=False
skips the compiler (faster, but the output is not verified).
Runnable examples:
.venv/bin/python examples/api_usage.py # translate one schema + query
.venv/bin/python examples/translate_all_examples.py # translate all built-in examples + summary
Configuration
Environment variables (set in .env):
| Variable | Description | Default |
|---|---|---|
ANTHROPIC_API_KEY |
Anthropic API key | (required) |
FELDERIZE_MODEL |
LLM model to use (can also be set with --model) |
(required, set in .env) |
FELDERA_COMPILER |
Path to sql-to-dbsp compiler (can also be set with --compiler) |
(optional; auto-downloaded when unset) |
FELDERIZE_AUTO_DOWNLOAD |
Auto-download the compiler on first --validate when none is configured. Set to 0/false to disable. |
1 |
ANTHROPIC_BASE_URL |
Override Anthropic API base URL (for proxies or alternate endpoints) | (optional) |
Customizing translation
You can teach felderize your project-specific patterns by adding rules and examples.
Rules
Rules tell the LLM how to rewrite specific Spark constructs. Each .md file should start with a YAML frontmatter block with a name: and description: field, followed by plain-markdown bullet points:
---
name: my-project-rules
description: Project-specific Spark-to-Feldera rewrites.
---
- **[PROJ-HASH] Internal UDF `my_hash(col)`:** Rewrite as `MD5(CAST(col AS VARCHAR))`.
- **[PROJ-ID] `CUSTOM_ID` columns:** Always map to `BIGINT NOT NULL` in Feldera.
Note: Frontmatter is recommended. Files without it are still loaded but produce a warning.
Place .md files in one of these locations — all are loaded automatically, no flag needed:
| Location | Scope |
|---|---|
~/.felderize/rules/ |
All your projects (survives pip upgrades) |
.felderize/rules/ in your project dir |
This project only (commit to git) |
Or pass one or more files explicitly (repeatable):
felderize spark translate schema.sql query.sql --rules rules1.md --rules rules2.md
Examples
Examples are validated Spark → Feldera pairs shown to the LLM alongside the built-in ones. The more precise your examples, the better the translation quality for your specific SQL patterns.
Each .md file must start with a YAML frontmatter block. Use categories: to load the example only when those SQL constructs are detected in the query being translated. Omit categories: (but keep the frontmatter) to always include the example:
---
categories: [datetime]
---
### Example: Monthly revenue
**Spark SQL:**
```sql
SELECT date_trunc('MONTH', ts) AS month, SUM(amount) AS revenue
FROM sales GROUP BY date_trunc('MONTH', ts);
\```
**Feldera SQL:**
```sql
SELECT FLOOR(ts TO MONTH) AS month, SUM(amount) AS revenue
FROM sales GROUP BY FLOOR(ts TO MONTH);
\```
Note: The frontmatter block (
---) is required. Files without it are skipped.
Valid categories: aggregates, string, datetime, array, json, map, types.
Place .md files in one of these locations — loaded automatically, no flag needed:
| Location | Scope |
|---|---|
~/.felderize/examples/ |
All your projects (survives pip upgrades) |
.felderize/examples/ in your project dir |
This project only (commit to git) |
Or pass individual files or directories explicitly (repeatable, accepts both):
felderize spark translate schema.sql query.sql --examples ex1.md --examples my_examples/
How it works
felderize translates the whole program (schema + all views) in a single LLM call:
- Loads translation rules from the skill file (
felderize/skills/spark_skills.md). - Trims the schema to the tables the query actually references, then sends the Spark schema + query to the LLM with the rules and validated examples.
- Parses the translated Feldera SQL from the response. Constructs with no Feldera
equivalent are emitted as
CAST(NULL AS <type>)placeholders and listed inunsupported. - With
--validate, compiles the output against the Feldera compiler and repairs it using the compiler's error feedback for up to a few attempts. If that first pass still doesn't compile, it retries once more with relevant Feldera documentation added to the prompt (fromdocs.feldera.com/docs/sql/); use--no-docsto skip the documentation pass.
Support
Contact us at support@feldera.com for assistance with unsupported Spark SQL features.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file felderize-0.314.0.tar.gz.
File metadata
- Download URL: felderize-0.314.0.tar.gz
- Upload date:
- Size: 62.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15b0349bcb5342f1ff4e76814db49ef1d5a81aa0a44682a8199a22d3cd34d3ae
|
|
| MD5 |
7a51a0d4fc5dc50f8eaa1efcd08464ad
|
|
| BLAKE2b-256 |
abdce4c4f3c21c161f0ad42973ea45bf5ae0e841839c9dcf092bda1970691289
|
Provenance
The following attestation bundles were made for felderize-0.314.0.tar.gz:
Publisher:
ci-post-release.yml on feldera/feldera
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
felderize-0.314.0.tar.gz -
Subject digest:
15b0349bcb5342f1ff4e76814db49ef1d5a81aa0a44682a8199a22d3cd34d3ae - Sigstore transparency entry: 1942023728
- Sigstore integration time:
-
Permalink:
feldera/feldera@ad8a97714d83671cf5fa06462ef204339d39c906 -
Branch / Tag:
refs/tags/v0.314.0 - Owner: https://github.com/feldera
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci-post-release.yml@ad8a97714d83671cf5fa06462ef204339d39c906 -
Trigger Event:
release
-
Statement type:
File details
Details for the file felderize-0.314.0-py3-none-any.whl.
File metadata
- Download URL: felderize-0.314.0-py3-none-any.whl
- Upload date:
- Size: 67.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3af6ae3272de829d4e58b38aea4c121836ca296d9ad718142d2fb7360b6a0e1b
|
|
| MD5 |
fab5f4e79ffbe3ba9cb2b75aed6cfed5
|
|
| BLAKE2b-256 |
27304b52a7c2edc53f99e656916a825a9e1ad89329ca26d1c093407e977c5272
|
Provenance
The following attestation bundles were made for felderize-0.314.0-py3-none-any.whl:
Publisher:
ci-post-release.yml on feldera/feldera
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
felderize-0.314.0-py3-none-any.whl -
Subject digest:
3af6ae3272de829d4e58b38aea4c121836ca296d9ad718142d2fb7360b6a0e1b - Sigstore transparency entry: 1942023772
- Sigstore integration time:
-
Permalink:
feldera/feldera@ad8a97714d83671cf5fa06462ef204339d39c906 -
Branch / Tag:
refs/tags/v0.314.0 - Owner: https://github.com/feldera
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci-post-release.yml@ad8a97714d83671cf5fa06462ef204339d39c906 -
Trigger Event:
release
-
Statement type: