LLM-powered code generation and evaluation plugin for Flyte

Project description

Code Generation and Evaluation Plugin

Generate code from natural language prompts and validate it by running tests in an isolated sandbox. Works with any model that supports structured outputs (GPT-4, Claude, Gemini, etc. via LiteLLM) or directly with the Agent SDK (Claude-only).

Note: Only Python is supported today.

Installation

pip install flyteplugins-codegen

# For Agent mode (Claude-only)
pip install flyteplugins-codegen[agent]

Quick start

import flyte
from flyte.io import File
from flyte.sandbox import sandbox_environment
from flyteplugins.codegen import AutoCoderAgent

agent = AutoCoderAgent(model="gpt-4.1", name="summarize-sales", resources=flyte.Resources(cpu=1, memory="1Gi"))

env = flyte.TaskEnvironment(
    name="my-env",
    secrets=[flyte.Secret(key="openai_key", as_env_var="OPENAI_API_KEY")],
    image=flyte.Image.from_debian_base().with_pip_packages(
        "flyteplugins-codegen",
    ),
    depends_on=[sandbox_environment],  # Required
)

@env.task
async def process_data(csv_file: File) -> tuple[float, int, int]:
    result = await agent.generate.aio(
        prompt="Read the CSV and compute total_revenue, total_units and row_count.",
        samples={"sales": csv_file},
        outputs={"total_revenue": float, "total_units": int, "row_count": int},
    )
    return await result.run.aio()

Two approaches

1. LiteLLM (default)

Uses structured-output LLM calls to generate code, detect packages, build sandbox images, run tests, diagnose failures and iterate. Works with any model that supports structured outputs (GPT-4, Claude, Gemini, etc. via LiteLLM).

agent = AutoCoderAgent(
    name="my-task",
    model="gpt-4.1",           # Any LiteLLM-compatible model
    max_iterations=10,         # Generate-test-fix iterations
)

result = await agent.generate.aio(
    prompt="...",
    samples={"input": my_file},
    outputs={"result": str},
)

How it works:

prompt + samples
    |
    v
[generate_plan] --> CodePlan
    |
    v
[generate_code] --> CodeSolution (dependencies + code)
    |
    v
[detect_packages] --> pip/system packages
    |
    v
[build_image] --> Sandbox image with deps
    |
    +-- skip_tests=True? --> return result (no tests)
    |
    v
[generate_tests] --> pytest suite
    |
    v
[execute_tests] --> pass? return result
    |                  |
    |                  fail
    v                  |
[diagnose_error] --> logic/environment/test_error
    |
    +-- logic error ---------> regenerate code with patch instructions
    +-- environment error ---> add packages, rebuild image
    +-- test error ----------> fix test expectations
    |
    v
  (repeat up to max_iterations)

2. Agent

Uses the Claude Agent SDK to autonomously generate, test and fix code. The agent has access to Bash, Read, Write and Edit tools and iterates on its own. Test execution is intercepted and run in an isolated Sandbox.

agent = AutoCoderAgent(
    name="my-task",
    model="claude-sonnet-4-5-20250929",
    backend="claude",        # Requires ANTHROPIC_API_KEY as a Flyte secret
)

result = await agent.generate.aio(
    prompt="...",
    samples={"input": my_file},
    outputs={"result": str},
)

Key differences from LiteLLM:

Agent runs autonomously (no structured retry loop)
Requires ANTHROPIC_API_KEY as a Flyte secret
Claude-only (not model agnostic)
Traces agent tool calls, reasoning and test results in the Flyte UI
Test commands are intercepted via hooks and run in isolated sandbox environments

API reference

`AutoCoderAgent`

Create an agent instance with configuration, then call generate() per task.

agent = AutoCoderAgent(name="my-agent", model="gpt-4.1")

# Sync
result = agent.generate(prompt="...")

# Async
result = await agent.generate.aio(prompt="...")

Constructor parameters (agent-level config):

Parameter	Type	Default	Description
`name`	`str`	`"auto-coder"`	Unique name for tracking and image naming
`model`	`str`	`"gpt-4.1"`	LiteLLM model identifier
`system_prompt`	`str`	`None`	Custom system prompt override
`api_key`	`str`	`None`	Env var name for LLM API key
`api_base`	`str`	`None`	Custom API base URL
`litellm_params`	`dict`	`None`	Extra LiteLLM params (temperature, max_tokens, etc.)
`base_packages`	`list[str]`	`None`	Always-install pip packages
`resources`	`flyte.Resources`	`None`	Resources for sandbox execution (default: cpu=1, 1Gi)
`image_config`	`ImageConfig`	`None`	Registry, registry_secret, python_version
`max_iterations`	`int`	`10`	Max generate-test-fix iterations (LiteLLM mode)
`max_sample_rows`	`int`	`100`	Rows to sample from data for context
`skip_tests`	`bool`	`False`	Skip test generation and execution (LiteLLM mode only)
`sandbox_retries`	`int`	`0`	Flyte task-level retries for each sandbox execution
`timeout`	`int`	`None`	Timeout in seconds for sandboxes
`env_vars`	`dict[str, str]`	`None`	Environment variables to pass to sandboxes
`secrets`	`list`	`None`	`flyte.Secret` objects to make available to sandboxes
`cache`	`str`	`"auto"`	CacheRequest for sandboxes: `"auto"`, `"override"`, or `"disable"`
`backend`	`str`	`"litellm"`	Execution backend: `"litellm"` or `"claude"`
`agent_max_turns`	`int`	`50`	Max turns when `backend="claude"`
`block_network`	`bool`	`False`	Block all outbound network access in sandboxes. Set to `True` to block network access.

generate() parameters (per-call):

Parameter	Type	Default	Description
`prompt`	`str`	required	Natural-language task description
`schema`	`str`	`None`	Free-form context about data formats, structures, or schemas. Included verbatim in the LLM prompt.
`constraints`	`list[str]`	`None`	Natural-language constraints (e.g., `"quantity must be positive"`)
`samples`	`dict[str, File \| pd.DataFrame]`	`None`	Sample data. Sampled for LLM context, converted to File inputs for the sandbox. Used as defaults at runtime.
`inputs`	`dict[str, type]`	`None`	Non-sample CLI argument types (e.g., `{"threshold": float}`). Sample entries are auto-added as File inputs. Supported: `str, int, float, bool, File, Dir`.
`outputs`	`dict[str, type]`	`None`	Output types. Supported: `str, int, float, bool, datetime, timedelta, File, Dir`.

`CodeGenEvalResult`

Returned by agent.generate(). Key fields:

result.success        # bool — did tests pass?
result.solution       # CodeSolution — generated code
result.tests          # str — generated test code
result.output         # str — test output
result.exit_code      # int — test exit code
result.error          # str | None — error message if failed
result.attempts       # int — number of iterations used
result.image          # str — built sandbox image with all deps
result.detected_packages        # list[str] — pip packages detected
result.detected_system_packages # list[str] — apt packages detected
result.generated_schemas        # dict[str, str] | None — Pandera schemas as code
result.data_context             # str | None — extracted data context
result.original_samples         # dict[str, File] | None — sample data as Files

`result.as_task()`

Create a reusable sandbox from the generated code:

task = result.as_task(name="run-on-data")

# Call with your declared inputs — returns a tuple of outputs
total_revenue, total_units, transaction_count = task(sales_csv=my_file)

# If samples were provided, they are injected as defaults — override as needed
total_revenue, total_units, transaction_count = task(threshold=0.5)  # samples used for data inputs

# With sandbox options
task = result.as_task(
    name="run-on-data",
    retries=3,
    timeout=600,
    env_vars={"API_URL": "https://..."},
)

The task runs the generated script in the built sandbox image. Inputs are passed as --name value CLI arguments. Outputs are read from /var/outputs/{name} files.

`result.run()`

One-shot execution using sample data as defaults:

# Sync
total_revenue, total_units, transaction_count = result.run()

# Async
total_revenue, total_units, transaction_count = await result.run.aio()

# Override specific inputs
total_revenue, total_units, transaction_count = result.run(threshold=0.5)

Data handling

When you pass samples, the plugin automatically:

Converts DataFrames to CSVs and uploads as File objects
Infers Pandera schemas — conservative type + nullability checks inferred from the sample data (no value constraints)
Applies natural-language constraints — if constraints are provided, each one is parsed by the LLM into a Pandera check (e.g., "quantity must be positive" → pa.Check.gt(0)) and added to the schema
Extracts comprehensive context — column stats, distributions, patterns, sample rows
Includes everything in the prompt — the serialized schemas and data context are injected into the LLM prompt so the generated code is aware of exact column types, nullability and validation rules

Pandera is used purely for prompt enrichment, not runtime validation. The generated code itself doesn't import Pandera — it just benefits from the LLM knowing the precise data structure. The schemas are also stored on result.generated_schemas for inspection.

result = await agent.generate.aio(
    prompt="Clean and validate the data, remove duplicates",
    samples={"orders": orders_df, "products": products_file},
    constraints=["quantity must be positive", "price between 0 and 10000"],
    outputs={"cleaned_orders": File},
)

# Access generated schemas
print(result.generated_schemas)  # {"orders": "DataFrameSchema(...)", "products": "..."}

Configuration

Image configuration

agent = AutoCoderAgent(
    model="gpt-4.1",
    name="my-task",
    image_config=ImageConfig(
        registry="my-registry.io",
        registry_secret="registry-creds",
        python_version=(3, 12),
    ),
)

LiteLLM configuration

agent = AutoCoderAgent(
    name="my-task",
    model="anthropic/claude-sonnet-4-20250514",
    api_key="ANTHROPIC_API_KEY",     # env var name
    litellm_params={
        "temperature": 0.3,
        "max_tokens": 4000,
    },
)

Skipping tests

Set skip_tests=True to skip test generation and execution. The agent will still generate code, detect packages, and build the sandbox image, but won't generate or run tests. This is useful when you trust the LLM output or want faster turnaround.

agent = AutoCoderAgent(
    name="my-task",
    model="gpt-4.1",
    skip_tests=True,  # No test generation or execution
)

result = await agent.generate.aio(
    prompt="Parse JSON logs and extract error counts",
    samples={"logs": log_file},
    outputs={"error_count": int},
)

# result.as_task() and result.run() still work
error_count = await result.run.aio()

Note: skip_tests only applies to LiteLLM mode. In Agent mode, the agent autonomously decides when to test.

Environment setup

sandbox_environment must be listed as a dependency of your TaskEnvironment:

from flyte.sandbox import sandbox_environment

env = flyte.TaskEnvironment(
    name="my-env",
    image=flyte.Image.auto(),
    depends_on=[sandbox_environment],  # Required
)

This allows dynamically-created sandboxes to be registered with Flyte.

Tip: Use one AutoCoderAgent per task. Each generate() call builds its own sandbox image and manages its own package/image state. Running multiple agents in the same task can cause resource contention and makes failures harder to diagnose.

Module Structure

codegen/
├── __init__.py              # Public API: AutoCoderAgent, CodeGenEvalResult, types
├── auto_coder_agent.py      # AutoCoderAgent — config + generate() orchestrator
├── core/
│   └── types.py             # Pydantic models: CodeGenEvalResult, CodeSolution, CodePlan, etc.
├── data/
│   ├── extraction.py        # Extract context from DataFrames/Files (stats, patterns, samples)
│   └── schema.py            # Pandera schema inference, constraint parsing via LLM
├── execution/
│   ├── agent.py             # Claude Agent SDK path with hooks and sandbox test interception
│   ├── docker.py            # Image building (create_image_spec, incremental builds)
│   └── testing.py           # Test execution in sandboxes
├── generation/
│   ├── llm.py               # LLM calls: plan, code, tests, diagnosis, fixes, verification
│   └── prompts.py           # Prompt templates and constants

Data flow

User calls agent.generate(prompt, samples, outputs, ...)
│
├─ Data Processing (both paths)
│  ├─ Convert DataFrames → CSV Files
│  ├─ Infer Pandera schemas
│  ├─ Apply user constraints (LLM-parsed)
│  └─ Extract data context (stats, patterns, samples)
│
├─ LiteLLM Path (default)                 ├─ Agent Path (backend="claude")
│  ├─ generate_plan()                     │  ├─ Build prompt with all context
│  ├─ generate_code()                     │  ├─ Launch Claude agent with hooks:
│  ├─ detect_packages()                   │  │  ├─ PreToolUse: trace + classify commands
│  ├─ build_image()                       │  │  │  ├─ pytest → run in sandbox
│  ├─ execute_tests()                     │  │  │  ├─ safe (ls, cat, ...) → allow
│  ├─ diagnose_error() (if failed)        │  │  │  └─ denied (apt, pip, curl, ...) → block
│  ├─ fix code/tests/env                  │  │  ├─ PostToolUseFailure: trace errors
│  └─ repeat until pass or max_iterations  │  │  └─ Stop: trace summary
│                                         │  ├─ Agent writes solution.py, tests.py, packages.txt
│                                         │  ├─ pytest intercepted → sandbox execution
│                                         │  └─ Agent iterates until tests pass
│
└─ Return CodeGenEvalResult
   ├─ .solution (code)
   ├─ .image (sandbox image with deps)
   ├─ .as_task() → reusable sandbox
   └─ .run() → execute on sample data

Error handling

The LiteLLM path classifies test failures into three types:

Type	Meaning	Action
`logic`	Bug in generated code	Regenerate code with specific patch instructions
`environment`	Missing package/dependency	Add package, rebuild image
`test_error`	Bug in generated test	Fix test expectations

If the same error persists after fixes, the plugin reclassifies it (logic <-> test_error) to try the other approach.

Observability

LiteLLM path

Logs every iteration with attempt count, error type, and package changes
Tracks total input/output tokens across all LLM calls
Results include full conversation history for debugging

Agent path

Traces each tool call (name + input detail) via PreToolUse hook
Traces tool failures via PostToolUseFailure hook
Traces a summary when the agent finishes (total tool calls, tool distribution, final image/packages)
Classifies Bash commands as safe, denied, or pytest (intercepted for sandbox execution)
All traces appear in the Flyte UI under the task

Examples

See the examples/ directory:

example_csv_processing.py — Process CSVs with different schemas using LiteLLM. Shows batch processing with multiple CSV formats.
example_csv_processing_sync.py — Synchronous version of CSV processing. Shows agent.generate() and result.run() without async.
example_csv_processing_agent.py — CSV processing using Agent mode with backend="claude".
example_dataframe_analysis.py — DataFrame analysis with constraints, base_packages, and as_task() for reusable execution.
example_dataframe_analysis_agent.py — Same DataFrame analysis using Agent mode.
example_prompt_only.py — Log file analysis with schema, constraints, samples, and explicit inputs/outputs.
example_prompt_only_agent.py — Same log analysis using Agent mode.
example_multi_input.py — Multi-input data join with primitives (float, bool).
example_multi_input_agent.py — Same multi-input join using Agent mode.
example_durable_execution.py — Durable execution with injected failures, retries, and caching (LLM approach).
example_durable_execution_agent.py — Same durable execution using Agent mode.

Project details

Release history Release notifications | RSS feed

2.3.5

May 22, 2026

This version

2.3.4

May 19, 2026

2.3.3

May 19, 2026

2.3.2

May 15, 2026

2.3.1

May 14, 2026

2.3.0

May 13, 2026

2.3.0b0 pre-release

May 13, 2026

2.2.4

May 8, 2026

2.2.3

May 4, 2026

2.2.2

Apr 30, 2026

2.2.1

Apr 29, 2026

2.2.0

Apr 27, 2026

2.1.9

Apr 21, 2026

2.1.8

Apr 21, 2026

2.1.7

Apr 14, 2026

2.1.6

Apr 14, 2026

2.1.5

Apr 9, 2026

2.1.4

Apr 8, 2026

2.1.3

Apr 6, 2026

2.1.2

Apr 2, 2026

2.1.1

Apr 1, 2026

2.1.0

Mar 31, 2026

2.0.12

Apr 15, 2026

2.0.11

Mar 24, 2026

2.0.10

Mar 23, 2026

2.0.10a0 pre-release

Mar 18, 2026

2.0.9

Mar 18, 2026

2.0.8

Mar 17, 2026

2.0.7

Mar 16, 2026

2.0.6

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flyteplugins_codegen-2.3.4-py3-none-any.whl (53.9 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file flyteplugins_codegen-2.3.4-py3-none-any.whl.

File metadata

Download URL: flyteplugins_codegen-2.3.4-py3-none-any.whl
Upload date: May 19, 2026
Size: 53.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for flyteplugins_codegen-2.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7d2e94ee4bd8f0dafcdd72d2c756a89c9aea4bb47c7cddd0f75bb7f9c138f363`
MD5	`8ab927874c56bf6c6eb75f49f602048e`
BLAKE2b-256	`a664c31ea609a45bf50a8d9a48a3499e12b15d2dd6ea486942c6a6f67f2e0ac8`

See more details on using hashes here.

flyteplugins-codegen 2.3.4

Navigation

Verified details

Maintainers

Unverified details

Meta