Skip to main content

A tool for agentic recursive model improvement

Project description

Tropiflo

Automatically evolve your ML code to maximize a KPI — locally, securely, and reproducibly.


Is Tropiflo for you?

Tropiflo is for you if:

You already have working ML code — not starting from scratch
You know your metric (KPI) — accuracy, RMSE, AUC, whatever you optimize for
You want the system to rewrite parts of your code — to improve that metric
You do NOT want AutoML SaaS, data upload, or black boxes — everything runs locally

If that's you, keep reading.


How Tropiflo Thinks

Here's what actually happens when you run Tropiflo:

  1. You mark a code block you want to evolve (e.g., your feature engineering)
  2. You define a KPI by printing it (e.g., print(f"KPI: {accuracy}"))
  3. Tropiflo runs your baseline and records the KPI
  4. Tropiflo proposes a hypothesis about how to improve the code
  5. Tropiflo modifies ONLY the marked block with the new approach
  6. Tropiflo executes your full project to test the hypothesis
  7. Tropiflo scores the new KPI and keeps the change if it's better
  8. Repeat — the system keeps evolving toward higher KPIs

What Tropiflo is NOT

  • Not AutoML — It doesn't just tune hyperparameters
  • Not parameter search — It's code evolution, not grid search
  • Not a black box — You see every change it makes to your code
  • Not a data platform — Your data never leaves your machine

Quickstart: See it work in 2 minutes

The fastest way to understand Tropiflo is to watch it improve a simple problem.

Step 1: Install

pip install tropiflo

Step 2: Mark Your Code

Create train.py and mark the block you want to evolve:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load your data
X = pd.read_csv("data/features.csv")
y = pd.read_csv("data/labels.csv")

# CO_DATASCIENTIST_BLOCK_START
# This is the block Tropiflo will evolve
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
preds = model.predict(X)
# CO_DATASCIENTIST_BLOCK_END

# Print your KPI
accuracy = accuracy_score(y, preds)
print(f"KPI: {accuracy:.4f}")

Step 3: Create config.yaml

Minimal configuration:

mode: local
entry_command: "python train.py"

With more options:

mode: local
entry_command: "python train.py"

# Run multiple experiments in parallel
parallel: 3

# Mount external data directory
data_volume: "/path/to/your/data"

# AI evolution (get API key from tropiflo.io)
api_key: "sk_your_token_here"

Step 4: Set Your API Key

Before running Tropiflo, you need an API key:

  1. Sign up at tropiflo.io
  2. Copy your API key (starts with sk_ or is a JWT token)
  3. Set it using the CLI:
tropiflo set-token --token YOUR_API_KEY

This saves the key locally so you don't need to re-enter it. You can also put it in your config.yaml:

api_key: "YOUR_API_KEY"

Both methods work. If both are set, config.yaml takes priority.

Step 5: Run

tropiflo run --config config.yaml

Providing Context: Two Options

There are two ways to tell Tropiflo about your problem. Pick whichever fits your workflow.

Option A: Direct Context in Config (no Q&A, non-interactive)

Add a user_context field to your config.yaml with everything Tropiflo needs to know:

entry_command: "python train.py"
mode: local

user_context: |
  This is a credit scoring model. The KPI is ROC-AUC on a time-based holdout.
  Class imbalance is ~20:1. Tree-based models preferred.
  Do not use target encoding without proper out-of-fold.
  Must run under 60 seconds.

When user_context is set, Tropiflo skips the Q&A entirely — it compresses your context into a concise optimization brief and starts evolving immediately. This is ideal for CI/CD pipelines, scripted runs, or when you already know exactly what constraints matter.

Option B: Interactive Q&A (guided preflight)

If you omit user_context, Tropiflo runs the interactive Q&A flow on the first run:

  1. Executes your baseline code and records the initial KPI
  2. Asks you 5 questions about your problem, preferences, and constraints

These questions look like this:

1. Would you like to explore feature engineering approaches?
2. Are you interested in testing different model families?
3. Should we implement a more robust training strategy?
4. Would you prefer a conservative or experimental approach?
5. Are there specific domain pitfalls we should prepare for?

Please answer each question:
Answer 1: _

Answer each question briefly (even "yes" or "no" is fine). Your answers guide the AI's hypothesis generation — they help Tropiflo understand what kinds of changes you're open to.

Your answers are cached so you won't be asked again on subsequent runs. To reuse cached answers:

tropiflo run --config config.yaml --use-cached-qa

Running non-interactively (CI/CD, scripts): You can pipe answers via stdin:

echo -e "yes\nyes\nno\nexperimental\nno special constraints" | tropiflo run --config config.yaml

What happens next (both options)

After context is set (either way), Tropiflo begins generating and testing hypotheses automatically. Both paths produce the same internal optimization brief that guides every hypothesis.

Step 6: Track Progress (Optional)

Track runs live in a local dashboard:

# Launch workflow + Streamlit tracking UI
tropiflo run --config config.yaml --dashboard

# Optional: choose a different dashboard port
tropiflo run --config config.yaml --dashboard --dashboard-port 8502

# Launch dashboard later (without starting a new workflow)
tropiflo dashboard

The dashboard opens at http://127.0.0.1:8501 by default and reads local artifacts from results/runs/.

What you'll see:

  • Baseline run with initial KPI
  • Evolution hypotheses being tested
  • Progress toward better KPIs
  • Results saved to results/runs/{memorable_name}/

Results: Traceable, Reproducible, Diffable

Every run is fully traceable and reproducible.

your_project/
└── results/
    └── runs/
        └── happy_panda_20260207_143025/    ← Memorable run name
            ├── timeline/                     ← Chronological history
            │   ├── 0001_kpi_0.8530_baseline/
            │   ├── 0002_kpi_0.8812_hypothesis_ensemble/
            │   └── 0003_kpi_0.9103_hypothesis_feature_eng/
            ├── by_performance/               ← Auto-sorted by KPI
            └── best → timeline/0003...       ← Symlink to best version

Key features:

  • timeline/ shows every hypothesis tested, in order
  • by_performance/ automatically sorts runs by KPI for easy comparison
  • best symlink always points to your best-performing version
  • Every checkpoint contains the full modified code + metadata

Important Reassurances

Your code outside the block is never modified

Tropiflo only touches code between CO_DATASCIENTIST_BLOCK_START and CO_DATASCIENTIST_BLOCK_END. Everything else stays exactly as you wrote it.

If KPI doesn't improve, baseline is preserved

Tropiflo only keeps changes that improve your KPI. If a hypothesis performs worse, it's discarded and the previous best version is kept.

You can Ctrl+C at any time safely

Press Ctrl+C anytime to stop. Docker images and containers are cleaned up automatically. No manual cleanup needed.

All artifacts are local unless you opt in

Your data, code, and results stay on your machine. Nothing is uploaded unless you explicitly configure a cloud backend.


Configuration

Minimal Config (80% of users)

mode: local
entry_command: "python train.py"

Common Options

mode: local
entry_command: "python train.py"

# Parallelization
parallel: 3

# Data mounting (if data is outside your project)
data_volume: "/home/user/datasets"

# API key for AI-powered evolution
api_key: "sk_your_token_here"

Resource Control (Advanced)

mode: local
entry_command: "python train.py"
parallel: 4

# GPU configuration
enable_gpu: true           # Force GPU (auto-detected by default)
gpus_per_task: 1           # GPUs per container

# CPU and memory limits
cpus_per_task: 4.0         # CPU cores per container
memory_per_task: "8g"      # Memory per container

Cloud Backends (Optional)

Google Cloud Run
mode: gcloud
entry_command: "python train.py"
project_id: "your-gcp-project"
region: "us-central1"
data_volume: "gs://your-bucket"

See full GCloud setup guide below.

AWS ECS Fargate
mode: aws
entry_command: "python train.py"
aws:
  cluster: "my-cluster"
  task_definition: "my-task"
  region: "us-east-1"

See full AWS setup guide below.

Databricks
mode: databricks
entry_command: "python train.py"
databricks:
  volume_uri: "dbfs:/Volumes/my_catalog/my_schema/my_volume"
  timeout: "30m"
  job:
    tasks:
      - task_key: "t"
        existing_cluster_id: "your-cluster-id"

See full Databricks setup guide below.


Using Your Data

After the dummy example works, here's how to use YOUR data:

Method 1: Hardcoded Paths (Simplest)

Just put the full path in your code:

import pandas as pd

X = pd.read_csv("/full/path/to/your/data.csv")
# ... rest of your code

Method 2: Docker Volume Mounting (Recommended)

For data that lives outside your project:

Update config.yaml:

mode: local
entry_command: "python train.py"
data_volume: "/home/user/my_datasets"

Update your code:

import os
import pandas as pd

# Tropiflo automatically sets INPUT_URI to /data inside Docker
DATA_DIR = os.environ.get("INPUT_URI", "/data")
X = pd.read_csv(os.path.join(DATA_DIR, "train.csv"))
y = pd.read_csv(os.path.join(DATA_DIR, "labels.csv"))

# CO_DATASCIENTIST_BLOCK_START
# Your model code here
# CO_DATASCIENTIST_BLOCK_END

print(f"KPI: {score}")

What happens: Tropiflo mounts /home/user/my_datasets to /data inside the Docker container, so your code can access files like train.csv.


Block Placement Rules

Block markers MUST be at top level (no indentation):

# ✅ CORRECT - No indentation before the comment
# CO_DATASCIENTIST_BLOCK_START
def my_model():
    return LinearRegression()
# CO_DATASCIENTIST_BLOCK_END

# ❌ WRONG - Inside a function (has tabs/spaces before comment)
def train():
    # CO_DATASCIENTIST_BLOCK_START  ← This will NOT be detected!
    model = train_model()
    # CO_DATASCIENTIST_BLOCK_END

Rule: Block markers must start at column 0 (no tabs or spaces before #).


Multi-File Projects

Tropiflo supports both single-file scripts and multi-file projects:

  • Single File: tropiflo run python my_script.py
  • Multi-File: Auto-detects run.sh, main.py, or run.py in your project root
  • Custom Entry Point: tropiflo run bash custom_script.sh

When you run Tropiflo on a multi-file project:

  1. Scanning: Scans all .py files for CO_DATASCIENTIST_BLOCK markers
  2. Selection: Each generation, randomly picks ONE file to evolve
  3. Evolution: The AI generates hypotheses and modifies the selected block
  4. Testing: Your entire project runs with the new code
  5. Checkpointing: Best results are saved as complete directories with all files

This means you can have complex multi-file ML pipelines where each file evolves independently but is tested as a complete system.


Deployment

Take your best checkpoint and create a production-ready project:

# Deploy best checkpoint from latest run
tropiflo deploy results/runs/happy_panda_20260207/best/

# Deploy specific version
tropiflo deploy results/runs/happy_panda_20260207/timeline/0003_kpi_0.9103_feature_eng/

# Custom output directory
tropiflo deploy results/runs/happy_panda_20260207/best/ --output-dir my_optimized_v2

What it does:

  1. Copies your entire original project (including data, configs, assets)
  2. Integrates the evolved code from the checkpoint
  3. Excludes Tropiflo artifacts (checkpoints, cache, etc.)
  4. Creates a deployment_info.json with checkpoint metadata

The result is a complete, standalone project ready to deploy to production.


Analysis Tools

Live Local Tracking Dashboard

Run with a live dashboard to monitor experiments as checkpoints are saved:

tropiflo run --config config.yaml --dashboard

Open the same dashboard anytime (even when no run is active):

# Reads ./results/runs by default
tropiflo dashboard

# Point to another project directory
tropiflo dashboard --working-directory /path/to/project

# Or pass an explicit results root and custom port
tropiflo dashboard --results-root /path/to/project/results/runs --dashboard-port 8502

Dashboard highlights:

  • KPI over time (all runs as points + running best line)
  • Baseline marker and best-so-far trajectory
  • Hypotheses table across the workflow
  • Diff viewer vs baseline per file
  • Stdout/stderr per checkpoint

If you run multiple workflows, select and compare them from the dashboard sidebar.
Data is loaded from local results/runs/ folders, so old and new runs appear together.

Plot KPI Progression

Visualize how your KPI improves over iterations:

# Basic usage
tropiflo plot-kpi --checkpoints-dir results/runs/happy_panda_20260207/

# With options
tropiflo plot-kpi \
  --checkpoints-dir results/runs/happy_panda_20260207/ \
  --max-iteration 350 \
  --title "AUC Training Progress" \
  --kpi-label "AUC" \
  --output my_kpi_plot.png

Generate PDF Code Diffs

Create professional PDF reports comparing two versions:

# Compare two Python files
tropiflo diff-pdf baseline.py improved.py

# With custom title
tropiflo diff-pdf \
  baseline.py \
  optimized.py \
  --output "optimization_report.pdf" \
  --title "XOR Problem Optimization Results"

Air-Gapped / Offline Deployment

Need to run Tropiflo in an environment without internet access?

Quick Setup (One-Time, Requires Internet)

# Run this once while connected to internet
tropiflo setup-airgap

# That's it! Now you can disconnect and work offline

What It Does

  1. Pulls base Python Docker image (one-time download)
  2. Builds complete image with all your dependencies pre-installed
  3. Updates your config.yaml to use the pre-built image
  4. Everything runs locally - no internet required after setup

After Setup

# Disconnect from internet (or work in isolated environment)
tropiflo run --config config.yaml  # Works offline!

Perfect for:

  • Air-gapped production environments
  • Isolated VPC deployments
  • High-security environments
  • Offline development

Private/Self-Hosted Backend

If you run the backend on your own host (VPC, on-prem), point the CLI at it:

In config.yaml:

backend_url: "https://your-private-backend.example.com"
backend_url_dev: "http://localhost:8000"  # Optional, for dev mode

Or with environment variables:

export CO_DATASCIENTIST_CO_DATASCIENTIST_BACKEND_URL="https://your-private-backend.example.com"
export CO_DATASCIENTIST_CO_DATASCIENTIST_BACKEND_URL_DEV="http://localhost:8000"
export CO_DATASCIENTIST_DEV_MODE=true  # To force dev URL

If neither YAML nor env are set, the client defaults to https://co-datascientist.io.


Resource Allocation (GPU, CPU, Memory)

Control how much hardware each Docker container gets.

GPU Configuration

Auto-detection (default):

# No configuration needed - GPUs auto-detected!
# If available: containers get GPU access
# If not available: containers run on CPU automatically

Manual control:

enable_gpu: false       # Force CPU-only (even if GPU available)
enable_gpu: true        # Force GPU (fails if not available)
gpus_per_task: 1        # Each container gets 1 GPU

CPU & Memory Limits

cpus_per_task: 4.0      # Each container gets 4 CPU cores
memory_per_task: "8g"   # Each container gets 8GB RAM

Common Scenarios

Single GPU Workstation:

entry_command: "python train.py"
parallel: 2
gpus_per_task: 1        # Each gets 1 GPU (total: 2 GPUs)
cpus_per_task: 4.0      # Each gets 4 cores (total: 8 cores)
memory_per_task: "8g"   # Each gets 8GB (total: 16GB)

Multi-GPU Server:

entry_command: "python train.py"
parallel: 8
gpus_per_task: 1        # Each gets 1 GPU (total: 8 GPUs)
cpus_per_task: 2.0      # Each gets 2 cores (total: 16 cores)
memory_per_task: "4g"   # Each gets 4GB (total: 32GB)

CPU-Only Machine:

entry_command: "python train.py"
parallel: 4
enable_gpu: false       # Force CPU mode
cpus_per_task: 2.0      # Each gets 2 cores (total: 8 cores)
memory_per_task: "2g"   # Each gets 2GB (total: 8GB)

Before vs After Example

Before
KPI ≈ 0.50
After
KPI 1.00
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import numpy as np

# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=0))
])

pipeline.fit(X, y)
preds = pipeline.predict(X)
accuracy = accuracy_score(y, preds)
print(f'KPI: {accuracy:.4f}')
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

class ChebyshevPolyExpansion(BaseEstimator, TransformerMixin):
    def __init__(self, degree=3):
        self.degree = degree
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = np.asarray(X)
        X_scaled = 2 * X - 1
        n_samples, n_features = X_scaled.shape
        features = []
        for f in range(n_features):
            x = X_scaled[:, f]
            T = np.empty((self.degree + 1, n_samples))
            T[0] = 1
            if self.degree >= 1:
                T[1] = x
            for d in range(2, self.degree + 1):
                T[d] = 2 * x * T[d - 1] - T[d - 2]
            features.append(T.T)
        return np.hstack(features)

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

pipeline = Pipeline([
    ('cheb', ChebyshevPolyExpansion(degree=3)),
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=0))
])

pipeline.fit(X, y)
preds = pipeline.predict(X)
accuracy = accuracy_score(y, preds)
print(f'KPI: {accuracy:.4f}')

Cloud Integrations

Google Cloud Run Jobs Integration

Execute your code at scale on Google Cloud infrastructure.

Prerequisites (One-Time, 5 Minutes)

  1. Install & authenticate gcloud CLI:
# Install gcloud CLI (if not installed)
# See: https://cloud.google.com/sdk/docs/install

# Authenticate
gcloud auth login
gcloud auth application-default login

# Set your project
gcloud config set project YOUR_PROJECT_ID
  1. Enable required APIs:
gcloud services enable artifactregistry.googleapis.com
gcloud services enable run.googleapis.com
  1. Create Artifact Registry repository:
gcloud artifacts repositories create co-datascientist-repo \
  --repository-format=docker \
  --location=us-central1 \
  --description="Docker images for Co-DataScientist"

Configuration

Minimal config.yaml for GCloud:

mode: gcloud
entry_command: "python train.py"
project_id: "your-gcp-project-id"

With options:

mode: gcloud
entry_command: "python train.py"
project_id: "your-gcp-project-id"

# Optional
region: "us-central1"
repo: "co-datascientist-repo"
parallel: 2
data_volume: "gs://your-bucket"
api_key: "sk_your_token"

What Happens

When you run tropiflo run --config config.yaml:

  1. Builds your Docker image locally
  2. Pushes to GCP Artifact Registry
  3. Creates & executes Cloud Run Job
  4. Retrieves results and KPIs
  5. Cleans up resources automatically

Cost efficient: Cleans up jobs and images automatically (configurable with cleanup_job and cleanup_remote_image)

Using Data from GCS

mode: gcloud
project_id: "my-project"
entry_command: "python train.py"
data_volume: "gs://my-data-bucket"

Your code accesses data at /data:

import os
DATA_DIR = os.environ.get("INPUT_URI", "/data")
df = pd.read_csv(os.path.join(DATA_DIR, "train.csv"))

Note: Your Cloud Run service account needs storage.objectViewer permission on the bucket.

AWS ECS Fargate Integration

Execute and optimize your Python code at scale using AWS ECS Fargate.

Setup

  1. Prerequisites:

    • AWS account with ECS Fargate enabled
    • Authenticated AWS CLI: aws configure
    • An ECS cluster and task definition configured for your needs
  2. Create config.yaml:

mode: aws
entry_command: "python train.py"
aws:
  script_path: "/path/to/your/script.py"
  cluster: "my-cluster"
  task_definition: "my-job-taskdef"
  launch_type: "FARGATE"
  region: "us-east-1"
  network_configuration:
    subnets: ["subnet-abc123", "subnet-def456"]
    security_groups: ["sg-123456"]
    assign_public_ip: "ENABLED"
  timeout: 1800  # seconds
  1. Run:
tropiflo run --config config.yaml

Your code will be executed in AWS ECS Fargate containers, with results and KPIs retrieved automatically. Perfect for serverless compute scaling!

Databricks Integration

Run Tropiflo evolution on a Databricks cluster or serverless compute instead of local Docker containers. Your code is uploaded to Databricks storage and executed as a Spark Python task.

There are two compute options:

Option Config key Best for
Existing cluster existing_cluster_id Workspaces with classic compute (VPC configured)
Serverless environment_key + environments New workspaces, trial accounts, or no VPC setup

If you're not sure which you have, try creating a cluster in the Databricks UI. If you see an error like "does not have any associated worker environments", your workspace only supports serverless — skip to Option B: Serverless.

Prerequisites

  1. Install the Databricks CLI (v2):
# Linux / macOS
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sudo sh

# Windows — download the installer from:
# https://docs.databricks.com/en/dev-tools/cli/install.html
  1. Authenticate with a Personal Access Token:

Generate a token in your Databricks workspace under Settings > Developer > Access tokens, then configure the CLI:

databricks configure
# Enter your workspace URL (e.g. https://dbc-xxxxx.cloud.databricks.com)
# Enter your access token

To find your workspace URL: log into Databricks and copy the URL from the browser address bar (everything before /?o=).

Verify it works:

databricks auth describe
  1. Create a Unity Catalog Volume (where Tropiflo stores project files):
# List available catalogs and schemas
databricks catalogs list
databricks schemas list <catalog_name>

# Check if you already have a volume
databricks volumes list <catalog_name> <schema_name>

If you need to create one:

databricks volumes create <catalog_name> <schema_name> tropiflo_volume MANAGED

Your volume URI will be: dbfs:/Volumes/<catalog_name>/<schema_name>/tropiflo_volume

Alternative storage options:

Storage type volume_uri example Best for
Unity Catalog Volume (recommended) dbfs:/Volumes/my_catalog/my_schema/my_volume Modern workspaces with Unity Catalog
Workspace Files /Workspace/Users/you@company.com/tropiflo Workspaces where DBFS is restricted
Classic DBFS dbfs:/FileStore/tropiflo Legacy workspaces without Unity Catalog

Option A: Existing Cluster

Use this if your workspace has classic compute infrastructure (VPC configured).

Find your cluster ID:

databricks clusters list

Or in the Databricks UI: Compute > your cluster > JSON view.

config.yaml:

mode: databricks
entry_command: "python train.py"
api_key: "YOUR_API_KEY"

databricks:
  volume_uri: "dbfs:/Volumes/my_catalog/my_schema/tropiflo_volume"
  timeout: "30m"
  job:
    tasks:
      - task_key: "t"
        existing_cluster_id: "0324-151716-abc123"

Dependencies from requirements.txt are auto-detected and installed via Databricks task libraries.

Option B: Serverless (No Cluster Needed)

Use this if your workspace doesn't have classic compute, or you just want the simplest setup. Serverless compute is managed entirely by Databricks — no VPC, no cluster creation, no infrastructure to manage.

config.yaml:

mode: databricks
entry_command: "python train.py"
api_key: "YOUR_API_KEY"

databricks:
  volume_uri: "dbfs:/Volumes/my_catalog/my_schema/tropiflo_volume"
  timeout: "30m"

  job:
    tasks:
      - task_key: "t"
        environment_key: "default"
    environments:
      - environment_key: "default"
        spec:
          client: "1"
          dependencies:
            - "scikit-learn>=1.0.0"
            - "numpy"
            - "pandas"

Key differences from the existing-cluster config:

  • No existing_cluster_id — instead you set environment_key: "default" on the task
  • Dependencies are listed explicitly in environments[*].spec.dependencies (not auto-read from requirements.txt)
  • Compute is provisioned on-demand by Databricks — startup takes ~60-90 seconds per run

Complete Serverless Walkthrough

Here's a full end-to-end example that reads data from a Databricks Volume:

1. Upload your data to the volume:

# Create a data directory on the volume
databricks fs mkdir dbfs:/Volumes/workspace/default/tropiflo_volume/data

# Upload your CSV files
databricks fs cp features.csv dbfs:/Volumes/workspace/default/tropiflo_volume/data/features.csv
databricks fs cp labels.csv dbfs:/Volumes/workspace/default/tropiflo_volume/data/labels.csv

# Verify
databricks fs ls dbfs:/Volumes/workspace/default/tropiflo_volume/data/

2. Write train.py that reads from the volume:

import os
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# On Databricks, Unity Catalog Volumes are mounted at /Volumes/...
# The dbfs: prefix is stripped at runtime
DATA_DIR = "/Volumes/workspace/default/tropiflo_volume/data"

X = pd.read_csv(os.path.join(DATA_DIR, "features.csv"))
y = pd.read_csv(os.path.join(DATA_DIR, "labels.csv"))["y"]

# CO_DATASCIENTIST_BLOCK_START
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=0))
])
pipeline.fit(X, y)
preds = pipeline.predict(X)
# CO_DATASCIENTIST_BLOCK_END

accuracy = accuracy_score(y, preds)
print(f"KPI: {accuracy:.4f}")

3. Write config.yaml:

mode: databricks
entry_command: "python train.py"
api_key: "YOUR_API_KEY"

databricks:
  volume_uri: "dbfs:/Volumes/workspace/default/tropiflo_volume"
  timeout: "30m"

  job:
    tasks:
      - task_key: "t"
        environment_key: "default"
    environments:
      - environment_key: "default"
        spec:
          client: "1"
          dependencies:
            - "scikit-learn>=1.0.0"
            - "numpy"
            - "pandas"

4. Write requirements.txt:

numpy
scikit-learn
pandas

5. Run:

tropiflo run --config config.yaml

Understanding data paths: In your config.yaml, the volume_uri uses the dbfs: prefix (dbfs:/Volumes/...) — this tells the Databricks CLI where to upload files. In your Python code, you use the runtime path without the dbfs: prefix (/Volumes/...) — this is how the filesystem is mounted inside the Databricks execution environment.

How It Works

When you run tropiflo run --config config.yaml:

  1. Your project is zipped and uploaded to {volume_uri}/runs/{run_id}/project.zip
  2. A launcher script is uploaded to {volume_uri}/runs/{run_id}/launcher.py
  3. A Databricks job is submitted that runs the launcher on your cluster (or serverless)
  4. The launcher extracts the project zip and runs your entry_command
  5. Tropiflo polls for completion and retrieves stdout/stderr/KPI
  6. If cleanup_remote_files: true, the run directory is deleted afterward

Environment & Dependencies

Your code runs inside the Python environment of the Databricks cluster. There is no Docker container — packages, drivers, and hardware are whatever the cluster provides.

For existing clusters:

  • Base environment comes from the Databricks Runtime. Standard runtimes include numpy, pandas, scikit-learn, etc. ML Runtimes (e.g. 15.4 LTS ML) additionally include PyTorch, TensorFlow, XGBoost, and CUDA/cuDNN drivers.
  • requirements.txt is auto-detected — Tropiflo installs packages via task libraries.
  • For slow-to-install packages, pre-install them on the cluster via Compute > your cluster > Libraries > Install new.

For serverless:

  • List dependencies explicitly in environments[*].spec.dependencies in your config.
  • requirements.txt is not auto-read for serverless — you must list each dependency in the config.
my_project/
├── config.yaml
├── train.py
└── requirements.txt   ← auto-detected for existing clusters only

Full Config Reference

mode: databricks
entry_command: "python train.py"
api_key: "YOUR_API_KEY"

databricks:
  cli: "databricks"              # CLI binary name or path (default: "databricks")
  volume_uri: "dbfs:/Volumes/my_catalog/my_schema/my_volume"
  timeout: "30m"                 # max job runtime (supports s/m/h suffixes)
  cleanup_remote_files: true     # delete uploaded files after each run

  job:
    tasks:
      - task_key: "t"
        existing_cluster_id: "0324-151716-abc123"  # OR use environment_key for serverless

GPU Clusters

Databricks GPU support works out of the box — no Tropiflo configuration needed. Unlike local mode (which requires enable_gpu and gpus_per_task for Docker), Databricks mode runs directly on the cluster hardware with no container layer.

Setup: Just point existing_cluster_id to a GPU-enabled cluster:

databricks:
  volume_uri: "dbfs:/Volumes/my_catalog/my_schema/my_volume"
  timeout: "30m"
  job:
    tasks:
      - task_key: "t"
        existing_cluster_id: "0324-151716-gpu-cluster"

Your code sees GPUs automatically:

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")  # "Using: cuda"

Recommended cluster setup for GPU workloads:

  • Runtime: Use a ML Runtime (e.g. 15.4 LTS ML GPU) — it comes with CUDA, cuDNN, PyTorch, and TensorFlow pre-installed
  • Node type: Pick a GPU instance (e.g. g4dn.xlarge on AWS, Standard_NC6s_v3 on Azure, a2-highgpu-1g on GCP)
  • Single-node mode: Enable "Use as single node" under Advanced options — this ensures the driver node (where your code runs) has GPU access. On multi-node clusters, only the driver runs your script via spark_python_task, so the driver node must have the GPU

Accessing Data on Databricks

Unlike local mode (which mounts a data_volume into Docker), Databricks mode runs your code on a remote cluster or serverless compute. Your script must read data from locations the compute can access directly. There is no automatic INPUT_URI or /data mount.

Common data access patterns:

Method Path in Python Path in CLI / config
Unity Catalog Volume /Volumes/catalog/schema/volume/file.csv dbfs:/Volumes/catalog/schema/volume/file.csv
Unity Catalog Table spark.table("catalog.schema.table") N/A
S3 s3://bucket/path/file.csv N/A
ADLS abfss://container@account.dfs.core.windows.net/path N/A
Classic DBFS /dbfs/FileStore/path/file.csv dbfs:/FileStore/path/file.csv

Example — reading from a Unity Catalog Volume:

import pandas as pd

# Note: /Volumes/... (no dbfs: prefix) — this is the runtime mount path
df = pd.read_csv("/Volumes/my_catalog/my_schema/my_volume/data/train.csv")

Example — reading from a Unity Catalog table:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.table("my_catalog.my_schema.my_table").toPandas()

Tip: Keep large datasets out of your project directory. Tropiflo zips your entire project and uploads it for each run. If you have a data/ folder inside your project, it will be zipped and uploaded every time — slow and wasteful. Instead, store data on Volumes / tables / cloud storage and reference it by path in your code.

Using Workspace Paths

If Unity Catalog Volumes aren't available, you can store files directly in the Databricks Workspace filesystem:

databricks:
  volume_uri: "/Workspace/Users/you@company.com/tropiflo"

Tropiflo detects Workspace paths and automatically uses databricks workspace CLI commands (instead of databricks fs) for uploads. The Jobs API receives /Workspace/... paths, which don't require DBFS file privileges.

If you accidentally write dbfs:/Workspace/..., Tropiflo strips the dbfs: prefix and logs a warning. It's better to use the correct form from the start.

Troubleshooting

Current organization does not have any associated worker environments

This means your Databricks workspace doesn't have classic compute infrastructure (VPC) configured. You have two options:

  • Use serverless (recommended, no setup needed) — see Option B: Serverless above
  • Set up classic compute — requires an admin to configure VPC/network settings in the Databricks Account Console under Cloud Resources (create a credential configuration, storage configuration, and network configuration)

INSUFFICIENT_PERMISSIONS: User does not have permission SELECT on any file

This means the cluster has Unity Catalog enabled but the job references a dbfs:/ path. Solutions:

  • Best fix: Switch volume_uri to a Unity Catalog Volume: dbfs:/Volumes/<catalog>/<schema>/<volume>
  • Alternative: Use a Workspace path: /Workspace/Users/you@company.com/tropiflo
  • If you must use DBFS: Ask your workspace admin to grant SELECT on any file (not recommended — it's a broad privilege)

Error: No operations allowed on this path when running databricks fs ls dbfs:/Volumes

You can't list the bare /Volumes root. You need the full path including catalog, schema, and volume name:

# Wrong
databricks fs ls dbfs:/Volumes

# Correct
databricks fs ls dbfs:/Volumes/my_catalog/my_schema/my_volume/

Failed to validate python file ...

Check that:

  1. Your volume_uri points to a location the cluster can actually read
  2. The cluster is running and accessible (databricks clusters list)
  3. Your token has permission to submit jobs (databricks jobs list)

Windows-specific: databricks not found

Set the cli field to the full path or use databricks.exe:

databricks:
  cli: "databricks.exe"
  # or the full path:
  # cli: "C:\\Users\\you\\AppData\\Local\\Programs\\databricks\\databricks.exe"

Important Notes

  • Avoid input() or interactive prompts — Tropiflo needs to run your code automatically
  • Mark the parts you want to evolve — Use CO_DATASCIENTIST_BLOCK_START and CO_DATASCIENTIST_BLOCK_END
  • Add comments with context — Tropiflo understands your domain! Explain your problem, constraints, and ideas in comments near your code

Naming Note

"Co-DataScientist" is the internal engine behind Tropiflo.
You only interact with the Tropiflo CLI. If you see references to "Co-DataScientist" in code, logs, or config keys, that's the underlying system. They're the same product.


Need Help?

We'd love to chat: oz.kilim@tropiflo.io


Disclaimer: Tropiflo executes your scripts on your own machine. Make sure you trust the code you feed it!

Made by the Tropiflo team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tropiflo-2.0.8.tar.gz (492.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tropiflo-2.0.8-py3-none-any.whl (114.1 kB view details)

Uploaded Python 3

File details

Details for the file tropiflo-2.0.8.tar.gz.

File metadata

  • Download URL: tropiflo-2.0.8.tar.gz
  • Upload date:
  • Size: 492.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tropiflo-2.0.8.tar.gz
Algorithm Hash digest
SHA256 4a871383292869dcb4a3b145df2e12c0ef1d5fbe02988b37f4e753ca17888ad3
MD5 6ed4c125c2748060dcf7630ea22f8616
BLAKE2b-256 7fdfe54a6fec5e1d9dca9f32812948d2b578aeb32eb5ea02ca7fd0c2915a962d

See more details on using hashes here.

File details

Details for the file tropiflo-2.0.8-py3-none-any.whl.

File metadata

  • Download URL: tropiflo-2.0.8-py3-none-any.whl
  • Upload date:
  • Size: 114.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tropiflo-2.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 9eace348c2c126f12efc000bc4f882ae9e084aad637c870f2a6bba6ea3df2e8e
MD5 5b7c6015aefb06f00657e5d5f2cc0f4e
BLAKE2b-256 f937c46f03057b7da008b82d405383b0676086c1eff22089efe823dab093b11a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page