A tool for agentic recursive model improvement

These details have not been verified by PyPI

Project links

Project description

Introducing the Co-DataScientist

Version License

Kick back, relax, and tomorrow morning greet a shiny KPI you can parade at ML stand-up.

Why is everyone talking about the Co-DataScientist

Idea Explosion — Launches a swarm of models, feature recipes & hyper-parameters you never knew existed.
Full-Map Exploration — Charts the entire optimization galaxy so you can stop guessing and start winning.
Hands-Free Mode — Hit run and the search party works through the night.
KPI Fanatic — Every evolutionary step is focused on improving your target metric.
Data Stays Home — Your training and testing data never leaves your server; everything runs locally.

Fast-track your ML pipelines from painful to excellent

Quickstart — 30-Second Setup

1. Install

pip install co-datascientist

2. Write a tiny script (e.g. `xor.py`). The only rule: print your KPI

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# XOR toy-set
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0,1,1,0])

# CO_DATASCIENTIST_BLOCK_START

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(random_state=0))
])

pipe.fit(X, y)
acc = accuracy_score(y, pipe.predict(X))

# CO_DATASCIENTIST_BLOCK_END


print(f"KPI: {acc:.4f}")  # Tag your metric!

# comments
# This is the classic XOR problem — it's not linearly separable!
# A linear model like LogisticRegression can't solve it perfectly,
# because no straight line can separate the classes in 2D.
# This makes it a great test for feature engineering or non-linear models.

Sanity check the baseline runs in your environment e.g. (you might need to pip install scikit-learn first!)

python xor.py

3. Set your API Token (one time only!)

Before running any commands, you need to set your Co-DataScientist API token. You only need to do this once per machine.

co-datascientist set-token --token <YOUR_TOKEN>

4. Run the tool

Then run the co-datascientist!

co-datascientist run --script-path xor.py --parallel 3

Watch accuracy improve.

(when you get to KPI=1.0 you can stop it)

You will find the new glowed up code in the co_datascientist_checkpoints directory.

Yes, it's that simple

Try it on your toughest problem and see how your KPI improves.
Co-DataScientist helps you get better results—no matter how big your challenge.

Important Notes About Your Input Script

KPI Tagging

Co-DataScientist scans your stdout for the pattern KPI: <number> — that’s the metric it maximizes. Use anything: accuracy, F1, revenue per click, unicorns-per-second… you name it!

📁 Hardcode Your Data Paths

Important: Please hardcode any data file paths directly in your script.
For example, use data = np.loadtxt("full/path/to/my/my_data.csv") or similar.
Do not use input() or command-line arguments to specify file paths.
This ensures Co-DataScientist can run your script automatically without manual intervention.

🧬 Blocks to evolve

As you will see in the XOR example, Co-DataScientist uses # CO_DATASCIENTIST_BLOCK_START and # CO_DATASCIENTIST_BLOCK_END tags to identify the parts of the system you want it to improve. Make sure to tag parts of your system you care about improving! It will help to Co-DataScientist stay focused on its job.

One File Only: Self-Contained Scripts Required

Note: Co-DataScientist currently supports only scripts written as a single, self-contained Python file. Please put all your code in one .py file—multi-file projects are not supported (yet!). Everything your workflow needs should be in that one file.

Add Domain-Specific Notes for Best Results

After your code, add comments with any extra context, known issues, or ideas you have about your problem. This helps Co-DataScientist understand your goals and constraints! The Co-Datascientist UNDERSTANDS your problem. It's not just doing a blind search!

Other helpful stuff

Skip Q&A on Repeat Runs

For faster iterations, use cached answers from your previous run:

co-datascientist run --script-path xor.py --use-cached-qa

This skips the interactive questions and uses your previous answers, jumping straight to the optimization process.

📝 Before vs After

Before _{KPI ≈ 0.50}	After _{KPI 1.00}
from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.metrics import accuracy_score import numpy as np # XOR data X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y = np.array([0, 1, 1, 0]) pipeline = Pipeline([ ('scaler', StandardScaler()), ('clf', RandomForestClassifier(n_estimators=10, random_state=0)) ]) pipeline.fit(X, y) preds = pipeline.predict(X) accuracy = accuracy_score(y, preds) print(f'Accuracy: {accuracy:.2f}') print(f'KPI: {accuracy:.4f}')	import numpy as np from sklearn.base import TransformerMixin, BaseEstimator from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.metrics import accuracy_score from tqdm import tqdm class ChebyshevPolyExpansion(BaseEstimator, TransformerMixin): def __init__(self, degree=3): self.degree = degree def fit(self, X, y=None): return self def transform(self, X): X = np.asarray(X) X_scaled = 2 * X - 1 n_samples, n_features = X_scaled.shape features = [] for f in tqdm(range(n_features), desc='Chebyshev features'): x = X_scaled[:, f] T = np.empty((self.degree + 1, n_samples)) T[0] = 1 if self.degree >= 1: T[1] = x for d in range(2, self.degree + 1): T[d] = 2 * x * T[d - 1] - T[d - 2] features.append(T.T) return np.hstack(features) X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y = np.array([0, 1, 1, 0]) pipeline = Pipeline([ ('cheb', ChebyshevPolyExpansion(degree=3)), ('scaler', StandardScaler()), ('clf', RandomForestClassifier(n_estimators=10, random_state=0)) ]) pipeline.fit(X, y) preds = pipeline.predict(X) accuracy = accuracy_score(y, preds) print(f'Accuracy: {accuracy:.2f}') print(f'KPI: {accuracy:.4f}')

Before
_{KPI ≈ 0.50}

After
_{KPI 1.00}

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import numpy as np

# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=0))
])

pipeline.fit(X, y)
preds = pipeline.predict(X)
accuracy = accuracy_score(y, preds)
print(f'Accuracy: {accuracy:.2f}')
print(f'KPI: {accuracy:.4f}')

import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from tqdm import tqdm

class ChebyshevPolyExpansion(BaseEstimator, TransformerMixin):
    def __init__(self, degree=3):
        self.degree = degree
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = np.asarray(X)
        X_scaled = 2 * X - 1
        n_samples, n_features = X_scaled.shape
        features = []
        for f in tqdm(range(n_features), desc='Chebyshev features'):
            x = X_scaled[:, f]
            T = np.empty((self.degree + 1, n_samples))
            T[0] = 1
            if self.degree >= 1:
                T[1] = x
            for d in range(2, self.degree + 1):
                T[d] = 2 * x * T[d - 1] - T[d - 2]
            features.append(T.T)
        return np.hstack(features)

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

pipeline = Pipeline([
    ('cheb', ChebyshevPolyExpansion(degree=3)),
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=0))
])

pipeline.fit(X, y)
preds = pipeline.predict(X)
accuracy = accuracy_score(y, preds)
print(f'Accuracy: {accuracy:.2f}')
print(f'KPI: {accuracy:.4f}')

We now support Databricks

Databricks setup

Download the databricks CLI package

curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sudo sh

Get a databricks token : and test the CLI works Get your Databricks token here
Prepare a config file with all of your compute/environmental requirements in databricks_config.yaml example below

# Enable Databricks integration
databricks: true

# Databricks configuration for XOR demo
databricks:
  cli: "databricks"  # databricks CLI command (optional, defaults to "databricks")
  volume_uri: "dbfs:/Volumes/workspace/default/volume"  # DBFS volume URI for file uploads
  code_path: "dbfs:/Volumes/workspace/default/volume/xor.py"  # Specific code path (optional, overrides volume_uri + temp filename)
  timeout: "30m"  # Job timeout duration
  
  job:
    name: "run-<script-stem>-<timestamp>"  # Job name template (supports <script-stem> and <timestamp>)
    tasks:
      - task_key: "t"
        spark_python_task:
          python_file: "<remote_path>"  # Will be automatically replaced with actual remote path
        environment_key: "default"
    environments:
      - environment_key: "default"
        spec:
          client: "1"
          dependencies:
            - "scikit-learn>=1.0.0"
            - "numpy>=1.20.0"

Then run the co-datascientist with:

co-datascientist run --cloud-config databricks_config.yaml

Now your new optimized model checkpoints will save in : dbfs:/Volumes/workspace/default/volume/co-datascientist-checkpoints

☁️ Google Cloud Run Jobs Integration

Execute and optimize your Python code at scale using Google Cloud Run Jobs.

Setup

Prerequisites:
- Google Cloud project with Cloud Run enabled
- Authenticated gcloud CLI: gcloud auth login
- A Cloud Run Job template (e.g., test-job-clean)
Create a config file (e.g., gcloud_config.yaml):

gcloud:
  enabled: true
  script_path: "/path/to/your/script.py"
  job_template: "your-job-name"
  region: "europe-west3"
  timeout: "30m"
  code_injection_method: "args"

Run Co-DataScientist:

co-datascientist run --cloud-config gcloud_config.yaml --no-preflight

Your code will be executed in Google Cloud Run Jobs, with results and KPIs retrieved automatically. Perfect for scaling compute-intensive optimizations!

📖 See the complete demo: /demos/gcloud/

Analysis and Visualization Tools

Co-DataScientist includes built-in visualization tools to help you analyze your optimization results and compare different versions of your code.

Plot KPI Progression

Visualize how your KPI improves over iterations from checkpoint JSON files:

# Basic usage - plot KPI progression from checkpoints directory
co-datascientist plot-kpi --checkpoints-dir /path/to/co_datascientist_checkpoints

# Advanced usage with custom options
co-datascientist plot-kpi \
  --checkpoints-dir /path/to/checkpoints \
  --max-iteration 350 \
  --title "AUC Training Progress" \
  --kpi-label "AUC" \
  --output my_kpi_plot.png

Options:

--checkpoints-dir, -c: Directory containing checkpoint JSON files (required)
--max-iteration, -m: Maximum iteration to include in plot
--title, -t: Custom title for the plot
--output, -o: Output file path (auto-generated if not specified)
--kpi-label, -k: Label for the KPI metric (default: "RMSE")

Generate PDF Code Diffs

Create beautiful PDF reports comparing two versions of your Python code:

# Basic usage - compare two Python files
co-datascientist diff-pdf baseline.py improved.py

# Advanced usage with custom options
co-datascientist diff-pdf \
  baseline.py \
  optimized.py \
  --output "optimization_report.pdf" \
  --title "XOR Problem Optimization Results"

Options:

file1: Path to the baseline/original file (required)
file2: Path to the modified/new file (required)
--output, -o: Output PDF file path (auto-generated if not specified)
--title, -t: Custom title for the diff report

Example workflow:

# 1. Run optimization
co-datascientist run --script-path xor.py --parallel 3

# 2. Plot the KPI progression
co-datascientist plot-kpi --checkpoints-dir co_datascientist_checkpoints --title "XOR Optimization"

# 3. Compare best result with baseline
co-datascientist diff-pdf xor.py co_datascientist_checkpoints/best_iteration_50.py --title "XOR Improvements"

These tools help you understand your optimization journey and create professional reports showing the improvements Co-DataScientist achieved.

Need help

We’d love to chat: oz.kilim@tropiflo.io

All set? Run your pipelines and track the results.

⚠️ Disclaimer: Co-DataScientist executes your scripts on your own machine. Make sure you trust the code you feed it!

Made by the Tropiflo team

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.4

Oct 12, 2025

0.6.3

Oct 12, 2025

0.6.2

Oct 12, 2025

0.6.1

Oct 10, 2025

0.6.0

Oct 8, 2025

This version

0.5.9

Sep 29, 2025

0.5.8

Sep 14, 2025

0.5.7

Sep 11, 2025

0.5.6

Sep 10, 2025

0.5.5

Sep 10, 2025

0.5.4

Sep 9, 2025

0.5.3

Sep 9, 2025

0.5.2

Sep 1, 2025

0.5.1

Sep 1, 2025

0.5.0

Sep 1, 2025

0.4.9

Sep 1, 2025

0.4.8

Sep 1, 2025

0.4.7

Aug 20, 2025

0.4.6

Aug 15, 2025

0.4.5

Aug 14, 2025

0.4.4

Aug 14, 2025

0.4.3

Aug 13, 2025

0.4.2

Aug 12, 2025

0.4.1

Aug 11, 2025

0.4.0

Jul 25, 2025

0.3.9

Jul 25, 2025

0.3.8

Jul 25, 2025

0.3.7

Jul 16, 2025

0.3.6

Jul 13, 2025

0.3.5

Jul 13, 2025

0.3.4

Jul 12, 2025

0.3.3

Jul 12, 2025

0.3.2

Jul 12, 2025

0.3.1

Jul 12, 2025

0.3.0

Jul 12, 2025

0.2.9

Jul 11, 2025

0.2.8

Jul 11, 2025

0.2.7

Jul 11, 2025

0.2.6

Jul 10, 2025

0.2.5

Jul 10, 2025

0.2.4

Jul 9, 2025

0.2.3

Jul 8, 2025

0.2.2

Jul 8, 2025

0.2.1

Jun 26, 2025

0.2.0

Jun 18, 2025

0.1.9

Jun 16, 2025

0.1.8

Jun 16, 2025

0.1.7

Jun 16, 2025

0.1.6

Jun 14, 2025

0.1.5

Jun 14, 2025

0.1.4

Jun 13, 2025

0.1.3

Jun 12, 2025

0.1.2

Jun 12, 2025

0.1.1

Jun 8, 2025

0.1.0

May 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

co_datascientist-0.5.9.tar.gz (336.2 kB view details)

Uploaded Sep 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

co_datascientist-0.5.9-py3-none-any.whl (62.9 kB view details)

Uploaded Sep 29, 2025 Python 3

File details

Details for the file co_datascientist-0.5.9.tar.gz.

File metadata

Download URL: co_datascientist-0.5.9.tar.gz
Upload date: Sep 29, 2025
Size: 336.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for co_datascientist-0.5.9.tar.gz
Algorithm	Hash digest
SHA256	`daeb69589e7ac8dea7b15f9dc8c7f3d96378d356706802050072d5ea56489cea`
MD5	`7033851738afbbe98f430d0128c0d71b`
BLAKE2b-256	`f41d4ee4937d00fe97ed452d4a95447b1ba73c9c0f8a99dd03842303778d92af`

See more details on using hashes here.

File details

Details for the file co_datascientist-0.5.9-py3-none-any.whl.

File metadata

Download URL: co_datascientist-0.5.9-py3-none-any.whl
Upload date: Sep 29, 2025
Size: 62.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for co_datascientist-0.5.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9864dabc3bff6a3c58a64e7945124001ee69021b8a6498f47c6a6f00e0c502e3`
MD5	`f3db375d34decff91803206add3209ff`
BLAKE2b-256	`1e2d1c3458276a601381bd6f7ea32bbf5747a97c857254875998949cf5e2f342`

See more details on using hashes here.

co-datascientist 0.5.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Introducing the Co-DataScientist

Why is everyone talking about the Co-DataScientist

Fast-track your ML pipelines from painful to excellent

Quickstart — 30-Second Setup

1. Install

2. Write a tiny script (e.g. xor.py). The only rule: print your KPI

3. Set your API Token (one time only!)

4. Run the tool

(when you get to KPI=1.0 you can stop it)

Yes, it's that simple

Try it on your toughest problem and see how your KPI improves. Co-DataScientist helps you get better results—no matter how big your challenge.

KPI Tagging

📁 Hardcode Your Data Paths

🧬 Blocks to evolve

One File Only: Self-Contained Scripts Required

Add Domain-Specific Notes for Best Results

Skip Q&A on Repeat Runs

📝 Before vs After

We now support Databricks

☁️ Google Cloud Run Jobs Integration

Setup

Analysis and Visualization Tools

Plot KPI Progression

Generate PDF Code Diffs

Need help

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

2. Write a tiny script (e.g. `xor.py`). The only rule: print your KPI

Try it on your toughest problem and see how your KPI improves.
Co-DataScientist helps you get better results—no matter how big your challenge.