Skip to main content

Human-readable ML pipeline language with DSL, debugging, and visualization

Project description

๐Ÿ”ฅ PipelineScript - Human-Readable ML Pipeline Language

Transform machine learning pipelines from code into conversation.

Python 3.8+ License: MIT PyPI version


๐Ÿš€ What is PipelineScript?

PipelineScript is a revolutionary Domain-Specific Language (DSL) that makes machine learning pipelines readable, debuggable, and accessible to everyone. No more nested code, complex APIs, or cryptic configurations.

Before PipelineScript:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Load data
data = pd.read_csv('data.csv')

# Clean
data = data.dropna()

# Encode categoricals
from sklearn.preprocessing import LabelEncoder
for col in data.select_dtypes(['object']).columns:
    data[col] = LabelEncoder().fit_transform(data[col])

# Split
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train
model = XGBClassifier()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

# Export
import pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

With PipelineScript:

load data.csv
clean missing
encode
split 80/20 --target target
scale
train xgboost
evaluate
export model.pkl

That's it. Same functionality, 90% less code, infinitely more readable.


โœจ Key Features

1. ๐Ÿ—ฃ๏ธ Human-Readable Syntax

Write ML pipelines like you'd describe them to a colleague:

load sales.csv
filter revenue > 1000
clean outliers
split 75/25 --target revenue
train xgboost
evaluate

2. ๐Ÿ› Interactive Debugging

Step through your pipeline like a regular program:

from pipelinescript import debug

debug("""
    load data.csv
    clean missing
    train xgboost
""")

Debugger commands:

  • step - Execute next step
  • break 3 - Set breakpoint at step 3
  • context - Show current data and model
  • inspect model - Inspect specific variable
  • continue - Run until completion

3. ๐Ÿ“Š Built-in Visualization

Automatically visualize your pipeline structure:

from pipelinescript import run

run(script, visualize=True)

Generates ASCII or graphical pipeline diagrams showing data flow.

4. ๐Ÿ”— Method Chaining API

Prefer Python? Use the fluent API:

from pipelinescript import Pipeline

result = (Pipeline()
    .load("data.csv")
    .clean_missing()
    .encode()
    .split(0.8, target="label")
    .train("xgboost")
    .evaluate()
    .export("model.pkl")
    .run())

5. โšก Quick Builders

Pre-built pipelines for common tasks:

from pipelinescript.pipeline import quick_classification

# One line for complete classification pipeline
result = quick_classification("data.csv", "label", "xgboost")

๐Ÿ“ฆ Installation

pip install pipelinescript

Optional dependencies:

# For XGBoost models
pip install xgboost

# For visualization
pip install matplotlib

# For all features
pip install pipelinescript[full]

๐ŸŽฏ Quick Start

1. Create a Pipeline File (.psl)

my_pipeline.psl:

load iris.csv
clean missing
encode
split 80/20 --target species
train random_forest
evaluate
export iris_model.pkl

2. Run It

Command Line:

pipelinescript run my_pipeline.psl

Python:

from pipelinescript import run

result = run("my_pipeline.psl")

if result.success:
    print(f"โœ… Accuracy: {result.context.metrics['accuracy']:.4f}")

That's it! Your model is trained, evaluated, and exported.


๐Ÿ“– Language Reference

Commands

Data Loading

load <filepath>              # Load data from file

Supported formats: CSV, Excel, JSON, Parquet

Data Cleaning

clean missing                # Remove rows with missing values
clean duplicates             # Remove duplicate rows
clean outliers               # Remove statistical outliers (IQR method)

Data Transformation

encode                       # Encode categorical variables
scale                        # Scale numeric features (StandardScaler)
filter <condition>           # Filter rows (e.g., "age > 18")
select <col1> <col2> ...     # Select specific columns

Train/Test Split

split 80/20                  # Split data 80% train, 20% test
split 0.8 --target label     # Split with specific target column
split 75/25 --target price   # Custom ratio with target

Model Training

train xgboost                # XGBoost (requires xgboost package)
train random_forest          # Random Forest
train logistic               # Logistic Regression
train linear                 # Linear Regression
train auto                   # Auto-select based on task

Evaluation

predict                      # Make predictions on test set
evaluate                     # Compute evaluation metrics

Model Export/Import

export model.pkl             # Save model to file
save model.pkl               # Alias for export
import model.pkl             # Load model from file

Options

Options use --flag or -f syntax:

split 80/20 --target revenue
train xgboost --n_estimators 100

Comments

Use # for comments:

# Load and prepare data
load data.csv
clean missing  # Remove nulls

# Train model
train xgboost

๐Ÿ”ฅ Examples

Example 1: Basic Classification

load titanic.csv
clean missing
encode
split 80/20 --target survived
train random_forest
evaluate
export titanic_model.pkl

Example 2: Regression with Preprocessing

load housing.csv
clean outliers
select bedrooms bathrooms sqft price
scale
split 75/25 --target price
train linear
evaluate

Example 3: XGBoost with Feature Selection

load sales.csv
filter revenue > 1000
select date product revenue region
clean missing
encode
split 80/20 --target revenue
train xgboost
evaluate
export sales_model.pkl

Example 4: Interactive Debugging

from pipelinescript import debug

script = """
load data.csv
clean missing
split 80/20 --target label
train xgboost
evaluate
"""

result = debug(script)

# In debugger:
# (pdb) step           # Execute next step
# (pdb) context        # Show current state
# (pdb) inspect model  # Look at model
# (pdb) continue       # Run to completion

Example 5: Python API

from pipelinescript import Pipeline

# Method chaining
pipeline = (Pipeline()
    .load("data.csv")
    .clean_missing()
    .clean_outliers()
    .encode()
    .scale()
    .split(0.8, target="label")
    .train_xgboost()
    .evaluate()
    .export("model.pkl")
)

# Execute
result = pipeline.run()

# Show results
if result.success:
    print(f"Duration: {result.duration:.2f}s")
    print(f"Metrics: {result.context.metrics}")

Example 6: Quick Builders

from pipelinescript.pipeline import (
    quick_classification,
    quick_regression,
    quick_train
)

# Classification in one line
result = quick_classification("iris.csv", "species", "xgboost")

# Regression in one line
result = quick_regression("housing.csv", "price", "random_forest")

# Train and export in one line
result = quick_train("data.csv", "target", "model.pkl")

๐ŸŽจ Visualization

ASCII Pipeline Diagram

from pipelinescript import run

run(script, visualize=True)

Output:

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
    ๐Ÿ“Š PIPELINE VISUALIZATION
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

    START
      โ”‚
      โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ LOAD data.csv โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚
      โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ CLEAN missing โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚
      โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ TRAIN xgboost โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚
      โ–ผ
    END

Graphical Pipeline (with matplotlib)

from pipelinescript import parse
from pipelinescript.visualizer import PipelineVisualizer

ast = parse(script)
visualizer = PipelineVisualizer()
visualizer.visualize_pipeline(ast, save_path="pipeline.png")

Generates a beautiful flowchart visualization.


๐Ÿ› Interactive Debugging

PipelineScript includes a powerful interactive debugger inspired by Python's pdb:

from pipelinescript import debug

debug("""
    load data.csv
    clean missing
    split 80/20 --target label
    train xgboost
    evaluate
""")

Debugger Commands

Command Alias Description
run r Run until completion/breakpoint
step s, next, n Execute next step
continue c, cont Continue execution
break <n> b Set breakpoint at step n
clear <n> Clear breakpoint
list l, ls List all steps
context ctx, vars Show execution context
inspect <var> i, p Inspect variable
restart Restart from beginning
quit q, exit Quit debugger

Example Debugging Session

(pdb) list
Pipeline Steps:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
   โ†’ 1. load
     2. clean
     3. split
     4. train
     5. evaluate
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

(pdb) break 4
๐Ÿ”ด Breakpoint set at step 4

(pdb) run
โ–ถ๏ธ  Step 1: load
   Loaded 150 rows from iris.csv

โ–ถ๏ธ  Step 2: clean
   Removed 0 rows with missing values

โ–ถ๏ธ  Step 3: split
   Split data: 120 train, 30 test (80/20)

๐Ÿ”ด Breakpoint at step 4

(pdb) context
๐Ÿ“Š Execution Context:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
  data: DataFrame (150, 5)
    columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
  X_train: (120, 4)
  X_test: (30, 4)

  Recent log entries:
    โ€ข Loaded 150 rows from iris.csv
    โ€ข Removed 0 rows with missing values
    โ€ข Split data: 120 train, 30 test (80/20)
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

(pdb) step
โ–ถ๏ธ  Step 4: train
   Trained XGBClassifier

(pdb) inspect model
model: XGBClassifier
  Value: XGBClassifier(...)

(pdb) continue
โ–ถ๏ธ  Step 5: evaluate
   Accuracy: 0.9667

โœ… Pipeline execution completed!

๐Ÿ—๏ธ Architecture

PipelineScript consists of five core components:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          PipelineScript Engine              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                             โ”‚
โ”‚  1. Parser     โ†’  Lexical analysis & AST   โ”‚
โ”‚  2. Compiler   โ†’  AST to executable steps  โ”‚
โ”‚  3. Executor   โ†’  Step execution engine    โ”‚
โ”‚  4. Debugger   โ†’  Interactive debugging    โ”‚
โ”‚  5. Visualizer โ†’  Pipeline visualization   โ”‚
โ”‚                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

1. Parser (parser.py)

  • Lexical analysis (tokenization)
  • Syntax parsing
  • AST generation

2. Compiler (compiler.py)

  • Compiles AST into executable steps
  • Integrates with sklearn, xgboost
  • Handles data transformations

3. Executor (executor.py)

  • Executes compiled steps
  • Manages execution context
  • Handles errors and logging

4. Debugger (debugger.py)

  • Interactive step-through execution
  • Breakpoints and inspection
  • Context visualization

5. Visualizer (visualizer.py)

  • ASCII pipeline diagrams
  • Graphical visualizations
  • DAG export

๐ŸŽฏ Use Cases

1. Rapid Prototyping

Test different models and preprocessing strategies in minutes:

load data.csv
clean missing
split 80/20 --target label
train xgboost
evaluate

2. Teaching & Learning

Perfect for teaching ML concepts without drowning in code:

# Clear, readable steps students can understand
load iris.csv
split 70/30 --target species
train random_forest
evaluate

3. Reproducible Research

Pipeline scripts are version-controllable and self-documenting:

# research_pipeline.psl
load experiment_data.csv
clean outliers
split 80/20 --target outcome
train xgboost
evaluate

4. Automated ML

Easily generate and test multiple pipelines programmatically:

models = ['xgboost', 'random_forest', 'logistic']

for model in models:
    pipeline = Pipeline().load("data.csv").clean_missing()
    pipeline.split(0.8, target="label").train(model).evaluate()
    result = pipeline.run()
    print(f"{model}: {result.context.metrics['accuracy']}")

5. Production Pipelines

Export trained pipelines as standalone Python scripts or containers.


๐Ÿ”ฌ Advanced Usage

Custom Preprocessing

from pipelinescript import Pipeline

pipeline = Pipeline()
pipeline.load("data.csv")

# Custom filtering
pipeline.filter("age > 18 and income < 100000")

# Select features
pipeline.select("age", "income", "education")

# Continue pipeline
pipeline.clean_missing().encode().scale()
pipeline.split(0.8, target="default").train("xgboost")

result = pipeline.run()

Accessing Context

result = pipeline.run()

if result.success:
    # Access data
    print(result.context.data.head())
    
    # Access model
    model = result.context.model
    
    # Access metrics
    print(result.context.metrics)
    
    # Access predictions
    predictions = result.context.predictions
    
    # Access log
    for entry in result.context.log:
        print(entry)

Extending PipelineScript

Add custom commands by extending the compiler:

from pipelinescript.compiler import PipelineCompiler
from pipelinescript.parser import ASTNode

class CustomCompiler(PipelineCompiler):
    def __init__(self):
        super().__init__()
        self.commands['my_command'] = self._compile_my_command
    
    def _compile_my_command(self, node: ASTNode):
        def custom_step(context):
            # Your custom logic
            return context
        
        return CompiledStep('my_command', custom_step, [], {}, node.line)

๐Ÿšง Roadmap

  • v0.2.0: GPU support (RAPIDS, cuML)
  • v0.3.0: Deep learning models (PyTorch, TensorFlow)
  • v0.4.0: AutoML integration
  • v0.5.0: Distributed training (Ray, Dask)
  • v0.6.0: Model serving integration
  • v0.7.0: Pipeline scheduling and monitoring
  • v1.0.0: Production-ready feature complete

๐Ÿค Contributing

Contributions welcome! Areas needing help:

  1. Additional model types (SVM, KNN, etc.)
  2. More preprocessing options
  3. Better visualizations
  4. Documentation improvements
  5. Test coverage

See CONTRIBUTING.md for guidelines.


๐Ÿ“„ License

MIT License - see LICENSE file.


๐Ÿ™ Acknowledgments

PipelineScript was inspired by:

  • SQL's declarative simplicity
  • UNIX pipes' composability
  • scikit-learn's consistent API
  • The need for ML democratization

๐Ÿ“Š Comparison

Feature PipelineScript Sklearn Keras MLflow
Human-readable syntax โœ… โŒ โŒ โŒ
Interactive debugging โœ… โŒ โŒ โŒ
Built-in visualization โœ… โŒ โœ… โœ…
One-line pipelines โœ… โŒ โŒ โŒ
No code required โœ… โŒ โŒ โŒ
Production ready ๐Ÿšง โœ… โœ… โœ…

๐ŸŽ“ Examples & Tutorials

See the examples/ directory for:

  • simple_classification.psl - Basic classification
  • xgboost_pipeline.psl - XGBoost example
  • regression.psl - Regression pipeline
  • python_examples.py - Python API examples
  • iris.csv - Sample dataset

๐Ÿ“ž Support


๐ŸŒŸ Star History

If you find PipelineScript useful, please star the repo! โญ


๐Ÿ”ฅ Built with โค๏ธ by Idriss Bado

Making machine learning pipelines human again.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipelinescript-0.1.3.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pipelinescript-0.1.3-py3-none-any.whl (25.8 kB view details)

Uploaded Python 3

File details

Details for the file pipelinescript-0.1.3.tar.gz.

File metadata

  • Download URL: pipelinescript-0.1.3.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pipelinescript-0.1.3.tar.gz
Algorithm Hash digest
SHA256 64b9618bf2a4431d60337842c15dea935307f0b334866ae945f6057bafe9125e
MD5 b756d598b5c3d2d2804d10fd796944ac
BLAKE2b-256 efff00f82011b7cd6f05c4fc8777f103b4936db78c997a72fea58dd2936c0f80

See more details on using hashes here.

File details

Details for the file pipelinescript-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: pipelinescript-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 25.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pipelinescript-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9356bedd25006ad754c5b4f42264ec638602b99b8bf16de5f734cbea653406c0
MD5 a78ac094738b91011094688163959821
BLAKE2b-256 393c64796b77df43484d315ff737ae42b015f49f823d9c8014443bf1625aa3ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page