Human-readable ML pipeline language with DSL, debugging, and visualization
Project description
๐ฅ PipelineScript - Human-Readable ML Pipeline Language
Transform machine learning pipelines from code into conversation.
๐ What is PipelineScript?
PipelineScript is a revolutionary Domain-Specific Language (DSL) that makes machine learning pipelines readable, debuggable, and accessible to everyone. No more nested code, complex APIs, or cryptic configurations.
Before PipelineScript:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
# Load data
data = pd.read_csv('data.csv')
# Clean
data = data.dropna()
# Encode categoricals
from sklearn.preprocessing import LabelEncoder
for col in data.select_dtypes(['object']).columns:
data[col] = LabelEncoder().fit_transform(data[col])
# Split
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train
model = XGBClassifier()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
# Export
import pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
With PipelineScript:
load data.csv
clean missing
encode
split 80/20 --target target
scale
train xgboost
evaluate
export model.pkl
That's it. Same functionality, 90% less code, infinitely more readable.
โจ Key Features
1. ๐ฃ๏ธ Human-Readable Syntax
Write ML pipelines like you'd describe them to a colleague:
load sales.csv
filter revenue > 1000
clean outliers
split 75/25 --target revenue
train xgboost
evaluate
2. ๐ Interactive Debugging
Step through your pipeline like a regular program:
from pipelinescript import debug
debug("""
load data.csv
clean missing
train xgboost
""")
Debugger commands:
step- Execute next stepbreak 3- Set breakpoint at step 3context- Show current data and modelinspect model- Inspect specific variablecontinue- Run until completion
3. ๐ Built-in Visualization
Automatically visualize your pipeline structure:
from pipelinescript import run
run(script, visualize=True)
Generates ASCII or graphical pipeline diagrams showing data flow.
4. ๐ Method Chaining API
Prefer Python? Use the fluent API:
from pipelinescript import Pipeline
result = (Pipeline()
.load("data.csv")
.clean_missing()
.encode()
.split(0.8, target="label")
.train("xgboost")
.evaluate()
.export("model.pkl")
.run())
5. โก Quick Builders
Pre-built pipelines for common tasks:
from pipelinescript.pipeline import quick_classification
# One line for complete classification pipeline
result = quick_classification("data.csv", "label", "xgboost")
๐ฆ Installation
pip install pipelinescript
Optional dependencies:
# For XGBoost models
pip install xgboost
# For visualization
pip install matplotlib
# For all features
pip install pipelinescript[full]
๐ฏ Quick Start
1. Create a Pipeline File (.psl)
my_pipeline.psl:
load iris.csv
clean missing
encode
split 80/20 --target species
train random_forest
evaluate
export iris_model.pkl
2. Run It
Command Line:
pipelinescript run my_pipeline.psl
Python:
from pipelinescript import run
result = run("my_pipeline.psl")
if result.success:
print(f"โ
Accuracy: {result.context.metrics['accuracy']:.4f}")
That's it! Your model is trained, evaluated, and exported.
๐ Language Reference
Commands
Data Loading
load <filepath> # Load data from file
Supported formats: CSV, Excel, JSON, Parquet
Data Cleaning
clean missing # Remove rows with missing values
clean duplicates # Remove duplicate rows
clean outliers # Remove statistical outliers (IQR method)
Data Transformation
encode # Encode categorical variables
scale # Scale numeric features (StandardScaler)
filter <condition> # Filter rows (e.g., "age > 18")
select <col1> <col2> ... # Select specific columns
Train/Test Split
split 80/20 # Split data 80% train, 20% test
split 0.8 --target label # Split with specific target column
split 75/25 --target price # Custom ratio with target
Model Training
train xgboost # XGBoost (requires xgboost package)
train random_forest # Random Forest
train logistic # Logistic Regression
train linear # Linear Regression
train auto # Auto-select based on task
Evaluation
predict # Make predictions on test set
evaluate # Compute evaluation metrics
Model Export/Import
export model.pkl # Save model to file
save model.pkl # Alias for export
import model.pkl # Load model from file
Options
Options use --flag or -f syntax:
split 80/20 --target revenue
train xgboost --n_estimators 100
Comments
Use # for comments:
# Load and prepare data
load data.csv
clean missing # Remove nulls
# Train model
train xgboost
๐ฅ Examples
Example 1: Basic Classification
load titanic.csv
clean missing
encode
split 80/20 --target survived
train random_forest
evaluate
export titanic_model.pkl
Example 2: Regression with Preprocessing
load housing.csv
clean outliers
select bedrooms bathrooms sqft price
scale
split 75/25 --target price
train linear
evaluate
Example 3: XGBoost with Feature Selection
load sales.csv
filter revenue > 1000
select date product revenue region
clean missing
encode
split 80/20 --target revenue
train xgboost
evaluate
export sales_model.pkl
Example 4: Interactive Debugging
from pipelinescript import debug
script = """
load data.csv
clean missing
split 80/20 --target label
train xgboost
evaluate
"""
result = debug(script)
# In debugger:
# (pdb) step # Execute next step
# (pdb) context # Show current state
# (pdb) inspect model # Look at model
# (pdb) continue # Run to completion
Example 5: Python API
from pipelinescript import Pipeline
# Method chaining
pipeline = (Pipeline()
.load("data.csv")
.clean_missing()
.clean_outliers()
.encode()
.scale()
.split(0.8, target="label")
.train_xgboost()
.evaluate()
.export("model.pkl")
)
# Execute
result = pipeline.run()
# Show results
if result.success:
print(f"Duration: {result.duration:.2f}s")
print(f"Metrics: {result.context.metrics}")
Example 6: Quick Builders
from pipelinescript.pipeline import (
quick_classification,
quick_regression,
quick_train
)
# Classification in one line
result = quick_classification("iris.csv", "species", "xgboost")
# Regression in one line
result = quick_regression("housing.csv", "price", "random_forest")
# Train and export in one line
result = quick_train("data.csv", "target", "model.pkl")
๐จ Visualization
ASCII Pipeline Diagram
from pipelinescript import run
run(script, visualize=True)
Output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ PIPELINE VISUALIZATION
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
START
โ
โผ
โโโโโโโโโโโโโโโ
โ LOAD data.csv โ
โโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ
โ CLEAN missing โ
โโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ
โ TRAIN xgboost โ
โโโโโโโโโโโโโโโโ
โ
โผ
END
Graphical Pipeline (with matplotlib)
from pipelinescript import parse
from pipelinescript.visualizer import PipelineVisualizer
ast = parse(script)
visualizer = PipelineVisualizer()
visualizer.visualize_pipeline(ast, save_path="pipeline.png")
Generates a beautiful flowchart visualization.
๐ Interactive Debugging
PipelineScript includes a powerful interactive debugger inspired by Python's pdb:
from pipelinescript import debug
debug("""
load data.csv
clean missing
split 80/20 --target label
train xgboost
evaluate
""")
Debugger Commands
| Command | Alias | Description |
|---|---|---|
run |
r |
Run until completion/breakpoint |
step |
s, next, n |
Execute next step |
continue |
c, cont |
Continue execution |
break <n> |
b |
Set breakpoint at step n |
clear <n> |
Clear breakpoint | |
list |
l, ls |
List all steps |
context |
ctx, vars |
Show execution context |
inspect <var> |
i, p |
Inspect variable |
restart |
Restart from beginning | |
quit |
q, exit |
Quit debugger |
Example Debugging Session
(pdb) list
Pipeline Steps:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. load
2. clean
3. split
4. train
5. evaluate
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
(pdb) break 4
๐ด Breakpoint set at step 4
(pdb) run
โถ๏ธ Step 1: load
Loaded 150 rows from iris.csv
โถ๏ธ Step 2: clean
Removed 0 rows with missing values
โถ๏ธ Step 3: split
Split data: 120 train, 30 test (80/20)
๐ด Breakpoint at step 4
(pdb) context
๐ Execution Context:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
data: DataFrame (150, 5)
columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
X_train: (120, 4)
X_test: (30, 4)
Recent log entries:
โข Loaded 150 rows from iris.csv
โข Removed 0 rows with missing values
โข Split data: 120 train, 30 test (80/20)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
(pdb) step
โถ๏ธ Step 4: train
Trained XGBClassifier
(pdb) inspect model
model: XGBClassifier
Value: XGBClassifier(...)
(pdb) continue
โถ๏ธ Step 5: evaluate
Accuracy: 0.9667
โ
Pipeline execution completed!
๐๏ธ Architecture
PipelineScript consists of five core components:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PipelineScript Engine โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. Parser โ Lexical analysis & AST โ
โ 2. Compiler โ AST to executable steps โ
โ 3. Executor โ Step execution engine โ
โ 4. Debugger โ Interactive debugging โ
โ 5. Visualizer โ Pipeline visualization โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1. Parser (parser.py)
- Lexical analysis (tokenization)
- Syntax parsing
- AST generation
2. Compiler (compiler.py)
- Compiles AST into executable steps
- Integrates with sklearn, xgboost
- Handles data transformations
3. Executor (executor.py)
- Executes compiled steps
- Manages execution context
- Handles errors and logging
4. Debugger (debugger.py)
- Interactive step-through execution
- Breakpoints and inspection
- Context visualization
5. Visualizer (visualizer.py)
- ASCII pipeline diagrams
- Graphical visualizations
- DAG export
๐ฏ Use Cases
1. Rapid Prototyping
Test different models and preprocessing strategies in minutes:
load data.csv
clean missing
split 80/20 --target label
train xgboost
evaluate
2. Teaching & Learning
Perfect for teaching ML concepts without drowning in code:
# Clear, readable steps students can understand
load iris.csv
split 70/30 --target species
train random_forest
evaluate
3. Reproducible Research
Pipeline scripts are version-controllable and self-documenting:
# research_pipeline.psl
load experiment_data.csv
clean outliers
split 80/20 --target outcome
train xgboost
evaluate
4. Automated ML
Easily generate and test multiple pipelines programmatically:
models = ['xgboost', 'random_forest', 'logistic']
for model in models:
pipeline = Pipeline().load("data.csv").clean_missing()
pipeline.split(0.8, target="label").train(model).evaluate()
result = pipeline.run()
print(f"{model}: {result.context.metrics['accuracy']}")
5. Production Pipelines
Export trained pipelines as standalone Python scripts or containers.
๐ฌ Advanced Usage
Custom Preprocessing
from pipelinescript import Pipeline
pipeline = Pipeline()
pipeline.load("data.csv")
# Custom filtering
pipeline.filter("age > 18 and income < 100000")
# Select features
pipeline.select("age", "income", "education")
# Continue pipeline
pipeline.clean_missing().encode().scale()
pipeline.split(0.8, target="default").train("xgboost")
result = pipeline.run()
Accessing Context
result = pipeline.run()
if result.success:
# Access data
print(result.context.data.head())
# Access model
model = result.context.model
# Access metrics
print(result.context.metrics)
# Access predictions
predictions = result.context.predictions
# Access log
for entry in result.context.log:
print(entry)
Extending PipelineScript
Add custom commands by extending the compiler:
from pipelinescript.compiler import PipelineCompiler
from pipelinescript.parser import ASTNode
class CustomCompiler(PipelineCompiler):
def __init__(self):
super().__init__()
self.commands['my_command'] = self._compile_my_command
def _compile_my_command(self, node: ASTNode):
def custom_step(context):
# Your custom logic
return context
return CompiledStep('my_command', custom_step, [], {}, node.line)
๐ง Roadmap
- v0.2.0: GPU support (RAPIDS, cuML)
- v0.3.0: Deep learning models (PyTorch, TensorFlow)
- v0.4.0: AutoML integration
- v0.5.0: Distributed training (Ray, Dask)
- v0.6.0: Model serving integration
- v0.7.0: Pipeline scheduling and monitoring
- v1.0.0: Production-ready feature complete
๐ค Contributing
Contributions welcome! Areas needing help:
- Additional model types (SVM, KNN, etc.)
- More preprocessing options
- Better visualizations
- Documentation improvements
- Test coverage
See CONTRIBUTING.md for guidelines.
๐ License
MIT License - see LICENSE file.
๐ Acknowledgments
PipelineScript was inspired by:
- SQL's declarative simplicity
- UNIX pipes' composability
- scikit-learn's consistent API
- The need for ML democratization
๐ Comparison
| Feature | PipelineScript | Sklearn | Keras | MLflow |
|---|---|---|---|---|
| Human-readable syntax | โ | โ | โ | โ |
| Interactive debugging | โ | โ | โ | โ |
| Built-in visualization | โ | โ | โ | โ |
| One-line pipelines | โ | โ | โ | โ |
| No code required | โ | โ | โ | โ |
| Production ready | ๐ง | โ | โ | โ |
๐ Examples & Tutorials
See the examples/ directory for:
simple_classification.psl- Basic classificationxgboost_pipeline.psl- XGBoost exampleregression.psl- Regression pipelinepython_examples.py- Python API examplesiris.csv- Sample dataset
๐ Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: idrissbadoolivier@gmail.com
๐ Star History
If you find PipelineScript useful, please star the repo! โญ
๐ฅ Built with โค๏ธ by Idriss Bado
Making machine learning pipelines human again.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pipelinescript-0.1.3.tar.gz.
File metadata
- Download URL: pipelinescript-0.1.3.tar.gz
- Upload date:
- Size: 31.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64b9618bf2a4431d60337842c15dea935307f0b334866ae945f6057bafe9125e
|
|
| MD5 |
b756d598b5c3d2d2804d10fd796944ac
|
|
| BLAKE2b-256 |
efff00f82011b7cd6f05c4fc8777f103b4936db78c997a72fea58dd2936c0f80
|
File details
Details for the file pipelinescript-0.1.3-py3-none-any.whl.
File metadata
- Download URL: pipelinescript-0.1.3-py3-none-any.whl
- Upload date:
- Size: 25.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9356bedd25006ad754c5b4f42264ec638602b99b8bf16de5f734cbea653406c0
|
|
| MD5 |
a78ac094738b91011094688163959821
|
|
| BLAKE2b-256 |
393c64796b77df43484d315ff737ae42b015f49f823d9c8014443bf1625aa3ec
|