Skip to main content

High-performance SISSO implementation with Rust backend.

Project description

mini-sisso

PyPI version License: MIT Python Version

mini-sisso is a lightweight and user-friendly Python implementation of the SISSO (Sure Independence Screening and Sparsifying Operator) symbolic regression algorithm. It offers full compatibility with the scikit-learn ecosystem for discovering interpretable mathematical models from data.

Inheriting the advanced exploration capabilities of the original C++/Fortran-based implementation, mini-sisso provides these features in a more modern and accessible package, powered by a blazing-fast Rust backend:

  • 🚀 Easy Adoption: Simple pip install. The default CPU version has minimal dependencies (NumPy/SciPy), ensuring a hassle-free setup.
  • 🦀 High-Performance Rust Backend:
    • The computationally intensive exhaustive search is implemented in Rust, delivering performance far superior to pure Python implementations while remaining completely transparent to the user.
  • 🧠 Memory Efficiency & Fast Exploration:
    • A "recipe-based" architecture dramatically reduces memory consumption during Feature Expansion.
    • The "Level-wise SIS" feature (toggleable) speeds up exploration by pruning unpromising features early.
  • ⚖️ Balanced Feature Selection:
    • Implements a "Split Selection" strategy during SIS. This ensures that unary operators (like sin, exp) are not "crowded out" by the overwhelming number of binary combinations, preserving feature diversity.
  • 🤝 Full scikit-learn Compatibility: Seamlessly integrates with powerful tools like GridSearchCV and Pipeline, in addition to the standard fit()/predict() interface.
  • ⚡ Optional GPU Support: Achieve significant speedups with GPU acceleration by installing the optional PyTorch backend.

📥 Installation

CPU Version (Default, Recommended)

Installs the lightweight CPU version from PyPI. It includes the optimized Rust backend.

pip install mini-sisso

GPU Version (Optional)

To enable GPU acceleration with the PyTorch backend, install with the [gpu] option.

pip install "mini-sisso[gpu]"

🚀 Quick Start

Discover a mathematical model from your data in just a few lines of code.

import pandas as pd
import numpy as np
from mini_sisso.model import MiniSisso

# 1. Prepare data
np.random.seed(42) # Set seed for reproducibility
X_df = pd.DataFrame(np.random.rand(100, 2) * [2, 3], columns=["feature_A", "feature_B"])
# True equation: y = 2*sin(feature_A) + feature_B^2 + noise
y_series = pd.Series(2 * np.sin(X_df["feature_A"]) + X_df["feature_B"]**2 + np.random.randn(100) * 0.1)

# 2. Instantiate the Model (Full Hyperparameter List)
# Uncomment the parameters you need to change.
model = MiniSisso(
    # --- Control the fundamental search space ---
    n_expansion=2,                      # Depth of feature expansion (deeper finds more complex equations)
    operators=["+", "sin", "pow2"],     # List of operators for feature expansion
    
    # --- Select the main search strategy ---
    so_method="exhaustive",             # Model search strategy ('exhaustive', 'lasso', 'lightgbm')
    
    # --- Detailed settings for each strategy (selection_params) ---
    selection_params={
        # -- Parameters for "exhaustive" method --
        'n_term': 2,                    # Maximum number of terms in the discovered equation
        'n_sis_features': 10,           # Number of SIS candidates for each term
        
        # -- Parameters for "lasso" method --
        # 'alpha': 0.01,                # Regularization strength for Lasso
        
        # -- Parameters for "lightgbm" method --
        # 'n_features_to_select': 20,   # Number of features to select with LightGBM
        # 'lightgbm_params': {'n_estimators': 100, 'random_state': 42}, # Parameters for the LightGBM model itself
        
        # -- Optional preprocessing filters for "lasso"/"lightgbm" --
        # 'n_global_sis_features': 200, # Number of candidates to pre-screen based on correlation with target
        # 'collinearity_filter': 'mi',  # Method to calculate correlation between candidates ('mi' or 'dcor')
        # 'collinearity_threshold': 0.9, # Correlation threshold for the above filter
    },
    
    # --- Control computational efficiency ---
    use_levelwise_sis=True,             # Use staged search for speed (strongly recommended)
    n_level_sis_features=50,            # Number of promising features to keep at each expansion level
    
    # --- Select the execution environment ---
    # device="cuda",                      # Specify 'cuda' to use GPU
)

# 3. Fit the model
model.fit(X_df, y_series)

# 4. Check the results
print("\n--- Fit Results ---")
print(f"Discovered Equation: {model.equation_}")
print(f"Training RMSE: {model.rmse_:.4f}")
print(f"Training R2 Score: {model.r2_:.4f}")

# 5. Make predictions
print("\n--- Prediction ---")
X_test_df = pd.DataFrame(np.array([[0.5, 1.0], [1.0, 2.0]]), columns=["feature_A", "feature_B"])
predictions = model.predict(X_test_df)
print(f"Predictions for new data ([0.5, 1.0], [1.0, 2.0]): {predictions}")

Example Output:

Using NumPy/SciPy backend for CPU execution.
*** Starting Level-wise Recipe Generation (Level-wise SIS: ON, k_per_level=50) ***
Level 1: Generated 5, selected top 5. Total promising: 7. Time: 0.00s
Level 2: Generated 30, selected top 30. Total promising: 37. Time: 0.00s
***************** Starting SISSO Regressor (NumPy/SciPy Backend, Method: exhaustive) *****************

===== Searching for 1-term models =====
...
===== Searching for 2-term models =====
...
Best 2-term model: RMSE=0.092124, Eq: +0.998492 * ^2(feature_B) +1.971237 * sin(feature_A) +0.030610
Time: 0.01 seconds

==================================================
SISSO fitting finished. Total time: 0.02s
==================================================

Best Model Found (2 terms):
  RMSE: 0.092124
  R2:   0.998806
  Equation: +0.998492 * ^2(feature_B) +1.971237 * sin(feature_A) +0.030610

--- Fit Results ---
Discovered Equation: +0.998492 * ^2(feature_B) +1.971237 * sin(feature_A) +0.030610
Training RMSE: 0.0921
Training R2 Score: 0.9988

--- Predictions ---
Predictions for new data ([0.5, 1.0], [1.0, 2.0]): [2.0016012 5.6796584]

🛠️ Usage Guide: Controlling the Search with Hyperparameters

The mini-sisso search process follows this workflow, with each step controlled by hyperparameters.

Workflow Overview

  1. Feature Expansion: Generates a large number of candidate features based on operators and n_expansion.
    • This process is made efficient by use_levelwise_sis=True and n_level_sis_features.
  2. [Optional] Preprocessing Filters: A set of filters to prune candidate features when using lasso or lightgbm. (Configured in selection_params).
    • Global SIS: Removes features with low correlation to the target y.
    • Collinearity Filter: Removes highly correlated features from each other.
  3. Model Search (Sparsifying Operator): The final model is discovered from the pruned candidates using the strategy specified by so_method.

Main Hyperparameters

so_method: The Three Model Search Strategies

The so_method parameter determines the core search approach.

1. so_method="exhaustive" (Default)

The classic SISSO approach. It uses iterative SIS and an exhaustive search powered by Rust to find the optimal model. Best for finding simple, interpretable models.

# Exhaustively search for models up to 3 terms
model = MiniSisso(
    so_method="exhaustive",
    selection_params={
        'n_term': 3,          # Max number of terms to search for
        'n_sis_features': 15  # Number of candidates to add to the pool at each SIS step
    }
)
2. so_method="lasso"

Uses Lasso regression as a feature selector to build a model quickly. Effective for large feature spaces.

# Select features using Lasso
model = MiniSisso(
    so_method="lasso",
    selection_params={
        'alpha': 0.01 # Regularization parameter for Lasso
    }
)
3. so_method="lightgbm"

Uses LightGBM as a feature selector. Excels at capturing non-linear relationships.

# Select top 20 features using LightGBM
model = MiniSisso(
    so_method="lightgbm",
    selection_params={
        'n_features_to_select': 20
    }
)

selection_params: Detailed Control for Each Strategy

The selection_params dictionary allows you to apply preprocessing filters and fine-tune each so_method.

Preprocessing Filters (for lasso/lightgbm)
  • n_global_sis_features: Pre-screens candidates by removing those with low correlation to the target y.
  • collinearity_filter: Removes highly correlated features to stabilize Lasso/LightGBM. Can be 'mi' (Mutual Information) or 'dcor' (Distance Correlation).
# Before running LightGBM, pre-screen to the top 200 features,
# then remove pairs with an MI score > 0.9
model = MiniSisso(
    so_method='lightgbm',
    selection_params={
        'n_global_sis_features': 200,
        'collinearity_filter': 'mi',
        'collinearity_threshold': 0.9,
        'n_features_to_select': 20
    }
)
Expert Settings (for lightgbm)

You can also directly specify the internal hyperparameters for lightgbm.

model = MiniSisso(
    so_method='lightgbm',
    selection_params={
        'n_features_to_select': 20,
        'lightgbm_params': {
            'n_estimators': 200,         # Number of trees
            'num_leaves': 31,            # Max number of leaves in one tree
            'learning_rate': 0.05,       # Learning rate
            'colsample_bytree': 0.8,     # Fraction of features to be considered for each tree
            'subsample': 0.8,            # Fraction of data to be used for each tree
            'reg_alpha': 0.1,            # L1 regularization
            'reg_lambda': 0.1,           # L2 regularization
            'random_state': 42,
            'n_jobs': -1,
            'verbosity': -1,
        }
    }
)

Other Key Parameters

  • use_levelwise_sis (bool, default=True): Strongly recommended. Speeds up feature generation and saves memory.
  • n_level_sis_features (int, default=50): Number of features to keep at each stage when use_levelwise_sis=True.
  • device (str, default="cpu"): Set to "cuda" to use the GPU backend.

Available Operators

Specify the operators argument as a list of strings.

Operator Description
'+' Addition (a + b)
'-' Subtraction (a - b)
'*' Multiplication (a * b)
'/' Division (a / b)
'sin' Sine (sin(a))
'cos' Cosine (cos(a))
'exp' Exponential (e^a)
'log' Natural logarithm (ln(a))
'sqrt' Square root (sqrt(
'pow2' Square (a^2)
'pow3' Cube (a^3)
'inv' Reciprocal (1/a)
'|-|' Absolute difference (|a - b|)
'cbrt' Cube root (a^(1/3))
'abs' Absolute value (|a|)
'scd' Standard Cauchy Distribution (1 / (π * (1 + a^2)))

🤝 scikit-learn Ecosystem Integration

mini-sisso inherits BaseEstimator and RegressorMixin from scikit-learn, allowing it to seamlessly integrate with the powerful tools provided by scikit-learn.

More detailed usage of Pipeline

Pipeline is a tool for connecting multiple processing steps and treating them as a single estimator.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from mini_sisso.model import MiniSisso

# Pipeline definition
# Note: MiniSisso is sensitive to the scale of input features, so preprocessing such as StandardScaler
# may impair the interpretability of the discovered formula. It is generally not recommended.
# Here is an example to demonstrate how Pipeline technically works.
pipeline = Pipeline([
# Step 1: Run standardization using the name 'scaler'
('scaler', StandardScaler()), # Usually unnecessary/not recommended for MiniSisso
# Step 2: Run MiniSisso using the name 'sisso'
('sisso', MiniSisso(n_expansion=2, selection_params={'n_term': 2}, operators=["+", "sin", "pow2"]))
])

# Train the entire pipeline: X -> scaler.fit_transform -> sisso.fit
pipeline.fit(X_df, y_series)

# Predict using the pipeline: X -> scaler.transform -> sisso.predict
predictions = pipeline.predict(X_df)

# You can also access and change parameters for each step of the pipeline.
# Example: Changing the number of SISSO terms after training
# pipeline.set_params(sisso__selection_params={'n_term': 3})
print(f"Number of terms in the SISSO step of the pipeline: {pipeline.named_steps['sisso'].selection_params['n_term']}")

Advanced GridSearchCV Usage

GridSearchCV can automatically find the best combination of hyperparameters, including the so_method itself. The __ (double underscore) syntax allows you to search nested parameters within selection_params.

from sklearn.model_selection import GridSearchCV

# Define a list of parameter grids to search over
param_grid = [
    # Case 1: Search patterns for exhaustive method
    {
        'so_method': ['exhaustive'],
        'selection_params': [
            {'n_term': 2, 'n_sis_features': 10},
            {'n_term': 3, 'n_sis_features': 15}
        ]
    },
    # Case 2: Search patterns for lasso method
    {
        'so_method': ['lasso'],
        'selection_params': [
            {'alpha': 0.01, 'collinearity_filter': 'mi'},
            {'alpha': 0.005}
        ]
    },
    # Case 3: Search patterns for lightgbm method
    {
        'so_method': ['lightgbm'],
        'selection_params__n_features_to_select':,
        'selection_params__lightgbm_params__n_estimators':,
    }
]

grid_search = GridSearchCV(
    MiniSisso(n_expansion=2, operators=['+', 'sin', 'pow2']),
    param_grid, cv=3, scoring='neg_root_mean_squared_error', n_jobs=-1, verbose=1
)

print("Starting GridSearchCV to find the best method and parameters...")
grid_search.fit(X_df, y_series)

print(f"\nBest search method and params: {grid_search.best_params_}")
print(f"Equation from the best model: {grid_search.best_estimator_.equation_}")

⚙️ API Reference

MiniSisso

class MiniSisso(BaseEstimator, RegressorMixin):
    def __init__(self, n_expansion: int = 2, operators: list = None,
                 so_method: str = "exhaustive", selection_params: dict = None,
                 use_levelwise_sis: bool = True, n_level_sis_features: int = 50,
                 device: str = "cpu"):

MiniSisso

  • n_expansion (int, default=2): Max level of feature expansion.
  • operators (list[str], required): List of operators for feature generation.
  • so_method (str, default="exhaustive"): Model search strategy ("exhaustive", "lasso", "lightgbm").
  • selection_params (dict, optional): Dictionary of detailed parameters for the selected so_method and preprocessing filters.
  • use_levelwise_sis (bool, default=True): Toggles the level-wise SIS feature.
  • n_level_sis_features (int, default=50): Number of features to keep at each level if use_levelwise_sis=True.
  • device (str, default="cpu"): Computation device ("cpu" or "cuda").

fit(X, y)

Fits the model to the training data.

Parameters

  • X (array-like or pd.DataFrame): The feature data, shape (n_samples, n_features).
  • y (array-like or pd.Series): The target variable data, shape (n_samples,).

Returns

  • self: The fitted MiniSisso instance.

predict(X)

Makes predictions using the fitted model.

Parameters

  • X (array-like or pd.DataFrame): The data to make predictions on.

Returns

  • np.ndarray: A NumPy array of the predictions.

score(X, y)

Returns the coefficient of determination (R² score) of the prediction.

Parameters

  • X (array-like or pd.DataFrame): The feature data.
  • y (array-like or pd.Series): The true target variable data.

Returns

  • float: The R² score.

Fitted Attributes

After calling fit(), you can access the following attributes:

  • model.equation_ (str): The best mathematical model found.
  • model.rmse_ (float): The RMSE of the best model on the training data.
  • model.r2_ (float): The R2 score of the best model on the training data.
  • model.coef_ (np.ndarray): The coefficients for each term in the best model.
  • model.intercept_ (float): The intercept of the best model.

📜 License

This project is licensed under the MIT License.

🙏 Acknowledgements

This library was greatly inspired by the original SISSO algorithm paper and is built upon the fantastic open-source projects NumPy, SciPy, Pandas, scikit-learn, and PyTorch.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mini_sisso-1.4.1.tar.gz (36.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mini_sisso-1.4.1-cp313-cp313-macosx_11_0_arm64.whl (356.4 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file mini_sisso-1.4.1.tar.gz.

File metadata

  • Download URL: mini_sisso-1.4.1.tar.gz
  • Upload date:
  • Size: 36.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for mini_sisso-1.4.1.tar.gz
Algorithm Hash digest
SHA256 64065b8d4dd6e41be5b27ccd032792ad56c16c01f743d8e5005e74a1e5f96c11
MD5 0eb051be03d7b4f541ff8255018d8528
BLAKE2b-256 54c9f0eb7b398722869c47e1030182e03799cea1ba9316d0782c72bb563b7ff0

See more details on using hashes here.

File details

Details for the file mini_sisso-1.4.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mini_sisso-1.4.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c8aad7da253c0dd32f78556739f7b39614e9ab87d0b9cd9594d05c4b3471112b
MD5 992fbb918c113f06cf4b8cafc78a46a1
BLAKE2b-256 c9b71e320115caf0a1b9aea319cd7c70cd366cd7b6b2fc630847a00ab7a8a330

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page