Skip to main content

Framework for machine and deep learning, with regression, classification and time series analysis

Project description

crapaud

Welcome to LeCrapaud

An all-in-one machine learning framework

GitHub stars PyPI version Python versions License codecov

🚀 Introduction

LeCrapaud is a high-level Python library for end-to-end machine learning workflows on tabular data, with a focus on financial and stock datasets. It provides a simple API to handle feature engineering, model selection, training, and prediction, all in a reproducible and modular way.

✨ Key Features

  • 🧩 Modular pipeline: Feature engineering, preprocessing, selection, and modeling as independent steps
  • 🤖 Automated model selection and hyperparameter optimization
  • 📊 Easy integration with pandas DataFrames
  • 🔬 Supports both regression and classification tasks
  • 🛠️ Simple API for both full pipeline and step-by-step usage
  • 📦 Ready for production and research workflows

⚡ Quick Start

Install the package

pip install lecrapaud

How it works

This package provides a high-level API to manage experiments for feature engineering, model selection, and prediction on tabular data (e.g. stock data).

Typical workflow

from lecrapaud import LeCrapaud

# Create a new experiment with data
experiment = LeCrapaud(
    data=your_dataframe,
    target_numbers=[1, 2],
    target_clf=[2],  # TARGET_2 is classification
    columns_drop=[...],
    columns_date=[...],
    # ... other config options
)

# Train the model
experiment.fit(your_dataframe)

# Make predictions
predictions, reg_scores, clf_scores = experiment.predict(new_data)

# Load existing experiment by ID
experiment = LeCrapaud(id=123)

# Or get best experiment by name
best_exp = LeCrapaud.get_best_experiment_by_name('my_experiment')

Database Configuration (Required)

LeCrapaud requires access to a MySQL database to store experiments and results. You can configure the database by:

  • Passing a valid MySQL URI to the constructor:
    experiment = LeCrapaud(uri="mysql+pymysql://user:password@host:port/dbname", data=df, ...)
    
  • OR setting environment variables:
    • DB_USER, DB_PASSWORD, DB_HOST, DB_PORT, DB_NAME
    • Or set DB_URI directly with your full connection string.

If neither is provided, database operations will not work.

Using OpenAI Embeddings (Optional)

If you want to use the columns_pca embedding feature (for advanced feature engineering), you must set the OPENAI_API_KEY environment variable with your OpenAI API key:

export OPENAI_API_KEY=sk-...

If this variable is not set, features relying on OpenAI embeddings will not be available.

Experiment Context Arguments

The experiment context is a dictionary containing all configuration parameters for your ML pipeline. Parameters are stored in the experiment's database record and automatically retrieved when loading an existing experiment.

Required Parameters

Parameter Type Description Example
data DataFrame Input dataset (required for new experiments only) pd.DataFrame(...)
experiment_name str Unique name for the experiment 'stock_prediction'
date_column str Name of the date column (required for time series) 'DATE'
group_column str Name of the group column (required for panel data) 'STOCK'

Feature Engineering Parameters

Parameter Type Default Description
columns_drop list [] Columns to drop during feature engineering
columns_boolean list [] Columns to convert to boolean features
columns_date list [] Date columns for cyclic encoding
columns_te_groupby list [] Groupby columns for target encoding
columns_te_target list [] Target columns for target encoding

Preprocessing Parameters

Parameter Type Default Description
time_series bool False Whether data is time series
val_size float 0.2 Validation set size (fraction)
test_size float 0.2 Test set size (fraction)
columns_pca list [] Columns for PCA transformation
pca_temporal list [] Temporal PCA config (e.g., lag features)
pca_cross_sectional list [] Cross-sectional PCA config (e.g., market regime)
columns_onehot list [] Columns for one-hot encoding
columns_binary list [] Columns for binary encoding
columns_ordinal list [] Columns for ordinal encoding
columns_frequency list [] Columns for frequency encoding

Feature Selection Parameters

Parameter Type Default Description
percentile float 20 Percentage of features to keep per selection method
corr_threshold float 80 Maximum correlation threshold (%) between features
max_features int 50 Maximum number of final features
max_p_value_categorical float 0.05 Maximum p-value for categorical feature selection (Chi2)

Model Selection Parameters

Parameter Type Default Description
target_numbers list [] List of target indices to predict
target_clf list [] Classification target indices
models_idx list [] Model indices or names to use (e.g., [1, 'xgb', 'lgb'])
max_timesteps int 120 Maximum timesteps for recurrent models
perform_hyperopt bool True Whether to perform hyperparameter optimization
number_of_trials int 20 Number of hyperopt trials
perform_crossval bool False Whether to use cross-validation during hyperopt
plot bool True Whether to generate plots
preserve_model bool True Whether to save the best model
target_clf_thresholds dict {} Classification thresholds per target

Example Context Configuration

context = {
    # Required parameters
    "experiment_name": f"stock_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    "date_column": "DATE",
    "group_column": "STOCK",
    
    # Feature selection
    "corr_threshold": 80,
    "max_features": 20,
    "percentile": 20,
    "max_p_value_categorical": 0.05,
    
    # Feature engineering
    "columns_drop": ["SECURITY", "ISIN", "ID"],
    "columns_boolean": [],
    "columns_date": ["DATE"],
    "columns_te_groupby": [["SECTOR", "DATE"]],
    "columns_te_target": ["RET", "VOLUME"],
    
    # Preprocessing
    "time_series": True,
    "val_size": 0.2,
    "test_size": 0.2,
    "pca_temporal": [
        # Old format (still supported)
        # {"name": "LAST_20_RET", "columns": [f"RET_-{i}" for i in range(1, 21)]},
        # New simplified format - automatically creates lag columns
        {"name": "LAST_20_RET", "column": "RET", "lags": 20},
        {"name": "LAST_10_VOL", "column": "VOLUME", "lags": 10},
    ],
    "pca_cross_sectional": [
        {
            "name": "MARKET_REGIME",
            "index": "DATE",
            "columns": "STOCK",
            "value": "RET",
        }
    ],
    "columns_onehot": ["BUY_SIGNAL"],
    "columns_binary": ["SECTOR", "LOCATION"],
    "columns_ordinal": ["STOCK"],
    
    # Model selection
    "target_numbers": [1, 2, 3],
    "target_clf": [1],
    "models_idx": ["xgb", "lgb", "catboost"],
    "max_timesteps": 120,
    "perform_hyperopt": True,
    "number_of_trials": 50,
    "perform_crossval": True,
    "plot": True,
    "preserve_model": True,
    "target_clf_thresholds": {1: {"precision": 0.80}},
}

# Create experiment with the new unified API
experiment = LeCrapaud(data=your_dataframe, **context)

Important Notes

  1. Context Persistence: All context parameters are saved in the database when creating an experiment and automatically restored when loading it.

  2. Parameter Precedence: When loading an existing experiment, the stored context takes precedence over any parameters passed to the constructor.

  3. PCA Time Series:

    • For time series data, both pca_cross_sectional and pca_temporal automatically use an expanding window approach with periodic refresh (default: every 90 days) to prevent data leakage.
    • The system fits PCA only on historical data (lookback window of 365 days by default) and avoids look-ahead bias.
    • For panel data (e.g., multiple stocks), lag features are created per group when using the simplified pca_temporal format.
    • Missing PCA values are handled with forward-fill followed by zero-fill to ensure compatibility with downstream models.
  4. PCA Temporal Simplified Format:

    • Instead of manually listing lag columns: {"name": "LAST_20_RET", "columns": ["RET_-1", "RET_-2", ..., "RET_-20"]}
    • Use the simplified format: {"name": "LAST_20_RET", "column": "RET", "lags": 20}
    • The system automatically creates the lag columns, handling panel data correctly with group_column.
  5. OpenAI Embeddings: If using columns_pca with text columns, ensure OPENAI_API_KEY is set as an environment variable.

  6. Model Indices: The models_idx parameter accepts both integer indices and string names (e.g., 'xgb', 'lgb', 'catboost').

Modular usage with sklearn-compatible components

You can also use individual pipeline components:

from lecrapaud import FeatureEngineering, FeaturePreprocessor, FeatureSelector

# Create components with experiment context
feature_eng = FeatureEngineering(experiment=experiment)
feature_prep = FeaturePreprocessor(experiment=experiment)
feature_sel = FeatureSelector(experiment=experiment, target_number=1)

# Use sklearn fit/transform pattern
feature_eng.fit(data)
data_eng = feature_eng.get_data()

feature_prep.fit(data_eng)
data_preprocessed = feature_prep.transform(data_eng)

feature_sel.fit(data_preprocessed)

# Or use in sklearn Pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('feature_eng', FeatureEngineering(experiment=experiment)),
    ('feature_prep', FeaturePreprocessor(experiment=experiment))
])

⚠️ Using Alembic in Your Project (Important for Integrators)

If you use Alembic for migrations in your own project and you share the same database with LeCrapaud, you must ensure that Alembic does not attempt to drop or modify LeCrapaud tables (those prefixed with {LECRAPAUD_TABLE_PREFIX}_).

By default, Alembic's autogenerate feature will propose to drop any table that exists in the database but is not present in your project's models. To prevent this, add the following filter to your env.py:

def include_object(object, name, type_, reflected, compare_to):
    if type_ == "table" and name.startswith(f"{LECRAPAUD_TABLE_PREFIX}_"):
        return False  # Ignore LeCrapaud tables
    return True

context.configure(
    # ... other options ...
    include_object=include_object,
)

This will ensure that Alembic ignores all tables created by LeCrapaud when generating migrations for your own project.


🤝 Contributing

Reminders for Github usage

  1. Creating Github repository
$ brew install gh
$ gh auth login
$ gh repo create
  1. Initializing git and first commit to distant repository
$ git init
$ git add .
$ git commit -m 'first commit'
$ git remote add origin <YOUR_REPO_URL>
$ git push -u origin master
  1. Use conventional commits
    https://www.conventionalcommits.org/en/v1.0.0/#summary

  2. Create environment

$ pip install virtualenv
$ python -m venv .venv
$ source .venv/bin/activate
  1. Install dependencies
$ make install
  1. Deactivate virtualenv (if needed)
$ deactivate

Pierre Gallet © 2025

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lecrapaud-0.22.0.tar.gz (103.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lecrapaud-0.22.0-py3-none-any.whl (123.8 kB view details)

Uploaded Python 3

File details

Details for the file lecrapaud-0.22.0.tar.gz.

File metadata

  • Download URL: lecrapaud-0.22.0.tar.gz
  • Upload date:
  • Size: 103.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for lecrapaud-0.22.0.tar.gz
Algorithm Hash digest
SHA256 a98afa77f39ac1e2bdd5a016a66f84218463f4e49c046a035490b0ddad2039da
MD5 4666c9f5d25d8363efb14a9b60d9afd5
BLAKE2b-256 97e7b32bf94f1781fbb181db799b32274c0bc7fdec9a3413140402c7f06942ba

See more details on using hashes here.

File details

Details for the file lecrapaud-0.22.0-py3-none-any.whl.

File metadata

  • Download URL: lecrapaud-0.22.0-py3-none-any.whl
  • Upload date:
  • Size: 123.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for lecrapaud-0.22.0-py3-none-any.whl
Algorithm Hash digest
SHA256 081b1e64f7873e5ca515596261ddd3815d21114f9889f6149c46adb51e11f2a9
MD5 ee43535362b6e17d6c6891415cc6bb9a
BLAKE2b-256 f1979c90a132af4d0c336dd7c02589d2291d41482072abe15db3678a4bcd255d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page