Skip to main content

Framework for machine and deep learning, with regression, classification and time series analysis

Project description

crapaud

Welcome to LeCrapaud

An all-in-one machine learning framework

GitHub stars PyPI version Python versions License codecov

🚀 Introduction

LeCrapaud is a high-level Python library for end-to-end machine learning workflows on tabular data, with a focus on financial and stock datasets. It provides a simple API to handle feature engineering, model selection, training, and prediction, all in a reproducible and modular way.

✨ Key Features

  • 🧩 Modular pipeline: Feature engineering, preprocessing, selection, and modeling as independent steps
  • 🤖 Automated model selection and hyperparameter optimization
  • 📊 Easy integration with pandas DataFrames
  • 🔬 Supports both regression and classification tasks
  • 🛠️ Simple API for both full pipeline and step-by-step usage
  • 📦 Ready for production and research workflows

⚡ Quick Start

Install the package

pip install lecrapaud

How it works

This package provides a high-level API to manage experiments for feature engineering, model selection, and prediction on tabular data (e.g. stock data).

Typical workflow

from lecrapaud import LeCrapaud

# 1. Create the main app
app = LeCrapaud(uri=uri)

# 2. Define your experiment context (see your notebook or api.py for all options)
context = {
    "data": your_dataframe,
    "columns_drop": [...],
    "columns_date": [...],
    # ... other config options
}

# 3. Create an experiment
experiment = app.create_experiment(**context)

# 4. Run the full training pipeline
experiment.train(your_dataframe)

# 5. Make predictions on new data
predictions = experiment.predict(new_data)

Database Configuration (Required)

LeCrapaud requires access to a MySQL database to store experiments and results. You must either:

  • Pass a valid MySQL URI to the LeCrapaud constructor:
    app = LeCrapaud(uri="mysql+pymysql://user:password@host:port/dbname")
    
  • OR set the following environment variables before using the package:
    • DB_USER, DB_PASSWORD, DB_HOST, DB_PORT, DB_NAME
    • Or set DB_URI directly with your full connection string.

If neither is provided, database operations will not work.

Using OpenAI Embeddings (Optional)

If you want to use the columns_pca embedding feature (for advanced feature engineering), you must set the OPENAI_API_KEY environment variable with your OpenAI API key:

export OPENAI_API_KEY=sk-...

If this variable is not set, features relying on OpenAI embeddings will not be available.

Experiment Context Arguments

The experiment context is a dictionary containing all configuration parameters for your ML pipeline. Parameters are stored in the experiment's database record and automatically retrieved when loading an existing experiment.

Required Parameters

Parameter Type Description Example
data DataFrame Input dataset (required for new experiments only) pd.DataFrame(...)
experiment_name str Unique name for the experiment 'stock_prediction'
date_column str Name of the date column (required for time series) 'DATE'
group_column str Name of the group column (required for panel data) 'STOCK'

Feature Engineering Parameters

Parameter Type Default Description
columns_drop list [] Columns to drop during feature engineering
columns_boolean list [] Columns to convert to boolean features
columns_date list [] Date columns for cyclic encoding
columns_te_groupby list [] Groupby columns for target encoding
columns_te_target list [] Target columns for target encoding

Preprocessing Parameters

Parameter Type Default Description
time_series bool False Whether data is time series
val_size float 0.2 Validation set size (fraction)
test_size float 0.2 Test set size (fraction)
columns_pca list [] Columns for PCA transformation
pca_temporal list [] Temporal PCA config (e.g., lag features)
pca_cross_sectional list [] Cross-sectional PCA config (e.g., market regime)
columns_onehot list [] Columns for one-hot encoding
columns_binary list [] Columns for binary encoding
columns_ordinal list [] Columns for ordinal encoding
columns_frequency list [] Columns for frequency encoding

Feature Selection Parameters

Parameter Type Default Description
percentile float 20 Percentage of features to keep per selection method
corr_threshold float 80 Maximum correlation threshold (%) between features
max_features int 50 Maximum number of final features
max_p_value_categorical float 0.05 Maximum p-value for categorical feature selection (Chi2)

Model Selection Parameters

Parameter Type Default Description
target_numbers list [] List of target indices to predict
target_clf list [] Classification target indices
models_idx list [] Model indices or names to use (e.g., [1, 'xgb', 'lgb'])
max_timesteps int 120 Maximum timesteps for recurrent models
perform_hyperopt bool True Whether to perform hyperparameter optimization
number_of_trials int 20 Number of hyperopt trials
perform_crossval bool False Whether to use cross-validation during hyperopt
plot bool True Whether to generate plots
preserve_model bool True Whether to save the best model
target_clf_thresholds dict {} Classification thresholds per target

Example Context Configuration

context = {
    # Required parameters
    "experiment_name": f"stock_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    "date_column": "DATE",
    "group_column": "STOCK",
    
    # Feature selection
    "corr_threshold": 80,
    "max_features": 20,
    "percentile": 20,
    "max_p_value_categorical": 0.05,
    
    # Feature engineering
    "columns_drop": ["SECURITY", "ISIN", "ID"],
    "columns_boolean": [],
    "columns_date": ["DATE"],
    "columns_te_groupby": [["SECTOR", "DATE"]],
    "columns_te_target": ["RET", "VOLUME"],
    
    # Preprocessing
    "time_series": True,
    "val_size": 0.2,
    "test_size": 0.2,
    "pca_temporal": [
        # Old format (still supported)
        # {"name": "LAST_20_RET", "columns": [f"RET_-{i}" for i in range(1, 21)]},
        # New simplified format - automatically creates lag columns
        {"name": "LAST_20_RET", "column": "RET", "lags": 20},
        {"name": "LAST_10_VOL", "column": "VOLUME", "lags": 10},
    ],
    "pca_cross_sectional": [
        {
            "name": "MARKET_REGIME",
            "index": "DATE",
            "columns": "STOCK",
            "value": "RET",
        }
    ],
    "columns_onehot": ["BUY_SIGNAL"],
    "columns_binary": ["SECTOR", "LOCATION"],
    "columns_ordinal": ["STOCK"],
    
    # Model selection
    "target_numbers": [1, 2, 3],
    "target_clf": [1],
    "models_idx": ["xgb", "lgb", "catboost"],
    "max_timesteps": 120,
    "perform_hyperopt": True,
    "number_of_trials": 50,
    "perform_crossval": True,
    "plot": True,
    "preserve_model": True,
    "target_clf_thresholds": {1: {"precision": 0.80}},
}

# Create experiment
experiment = app.create_experiment(data=your_dataframe, **context)

Important Notes

  1. Context Persistence: All context parameters are saved in the database when creating an experiment and automatically restored when loading it.

  2. Parameter Precedence: When loading an existing experiment, the stored context takes precedence over any parameters passed to the constructor.

  3. PCA Time Series:

    • For time series data, both pca_cross_sectional and pca_temporal automatically use an expanding window approach with periodic refresh (default: every 90 days) to prevent data leakage.
    • The system fits PCA only on historical data (lookback window of 365 days by default) and avoids look-ahead bias.
    • For panel data (e.g., multiple stocks), lag features are created per group when using the simplified pca_temporal format.
    • Missing PCA values are handled with forward-fill followed by zero-fill to ensure compatibility with downstream models.
  4. PCA Temporal Simplified Format:

    • Instead of manually listing lag columns: {"name": "LAST_20_RET", "columns": ["RET_-1", "RET_-2", ..., "RET_-20"]}
    • Use the simplified format: {"name": "LAST_20_RET", "column": "RET", "lags": 20}
    • The system automatically creates the lag columns, handling panel data correctly with group_column.
  5. OpenAI Embeddings: If using columns_pca with text columns, ensure OPENAI_API_KEY is set as an environment variable.

  6. Model Indices: The models_idx parameter accepts both integer indices and string names (e.g., 'xgb', 'lgb', 'catboost').

Modular usage

You can also use each step independently:

data_eng = experiment.feature_engineering(data)
train, val, test = experiment.preprocess_feature(data_eng)
features = experiment.feature_selection(train)
std_data, reshaped_data = experiment.preprocess_model(train, val, test)
experiment.model_selection(std_data, reshaped_data)

⚠️ Using Alembic in Your Project (Important for Integrators)

If you use Alembic for migrations in your own project and you share the same database with LeCrapaud, you must ensure that Alembic does not attempt to drop or modify LeCrapaud tables (those prefixed with {LECRAPAUD_TABLE_PREFIX}_).

By default, Alembic's autogenerate feature will propose to drop any table that exists in the database but is not present in your project's models. To prevent this, add the following filter to your env.py:

def include_object(object, name, type_, reflected, compare_to):
    if type_ == "table" and name.startswith(f"{LECRAPAUD_TABLE_PREFIX}_"):
        return False  # Ignore LeCrapaud tables
    return True

context.configure(
    # ... other options ...
    include_object=include_object,
)

This will ensure that Alembic ignores all tables created by LeCrapaud when generating migrations for your own project.


🤝 Contributing

Reminders for Github usage

  1. Creating Github repository
$ brew install gh
$ gh auth login
$ gh repo create
  1. Initializing git and first commit to distant repository
$ git init
$ git add .
$ git commit -m 'first commit'
$ git remote add origin <YOUR_REPO_URL>
$ git push -u origin master
  1. Use conventional commits
    https://www.conventionalcommits.org/en/v1.0.0/#summary

  2. Create environment

$ pip install virtualenv
$ python -m venv .venv
$ source .venv/bin/activate
  1. Install dependencies
$ make install
  1. Deactivate virtualenv (if needed)
$ deactivate

Pierre Gallet © 2025

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lecrapaud-0.21.2.tar.gz (94.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lecrapaud-0.21.2-py3-none-any.whl (113.1 kB view details)

Uploaded Python 3

File details

Details for the file lecrapaud-0.21.2.tar.gz.

File metadata

  • Download URL: lecrapaud-0.21.2.tar.gz
  • Upload date:
  • Size: 94.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for lecrapaud-0.21.2.tar.gz
Algorithm Hash digest
SHA256 c143de490bdda5662222fa34d8cdc560de65893eb58d9b67a510751bebf7c284
MD5 1b73982c13f64d62cb6363506dce889b
BLAKE2b-256 63765f56f5ed9541f1d5baae11d44d767e8cee97955bfa1b19d2a8916bcf9f7b

See more details on using hashes here.

File details

Details for the file lecrapaud-0.21.2-py3-none-any.whl.

File metadata

  • Download URL: lecrapaud-0.21.2-py3-none-any.whl
  • Upload date:
  • Size: 113.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for lecrapaud-0.21.2-py3-none-any.whl
Algorithm Hash digest
SHA256 118d0c8320b5607fbebfa1c738123f64e3350f744459c7f54651459e4fc97e8e
MD5 b677880e6496cf7bfd65b158b1c53199
BLAKE2b-256 8af046b22f62a994ddf678bd08a865cd54baf8fca5e8fc25df2801626425838a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page