Skip to main content

Framework for machine and deep learning, with regression, classification and time series analysis

Project description

crapaud

Welcome to LeCrapaud

An all-in-one machine learning framework

GitHub stars PyPI version Python versions License codecov

🚀 Introduction

LeCrapaud is a high-level Python library for end-to-end machine learning workflows on tabular data, with a focus on financial and stock datasets. It provides a simple API to handle feature engineering, model selection, training, and prediction, all in a reproducible and modular way.

✨ Key Features

  • 🧩 Modular pipeline: Feature engineering, preprocessing, selection, and modeling as independent steps
  • 🤖 Automated model selection and hyperparameter optimization
  • 📊 Easy integration with pandas DataFrames
  • 🔬 Supports both regression and classification tasks
  • 🛠️ Simple API for both full pipeline and step-by-step usage
  • 📦 Ready for production and research workflows

⚡ Quick Start

Install the package

pip install lecrapaud

How it works

This package provides a high-level API to manage experiments for feature engineering, model selection, and prediction on tabular data (e.g. stock data).

Typical workflow

from lecrapaud import LeCrapaud

# 1. Create the main app
app = LeCrapaud(uri=uri)

# 2. Define your experiment context (see your notebook or api.py for all options)
context = {
    "data": your_dataframe,
    "columns_drop": [...],
    "columns_date": [...],
    # ... other config options
}

# 3. Create an experiment
experiment = app.create_experiment(**context)

# 4. Run the full training pipeline
experiment.train(your_dataframe)

# 5. Make predictions on new data
predictions = experiment.predict(new_data)

Database Configuration (Required)

LeCrapaud requires access to a MySQL database to store experiments and results. You must either:

  • Pass a valid MySQL URI to the LeCrapaud constructor:
    app = LeCrapaud(uri="mysql+pymysql://user:password@host:port/dbname")
    
  • OR set the following environment variables before using the package:
    • DB_USER, DB_PASSWORD, DB_HOST, DB_PORT, DB_NAME
    • Or set DB_URI directly with your full connection string.

If neither is provided, database operations will not work.

Using OpenAI Embeddings (Optional)

If you want to use the columns_pca embedding feature (for advanced feature engineering), you must set the OPENAI_API_KEY environment variable with your OpenAI API key:

export OPENAI_API_KEY=sk-...

If this variable is not set, features relying on OpenAI embeddings will not be available.

Experiment Context Arguments

The experiment context is a dictionary containing all configuration parameters for your ML pipeline. Parameters are stored in the experiment's database record and automatically retrieved when loading an existing experiment.

Required Parameters

Parameter Type Description Example
data DataFrame Input dataset (required for new experiments only) pd.DataFrame(...)
experiment_name str Unique name for the experiment 'stock_prediction'
date_column str Name of the date column (required for time series) 'DATE'
group_column str Name of the group column (required for panel data) 'STOCK'

Feature Engineering Parameters

Parameter Type Default Description
columns_drop list [] Columns to drop during feature engineering
columns_boolean list [] Columns to convert to boolean features
columns_date list [] Date columns for cyclic encoding
columns_te_groupby list [] Groupby columns for target encoding
columns_te_target list [] Target columns for target encoding

Preprocessing Parameters

Parameter Type Default Description
time_series bool False Whether data is time series
val_size float 0.2 Validation set size (fraction)
test_size float 0.2 Test set size (fraction)
columns_pca list [] Columns for PCA transformation
pca_temporal list [] Temporal PCA config (e.g., lag features)
pca_cross_sectional list [] Cross-sectional PCA config (e.g., market regime)
columns_onehot list [] Columns for one-hot encoding
columns_binary list [] Columns for binary encoding
columns_ordinal list [] Columns for ordinal encoding
columns_frequency list [] Columns for frequency encoding

Feature Selection Parameters

Parameter Type Default Description
percentile float 20 Percentage of features to keep per selection method
corr_threshold float 80 Maximum correlation threshold (%) between features
max_features int 50 Maximum number of final features
max_p_value_categorical float 0.05 Maximum p-value for categorical feature selection (Chi2)

Model Selection Parameters

Parameter Type Default Description
target_numbers list [] List of target indices to predict
target_clf list [] Classification target indices
models_idx list [] Model indices or names to use (e.g., [1, 'xgb', 'lgb'])
max_timesteps int 120 Maximum timesteps for recurrent models
perform_hyperopt bool True Whether to perform hyperparameter optimization
number_of_trials int 20 Number of hyperopt trials
perform_crossval bool False Whether to use cross-validation during hyperopt
plot bool True Whether to generate plots
preserve_model bool True Whether to save the best model
target_clf_thresholds dict {} Classification thresholds per target

Example Context Configuration

context = {
    # Required parameters
    "experiment_name": f"stock_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    "date_column": "DATE",
    "group_column": "STOCK",
    
    # Feature selection
    "corr_threshold": 80,
    "max_features": 20,
    "percentile": 20,
    "max_p_value_categorical": 0.05,
    
    # Feature engineering
    "columns_drop": ["SECURITY", "ISIN", "ID"],
    "columns_boolean": [],
    "columns_date": ["DATE"],
    "columns_te_groupby": [["SECTOR", "DATE"]],
    "columns_te_target": ["RET", "VOLUME"],
    
    # Preprocessing
    "time_series": True,
    "val_size": 0.2,
    "test_size": 0.2,
    "pca_temporal": [
        {"name": "LAST_20_RET", "columns": [f"RET_-{i}" for i in range(1, 21)]},
    ],
    "pca_cross_sectional": [
        {
            "name": "MARKET_REGIME",
            "index": "DATE",
            "columns": "STOCK",
            "value": "RET",
        }
    ],
    "columns_onehot": ["BUY_SIGNAL"],
    "columns_binary": ["SECTOR", "LOCATION"],
    "columns_ordinal": ["STOCK"],
    
    # Model selection
    "target_numbers": [1, 2, 3],
    "target_clf": [1],
    "models_idx": ["xgb", "lgb", "catboost"],
    "max_timesteps": 120,
    "perform_hyperopt": True,
    "number_of_trials": 50,
    "perform_crossval": True,
    "plot": True,
    "preserve_model": True,
    "target_clf_thresholds": {1: {"precision": 0.80}},
}

# Create experiment
experiment = app.create_experiment(data=your_dataframe, **context)

Important Notes

  1. Context Persistence: All context parameters are saved in the database when creating an experiment and automatically restored when loading it.

  2. Parameter Precedence: When loading an existing experiment, the stored context takes precedence over any parameters passed to the constructor.

  3. PCA Time Series: For time series data with pca_cross_sectional where index equals date_column, the system automatically uses an expanding window approach to prevent data leakage.

  4. OpenAI Embeddings: If using columns_pca with text columns, ensure OPENAI_API_KEY is set as an environment variable.

  5. Model Indices: The models_idx parameter accepts both integer indices and string names (e.g., 'xgb', 'lgb', 'catboost').

Modular usage

You can also use each step independently:

data_eng = experiment.feature_engineering(data)
train, val, test = experiment.preprocess_feature(data_eng)
features = experiment.feature_selection(train)
std_data, reshaped_data = experiment.preprocess_model(train, val, test)
experiment.model_selection(std_data, reshaped_data)

⚠️ Using Alembic in Your Project (Important for Integrators)

If you use Alembic for migrations in your own project and you share the same database with LeCrapaud, you must ensure that Alembic does not attempt to drop or modify LeCrapaud tables (those prefixed with {LECRAPAUD_TABLE_PREFIX}_).

By default, Alembic's autogenerate feature will propose to drop any table that exists in the database but is not present in your project's models. To prevent this, add the following filter to your env.py:

def include_object(object, name, type_, reflected, compare_to):
    if type_ == "table" and name.startswith(f"{LECRAPAUD_TABLE_PREFIX}_"):
        return False  # Ignore LeCrapaud tables
    return True

context.configure(
    # ... other options ...
    include_object=include_object,
)

This will ensure that Alembic ignores all tables created by LeCrapaud when generating migrations for your own project.


🤝 Contributing

Reminders for Github usage

  1. Creating Github repository
$ brew install gh
$ gh auth login
$ gh repo create
  1. Initializing git and first commit to distant repository
$ git init
$ git add .
$ git commit -m 'first commit'
$ git remote add origin <YOUR_REPO_URL>
$ git push -u origin master
  1. Use conventional commits
    https://www.conventionalcommits.org/en/v1.0.0/#summary

  2. Create environment

$ pip install virtualenv
$ python -m venv .venv
$ source .venv/bin/activate
  1. Install dependencies
$ make install
  1. Deactivate virtualenv (if needed)
$ deactivate

Pierre Gallet © 2025

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lecrapaud-0.20.2.tar.gz (90.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lecrapaud-0.20.2-py3-none-any.whl (108.9 kB view details)

Uploaded Python 3

File details

Details for the file lecrapaud-0.20.2.tar.gz.

File metadata

  • Download URL: lecrapaud-0.20.2.tar.gz
  • Upload date:
  • Size: 90.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for lecrapaud-0.20.2.tar.gz
Algorithm Hash digest
SHA256 cc572e795e153e949b3879730810dc688b4ad8837279703b0f514175c96ca47a
MD5 c659497a74fd17af73c37de625cad40f
BLAKE2b-256 0bf1b8cd61a58b8a7a70d3e44bd94caf4e2c217f661bf69da4c4af141427e66e

See more details on using hashes here.

File details

Details for the file lecrapaud-0.20.2-py3-none-any.whl.

File metadata

  • Download URL: lecrapaud-0.20.2-py3-none-any.whl
  • Upload date:
  • Size: 108.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for lecrapaud-0.20.2-py3-none-any.whl
Algorithm Hash digest
SHA256 376582fe5214f30e824d667e1d4693e730066bb5baf362e80ef6b1e285be7433
MD5 a4edc3f0f41b1ea8170a81988ec058ec
BLAKE2b-256 9db099821f7a40de26d16073030bd3c6ac6db47416af868279606d57a1de3e8c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page