Skip to main content

Framework for machine and deep learning, with regression, classification and time series analysis

Project description

crapaud

Welcome to LeCrapaud

An all-in-one machine learning framework

GitHub stars PyPI version Python versions License codecov

🚀 Introduction

LeCrapaud is a high-level Python library for end-to-end machine learning workflows on tabular data, with a focus on financial and stock datasets. It provides a simple API to handle feature engineering, model selection, training, and prediction, all in a reproducible and modular way.

✨ Key Features

  • 🧩 Modular pipeline: Feature engineering, preprocessing, selection, and modeling as independent steps
  • 🤖 Automated model selection and hyperparameter optimization
  • 📊 Easy integration with pandas DataFrames
  • 🔬 Supports both regression and classification tasks
  • 🛠️ Simple API for both full pipeline and step-by-step usage
  • 📦 Ready for production and research workflows

⚡ Quick Start

Install the package

pip install lecrapaud

How it works

This package provides a high-level API to manage experiments for feature engineering, model selection, and prediction on tabular data (e.g. stock data).

Typical workflow

from lecrapaud import LeCrapaud

# 1. Create the main app
app = LeCrapaud(uri=uri)

# 2. Define your experiment context (see your notebook or api.py for all options)
context = {
    "data": your_dataframe,
    "columns_drop": [...],
    "columns_date": [...],
    # ... other config options
}

# 3. Create an experiment
experiment = app.create_experiment(**context)

# 4. Run the full training pipeline
experiment.train(your_dataframe)

# 5. Make predictions on new data
predictions = experiment.predict(new_data)

Database Configuration (Required)

LeCrapaud requires access to a MySQL database to store experiments and results. You must either:

  • Pass a valid MySQL URI to the LeCrapaud constructor:
    app = LeCrapaud(uri="mysql+pymysql://user:password@host:port/dbname")
    
  • OR set the following environment variables before using the package:
    • DB_USER, DB_PASSWORD, DB_HOST, DB_PORT, DB_NAME
    • Or set DB_URI directly with your full connection string.

If neither is provided, database operations will not work.

Using OpenAI Embeddings (Optional)

If you want to use the columns_pca embedding feature (for advanced feature engineering), you must set the OPENAI_API_KEY environment variable with your OpenAI API key:

export OPENAI_API_KEY=sk-...

If this variable is not set, features relying on OpenAI embeddings will not be available.

Experiment Context Arguments

The experiment context is a dictionary containing all configuration parameters for your ML pipeline. Parameters are stored in the experiment's database record and automatically retrieved when loading an existing experiment.

Required Parameters

Parameter Type Description Example
data DataFrame Input dataset (required for new experiments only) pd.DataFrame(...)
experiment_name str Unique name for the experiment 'stock_prediction'
date_column str Name of the date column (required for time series) 'DATE'
group_column str Name of the group column (required for panel data) 'STOCK'

Feature Engineering Parameters

Parameter Type Default Description
columns_drop list [] Columns to drop during feature engineering
columns_boolean list [] Columns to convert to boolean features
columns_date list [] Date columns for cyclic encoding
columns_te_groupby list [] Groupby columns for target encoding
columns_te_target list [] Target columns for target encoding

Preprocessing Parameters

Parameter Type Default Description
time_series bool False Whether data is time series
val_size float 0.2 Validation set size (fraction)
test_size float 0.2 Test set size (fraction)
columns_pca list [] Columns for PCA transformation
pca_temporal list [] Temporal PCA config (e.g., lag features)
pca_cross_sectional list [] Cross-sectional PCA config (e.g., market regime)
columns_onehot list [] Columns for one-hot encoding
columns_binary list [] Columns for binary encoding
columns_ordinal list [] Columns for ordinal encoding
columns_frequency list [] Columns for frequency encoding

Feature Selection Parameters

Parameter Type Default Description
percentile float 20 Percentage of features to keep per selection method
corr_threshold float 80 Maximum correlation threshold (%) between features
max_features int 50 Maximum number of final features
max_p_value_categorical float 0.05 Maximum p-value for categorical feature selection (Chi2)

Model Selection Parameters

Parameter Type Default Description
target_numbers list [] List of target indices to predict
target_clf list [] Classification target indices
models_idx list [] Model indices or names to use (e.g., [1, 'xgb', 'lgb'])
max_timesteps int 120 Maximum timesteps for recurrent models
perform_hyperopt bool True Whether to perform hyperparameter optimization
number_of_trials int 20 Number of hyperopt trials
perform_crossval bool False Whether to use cross-validation during hyperopt
plot bool True Whether to generate plots
preserve_model bool True Whether to save the best model
target_clf_thresholds dict {} Classification thresholds per target

Example Context Configuration

context = {
    # Required parameters
    "experiment_name": f"stock_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    "date_column": "DATE",
    "group_column": "STOCK",
    
    # Feature selection
    "corr_threshold": 80,
    "max_features": 20,
    "percentile": 20,
    "max_p_value_categorical": 0.05,
    
    # Feature engineering
    "columns_drop": ["SECURITY", "ISIN", "ID"],
    "columns_boolean": [],
    "columns_date": ["DATE"],
    "columns_te_groupby": [["SECTOR", "DATE"]],
    "columns_te_target": ["RET", "VOLUME"],
    
    # Preprocessing
    "time_series": True,
    "val_size": 0.2,
    "test_size": 0.2,
    "pca_temporal": [
        {"name": "LAST_20_RET", "columns": [f"RET_-{i}" for i in range(1, 21)]},
    ],
    "pca_cross_sectional": [
        {
            "name": "MARKET_REGIME",
            "index": "DATE",
            "columns": "STOCK",
            "value": "RET",
        }
    ],
    "columns_onehot": ["BUY_SIGNAL"],
    "columns_binary": ["SECTOR", "LOCATION"],
    "columns_ordinal": ["STOCK"],
    
    # Model selection
    "target_numbers": [1, 2, 3],
    "target_clf": [1],
    "models_idx": ["xgb", "lgb", "catboost"],
    "max_timesteps": 120,
    "perform_hyperopt": True,
    "number_of_trials": 50,
    "perform_crossval": True,
    "plot": True,
    "preserve_model": True,
    "target_clf_thresholds": {1: {"precision": 0.80}},
}

# Create experiment
experiment = app.create_experiment(data=your_dataframe, **context)

Important Notes

  1. Context Persistence: All context parameters are saved in the database when creating an experiment and automatically restored when loading it.

  2. Parameter Precedence: When loading an existing experiment, the stored context takes precedence over any parameters passed to the constructor.

  3. PCA Time Series: For time series data with pca_cross_sectional where index equals date_column, the system automatically uses an expanding window approach to prevent data leakage.

  4. OpenAI Embeddings: If using columns_pca with text columns, ensure OPENAI_API_KEY is set as an environment variable.

  5. Model Indices: The models_idx parameter accepts both integer indices and string names (e.g., 'xgb', 'lgb', 'catboost').

Modular usage

You can also use each step independently:

data_eng = experiment.feature_engineering(data)
train, val, test = experiment.preprocess_feature(data_eng)
features = experiment.feature_selection(train)
std_data, reshaped_data = experiment.preprocess_model(train, val, test)
experiment.model_selection(std_data, reshaped_data)

⚠️ Using Alembic in Your Project (Important for Integrators)

If you use Alembic for migrations in your own project and you share the same database with LeCrapaud, you must ensure that Alembic does not attempt to drop or modify LeCrapaud tables (those prefixed with {LECRAPAUD_TABLE_PREFIX}_).

By default, Alembic's autogenerate feature will propose to drop any table that exists in the database but is not present in your project's models. To prevent this, add the following filter to your env.py:

def include_object(object, name, type_, reflected, compare_to):
    if type_ == "table" and name.startswith(f"{LECRAPAUD_TABLE_PREFIX}_"):
        return False  # Ignore LeCrapaud tables
    return True

context.configure(
    # ... other options ...
    include_object=include_object,
)

This will ensure that Alembic ignores all tables created by LeCrapaud when generating migrations for your own project.


🤝 Contributing

Reminders for Github usage

  1. Creating Github repository
$ brew install gh
$ gh auth login
$ gh repo create
  1. Initializing git and first commit to distant repository
$ git init
$ git add .
$ git commit -m 'first commit'
$ git remote add origin <YOUR_REPO_URL>
$ git push -u origin master
  1. Use conventional commits
    https://www.conventionalcommits.org/en/v1.0.0/#summary

  2. Create environment

$ pip install virtualenv
$ python -m venv .venv
$ source .venv/bin/activate
  1. Install dependencies
$ make install
  1. Deactivate virtualenv (if needed)
$ deactivate

Pierre Gallet © 2025

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lecrapaud-0.21.0.tar.gz (92.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lecrapaud-0.21.0-py3-none-any.whl (111.5 kB view details)

Uploaded Python 3

File details

Details for the file lecrapaud-0.21.0.tar.gz.

File metadata

  • Download URL: lecrapaud-0.21.0.tar.gz
  • Upload date:
  • Size: 92.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for lecrapaud-0.21.0.tar.gz
Algorithm Hash digest
SHA256 73028959985851a39fce37f972efe977b398fba4795c9b74714b43d6d3aedc2a
MD5 21ae96951879bb0efb34385d87812491
BLAKE2b-256 d5b171d9b322648fd3a7005e81b4974d14cd1c8f3240ecb12a9bf1172c0a34e3

See more details on using hashes here.

File details

Details for the file lecrapaud-0.21.0-py3-none-any.whl.

File metadata

  • Download URL: lecrapaud-0.21.0-py3-none-any.whl
  • Upload date:
  • Size: 111.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for lecrapaud-0.21.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e8dc91df02fded648c1a67566678e48f21e9e387265993cf9aa87c8c0319c0ea
MD5 1ab8d9e7595a0af7764bb272f7c00f24
BLAKE2b-256 41bce801884ea9453674ed5b12d63708d453bfae092caac46dbc0091ed665a89

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page