Skip to main content

essential data transformers and model estimators for ML and data science competitions

Project description

Centimators

Centimators: essential data transformers and model estimators for ML and data science competitions

centimators is an open-source python library built on scikit-learn, keras, and narwhals: designed for building and sharing dataframe-agnostic (pandas/polars), multi-framework (jax/tf/pytorch), sklearn-style (fit/transform/predict) transformers, meta-estimators, and machine learning models for data science competitions like Numerai, Kaggle, and the CrowdCent Challenge.

centimators makes heavy use of advanced scikit-learn concepts such as metadata routing. Familiarity with these concepts is recommended for optimal use of the library. You can learn more about metadata routing in the scikit-learn documentation.

Documentation is available at https://crowdcent.github.io/centimators/.

Installation

Recommended (using uv):

uv add centimators

Or, using pip:

pip install centimators

Quick Start

centimators transformers and estimators are dataframe-agnostic, powered by narwhals. You can use the same transformer seamlessly with both Pandas and Polars DataFrames. Here's an example with RankTransformer, which calculates the normalized rank of features for all tickers over time by date.

First, let's define some common data:

import pandas as pd
import polars as pl
# Create sample OHLCV data for two stocks over four trading days
data = {
    'date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', 
             '2021-01-03', '2021-01-03', '2021-01-04', '2021-01-04'],
    'ticker': ['AAPL', 'MSFT', 'AAPL', 'MSFT', 'AAPL', 'MSFT', 'AAPL', 'MSFT'],
    'open': [150.0, 280.0, 151.0, 282.0, 152.0, 283.0, 153.0, 284.0],    # Opening prices
    'high': [152.0, 282.0, 153.0, 284.0, 154.0, 285.0, 155.0, 286.0],    # Daily highs
    'low': [149.0, 278.0, 150.0, 280.0, 151.0, 281.0, 152.0, 282.0],     # Daily lows
    'close': [151.0, 281.0, 152.0, 283.0, 153.0, 284.0, 154.0, 285.0],   # Closing prices
    'volume': [1000000, 800000, 1200000, 900000, 1100000, 850000, 1050000, 820000]  # Trading volume
}

# Create both Pandas and Polars DataFrames
df_pd = pd.DataFrame(data)
df_pl = pl.DataFrame(data)

# Define the OHLCV features we want to transform
feature_cols = ['volume', 'close']

Now, let's use the transformer:

from centimators.feature_transformers import RankTransformer

transformer = RankTransformer(feature_names=feature_cols)
result_pd = transformer.fit_transform(df_pd[feature_cols], date_series=df_pd['date'])
result_pl = transformer.fit_transform(df_pl[feature_cols], date_series=df_pl['date'])

Both result_pd (from Pandas) and result_pl (from Polars) will contain the same transformed data in their native DataFrame formats. You may find significant performance gains using Polars for certain operations.

Advanced Pipeline

centimators transformers are designed to work seamlessly within scikit-learn Pipelines, leveraging its metadata routing capabilities. This allows you to pass data like date or ticker series through the pipeline to the specific transformers that need them, while also chaining together multiple transformers. This is useful for building more complex feature pipelines, but also allows for better cross-validation, hyperparameter tuning, and model selection. For example, if you add a Regressor at the end of the pipeline, you can imagine searching over various combinations of lags, moving average windows, and model hyperparameters during the training process.

output_chart

from sklearn import set_config
from sklearn.pipeline import make_pipeline
from centimators.feature_transformers import (
    LogReturnTransformer,
    RankTransformer,
    LagTransformer,
    MovingAverageTransformer
)

# Enable metadata routing globally
set_config(enable_metadata_routing=True)

# Define individual transformers with their parameters
log_return_transformer = LogReturnTransformer().set_transform_request(
    ticker_series=True
)
ranker = RankTransformer().set_transform_request(date_series=True)
lag_windows = [0, 5, 10, 15]
lagger = LagTransformer(windows=lag_windows).set_transform_request(
    ticker_series=True
)
ma_windows = [5, 10, 20, 40]
ma_transformer = MovingAverageTransformer(
    windows=ma_windows
).set_transform_request(ticker_series=True)

# Create the pipeline
pipeline = make_pipeline(
    log_return_transformer, ranker, lagger, ma_transformer
)

centimators_pipeline

Explanation:

  • set_config(enable_metadata_routing=True) turns on scikit-learn's metadata routing.
  • set_transform_request(metadata_name=True) on each transformer tells the pipeline that this transformer expects metadata_name (e.g., date_series).
  • When pipeline.fit_transform(X, date_series=dates, ticker_series=tickers) is called:
    • The date_series is automatically passed to RankTransformer.
    • The ticker_series is automatically passed to LagTransformer, MovingAverageTransformer, and LogReturnTransformer.
    • The output of LogReturnTransformer is passed to RankTransformer
    • The output of RankTransformer is passed to LagTransformer
    • The output of LagTransformer is passed to MovingAverageTransformer

This allows for complex data transformations where different steps require different auxiliary information, all managed cleanly by the pipeline.

# Now you can use this pipeline with your data
feature_names = ['open', 'high', 'low', 'close']
transformed_df = pipeline.fit_transform(
    df_polars[feature_names],
    date_series=df_polars["date"],
    ticker_series=df_polars["ticker"],
)

We can take a closer look at a sample output for a single ticker and for a single initial feature. This clearly shows how the close price for a cross-sectional dataset is transformed into a log return, ranked (between 0 and 1) by date, and smoothed (moving average windows) by ticker: feature_example

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

centimators-0.1.4.tar.gz (132.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

centimators-0.1.4-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file centimators-0.1.4.tar.gz.

File metadata

  • Download URL: centimators-0.1.4.tar.gz
  • Upload date:
  • Size: 132.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.25

File hashes

Hashes for centimators-0.1.4.tar.gz
Algorithm Hash digest
SHA256 3aa724c07ca78044e9ba8625b73a482facf192f299cd176fef07eaa29287f714
MD5 acf18a79c4e9254794a1c5ebb9ae7f33
BLAKE2b-256 a207d27634bdaa9f612b3f08d3e19fbadd6680e49647f8060f7ae264938d4e02

See more details on using hashes here.

File details

Details for the file centimators-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for centimators-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 445842a4aeb3b99b62936f20e281f17c76efe47c90ac20c33fb70483ac98e2f8
MD5 077c56111440fb31e22a6827c5b981d9
BLAKE2b-256 a20224f7f0c7205439fdc132e799c0495a644f890d31de71e42d766d93580290

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page