Skip to main content

essential data transformers and model estimators for ML and data science competitions

Project description

Centimators

Centimators: essential data transformers and model estimators for ML and data science competitions

centimators is an open-source python library built on scikit-learn, keras, and narwhals: designed for building and sharing dataframe-agnostic (pandas/polars), multi-framework (jax/tf/pytorch), sklearn-style (fit/transform/predict) transformers, meta-estimators, and machine learning models for data science competitions like Numerai, Kaggle, and the CrowdCent Challenge.

centimators makes heavy use of advanced scikit-learn concepts such as metadata routing. Familiarity with these concepts is recommended for optimal use of the library. You can learn more about metadata routing in the scikit-learn documentation.

Documentation is available at https://crowdcent.github.io/centimators/.

Installation

Recommended (using uv):

uv add centimators

Or, using pip:

pip install centimators

Quick Start

centimators transformers and estimators are dataframe-agnostic, powered by narwhals. You can use the same transformer (like RankTransformer) seamlessly with both Pandas and Polars DataFrames (NOTE: currently, some transformers only support Polars).

First, let's define some common data:

from centimators.feature_transformers import RankTransformer

# 1. Define your data
data = {
    'date': ['2021-01-01', '2021-01-01', '2021-01-02'],
    'feature1': [3, 1, 2],       # For 2021-01-01: 3 is 2nd, 1 is 1st
    'feature2': [30, 20, 10]      # For 2021-01-01: 30 is 2nd, 20 is 1st
}
feature_cols = ['feature1', 'feature2']

2. With Pandas:

import pandas as pd

df_pd = pd.DataFrame(data)
transformer = RankTransformer(feature_names=feature_cols)
result_pd = transformer.fit_transform(df_pd[feature_cols], date_series=df_pd['date'])

3. With Polars:

import polars as pl

df_pl = pl.DataFrame(data)
# The same transformer instance can be used, or a new one initialized
result_pl = transformer.fit_transform(df_pl[feature_cols], date_series=df_pl['date'])

Expected Output:

Both result_pd (from Pandas) and result_pl (from Polars) will contain the same transformed data. For the example data, the output would be:

   feature1_rank  feature2_rank
0            1.0            1.0  # Corresponds to original (feature1=3, feature2=30) on 2021-01-01
1            0.5            0.5  # Corresponds to original (feature1=1, feature2=20) on 2021-01-01
2            1.0            1.0  # Corresponds to original (feature1=2, feature2=10) on 2021-01-02

This transformer calculates the normalized rank of features within each date group. By default, higher original values receive higher ranks (e.g., a rank of 1.0 is "higher" or later in sort order than 0.5 when scaled by group size).

Pipeline with Metadata Routing

centimators transformers are designed to work seamlessly within scikit-learn Pipelines, leveraging its metadata routing capabilities. This allows you to pass data like date or ticker series through the pipeline to the specific transformers that need them, while also chaining together multiple transformers. This is useful for building more complex feature pipelines, but also allows for better cross-validation, hyperparameter tuning, and model selection. For example, if you add a Regressor at the end of the pipeline, you can imagine searching over various combinations of lags, moving average windows, and model hyperparameters during the training process.

Here's an example using RankTransformer and LagTransformer:

import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn import set_config
from centimators.feature_transformers import RankTransformer, LagTransformer

# 1. Enable metadata routing globally (once per session)
set_config(enable_metadata_routing=True)

# 2. Sample Data (Pandas OR Polars DataFrame)
data = {
    'date': pd.to_datetime(['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03']),
    'ticker': ['A', 'B', 'A', 'B', 'A', 'B'],
    'price': [10, 20, 11, 21, 12, 22],
    'volume': [100, 200, 110, 210, 120, 220]
}
df = pd.DataFrame(data)

X = df[['price', 'volume']]
dates = df['date']
tickers = df['ticker']

# 3. Instantiate transformers and request metadata
# RankTransformer needs 'date_series'
rank_transformer = RankTransformer().set_transform_request(date_series=True)

# LagTransformer needs 'ticker_series'
lag_transformer = LagTransformer(windows=[0, 1, 2, 3, 4]).set_transform_request(ticker_series=True)

# 4. Create the pipeline
pipeline = make_pipeline(
    rank_transformer, 
    lag_transformer
)

# 5. Fit and transform
# The metadata (dates, tickers) is passed to fit_transform
transformed_data = pipeline.fit_transform(X, date_series=dates, ticker_series=tickers)

Explanation:

  • set_config(enable_metadata_routing=True) turns on scikit-learn's metadata routing.
  • set_transform_request(metadata_name=True) on each transformer tells the pipeline that this transformer expects metadata_name (e.g., date_series).
  • When pipeline.fit_transform(X, date_series=dates, ticker_series=tickers) is called:
    • The date_series is automatically passed to RankTransformer.
    • The ticker_series is automatically passed to LagTransformer.
    • The output of RankTransformer (ranked features) becomes the input to LagTransformer.
  • The LagTransformer then computes lagged values for the ranked features.

This allows for complex data transformations where different steps require different auxiliary information, all managed cleanly by the pipeline.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

centimators-0.1.2.tar.gz (798.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

centimators-0.1.2-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file centimators-0.1.2.tar.gz.

File metadata

  • Download URL: centimators-0.1.2.tar.gz
  • Upload date:
  • Size: 798.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.25

File hashes

Hashes for centimators-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7501e0118f84c8a43589f036d4539a755b6c978d70344a8dedc5911a6c1e7516
MD5 303f16f1506d86d1a8e288e99b3d79fa
BLAKE2b-256 079fb13efc609055f6e581890867bc94c29aa335f774943b7a563a8f1967e463

See more details on using hashes here.

File details

Details for the file centimators-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for centimators-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 45de3cdc481061a194d9328a5e3c7f4236ee399f307184b4fde88d98ed83f7ac
MD5 bbb179aa350fb5ecaf44977d20cd6824
BLAKE2b-256 1b390eac02d57d4e133ae6e44b5fad08920947e59f3c485a3f3d3ce5e12ebe42

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page