Skip to main content

Python package implementing ML feature engineering and pre-processing for polars or pandas dataframes.

Project description

Feature engineering on polars and pandas dataframes for machine learning!


PyPI Read the Docs GitHub GitHub last commit GitHub issues Build Binder

tubular implements pre-processing steps for tabular data commonly used in machine learning pipelines.

The transformers are compatible with scikit-learn Pipelines. Each has a transform method to apply the pre-processing step to data and a fit method to learn the relevant information from the data, if applicable.

The transformers in tubular are written in narwhals narwhals, so are agnostic between pandas and polars dataframes, and will utilise the chosen (pandas/polars) API under the hood.

There are a variety of transformers to assist with;

  • capping
  • dates
  • imputation
  • mapping
  • categorical encoding
  • numeric operations

Here is a simple example of applying capping to two columns;

import polars as pl

transformer = CappingTransformer(
    capping_values={"a": [10, 20], "b": [1, 3]},
)

test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})

transformer.transform(test_df)
# ->
# shape: (4, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 10  ┆ 3   ┆ 1   │
# │ 15  ┆ 2   ┆ 2   │
# │ 18  ┆ 3   ┆ 3   │
# │ 20  ┆ 1   ┆ 4   │
# └─────┴─────┴─────┘

Tubular also supports saving/reading transformers and pipelines to/from json format (goodbye .pkls!), which we demo below:

import polars as pl
from tubular.imputers import MeanImputer, MedianImputer
from sklearn.pipeline import Pipeline
from tubular.pipeline import dump_pipeline_to_json, load_pipeline_from_json

# Create a simple dataframe

df = pl.DataFrame({"a": [1, 5], "b": [10, None]})

# Add imputers
median_imputer = MedianImputer(columns=["b"])
mean_imputer = MeanImputer(columns=["b"])

# Create and fit the pipeline
original_pipeline = Pipeline(
    [("MedianImputer", median_imputer), ("MeanImputer", mean_imputer)]
)
original_pipeline = original_pipeline.fit(df)

# Dumping the pipeline to JSON
pipeline_json = dump_pipeline_to_json(original_pipeline)
pipeline_json

# Printed value:
# ->
# {
# 'MedianImputer': {
#     'tubular_version': '2.6.1',
#     'classname': 'MedianImputer',
#     'init': {
#          'columns': ['b'],
#          'copy': False,
#          'verbose': False,
#          'return_native': True,
#          'weights_column': None
#          },
#     'fit': {
#           'impute_values_': {'b': 10.0}
#           }
#      },
# 'MeanImputer': {
#      'tubular_version': '2.6.1',
#      'classname': 'MeanImputer',
#      'init': {
#          'columns': ['b'],
#          'copy': False,
#          'verbose': False,
#          'return_native': True,
#          'weights_column': None
#           },
#      'fit': {
#          'impute_values_': {
#          'b': 10.0
#          }
#     }
# }

# Load the pipeline from JSON
pipeline = load_pipeline_from_json(pipeline_json)

# Verify the reconstructed pipeline
print(pipeline)

# Printed value:
# Pipeline(steps=[('MedianImputer', MedianImputer(columns=['b'])),
#                 ('MeanImputer', MeanImputer(columns=['b']))])

We are currently in the process of rolling out support for polars lazyframes!

track our progress below:

polars_compatible pandas_compatible jsonable lazyframe_compatible
AggregateColumnsOverRowTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
AggregateRowsOverColumnTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
ArbitraryImputer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
BetweenDatesTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
CappingTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
ColumnDtypeSetter :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
CompareTwoColumnsTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
DateDifferenceTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
DatetimeComponentExtractor :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
DatetimeInfoExtractor :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
DatetimeSinusoidCalculator :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
DifferenceTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
ExtractStringComponentsTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
GroupRareLevelsTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
LowerCaseTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
MappingTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
MeanImputer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
MeanResponseTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
MedianImputer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
ModeImputer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
NullIndicator :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
OneDKmeansTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :x:
OneHotEncodingTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
OutOfRangeNullTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
RatioTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
RemoveCharactersTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
RenameColumnsTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
SetValueTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
StringContainsTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
ToDatetimeTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
WhenThenOtherwiseTransformer :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:

Installation

The easiest way to get tubular is directly from pypi with;

pip install tubular

Documentation

The documentation for tubular can be found on readthedocs.

Instructions for building the docs locally can be found in docs/README.

Examples

We utilise doctest to keep valid usage examples in the docstrings of transformers in the package, so please see these for getting started!

Issues

For bugs and feature requests please open an issue.

Build and test

The test framework we are using for this project is pytest. To build the package locally and run the tests follow the steps below.

First clone the repo and move to the root directory;

git clone https://github.com/azukds/tubular.git
cd tubular

Next install tubular and development dependencies;

pip install . -r requirements-dev.txt

Finally run the test suite with pytest;

pytest

Contribute

tubular is under active development, we're super excited if you're interested in contributing!

See the CONTRIBUTING file for the full details of our working practices.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tubular-3.6.0.tar.gz (272.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tubular-3.6.0-py3-none-any.whl (96.5 kB view details)

Uploaded Python 3

File details

Details for the file tubular-3.6.0.tar.gz.

File metadata

  • Download URL: tubular-3.6.0.tar.gz
  • Upload date:
  • Size: 272.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tubular-3.6.0.tar.gz
Algorithm Hash digest
SHA256 cf0018ce1da8e969bb8f3f34b0fd9dc0f1666cba8e879c8013d4abfb505fcf6f
MD5 f03cdebd05140c25acf37deb955d2f7b
BLAKE2b-256 dd07ea9cf27c8177b3769f1740331063316930b91898d0d44ea594d25b63f9f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for tubular-3.6.0.tar.gz:

Publisher: release.yml on azukds/tubular

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tubular-3.6.0-py3-none-any.whl.

File metadata

  • Download URL: tubular-3.6.0-py3-none-any.whl
  • Upload date:
  • Size: 96.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tubular-3.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1b869f897857e92ea5ea16218143d7c1e0b339d6d0d453c730d4649d64e77b8e
MD5 74a8e7c16476ae42fc1b1236896936c9
BLAKE2b-256 fb795dafeefcc32e8be305dc38ccb0a4eb76d8284c21554c8d631cc20fe28f78

See more details on using hashes here.

Provenance

The following attestation bundles were made for tubular-3.6.0-py3-none-any.whl:

Publisher: release.yml on azukds/tubular

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page