Skip to main content

Ensemble dataset generator for tabular data prediction and modeling projects.

Project description

EnsembleSet

Publish to PyPI Publish to TestPyPI PR Validation pages-build-deployment Documentation

EnsembleSet generates dataset ensembles by applying a randomized sequence of feature engineering methods to a randomized subset of input features.

1. Installation

Install the pre-release alpha from PyPI with:

pip install ensembleset

2. Usage

See the example usage notebook.

Initialize an EnsembleSet class instance, passing in the label name and training DataFrame. Optionally, include a test DataFrame and/or list of any string features and the path where you want EnsembleSet to put data. Then call the make_datasets() to generate an EnsembleSet, specifying:

  1. The number of individual datasets to generate.
  2. The fraction of features to randomly select for each feature engineering step.
  3. The number of feature engineering steps to run.
import ensembleset.dataset as ds

data_ensemble=ds.DataSet(
    label='label_column_name',                       # Required
    train_data=train_df,                             # Required
    test_data=test_df,                               # Optional, defaults to None
    string_features=['string_feature_column_names'], # Optional, defaults to None
    data_directory='path/to/ensembleset/data'        # Optional, defaults to ./data
)

data_ensemble.make_datasets(
    n_datasets=10,         # Required
    fraction_features=0.1, # Required
    n_steps=5              # Required
)

The above call to make_datasets() will generate 10 different datasets using a random sequence of 5 feature engineering techniques applied to a randomly selected 10% of features. The feature selection is re-calculated after each feature engineering step. Each feature engineering step is applied to the test set if one is provided with a minimum of data leakage (e.g. gaussian KDE is calculated from training data only and then applied to training and testing data).

By default, generated datasets will be saved to HDF5 in data/dataset.h5 using the following structure:

dataset.h5
├──train
│   ├── labels
|   ├── 1
|   ├── .
|   ├── .
|   ├── .
|   └── n
│
└──test
    ├── labels
    ├── 1
    ├── .
    ├── .
    ├── .
    └── n

3. Feature engineering

The currently implemented pool of feature engineering methods are:

  1. One-hot encoding for string features
  2. Ordinal encoding for string features
  3. Log features with bases 2, e or 10
  4. Ratio features
  5. Exponential features with base 2 or e
  6. Sum features with 2, 3, or 4
  7. Difference features with 2, 3 or 4 subtrahends
  8. Polynomial features with degree 2 or 3
  9. Spline features with degree 2, 3 or 4
  10. Quantized features with using randomly selected k-bins
  11. Smoothed features with gaussian kernel density estimation

Major feature engineering parameters are also randomly selected for each step.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ensembleset-1.0a23.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ensembleset-1.0a23-py3-none-any.whl (28.7 kB view details)

Uploaded Python 3

File details

Details for the file ensembleset-1.0a23.tar.gz.

File metadata

  • Download URL: ensembleset-1.0a23.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ensembleset-1.0a23.tar.gz
Algorithm Hash digest
SHA256 99cb5a93c73b16a4b45c2d045f9d9a6e1f3693b7f18ca6e9c194186b0665a8a2
MD5 0d628763096814f6a3d872b086819a09
BLAKE2b-256 372625ec9a83ece000f7b49fa6dbae0c8f2b4157e001cae06c130ce566d3b852

See more details on using hashes here.

Provenance

The following attestation bundles were made for ensembleset-1.0a23.tar.gz:

Publisher: publish-to-pypi.yml on gperdrizet/ensembleset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ensembleset-1.0a23-py3-none-any.whl.

File metadata

  • Download URL: ensembleset-1.0a23-py3-none-any.whl
  • Upload date:
  • Size: 28.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ensembleset-1.0a23-py3-none-any.whl
Algorithm Hash digest
SHA256 2103ddc28ddf416e468040ceb7b4c81093a49d9aeb2856474940e4aef0afaad2
MD5 96dcaa9bafc141c1e0ad0fac86bebb39
BLAKE2b-256 96070cba09ed4239433756c9075f9a7fd2828fb147fe71ced0f9fc328f51f3ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for ensembleset-1.0a23-py3-none-any.whl:

Publisher: publish-to-pypi.yml on gperdrizet/ensembleset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page