Download and preprocess your dataset from any source, all in one place.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Operating System
- OS Independent
Programming Language
- Python :: 3.10

Project description

Datati: Modern (tabular) datasets require modern solutions

Dataset to model, in one go! datati is a small library to streamline tabular dataset loading and preprocessing. The goal of this library is to minimize the boring boilerplate code that separates choosing a dataset to work on, and actually getting it ready to train a classification model . datati provides simple interfaces to load, preprocess, and encode a dataset for training your model of choice:

from datati.dataset import Dataset
from datati.models.trees import ContinuousTreeModeler

# load dataset
dataset = Dataset("mstz/adult", config="income", split="train", load_from="huggingface",
                  target_feature="over_threshold")
tree_dataset = ContinuousTreeModeler().process(dataset).to_array()

x, y = tree_dataset[:, :-1], tree_dataset[:, -1]

This snippet allows us to load a dataset (Dataset("mstz/adult")) in a desired configuration (config="income") and split (split="train"") of choice. Then we use a Modeler object to map the initial Dataset into an encoding suitable for Decision Tree induction (modeler.process(dataset)).

datati builds on top of the huggingface hub, providing an interface to integrate it with common preprocessing pipelines.

Quickstart

pip install

What datasets are available?

datati allows you to load huggingface (load_from="huggingface""), or local (load_from="local"") datasets, whether they are numpy.arrays, pandas.DataFrames, or pyarrow.ArrowDatasets.

What can I do with a dataset?

Most operations have no side-effects, that is, they yield a new Dataset object, rather than modifying the existing one. Extending pandas.DataFrame, all operations supported on a pandas.DataFrame are also supported on Dataset instances. Methods yielding a pandas.DataFrame have been overwritten to yield a Dataset instead.

Dunders Dataset implements most dunder methods. A dataset d can be both copied (copy.copy(d)) and deepcopied (copy.deepcopy(d)), it can be checked for equality, and hashed (hash(d)).

Conversion to/from other formats Datasets can be directly exported to:

pandas.DataFrame (dataset.to_pandas())
numpy.array (dataset.to_array())
list (dataset.to_list())

Model-specific encoding

NOTE As of now datati is aimed exclusively at single-output tabular classifiers, hence string/object features are treated as categorical.

The Modeler class (datati.models.Modeler) implements a minimal interface to map a dataset for processing for the algorithm of choice. Currently, datati implements:

	Algorithm	Info`
`ContinuousTreeModeler`	Decision tree	Categorical features are encoded through target encoding.
`OneHotTreeModeler`	Decision tree	Categorical features are encoded through one-hot encoding.
`SBRLModeler`	SBRL	All features are binned, then binarized.
`CorelsModeler`	CORELS	All features are binned, then binarized.

All implemented Modelers leave a trace of their own transformations by enriching the transformed Dataset with transformation-specific mappings:

from pprint import pprint

dataset = Dataset("mstz/adult", config="income", split="train", load_from="huggingface")

# preprocess dataset for decision tree classification
dataset.target_feature = "over_threshold"
modeler = ContinuousTreeModeler()
tree_dataset = modeler.process(dataset)

pprint(tree_dataset.bins_encoding_dictionaries)
# {}
pprint(tree_dataset.one_hot_encoding_dictionaries)
# {}
pprint(tree_dataset.target_encoding_dictionaries)
 #{'marital_status': {'Divorced': array([0.12109962]),
 #                    'Married-AF-spouse': array([0.00057328]),
 #                    'Married-civ-spouse': array([0.25382872]),
 #                    'Married-spouse-absent': array([0.01165679]),
 #                    'Never-married': array([0.3155797]),
 #                    'Separated': array([0.02942863]),
 #                    'Widowed': array([0.02855505])},
 # 'native_country': {'?': array([0.01321285]),
 #                    'Cambodia': array([0.00035489]),
 #                    'Canada': array([0.00232044]),
 #                    'China': array([0.00174715]),
 #                    'Columbia': array([0.00185635]),
 #                    'Cuba': array([0.00204745]),
 # ...

Similarly, when applying one-hot encoding, dataset.one_hot_encoding_dictionary will hold the encoding indexes:

from pprint import pprint

dataset = Dataset("mstz/adult", config="income", split="train", load_from="huggingface")

# preprocess dataset for decision tree classification
dataset.target_feature = "over_threshold"
modeler = ContinuousTreeModeler()
tree_dataset = modeler.process(dataset)

pprint(tree_dataset.one_hot_encoding_dictionary)
# {'marital_status': {'Divorced': 0,
#                     'Married-AF-spouse': 1,
#                     'Married-civ-spouse': 2,
#                     'Married-spouse-absent': 3,
#                     'Never-married': 4,
#                     'Separated': 5,
#                     'Widowed': 6},
#  'native_country': {'?': 0,
#                     'Cambodia': 1,
#                     'Canada': 2,
#                     'China': 3,
#                     'Columbia': 4,
#                     'Cuba': 5,
# ...

Encoding Legos

Some lower-level modelers can be composed as small building blocks to provide the desired result. Currently, datati implements:

	Info
`TargetModeler`	Categorical features are encoded through target encoding.
`OneHotModeler`	Categorical features are encoded through one-hot encoding.
`BinModeler`	Numerical features are discretized into bins.
`BinaryModeler`	All features are first binned, then each bin is transformed into a boolean feature.
`BoolToIntModeler`	Booleans are mapped to `{0, 1}`.
`IntToBoolModeler`	Integers are mapped to booleans.
`SBRLModeler`	All features are binned, then binarized.
`CorelsModeler`	All features are binned, then binarized.

Similarly to scikit-learn pipelines, you can implement your own Modeler by combining existing modelers through a pipeline, as it is done in the ContinuousTreeModeler. Here, we first target-encode categorical variables, then transform booleans into integers:

class ContinuousTreeModeler(NumericModeler):
    """Model data as continuous, mapping boolean features to 0/1, and categorical features to target encoders."""
    def __init__(self):
        super().__init__()
        self.pipeline = Pipeline(TargetModeler(), BoolToIntModeler(guess_booleans=True))

    def process(self, dataset: Dataset, **kwargs) -> Dataset:
        """Adapt the given `dataset` to be fed to the model of choice.

        Args:
           dataset: The dataset to process.
           **kwargs: Keyword arguments.

        Returns:
           The processed dataset.
        """
        return self.pipeline(dataset, **kwargs)

Run on your own (local) dataset

Local or remote doesn't matter, datati can be integrated to work on your own local dataset. All it takes, is to specify that we're working on a local (local=True) dataset:

from datati.dataset import Dataset

dataset = Dataset("./adult", load_from="local")

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Operating System
- OS Independent
Programming Language
- Python :: 3.10

Release history Release notifications | RSS feed

This version

0.1.0

Jun 19, 2023

0.0.26

Jun 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datati-0.1.0.tar.gz (48.0 kB view hashes)

Uploaded Jun 19, 2023 Source

Built Distribution

datati-0.1.0-py3-none-any.whl (50.9 kB view hashes)

Uploaded Jun 19, 2023 Python 3

Hashes for datati-0.1.0.tar.gz

Hashes for datati-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`96863d5caa4674ff00c6b25a40f936765a3e85e838bdb867dd7113f01e844452`
MD5	`2aa49540853c578256da13a6d49472ce`
BLAKE2b-256	`d0ffaf186d5892afcdd3c24f09ada9e3138e14c07a76d65a513f4a3fbd8c6fe4`

Hashes for datati-0.1.0-py3-none-any.whl

Hashes for datati-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c14c637cef5e303f3aaabafb9c887ac77b836182e66cb2ad287e4c3ce2e79d0`
MD5	`b0b47834d998f1b301e2bce62d78a94a`
BLAKE2b-256	`fa79908acf34010c325de06eebeeb5a0d73c307f1e5336be5ec3469f0d786a55`