Download and preprocess your dataset from any source, all in one place.
Project description
Datati: Modern (tabular) datasets require modern solutions
Dataset to model, in one go!
datati is a small library to streamline tabular dataset loading and preprocessing.
The goal of this library is to minimize the boring boilerplate code that separates choosing a dataset to work on,
and actually getting it ready to train a classification model .
datati provides simple interfaces to load, preprocess, and encode a dataset for training your model of choice:
from datati.dataset import Dataset
from datati.models.trees import ContinuousTreeModeler
# load dataset
dataset = Dataset("mstz/adult", config="income", split="train", load_from="huggingface",
target_feature="over_threshold")
tree_dataset = ContinuousTreeModeler().process(dataset).to_array()
x, y = tree_dataset[:, :-1], tree_dataset[:, -1]
This snippet allows us to load a dataset (Dataset("mstz/adult")) in a desired configuration
(config="income") and split (split="train"") of choice.
Then we use a Modeler object to map the initial Dataset into an encoding suitable for Decision Tree induction
(modeler.process(dataset)).
datati builds on top of the huggingface hub, providing an interface to integrate it with common preprocessing
pipelines.
Quickstart
pip install
What datasets are available?
datati allows you to load huggingface (load_from="huggingface""), or local (load_from="local"") datasets,
whether they are numpy.arrays, pandas.DataFrames, or pyarrow.ArrowDatasets.
What can I do with a dataset?
Most operations have no side-effects, that is, they yield a new Dataset object, rather than modifying the
existing one.
Extending pandas.DataFrame, all operations supported on a pandas.DataFrame are also supported on
Dataset instances.
Methods yielding a pandas.DataFrame have been overwritten to yield a Dataset instead.
Dunders
Dataset implements most dunder methods.
A dataset d can be both copied (copy.copy(d)) and deepcopied (copy.deepcopy(d)), it can be checked for equality,
and hashed (hash(d)).
Conversion to/from other formats
Datasets can be directly exported to:
pandas.DataFrame(dataset.to_pandas())numpy.array(dataset.to_array())list(dataset.to_list())
Model-specific encoding
NOTE As of now
datatiis aimed exclusively at single-output tabular classifiers, hence string/object features are treated as categorical.
The Modeler class (datati.models.Modeler) implements a minimal interface to map a dataset for processing for the
algorithm of choice.
Currently, datati implements:
| Algorithm | Info` | |
|---|---|---|
ContinuousTreeModeler |
Decision tree | Categorical features are encoded through target encoding. |
OneHotTreeModeler |
Decision tree | Categorical features are encoded through one-hot encoding. |
SBRLModeler |
SBRL | All features are binned, then binarized. |
CorelsModeler |
CORELS | All features are binned, then binarized. |
All implemented Modelers leave a trace of their own transformations by enriching the transformed Dataset with
transformation-specific mappings:
from pprint import pprint
dataset = Dataset("mstz/adult", config="income", split="train", load_from="huggingface")
# preprocess dataset for decision tree classification
dataset.target_feature = "over_threshold"
modeler = ContinuousTreeModeler()
tree_dataset = modeler.process(dataset)
pprint(tree_dataset.bins_encoding_dictionaries)
# {}
pprint(tree_dataset.one_hot_encoding_dictionaries)
# {}
pprint(tree_dataset.target_encoding_dictionaries)
#{'marital_status': {'Divorced': array([0.12109962]),
# 'Married-AF-spouse': array([0.00057328]),
# 'Married-civ-spouse': array([0.25382872]),
# 'Married-spouse-absent': array([0.01165679]),
# 'Never-married': array([0.3155797]),
# 'Separated': array([0.02942863]),
# 'Widowed': array([0.02855505])},
# 'native_country': {'?': array([0.01321285]),
# 'Cambodia': array([0.00035489]),
# 'Canada': array([0.00232044]),
# 'China': array([0.00174715]),
# 'Columbia': array([0.00185635]),
# 'Cuba': array([0.00204745]),
# ...
Similarly, when applying one-hot encoding, dataset.one_hot_encoding_dictionary will hold the encoding indexes:
from pprint import pprint
dataset = Dataset("mstz/adult", config="income", split="train", load_from="huggingface")
# preprocess dataset for decision tree classification
dataset.target_feature = "over_threshold"
modeler = ContinuousTreeModeler()
tree_dataset = modeler.process(dataset)
pprint(tree_dataset.one_hot_encoding_dictionary)
# {'marital_status': {'Divorced': 0,
# 'Married-AF-spouse': 1,
# 'Married-civ-spouse': 2,
# 'Married-spouse-absent': 3,
# 'Never-married': 4,
# 'Separated': 5,
# 'Widowed': 6},
# 'native_country': {'?': 0,
# 'Cambodia': 1,
# 'Canada': 2,
# 'China': 3,
# 'Columbia': 4,
# 'Cuba': 5,
# ...
Encoding Legos
Some lower-level modelers can be composed as small building blocks to provide the desired result.
Currently, datati implements:
| Info | |
|---|---|
TargetModeler |
Categorical features are encoded through target encoding. |
OneHotModeler |
Categorical features are encoded through one-hot encoding. |
BinModeler |
Numerical features are discretized into bins. |
BinaryModeler |
All features are first binned, then each bin is transformed into a boolean feature. |
BoolToIntModeler |
Booleans are mapped to {0, 1}. |
IntToBoolModeler |
Integers are mapped to booleans. |
SBRLModeler |
All features are binned, then binarized. |
CorelsModeler |
All features are binned, then binarized. |
Similarly to scikit-learn pipelines, you can implement your own Modeler by combining existing modelers through a
pipeline, as it is done in the ContinuousTreeModeler.
Here, we first target-encode categorical variables, then transform booleans into integers:
class ContinuousTreeModeler(NumericModeler):
"""Model data as continuous, mapping boolean features to 0/1, and categorical features to target encoders."""
def __init__(self):
super().__init__()
self.pipeline = Pipeline(TargetModeler(), BoolToIntModeler(guess_booleans=True))
def process(self, dataset: Dataset, **kwargs) -> Dataset:
"""Adapt the given `dataset` to be fed to the model of choice.
Args:
dataset: The dataset to process.
**kwargs: Keyword arguments.
Returns:
The processed dataset.
"""
return self.pipeline(dataset, **kwargs)
Run on your own (local) dataset
Local or remote doesn't matter, datati can be integrated to work on your own local dataset.
All it takes, is to specify that we're working on a local (local=True) dataset:
from datati.dataset import Dataset
dataset = Dataset("./adult", load_from="local")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datati-0.1.0.tar.gz.
File metadata
- Download URL: datati-0.1.0.tar.gz
- Upload date:
- Size: 48.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96863d5caa4674ff00c6b25a40f936765a3e85e838bdb867dd7113f01e844452
|
|
| MD5 |
2aa49540853c578256da13a6d49472ce
|
|
| BLAKE2b-256 |
d0ffaf186d5892afcdd3c24f09ada9e3138e14c07a76d65a513f4a3fbd8c6fe4
|
File details
Details for the file datati-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datati-0.1.0-py3-none-any.whl
- Upload date:
- Size: 50.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c14c637cef5e303f3aaabafb9c887ac77b836182e66cb2ad287e4c3ce2e79d0
|
|
| MD5 |
b0b47834d998f1b301e2bce62d78a94a
|
|
| BLAKE2b-256 |
fa79908acf34010c325de06eebeeb5a0d73c307f1e5336be5ec3469f0d786a55
|