DAtaset siZe Effect estimatoR
Project description
DAZER (DAtaset siZe Effect estimatoR)
Example
from dazer import Subsampler
import seaborn as sns
target_column = 'y'
df = sns.load_dataset('penguins', data_home=self.test_dir)
df = df.dropna()
df[target_column] = df['species'] == 'Adelie'
subsampler = dazer.Subsampler(df, ['body_mass_g', 'y'], .07, True)
df_test = subsampler.extract_test(.2, random_state=2)
df_train = subsampler.subsample(.4, random_state=3)
y_test = df_test[target_column]
X_test = df_test.drop([target_column], axis=1)
y_train = df_train[target_column]
X_train = df_train.drop([target_column], axis=1)
classifier = dazer.Classifier(X_train, y_train, X_test, y_test)
model, evaluation = classifier.train_test('rf', scoring='f1')
print(evaluation)
Class Subsampler
The 'Subsampler' class serves to subsample proportions of the data. While doing so, it is able to preserve the distribution of values in selected features (columns_keep_ratio).
Additionally, it offers the functionality to extract a test dataset. Samples in this dataset will be excluded from following subsamples.
setup & generate test data
import dazer
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import pandas as pd
X, y = make_classification(
n_samples=100, n_features=10, random_state=444)
df = pd.DataFrame(X)
df = df.join(pd.Series(y, name='label'))
subsample
subsampler = dazer.Subsampler(df, columns_keep_ratio=['label'], allowed_deviation=.2)
df_test = subsampler.extract_test()
df_test = subsampler.extract_test(test_size=.2, random_state=101)
df_train_1 = subsampler.subsample(subsample_factor=.1, random_state=101)
df_train_2 = subsampler.subsample(subsample_factor=.2, random_state=101)
df_train_3 = subsampler.subsample(subsample_factor=.3, random_state=101)
Class Classifier
The class 'Classifier' contains wrappers for a number of classification models. Currently supported models are:
- 'rf' (Random Forest)
- 'xgb' (XGBoost)
- 'mlp' (Multi-layer Perceptron)
- 'gnb' (Gaussian Naive Bayes)
- 'svc' (Support Vector Classification)
prepare data for training and testing
y_test = df_test[target_column] == target_value
X_test = df_test.drop([target_column], axis=1)
y_train = df_train_1[target_column] == target_value
X_train = df_train_1.drop([target_column], axis=1)
model training and evaluation (example: Random Forest)
classifier = dazer.Classifier(X_train, y_train, X_test, y_test)
model, evaluation = classifier.train_test('rf', scoring='f1')
For the possible scoring options, refer to https://scikit-learn.org/stable/modules/model_evaluation.html.
model training and evaluation (Multi-layer Perceptron)
classifier = dazer.Classifier(X_train, y_train, X_test, y_test)
model, evaluation = classifier.train_test('mlp', scoring='f1', param_grid={'solver': 'lbfgs', 'hidden_layer_sizes': (10, 5), 'random_state': 101, 'alpha': 1e-5, 'C': 1})
model training and evaluation (Support Vector Classification with rbf kernel)
classifier = dazer.Classifier(X_train, y_train, X_test, y_test)
model, evaluation = classifier.train_test('svc', scoring='f1', param_grid={'kernel': 'rbf', 'C': 1, 'gamma': 2, 'random_state': 101})
save model immediately as .joblib object
classifier = dazer.Classifier(X_train, y_train, X_test, y_test)
model, evaluation = classifier.train_test('rf', model_path='models/model_1.joblib', scoring='f1')
Class Regressor
The regression models are similarly implemented to classification models of the class 'Classifier' above. Currently supported models are:
- 'rf' (Random Forest)
- 'xgb' (XGBoost)
- 'mlp' (Multi-layer Perceptron)
- 'svr' (Support Vector Classification)
imports and generate test data
import dazer
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import pandas as pd
X, y = make_regression(n_samples=1000, n_features=10, noise=1, random_state=444)
target_column = 'y'
df = pd.DataFrame(X)
df = df.join(pd.Series(y, name=target_column))
prepare data for training and testing with dazer
subsampler = dazer.Subsampler(df, columns_keep_ratio=[target_column], allowed_deviation=.2)
df_test = subsampler.extract_test(test_size=.2, random_state=101)
df_train = subsampler.subsample(subsample_factor=.1, random_state=101)
y_test = df_test[target_column]
X_test = df_test.drop([target_column], axis=1)
y_train = df_train[target_column]
X_train = df_train.drop([target_column], axis=1)
model training and evaluation (example: Random Forest)
regressor = dazer.Regressor(X_train, y_train, X_test, y_test)
model, evaluation = regressor.train_test('rf', scoring='max_error')
For the possible scoring options, refer to https://scikit-learn.org/stable/modules/model_evaluation.html.
save model immediately as .joblib object
regressor = dazer.Regressor(X_train, y_train, X_test, y_test)
model, evaluation = regressor.train_test('rf', model_path='models/model_1.joblib', scoring='max_error')
Utils
Useful high level wrappers incorporating the dazer functionalities.
test_dict, train_dict = dazer.subsample_iterative(df, columns_keep_ratio=[], allowed_deviation=.2, test_size=.2, random_states=[101, 102, 103, 104, 105], attempts=10000, ratios=[.2, .4, .6, .8, 1]):
Run unittests
python3 -m unittest discover tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file dazer-0.2.3.tar.gz
.
File metadata
- Download URL: dazer-0.2.3.tar.gz
- Upload date:
- Size: 13.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f544d04f4f06699fdbe06a2a97e4fc8e2b6c9f7f37f5171fab407bb426bd3f1 |
|
MD5 | 3cfa9fed5ba2622d3f8193e12e5aa624 |
|
BLAKE2b-256 | 335ba536405c948445aadeeb24781feb38e39c08bdaa2bc804f65bea6a7881f6 |