DAtaset siZe Effect estimatoR
Project description
DAZER (DAtaset siZe Effect estimatoR)
Class Subsampler
The 'Subsampler' class serves to subsample proportions of the data. While doing so, it is able to preserve the distribution of values in selected features (columns_keep_ratio).
Additionally, it offers the functionality to extract a test dataset. Samples in this dataset will be excluded from following subsamples.
setup & generate test data
import dazer
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(
n_samples=100, n_features=10, random_state=444)
df = pd.DataFrame(X)
df.join(pd.Series(y, name='label'))
subsample
subsampler = dazer.Subsampler(df, columns_keep_ratio=['label'], allowed_deviation=.2)
df_test = subsampler.extract_test()
df_test = subsampler.extract_test(test_size=.2, random_state=101)
df_train_1 = subsampler.subsample(subsample_factor=.1, random_state=101)
df_train_2 = subsampler.subsample(subsample_factor=.2, random_state=101)
df_train_3 = subsampler.subsample(subsample_factor=.3, random_state=101)
Class Classifier
prepare data for training and testing
y_test = df_test[target_column] == target_value
X_test = df_test.drop([target_column], axis=1)
y_train = df_train_1[target_column] == target_value
X_train = df_train_1.drop([target_column], axis=1)
model training and evaluation (Random Forest)
classifier = dazer.Classifier(X_train, y_train, X_test, y_test)
model, evaluation = classifier.train_test('rf', random_state=101, scoring='f1')
model training and evaluation (Multi-layer Perceptron)
classifier = dazer.Classifier(X_train, y_train, X_test, y_test)
model, evaluation = classifier.train_test('mlp', random_state=101, scoring='f1', param_model={'solver': 'lbfgs', 'hidden_layer_sizes': (10, 5), 'random_state': 101, 'alpha': 1e-5, 'C': 1})
model training and evaluation (Support Vector Classification with rbf kernel)
classifier = dazer.Classifier(X_train, y_train, X_test, y_test)
model, evaluation = classifier.train_test('svc', random_state=101, scoring='f1', param_model={'kernel': 'rbf', 'C': 1, 'gamma': 2, 'random_state': 101})
available models:
- 'rf' (Random Forest)
- 'xgb' (XGBoost)
- 'mlp' (Multi-layer Perceptron)
- 'gnb' (Gaussian Naive Bayes)
- 'svc' (Support Vector Classification)
save model immediately as .joblib object
classifier = dazer.Classifier(X_train, y_train, X_test, y_test)
model, evaluation = classifier.train_test('rf', random_state=101, model_path='models/model_1.joblib', scoring='f1')
Utils
Useful high level wrappers incorporating the dazer functionalities.
test_dict, train_dict = dazer.subsample_iterative(df, columns_keep_ratio=[], allowed_deviation=.2, test_size=.2, random_states=[101, 102, 103, 104, 105], attempts=10000, ratios=[.2, .4, .6, .8, 1]):
Run unittests
python3 -m unittest discover tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dazer-0.1.22.tar.gz
(11.4 kB
view hashes)