Data Science Library
Project description
WOLTA DOCUMENTATION
Wolta is designed for making simplify the frequently used processes which includes Pandas, Numpy and Scikit-Learn in Machine Learning.
Currently there are four modules inside the library, which are 'data_tools', 'model_tools', 'progressive_tools', 'feature_tools' and 'visual_tools'
Installation
pip install wolta
Data Tools
Data Tools was designed for manipulating the data.
load_by_parts
Returns: pandas dataframe
Parameters:
-
paths, python list
-
strategy, {'default', 'efficient'}, by default, 'default'
-
deleted_columns, python string list, by default, None
-
print_description, boolean, by default, False
-
shuffle, boolean , by default, False
-
encoding, string, by default, 'utf-8'
- paths holds the locations of data files.
- If strategy is 'default', then the datatypes of columns are assigned with maximum bytes (64).
- If strategy is 'efficient', then the each column is examined and the min-max values are detected. According to the info, least required byte amount is assigned to each column.
- deleted_columns holds the names of the columns that will be directly from each sub dataframe.
- If print_description is True, then how many paths have been read is printed out in the console.
from wolta.data_tools import load_by_parts
import glob
paths = glob.glob('path/to/dir/*.csv')
df = load_by_parts(paths)
col_types
Returns: python string list
Parameters:
-
df, pandas dataframe
-
print_columns, boolean, by default, False
-
df holds the dataframe and for each datatype for columns are returned.
-
If print_columns is True, then 'class name: datatype' is printed out for each column.
import pandas as pd
from wolta.data_tools import col_types
df = pd.read_csv('data.csv')
types = col_types(df)
make_numerics
Returns:
-
pandas dataframe column which has int64 data inside it
-
If space_requested is True, then dictionary that used in mapping
Parameters:
-
column, pandas dataframe column
-
space_requested, boolean, by default, False
import pandas as pd
from wolta.data_tools import make_numerics
df = pd.read_csv('data.csv')
df['output'] = make_numerics(df['output'])
make_null
Returns: pandas dataframe or numpy array
Parameters:
-
matrix (pandas dataframe or numpy array)
-
replace, string, the text which will converted to null
-
type, {'df', 'np'}, by default 'df'
seek_null
Returns: boolean
Parameters:
-
df, pandas dataframe
-
print_columns, boolean, False by default
find_deflection
Returns:
-
differences, 1D numpy array, if arr parameter is True
-
average, average difference between predictions and actual values, if avg is True
-
amount of succeeded predictions, the amount of predictions which in acceptable range, if gap is not None
-
indexes of succeeded predictions in y_pred, if success_indexes is True
Parameters:
-
y_test
-
y_pred
-
arr, True by default
-
avg, False by default
-
gap, None by default, if it is not None then it must be positive
-
gap_type, the type of usage gap value in the creation of accepted range as 'successful'
| value | meaning |
| --- |-----------------------------------------------------------------------------|
| exact | prediction and actual value must be same |
| num | for succession must be 'actual - gap <= prediction <= actual + gap' |
| num+ | for succession must be 'actual <= prediction <= actual + gap' |
| num- | for succession must be 'actual - gap <= prediction <= actual' |
| per | for succession must be 'actual * (100 - gap) / 100 <= prediction <= actual * (100 + gap) / 100' |
| per+ | for succession must be 'actual <= prediction <= actual * (100 + gap) / 100' |
| per- | for succession must be 'actual * (100 - gap) / 100 <= prediction <= actual' |
- dif_type, indicates the way of calculation of difference between prediction and actual value. 'f-i' by default.
| value | meaning |
| --- | --- |
| f-i | difference = prediction - actual |
| i-f | difference = actual - prediction |
| abs | absolute value |
-
avg_w_abs, True by default, indicates the way of calculation of average difference, using difference array with absolute values or not
-
success_indexes, False by default
list_deletings
It cleans dataframe from features that should be deleted.
Returns:
-
pandas dataframe
-
string list, the list of might be deleted feature names, it is returned only if return_extra is True
Parameters:
-
df, pandas dataframe
-
extra, list of string, the list of the feature names that will be deleted directly, by default None
-
del_null, delete or not delete features that has too many null values, by default True
-
null_tolerance, creates the limit indicator in the way of null_tolerance% of whole sample amount, by default 20
-
del_single, delete or not delete features that has single value, by default True
-
del_almost_single, delete or not delete features that has nearly single value, by default False
-
almost_tolerance, creates the limit indicator in the way of almost_tolerance% of whole sample amount, by default 50
-
suggest_extra, investigates string features that has too many unique data or not and points them, by default True
-
return_extra, if suggest_extra and this are True, returns array the list of feature names that might be deleted, by default False
-
unique_tolerance, creates the limit indicator in the way of st_tolerance% of whole sample amount, by default 10
extract_float
If the column contains float/int data but it contains string type symbols such as dollar sign and comma, this function makes it completely translated.
Returns: pandas dataframe column
Parameters:
-
pandas dataframe column
-
symbols, the list of things that will be deleted
transform_data
Returns:
-
transformed X
-
transformed y
-
if strategy ends with 'm', amin_x
-
if strategy ends with 'm', amin_y
Parameters:
-
X
-
y
-
strategy, {'log', 'log-m', 'log2', 'log2-m', 'log10', 'log10-m', 'sqrt', 'sqrt-m', 'cbrt'}, by default 'log-m'
If you concern about situations like sqrt(0) or log(0), use strategies which ends with 'm' (means manipulated).
transform_pred
This function is designed for make predictions realistic and handles back-transformation.
Returns: back-transformed y
Parameters:
-
y_pred
-
strategy, {'log', 'log-m', 'log2', 'log2-m', 'log10', 'log10-m', 'sqrt', 'sqrt-m', 'cbrt'}, by default 'log-m'
-
amin_y, int, by default 0. amin_y is used min y value in transform_data if data was manipulated and it is required if a strategy which ends with 'm' was selected.
make_categorical
It places regression outputs into three sub-classes according to mean and standard deviation. Normal distribution is required.
Returns:
- y
the other outputs are returned if only normal-extra is selected
-
the min value of array
-
the max value of array
-
the standard deviation of array
-
the mean value of array
-
the result of mean - standard deviation
-
the result of mean + standard deviation
Parameters:
-
y
-
strategy, {'normal', 'normal-extra'}, default by, 'normal'
state_sum
It gives information about features.
Returns: dictionary which keys are feature names. (if get_dict param is True)
Parameters:
-
df, pandas dataframe
-
requested, array of string
| value | meaning |
| --- |-----------------------------|
| min | minimum value for a feature |
| max | maximum value for a feature |
| width | difference between max and min |
| std | standard |
| mean | mean |
| med | median |
| var | variance |
-
only, list of string, it gets results for these features only, by default None. If it is none, function gets results for all features
-
exclude, list of string, it gets results for except these features, by default None.
-
get_dict, by default False
-
verbose, by default True
create_chunks
Parameters:
-
path, string
-
sample_amount, int, sample amount for each chunk
-
target_dir, string, directory path to save chunks, by default, None
-
print_description, boolean, shows the progress in console or not, by default, False
-
chunk_name, string, general name for chunks, by default, 'part'
from wolta.data_tools import create_chunks
create_chunks('whole_data.csv', 1000000)
unique_amounts
Returns: dictionary with <string, int> form, <column name, unique value amount>
Parameters:
-
df, pandas dataframe
-
strategy, python string list, by default, None, it is designed for to hold requested column names
-
print_dict, boolean, by default, False
import pandas as pd
from wolta.data_tools import unique_amounts
df = pd.read_csv('data.csv')
amounts = unique_amounts(df)
scale_x
Returns:
-
X_train
-
X_test
Parameters:
-
X_train
-
X_test
It makes Standard Scaling.
import pandas as pd
from sklearn.model_selection import train_test_split
from wolta.data_tools import scale_x
df = pd.read_csv('data.csv')
y = df['output']
del df['output']
X = df
del df
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test = scale_x(X_train, X_test)
is_normal
Returns: Boolean
if data has normal distribution returns true, else returns false
Parameter: y
examine_floats
Returns: list with full of column names which supplies the requested situation
Parameters:
-
df, pandas dataframe
-
float_columns, string list, column names which has float data
-
get, {'float', 'int'}, by default, 'float'.
-
If get is 'float', then it returns the names of the columns which has float values rather than .0
-
If get is 'int', then it returns the names of the columns which has int values with .0
import pandas as pd
from wolta.data_tools import examine_floats
df = pd.read_csv('data.csv')
float_cols = ['col1', 'col2', 'col3']
float_cols = examine_floats(df, float_cols)
calculate_min_max
Returns:
-
columns, list of string, column names which holds int or float data
-
types, list of string, datatypes of these columns
-
max_val, list of int & float, holds the maximum value for each column
-
min_val, list of int & float, holds the minimum value for each column
Parameters:
-
paths, list of string, paths of dataframes
-
deleted_columns, list of string, initially excluded column names, by default, None
import glob
from wolta.data_tools import calculate_min_max
paths = glob.glob('path/to/dir/*.csv')
columns, types, max_val, min_val = calculate_min_max(paths)
calculate_bounds
Returns: list of string, name of types with optimum sizes.
Parameters:
-
gen_types, list of string, holds the default datatypes for each column
-
min_val, list of int & float, holds the minimum value for each column
-
max_val, list of int & float, holds the maximum value for each column
import glob
from wolta.data_tools import calculate_bounds, calculate_min_max
paths = glob.glob('path/to/dir/*.csv')
columns, types, max_val, min_val = calculate_min_max(paths)
types = calculate_bounds(types, min_val, max_val)
Model Tools
Model Tools was designed for to get some results on models.
get_score
Returns: dictionary with <string, float> form, <score type, value>, also prints out the result by default
Parameters:
-
y_test, 1D numpy array
-
y_pred, 1D numpy array
-
metrics, list of string, this list can only have values the table below:
For 'clf':
| value | full name |
| --- |----------------|
| acc | accuracy score |
| f1 | f1 score |
| hamming | hamming loss |
| jaccard | jaccard score |
| log | log loss |
| mcc | matthews corrcoef |
| precision | precision score |
| recall | recall score |
| zol | zero one loss |
by default, ['acc']
For 'reg':
| value | full name |
|---------|------------------------------------|
| var | explained variance |
| max | max error |
| abs | neg mean absolute error |
| sq | neg mean squared error |
| rsq | neg root mean squared error |
| log | neg mean squared log error |
| rlog | neg mean squared log error |
| medabs | neg median absolute error |
| poisson | neg mean poisson deviance |
| gamma | neg mean gamma deviance |
| per | neg mean absolute percentage error |
| d2abs | d2 absolute error score |
| d2pin | d2 pinball score |
| d2twe | d2 tweedie score |
by default, ['sq']
-
average, string, {'weighted', 'micro', 'macro', 'binary', 'samples'}, by default, 'weighted'
-
algo_type, {'clf', 'reg'}, 'clf' by default
import numpy as np
from wolta.model_tools import get_score
y_test = np.load('test.npy')
y_pred = np.load('pred.npy')
scores = get_score(y_test, y_pred, ['acc', 'precision'])
get_supported_metrics
It returns the string list of possible score names for get_score function
from wolta.model_tools import get_supported_metrics
print(get_supported_metrics())
get_avg_options
It returns the string list of possible average values for get_score function
from wolta.model_tools import get_avg_options
print(get_avg_options())
compare_models
Returns: nothing, just prints out the results
Parameters:
-
algo_type, {'clf', 'reg'}
-
algorithms, list of string, if the first element is 'all' then it gets results for every algorithm.
for 'clf':
| value | full name |
|-------|-----------|
| cat | catboost |
| ada | adaboost |
| dtr | decision tree |
| raf | random forest |
| lbm | lightgbm |
| ext | extra tree |
| log | logistic regression |
| knn | knn |
| gnb | gaussian nb |
| rdg | ridge |
| bnb | bernoulli nb |
| svc | svc |
| per | perceptron |
| mnb | multinomial nb |
for 'reg':
| value | full name |
|-------|-------------------|
| cat | catboost |
| ada | adaboost |
| dtr | decision tree |
| raf | random forest |
| lbm | lightgbm |
| ext | extra tree |
| lin | linear regression |
| knn | knn |
| svr | svr |
-
metrics, list of string, its values must be acceptable for get_score method
-
X_train
-
y_train
-
X_test
-
y_test
do_combinations
Returns: list of the int lists
Parameters:
-
indexes, list of int
-
min_item, int, it is the minimum amount of index inside a combination
-
max_item, int, it is the maximum amount of index inside a combination
It creates a list for all possible min_item <= x <= max_item terms combinations
from wolta.model_tools import do_combinations
combinations = do_combinations([0, 1, 2], 1, 3)
do_voting
Returns: list of 1D numpy arrays
Parameters:
-
y_pred_list, list of 1D numpy arrays
-
combinations, list of int lists, it holds the indexes from y_pred_list for each combination
-
strategy, {'avg', 'mode'}, default by, 'avg'
If 'avg' is selected then this function makes sum of matrices, then divides it the amount of matrices and finally makes whole matrix as int value.
If 'mode' is selected then for every sample, the predicts are collected and then mode is found one by one.
import numpy as np
from wolta.model_tools import do_voting, do_combinations
y_pred_1 = np.load('one.npy')
y_pred_2 = np.load('two.npy')
y_pred_3 = np.load('three.npy')
y_preds = [y_pred_1, y_pred_2, y_pred_3]
combinations = do_combinations([0, 1, 2], 1, 3)
results = do_voting(y_preds, combinations)
WelkinClassification
The Welkin Classification has a very basic idea. It calculates min and max values for each feature for every class according to the training data. Then, in prediction process, it checks every classes one by one, if input features between the range that detected, it gives a score. The class which has the biggest score is become the predict. Ofcourse this is for 'travel' strategy. If the strategy is 'limit', then if m of features has value between those ranges, that becomes the answer and the other possible answers aren't investigated. This strategy is recommended for speed. At this point, feature investigation order becomes crucial so they can be reordered with 'priority' parameter.
Parameters:
-
strategy, {'travel', 'limit'}, by default, 'travel'
-
priority, list of feature indexes, by default, None
-
limit, integer, by default, None
This class has those functions:
-
fit(X_train, y_train)
-
predict(X_test), returns y_pred
DistRegressor
This regression approach provides a hybrid solution for problems which have output space in wide range thanks to normal distribution.
Parameters:
-
verbose, boolean, by default, True
-
clf_model, classification model class, by default, None (If it is None, CatBoostClassifier is used with default parameters except iterations, iterations has 20 as value)
-
clf_params, parameters for classification model in dict form, by default, None
-
reg_model, regression model, by default, None (If it is None, CatBoostRegressor is used with default parameters except iterations, iterations has 20 as value)
-
reg_params, parameters for regression model in dict form, by default, None
This class has those functions:
-
fit(X_train, y_train)
-
predict(X_test), returns y_pred
examine_time
It calculates the fitting time for a model and also returns the trained model.
Returns:
-
int
-
model
Parameters:
-
model
-
X_train
-
y_train
from wolta.model_tools import examine_time
from sklearn.ensemble import RandomForestClassifier
import numpy as np
X_train = np.load('x.npy')
y_train = np.load('y.npy')
model = RandomForestClassifier(random_state=42)
consumed, model = examine_time(model, X_train, y_train)
Progressive Tools
This module has been designed for progressive sampling.
make_run
It was designed to use one loaded numpy array for all sampling trials.
Returned:
-
list of int, percentage logging
-
list of dictionaries, metrics logging
Parameters:
-
model_class
-
X_train
-
y_train
-
X_test
-
y_test
-
init_per, int, default by, 1, inclusive starting percentage
-
limit_per, int, default by, 100, inclusive ending percentage
-
increment, int, default by, 1
-
metrics, list of string, the values must be recognizable for model_tools.get_score(), default by, ['acc']
-
average, string, the value must be recognizable for model_tools.get_score(), default by, 'weighted'
-
params, dictionary, if model has parameters, they initialize it here, default by, None
from wolta.progressive_tools import make_run
from sklearn.ensemble import RandomForestClassifier
import numpy as np
X_train = np.load('x_train.npy')
y_train = np.load('y_train.npy')
X_test = np.load('x_test.npy')
y_test = np.load('x_test.npy')
percentage_log, metrics_log = make_run(RandomForestClassifier, X_train, y_train, X_test, y_test)
get_best
Returns:
-
int, best percentage
-
float, best score
Parameters:
-
percentage_log, list of int
-
metrics_log, list of dictionary
-
requested_metrics, string
from wolta.progressive_tools import make_run, get_best
from sklearn.ensemble import RandomForestClassifier
import numpy as np
X_train = np.load('x_train.npy')
y_train = np.load('y_train.npy')
X_test = np.load('x_test.npy')
y_test = np.load('x_test.npy')
percentage_log, metrics_log = make_run(RandomForestClassifier, X_train, y_train, X_test, y_test)
best_per, best_score = get_best(percentage_log, metrics_log, 'acc')
path_chain
Unlike make_run, it loads train data from different files every time.
Returns: list of dictionary, metrics logging
Parameters:
-
paths, list of string
-
model_class
-
X_test
-
y_test
-
output_column, string
-
metrics, list of string, the values must be recognizable for model_tools.get_score(), default by, ['acc']
-
average, string, the value must be recognizable for model_tools.get_score(), default by, 'weighted'
-
params, dictionary, if model has parameters, they initialize it here, default by, None
from wolta.progressive_tools import path_chain
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import glob
X_test = np.load('x_test.npy')
y_test = np.load('x_test.npy')
paths = glob.glob('path/to/dir/*.csv')
percentage_log, metrics_log = path_chain(paths, RandomForestClassifier, X_test, y_test, 'output')
Feature Tools
This module is designed for manipulating features in datasets.
quest_selection
Prints out suggestions about what feature(s) can be deleted with less loss or maximum gain.
The algorithm works with two steps: Firstly, It removes one feature for each time and compares accuracies between current situation and whole-features case. If new accuracy is the better than whole-feature one or their difference less-equal than flag_one_tol, it passes to the second step.
The next process 'trials' times creates combinations with random amounts of passed features and they are removed at same time. If new accuracy is the better than whole-feature one or their difference less-equal than fin_tol, it becomes a suggestion.
Parameters:
-
model_class
-
X_train
-
y_train
-
X_test
-
y_test
-
features, list of string, holds column names for X.
-
flag_one_tol, float
-
fin_tol, float
-
params, dictionary, if model has parameters, they initialize it here, default by, None
-
normal_acc, float, default by, None. If it is None then it is calculated first of all
-
trials, int, default by, 100
Feature Tools
This module is designed for visualizing the inputs and results
make_table
Returns: string, md code for table
Parameters:
-
mode, 'sheet' or 'nx2'
-
inputs, it is used if 'sheet' is selected, arrays of list of rows or columns
-
inputs_type, indicates the type of arrays in inputs, the value must be 'row' or 'column'
-
first_column, it is used if 'nx2' is selected. It indicates the first column
-
filler, it is used if 'nx2' is selected. The second column is filled with it
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.