Wolta is designed for making simplify the frequently used processes which includes Pandas, Numpy and Scikit-Learn in Machine Learning.

Currently there are four modules inside the library, which are 'data_tools', 'model_tools', 'progressive_tools', 'feature_tools' and 'visual_tools'


pip install wolta

Data Tools

Data Tools was designed for manipulating the data.


Returns: pandas dataframe


  • paths, python list

  • strategy, {'default', 'efficient'}, by default, 'default'

  • deleted_columns, python string list, by default, None

  • print_description, boolean, by default, False

  • shuffle, boolean , by default, False

  • encoding, string, by default, 'utf-8'

  1. paths holds the locations of data files.

  1. If strategy is 'default', then the datatypes of columns are assigned with maximum bytes (64).

  1. If strategy is 'efficient', then the each column is examined and the min-max values are detected. According to the info, least required byte amount is assigned to each column.

  1. deleted_columns holds the names of the columns that will be directly from each sub dataframe.

  1. If print_description is True, then how many paths have been read is printed out in the console.
from wolta.data_tools import load_by_parts

import glob

paths = glob.glob('path/to/dir/*.csv')

df = load_by_parts(paths)


Returns: python string list


  • df, pandas dataframe

  • print_columns, boolean, by default, False

  1. df holds the dataframe and for each datatype for columns are returned.

  2. If print_columns is True, then 'class name: datatype' is printed out for each column.

import pandas as pd

from wolta.data_tools import col_types

df = pd.read_csv('data.csv')

types = col_types(df)



  1. pandas dataframe column which has int64 data inside it

  2. If space_requested is True, then dictionary that used in mapping


  • column, pandas dataframe column

  • space_requested, boolean, by default, False

import pandas as pd

from wolta.data_tools import make_numerics

df = pd.read_csv('data.csv')

df['output'] = make_numerics(df['output'])


Returns: pandas dataframe or numpy array


  • matrix (pandas dataframe or numpy array)

  • replace, string, the text which will converted to null

  • type, {'df', 'np'}, by default 'df'


Returns: boolean


  • df, pandas dataframe

  • print_columns, boolean, False by default



  1. differences, 1D numpy array, if arr parameter is True

  2. average, average difference between predictions and actual values, if avg is True

  3. amount of succeeded predictions, the amount of predictions which in acceptable range, if gap is not None

  4. indexes of succeeded predictions in y_pred, if success_indexes is True


  • y_test

  • y_pred

  • arr, True by default

  • avg, False by default

  • gap, None by default, if it is not None then it must be positive

  • gap_type, the type of usage gap value in the creation of accepted range as 'successful'

| value | meaning |

| --- |-----------------------------------------------------------------------------|

| exact | prediction and actual value must be same |

| num | for succession must be 'actual - gap <= prediction <= actual + gap' |

| num+ | for succession must be 'actual <= prediction <= actual + gap' |

| num- | for succession must be 'actual - gap <= prediction <= actual' |

| per | for succession must be 'actual * (100 - gap) / 100 <= prediction <= actual * (100 + gap) / 100' |

| per+ | for succession must be 'actual <= prediction <= actual * (100 + gap) / 100' |

| per- | for succession must be 'actual * (100 - gap) / 100 <= prediction <= actual' |

  • dif_type, indicates the way of calculation of difference between prediction and actual value. 'f-i' by default.

| value | meaning |

| --- | --- |

| f-i | difference = prediction - actual |

| i-f | difference = actual - prediction |

| abs | absolute value |

  • avg_w_abs, True by default, indicates the way of calculation of average difference, using difference array with absolute values or not

  • success_indexes, False by default


It cleans dataframe from features that should be deleted.


  1. pandas dataframe

  2. string list, the list of might be deleted feature names, it is returned only if return_extra is True


  1. df, pandas dataframe

  2. extra, list of string, the list of the feature names that will be deleted directly, by default None

  3. del_null, delete or not delete features that has too many null values, by default True

  4. null_tolerance, creates the limit indicator in the way of null_tolerance% of whole sample amount, by default 20

  5. del_single, delete or not delete features that has single value, by default True

  6. del_almost_single, delete or not delete features that has nearly single value, by default False

  7. almost_tolerance, creates the limit indicator in the way of almost_tolerance% of whole sample amount, by default 50

  8. suggest_extra, investigates string features that has too many unique data or not and points them, by default True

  9. return_extra, if suggest_extra and this are True, returns array the list of feature names that might be deleted, by default False

  10. unique_tolerance, creates the limit indicator in the way of st_tolerance% of whole sample amount, by default 10


If the column contains float/int data but it contains string type symbols such as dollar sign and comma, this function makes it completely translated.

Returns: pandas dataframe column


  1. pandas dataframe column

  2. symbols, the list of things that will be deleted



  1. transformed X

  2. transformed y

  3. if strategy ends with 'm', amin_x

  4. if strategy ends with 'm', amin_y


  • X

  • y

  • strategy, {'log', 'log-m', 'log2', 'log2-m', 'log10', 'log10-m', 'sqrt', 'sqrt-m', 'cbrt'}, by default 'log-m'

If you concern about situations like sqrt(0) or log(0), use strategies which ends with 'm' (means manipulated).


This function is designed for make predictions realistic and handles back-transformation.

Returns: back-transformed y


  • y_pred

  • strategy, {'log', 'log-m', 'log2', 'log2-m', 'log10', 'log10-m', 'sqrt', 'sqrt-m', 'cbrt'}, by default 'log-m'

  • amin_y, int, by default 0. amin_y is used min y value in transform_data if data was manipulated and it is required if a strategy which ends with 'm' was selected.


It places regression outputs into three sub-classes according to mean and standard deviation. Normal distribution is required.


  1. y

the other outputs are returned if only normal-extra is selected

  1. the min value of array

  2. the max value of array

  3. the standard deviation of array

  4. the mean value of array

  5. the result of mean - standard deviation

  6. the result of mean + standard deviation


  • y

  • strategy, {'normal', 'normal-extra'}, default by, 'normal'


It gives information about features.

Returns: dictionary which keys are feature names. (if get_dict param is True)


  • df, pandas dataframe

  • requested, array of string

| value | meaning |

| --- |-----------------------------|

| min | minimum value for a feature |

| max | maximum value for a feature |

| width | difference between max and min |

| std | standard |

| mean | mean |

| med | median |

| var | variance |

  • only, list of string, it gets results for these features only, by default None. If it is none, function gets results for all features

  • exclude, list of string, it gets results for except these features, by default None.

  • get_dict, by default False

  • verbose, by default True



  • path, string

  • sample_amount, int, sample amount for each chunk

  • target_dir, string, directory path to save chunks, by default, None

  • print_description, boolean, shows the progress in console or not, by default, False

  • chunk_name, string, general name for chunks, by default, 'part'

from wolta.data_tools import create_chunks

create_chunks('whole_data.csv', 1000000)


Returns: dictionary with <string, int> form, <column name, unique value amount>


  1. df, pandas dataframe

  2. strategy, python string list, by default, None, it is designed for to hold requested column names

  3. print_dict, boolean, by default, False

import pandas as pd

from wolta.data_tools import unique_amounts

df = pd.read_csv('data.csv')

amounts = unique_amounts(df)



  1. X_train

  2. X_test


  1. X_train

  2. X_test

It makes Standard Scaling.

import pandas as pd

from sklearn.model_selection import train_test_split

from wolta.data_tools import scale_x

df = pd.read_csv('data.csv')

y = df['output']

del df['output']

X = df

del df

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train, X_test = scale_x(X_train, X_test)


Returns: Boolean

if data has normal distribution returns true, else returns false

Parameter: y


Returns: list with full of column names which supplies the requested situation


  • df, pandas dataframe

  • float_columns, string list, column names which has float data

  • get, {'float', 'int'}, by default, 'float'.

  1. If get is 'float', then it returns the names of the columns which has float values rather than .0

  2. If get is 'int', then it returns the names of the columns which has int values with .0

import pandas as pd

from wolta.data_tools import examine_floats

df = pd.read_csv('data.csv')

float_cols = ['col1', 'col2', 'col3']

float_cols = examine_floats(df, float_cols)



  1. columns, list of string, column names which holds int or float data

  2. types, list of string, datatypes of these columns

  3. max_val, list of int & float, holds the maximum value for each column

  4. min_val, list of int & float, holds the minimum value for each column


  • paths, list of string, paths of dataframes

  • deleted_columns, list of string, initially excluded column names, by default, None

import glob

from wolta.data_tools import calculate_min_max

paths = glob.glob('path/to/dir/*.csv')

columns, types, max_val, min_val = calculate_min_max(paths)


Returns: list of string, name of types with optimum sizes.


  • gen_types, list of string, holds the default datatypes for each column

  • min_val, list of int & float, holds the minimum value for each column

  • max_val, list of int & float, holds the maximum value for each column

import glob

from wolta.data_tools import calculate_bounds, calculate_min_max

paths = glob.glob('path/to/dir/*.csv')

columns, types, max_val, min_val = calculate_min_max(paths)

types = calculate_bounds(types, min_val, max_val)

Model Tools

Model Tools was designed for to get some results on models.


Returns: dictionary with <string, float> form, <score type, value>, also prints out the result by default


  • y_test, 1D numpy array

  • y_pred, 1D numpy array

  • metrics, list of string, this list can only have values the table below:

For 'clf':

| value | full name |

| --- |----------------|

| acc | accuracy score |

| f1 | f1 score |

| hamming | hamming loss |

| jaccard | jaccard score |

| log | log loss |

| mcc | matthews corrcoef |

| precision | precision score |

| recall | recall score |

| zol | zero one loss |

by default, ['acc']

For 'reg':

| value | full name |


| var | explained variance |

| max | max error |

| abs | neg mean absolute error |

| sq | neg mean squared error |

| rsq | neg root mean squared error |

| log | neg mean squared log error |

| rlog | neg mean squared log error |

| medabs | neg median absolute error |

| poisson | neg mean poisson deviance |

| gamma | neg mean gamma deviance |

| per | neg mean absolute percentage error |

| d2abs | d2 absolute error score |

| d2pin | d2 pinball score |

| d2twe | d2 tweedie score |

by default, ['sq']

  • average, string, {'weighted', 'micro', 'macro', 'binary', 'samples'}, by default, 'weighted'

  • algo_type, {'clf', 'reg'}, 'clf' by default

import numpy as np

from wolta.model_tools import get_score

y_test = np.load('test.npy')

y_pred = np.load('pred.npy')

scores = get_score(y_test, y_pred, ['acc', 'precision'])


It returns the string list of possible score names for get_score function

from wolta.model_tools import get_supported_metrics



It returns the string list of possible average values for get_score function

from wolta.model_tools import get_avg_options



Returns: nothing, just prints out the results


  • algo_type, {'clf', 'reg'}

  • algorithms, list of string, if the first element is 'all' then it gets results for every algorithm.

for 'clf':

| value | full name |


| cat | catboost |

| ada | adaboost |

| dtr | decision tree |

| raf | random forest |

| lbm | lightgbm |

| ext | extra tree |

| log | logistic regression |

| knn | knn |

| gnb | gaussian nb |

| rdg | ridge |

| bnb | bernoulli nb |

| svc | svc |

| per | perceptron |

| mnb | multinomial nb |

for 'reg':

| value | full name |


| cat | catboost |

| ada | adaboost |

| dtr | decision tree |

| raf | random forest |

| lbm | lightgbm |

| ext | extra tree |

| lin | linear regression |

| knn | knn |

| svr | svr |

  • metrics, list of string, its values must be acceptable for get_score method

  • X_train

  • y_train

  • X_test

  • y_test


Returns: list of the int lists


  • indexes, list of int

  • min_item, int, it is the minimum amount of index inside a combination

  • max_item, int, it is the maximum amount of index inside a combination

It creates a list for all possible min_item <= x <= max_item terms combinations

from wolta.model_tools import do_combinations

combinations = do_combinations([0, 1, 2], 1, 3)


Returns: list of 1D numpy arrays


  • y_pred_list, list of 1D numpy arrays

  • combinations, list of int lists, it holds the indexes from y_pred_list for each combination

  • strategy, {'avg', 'mode'}, default by, 'avg'

If 'avg' is selected then this function makes sum of matrices, then divides it the amount of matrices and finally makes whole matrix as int value.

If 'mode' is selected then for every sample, the predicts are collected and then mode is found one by one.

import numpy as np

from wolta.model_tools import do_voting, do_combinations

y_pred_1 = np.load('one.npy')

y_pred_2 = np.load('two.npy')

y_pred_3 = np.load('three.npy')

y_preds = [y_pred_1, y_pred_2, y_pred_3]

combinations = do_combinations([0, 1, 2], 1, 3)

results = do_voting(y_preds, combinations)


The Welkin Classification has a very basic idea. It calculates min and max values for each feature for every class according to the training data. Then, in prediction process, it checks every classes one by one, if input features between the range that detected, it gives a score. The class which has the biggest score is become the predict. Ofcourse this is for 'travel' strategy. If the strategy is 'limit', then if m of features has value between those ranges, that becomes the answer and the other possible answers aren't investigated. This strategy is recommended for speed. At this point, feature investigation order becomes crucial so they can be reordered with 'priority' parameter.


  • strategy, {'travel', 'limit'}, by default, 'travel'

  • priority, list of feature indexes, by default, None

  • limit, integer, by default, None

This class has those functions:

  • fit(X_train, y_train)

  • predict(X_test), returns y_pred


This regression approach provides a hybrid solution for problems which have output space in wide range thanks to normal distribution.


  • verbose, boolean, by default, True

  • clf_model, classification model class, by default, None (If it is None, CatBoostClassifier is used with default parameters except iterations, iterations has 20 as value)

  • clf_params, parameters for classification model in dict form, by default, None

  • reg_model, regression model, by default, None (If it is None, CatBoostRegressor is used with default parameters except iterations, iterations has 20 as value)

  • reg_params, parameters for regression model in dict form, by default, None

This class has those functions:

  • fit(X_train, y_train)

  • predict(X_test), returns y_pred


It calculates the fitting time for a model and also returns the trained model.


  1. int

  2. model


  • model

  • X_train

  • y_train

from wolta.model_tools import examine_time

from sklearn.ensemble import RandomForestClassifier

import numpy as np

X_train = np.load('x.npy')

y_train = np.load('y.npy')

model = RandomForestClassifier(random_state=42)

consumed, model = examine_time(model, X_train, y_train)

Progressive Tools

This module has been designed for progressive sampling.


It was designed to use one loaded numpy array for all sampling trials.


  1. list of int, percentage logging

  2. list of dictionaries, metrics logging


  • model_class

  • X_train

  • y_train

  • X_test

  • y_test

  • init_per, int, default by, 1, inclusive starting percentage

  • limit_per, int, default by, 100, inclusive ending percentage

  • increment, int, default by, 1

  • metrics, list of string, the values must be recognizable for model_tools.get_score(), default by, ['acc']

  • average, string, the value must be recognizable for model_tools.get_score(), default by, 'weighted'

  • params, dictionary, if model has parameters, they initialize it here, default by, None

from wolta.progressive_tools import make_run

from sklearn.ensemble import RandomForestClassifier

import numpy as np

X_train = np.load('x_train.npy')

y_train = np.load('y_train.npy')

X_test = np.load('x_test.npy')

y_test = np.load('x_test.npy')

percentage_log, metrics_log = make_run(RandomForestClassifier, X_train, y_train, X_test, y_test)



  1. int, best percentage

  2. float, best score


  • percentage_log, list of int

  • metrics_log, list of dictionary

  • requested_metrics, string

from wolta.progressive_tools import make_run, get_best

from sklearn.ensemble import RandomForestClassifier

import numpy as np

X_train = np.load('x_train.npy')

y_train = np.load('y_train.npy')

X_test = np.load('x_test.npy')

y_test = np.load('x_test.npy')

percentage_log, metrics_log = make_run(RandomForestClassifier, X_train, y_train, X_test, y_test)

best_per, best_score = get_best(percentage_log, metrics_log, 'acc')


Unlike make_run, it loads train data from different files every time.

Returns: list of dictionary, metrics logging


  • paths, list of string

  • model_class

  • X_test

  • y_test

  • output_column, string

  • metrics, list of string, the values must be recognizable for model_tools.get_score(), default by, ['acc']

  • average, string, the value must be recognizable for model_tools.get_score(), default by, 'weighted'

  • params, dictionary, if model has parameters, they initialize it here, default by, None

from wolta.progressive_tools import path_chain

from sklearn.ensemble import RandomForestClassifier

import numpy as np

import glob

X_test = np.load('x_test.npy')

y_test = np.load('x_test.npy')

paths = glob.glob('path/to/dir/*.csv')

percentage_log, metrics_log = path_chain(paths, RandomForestClassifier, X_test, y_test, 'output')

Feature Tools

This module is designed for manipulating features in datasets.


Prints out suggestions about what feature(s) can be deleted with less loss or maximum gain.

The algorithm works with two steps: Firstly, It removes one feature for each time and compares accuracies between current situation and whole-features case. If new accuracy is the better than whole-feature one or their difference less-equal than flag_one_tol, it passes to the second step.

The next process 'trials' times creates combinations with random amounts of passed features and they are removed at same time. If new accuracy is the better than whole-feature one or their difference less-equal than fin_tol, it becomes a suggestion.


  • model_class

  • X_train

  • y_train

  • X_test

  • y_test

  • features, list of string, holds column names for X.

  • flag_one_tol, float

  • fin_tol, float

  • params, dictionary, if model has parameters, they initialize it here, default by, None

  • normal_acc, float, default by, None. If it is None then it is calculated first of all

  • trials, int, default by, 100

Feature Tools

This module is designed for visualizing the inputs and results


Returns: string, md code for table


  • mode, 'sheet' or 'nx2'

  • inputs, it is used if 'sheet' is selected, arrays of list of rows or columns

  • inputs_type, indicates the type of arrays in inputs, the value must be 'row' or 'column'

  • first_column, it is used if 'nx2' is selected. It indicates the first column

  • filler, it is used if 'nx2' is selected. The second column is filled with it

