Data Science Library
Project description
WOLTA DOCUMENTATION
Wolta is designed for making simplify the frequently used processes which includes Pandas, Numpy and Scikit-Learn in Machine Learning.
Currently there are three modules inside the library, which are 'data_tools', 'model_tools' and 'progressive_tools'
Installation
pip install wolta
Data Tools
Data Tools was designed for manipulating the data.
load_by_parts
Returns: pandas dataframe
Parameters:
-
paths, python list
-
strategy, {'default', 'efficient'}, by default, 'default'
-
deleted_columns, python string list, by default, None
-
print_description, boolean, by default, False
-
shuffle, boolean , by default, False
- paths holds the locations of data files.
- If strategy is 'default', then the datatypes of columns are assigned with maximum bytes (64).
- If strategy is 'efficient', then the each column is examined and the min-max values are detected. According to the info, least required byte amount is assigned to each column.
- deleted_columns holds the names of the columns that will be directly from each sub dataframe.
- If print_description is True, then how many paths have been read is printed out in the console.
from wolta.data_tools import load_by_parts
import glob
paths = glob.glob('path/to/dir/*.csv')
df = load_by_parts(paths)
col_types
Returns: python string list
Parameters:
-
df, pandas dataframe
-
print_columns, boolean, by default, False
-
df holds the dataframe and for each datatype for columns are returned.
-
If print_columns is True, then 'class name: datatype' is printed out for each column.
import pandas as pd
from wolta.data_tools import col_types
df = pd.read_csv('data.csv')
types = col_types(df)
make_numerics
Returns: _pandas dataframe column which has int64 data inside it
Parameter: column, pandas dataframe column
import pandas as pd
from wolta.data_tools import make_numerics
df = pd.read_csv('data.csv')
df['output'] = make_numerics(df['output'])
create_chunks
Parameters:
-
path, string
-
sample_amount, int, sample amount for each chunk
-
target_dir, string, directory path to save chunks, by default, None
-
print_description, boolean, shows the progress in console or not, by default, False
-
chunk_name, string, general name for chunks, by default, 'part'
from wolta.data_tools import create_chunks
create_chunks('whole_data.csv', 1000000)
unique_amounts
Returns: dictionary with <string, int> form, <column name, unique value amount>
Parameters:
-
df, pandas dataframe
-
strategy, python string list, by default, None, it is designed for to hold requested column names
-
print_dict, boolean, by default, False
import pandas as pd
from wolta.data_tools import unique_amounts
df = pd.read_csv('data.csv')
amounts = unique_amounts(df)
scale_x
Returns:
-
X_train
-
X_test
Parameters:
-
X_train
-
X_test
It makes Standard Scaling.
import pandas as pd
from sklearn.model_selection import train_test_split
from wolta.data_tools import scale_x
df = pd.read_csv('data.csv')
y = df['output']
del df['output']
X = df
del df
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test = scale_x(X_train, X_test)
examine_floats
Returns: list with full of column names which supplies the requested situation
Parameters:
-
df, pandas dataframe
-
float_columns, string list, column names which has float data
-
get, {'float', 'int'}, by default, 'float'.
-
If get is 'float', then it returns the names of the columns which has float values rather than .0
-
If get is 'int', then it returns the names of the columns which has int values with .0
import pandas as pd
from wolta.data_tools import examine_floats
df = pd.read_csv('data.csv')
float_cols = ['col1', 'col2', 'col3']
float_cols = examine_floats(df, float_cols)
calculate_min_max
Returns:
-
columns, list of string, column names which holds int or float data
-
types, list of string, datatypes of these columns
-
max_val, list of int & float, holds the maximum value for each column
-
min_val, list of int & float, holds the minimum value for each column
Parameters:
-
paths, list of string, paths of dataframes
-
deleted_columns, list of string, initially excluded column names, by default, None
import glob
from wolta.data_tools import calculate_min_max
paths = glob.glob('path/to/dir/*.csv')
columns, types, max_val, min_val = calculate_min_max(paths)
calculate_bounds
Returns: list of string, name of types with optimum sizes.
Parameters:
-
gen_types, list of string, holds the default datatypes for each column
-
min_val, list of int & float, holds the minimum value for each column
-
max_val, list of int & float, holds the maximum value for each column
import glob
from wolta.data_tools import calculate_bounds, calculate_min_max
paths = glob.glob('path/to/dir/*.csv')
columns, types, max_val, min_val = calculate_min_max(paths)
types = calculate_bounds(types, min_val, max_val)
Model Tools
Model Tools was designed for to get some results on models.
get_score
Returns: dictionary with <string, float> form, <score type, value>, also prints out the result by default
Parameters:
-
y_test, 1D numpy array
-
y_pred, 1D numpy array
-
metrics, list of string, this list can only have values the table below:
| value | full name |
| --- |----------------|
| acc | accuracy score |
| f1 | f1 score |
| hamming | hamming loss |
| jaccard | jaccard score |
| log | log loss |
| mcc | matthews corrcoef |
| precision | precision score |
| recall | recall score |
| zol | zero one loss |
by default, ['acc']
- average, string, {'weighted', 'micro', 'macro', 'binary', 'samples'}, by default, 'weighted'
import numpy as np
from wolta.model_tools import get_score
y_test = np.load('test.npy')
y_pred = np.load('pred.npy')
scores = get_score(y_test, y_pred, ['acc', 'precision'])
get_supported_metrics
It returns the string list of possible score names for get_score function
from wolta.model_tools import get_supported_metrics
print(get_supported_metrics())
get_avg_options
It returns the string list of possible average values for get_score function
from wolta.model_tools import get_avg_options
print(get_avg_options())
do_combinations
Returns: list of the int lists
Parameters:
-
indexes, list of int
-
min_item, int, it is the minimum amount of index inside a combination
-
max_item, int, it is the maximum amount of index inside a combination
It creates a list for all possible min_item <= x <= max_item terms combinations
from wolta.model_tools import do_combinations
combinations = do_combinations([0, 1, 2], 1, 3)
do_voting
Returns: list of 1D numpy arrays
Parameters:
-
y_pred_list, list of 1D numpy arrays
-
combinations, list of int lists, it holds the indexes from y_pred_list for each combination
This function makes sum of matrices, then divides it the amount of matrices and finally makes whole matrix as int value.
import numpy as np
from wolta.model_tools import do_voting, do_combinations
y_pred_1 = np.load('one.npy')
y_pred_2 = np.load('two.npy')
y_pred_3 = np.load('three.npy')
y_preds = [y_pred_1, y_pred_2, y_pred_3]
combinations = do_combinations([0, 1, 2], 1, 3)
results = do_voting(y_preds, combinations)
examine_time
It calculates the fitting time for a model and also returns the trained model.
Returns:
-
int
-
model
Parameters:
-
model
-
X_train
-
y_train
from wolta.model_tools import examine_time
from sklearn.ensemble import RandomForestClassifier
import numpy as np
X_train = np.load('x.npy')
y_train = np.load('y.npy')
model = RandomForestClassifier(random_state=42)
consumed, model = examine_time(model, X_train, y_train)
Progressive Tools
This module has been designed for progressive sampling.
make_run
It was designed to use one loaded numpy array for all sampling trials.
Returned:
-
list of int, percentage logging
-
list of dictionaries, metrics logging
Parameters:
-
model_class
-
X_train
-
y_train
-
X_test
-
y_test
-
init_per, int, default by, 1, inclusive starting percentage
-
limit_per, int, default by, 100, inclusive ending percentage
-
increment, int, default by, 1
-
metrics, list of string, the values must be recognizable for model_tools.get_score(), default by, ['acc']
-
average, string, the value must be recognizable for model_tools.get_score(), default by, 'weighted'
-
params, dictionary, if model has parameters, they initialize it here, default by, None
from wolta.progressive_tools import make_run
from sklearn.ensemble import RandomForestClassifier
import numpy as np
X_train = np.load('x_train.npy')
y_train = np.load('y_train.npy')
X_test = np.load('x_test.npy')
y_test = np.load('x_test.npy')
percentage_log, metrics_log = make_run(RandomForestClassifier, X_train, y_train, X_test, y_test)
get_best
Returns:
-
int, best percentage
-
float, best score
Parameters:
-
percentage_log, list of int
-
metrics_log, list of dictionary
-
requested_metrics, string
from wolta.progressive_tools import make_run, get_best
from sklearn.ensemble import RandomForestClassifier
import numpy as np
X_train = np.load('x_train.npy')
y_train = np.load('y_train.npy')
X_test = np.load('x_test.npy')
y_test = np.load('x_test.npy')
percentage_log, metrics_log = make_run(RandomForestClassifier, X_train, y_train, X_test, y_test)
best_per, best_score = get_best(percentage_log, metrics_log, 'acc')
path_chain
Unlike make_run, it loads train data from different files every time.
Returns: list of dictionary, metrics logging
Parameters:
-
paths, list of string
-
model_class
-
X_test
-
y_test
-
output_column, string
-
metrics, list of string, the values must be recognizable for model_tools.get_score(), default by, ['acc']
-
average, string, the value must be recognizable for model_tools.get_score(), default by, 'weighted'
-
params, dictionary, if model has parameters, they initialize it here, default by, None
from wolta.progressive_tools import path_chain
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import glob
X_test = np.load('x_test.npy')
y_test = np.load('x_test.npy')
paths = glob.glob('path/to/dir/*.csv')
percentage_log, metrics_log = path_chain(paths, RandomForestClassifier, X_test, y_test, 'output')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.