Cuoco is a tool for automatic data preprocessing. Cuoco comes from Italy, means chef.
Project description
CUOCO
Cuoco is a tool for automatic processing of data.
Example
Import the library
import cuoco
from cuoco import dataPipeline
Use the dataPipeline
dataPipeline.readJson('/content/biostats.csv', '/content/jsonTESTFILE.json')
Documentation
How it works: Cuoco uses a json created by the user to automatically apply data-processing functions to the desired dataset. The Json has the next values:
- input_format: format of the input dataset. Can be csv, parquet, orc and txt
- output_format: format of the resulted dataset. Can be csv, parquet, orc and txt
- new_fileName: name of the new dataset the DataChef will write
- new_file_route: route where to store the new data file
- header: if yor datasets has a header. Can be yes or none
- separator: the separator of your dataset. Only applies if its csv o txt format.
- num_nans: method you want to use against possible numerical nans (include empties). Can be:
- drop: drop rows that contains nans
- yes: dont do anything with rows that contains nans
- mean: fill nans with the mean value of the column
- median: fill nans with the median value of the column
- mode: fill nans with the mode value of the column
- str_nans: method you want to use against possible string nans (include empties). Can be:
- yes: keep nans columns
- no: drop nans columns
- caps: method you want to use with strings that contains Upper and Lower case letters:
- no: dont do anything
- upper: put all strings of string columns to uppercase
- lower: put all strings of string columns to lowercase
- normalize_method: method to use to normalize numerical columns. Can be:
- no: dont normalize
- max_abs: uses max absolute value to normalize
- min_max: uses min - max value method to normalize
- z_score: uses z-score value method to normalize
- normalize:
- write the name of the columns you want to normalize
- Note: if yor dataset does not have a header, you must write the columns's names you want to normalize in number format, if it has a header you must write the columns's names between ""
- balance_data: if you want to balance your data (recomended for AI datasets). Can be:
- yes
- no
- Inside balance_params there are two items:
- balance_method: mehod you want for oversampling. Can be:
- random: random oversampling
- smote: perform SMOTE technique for oversampling.
- y_col: column of the dataset you want to use as target for the balance
- balance_method: mehod you want for oversampling. Can be:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cuoco-0.1.4.tar.gz
(17.9 kB
view hashes)
Built Distribution
cuoco-0.1.4-py3-none-any.whl
(18.8 kB
view hashes)