Data processing module implemented with numpy
Project description
carefree-data
carefree-data implemented a data processing module with numpy.
Update 2021.02.04
carefree-data now uses datatable as backend, which significantly improves the performances on file inputs!
Why carefree-data?
carefree-data is a data processing module which is capable of handling 'dirty' and 'messy' datasets.
For tabular datasets, carefree-data is able to:
- Elegantly deal with data pre-processing.
- A
Recognizerto recognize whether a column isSTRING,NUMERICALorCATEGORICAL. - A
Converterto convert a column into friendly format (["one", "two"] -> [0, 1]). - A
Processorto further process columns (OneHot,Normalize,MinMax, ...). - And all the transforms could be inverse! (See
tests\unittests\test_tabular.py->test_recover_labels&test_recover_features). - And these procedures are all completed AUTOMATICALLY!
- A
- Handle datasets saved in files (
.txt,.csv).- For
.txt," "will be the defaultdelimiter. - For
.csv,","will be the defaultdelimiter, and the first row will be skipped as default. delimiter,label index,skip firstcould be set manually.
- For
Pandas-free
There is one more thing we'd like to mention: carefree-data is 'Pandas-free'. Pandas is an open source library providing easy-to-use data structures on structured datasets. Although it is a widely used library in almost every famous Machine Learning and Deep Learning module, we finally decided to escape from it, and the reasons are listed below:
carefree-datawants to have full control on the data, and Pandas is not flexible enough.carefree-dataneeds higher performances. Pandas is fast, but not as fast as pure numpy (and sometimes cython) codes on some critical code paths.- Pandas provides many powerful functions, but
carefree-datadoesn't need that much, which means Pandas is a little 'heavy' forcarefree-data.
In short, Pandas is a more general library, and that's why we've written some codes to cover our needs instead of directly utilizing it.
Currently
carefree-dataonly supports tabular datasets.
Installation
carefree-data requires Python 3.8 or higher.
pip install carefree-data
or
git clone https://github.com/carefree0910/carefree-data.git
cd carefree-data
pip install -e .
Basic Usages
Get scikit-learn datasets
from cfdata.tabular import TabularDataset
iris = TabularDataset.iris()
Read from array / dataset
from cfdata.tabular import *
iris = TabularDataset.iris()
x, y = iris.xy
assert TabularData().read(x, y) == TabularData.from_dataset(iris)
Read from file
from cfdata.tabular import TabularData
file = "/path/to/your/file"
data = TabularData().read(file)
assert data.processed == data.transform(file)
License
carefree-data is MIT licensed, as found in the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file carefree-data-0.2.9.tar.gz.
File metadata
- Download URL: carefree-data-0.2.9.tar.gz
- Upload date:
- Size: 35.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28a24125b6efedd10eeab466a3bb65833046835798db23c44ab177eb8df7e79e
|
|
| MD5 |
812c539ad338d13fcf1b77317f8e75e1
|
|
| BLAKE2b-256 |
7ea4f518261e4b61d105dd22db20e45dbb9935fbf33c682645bc4f75bb62da04
|