Data processing module implemented with numpy
Project description
carefree-data
carefree-data
implemented a data processing module with numpy.
Update 2021.02.04
carefree-data
now uses datatable
as backend, which significantly improves the performances on file inputs!
Why carefree-data?
carefree-data
is a data processing module which is capable of handling 'dirty' and 'messy' datasets.
For tabular datasets, carefree-data
is able to:
- Elegantly deal with data pre-processing.
- A
Recognizer
to recognize whether a column isSTRING
,NUMERICAL
orCATEGORICAL
. - A
Converter
to convert a column into friendly format (["one", "two"] -> [0, 1]). - A
Processor
to further process columns (OneHot
,Normalize
,MinMax
, ...). - And all the transforms could be inverse! (See
tests\unittests\test_tabular.py
->test_recover_labels
&test_recover_features
). - And these procedures are all completed AUTOMATICALLY!
- A
- Handle datasets saved in files (
.txt
,.csv
).- For
.txt
," "
will be the defaultdelimiter
. - For
.csv
,","
will be the defaultdelimiter
, and the first row will be skipped as default. delimiter
,label index
,skip first
could be set manually.
- For
Pandas-free
There is one more thing we'd like to mention: carefree-data
is 'Pandas-free'. Pandas is an open source library providing easy-to-use data structures on structured datasets. Although it is a widely used library in almost every famous Machine Learning and Deep Learning module, we finally decided to escape from it, and the reasons are listed below:
carefree-data
wants to have full control on the data, and Pandas is not flexible enough.carefree-data
needs higher performances. Pandas is fast, but not as fast as pure numpy (and sometimes cython) codes on some critical code paths.- Pandas provides many powerful functions, but
carefree-data
doesn't need that much, which means Pandas is a little 'heavy' forcarefree-data
.
In short, Pandas is a more general library, and that's why we've written some codes to cover our needs instead of directly utilizing it.
Currently
carefree-data
only supports tabular datasets.
Installation
carefree-data
requires Python 3.8 or higher.
pip install carefree-data
or
git clone https://github.com/carefree0910/carefree-data.git
cd carefree-data
pip install -e .
Basic Usages
Get scikit-learn datasets
from cfdata.tabular import TabularDataset
iris = TabularDataset.iris()
Read from array / dataset
from cfdata.tabular import *
iris = TabularDataset.iris()
x, y = iris.xy
assert TabularData().read(x, y) == TabularData.from_dataset(iris)
Read from file
from cfdata.tabular import TabularData
file = "/path/to/your/file"
data = TabularData().read(file)
assert data.processed == data.transform(file)
License
carefree-data
is MIT licensed, as found in the LICENSE
file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file carefree-data-0.2.9.tar.gz
.
File metadata
- Download URL: carefree-data-0.2.9.tar.gz
- Upload date:
- Size: 35.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28a24125b6efedd10eeab466a3bb65833046835798db23c44ab177eb8df7e79e |
|
MD5 | 812c539ad338d13fcf1b77317f8e75e1 |
|
BLAKE2b-256 | 7ea4f518261e4b61d105dd22db20e45dbb9935fbf33c682645bc4f75bb62da04 |