Skip to main content

common_datasets

Project description

CircleCI GitHub Codecov ReadTheDocs PythonVersion pylint PyPi

common-datasets: common machine learning datasets

This package provides an unofficial collection of datasets widely used in the evaluation of machine learning techniques, mainly small and imbalanced datasets for binary, multiclass classification and regression. The datasets are provided in the usual sklearn.datasets format, with missing data imputation and the encoding of category and ordinal features. The authors of this repository do not own any licenses for the datasets, the goal of the project is to provide a stanardized collection of datasets for research purposes.

PLEASE DO NOT CITE OR REFER TO THIS PACKAGE IN ANY FORM!

If you use data through this repository, please cite the original works publishing and specifying these datasets:

@article{keel,
  author={Alcala-Fdez, J. and Fernandez, A. and Luengo, J. and Derrac, J. and Garcia, S.
          and Sanchez, L. and Herrera, F.},
  title={KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms
          and Experimental Analysis Framework},
  journal={Journal of Multiple-Valued Logic and Soft Computing},
  volume={17},
  number={2-3},
  year={2011},
  pages={255-287}}

@misc{uci,
  author = "Dua, Dheeru and Karra Taniskidou, Efi",
  year = "2017",
  title = "{UCI} Machine Learning Repository",
  url = "http://archive.ics.uci.edu/ml",
  institution = "University of California, Irvine, School of Information and Computer Sciences"}

@article{krnn,
  author={X. J. Zhang and Z. Tari and M. Cheriet},
  title={{KRNN}: k {Rare-class Nearest Neighbor} classification},
  journal={Pattern Recognition},
  year={2017},
  volume={62},
  number={2},
  pages={33--44}
  }

For each individual dataset the citation key referring to its publisher or a relevant publication in which the dataset in the given configuration has been used is provided as part of the dataset. For example:

# binary classification
>> import common_datasets.binary_classification as binclas

>> dataset = bin_clas.load_abalone19()
>> dataset['citation_key']
'keel'

Introduction

The package contains 119 binary classification, 23 multiclass classification and 23 regression datasets.

Installation

The package can be cloned from GitHub in the usual way, and the latest stable version is also available in the PyPI repository:

pip install common_datasets

Use cases

Loading a dataset

# binary classification
import common_datasets.binary_classification as binclas

dataset = binclas.load_abalone19()

# multiclass classification
import common_datasets.multiclass_classification as multclas

dataset = multclas.load_abalone()

# regression
from common_datasets import regression

dataset = regression.load_treasury()

Querying all dataset loaders and loading a dataset

# binary classification
import common_datasets.binary_classification as binclas

data_loaders = binclas.get_data_loaders()

dataset_0 = data_loaders[0]()

# multiclass classification
import common_datasets.multiclass_classification as multclas

data_loaders = multclas.get_data_loaders()

dataset_0 = data_loaders[0]()

# regression
from common_datasets import regression

data_loaders = regression.get_data_loaders()

dataset_0 = data_loaders[0]()

Querying the loaders of the 5 smallest datasets regarding the total number of records

# binary classification
import common_datasets.binary_classification as binclas

data_loaders = binclas.get_filtered_data_loaders(n_smallest=5, sorting='n')

dataset_0 = data_loaders[0]()

# multiclass classification
import common_datasets.multiclass_classification as multclas

data_loaders = multclas.get_data_loaders(n_smallest=5, sorting='n')

dataset_0 = data_loaders[0]()

# regression
from common_datasets import regression

data_loaders = regression.get_data_loaders(n_smallest=5, sorting='n')

dataset_0 = data_loaders[0]()

Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

common_datasets-0.3.10.tar.gz (14.6 MB view details)

Uploaded Source

Built Distribution

common_datasets-0.3.10-py3-none-any.whl (15.4 MB view details)

Uploaded Python 3

File details

Details for the file common_datasets-0.3.10.tar.gz.

File metadata

  • Download URL: common_datasets-0.3.10.tar.gz
  • Upload date:
  • Size: 14.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for common_datasets-0.3.10.tar.gz
Algorithm Hash digest
SHA256 1bad732c1aee0f3d479a111b14c1e3142742ad4748694b4b20b33cc7217ed6c7
MD5 0446d932e181cbbe899b3388acbd4af1
BLAKE2b-256 bb8121e1f8ae8ea00639be5a259e25760ea6526a1476608042ed3c2deb1618c6

See more details on using hashes here.

File details

Details for the file common_datasets-0.3.10-py3-none-any.whl.

File metadata

File hashes

Hashes for common_datasets-0.3.10-py3-none-any.whl
Algorithm Hash digest
SHA256 6e2a68ee16b29ea071c3c7a14c1ae36509c14adcdac342ea6af82859c82d6bc4
MD5 3f21850f2693c687ab9d17c0b8397fa1
BLAKE2b-256 02c869cd78a739a25c4b33a7ded6150540c37dd9639bc1f5609f00388dd3b2f4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page