Skip to main content

Machine Learning dataset loaders

Project description

Machine learning dataset loaders for testing and examples

Loaders for various machine learning datasets for testing and example scripts. Previously in thinc.extra.datasets.

PyPi Version

Setup and installation

The package can be installed via pip:

pip install ml-datasets

Loaders

Loaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments – see the source for details.

# Import directly
from ml_datasets import imdb
train_data, dev_data = imdb()
# Load via registry
from ml_datasets import loaders
imdb_loader = loaders.get("imdb")
train_data, dev_data = imdb_loader()

Available loaders

NLP datasets

ID / Function Description NLP task From URL
imdb IMDB sentiment dataset Binary classification: sentiment analysis
dbpedia DBPedia ontology dataset Multi-class single-label classification
cmu CMU movie genres dataset Multi-class, multi-label classification
quora_questions Duplicate Quora questions dataset Detecting duplicate questions
reuters Reuters dataset (texts not included) Multi-class multi-label classification
snli Stanford Natural Language Inference corpus Recognizing textual entailment
stack_exchange Stack Exchange dataset Question Answering
ud_ancora_pos_tags Universal Dependencies Spanish AnCora corpus POS tagging
ud_ewtb_pos_tags Universal Dependencies English EWT corpus POS tagging
wikiner WikiNER data Named entity recognition

Other ML datasets

ID / Function Description ML task From URL
mnist MNIST data Image recognition

Dataset details

IMDB

Each instance contains the text of a movie review, and a sentiment expressed as 0 or 1.

train_data, dev_data = ml_datasets.imdb()
for text, annot in train_data[0:5]:
    print(f"Review: {text}")
    print(f"Sentiment: {annot}")
Property Training Dev
# Instances 25000 25000
Label values {0, 1} {0, 1}
Labels per instance Single Single
Label distribution Balanced (50/50) Balanced (50/50)

DBPedia

Each instance contains an ontological description, and a classification into one of the 14 distinct labels.

train_data, dev_data = ml_datasets.dbpedia()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Category: {annot}")
Property Training Dev
# Instances 560000 70000
Label values 1-14 1-14
Labels per instance Single Single
Label distribution Balanced Balanced

CMU

Each instance contains a movie description, and a classification into a list of appropriate genres.

train_data, dev_data = ml_datasets.cmu()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Genres: {annot}")
Property Training Dev
# Instances 41793 0
Label values 363 different genres -
Labels per instance Multiple -
Label distribution Imbalanced: 147 labels with less than 20 examples, while Drama occurs more than 19000 times -

Quora

train_data, dev_data = ml_datasets.quora_questions()
for questions, annot in train_data[0:50]:
    q1, q2 = questions
    print(f"Question 1: {q1}")
    print(f"Question 2: {q2}")
    print(f"Similarity: {annot}")

Each instance contains two quora questions, and a label indicating whether or not they are duplicates (0: no, 1: yes). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

Property Training Dev
# Instances 363859 40429
Label values {0, 1} {0, 1}
Labels per instance Single Single
Label distribution Imbalanced: 63% label 0 Imbalanced: 63% label 0

Registering loaders

Loaders can be registered externally using the loaders registry as a decorator. For example:

@ml_datasets.loaders("my_custom_loader")
def my_custom_loader():
    return load_some_data()

assert "my_custom_loader" in ml_datasets.loaders

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml_datasets-0.2.0.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

ml_datasets-0.2.0-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file ml_datasets-0.2.0.tar.gz.

File metadata

  • Download URL: ml_datasets-0.2.0.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9

File hashes

Hashes for ml_datasets-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3f9c8901f8d6be3dab5b23ec3a6c01e619a60d0184696b1030cde2e3086943f1
MD5 da3d4bf661213c6f6edac48a6c599639
BLAKE2b-256 3ca8149700bd6087fbffdbe85d32a7587f497cf45c432864d0000eef6bad1020

See more details on using hashes here.

File details

Details for the file ml_datasets-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ml_datasets-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9

File hashes

Hashes for ml_datasets-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5adf087a2a8ff67ddbfc297f3bd7dd69a88d5c7f8f95d21cc1e96fef5a10ad3a
MD5 57af26a2844b672b69ac7095090c55b4
BLAKE2b-256 5104caa6c271b2dac193b9699745f67a7841eec38442329e0590e50b1938b831

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page