Machine Learning dataset loaders
Project description
Machine learning dataset loaders for testing and examples
Loaders for various machine learning datasets for testing and example scripts.
Previously in thinc.extra.datasets
.
Setup and installation
The package can be installed via pip:
pip install ml-datasets
Loaders
Loaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments – see the source for details.
# Import directly
from ml_datasets import imdb
train_data, dev_data = imdb()
# Load via registry
from ml_datasets import loaders
imdb_loader = loaders.get("imdb")
train_data, dev_data = imdb_loader()
Available loaders
NLP datasets
ID / Function | Description | NLP task | From URL |
---|---|---|---|
imdb |
IMDB sentiment dataset | Binary classification: sentiment analysis | ✓ |
dbpedia |
DBPedia ontology dataset | Multi-class single-label classification | ✓ |
cmu |
CMU movie genres dataset | Multi-class, multi-label classification | ✓ |
quora_questions |
Duplicate Quora questions dataset | Detecting duplicate questions | ✓ |
reuters |
Reuters dataset (texts not included) | Multi-class multi-label classification | ✓ |
snli |
Stanford Natural Language Inference corpus | Recognizing textual entailment | ✓ |
stack_exchange |
Stack Exchange dataset | Question Answering | |
ud_ancora_pos_tags |
Universal Dependencies Spanish AnCora corpus | POS tagging | ✓ |
ud_ewtb_pos_tags |
Universal Dependencies English EWT corpus | POS tagging | ✓ |
wikiner |
WikiNER data | Named entity recognition |
Other ML datasets
ID / Function | Description | ML task | From URL |
---|---|---|---|
mnist |
MNIST data | Image recognition | ✓ |
Dataset details
IMDB
Each instance contains the text of a movie review, and a sentiment expressed as 0
or 1
.
train_data, dev_data = ml_datasets.imdb()
for text, annot in train_data[0:5]:
print(f"Review: {text}")
print(f"Sentiment: {annot}")
- Download URL: http://ai.stanford.edu/~amaas/data/sentiment/
- Citation: Andrew L. Maas et al., 2011
Property | Training | Dev |
---|---|---|
# Instances | 25000 | 25000 |
Label values | {0 , 1 } |
{0 , 1 } |
Labels per instance | Single | Single |
Label distribution | Balanced (50/50) | Balanced (50/50) |
DBPedia
Each instance contains an ontological description, and a classification into one of the 14 distinct labels.
train_data, dev_data = ml_datasets.dbpedia()
for text, annot in train_data[0:5]:
print(f"Text: {text}")
print(f"Category: {annot}")
- Download URL: Via fast.ai
- Original citation: Xiang Zhang et al., 2015
Property | Training | Dev |
---|---|---|
# Instances | 560000 | 70000 |
Label values | 1 -14 |
1 -14 |
Labels per instance | Single | Single |
Label distribution | Balanced | Balanced |
CMU
Each instance contains a movie description, and a classification into a list of appropriate genres.
train_data, dev_data = ml_datasets.cmu()
for text, annot in train_data[0:5]:
print(f"Text: {text}")
print(f"Genres: {annot}")
- Download URL: http://www.cs.cmu.edu/~ark/personas/
- Original citation: David Bamman et al., 2013
Property | Training | Dev |
---|---|---|
# Instances | 41793 | 0 |
Label values | 363 different genres | - |
Labels per instance | Multiple | - |
Label distribution | Imbalanced: 147 labels with less than 20 examples, while Drama occurs more than 19000 times |
- |
Quora
train_data, dev_data = ml_datasets.quora_questions()
for questions, annot in train_data[0:50]:
q1, q2 = questions
print(f"Question 1: {q1}")
print(f"Question 2: {q2}")
print(f"Similarity: {annot}")
Each instance contains two quora questions, and a label indicating whether or not they are duplicates (0
: no, 1
: yes).
The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.
- Download URL: http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
- Original citation: Kornél Csernai et al., 2017
Property | Training | Dev |
---|---|---|
# Instances | 363859 | 40429 |
Label values | {0 , 1 } |
{0 , 1 } |
Labels per instance | Single | Single |
Label distribution | Imbalanced: 63% label 0 |
Imbalanced: 63% label 0 |
Registering loaders
Loaders can be registered externally using the loaders
registry as a decorator. For example:
@ml_datasets.loaders("my_custom_loader")
def my_custom_loader():
return load_some_data()
assert "my_custom_loader" in ml_datasets.loaders
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ml_datasets-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5adf087a2a8ff67ddbfc297f3bd7dd69a88d5c7f8f95d21cc1e96fef5a10ad3a |
|
MD5 | 57af26a2844b672b69ac7095090c55b4 |
|
BLAKE2b-256 | 5104caa6c271b2dac193b9699745f67a7841eec38442329e0590e50b1938b831 |