Machine Learning dataset loaders
Project description
Machine learning dataset loaders for testing and examples
Loaders for various machine learning datasets for testing and example scripts.
Previously in thinc.extra.datasets
.
Setup and installation
The package can be installed via pip:
pip install ml-datasets
Loaders
Loaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments – see the source for details.
# Import directly
from ml_datasets import imdb
train_data, dev_data = imdb()
# Load via registry
from ml_datasets import loaders
imdb_loader = loaders.get("imdb")
train_data, dev_data = imdb_loader()
Available loaders
NLP datasets
ID / Function | Description | NLP task | From URL |
---|---|---|---|
imdb |
IMDB sentiment dataset | Binary classification: sentiment analysis | ✓ |
dbpedia |
DBPedia ontology dataset | Multi-class single-label classification | ✓ |
cmu |
CMU movie genres dataset | Multi-class, multi-label classification | ✓ |
quora_questions |
Duplicate Quora questions dataset | Detecting duplicate questions | ✓ |
reuters |
Reuters dataset (texts not included) | Multi-class multi-label classification | ✓ |
snli |
Stanford Natural Language Inference corpus | Recognizing textual entailment | ✓ |
stack_exchange |
Stack Exchange dataset | Question Answering | |
ud_ancora_pos_tags |
Universal Dependencies Spanish AnCora corpus | POS tagging | ✓ |
ud_ewtb_pos_tags |
Universal Dependencies English EWT corpus | POS tagging | ✓ |
wikiner |
WikiNER data | Named entity recognition |
Other ML datasets
ID / Function | Description | ML task | From URL |
---|---|---|---|
mnist |
MNIST data | Image recognition | ✓ |
Dataset details
IMDB
Each instance contains the text of a movie review, and a sentiment expressed as 0
or 1
.
train_data, dev_data = ml_datasets.imdb()
for text, annot in train_data[0:5]:
print(f"Review: {text}")
print(f"Sentiment: {annot}")
- Download URL: http://ai.stanford.edu/~amaas/data/sentiment/
- Citation: Andrew L. Maas et al., 2011
Property | Training | Dev |
---|---|---|
# Instances | 25000 | 25000 |
Label values | {0 , 1 } |
{0 , 1 } |
Labels per instance | Single | Single |
Label distribution | Balanced (50/50) | Balanced (50/50) |
DBPedia
Each instance contains an ontological description, and a classification into one of the 14 distinct labels.
train_data, dev_data = ml_datasets.dbpedia()
for text, annot in train_data[0:5]:
print(f"Text: {text}")
print(f"Category: {annot}")
- Download URL: Via fast.ai
- Original citation: Xiang Zhang et al., 2015
Property | Training | Dev |
---|---|---|
# Instances | 560000 | 70000 |
Label values | 1 -14 |
1 -14 |
Labels per instance | Single | Single |
Label distribution | Balanced | Balanced |
CMU
Each instance contains a movie description, and a classification into a list of appropriate genres.
train_data, dev_data = ml_datasets.cmu()
for text, annot in train_data[0:5]:
print(f"Text: {text}")
print(f"Genres: {annot}")
- Download URL: http://www.cs.cmu.edu/~ark/personas/
- Original citation: David Bamman et al., 2013
Property | Training | Dev |
---|---|---|
# Instances | 41793 | 0 |
Label values | 363 different genres | - |
Labels per instance | Multiple | - |
Label distribution | Imbalanced: 147 labels with less than 20 examples, while Drama occurs more than 19000 times |
- |
Quora
train_data, dev_data = ml_datasets.quora_questions()
for questions, annot in train_data[0:50]:
q1, q2 = questions
print(f"Question 1: {q1}")
print(f"Question 2: {q2}")
print(f"Similarity: {annot}")
Each instance contains two quora questions, and a label indicating whether or not they are duplicates (0
: no, 1
: yes).
The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.
- Download URL: http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
- Original citation: Kornél Csernai et al., 2017
Property | Training | Dev |
---|---|---|
# Instances | 363859 | 40429 |
Label values | {0 , 1 } |
{0 , 1 } |
Labels per instance | Single | Single |
Label distribution | Imbalanced: 63% label 0 |
Imbalanced: 63% label 0 |
Registering loaders
Loaders can be registered externally using the loaders
registry as a decorator. For example:
@ml_datasets.loaders("my_custom_loader")
def my_custom_loader():
return load_some_data()
assert "my_custom_loader" in ml_datasets.loaders
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ml_datasets-0.2.0.tar.gz
.
File metadata
- Download URL: ml_datasets-0.2.0.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f9c8901f8d6be3dab5b23ec3a6c01e619a60d0184696b1030cde2e3086943f1 |
|
MD5 | da3d4bf661213c6f6edac48a6c599639 |
|
BLAKE2b-256 | 3ca8149700bd6087fbffdbe85d32a7587f497cf45c432864d0000eef6bad1020 |
File details
Details for the file ml_datasets-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: ml_datasets-0.2.0-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5adf087a2a8ff67ddbfc297f3bd7dd69a88d5c7f8f95d21cc1e96fef5a10ad3a |
|
MD5 | 57af26a2844b672b69ac7095090c55b4 |
|
BLAKE2b-256 | 5104caa6c271b2dac193b9699745f67a7841eec38442329e0590e50b1938b831 |