Machine Learning dataset loaders

Project description

Machine learning dataset loaders for testing and examples

Loaders for various machine learning datasets for testing and example scripts. Previously in thinc.extra.datasets.

Setup and installation

The package can be installed via pip:

pip install ml-datasets

Loaders

Loaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments – see the source for details.

# Import directly
from ml_datasets import imdb
train_data, dev_data = imdb()

# Load via registry
from ml_datasets import loaders
imdb_loader = loaders.get("imdb")
train_data, dev_data = imdb_loader()

Available loaders

NLP datasets

ID / Function	Description	NLP task	From URL
`imdb`	IMDB sentiment dataset	Binary classification: sentiment analysis	✓
`dbpedia`	DBPedia ontology dataset	Multi-class single-label classification	✓
`cmu`	CMU movie genres dataset	Multi-class, multi-label classification	✓
`quora_questions`	Duplicate Quora questions dataset	Detecting duplicate questions	✓
`reuters`	Reuters dataset (texts not included)	Multi-class multi-label classification	✓
`snli`	Stanford Natural Language Inference corpus	Recognizing textual entailment	✓
`stack_exchange`	Stack Exchange dataset	Question Answering
`ud_ancora_pos_tags`	Universal Dependencies Spanish AnCora corpus	POS tagging	✓
`ud_ewtb_pos_tags`	Universal Dependencies English EWT corpus	POS tagging	✓
`wikiner`	WikiNER data	Named entity recognition

Other ML datasets

ID / Function	Description	ML task	From URL
`mnist`	MNIST data	Image recognition	✓

Dataset details

IMDB

Each instance contains the text of a movie review, and a sentiment expressed as 0 or 1.

train_data, dev_data = ml_datasets.imdb()
for text, annot in train_data[0:5]:
    print(f"Review: {text}")
    print(f"Sentiment: {annot}")

Download URL: http://ai.stanford.edu/~amaas/data/sentiment/
Citation: Andrew L. Maas et al., 2011

Property	Training	Dev
# Instances	25000	25000
Label values	{`0`, `1`}	{`0`, `1`}
Labels per instance	Single	Single
Label distribution	Balanced (50/50)	Balanced (50/50)

DBPedia

Each instance contains an ontological description, and a classification into one of the 14 distinct labels.

train_data, dev_data = ml_datasets.dbpedia()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Category: {annot}")

Download URL: Via fast.ai
Original citation: Xiang Zhang et al., 2015

Property	Training	Dev
# Instances	560000	70000
Label values	`1`-`14`	`1`-`14`
Labels per instance	Single	Single
Label distribution	Balanced	Balanced

CMU

Each instance contains a movie description, and a classification into a list of appropriate genres.

train_data, dev_data = ml_datasets.cmu()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Genres: {annot}")

Download URL: http://www.cs.cmu.edu/~ark/personas/
Original citation: David Bamman et al., 2013

Property	Training	Dev
# Instances	41793	0
Label values	363 different genres	-
Labels per instance	Multiple	-
Label distribution	Imbalanced: 147 labels with less than 20 examples, while `Drama` occurs more than 19000 times	-

Quora

train_data, dev_data = ml_datasets.quora_questions()
for questions, annot in train_data[0:50]:
    q1, q2 = questions
    print(f"Question 1: {q1}")
    print(f"Question 2: {q2}")
    print(f"Similarity: {annot}")

Each instance contains two quora questions, and a label indicating whether or not they are duplicates (0: no, 1: yes). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

Download URL: http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Original citation: Kornél Csernai et al., 2017

Property	Training	Dev
# Instances	363859	40429
Label values	{`0`, `1`}	{`0`, `1`}
Labels per instance	Single	Single
Label distribution	Imbalanced: 63% label `0`	Imbalanced: 63% label `0`

Registering loaders

Loaders can be registered externally using the loaders registry as a decorator. For example:

@ml_datasets.loaders("my_custom_loader")
def my_custom_loader():
    return load_some_data()

assert "my_custom_loader" in ml_datasets.loaders

Project details

Release history Release notifications | RSS feed

This version

0.2.0

Jan 31, 2021

0.2.0a0 pre-release

Sep 17, 2020

0.1.6

Jan 23, 2020

0.1.5

Jan 21, 2020

0.1.4

Jan 15, 2020

0.1.3

Jan 9, 2020

0.1.2

Jan 8, 2020

0.1.1

Jan 7, 2020

0.1.0

Jan 7, 2020

0.0.3

Dec 28, 2019

0.0.2

Dec 28, 2019

0.0.1

Dec 28, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml_datasets-0.2.0.tar.gz (13.0 kB view details)

Uploaded Jan 31, 2021 Source

Built Distribution

ml_datasets-0.2.0-py3-none-any.whl (15.9 kB view details)

Uploaded Jan 31, 2021 Python 3

File details

Details for the file ml_datasets-0.2.0.tar.gz.

File metadata

Download URL: ml_datasets-0.2.0.tar.gz
Upload date: Jan 31, 2021
Size: 13.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9

File hashes

Hashes for ml_datasets-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`3f9c8901f8d6be3dab5b23ec3a6c01e619a60d0184696b1030cde2e3086943f1`
MD5	`da3d4bf661213c6f6edac48a6c599639`
BLAKE2b-256	`3ca8149700bd6087fbffdbe85d32a7587f497cf45c432864d0000eef6bad1020`

See more details on using hashes here.

File details

Details for the file ml_datasets-0.2.0-py3-none-any.whl.

File metadata

Download URL: ml_datasets-0.2.0-py3-none-any.whl
Upload date: Jan 31, 2021
Size: 15.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9

File hashes

Hashes for ml_datasets-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5adf087a2a8ff67ddbfc297f3bd7dd69a88d5c7f8f95d21cc1e96fef5a10ad3a`
MD5	`57af26a2844b672b69ac7095090c55b4`
BLAKE2b-256	`5104caa6c271b2dac193b9699745f67a7841eec38442329e0590e50b1938b831`

See more details on using hashes here.

ml-datasets 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Machine learning dataset loaders for testing and examples

Setup and installation

Loaders

Available loaders

NLP datasets

Other ML datasets

Dataset details

IMDB

DBPedia

CMU

Quora

Registering loaders

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes