Skip to main content

Tools for assessing the difficulty of datasets for machine learning models

Project description

Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Authors: Ed Collins, Nikolai Rozanov, Bingbing Zhang

Contact: contact@wluper.com

In the paper of the corresponding name, we discuss how we used an evolutionary algorithm to discover which statistics about a text classification dataset most accurately represent how difficult that dataset is likely to be for machine learning models to learn. We presented there the difficulty measure which we discovered and have provided this Python package of code which can calculate it.

Installation

This code is pip-installable so can be installed on your machine by running:

pip3 install edm

The code requires Python 3 and NumPy.

It is recommended that you install this code in a virtualenv:

$ mkdir myvirtualenv/
$ virtualenv -p python3 myvirtualenv/
$ source bin/activate
(myvirtualenv) $ pip3 install edm

Running

To calculate the difficulty of a text classification dataset, you will need to provide two lists: one of sentences and one of labels. These two lists need to be the same length - i.e. every sentence has a label. Each item of data should be an untokenized string and each label a string.

>>> sents, labels = your_own_loading_function(PATH_TO_DATA_FILE)
>>> sents
["this is a positive sentence", "this is a negative sentence", ...]
>>> labels
["positive", "negative", ...]
>>> assert len(sents) == len(labels)
True

This code does not support the loading of data files (e.g. csv files) into memory - you will need to do this separately.

Once you have loaded your dataset into memory, you can receive a "difficulty report" by running the code as follows:

from edm import report

sents, labels = your_own_loading_function(PATH_TO_DATA_FILE)

print(report.get_difficulty_report(sents, labels))

Note that if your dataset is very large, then counting the words of the dataset may take several minutes. The Amazon Reviews dataset from Character-level Convolutional Networks for Text Classification by Xiang Zhang, Junbo Zhao and Yann LeCun, 2015 which contains 3.6 million Amazon reviews takes approximately 15 minutes to be processed and the difficulty report created. A loading bar will be displayed while the words are counted.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edm-0.0.4.tar.gz (9.8 kB view details)

Uploaded Source

File details

Details for the file edm-0.0.4.tar.gz.

File metadata

  • Download URL: edm-0.0.4.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5

File hashes

Hashes for edm-0.0.4.tar.gz
Algorithm Hash digest
SHA256 3edbe341137e8270aedc69406bb4a107e51c65aadf443b83c7bcc5bbc2012c1e
MD5 15f9288552fbb45815137ac7113e0e80
BLAKE2b-256 424bab24f5d58fa155a9cb9388681ec90fed70bc1ff4efea6b3964827354f601

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page