ShallowLearn

A collection of supervised learning models based on shallow neural network approaches (e.g., word2vec and fastText) with some additional exclusive features

These details have not been verified by PyPI

Project links

Homepage

Project description

A collection of supervised learning models based on shallow neural network approaches (e.g., word2vec and fastText) with some additional exclusive features. Written in Python and fully compatible with scikit-learn.

Discussion group for users and developers: https://groups.google.com/d/forum/shallowlearn

https://travis-ci.org/giacbrd/ShallowLearn.svg?branch=master

https://badge.fury.io/py/shallowlearn.svg

Getting Started

Install the latest version:

pip install cython
pip install shallowlearn

Import models from shallowlearn.models, they implement the standard methods for supervised learning in scikit-learn, e.g., fit(X, y), predict(X), etc.

Data is raw text, each sample in the iterable X is a list of tokens (words of a document), while each element in the iterable y (corresponding to an element in X) can be a single label or a list in case of a multi-label training set. Obviously, y must be of the same size of X.

Models

GensimFastText

A supervised learning model based on the fastText algorithm [1]. The code is mostly taken and rewritten from Gensim, it takes advantage of its optimizations (e.g. Cython) and support.

It is possible to choose the Softmax loss function (default) or one of its two “approximations”: Hierarchical Softmax and Negative Sampling. It is also possible to load pre-trained word vectors at initialization, passing a Gensim Word2Vec or a ShallowLearn LabeledWord2Vec instance (the latter is retrievable from a GensimFastText model by the attribute classifier).

Constructor argument names are a mix between the ones of Gensim and the ones of fastText (see this class docstring).

>>> from shallowlearn.models import GensimFastText
>>> clf = GensimFastText(size=100, min_count=0, loss='hs', iter=3, seed=66)
>>> clf.fit([('i', 'am', 'tall'), ('you', 'are', 'fat')], ['yes', 'no'])
>>> clf.predict([('tall', 'am', 'i')])
['yes']

FastText

The supervised algorithm of fastText implemented in fastText.py , which exposes an interface on the original C++ code. The current advantages of this class over GensimFastText are the subwords ant the n-gram features implemented via the hashing trick. The constructor arguments are equivalent to the original supervised model, except for input_file, output and label_prefix.

WARNING: The only way of loading datasets in fastText.py is through the filesystem (as of version 0.8.0), so data passed to fit(X, y) will be written in temporary files on disk.

>>> from shallowlearn.models import FastText
>>> clf = FastText(dim=100, min_count=0, loss='hs', epoch=3, bucket=5, word_ngrams=2)
>>> clf.fit([('i', 'am', 'tall'), ('you', 'are', 'fat')], ['yes', 'no'])
>>> clf.predict([('tall', 'am', 'i')])
['yes']

DeepInverseRegression

TODO: Based on https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score

DeepAveragingNetworks

TODO: Based on https://github.com/miyyer/dan

Exclusive Features

Persistence

Any model can be serialized and de-serialized with the two methods save and load. They overload the SaveLoad interface of Gensim, so it is possible to control the cost on disk usage of the models, instead of simply pickling the objects.

>>> from shallowlearn.models import GensimFastText
>>> clf = GensimFastText(size=100, min_count=0, loss='hs', iter=3, seed=66)
>>> clf.save('./model')
>>> loaded = GensimFastText.load('./model')

Benchmarks

The script scripts/document_classification_20newsgroups.py refers to this scikit-learn example in which text classifiers are compared on a reference dataset; we added our models to the comparison. The current results, even if still preliminary, are comparable with other approaches, achieving the best performance in speed.

Results as of release 0.0.4, with chi2_select option set to 80%. The times take into account of tf-idf vectorization in the “classic” classifiers, and the I/O operations for the training of fastText.py. The evaluation measure is macro F1.

References

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.5

Dec 30, 2016

This version

0.0.4

Nov 6, 2016

0.0.3

Oct 27, 2016

0.0.2

Oct 14, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ShallowLearn-0.0.4.tar.gz (85.5 kB view details)

Uploaded Nov 6, 2016 Source

File details

Details for the file ShallowLearn-0.0.4.tar.gz.

File metadata

Download URL: ShallowLearn-0.0.4.tar.gz
Upload date: Nov 6, 2016
Size: 85.5 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for ShallowLearn-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`ee1a583bda743bca2c0e82ff732626fe2068a457c634ae00e1c30e0145933e41`
MD5	`8ccb858e43b4bcdd5f3c8fcf364b63ae`
BLAKE2b-256	`cd1d39d0c58e782885412674e024645c1b9f15469095a2f650d284ffd6ddd6a8`

See more details on using hashes here.

ShallowLearn 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Getting Started

Models

GensimFastText

FastText

DeepInverseRegression

DeepAveragingNetworks

Exclusive Features

Benchmarks

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes