scikit-learn wrappers for Python fastText
Project description
scikit-learn wrappers for Python fastText.
>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
1 Installation
Dependencies:
numpy
scipy
scikit-learn
The fasttext Python package
pip install skift
2 Configuration
Because fasttext reads input data from files, skift has to dump the input data into temporary files for fasttext to use. A dedicated folder is created for those files on the filesystem. By default, this storage is allocated in the system temporary storage location (i.e. /tmp on *nix systems). To override this default location, use the SKIFT_TEMP_DIR environment variable:
export SKIFT_TEMP_DIR=/path/to/desired/temp/folder
NOTE: The directory will be created if it does not already exist.
3 Features
Adheres to the scikit-learn classifier API, including predict_proba.
Also caters to the common use case of pandas.DataFrame inputs.
Enables easy stacking of fastText with other types of scikit-learn-compliant classifiers.
Pickle-able classifier objects.
Built around the official fasttext Python package.
Pure python.
Supports Python 3.5+.
Fully tested.
4 Wrappers
fastText works only on text data, which means that it will only use a single column from a dataset which might contain many feature columns of different types. As such, a common use case is to have the fastText classifier use a single column as input, ignoring other columns. This is especially true when fastText is to be used as one of several classifiers in a stacking classifier, with other classifiers using non-textual features.
skift includes several scikit-learn-compatible wrappers (for the official fastText Python package) which cater to these use cases.
NOTICE: Any additional keyword arguments provided to the classifier constructor, besides those required, will be forwarded to the fastText.train_supervised method on every call to fit.
4.1 Standard wrappers
These wrappers do not make additional assumptions on input besides those commonly made by scikit-learn classifies; i.e. that input is a 2d ndarray object and such.
FirstColFtClassifier - An sklearn classifier adapter for fasttext that takes the first column of input ndarray objects as input.
>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
IdxBasedFtClassifier - An sklearn classifier adapter for fasttext that takes input by column index. This is set on object construction by providing the input_ix parameter to the constructor.
>>> from skift import IdxBasedFtClassifier
>>> df = pandas.DataFrame([[5, 'woof', 0], [83, 'meow', 1]], columns=['count', 'txt', 'lbl'])
>>> sk_clf = IdxBasedFtClassifier(input_ix=1, lr=0.4, epoch=6)
>>> sk_clf.fit(df[['count', 'txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
4.2 pandas-dependent wrappers
These wrappers assume the X parameter given to fit, predict, and predict_proba methods is a pandas.DataFrame object:
FirstObjFtClassifier - An sklearn adapter for fasttext using the first column of dtype == object as input.
>>> from skift import FirstObjFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstObjFtClassifier(lr=0.2)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
ColLblBasedFtClassifier - An sklearn adapter for fasttext taking input by column label. This is set on object construction by providing the input_col_lbl parameter to the constructor.
>>> from skift import ColLblBasedFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = ColLblBasedFtClassifier(input_col_lbl='txt', epoch=8)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
5 Contributing
Package author and current maintainer is Shay Palachy (shay.palachy@gmail.com); You are more than welcome to approach him for help. Contributions are very welcomed.
5.1 Installing for development
Clone:
git clone git@github.com:shaypal5/skift.git
Install in development mode, including test dependencies:
cd skift
pip install -e '.[test]'
To also install fasttext, see instructions in the Installation section.
5.2 Running the tests
To run the tests use:
cd skift
pytest
5.3 Adding documentation
The project is documented using the numpy docstring conventions, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings. When documenting code you add to this project, follow these conventions.
Additionally, if you update this README.rst file, use python setup.py checkdocs to validate it compiles.
6 Credits
Created by Shay Palachy (shay.palachy@gmail.com).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for skift-0.0.16-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c0c93ee45c250644b79fc4978054f39300dcef2960397ff0cd5fcd81cf1ce606 |
|
MD5 | 6cacb9c84e5dcd4f6a1a9566edad97e0 |
|
BLAKE2b-256 | bd3e24dfaae2f52bc415b886dcc68e598279e5b85f8fe9c82fa2540b7dbfb915 |