skift

scikit-learn wrappers for Python fastText

These details have not been verified by PyPI

Project links

Homepage

Project description

scikit-learn wrappers for Python fastText.

>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

1 Installation

Dependencies:

numpy
scipy
scikit-learn
The fasttext Python package

pip install skift

Because fasttext reads input data from files, skift has to dump the input data into temporary files for fasttext to use. A dedicated folder is created for those files on the filesystem. By default, this storage is allocated in the system temporary storage location (i.e. /tmp on *nix systems). To override this default location, use the SKIFT_TEMP_DIR environment variable:

export SKIFT_TEMP_DIR=/path/to/desired/temp/folder

NOTE: The directory will be created if it does not already exist.

3 Features

Adheres to the scikit-learn classifier API, including predict_proba.
Also caters to the common use case of pandas.DataFrame inputs.
Enables easy stacking of fastText with other types of scikit-learn-compliant classifiers.
Pickle-able classifier objects.
Built around the official fasttext Python package.
Pure python.
Supports Python 3.5+.
Fully tested on Linux, OSX and Windows operating systems.

4 Wrappers

fastText works only on text data, which means that it will only use a single column from a dataset which might contain many feature columns of different types. As such, a common use case is to have the fastText classifier use a single column as input, ignoring other columns. This is especially true when fastText is to be used as one of several classifiers in a stacking classifier, with other classifiers using non-textual features.

skift includes several scikit-learn-compatible wrappers (for the official fastText Python package) which cater to these use cases.

NOTICE: Any additional keyword arguments provided to the classifier constructor, besides those required, will be forwarded to the fastText.train_supervised method on every call to fit.

4.1 Standard wrappers

These wrappers do not make additional assumptions on input besides those commonly made by scikit-learn classifies; i.e. that input is a 2d ndarray object and such.

FirstColFtClassifier - An sklearn classifier adapter for fasttext that takes the first column of input ndarray objects as input.

>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

IdxBasedFtClassifier - An sklearn classifier adapter for fasttext that takes input by column index. This is set on object construction by providing the input_ix parameter to the constructor.

>>> from skift import IdxBasedFtClassifier
>>> df = pandas.DataFrame([[5, 'woof', 0], [83, 'meow', 1]], columns=['count', 'txt', 'lbl'])
>>> sk_clf = IdxBasedFtClassifier(input_ix=1, lr=0.4, epoch=6)
>>> sk_clf.fit(df[['count', 'txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

4.2 pandas-dependent wrappers

These wrappers assume the X parameter given to fit, predict, and predict_proba methods is a pandas.DataFrame object:

FirstObjFtClassifier - An sklearn adapter for fasttext using the first column of dtype == object as input.

>>> from skift import FirstObjFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstObjFtClassifier(lr=0.2)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

ColLblBasedFtClassifier - An sklearn adapter for fasttext taking input by column label. This is set on object construction by providing the input_col_lbl parameter to the constructor.

>>> from skift import ColLblBasedFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = ColLblBasedFtClassifier(input_col_lbl='txt', epoch=8)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

SeriesFtClassifier - An sklearn adapter for fasttext taking a Pandas Series as input.

>>> from skift import SeriesFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = SeriesFtClassifier(input_col_lbl='txt', epoch=8)
>>> sk_clf.fit(df['txt'], df['lbl'])
>>> sk_clf.predict(['woof'])
>>> sk_clf.predict(df['txt'])

4.3 Hyperparameter auto-tuning

It’s possible to pass a validation set to fit() in order to optimize the hyper-parameters.

First, to adjust the auto-tune settings, the corresponding keyword arguments can be passed to the constructor (if none are passed the default settings are used):

>>> from skift import SeriesFtClassifier
>>> df_train = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> df_val = pandas.DataFrame([['woof woof', 0], ['meow meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = SeriesFtClassifier(input_col_lbl='txt', epoch=8, autotuneDuration=5)

Then, the validation dataframe (or series, in this case, since we constructed a SeriesFtClassifier) and label column should be provided to the fit() method:

>>> sk_clf.fit(df_train['txt'], df_train['lbl'], X_validation=df_val['txt'], y_validation=df_val['lbl'])

Or simply by position:

>>> sk_clf.fit(df_train['txt'], df_train['lbl'], df_val['txt'], df_val['lbl'])

5 Contributing

Package author and current maintainer is Shay Palachy (shay.palachy@gmail.com); You are more than welcome to approach him for help. Contributions are very welcomed.

5.1 Installing for development

Clone:

git clone git@github.com:shaypal5/skift.git

Install in development mode, including test dependencies:

cd skift
pip install -e '.[test]'

To also install fasttext, see instructions in the Installation section.

5.2 Running the tests

To run the tests use:

cd skift
pytest

5.3 Adding documentation

The project is documented using the numpy docstring conventions, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings. When documenting code you add to this project, follow these conventions.

Additionally, if you update this README.rst file, use python setup.py checkdocs to validate it compiles.

6 Credits

Created by Shay Palachy (shay.palachy@gmail.com).

Contributions:

Dimid Duchovny <https://github.com/dimidd>_ contributed the SeriesFtClassifier class and the hyperparameter auto-tuning capability.

Fixes: uniaz, crouffer, amirzamli and sgt.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.23

Feb 14, 2022

0.0.22

Jan 20, 2022

0.0.21

Dec 13, 2021

0.0.19

Aug 27, 2020

0.0.18

Aug 4, 2020

0.0.17

Jan 15, 2020

0.0.16

Jul 13, 2019

0.0.12

Jan 29, 2019

0.0.11

Mar 15, 2018

0.0.10

Feb 26, 2018

0.0.9

Feb 22, 2018

0.0.8

Feb 22, 2018

0.0.7

Feb 19, 2018

0.0.6

Feb 12, 2018

0.0.5

Feb 12, 2018

0.0.4

Feb 4, 2018

0.0.3

Feb 4, 2018

0.0.1

Feb 3, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skift-0.0.23.tar.gz (296.4 kB view details)

Uploaded Feb 14, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skift-0.0.23-py2.py3-none-any.whl (12.8 kB view details)

Uploaded Feb 14, 2022 Python 2Python 3

File details

Details for the file skift-0.0.23.tar.gz.

File metadata

Download URL: skift-0.0.23.tar.gz
Upload date: Feb 14, 2022
Size: 296.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for skift-0.0.23.tar.gz
Algorithm	Hash digest
SHA256	`36793fdcb27545915f16e6ecd8679626047b477a564b94f7c175dea781f0c60d`
MD5	`feab5d6756314107f210b87e5607cd8f`
BLAKE2b-256	`c17e0ff4af5b00ecd60fcd144c8048da79d7bebfc73a866c9fcced7053f189c2`

See more details on using hashes here.

File details

Details for the file skift-0.0.23-py2.py3-none-any.whl.

File metadata

Download URL: skift-0.0.23-py2.py3-none-any.whl
Upload date: Feb 14, 2022
Size: 12.8 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for skift-0.0.23-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`ab849db4f028a3e4cd6c9e1e328fe767bcb51fb0b55fdc8d05b09fcc3bba266a`
MD5	`11f6ca78f14bdf8c7d47fa35eb7d3ee7`
BLAKE2b-256	`8ec074b3bf2a2257ceb6de5ae1271237e8e315437236868c46f0a5403897ea54`

See more details on using hashes here.

skift 0.0.23

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

1 Installation

2 Configuration

3 Features

4 Wrappers

4.1 Standard wrappers

4.2 pandas-dependent wrappers

4.3 Hyperparameter auto-tuning

5 Contributing

5.1 Installing for development

5.2 Running the tests

5.3 Adding documentation

6 Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes