Skip to main content

Text Classification Library for Keras

Project description

Text Classification Keras Build Status PyPI PyPI - Python Version Gitter

A high-level text classification library implementing various well-established models. With a clean and extendable interface to implement custom architectures.

Quick start

Install

pip install text-classification-keras[full]

The [full] will additionally install TensorFlow, Spacy, and Deep Plots. Choose this if you want to get started right away.

Usage

from texcla import experiment, data
from texcla.models import TokenModelFactory, YoonKimCNN
from texcla.preprocessing import FastTextWikiTokenizer

# input text
X = ['some random text', 'another random text lala', 'peter', ...]

# input labels
y = ['a', 'b', 'a', ...]

# use the special tokenizer used for constructing the embeddings
tokenizer = FastTextWikiTokenizer()

# preprocess data (once)
experiment.setup_data(X, y, tokenizer, 'data.bin', max_len=100)

# load data
ds = data.Dataset.load('data.bin')

# construct base
factory = TokenModelFactory(
    ds.num_classes, ds.tokenizer.token_index, max_tokens=100,
    embedding_type='fasttext.wiki.simple', embedding_dims=300)

# choose a model
word_encoder_model = YoonKimCNN()

# build a model
model = factory.build_model(
    token_encoder_model=word_encoder_model, trainable_embeddings=False)

# use experiment.train as wrapper for Keras.fit()
experiment.train(x=ds.X, y=ds.y, validation_split=0.1, model=model,
    word_encoder_model=word_encoder_model)

Check out more examples.

API Documenation

https://github.io/jfilter/text-classification-keras/

Advanced

Embeddings

Choose a pre-trained word embedding by setting the embedding_type and the corresponding embedding dimensions. Set embedding_type=None to initialize the word embeddings randomly (but make sure to set trainable_embeddings=True so you actually train the embeddings).

factory = TokenModelFactory(embedding_type='fasttext.wiki.simple', embedding_dims=300)

FastText

Several pre-trained FastText embeddings are included. For now, we only have the word embeddings and not the n-gram features. All embedding have 300 dimensions.

GloVe

The GloVe embeddings are some kind of predecessor to FastText. In general choose FastText embeddings over GloVe. The dimension for the pre-trained embeddings varies.

Tokenzation

  • To work on token (or word) level, use a TokenTokenizer such e.g. TwokenizeTokenizer or SpacyTokenizer.
  • To work on token and sentence level, use SpacySentenceTokenizer.
  • To create an custom Tokenizer, extend Tokenizer and implement the token_generator method.

Spacy

You may use spaCy for the tokenization. See instructions on how to download model for your target language. E.g. for English:

python -m spacy download en

Models

Token-based Models

When working on token level, use TokenModelFactory.

from texcla.models import TokenModelFactory, YoonKimCNN

factory = TokenModelFactory(tokenizer.num_classes, tokenizer.token_index,
    max_tokens=100, embedding_type='glove.6B.100d')
word_encoder_model = YoonKimCNN()
model = factory.build_model(token_encoder_model=word_encoder_model)

Currently supported models include:

TokenModelFactory.build_model uses the provided word encoder which is then classified via a Dense layer.

Sentence-based Models

When working on sentence level, use SentenceModelFactory.

# Pad max sentences per doc to 500 and max words per sentence to 200.
# Can also use `max_sents=None` to allow variable sized max_sents per mini-batch.

factory = SentenceModelFactory(10, tokenizer.token_index, max_sents=500,
    max_tokens=200, embedding_type='glove.6B.100d')
word_encoder_model = AttentionRNN()
sentence_encoder_model = AttentionRNN()

# Allows you to compose arbitrary word encoders followed by sentence encoder.
model = factory.build_model(word_encoder_model, sentence_encoder_model)
  • Hierarchical attention networks (HANs) can be build by composing two attention based RNN models. This is useful when a document is very large.
  • For smaller document a reasonable way to encode sentences is to average words within it. This can be done by using token_encoder_model=AveragingEncoder()
  • Mix and match encoders as you see fit for your problem.

SentenceModelFactory.build_model created a tiered model where words within a sentence is first encoded using word_encoder_model. All such encodings per sentence is then encoded using sentence_encoder_model.

Related

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

Acknowledgements

Built upon the work by Raghavendra Kotikalapudi: keras-text.

Citation

If you find Text Classification Keras useful for an academic publication, then please use the following BibTeX to cite it:

@misc{raghakotfiltertexclakeras
    title={Text Classification Keras},
    author={Raghavendra Kotikalapudi, and Johannes Filter, and contributors},
    year={2018},
    publisher={GitHub},
    howpublished={\url{https://github.com/jfilter/text-classification-keras}},
}

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text-classification-keras-0.1.4.tar.gz (42.0 kB view details)

Uploaded Source

Built Distribution

text_classification_keras-0.1.4-py2.py3-none-any.whl (51.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file text-classification-keras-0.1.4.tar.gz.

File metadata

  • Download URL: text-classification-keras-0.1.4.tar.gz
  • Upload date:
  • Size: 42.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for text-classification-keras-0.1.4.tar.gz
Algorithm Hash digest
SHA256 10d4805b7b9d451dc9dbf3511fddd3cb8d4e14f5e2132a805d0636af1ddfbc26
MD5 945859bdecfd224b95e37bef83b8dcac
BLAKE2b-256 afc9740d1446c409891996341feb894738d8212ffdf80df0cc18dbf5b95e22aa

See more details on using hashes here.

File details

Details for the file text_classification_keras-0.1.4-py2.py3-none-any.whl.

File metadata

  • Download URL: text_classification_keras-0.1.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 51.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for text_classification_keras-0.1.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8219e16304c4335ebcca0c1e6f7b121be0c2acb29f0aa25af4126feec1c89e51
MD5 2fcae53bc0200aa3c20e0a80eab34d3b
BLAKE2b-256 104ee37a46a62416881cb18cc164ec66c87ba288e7483df45302187e99c7762f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page