Text Classification Library for Keras
Project description
Text Classification Keras
A high-level text classification library implementing various well-established models. With a clean and extendable interface to implement custom architectures.
Quick start
Install
pip install text-classification-keras[full]
The [full]
will additionally install TensorFlow, Spacy, and Deep Plots. Choose this if you want to get started right away.
Usage
from texcla import experiment, data
from texcla.models import TokenModelFactory, YoonKimCNN
from texcla.preprocessing import FastTextWikiTokenizer
# input text
X = ['some random text', 'another random text lala', 'peter', ...]
# input labels
y = ['a', 'b', 'a', ...]
# use the special tokenizer used for constructing the embeddings
tokenizer = FastTextWikiTokenizer()
# preprocess data (once)
experiment.setup_data(X, y, tokenizer, 'data.bin', max_len=100)
# load data
ds = data.Dataset.load('data.bin')
# construct base
factory = TokenModelFactory(
ds.num_classes, ds.tokenizer.token_index, max_tokens=100,
embedding_type='fasttext.wiki.simple', embedding_dims=300)
# choose a model
word_encoder_model = YoonKimCNN()
# build a model
model = factory.build_model(
token_encoder_model=word_encoder_model, trainable_embeddings=False)
# use experiment.train as wrapper for Keras.fit()
experiment.train(x=ds.X, y=ds.y, validation_split=0.1, model=model,
word_encoder_model=word_encoder_model)
Check out more examples.
API Documenation
https://github.io/jfilter/text-classification-keras/
Advanced
Embeddings
Choose a pre-trained word embedding by setting the embedding_type and the corresponding embedding dimensions. Set embedding_type=None
to initialize the word embeddings randomly (but make sure to set trainable_embeddings=True
so you actually train the embeddings).
factory = TokenModelFactory(embedding_type='fasttext.wiki.simple', embedding_dims=300)
FastText
Several pre-trained FastText embeddings are included. For now, we only have the word embeddings and not the n-gram features. All embedding have 300 dimensions.
- English Vectors: e.g.
fasttext.wn.1M.300d
, check out all avaiable embeddings - Multilang Vectors: in the format
fasttext.cc.LANG_CODE
e.g.fasttext.cc.en
- Wikipedia Vectors: in the format
fasttext.wiki.LANG_CODE
e.g.fasttext.wiki.en
GloVe
The GloVe embeddings are some kind of predecessor to FastText. In general choose FastText embeddings over GloVe. The dimension for the pre-trained embeddings varies.
- : e.g.
glove.6B.50d
, check out all avaiable embeddings
Tokenzation
- To work on token (or word) level, use a TokenTokenizer such e.g.
TwokenizeTokenizer
orSpacyTokenizer
. - To work on token and sentence level, use
SpacySentenceTokenizer
. - To create an custom Tokenizer, extend
Tokenizer
and implement thetoken_generator
method.
Spacy
You may use spaCy for the tokenization. See instructions on how to download model for your target language. E.g. for English:
python -m spacy download en
Models
Token-based Models
When working on token level, use TokenModelFactory
.
from texcla.models import TokenModelFactory, YoonKimCNN
factory = TokenModelFactory(tokenizer.num_classes, tokenizer.token_index,
max_tokens=100, embedding_type='glove.6B.100d')
word_encoder_model = YoonKimCNN()
model = factory.build_model(token_encoder_model=word_encoder_model)
Currently supported models include:
TokenModelFactory.build_model
uses the provided word encoder which is then classified via a Dense layer.
Sentence-based Models
When working on sentence level, use SentenceModelFactory
.
# Pad max sentences per doc to 500 and max words per sentence to 200.
# Can also use `max_sents=None` to allow variable sized max_sents per mini-batch.
factory = SentenceModelFactory(10, tokenizer.token_index, max_sents=500,
max_tokens=200, embedding_type='glove.6B.100d')
word_encoder_model = AttentionRNN()
sentence_encoder_model = AttentionRNN()
# Allows you to compose arbitrary word encoders followed by sentence encoder.
model = factory.build_model(word_encoder_model, sentence_encoder_model)
- Hierarchical attention networks (HANs) can be build by composing two attention based RNN models. This is useful when a document is very large.
- For smaller document a reasonable way to encode sentences is to average words within it. This can be done by using
token_encoder_model=AveragingEncoder()
- Mix and match encoders as you see fit for your problem.
SentenceModelFactory.build_model
created a tiered model where words within a sentence is first encoded using
word_encoder_model
. All such encodings per sentence is then encoded using sentence_encoder_model
.
Related
- https://github.com/brightmart/text_classification
- https://github.com/allenai/allennlp
- https://github.com/facebookresearch/pytext
- https://docs.fast.ai/text.html
- https://github.com/dkpro/dkpro-tc
Contributing
If you have a question, found a bug or want to propose a new feature, have a look at the issues page.
Pull requests are especially welcomed when they fix bugs or improve the code quality.
Acknowledgements
Built upon the work by Raghavendra Kotikalapudi: keras-text.
Citation
If you find Text Classification Keras useful for an academic publication, then please use the following BibTeX to cite it:
@misc{raghakotfiltertexclakeras
title={Text Classification Keras},
author={Raghavendra Kotikalapudi, and Johannes Filter, and contributors},
year={2018},
publisher={GitHub},
howpublished={\url{https://github.com/jfilter/text-classification-keras}},
}
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text-classification-keras-0.1.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 10d4805b7b9d451dc9dbf3511fddd3cb8d4e14f5e2132a805d0636af1ddfbc26 |
|
MD5 | 945859bdecfd224b95e37bef83b8dcac |
|
BLAKE2b-256 | afc9740d1446c409891996341feb894738d8212ffdf80df0cc18dbf5b95e22aa |
Hashes for text_classification_keras-0.1.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8219e16304c4335ebcca0c1e6f7b121be0c2acb29f0aa25af4126feec1c89e51 |
|
MD5 | 2fcae53bc0200aa3c20e0a80eab34d3b |
|
BLAKE2b-256 | 104ee37a46a62416881cb18cc164ec66c87ba288e7483df45302187e99c7762f |