simple and powerful state-of-the-art NLP framework with pre-trained word2vec and bert embedding.
Project description
Kashgari
Simple and powerful NLP framework, build your own state-of-art model in 5 minutes.
Kashgare is:
- Human-friendly framework. Kashgare's code is very simple, well documented and tested, which makes it very easy to understand and modify.
- Powerful and simple NLP library. Kashgare allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
- A Keras NLP framework. Kashgare builds directly on Keras, making it easy to train your own models and experiment with new approaches using different embeddings and model structure.
Feature List
- Embedding support
- Classic word2vec embedding
- BERT embedding
- Text Classification Models
- CNN Classification Model
- CNN LSTM Classification Model
- Bidirectional LSTM Classification Model
- Text Labeling Models (NER, PoS)
- Bidirectional LSTM Labeling Model
- Bidirectional LSTM CRF Labeling Model
- CNN LSTM Labeling Model
- Model Training
- Model Evaluate
- GPU Support
- Customize Model
Roadmap
- ELMo Embedding
- Pre-trained models
- More model structure
Tutorials
Quick start
Requirements and Installation
The project is based on Keras 2.2.0+ and Python 3.6+, because it is 2019 and type hints is cool.
pip install kashgari
# CPU
pip install tensorflow
# GPU
pip install tensorflow-gpu
Example Usage
lets run a text classification with CNN model over SMP 2017 ECDT Task1.
>>> from kashgari.corpus import SMP2017ECDTClassificationCorpus
>>> from kashgari.tasks.classification import CNNLSTMModel
>>> x_data, y_data = SMP2017ECDTClassificationCorpus.get_classification_data()
>>> x_data[0]
['你', '知', '道', '我', '几', '岁']
>>> y_data[0]
'chat'
# provided classification models `CNNModel`, `BLSTMModel`, `CNNLSTMModel`
>>> classifier = CNNLSTMModel()
>>> classifier.fit(x_data, y_data)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 10) 0
_________________________________________________________________
embedding_1 (Embedding) (None, 10, 100) 87500
_________________________________________________________________
conv1d_1 (Conv1D) (None, 10, 32) 9632
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 5, 32) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 100) 53200
_________________________________________________________________
dense_1 (Dense) (None, 32) 3232
=================================================================
Total params: 153,564
Trainable params: 153,564
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
1/35 [..............................] - ETA: 32s - loss: 3.4652 - acc: 0.0469
...
>>> x_test, y_test = SMP2017ECDTClassificationCorpus.get_classification_data('test')
>>> classifier.evaluate(x_test, y_test)
precision recall f1-score support
calc 0.75 0.75 0.75 8
chat 0.83 0.86 0.85 154
contacts 0.54 0.70 0.61 10
cookbook 0.97 0.94 0.95 89
datetime 0.67 0.67 0.67 6
email 1.00 0.88 0.93 8
epg 0.61 0.56 0.58 36
flight 1.00 0.90 0.95 21
...
Run with Bert Embedding
from kashgari.embeddings import BERTEmbedding
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus
bert_embedding = BERTEmbedding('bert-base-chinese', sequence_length=30)
model = CNNLSTMModel(bert_embedding)
train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)
Run with Word2vec embedded
from kashgari.embeddings import WordEmbeddings
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus
bert_embedding = WordEmbeddings('sgns.weibo.bigram', sequence_length=30)
model = CNNLSTMModel(bert_embedding)
train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)
Reference
This library is inspired and reference following framework and papers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kashgari-0.1.2.tar.gz.
File metadata
- Download URL: kashgari-0.1.2.tar.gz
- Upload date:
- Size: 28.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.6.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43b741b02b2f651fe2c2379f6a94404b1fcc9668659547222567958ac2e41130
|
|
| MD5 |
f4ad1b8926aee14be3afe0861f0e3079
|
|
| BLAKE2b-256 |
4e7e83ef0b941abc1448094ddcd230c257fe2511bea4e02ce61c4bb6369414a7
|
File details
Details for the file kashgari-0.1.2-py3-none-any.whl.
File metadata
- Download URL: kashgari-0.1.2-py3-none-any.whl
- Upload date:
- Size: 40.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.6.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4b60819e9ec25b13088aa901171f088b598bfd39e5f5cb4e0d2100a0be53ca1
|
|
| MD5 |
ae00530ec7529c3f3df276ad29ffa307
|
|
| BLAKE2b-256 |
3fb0f15d5af90ba64ebd490efdf98a0c6ad2a60adba5f7533642fbf0ef6973c6
|