Skip to main content

simple and powerful state-of-the-art NLP framework with pre-trained word2vec and bert embedding.

Project description

Kashgari

Pypi Python version Travis Issues Contributions welcome License: MIT

Simple and powerful NLP framework, build your own state-of-art model in 5 minutes.

Kashgare is:

  • Human-friendly framework. Kashgare's code is very simple, well documented and tested, which makes it very easy to understand and modify.
  • Powerful and simple NLP library. Kashgare allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
  • A Keras NLP framework. Kashgare builds directly on Keras, making it easy to train your own models and experiment with new approaches using different embeddings and model structure.

Feature List

  • Embedding support
    • Classic word2vec embedding
    • BERT embedding
  • Text Classification Models
    • CNN Classification Model
    • CNN LSTM Classification Model
    • Bidirectional LSTM Classification Model
  • Text Labeling Models (NER, PoS)
    • Bidirectional LSTM Labeling Model
    • Bidirectional LSTM CRF Labeling Model
    • CNN LSTM Labeling Model
  • Model Training
  • Model Evaluate
  • GPU Support
  • Customize Model

Roadmap

  • ELMo Embedding
  • Pre-trained models
  • More model structure

Tutorials

Quick start

Requirements and Installation

The project is based on Keras 2.2.0+ and Python 3.6+, because it is 2019 and type hints is cool.

pip install kashgari
# CPU
pip install tensorflow
# GPU
pip install tensorflow-gpu 

Example Usage

lets run a text classification with CNN model over SMP 2017 ECDT Task1.

>>> from kashgari.corpus import SMP2017ECDTClassificationCorpus
>>> from kashgari.tasks.classification import CNNLSTMModel

>>> x_data, y_data = SMP2017ECDTClassificationCorpus.get_classification_data()
>>> x_data[0]
['你', '知', '道', '我', '几', '岁']
>>> y_data[0]
'chat'

# provided classification models `CNNModel`, `BLSTMModel`, `CNNLSTMModel` 
>>> classifier = CNNLSTMModel()
>>> classifier.fit(x_data, y_data)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 10)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 10, 100)           87500     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 10, 32)            9632      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 5, 32)             0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 32)                3232      
=================================================================
Total params: 153,564
Trainable params: 153,564
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
 1/35 [..............................] - ETA: 32s - loss: 3.4652 - acc: 0.0469

... 

>>> x_test, y_test = SMP2017ECDTClassificationCorpus.get_classification_data('test')
>>> classifier.evaluate(x_test, y_test)
              precision    recall  f1-score   support

        calc       0.75      0.75      0.75         8
        chat       0.83      0.86      0.85       154
    contacts       0.54      0.70      0.61        10
    cookbook       0.97      0.94      0.95        89
    datetime       0.67      0.67      0.67         6
       email       1.00      0.88      0.93         8
         epg       0.61      0.56      0.58        36
      flight       1.00      0.90      0.95        21
...

Run with Bert Embedding

from kashgari.embeddings import BERTEmbedding
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus

bert_embedding = BERTEmbedding('bert-base-chinese', sequence_length=30)                                   
model = CNNLSTMModel(bert_embedding)

train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)

Run with Word2vec embedded

from kashgari.embeddings import WordEmbeddings
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus

bert_embedding = WordEmbeddings('sgns.weibo.bigram', sequence_length=30)                                  
model = CNNLSTMModel(bert_embedding)
train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)

Reference

This library is inspired and reference following framework and papers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kashgari-0.1.2.tar.gz (28.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kashgari-0.1.2-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file kashgari-0.1.2.tar.gz.

File metadata

  • Download URL: kashgari-0.1.2.tar.gz
  • Upload date:
  • Size: 28.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.6.5

File hashes

Hashes for kashgari-0.1.2.tar.gz
Algorithm Hash digest
SHA256 43b741b02b2f651fe2c2379f6a94404b1fcc9668659547222567958ac2e41130
MD5 f4ad1b8926aee14be3afe0861f0e3079
BLAKE2b-256 4e7e83ef0b941abc1448094ddcd230c257fe2511bea4e02ce61c4bb6369414a7

See more details on using hashes here.

File details

Details for the file kashgari-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: kashgari-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.6.5

File hashes

Hashes for kashgari-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f4b60819e9ec25b13088aa901171f088b598bfd39e5f5cb4e0d2100a0be53ca1
MD5 ae00530ec7529c3f3df276ad29ffa307
BLAKE2b-256 3fb0f15d5af90ba64ebd490efdf98a0c6ad2a60adba5f7533642fbf0ef6973c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page