Natural Language Procecssing Toolkit with support for tokenization, sentence splitting, lemmatization, tagging and parsing for more than 60 languages
Project description
NLP-Cube
Setup:
Before running the server, you need the model's weights, and you can follow two approaches to get them:
- Download data in order to train the model yourself
- Download already existing model weights
Installing dyNET:
-
Make sure you have Mercurial, python, pip, cmake installed (you can also check steps documented here)
-
Install Intel's MKL library
-
Install
dyNET
by using the installation steps from the manual installation page. More specifically, you should use:pip install cython mkdir dynet-base cd dynet-base git clone https://github.com/clab/dynet.git hg clone https://bitbucket.org/eigen/eigen -r 2355b22 # -r NUM specified a known working revision cd dynet mkdir build cd build cmake .. -DEIGEN3_INCLUDE_DIR=/path/to/eigen -DMKL_ROOT=/opt/intel/mkl -DPYTHON=`which python2` make -j 2 # replace 2 with the number of available cores make install cd python python2 ../../setup.py build --build-dir=.. --skip-build install
Training the lemmatizer:
Use the following command to train your lemmatizer:
python2 cube/main.py --train=lemmatizer --train-file=corpus/ud_treebanks/UD_Romanian/ro-ud-train.conllu --dev-file=corpus/ud_treebanks/UD_Romanian/ro-ud-dev.conllu --embeddings=corpus/wiki.ro.vec --store=corpus/trained_models/ro/lemma/lemma --test-file=corpus/ud_test/gold/conll17-ud-test-2017-05-09/ro.conllu --batch-size=1000
Running the server:
Use the following command to run the server locally:
python2 cube/main.py --start-server --model-tokenization=corpus/trained_models/ro/tokenizer --model-parsing=corpus/trained_models/ro/parser --model-lemmatization=corpus/trained_models/ro/lemma --embeddings=corpus/wiki.ro.vec --server-port=8080
Current status
- we treat words and character embeddings in a similar fashion
- we tested with character encodings only (feature cutoff is set at 100)
ToDO
- provide training examples
- add word embeddings
- find a good network achitecture for POS tagging
- prepare a neural/based language pipeline
- pre-train models using universal dependencies
- add a parser
Parser architecture
# ----------------- --------------------------
# |word emebddings|---- ------|morphological embeddings|
# ----------------- | | --------------------------
# | |
# --------------
# |concatenate |
# --------------
# |
# ----------------
# |bdlstm_1_layer|
# ----------------
# |
# ----------------
# |bdlstm_2_layer|
# ----------------
# |-----------------------------------------------------------------
# ---------------- |
# |bdlstm_3_layer| |
# ---------------- |
# | |
# --------------------------------------------- ---------------------------------------------
# | | | | | | | |
# | | | | | | | |
# --------- ----------- ---------- ------------ --------- ----------- ---------- ------------
# |to_link| |from_link| |to_label| |from_label| |to_link| |from_link| |to_label| |from_label|
# --------- ----------- ---------- ------------ --------- ----------- ---------- ------------
# | | | | | | | |
# -------------- --------------- ------------------ -------------------
# |softmax link| |softmax label| |aux softmax link| |aux softmax label|
# -------------- --------------- ------------------ -------------------
#
#
Tagger architecture
# ----------------- ----------------------
# |word emebddings|---- ------|character embeddings|
# ----------------- | | ----------------------
# | |
# --------------
# |tanh_1_layer|
# --------------
# |
# ----------------
# |bdlstm_1_layer|
# ----------------
# |
# --------------
# |tanh_2_layer|-------------------
# -------------- |
# | |
# ---------------- -------------------
# |bdlstm_2_layer| |aux_softmax_layer|
# ---------------- -------------------
# |
# ---------------
# |softmax_layer|
# ---------------
#
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nlpcube-0.0.9.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 671350cf2223c87ab0ff8a0445e809d3132795f77b44ff948273da1a017d30c6 |
|
MD5 | 1bfc9fb98e2668ead1b28a736633f966 |
|
BLAKE2b-256 | eb7ec44a5f355a42a3792bcb5d1593f46a19b0fa52ec784f999486ac88f16963 |