ETNLP: Embedding Toolkit for NLP Tasks

Project description

Introduction
More about ETNLP
Installation and How to Use
Download Resources

I. ETNLP: A Toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings

A glimpse of ETNLP:

Github: https://github.com/vietnlp/etnlp
Video: https://vimeo.com/317599106

II. More about ETNLP :

1. Embedding Evaluator:

To compare quality of embedding models on the word analogy task.

Input: a pre-trained embedding vector file (word2vec format), and word analogy file.
Output: (1) evaluate quality of the embedding model based on the MAP/P@10 score, (2) Paired t-tests to show significant level between different word embeddings.

1.1. Note: The word analogy list is created by:

Adopt from the English list by selecting suitable categories and translating to the target language (i.e., Vietnamese).
Removing inappropriate categories (i.e., category 6, 10, 11, 14) in the target language (i.e., Vietnamese).
Adding custom category that is suitable for the target language (e.g., cities and their zones in Vietnam for Vietnamese). Since most of this process is automatically done, it can be applied in other languages as well.

1.2. Selected categories for Vietnamese:

capital-common-countries

capital-world

currency: E.g., Algeria | dinar | Angola | kwanza

city-in-zone (Vietnam's cities and its zone)

family (boy|girl | brother | sister)

gram1-adjective-to-adverb (NOT USED)

gram2-opposite (e.g., acceptable | unacceptable | aware | unaware)

gram3-comparative (e.g., bad | worse | big | bigger)

gram4-superlative (e.g., bad | worst | big | biggest)

gram5-present-participle (NOT USED)

gram6-nationality-adjective-nguoi-tieng (e.g., Albania | Albanian | Argentina | Argentinean)

gram7-past-tense (NOT USED)

gram8-plural-cac-nhung (e.g., banana | bananas | bird | birds) (NOT USED)

gram9-plural-verbs (NOT USED)

1.3 Evaluation results (in details)

Analogy: Word Analogy Task
NER (w): NER task with hyper-parameters selected from the best F1 on validation set.
NER (w.o): NER task without selecting hyper-parameters from the validation set.

Model	NER.w	NER.w.o	Analogy
BiLC3 + w2v	89.01	89.41	0.4796
BiLC3 + Bert_Base	88.26	89.91	0.4609
BiLC3 + w2v_c2v	89.46	89.46	0.4796
BiLC3 + fastText	89.65	89.84	0.4970
BiLC3 + Elmo	89.67	90.84	0.4999
BiLC3 + MULTI_WC_F_E_B	91.09	91.75	0.4906

2. Embedding Extractor: To extract embedding vectors for other tasks.

Input: (1) list of input embeddings, (2) a vocabulary file.
Output: embedding vectors of the given vocab file in .txt, i.e., each line conains the embedding for a word. The file then be compressed in .gz format. This format is widely used in existing NLP Toolkits (e.g., Reimers et al. [1]).

Extra options:

-input-c2v: character embedding file
solveoov:1: to solve OOV words of the 1st embedding. Similarly for more than one embedding: e.g., solveoov:1:2.

[1] Nils Reimers and Iryna Gurevych, Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging, 2017, http://arxiv.org/abs/1707.09861, arXiv.

3. Visualizer: to explore the embedding space and compare between different embeddings.

Screenshot of viewing multiple-embeddings side-by-side

Alt text

Screenshot of viewing each embedding interactively

Alt text

III. Installation and How to use ETNLP

1. Installation:

From source codes:

cd src/codes/

python setup.py install

From pip

pip install etnlp

2. Examples

cd src/examples

python test1_etnlp_preprocessing.py

python test2_etnlp_extractor.py

python test3_etnlp_evaluator.py

python test4_etnlp_visualizer.py

3. Visualization

Side-by-side visualization:

sh src/codes/04.run_etnlp_visualizer_sbs.sh

Interactive visualization:

sh src/codes/04.run_etnlp_visualizer_inter.sh

IV. Available Lexical Resources

1. Word Analogy List for Vietnamese

Word Analogy List	Download Link (NER Task)	Download Link (General)
Vietnamese (This work)	Link1	[Link1]
English (Mirkolov et al. [2])	[Link2]	[Link2]
Portuguese (Hartmann et al. [3])	[Link3]	Link3

2. Multiple pre-trained embedding models for Vietnamese

Training data: Wiki in Vietnamese:

# of sentences	# of tokenized words
6,685,621	114,997,587

Download Pre-trained Embeddings:
(Note: The MULTI_WC_F_E_B is the concatenation of four embeddings: W2V_C2V, fastText, ELMO, and Bert_Base.)

Embedding Model	Download Link (NER Task)	Download Link (AIVIVN SentiTask)	Download Link (General)
w2v	Link1 (dim=300)	[Link1]	[Link1]
w2v_c2v	Link2 (dim=300)	[Link2]	[Link2]
fastText	Link3 (dim=300)	[Link3]	[Link3]
Elmo	Link4 (dim=1024)	Link4 (dim=1024)	[Link4]
Bert_base	Link5 (dim=768)	[Link5]	[Link5]
MULTI_WC_F_E_B	Link6 (dim=2392)	[Link6]	[Link6]

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Apr 16, 2019

0.1.0

Apr 16, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ETNLP-0.1.1-py3.6.egg (72.3 kB view details)

Uploaded Apr 16, 2019 Egg

File details

Details for the file ETNLP-0.1.1-py3.6.egg.

File metadata

Download URL: ETNLP-0.1.1-py3.6.egg
Upload date: Apr 16, 2019
Size: 72.3 kB
Tags: Egg
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.29.1 CPython/3.6.7

File hashes

Hashes for ETNLP-0.1.1-py3.6.egg
Algorithm	Hash digest
SHA256	`bd1ec09d025d7a3a4e21a222a8a947fbc34fef4f433d1dccef65b32641973d75`
MD5	`4c536ac4bdaba2f2ed0dd6dafe5d2901`
BLAKE2b-256	`32c8f5bef9d60790c0c4ec767150c33fb4e3af57e4a5aa8acffed0881d1ee978`

See more details on using hashes here.

ETNLP 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta