embedding-as-service: one-stop solution to encode sentence to vectors using various embedding methods
Project description
embedding-as-service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
• Inspired from bert-as-service
What is it • Installation • Getting Started • Supported Embeddings • API • Tutorials
What is it ?
Encoding/Embedding is a upstream task of encoding any inputs in the form of text, image, audio, video, transactional data to fixed length vector. Embeddings are quite popular in the field of NLP, there has been various Embeddings models being proposed in recent years by researchers, some of the famous one are bert, xlnet, word2vec etc. The goal of this repo is to build one stop solution for all embeddings techniques available, here we are starting with popular text embeddings for now and later on we aim to add as much technique for image, audio, video input also.
Finally, embedding-as-service
help you to encode any given text to fixed length vector from supported embeddings and models.
Installation
Install the embedding-as-servive via pip
.
pip install embedding-as-service
Note that the code MUST be running on Python >= 3.6 with Tensorflow >= 1.10 (one-point-ten). Again, this module does not support Python 2!
Getting Started
1. Intialise encoder using supported embedding and models from here
from embedding_as_service.text.encode import Encoder
>>> en = Encoder(embedding='xlnet', model='xlnet_base_cased', download=True)
2. Get sentences tokens embedding
>>> vector = en.encode(texts=['hello aman', 'how are you?'])
array([[[ 1.7049843 , 0. , 1.3486509 , ..., -1.3647075 ,
0.6958289 , 1.8013777 ],
...
[ 0.4913215 , 0.60877025, 0.73050433, ..., -0.64490885,
0.8525057 , 0.3080206 ]]], dtype=float32)
>>> vector.shape
(2, 128, 768)
3. Using pooling strategy, click here for more.
>>> vector = en.encode(texts=['hello aman', 'how are you?'], pooling='mean')
array([[-0.33547154, 0.34566957, 1.1954105 , ..., 0.33702594,
1.0317835 , -0.785943 ],
[-0.3439088 , 0.36881036, 1.0612687 , ..., 0.28851607,
1.1107115 , -0.6253736 ]], dtype=float32)
>>> vector.shape
(2, 768)
4. Use custom max_seq_length
>>> vectors = en.encode(texts=['hello aman', 'how are you?'], max_seq_length=256)
array([[ 0.48388457, -0.01327741, -0.76577514, ..., -0.54265064,
-0.5564591 , 0.6454179 ],
[ 0.53209245, 0.00526248, -0.71091074, ..., -0.5171917 ,
-0.40458363, 0.6779779 ]], dtype=float32)
>>> vectors.shape
(2, 256, 768)
Using Tokenizer
Check Embedding Meta
Supported Embeddings and Models
Here are the list of supported embeddings and their respective models.
Index | Embedding | Model | Embedding dimensions | Paper |
---|---|---|---|---|
1. | xlnet |
xlnet_large_cased |
1024 | link |
xlnet_base_cased |
768 | |||
2. | bert |
bert_base_uncased |
768 | link |
bert_base_cased |
768 | |||
bert_multi_cased |
768 | |||
bert_large_uncased |
1024 | |||
bert_large_cased |
1024 | |||
3. | elmo |
elmo_bi_lm |
512 | link |
4. | ulmfit |
ulmfit_forward |
300 | link |
ulmfit_backward |
300 | |||
5. | use |
use_dan |
512 | link |
use_transformer_large |
512 | |||
use_transformer_lite |
512 | |||
6. | word2vec |
google_news_300 |
300 | link |
7. | fasttext |
wiki_news_300 |
300 | link |
wiki_news_300_sub |
300 | |||
common_crawl_300 |
300 | |||
common_crawl_300_sub |
300 | |||
8. | glove |
twitter_200 |
200 | link |
twitter_100 |
100 | |||
twitter_50 |
50 | |||
twitter_25 |
25 | |||
wiki_300 |
300 | |||
wiki_200 |
200 | |||
wiki_100 |
100 | |||
wiki_50 |
50 | |||
crawl_42B_300 |
300 | |||
crawl_840B_300 |
300 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for embedding_as_service-0.0.7.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc01295e6160359c13ea5a65f4e25111a0a1ac53b6eb4faf2be8dda2e7c3c539 |
|
MD5 | 696078ec3f99ee5e429c4cacf36e27b6 |
|
BLAKE2b-256 | 49ac28b083f4248e8b0c8bfaa1f54d16e80fc456db1abee5b7900a7b4676f33d |
Hashes for embedding_as_service-0.0.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60926264d9f9b9e4bdca57c9cc76407b287ee5034684da863425c7a1608a92a9 |
|
MD5 | 21d0b88f8dda6ee7d37328df0953bf56 |
|
BLAKE2b-256 | e698299305274abeec0c9315350f4b2d3b9401d77131c6790a0c733173825c67 |