embedding-as-service: one-stop solution to encode sentence to vectors using various embedding methods
Project description
embedding-as-service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
• Inspired from bert-as-service
What is it • Installation • Getting Started • Supported Embeddings • API •
What is it
Encoding/Embedding is a upstream task of encoding any inputs in the form of text, image, audio, video, transactional data to fixed length vector. Embeddings are quite popular in the field of NLP, there has been various Embeddings models being proposed in recent years by researchers, some of the famous one are bert, xlnet, word2vec etc. The goal of this repo is to build one stop solution for all embeddings techniques available, here we are starting with popular text embeddings for now and later on we aim to add as much technique for image, audio, video inputs also.
embedding-as-service
help you to encode any given text to fixed length vector from supported embeddings and models.
💾 Installation
Here we have given the capability to use embedding-as-service
like a module or you can run it as a server and handle queries by installing client package embedding-as-service-client
Using embedding-as-service
as module
Install the embedding-as-servive via pip
.
$ pip install embedding-as-service
Note that the code MUST be running on Python >= 3.6. Again module does not support Python 2!
Using embedding-as-service
as a server
Here you also need to install a client module embedding-as-service-client
$ pip install embedding-as-service # server
$ pip install embedding-as-service-client # client
Client module need not to be on Python 3.6, it support both Python2 and Python3
⚡ ️Getting Started
1. Intialise encoder using supported embedding and models from here
If using embedding-as-service
as a module
>>> from embedding_as_service.text.encode import Encoder
>>> en = Encoder(embedding='bert', model='bert_base_cased')
If using embedding-as-service
as a server
# start the server by proving embedding, model, port, max_seq_length[default=256], num_workers[default=4]
$ embedding-as-service-start --embedding bert --model bert_base_cased --port 8080 --max_seq_length 256
>>> from embedding_as_service_client import EmbeddingClient
>>> en = EmbeddingClient(host=<host_server_ip>, port=<host_port>)
2. Get sentences tokens embedding
>>> vecs = en.encode(texts=['hello aman', 'how are you?'])
>>> vecs
array([[[ 1.7049843 , 0. , 1.3486509 , ..., -1.3647075 ,
0.6958289 , 1.8013777 ], ... [ 0.4913215 , 0.60877025, 0.73050433, ..., -0.64490885, 0.8525057 , 0.3080206 ]]], dtype=float32)
>>> vecs.shape
(2, 128, 768) # batch x max_sequence_length x embedding_size
3. Using pooling strategy, click here for more.
Supported Pooling Methods
Strategy | Description |
---|---|
None |
no pooling at all, useful when you want to use word embedding instead of sentence embedding. This will results in a [max_seq_len, embedding_size] encode matrix for a sequence. |
reduce_mean |
take the average of all token embeddings |
reduce_min |
take the minumun of all token embeddings |
reduce_max |
take the maximum of all token embeddings |
reduce_mean_max |
do reduce_mean and reduce_max separately and then concat them together |
first_token |
get the token embedding of first token of a sentence |
last_token |
get the token embedding of last token of a sentence |
>>> vecs = en.encode(texts=['hello aman', 'how are you?'], pooling='reduce_mean')
>>> vecs
array([[-0.33547154, 0.34566957, 1.1954105 , ..., 0.33702594,
1.0317835 , -0.785943 ], [-0.3439088 , 0.36881036, 1.0612687 , ..., 0.28851607, 1.1107115 , -0.6253736 ]], dtype=float32)
>>> vecs.shape
(2, 768) # batch x embedding_size
4. Use custom max_seq_length
, default is 128
>>> en = Encoder(embedding='bert', model='bert_base_cased', max_seq_length=256)
>>> vecs = en.encode(texts=['hello aman', 'how are you?'])
>>> vecs
array([[ 0.48388457, -0.01327741, -0.76577514, ..., -0.54265064,
-0.5564591 , 0.6454179 ], [ 0.53209245, 0.00526248, -0.71091074, ..., -0.5171917 , -0.40458363, 0.6779779 ]], dtype=float32)
>>> vecs.shape
(2, 256, 768) # batch x max_sequence_length x embedding_size
5. Show embedding Tokens
>>> en.tokenize(texts=['hello aman', 'how are you?'])
[['_hello', '_aman'], ['_how', '_are', '_you', '?']]
6. Using your own tokenizer
>>> texts = ['hello aman!', 'how are you']
# a naive whitespace tokenizer
>>> tokens = [s.split() for s in texts]
>>> vecs = en.encode(tokens, is_tokenized=True)
📋 API
- class
embedding_as_service.text.encoder.Encoder
Argument | Type | Default | Description |
---|---|---|---|
embedding |
str | Required | embedding method to be used, check Embedding column here |
model |
str | Required | Model to be used for mentioned embedding, check Model column here |
max_seq_length |
int | 128 | Maximum Sequence Length, default is 128 |
- def
embedding_as_service.text.encoder.Encoder.encode
Argument | Type | Default | Description |
---|---|---|---|
Texts |
List[str] or List[List[str]] | Required | List of sentences or list of list of sentence tokens in case of is_tokenized=True |
pooling |
str | (Optional) | Pooling methods to apply, here is available methods |
is_tokenized |
bool | False |
set as True in case of tokens are passed for encoding |
batch_size |
int | 128 |
maximum number of sequences handled by encoder, larger batch will be partitioned into small batches. |
- def
embedding_as_service.text.encoder.Encoder.tokenize
Argument | Type | Default | Description |
---|---|---|---|
Texts |
List[str] | Required | List of sentences |
✅ Supported Embeddings and Models
Here are the list of supported embeddings and their respective models.
Embedding | Model | Embedding dimensions | Paper | |
---|---|---|---|---|
:one: | albert | albert_base |
768 | Read Paper :bookmark: |
albert_large |
1024 | |||
albert_xlarge |
2048 | |||
albert_xxlarge |
4096 | |||
:two: | xlnet | xlnet_large_cased |
1024 | Read Paper :bookmark: |
xlnet_base_cased |
768 | |||
:three: | bert | bert_base_uncased |
768 | Read Paper :bookmark: |
bert_base_cased |
768 | |||
bert_multi_cased |
768 | |||
bert_large_uncased |
1024 | |||
bert_large_cased |
1024 | |||
:four: | elmo | elmo_bi_lm |
512 | Read Paper :bookmark: |
:five: | ulmfit | ulmfit_forward |
300 | Read Paper :bookmark: |
ulmfit_backward |
300 | |||
:six: | use | use_dan |
512 | Read Paper :bookmark: |
use_transformer_large |
512 | |||
use_transformer_lite |
512 | |||
:seven: | word2vec | google_news_300 |
300 | Read Paper :bookmark: |
:eight: | fasttext | wiki_news_300 |
300 | Read Paper :bookmark: |
wiki_news_300_sub |
300 | |||
common_crawl_300 |
300 | |||
common_crawl_300_sub |
300 | |||
:nine: | glove | twitter_200 |
200 | Read Paper :bookmark: |
twitter_100 |
100 | |||
twitter_50 |
50 | |||
twitter_25 |
25 | |||
wiki_300 |
300 | |||
wiki_200 |
200 | |||
wiki_100 |
100 | |||
wiki_50 |
50 | |||
crawl_42B_300 |
300 | |||
crawl_840B_300 |
300 |
Credits
This software uses the following open source packages:
Contributors ✨
Thanks goes to these wonderful people (emoji key):
Aman Srivastava 💻 📖 🚇 |
Ashutosh Singh 💻 📖 🚇 |
Chirag Jain 💻 📖 🚇 |
MrPranav101 💻 📖 🚇 |
Dhaval Taunk 💻 📖 🚇 |
This project follows the all-contributors specification. Contributions of any kind welcome!
Please read the contribution guidelines first.
Citing
If you use embedding-as-service in a scientific publication, we would appreciate references to the following BibTex entry:
@misc{aman2019embeddingservice,
title={embedding-as-service},
author={Srivastava, Aman},
howpublished={\url{https://github.com/amansrivastava17/embedding-as-service}},
year={2019}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file embedding_as_service_client-1.0.0.tar.gz
.
File metadata
- Download URL: embedding_as_service_client-1.0.0.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.8.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea54bbc920121467badde13523073634c9287af080329d71518583c4c96eed20 |
|
MD5 | 31084d246f268798ca58b27db51587df |
|
BLAKE2b-256 | 2e3b9be440daead1a0d4d42dd39d91357451fb3750035aec05f33b4af5be8ccf |
File details
Details for the file embedding_as_service_client-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: embedding_as_service_client-1.0.0-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.8.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99849bcbf4f7de42b63c582661beaafcae243023ad1514e81ef33870fbb8b929 |
|
MD5 | 05186aa5e78d61de8c545d2c2757b0a6 |
|
BLAKE2b-256 | 7e388b6bcc113fab4cbada63958994f390962b032e093d62133e2d79d2279649 |