Skip to main content

embedding-as-service: one-stop solution to encode sentence to vectors using various embedding methods

Project description

embedding-as-service

One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
• Inspired from bert-as-service

GitHub stars Downloads Pypi package GitHub issues GitHub license Contributors

What is itInstallationGetting StartedSupported EmbeddingsAPI

What is it

Encoding/Embedding is a upstream task of encoding any inputs in the form of text, image, audio, video, transactional data to fixed length vector. Embeddings are quite popular in the field of NLP, there has been various Embeddings models being proposed in recent years by researchers, some of the famous one are bert, xlnet, word2vec etc. The goal of this repo is to build one stop solution for all embeddings techniques available, here we are starting with popular text embeddings for now and later on we aim to add as much technique for image, audio, video inputs also.

embedding-as-service help you to encode any given text to fixed length vector from supported embeddings and models.

💾 Installation

▴ Back to top

Here we have given the capability to use embedding-as-service like a module or you can run it as a server and handle queries by installing client package embedding-as-service-client

Using embedding-as-service as module

Install the embedding-as-servive via pip.

$ pip install embedding-as-service

Note that the code MUST be running on Python >= 3.6. Again module does not support Python 2!

Using embedding-as-service as a server

Here you also need to install a client module embedding-as-service-client

$ pip install embedding-as-service # server
$ pip install embedding-as-service-client # client

Client module need not to be on Python 3.6, it support both Python2 and Python3

⚡ ️Getting Started

▴ Back to top

1. Intialise encoder using supported embedding and models from here

If using embedding-as-service as a module

>>> from embedding_as_service.text.encode import Encoder  
>>> en = Encoder(embedding='bert', model='bert_base_cased')  

If using embedding-as-service as a server

# start the server by proving embedding, model, port, max_seq_length[default=256], num_workers[default=4]
$ embedding-as-service-start --embedding bert --model bert_base_cased --port 8080 --max_seq_length 256
>>> from embedding_as_service_client import EmbeddingClient
>>> en = EmbeddingClient(host=<host_server_ip>, port=<host_port>)

2. Get sentences tokens embedding

>>> vecs = en.encode(texts=['hello aman', 'how are you?'])  
>>> vecs  
array([[[ 1.7049843 ,  0.        ,  1.3486509 , ..., -1.3647075 ,  
 0.6958289 ,  1.8013777 ], ... [ 0.4913215 ,  0.60877025,  0.73050433, ..., -0.64490885, 0.8525057 ,  0.3080206 ]]], dtype=float32)  
>>> vecs.shape  
(2, 128, 768) # batch x max_sequence_length x embedding_size  

3. Using pooling strategy, click here for more.

Supported Pooling Methods
Strategy Description
None no pooling at all, useful when you want to use word embedding instead of sentence embedding. This will results in a [max_seq_len, embedding_size] encode matrix for a sequence.
reduce_mean take the average of all token embeddings
reduce_min take the minumun of all token embeddings
reduce_max take the maximum of all token embeddings
reduce_mean_max do reduce_mean and reduce_max separately and then concat them together
first_token get the token embedding of first token of a sentence
last_token get the token embedding of last token of a sentence
>>> vecs = en.encode(texts=['hello aman', 'how are you?'], pooling='reduce_mean')  
>>> vecs  
array([[-0.33547154,  0.34566957,  1.1954105 , ...,  0.33702594,  
 1.0317835 , -0.785943  ], [-0.3439088 ,  0.36881036,  1.0612687 , ...,  0.28851607, 1.1107115 , -0.6253736 ]], dtype=float32)  

>>> vecs.shape  
(2, 768) # batch x embedding_size  

4. Use custom max_seq_length, default is 128

>>> en = Encoder(embedding='bert', model='bert_base_cased', max_seq_length=256)  
>>> vecs = en.encode(texts=['hello aman', 'how are you?'])  
>>> vecs  
array([[ 0.48388457, -0.01327741, -0.76577514, ..., -0.54265064,  
 -0.5564591 ,  0.6454179 ], [ 0.53209245,  0.00526248, -0.71091074, ..., -0.5171917 , -0.40458363,  0.6779779 ]], dtype=float32)  

>>> vecs.shape  
(2, 256, 768) # batch x max_sequence_length x embedding_size  

5. Show embedding Tokens

>>> en.tokenize(texts=['hello aman', 'how are you?'])  
[['_hello', '_aman'], ['_how', '_are', '_you', '?']]  

6. Using your own tokenizer

>>> texts = ['hello aman!', 'how are you']  

# a naive whitespace tokenizer  
>>> tokens = [s.split() for s in texts]  
>>> vecs = en.encode(tokens, is_tokenized=True)  

📋 API

▴ Back to top

  1. class embedding_as_service.text.encoder.Encoder
Argument Type Default Description
embedding str Required embedding method to be used, check Embedding column here
model str Required Model to be used for mentioned embedding, check Model column here
max_seq_length int 128 Maximum Sequence Length, default is 128
  1. def embedding_as_service.text.encoder.Encoder.encode
Argument Type Default Description
Texts List[str] or List[List[str]] Required List of sentences or list of list of sentence tokens in case of is_tokenized=True
pooling str (Optional) Pooling methods to apply, here is available methods
is_tokenized bool False set as True in case of tokens are passed for encoding
batch_size int 128 maximum number of sequences handled by encoder, larger batch will be partitioned into small batches.
  1. def embedding_as_service.text.encoder.Encoder.tokenize
Argument Type Default Description
Texts List[str] Required List of sentences

✅ Supported Embeddings and Models

▴ Back to top

Here are the list of supported embeddings and their respective models.

Embedding Model Embedding dimensions Paper
:one: albert albert_base 768 Read Paper :bookmark:
albert_large 1024
albert_xlarge 2048
albert_xxlarge 4096
:two: xlnet xlnet_large_cased 1024 Read Paper :bookmark:
xlnet_base_cased 768
:three: bert bert_base_uncased 768 Read Paper :bookmark:
bert_base_cased 768
bert_multi_cased 768
bert_large_uncased 1024
bert_large_cased 1024
:four: elmo elmo_bi_lm 512 Read Paper :bookmark:
:five: ulmfit ulmfit_forward 300 Read Paper :bookmark:
ulmfit_backward 300
:six: use use_dan 512 Read Paper :bookmark:
use_transformer_large 512
use_transformer_lite 512
:seven: word2vec google_news_300 300 Read Paper :bookmark:
:eight: fasttext wiki_news_300 300 Read Paper :bookmark:
wiki_news_300_sub 300
common_crawl_300 300
common_crawl_300_sub 300
:nine: glove twitter_200 200 Read Paper :bookmark:
twitter_100 100
twitter_50 50
twitter_25 25
wiki_300 300
wiki_200 200
wiki_100 100
wiki_50 50
crawl_42B_300 300
crawl_840B_300 300

Credits

This software uses the following open source packages:

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Aman Srivastava
Aman Srivastava

💻 📖 🚇
Ashutosh Singh
Ashutosh Singh

💻 📖 🚇
Chirag Jain
Chirag Jain

💻 📖 🚇
MrPranav101
MrPranav101

💻 📖 🚇
Dhaval Taunk
Dhaval Taunk

💻 📖 🚇

This project follows the all-contributors specification. Contributions of any kind welcome!

Please read the contribution guidelines first.

Citing

▴ Back to top

If you use embedding-as-service in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{aman2019embeddingservice,
  title={embedding-as-service},
  author={Srivastava, Aman},
  howpublished={\url{https://github.com/amansrivastava17/embedding-as-service}},
  year={2019}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_as_service_client-1.0.0.tar.gz (8.6 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file embedding_as_service_client-1.0.0.tar.gz.

File metadata

  • Download URL: embedding_as_service_client-1.0.0.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.8.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for embedding_as_service_client-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ea54bbc920121467badde13523073634c9287af080329d71518583c4c96eed20
MD5 31084d246f268798ca58b27db51587df
BLAKE2b-256 2e3b9be440daead1a0d4d42dd39d91357451fb3750035aec05f33b4af5be8ccf

See more details on using hashes here.

File details

Details for the file embedding_as_service_client-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: embedding_as_service_client-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.8.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for embedding_as_service_client-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 99849bcbf4f7de42b63c582661beaafcae243023ad1514e81ef33870fbb8b929
MD5 05186aa5e78d61de8c545d2c2757b0a6
BLAKE2b-256 7e388b6bcc113fab4cbada63958994f390962b032e093d62133e2d79d2279649

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page