embedding-as-service: one-stop solution to encode sentence to vectors using various embedding methods

These details have not been verified by PyPI

Project links

Homepage

Project description

embedding-as-service

One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
• Inspired from bert-as-service

What is it • Installation • Getting Started • Supported Embeddings • API •

What is it

Encoding/Embedding is a upstream task of encoding any inputs in the form of text, image, audio, video, transactional data to fixed length vector. Embeddings are quite popular in the field of NLP, there has been various Embeddings models being proposed in recent years by researchers, some of the famous one are bert, xlnet, word2vec etc. The goal of this repo is to build one stop solution for all embeddings techniques available, here we are starting with popular text embeddings for now and later on we aim to add as much technique for image, audio, video inputs also.

embedding-as-service help you to encode any given text to fixed length vector from supported embeddings and models.

💾 Installation

^{▴ Back to top}

Here we have given the capability to use embedding-as-service like a module or you can run it as a server and handle queries by installing client package embedding-as-service-client

Using `embedding-as-service` as module

Install the embedding-as-servive via pip.

$ pip install embedding-as-service

Note that the code MUST be running on Python >= 3.6. Again module does not support Python 2!

Using `embedding-as-service` as a server

Here you also need to install a client module embedding-as-service-client

$ pip install embedding-as-service # server
$ pip install embedding-as-service-client # client

Client module need not to be on Python 3.6, it supports both Python2 and Python3

⚡ ️Getting Started

^{▴ Back to top}

1. Intialise encoder using supported embedding and models from here

If using embedding-as-service as a module

>>> from embedding_as_service.text.encode import Encoder  
>>> en = Encoder(embedding='bert', model='bert_base_cased', max_seq_length=256)

If using embedding-as-service as a server

# start the server by proving embedding, model, port, max_seq_length[default=256], num_workers[default=4]
$ embedding-as-service-start --embedding bert --model bert_base_cased --port 8080 --max_seq_length 256

>>> from embedding_as_service_client import EmbeddingClient
>>> en = EmbeddingClient(host=<host_server_ip>, port=<host_port>)

2. Get sentences tokens embedding

>>> vecs = en.encode(texts=['hello aman', 'how are you?'])  
>>> vecs  
array([[[ 1.7049843 ,  0.        ,  1.3486509 , ..., -1.3647075 ,  
 0.6958289 ,  1.8013777 ], ... [ 0.4913215 ,  0.60877025,  0.73050433, ..., -0.64490885, 0.8525057 ,  0.3080206 ]]], dtype=float32)  
>>> vecs.shape  
(2, 128, 768) # batch x max_sequence_length x embedding_size

3. Using pooling strategy, click here for more.

Supported Pooling Methods

Strategy	Description
`None`	no pooling at all, useful when you want to use word embedding instead of sentence embedding. This will results in a `[max_seq_len, embedding_size]` encode matrix for a sequence.
`reduce_mean`	take the average of all token embeddings
`reduce_min`	take the minumun of all token embeddings
`reduce_max`	take the maximum of all token embeddings
`reduce_mean_max`	do `reduce_mean` and `reduce_max` separately and then concat them together
`first_token`	get the token embedding of first token of a sentence
`last_token`	get the token embedding of last token of a sentence

>>> vecs = en.encode(texts=['hello aman', 'how are you?'], pooling='reduce_mean')  
>>> vecs  
array([[-0.33547154,  0.34566957,  1.1954105 , ...,  0.33702594,  
 1.0317835 , -0.785943  ], [-0.3439088 ,  0.36881036,  1.0612687 , ...,  0.28851607, 1.1107115 , -0.6253736 ]], dtype=float32)  

>>> vecs.shape  
(2, 768) # batch x embedding_size

4. Show embedding Tokens

>>> en.tokenize(texts=['hello aman', 'how are you?'])  
[['_hello', '_aman'], ['_how', '_are', '_you', '?']]

5. Using your own tokenizer

>>> texts = ['hello aman!', 'how are you']  

# a naive whitespace tokenizer  
>>> tokens = [s.split() for s in texts]  
>>> vecs = en.encode(tokens, is_tokenized=True)

📋 API

^{▴ Back to top}

class embedding_as_service.text.encoder.Encoder

Argument	Type	Default	Description
`embedding`	str	Required	embedding method to be used, check `Embedding` column here
`model`	str	Required	Model to be used for mentioned embedding, check `Model` column here
`max_seq_length`	int	128	Maximum Sequence Length, default is 128

def embedding_as_service.text.encoder.Encoder.encode

Argument	Type	Default	Description
`Texts`	List[str] or List[List[str]]	Required	List of sentences or list of list of sentence tokens in case of `is_tokenized=True`
`pooling`	str	(Optional)	Pooling methods to apply, here is available methods
`is_tokenized`	bool	`False`	set as True in case of tokens are passed for encoding
`batch_size`	int	`128`	maximum number of sequences handled by encoder, larger batch will be partitioned into small batches.

def embedding_as_service.text.encoder.Encoder.tokenize

Argument	Type	Default	Description
`Texts`	List[str]	Required	List of sentences

✅ Supported Embeddings and Models

^{▴ Back to top}

Here are the list of supported embeddings and their respective models.

	Embedding	Model	Embedding dimensions	Paper
:one:	albert	`albert_base`	768	Read Paper :bookmark:
		`albert_large`	1024
		`albert_xlarge`	2048
		`albert_xxlarge`	4096
:two:	xlnet	`xlnet_large_cased`	1024	Read Paper :bookmark:
		`xlnet_base_cased`	768
:three:	bert	`bert_base_uncased`	768	Read Paper :bookmark:
		`bert_base_cased`	768
		`bert_multi_cased`	768
		`bert_large_uncased`	1024
		`bert_large_cased`	1024
:four:	elmo	`elmo_bi_lm`	512	Read Paper :bookmark:
:five:	ulmfit	`ulmfit_forward`	300	Read Paper :bookmark:
		`ulmfit_backward`	300
:six:	use	`use_dan`	512	Read Paper :bookmark:
		`use_transformer_large`	512
		`use_transformer_lite`	512
:seven:	word2vec	`google_news_300`	300	Read Paper :bookmark:
:eight:	fasttext	`wiki_news_300`	300	Read Paper :bookmark:
		`wiki_news_300_sub`	300
		`common_crawl_300`	300
		`common_crawl_300_sub`	300
:nine:	glove	`twitter_200`	200	Read Paper :bookmark:
		`twitter_100`	100
		`twitter_50`	50
		`twitter_25`	25
		`wiki_300`	300
		`wiki_200`	200
		`wiki_100`	100
		`wiki_50`	50
		`crawl_42B_300`	300
		`crawl_840B_300`	300

Credits

This software uses the following open source packages:

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_MrPranav101
💻 📖 🚇

_{Aman Srivastava}
💻 📖 🚇

_{Chirag Jain}
💻 📖 🚇

_{Ashutosh Singh}
💻 📖 🚇

_{Dhaval Taunk}
💻 📖 🚇

_{Alec Koumjian}
🐛

_Pradeesh
🐛

This project follows the all-contributors specification. Contributions of any kind welcome!

Please read the contribution guidelines first.

Citing

^{▴ Back to top}

If you use embedding-as-service in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{aman2019embeddingservice,
  title={embedding-as-service},
  author={Srivastava, Aman},
  howpublished={\url{https://github.com/amansrivastava17/embedding-as-service}},
  year={2019}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

3.1.2

Oct 25, 2022

3.1.1

Oct 25, 2022

3.1.0

Oct 25, 2022

3.0.2

Oct 25, 2022

3.0.1

Oct 25, 2022

3.0.0

Oct 25, 2022

2.0.2

Oct 25, 2022

2.0.1

Jul 15, 2020

2.0.0

Jan 10, 2020

1.6.0

Dec 29, 2019

1.5.0

Dec 27, 2019

1.4.0

Nov 4, 2019

1.3.0

Oct 30, 2019

1.0.0

Sep 4, 2019

0.9.0

Sep 4, 2019

0.8.0

Sep 4, 2019

0.7.0

Aug 30, 2019

0.6.0

Aug 30, 2019

0.5.0

Aug 29, 2019

0.4.0

Aug 29, 2019

0.3.0

Aug 29, 2019

0.2.0

Aug 29, 2019

0.1.0

Aug 29, 2019

0.0.9

Aug 29, 2019

0.0.8

Aug 29, 2019

0.0.7

Aug 29, 2019

0.0.6

Aug 29, 2019

0.0.5

Aug 29, 2019

0.0.4

Aug 28, 2019

0.0.3

Aug 26, 2019

0.0.2

Aug 26, 2019

0.0.1

Aug 26, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_as_service-3.1.2.tar.gz (121.7 kB view details)

Uploaded Oct 25, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embedding_as_service-3.1.2-py3-none-any.whl (140.3 kB view details)

Uploaded Oct 25, 2022 Python 3

File details

Details for the file embedding_as_service-3.1.2.tar.gz.

File metadata

Download URL: embedding_as_service-3.1.2.tar.gz
Upload date: Oct 25, 2022
Size: 121.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.9

File hashes

Hashes for embedding_as_service-3.1.2.tar.gz
Algorithm	Hash digest
SHA256	`d8506d142509c4fb2a7c6a7bf0c9aa29bb59200f5896c8b330d935b60283c162`
MD5	`13ede24373fee211d1e994fd8a78577f`
BLAKE2b-256	`be3b143c4b0765013b1e7d3a61a43f15de2ed6c104f329763a82fbe95170b71d`

See more details on using hashes here.

File details

Details for the file embedding_as_service-3.1.2-py3-none-any.whl.

File metadata

Download URL: embedding_as_service-3.1.2-py3-none-any.whl
Upload date: Oct 25, 2022
Size: 140.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.9

File hashes

Hashes for embedding_as_service-3.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cf0411f81aec699f3b458bcebd3d690de4801deadf2aa6f69dfcb4fe8f65f110`
MD5	`7da4b5418556a6f3d66fda463dccc951`
BLAKE2b-256	`8179e3bcf0abb4ec8276f1823e389182819161a80650fface67e6c13cb71d01e`

See more details on using hashes here.

embedding-as-service 3.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

embedding-as-service

What is it

💾 Installation

Using embedding-as-service as module

Using embedding-as-service as a server

⚡ ️Getting Started

1. Intialise encoder using supported embedding and models from here

2. Get sentences tokens embedding

3. Using pooling strategy, click here for more.

4. Show embedding Tokens

5. Using your own tokenizer

📋 API

✅ Supported Embeddings and Models

Credits

Contributors ✨

Citing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Using `embedding-as-service` as module

Using `embedding-as-service` as a server