High-level API for creating sentence and token embeddings
Project description
embedders
With embedders, you can easily convert your text into sentence- or token-level embeddings within a few lines of code. Use cases for this include similarity search between texts, information extraction such as named entity recognition, or basic text classification.
How to install
You can set up this library via either running pip install embedders
, or via cloning this repository and running pip install -r requirements.txt
in your repository.
This library uses spaCy for tokenization; to apply it, please download the respective language model first.
Caution: We currently have this tested for Python 3 up to Python 3.9. If your installation runs into issues, please contact us.
Example
Calculating sentence embeddings
from embedders.classification.contextual import TransformerSentenceEmbedder
from embedders.classification.reduce import PCASentenceReducer
corpus = [
"I went to Cologne in 2009",
"My favorite number is 41",
...
]
embedder = TransformerSentenceEmbedder("bert-base-cased")
embeddings = embedder.encode(corpus) # contains a list of shape [num_texts, embedding_dimension]
# if the dimension is too large, you can also apply dimensionality reduction
reducer = PCASentenceReducer(embedder)
embeddings_reduced = reducer.fit_transform(corpus)
Calculating token embeddings
from embedders.extraction.count_based import CharacterTokenEmbedder
from embedders.extraction.reduce import PCATokenReducer
corpus = [
"I went to Cologne in 2009",
"My favorite number is 41",
...
]
embedder = CharacterTokenEmbedder("en_core_web_sm")
embeddings = embedder.encode(corpus) # contains a list of ragged shape [num_texts, num_tokens (text-specific), embedding_dimension]
# if the dimension is too large, you can also apply dimensionality reduction
reducer = PCATokenReducer(embedder)
embeddings_reduced = reducer.fit_transform(corpus)
How to contribute
Currently, the best way to contribute is via adding issues for the kind of transformations you like and starring this repository :-)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for embedders-0.0.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a73faacd8c5238d454de35bbdfbb769a318064becdef6a5973668e2752cc637 |
|
MD5 | 912d1d2df70cf122311dd077b6c22f22 |
|
BLAKE2b-256 | aea0a877a1a001db5a5d219d872ba9c911fc7d8660a6981f3dbc7fff9b517ccc |