High-level API for creating sentence and token embeddings

These details have not been verified by PyPI

Project links

Homepage

Project description

embedders

⚗️ embedders

With embedders, you can easily convert your texts into sentence- or token-level embeddings within a few lines of code. Use cases for this include similarity search between texts, information extraction such as named entity recognition, or basic text classification.

Prerequisites

This library uses spaCy for tokenization; to apply it, please download the respective language model first.

Installation

You can set up this library via either running $ pip install embedders, or via cloning this repository and running $ pip install -r requirements.txt in your repository.

A sample installation would be:

$ conda create --name embedders python=3.9
$ conda activate embedders
$ pip install embedders
$ python -m spacy download en_core_web_sm

Usage

Once you installed the package, you can apply the embedders with a few lines of code. You can apply embedders on sentence- or token-level.

Sentence embeddings

"Wow, what a cool tool!" is embedded to

[
    2.453, 8.325, ..., 3.863
]

Currently, we provide the following sentence embeddings:

Path	Name	Embeds documents using ...
embedders.classification.contextual	HuggingFaceSentenceEmbedder	large, pre-trained transformers from https://huggingface.co
embedders.classification.contextual	OpenAISentenceEmbedder	large, pre-trained transformers from https://openai.com
embedders.classification.contextual	CohereSentenceEmbedder	large, pre-trained transformers from https://cohere.com
embedders.classification.count_based	BagOfCharsSentenceEmbedder	plain Bag of Chars approach
embedders.classification.count_based	BagOfWordsSentenceEmbedder	plain Bag of Words approach
embedders.classification.count_based	TfidfSentenceEmbedder	Term Frequency - Inverse Document Frequency (TFIDF) approach

Token embeddings

"Wow, what a cool tool!" is embedded to

[
    [8.453, 1.853, ...],
    [3.623, 2.023, ...],
    [1.906, 9.604, ...],
    [7.306, 2.325, ...],
    [6.630, 1.643, ...],
    [3.023, 4.974, ...]
]

Currently, we provide the following token embeddings:

Path	Name	Embeds documents using ...
embedders.extraction.contextual	TransformerTokenEmbedder	large, pre-trained transformers from https://huggingface.co
embedders.extraction.count_based	BagOfCharsTokenEmbedder	plain Bag of Characters approach

You can choose the embedding category depending on your task at hand. To implement them, you can just grab one of the available methods and apply them to your text corpus as follows (shown for sentence embeddings, but the same is possible for token):

from embedders.classification.contextual import TransformerSentenceEmbedder
from embedders.classification.reduce import PCASentenceReducer

corpus = [
    "I went to Cologne in 2009",
    "My favorite number is 41",
    # ...
]

embedder = TransformerSentenceEmbedder("bert-base-cased")
embeddings = embedder.fit_transform(corpus) # contains a list of shape [num_texts, embedding_dimension]

Sometimes, you want to reduce the size of the embeddings you received. To do so, you can easily wrap your embedder with some dimensionality reduction technique.

# if the dimension is too large, you can also apply dimensionality reduction
reducer = PCASentenceReducer(embedder)
embeddings_reduced = reducer.fit_transform(corpus)

Currently, we provide the following dimensionality reductions:

Path	Name	Description
embedders.classification.reduce	PCASentenceEmbedder	Wraps embedder into a principial component analysis to reduce the dimensionality
embedders.extraction.reduce	PCATokenEmbedder	Wraps embedder into a principial component analysis to reduce the dimensionality

Pre-trained embedders

With growing availability of large, pre-trained models such as provided by 🤗 Hugging Face, embedding complex sentences in a wide variety of languages and domains becomes much more applicable. If you want to make use of transformer models, you can just use the configuration string of the respective model, which will automatically pull the correct model for the 🤗 Hugging Face Hub.

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

And please don't forget to leave a ⭐ if you like the work!

License

Distributed under the Apache 2.0 License. See LICENSE.txt for more information.

Contact

This library is developed and maintained by kern.ai. If you want to provide us with feedback or have some questions, don't hesitate to contact us. We're super happy to help ✌️

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.8

Aug 14, 2023

0.1.7

Aug 11, 2023

0.1.6

Aug 8, 2023

0.1.5

Jul 20, 2023

0.1.4

May 14, 2023

0.1.3

May 14, 2023

0.1.2

May 14, 2023

0.1.1

May 14, 2023

0.1.0

May 14, 2023

0.0.19

Oct 30, 2022

0.0.18

Oct 20, 2022

0.0.17

Sep 14, 2022

0.0.16

Aug 22, 2022

0.0.15

Aug 11, 2022

0.0.14

Jun 24, 2022

0.0.13

Jun 24, 2022

0.0.12

Jun 9, 2022

0.0.11

May 24, 2022

0.0.10

May 16, 2022

0.0.9

May 16, 2022

0.0.8

May 6, 2022

0.0.7

May 5, 2022

0.0.6

May 1, 2022

0.0.5

May 1, 2022

0.0.4

Apr 28, 2022

0.0.3

Apr 28, 2022

0.0.2

Apr 28, 2022

0.0.1

Apr 27, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embedders-0.1.8-py2.py3-none-any.whl (24.3 kB view details)

Uploaded Aug 14, 2023 Python 2Python 3

File details

Details for the file embedders-0.1.8-py2.py3-none-any.whl.

File metadata

Download URL: embedders-0.1.8-py2.py3-none-any.whl
Upload date: Aug 14, 2023
Size: 24.3 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.3

File hashes

Hashes for embedders-0.1.8-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed955ef95592380ca980e5991ec07c35ba7c1850469354bd19096ee6595a1fb2`
MD5	`f512c2074ccfde4818884b9dc1789bd6`
BLAKE2b-256	`0e6f603db1e11518dc8b90bd1d69edd18bda67f8745418abd25f03adfae6486b`

See more details on using hashes here.

embedders 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

⚗️ embedders

Prerequisites

Installation

Usage

Sentence embeddings

Token embeddings

Pre-trained embedders

Contributing

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes