YASEM - Yet Another Splade|Sparse Embedder - A simple and efficient library for SPLADE embeddings
Project description
YASEM (Yet Another Splade|Sparse Embedder)
YASEM is a simple and efficient library for executing SPLADE (Sparse Lexical and Expansion Model for Information Retrieval) and creating sparse vectors. It provides a straightforward interface inspired by SentenceTransformers for easy integration into your projects.
Why YASEM?
- Simplicity: YASEM focuses on providing a clean and simple implementation of SPLADE without unnecessary complexity.
- Efficiency: Generate sparse embeddings quickly and easily.
- Flexibility: Works with both NumPy and PyTorch backends.
- Convenience: Includes helpful utilities like get_token_values for inspecting feature representations.
Installation
You can install YASEM using pip:
pip install yasem
Quick Start
Here's a simple example of how to use YASEM:
from yasem import SpladeEmbedder
# Initialize the embedder
embedder = SpladeEmbedder("naver/splade-v3")
# Prepare some sentences
sentences = [
"Hello, my dog is cute",
"Hello, my cat is cute",
"Hello, I like a ramen",
"Hello, I like a sushi",
]
# Generate embeddings
embeddings = embedder.encode(sentences)
# or sparse csr matrix
# embeddings = embedder.encode(sentences, convert_to_csr_matrix=True)
# Compute similarity
similarity = embedder.similarity(embeddings, embeddings)
print(similarity)
# [[148.62903569 106.88184372 18.86930016 22.87525314]
# [106.88184372 122.79656474 17.45339064 21.44758757]
# [ 18.86930016 17.45339064 61.00272733 40.92700849]
# [ 22.87525314 21.44758757 40.92700849 73.98511539]]
# Inspect token values for the first sentence
token_values = embedder.get_token_values(embeddings[0])
print(token_values)
# {'hello': 6.89453125, 'dog': 6.48828125, 'cute': 4.6015625,
# 'message': 2.38671875, 'greeting': 2.259765625,
# ...
token_values = embedder.get_token_values(embeddings[3])
print(token_values)
# {'##shi': 3.63671875, 'su': 3.470703125, 'eat': 3.25,
# 'hello': 2.73046875, 'you': 2.435546875, 'like': 2.26953125, 'taste': 1.8203125,
rank API
# Rank documents based on query
query = "What programming language is best for machine learning?"
documents = [
"Python is widely used in machine learning due to its extensive libraries like TensorFlow and PyTorch",
"JavaScript is primarily used for web development and front-end applications",
"SQL is essential for database management and data manipulation"
]
# Get ranked results with relevance scores
results = embedder.rank(query, documents)
print(results)
# [
# {'corpus_id': 0, 'score': 12.453}, # Python/ML document ranks highest
# {'corpus_id': 2, 'score': 5.234},
# {'corpus_id': 1, 'score': 3.123}
# ]
# Get ranked results including document text
results = embedder.rank(query, documents, return_documents=True)
print(results)
# [
# {
# 'corpus_id': 0,
# 'score': 12.453,
# 'text': 'Python is widely used in machine learning due to its extensive libraries like TensorFlow and PyTorch'
# },
# {
# 'corpus_id': 2,
# 'score': 5.234,
# 'text': 'SQL is essential for database management and data manipulation'
# },
# ...
# ]
Features
- Easy-to-use API inspired by SentenceTransformers
- Support for both NumPy and scipy.sparse.csr_matrix
- Efficient dot product similarity computation
- Utility function to inspect token values in embeddings
License
This project is licensed under the MIT License. See the LICENSE file for the full license text. Copyright (c) 2024 Yuichi Tateno (@hotchpotch)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgements
This library is inspired by the SPLADE model and aims to provide a simple interface for its usage. Special thanks to the authors of the original SPLADE paper and the developers of the model.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yasem-0.4.0.tar.gz.
File metadata
- Download URL: yasem-0.4.0.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.4.30
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc65df2740562ce5c34c4fc4c435bd42e2865f4185bbf1798be9555dd37bd24e
|
|
| MD5 |
8102433a491849e33f8ab7191b306522
|
|
| BLAKE2b-256 |
4378333a71c2e9af2619b639a8b0a99951120c9b69a6007cb07b7f623d86ee79
|
File details
Details for the file yasem-0.4.0-py3-none-any.whl.
File metadata
- Download URL: yasem-0.4.0-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.4.30
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ffc9cb10323ce700cafa601c6e687668cf4abbb50e5d29a34df93fe2ebb4532
|
|
| MD5 |
a4aee32dad93ef40f75a231015eca481
|
|
| BLAKE2b-256 |
d0fe5abca4d20b6cdea352cabfd59516e06d14237e7b89611d6f9b651b968db2
|