Skip to main content

YASEM - Yet Another Splade|Sparse Embedder - A simple and efficient library for SPLADE embeddings

Project description

YASEM (Yet Another Splade|Sparse Embedder)

YASEM is a simple and efficient library for executing SPLADE (Sparse Lexical and Expansion Model for Information Retrieval) and creating sparse vectors. It provides a straightforward interface inspired by SentenceTransformers for easy integration into your projects.

Why YASEM?

  • Simplicity: YASEM focuses on providing a clean and simple implementation of SPLADE without unnecessary complexity.
  • Efficiency: Generate sparse embeddings quickly and easily.
  • Flexibility: Works with both NumPy and PyTorch backends.
  • Convenience: Includes helpful utilities like get_token_values for inspecting feature representations.

Installation

You can install YASEM using pip:

pip install yasem

Quick Start

Here's a simple example of how to use YASEM:

from yasem import SpladeEmbedder

# Initialize the embedder
embedder = SpladeEmbedder("naver/splade-v3")

# Prepare some sentences
sentences = [
    "Hello, my dog is cute",
    "Hello, my cat is cute",
    "Hello, I like a ramen",
    "Hello, I like a sushi",
]

# Generate embeddings
embeddings = embedder.encode(sentences)
# or sparse csr matrix
# embeddings = embedder.encode(sentences, convert_to_csr_matrix=True)

# Compute similarity
similarity = embedder.similarity(embeddings, embeddings)
print(similarity)
# [[148.62903569 106.88184372  18.86930016  22.87525314]
#  [106.88184372 122.79656474  17.45339064  21.44758757]
#  [ 18.86930016  17.45339064  61.00272733  40.92700849]
#  [ 22.87525314  21.44758757  40.92700849  73.98511539]]


# Inspect token values for the first sentence
token_values = embedder.get_token_values(embeddings[0])
print(token_values)
# {'hello': 6.89453125, 'dog': 6.48828125, 'cute': 4.6015625,
#  'message': 2.38671875, 'greeting': 2.259765625,
#    ...

token_values = embedder.get_token_values(embeddings[3])
print(token_values)
# {'##shi': 3.63671875, 'su': 3.470703125, 'eat': 3.25,
#  'hello': 2.73046875, 'you': 2.435546875, 'like': 2.26953125, 'taste': 1.8203125,

rank API

# Rank documents based on query
query = "What programming language is best for machine learning?"
documents = [
   "Python is widely used in machine learning due to its extensive libraries like TensorFlow and PyTorch",
   "JavaScript is primarily used for web development and front-end applications", 
   "SQL is essential for database management and data manipulation"
]

# Get ranked results with relevance scores
results = embedder.rank(query, documents)
print(results)
# [
#   {'corpus_id': 0, 'score': 12.453},  # Python/ML document ranks highest
#   {'corpus_id': 2, 'score': 5.234},
#   {'corpus_id': 1, 'score': 3.123}
# ]

# Get ranked results including document text
results = embedder.rank(query, documents, return_documents=True)
print(results)  
# [
#   {
#     'corpus_id': 0,
#     'score': 12.453,
#     'text': 'Python is widely used in machine learning due to its extensive libraries like TensorFlow and PyTorch'
#   },
#   {
#     'corpus_id': 2, 
#     'score': 5.234,
#     'text': 'SQL is essential for database management and data manipulation'
#   },
#   ...
# ]

Features

  • Easy-to-use API inspired by SentenceTransformers
  • Support for both NumPy and scipy.sparse.csr_matrix
  • Efficient dot product similarity computation
  • Utility function to inspect token values in embeddings

License

This project is licensed under the MIT License. See the LICENSE file for the full license text. Copyright (c) 2024 Yuichi Tateno (@hotchpotch)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgements

This library is inspired by the SPLADE model and aims to provide a simple interface for its usage. Special thanks to the authors of the original SPLADE paper and the developers of the model.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yasem-0.4.1.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yasem-0.4.1-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file yasem-0.4.1.tar.gz.

File metadata

  • Download URL: yasem-0.4.1.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for yasem-0.4.1.tar.gz
Algorithm Hash digest
SHA256 db1b57feb4d8f4ca013954c2ea2167f4de2f66cc652ee0de0ce59cd24172eb1b
MD5 2166784a8a0cd7216ed87f2a9da633bc
BLAKE2b-256 5c3739186e0ee0f8a9acb50e7bec1058d7291e7fac827719f8ed9fdb7e27429d

See more details on using hashes here.

File details

Details for the file yasem-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: yasem-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.30

File hashes

Hashes for yasem-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 86a59e6251ab82f029ff7f44cb4affe841f3f7d956b7e2fc24ffcd33e1b6df02
MD5 e2a659d6738e8fa6ae171495705cb41c
BLAKE2b-256 0d580c4c68b33a9a5773b62b272f6b3dc392cab01b25a2f18e92d2409be4c6c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page