Endee model for sparse embedding generation
Project description
Endee Model
A Python library for generating sparse text embeddings using the BM25 algorithm. Designed for integration with vector databases to enable efficient keyword-based search alongside dense embeddings.
Installation
pip install endee-model
Quick Start
from endee_model import SparseModel
model = SparseModel(model_name="endee/bm25")
documents = [
"The quick brown fox jumps over the lazy dog",
"Machine learning enables computers to learn from data",
]
for embedding in model.embed(documents):
print(embedding.as_dict()) # {token_id: weight, ...}
Usage
Embed Documents
from endee_model import SparseModel
model = SparseModel(model_name="endee/bm25")
documents = ["first document text", "second document text"]
# Returns a generator — iterate to get SparseEmbedding objects
for embedding in model.embed(documents, batch_size=256):
sparse_dict = embedding.as_dict() # {int: float}
sparse_obj = embedding.as_object() # {'indices': array, 'values': array}
Embed Queries
query = "search query text"
for embedding in model.query_embed(query):
print(embedding.as_dict())
Count Tokens
count = model.token_count("some text here")
print(f"Token count: {count}")
Work with SparseEmbedding Directly
from endee_model import SparseEmbedding
# Create from a {token_id: weight} dictionary
embedding = SparseEmbedding.from_dict({100: 0.5, 200: 0.8, 300: 1.2})
embedding.as_dict() # {100: 0.5, 200: 0.8, 300: 1.2}
embedding.as_object() # {'indices': array([100, 200, 300]), 'values': array([0.5, 0.8, 1.2])}
Configuration
SparseModel Parameters
| Parameter | Default | Description |
|---|---|---|
model_name |
required | Model identifier (use "endee/bm25") |
cache_dir |
None |
Custom cache directory (see Cache) |
k |
1.2 |
BM25 saturation parameter — controls term frequency saturation |
b |
0.75 |
Length normalization factor (0 = none, 1 = full) |
language |
"english" |
Language for Snowball stemmer |
max_token_len |
40 |
Tokens longer than this are discarded |
disable_stemmer |
False |
Skip stemming (enables more languages via NLTK stopwords only) |
model = SparseModel(
model_name="endee/bm25",
k=1.5,
b=0.8,
language="english",
)
Available Languages
from endee_model.sparse.bm25 import bm25_languages
print(bm25_languages()) # List of supported Snowball stemmer languages
Cache
NLTK resources and model files are cached locally. The cache location is resolved in this order:
cache_dirargument passed toSparseModelENDEE_CACHE_PATHenvironment variable- Default:
{system_tmp}/endee_cache
export ENDEE_CACHE_PATH=/path/to/custom/cache
Requirements
How It Works
-
Normalization — punctuation is stripped, stopwords removed, oversized tokens discarded
-
Stemming — tokens are reduced to stems using the Snowball stemmer (optional)
-
BM25 weights — term-frequency weights are computed using the BM25 TF formula:
tf_weight = tf * (k + 1) / (tf + k * (1 - b + b * (doc_len / avg_len)))
Note: BM25 IDF weighting must be applied on the vector index side. This library outputs TF weights only.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file endee_model-0.1.0.tar.gz.
File metadata
- Download URL: endee_model-0.1.0.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13da94dfb4568f73c528e29cde8cb2e489d4931214b49efe05e241f0b3e822ed
|
|
| MD5 |
6caca1887f40e050c72ba99ee063e9ad
|
|
| BLAKE2b-256 |
0b926ecf8bf6bfa5f71089979b45f7d08a22da7bcb8b53a044436ed3257ff768
|
File details
Details for the file endee_model-0.1.0-py3-none-any.whl.
File metadata
- Download URL: endee_model-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67d4d5c419163f2b117b5781ed32fe0b4728f9c7af4e65994d3490c0e38f9f5f
|
|
| MD5 |
fcb830a26f543cb61678a179a13cca54
|
|
| BLAKE2b-256 |
c6b75b85ae56fe6b268bdc86a8529ebf6b23611301213c4b709f8c34b241f536
|