Sentence embeddings (WISSE) with SBERT-like API for downstream NLP applications.

These details have not been verified by PyPI

Project links

Project description

WISSE — Sentence embeddings

Sentence embeddings via entropy-weighted series (TF-IDF–weighted word embeddings). No language or knowledge resources required. Python 3.8+.

SBERT-like API: encode() and similarity(). Default model keys (wisse-fasttext-300, wisse-idf-en) point to the Hugging Face repo; once the paper’s Wikipedia FastText and TF-IDF assets are uploaded there (one-time, see Uploading assets), they auto-download on first use to ~/.wisse. Until then, use local paths (e.g. after downloading from MEGA).

Quick start

With assets on Hugging Face (after one-time upload of the paper assets):

from wisse import SentenceEmbedding

model = SentenceEmbedding()  # downloads to ~/.wisse on first use

sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)  # shape (3, 300)
sim = model.similarity(embeddings, embeddings)

With local paths (e.g. after downloading from MEGA):

model = SentenceEmbedding(
    model_name_or_path="/path/to/indexed_fasttext/",
    idf_name_or_path="/path/to/idf-en.pkl",
)
embeddings = model.encode(["First sentence.", "Second sentence."])

Similarity: "cosine", "dot", "euclidean", "manhattan".

from wisse import similarity
import numpy as np
s = similarity(embeddings, embeddings, similarity_fn="cosine")

Installation

pip install -e .

Requirements: Python ≥3.8, numpy, scikit-learn, gensim, joblib, requests.

Default models (Hugging Face)

The package expects Wikipedia-trained FastText (300d) and TF-IDF weights at the Hugging Face repo. Once those files are uploaded (see Uploading assets), they download on first use to ~/.wisse (or $WISSE_HOME).

Registry keys: wisse-fasttext-300, wisse-idf-en (and optionally wisse-glove-300).
Override: WISSE_HF_REPO, WISSE_HF_REPO_TYPE, or WISSE_FASTTEXT_URL, WISSE_IDF_URL, WISSE_EMBEDDING_URL.

Repository layout

Clean Python package layout:

sentence_embedding/
├── wisse/                 # Package
│   ├── __init__.py
│   ├── wisse.py           # Core: TF-IDF weighting, vector_space, keyed2indexed
│   ├── model.py           # SentenceEmbedding (SBERT-like API)
│   ├── similarity.py      # Pairwise similarity helpers
│   ├── download.py        # HF registry and autodownload
│   └── cli.py             # wisse-encode, keyed2indexed entry points
├── tests/                 # Pytest suite
├── hf_model/              # Hugging Face model card (README.md) and push script
├── setup.py
├── pyproject.toml
├── requirements.txt
├── run_tests.py
├── keyed2indexed.py      # Standalone script (or use CLI after install)
├── LICENSE
├── .gitignore
└── README.md

CLI

After pip install:

wisse-encode --input sentences.txt --output vectors.npy
wisse-encode --input sentences.txt --output out.npy --model wisse-fasttext-300 --idf wisse-idf-en

keyed2indexed --input model.bin --output output_indexed
keyed2indexed --input model.vec --txt --output output_indexed

wisse-train --wikipedia en --idf-out idf-en.pkl --embeddings-out fasttext-300-indexed
wisse-train --corpus-dir ./my_texts --document-unit paragraph --idf-out idf.pkl

Train IDF + FastText (new operating mode): from a directory of plain text files or from Wikipedia (Hugging Face). Produces WISSE-ready IDF pickle and indexed FastText embeddings. For Wikipedia you need pip install ".[train]" (adds datasets).

--corpus-dir PATH — directory of plain text files, or
--wikipedia LANG — e.g. en, es (downloads from HF wikimedia/wikipedia).
--document-unit article|paragraph — one doc per file/article vs per paragraph.
--idf-out, --embeddings-out — explicit output paths (defaults: idf-<lang>.pkl, fasttext-300-indexed).
--binary-out PATH — optionally save FastText in Word2Vec binary format.
--dim, --window, --min-count, --epochs — paper defaults (300, 5, 5, 5), all configurable.
--cap-articles, --cap-tokens — optional cap with efficient random sampling; default for Wikipedia: 500k articles / 100M tokens.

From repo without installing:

python keyed2indexed.py --input model.bin --output output_indexed

Low-level usage

Convert word2vec to indexed format and use the WISSE combiner:

import wisse
from gensim.models.keyedvectors import KeyedVectors

kv = KeyedVectors.load_word2vec_format("/path/to/embeddings.bin", binary=True)
wisse.keyed2indexed(kv, "/path/to/output_dir/")

embedding = wisse.vector_space("/path/to/output_dir/")
# embedding["word"] → array

import pickle
with open("/path/to/tfidf.pkl", "rb") as f:
    vectorizer = pickle.load(f)
w = wisse.wisse(embedding, vectorizer=vectorizer, tf_tfidf=True, combiner="sum", generate=True)
vec = w.infer_sentence("this is a sentence")

Testing

pip install -e ".[dev]"
pytest tests/ -v
# or
python run_tests.py

Test	Description	Needs real pretrained assets?
`test_00_install`	Package and public API import	No
`test_01_toy_tfidf_fasttext_and_helpers`	Toy TF-IDF FastText + all helpers (synthetic data)	No
`test_02_paper_wikipedia_assets`	Paper’s Wikipedia FastText + TF-IDF	Yes (see below)
`test_new_user_full_workflow`	New user: mocked download, toy sentences	No (uses mocks)
`test_encode_similarity`	encode/similarity and low-level API (synthetic)	No

Tests that use the real pretrained models (optional): test_02_paper_wikipedia_assets runs only when assets are available:

Set WISSE_PAPER_FASTTEXT_DIR and WISSE_PAPER_IDF_PATH to local paths (e.g. after downloading from the MEGA links below), or
Set WISSE_TEST_HF_REGISTRY=1 to use the Hugging Face registry (only works after the paper assets have been uploaded to the HF repo).

Uploading assets to Hugging Face

For SentenceEmbedding() to work with defaults (no local paths), the paper’s FastText and TF-IDF files must be on the Hub. There are no synthetic “minimal” defaults — only the real pretrained assets are useful.

Get the assets: Download from MEGA (links in Pretrained assets): indexed FastText and the TF-IDF pickle. Optionally pack the FastText directory as fasttext-300-indexed.tar.gz.
Upload: Use hf_model/upload_assets_to_hf.py with local paths, or the Hub UI/CLI. Full steps: hf_model/PUSH_TO_HF.md.
Repo: huggingface.co/datasets/iarroyof/wisse-models

Pretrained assets (manual download)

If you prefer not to use the Hub, download and extract manually, then pass paths to SentenceEmbedding(...) or wisse.vector_space(path):

FastText (Wikipedia 300d): idx_FastText
Word2Vec (Wikipedia 300d): idx_Word2Vec
GloVe (840B 300d): idx_Glove
Dep2Vec (300d): idx_Dep2Vec
TF-IDF (Wikipedia, stop words ignored): pretrained_idf

Citation

@article{arroyo2017unsupervised,
  title={Unsupervised Sentence Representations as Word Information Series: Revisiting TF--IDF},
  author={Arroyo-Fern{\'a}ndez, Ignacio and M{\'e}ndez-Cruz, Carlos-Francisco and Sierra, Gerardo and Torres-Moreno, Juan-Manuel and Sidorov, Grigori},
  journal={arXiv preprint arXiv:1710.06524},
  year={2017}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wisse_sentence-0.1.0.tar.gz (29.6 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wisse_sentence-0.1.0-py3-none-any.whl (24.1 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file wisse_sentence-0.1.0.tar.gz.

File metadata

Download URL: wisse_sentence-0.1.0.tar.gz
Upload date: Mar 16, 2026
Size: 29.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.5

File hashes

Hashes for wisse_sentence-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4e990d7688e13d42d74d63f9b7d1feffa283b4ff2098e8022be21781fec3e448`
MD5	`5703a88604547b9741a2cc0de3cc3a19`
BLAKE2b-256	`d8915d75700a61a1bca7d1350e37112b99952aae1824a65c57b34ee404762d5b`

See more details on using hashes here.

File details

Details for the file wisse_sentence-0.1.0-py3-none-any.whl.

File metadata

Download URL: wisse_sentence-0.1.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 24.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.5

File hashes

Hashes for wisse_sentence-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5dc770879168a7125f202a1e16a77155273d52bf0385dfee2671bc9b1af46b5a`
MD5	`08e890110695c90bbb165d223ca5bd7d`
BLAKE2b-256	`c9e132f3a5a16e649415750f54e1159fc6a33e2e4de90921de0dc4f253fd5b16`

See more details on using hashes here.

wisse-sentence 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WISSE — Sentence embeddings

Quick start

Installation

Default models (Hugging Face)

Repository layout

CLI

Low-level usage

Testing

Uploading assets to Hugging Face

Pretrained assets (manual download)

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes