Skip to main content

Sentence embeddings (WISSE) with SBERT-like API for downstream NLP applications.

Project description

WISSE — Sentence embeddings

Sentence embeddings via entropy-weighted series (TF-IDF–weighted word embeddings). No language or knowledge resources required. Python 3.8+.

SBERT-like API: encode() and similarity(). Default model keys (wisse-fasttext-300, wisse-idf-en) point to the Hugging Face repo; once the paper’s Wikipedia FastText and TF-IDF assets are uploaded there (one-time, see Uploading assets), they auto-download on first use to ~/.wisse. Until then, use local paths (e.g. after downloading from MEGA).


Quick start

With assets on Hugging Face (after one-time upload of the paper assets):

from wisse import SentenceEmbedding

model = SentenceEmbedding()  # downloads to ~/.wisse on first use

sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)  # shape (3, 300)
sim = model.similarity(embeddings, embeddings)

With local paths (e.g. after downloading from MEGA):

model = SentenceEmbedding(
    model_name_or_path="/path/to/indexed_fasttext/",
    idf_name_or_path="/path/to/idf-en.pkl",
)
embeddings = model.encode(["First sentence.", "Second sentence."])

Similarity: "cosine", "dot", "euclidean", "manhattan".

from wisse import similarity
import numpy as np
s = similarity(embeddings, embeddings, similarity_fn="cosine")

Installation

pip install -e .

Requirements: Python ≥3.8, numpy, scikit-learn, gensim, joblib, requests.


Default models (Hugging Face)

The package expects Wikipedia-trained FastText (300d) and TF-IDF weights at the Hugging Face repo. Once those files are uploaded (see Uploading assets), they download on first use to ~/.wisse (or $WISSE_HOME).

  • Registry keys: wisse-fasttext-300, wisse-idf-en (and optionally wisse-glove-300).
  • Override: WISSE_HF_REPO, WISSE_HF_REPO_TYPE, or WISSE_FASTTEXT_URL, WISSE_IDF_URL, WISSE_EMBEDDING_URL.

Repository layout

Clean Python package layout:

sentence_embedding/
├── wisse/                 # Package
│   ├── __init__.py
│   ├── wisse.py           # Core: TF-IDF weighting, vector_space, keyed2indexed
│   ├── model.py           # SentenceEmbedding (SBERT-like API)
│   ├── similarity.py      # Pairwise similarity helpers
│   ├── download.py        # HF registry and autodownload
│   └── cli.py             # wisse-encode, keyed2indexed entry points
├── tests/                 # Pytest suite
├── hf_model/              # Hugging Face model card (README.md) and push script
├── setup.py
├── pyproject.toml
├── requirements.txt
├── run_tests.py
├── keyed2indexed.py      # Standalone script (or use CLI after install)
├── LICENSE
├── .gitignore
└── README.md

CLI

After pip install:

wisse-encode --input sentences.txt --output vectors.npy
wisse-encode --input sentences.txt --output out.npy --model wisse-fasttext-300 --idf wisse-idf-en

keyed2indexed --input model.bin --output output_indexed
keyed2indexed --input model.vec --txt --output output_indexed

wisse-train --wikipedia en --idf-out idf-en.pkl --embeddings-out fasttext-300-indexed
wisse-train --corpus-dir ./my_texts --document-unit paragraph --idf-out idf.pkl

Train IDF + FastText (new operating mode): from a directory of plain text files or from Wikipedia (Hugging Face). Produces WISSE-ready IDF pickle and indexed FastText embeddings. For Wikipedia you need pip install ".[train]" (adds datasets).

  • --corpus-dir PATH — directory of plain text files, or
  • --wikipedia LANG — e.g. en, es (downloads from HF wikimedia/wikipedia).
  • --document-unit article|paragraph — one doc per file/article vs per paragraph.
  • --idf-out, --embeddings-out — explicit output paths (defaults: idf-<lang>.pkl, fasttext-300-indexed).
  • --binary-out PATH — optionally save FastText in Word2Vec binary format.
  • --dim, --window, --min-count, --epochs — paper defaults (300, 5, 5, 5), all configurable.
  • --cap-articles, --cap-tokens — optional cap with efficient random sampling; default for Wikipedia: 500k articles / 100M tokens.

From repo without installing:

python keyed2indexed.py --input model.bin --output output_indexed

Low-level usage

Convert word2vec to indexed format and use the WISSE combiner:

import wisse
from gensim.models.keyedvectors import KeyedVectors

kv = KeyedVectors.load_word2vec_format("/path/to/embeddings.bin", binary=True)
wisse.keyed2indexed(kv, "/path/to/output_dir/")

embedding = wisse.vector_space("/path/to/output_dir/")
# embedding["word"] → array

import pickle
with open("/path/to/tfidf.pkl", "rb") as f:
    vectorizer = pickle.load(f)
w = wisse.wisse(embedding, vectorizer=vectorizer, tf_tfidf=True, combiner="sum", generate=True)
vec = w.infer_sentence("this is a sentence")

Testing

pip install -e ".[dev]"
pytest tests/ -v
# or
python run_tests.py
Test Description Needs real pretrained assets?
test_00_install Package and public API import No
test_01_toy_tfidf_fasttext_and_helpers Toy TF-IDF FastText + all helpers (synthetic data) No
test_02_paper_wikipedia_assets Paper’s Wikipedia FastText + TF-IDF Yes (see below)
test_new_user_full_workflow New user: mocked download, toy sentences No (uses mocks)
test_encode_similarity encode/similarity and low-level API (synthetic) No

Tests that use the real pretrained models (optional): test_02_paper_wikipedia_assets runs only when assets are available:

  • Set WISSE_PAPER_FASTTEXT_DIR and WISSE_PAPER_IDF_PATH to local paths (e.g. after downloading from the MEGA links below), or
  • Set WISSE_TEST_HF_REGISTRY=1 to use the Hugging Face registry (only works after the paper assets have been uploaded to the HF repo).

Uploading assets to Hugging Face

For SentenceEmbedding() to work with defaults (no local paths), the paper’s FastText and TF-IDF files must be on the Hub. There are no synthetic “minimal” defaults — only the real pretrained assets are useful.

  1. Get the assets: Download from MEGA (links in Pretrained assets): indexed FastText and the TF-IDF pickle. Optionally pack the FastText directory as fasttext-300-indexed.tar.gz.
  2. Upload: Use hf_model/upload_assets_to_hf.py with local paths, or the Hub UI/CLI. Full steps: hf_model/PUSH_TO_HF.md.
  3. Repo: huggingface.co/datasets/iarroyof/wisse-models

Pretrained assets (manual download)

If you prefer not to use the Hub, download and extract manually, then pass paths to SentenceEmbedding(...) or wisse.vector_space(path):


Citation

@article{arroyo2017unsupervised,
  title={Unsupervised Sentence Representations as Word Information Series: Revisiting TF--IDF},
  author={Arroyo-Fern{\'a}ndez, Ignacio and M{\'e}ndez-Cruz, Carlos-Francisco and Sierra, Gerardo and Torres-Moreno, Juan-Manuel and Sidorov, Grigori},
  journal={arXiv preprint arXiv:1710.06524},
  year={2017}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wisse_sentence-0.1.0.tar.gz (29.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wisse_sentence-0.1.0-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file wisse_sentence-0.1.0.tar.gz.

File metadata

  • Download URL: wisse_sentence-0.1.0.tar.gz
  • Upload date:
  • Size: 29.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.5

File hashes

Hashes for wisse_sentence-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4e990d7688e13d42d74d63f9b7d1feffa283b4ff2098e8022be21781fec3e448
MD5 5703a88604547b9741a2cc0de3cc3a19
BLAKE2b-256 d8915d75700a61a1bca7d1350e37112b99952aae1824a65c57b34ee404762d5b

See more details on using hashes here.

File details

Details for the file wisse_sentence-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: wisse_sentence-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.5

File hashes

Hashes for wisse_sentence-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5dc770879168a7125f202a1e16a77155273d52bf0385dfee2671bc9b1af46b5a
MD5 08e890110695c90bbb165d223ca5bd7d
BLAKE2b-256 c9e132f3a5a16e649415750f54e1159fc6a33e2e4de90921de0dc4f253fd5b16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page