Skip to main content

Few-shot fine-tune an LLM to emit a latent into a bespoke geometry you define with text — and drop the result into SetFit as a Sentence-Transformer body.

Project description

langset — read text, answer in a vector

langset turns a language model into your own bespoke embedding space, few-shot. Bolt a tiny vector head onto a pretrained LLM, describe the axis you want in words, and it learns to read text and emit a latent into a geometry defined by those descriptions — using the LLM's world knowledge to do the reading. The latent lives in the model's own space (it's your embedding, not a re-projected off-the-shelf one), and it drops straight into SetFit as a Sentence-Transformer body.

The one idea

The target_text is the geometry. Whatever your target descriptions describe becomes the axis your latent space measures — and nothing else. Describe instrumentation and the space clusters by instrumentation; describe vocals and emotion and it clusters by vocals and emotion. You don't discover the geometry, you define it — and you re-steer it by rewriting the target text, no model changes.

What makes langset different:

  • 🎯 Latent out, not a label. You define the output space; the model answers in a vector. Retrieval, "find similar", ranking, clustering — classification is just one thing you can do downstream.
  • 🧭 You design the axis in words. The target text defines the geometry. Point it at the signal you care about (and at something the input text can't trivially regenerate, or you're just distilling a text encoder).
  • 🧠 World knowledge does the work. It's a generative LLM, so it generalizes from hundreds of examples, not millions — it reads the input rather than pattern-matching surface tokens.
  • 🪞 Your own embedding. The latent lives in the model's own hidden space; the geometry comes from a self-contrastive objective against your target text — no external encoder in the loop.

Install

pip install langset

Usage

A langset dataset is rows of input_texttarget_text. Pick an LLM backbone; langset trains the mapping.

from langset import LangSetModel, Trainer, TrainingArguments

rows = [  # what you'll have at inference -> a description that DEFINES where it should land
    {"input_text": "an hour-long track of detuned riffs that never break stride, moving at the pace of continental drift",
     "target_text": "glacial detuned doom-metal, sludgy and hypnotic, buried roared vocals"},
    {"input_text": "chopped vocal ghosts drifting over vinyl crackle and the hiss of a city at 3am",
     "target_text": "crackly nocturnal UK garage, pitched vocal ghosts, wistful and restless"},
    # ...
]

model = LangSetModel.from_pretrained("HuggingFaceTB/SmolLM2-135M")   # any HF causal LM
Trainer(model, TrainingArguments(), train_dataset=rows).train()

z = model.encode(["a wall of downtuned fuzz that buries the vocals under sheer volume"])
print(z.shape)   # (1, 576)  — a latent in the backbone's own space

See examples/sounds_like/ for the full reference task (album review → "how it sounds" latent).

How it works

  1. Self-contrastive. For each row, emit(input_text) is trained to match emit(target_text)both emitted through the model into its own space — against in-batch negatives. The target text defines where each item lands; the negatives force different items apart (so the space can't collapse).
  2. Grounding aux. A light reconstruction term makes the latent also decode the target text, tying it to the words. A light uniformity term keeps the space spread on the sphere.
  3. Collapse-aware selection. langset early-stops on held-out input↔target retrieval and reconstruction, with a hard penalty on any collapse of the geometry — never on the training loss (which collapse can game).

Dataset contract

column meaning
input_text what you'll have at inference (a name, a query, a review)
target_text a description of the same item that defines where it lands (the geometry)

Trainer accepts a datasets.Dataset or list[dict]; use column_mapping to rename your columns.

Using with SetFit

The name is the chain: lang·set·fit — a language model emits into the set geometry (langset, usable on its own), which then fits a classifier. model.as_sentence_transformer() is a drop-in SetFit model_body, so you can train a few-shot classifier directly on your bespoke geometry.

The clean distinction: SetFit answers with a label; langset answers with a latent.

reach for SetFit reach for langset
your answer is a label (fixed classes) a point in a space — retrieval, "find similar", ranking, clustering
you define the target by enumerating classes a description of the geometry ("how it sounds")
your input text to classify text or an identifier — leans on the LLM's world knowledge
  • Use SetFit alone for plain few-shot classification — you won't beat it by bolting on langset.
  • Use langset when the answer is a geometry, not a label (you'll retrieve / rank / cluster in it).
  • Use langset → SetFit when a task-shaped body helps the classifier.
pip install "langset[setfit]"      # pins the verified composition window (below)
from sklearn.linear_model import LogisticRegression
from setfit import SetFitModel

clf = SetFitModel(model_body=langset_model.as_sentence_transformer(),
                  model_head=LogisticRegression(max_iter=2000),    # direct construction needs an explicit head
                  labels=[...])
clf.fit(x_train, y_train, num_epochs=1)        # frozen body + head — the robust path
clf.predict(["..."])

Dependency alignment. SetFit's pins are loose, so versions matter:

install transformers / torch Python use
langset latest (≥4.41) 3.10+ modern backbones incl. Qwen3; no SetFit
langset[setfit] 4.46.x / <2.5 3.10–3.12 verified SetFit composition

SetFit imports transformers.training_args.default_logdir (removed after 4.46), and 4.46 + torch≥2.5 trips a torch.distributed.tensor bug — hence the cap. Use the frozen-body SetFitModel.fit/predict path above; the full setfit.Trainer (fine-tunes the body) is fragile in this window.

Status

v0.2 — the core engine, validated on a real task (album review → "how it sounds" latent) with a downstream SetFit composition. Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langset-0.2.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langset-0.2.0-py3-none-any.whl (18.3 kB view details)

Uploaded Python 3

File details

Details for the file langset-0.2.0.tar.gz.

File metadata

  • Download URL: langset-0.2.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for langset-0.2.0.tar.gz
Algorithm Hash digest
SHA256 85f05c3d4a93a6c2fb4342d624a45f480654301481d0b7b63be11abd98fc6553
MD5 2c324859dc758c9d7d0676c901e39576
BLAKE2b-256 d067c6d443676acb66cbbcaea572aad8275d3e7397c7357a937a6959143865df

See more details on using hashes here.

File details

Details for the file langset-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: langset-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 18.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for langset-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1f5eb840550e8635b9287510a2d8ec4858f0ea0ad2ceb06be6be56ee60f4194c
MD5 7206324802aea1b4ba05ee9b1c3033a9
BLAKE2b-256 580e1929a136d077a41639ee09deeaacd6970e22d4fe870d121a9954c6abd29a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page