Few-shot fine-tune an LLM to emit a latent into a bespoke, specialized geometry — and drop the result into SetFit as a Sentence-Transformer body.
Project description
langset — read text, answer in a vector
langset turns a language model into your own bespoke embedding space, few-shot. Bolt a tiny vector head onto a pretrained LLM, point it at a target geometry you define, and it learns to read text and emit a latent into that space — using the LLM's world knowledge to do the work. From ~480 examples it builds a specialized "how it sounds" geometry that beats the encoder it was bootstrapped from on 4/5 held-out axes — while the same architecture with a randomly-initialized backbone sits at chance. That gap is the world knowledge, and it's the whole point.
What makes langset different:
- 🎯 Latent out, not a label. You define the output space; the model answers in a vector, not a class. The geometry can be anything a description can seed — "how it sounds", "how it behaves", "what it's like".
- 🧠 World knowledge does the work. It's a generative LLM, so it generalizes from hundreds of examples, not millions. Swap in a random-init backbone and it barely moves off chance — the pretrained knowledge is the engine.
- 🌱 Bootstrap, then specialize. Seed the target geometry from any off-the-shelf encoder (no contrastive pairs, no labels), then an EMA self-target drifts it off that seed into your own task-shaped space — it stops being "text similarity" and becomes its own thing (measurably: it leaves the seed's cone).
- 🔀 Input-agnostic & non-circular. Many input views can map to the same point, and you point the target at a signal the input text can't regenerate — so it's not just distilling a text encoder.
- ✅ Honest selection. Optional per-row geometry labels are eval-only probes; langset early-stops on their held-out kNN-purity with a collapse guard — never on the training loss (which collapse can game).
Install
pip install langset
Usage
A langset dataset is rows of input_text → target_text (+ optional eval-only geometry labels). You pick the
LLM backbone and the bootstrap encoder; langset trains the mapping and specializes the geometry.
from langset import LangSetModel, Trainer, TrainingArguments
rows = [ # a review SNIPPET (what you'll have at inference) -> a description that defines where it should land
{"input_text": "an hour-long track of detuned riffs that never break stride, moving at the pace of continental drift",
"target_text": "glacial detuned doom-metal, sludgy and hypnotic", "mood": "heavy"},
{"input_text": "chopped vocal ghosts drifting over vinyl crackle and the hiss of a city at 3am",
"target_text": "crackly nocturnal UK garage, pitched vocal ghosts", "mood": "calm"},
# ...
]
model = LangSetModel.from_pretrained(
llm_model="HuggingFaceTB/SmolLM2-135M", # any HF causal LM (this is what the examples validate on)
bootstrap_model="sentence-transformers/all-MiniLM-L6-v2", # seeds the target geometry
)
Trainer(model, TrainingArguments(), train_dataset=rows).train() # 'mood' auto-detected as an eval-only label
z = model.encode(["a wall of downtuned fuzz that buries the vocals under sheer volume"]) # review snippet -> latent
print(z.shape) # (1, 384)
See examples/sounds_like/ for the full reference task (album review → "how it
sounds" latent, 481 albums): reproduced through this API at held-out kNN-purity 0.60, beats-bootstrap
4/5.
How it works
- Bootstrap. Targets = the bootstrap encoder's embedding of
target_text. No pairs, no labels. - Contrastive fit. InfoNCE (in-batch negatives) trains the LLM emitter to hit its own target and separate from others. A small cosine anchor optionally keeps it tethered to the seed.
- Specialize. An EMA self-target drifts the geometry off the seed into the model's own arrangement (set
lam_anchorlow to let it go). Emergent structure falls out — never trained, never labeled.
Validation / early-stop (the part that bites)
Two traps langset refuses to fall into:
- Never select on training loss — InfoNCE + EMA can minimize it by collapsing the geometry.
- Never score retrieval against the frozen bootstrap targets — the model specializes away from them.
So it selects on held-out geometry in the current space: input-view↔target-view retrieval + a collapse guard by default; held-out kNN-purity (+ beats-bootstrap) when rows carry geometry labels. Early-stop = patience + restore-best.
Dataset contract
| column | meaning |
|---|---|
input_text |
what you'll have at inference (a name, query, review) |
target_text |
a description of the same item defining where it lands (seeds the geometry) |
| anything else | optional eval-only geometry labels (kNN-purity at validation; never trained on) |
Trainer accepts a datasets.Dataset or list[dict]; use column_mapping to rename your columns.
Using with SetFit
The name is the chain: lang·set·fit — a language model emits into the set geometry (langset, usable on
its own), which then fits a classifier. model.as_sentence_transformer() is a drop-in
SetFit model_body, so you can train a few-shot classifier directly
on the specialized geometry. On the sounds-like example, a genre classifier on the langset body beats one on
raw MiniLM, 0.240 vs 0.205.
When to reach for which
The clean distinction: SetFit answers with a label; langset answers with a latent. SetFit is a few-shot classifier; langset is a few-shot bespoke embedding space — classification is just one thing you can do downstream of a latent.
| reach for SetFit | reach for langset | |
|---|---|---|
| your answer is | a label (fixed classes) | a point in a space — retrieval, "find similar", ranking, clustering |
| you define the target by | enumerating classes | a description of the geometry ("how it sounds") |
| your input | text to classify | text or an identifier (a name) — leans on the LLM's world knowledge |
| plain few-shot classification | ✅ simpler, faster, proven | not its job |
- Use SetFit alone for plain few-shot classification — it directly optimizes class separation; you won't beat it by bolting on langset.
- Use langset when the answer is a geometry, not a label (you'll retrieve / rank / cluster in it).
- Use langset → SetFit when a task-shaped body helps the classifier (demo: genre 0.240 vs 0.205 — modest, so treat it as "plausibly helps", not a slam-dunk).
pip install "langset[setfit]" # pins the verified composition window (below)
from sklearn.linear_model import LogisticRegression
from setfit import SetFitModel
clf = SetFitModel(model_body=langset_model.as_sentence_transformer(),
model_head=LogisticRegression(max_iter=2000), # direct construction needs an explicit head
labels=[...])
clf.fit(x_train, y_train, num_epochs=1) # frozen body + head — the robust path
clf.predict(["..."])
Dependency alignment. SetFit's pins are loose, so versions matter:
| install | transformers / torch | Python | use |
|---|---|---|---|
langset |
latest (≥4.41) | 3.10+ | modern backbones incl. Qwen3; no SetFit |
langset[setfit] |
4.46.x / <2.5 | 3.10–3.12 | verified SetFit composition |
SetFit imports transformers.training_args.default_logdir (removed after 4.46), and 4.46 + torch≥2.5 trips a
torch.distributed.tensor bug — hence the cap. Use the frozen-body SetFitModel.fit/predict path above; the
full setfit.Trainer (fine-tunes the body) is fragile in this window. Qwen3 + SetFit can't share one env until
SetFit drops that import.
Status
v0.1 — the core engine, validated on a real task (album review → "how it sounds" latent), with a downstream classifier composition (see above). No trust/hallucination layer yet (intentionally out of v1). Apache-2.0.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langset-0.1.0.tar.gz.
File metadata
- Download URL: langset-0.1.0.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b619e1a16cedbbfaafe2c29222400a0e2f678d1a184e52888bf276e420944d50
|
|
| MD5 |
e6e173a10ab59bbfd005d8334bd084eb
|
|
| BLAKE2b-256 |
7e4783276e8258f59218be776eba89f653be27689a335f26c934e5c62e7648ef
|
File details
Details for the file langset-0.1.0-py3-none-any.whl.
File metadata
- Download URL: langset-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
874700f89aadc15fa420ad51d1e6ebb6d816b7db5d9cdb929264a39be0e4da3c
|
|
| MD5 |
fa70994f367b5892a9b3c8f9502e1162
|
|
| BLAKE2b-256 |
cc0d0e64e51718fdd5ce01a48c9d7bdb3628216a95fb48b6ee49f2ad0fc7848a
|