Zero-shot hierarchical taxonomy classification using semantic embeddings
Project description
semtax
Zero-shot hierarchical taxonomy classification using semantic embeddings.
pip install semtax
Requires Python 3.9+. Key dependencies: sentence-transformers, numpy, tqdm.
No training data. No API keys. No labeled examples. Point it at text, get back a taxonomy match.
What is this?
semtax classifies free-text descriptions against a hierarchical taxonomy — instantly, locally, without any setup beyond the install.
The core idea: taxonomy nodes already contain rich semantic information in their labels and definitions. By embedding both the input text and the taxonomy nodes into the same vector space, we can find the best match using cosine similarity. No supervised learning required.
The gap this fills: most existing tools either require labeled training data (HiClass, scikit-learn pipelines) or charge per API call (Qvalia, Classifast). semtax requires neither.
Who is this for?
Procurement and finance analysts who need to categorise spend data against UNSPSC without manual tagging. Developers building spend analytics pipelines, vendor classification tools, or procurement automation systems. If you have a list of item descriptions and need structured taxonomy codes attached to them — this is the tool.
Quickstart
from semtax import SemTax
classifier = SemTax()
result = classifier.classify("toner cartridges for laser printer")
print(result.segment.label) # Office Equipment and Accessories and Supplies
print(result.class_.label) # Toner cartridges and supplies (class_ — class is a Python reserved word)
print(result.class_.confidence) # 0.8341
print(result.commodity.label) # Laser toner cartridges
print(result.match_level) # commodity
Batch classification
descriptions = [
"laptop battery replacement",
"janitorial cleaning services",
"annual software license renewal",
"server rack unit 2U",
]
results = classifier.classify(descriptions)
for r in results:
print(f"{r.description:<40} → {r.class_.label} ({r.class_.confidence:.2f})")
Import from a file
All three import methods auto-detect the description column, or accept column= explicitly. They return a DataFrame with your original data plus classification columns appended.
# CSV
output = classifier.classify_csv("spend_data.csv")
# Excel
output = classifier.classify_excel("spend_data.xlsx")
# JSON — accepts a list of strings or a list of dicts
output = classifier.classify_json("spend_data.json")
# Non-standard column name
output = classifier.classify_csv("spend_data.csv", column="line_item")
output = classifier.classify_excel("spend_data.xlsx", column="line_item", sheet_name="Q1")
The output DataFrame includes all your original columns plus: segment_code, segment_label, segment_confidence, family_code, family_label, family_confidence, class_code, class_label, class_confidence, commodity_code, commodity_label, commodity_confidence, commodity_populated, match_level, flags.
Export results
results = classifier.classify(descriptions)
classifier.to_csv(results, "output.csv") # no pandas required
classifier.to_excel(results, "output.xlsx")
classifier.to_json(results, "output.json") # writes file
json_str = classifier.to_json(results) # returns string if no path
df = classifier.to_dataframe(results) # pandas DataFrame
How it works
Classification runs two paths in parallel and reconciles them:
Path 1 — top-down: Segment → Family → Class, drilling down the hierarchy at each level.
Path 2 — flat class search: cosine similarity against all ~900 classes directly, giving a strong semantic anchor without the noise of 157k commodity-level comparisons.
Both paths are reconciled at the class level. If they agree, confidence is high. If they disagree, the result is flagged. Once a class is matched above the confidence threshold, commodity drill-down searches only the 20-50 commodities within that class — a tractable scope where fine-grained distinctions are reliable.
Taxonomy embeddings are cached on first use at ~/.semtax/cache/. Subsequent runs load from disk in ~1 second.
Confidence and ambiguity flags
Every result includes a confidence score at each level. Results that are uncertain are flagged:
result = classifier.classify("IT hardware and software maintenance services")
print(result.flags)
# ['composite_heuristic', 'multi_segment_spread']
print(result.class_.confidence) # 0.61
print(result.commodity.populated) # False — stopped at class level
| Flag | Meaning |
|---|---|
low_confidence |
Best class match scored below threshold |
margin_too_small |
Top-1 and top-2 class scores are too close to call |
multi_segment_spread |
Top matches span multiple segments — ambiguous input |
composite_heuristic |
Input likely describes multiple distinct items |
Configuring thresholds
classifier = SemTax(
class_confidence_threshold=0.55, # flag low_confidence below this
commodity_confidence_threshold=0.68, # stop at class level below this
)
Custom embedding models
# Higher accuracy, slower — still local
classifier = SemTax(
embedding_model="sentence-transformers/all-mpnet-base-v2"
)
# OpenAI embeddings
import openai
def openai_embed(texts):
resp = openai.embeddings.create(input=texts, model="text-embedding-3-small")
return [r.embedding for r in resp.data]
classifier = SemTax(embedding_model=openai_embed)
Any callable with signature (list[str]) -> list[list[float]] works.
What's available to import
from semtax import (
SemTax, # the classifier
AmbiguityConfig, # fine-grained threshold configuration
ClassificationResult, # return type from classify() — useful for type hints
LevelResult, # segment / family / class / commodity result type
FLAG_LOW_CONFIDENCE, # filter results by flag — these are plain strings, use `in result.flags`
FLAG_MARGIN_TOO_SMALL,
FLAG_MULTI_SEGMENT_SPREAD,
FLAG_COMPOSITE_HEURISTIC,
)
AmbiguityConfig
For control beyond the two threshold kwargs:
from semtax import SemTax, AmbiguityConfig
config = AmbiguityConfig(
class_confidence_threshold=0.55,
commodity_confidence_threshold=0.68,
margin_threshold=0.08, # stricter margin requirement
top_k_spread=3, # check top-3 classes for segment spread (default 5)
)
classifier = SemTax(config=config)
Filtering by flag
results = classifier.classify(descriptions)
# Only keep clean, high-confidence results
clean = [r for r in results if not r.flags]
# Find everything that looks composite
composite = [r for r in results if FLAG_COMPOSITE_HEURISTIC in r.flags]
Telemetry
semtax collects anonymous usage data (batch sizes, taxonomy matched, model used — never description text) via PostHog. Opt out any time:
SemTax(telemetry=False)
SEMTAX_DISABLE_TELEMETRY=1 python your_script.py
Roadmap
| Version | Feature |
|---|---|
| V1 | UNSPSC classification, hybrid search, confidence scoring, ambiguity flags |
| V2 | CWE (cybersecurity weakness classification) |
| V2 | NAICS (industry/vendor classification) |
| V2 | LLM enrichment layer for low-confidence items |
| V2 | Custom taxonomy support (bring your own CSV) |
| V3 | arXiv subject categories |
| V3 | CPV (EU public procurement) |
Why not just use an LLM?
- Cost at scale: Classifying 50k rows through an LLM API is expensive. Local embeddings cost nothing.
- Speed: Batch embedding classification is orders of magnitude faster than LLM inference.
- No data leaving your environment: Sensitive procurement or financial data often can't touch external APIs.
- Deterministic output: LLMs hallucinate codes and format output inconsistently.
semtaxreturns clean, structured results every time.
LLMs are reasoning engines, not classification infrastructure.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semtax-0.1.1.tar.gz.
File metadata
- Download URL: semtax-0.1.1.tar.gz
- Upload date:
- Size: 6.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12822cebb250444514aaee6e828ce99fb4ab3c654324794c77e962c7d17b0f22
|
|
| MD5 |
aecbc0d59e1156eaaced37fa87109b23
|
|
| BLAKE2b-256 |
48f07cb74228cbec0a7180672c1e9d05b1d1bd50f0d90066c5a29da834b14f4f
|
Provenance
The following attestation bundles were made for semtax-0.1.1.tar.gz:
Publisher:
publish.yml on getfounded/semtax
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semtax-0.1.1.tar.gz -
Subject digest:
12822cebb250444514aaee6e828ce99fb4ab3c654324794c77e962c7d17b0f22 - Sigstore transparency entry: 1245522930
- Sigstore integration time:
-
Permalink:
getfounded/semtax@5407022680ccc74c224e20a9b3dcd36feb361144 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/getfounded
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5407022680ccc74c224e20a9b3dcd36feb361144 -
Trigger Event:
release
-
Statement type:
File details
Details for the file semtax-0.1.1-py3-none-any.whl.
File metadata
- Download URL: semtax-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10ceb491fd627cf8948b300174ad2c04e2e31ee32dcdb243f535025ebf8d093f
|
|
| MD5 |
1baec79b487fce0cbced7444614b8bc8
|
|
| BLAKE2b-256 |
ecc3e3cab2b8e9fadd8c8fffdc26e2dcc391a6af8735738c6d80499c0b565725
|
Provenance
The following attestation bundles were made for semtax-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on getfounded/semtax
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semtax-0.1.1-py3-none-any.whl -
Subject digest:
10ceb491fd627cf8948b300174ad2c04e2e31ee32dcdb243f535025ebf8d093f - Sigstore transparency entry: 1245523019
- Sigstore integration time:
-
Permalink:
getfounded/semtax@5407022680ccc74c224e20a9b3dcd36feb361144 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/getfounded
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5407022680ccc74c224e20a9b3dcd36feb361144 -
Trigger Event:
release
-
Statement type: