Zero-shot hierarchical taxonomy classification using semantic embeddings

These details have not been verified by PyPI

Project description

semtax

Zero-shot hierarchical taxonomy classification using semantic embeddings.

pip install semtax

Requires Python 3.9+. Key dependencies: sentence-transformers, numpy, tqdm.

No training data. No API keys. No labeled examples. Point it at text, get back a taxonomy match.

What is this?

semtax classifies free-text descriptions against a hierarchical taxonomy — instantly, locally, without any setup beyond the install.

The core idea: taxonomy nodes already contain rich semantic information in their labels and definitions. By embedding both the input text and the taxonomy nodes into the same vector space, we can find the best match using cosine similarity. No supervised learning required.

The gap this fills: most existing tools either require labeled training data (HiClass, scikit-learn pipelines) or charge per API call (Qvalia, Classifast). semtax requires neither.

Who is this for?

Procurement and finance analysts who need to categorise spend data against UNSPSC without manual tagging. Developers building spend analytics pipelines, vendor classification tools, or procurement automation systems. If you have a list of item descriptions and need structured taxonomy codes attached to them — this is the tool.

Quickstart

from semtax import SemTax

classifier = SemTax()

result = classifier.classify("toner cartridges for laser printer")

print(result.segment.label)      # Office Equipment and Accessories and Supplies
print(result.class_.label)       # Toner cartridges and supplies  (class_ — class is a Python reserved word)
print(result.class_.confidence)  # 0.8341
print(result.commodity.label)    # Laser toner cartridges
print(result.match_level)        # commodity

Batch classification

descriptions = [
    "laptop battery replacement",
    "janitorial cleaning services",
    "annual software license renewal",
    "server rack unit 2U",
]

results = classifier.classify(descriptions)

for r in results:
    print(f"{r.description:<40} → {r.class_.label} ({r.class_.confidence:.2f})")

Import from a file

All three import methods auto-detect the description column, or accept column= explicitly. They return a DataFrame with your original data plus classification columns appended.

# CSV
output = classifier.classify_csv("spend_data.csv")

# Excel
output = classifier.classify_excel("spend_data.xlsx")

# JSON — accepts a list of strings or a list of dicts
output = classifier.classify_json("spend_data.json")

# Non-standard column name
output = classifier.classify_csv("spend_data.csv", column="line_item")
output = classifier.classify_excel("spend_data.xlsx", column="line_item", sheet_name="Q1")

The output DataFrame includes all your original columns plus: segment_code, segment_label, segment_confidence, family_code, family_label, family_confidence, class_code, class_label, class_confidence, commodity_code, commodity_label, commodity_confidence, commodity_populated, match_level, flags.

Export results

results = classifier.classify(descriptions)

classifier.to_csv(results, "output.csv")           # no pandas required
classifier.to_excel(results, "output.xlsx")
classifier.to_json(results, "output.json")         # writes file
json_str = classifier.to_json(results)             # returns string if no path
df = classifier.to_dataframe(results)              # pandas DataFrame

How it works

Classification runs two paths in parallel and reconciles them:

Path 1 — top-down: Segment → Family → Class, drilling down the hierarchy at each level.

Path 2 — flat class search: cosine similarity against all ~900 classes directly, giving a strong semantic anchor without the noise of 157k commodity-level comparisons.

Both paths are reconciled at the class level. If they agree, confidence is high. If they disagree, the result is flagged. Once a class is matched above the confidence threshold, commodity drill-down searches only the 20-50 commodities within that class — a tractable scope where fine-grained distinctions are reliable.

Taxonomy embeddings are cached on first use at ~/.semtax/cache/. Subsequent runs load from disk in ~1 second.

Confidence and ambiguity flags

Every result includes a confidence score at each level. Results that are uncertain are flagged:

result = classifier.classify("IT hardware and software maintenance services")

print(result.flags)
# ['composite_heuristic', 'multi_segment_spread']

print(result.class_.confidence)   # 0.61
print(result.commodity.populated) # False — stopped at class level

Flag	Meaning
`low_confidence`	Best class match scored below threshold
`margin_too_small`	Top-1 and top-2 class scores are too close to call
`multi_segment_spread`	Top matches span multiple segments — ambiguous input
`composite_heuristic`	Input likely describes multiple distinct items

Configuring thresholds

classifier = SemTax(
    class_confidence_threshold=0.55,     # flag low_confidence below this
    commodity_confidence_threshold=0.68, # stop at class level below this
)

Custom embedding models

# Higher accuracy, slower — still local
classifier = SemTax(
    embedding_model="sentence-transformers/all-mpnet-base-v2"
)

# OpenAI embeddings
import openai

def openai_embed(texts):
    resp = openai.embeddings.create(input=texts, model="text-embedding-3-small")
    return [r.embedding for r in resp.data]

classifier = SemTax(embedding_model=openai_embed)

Any callable with signature (list[str]) -> list[list[float]] works.

What's available to import

from semtax import (
    SemTax,               # the classifier
    AmbiguityConfig,      # fine-grained threshold configuration
    ClassificationResult, # return type from classify() — useful for type hints
    LevelResult,          # segment / family / class / commodity result type
    FLAG_LOW_CONFIDENCE,        # filter results by flag — these are plain strings, use `in result.flags`
    FLAG_MARGIN_TOO_SMALL,
    FLAG_MULTI_SEGMENT_SPREAD,
    FLAG_COMPOSITE_HEURISTIC,
)

AmbiguityConfig

For control beyond the two threshold kwargs:

from semtax import SemTax, AmbiguityConfig

config = AmbiguityConfig(
    class_confidence_threshold=0.55,
    commodity_confidence_threshold=0.68,
    margin_threshold=0.08,   # stricter margin requirement
    top_k_spread=3,          # check top-3 classes for segment spread (default 5)
)

classifier = SemTax(config=config)

Filtering by flag

results = classifier.classify(descriptions)

# Only keep clean, high-confidence results
clean = [r for r in results if not r.flags]

# Find everything that looks composite
composite = [r for r in results if FLAG_COMPOSITE_HEURISTIC in r.flags]

Telemetry

semtax collects anonymous usage data (batch sizes, taxonomy matched, model used — never description text) via PostHog. Opt out any time:

SemTax(telemetry=False)

SEMTAX_DISABLE_TELEMETRY=1 python your_script.py

Roadmap

Version	Feature
V1	UNSPSC classification, hybrid search, confidence scoring, ambiguity flags
V2	CWE (cybersecurity weakness classification)
V2	NAICS (industry/vendor classification)
V2	LLM enrichment layer for low-confidence items
V2	Custom taxonomy support (bring your own CSV)
V3	arXiv subject categories
V3	CPV (EU public procurement)

Why not just use an LLM?

Cost at scale: Classifying 50k rows through an LLM API is expensive. Local embeddings cost nothing.
Speed: Batch embedding classification is orders of magnitude faster than LLM inference.
No data leaving your environment: Sensitive procurement or financial data often can't touch external APIs.
Deterministic output: LLMs hallucinate codes and format output inconsistently. semtax returns clean, structured results every time.

LLMs are reasoning engines, not classification infrastructure.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Apr 7, 2026

0.1.0

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semtax-0.1.1.tar.gz (6.0 MB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

semtax-0.1.1-py3-none-any.whl (6.1 MB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file semtax-0.1.1.tar.gz.

File metadata

Download URL: semtax-0.1.1.tar.gz
Upload date: Apr 7, 2026
Size: 6.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for semtax-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`12822cebb250444514aaee6e828ce99fb4ab3c654324794c77e962c7d17b0f22`
MD5	`aecbc0d59e1156eaaced37fa87109b23`
BLAKE2b-256	`48f07cb74228cbec0a7180672c1e9d05b1d1bd50f0d90066c5a29da834b14f4f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for semtax-0.1.1.tar.gz:

Publisher: publish.yml on getfounded/semtax

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: semtax-0.1.1.tar.gz
- Subject digest: 12822cebb250444514aaee6e828ce99fb4ab3c654324794c77e962c7d17b0f22
- Sigstore transparency entry: 1245522930
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: getfounded/semtax@5407022680ccc74c224e20a9b3dcd36feb361144
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/getfounded
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5407022680ccc74c224e20a9b3dcd36feb361144
- Trigger Event: release

File details

Details for the file semtax-0.1.1-py3-none-any.whl.

File metadata

Download URL: semtax-0.1.1-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 6.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for semtax-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`10ceb491fd627cf8948b300174ad2c04e2e31ee32dcdb243f535025ebf8d093f`
MD5	`1baec79b487fce0cbced7444614b8bc8`
BLAKE2b-256	`ecc3e3cab2b8e9fadd8c8fffdc26e2dcc391a6af8735738c6d80499c0b565725`

See more details on using hashes here.

Provenance

The following attestation bundles were made for semtax-0.1.1-py3-none-any.whl:

Publisher: publish.yml on getfounded/semtax

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: semtax-0.1.1-py3-none-any.whl
- Subject digest: 10ceb491fd627cf8948b300174ad2c04e2e31ee32dcdb243f535025ebf8d093f
- Sigstore transparency entry: 1245523019
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: getfounded/semtax@5407022680ccc74c224e20a9b3dcd36feb361144
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/getfounded
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5407022680ccc74c224e20a9b3dcd36feb361144
- Trigger Event: release

semtax 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

semtax

What is this?

Who is this for?

Quickstart

Batch classification

Import from a file

Export results

How it works

Confidence and ambiguity flags

Configuring thresholds

Custom embedding models

What's available to import

AmbiguityConfig

Filtering by flag

Telemetry

Roadmap

Why not just use an LLM?

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance