Density Yields Features - discover structure in embedding spaces
Project description
DYF - Outlier Classification
Fast outlier classification using PCA-based LSH. Identifies three types of items in embedding spaces:
- Dense: Items in well-populated semantic buckets (the majority)
- Diaspora: Sparse items that find community via recovery PCA (misplaced by global structure)
- Orphan: Truly unique items with no semantic neighbors
Installation
pip install dyf
For full features (embedding generation, LLM labeling):
pip install dyf[full]
Quick Start
Fast Classification (Rust-accelerated)
import numpy as np
from dyf import OutlierClassifier
# Create embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)
# Classify outliers (~60ms for 60K samples)
classifier = OutlierClassifier(embedding_dim=384)
classifier.fit(embeddings)
# Get results
print(classifier.report())
diaspora = classifier.get_diaspora() # Indices of diaspora items
orphans = classifier.get_orphans() # Indices of orphan items
Full-Featured Usage (with embeddings & labeling)
from dyf import OutlierClassifierFull, EmbedderConfig, LabelerConfig
# From raw texts with built-in TF-IDF embeddings
classifier = OutlierClassifierFull.from_texts(
texts=documents,
categories=categories, # Optional category labels
embedding_dim=128
)
# Or use sentence-transformers
embeddings = EmbedderConfig.MEDIUM.embed(texts) # all-mpnet-base-v2
classifier = OutlierClassifierFull(embedding_dim=768)
classifier.fit(embeddings, categories=categories, texts=texts)
# Get detailed report
print(classifier.report())
# Label buckets with local LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())
print(labels['dense'][1234]['label']) # "Reinforcement Learning"
# Or use keyword extraction (no LLM required)
labels = classifier.label_buckets_keywords()
Performance
| Implementation | 60K samples (384d) | Per sample |
|---|---|---|
| DYF (Rust) | ~60ms | 1.0 µs |
| Pure Python | ~230ms | 3.8 µs |
3.8x faster than pure Python/sklearn.
API Reference
OutlierClassifier (Fast)
OutlierClassifier(
embedding_dim: int,
initial_bits: int = 14, # Bits for initial PCA LSH
recovery_bits: int = 8, # Bits for recovery PCA
dense_threshold: int = 10, # Min bucket size for "dense"
intra_outlier_std: float = 2.0, # Std threshold for intra-bucket outliers
recovery_cluster_min: int = 3, # Min cluster size for "recovered"
seed: int = 31
)
Methods:
fit(embeddings)- Fit on numpy array (n_samples, embedding_dim)fit_arrow(arrow_array)- Fit on PyArrow FixedSizeListArray (zero-copy)get_diaspora()- Get indices of diaspora itemsget_orphans()- Get indices of orphan itemsget_statuses()- Get status for all itemsreport()- Get classification report
EmbedderConfig Presets
| Name | Model | Dimensions | Size |
|---|---|---|---|
TFIDF |
TF-IDF + SVD | 128 | 0 MB |
LOW |
all-MiniLM-L6-v2 | 384 | 80 MB |
MEDIUM |
all-mpnet-base-v2 | 768 | 420 MB |
HIGH |
bge-large-en-v1.5 | 1024 | 1.3 GB |
LabelerConfig Presets
| Name | Model | Parameters |
|---|---|---|
KEYWORDS |
TF-IDF keywords | - |
LOW |
phi3:mini | 3.8B |
MEDIUM |
qwen2.5:7b | 7B |
HIGH |
qwen2.5:14b | 14B |
Algorithm
Two-stage PCA-based LSH outlier classification:
- Stage 1: Random hash → bucket centroids → PCA on centroids → re-hash
- Outlier Detection: Sparse buckets + intra-bucket distance outliers
- Stage 2: Recovery PCA on outliers → diaspora vs orphan
The key insight: outliers from global PCA often share structure at coarser resolution.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dyf-0.1.3.tar.gz.
File metadata
- Download URL: dyf-0.1.3.tar.gz
- Upload date:
- Size: 19.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ce428cfebc5b4272ff6f350800e0f8a1558c94b1c9d4029ffa7e18b8359e663
|
|
| MD5 |
bd7215ff946399d20af85ca90d27b587
|
|
| BLAKE2b-256 |
59b8be76760dd7dcf21b1477ce9bde7f53a7961a5e6efc3312770c67a916e9fb
|
Provenance
The following attestation bundles were made for dyf-0.1.3.tar.gz:
Publisher:
publish.yml on jdonaldson/dyf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dyf-0.1.3.tar.gz -
Subject digest:
9ce428cfebc5b4272ff6f350800e0f8a1558c94b1c9d4029ffa7e18b8359e663 - Sigstore transparency entry: 815190003
- Sigstore integration time:
-
Permalink:
jdonaldson/dyf@3e1bc474520d8a267fff31b049678d3a886fa97c -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/jdonaldson
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3e1bc474520d8a267fff31b049678d3a886fa97c -
Trigger Event:
push
-
Statement type:
File details
Details for the file dyf-0.1.3-py3-none-any.whl.
File metadata
- Download URL: dyf-0.1.3-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
727e210706943d383f004549f6b58768ae874786705950d919849da3d1aebe67
|
|
| MD5 |
0a526842e2b11748c57ab439e348c00b
|
|
| BLAKE2b-256 |
f34473d44fe366a7ba1ed0f61d8b0312479e289cdbdda24df392e2e1c2922caf
|
Provenance
The following attestation bundles were made for dyf-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on jdonaldson/dyf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dyf-0.1.3-py3-none-any.whl -
Subject digest:
727e210706943d383f004549f6b58768ae874786705950d919849da3d1aebe67 - Sigstore transparency entry: 815190014
- Sigstore integration time:
-
Permalink:
jdonaldson/dyf@3e1bc474520d8a267fff31b049678d3a886fa97c -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/jdonaldson
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3e1bc474520d8a267fff31b049678d3a886fa97c -
Trigger Event:
push
-
Statement type: