Skip to main content

Density Yields Features - discover structure in embedding spaces

Project description

DYF - Outlier Classification

Fast outlier classification using PCA-based LSH. Identifies three types of items in embedding spaces:

  • Dense: Items in well-populated semantic buckets (the majority)
  • Diaspora: Sparse items that find community via recovery PCA (misplaced by global structure)
  • Orphan: Truly unique items with no semantic neighbors

Installation

pip install dyf

For full features (embedding generation, LLM labeling):

pip install dyf[full]

Quick Start

Fast Classification (Rust-accelerated)

import numpy as np
from dyf import OutlierClassifier

# Create embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)

# Classify outliers (~60ms for 60K samples)
classifier = OutlierClassifier(embedding_dim=384)
classifier.fit(embeddings)

# Get results
print(classifier.report())
diaspora = classifier.get_diaspora()  # Indices of diaspora items
orphans = classifier.get_orphans()    # Indices of orphan items

Full-Featured Usage (with embeddings & labeling)

from dyf import OutlierClassifierFull, EmbedderConfig, LabelerConfig

# From raw texts with built-in TF-IDF embeddings
classifier = OutlierClassifierFull.from_texts(
    texts=documents,
    categories=categories,  # Optional category labels
    embedding_dim=128
)

# Or use sentence-transformers
embeddings = EmbedderConfig.MEDIUM.embed(texts)  # all-mpnet-base-v2
classifier = OutlierClassifierFull(embedding_dim=768)
classifier.fit(embeddings, categories=categories, texts=texts)

# Get detailed report
print(classifier.report())

# Label buckets with local LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())
print(labels['dense'][1234]['label'])  # "Reinforcement Learning"

# Or use keyword extraction (no LLM required)
labels = classifier.label_buckets_keywords()

Performance

Implementation 60K samples (384d) Per sample
DYF (Rust) ~60ms 1.0 µs
Pure Python ~230ms 3.8 µs

3.8x faster than pure Python/sklearn.

API Reference

OutlierClassifier (Fast)

OutlierClassifier(
    embedding_dim: int,
    initial_bits: int = 14,       # Bits for initial PCA LSH
    recovery_bits: int = 8,       # Bits for recovery PCA
    dense_threshold: int = 10,    # Min bucket size for "dense"
    intra_outlier_std: float = 2.0,   # Std threshold for intra-bucket outliers
    recovery_cluster_min: int = 3,    # Min cluster size for "recovered"
    seed: int = 31
)

Methods:

  • fit(embeddings) - Fit on numpy array (n_samples, embedding_dim)
  • fit_arrow(arrow_array) - Fit on PyArrow FixedSizeListArray (zero-copy)
  • get_diaspora() - Get indices of diaspora items
  • get_orphans() - Get indices of orphan items
  • get_statuses() - Get status for all items
  • report() - Get classification report

EmbedderConfig Presets

Name Model Dimensions Size
TFIDF TF-IDF + SVD 128 0 MB
LOW all-MiniLM-L6-v2 384 80 MB
MEDIUM all-mpnet-base-v2 768 420 MB
HIGH bge-large-en-v1.5 1024 1.3 GB

LabelerConfig Presets

Name Model Parameters
KEYWORDS TF-IDF keywords -
LOW phi3:mini 3.8B
MEDIUM qwen2.5:7b 7B
HIGH qwen2.5:14b 14B

Algorithm

Two-stage PCA-based LSH outlier classification:

  1. Stage 1: Random hash → bucket centroids → PCA on centroids → re-hash
  2. Outlier Detection: Sparse buckets + intra-bucket distance outliers
  3. Stage 2: Recovery PCA on outliers → diaspora vs orphan

The key insight: outliers from global PCA often share structure at coarser resolution.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dyf-0.1.3.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dyf-0.1.3-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file dyf-0.1.3.tar.gz.

File metadata

  • Download URL: dyf-0.1.3.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dyf-0.1.3.tar.gz
Algorithm Hash digest
SHA256 9ce428cfebc5b4272ff6f350800e0f8a1558c94b1c9d4029ffa7e18b8359e663
MD5 bd7215ff946399d20af85ca90d27b587
BLAKE2b-256 59b8be76760dd7dcf21b1477ce9bde7f53a7961a5e6efc3312770c67a916e9fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.1.3.tar.gz:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dyf-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: dyf-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dyf-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 727e210706943d383f004549f6b58768ae874786705950d919849da3d1aebe67
MD5 0a526842e2b11748c57ab439e348c00b
BLAKE2b-256 f34473d44fe366a7ba1ed0f61d8b0312479e289cdbdda24df392e2e1c2922caf

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.1.3-py3-none-any.whl:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page