Skip to main content

Fast outlier classification using PCA-based LSH

Project description

DYF - Outlier Classification

Fast outlier classification using PCA-based LSH. Identifies three types of items in embedding spaces:

  • Dense: Items in well-populated semantic buckets (the majority)
  • Diaspora: Sparse items that find community via recovery PCA (misplaced by global structure)
  • Orphan: Truly unique items with no semantic neighbors

Installation

pip install dyf

For full features (embedding generation, LLM labeling):

pip install dyf[full]

Quick Start

Fast Classification (Rust-accelerated)

import numpy as np
from dyf import OutlierClassifier

# Create embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)

# Classify outliers (~60ms for 60K samples)
classifier = OutlierClassifier(embedding_dim=384)
classifier.fit(embeddings)

# Get results
print(classifier.report())
diaspora = classifier.get_diaspora()  # Indices of diaspora items
orphans = classifier.get_orphans()    # Indices of orphan items

Full-Featured Usage (with embeddings & labeling)

from dyf import OutlierClassifierFull, EmbedderConfig, LabelerConfig

# From raw texts with built-in TF-IDF embeddings
classifier = OutlierClassifierFull.from_texts(
    texts=documents,
    categories=categories,  # Optional category labels
    embedding_dim=128
)

# Or use sentence-transformers
embeddings = EmbedderConfig.MEDIUM.embed(texts)  # all-mpnet-base-v2
classifier = OutlierClassifierFull(embedding_dim=768)
classifier.fit(embeddings, categories=categories, texts=texts)

# Get detailed report
print(classifier.report())

# Label buckets with local LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())
print(labels['dense'][1234]['label'])  # "Reinforcement Learning"

# Or use keyword extraction (no LLM required)
labels = classifier.label_buckets_keywords()

Performance

Implementation 60K samples (384d) Per sample
DYF (Rust) ~60ms 1.0 µs
Pure Python ~230ms 3.8 µs

3.8x faster than pure Python/sklearn.

API Reference

OutlierClassifier (Fast)

OutlierClassifier(
    embedding_dim: int,
    initial_bits: int = 14,       # Bits for initial PCA LSH
    recovery_bits: int = 8,       # Bits for recovery PCA
    dense_threshold: int = 10,    # Min bucket size for "dense"
    intra_outlier_std: float = 2.0,   # Std threshold for intra-bucket outliers
    recovery_cluster_min: int = 3,    # Min cluster size for "recovered"
    seed: int = 31
)

Methods:

  • fit(embeddings) - Fit on numpy array (n_samples, embedding_dim)
  • fit_arrow(arrow_array) - Fit on PyArrow FixedSizeListArray (zero-copy)
  • get_diaspora() - Get indices of diaspora items
  • get_orphans() - Get indices of orphan items
  • get_statuses() - Get status for all items
  • report() - Get classification report

EmbedderConfig Presets

Name Model Dimensions Size
TFIDF TF-IDF + SVD 128 0 MB
LOW all-MiniLM-L6-v2 384 80 MB
MEDIUM all-mpnet-base-v2 768 420 MB
HIGH bge-large-en-v1.5 1024 1.3 GB

LabelerConfig Presets

Name Model Parameters
KEYWORDS TF-IDF keywords -
LOW phi3:mini 3.8B
MEDIUM qwen2.5:7b 7B
HIGH qwen2.5:14b 14B

Algorithm

Two-stage PCA-based LSH outlier classification:

  1. Stage 1: Random hash → bucket centroids → PCA on centroids → re-hash
  2. Outlier Detection: Sparse buckets + intra-bucket distance outliers
  3. Stage 2: Recovery PCA on outliers → diaspora vs orphan

The key insight: outliers from global PCA often share structure at coarser resolution.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dyf-0.1.2.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dyf-0.1.2-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file dyf-0.1.2.tar.gz.

File metadata

  • Download URL: dyf-0.1.2.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dyf-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a4098339d6517631ff4fe85432fe7b8b06b672b8142e8863d76f9c84b55c9133
MD5 fbc8b34488c53fc7eee5c7b19220090b
BLAKE2b-256 b7226fd565745ba883d73bc7b08a8467dcbef94615513337a4678ebe01b49b10

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.1.2.tar.gz:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dyf-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: dyf-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dyf-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6dee71c0e331af72c9c6b18a319edf0340d1b233f890233be6b6e883833adac8
MD5 7b154b60484b9895c2e931eda4e42029
BLAKE2b-256 dda1cde45559db564ebbf3a6b542022dc7907ac61d887aec31c5e4b14243c235

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.1.2-py3-none-any.whl:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page