Skip to main content

Fast outlier classification using PCA-based LSH

Project description

DYF - Outlier Classification

Fast outlier classification using PCA-based LSH. Identifies three types of items in embedding spaces:

  • Dense: Items in well-populated semantic buckets (the majority)
  • Diaspora: Sparse items that find community via recovery PCA (misplaced by global structure)
  • Orphan: Truly unique items with no semantic neighbors

Installation

pip install dyf

For full features (embedding generation, LLM labeling):

pip install dyf[full]

Quick Start

Fast Classification (Rust-accelerated)

import numpy as np
from dyf import OutlierClassifier

# Create embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)

# Classify outliers (~60ms for 60K samples)
classifier = OutlierClassifier(embedding_dim=384)
classifier.fit(embeddings)

# Get results
print(classifier.report())
diaspora = classifier.get_diaspora()  # Indices of diaspora items
orphans = classifier.get_orphans()    # Indices of orphan items

Full-Featured Usage (with embeddings & labeling)

from dyf import OutlierClassifierFull, EmbedderConfig, LabelerConfig

# From raw texts with built-in TF-IDF embeddings
classifier = OutlierClassifierFull.from_texts(
    texts=documents,
    categories=categories,  # Optional category labels
    embedding_dim=128
)

# Or use sentence-transformers
embeddings = EmbedderConfig.MEDIUM.embed(texts)  # all-mpnet-base-v2
classifier = OutlierClassifierFull(embedding_dim=768)
classifier.fit(embeddings, categories=categories, texts=texts)

# Get detailed report
print(classifier.report())

# Label buckets with local LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())
print(labels['dense'][1234]['label'])  # "Reinforcement Learning"

# Or use keyword extraction (no LLM required)
labels = classifier.label_buckets_keywords()

Performance

Implementation 60K samples (384d) Per sample
DYF (Rust) ~60ms 1.0 µs
Pure Python ~230ms 3.8 µs

3.8x faster than pure Python/sklearn.

API Reference

OutlierClassifier (Fast)

OutlierClassifier(
    embedding_dim: int,
    initial_bits: int = 14,       # Bits for initial PCA LSH
    recovery_bits: int = 8,       # Bits for recovery PCA
    dense_threshold: int = 10,    # Min bucket size for "dense"
    intra_outlier_std: float = 2.0,   # Std threshold for intra-bucket outliers
    recovery_cluster_min: int = 3,    # Min cluster size for "recovered"
    seed: int = 31
)

Methods:

  • fit(embeddings) - Fit on numpy array (n_samples, embedding_dim)
  • fit_arrow(arrow_array) - Fit on PyArrow FixedSizeListArray (zero-copy)
  • get_diaspora() - Get indices of diaspora items
  • get_orphans() - Get indices of orphan items
  • get_statuses() - Get status for all items
  • report() - Get classification report

EmbedderConfig Presets

Name Model Dimensions Size
TFIDF TF-IDF + SVD 128 0 MB
LOW all-MiniLM-L6-v2 384 80 MB
MEDIUM all-mpnet-base-v2 768 420 MB
HIGH bge-large-en-v1.5 1024 1.3 GB

LabelerConfig Presets

Name Model Parameters
KEYWORDS TF-IDF keywords -
LOW phi3:mini 3.8B
MEDIUM qwen2.5:7b 7B
HIGH qwen2.5:14b 14B

Algorithm

Two-stage PCA-based LSH outlier classification:

  1. Stage 1: Random hash → bucket centroids → PCA on centroids → re-hash
  2. Outlier Detection: Sparse buckets + intra-bucket distance outliers
  3. Stage 2: Recovery PCA on outliers → diaspora vs orphan

The key insight: outliers from global PCA often share structure at coarser resolution.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dyf-0.1.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dyf-0.1.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file dyf-0.1.0.tar.gz.

File metadata

  • Download URL: dyf-0.1.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dyf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bff95dd7244398190fc6c55fdc389bec8b9eec1ddd08183012b61c20a7f6a11c
MD5 c49dd3f23137327df447937b0214734f
BLAKE2b-256 983a277f12ff98bea943e5d82087ebf73c96c1a2a566e89483e17ec56c0560ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.1.0.tar.gz:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dyf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dyf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dyf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 421a9ad3622f2b31e9bf4bc8ceb66204c764537534adb13cb8d0504487ab8fb2
MD5 ee4a631e6d4dcfd5178cb3712a01e7e8
BLAKE2b-256 a4c867de2190aad319724df93cfe4bcc31153b6637292f87623af11a63fee7dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.1.0-py3-none-any.whl:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page