Skip to main content

Fast outlier classification using PCA-based LSH

Project description

DYF - Outlier Classification

Fast outlier classification using PCA-based LSH. Identifies three types of items in embedding spaces:

  • Dense: Items in well-populated semantic buckets (the majority)
  • Diaspora: Sparse items that find community via recovery PCA (misplaced by global structure)
  • Orphan: Truly unique items with no semantic neighbors

Installation

pip install dyf

For full features (embedding generation, LLM labeling):

pip install dyf[full]

Quick Start

Fast Classification (Rust-accelerated)

import numpy as np
from dyf import OutlierClassifier

# Create embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)

# Classify outliers (~60ms for 60K samples)
classifier = OutlierClassifier(embedding_dim=384)
classifier.fit(embeddings)

# Get results
print(classifier.report())
diaspora = classifier.get_diaspora()  # Indices of diaspora items
orphans = classifier.get_orphans()    # Indices of orphan items

Full-Featured Usage (with embeddings & labeling)

from dyf import OutlierClassifierFull, EmbedderConfig, LabelerConfig

# From raw texts with built-in TF-IDF embeddings
classifier = OutlierClassifierFull.from_texts(
    texts=documents,
    categories=categories,  # Optional category labels
    embedding_dim=128
)

# Or use sentence-transformers
embeddings = EmbedderConfig.MEDIUM.embed(texts)  # all-mpnet-base-v2
classifier = OutlierClassifierFull(embedding_dim=768)
classifier.fit(embeddings, categories=categories, texts=texts)

# Get detailed report
print(classifier.report())

# Label buckets with local LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())
print(labels['dense'][1234]['label'])  # "Reinforcement Learning"

# Or use keyword extraction (no LLM required)
labels = classifier.label_buckets_keywords()

Performance

Implementation 60K samples (384d) Per sample
DYF (Rust) ~60ms 1.0 µs
Pure Python ~230ms 3.8 µs

3.8x faster than pure Python/sklearn.

API Reference

OutlierClassifier (Fast)

OutlierClassifier(
    embedding_dim: int,
    initial_bits: int = 14,       # Bits for initial PCA LSH
    recovery_bits: int = 8,       # Bits for recovery PCA
    dense_threshold: int = 10,    # Min bucket size for "dense"
    intra_outlier_std: float = 2.0,   # Std threshold for intra-bucket outliers
    recovery_cluster_min: int = 3,    # Min cluster size for "recovered"
    seed: int = 31
)

Methods:

  • fit(embeddings) - Fit on numpy array (n_samples, embedding_dim)
  • fit_arrow(arrow_array) - Fit on PyArrow FixedSizeListArray (zero-copy)
  • get_diaspora() - Get indices of diaspora items
  • get_orphans() - Get indices of orphan items
  • get_statuses() - Get status for all items
  • report() - Get classification report

EmbedderConfig Presets

Name Model Dimensions Size
TFIDF TF-IDF + SVD 128 0 MB
LOW all-MiniLM-L6-v2 384 80 MB
MEDIUM all-mpnet-base-v2 768 420 MB
HIGH bge-large-en-v1.5 1024 1.3 GB

LabelerConfig Presets

Name Model Parameters
KEYWORDS TF-IDF keywords -
LOW phi3:mini 3.8B
MEDIUM qwen2.5:7b 7B
HIGH qwen2.5:14b 14B

Algorithm

Two-stage PCA-based LSH outlier classification:

  1. Stage 1: Random hash → bucket centroids → PCA on centroids → re-hash
  2. Outlier Detection: Sparse buckets + intra-bucket distance outliers
  3. Stage 2: Recovery PCA on outliers → diaspora vs orphan

The key insight: outliers from global PCA often share structure at coarser resolution.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dyf-0.1.1.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dyf-0.1.1-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file dyf-0.1.1.tar.gz.

File metadata

  • Download URL: dyf-0.1.1.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dyf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b221940d6136136566fc3a40d474da6f395b4c5bb30cfe20f00ef7dc5c138ec3
MD5 bf102bd736b50b24ff974a27769dff12
BLAKE2b-256 9957547e2c0eed349a67890407358f8883cfabe736fad18ca9bb596533c00766

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.1.1.tar.gz:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dyf-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dyf-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dyf-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 427b41758eaa22fc3eeae8c7dba6f0927712c07b0e3530e6fa67263d3671d4a0
MD5 761a9c6684acf5c136c8c24eadb04268
BLAKE2b-256 017e74cbda80b9975a73f94bf38bc1efa169ba272ea473d05577b307032d4e67

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.1.1-py3-none-any.whl:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page