Skip to main content

Fast outlier classification using PCA-based LSH (Rust core)

Project description

DYF - Outlier Classification

Fast outlier classification using PCA-based LSH. Identifies three types of items in embedding spaces:

  • Dense: Items in well-populated semantic buckets
  • Diaspora: Sparse items that find community via recovery PCA
  • Orphan: Truly unique items with no semantic neighbors

Installation

pip install dyf

Quick Start

import numpy as np
from dyf import OutlierClassifier

# Create embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)

# Classify outliers
classifier = OutlierClassifier(embedding_dim=384)
classifier.fit(embeddings)

# Get results
print(classifier.report())
diaspora = classifier.get_diaspora()  # Indices of diaspora items
orphans = classifier.get_orphans()    # Indices of orphan items

Performance

~60ms for 60K embeddings (384 dimensions) - 3.8x faster than pure Python/sklearn.

API

OutlierClassifier

OutlierClassifier(
    embedding_dim: int,
    initial_bits: int = 14,      # Bits for initial PCA LSH
    recovery_bits: int = 8,       # Bits for recovery PCA
    dense_threshold: int = 10,    # Min bucket size for "dense"
    intra_outlier_std: float = 2.0,  # Std threshold for intra-bucket outliers
    recovery_cluster_min: int = 3,   # Min cluster size for "recovered"
    seed: int = 31
)

Methods:

  • fit(embeddings) - Fit on numpy array (n_samples, embedding_dim)
  • fit_arrow(arrow_array) - Fit on PyArrow FixedSizeListArray (zero-copy)
  • get_diaspora() - Get indices of diaspora items
  • get_orphans() - Get indices of orphan items
  • get_statuses() - Get status for all items
  • report() - Get classification report

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dyf_rs-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

dyf_rs-0.1.0-cp311-cp311-macosx_11_0_arm64.whl (643.3 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

dyf_rs-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl (680.7 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file dyf_rs-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for dyf_rs-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8084e0116a40d7b86520b78ecc6f4f7938684d4234a8231fa4a8f0dcfe1fc0c9
MD5 d4549921386bd191f9388a72630231a5
BLAKE2b-256 33bb3e4f8009b61076d5ee1a32a344681dee2af68f300e8e2fa2dc14be8f0b31

See more details on using hashes here.

File details

Details for the file dyf_rs-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dyf_rs-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fc6c27a02ab8f3632d3ec228b8cb2d3c607a09fc85b4feac37513db24db21fa9
MD5 ae9cbdefa14718861e341b826cb639c4
BLAKE2b-256 dba892a99cafa4cc19da0e7e17399a5941b2d3f9e8402954f24be9a5a81125fb

See more details on using hashes here.

File details

Details for the file dyf_rs-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for dyf_rs-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 22841f12e9afe691a22a8f3d5d874f027c88201fb4de4adbff4147416d3d0657
MD5 7ef36f0c90c214210da09e5a07c57394
BLAKE2b-256 b19a87a735e5d2528efc104a7b31cbe94851b42f2c6ca7a22a62bad036a10075

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page