Fast outlier classification using PCA-based LSH (Rust core)
Project description
DYF - Outlier Classification
Fast outlier classification using PCA-based LSH. Identifies three types of items in embedding spaces:
- Dense: Items in well-populated semantic buckets
- Diaspora: Sparse items that find community via recovery PCA
- Orphan: Truly unique items with no semantic neighbors
Installation
pip install dyf
Quick Start
import numpy as np
from dyf import OutlierClassifier
# Create embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)
# Classify outliers
classifier = OutlierClassifier(embedding_dim=384)
classifier.fit(embeddings)
# Get results
print(classifier.report())
diaspora = classifier.get_diaspora() # Indices of diaspora items
orphans = classifier.get_orphans() # Indices of orphan items
Performance
~60ms for 60K embeddings (384 dimensions) - 3.8x faster than pure Python/sklearn.
API
OutlierClassifier
OutlierClassifier(
embedding_dim: int,
initial_bits: int = 14, # Bits for initial PCA LSH
recovery_bits: int = 8, # Bits for recovery PCA
dense_threshold: int = 10, # Min bucket size for "dense"
intra_outlier_std: float = 2.0, # Std threshold for intra-bucket outliers
recovery_cluster_min: int = 3, # Min cluster size for "recovered"
seed: int = 31
)
Methods:
fit(embeddings)- Fit on numpy array (n_samples, embedding_dim)fit_arrow(arrow_array)- Fit on PyArrow FixedSizeListArray (zero-copy)get_diaspora()- Get indices of diaspora itemsget_orphans()- Get indices of orphan itemsget_statuses()- Get status for all itemsreport()- Get classification report
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dyf_rs-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: dyf_rs-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 13.0 MB
- Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8084e0116a40d7b86520b78ecc6f4f7938684d4234a8231fa4a8f0dcfe1fc0c9
|
|
| MD5 |
d4549921386bd191f9388a72630231a5
|
|
| BLAKE2b-256 |
33bb3e4f8009b61076d5ee1a32a344681dee2af68f300e8e2fa2dc14be8f0b31
|
File details
Details for the file dyf_rs-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: dyf_rs-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 643.3 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc6c27a02ab8f3632d3ec228b8cb2d3c607a09fc85b4feac37513db24db21fa9
|
|
| MD5 |
ae9cbdefa14718861e341b826cb639c4
|
|
| BLAKE2b-256 |
dba892a99cafa4cc19da0e7e17399a5941b2d3f9e8402954f24be9a5a81125fb
|
File details
Details for the file dyf_rs-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl.
File metadata
- Download URL: dyf_rs-0.1.0-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 680.7 kB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22841f12e9afe691a22a8f3d5d874f027c88201fb4de4adbff4147416d3d0657
|
|
| MD5 |
7ef36f0c90c214210da09e5a07c57394
|
|
| BLAKE2b-256 |
b19a87a735e5d2528efc104a7b31cbe94851b42f2c6ca7a22a62bad036a10075
|