Density Yields Features - discover structure in embedding spaces
Project description
DYF - Density Yields Features
Discover structure in embedding spaces. DYF uses density-based LSH to reveal the natural organization of your data:
- Dense: Core items in well-populated semantic regions
- Bridge: Transitional items connecting different clusters
- Orphan: Unique items with no semantic neighbors
What it does
DYF transforms raw embeddings into navigable semantic maps. Instead of just clustering, it reveals the topology - which regions are dense, which items bridge between concepts, and which are truly unique.
Use cases:
- Semantic navigation: Find paths between concepts
- Structure discovery: Understand how your data organizes itself
- Anomaly detection: Identify orphans and bridges
- Index building: Pre-compute structure for fast queries
Installation
pip install dyf
For serialization (save/load indexes):
pip install dyf[io]
For full features (embedding generation, LLM labeling):
pip install dyf[full]
Quick Start
Discover Structure
import numpy as np
from dyf import DensityClassifier
# Your embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)
# Find structure
classifier = DensityClassifier(embedding_dim=384)
classifier.fit(embeddings)
# What did we find?
print(classifier.report())
# Corpus: 10000 items
# Dense: 9500 (95.0%)
# Bridge: 450 (4.5%)
# Orphan: 50 (0.5%)
# Get indices
bridges = classifier.get_bridge() # Transitional items
orphans = classifier.get_orphans() # Unique items
Save & Load Pre-computed Indexes
from dyf import save_index, PrecomputedIndex
# Save (includes embeddings + metadata)
save_index(classifier, 'index.safetensors', embeddings,
metadata={'model': 'all-MiniLM-L6-v2', 'created': '2026-01-12'})
# Load (no dyf-rs dependency needed!)
index = PrecomputedIndex.load('index.safetensors')
print(index.version) # Check what version created this
print(index.metadata) # All metadata
dense_items = index.get_dense()
bucket_5 = index.get_bucket(5)
Full-Featured Usage
from dyf import DensityClassifierFull, EmbedderConfig, LabelerConfig
# From raw texts
classifier = DensityClassifierFull.from_texts(
texts=documents,
categories=categories,
)
# Label clusters with LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())
print(labels['dense'][1234]['label']) # "Machine Learning Papers"
How It Works
Two-stage PCA-based LSH:
- Initial bucketing: PCA projections create semantic buckets
- Density check: Items in sparse buckets are candidates for reclassification
- Recovery stage: Coarser PCA finds structure among sparse items
- Classification: Dense (core), Bridge (recovered), Orphan (truly unique)
The key insight: items that appear as outliers globally often share structure at coarser resolution. Bridges are these "misplaced" items - they connect different semantic regions.
Performance
| Dataset | Time | Per item |
|---|---|---|
| 60K embeddings (384d) | ~60ms | 1.0 µs |
Rust-accelerated via PyO3. ~4x faster than pure Python.
API
DensityClassifier
DensityClassifier(
embedding_dim: int,
initial_bits: int = 14, # LSH resolution
recovery_bits: int = 8, # Coarser recovery resolution
dense_threshold: int = 10, # Min bucket size for "dense"
seed: int = 31
)
# Methods
classifier.fit(embeddings)
classifier.get_dense() # Dense item indices
classifier.get_bridge() # Bridge item indices
classifier.get_orphans() # Orphan item indices
classifier.get_bucket_id(idx) # Which bucket is item in?
classifier.report() # Summary statistics
Index Serialization
from dyf import save_index, load_index, PrecomputedIndex
# Save fitted classifier
save_index(classifier, 'index.safetensors', embeddings, metadata={...})
# Load as dict
data = load_index('index.safetensors')
data, metadata = load_index('index.safetensors', include_metadata=True)
# Load as object (recommended)
index = PrecomputedIndex.load('index.safetensors')
index.get_dense()
index.get_bucket(5)
index.metadata
index.version
Documentation
Full documentation and API reference at dyf.io.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dyf-0.8.0.tar.gz.
File metadata
- Download URL: dyf-0.8.0.tar.gz
- Upload date:
- Size: 225.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85be35e5533a5aa532936847e52668435707c4b068ba14578619ca3731473989
|
|
| MD5 |
08f17c8dfcd3d280a76956b6a8ff9069
|
|
| BLAKE2b-256 |
65ef7532ab346efdd9417fe48c2fb7468622785f7775dc2aa1268072a8d87c08
|
File details
Details for the file dyf-0.8.0-py3-none-any.whl.
File metadata
- Download URL: dyf-0.8.0-py3-none-any.whl
- Upload date:
- Size: 178.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7417c1a49417d12fb0fc6a838cf9931a12c9332c27fc54d32410fdc1f227adf
|
|
| MD5 |
dfa2422cac5facc460b121fa60d63a57
|
|
| BLAKE2b-256 |
9cdb8fb04b72925a0e8df7632b4f1d5f221926e3bf14fba0edf23a54c02e2a9a
|