Density Yields Features - discover structure in embedding spaces
Project description
DYF - Density Yields Features
Discover structure in embedding spaces. DYF uses density-based LSH to reveal the natural organization of your data:
- Dense: Core items in well-populated semantic regions
- Bridge: Transitional items connecting different clusters
- Orphan: Unique items with no semantic neighbors
What it does
DYF transforms raw embeddings into navigable semantic maps. Instead of just clustering, it reveals the topology - which regions are dense, which items bridge between concepts, and which are truly unique.
Use cases:
- Semantic navigation: Find paths between concepts
- Structure discovery: Understand how your data organizes itself
- Anomaly detection: Identify orphans and bridges
- Index building: Pre-compute structure for fast queries
Installation
pip install dyf
For serialization (save/load indexes):
pip install dyf[io]
For full features (embedding generation, LLM labeling):
pip install dyf[full]
Quick Start
Discover Structure
import numpy as np
from dyf import DensityClassifier
# Your embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)
# Find structure
classifier = DensityClassifier(embedding_dim=384)
classifier.fit(embeddings)
# What did we find?
print(classifier.report())
# Corpus: 10000 items
# Dense: 9500 (95.0%)
# Bridge: 450 (4.5%)
# Orphan: 50 (0.5%)
# Get indices
bridges = classifier.get_bridge() # Transitional items
orphans = classifier.get_orphans() # Unique items
Build & Search Indexes
from dyf import build_dyf_tree, write_lazy_index, LazyIndex
# Build tree from embeddings
tree = build_dyf_tree(embeddings, max_depth=4, num_bits=3, min_leaf_size=8)
# Write to disk (mmap-friendly, zero startup cost)
write_lazy_index(tree, embeddings, "index.dyf",
quantization="float16", compression="zstd",
stored_fields={"title": titles},
metadata={"model": "nomic-embed-text-v1.5"})
# Search (instant open, LRU-cached leaf access)
with LazyIndex("index.dyf") as idx:
result = idx.search(query_embedding, k=10, nprobe=3)
print(result.indices, result.scores)
print(result.fields["title"]) # stored fields returned with results
Adaptive Probing
Queries near decision boundaries automatically probe more leaves:
from dyf import LazyIndex, AdaptiveProbeConfig
with LazyIndex("index.dyf") as idx:
# Auto mode: margin-based probe count (default thresholds)
result = idx.search(query, k=10, nprobe="auto", return_routing=True)
print(result.routing["adaptive_nprobe"]) # how many leaves were probed
# Custom thresholds
cfg = AdaptiveProbeConfig(margin_lo=0.005, margin_hi=0.2,
min_probes=1, max_probes=8)
result = idx.search(query, k=10, nprobe=cfg)
Full-Featured Usage
from dyf import DensityClassifierFull, EmbedderConfig, LabelerConfig
# From raw texts
classifier = DensityClassifierFull.from_texts(
texts=documents,
categories=categories,
)
# Label clusters with LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())
print(labels['dense'][1234]['label']) # "Machine Learning Papers"
How It Works
Two-stage PCA-based LSH:
- Initial bucketing: PCA projections create semantic buckets
- Density check: Items in sparse buckets are candidates for reclassification
- Recovery stage: Coarser PCA finds structure among sparse items
- Classification: Dense (core), Bridge (recovered), Orphan (truly unique)
The key insight: items that appear as outliers globally often share structure at coarser resolution. Bridges are these "misplaced" items - they connect different semantic regions.
Performance
| Dataset | Time | Per item |
|---|---|---|
| 60K embeddings (384d) | ~60ms | 1.0 µs |
Rust-accelerated via PyO3. ~4x faster than pure Python.
API
DensityClassifier
DensityClassifier(
embedding_dim: int,
initial_bits: int = 14, # LSH resolution
recovery_bits: int = 8, # Coarser recovery resolution
dense_threshold: int = 10, # Min bucket size for "dense"
seed: int = 31
)
# Methods
classifier.fit(embeddings)
classifier.get_dense() # Dense item indices
classifier.get_bridge() # Bridge item indices
classifier.get_orphans() # Orphan item indices
classifier.get_bucket_id(idx) # Which bucket is item in?
classifier.report() # Summary statistics
LazyIndex
from dyf import LazyIndex
with LazyIndex("index.dyf") as idx:
# Search with fixed or adaptive probing
result = idx.search(query, k=10, nprobe=3) # fixed
result = idx.search(query, k=10, nprobe="auto") # adaptive
# Inspect index structure
idx.tree_summary # metadata, dims, leaf count
idx.total_items # total indexed items
idx.stored_field_names # available stored fields
# Extract all data
data = idx.extract_all_fields()
data['embeddings'] # (n, d) float32
data['fields'] # {field_name: array}
Documentation
- How It Works — the algorithm, metrics, and Dense/Bridge/Orphan explained
- Getting Started — code recipes and examples
- API Reference — full documentation for all classes and functions
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dyf-0.8.1.tar.gz.
File metadata
- Download URL: dyf-0.8.1.tar.gz
- Upload date:
- Size: 237.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0d1c7e03a7f4706f92002705c3cd79507ebe22498a8054aeec60f83d6c2bd6b
|
|
| MD5 |
137ae16239fbaf6ef3b8fc9267aa43e6
|
|
| BLAKE2b-256 |
7e7a1309b3789f43d9a55c95da3bb71bf979bf03de954ea9d509d3720341fbe1
|
Provenance
The following attestation bundles were made for dyf-0.8.1.tar.gz:
Publisher:
publish.yml on jdonaldson/dyf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dyf-0.8.1.tar.gz -
Subject digest:
e0d1c7e03a7f4706f92002705c3cd79507ebe22498a8054aeec60f83d6c2bd6b - Sigstore transparency entry: 1256375286
- Sigstore integration time:
-
Permalink:
jdonaldson/dyf@0c5f2f439955c56ba43e0ddffbf8ec9deb5067fb -
Branch / Tag:
refs/tags/v0.8.1 - Owner: https://github.com/jdonaldson
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0c5f2f439955c56ba43e0ddffbf8ec9deb5067fb -
Trigger Event:
push
-
Statement type:
File details
Details for the file dyf-0.8.1-py3-none-any.whl.
File metadata
- Download URL: dyf-0.8.1-py3-none-any.whl
- Upload date:
- Size: 187.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9aa6bc245bea4a39dceed6f53d6cb620f01cfd23ef5030185e9dc5f29ab7b5f8
|
|
| MD5 |
24e8f67b8faa1372a9a2b9f1c2d0a027
|
|
| BLAKE2b-256 |
ffc3f513e991cd76905ef535d557132edf2c09989fe657a5bbf28fcc461913ce
|
Provenance
The following attestation bundles were made for dyf-0.8.1-py3-none-any.whl:
Publisher:
publish.yml on jdonaldson/dyf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dyf-0.8.1-py3-none-any.whl -
Subject digest:
9aa6bc245bea4a39dceed6f53d6cb620f01cfd23ef5030185e9dc5f29ab7b5f8 - Sigstore transparency entry: 1256375400
- Sigstore integration time:
-
Permalink:
jdonaldson/dyf@0c5f2f439955c56ba43e0ddffbf8ec9deb5067fb -
Branch / Tag:
refs/tags/v0.8.1 - Owner: https://github.com/jdonaldson
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0c5f2f439955c56ba43e0ddffbf8ec9deb5067fb -
Trigger Event:
push
-
Statement type: