Density Yields Features - discover structure in embedding spaces

These details have not been verified by PyPI

Project links

Project description

DYF - Density Yields Features

50,000 Wikipedia articles clustered by semantic similarity. Bright lines show density-based bridges connecting clusters. Try the interactive demo →

Discover structure in embedding spaces. DYF uses density-based LSH to reveal the natural organization of your data:

Dense: Core items in well-populated semantic regions
Bridge: Transitional items connecting different clusters
Orphan: Unique items with no semantic neighbors

What it does

DYF transforms raw embeddings into navigable semantic maps. Instead of just clustering, it reveals the topology - which regions are dense, which items bridge between concepts, and which are truly unique.

Use cases:

Semantic navigation: Find paths between concepts
Structure discovery: Understand how your data organizes itself
Anomaly detection: Identify orphans and bridges
Index building: Pre-compute structure for fast queries

Installation

pip install dyf

For serialization (save/load indexes):

pip install dyf[io]

For full features (embedding generation, LLM labeling):

pip install dyf[full]

Quick Start

Discover Structure

import numpy as np
from dyf import DensityClassifier

# Your embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)

# Find structure
classifier = DensityClassifier(embedding_dim=384)
classifier.fit(embeddings)

# What did we find?
print(classifier.report())
# Corpus: 10000 items
#   Dense: 9500 (95.0%)
#   Bridge: 450 (4.5%)
#   Orphan: 50 (0.5%)

# Get indices
bridges = classifier.get_bridge()  # Transitional items
orphans = classifier.get_orphans() # Unique items

Save & Load Pre-computed Indexes

from dyf import save_index, PrecomputedIndex

# Save (includes embeddings + metadata)
save_index(classifier, 'index.safetensors', embeddings,
           metadata={'model': 'all-MiniLM-L6-v2', 'created': '2026-01-12'})

# Load (no dyf-rs dependency needed!)
index = PrecomputedIndex.load('index.safetensors')
print(index.version)  # Check what version created this
print(index.metadata)  # All metadata

dense_items = index.get_dense()
bucket_5 = index.get_bucket(5)

Full-Featured Usage

from dyf import DensityClassifierFull, EmbedderConfig, LabelerConfig

# From raw texts
classifier = DensityClassifierFull.from_texts(
    texts=documents,
    categories=categories,
)

# Label clusters with LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())
print(labels['dense'][1234]['label'])  # "Machine Learning Papers"

How It Works

Two-stage PCA-based LSH:

Initial bucketing: PCA projections create semantic buckets
Density check: Items in sparse buckets are candidates for reclassification
Recovery stage: Coarser PCA finds structure among sparse items
Classification: Dense (core), Bridge (recovered), Orphan (truly unique)

The key insight: items that appear as outliers globally often share structure at coarser resolution. Bridges are these "misplaced" items - they connect different semantic regions.

Performance

Dataset	Time	Per item
60K embeddings (384d)	~60ms	1.0 µs

Rust-accelerated via PyO3. ~4x faster than pure Python.

API

DensityClassifier

DensityClassifier(
    embedding_dim: int,
    initial_bits: int = 14,      # LSH resolution
    recovery_bits: int = 8,      # Coarser recovery resolution
    dense_threshold: int = 10,   # Min bucket size for "dense"
    seed: int = 31
)

# Methods
classifier.fit(embeddings)
classifier.get_dense()           # Dense item indices
classifier.get_bridge()          # Bridge item indices
classifier.get_orphans()         # Orphan item indices
classifier.get_bucket_id(idx)    # Which bucket is item in?
classifier.report()              # Summary statistics

Index Serialization

from dyf import save_index, load_index, PrecomputedIndex

# Save fitted classifier
save_index(classifier, 'index.safetensors', embeddings, metadata={...})

# Load as dict
data = load_index('index.safetensors')
data, metadata = load_index('index.safetensors', include_metadata=True)

# Load as object (recommended)
index = PrecomputedIndex.load('index.safetensors')
index.get_dense()
index.get_bucket(5)
index.metadata
index.version

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.1

Apr 8, 2026

0.8.0

Mar 18, 2026

0.7.4

Mar 13, 2026

0.7.3

Mar 11, 2026

0.7.2

Mar 6, 2026

0.7.1

Mar 6, 2026

0.7.0

Mar 6, 2026

0.6.3

Mar 6, 2026

0.6.2

Mar 2, 2026

0.6.1

Mar 2, 2026

0.6.0

Mar 2, 2026

0.5.0

Mar 2, 2026

0.4.1

Feb 24, 2026

This version

0.4.0

Jan 17, 2026

0.2.0

Jan 14, 2026

0.1.3

Jan 12, 2026

0.1.2

Jan 12, 2026

0.1.1

Jan 12, 2026

0.1.0

Jan 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dyf-0.4.0.tar.gz (21.8 kB view details)

Uploaded Jan 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dyf-0.4.0-py3-none-any.whl (17.8 kB view details)

Uploaded Jan 17, 2026 Python 3

File details

Details for the file dyf-0.4.0.tar.gz.

File metadata

Download URL: dyf-0.4.0.tar.gz
Upload date: Jan 17, 2026
Size: 21.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for dyf-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`aa524e3ab522e13b30425064b3517e084b1cd29a415d95fe18a2d6bb39bcc4a8`
MD5	`65aa61ebd4043d1acc9ef57df02d79fa`
BLAKE2b-256	`fee881b57990a7d9c103f863415c5d3a9b386f14669d3dcbe2f74f8d16083f30`

See more details on using hashes here.

File details

Details for the file dyf-0.4.0-py3-none-any.whl.

File metadata

Download URL: dyf-0.4.0-py3-none-any.whl
Upload date: Jan 17, 2026
Size: 17.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for dyf-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`36096e3891176bc0b4f14cd1ef6f736c2400ea04a766b44c9e7ee5b0af58cf91`
MD5	`ef727afcd208a99bdf5be9a6285a3f00`
BLAKE2b-256	`76ca5efa03603d477f32536e9ef2d2d5a909586173d10d9bc5613b47e0aaf478`

See more details on using hashes here.

dyf 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DYF - Density Yields Features

What it does

Installation

Quick Start

Discover Structure

Save & Load Pre-computed Indexes

Full-Featured Usage

How It Works

Performance

API

DensityClassifier

Index Serialization

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes