Density Yields Features - discover structure in embedding spaces

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jdonaldson

These details have not been verified by PyPI

Project links

Project description

DYF - Density Yields Features

Discover structure in embedding spaces. DYF uses density-based LSH to reveal the natural organization of your data:

Dense: Core items in well-populated semantic regions
Bridge: Transitional items connecting different clusters
Orphan: Unique items with no semantic neighbors

What it does

DYF transforms raw embeddings into navigable semantic maps. Instead of just clustering, it reveals the topology - which regions are dense, which items bridge between concepts, and which are truly unique.

Use cases:

Semantic navigation: Find paths between concepts
Structure discovery: Understand how your data organizes itself
Anomaly detection: Identify orphans and bridges
Index building: Pre-compute structure for fast queries

Installation

pip install dyf

For serialization (save/load indexes):

pip install dyf[io]

For full features (embedding generation, LLM labeling):

pip install dyf[full]

Quick Start

Discover Structure

import numpy as np
from dyf import DensityClassifier

# Your embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)

# Find structure
classifier = DensityClassifier(embedding_dim=384)
classifier.fit(embeddings)

# What did we find?
print(classifier.report())
# Corpus: 10000 items
#   Dense: 9500 (95.0%)
#   Bridge: 450 (4.5%)
#   Orphan: 50 (0.5%)

# Get indices
bridges = classifier.get_bridge()  # Transitional items
orphans = classifier.get_orphans() # Unique items

Build & Search Indexes

from dyf import build_dyf_tree, write_lazy_index, LazyIndex

# Build tree from embeddings
tree = build_dyf_tree(embeddings, max_depth=4, num_bits=3, min_leaf_size=8)

# Write to disk (mmap-friendly, zero startup cost)
write_lazy_index(tree, embeddings, "index.dyf",
                 quantization="float16", compression="zstd",
                 stored_fields={"title": titles},
                 metadata={"model": "nomic-embed-text-v1.5"})

# Search (instant open, LRU-cached leaf access)
with LazyIndex("index.dyf") as idx:
    result = idx.search(query_embedding, k=10, nprobe=3)
    print(result.indices, result.scores)
    print(result.fields["title"])  # stored fields returned with results

Adaptive Probing

Queries near decision boundaries automatically probe more leaves:

from dyf import LazyIndex, AdaptiveProbeConfig

with LazyIndex("index.dyf") as idx:
    # Auto mode: margin-based probe count (default thresholds)
    result = idx.search(query, k=10, nprobe="auto", return_routing=True)
    print(result.routing["adaptive_nprobe"])  # how many leaves were probed

    # Custom thresholds
    cfg = AdaptiveProbeConfig(margin_lo=0.005, margin_hi=0.2,
                              min_probes=1, max_probes=8)
    result = idx.search(query, k=10, nprobe=cfg)

Full-Featured Usage

from dyf import DensityClassifierFull, EmbedderConfig, LabelerConfig

# From raw texts
classifier = DensityClassifierFull.from_texts(
    texts=documents,
    categories=categories,
)

# Label clusters with LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())
print(labels['dense'][1234]['label'])  # "Machine Learning Papers"

How It Works

Two-stage PCA-based LSH:

Initial bucketing: PCA projections create semantic buckets
Density check: Items in sparse buckets are candidates for reclassification
Recovery stage: Coarser PCA finds structure among sparse items
Classification: Dense (core), Bridge (recovered), Orphan (truly unique)

The key insight: items that appear as outliers globally often share structure at coarser resolution. Bridges are these "misplaced" items - they connect different semantic regions.

Performance

Dataset	Time	Per item
60K embeddings (384d)	~60ms	1.0 µs

Rust-accelerated via PyO3. ~4x faster than pure Python.

API

DensityClassifier

DensityClassifier(
    embedding_dim: int,
    initial_bits: int = 14,      # LSH resolution
    recovery_bits: int = 8,      # Coarser recovery resolution
    dense_threshold: int = 10,   # Min bucket size for "dense"
    seed: int = 31
)

# Methods
classifier.fit(embeddings)
classifier.get_dense()           # Dense item indices
classifier.get_bridge()          # Bridge item indices
classifier.get_orphans()         # Orphan item indices
classifier.get_bucket_id(idx)    # Which bucket is item in?
classifier.report()              # Summary statistics

LazyIndex

from dyf import LazyIndex

with LazyIndex("index.dyf") as idx:
    # Search with fixed or adaptive probing
    result = idx.search(query, k=10, nprobe=3)       # fixed
    result = idx.search(query, k=10, nprobe="auto")   # adaptive

    # Inspect index structure
    idx.tree_summary          # metadata, dims, leaf count
    idx.total_items           # total indexed items
    idx.stored_field_names    # available stored fields

    # Extract all data
    data = idx.extract_all_fields()
    data['embeddings']        # (n, d) float32
    data['fields']            # {field_name: array}

Documentation

How It Works — the algorithm, metrics, and Dense/Bridge/Orphan explained
Getting Started — code recipes and examples
API Reference — full documentation for all classes and functions

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jdonaldson

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.8.1

Apr 8, 2026

0.8.0

Mar 18, 2026

0.7.4

Mar 13, 2026

0.7.3

Mar 11, 2026

0.7.2

Mar 6, 2026

0.7.1

Mar 6, 2026

0.7.0

Mar 6, 2026

0.6.3

Mar 6, 2026

0.6.2

Mar 2, 2026

0.6.1

Mar 2, 2026

0.6.0

Mar 2, 2026

0.5.0

Mar 2, 2026

0.4.1

Feb 24, 2026

0.4.0

Jan 17, 2026

0.2.0

Jan 14, 2026

0.1.3

Jan 12, 2026

0.1.2

Jan 12, 2026

0.1.1

Jan 12, 2026

0.1.0

Jan 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dyf-0.8.1.tar.gz (237.1 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dyf-0.8.1-py3-none-any.whl (187.8 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file dyf-0.8.1.tar.gz.

File metadata

Download URL: dyf-0.8.1.tar.gz
Upload date: Apr 8, 2026
Size: 237.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dyf-0.8.1.tar.gz
Algorithm	Hash digest
SHA256	`e0d1c7e03a7f4706f92002705c3cd79507ebe22498a8054aeec60f83d6c2bd6b`
MD5	`137ae16239fbaf6ef3b8fc9267aa43e6`
BLAKE2b-256	`7e7a1309b3789f43d9a55c95da3bb71bf979bf03de954ea9d509d3720341fbe1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.8.1.tar.gz:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dyf-0.8.1.tar.gz
- Subject digest: e0d1c7e03a7f4706f92002705c3cd79507ebe22498a8054aeec60f83d6c2bd6b
- Sigstore transparency entry: 1256375286
- Sigstore integration time: Apr 8, 2026
Source repository:
- Permalink: jdonaldson/dyf@0c5f2f439955c56ba43e0ddffbf8ec9deb5067fb
- Branch / Tag: refs/tags/v0.8.1
- Owner: https://github.com/jdonaldson
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0c5f2f439955c56ba43e0ddffbf8ec9deb5067fb
- Trigger Event: push

File details

Details for the file dyf-0.8.1-py3-none-any.whl.

File metadata

Download URL: dyf-0.8.1-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 187.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dyf-0.8.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9aa6bc245bea4a39dceed6f53d6cb620f01cfd23ef5030185e9dc5f29ab7b5f8`
MD5	`24e8f67b8faa1372a9a2b9f1c2d0a027`
BLAKE2b-256	`ffc3f513e991cd76905ef535d557132edf2c09989fe657a5bbf28fcc461913ce`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.8.1-py3-none-any.whl:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dyf-0.8.1-py3-none-any.whl
- Subject digest: 9aa6bc245bea4a39dceed6f53d6cb620f01cfd23ef5030185e9dc5f29ab7b5f8
- Sigstore transparency entry: 1256375400
- Sigstore integration time: Apr 8, 2026
Source repository:
- Permalink: jdonaldson/dyf@0c5f2f439955c56ba43e0ddffbf8ec9deb5067fb
- Branch / Tag: refs/tags/v0.8.1
- Owner: https://github.com/jdonaldson
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0c5f2f439955c56ba43e0ddffbf8ec9deb5067fb
- Trigger Event: push

dyf 0.8.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DYF - Density Yields Features

What it does

Installation

Quick Start

Discover Structure

Build & Search Indexes

Adaptive Probing

Full-Featured Usage

How It Works

Performance

API

DensityClassifier

LazyIndex

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance