Skip to main content

Density Yields Features - discover structure in embedding spaces

Project description

DYF - Density Yields Features

CI PyPI Python

Discover structure in embedding spaces. DYF uses density-based LSH to reveal the natural organization of your data:

  • Dense: Core items in well-populated semantic regions
  • Bridge: Transitional items connecting different clusters
  • Orphan: Unique items with no semantic neighbors

What it does

DYF transforms raw embeddings into navigable semantic maps. Instead of just clustering, it reveals the topology - which regions are dense, which items bridge between concepts, and which are truly unique.

Use cases:

  • Semantic navigation: Find paths between concepts
  • Structure discovery: Understand how your data organizes itself
  • Anomaly detection: Identify orphans and bridges
  • Index building: Pre-compute structure for fast queries

Installation

pip install dyf

For serialization (save/load indexes):

pip install dyf[io]

For full features (embedding generation, LLM labeling):

pip install dyf[full]

Quick Start

Discover Structure

import numpy as np
from dyf import DensityClassifier

# Your embeddings (e.g., from sentence-transformers)
embeddings = np.random.randn(10000, 384).astype(np.float32)

# Find structure
classifier = DensityClassifier(embedding_dim=384)
classifier.fit(embeddings)

# What did we find?
print(classifier.report())
# Corpus: 10000 items
#   Dense: 9500 (95.0%)
#   Bridge: 450 (4.5%)
#   Orphan: 50 (0.5%)

# Get indices
bridges = classifier.get_bridge()  # Transitional items
orphans = classifier.get_orphans() # Unique items

Build & Search Indexes

from dyf import build_dyf_tree, write_lazy_index, LazyIndex

# Build tree from embeddings
tree = build_dyf_tree(embeddings, max_depth=4, num_bits=3, min_leaf_size=8)

# Write to disk (mmap-friendly, zero startup cost)
write_lazy_index(tree, embeddings, "index.dyf",
                 quantization="float16", compression="zstd",
                 stored_fields={"title": titles},
                 metadata={"model": "nomic-embed-text-v1.5"})

# Search (instant open, LRU-cached leaf access)
with LazyIndex("index.dyf") as idx:
    result = idx.search(query_embedding, k=10, nprobe=3)
    print(result.indices, result.scores)
    print(result.fields["title"])  # stored fields returned with results

Adaptive Probing

Queries near decision boundaries automatically probe more leaves:

from dyf import LazyIndex, AdaptiveProbeConfig

with LazyIndex("index.dyf") as idx:
    # Auto mode: margin-based probe count (default thresholds)
    result = idx.search(query, k=10, nprobe="auto", return_routing=True)
    print(result.routing["adaptive_nprobe"])  # how many leaves were probed

    # Custom thresholds
    cfg = AdaptiveProbeConfig(margin_lo=0.005, margin_hi=0.2,
                              min_probes=1, max_probes=8)
    result = idx.search(query, k=10, nprobe=cfg)

Full-Featured Usage

from dyf import DensityClassifierFull, EmbedderConfig, LabelerConfig

# From raw texts
classifier = DensityClassifierFull.from_texts(
    texts=documents,
    categories=categories,
)

# Label clusters with LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())
print(labels['dense'][1234]['label'])  # "Machine Learning Papers"

How It Works

Two-stage PCA-based LSH:

  1. Initial bucketing: PCA projections create semantic buckets
  2. Density check: Items in sparse buckets are candidates for reclassification
  3. Recovery stage: Coarser PCA finds structure among sparse items
  4. Classification: Dense (core), Bridge (recovered), Orphan (truly unique)

The key insight: items that appear as outliers globally often share structure at coarser resolution. Bridges are these "misplaced" items - they connect different semantic regions.

Performance

Dataset Time Per item
60K embeddings (384d) ~60ms 1.0 µs

Rust-accelerated via PyO3. ~4x faster than pure Python.

API

DensityClassifier

DensityClassifier(
    embedding_dim: int,
    initial_bits: int = 14,      # LSH resolution
    recovery_bits: int = 8,      # Coarser recovery resolution
    dense_threshold: int = 10,   # Min bucket size for "dense"
    seed: int = 31
)

# Methods
classifier.fit(embeddings)
classifier.get_dense()           # Dense item indices
classifier.get_bridge()          # Bridge item indices
classifier.get_orphans()         # Orphan item indices
classifier.get_bucket_id(idx)    # Which bucket is item in?
classifier.report()              # Summary statistics

LazyIndex

from dyf import LazyIndex

with LazyIndex("index.dyf") as idx:
    # Search with fixed or adaptive probing
    result = idx.search(query, k=10, nprobe=3)       # fixed
    result = idx.search(query, k=10, nprobe="auto")   # adaptive

    # Inspect index structure
    idx.tree_summary          # metadata, dims, leaf count
    idx.total_items           # total indexed items
    idx.stored_field_names    # available stored fields

    # Extract all data
    data = idx.extract_all_fields()
    data['embeddings']        # (n, d) float32
    data['fields']            # {field_name: array}

Documentation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dyf-0.8.1.tar.gz (237.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dyf-0.8.1-py3-none-any.whl (187.8 kB view details)

Uploaded Python 3

File details

Details for the file dyf-0.8.1.tar.gz.

File metadata

  • Download URL: dyf-0.8.1.tar.gz
  • Upload date:
  • Size: 237.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dyf-0.8.1.tar.gz
Algorithm Hash digest
SHA256 e0d1c7e03a7f4706f92002705c3cd79507ebe22498a8054aeec60f83d6c2bd6b
MD5 137ae16239fbaf6ef3b8fc9267aa43e6
BLAKE2b-256 7e7a1309b3789f43d9a55c95da3bb71bf979bf03de954ea9d509d3720341fbe1

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.8.1.tar.gz:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dyf-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: dyf-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 187.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dyf-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9aa6bc245bea4a39dceed6f53d6cb620f01cfd23ef5030185e9dc5f29ab7b5f8
MD5 24e8f67b8faa1372a9a2b9f1c2d0a027
BLAKE2b-256 ffc3f513e991cd76905ef535d557132edf2c09989fe657a5bbf28fcc461913ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for dyf-0.8.1-py3-none-any.whl:

Publisher: publish.yml on jdonaldson/dyf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page