Tools to work with embeddings

Project description

imbed

Tools to work with embeddings, easily an flexibily.

To install: pip install imbed

Introduction

As we all know, though RAG (Retrieval Augumented Generation) is hyper-popular at the moment, the R part, though around for decades (mainly under the names "information retrieval" (IR), "search", "indexing",...), has a lot to contribute towards the success, or failure, of the effort. The many characteristics of the retrieval part need to be tuned to align with the final generation and business objectives. There's still a lot of science to do.

So the last thing we want is to be slowed down by pedestrian aspects of the process. We want to be agile in getting data prepared and analyzed, so we spend more time doign science, and iterate our models quickly.

There are two major aspects the imbed wishes to contribute two that.

search: getting from raw data to an iterface where we can search the information effectively
visualize: exploring the data visually (which requires yet another kind of embedding, to 2D or 3D vectors)

What we're looking for here is a setup where with minimal configuration (not code), we can make pipelines where we can point to the original data, enter a few parameters, wait, and get a "search controller" (that is, an object that has all the methods we need to do retrieval stuff). Here's an example of the kind of interface we'd like to target.

raw_docs = mk_text_store(doc_src_uri)  # the store used will depend on the source and format of where the docs are stored
segments = mk_segments_store(raw_docs, ...)  # will not copy any data over, but will give a key-value view of chunked (split) docs
search_ctrl = mk_search_controller(vectorDB, embedder, ...)
search_ctrl.fit(segments, doc_src_uri, ...)
search_ctrl.save(...)

Basic Usage

Text Segmentation

from imbed.segmentation_util import fixed_step_chunker

# Create chunks of text with a specific size
text = "This is a sample text that will be divided into smaller chunks for processing."
chunks = list(fixed_step_chunker(text.split(), chk_size=3))
print(chunks)
# Output: [['This', 'is', 'a'], ['sample', 'text', 'that'], ['will', 'be', 'divided'], ['into', 'smaller', 'chunks'], ['for', 'processing.']]

# Create overlapping chunks with a step size
overlapping_chunks = list(fixed_step_chunker(text.split(), chk_size=4, chk_step=2))
print(overlapping_chunks)
# Output: [['This', 'is', 'a', 'sample'], ['a', 'sample', 'text', 'that'], ...]

Working with Embeddings

import numpy as np
from imbed.util import cosine_similarity, planar_embeddings, transpose_iterable

# Create some example embeddings
embeddings = {
    "doc1": np.array([0.1, 0.2, 0.3]),
    "doc2": np.array([0.2, 0.3, 0.4]),
    "doc3": np.array([0.9, 0.8, 0.7])
}

# Calculate cosine similarity between embeddings
similarity = cosine_similarity(embeddings["doc1"], embeddings["doc2"])
print(f"Similarity between doc1 and doc2: {similarity:.3f}")

# Project embeddings to 2D for visualization
planar_coords = planar_embeddings(embeddings)
print("2D coordinates for visualization:")
for doc_id, coords in planar_coords.items():
    print(f"  {doc_id}: {coords}")

# Get x, y coordinates separately for plotting
x_values, y_values = transpose_iterable(planar_coords.values())

Creating a Search System

from imbed.segmentation_util import SegmentStore

# Example document store
docs = {
    "doc1": "This is the first document about artificial intelligence.",
    "doc2": "The second document discusses neural networks and deep learning.",
    "doc3": "Document three covers natural language processing."
}

# Create segment keys (doc_id, start_position, end_position)
segment_keys = [
    ("doc1", 0, len(docs["doc1"])),
    ("doc2", 0, 27),  # First half
    ("doc2", 28, len(docs["doc2"])),  # Second half
    ("doc3", 0, len(docs["doc3"]))
]

# Create a segment store
segment_store = SegmentStore(docs, segment_keys)

# Get a segment
print(segment_store[("doc2", 28, len(docs["doc2"]))])
# Output: "neural networks and deep learning."

# Iterate over all segments
for key in segment_store:
    segment = segment_store[key]
    print(f"{key}: {segment[:20]}...")

Storage Utilities

import os
from imbed.util import extension_based_wrap
from dol import Files

# Create a directory for storing data
os.makedirs("./data_store", exist_ok=True)

# Create a store that handles encoding/decoding based on file extensions
store = extension_based_wrap(Files("./data_store"))

# Store different types of data with appropriate extensions
store["config.json"] = {"model": "text-embedding-3-small", "batch_size": 32}
store["embeddings.npy"] = np.random.random((10, 128))

# The data is automatically encoded/decoded based on file extension
config = store["config.json"]  # Decoded from JSON automatically
embeddings = store["embeddings.npy"]  # Loaded as numpy array automatically

# Check available codec mappings
from imbed.util import get_codec_mappings
print("Available codecs:", list(get_codec_mappings()[0].keys()))

Advanced Pipeline

For more complex use cases, imbed enables a configuration-driven pipeline approach:

# Example of the configuration-driven pipeline (conceptual)
raw_docs = mk_text_store("s3://my-bucket/documents/")
segments = mk_segments_store(raw_docs, chunk_size=512, overlap=128)
search_ctrl = mk_search_controller(vector_db="faiss", embedder="text-embedding-3-small")
search_ctrl.fit(segments)
search_ctrl.save("./search_model")

# Search using the controller
results = search_ctrl.search("How does machine learning work?")

Working with Embeddings and Visualization in imbed

The imbed package provides powerful tools for working with embeddings, particularly for visualizing high-dimensional data and identifying meaningful clusters. Here are examples showing how to use the planarization and clusterization modules.

Planarization for Embedding Visualization

Embedding models typically produce high-dimensional vectors (e.g., 384 or 1536 dimensions) that can't be directly visualized. The planarization module helps project these vectors to 2D space for visualization purposes.

import numpy as np
import matplotlib.pyplot as plt
from imbed.components.planarization import planarizers, umap_planarizer

# Create some sample high-dimensional embeddings
np.random.seed(42)
embeddings = np.random.randn(100, 128)  # 100 documents with 128-dimensional embeddings

# Project embeddings to 2D using UMAP (great for preserving local relationships)
planar_points = umap_planarizer(
    embeddings,
    n_neighbors=15,
    min_dist=0.1,
    random_state=42
)

# Convert to separate x and y coordinates for plotting
x_coords, y_coords = zip(*planar_points)

# Create a simple scatter plot
plt.figure(figsize=(10, 8))
plt.scatter(x_coords, y_coords, alpha=0.7)
plt.title("Document Embeddings Visualization using UMAP")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.grid(alpha=0.3)
plt.show()

# Available planarization algorithms
print(f"Available planarization methods: {list(planarizers.keys())}")

The planarization module offers multiple techniques for dimensionality reduction:

umap_planarizer: Great for preserving both local and global relationships
tsne_planarizer: Good for preserving local neighborhood relationships
pca_planarizer: Linear projection that preserves global variance
force_directed_planarizer: Physics-based approach for visualization

Each algorithm has different strengths - UMAP is generally excellent for embedding visualization, while t-SNE is better for highlighting local clusters.

Clusterization for Content Organization

After projecting embeddings to 2D, you can cluster them to identify groups of related documents:

import numpy as np
import matplotlib.pyplot as plt
from imbed.components.planarization import umap_planarizer
from imbed.components.clusterization import kmeans_clusterer, hierarchical_clusterer, clusterers

# Create some sample embeddings
np.random.seed(42)
# Create 3 distinct groups of embeddings
group1 = np.random.randn(30, 128) + np.array([2.0] * 128)
group2 = np.random.randn(40, 128) - np.array([2.0] * 128)
group3 = np.random.randn(30, 128) + np.array([0.0] * 128)
embeddings = np.vstack([group1, group2, group3])

# First project to 2D for visualization
planar_points = umap_planarizer(embeddings, random_state=42)
x_coords, y_coords = zip(*planar_points)

# Apply clustering to the original high-dimensional embeddings
cluster_ids = kmeans_clusterer(embeddings, n_clusters=3)

# Visualize the clusters
plt.figure(figsize=(10, 8))
scatter = plt.scatter(x_coords, y_coords, c=cluster_ids, cmap='viridis', alpha=0.7)
plt.colorbar(scatter, label='Cluster ID')
plt.title("Document Clusters Visualization")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.grid(alpha=0.3)
plt.show()

# Try a different clustering algorithm
hierarchical_clusters = hierarchical_clusterer(embeddings, n_clusters=3, linkage='ward')

# Compare clustering results
agreement = sum(1 for a, b in zip(cluster_ids, hierarchical_clusters) if a == b) / len(cluster_ids)
print(f"Agreement between kmeans and hierarchical clustering: {agreement:.2%}")

# Available clustering algorithms
print(f"Available clustering methods: {list(clusterers.keys())}")

Labeling clusters

A useful AI-based tool to label clusters.

from imbed import cluster_labeler

import pandas as pd
import numpy as np
from typing import Callable, Union

# Sample data with text segments and cluster indices
data = {
    'segment': [
        "Machine learning models can be trained on large datasets to identify patterns.",
        "Neural networks are a subset of machine learning algorithms inspired by the human brain.",
        "Deep learning is a type of neural network with multiple hidden layers.",
        "Python is a versatile programming language used in data science and web development.",
        "JavaScript is primarily used for web development and creating interactive websites.",
        "HTML and CSS are markup languages used to structure and style web pages.",
        "SQL is a query language designed for managing and manipulating databases.",
        "NoSQL databases like MongoDB store data in flexible, JSON-like documents."
    ],
    'cluster_idx': [0, 0, 0, 1, 1, 1, 2, 2]  # 3 clusters
}

# Create the dataframe
df = pd.DataFrame(data)

# You can test with:
labels = cluster_labeler(df, context="Technical documentation")
labels

{0: 'Neural Networks Overview',
1: 'Web Development Language Comparisons',
2: 'Database Management Comparisons'}

Why This Matters for Embedding Visualization

Both planarization and clusterization are essential for making sense of embeddings:

Dimensionality Reduction: High-dimensional embeddings can't be directly visualized. Planarization techniques reduce them to 2D or 3D for plotting while preserving meaningful relationships.

Pattern Discovery: Clustering helps identify natural groupings within your data, revealing thematic structures that might not be obvious.

Content Organization: You can use clusters to automatically organize documents by topic, identify outliers, or create faceted navigation systems.

Relevance Evaluation: Visualizing embeddings lets you assess whether your embedding model is capturing meaningful semantic relationships.

Iterative Refinement: Visual inspection of embeddings and clusters helps you iterate on your data preparation, segmentation, and model selection strategies.

The imbed package makes these powerful techniques accessible through a simple, unified interface, allowing you to focus on the analysis rather than implementation details.

Project details

Release history Release notifications | RSS feed

This version

0.0.24

May 16, 2026

0.0.23

Oct 2, 2025

0.0.22

Aug 22, 2025

0.0.21

Jul 20, 2025

0.0.20

Jul 20, 2025

0.0.19

Jul 9, 2025

0.0.18

Jul 9, 2025

0.0.17

Jul 9, 2025

0.0.16

Jul 1, 2025

0.0.15

Jul 1, 2025

0.0.14

Jun 28, 2025

0.0.13

Jun 24, 2025

0.0.12

Mar 28, 2025

0.0.11

Mar 25, 2025

0.0.10

Mar 25, 2025

0.0.9

Mar 24, 2025

0.0.8

Mar 21, 2025

0.0.7

Mar 21, 2025

0.0.6

Mar 18, 2025

0.0.5

Mar 18, 2025

0.0.4

Dec 27, 2024

0.0.3

Aug 29, 2024

0.0.2

Dec 13, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imbed-0.0.24.tar.gz (74.8 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

imbed-0.0.24-py3-none-any.whl (75.5 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file imbed-0.0.24.tar.gz.

File metadata

Download URL: imbed-0.0.24.tar.gz
Upload date: May 16, 2026
Size: 74.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for imbed-0.0.24.tar.gz
Algorithm	Hash digest
SHA256	`e429bfbd19da087777b060e2c6f95c6684c51fa531d4b7d757c8f60e06251e5d`
MD5	`2554494eefc95df4c73259ae6b101e09`
BLAKE2b-256	`0394de02b816305addc58f5bcce60dbcbb9e2fee15770b5ed0365f6bcbcd9d41`

See more details on using hashes here.

File details

Details for the file imbed-0.0.24-py3-none-any.whl.

File metadata

Download URL: imbed-0.0.24-py3-none-any.whl
Upload date: May 16, 2026
Size: 75.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for imbed-0.0.24-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c4cbab952d4a92465a1ea9ddc8440039e16c7d0001de92907b478361b588dcb3`
MD5	`ea1b7505527c5eecf3a240fbfa6a049a`
BLAKE2b-256	`c34b89d3ca405e32f959b89fe6ad2be94da642bae665544a686adc39fa6df235`

See more details on using hashes here.

imbed 0.0.24

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

imbed

Introduction

Basic Usage

Text Segmentation

Working with Embeddings

Creating a Search System

Storage Utilities

Advanced Pipeline

Working with Embeddings and Visualization in imbed

Planarization for Embedding Visualization

Clusterization for Content Organization

Labeling clusters

Why This Matters for Embedding Visualization

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes