Embeddings Flow -- Tools for workflows involving semantic embeddings

Project description

ef (Embedding Flow)

Lightweight framework for embedding pipelines

ef is a simple, composable framework for building and running embedding pipelines. It provides:

✅ Works out-of-the-box (zero configuration, built-in components)
✅ Component registries as mapping stores (easy discovery)
✅ Automatic pipeline composition (via DAG)
✅ Flexible storage backends (memory, files, custom)
✅ Plugin system (add production components when needed)

Installation

# Basic installation (works immediately with built-in components)
pip install ef

# With full functionality (dol, meshed, larder)
pip install ef[full]

# With imbed integration (production components)
pip install ef[imbed]

Quick Start

from ef import Project

# Create project (works immediately!)
project = Project.create('my_project', backend='memory')

# Add data
project.add_source('doc1', 'First document about AI')
project.add_source('doc2', 'Second document about ML')

# List available components
print(project.list_components())
# {
#   'embedders': ['simple', 'char_counts'],
#   'planarizers': ['simple_2d', 'normalize_2d'],
#   'clusterers': ['simple_kmeans', 'threshold'],
#   'segmenters': ['identity', 'lines', 'sentences']
# }

# Create pipeline
project.create_pipeline(
    'analysis',
    embedder='simple',
    planarizer='simple_2d',
    clusterer='simple_kmeans',
    n_clusters=2
)

# Run pipeline (persists all results automatically)
results = project.run_pipeline('analysis')

# Access persisted data
print(f"Segments: {len(project.segments)}")
print(f"Embeddings: {len(project.embeddings)}")
print(f"Clusters: {dict(project.clusters)}")

# Get project summary
print(project.summary())

Core Concepts

1. Component Registries (Mapping Stores)

Components are stored in registries that behave like dictionaries:

# Access components like a dict
embedder = project.embedders['simple']
vectors = embedder({'text1': 'Sample text'})

# List all components
print(list(project.embedders.keys()))

# Get component metadata
meta = project.embedders.get_metadata('simple')

2. Mall Pattern (Store of Stores)

Each project has a "mall" - separate stores for each data type:

# Access different stores
project.segments['doc1'] = 'text'
project.embeddings['doc1'] = [1.0, 2.0, 3.0]
project.clusters['doc1'] = 0

# All stores use MutableMapping interface
for key, value in project.embeddings.items():
    print(f"{key}: {value}")

3. Pipeline Assembly

Pipelines are assembled automatically from components:

# Create pipeline by naming components
project.create_pipeline(
    'my_pipeline',
    segmenter='lines',      # Optional: split text
    embedder='simple',      # Required: embed segments
    planarizer='simple_2d', # Optional: reduce dimensions
    clusterer='simple_kmeans',  # Optional: cluster
    n_clusters=5  # Pass parameters to components
)

# Run with automatic persistence
results = project.run_pipeline('my_pipeline')

4. Flexible Storage

Choose storage backend based on needs:

# In-memory (fast, temporary)
project = Project.create('test', backend='memory')

# File-based (persistent)
project = Project.create('prod', backend='files', root_dir='/data')

# Custom (bring your own store)
from ef.storage import mk_project_mall
mall = mk_project_mall('custom', backend='files')
mall['embeddings'] = MyCustomStore()  # Any MutableMapping
project = Project('custom', mall=mall)

Plugin System

Built-in Components (Always Available)

ef comes with simple implementations that work without dependencies:

# Automatically registered on import
from ef import Project
project = Project.create('test')

# Has built-in components:
# - Embedders: simple, char_counts
# - Planarizers: simple_2d, normalize_2d
# - Clusterers: simple_kmeans, threshold
# - Segmenters: identity, lines, sentences

Adding Production Components

Use plugins to add real ML implementations:

from ef import Project
from ef.plugins import imbed  # Requires: pip install ef[imbed]

project = Project.create('production')
imbed.register(project)

# Now has production components:
# - OpenAI embedders
# - UMAP planarization
# - scikit-learn clustering
# - And more...

print(list(project.embedders.keys()))
# ['simple', 'char_counts', 'openai-small', 'openai-large', ...]

Writing Your Own Plugin

# my_plugin.py
def register(project):
    """Add custom components to project."""
    
    @project.embedders.register('my_embedder', dimension=768)
    def my_embedder(segments):
        # Your implementation
        return {key: compute(text) for key, text in segments.items()}

# Use it
from ef import Project
import my_plugin

project = Project.create('custom')
my_plugin.register(project)

Advanced Usage

Multiple Projects

from ef import Projects

# Manage multiple projects
projects = Projects(root_dir='/data')

# Create projects
proj1 = projects.create_project('research', backend='files')
proj2 = projects.create_project('production', backend='files')

# Access later
existing = projects['research']

# List all
print(list(projects.keys()))

Quick Embed (No Pipeline)

# For one-off embeddings
embeddings = project.quick_embed(
    'Some text to embed',
    embedder='simple'
)

Custom Components

# Register your own component
@project.embedders.register('custom', dimension=512)
def custom_embedder(segments):
    return {k: my_model(v) for k, v in segments.items()}

# Use in pipeline
project.create_pipeline('custom_pipe', embedder='custom')

Architecture

ef follows Option 1 from the design plan:

┌─────────────────────────────────────┐
│  ef (lightweight interface layer)   │
│  - ComponentRegistry                │
│  - Project/Projects                 │
│  - Mall pattern                     │
│  - Pipeline assembly                │
│  - Built-in toy components          │
└──────────────┬──────────────────────┘
               │ imports (optional)
               ↓
┌─────────────────────────────────────┐
│  imbed (heavy implementation)       │
│  - Real embedders (OpenAI, etc.)    │
│  - Real planarizers (UMAP)          │
│  - Real clusterers (sklearn)        │
│  - Dataset classes                  │
│  - All utilities                    │
└─────────────────────────────────────┘

Design Principles

Works immediately: Built-in components require no setup
Mapping everywhere: All stores use MutableMapping interface
Composable: Mix and match components easily
Discoverable: .list_components(), .list_pipelines()
Flexible: Swap storage backends without code changes
Extensible: Plugin system for adding functionality
Progressive enhancement: Start simple, add complexity as needed

Dependencies

Required (minimal):

Python 3.10+
numpy

Optional (recommended):

dol>=0.2.38 - Better storage abstraction
meshed>=0.1.20 - Automatic DAG composition
larder>=0.1.6 - Automatic caching

Plugin dependencies:

imbed>=0.1 - Production ML components

Install optional dependencies:

pip install ef[full]     # Install dol, meshed, larder
pip install ef[imbed]    # Install imbed + full dependencies

Development

# Clone repository
git clone https://github.com/thorwhalen/ef.git
cd ef

# Install in development mode
pip install -e .

# Run tests (if available)
pytest

Examples

See the imbed_refactored/ directory for detailed examples:

imbed_refactored.py - Core patterns and complete demo
advanced_example.py - Real ML integrations (OpenAI, UMAP, sklearn)
persistence_examples.py - Pipeline sharing and caching

Comparison with imbed

Feature	ef	imbed
Purpose	Lightweight interface framework	Production ML implementations
Dependencies	numpy (+ optional)	openai, umap, sklearn, datasets, etc.
Out-of-box	✓ Works immediately	Requires configuration
Components	Toy implementations	Production implementations
Use case	Prototyping, learning, interfaces	Production ML pipelines

Use together:

from ef import Project
from ef.plugins import imbed

project = Project.create('best_of_both')
imbed.register(project)  # Add production power to clean interface

License

MIT

Contributing

Contributions welcome! Please:

Write tests for new features
Follow existing code style
Update documentation
Submit PRs to main branch

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Oct 31, 2025

0.0.6

Jun 15, 2025

0.0.5

May 17, 2025

0.0.4

Oct 10, 2022

0.0.3

Oct 4, 2022

0.0.2

Jan 6, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ef-0.1.1.tar.gz (19.3 kB view details)

Uploaded Oct 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ef-0.1.1-py3-none-any.whl (19.5 kB view details)

Uploaded Oct 31, 2025 Python 3

File details

Details for the file ef-0.1.1.tar.gz.

File metadata

Download URL: ef-0.1.1.tar.gz
Upload date: Oct 31, 2025
Size: 19.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for ef-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`c4f0b0cab41570e5be3fa3dddae9ad41b00f743dd813f4bc935dac030776f8ab`
MD5	`329698e7e2103a80857cd218ed7f927a`
BLAKE2b-256	`e3b4c7215ff9983f3d8591897c2f13c1c32efaf4a534e4aaa0c4dab74d396213`

See more details on using hashes here.

File details

Details for the file ef-0.1.1-py3-none-any.whl.

File metadata

Download URL: ef-0.1.1-py3-none-any.whl
Upload date: Oct 31, 2025
Size: 19.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for ef-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c49fbd1c7ccaf3a6b79d1b5ebc141d877455ce8d653cdd8d89f65f1acd989312`
MD5	`78764689d440ff261cbf637cf2a43b4a`
BLAKE2b-256	`294c7703ebcb8a561cd772bf649cf884a3c4e593677d50d737a06c5ccc4fd5b6`

See more details on using hashes here.

ef 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

ef (Embedding Flow)

Installation

Quick Start

Core Concepts

1. Component Registries (Mapping Stores)

2. Mall Pattern (Store of Stores)

3. Pipeline Assembly

4. Flexible Storage

Plugin System

Built-in Components (Always Available)

Adding Production Components

Writing Your Own Plugin

Advanced Usage

Multiple Projects

Quick Embed (No Pipeline)

Custom Components

Architecture

Design Principles

Dependencies

Development

Examples

Comparison with imbed

License

Contributing

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes