Embeddings Flow -- Tools for workflows involving semantic embeddings
Project description
ef (Embedding Flow)
Lightweight framework for embedding pipelines
ef is a simple, composable framework for building and running embedding pipelines. It provides:
- ✅ Works out-of-the-box (zero configuration, built-in components)
- ✅ Component registries as mapping stores (easy discovery)
- ✅ Automatic pipeline composition (via DAG)
- ✅ Flexible storage backends (memory, files, custom)
- ✅ Plugin system (add production components when needed)
Installation
# Basic installation (works immediately with built-in components)
pip install ef
# With full functionality (dol, meshed, larder)
pip install ef[full]
# With imbed integration (production components)
pip install ef[imbed]
Quick Start
from ef import Project
# Create project (works immediately!)
project = Project.create('my_project', backend='memory')
# Add data
project.add_source('doc1', 'First document about AI')
project.add_source('doc2', 'Second document about ML')
# List available components
print(project.list_components())
# {
# 'embedders': ['simple', 'char_counts'],
# 'planarizers': ['simple_2d', 'normalize_2d'],
# 'clusterers': ['simple_kmeans', 'threshold'],
# 'segmenters': ['identity', 'lines', 'sentences']
# }
# Create pipeline
project.create_pipeline(
'analysis',
embedder='simple',
planarizer='simple_2d',
clusterer='simple_kmeans',
n_clusters=2
)
# Run pipeline (persists all results automatically)
results = project.run_pipeline('analysis')
# Access persisted data
print(f"Segments: {len(project.segments)}")
print(f"Embeddings: {len(project.embeddings)}")
print(f"Clusters: {dict(project.clusters)}")
# Get project summary
print(project.summary())
Core Concepts
1. Component Registries (Mapping Stores)
Components are stored in registries that behave like dictionaries:
# Access components like a dict
embedder = project.embedders['simple']
vectors = embedder({'text1': 'Sample text'})
# List all components
print(list(project.embedders.keys()))
# Get component metadata
meta = project.embedders.get_metadata('simple')
2. Mall Pattern (Store of Stores)
Each project has a "mall" - separate stores for each data type:
# Access different stores
project.segments['doc1'] = 'text'
project.embeddings['doc1'] = [1.0, 2.0, 3.0]
project.clusters['doc1'] = 0
# All stores use MutableMapping interface
for key, value in project.embeddings.items():
print(f"{key}: {value}")
3. Pipeline Assembly
Pipelines are assembled automatically from components:
# Create pipeline by naming components
project.create_pipeline(
'my_pipeline',
segmenter='lines', # Optional: split text
embedder='simple', # Required: embed segments
planarizer='simple_2d', # Optional: reduce dimensions
clusterer='simple_kmeans', # Optional: cluster
n_clusters=5 # Pass parameters to components
)
# Run with automatic persistence
results = project.run_pipeline('my_pipeline')
4. Flexible Storage
Choose storage backend based on needs:
# In-memory (fast, temporary)
project = Project.create('test', backend='memory')
# File-based (persistent)
project = Project.create('prod', backend='files', root_dir='/data')
# Custom (bring your own store)
from ef.storage import mk_project_mall
mall = mk_project_mall('custom', backend='files')
mall['embeddings'] = MyCustomStore() # Any MutableMapping
project = Project('custom', mall=mall)
Plugin System
Built-in Components (Always Available)
ef comes with simple implementations that work without dependencies:
# Automatically registered on import
from ef import Project
project = Project.create('test')
# Has built-in components:
# - Embedders: simple, char_counts
# - Planarizers: simple_2d, normalize_2d
# - Clusterers: simple_kmeans, threshold
# - Segmenters: identity, lines, sentences
Adding Production Components
Use plugins to add real ML implementations:
from ef import Project
from ef.plugins import imbed # Requires: pip install ef[imbed]
project = Project.create('production')
imbed.register(project)
# Now has production components:
# - OpenAI embedders
# - UMAP planarization
# - scikit-learn clustering
# - And more...
print(list(project.embedders.keys()))
# ['simple', 'char_counts', 'openai-small', 'openai-large', ...]
Writing Your Own Plugin
# my_plugin.py
def register(project):
"""Add custom components to project."""
@project.embedders.register('my_embedder', dimension=768)
def my_embedder(segments):
# Your implementation
return {key: compute(text) for key, text in segments.items()}
# Use it
from ef import Project
import my_plugin
project = Project.create('custom')
my_plugin.register(project)
Advanced Usage
Multiple Projects
from ef import Projects
# Manage multiple projects
projects = Projects(root_dir='/data')
# Create projects
proj1 = projects.create_project('research', backend='files')
proj2 = projects.create_project('production', backend='files')
# Access later
existing = projects['research']
# List all
print(list(projects.keys()))
Quick Embed (No Pipeline)
# For one-off embeddings
embeddings = project.quick_embed(
'Some text to embed',
embedder='simple'
)
Custom Components
# Register your own component
@project.embedders.register('custom', dimension=512)
def custom_embedder(segments):
return {k: my_model(v) for k, v in segments.items()}
# Use in pipeline
project.create_pipeline('custom_pipe', embedder='custom')
Architecture
ef follows Option 1 from the design plan:
┌─────────────────────────────────────┐
│ ef (lightweight interface layer) │
│ - ComponentRegistry │
│ - Project/Projects │
│ - Mall pattern │
│ - Pipeline assembly │
│ - Built-in toy components │
└──────────────┬──────────────────────┘
│ imports (optional)
↓
┌─────────────────────────────────────┐
│ imbed (heavy implementation) │
│ - Real embedders (OpenAI, etc.) │
│ - Real planarizers (UMAP) │
│ - Real clusterers (sklearn) │
│ - Dataset classes │
│ - All utilities │
└─────────────────────────────────────┘
Design Principles
- Works immediately: Built-in components require no setup
- Mapping everywhere: All stores use
MutableMappinginterface - Composable: Mix and match components easily
- Discoverable:
.list_components(),.list_pipelines() - Flexible: Swap storage backends without code changes
- Extensible: Plugin system for adding functionality
- Progressive enhancement: Start simple, add complexity as needed
Dependencies
Required (minimal):
- Python 3.10+
- numpy
Optional (recommended):
dol>=0.2.38- Better storage abstractionmeshed>=0.1.20- Automatic DAG compositionlarder>=0.1.6- Automatic caching
Plugin dependencies:
imbed>=0.1- Production ML components
Install optional dependencies:
pip install ef[full] # Install dol, meshed, larder
pip install ef[imbed] # Install imbed + full dependencies
Development
# Clone repository
git clone https://github.com/thorwhalen/ef.git
cd ef
# Install in development mode
pip install -e .
# Run tests (if available)
pytest
Examples
See the imbed_refactored/ directory for detailed examples:
imbed_refactored.py- Core patterns and complete demoadvanced_example.py- Real ML integrations (OpenAI, UMAP, sklearn)persistence_examples.py- Pipeline sharing and caching
Comparison with imbed
| Feature | ef | imbed |
|---|---|---|
| Purpose | Lightweight interface framework | Production ML implementations |
| Dependencies | numpy (+ optional) | openai, umap, sklearn, datasets, etc. |
| Out-of-box | ✓ Works immediately | Requires configuration |
| Components | Toy implementations | Production implementations |
| Use case | Prototyping, learning, interfaces | Production ML pipelines |
Use together:
from ef import Project
from ef.plugins import imbed
project = Project.create('best_of_both')
imbed.register(project) # Add production power to clean interface
License
MIT
Contributing
Contributions welcome! Please:
- Write tests for new features
- Follow existing code style
- Update documentation
- Submit PRs to main branch
Links
- GitHub: https://github.com/thorwhalen/ef
- imbed: https://github.com/thorwhalen/imbed (production components)
- dol: https://github.com/i2mint/dol (storage layer)
- meshed: https://github.com/i2mint/meshed (DAG composition) -- Tools for workflows involving semantic embeddings
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ef-0.1.1.tar.gz.
File metadata
- Download URL: ef-0.1.1.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4f0b0cab41570e5be3fa3dddae9ad41b00f743dd813f4bc935dac030776f8ab
|
|
| MD5 |
329698e7e2103a80857cd218ed7f927a
|
|
| BLAKE2b-256 |
e3b4c7215ff9983f3d8591897c2f13c1c32efaf4a534e4aaa0c4dab74d396213
|
File details
Details for the file ef-0.1.1-py3-none-any.whl.
File metadata
- Download URL: ef-0.1.1-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c49fbd1c7ccaf3a6b79d1b5ebc141d877455ce8d653cdd8d89f65f1acd989312
|
|
| MD5 |
78764689d440ff261cbf637cf2a43b4a
|
|
| BLAKE2b-256 |
294c7703ebcb8a561cd772bf649cf884a3c4e593677d50d737a06c5ccc4fd5b6
|