Skip to main content

Embeddings Flow -- Tools for workflows involving semantic embeddings

Project description

ef (Embedding Flow)

Lightweight framework for embedding pipelines

ef is a simple, composable framework for building and running embedding pipelines. It provides:

  • ✅ Works out-of-the-box (zero configuration, built-in components)
  • ✅ Component registries as mapping stores (easy discovery)
  • ✅ Automatic pipeline composition (via DAG)
  • ✅ Flexible storage backends (memory, files, custom)
  • ✅ Plugin system (add production components when needed)

Installation

# Basic installation (works immediately with built-in components)
pip install ef

# With full functionality (dol, meshed, larder)
pip install ef[full]

# With imbed integration (production components)
pip install ef[imbed]

Quick Start

from ef import Project

# Create project (works immediately!)
project = Project.create('my_project', backend='memory')

# Add data
project.add_source('doc1', 'First document about AI')
project.add_source('doc2', 'Second document about ML')

# List available components
print(project.list_components())
# {
#   'embedders': ['simple', 'char_counts'],
#   'planarizers': ['simple_2d', 'normalize_2d'],
#   'clusterers': ['simple_kmeans', 'threshold'],
#   'segmenters': ['identity', 'lines', 'sentences']
# }

# Create pipeline
project.create_pipeline(
    'analysis',
    embedder='simple',
    planarizer='simple_2d',
    clusterer='simple_kmeans',
    n_clusters=2
)

# Run pipeline (persists all results automatically)
results = project.run_pipeline('analysis')

# Access persisted data
print(f"Segments: {len(project.segments)}")
print(f"Embeddings: {len(project.embeddings)}")
print(f"Clusters: {dict(project.clusters)}")

# Get project summary
print(project.summary())

Core Concepts

1. Component Registries (Mapping Stores)

Components are stored in registries that behave like dictionaries:

# Access components like a dict
embedder = project.embedders['simple']
vectors = embedder({'text1': 'Sample text'})

# List all components
print(list(project.embedders.keys()))

# Get component metadata
meta = project.embedders.get_metadata('simple')

2. Mall Pattern (Store of Stores)

Each project has a "mall" - separate stores for each data type:

# Access different stores
project.segments['doc1'] = 'text'
project.embeddings['doc1'] = [1.0, 2.0, 3.0]
project.clusters['doc1'] = 0

# All stores use MutableMapping interface
for key, value in project.embeddings.items():
    print(f"{key}: {value}")

3. Pipeline Assembly

Pipelines are assembled automatically from components:

# Create pipeline by naming components
project.create_pipeline(
    'my_pipeline',
    segmenter='lines',      # Optional: split text
    embedder='simple',      # Required: embed segments
    planarizer='simple_2d', # Optional: reduce dimensions
    clusterer='simple_kmeans',  # Optional: cluster
    n_clusters=5  # Pass parameters to components
)

# Run with automatic persistence
results = project.run_pipeline('my_pipeline')

4. Flexible Storage

Choose storage backend based on needs:

# In-memory (fast, temporary)
project = Project.create('test', backend='memory')

# File-based (persistent)
project = Project.create('prod', backend='files', root_dir='/data')

# Custom (bring your own store)
from ef.storage import mk_project_mall
mall = mk_project_mall('custom', backend='files')
mall['embeddings'] = MyCustomStore()  # Any MutableMapping
project = Project('custom', mall=mall)

Plugin System

Built-in Components (Always Available)

ef comes with simple implementations that work without dependencies:

# Automatically registered on import
from ef import Project
project = Project.create('test')

# Has built-in components:
# - Embedders: simple, char_counts
# - Planarizers: simple_2d, normalize_2d
# - Clusterers: simple_kmeans, threshold
# - Segmenters: identity, lines, sentences

Adding Production Components

Use plugins to add real ML implementations:

from ef import Project
from ef.plugins import imbed  # Requires: pip install ef[imbed]

project = Project.create('production')
imbed.register(project)

# Now has production components:
# - OpenAI embedders
# - UMAP planarization
# - scikit-learn clustering
# - And more...

print(list(project.embedders.keys()))
# ['simple', 'char_counts', 'openai-small', 'openai-large', ...]

Writing Your Own Plugin

# my_plugin.py
def register(project):
    """Add custom components to project."""
    
    @project.embedders.register('my_embedder', dimension=768)
    def my_embedder(segments):
        # Your implementation
        return {key: compute(text) for key, text in segments.items()}

# Use it
from ef import Project
import my_plugin

project = Project.create('custom')
my_plugin.register(project)

Advanced Usage

Multiple Projects

from ef import Projects

# Manage multiple projects
projects = Projects(root_dir='/data')

# Create projects
proj1 = projects.create_project('research', backend='files')
proj2 = projects.create_project('production', backend='files')

# Access later
existing = projects['research']

# List all
print(list(projects.keys()))

Quick Embed (No Pipeline)

# For one-off embeddings
embeddings = project.quick_embed(
    'Some text to embed',
    embedder='simple'
)

Custom Components

# Register your own component
@project.embedders.register('custom', dimension=512)
def custom_embedder(segments):
    return {k: my_model(v) for k, v in segments.items()}

# Use in pipeline
project.create_pipeline('custom_pipe', embedder='custom')

Architecture

ef follows Option 1 from the design plan:

┌─────────────────────────────────────┐
│  ef (lightweight interface layer)   │
│  - ComponentRegistry                │
│  - Project/Projects                 │
│  - Mall pattern                     │
│  - Pipeline assembly                │
│  - Built-in toy components          │
└──────────────┬──────────────────────┘
               │ imports (optional)
               ↓
┌─────────────────────────────────────┐
│  imbed (heavy implementation)       │
│  - Real embedders (OpenAI, etc.)    │
│  - Real planarizers (UMAP)          │
│  - Real clusterers (sklearn)        │
│  - Dataset classes                  │
│  - All utilities                    │
└─────────────────────────────────────┘

Design Principles

  1. Works immediately: Built-in components require no setup
  2. Mapping everywhere: All stores use MutableMapping interface
  3. Composable: Mix and match components easily
  4. Discoverable: .list_components(), .list_pipelines()
  5. Flexible: Swap storage backends without code changes
  6. Extensible: Plugin system for adding functionality
  7. Progressive enhancement: Start simple, add complexity as needed

Dependencies

Required (minimal):

  • Python 3.10+
  • numpy

Optional (recommended):

  • dol>=0.2.38 - Better storage abstraction
  • meshed>=0.1.20 - Automatic DAG composition
  • larder>=0.1.6 - Automatic caching

Plugin dependencies:

  • imbed>=0.1 - Production ML components

Install optional dependencies:

pip install ef[full]     # Install dol, meshed, larder
pip install ef[imbed]    # Install imbed + full dependencies

Development

# Clone repository
git clone https://github.com/thorwhalen/ef.git
cd ef

# Install in development mode
pip install -e .

# Run tests (if available)
pytest

Examples

See the imbed_refactored/ directory for detailed examples:

  • imbed_refactored.py - Core patterns and complete demo
  • advanced_example.py - Real ML integrations (OpenAI, UMAP, sklearn)
  • persistence_examples.py - Pipeline sharing and caching

Comparison with imbed

Feature ef imbed
Purpose Lightweight interface framework Production ML implementations
Dependencies numpy (+ optional) openai, umap, sklearn, datasets, etc.
Out-of-box ✓ Works immediately Requires configuration
Components Toy implementations Production implementations
Use case Prototyping, learning, interfaces Production ML pipelines

Use together:

from ef import Project
from ef.plugins import imbed

project = Project.create('best_of_both')
imbed.register(project)  # Add production power to clean interface

License

MIT

Contributing

Contributions welcome! Please:

  1. Write tests for new features
  2. Follow existing code style
  3. Update documentation
  4. Submit PRs to main branch

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ef-0.1.1.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ef-0.1.1-py3-none-any.whl (19.5 kB view details)

Uploaded Python 3

File details

Details for the file ef-0.1.1.tar.gz.

File metadata

  • Download URL: ef-0.1.1.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for ef-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c4f0b0cab41570e5be3fa3dddae9ad41b00f743dd813f4bc935dac030776f8ab
MD5 329698e7e2103a80857cd218ed7f927a
BLAKE2b-256 e3b4c7215ff9983f3d8591897c2f13c1c32efaf4a534e4aaa0c4dab74d396213

See more details on using hashes here.

File details

Details for the file ef-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ef-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for ef-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c49fbd1c7ccaf3a6b79d1b5ebc141d877455ce8d653cdd8d89f65f1acd989312
MD5 78764689d440ff261cbf637cf2a43b4a
BLAKE2b-256 294c7703ebcb8a561cd772bf649cf884a3c4e593677d50d737a06c5ccc4fd5b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page