Skip to main content

Embeddings Flow -- Tools for workflows involving semantic embeddings

Project description

ef (Embedding Flow)

Lightweight framework for embedding pipelines

ef is a simple, composable framework for building and running embedding pipelines. It provides:

  • ✅ Works out-of-the-box (zero configuration, built-in components)
  • ✅ Component registries as mapping stores (easy discovery)
  • ✅ Automatic pipeline composition (via DAG)
  • ✅ Flexible storage backends (memory, files, custom)
  • ✅ Plugin system (add production components when needed)

Installation

# Basic installation (works immediately with built-in components)
pip install ef

# With full functionality (dol, meshed, larder)
pip install ef[full]

# With imbed integration (production components)
pip install ef[imbed]

Quick Start

from ef import Project

# Create project (works immediately!)
project = Project.create('my_project', backend='memory')

# Add data
project.add_source('doc1', 'First document about AI')
project.add_source('doc2', 'Second document about ML')

# List available components
print(project.list_components())
# {
#   'embedders': ['simple', 'char_counts'],
#   'planarizers': ['simple_2d', 'normalize_2d'],
#   'clusterers': ['simple_kmeans', 'threshold'],
#   'segmenters': ['identity', 'lines', 'sentences']
# }

# Create pipeline
project.create_pipeline(
    'analysis',
    embedder='simple',
    planarizer='simple_2d',
    clusterer='simple_kmeans',
    n_clusters=2
)

# Run pipeline (persists all results automatically)
results = project.run_pipeline('analysis')

# Access persisted data
print(f"Segments: {len(project.segments)}")
print(f"Embeddings: {len(project.embeddings)}")
print(f"Clusters: {dict(project.clusters)}")

# Get project summary
print(project.summary())

Core Concepts

1. Component Registries (Mapping Stores)

Components are stored in registries that behave like dictionaries:

# Access components like a dict
embedder = project.embedders['simple']
vectors = embedder({'text1': 'Sample text'})

# List all components
print(list(project.embedders.keys()))

# Get component metadata
meta = project.embedders.get_metadata('simple')

2. Mall Pattern (Store of Stores)

Each project has a "mall" - separate stores for each data type:

# Access different stores
project.segments['doc1'] = 'text'
project.embeddings['doc1'] = [1.0, 2.0, 3.0]
project.clusters['doc1'] = 0

# All stores use MutableMapping interface
for key, value in project.embeddings.items():
    print(f"{key}: {value}")

3. Pipeline Assembly

Pipelines are assembled automatically from components:

# Create pipeline by naming components
project.create_pipeline(
    'my_pipeline',
    segmenter='lines',      # Optional: split text
    embedder='simple',      # Required: embed segments
    planarizer='simple_2d', # Optional: reduce dimensions
    clusterer='simple_kmeans',  # Optional: cluster
    n_clusters=5  # Pass parameters to components
)

# Run with automatic persistence
results = project.run_pipeline('my_pipeline')

4. Flexible Storage

Choose storage backend based on needs:

# In-memory (fast, temporary)
project = Project.create('test', backend='memory')

# File-based (persistent)
project = Project.create('prod', backend='files', root_dir='/data')

# Custom (bring your own store)
from ef.storage import mk_project_mall
mall = mk_project_mall('custom', backend='files')
mall['embeddings'] = MyCustomStore()  # Any MutableMapping
project = Project('custom', mall=mall)

Plugin System

Built-in Components (Always Available)

ef comes with simple implementations that work without dependencies:

# Automatically registered on import
from ef import Project
project = Project.create('test')

# Has built-in components:
# - Embedders: simple, char_counts
# - Planarizers: simple_2d, normalize_2d
# - Clusterers: simple_kmeans, threshold
# - Segmenters: identity, lines, sentences

Adding Production Components

Use plugins to add real ML implementations:

from ef import Project
from ef.plugins import imbed  # Requires: pip install ef[imbed]

project = Project.create('production')
imbed.register(project)

# Now has production components:
# - OpenAI embedders
# - UMAP planarization
# - scikit-learn clustering
# - And more...

print(list(project.embedders.keys()))
# ['simple', 'char_counts', 'openai-small', 'openai-large', ...]

Writing Your Own Plugin

# my_plugin.py
def register(project):
    """Add custom components to project."""
    
    @project.embedders.register('my_embedder', dimension=768)
    def my_embedder(segments):
        # Your implementation
        return {key: compute(text) for key, text in segments.items()}

# Use it
from ef import Project
import my_plugin

project = Project.create('custom')
my_plugin.register(project)

Advanced Usage

Multiple Projects

from ef import Projects

# Manage multiple projects
projects = Projects(root_dir='/data')

# Create projects
proj1 = projects.create_project('research', backend='files')
proj2 = projects.create_project('production', backend='files')

# Access later
existing = projects['research']

# List all
print(list(projects.keys()))

Quick Embed (No Pipeline)

# For one-off embeddings
embeddings = project.quick_embed(
    'Some text to embed',
    embedder='simple'
)

Custom Components

# Register your own component
@project.embedders.register('custom', dimension=512)
def custom_embedder(segments):
    return {k: my_model(v) for k, v in segments.items()}

# Use in pipeline
project.create_pipeline('custom_pipe', embedder='custom')

Architecture

ef follows Option 1 from the design plan:

┌─────────────────────────────────────┐
│  ef (lightweight interface layer)   │
│  - ComponentRegistry                │
│  - Project/Projects                 │
│  - Mall pattern                     │
│  - Pipeline assembly                │
│  - Built-in toy components          │
└──────────────┬──────────────────────┘
               │ imports (optional)
               ↓
┌─────────────────────────────────────┐
│  imbed (heavy implementation)       │
│  - Real embedders (OpenAI, etc.)    │
│  - Real planarizers (UMAP)          │
│  - Real clusterers (sklearn)        │
│  - Dataset classes                  │
│  - All utilities                    │
└─────────────────────────────────────┘

Design Principles

  1. Works immediately: Built-in components require no setup
  2. Mapping everywhere: All stores use MutableMapping interface
  3. Composable: Mix and match components easily
  4. Discoverable: .list_components(), .list_pipelines()
  5. Flexible: Swap storage backends without code changes
  6. Extensible: Plugin system for adding functionality
  7. Progressive enhancement: Start simple, add complexity as needed

Dependencies

Required (minimal):

  • Python 3.10+
  • numpy

Optional (recommended):

  • dol>=0.2.38 - Better storage abstraction
  • meshed>=0.1.20 - Automatic DAG composition
  • larder>=0.1.6 - Automatic caching

Plugin dependencies:

  • imbed>=0.1 - Production ML components

Install optional dependencies:

pip install ef[full]     # Install dol, meshed, larder
pip install ef[imbed]    # Install imbed + full dependencies

Development

# Clone repository
git clone https://github.com/thorwhalen/ef.git
cd ef

# Install in development mode
pip install -e .

# Run tests (if available)
pytest

Examples

See the imbed_refactored/ directory for detailed examples:

  • imbed_refactored.py - Core patterns and complete demo
  • advanced_example.py - Real ML integrations (OpenAI, UMAP, sklearn)
  • persistence_examples.py - Pipeline sharing and caching

Comparison with imbed

Feature ef imbed
Purpose Lightweight interface framework Production ML implementations
Dependencies numpy (+ optional) openai, umap, sklearn, datasets, etc.
Out-of-box ✓ Works immediately Requires configuration
Components Toy implementations Production implementations
Use case Prototyping, learning, interfaces Production ML pipelines

Use together:

from ef import Project
from ef.plugins import imbed

project = Project.create('best_of_both')
imbed.register(project)  # Add production power to clean interface

License

MIT

Contributing

Contributions welcome! Please:

  1. Write tests for new features
  2. Follow existing code style
  3. Update documentation
  4. Submit PRs to main branch

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ef-0.1.4.tar.gz (65.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ef-0.1.4-py3-none-any.whl (36.9 kB view details)

Uploaded Python 3

File details

Details for the file ef-0.1.4.tar.gz.

File metadata

  • Download URL: ef-0.1.4.tar.gz
  • Upload date:
  • Size: 65.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for ef-0.1.4.tar.gz
Algorithm Hash digest
SHA256 a82acf3472fe3f79eccc1ccaea99d510065af206e41d996b991aae1e4b4db239
MD5 28ad520a633d7674991a3740139c44e2
BLAKE2b-256 2925f475eac031dd2375df8782957c7394a0c61e418fd778f9711a351b901053

See more details on using hashes here.

File details

Details for the file ef-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: ef-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 36.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for ef-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 51fc7f261a525e995099ae1d4490b05e6fcd51058eccf1c4d85b64599635de48
MD5 2513afb8548e5b8e7aa09219e4f00df3
BLAKE2b-256 fbb448fa11592e044851cac54e0c6ab96d86b5d46bb3da7d9d14c1c229cc4515

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page