Flow Matching Mixture of Experts (FM-MoE)

These details have not been verified by PyPI

Project links

Project description

Flow Matching Mixture of Experts (FM-MoE)

A PyTorch implementation of Flow Matching Mixture of Experts, a novel neural network architecture that integrates continuous normalizing flows with sparse mixture of experts routing for enhanced representational capacity in deep learning models.

Abstract

Mixture of Experts (MoE) architectures have demonstrated remarkable success in scaling neural networks by activating only a subset of parameters per input. However, traditional MoE implementations rely on standard multi-layer perceptrons (MLPs) as expert networks, which may limit the expressiveness of learned transformations. This work introduces Flow Matching Mixture of Experts (FM-MoE), a framework that replaces conventional MLP experts with flow matching networks. Each expert learns a continuous transformation through an ordinary differential equation (ODE), enabling more expressive feature mappings while maintaining the computational efficiency of sparse expert selection.

Installation

pip install mixture-of-flows

Usage

import torch
from mixture_of_flows import FlowMatchingMoE, FlowMatchingMoEConfig

# Define configuration
config = FlowMatchingMoEConfig(
    input_dim=512,
    hidden_dim=2048,
    num_experts=8,
    num_selected=2,
    flow_steps=10,
)

# Create model
model = FlowMatchingMoE(config)

# Forward pass
x = torch.randn(4, 128, 512)  # (batch_size, seq_len, input_dim)
output, aux_loss = model(x)

print(f"Output shape: {output.shape}")  # (4, 128, 512)
print(f"Auxiliary loss: {aux_loss.item():.6f}")

Introduction

Mixture of Experts (MoE) models partition the input space among specialized sub-networks (experts), with a gating mechanism that routes each input to a subset of experts. This approach enables conditional computation, where only a fraction of model parameters are activated for any given input, allowing for significant scaling of model capacity without proportional increases in computational cost (Shazeer et al., 2017; Fedus et al., 2022).

Flow Matching is a simulation-free approach to training continuous normalizing flows (CNFs). Unlike traditional normalizing flows that require explicit invertibility constraints or simulation during training, flow matching learns a velocity field that transports samples from a source distribution to a target distribution along optimal transport paths (Lipman et al., 2023).

Motivation

Traditional MoE architectures employ feed-forward networks as experts, which apply fixed, discrete transformations to inputs. By replacing these with flow matching networks, FM-MoE enables each expert to learn a continuous, time-dependent transformation governed by:

dx/dt = v_theta(x, t), where t in [0, 1]

This formulation provides several theoretical and practical advantages:

Enhanced Expressiveness: Continuous transformations can represent more complex mappings than discrete layer compositions
Smooth Interpolation: The ODE formulation ensures smooth transitions in the learned representation space
Flexible Computation: Integration steps can be adjusted at inference time for accuracy-efficiency trade-offs

Architecture

                                    +------------------+
                                    |   Expert 0       |
                                    | (Flow Matching)  |
                                    +--------+---------+
                                             |
+----------+    +----------+    +------------------+    +----------+
|  Input   | -> |  Router  | -> |   Expert 1       | -> | Weighted |
|  Tokens  |    | (Top-k)  |    | (Flow Matching)  |    |   Sum    | -> Output
+----------+    +----------+    +--------+---------+    +----------+
                    |                    |
                    |           +------------------+
                    +---------> |   Expert N-1     |
                                | (Flow Matching)  |
                                +------------------+

Component Architecture

1. Router Network

The router computes expert assignment probabilities and selects the top-k experts for each token:

                    +-------------------------+
                    |     Linear(d -> E)      |
      Input  -----> |          |              | -----> Top-k Selection
     (B,L,d)        |     Softmax             |        Probabilities
                    +-------------------------+

2. Flow Matching Expert

Each expert implements a velocity field network integrated via Euler method:

    +-------+     +---------------+     +-------+
    | x(t)  | --> | Concat with   | --> |  MLP  | --> v(x,t)
    +-------+     | time embed    |     +-------+
                  +---------------+
         |                                   |
         |        x(t+dt) = x(t) + v*dt      |
         +-----------------------------------+
                  (Euler Integration)

3. Flow Matching Expert Internal Structure

+------------------------------------------------------------------+
|                    Flow Matching Expert                           |
+------------------------------------------------------------------+
|                                                                   |
|   Input (d)                                                       |
|      |                                                            |
|      v                                                            |
|   +------------------+    +----------------------+                |
|   | Sinusoidal Time  |    |                      |                |
|   | Embedding        |--->|  Concatenate         |                |
|   | (t -> d_time)    |    |  [x; time_embed]     |                |
|   +------------------+    +----------+-----------+                |
|                                      |                            |
|                                      v                            |
|                          +-----------------------+                |
|                          | Linear(d+d_time, h)   |                |
|                          +-----------------------+                |
|                                      |                            |
|                                      v                            |
|                          +-----------------------+                |
|                          | SiLU + LayerNorm      |                |
|                          +-----------------------+                |
|                                      |                            |
|                                      v                            |
|                          +-----------------------+                |
|                          | Linear(h, h)          |                |
|                          +-----------------------+                |
|                                      |                            |
|                                      v                            |
|                          +-----------------------+                |
|                          | SiLU + LayerNorm      |                |
|                          +-----------------------+                |
|                                      |                            |
|                                      v                            |
|                          +-----------------------+                |
|                          | Linear(h, d)          |                |
|                          | (zero-initialized)    |                |
|                          +-----------------------+                |
|                                      |                            |
|                                      v                            |
|                              Velocity v(x,t)                      |
|                                                                   |
+------------------------------------------------------------------+

Mathematical Formulation

Flow Matching Fundamentals

Flow matching learns a time-dependent velocity field v_theta(x, t) that defines an ordinary differential equation:

dx/dt = v_theta(x, t),    t in [0, 1]

The transformation from input x_0 to output x_1 is obtained by integrating this ODE:

x_1 = x_0 + integral from 0 to 1 of v_theta(x(t), t) dt

Numerical Integration

We employ the Euler method for numerical integration with N steps:

x_{n+1} = x_n + v_theta(x_n, t_n) * dt

where:
  - dt = 1/N (step size)
  - t_n = n * dt (time at step n)
  - n in {0, 1, ..., N-1}

Expert Routing

Given input x in R^d, the router computes expert probabilities:

p(e|x) = softmax(W_r * x)_e,    e in {1, ..., E}

The top-k experts are selected based on these probabilities:

S(x) = argmax_k { p(e|x) : e in {1, ..., E} }

Mixture Computation

The final output combines selected expert outputs:

y = sum over e in S(x) of [ p_norm(e|x) * f_e(x) ]

where:
  - f_e(x) is the flow transformation of expert e
  - p_norm(e|x) = p(e|x) / sum over e' in S(x) of p(e'|x)

Auxiliary Losses

Load Balancing Loss

Encourages uniform expert utilization:

L_balance = E * sum from e=1 to E of (p_bar_e - 1/E)^2

where p_bar_e = (1/BL) * sum over b,l of p(e|x_{b,l})

Router Z-Loss

Prevents large logit values for training stability:

L_z = (1/BL) * sum over b,l of (log(sum over e of exp(z_{b,l,e})))^2

Time Embedding

Sinusoidal positional encoding for continuous time:

PE(t, 2i) = sin(t / 10000^(2i/d))
PE(t, 2i+1) = cos(t / 10000^(2i/d))

More Examples

Integration with Transformer

import torch
import torch.nn as nn
from mixture_of_flows import FlowMatchingMoE, FlowMatchingMoEConfig

class TransformerBlockWithFMMoE(nn.Module):
    """Transformer block using FM-MoE instead of standard FFN."""
    
    def __init__(self, d_model: int, n_heads: int, num_experts: int = 8):
        super().__init__()
        
        # Multi-head attention
        self.attention = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm1 = nn.LayerNorm(d_model)
        
        # FM-MoE replaces the feed-forward network
        self.moe = FlowMatchingMoE(FlowMatchingMoEConfig(
            input_dim=d_model,
            hidden_dim=d_model * 4,
            num_experts=num_experts,
            num_selected=2,
            flow_steps=10,
        ))
        self.norm2 = nn.LayerNorm(d_model)
        
    def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        # Self-attention with residual
        attn_out, _ = self.attention(x, x, x)
        x = self.norm1(x + attn_out)
        
        # FM-MoE with residual
        moe_out, aux_loss = self.moe(x)
        x = self.norm2(x + moe_out)
        
        return x, aux_loss

Expert Usage Analysis

# Get expert usage statistics
stats = model.get_expert_usage_stats(x)

print(f"Expert probabilities: {stats['expert_probs']}")
print(f"Expert selections: {stats['expert_selections']}")
print(f"Balance score: {stats['balance_score']:.4f}")  # 1.0 = perfect balance

Visualization

from mixture_of_flows.main import visualize_flow_matching_moe

# Generate comprehensive visualization
visualize_flow_matching_moe(
    model=model,
    x=x,
    save_path="fm_moe_analysis.png",
    show_flow_trajectory=True,
)

API Reference

FlowMatchingMoEConfig

Configuration dataclass for FM-MoE hyperparameters.

Parameter	Type	Default	Description
`input_dim`	int	required	Dimensionality of input features
`hidden_dim`	int	required	Hidden dimension for flow transformations
`num_experts`	int	8	Number of flow matching experts
`num_selected`	int	2	Number of top-k experts per token
`flow_steps`	int	10	Euler integration steps
`time_embed_dim`	int	64	Time embedding dimension
`use_router_aux_loss`	bool	True	Enable auxiliary losses
`router_z_loss_coef`	float	0.001	Z-loss coefficient
`load_balance_loss_coef`	float	0.01	Load balancing coefficient
`expert_capacity_factor`	float	1.25	Expert capacity multiplier
`dropout`	float	0.0	Dropout probability

FlowMatchingMoE

Main module implementing the FM-MoE layer.

Methods

forward(x, return_aux_loss=True)

Input: x - Tensor of shape (batch_size, seq_len, input_dim)
Output: Tuple of (output, aux_loss) where output has same shape as input

get_expert_usage_stats(x)

Input: x - Tensor of shape (batch_size, seq_len, input_dim)
Output: Dictionary containing expert_probs, expert_selections, balance_score

FlowMatchingExpert

Individual flow matching expert network.

Methods

forward(x, t)

Computes velocity field at position x and time t

flow_transform(x, steps=10)

Applies complete flow transformation via Euler integration

Configuration

Recommended Configurations

Small Model (Research/Prototyping)

config = FlowMatchingMoEConfig(
    input_dim=256,
    hidden_dim=512,
    num_experts=4,
    num_selected=1,
    flow_steps=5,
)

Base Model

config = FlowMatchingMoEConfig(
    input_dim=512,
    hidden_dim=2048,
    num_experts=8,
    num_selected=2,
    flow_steps=10,
)

Large Model

config = FlowMatchingMoEConfig(
    input_dim=1024,
    hidden_dim=4096,
    num_experts=16,
    num_selected=2,
    flow_steps=15,
)

Hyperparameter Guidelines

Parameter	Low	Medium	High	Notes
`flow_steps`	5	10	20	More steps = higher accuracy, more compute
`num_experts`	4	8	16	Scale with model capacity
`num_selected`	1	2	4	Trade-off: efficiency vs. expressiveness
`load_balance_loss_coef`	0.001	0.01	0.1	Increase if experts collapse

Experimental Results

Computational Complexity

Component	Time Complexity	Space Complexity
Router	O(B * L * E)	O(E * d)
Per Expert	O(B * L * N * d * h)	O(d * h)
Total	O(B * L * k * N * d * h)	O(E * d * h)

Where: B = batch size, L = sequence length, E = num experts, k = num selected, N = flow steps, d = input dim, h = hidden dim

Memory Efficiency

Compared to dense models with equivalent expressiveness, FM-MoE achieves:

Active Parameters: k/E fraction of total parameters per forward pass
Memory Scaling: Sub-linear with number of experts due to sparse activation

Citation

If you use this implementation in your research, please cite:

@software{gomez2026fmmoe,
  author       = {Gomez, Kye},
  title        = {Flow Matching Mixture of Experts: Continuous Normalizing Flows for Sparse Expert Networks},
  year         = {2026},
  publisher    = {The Swarm Corporation},
  url          = {https://github.com/The-Swarm-Corporation/MoF}
}

Related Works

@article{lipman2023flow,
  title={Flow Matching for Generative Modeling},
  author={Lipman, Yaron and Chen, Ricky TQ and Ben-Hamu, Heli and Nickel, Maximilian and Le, Matt},
  journal={International Conference on Learning Representations},
  year={2023}
}

@article{fedus2022switch,
  title={Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  author={Fedus, William and Zoph, Barret and Shazeer, Noam},
  journal={Journal of Machine Learning Research},
  volume={23},
  number={120},
  pages={1--39},
  year={2022}
}

@article{shazeer2017outrageously,
  title={Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer},
  author={Shazeer, Noam and Mirhoseini, Azalia and Maziarz, Krzysztof and Davis, Andy and Le, Quoc and Hinton, Geoffrey and Dean, Jeff},
  journal={International Conference on Learning Representations},
  year={2017}
}

@article{chen2018neural,
  title={Neural Ordinary Differential Equations},
  author={Chen, Ricky TQ and Rubanova, Yulia and Bettencourt, Jesse and Duvenaud, David K},
  journal={Advances in Neural Information Processing Systems},
  volume={31},
  year={2018}
}

License

This project is licensed under the MIT License. See LICENSE for details.

Acknowledgments

This implementation builds upon foundational work in mixture of experts architectures and continuous normalizing flows. We acknowledge the contributions of the broader research community in advancing these methodologies.

Author

Kye Gomez

The Swarm Corporation

GitHub

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Feb 4, 2026

0.1.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mixture_of_flows-0.2.0.tar.gz (18.7 kB view details)

Uploaded Feb 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mixture_of_flows-0.2.0-py3-none-any.whl (15.4 kB view details)

Uploaded Feb 4, 2026 Python 3

File details

Details for the file mixture_of_flows-0.2.0.tar.gz.

File metadata

Download URL: mixture_of_flows-0.2.0.tar.gz
Upload date: Feb 4, 2026
Size: 18.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for mixture_of_flows-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`9c11fa20a3ce24206117d3f6a212706119c1ea37b6ba6d8aeb61ff2dc6a4de58`
MD5	`b0bb96d68fb7014b931864b479d09f2e`
BLAKE2b-256	`251e4e6824629d27223b59d36711dbb8499b920ea356ba4d6b9d4932cdeedea4`

See more details on using hashes here.

File details

Details for the file mixture_of_flows-0.2.0-py3-none-any.whl.

File metadata

Download URL: mixture_of_flows-0.2.0-py3-none-any.whl
Upload date: Feb 4, 2026
Size: 15.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for mixture_of_flows-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ef8fc2a94e1e1e9b409c9d1eaa1ebbd1ea71eaffaeb73599087f41bf41a2873`
MD5	`8bfc30aba33f4cd3f2a0ba26a876a5cb`
BLAKE2b-256	`dca5f92912ad3a06a5b28032a572869f28473562e24af8a49a56f9a7d61a7260`

See more details on using hashes here.

mixture-of-flows 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Flow Matching Mixture of Experts (FM-MoE)

Abstract

Installation

Usage

Introduction

Motivation

Architecture

Component Architecture

1. Router Network

2. Flow Matching Expert

3. Flow Matching Expert Internal Structure

Mathematical Formulation

Flow Matching Fundamentals

Numerical Integration

Expert Routing

Mixture Computation

Auxiliary Losses

Load Balancing Loss

Router Z-Loss

Time Embedding

More Examples

Integration with Transformer

Expert Usage Analysis

Visualization

API Reference

FlowMatchingMoEConfig

FlowMatchingMoE

Methods

FlowMatchingExpert

Methods

Configuration

Recommended Configurations

Small Model (Research/Prototyping)

Base Model

Large Model

Hyperparameter Guidelines

Experimental Results

Computational Complexity

Memory Efficiency

Citation

Related Works

License

Acknowledgments

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes