Core model and inference for FluxFlow text-to-image generation
Project description
FluxFlow Core
Smaller, Faster, More Expressive: Text-to-Image Generation with Bezier Activation Functions
🚧 Project Status
Training In Progress: FluxFlow models are currently in Week 1-4 of systematic validation.
Status:
- ✅ Architecture implemented and tested (including v0.8.0 pillar-attention)
- 🔄 VAE training in progress (Bezier + ReLU baselines)
- ⏳ Flow training pending VAE completion
- ⏳ Empirical benchmarks pending training completion
- 📅 Expected completion: Late February 2026
All performance claims below are theoretical targets - empirical validation underway.
FluxFlow is a novel approach to text-to-image generation that targets 2-3× smaller models with equivalent or superior quality compared to standard architectures. The key innovation is the use of Cubic Bezier activation functions, which provide 3rd-degree polynomial expressiveness, enabling each neuron to learn complex, smooth non-linear transformations.
Core Philosophy
Inspired by Kolmogorov-Arnold Networks (KAN), FluxFlow extends the concept of learnable activation functions to large-scale generative models. While KAN uses B-splines, FluxFlow employs Cubic Bezier curves with three distinct control point generation strategies.
Bezier Activations: Three Approaches
FluxFlow employs three Bezier activation strategies, each suited for different architectural needs:
1. Input-Based (BezierActivation) - Most Common
Control points derived directly from input channels via 5× channel expansion pattern.
- Implementation: Previous layer outputs 5× channels, BezierActivation reduces to 1×
- Parameters: 0 learnable parameters in activation (cost shifted to previous layer)
- Usage: VAE encoder/decoder, convolutional layers
- Pattern:
Conv2d(C, 5C) → BezierActivation() → C outputs
2. Trainable (TrainableBezier) - Specialized Layers
Learnable control points for per-channel transformations.
- Implementation: 4 learnable parameters per output dimension
- Parameters: 4×D learnable parameters (minimal overhead)
- Usage: VAE latent bottleneck (mu/logvar), RGB output layer
- Pattern:
Linear(C, C) → TrainableBezier(C) → C outputs
3. Pillar-Based - Transformer MLPs
Control points generated by 4 independent depth-3 MLP networks for maximum expressiveness.
- Parameters: 4×(depth=3)×D² (substantial overhead for expressiveness)
- Usage: Flow transformer MLP layers only
- See BEZIER_ACTIVATIONS.md#pillar-based-bezier for implementation details
Unified Formula (All Approaches):
B(t) = (1-t)³·p₀ + 3(1-t)²·t·p₁ + 3(1-t)·t²·p₂ + t³·p₃
What differs: how (t, p₀, p₁, p₂, p₃) are obtained (from input, learned, or computed by MLPs).
Smoothness: C² continuous (continuous up to second derivative), providing smooth gradients unlike ReLU's discontinuous derivative.
Expected Benefits (empirical validation in progress):
- Smaller models: 2-2.5× fewer parameters target for equivalent quality
- Faster inference: 38% speedup target through layer reduction
- Better gradients: Smooth C² continuous gradients reduce vanishing gradient issues
- Adaptive: Each approach provides different expressiveness-cost trade-offs
Installation
Production Install
pip install fluxflow
What gets installed:
fluxflow- Core model architectures and inference pipeline- Flow matching models, VAE, and text encoders
- Note: Does NOT include training tools (use
fluxflow-trainingfor that) - Note: Does NOT include UI (use
fluxflow-uiorfluxflow-comfyuifor that)
Package available on PyPI: fluxflow v0.8.0
Development Install
git clone https://github.com/danny-mio/fluxflow-core.git
cd fluxflow-core
pip install -e ".[dev]"
System Requirements
Minimum Requirements
- Python: 3.10 or later
- CPU: Modern x86_64 processor
- RAM: 16 GB minimum, 32 GB recommended
- Storage: 10 GB for package and dependencies
GPU Requirements (Optional but Recommended)
For Training
- GPU: NVIDIA GPU with CUDA support
- VRAM: 24 GB minimum (NVIDIA RTX 3090, A5000, or better)
- CUDA: 11.8 or later
- cuDNN: 8.6 or later
- Recommended: NVIDIA A6000 (48GB) or A100 (40GB/80GB)
For Inference
- GPU: NVIDIA GPU with CUDA support
- VRAM: 8 GB minimum, 12 GB recommended
- CUDA: 11.8 or later
- Recommended: NVIDIA RTX 3060 (12GB) or better
CPU-Only Mode
- Supported for inference (slower)
- Requires 32 GB RAM
- Not recommended for training (very slow)
Apple Silicon (MPS)
- Supported on M1/M2/M3 with macOS 12.3+
- Good performance for inference
- Training supported but slower than CUDA
Dependency Notes
- numpy: Version 2.x not yet supported (use numpy<2.0)
- torch: CUDA 11.8 or 12.1 builds recommended
- transformers: 4.30.0+ required for text encoding
Key Features
- Bezier Activations: Learnable 3rd-degree (cubic) polynomial activation functions
- Compact VAE: Variational autoencoder with 25M params (encoder) + 30M params (decoder)
- Flow-based Diffusion: 150M param transformer with rotary embeddings
- Text Conditioning: DistilBERT-based encoder (~71M params total: ~66M backbone + Bezier projection layers)
- Note: Current implementation uses pre-trained DistilBERT as a temporary solution. Future versions will feature a custom Bezier-based text encoder for full end-to-end training and multimodal support.
- Adaptive Architecture: Different activation strategies per component (Bezier for generative, LeakyReLU for discriminative)
Quick Start
High-Level API (Recommended)
from fluxflow.models import FluxFlowPipeline
# Load from checkpoint directory (standard training output)
pipeline = FluxFlowPipeline.from_pretrained("path/to/checkpoint_dir/")
# Or load from a single checkpoint file
# pipeline = FluxFlowPipeline.from_pretrained("path/to/checkpoint.safetensors")
# Generate image with Diffusers-style API
image = pipeline(
prompt="a beautiful sunset over mountains",
num_inference_steps=50,
guidance_scale=7.5,
height=512,
width=512,
).images[0]
image.save("output.png")
Advanced Usage
from fluxflow.models import FluxFlowPipeline
import torch
# Load with specific settings
pipeline = FluxFlowPipeline.from_pretrained(
"path/to/checkpoint.safetensors",
torch_dtype=torch.float16,
device="cuda",
)
# Generate with more control
result = pipeline(
prompt="a serene mountain landscape at dawn",
negative_prompt="blurry, low quality",
num_inference_steps=50,
guidance_scale=7.5,
height=768,
width=768,
num_images_per_prompt=4,
generator=torch.Generator().manual_seed(42),
)
# Save all generated images
for i, img in enumerate(result.images):
img.save(f"output_{i}.png")
Model Versions
| Version | Description | Status |
|---|---|---|
0.8.0 |
Pillar-attention (FiLM + cross-attn on pillars) | Current |
0.7.0 |
Context-enhanced flow transformer | Stable |
0.6.0 |
Default stable | Stable |
0.3.0 |
Legacy | Legacy |
- Default model version:
0.6.0(set byFluxFlowConfig.model.model_version) - v0.8.0 checkpoints require
load_versioned_checkpoint()— see docs/MIGRATION.md - For versioned checkpoints, use
load_versioned_checkpoint()and setmodel_versionwhen saving
Classifier-Free Guidance (CFG)
Available since v0.3.0: FluxFlow supports Classifier-Free Guidance for enhanced generation control.
What is CFG?
CFG improves generation quality by amplifying the influence of text conditioning. It works by:
- Running two forward passes: one with text, one without
- Interpolating between conditional and unconditional predictions
- Producing images that more strongly follow the text prompt
Using CFG
from fluxflow.models import FluxFlowPipeline
pipeline = FluxFlowPipeline.from_pretrained("path/to/checkpoint.safetensors")
# Generate with CFG (requires model trained with cfg_dropout_prob > 0)
image = pipeline(
prompt="a photorealistic portrait of a cat",
negative_prompt="blurry, distorted, low quality", # Optional
num_inference_steps=50,
guidance_scale=5.0, # Recommended: 3.0-7.0 for balanced results
height=512,
width=512,
).images[0]
Guidance Scale Guidelines
- 1.0: No guidance (standard generation)
- 3.0-7.0: Moderate guidance (RECOMMENDED - balanced quality/creativity)
- 7.0-15.0: Strong guidance (may oversaturate or lose diversity)
Important: CFG requires models trained with cfg_dropout_prob > 0 (typically 0.10-0.15). See fluxflow-training for training details.
Low-Level API
For more control, use the base FluxPipeline:
import torch
from fluxflow.models import FluxPipeline, BertTextEncoder
from transformers import AutoTokenizer
# Load components manually
pipeline = FluxPipeline.from_pretrained("path/to/checkpoint.safetensors")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
text_encoder = BertTextEncoder(embed_dim=1024) # Must match text_embedding_dim in training config (default: 1024)
# Encode text
text = "a beautiful sunset"
tokens = tokenizer(text, return_tensors="pt", padding="max_length", max_length=512)
text_embeddings = text_encoder(tokens["input_ids"])
# Manual forward pass (requires implementing sampling loop)
# See fluxflow-training for complete examples
Package Contents
fluxflow.models- Model architectures (VAE, Flow, Encoders, Discriminators)activations- BezierActivation, TrainableBeziervae- FluxCompressor (encoder) and FluxExpander (decoder)flow- FluxFlowProcessor (diffusion transformer)encoders- BertTextEncoderdiscriminators- PatchDiscriminator (for GAN training)conditioning- SPADE, FiLM, Gated conditioning modules
fluxflow.utils- Utilities for I/O, visualization, and loggingfluxflow.config- Configuration managementfluxflow.types- Type definitions and protocolsfluxflow.exceptions- Custom exception classes
Why Bezier Activations?
Mathematical Foundation
Traditional activations provide a single fixed transformation:
- ReLU: max(0, x) - piecewise linear, 50% gradient death
- GELU/SiLU: Fixed smooth curves, no adaptability
Bezier activations provide a learnable manifold:
- 4 control points per dimension (p₀, p₁, p₂, p₃)
- Smooth interpolation via cubic Bezier curves
- Adaptive transformations: Each dimension can follow a different cubic curve
- TrainableBezier: Optional 4×D learnable parameters for per-dimension optimization
Performance Targets
⚠️ Training In Progress: The metrics below are theoretical targets based on architecture analysis and parameter counting. Empirical measurements will be added to this table upon training completion.
| Metric | ReLU Baseline (Target) | Bezier FluxFlow (Target) | Expected Improvement |
|---|---|---|---|
| Parameters | 500M | 183M | 2.7× smaller |
| Inference time (A100, 512², 50 steps) | 1.82s | 1.12s | 38% faster |
| Training memory (batch=2) | 10.2GB | 4.1GB | 60% reduction |
| FID (COCO val) | 15.2±0.3 | ≤15.0 | Equivalent quality |
Status:
- VAE training: 🔄 In progress
- Flow training: ⏳ Pending VAE completion
- Baseline comparison: ⏳ Pending both completions
- Empirical results: 📊 Will be published to MODEL_ZOO.md
Strategic Activation Placement
FluxFlow uses different activations based on component purpose:
Bezier activations (high expressiveness needed):
- VAE encoder/decoder: Complex image↔latent mappings
- Flow transformer: Core generative model
- Text encoder: Semantic embedding space
LeakyReLU (memory efficiency critical):
- GAN discriminator: Binary classification, 2× forward passes per batch
- Saves 126 MB per batch vs Bezier
ReLU (simple transformations):
- SPADE normalization: Affine scale/shift operations
API Comparison
| Feature | FluxFlowPipeline | FluxPipeline |
|---|---|---|
| Type | DiffusionPipeline |
nn.Module |
| Input | Text prompts | Pre-encoded embeddings |
| Inference | Full iterative denoising | Single forward pass |
| Guidance | Classifier-free (automatic) | Manual implementation |
| Scheduler | Built-in (DPMSolver++) | None |
| Output | PIL Images / numpy | Tensor |
| Use case | Production inference | Training / Custom pipelines |
When to use which:
- FluxFlowPipeline: Text-to-image generation, production use, Diffusers ecosystem
- FluxPipeline: Training, fine-tuning, custom inference loops, research
Model Architecture Overview
Total Parameters: ~183M (default config: vae_dim=128, feat_dim=128)
| Component | Parameters | Activation Type | Purpose |
|---|---|---|---|
| FluxCompressor | 12.6M | BezierActivation | Image → latent encoding |
| FluxExpander | 94.0M | BezierActivation | Latent → image decoding |
| FluxFlowProcessor | 5.4M | BezierActivation | Diffusion transformer |
| BertTextEncoder | 71.0M | BezierActivation (projection) | Text → embedding |
| PatchDiscriminator | 45.1M | LeakyReLU | GAN training only |
Note: FluxExpander is asymmetrically larger due to progressive upsampling with SPADE conditioning layers.
Technical Details
Bezier Activation Types
1. Input-Based BezierActivation
Channel expansion pattern (5→1 dimension reduction):
# Previous layer outputs 5× channels
nn.Conv2d(in_ch, out_ch * 5, kernel_size=3, padding=1)
# BezierActivation splits into [t, p0, p1, p2, p3] and reduces to out_ch
BezierActivation(t_pre_activation="sigmoid", p_preactivation="silu")
Parameters: 0 learnable (but previous layer needs 5× weights) Use: VAE encoder/decoder, convolutional layers
2. TrainableBezier
Fixed learnable control points (dimension-preserving):
# Standard dimension mapping
nn.Linear(latent_dim, latent_dim)
# Add 4×D learnable parameters
TrainableBezier((latent_dim,), channel_only=True)
Parameters: 4×D learnable (e.g., 1024 params for D=256) Use: VAE latent bottleneck (mu/logvar), RGB output layer
3. Pillar-Based
Context-dependent control points from deep MLPs:
# 4 separate depth-3 MLP networks
p0 = pillarLayer(d_model, d_model, depth=3, activation=nn.SiLU())
p1 = pillarLayer(d_model, d_model, depth=3, activation=nn.SiLU())
p2 = pillarLayer(d_model, d_model, depth=3, activation=nn.SiLU())
p3 = pillarLayer(d_model, d_model, depth=3, activation=nn.SiLU())
# Generate control points from gated input
g = torch.sigmoid(img_seq)
# Concatenate and apply Bezier
output = BezierActivation(torch.cat([img_seq, p0(g), p1(g), p2(g), p3(g)], dim=-1))
Parameters: 4×(depth=3)×D² (e.g., 198K params for D=128) Use: Flow transformer MLP layers
Pre-activation parameters (for Input-Based and Pillar-Based):
t_pre_activation: Transform input t (sigmoid, silu, tanh, or None)p_preactivation: Transform control points (sigmoid, silu, tanh, or None)
Current FluxFlow Configuration
VAE Encoder/Decoder: Input-Based BezierActivation
- Pattern:
ConvTranspose2d(C, 5C) → BezierActivation() → Conv2d(C, 5C) → BezierActivation() - Rationale: 0 activation params, smooth gradients for image↔latent mapping
VAE Latent (mu/logvar): TrainableBezier
- Pattern:
Linear(D, D) → TrainableBezier(D) - Rationale: Per-channel learned curves for latent distribution (1024 params for D=256)
VAE RGB Output: TrainableBezier
- Pattern:
Conv2d(C, 3, ...) → TrainableBezier(3) - Rationale: Learned per-channel color correction (12 params)
Flow Transformer: Pillar-Based BezierActivation
- Control point generation:
4 × pillarLayer(d_model, d_model, depth=3) - Gating:
sigmoid(img_seq)bounds inputs to [0,1] before pillar processing - Final activation:
BezierActivation(concat([img_seq, p0, p1, p2, p3])) - Rationale: Highly expressive context-dependent activations per token (~198K params per block for d_model=128)
Text Encoder: Input-Based BezierActivation
- GELU alternative for BERT-like architectures
- Learns optimal text→latent space mapping
Discriminator: LeakyReLU
- Memory efficiency - called 2× per batch (generator+real)
SPADE Blocks: ReLU
- Simple affine transformations don't benefit from Bezier complexity
Future Directions
Custom Text Encoder
The current implementation uses pre-trained DistilBERT as a practical starting point. Future development will create a custom text encoder built entirely with Bezier activations, enabling:
- True end-to-end Bezier-based training
- Better semantic alignment with the generative model
- Reduced dependency on external pre-trained models
- Foundation for multimodal extensions
Multimodal Extensions
With a custom Bezier text encoder, FluxFlow can be extended to:
- Text + Image → Image: Conditioning on reference images
- Video generation: Temporal consistency via Bezier transformations
- 3D synthesis: Extending the architecture to volumetric data
Performance Optimizations
- JIT compilation: Already implemented (10-20% speedup available)
- Mixed precision: fp16/bf16 training and inference
- Quantization: 8-bit/4-bit inference for edge devices
- Knowledge distillation: Bezier→fixed activation distillation for mobile deployment
Links
- GitHub Repository
- Architecture Details
- Bezier Activations Guide
- References & Acknowledgments
- Training Tools
- Web UI
- ComfyUI Plugin
Acknowledgments
FluxFlow was inspired by Kolmogorov-Arnold Networks (KAN) [Liu et al., 2024], extending learnable activation functions to generative models with dynamic parameter generation.
Special thanks to:
- COCO 2017 [cocodataset.org] & Open Images [Google] - Mixed captions used for testing and validation
- TTI-2M Dataset [HuggingFace] - 2M image-text pairs for large-scale training experiments
- SPADE [Park et al., 2019] - Spatial conditioning mechanism
- FiLM [Perez et al., 2018] - Feature-wise modulation
For complete references, see REFERENCES.md.
Citation
If you use FluxFlow in your research, please cite:
@software{fluxflow2024,
title = {FluxFlow: Efficient Text-to-Image Generation with Bezier Activation Functions},
author = {FluxFlow Contributors},
year = {2025},
note = {Inspired by Kolmogorov-Arnold Networks (KAN)},
url = {https://github.com/danny-mio/fluxflow-core}
}
Key References:
@article{liu2024kan,
title={KAN: Kolmogorov-Arnold Networks},
author={Liu, Ziming and Wang, Yixuan and Vaidya, Sachin and others},
journal={arXiv preprint arXiv:2404.19756},
year={2024}
}
License
MIT License - see LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fluxflow-0.8.1.tar.gz.
File metadata
- Download URL: fluxflow-0.8.1.tar.gz
- Upload date:
- Size: 136.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
719ceca45d621781e0c88f4ab7b223810fd435441cf7880c75011e563f5dcadb
|
|
| MD5 |
4ed36db83d7ab3ec58543f32b9530b48
|
|
| BLAKE2b-256 |
fdda2c19f804021d5acae2ca1d615a8050de03cdbe55675473f5120dd93e5978
|
Provenance
The following attestation bundles were made for fluxflow-0.8.1.tar.gz:
Publisher:
ci.yml on danny-mio/fluxflow-core
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fluxflow-0.8.1.tar.gz -
Subject digest:
719ceca45d621781e0c88f4ab7b223810fd435441cf7880c75011e563f5dcadb - Sigstore transparency entry: 1228837993
- Sigstore integration time:
-
Permalink:
danny-mio/fluxflow-core@eebd1eedb61f13da2b359872c65b65c8b3c02ccb -
Branch / Tag:
refs/tags/v0.8.1 - Owner: https://github.com/danny-mio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@eebd1eedb61f13da2b359872c65b65c8b3c02ccb -
Trigger Event:
push
-
Statement type:
File details
Details for the file fluxflow-0.8.1-py3-none-any.whl.
File metadata
- Download URL: fluxflow-0.8.1-py3-none-any.whl
- Upload date:
- Size: 127.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
991003241833433d1ccac37baab324cca3d245d8fa336a8d5760b00425e789c7
|
|
| MD5 |
5849f1b63b856d081f98bf624a67e235
|
|
| BLAKE2b-256 |
eff0c80632c5c2173805ce83cba52f8c2e7d41cc39b05fc3215b0f908de3fb4f
|
Provenance
The following attestation bundles were made for fluxflow-0.8.1-py3-none-any.whl:
Publisher:
ci.yml on danny-mio/fluxflow-core
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fluxflow-0.8.1-py3-none-any.whl -
Subject digest:
991003241833433d1ccac37baab324cca3d245d8fa336a8d5760b00425e789c7 - Sigstore transparency entry: 1228838013
- Sigstore integration time:
-
Permalink:
danny-mio/fluxflow-core@eebd1eedb61f13da2b359872c65b65c8b3c02ccb -
Branch / Tag:
refs/tags/v0.8.1 - Owner: https://github.com/danny-mio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@eebd1eedb61f13da2b359872c65b65c8b3c02ccb -
Trigger Event:
push
-
Statement type: