Skip to main content

Core model and inference for FluxFlow text-to-image generation

Project description

FluxFlow Core

Smaller, Faster, More Expressive: Text-to-Image Generation with Bezier Activation Functions

🚧 Project Status

Training In Progress: FluxFlow models are currently in Week 1-4 of systematic validation.

Status:

  • ✅ Architecture implemented and tested (including v0.8.0 pillar-attention)
  • 🔄 VAE training in progress (Bezier + ReLU baselines)
  • ⏳ Flow training pending VAE completion
  • ⏳ Empirical benchmarks pending training completion
  • 📅 Expected completion: Late February 2026

All performance claims below are theoretical targets - empirical validation underway.


FluxFlow is a novel approach to text-to-image generation that targets 2-3× smaller models with equivalent or superior quality compared to standard architectures. The key innovation is the use of Cubic Bezier activation functions, which provide 3rd-degree polynomial expressiveness, enabling each neuron to learn complex, smooth non-linear transformations.

Core Philosophy

Inspired by Kolmogorov-Arnold Networks (KAN), FluxFlow extends the concept of learnable activation functions to large-scale generative models. While KAN uses B-splines, FluxFlow employs Cubic Bezier curves with three distinct control point generation strategies.

Bezier Activations: Three Approaches

FluxFlow employs three Bezier activation strategies, each suited for different architectural needs:

1. Input-Based (BezierActivation) - Most Common

Control points derived directly from input channels via 5× channel expansion pattern.

  • Implementation: Previous layer outputs 5× channels, BezierActivation reduces to 1×
  • Parameters: 0 learnable parameters in activation (cost shifted to previous layer)
  • Usage: VAE encoder/decoder, convolutional layers
  • Pattern: Conv2d(C, 5C) → BezierActivation() → C outputs

2. Trainable (TrainableBezier) - Specialized Layers

Learnable control points for per-channel transformations.

  • Implementation: 4 learnable parameters per output dimension
  • Parameters: 4×D learnable parameters (minimal overhead)
  • Usage: VAE latent bottleneck (mu/logvar), RGB output layer
  • Pattern: Linear(C, C) → TrainableBezier(C) → C outputs

3. Pillar-Based - Transformer MLPs

Control points generated by 4 independent depth-3 MLP networks for maximum expressiveness.

Unified Formula (All Approaches):

B(t) = (1-t)³·p₀ + 3(1-t)²·t·p₁ + 3(1-t)·t²·p₂ + t³·p₃

What differs: how (t, p₀, p₁, p₂, p₃) are obtained (from input, learned, or computed by MLPs).

Smoothness: C² continuous (continuous up to second derivative), providing smooth gradients unlike ReLU's discontinuous derivative.

Expected Benefits (empirical validation in progress):

  • Smaller models: 2-2.5× fewer parameters target for equivalent quality
  • Faster inference: 38% speedup target through layer reduction
  • Better gradients: Smooth C² continuous gradients reduce vanishing gradient issues
  • Adaptive: Each approach provides different expressiveness-cost trade-offs

Installation

Production Install

pip install fluxflow

What gets installed:

  • fluxflow - Core model architectures and inference pipeline
  • Flow matching models, VAE, and text encoders
  • Note: Does NOT include training tools (use fluxflow-training for that)
  • Note: Does NOT include UI (use fluxflow-ui or fluxflow-comfyui for that)

Package available on PyPI: fluxflow v0.8.0

Development Install

git clone https://github.com/danny-mio/fluxflow-core.git
cd fluxflow-core
pip install -e ".[dev]"

System Requirements

Minimum Requirements

  • Python: 3.10 or later
  • CPU: Modern x86_64 processor
  • RAM: 16 GB minimum, 32 GB recommended
  • Storage: 10 GB for package and dependencies

GPU Requirements (Optional but Recommended)

For Training

  • GPU: NVIDIA GPU with CUDA support
  • VRAM: 24 GB minimum (NVIDIA RTX 3090, A5000, or better)
  • CUDA: 11.8 or later
  • cuDNN: 8.6 or later
  • Recommended: NVIDIA A6000 (48GB) or A100 (40GB/80GB)

For Inference

  • GPU: NVIDIA GPU with CUDA support
  • VRAM: 8 GB minimum, 12 GB recommended
  • CUDA: 11.8 or later
  • Recommended: NVIDIA RTX 3060 (12GB) or better

CPU-Only Mode

  • Supported for inference (slower)
  • Requires 32 GB RAM
  • Not recommended for training (very slow)

Apple Silicon (MPS)

  • Supported on M1/M2/M3 with macOS 12.3+
  • Good performance for inference
  • Training supported but slower than CUDA

Dependency Notes

  • numpy: Version 2.x not yet supported (use numpy<2.0)
  • torch: CUDA 11.8 or 12.1 builds recommended
  • transformers: 4.30.0+ required for text encoding

Key Features

  • Bezier Activations: Learnable 3rd-degree (cubic) polynomial activation functions
  • Compact VAE: Variational autoencoder with 25M params (encoder) + 30M params (decoder)
  • Flow-based Diffusion: 150M param transformer with rotary embeddings
  • Text Conditioning: DistilBERT-based encoder (~71M params total: ~66M backbone + Bezier projection layers)
    • Note: Current implementation uses pre-trained DistilBERT as a temporary solution. Future versions will feature a custom Bezier-based text encoder for full end-to-end training and multimodal support.
  • Adaptive Architecture: Different activation strategies per component (Bezier for generative, LeakyReLU for discriminative)

Quick Start

High-Level API (Recommended)

from fluxflow.models import FluxFlowPipeline

# Load from checkpoint directory (standard training output)
pipeline = FluxFlowPipeline.from_pretrained("path/to/checkpoint_dir/")

# Or load from a single checkpoint file
# pipeline = FluxFlowPipeline.from_pretrained("path/to/checkpoint.safetensors")

# Generate image with Diffusers-style API
image = pipeline(
    prompt="a beautiful sunset over mountains",
    num_inference_steps=50,
    guidance_scale=7.5,
    height=512,
    width=512,
).images[0]

image.save("output.png")

Advanced Usage

from fluxflow.models import FluxFlowPipeline
import torch

# Load with specific settings
pipeline = FluxFlowPipeline.from_pretrained(
    "path/to/checkpoint.safetensors",
    torch_dtype=torch.float16,
    device="cuda",
)

# Generate with more control
result = pipeline(
    prompt="a serene mountain landscape at dawn",
    negative_prompt="blurry, low quality",
    num_inference_steps=50,
    guidance_scale=7.5,
    height=768,
    width=768,
    num_images_per_prompt=4,
    generator=torch.Generator().manual_seed(42),
)

# Save all generated images
for i, img in enumerate(result.images):
    img.save(f"output_{i}.png")

Model Versions

Version Description Status
0.8.0 Pillar-attention (FiLM + cross-attn on pillars) Current
0.7.0 Context-enhanced flow transformer Stable
0.6.0 Default stable Stable
0.3.0 Legacy Legacy
  • Default model version: 0.6.0 (set by FluxFlowConfig.model.model_version)
  • v0.8.0 checkpoints require load_versioned_checkpoint() — see docs/MIGRATION.md
  • For versioned checkpoints, use load_versioned_checkpoint() and set model_version when saving

Classifier-Free Guidance (CFG)

Available since v0.3.0: FluxFlow supports Classifier-Free Guidance for enhanced generation control.

What is CFG?

CFG improves generation quality by amplifying the influence of text conditioning. It works by:

  1. Running two forward passes: one with text, one without
  2. Interpolating between conditional and unconditional predictions
  3. Producing images that more strongly follow the text prompt

Using CFG

from fluxflow.models import FluxFlowPipeline

pipeline = FluxFlowPipeline.from_pretrained("path/to/checkpoint.safetensors")

# Generate with CFG (requires model trained with cfg_dropout_prob > 0)
image = pipeline(
    prompt="a photorealistic portrait of a cat",
    negative_prompt="blurry, distorted, low quality",  # Optional
    num_inference_steps=50,
    guidance_scale=5.0,  # Recommended: 3.0-7.0 for balanced results
    height=512,
    width=512,
).images[0]

Guidance Scale Guidelines

  • 1.0: No guidance (standard generation)
  • 3.0-7.0: Moderate guidance (RECOMMENDED - balanced quality/creativity)
  • 7.0-15.0: Strong guidance (may oversaturate or lose diversity)

Important: CFG requires models trained with cfg_dropout_prob > 0 (typically 0.10-0.15). See fluxflow-training for training details.

Low-Level API

For more control, use the base FluxPipeline:

import torch
from fluxflow.models import FluxPipeline, BertTextEncoder
from transformers import AutoTokenizer

# Load components manually
pipeline = FluxPipeline.from_pretrained("path/to/checkpoint.safetensors")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
text_encoder = BertTextEncoder(embed_dim=1024)  # Must match text_embedding_dim in training config (default: 1024)

# Encode text
text = "a beautiful sunset"
tokens = tokenizer(text, return_tensors="pt", padding="max_length", max_length=512)
text_embeddings = text_encoder(tokens["input_ids"])

# Manual forward pass (requires implementing sampling loop)
# See fluxflow-training for complete examples

Package Contents

  • fluxflow.models - Model architectures (VAE, Flow, Encoders, Discriminators)
    • activations - BezierActivation, TrainableBezier
    • vae - FluxCompressor (encoder) and FluxExpander (decoder)
    • flow - FluxFlowProcessor (diffusion transformer)
    • encoders - BertTextEncoder
    • discriminators - PatchDiscriminator (for GAN training)
    • conditioning - SPADE, FiLM, Gated conditioning modules
  • fluxflow.utils - Utilities for I/O, visualization, and logging
  • fluxflow.config - Configuration management
  • fluxflow.types - Type definitions and protocols
  • fluxflow.exceptions - Custom exception classes

Why Bezier Activations?

Mathematical Foundation

Traditional activations provide a single fixed transformation:

  • ReLU: max(0, x) - piecewise linear, 50% gradient death
  • GELU/SiLU: Fixed smooth curves, no adaptability

Bezier activations provide a learnable manifold:

  • 4 control points per dimension (p₀, p₁, p₂, p₃)
  • Smooth interpolation via cubic Bezier curves
  • Adaptive transformations: Each dimension can follow a different cubic curve
  • TrainableBezier: Optional 4×D learnable parameters for per-dimension optimization

Performance Targets

⚠️ Training In Progress: The metrics below are theoretical targets based on architecture analysis and parameter counting. Empirical measurements will be added to this table upon training completion.

Metric ReLU Baseline (Target) Bezier FluxFlow (Target) Expected Improvement
Parameters 500M 183M 2.7× smaller
Inference time (A100, 512², 50 steps) 1.82s 1.12s 38% faster
Training memory (batch=2) 10.2GB 4.1GB 60% reduction
FID (COCO val) 15.2±0.3 ≤15.0 Equivalent quality

Status:

  • VAE training: 🔄 In progress
  • Flow training: ⏳ Pending VAE completion
  • Baseline comparison: ⏳ Pending both completions
  • Empirical results: 📊 Will be published to MODEL_ZOO.md

Strategic Activation Placement

FluxFlow uses different activations based on component purpose:

Bezier activations (high expressiveness needed):

  • VAE encoder/decoder: Complex image↔latent mappings
  • Flow transformer: Core generative model
  • Text encoder: Semantic embedding space

LeakyReLU (memory efficiency critical):

  • GAN discriminator: Binary classification, 2× forward passes per batch
  • Saves 126 MB per batch vs Bezier

ReLU (simple transformations):

  • SPADE normalization: Affine scale/shift operations

API Comparison

Feature FluxFlowPipeline FluxPipeline
Type DiffusionPipeline nn.Module
Input Text prompts Pre-encoded embeddings
Inference Full iterative denoising Single forward pass
Guidance Classifier-free (automatic) Manual implementation
Scheduler Built-in (DPMSolver++) None
Output PIL Images / numpy Tensor
Use case Production inference Training / Custom pipelines

When to use which:

  • FluxFlowPipeline: Text-to-image generation, production use, Diffusers ecosystem
  • FluxPipeline: Training, fine-tuning, custom inference loops, research

Model Architecture Overview

Total Parameters: ~183M (default config: vae_dim=128, feat_dim=128)

Component Parameters Activation Type Purpose
FluxCompressor 12.6M BezierActivation Image → latent encoding
FluxExpander 94.0M BezierActivation Latent → image decoding
FluxFlowProcessor 5.4M BezierActivation Diffusion transformer
BertTextEncoder 71.0M BezierActivation (projection) Text → embedding
PatchDiscriminator 45.1M LeakyReLU GAN training only

Note: FluxExpander is asymmetrically larger due to progressive upsampling with SPADE conditioning layers.

Technical Details

Bezier Activation Types

1. Input-Based BezierActivation

Channel expansion pattern (5→1 dimension reduction):

# Previous layer outputs 5× channels
nn.Conv2d(in_ch, out_ch * 5, kernel_size=3, padding=1)
# BezierActivation splits into [t, p0, p1, p2, p3] and reduces to out_ch
BezierActivation(t_pre_activation="sigmoid", p_preactivation="silu")

Parameters: 0 learnable (but previous layer needs 5× weights) Use: VAE encoder/decoder, convolutional layers

2. TrainableBezier

Fixed learnable control points (dimension-preserving):

# Standard dimension mapping
nn.Linear(latent_dim, latent_dim)
# Add 4×D learnable parameters
TrainableBezier((latent_dim,), channel_only=True)

Parameters: 4×D learnable (e.g., 1024 params for D=256) Use: VAE latent bottleneck (mu/logvar), RGB output layer

3. Pillar-Based

Context-dependent control points from deep MLPs:

# 4 separate depth-3 MLP networks
p0 = pillarLayer(d_model, d_model, depth=3, activation=nn.SiLU())
p1 = pillarLayer(d_model, d_model, depth=3, activation=nn.SiLU())
p2 = pillarLayer(d_model, d_model, depth=3, activation=nn.SiLU())
p3 = pillarLayer(d_model, d_model, depth=3, activation=nn.SiLU())
# Generate control points from gated input
g = torch.sigmoid(img_seq)
# Concatenate and apply Bezier
output = BezierActivation(torch.cat([img_seq, p0(g), p1(g), p2(g), p3(g)], dim=-1))

Parameters: 4×(depth=3)×D² (e.g., 198K params for D=128) Use: Flow transformer MLP layers

Pre-activation parameters (for Input-Based and Pillar-Based):

  • t_pre_activation: Transform input t (sigmoid, silu, tanh, or None)
  • p_preactivation: Transform control points (sigmoid, silu, tanh, or None)

Current FluxFlow Configuration

VAE Encoder/Decoder: Input-Based BezierActivation

  • Pattern: ConvTranspose2d(C, 5C) → BezierActivation() → Conv2d(C, 5C) → BezierActivation()
  • Rationale: 0 activation params, smooth gradients for image↔latent mapping

VAE Latent (mu/logvar): TrainableBezier

  • Pattern: Linear(D, D) → TrainableBezier(D)
  • Rationale: Per-channel learned curves for latent distribution (1024 params for D=256)

VAE RGB Output: TrainableBezier

  • Pattern: Conv2d(C, 3, ...) → TrainableBezier(3)
  • Rationale: Learned per-channel color correction (12 params)

Flow Transformer: Pillar-Based BezierActivation

  • Control point generation: 4 × pillarLayer(d_model, d_model, depth=3)
  • Gating: sigmoid(img_seq) bounds inputs to [0,1] before pillar processing
  • Final activation: BezierActivation(concat([img_seq, p0, p1, p2, p3]))
  • Rationale: Highly expressive context-dependent activations per token (~198K params per block for d_model=128)

Text Encoder: Input-Based BezierActivation

  • GELU alternative for BERT-like architectures
  • Learns optimal text→latent space mapping

Discriminator: LeakyReLU

  • Memory efficiency - called 2× per batch (generator+real)

SPADE Blocks: ReLU

  • Simple affine transformations don't benefit from Bezier complexity

Future Directions

Custom Text Encoder

The current implementation uses pre-trained DistilBERT as a practical starting point. Future development will create a custom text encoder built entirely with Bezier activations, enabling:

  • True end-to-end Bezier-based training
  • Better semantic alignment with the generative model
  • Reduced dependency on external pre-trained models
  • Foundation for multimodal extensions

Multimodal Extensions

With a custom Bezier text encoder, FluxFlow can be extended to:

  • Text + Image → Image: Conditioning on reference images
  • Video generation: Temporal consistency via Bezier transformations
  • 3D synthesis: Extending the architecture to volumetric data

Performance Optimizations

  • JIT compilation: Already implemented (10-20% speedup available)
  • Mixed precision: fp16/bf16 training and inference
  • Quantization: 8-bit/4-bit inference for edge devices
  • Knowledge distillation: Bezier→fixed activation distillation for mobile deployment

Links

Acknowledgments

FluxFlow was inspired by Kolmogorov-Arnold Networks (KAN) [Liu et al., 2024], extending learnable activation functions to generative models with dynamic parameter generation.

Special thanks to:

For complete references, see REFERENCES.md.

Citation

If you use FluxFlow in your research, please cite:

@software{fluxflow2024,
  title = {FluxFlow: Efficient Text-to-Image Generation with Bezier Activation Functions},
  author = {FluxFlow Contributors},
  year = {2025},
  note = {Inspired by Kolmogorov-Arnold Networks (KAN)},
  url = {https://github.com/danny-mio/fluxflow-core}
}

Key References:

@article{liu2024kan,
  title={KAN: Kolmogorov-Arnold Networks},
  author={Liu, Ziming and Wang, Yixuan and Vaidya, Sachin and others},
  journal={arXiv preprint arXiv:2404.19756},
  year={2024}
}

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fluxflow-0.8.1.tar.gz (136.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fluxflow-0.8.1-py3-none-any.whl (127.5 kB view details)

Uploaded Python 3

File details

Details for the file fluxflow-0.8.1.tar.gz.

File metadata

  • Download URL: fluxflow-0.8.1.tar.gz
  • Upload date:
  • Size: 136.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fluxflow-0.8.1.tar.gz
Algorithm Hash digest
SHA256 719ceca45d621781e0c88f4ab7b223810fd435441cf7880c75011e563f5dcadb
MD5 4ed36db83d7ab3ec58543f32b9530b48
BLAKE2b-256 fdda2c19f804021d5acae2ca1d615a8050de03cdbe55675473f5120dd93e5978

See more details on using hashes here.

Provenance

The following attestation bundles were made for fluxflow-0.8.1.tar.gz:

Publisher: ci.yml on danny-mio/fluxflow-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fluxflow-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: fluxflow-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 127.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fluxflow-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 991003241833433d1ccac37baab324cca3d245d8fa336a8d5760b00425e789c7
MD5 5849f1b63b856d081f98bf624a67e235
BLAKE2b-256 eff0c80632c5c2173805ce83cba52f8c2e7d41cc39b05fc3215b0f908de3fb4f

See more details on using hashes here.

Provenance

The following attestation bundles were made for fluxflow-0.8.1-py3-none-any.whl:

Publisher: ci.yml on danny-mio/fluxflow-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page