Skip to main content

A high-performance Vision-Language-Action (VLA) model fine-tuning library optimized for Tesla T4 hardware.

Project description

FastVLA: High-Performance VLA Fine-Tuning for Everyone

GitHub Stars PyPI Version Hugging Face Colab Kaggle

Stop training VLAs on H100s. I just brought OpenVLA to the T4.

FastVLA is a high-performance library built to democratize Vision-Language-Action (VLA) models. By integrating Unsloth-inspired 4-bit kernels, Custom Triton Action Heads, and Memory-Efficient QLoRA, we enable fine-tuning 7B+ robotics policies on a single, free Tesla T4 (15GB).


Why FastVLA?

VLA models are usually gated behind $40k GPUs. OpenVLA (7B) in FP16 takes ~28GB VRAM—impossible for gradients even on a 3090. FastVLA reduces memory consumption by 70%.

  • 2x Faster Training: Specialized Triton kernels for vision-action fusion.
  • 70% VRAM Savings: Train OpenVLA-7B with only 6.3 GB of VRAM (leaving >8GB for activations/gradients).
  • Convergent Quality: 4-bit QLoRA verified to match FP16 convergence on real robotics datasets.
  • Edge-Optimized: Built for hobbyists, researchers, and robots running on NVIDIA Jetson / T4.

Benchmark: OpenVLA-7B on Tesla T4 (15GB)

We fine-tuned OpenVLA-7B on the standard lerobot/pusht_image dataset (Real-world block pushing).

Feature Standard HF LoRA¹ FastVLA (4-bit) Improvement
VRAM Usage ~15 GB (LoRA-only, no grad) 6.31 GB (Total Peak) 2.4x Less
Throughput 2.8s / step 1.42s / step 2.0x Faster
Model Size 14.6 GB (FP16) 4.3 GB (4-bit) 70% Savings
Status CUDA OOM for Training Steady Convergence Verified

¹ Standard HuggingFace LoRA results estimated; often impossible to run without 4-bit optimization on T4.


Case Study: The "Wall" vs The "Fast"

Before FastVLA, training VLAs on T4 was a nightmare of crashes and slow iterations. Below is a comparison against the original SmolVLA-Offline-Finetuning logs:

Metric Baseline (SmolVLA 1.7B) FastVLA (OpenVLA 7B) Difference
Step Latency 8.35s / step 1.42s / step 6x Faster
Model Scale 1.7 Billion Parameters 7.3 Billion Parameters 4.3x Larger
Stability Crashed (4/4 runs) 100% Stable (2000+ steps) Finalist

Bottom Line: FastVLA is 6x faster while training a 4x larger model on the exact same hardware. This is the power of custom Triton kernels and memory-mapped quantization.


FastVLA Architecture

FastVLA isn't just a wrapper; it's a systems-reengineering of the VLA pipeline.

graph LR
    IMG[Image Input] --> SIG[SigLIP Encoder]
    TXT[Query/Prompt] --> LLM[Llama-2-7B / SmolVLA-1.7B]
    SIG --> PROJ[Fusion Projector]
    PROJ --> LLM
    LLM --> TRITON[Fused Triton Action Head]
    TRITON --> ACT[Action Tensor]
    
    style TRITON fill:#f96,stroke:#333Category,stroke-width:4px
    style LLM fill:#dfd,stroke:#333

Performance Features

  • Triton Action Kernels: Fused Linear → ReLU → Linear → Tanh layers with gradient checkpointing.
  • Auto-Quantization: One-click 4-bit / 8-bit loading with FastVLA.from_pretrained().
  • VLA-Specific Collators: Efficient image packing and action binning (256 bins) for robotics policies.
  • SmolVLA Support: Specifically optimized for the 1.7B "SmolVLA"—the perfect base for real-time edge robotics.

Quick Start

1. Install with uv (Recommended)

git clone https://github.com/BouajilaHamza/FastVLA.git
cd FastVLA
uv sync

2. Fine-Tune on PushT

uv run scripts/finetune_pusht.py --steps 2000 --batch 1 --lr 1e-4

3. Usage Example

from fastvla import FastVLAModel

# Load OpenVLA-7B in 4-bit with PEFT
model = FastVLAModel.from_pretrained(
    "openvla-7b",
    load_in_4bit=True,
    use_peft=True
)

# Predict next robot action
action = model.predict(image, "push the t-shaped block")

Objective Evaluation for ETH Zurich

FastVLA demonstrates a Systems Engineering mindset:

  1. Resource Optimization: Bringing massive models to constrained hardware.
  2. Custom Kernels: Proof of ability to write GPU-accelerated backends with Triton.
  3. Robotics Focus: Bridging the gap between SOTA AI and real-time control constraints.

Roadmap & Community

  • Unsloth v2 Integration: Direct patching for vision encoders.
  • Jetson Orin Support: Real-time inference kernels.
  • Multi-Camera Fusing: Optimized packing for 3+ camera setups.

Star the repo to support democratized robotics!


License

Apache-2.0. Created by the FastVLA Team. ics! ⭐


📜 License

Apache-2.0. Created by the FastVLA Team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastvla-0.1.1.tar.gz (55.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastvla-0.1.1-py3-none-any.whl (32.5 kB view details)

Uploaded Python 3

File details

Details for the file fastvla-0.1.1.tar.gz.

File metadata

  • Download URL: fastvla-0.1.1.tar.gz
  • Upload date:
  • Size: 55.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for fastvla-0.1.1.tar.gz
Algorithm Hash digest
SHA256 52d1eca3b66616f605bf40ce2cb2a66b97f775db41cd1a8a059c66e503d98798
MD5 a0099e7c78257905bb9b51b7d1dd445c
BLAKE2b-256 b1cdb5fdd618450a813ab883c6d4c26509806cd1311b2ee2cf072637ce72bc3f

See more details on using hashes here.

File details

Details for the file fastvla-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fastvla-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 32.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for fastvla-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c53abc1a68e150c9e43255de41f4c73e9be4153d38bde0cdab50c336ebe6ac0b
MD5 ff915c71a39df60ba9d047d956aabba3
BLAKE2b-256 3b14a266130b80b3ccce21e9f88c74d6db2df79f66401a2142ec53333ae81a1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page