Open-source G1 humanoid VLA with video foundation model backbone
Project description
The Idea
A child watches a ball roll off a table and reaches out to catch it. She doesn't look at a photograph — she sees the motion. The arc, the acceleration, the moment it leaves the edge. She predicts the future from the flow of time.
Every robot today is blind to this. State-of-the-art Vision-Language-Action models look at the world through frozen snapshots. They see where things are, but not where things are going. It's like trying to catch that ball with your eyes closed between blinks.
Neon's insight is one sentence:
Video foundation models already understand motion — we just connect them to robot bodies.
Models like Qwen2.5-Omni and Cosmos-Reason2 have watched millions of hours of video. They've learned that cups fall when pushed, that doors swing on hinges, that hands reach before they grasp. This temporal understanding — physics, dynamics, cause and effect — is exactly what a robot needs. It's sitting there, pre-trained, waiting.
So we do something radical in its simplicity. We take a 7-billion-parameter video model, freeze it entirely, and train a tiny action decoder on top — just 6 million parameters, 0.08% of the total — that translates the video model's rich temporal understanding into 29 joint commands for a humanoid body, 16 timesteps into the future.
The video model sees. The decoder acts.
pip install neon-vla
How It Works
graph LR
CAM["📹 Camera"] --> VB["Video Backbone<br/><b>7B frozen</b><br/>Qwen2.5-Omni / Cosmos"]
MIC["🎤 Voice"] --> VB
PROP["🦾 Joints"] --> PE["Proprio Encoder"]
LIDAR["📡 LiDAR"] --> LE["PointCloud Encoder"]
EEF["🤲 EEF State"] --> EE["EEF Encoder"]
VB --> FUS["Feature Fusion"]
PE --> FUS
LE --> FUS
EE --> FUS
FUS --> AH["Action Heads<br/><b>~6M trainable</b><br/>Parameter Golf v2"]
AH --> ACT["🤖 29 DoF × 16 steps"]
AH --> SPEECH["🔊 Speech Out"]
style VB fill:#0097a7,color:#fff,stroke:#0097a7
style AH fill:#e65100,color:#fff,stroke:#e65100
style FUS fill:#333,color:#fff
Full architecture diagram — all 6 input modalities
graph TD
subgraph "Inputs (6 modalities)"
CAM["📹 Camera Frames"]
VID["🎬 Video Frames"]
MIC["🎤 Audio (16kHz)"]
TXT["📝 Language"]
PROP["🦾 Joint States (29 DoF)"]
LID["📡 LiDAR Point Cloud (N×4)"]
EEF["🤲 EEF State (14 DoF)"]
end
subgraph "Neon VLA"
VB["Video Backbone<br/>Qwen2.5-Omni / Cosmos-Reason2<br/><i>frozen, 3-7B</i>"]
AE["Whisper Audio Encoder<br/><i>frozen, 39M</i>"]
PE["Proprio Encoder<br/><i>trainable MLP</i>"]
LE["PointCloud Encoder<br/><i>trainable PointNet-style</i>"]
EE["EEF Encoder<br/><i>trainable MLP</i>"]
FUS["Feature Fusion<br/>Linear + ReLU²"]
AH["Action Heads<br/>Parameter Golf v2<br/><i>trainable, ~6M</i>"]
SH["Speech Head<br/>PersonaPlex TTS"]
end
subgraph "Outputs"
ARM["Arms (14 DoF)"]
LOCO["Locomotion (vx, vy, ω)"]
HEAD["Head (2 DoF)"]
VOICE["🔊 Voice"]
end
CAM --> VB
VID --> VB
TXT --> VB
MIC --> AE
PROP --> PE
LID --> LE
EEF --> EE
VB --> FUS
AE --> FUS
PE --> FUS
LE --> FUS
EE --> FUS
FUS --> AH
FUS --> SH
AH --> ARM
AH --> LOCO
AH --> HEAD
SH --> VOICE
style VB fill:#0097a7,color:#fff
style AH fill:#e65100,color:#fff
style FUS fill:#333,color:#fff
Why video models, not image models?
| Traditional VLAs | Neon | |
|---|---|---|
| Vision | Single frame (photograph) | Temporal sequence (video) |
| Physics | None — must learn from scratch | Cosmos-Reason2 — pre-trained on physical world |
| Prediction | 1 action at a time | 16-step action chunking (anticipates the future) |
| Audio | Separate pipeline | Native — Qwen2.5-Omni hears and speaks |
| Spatial | No depth | LiDAR point clouds → PointNet encoder |
| Trainable params | Billions | ~6M (0.08% of total) |
Quick Start
As a VLA model
from neon.model.neon_vla import NeonVLA, NeonConfig
model = NeonVLA(NeonConfig(control_mode="arms_only"))
model.load_backbone()
# Full omni-modal prediction
output = model.predict(
image=camera_frame, # 📹 what the robot sees
instruction="Pick up the red cup", # 📝 what you want
proprioception=joint_states, # 🦾 where the robot is
audio=voice_waveform, # 🎤 spoken command (16kHz)
lidar=point_cloud, # 📡 spatial awareness (N×4)
eef_state=ee_positions, # 🤲 hand positions (14-DOF)
speak=True, # 🔊 robot narrates its action
)
output.actions # → (16, 17) — 16 timesteps × 17 joints
output.upper_body # → (16, 14) — arm positions
output.locomotion # → (16, 3) — velocity commands (vx, vy, ω)
output.speech_path # → "/tmp/neon_speech_xyz.wav"
As a strands-robots policy (plug-and-play)
# Direct usage
from neon import NeonPolicy
policy = NeonPolicy(host="192.168.123.10", port=8300)
actions = policy.get_actions_sync(obs, "pick up the red cup")
# Via strands-robots (auto-discovered on install)
from strands_robots.policies import create_policy
policy = create_policy("neon", host="robot-ip", port=8300)
# Smart resolution from HuggingFace model ID
policy = create_policy("cagataydev/neon-g1-v1-dev")
Run the inference server
# On the robot (Jetson Orin / any CUDA machine)
neon-serve --model cagataydev/neon-g1-v1-dev --port 8300
# The server accepts ALL modalities via HTTP:
curl -X POST http://robot:8300/predict \
-H "Content-Type: application/json" \
-d '{"image_base64": "...", "instruction": "pick up the cup", "proprioception": [...]}'
# Check what modalities the model supports:
curl http://robot:8300/health
# → {"modalities": {"camera": true, "audio": true, "lidar": false, ...}}
The Action Decoder — Parameter Golf v2
Our decoder heads come from a competition to build the smallest working language model. Every trick matters when your entire trainable model fits in 25 megabytes:
| Technique | What | Why it matters |
|---|---|---|
| ReLU² | max(0, x)² |
Smoother than GELU, cheaper than SiLU |
| RMSNorm | x / √(mean(x²)) |
Half the cost of LayerNorm |
| Soft-Capping | c · tanh(x/c) |
Never kills gradients at boundaries |
| Residual Scales | h + α·h_skip |
Learned α — network decides backbone trust |
| U-Net Skip | Layer 0 → last layer | Gradient highway through deep decoders |
| β₁ = 0.85 | Lower Adam momentum | Faster adaptation to shifting distributions |
| Grad Clip 0.3 | Tight clipping | Prevents divergence in small heads |
Data Soup — A Thousand Bodies, Unified
A robot built for a specific body usually needs data from that body. We break this with relative actions — displacements in the gripper's local frame. The same reaching motion produces the same numbers whether performed by a Franka, an SO-100, or our G1 humanoid.
# Same physical motion = same numbers, any robot
rel_xyz = prev_rotm.T @ (curr_xyz - prev_xyz) # Position delta
rel_rot = rotm2euler(prev_rotm.T @ curr_rotm) # Rotation delta
action = [rel_xyz, rel_rot, gripper_state] # Cross-embodiment!
Seven source types, one training stream:
graph LR
LR["🤖 LeRobot<br/>Bridge + DROID"] --> MIX
AG["🦾 Agibot-World<br/>Bimanual 1M+"] --> MIX
COS["🌌 Cosmos DreamGen<br/>Synthetic"] --> MIX
S4D["📸 Stereo4D<br/>Kitchen depth"] --> MIX
VC["🗣️ Voice Commands<br/>50K instructions"] --> MIX
TEL["🎮 G1 Teleoperation<br/>LiDAR + EEF + Audio"] --> MIX
DR["💭 GR00T-Dreams<br/>Humanoid demos"] --> MIX
MIX["Data Soup 🥣"] --> TRAIN["NeonTrainer<br/>All 6 modalities"]
style MIX fill:#e65100,color:#fff,stroke:#e65100
G1 Humanoid — 29 Degrees of Freedom
graph TD
G1["Unitree G1<br/>29 DoF"] --> LA["Left Arm · 7"]
G1 --> RA["Right Arm · 7"]
G1 --> T["Torso · 1"]
G1 --> H["Head · 2"]
G1 --> LL["Left Leg · 6"]
G1 --> RL["Right Leg · 6"]
style G1 fill:#e65100,color:#fff,stroke:#e65100
style LA fill:#00695c,color:#fff
style RA fill:#00695c,color:#fff
style LL fill:#0097a7,color:#fff
style RL fill:#0097a7,color:#fff
| Mode | Joints | Use Case |
|---|---|---|
arms_only |
14 arms + 3 loco = 17 | Tabletop manipulation |
upper_body |
+ 3 head/torso = 20 | Manipulation + gaze tracking |
whole_body |
All 29 | Full locomotion + manipulation |
Training
Eight presets, from a laptop GPU to a cloud A100:
| Config | Backbone | Mode | GPU | Notes |
|---|---|---|---|---|
edge_3b |
Qwen2.5-Omni-3B | arms | RTX 3090 / L4 | Edge deployment |
default_arms_only |
Qwen2.5-Omni-7B | arms | A100 40GB | Standard |
default_wholebody |
Qwen2.5-Omni-7B | whole | A100 80GB | Full body |
cosmos_physics |
Cosmos-Reason2-8B | arms | A100 40GB | Physics-heavy |
large_arms |
Qwen2.5-Omni-7B | arms | A100 40GB | ~44M heads (GR00T-scale) |
large_cosmos |
Cosmos-Reason2-8B | arms | A100 40GB | Physics + large heads |
large_wholebody |
Qwen2.5-Omni-7B | whole | A100 80GB | 29 DoF + large heads |
g1_omnimodal |
Qwen2.5-Omni-7B | whole | A100 40GB+ | All sensors: LiDAR + EEF + audio |
# Train on HuggingFace Jobs (recommended)
hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 8h -- \
python -m neon.training.train --backbone Qwen/Qwen2.5-Omni-7B --mode arms_only
# Or locally
from neon.training.config import default_arms_only_config
from neon.training.train import train
train(default_arms_only_config())
By The Numbers
| Backbone | 3-7B params (frozen) |
| Decoder | ~6M params (trainable) — 0.08% |
| Action Space | 29 DoF, 16-step chunking |
| Input Modalities | 6 (camera, video, audio, LiDAR, EEF, proprioception) |
| Latency | 50ms on Jetson Orin |
| Data Sources | 7 types, cross-embodiment |
| Training Presets | 8 configs (edge to omni-modal) |
| Tests | 168 passing (CPU, no GPU needed) |
| License | MIT |
Project Structure
neon/
├── neon/
│ ├── model/
│ │ ├── neon_vla.py # Complete VLA pipeline + PointCloudEncoder + EEFEncoder
│ │ ├── action_heads.py # Parameter Golf v2 decoders (ReLU², soft-cap)
│ │ ├── video_backbone.py # Qwen / Cosmos adapter (3B-8B)
│ │ └── audio.py # Whisper encoder + PersonaPlex TTS
│ ├── data/
│ │ ├── action_space.py # G1 29-DoF joint definitions + normalization
│ │ ├── data_soup.py # 7-source data mixing (NeonEpisode w/ all modalities)
│ │ └── relative_actions.py # Cosmos-style relative EE actions
│ ├── training/
│ │ ├── config.py # TrainConfig + 8 presets (incl. g1_omnimodal)
│ │ └── train.py # NeonTrainer (omni-modal collation + loss)
│ ├── inference/
│ │ ├── server.py # HTTP inference server (all 6 modalities)
│ │ └── g1_controller.py # Unitree SDK interface
│ ├── streams/
│ │ ├── channels.py # Typed data channels (Camera, Joint, LiDAR, Audio, Text, ToolCall)
│ │ ├── recorder.py # StreamRecorder → LeRobot dataset
│ │ └── session.py # StreamSession — full robot loop
│ ├── dashboard/
│ │ └── bridge.py # WebSocket dashboard (camera, joints, LiDAR viz)
│ └── policy.py # NeonPolicy — strands-robots integration (HTTP/ZMQ)
├── tests/ # 168 tests (all CPU, no GPU needed)
├── paper/ # LaTeX white paper + soul manifesto
├── video/ # Remotion explainer video source
└── docs/ # MkDocs Material site
strands-robots Integration
Neon ships as a first-class strands-robots policy. On pip install neon-vla, it auto-registers and is immediately discoverable:
from strands_robots.policies import create_policy
# The NeonPolicy bridges VLA inference (5-10 Hz) to robot control (50 Hz)
# via RTC action queue with temporal blending
policy = create_policy("neon", host="192.168.123.10", port=8300)
# Full omni-modal observation
obs = {
"observation.images.front": camera_frame, # (H, W, 3) uint8
"observation.state": joint_positions, # (17,) float32
"observation.audio": voice_waveform, # (16000,) float32
"observation.lidar": point_cloud, # (4096, 4) float32
"observation.eef_state": ee_state, # (14,) float32
}
actions = policy.get_actions_sync(obs, "pick up the red cup")
Three blend schedules: linear, step, exponential. The policy auto-discovers server capabilities via /health.
Video
The explainer video is built with Remotion — code-as-video, version-controlled, reproducible:
cd video
npm install
npx remotion studio # Preview in browser
npx remotion render src/index.ts NeonExplainer out/neon-explainer.mp4
Papers
| Document | Pages | What |
|---|---|---|
paper/neon.tex |
6 | Full technical report — math, proofs, pseudocode |
paper/soul.tex |
1 | The Soul of Neon — "Teaching Robots to See Time" |
PDFs attached to every GitHub release.
Related Work
- GR00T N1 — Architecture reference for humanoid VLA
- GR00T-WholeBodyControl — RL whole-body policies (learnings adopted in Neon)
- Cosmos-Predict2.5 — Relative actions, world model reasoning
- OmniVLA — Omni-modal VLA reference
- MicroGPT Parameter Golf — Source of action head optimizations
- Strands Agents — Agent framework for robot integration
- strands-robots — Robot SDK (NeonPolicy integrates via entry-point)
Citation
@software{neon2026,
title = {Neon: Open-Source Vision-Language-Action Model for Humanoid Whole-Body Control},
author = {Cali, Cagatay},
year = {2026},
url = {https://github.com/cagataycali/neon},
license = {MIT}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file neon_vla-0.1.1.tar.gz.
File metadata
- Download URL: neon_vla-0.1.1.tar.gz
- Upload date:
- Size: 41.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01122c2e2cfdf41692837c97ebb38a5fe83b01392f362c754fd50dcc1d667505
|
|
| MD5 |
c674f8daa037037ae6ed8bbb02c403e6
|
|
| BLAKE2b-256 |
0725b221d912440832295bd9c37309c94d9211dc238aee1066ac5505b06a0da6
|
File details
Details for the file neon_vla-0.1.1-py3-none-any.whl.
File metadata
- Download URL: neon_vla-0.1.1-py3-none-any.whl
- Upload date:
- Size: 416.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a51f47799cd2323e3a8452abd595df97dc61a966028aa04909c8877d5a8d321e
|
|
| MD5 |
c20e8374e25fa585fadc21ed7b6413f6
|
|
| BLAKE2b-256 |
319f4c97580f2a0861c37636577e4fb5dd1dabc1d3c61f65e355c9af8d3a1954
|