neon-vla

Open-source G1 humanoid VLA with video foundation model backbone

These details have not been verified by PyPI

Project links

Project description

The Idea

A child watches a ball roll off a table and reaches out to catch it. She doesn't look at a photograph — she sees the motion. The arc, the acceleration, the moment it leaves the edge. She predicts the future from the flow of time.

Every robot today is blind to this. State-of-the-art Vision-Language-Action models look at the world through frozen snapshots. They see where things are, but not where things are going. It's like trying to catch that ball with your eyes closed between blinks.

Neon's insight is one sentence:

Video foundation models already understand motion — we just connect them to robot bodies.

Models like Qwen2.5-Omni and Cosmos-Reason2 have watched millions of hours of video. They've learned that cups fall when pushed, that doors swing on hinges, that hands reach before they grasp. This temporal understanding — physics, dynamics, cause and effect — is exactly what a robot needs. It's sitting there, pre-trained, waiting.

So we do something radical in its simplicity. We take a 7-billion-parameter video model, freeze it entirely, and train a tiny action decoder on top — just 6 million parameters, 0.08% of the total — that translates the video model's rich temporal understanding into 29 joint commands for a humanoid body, 16 timesteps into the future.

The video model sees. The decoder acts.

pip install neon-vla

How It Works

graph LR
    CAM["📹 Camera"] --> VB["Video Backbone<br/><b>7B frozen</b><br/>Qwen2.5-Omni / Cosmos"]
    MIC["🎤 Voice"] --> VB
    PROP["🦾 Joints"] --> PE["Proprio Encoder"]
    LIDAR["📡 LiDAR"] --> LE["PointCloud Encoder"]
    EEF["🤲 EEF State"] --> EE["EEF Encoder"]
    VB --> FUS["Feature Fusion"]
    PE --> FUS
    LE --> FUS
    EE --> FUS
    FUS --> AH["Action Heads<br/><b>~6M trainable</b><br/>Parameter Golf v2"]
    AH --> ACT["🤖 29 DoF × 16 steps"]
    AH --> SPEECH["🔊 Speech Out"]
    
    style VB fill:#0097a7,color:#fff,stroke:#0097a7
    style AH fill:#e65100,color:#fff,stroke:#e65100
    style FUS fill:#333,color:#fff

Full architecture diagram — all 6 input modalities

graph TD
    subgraph "Inputs (6 modalities)"
        CAM["📹 Camera Frames"]
        VID["🎬 Video Frames"]
        MIC["🎤 Audio (16kHz)"]
        TXT["📝 Language"]
        PROP["🦾 Joint States (29 DoF)"]
        LID["📡 LiDAR Point Cloud (N×4)"]
        EEF["🤲 EEF State (14 DoF)"]
    end

    subgraph "Neon VLA"
        VB["Video Backbone<br/>Qwen2.5-Omni / Cosmos-Reason2<br/><i>frozen, 3-7B</i>"]
        AE["Whisper Audio Encoder<br/><i>frozen, 39M</i>"]
        PE["Proprio Encoder<br/><i>trainable MLP</i>"]
        LE["PointCloud Encoder<br/><i>trainable PointNet-style</i>"]
        EE["EEF Encoder<br/><i>trainable MLP</i>"]
        FUS["Feature Fusion<br/>Linear + ReLU²"]
        AH["Action Heads<br/>Parameter Golf v2<br/><i>trainable, ~6M</i>"]
        SH["Speech Head<br/>PersonaPlex TTS"]
    end

    subgraph "Outputs"
        ARM["Arms (14 DoF)"]
        LOCO["Locomotion (vx, vy, ω)"]
        HEAD["Head (2 DoF)"]
        VOICE["🔊 Voice"]
    end

    CAM --> VB
    VID --> VB
    TXT --> VB
    MIC --> AE
    PROP --> PE
    LID --> LE
    EEF --> EE
    VB --> FUS
    AE --> FUS
    PE --> FUS
    LE --> FUS
    EE --> FUS
    FUS --> AH
    FUS --> SH
    AH --> ARM
    AH --> LOCO
    AH --> HEAD
    SH --> VOICE

    style VB fill:#0097a7,color:#fff
    style AH fill:#e65100,color:#fff
    style FUS fill:#333,color:#fff

Why video models, not image models?

	Traditional VLAs	Neon
Vision	Single frame (photograph)	Temporal sequence (video)
Physics	None — must learn from scratch	Cosmos-Reason2 — pre-trained on physical world
Prediction	1 action at a time	16-step action chunking (anticipates the future)
Audio	Separate pipeline	Native — Qwen2.5-Omni hears and speaks
Spatial	No depth	LiDAR point clouds → PointNet encoder
Trainable params	Billions	~6M (0.08% of total)

Quick Start

As a VLA model

from neon.model.neon_vla import NeonVLA, NeonConfig

model = NeonVLA(NeonConfig(control_mode="arms_only"))
model.load_backbone()

# Full omni-modal prediction
output = model.predict(
    image=camera_frame,                     # 📹 what the robot sees
    instruction="Pick up the red cup",      # 📝 what you want
    proprioception=joint_states,            # 🦾 where the robot is
    audio=voice_waveform,                   # 🎤 spoken command (16kHz)
    lidar=point_cloud,                      # 📡 spatial awareness (N×4)
    eef_state=ee_positions,                 # 🤲 hand positions (14-DOF)
    speak=True,                             # 🔊 robot narrates its action
)

output.actions      # → (16, 17) — 16 timesteps × 17 joints
output.upper_body   # → (16, 14) — arm positions
output.locomotion   # → (16, 3)  — velocity commands (vx, vy, ω)
output.speech_path  # → "/tmp/neon_speech_xyz.wav"

As a strands-robots policy (plug-and-play)

# Direct usage
from neon import NeonPolicy
policy = NeonPolicy(host="192.168.123.10", port=8300)
actions = policy.get_actions_sync(obs, "pick up the red cup")

# Via strands-robots (auto-discovered on install)
from strands_robots.policies import create_policy
policy = create_policy("neon", host="robot-ip", port=8300)

# Smart resolution from HuggingFace model ID
policy = create_policy("cagataydev/neon-g1-v1-dev")

Run the inference server

# On the robot (Jetson Orin / any CUDA machine)
neon-serve --model cagataydev/neon-g1-v1-dev --port 8300

# The server accepts ALL modalities via HTTP:
curl -X POST http://robot:8300/predict \
  -H "Content-Type: application/json" \
  -d '{"image_base64": "...", "instruction": "pick up the cup", "proprioception": [...]}'

# Check what modalities the model supports:
curl http://robot:8300/health
# → {"modalities": {"camera": true, "audio": true, "lidar": false, ...}}

The Action Decoder — Parameter Golf v2

Our decoder heads come from a competition to build the smallest working language model. Every trick matters when your entire trainable model fits in 25 megabytes:

Technique	What	Why it matters
ReLU²	`max(0, x)²`	Smoother than GELU, cheaper than SiLU
RMSNorm	`x / √(mean(x²))`	Half the cost of LayerNorm
Soft-Capping	`c · tanh(x/c)`	Never kills gradients at boundaries
Residual Scales	`h + α·h_skip`	Learned α — network decides backbone trust
U-Net Skip	Layer 0 → last layer	Gradient highway through deep decoders
β₁ = 0.85	Lower Adam momentum	Faster adaptation to shifting distributions
Grad Clip 0.3	Tight clipping	Prevents divergence in small heads

Data Soup — A Thousand Bodies, Unified

A robot built for a specific body usually needs data from that body. We break this with relative actions — displacements in the gripper's local frame. The same reaching motion produces the same numbers whether performed by a Franka, an SO-100, or our G1 humanoid.

# Same physical motion = same numbers, any robot
rel_xyz = prev_rotm.T @ (curr_xyz - prev_xyz)    # Position delta
rel_rot = rotm2euler(prev_rotm.T @ curr_rotm)     # Rotation delta
action  = [rel_xyz, rel_rot, gripper_state]        # Cross-embodiment!

Seven source types, one training stream:

graph LR
    LR["🤖 LeRobot<br/>Bridge + DROID"] --> MIX
    AG["🦾 Agibot-World<br/>Bimanual 1M+"] --> MIX
    COS["🌌 Cosmos DreamGen<br/>Synthetic"] --> MIX
    S4D["📸 Stereo4D<br/>Kitchen depth"] --> MIX
    VC["🗣️ Voice Commands<br/>50K instructions"] --> MIX
    TEL["🎮 G1 Teleoperation<br/>LiDAR + EEF + Audio"] --> MIX
    DR["💭 GR00T-Dreams<br/>Humanoid demos"] --> MIX

    MIX["Data Soup 🥣"] --> TRAIN["NeonTrainer<br/>All 6 modalities"]

    style MIX fill:#e65100,color:#fff,stroke:#e65100

G1 Humanoid — 29 Degrees of Freedom

graph TD
    G1["Unitree G1<br/>29 DoF"] --> LA["Left Arm · 7"]
    G1 --> RA["Right Arm · 7"]
    G1 --> T["Torso · 1"]
    G1 --> H["Head · 2"]
    G1 --> LL["Left Leg · 6"]
    G1 --> RL["Right Leg · 6"]

    style G1 fill:#e65100,color:#fff,stroke:#e65100
    style LA fill:#00695c,color:#fff
    style RA fill:#00695c,color:#fff
    style LL fill:#0097a7,color:#fff
    style RL fill:#0097a7,color:#fff

Mode	Joints	Use Case
`arms_only`	14 arms + 3 loco = 17	Tabletop manipulation
`upper_body`	+ 3 head/torso = 20	Manipulation + gaze tracking
`whole_body`	All 29	Full locomotion + manipulation

Training

Eight presets, from a laptop GPU to a cloud A100:

Config	Backbone	Mode	GPU	Notes
`edge_3b`	Qwen2.5-Omni-3B	arms	RTX 3090 / L4	Edge deployment
`default_arms_only`	Qwen2.5-Omni-7B	arms	A100 40GB	Standard
`default_wholebody`	Qwen2.5-Omni-7B	whole	A100 80GB	Full body
`cosmos_physics`	Cosmos-Reason2-8B	arms	A100 40GB	Physics-heavy
`large_arms`	Qwen2.5-Omni-7B	arms	A100 40GB	~44M heads (GR00T-scale)
`large_cosmos`	Cosmos-Reason2-8B	arms	A100 40GB	Physics + large heads
`large_wholebody`	Qwen2.5-Omni-7B	whole	A100 80GB	29 DoF + large heads
`g1_omnimodal`	Qwen2.5-Omni-7B	whole	A100 40GB+	All sensors: LiDAR + EEF + audio

# Train on HuggingFace Jobs (recommended)
hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 8h -- \
    python -m neon.training.train --backbone Qwen/Qwen2.5-Omni-7B --mode arms_only

# Or locally
from neon.training.config import default_arms_only_config
from neon.training.train import train
train(default_arms_only_config())

By The Numbers


Backbone	3-7B params (frozen)
Decoder	~6M params (trainable) — 0.08%
Action Space	29 DoF, 16-step chunking
Input Modalities	6 (camera, video, audio, LiDAR, EEF, proprioception)
Latency	50ms on Jetson Orin
Data Sources	7 types, cross-embodiment
Training Presets	8 configs (edge to omni-modal)
Tests	168 passing (CPU, no GPU needed)
License	MIT

Project Structure

neon/
├── neon/
│   ├── model/
│   │   ├── neon_vla.py          # Complete VLA pipeline + PointCloudEncoder + EEFEncoder
│   │   ├── action_heads.py      # Parameter Golf v2 decoders (ReLU², soft-cap)
│   │   ├── video_backbone.py    # Qwen / Cosmos adapter (3B-8B)
│   │   └── audio.py             # Whisper encoder + PersonaPlex TTS
│   ├── data/
│   │   ├── action_space.py      # G1 29-DoF joint definitions + normalization
│   │   ├── data_soup.py         # 7-source data mixing (NeonEpisode w/ all modalities)
│   │   └── relative_actions.py  # Cosmos-style relative EE actions
│   ├── training/
│   │   ├── config.py            # TrainConfig + 8 presets (incl. g1_omnimodal)
│   │   └── train.py             # NeonTrainer (omni-modal collation + loss)
│   ├── inference/
│   │   ├── server.py            # HTTP inference server (all 6 modalities)
│   │   └── g1_controller.py     # Unitree SDK interface
│   ├── streams/
│   │   ├── channels.py          # Typed data channels (Camera, Joint, LiDAR, Audio, Text, ToolCall)
│   │   ├── recorder.py          # StreamRecorder → LeRobot dataset
│   │   └── session.py           # StreamSession — full robot loop
│   ├── dashboard/
│   │   └── bridge.py            # WebSocket dashboard (camera, joints, LiDAR viz)
│   └── policy.py                # NeonPolicy — strands-robots integration (HTTP/ZMQ)
├── tests/                       # 168 tests (all CPU, no GPU needed)
├── paper/                       # LaTeX white paper + soul manifesto
├── video/                       # Remotion explainer video source
└── docs/                        # MkDocs Material site

strands-robots Integration

Neon ships as a first-class strands-robots policy. On pip install neon-vla, it auto-registers and is immediately discoverable:

from strands_robots.policies import create_policy

# The NeonPolicy bridges VLA inference (5-10 Hz) to robot control (50 Hz)
# via RTC action queue with temporal blending
policy = create_policy("neon", host="192.168.123.10", port=8300)

# Full omni-modal observation
obs = {
    "observation.images.front": camera_frame,   # (H, W, 3) uint8
    "observation.state": joint_positions,        # (17,) float32
    "observation.audio": voice_waveform,         # (16000,) float32
    "observation.lidar": point_cloud,            # (4096, 4) float32
    "observation.eef_state": ee_state,           # (14,) float32
}
actions = policy.get_actions_sync(obs, "pick up the red cup")

Three blend schedules: linear, step, exponential. The policy auto-discovers server capabilities via /health.

Video

The explainer video is built with Remotion — code-as-video, version-controlled, reproducible:

cd video
npm install
npx remotion studio          # Preview in browser
npx remotion render src/index.ts NeonExplainer out/neon-explainer.mp4

Papers

Document	Pages	What
`paper/neon.tex`	6	Full technical report — math, proofs, pseudocode
`paper/soul.tex`	1	The Soul of Neon — "Teaching Robots to See Time"

PDFs attached to every GitHub release.

Related Work

GR00T N1 — Architecture reference for humanoid VLA
GR00T-WholeBodyControl — RL whole-body policies (learnings adopted in Neon)
Cosmos-Predict2.5 — Relative actions, world model reasoning
OmniVLA — Omni-modal VLA reference
MicroGPT Parameter Golf — Source of action head optimizations
Strands Agents — Agent framework for robot integration
strands-robots — Robot SDK (NeonPolicy integrates via entry-point)

Citation

@software{neon2026,
  title   = {Neon: Open-Source Vision-Language-Action Model for Humanoid Whole-Body Control},
  author  = {Cali, Cagatay},
  year    = {2026},
  url     = {https://github.com/cagataycali/neon},
  license = {MIT}
}

"The difference between seeing a photograph and watching a video
is the difference between knowing and understanding."

One idea. An invitation.

📖 Docs · 📦 PyPI · 📄 Papers

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.5

Mar 31, 2026

This version

0.1.4

Mar 31, 2026

0.1.3

Mar 31, 2026

0.1.2

Mar 31, 2026

0.1.1

Mar 30, 2026

0.1.0

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neon_vla-0.1.4.tar.gz (41.1 MB view details)

Uploaded Mar 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

neon_vla-0.1.4-py3-none-any.whl (416.4 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file neon_vla-0.1.4.tar.gz.

File metadata

Download URL: neon_vla-0.1.4.tar.gz
Upload date: Mar 31, 2026
Size: 41.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for neon_vla-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`9c4a5279b20603f15911a28537e007ca70852834b690af6b2ea3545965e97f7c`
MD5	`42816cec0db507c4ac3bf2a66218d306`
BLAKE2b-256	`d66cfbdebf3a07f0f756dfb9790fbc432050afa03445692f2bba2b3ef9014806`

See more details on using hashes here.

File details

Details for the file neon_vla-0.1.4-py3-none-any.whl.

File metadata

Download URL: neon_vla-0.1.4-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 416.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for neon_vla-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`99338fe21870ead36ea89bb36043d383f54a3fba29a852f4f531808f52060089`
MD5	`b79245f1aba7c11b925ed9929bf3a5c7`
BLAKE2b-256	`d255e7ae1621f41432b9698f52e20bf87fadeb5d6a8fd42e69bfbe058cddccd1`

See more details on using hashes here.

neon-vla 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

The Idea

How It Works

Why video models, not image models?

Quick Start

As a VLA model

As a strands-robots policy (plug-and-play)

Run the inference server

The Action Decoder — Parameter Golf v2

Data Soup — A Thousand Bodies, Unified

G1 Humanoid — 29 Degrees of Freedom

Training

By The Numbers

Project Structure

strands-robots Integration

Video

Papers

Related Work

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes