Skip to main content

Open-source G1 humanoid VLA with video foundation model backbone

Project description

Neon — Teaching robots to see time

PyPI License: MIT Tests Python 3.10+ Docs


The Idea

A child watches a ball roll off a table and reaches out to catch it. She doesn't look at a photograph — she sees the motion. The arc, the acceleration, the moment it leaves the edge. She predicts the future from the flow of time.

Every robot today is blind to this. State-of-the-art Vision-Language-Action models look at the world through frozen snapshots. They see where things are, but not where things are going. It's like trying to catch that ball with your eyes closed between blinks.

Neon's insight is one sentence:

Video foundation models already understand motion — we just connect them to robot bodies.

Models like Qwen2.5-Omni and Cosmos-Reason2 have watched millions of hours of video. They've learned that cups fall when pushed, that doors swing on hinges, that hands reach before they grasp. This temporal understanding — physics, dynamics, cause and effect — is exactly what a robot needs. It's sitting there, pre-trained, waiting.

So we do something radical in its simplicity. We take a 7-billion-parameter video model, freeze it entirely, and train a tiny action decoder on top — just 6 million parameters, 0.08% of the total — that translates the video model's rich temporal understanding into 29 joint commands for a humanoid body, 16 timesteps into the future.

The video model sees. The decoder acts.

pip install neon-vla

How It Works

graph LR
    CAM["📹 Camera"] --> VB["Video Backbone<br/><b>7B frozen</b><br/>Qwen2.5-Omni / Cosmos"]
    MIC["🎤 Voice"] --> VB
    PROP["🦾 Joints"] --> PE["Proprio Encoder"]
    LIDAR["📡 LiDAR"] --> LE["PointCloud Encoder"]
    EEF["🤲 EEF State"] --> EE["EEF Encoder"]
    VB --> FUS["Feature Fusion"]
    PE --> FUS
    LE --> FUS
    EE --> FUS
    FUS --> AH["Action Heads<br/><b>~6M trainable</b><br/>Parameter Golf v2"]
    AH --> ACT["🤖 29 DoF × 16 steps"]
    AH --> SPEECH["🔊 Speech Out"]
    
    style VB fill:#0097a7,color:#fff,stroke:#0097a7
    style AH fill:#e65100,color:#fff,stroke:#e65100
    style FUS fill:#333,color:#fff
Full architecture diagram — all 6 input modalities
graph TD
    subgraph "Inputs (6 modalities)"
        CAM["📹 Camera Frames"]
        VID["🎬 Video Frames"]
        MIC["🎤 Audio (16kHz)"]
        TXT["📝 Language"]
        PROP["🦾 Joint States (29 DoF)"]
        LID["📡 LiDAR Point Cloud (N×4)"]
        EEF["🤲 EEF State (14 DoF)"]
    end

    subgraph "Neon VLA"
        VB["Video Backbone<br/>Qwen2.5-Omni / Cosmos-Reason2<br/><i>frozen, 3-7B</i>"]
        AE["Whisper Audio Encoder<br/><i>frozen, 39M</i>"]
        PE["Proprio Encoder<br/><i>trainable MLP</i>"]
        LE["PointCloud Encoder<br/><i>trainable PointNet-style</i>"]
        EE["EEF Encoder<br/><i>trainable MLP</i>"]
        FUS["Feature Fusion<br/>Linear + ReLU²"]
        AH["Action Heads<br/>Parameter Golf v2<br/><i>trainable, ~6M</i>"]
        SH["Speech Head<br/>PersonaPlex TTS"]
    end

    subgraph "Outputs"
        ARM["Arms (14 DoF)"]
        LOCO["Locomotion (vx, vy, ω)"]
        HEAD["Head (2 DoF)"]
        VOICE["🔊 Voice"]
    end

    CAM --> VB
    VID --> VB
    TXT --> VB
    MIC --> AE
    PROP --> PE
    LID --> LE
    EEF --> EE
    VB --> FUS
    AE --> FUS
    PE --> FUS
    LE --> FUS
    EE --> FUS
    FUS --> AH
    FUS --> SH
    AH --> ARM
    AH --> LOCO
    AH --> HEAD
    SH --> VOICE

    style VB fill:#0097a7,color:#fff
    style AH fill:#e65100,color:#fff
    style FUS fill:#333,color:#fff

Why video models, not image models?

Traditional VLAs Neon
Vision Single frame (photograph) Temporal sequence (video)
Physics None — must learn from scratch Cosmos-Reason2 — pre-trained on physical world
Prediction 1 action at a time 16-step action chunking (anticipates the future)
Audio Separate pipeline Native — Qwen2.5-Omni hears and speaks
Spatial No depth LiDAR point clouds → PointNet encoder
Trainable params Billions ~6M (0.08% of total)

Quick Start

As a VLA model

from neon.model.neon_vla import NeonVLA, NeonConfig

model = NeonVLA(NeonConfig(control_mode="arms_only"))
model.load_backbone()

# Full omni-modal prediction
output = model.predict(
    image=camera_frame,                     # 📹 what the robot sees
    instruction="Pick up the red cup",      # 📝 what you want
    proprioception=joint_states,            # 🦾 where the robot is
    audio=voice_waveform,                   # 🎤 spoken command (16kHz)
    lidar=point_cloud,                      # 📡 spatial awareness (N×4)
    eef_state=ee_positions,                 # 🤲 hand positions (14-DOF)
    speak=True,                             # 🔊 robot narrates its action
)

output.actions      # → (16, 17) — 16 timesteps × 17 joints
output.upper_body   # → (16, 14) — arm positions
output.locomotion   # → (16, 3)  — velocity commands (vx, vy, ω)
output.speech_path  # → "/tmp/neon_speech_xyz.wav"

As a strands-robots policy (plug-and-play)

# Direct usage
from neon import NeonPolicy
policy = NeonPolicy(host="192.168.123.10", port=8300)
actions = policy.get_actions_sync(obs, "pick up the red cup")

# Via strands-robots (auto-discovered on install)
from strands_robots.policies import create_policy
policy = create_policy("neon", host="robot-ip", port=8300)

# Smart resolution from HuggingFace model ID
policy = create_policy("cagataydev/neon-g1-v1-dev")

Run the inference server

# On the robot (Jetson Orin / any CUDA machine)
neon-serve --model cagataydev/neon-g1-v1-dev --port 8300

# The server accepts ALL modalities via HTTP:
curl -X POST http://robot:8300/predict \
  -H "Content-Type: application/json" \
  -d '{"image_base64": "...", "instruction": "pick up the cup", "proprioception": [...]}'

# Check what modalities the model supports:
curl http://robot:8300/health
# → {"modalities": {"camera": true, "audio": true, "lidar": false, ...}}

The Action Decoder — Parameter Golf v2

Our decoder heads come from a competition to build the smallest working language model. Every trick matters when your entire trainable model fits in 25 megabytes:

Technique What Why it matters
ReLU² max(0, x)² Smoother than GELU, cheaper than SiLU
RMSNorm x / √(mean(x²)) Half the cost of LayerNorm
Soft-Capping c · tanh(x/c) Never kills gradients at boundaries
Residual Scales h + α·h_skip Learned α — network decides backbone trust
U-Net Skip Layer 0 → last layer Gradient highway through deep decoders
β₁ = 0.85 Lower Adam momentum Faster adaptation to shifting distributions
Grad Clip 0.3 Tight clipping Prevents divergence in small heads

Data Soup — A Thousand Bodies, Unified

A robot built for a specific body usually needs data from that body. We break this with relative actions — displacements in the gripper's local frame. The same reaching motion produces the same numbers whether performed by a Franka, an SO-100, or our G1 humanoid.

# Same physical motion = same numbers, any robot
rel_xyz = prev_rotm.T @ (curr_xyz - prev_xyz)    # Position delta
rel_rot = rotm2euler(prev_rotm.T @ curr_rotm)     # Rotation delta
action  = [rel_xyz, rel_rot, gripper_state]        # Cross-embodiment!

Seven source types, one training stream:

graph LR
    LR["🤖 LeRobot<br/>Bridge + DROID"] --> MIX
    AG["🦾 Agibot-World<br/>Bimanual 1M+"] --> MIX
    COS["🌌 Cosmos DreamGen<br/>Synthetic"] --> MIX
    S4D["📸 Stereo4D<br/>Kitchen depth"] --> MIX
    VC["🗣️ Voice Commands<br/>50K instructions"] --> MIX
    TEL["🎮 G1 Teleoperation<br/>LiDAR + EEF + Audio"] --> MIX
    DR["💭 GR00T-Dreams<br/>Humanoid demos"] --> MIX

    MIX["Data Soup 🥣"] --> TRAIN["NeonTrainer<br/>All 6 modalities"]

    style MIX fill:#e65100,color:#fff,stroke:#e65100

G1 Humanoid — 29 Degrees of Freedom

graph TD
    G1["Unitree G1<br/>29 DoF"] --> LA["Left Arm · 7"]
    G1 --> RA["Right Arm · 7"]
    G1 --> T["Torso · 1"]
    G1 --> H["Head · 2"]
    G1 --> LL["Left Leg · 6"]
    G1 --> RL["Right Leg · 6"]

    style G1 fill:#e65100,color:#fff,stroke:#e65100
    style LA fill:#00695c,color:#fff
    style RA fill:#00695c,color:#fff
    style LL fill:#0097a7,color:#fff
    style RL fill:#0097a7,color:#fff
Mode Joints Use Case
arms_only 14 arms + 3 loco = 17 Tabletop manipulation
upper_body + 3 head/torso = 20 Manipulation + gaze tracking
whole_body All 29 Full locomotion + manipulation

Training

Eight presets, from a laptop GPU to a cloud A100:

Config Backbone Mode GPU Notes
edge_3b Qwen2.5-Omni-3B arms RTX 3090 / L4 Edge deployment
default_arms_only Qwen2.5-Omni-7B arms A100 40GB Standard
default_wholebody Qwen2.5-Omni-7B whole A100 80GB Full body
cosmos_physics Cosmos-Reason2-8B arms A100 40GB Physics-heavy
large_arms Qwen2.5-Omni-7B arms A100 40GB ~44M heads (GR00T-scale)
large_cosmos Cosmos-Reason2-8B arms A100 40GB Physics + large heads
large_wholebody Qwen2.5-Omni-7B whole A100 80GB 29 DoF + large heads
g1_omnimodal Qwen2.5-Omni-7B whole A100 40GB+ All sensors: LiDAR + EEF + audio
# Train on HuggingFace Jobs (recommended)
hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 8h -- \
    python -m neon.training.train --backbone Qwen/Qwen2.5-Omni-7B --mode arms_only

# Or locally
from neon.training.config import default_arms_only_config
from neon.training.train import train
train(default_arms_only_config())

By The Numbers

Backbone 3-7B params (frozen)
Decoder ~6M params (trainable) — 0.08%
Action Space 29 DoF, 16-step chunking
Input Modalities 6 (camera, video, audio, LiDAR, EEF, proprioception)
Latency 50ms on Jetson Orin
Data Sources 7 types, cross-embodiment
Training Presets 8 configs (edge to omni-modal)
Tests 168 passing (CPU, no GPU needed)
License MIT

Project Structure

neon/
├── neon/
│   ├── model/
│   │   ├── neon_vla.py          # Complete VLA pipeline + PointCloudEncoder + EEFEncoder
│   │   ├── action_heads.py      # Parameter Golf v2 decoders (ReLU², soft-cap)
│   │   ├── video_backbone.py    # Qwen / Cosmos adapter (3B-8B)
│   │   └── audio.py             # Whisper encoder + PersonaPlex TTS
│   ├── data/
│   │   ├── action_space.py      # G1 29-DoF joint definitions + normalization
│   │   ├── data_soup.py         # 7-source data mixing (NeonEpisode w/ all modalities)
│   │   └── relative_actions.py  # Cosmos-style relative EE actions
│   ├── training/
│   │   ├── config.py            # TrainConfig + 8 presets (incl. g1_omnimodal)
│   │   └── train.py             # NeonTrainer (omni-modal collation + loss)
│   ├── inference/
│   │   ├── server.py            # HTTP inference server (all 6 modalities)
│   │   └── g1_controller.py     # Unitree SDK interface
│   ├── streams/
│   │   ├── channels.py          # Typed data channels (Camera, Joint, LiDAR, Audio, Text, ToolCall)
│   │   ├── recorder.py          # StreamRecorder → LeRobot dataset
│   │   └── session.py           # StreamSession — full robot loop
│   ├── dashboard/
│   │   └── bridge.py            # WebSocket dashboard (camera, joints, LiDAR viz)
│   └── policy.py                # NeonPolicy — strands-robots integration (HTTP/ZMQ)
├── tests/                       # 168 tests (all CPU, no GPU needed)
├── paper/                       # LaTeX white paper + soul manifesto
├── video/                       # Remotion explainer video source
└── docs/                        # MkDocs Material site

strands-robots Integration

Neon ships as a first-class strands-robots policy. On pip install neon-vla, it auto-registers and is immediately discoverable:

from strands_robots.policies import create_policy

# The NeonPolicy bridges VLA inference (5-10 Hz) to robot control (50 Hz)
# via RTC action queue with temporal blending
policy = create_policy("neon", host="192.168.123.10", port=8300)

# Full omni-modal observation
obs = {
    "observation.images.front": camera_frame,   # (H, W, 3) uint8
    "observation.state": joint_positions,        # (17,) float32
    "observation.audio": voice_waveform,         # (16000,) float32
    "observation.lidar": point_cloud,            # (4096, 4) float32
    "observation.eef_state": ee_state,           # (14,) float32
}
actions = policy.get_actions_sync(obs, "pick up the red cup")

Three blend schedules: linear, step, exponential. The policy auto-discovers server capabilities via /health.


Video

The explainer video is built with Remotion — code-as-video, version-controlled, reproducible:

cd video
npm install
npx remotion studio          # Preview in browser
npx remotion render src/index.ts NeonExplainer out/neon-explainer.mp4

Papers

Document Pages What
paper/neon.tex 6 Full technical report — math, proofs, pseudocode
paper/soul.tex 1 The Soul of Neon — "Teaching Robots to See Time"

PDFs attached to every GitHub release.


Related Work


Citation

@software{neon2026,
  title   = {Neon: Open-Source Vision-Language-Action Model for Humanoid Whole-Body Control},
  author  = {Cali, Cagatay},
  year    = {2026},
  url     = {https://github.com/cagataycali/neon},
  license = {MIT}
}


"The difference between seeing a photograph and watching a video
is the difference between knowing and understanding."


One idea. An invitation.

📖 Docs · 📦 PyPI · 📄 Papers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neon_vla-0.1.4.tar.gz (41.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neon_vla-0.1.4-py3-none-any.whl (416.4 kB view details)

Uploaded Python 3

File details

Details for the file neon_vla-0.1.4.tar.gz.

File metadata

  • Download URL: neon_vla-0.1.4.tar.gz
  • Upload date:
  • Size: 41.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for neon_vla-0.1.4.tar.gz
Algorithm Hash digest
SHA256 9c4a5279b20603f15911a28537e007ca70852834b690af6b2ea3545965e97f7c
MD5 42816cec0db507c4ac3bf2a66218d306
BLAKE2b-256 d66cfbdebf3a07f0f756dfb9790fbc432050afa03445692f2bba2b3ef9014806

See more details on using hashes here.

File details

Details for the file neon_vla-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: neon_vla-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 416.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for neon_vla-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 99338fe21870ead36ea89bb36043d383f54a3fba29a852f4f531808f52060089
MD5 b79245f1aba7c11b925ed9929bf3a5c7
BLAKE2b-256 d255e7ae1621f41432b9698f52e20bf87fadeb5d6a8fd42e69bfbe058cddccd1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page