AxisGuide

Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation

Jiyun Jang, Yujin Sung*, Woosung Joung*, Daewon Chae, Sangwon Lee, Sohwi Kim,
Jinkyu Kim†, Jungbeom Lee†

Korea University · University of Michigan · KT R&D center · Kakao Mobility
*Equal contribution · †Co-corresponding authors

Robotics: Science and Systems (RSS) 2026

Paper Video BibTeX Code

AxisGuide explicitly visualizes the robot base-frame action coordinate system in image space, helping visuomotor policies generalize to unseen object locations.

Abstract

Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene understanding, yet often fail to reliably execute correct low-level actions under distribution shifts. We argue that this gap arises from insufficient action understanding: the inability to interpret the robot's base-frame action coordinate system in image space.

AxisGuide is a lightweight guidance method that renders the robot base-frame axes in each camera view and augments RGB observations with cue channels that explicitly visualize the meaning of +x, +y, and +z motions. Across LIBERO simulation and real-world experiments, AxisGuide improves task success and robustness under distribution shift.

Motivation

Current visuomotor policies often understand what to do, but struggle with how to execute the correct low-level action. In robot manipulation, actions are usually defined in the robot base frame, while observations are RGB images. Without an explicit visual reference, the policy must infer how base-frame motions such as +x, +y, and +z appear in the image.

Conventional Policy

Learns image-to-action correspondences implicitly and can fail when the target object moves to unseen locations.

AxisGuide

Provides explicit image-space cues for the robot action coordinate system, improving action grounding and generalization.

Method Overview

Given camera intrinsics, extrinsics, and the end-effector pose, AxisGuide projects unit robot base-frame translations onto the image plane. The projected +x, +y, and +z directions are rendered as cue channels and concatenated with RGB observations. This keeps the original RGB image intact while giving the policy a clean visual reference for action-coordinate semantics.

Key Results

+13.33%p

LIBERO Novel Object Position

AxisGuide improves success from 52.38% to 65.71% in simulation.

+19.88%p

Real-World Novel Object Position

AxisGuide improves success from 30.12% to 50.00% on the Pick Up (Pear) task.

+5.41 ms

Low Runtime Overhead

AxisGuide adds only about 0.005 seconds of latency per inference on SmolVLA.

Qualitative Results

LIBERO simulation: reaching unseen object locations.

Real-world manipulation with AxisGuide.

Comparison with Visual Cue Baselines

Existing visual cue methods provide spatial or temporal guidance, but they do not explicitly represent the robot action coordinate system. AxisGuide directly shows where the x, y, and z action directions lie in the image, enabling the policy to infer how to move toward a target under novel object positions.

SmolVLA
52.38%

+ TraceVLA
51.90%

+ AimBot
52.38%

+ AxisGuide
65.71%

BibTeX

@inproceedings{jang2026axisguide,
  title     = {AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation},
  author    = {Jang, Jiyun and Sung, Yujin and Joung, Woosung and Chae, Daewon and Lee, Sangwon and Kim, Sohwi and Kim, Jinkyu and Lee, Jungbeom},
  booktitle = {Robotics: Science and Systems},
  year      = {2026}
}