HOT: Learning Generalizable Hand-Object Tracking from Synthetic Demonstrations

1HKUST      2Wuhan University      3Shanghai AI Laboratory      4ETH Zurich      5Shanghai Jiao Tong University      6HKUST (GZ)

Abstract

We present a system for learning generalizable hand-object tracking controllers purely from synthetic data, without requiring any human demonstrations.

Our approach makes two key contributions: (1) HOP, a Hand-Object Planner, which can synthesize diverse hand-object trajectories; and (2) HOT, a Hand-Object Tracker that bridges synthetic-to-physical transfer through reinforcement learning and interaction imitation learning, delivering a generalizable controller conditioned on target hand-object states.

Our method extends to diverse object shapes and hand morphologies. Through extensive evaluations, we show that our approach enables dexterous hands to track challenging, long-horizon sequences including object re-arrangement and agile in-hand reorientation. These results represent a significant step toward scalable foundation controllers for manipulation that can learn entirely from synthetic data, breaking the data bottleneck that has long constrained progress in dexterous manipulation.

Method Overview

Our system learns generalizable hand-object tracking from synthetic data. (a) HOP synthesizes manipulation trajectories for meta-skills. (b) HOT is trained through a two-stage teacher-student framework using reinforcement learning with a unified HOI imitation reward, enabling robust tracking of the target HOI trajectories. (c) At inference, the system can accept high-level waypoints from language models, generative models, or human data, which HOP converts into trajectories for HOT to track, enabling diverse applications.

HOP Pipeline

HOP synthesizes manipulation trajectories from grasp poses generated by force-closure optimization and refined by RL. Its grammar-based approach supports eight composable meta-skills, offering multi-source parameter control via randomization, LLM/VLM instructions, or from human demonstrations. The system naturally generalizes across diverse hands and objects for scalable data coverage.

HOP Synthetic Trajectories / HOT Tracking Results

Zero-Shot Tracking on GRAB

Tracking Long Sequences

VLM Planning + Our Method

Place the bottles into the basket

Different Dexterous Hand

Shadow Hand

Allegro Hand

Video

BibTeX

      
      @article{wang2025hot,
        title={Learning Generalizable Hand-Object Tracking from Synthetic Demonstrations},
        author={Wang, Yinhuai and Yu, Runyi and Tsui, Hok Wai and Xiaoyi Lin and Hui, Zhang and Zhao, Qihan and Ke, Fan and Li, Miao, and Song, Jie and Wang, Jingbo and Chen, Qifeng and Tan, Ping},
        journal={arXiv preprint arXiv:2512.19583},
        year={2025}
      }