History-Aware Visuomotor Policy Learning via Point Tracking

Authors: Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, Cewu Lu

🛫 Getting Started

💻 Installation

Please follow the installation guide to install the rise conda environments and the dependencies, as well as the real robot environments. Also, remember to adjust the constant parameters in dataset/constants.py and utils/constants.py according to your own environment.

Make sure that TRANS_MIN/MAX and WORKSPACE_MIN/MAX are correctly set in the camera coordinates, or you may obtain meaningless output. We recommend expanding TRANS_MIN/MAX by 0.15 - 0.3 meters on both sides of the actual translation range to accommodate spatial data augmentation. You could follow command_train.sh for data visualization and parameter check.

📷 Calibration

Please calibrate the camera(s) with the robot before data collection and evaluation to ensure correct spatial transformations between camera(s) and the robot. Please refer to calibration guide for more details.

🛢️ Data Collection

We apply the data collection process in the RH20T paper. We provide the sample data for each tasks on Google Drive and Baidu Netdisk (code: 643b). You may need to adjust dataset/realworld.py to accommodate different data formats. The sample data have the format of

collect_cups
|-- calib/
|   |-- [calib timestamp 1]/
|   |   |-- extrinsics.npy             # extrinsics (camera to marker)
|   |   |-- intrinsics.npy             # intrinsics
|   |   `-- tcp.npy                    # tcp pose of calibration
|   `-- [calib timestamp 2]/           # similar calib structure
`-- train/
    |-- [episode identifier 1]
    |   |-- metadata.json              # metadata
    |   |-- timestamp.txt              # calib timestamp  
    |   |-- cam_[serial_number 1]/    
    |   |   |-- color                  # RGB
    |   |   |   |-- [timestamp 1].png
    |   |   |   |-- [timestamp 2].png
    |   |   |   |-- ...
    |   |   |   `-- [timestamp T].png
    |   |   |-- depth                  # depth
    |   |   |   |-- [timestamp 1].png
    |   |   |   |-- [timestamp 2].png
    |   |   |   |-- ...
    |   |   |   `-- [timestamp T].png
    |   |   |-- tcp                    # tcp
    |   |   |   |-- [timestamp 1].npy
    |   |   |   |-- [timestamp 2].npy
    |   |   |   |-- ...
    |   |   |   `-- [timestamp T].npy
    |   |   `-- gripper_command        # gripper command
    |   |       |-- [timestamp 1].npy
    |   |       |-- [timestamp 2].npy
    |   |       |-- ...
    |   |       `-- [timestamp T].npy
    |   `-- cam_[serial_number 2]/     # similar camera structure
    `-- [episode identifier 2]         # similar episode structure

🎯 Data Preprocessing

For the dataset, first preprocess using create_tapip3d_npz.sh to obtain unified format output_data.npz, then perform offline tracking using offline_tracking.sh to store tracking results for faster training.

cd utils/tapip3d/
bash create_tapip3d_npz.sh

Here are the argument explanations for data preprocessing:

--color_dir: the RGB image directory path.
--depth_dir: the depth image directory path.
--output: the output npz file path.
--depth_scale: the depth image scale factor, e.g., 1000.0 if depth is in millimeters.

bash offline_tracking.sh

Here are the argument explanations for offline tracking:

--input: the input video (.mp4) or npz file path.
--sam2_checkpoint: the SAM2 model checkpoint path.
--sam2_config: the SAM2 config file path.
--tapip3d_checkpoint: the TAPIP3D model checkpoint path.
--output_dir: the output directory.
--target_resolution: the target resolution for input video or npz data.
--num_targets: the number of targets to segment and track.

🧑🏻‍💻 Training

Modify the arguments in command_train.sh, then

conda activate rise
bash command_train.sh

You can modify the code of training, data loading and model architecture for new environmental settings.

Here are the argument explanations in the training process:

--aug: set to enable spatial transformation augmentations, including random translation and rotation. It is recommend to add this flag for better workspace generalization.
--aug_jitter: set to enable color jitter augmentations. It is recommend to add this flag for better color generalization, but it will slow down the training process.
--num_action [Na]: the action horizon (chunk size).
--voxel_size [Sv]: the size of a 3D volume elements in a voxel grid.
--obs_feature_dim [Do]: the feature dimension of the observation.
--hidden_dim [Dh]: the hidden dimension in the transformer.
--nheads [Nh]: the number of heads in the transformer.
--num_encoder_layers [Ne]: the number of encoder layers in the transformer.
--num_decoder_layers [Nd]: the number of decoder layers in the transformer.
--dim_feedforward [Dff]: the feedforward dimension in the transformer.
--dropout [Pd]: the dropout ratio.
--ckpt_dir [ckpt_dir]: the checkpoint directory.
--resume_ckpt [ckpt_path]: from which checkpoint to resume training.
--resume_epoch [epoch]: from which epoch to resume training.
--lr [lr]: the learning rate.
--batch_size [bs]: the batch size. It is recommended to use a large batch size to stabilize the training, such as 120 or 240.
--num_epochs [Nep]: the epoch number.
--save_epochs [Nsep]: how often to automatically save a checkpoint, measured in epochs.
--num_workers [Nw]: the number of workers in dataloader.
--seed [seed]: the seed number.
--num_targets: number of targets to use.

The training script uses a distributed data parallel training framework, and you might need to adjust some settings (like master_addr and master_port) according to your own training server.

🤖 Evaluation

Here we provide the sample real-world evaluation code based on the hardwares (Flexiv Rizon 4 robotic arms, Dahuan AG-95 gripper, Intel RealSense camera). For other hardware settings, please follow the deployment guide to modify the evaluation script.

At the beginning of the evaluation, you need to annotate objects of interest on the screen. The system will require users to select or mark target objects through the interface, and then the tracker will perform real-time tracking of these objects of interest, providing precise target localization information for subsequent robotic operations.

Modify the arguments in command_eval.sh, then

conda activate rise
bash command_eval.sh

Here are the argument explanations in the training process:

--ckpt [ckpt_path]: the checkpoint to be evaluated.
--calib [calib_dir]: the calibration directory.
--num_inference_step [Ni]: how often to perform a policy inference, measured in steps.
--max_steps [Nstep]: maximum steps for an evaluation.
--vis: set to enable open3d visualization after every inference. This visualization is blocking, it will prevent the evaluation process from continuing.
--discretize_rotation: set to discretize the rotation process to prevent the robot from rotating too fast.
--ensemble_mode [mode]: the temporal ensemble mode.
- [mode] = "new": use the newest predicted action in this step.
- [mode] = "old": use the oldest predicted action in this step.
- [mode] = "avg": use the average of the predicted actions in this step.
- [mode] = "act": use the aggregated weighted average of predicted actions in this step. The weights are set following ACT.
- [mode] = "hato": use the aggregated weighted average of predicted actions in this step. The weights are set following HATO.
The other arguments remain the same as in the training script.

⚙️ History-Aware Visuomotor Policy Architecture

First, we use SAM2 to detect and segment all task-related objects. From each object, we sample a set of points and track them over time with an off-the-shelf point tracker. A track encoder then compresses these tracks into compact tokens by first embedding each patch with an MLP and applying cross-attention with temporal encodings. Finally, the track tokens from all points of an object are aggregated into a history-aware object token. Together with the original observation tokens and other tokens, these are fed into the transformer backbone of the visuomotor policy, which then outputs history-aware robot actions.

🙏 Acknowledgement

Our codebase is built upon RISE.
Our diffusion module is adapted from Diffusion Policy. This part is under MIT License.
Our transformer module is adapted from ACT, which used DETR in their implementations. The DETR part is under APACHE 2.0 License.
Our Minkowski ResNet observation encoder is adapted from the examples of the MinkowskiEngine repository. This part is under MIT License.
Our temporal ensemble implementation is inspired by the recent HATO project.

✍️ Citation

@article{chen2025history,
    title   = {History-Aware Visuomotor Policy Learning via Point Tracking},
    author  = {Chen, Jingjing and Fang, Hongjie and Wang, Chenxi and Wang, Shiquan and Lu, Cewu},
    journal = {arXiv preprint arXiv:2509.17141},
    year    = {2025}
}

📃 License

HistRISE by Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang and Cewu Lu is licensed under CC BY-NC-SA 4.0

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
dataset		dataset
device		device
policy		policy
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
command_eval.sh		command_eval.sh
command_train.sh		command_train.sh
eval.py		eval.py
eval_agent.py		eval_agent.py
online_tracker.py		online_tracker.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

History-Aware Visuomotor Policy Learning via Point Tracking

🛫 Getting Started

💻 Installation

📷 Calibration

🛢️ Data Collection

🎯 Data Preprocessing

🧑🏻‍💻 Training

🤖 Evaluation

⚙️ History-Aware Visuomotor Policy Architecture

🙏 Acknowledgement

✍️ Citation

📃 License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

rise-policy/HistRISE

Folders and files

Latest commit

History

Repository files navigation

History-Aware Visuomotor Policy Learning via Point Tracking

🛫 Getting Started

💻 Installation

📷 Calibration

🛢️ Data Collection

🎯 Data Preprocessing

🧑🏻‍💻 Training

🤖 Evaluation

⚙️ History-Aware Visuomotor Policy Architecture

🙏 Acknowledgement

✍️ Citation

📃 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages