RA-Touch: Retrieval-Augmented Touch Understanding with Enriched Visual Data
Yoorhim Cho*, Hongyeob Kim*, Semin Kim, Youjia Zhang, Yunseok Choi, Sungeun Hong†
* Denotes equal contribution
Sungkyunkwan University
We introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset, ImageNet-T, with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets.
FULL abstract
Visuo-tactile perception aims to understand an object’s tactile properties, such as texture, softness, and rigidity. However, the field remains underexplored because collecting tactile data is costly and labor-intensive. We observe that visually distinct objects can exhibit similar surface textures or material properties. For example, a leather sofa and a leather jacket have different appearances but share similar tactile properties. This implies that tactile understanding can be guided by material cues in visual data, even without direct tactile supervision. In this paper, we introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets. A key challenge remains in effectively utilizing these tactile-aware external descriptions. RATouch addresses this by retrieving visual-textual representations aligned with tactile inputs and integrating them to focus on relevant textural and material properties. By outperforming prior methods on the TVL benchmark, our method demonstrates the potential of retrieval-based visual reuse for tactile understanding.conda create -n ra-touch python=3.10 -y
conda activate ra-touch
git clone https://github.com/AIM-SKKU/RA-Touch.git
cd RA-Touchpip install -r requirements.txt
pip install -e .uv wync
source .venv/bin/acivateDownload Pre-trained Weights: Please download the required pre-trained weights from the TVL repository:
- TVL Encoder weights (e.g.,
tvl_enc_vitb.pth,tvl_enc_vits.pth,tvl_enc_vittiny.pth) - TVL-LLaMA weights (e.g.,
tvl_llama_vitb.pth,tvl_llama_vits.pth,tvl_llama_vittiny.pth)
Place these weights in the ./weights/ directory.
We have used torch 2.5.1 for training and evaluation on A6000 GPUs.
To train the tactile-guided retriever:
bash scripts/train_retriever.sh <gpu_ids> <vit_type> <epochs> <batch_size> <port>
# bash scripts/train_retriever.sh 0,1 base 60 256 23500Arguments:
gpu_ids: Comma-separated GPU IDs (e.g., "0,1,2,3")vit_type: Vision Transformer type ("tiny", "small", "base") - default: "base"epochs: Number of training epochs - default: 60batch_size: Training batch size - default: 256port: Distributed training port - default: 23500
Requirements:
- Pre-trained tactile encoder weights (e.g.,
./weights/tvl_enc_vitb.pth) - Training data configuration in
configs/finetune-data-config.yaml
To train the full RA-Touch model with texture-aware integrator:
bash scripts/train_ra_touch.sh <gpu_ids> <vit_type> <retriever_weight> <topk> <external_dataset> <retrieval_method> <port>
# bash scripts/train_ra_touch.sh 0,1,2,3 base ./output/retriever_checkpoint.pth 5 imgnet_t_150k txt2txt 1113Arguments:
gpu_ids: Comma-separated GPU IDsvit_type: Vision Transformer type ("tiny", "small", "base")retriever_weight: Path to trained retriever checkpointtopk: Number of top-k retrieved samples - default: 5external_dataset: External dataset for retrieval ("imgnet_t_10k", "imgnet_t_50k", "imgnet_t_100k", "imgnet_t_150k")retrieval_method: Retrieval method - default: "txt2txt"port: Distributed training port - default: 1113
Requirements:
- LLaMA-2 model in
./llama-2/directory - Pre-trained TVL-LLaMA weights (e.g.,
./weights/tvl_llama_vitb.pth) - ImageNet-T embeddings (e.g.,
./data/embeddings/imagenet_t_150k_embeddings.npz)
To evaluate the trained RA-Touch model:
bash scripts/eval_ra_touch.sh <gpu_ids> <vit_type> <retriever_ckpt> <ra_touch_ckpt> <topk> <external_dataset> <retrieval_method> <port>
# bash scripts/eval_ra_touch.sh 0 base ./output/retriever_checkpoint.pth ./output/ra_touch_checkpoint.pth 5 imgnet_t_150k txt2txt 1113Requirements:
- OpenAI API key in
scripts/openai_key.txtfor GPT evaluation - TVL dataset path configured in the script
- Trained retriever and RA-Touch model checkpoints
We follow the same paired t-test as TVL. Please replace the OpenAI API key in scripts/openai_key.txt after obtaining your OPENAI_API_Key
The dataset is hosted on HuggingFace. To use the dataset, we first download them using the GUI or use git:
# install git-lfs
sudo apt install git-lfs
git lfs install
# clone the dataset
git clone https://huggingface.co/datasets/yoorhim/ImageNet-T
# or you can download the zip files manually from here: https://huggingface.co/datasets/yoorhim/ImageNet-T/tree/mainTBU
- Training Code
- Release ImgeNet-T Dataset
This repository is built using the TVL repository.
@inproceedings{
cho2025ra,
title={RA-Touch: Retrieval-Augmented Touch Understanding with Enriched Visual Data},
author={Cho, Yoorhim and Kim, Hongyeob and Kim, Semin and Zhang, Youjia and Choi, Yunseok and Hong, Sungeun},
booktitle={ACM Multimedia 2025},
year={2025},
}