Thanks to visit codestin.com
Credit goes to xcapture.github.io

X-Capture: An Open-Source Portable Device for Multi-Sensory Learning

Stanford
International Conference on Computer Vision (ICCV) 2025

Abstract

Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,600 total points on 600 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.

Dataset

About X-Capture

The X-Capture dataset contains multisensory data collected from 600 real-world objects in nine in-the-wild environments. We provide RGB-D, acoustic, tactile, and 3D data. Each object has six recorded points each, covering diverse locations on the object.

Dataset Download

We host our dataset on Hugging Face. Use the following command to download the full dataset (~50GB uncompressed):

wget https://huggingface.co/datasets/swistreich/XCapture/resolve/main/XCapture_data.zip -O XCapture_data.zip

Alternatively, the dataset can be accessed here.

Explore Samples

We include some samples from the X-Capture dataset below.

Select an object to see examples of the data collected.
cat brush
computer speaker
metal box
insulated cup
storage bowl
Select one of the six collection points for this object.
Check out the rgb, touch, depth, and audio data available for this collection point.
rgb
target point
touch
depth
target point
audio

Hardware

We develop custom hardware to fit vision, touch, and audio sensing into a handheld package. We will open-source all our designs soon. See our video below to see a breakdown of the device hardware and see it in action. A full release of our hardware designs is coming soon (contact Samuel Clarke for more information).

Results

X-to 2D/3D Generation
Zero-Shot Audio-Based Detection
Select an image below.
glass tube
bottle opener
watch
key fob
Input (→Encoder)
(Encoder→) Shap-E
(Encoder→) Stable Diffusion
rgb
audio
touch
Select an image below.
+
=

Video

BibTex

@inproceedings{clarke2025xcapture,
    title={X-Capture: An Open-Source Portable Device for Multi-Sensory Learning},
    author={Samuel Clarke and Suzannah Wistreich and Yanjie Ze and Jiajun Wu},
    booktitle={International Conference on Computer Vision (ICCV)},
    year={2025},
    eprint={2504.02318},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2504.02318},
}