CVPR 2026

SHOW3D Capturing Scenes of 3D Hands and Objects in the Wild

Patrick Rim1,2Kevin Harris1Braden Copple1Shangchen Han1Xu Xie1Ivan Shugurov1
Sizhe An1He Wen1Alex Wong2Tomas Hodan1Kun He1

1Meta Reality Labs | 2Yale University

SHOW3D teaser: in-the-wild hand–object interactions with 3D annotations

SHOW3D is the first dataset of in-the-wild hand–object interactions with accurate 3D annotations as well as text descriptions. Captured with a novel mobile multi-camera rig across diverse indoor and outdoor scenes. Overlays show 3D annotations projected to egocentric images (hands in red and blue, object in green).

01

TL;DR

SHOW3D is the first large-scale dataset of people using their hands to manipulate objects in the wild — outdoors, on the move, across everyday environments — captured with studio-grade 3D ground truth: full hand meshes, 6DoF object poses, and natural-language action captions.

Why it matters for robotics & physical AI

Dexterous manipulation is the central bottleneck for physical AI — the robots and embodied agents that are the next frontier of the field. But nearly all hand–object data is recorded in controlled studios, so policies trained on it stumble the moment they meet the real world. SHOW3D delivers realistic, accurately labeled human demonstrations at scale, giving manipulation and world-model policies the in-the-wild grounding they need to generalize from the lab to reality.

02

Contributions

Capture System
A mobile data capture system: The first capture setup for recording and automatically generating 3D annotations of in-the-wild hand–object interactions. Our lightweight backpack-style rig enables accurate ground-truth hand and object pose annotation while users move freely in diverse environments.
Ground Truth
Ego-exo 3D ground truth: A pipeline to annotate 3D hand and object poses via ego-exo fusion with automatic filtering via confidence estimates. We rigorously evaluate quality, demonstrating sub-centimeter median accuracy.
Dataset
Large and diverse dataset: SHOW3D comprises 4.3M frames of synchronized multi-view images with comprehensive annotations: 3D hand poses and MANO meshes, 6DoF object poses with 3D models, and action-level text captions.
Utility
Demonstrated utility: Models trained on SHOW3D transfer better to other datasets than vice versa. Text-conditioned forecasting experiments confirm the value of our semantic annotations.
03

Dataset

4.3M
Total Frames
38
Subjects
21
Objects
30+
Locations
Dataset Ego / Exo Frames Hand Pose Obj Pose Objects Subjects Environment Annotation
SHOW3D (ours) 2 / 8 4.3M Both / Mesh 21 38 In-the-wild Mono + optimization
GigaHands 0 / 51 3.7M Both / Mesh 417 51 Studio RGB + optimization
ARCTIC 1 / 8 2.1M Both / Mesh 11 10 Studio MoCap
HOT3D 3 / 0 1.7M Both / Mesh 33 19 Studio MoCap
TACO 1 / 12 363K Both / Mesh 196 14 Studio MoCap
OakInk2 1 / 3 993K Both / Mesh 75 9 Studio MoCap
HOI4D 2 / 0 2.4M Both / Mesh 800 4 Studio RGB-D + manual
HO-Cap 1 / 9 699K Both / Mesh 64 9 Studio RGB-D + optimization
HOGraspNet 0 / 4 1.5M Single / Mesh 30 99 Studio RGB-D + optimization
Ego-Exo4D 2 / 4 >100M Both / Skel. Many 740 In-the-wild RGB + manual
AssemblyHands 4 / 8 203K Both / Skel. Many 20 Studio RGB + manual
EgoDex 1 / 0 90M Both / Skel. Many Studio-like Apple Vision Pro API

Comparison with existing hand–object interaction datasets. SHOW3D uniquely combines in-the-wild capture with dense, accurate 3D annotations for both hands and objects.

SHOW3D dataset statistics

Dataset statistics. Participant distribution (left) and number of recordings per object (right).

Cross-dataset UMAP feature embedding

Cross-dataset feature diversity. UMAP embeddings show that SHOW3D (pink) spans a broad visual manifold, bridging compact studio dataset clusters (GigaHands, HOT3D, ARCTIC).

04

Mobile Capture System

We design a portable, backpack-style multi-camera capture rig weighing roughly eight kilograms. Eight monochrome fisheye cameras are rigidly mounted in a half-dome configuration. Participants also wear a Meta Quest 3 headset providing two additional egocentric views. All ten cameras are hardware-synchronized at 60 Hz and precisely calibrated into a shared 3D reference frame, enabling capture in diverse environments—including outdoors—without restricting natural range of motion.

SHOW3D mobile multi-camera capture rig

Our mobile multi-camera capture rig. Left: hardware layout with five MoCap cameras (red), eight exocentric fisheye cameras in a half-dome configuration (green), and two egocentric cameras on the Meta Quest 3 headset (blue). Right: the rig in use during in-the-wild capture sessions.

05

Ego-Exo Annotation Pipeline

To generate accurate 3D annotations, we develop an ego-exo fusion pipeline. For hand pose, we fuse 2D keypoint predictions from Sapiens and InterNet via RANSAC-based triangulation across all ten cameras, then fit personalized hand meshes via Inverse Kinematics. For object pose, we extend CNOS, FoundPose, and GoTrack with multi-view gPnP to estimate and refine 6DoF object poses. Both components produce confidence estimates used for automated quality filtering.

SHOW3D ego-exo annotation pipeline

Ego-exo annotation pipeline. (a) Multi-view fisheye input from ego and exo cameras. (b) Hand keypoints fused from Sapiens and InterNet, fitted via Inverse Kinematics. (c) CAD-based 6DoF object pose via CNOS → FoundPose → GoTrack. (d) 3D annotations projected into ego cameras for downstream model training.

06

Experiments

3D Hand Pose Estimation

We train UmeTrack on different dataset combinations and evaluate cross-dataset generalization. Models trained only on studio datasets (UmeTrack, HOT3D) regress substantially when evaluated on in-the-wild SHOW3D data. Adding SHOW3D training consistently improves performance across all test domains.

# Training set Test set MKPE (mm) ↓
1 UmeTrack SHOW3D 22.2
2 HOT3D SHOW3D 19.6
3 UmeTrack + HOT3D SHOW3D 16.4
4 SHOW3D SHOW3D 15.5
5 UmeTrack + HOT3D + SHOW3D SHOW3D 14.3
6 HOT3D HOT3D 14.0
7 UmeTrack + HOT3D HOT3D 12.7
8 UmeTrack + HOT3D + SHOW3D HOT3D 12.3
9 UmeTrack UmeTrack 9.7
10 UmeTrack + HOT3D UmeTrack 9.5
11 UmeTrack + HOT3D + SHOW3D UmeTrack 9.6

3D hand pose estimation (MKPE, mm, lower is better). Models trained on existing studio datasets generalize noticeably worse to SHOW3D, highlighting the increased difficulty of in-the-wild data. Adding SHOW3D improves performance on both SHOW3D and HOT3D test sets.

Hand–Object Interaction Field Estimation

We evaluate cross-dataset generalization of the InterField model between SHOW3D and HOT3D. Training on SHOW3D and testing on HOT3D outperforms the reverse by a wide margin, confirming that in-the-wild data captures a broader distribution of interaction patterns.

Train set Test set ADE (mm) ↓ ACC (m/s²) ↓
SHOW3D HOT3D 14.70 4.05
HOT3D HOT3D 11.29 3.21
HOT3D + SHOW3D HOT3D 8.80 2.16
HOT3D SHOW3D 22.57 5.61
SHOW3D SHOW3D 13.82 3.79
SHOW3D + HOT3D SHOW3D 13.50 3.84

Interaction field estimation cross-dataset evaluation. Training on SHOW3D achieves 14.70 mm ADE on HOT3D, versus 22.57 mm (+54%) in the reverse direction — demonstrating that SHOW3D's broader environmental distribution enables better generalization.

Text-Driven 6DoF Object Pose Forecasting

We evaluate whether natural language descriptions improve future object pose prediction. Text conditioning consistently reduces forecasting error across objects and prediction horizons, confirming the value of SHOW3D's semantic annotations.

Horizon Condition aria bowl cansoup dinotoy dumbbell juice keyboard milk mouse mug mustard vase waffles Mean
30 frames w/o text 58.543.128.345.789.730.351.0 31.926.632.657.524.236.1 42.7
w/ text 49.037.525.0 34.073.819.3 37.319.514.5 21.616.320.3 27.0 30.4
60 frames w/o text 63.640.529.046.495.933.166.5 22.627.537.477.727.339.9 46.7
w/ text 57.032.728.4 37.589.021.1 34.625.224.7 24.419.026.3 35.1 35.0

Text-driven 6DoF object pose forecasting (average translation error, mm, lower is better). Text conditioning reduces mean error by 29% at 30 frames and 25% at 60 frames, with consistent improvements across nearly all objects.

Ground Truth Visualizations

Ground truth examples from SHOW3D More ground truth examples from SHOW3D

Ground truth examples. Hand pose (red and blue), object pose (green), and text captions across diverse in-the-wild environments. Our pipeline achieves sub-centimeter median accuracy validated against independent gold-standard references.

07

BibTeX

cite this work
@article{rim2026show3d,
  title   = {SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild},
  author  = {Rim, Patrick and Harris, Kevin and Copple, Braden and Han, Shangchen and
             Xie, Xu and Shugurov, Ivan and An, Sizhe and Wen, He and
             Wong, Alex and Hodan, Tomas and others},
  journal = {arXiv preprint arXiv:2603.28760},
  year    = {2026}
}