ESPADA: Execution Speedup via Spatially Aware Demonstration Downsampling

Byungju Kim1,2,*, Jinu Pahk1,2,*, Chungwoo Lee1,*, Jaejoon Kim1,2,*, Jangha Lee1,2,*,
Theo Taeyeong Kim1,2, Kyuhwan Shim2, Junki Lee1,2, Byoung-Tak Zhang1,2

1 Tommoro Robotics,  2 Seoul National University

* equal contribution. 

TL;DR

ESPADA accelerates slow imitation learning by using a VLM-LLM pipeline with 3D gripper-object relations to semantically segment demonstrations, only speeding up non-critical phases. This achieves approximately a 2x execution speedup while maintaining or improving task success, making robot control more efficient.

Abstract

Behavior-cloning based visuomotor policies enable precise manipulation but often inherit the slow, cautious tempo of human demonstrations, limiting practical deployment. However, prior studies on acceleration methods mainly rely on statistical or heuristic cues that ignore task semantics and can fail across diverse manipulation settings. We present ESPADA, a semantic and spatially aware framework that segments demonstrations using a VLM--LLM pipeline with 3D gripper–object relations, enabling aggressive downsampling only in non-critical segments while preserving precision-critical phases, without requiring extra data or architectural modifications, or any form of retraining. To scale from a single annotated episode to the full dataset, ESPADA propagates segment labels via Dynamic Time Warping (DTW) on dynamics-only features. Across both simulation and real-world experiments with ACT and DP baselines, ESPADA achieves approximately a 2x speed-up while maintaining success rates, narrowing the gap between human demonstrations and efficient robot control.

Method

ESPADA extracts grounded object tracks using Grounded-SAM2 and Video Depth Anything, computes 3D gripper–object distances, and feeds structured cues into a VLM for scene summaries. An LLM uses these cues to segment each trajectory into precision or casual spans. Only casual spans are aggressively downsampled using replicate-before-downsample with geometric consistency. Dataset-wide labels are propagated using banded Dynamic Time Warping on proprioception–action features.

Module 1: Context- and Spatial-Aware Segmentation via VLM → LLM.
Module 2: Banded DTW Label Transfer and Segment-wise Downsampling.

Results

Our experimental evaluation is guided by the following research questions:

RQ1. Success Rate

Does ESPADA achieve a higher success rate across diverse manipulation tasks, even under more aggressive acceleration settings, compared to baselines?

RQ2. Segmentation Accuracy

How accurately does ESPADA distinguish precision-critical from casual segments compared to entropy-based segmentation methods?

RQ3. Ablation Study

What are the respective roles of the 3D gripper–object distance \( r_t \) and VLM-generated scene descriptions in improving segmentation quality?

Real-world Experiments

ESPADA achieves up to 2.4× speedup while maintaining or improving success rates across 4 tasks: Sort, Pen-in-Cup, Kitchenware handling, Conveyor transfer. It preserves precision phases, avoids over-acceleration failures, and consistently outperforms entropy-based methods.

Conveyor — ACT Original vs ESPADA (2× / 4×)

Kitchenware — ACT Original vs ESPADA (2× / 4×)

Simulation (ALOHA + Bigym)

In ALOHA-sim tasks (Insertion, Transfer Cube), ESPADA outperforms DemoSpeedup with higher IoU vs. ground-truth boundaries (e.g., 0.5166 vs 0.0224 in Insertion ablation). In Bigym, ESPADA achieves comparable performance while providing stable acceleration even with noisy observations.