ESPADA: Execution Speedup for Imitation Learning

TL;DR

ESPADA accelerates slow imitation learning by using a VLM-LLM pipeline with 3D gripper-object relations to semantically segment demonstrations, only speeding up non-critical phases. This achieves approximately a 2x execution speedup while maintaining or improving task success, making robot control more efficient.

Abstract

Behavior-cloning based visuomotor policies enable precise manipulation but often inherit the slow, cautious tempo of human demonstrations, limiting practical deployment. However, prior studies on acceleration methods mainly rely on statistical or heuristic cues that ignore task semantics and can fail across diverse manipulation settings. We present ESPADA, a semantic and spatially aware framework that segments demonstrations using a VLM--LLM pipeline with 3D gripper–object relations, enabling aggressive downsampling only in non-critical segments while preserving precision-critical phases, without requiring extra data or architectural modifications, or any form of retraining. To scale from a single annotated episode to the full dataset, ESPADA propagates segment labels via Dynamic Time Warping (DTW) on dynamics-only features. Across both simulation and real-world experiments with ACT and DP baselines, ESPADA achieves approximately a 2x speed-up while maintaining success rates, narrowing the gap between human demonstrations and efficient robot control.

Method

ESPADA extracts grounded object tracks using Grounded-SAM2 and Video Depth Anything, computes 3D gripper–object distances, and feeds structured cues into a VLM for scene summaries. An LLM uses these cues to segment each trajectory into precision or casual spans. Only casual spans are aggressively downsampled using replicate-before-downsample with geometric consistency. Dataset-wide labels are propagated using banded Dynamic Time Warping on proprioception–action features.

**Module 2:** Banded DTW Label Transfer and Segment-wise Downsampling.

Context- and Spatial-Aware Segmentation via VLM → LLM

Object tracking with interactive keyframe seeding

First, we obtain open-vocabulary tracks from demonstration videos using Grounded-SAM2. In addition to text prompts, users can provide sparse keyframe annotations (boxes or point-groups) via a lightweight UI. We maintain a label↔id mapping across keyframes and perform IoU-based association to propagate user labels to SAM2 track IDs. During propagation, we use a keep-alive strategy (bbox carry-over for short outages) and periodic re-detection with Grounding DINO, reconnecting lost tracks via a score that mixes IoU and color-histogram similarity. This reduces fragmentations and preserves object identity across occlusions.

To bootstrap object grounding, we first sample ~10 representative frames from episode 0 and feed them into InternVL 3.5 to obtain a compact natural-language description of the overall task. For the same frames, we apply Grounding DINO v2 to detect and segment task-relevant entities such as left_gripper, right_gripper, and target objects (e.g., yellow_cup). If bounding box predictions fail for some frames, we allow lightweight manual correction (bounding box only) through the UI. The corrected boxes serve as anchors for SAM2, which then propagates object masks and bounding boxes consistently across the entire episode. This hybrid strategy (automatic detection + sparse manual fallback + SAM2 propagation) ensures that every frame obtains reliable per-object segmentation, even under occlusion or detector failure.

Depth estimation and 3D back-projection

We estimate per-frame depths with VDA/DA2 (metric or relative; optionally scaled by a factor \( z_{\mathrm{scale}} \)). As we obtained the pixel coordinates \( (u, v) \) of each object of interest in the previous step, given the corresponding depth \( Z \), we can recover its 3D position in the camera coordinate frame via standard back-projection: \[ \mathbf{p} = Z K^{-1}[u, v, 1]^{\top} \] This yields a center_3d for each tracked mask. We then compute frame-wise gripper–object distances, \[ r_t(g,o)=\|\mathbf{p}^{(g)}_{t}-\mathbf{p}^{(o)}_{t}\|_2 \] for \( g\in\{\text{gripper\_left,gripper\_right}\} \) and task-relevant objects \( o \). For multi-view sequences, we build per-camera relations_3d from the set of \( r_t(g,o) \) values, and prefer the head camera if present; otherwise we select the camera with the most valid relations at a frame. We rely on temporal trends in \( r_t \) rather than absolute scale, avoiding the need for extrinsics.

LLM-Based Segmentation Conditioned on VLM Summaries

From the sampled frames (typically 4–8) and their structured 3D cues, we query a VLM (InternVL) for a strict-JSON, chronologically ordered episode summary. We then attach this VLM-produced summary as a task descriptor to the LLM prompt. Using this VLM-produced descriptor, we perform LLM-based segmentation as follows. The LLM receives: (i) a JSONL stream with frame-wise center_3d and relations_3d for the full episode, and (ii) the VLM summary descriptor. It outputs non-overlapping, inclusive index ranges labeled precision or casual. We encode policy hints to favor robust, human-like chunks:

Intent criteria. Sustained near-contact plateaus and low-variance micro-adjustments \(\Rightarrow\) precision; long approach/retreat or persistent far separation \(\Rightarrow\) casual.
Stability. Minimum segment length \( L_{\min}=8 \); merge same-label segments across gaps shorter than \( G_{\min}=5 \); require \( \ge 3 \) consecutive frames to switch labels (hysteresis); ignore micro-oscillations shorter than \( L_{\mathrm{micro}}=6 \).
Parsimony. Prefer 3–4 segments unless strong evidence suggests otherwise.

Because the model may leave small gaps when confidence is low, we run a deterministic coverage completion pass: fill gaps by extending the nearest high-confidence neighbor that best matches the local \( r_t \) trend, then re-apply the stability rules. The final set provides full frame coverage with per-segment confidence.

Finally, to respect LLM context limits for long demonstrations, we apply token-budgeted sampling and JSON slimming. Demonstrations often have thousands of frames, easily exceeding LLM context limits. We therefore compute the maximum feasible sample count \( K \) by binary search over the measured per-frame JSON length and select \( K \) evenly spaced indices, ensuring trajectory-wide coverage under a fixed character budget. We further compact prompts by float rounding and whitespace-free JSON serialization, reducing token overhead by ~30–40% without changing semantics.

Banded DTW Label Transfer from Episode-0

For datasets where only episode 0 is labeled, we propagate its segment labels (precision / casual) to the remaining episodes via banded Dynamic Time Warping (DTW).

Proprioceptive DTW Alignment

From each episode we build a per-frame feature vector using only proprioception and actions. Concretely, we concatenate z-scored features to form \( \phi_t \in \mathbb{R}^{D} \): \[ \phi_t = [ a_t, \Delta a_t, v_t, \Delta v_t, \|a_t\|, \|v_t\|, \|\Delta a_t\|, \|\Delta q_t\|, \|\Delta v_t\|, \angle(a_t, a_t+\Delta a_t), \angle(v_t, v_t+\Delta v_t) ] \] where \( a_t \) are actions, \( q_t \) are joint positions, \( v_t \) are joint velocities if available (otherwise we use \( \Delta q_t \) as a proxy), and \( \angle(\cdot,\cdot) \) is the angle between successive vectors.

Given episode 0 features \( X_0 \) and target features \( X_k \) for episode \( k \), we run DTW with a Sakoe–Chiba band of half-width \( b \). This yields an alignment path which we convert into a monotone index map.

Segment-wise Label Transfer and Refinement

For each episode-0 labeled segment, we obtain the target span and snap both ends within a local window of \( \pm W \) frames (default \( W=12 \)) by minimizing the \( \ell_2 \) distance between short mean-pooled feature summaries. Mapped segments are sorted and trimmed to remove overlaps while preserving order. The banded DTW runtime is near-linear in sequence length, allowing fast transfers on CPU.

Segment-wise Downsampling and Dataset Compilation

Given the final segmentation, we construct an acceleration-aware dataset by applying replicate-before-downsample with a larger downsampling factor for casual spans and a smaller downsampling factor for precision spans.

Replicate-before-downsample

To maintain full state coverage under temporal compression, we adopt a replicate-before-downsample strategy. For a segment \( [s,e] \) and downsampling factor \( N \), we create \( N \) replicas with offsets \( m \in \{0,\dots,N-1\} \) and retain frames matching the offset. Taking the union across \( m \) recovers the original support, thereby preserving full state diversity in the downsampled dataset.

Geometric Consistency for Chunked Policies

Temporal acceleration alters the per-chunk spatial displacement, undermining the horizon \( K \) that the policy has been optimized to perform best at. To maintain geometric fidelity under accelerated demonstrations, we adopt the geometry-consistent downsampling scheme and rescale the effective chunk horizon \( K' \) so that its spatial displacement remains consistent with the original: \[ \sum_{k=0}^{K'-1}\|\Delta\mathbf{x}_{t+k}\| \approx \sum_{k=0}^{K-1}\|\Delta\mathbf{x}_{t+k}\| \] where \( \mathbf{x}_t \) denotes the end-effector pose. In practice, \( K' \approx \frac{1}{2} K \) performs well.

Gripper Event Precision Forcing

We apply gripper event precision forcing method to safeguard contact-rich phases from being over-accelerated. For each trajectory, we detect gripper movements by checking the change in the normalized gripper command \( g_t \) and mark a frame as a candidate event if \( |g_{t+4} - g_t| \ge 0.03 \). All marked frames are then clustered along the temporal axis using DBSCAN. For each cluster, we take the minimum and maximum frame indices, pad them by two frames on both sides, and override the corresponding window to be precision on top of the base LLM segmentation results.

Results

Our experimental evaluation is guided by the following research questions:

RQ1. Success Rate

Does ESPADA achieve a higher success rate across diverse manipulation tasks, even under more aggressive acceleration settings, compared to baselines?

RQ2. Segmentation Accuracy

How accurately does ESPADA distinguish precision-critical from casual segments compared to entropy-based segmentation methods?

RQ3. Ablation Study

What are the respective roles of the 3D gripper–object distance \( r_t \) and VLM-generated scene descriptions in improving segmentation quality?

Real-world Experiments

ESPADA achieves up to 2.4× speedup while maintaining or improving success rates across 4 tasks: Sort, Pen-in-Cup, Kitchenware handling, Conveyor transfer. It preserves precision phases, avoids over-acceleration failures, and consistently outperforms entropy-based methods.

Conveyor — ACT Original vs ESPADA (2× / 4×)

Kitchenware — ACT Original vs ESPADA (2× / 4×)

Simulation (ALOHA + Bigym)

In ALOHA-sim tasks (Insertion, Transfer Cube), ESPADA outperforms DemoSpeedup with higher IoU vs. ground-truth boundaries (e.g., 0.5166 vs 0.0224 in Insertion ablation). In Bigym, ESPADA achieves comparable performance while providing stable acceleration even with noisy observations.

ESPADA: Execution Speedup via Spatially Aware Demonstration Downsampling

Byungju Kim^1,2,, Jinu Pahk^1,2,, Chungwoo Lee^1,, Jaejoon Kim^1,2,, Jangha Lee^1,2,*,
Theo Taeyeong Kim^1,2, Kyuhwan Shim², Junki Lee^1,2, Byoung-Tak Zhang^1,2

Abstract

Method

Context- and Spatial-Aware Segmentation via VLM → LLM

Object tracking with interactive keyframe seeding

Depth estimation and 3D back-projection

LLM-Based Segmentation Conditioned on VLM Summaries

Banded DTW Label Transfer from Episode-0

Proprioceptive DTW Alignment

Segment-wise Label Transfer and Refinement

Segment-wise Downsampling and Dataset Compilation

Replicate-before-downsample

Geometric Consistency for Chunked Policies

Gripper Event Precision Forcing

Results

RQ1. Success Rate

RQ2. Segmentation Accuracy

RQ3. Ablation Study

Real-world Experiments

Conveyor — ACT Original vs ESPADA (2× / 4×)

Kitchenware — ACT Original vs ESPADA (2× / 4×)

Simulation (ALOHA + Bigym)

ESPADA: Execution Speedup via Spatially Aware Demonstration Downsampling

Byungju Kim1,2,*, Jinu Pahk1,2,*, Chungwoo Lee1,*, Jaejoon Kim1,2,*, Jangha Lee1,2,*, Theo Taeyeong Kim1,2, Kyuhwan Shim2, Junki Lee1,2, Byoung-Tak Zhang1,2

Abstract

Method

Context- and Spatial-Aware Segmentation via VLM → LLM

Object tracking with interactive keyframe seeding

Depth estimation and 3D back-projection

LLM-Based Segmentation Conditioned on VLM Summaries

Banded DTW Label Transfer from Episode-0

Proprioceptive DTW Alignment

Segment-wise Label Transfer and Refinement

Segment-wise Downsampling and Dataset Compilation

Replicate-before-downsample

Geometric Consistency for Chunked Policies

Gripper Event Precision Forcing

Results

RQ1. Success Rate

RQ2. Segmentation Accuracy

RQ3. Ablation Study

Real-world Experiments

Conveyor — ACT Original vs ESPADA (2× / 4×)

Kitchenware — ACT Original vs ESPADA (2× / 4×)

Simulation (ALOHA + Bigym)

Byungju Kim^1,2,, Jinu Pahk^1,2,, Chungwoo Lee^1,, Jaejoon Kim^1,2,, Jangha Lee^1,2,*,
Theo Taeyeong Kim^1,2, Kyuhwan Shim², Junki Lee^1,2, Byoung-Tak Zhang^1,2