ESPADA extracts grounded object tracks using Grounded-SAM2 and Video Depth Anything,
computes 3D gripper–object distances, and feeds structured cues into a VLM for scene summaries.
An LLM uses these cues to segment each trajectory into precision or casual spans.
Only casual spans are aggressively downsampled using replicate-before-downsample with geometric
consistency.
Dataset-wide labels are propagated using banded Dynamic Time Warping on proprioception–action features.
Context- and Spatial-Aware Segmentation via VLM → LLM
Object tracking with interactive keyframe seeding
First, we obtain open-vocabulary tracks from demonstration videos using Grounded-SAM2. In addition to
text prompts, users can provide sparse keyframe annotations (boxes or point-groups) via a lightweight
UI. We maintain a label↔id mapping across keyframes and perform IoU-based association to
propagate
user labels to SAM2 track IDs. During propagation, we use a keep-alive strategy (bbox carry-over for
short outages) and periodic re-detection with Grounding DINO, reconnecting lost tracks via a score
that
mixes IoU and color-histogram similarity. This reduces fragmentations and preserves object identity
across occlusions.
To bootstrap object grounding, we first sample ~10 representative frames from episode 0 and feed them
into InternVL 3.5 to obtain a compact natural-language description of the overall task. For the same
frames, we apply Grounding DINO v2 to detect and segment task-relevant entities such as
left_gripper, right_gripper, and target objects (e.g.,
yellow_cup). If bounding box predictions fail for some frames, we allow lightweight
manual
correction (bounding box only) through the UI. The corrected boxes serve as anchors for SAM2, which
then
propagates object masks and bounding boxes consistently across the entire episode. This hybrid
strategy
(automatic detection + sparse manual fallback + SAM2 propagation) ensures that every frame obtains
reliable per-object segmentation, even under occlusion or detector failure.
Depth estimation and 3D back-projection
We estimate per-frame depths with VDA/DA2 (metric or relative; optionally scaled by a factor \(
z_{\mathrm{scale}} \)). As we obtained the pixel coordinates \( (u, v) \) of each object of interest
in
the previous step, given the corresponding depth \( Z \), we can recover its 3D position in the camera
coordinate frame via standard back-projection:
\[
\mathbf{p} = Z K^{-1}[u, v, 1]^{\top}
\]
This yields a center_3d for each tracked mask. We then compute frame-wise gripper–object
distances,
\[
r_t(g,o)=\|\mathbf{p}^{(g)}_{t}-\mathbf{p}^{(o)}_{t}\|_2
\]
for \( g\in\{\text{gripper\_left,gripper\_right}\} \) and task-relevant objects \( o \). For
multi-view
sequences, we build per-camera relations_3d from the set of \( r_t(g,o) \) values, and
prefer the head camera if present; otherwise we select the camera with the most valid relations at a
frame. We rely on temporal trends in \( r_t \) rather than absolute scale, avoiding the need
for extrinsics.
LLM-Based Segmentation Conditioned on VLM Summaries
From the sampled frames (typically 4–8) and their structured 3D cues, we query a VLM (InternVL) for a
strict-JSON, chronologically ordered episode summary. We then attach this VLM-produced summary as a
task
descriptor to the LLM prompt.
Using this VLM-produced descriptor, we perform LLM-based segmentation as follows.
The LLM receives: (i) a JSONL stream with frame-wise center_3d and
relations_3d for the full episode, and (ii) the VLM summary descriptor. It
outputs non-overlapping, inclusive index ranges labeled precision or casual.
We encode policy hints to favor robust, human-like chunks:
- Intent criteria. Sustained near-contact plateaus and low-variance micro-adjustments
\(\Rightarrow\) precision; long approach/retreat or persistent far separation \(\Rightarrow\)
casual.
- Stability. Minimum segment length \( L_{\min}=8 \); merge same-label segments across
gaps shorter than \( G_{\min}=5 \); require \( \ge 3 \) consecutive frames to switch labels
(hysteresis); ignore micro-oscillations shorter than \( L_{\mathrm{micro}}=6 \).
- Parsimony. Prefer 3–4 segments unless strong evidence suggests otherwise.
Because the model may leave small gaps when confidence is low, we run a deterministic coverage
completion pass: fill gaps by extending the nearest high-confidence neighbor that best matches
the
local \( r_t \) trend, then re-apply the stability rules. The final set provides full
frame coverage with per-segment confidence.
Finally, to respect LLM context limits for long demonstrations, we apply token-budgeted sampling and
JSON slimming. Demonstrations often have thousands of frames, easily exceeding LLM context limits. We
therefore compute the maximum feasible sample count \( K \) by binary search over the measured
per-frame JSON length and select \( K \) evenly spaced indices, ensuring trajectory-wide
coverage under a fixed character budget. We further compact prompts by float rounding and
whitespace-free JSON serialization, reducing token overhead by ~30–40% without changing semantics.
Banded DTW Label Transfer from Episode-0
For datasets where only episode 0 is labeled, we propagate its segment labels (precision
/
casual) to the remaining episodes via banded Dynamic Time Warping (DTW).
Proprioceptive DTW Alignment
From each episode we build a per-frame feature vector using only proprioception and actions.
Concretely,
we concatenate z-scored features to form \( \phi_t \in \mathbb{R}^{D} \):
\[
\phi_t = [ a_t, \Delta a_t, v_t, \Delta v_t, \|a_t\|, \|v_t\|, \|\Delta a_t\|, \|\Delta q_t\|,
\|\Delta
v_t\|, \angle(a_t, a_t+\Delta a_t), \angle(v_t, v_t+\Delta v_t) ]
\]
where \( a_t \) are actions, \( q_t \) are joint positions, \( v_t \)
are joint velocities if available (otherwise we use \( \Delta q_t \) as a proxy), and
\( \angle(\cdot,\cdot) \) is the angle between successive vectors.
Given episode 0 features \( X_0 \) and target features \( X_k \) for episode
\( k \), we run DTW with a Sakoe–Chiba band of half-width \( b \). This yields an alignment path which
we convert into a monotone index map.
Segment-wise Label Transfer and Refinement
For each episode-0 labeled segment, we obtain the target span and snap both ends within a local window
of \( \pm W \) frames (default \( W=12 \)) by minimizing the \( \ell_2 \) distance between short
mean-pooled feature summaries. Mapped segments are sorted and trimmed to remove overlaps while
preserving order. The banded DTW runtime is near-linear in sequence length, allowing fast transfers on
CPU.
Segment-wise Downsampling and Dataset Compilation
Given the final segmentation, we construct an acceleration-aware dataset by applying
replicate-before-downsample with a larger downsampling factor for casual spans and a smaller
downsampling factor for precision spans.
Replicate-before-downsample
To maintain full state coverage under temporal compression, we adopt a replicate-before-downsample
strategy. For a segment \( [s,e] \) and downsampling factor \( N \), we create \( N \) replicas with
offsets \( m \in \{0,\dots,N-1\} \) and retain frames matching the offset. Taking the union across
\( m \) recovers the original support, thereby preserving full state diversity in the downsampled
dataset.
Geometric Consistency for Chunked Policies
Temporal acceleration alters the per-chunk spatial displacement, undermining the horizon \( K \) that
the policy has been optimized to perform best at. To maintain geometric fidelity under accelerated
demonstrations, we adopt the geometry-consistent downsampling scheme and rescale the effective chunk
horizon \( K' \) so that its spatial displacement remains consistent with the original:
\[
\sum_{k=0}^{K'-1}\|\Delta\mathbf{x}_{t+k}\| \approx \sum_{k=0}^{K-1}\|\Delta\mathbf{x}_{t+k}\|
\]
where \( \mathbf{x}_t \) denotes the end-effector pose. In practice, \( K' \approx \frac{1}{2} K \)
performs
well.
Gripper Event Precision Forcing
We apply gripper event precision forcing method to safeguard contact-rich phases from being
over-accelerated. For each trajectory, we detect gripper movements by checking the change in the
normalized gripper command \( g_t \) and mark a frame as a candidate event if
\( |g_{t+4} - g_t| \ge 0.03 \). All marked frames are then clustered along the
temporal axis using DBSCAN. For each cluster, we take the minimum and maximum frame indices, pad them
by
two frames on both sides, and override the corresponding window to be precision on top of the base LLM
segmentation results.