Abstract
We introduce a framework for collecting extended egocentric video sequences using standard smartphone hardware. Alongside the framework, we release 200 hours of long-form egocentric data with persistent tracking and open-source our video processing infrastructure, STERA. The contribution aims to democratize robotics data collection by enabling hour-plus egocentric trajectories using ubiquitous mobile devices, supporting Vision Language Action model development with standardized, training-ready data formats.
links
Contributions
STERA infrastructure
Open-source pipeline for processing long horizon egocentric video captured on commodity smartphones.
200h dataset
Long-form egocentric trajectories with persistent tracking, released for VLA model development.
Training-ready format
Standardized outputs designed to drop into existing robotics and VLA training pipelines.
Commodity hardware
Hour-plus capture sessions using devices people already carry — no specialized rig required.
Why it matters
Egocentric video data has historically been bottlenecked by specialized capture hardware and short session lengths. By targeting commodity smartphones and hour-plus horizons, this work lowers the barrier to building large, diverse datasets that match how people actually move through the world — a precondition for general-purpose Vision Language Action models.