EGO1: An All-in-One Headset for Egocentric Robot-Learning Data

GI Ego1 worn on a head, three-quarter view showing the stereo camera module, strap, and GoPro-compatible mount — Ego1 (EGO1-HS): a stereo head-mounted egocentric capture device. Drop it on, start the session.

Ego1 is our first egocentric capture headset for embodied intelligence, a device that effortlessly turns first person activities, especially manipulations, into training data for foundational models. We purposely co-designed the hardware and the perception stack around a single goal: capturing first-person manipulation at scale. The hardware carries two wide-angle cameras and a 200 Hz IMU — the cameras share a synchronized exposure, and the camera and IMU streams ride a hardware-synced clock — and every unit is individually calibrated for its own intrinsics and extrinsics.

Those decisions are what the perception stack runs on: a calibrated stereo baseline gives every reconstruction true metric scale that a single camera cannot recover, and the shared hardware clock puts every frame and IMU sample on one timeline, so nothing has to be resampled before reconstruction. That is what lets the stack turn Ego1's video and IMU streams into the things robot learning actually needs: visual-inertial SLAM (VI-SLAM) recovers a metric-scale head trajectory, hand detection plus stereo triangulation gives 21-keypoint 3D hand tracking for both hands, and stereo matching produces dense metric depth of the scene.

The Ego1 hardware

Ego1 is built around a simple goal: capture a clean, hardware-synced, egocentric stream that the perception pipeline can reconstruct from accurately, and do it untethered, for hours, with nothing else in the loop.

Ego1 front view showing the stereo camera pair — Front view: the two 1080p cameras sit 63 mm apart, the average human inter-pupillary distance.

Several design decisions follow directly from that goal. The two cameras sit 63 mm apart, the average human inter-pupillary distance, so the stereo baseline mirrors human binocular vision and gives downstream triangulation a good baseline to work from. Each camera is a wide 125° H × 80° V fisheye, which keeps the wearer's hands in frame even at the edges of natural reach, where narrower lenses lose them. A 9-DoF IMU at 200 Hz is hardware-synced to the cameras, sharing their exposure and clocks, so inertial and visual data land on the same timeline without software guesswork. And the whole thing records on its own: set it to always recording and capture starts the moment it powers on. No phone, no tablet, no laptop required.

To better serve stereo SLAM and depth reconstruction, each headset is calibrated individually. The calibration provides both intrinsics — each camera's focal length, principal point, and fisheye distortion — and extrinsics — the rotation and baseline between the two cameras. Assembly tolerances make these vary slightly from one unit to the next, so the calibration is unique to each device and travels with it — giving the perception stack the true metric scale it needs to reconstruct depth and motion in real-world units.

Spec	Ego1 (EGO1-HS)
Cameras	2× 1080p @ 30 fps, color
Field of view	125° H × 80° V, 148° diagonal per camera (fisheye)
Stereo baseline (IPD)	63 mm center-to-center
IMU	9-DoF (accel + gyro + mag) @ 200 Hz
Synchronization	Hardware-synced exposure and clocks across cameras and IMU
Encoding	H.265 / HEVC, MP4 container
Battery	Integrated, 6–9 hours continuous recording
Storage / offload	User-supplied microSD; Wi-Fi offload
Mount / form factor	GoPro-compatible, pitch-adjustable; fabric headband

The headset records standalone, but a data-collection team that wants tighter control can use the Ego Data Center app, which pairs over Wi-Fi for start/stop, live preview, disk telemetry, and a recording browser. We will publish the UMI Protocol, the compact, self-describing format that Ego1 streams and stores its synchronized camera and IMU data in, under the Apache License for teams who want to build their own tooling.

Ego Data Center companion app showing live preview, recording controls, and disk telemetry — Ego Data Center: recording control and disk/upload status (left), and a live stereo preview with IMU readout (right).

Computing the head trajectory

Metric stereo trajectory recovered.

With synced, calibrated stereo and IMU streaming off the device, we show how our perception stack recovers a highly accurate, metric head trajectory.

The head trajectory is where the camera moved through space over the whole recording. It matters because a head-mounted camera never holds still: the hand motion and the scene in the video are entangled with the wearer's own egomotion, while robot policies are trained on actions in a fixed reference frame. Recovering the trajectory lets us cancel that egomotion, so the hands and the scene settle into one fixed frame at true metric scale rather than drifting with the head. That turns free-form human activity into clean, world-anchored demonstrations a robot policy can learn from.^[1]1.Egocentric robot-learning systems depend on this: EgoMimic (Kareer et al., "EgoMimic: Scaling Imitation Learning via Egocentric Video," 2024) uses Project Aria's visual-inertial SLAM to transform hand-action trajectories into a stable world frame for co-training with robot data.

To recover it we run VI-SLAM with ORB-SLAM3,^[2]2.Campos et al., "ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM," IEEE Transactions on Robotics, 2021. fusing the synced stereo and IMU on each headset's own calibration. We first tried a filter-based VIO (OpenVINS), but on this rotation-dominant, short-baseline footage it diverged: a filter linearizes each measurement once, so when the head mostly turns rather than translates, the cues that pin metric scale go weak, early errors get locked in, and the estimate drifts off. ORB-SLAM3 instead bundle-adjusts a sliding window of keyframes, which stays robust in exactly that regime: the calibrated stereo baseline fixes true metric scale a single moving camera can only recover up to an unknown factor, and loop closure pulls accumulated drift back out whenever the wearer revisits a place. It also handles Ego1's wide fisheye lenses natively (Kannala-Brandt).^[3]3.Kannala and Brandt, "A Generic Camera Model and Calibration Method for Conventional, Wide-Angle, and Fish-Eye Lenses," IEEE TPAMI, 2006 — the fisheye projection model (KB4, coefficients k1–k4) we calibrate to.

The trajectory comes out at true real-world scale and stays stable: on a 41-second clip it tracked the full 4.85 m path and its rotation agreed with the IMU at corr 0.996; on a separate walk-around that returns to its starting point, loop closure brought the end-to-end drift to 0.33 cm, closing the loop to within a few millimeters. Per-device calibration is what makes the stereo pair rectify exactly, and that exact rectification is what makes the recovered scale truly metric.

Tracking human hands

Reconstructed 21-keypoint hand tracks overlaid on the egocentric video.

If the trajectory tells us where the camera went, this pipeline tells us what the hands did — for manipulation data, the signal that matters most. We locate both hands every frame with MediaPipe Hands,^[4]4.Zhang et al., "MediaPipe Hands: On-device Real-time Hand Tracking," CVPR Workshop on Computer Vision for AR/VR, 2020. then reconstruct all 21 keypoints of each with HaMeR, a vision transformer that regresses the hand's pose and shape from a single crop.^[5]5.Pavlakos et al., "Reconstructing Hands in 3D with Transformers" (HaMeR), CVPR, 2024. Built on the MANO parametric hand model.^[6]6.Romero et al., "Embodied Hands: Modeling and Capturing Hands and Bodies Together," ACM TOG (SIGGRAPH Asia), 2017. The MANO parametric hand model.

The catch is that these are off-the-shelf models, trained on third-person, centered, fully visible hands — everything egocentric video isn't. Hands get cropped at the frame edge, turn their backs to the camera, hide behind grasped objects, and smear under fast head turns, and it shows up as missing hands: the baseline found a given hand in only 75.6% of frames, and the left hand in just 53.7%.

You might reach for a bigger pose model, but that's not where the problem is. Since HaMeR always returns a hand once it has a box, a missing hand can't be a posing failure — it means the detector never located the hand in the first place. So the fix was detection. Two changes closed most of the gap — MediaPipe's VIDEO mode, which carries a hand's region of interest forward between frames, and a re-detect pass that crops and upsamples the last-seen region into the close-up the detector expects — and a speed-adaptive One-Euro filter^[7]7.Casiez et al., "1€ Filter: A Simple Speed-based Low-pass Filter for Noisy Input in Interactive Systems," CHI, 2012. smooths what's left. Per-hand coverage rose from 75.6% to 93.1%, at least one hand appears in 99.2% of frames, and jitter roughly halved. With clean keypoints in both the left and right frames, stereo finishes the job: triangulating the matching points against the calibrated baseline lifts each hand to metric 3D in real millimeters, which the head trajectory then places in the scene's world frame.

What we didn't do matters too: every hand above is a real detection, not an interpolated guess. We fill only short gaps (15 frames or fewer, little motion), never extrapolate a long absence, and tag each keypoint with its source (detected, redetect, or interp) so the data can be filtered. When a hand genuinely leaves the frame, we leave the gap empty rather than invent a pose.

Reconstructing depth

The rectified stereo pair (top) and the per-frame SGBM depth it produces (bottom), temporally smoothed.

The same calibrated stereo earns one more output: per-frame metric depth of the whole scene. A stereo-matching stage estimates disparity — the horizontal shift of each pixel between the rectified left/right views — and converts it to depth through the rectified-stereo relation depth = focal × baseline / disparity. Because the baseline was measured for this specific headset, the result is in real millimeters: a monocular network only infers scale from learned priors, whereas a known baseline is the scale. That baseline — 63 mm — also sets where the depth is trustworthy: a fraction-of-a-pixel matching error propagates as ΔZ ≈ Z² / (f · B) · Δd, so uncertainty grows with the square of distance. For Ego1's calibrated focal length (~883 px) and a half-pixel error:

Distance	Depth uncertainty (ΔZ)
0.5 m	~2 mm
1.0 m	~9 mm
2.0 m	~36 mm

Millimeters at arm's length, centimeters across the room — a 16× jump in error for a 4× jump in distance. The short baseline is sharpest exactly where the hands do their work, the 0.3–0.8 m manipulation range, and goes soft in the far scene, which is context the policy never acts on. The same IPD spacing that mirrors human binocular vision gives human-like depth acuity: crisp up close, approximate in the distance.

Computing that depth comes down to stereo matching, where the calibration pays off once more. We use Semi-Global Matching — OpenCV's block-based SGBM^[8]8.Hirschmüller, "Stereo Processing by Semiglobal Matching and Mutual Information," IEEE TPAMI, 2008 (orig. CVPR, 2005). OpenCV's StereoSGBM is a block-based variant. — which matches each pixel along the rectified epipolar line and converts the disparity to metric depth through the relation above, with no learned model or checkpoint to maintain. It commits a disparity only where the match is unambiguous, leaving textureless walls and occlusions as holes that an edge-aware, image-guided filter then fills. Each frame becomes a 16-bit-millimeter depth map that drops straight into ROS (16UC1), LeRobot depth features, or Open3D for RGB-D reconstruction.

Collecting data at scale

Ego1 is an egocentric stereo headset built for one job: collecting manipulation data at the scale robotics foundation models demand. You drop it on and it records hardware-synced stereo and IMU on its own; per-device calibration anchors the metric scale, and the perception stack turns that raw footage into what robot learning actually consumes — a head trajectory, 3D hand tracking for both hands, and dense metric depth. Ordinary first-person activity becomes the large, diverse, real-world data that robot policies are trained on.

If you're collecting manipulation data at scale and want to talk, book a demo or reach us at hello@gilabs.xyz. The UMI Protocol will be open-sourced under the Apache License, and device firmware ships through our public GitHub release channel.