EGO1: An All-in-One Headset for Egocentric Robot-Learning Data

By General Intelligence Labs Team

GI Ego1 worn on a head, three-quarter view showing the stereo camera module, strap, and GoPro-compatible mount
Ego1 (EGO1-HS): a stereo head-mounted egocentric capture device. Drop it on, start the session.

Ego1 is our first egocentric capture headset for embodied intelligence, a device that effortlessly turns first person activities, especially manipulations, into training data for foundational models. We purposely co-designed the hardware and the perception stack around a single goal: capturing first-person manipulation at scale. The hardware carries two wide-angle cameras and a 200 Hz IMU — the cameras share a synchronized exposure, and the camera and IMU streams ride a hardware-synced clock — and every unit is individually calibrated for its own intrinsics and extrinsics.

Those decisions are what the perception stack runs on: a calibrated stereo baseline gives every reconstruction true metric scale that a single camera cannot recover, and the shared hardware clock puts every frame and IMU sample on one timeline, so nothing has to be resampled before reconstruction. That is what lets the stack turn Ego1's video and IMU streams into the things robot learning actually needs: visual-inertial SLAM (VI-SLAM) recovers a metric-scale head trajectory, hand detection plus stereo triangulation gives 21-keypoint 3D hand tracking for both hands, and stereo matching produces dense metric depth of the scene.

The Ego1 hardware

Ego1 is built around a simple goal: capture a clean, hardware-synced, egocentric stream that the perception pipeline can reconstruct from accurately, and do it untethered, for hours, with nothing else in the loop.

Ego1 front view showing the stereo camera pair
Front view: the two 1080p cameras sit 63 mm apart, the average human inter-pupillary distance.

Several design decisions follow directly from that goal. The two cameras sit 63 mm apart, the average human inter-pupillary distance, so the stereo baseline mirrors human binocular vision and gives downstream triangulation a good baseline to work from. Each camera is a wide 150° H × 80° V fisheye, which keeps the wearer's hands in frame even at the edges of natural reach, where narrower lenses lose them. A 9-DoF IMU at 200 Hz is hardware-synced to the cameras, sharing their exposure and clocks, so inertial and visual data land on the same timeline without software guesswork. And the whole thing records on its own: set it to always recording and capture starts the moment it powers on. No phone, no tablet, no laptop required.

To better serve stereo SLAM and depth reconstruction, each headset is calibrated individually. The calibration provides both intrinsics — each camera's focal length, principal point, and fisheye distortion — and extrinsics — the rotation and baseline between the two cameras. Assembly tolerances make these vary slightly from one unit to the next, so the calibration is unique to each device and travels with it — giving the perception stack the true metric scale it needs to reconstruct depth and motion in real-world units.

SpecEgo1 (EGO1-HS)
Cameras2× 1080p @ 30 fps, color
Field of view150° H × 80° V per camera (fisheye)
Stereo baseline (IPD)63 mm center-to-center
IMU9-DoF (accel + gyro + mag) @ 200 Hz
SynchronizationHardware-synced exposure and clocks across cameras and IMU
EncodingH.265 / HEVC, MP4 container
BatteryIntegrated, 6–9 hours continuous recording
Storage / offloadUser-supplied microSD; Wi-Fi offload
Mount / form factorGoPro-compatible, pitch-adjustable; fabric headband

The headset records standalone, but a data-collection team that wants tighter control can use the Ego Data Center app, which pairs over Wi-Fi for start/stop, live preview, disk telemetry, and a recording browser. We will publish the UMI Protocol, the compact, self-describing format that Ego1 streams and stores its synchronized camera and IMU data in, under the Apache License for teams who want to build their own tooling.

Ego Data Center companion app showing live preview, recording controls, and disk telemetry
Ego Data Center: recording control and disk/upload status (left), and a live stereo preview with IMU readout (right).

Computing the head trajectory

Metric stereo trajectory recovered.

With synced, calibrated stereo and IMU streaming off the device, we show how our perception stack recovers a highly accurate, metric head trajectory.

The head trajectory is where the camera moved through space over the whole recording. It matters because a head-mounted camera never holds still: the hand motion and the scene in the video are entangled with the wearer's own egomotion, while robot policies are trained on actions in a fixed reference frame. Recovering the trajectory lets us cancel that egomotion, so the hands and the scene settle into one fixed frame at true metric scale rather than drifting with the head. That turns free-form human activity into clean, world-anchored demonstrations a robot policy can learn from.[1]

To recover it we run VI-SLAM with ORB-SLAM3,[2] fusing the synced stereo and IMU on each headset's own calibration. We first tried a filter-based VIO (OpenVINS), but on this rotation-dominant, short-baseline footage it diverged: a filter linearizes each measurement once, so when the head mostly turns rather than translates, the cues that pin metric scale go weak, early errors get locked in, and the estimate drifts off. ORB-SLAM3 instead bundle-adjusts a sliding window of keyframes, which stays robust in exactly that regime: the calibrated stereo baseline fixes true metric scale a single moving camera can only recover up to an unknown factor, and loop closure pulls accumulated drift back out whenever the wearer revisits a place. It also handles Ego1's wide fisheye lenses natively (Kannala-Brandt).[3]

The trajectory comes out at true real-world scale and stays stable: on a 41-second clip it tracked the full 4.85 m path and its rotation agreed with the IMU at corr 0.996; on a separate walk-around that returns to its starting point, loop closure brought the end-to-end drift to 0.33 cm, closing the loop to within a few millimeters. Per-device calibration is what makes the stereo pair rectify exactly, and that exact rectification is what makes the recovered scale truly metric.

Tracking human hands

Reconstructed 21-keypoint hand tracks overlaid on the egocentric video.

If the trajectory tells us where the camera went, this pipeline tells us what the hands did — for manipulation data, the signal that matters most. We locate both hands every frame with MediaPipe Hands,[4] then reconstruct all 21 keypoints of each with HaMeR, a vision transformer that regresses the hand's pose and shape from a single crop.[5][6]

The catch is that these are off-the-shelf models, trained on third-person, centered, fully visible hands — everything egocentric video isn't. Hands get cropped at the frame edge, turn their backs to the camera, hide behind grasped objects, and smear under fast head turns, and it shows up as missing hands: the baseline found a given hand in only 75.6% of frames, and the left hand in just 53.7%.

You might reach for a bigger pose model, but that's not where the problem is. Since HaMeR always returns a hand once it has a box, a missing hand can't be a posing failure — it means the detector never located the hand in the first place. So the fix was detection. Two changes closed most of the gap — MediaPipe's VIDEO mode, which carries a hand's region of interest forward between frames, and a re-detect pass that crops and upsamples the last-seen region into the close-up the detector expects — and a speed-adaptive One-Euro filter[7] smooths what's left. Per-hand coverage rose from 75.6% to 93.1%, at least one hand appears in 99.2% of frames, and jitter roughly halved. With clean keypoints in both the left and right frames, stereo finishes the job: triangulating the matching points against the calibrated baseline lifts each hand to metric 3D in real millimeters, which the head trajectory then places in the scene's world frame.

What we didn't do matters too: every hand above is a real detection, not an interpolated guess. We fill only short gaps (15 frames or fewer, little motion), never extrapolate a long absence, and tag each keypoint with its source (detected, redetect, or interp) so the data can be filtered. When a hand genuinely leaves the frame, we leave the gap empty rather than invent a pose.

Reconstructing depth

The rectified stereo pair (top) and the per-frame SGBM depth it produces (bottom), temporally smoothed.

The same calibrated stereo earns one more output: per-frame metric depth of the whole scene. A stereo-matching stage estimates disparity — the horizontal shift of each pixel between the rectified left/right views — and converts it to depth through the rectified-stereo relation depth = focal × baseline / disparity. Because the baseline was measured for this specific headset, the result is in real millimeters: a monocular network only infers scale from learned priors, whereas a known baseline is the scale. That baseline — 63 mm — also sets where the depth is trustworthy: a fraction-of-a-pixel matching error propagates as ΔZ ≈ Z² / (f · B) · Δd, so uncertainty grows with the square of distance. For Ego1's calibrated focal length (~883 px) and a half-pixel error:

DistanceDepth uncertainty (ΔZ)
0.5 m~2 mm
1.0 m~9 mm
2.0 m~36 mm

Millimeters at arm's length, centimeters across the room — a 16× jump in error for a 4× jump in distance. The short baseline is sharpest exactly where the hands do their work, the 0.3–0.8 m manipulation range, and goes soft in the far scene, which is context the policy never acts on. The same IPD spacing that mirrors human binocular vision gives human-like depth acuity: crisp up close, approximate in the distance.

Computing that depth comes down to stereo matching, where the calibration pays off once more. We use Semi-Global Matching — OpenCV's block-based SGBM[8] — which matches each pixel along the rectified epipolar line and converts the disparity to metric depth through the relation above, with no learned model or checkpoint to maintain. It commits a disparity only where the match is unambiguous, leaving textureless walls and occlusions as holes that an edge-aware, image-guided filter then fills. Each frame becomes a 16-bit-millimeter depth map that drops straight into ROS (16UC1), LeRobot depth features, or Open3D for RGB-D reconstruction.

Collecting data at scale

Ego1 is an egocentric stereo headset built for one job: collecting manipulation data at the scale robotics foundation models demand. You drop it on and it records hardware-synced stereo and IMU on its own; per-device calibration anchors the metric scale, and the perception stack turns that raw footage into what robot learning actually consumes — a head trajectory, 3D hand tracking for both hands, and dense metric depth. Ordinary first-person activity becomes the large, diverse, real-world data that robot policies are trained on.

If you're collecting manipulation data at scale and want to talk, book a demo or reach us at hello@gilabs.xyz. The UMI Protocol will be open-sourced under the Apache License, and device firmware ships through our public GitHub release channel.