EgoHDM: An Online Egocentric-Inertial Human Motion Capture, Localization, and Dense Mapping System

1The Hong Kong University of Science and Technology (Guangzhou)
2ETH Zurich
3ETH AI Center

SIGGRAPH Asia 2024 (TOG)


*Indicates Equal Contribution

^Corresponding Author

EgoHDM is an innovative online egocentric-inertial motion capture system that provides near real-time localization and dense scene mapping, enhancing human motion estimation in various terrains.

Abstract

We present EgoHDM, an online egocentric-inertial human motion capture (mocap), localization, and dense mapping system. Our system uses 6 inertial measurement units (IMUs) and a commodity head-mounted RGB camera. EgoHDM is the first human mocap system that offers dense scene mapping in near real-time. Further, it is fast and robust to initialize and fully closes the loop between physically plausible map-aware global human motion estimation and mocap-aware 3D scene reconstruction. Our key idea is integrating camera localization and mapping information with inertial human motion capture bidirectionally in our system. To achieve this, we design a tightly coupled mocap-aware dense bundle adjustment and physics-based body pose correction module leveraging a local body-centric elevation map. The latter introduces a novel terrain-aware contact PD controller, which enables characters to physically contact the given local elevation map thereby reducing human floating or penetration. We demonstrate the performance of our system on established synthetic and real-world benchmarks. The results show that our method reduces human localization, camera pose, and mapping accuracy error by 41%, 71%, 46%, respectively, compared to the state of the art. Our qualitative evaluations on newly captured data further demonstrate that EgoHDM can cover challenging scenarios in non-flat terrain including stepping over stairs and outdoor scenes in the wild.

Video

Method Overview

The inputs to EgoHDM are real-time acceleration and orientation measurements from six body-worn IMUs and monocular egocentric RGB images. We first initialize the system (VIM Initialization) by finding a similarity transform \(\mathbf{T}_{hc}\) that aligns inertial and camera frames with accurate scale found by leveraging body shape constraints. After initialization, the mocap-aware dense bundle adjustment (MDBA) jointly optimizes camera poses and depth images of keyframes by integrating inertial human motion constraints with RGB-based SLAM. We then construct and maintain a consistent, dense 3D map with global BA and loop closing. To reduce the depth noise influence in our global map, covariance-guided volumetric fusion is employed. Next, we create a local body-centric elevation map with a fixed resolution by projecting the global map along the direction of gravity. Lastly, in the map-aware inertial mocap module, we refine poses provided by an inertial learning-based pose estimator by introducing a physics-based correction module that leverages the elevation map to establish foot-to-ground contact. The corrected poses are fed back to the MDBA, thereby fully closing the loop between inertial-based pose estimation and SLAM-based mapping.

BibTeX

@article{liu2024egohdm,
      title={EgoHDM: An Online Egocentric-Inertial Human Motion Capture, Localization, and Dense Mapping System},
      author={Liu, Bonan and Yin, Handi and Kaufmann, Manuel and He, Jinhao and Christen, Sammy and Song, Jie and Hui, Pan},
      journal={arXiv preprint arXiv:2409.00343},
      year={2024}
    }