In the realm of immersive experiences in mixed-reality scenarios, generating accurate and plausible full-body avatar motion has been a persistent challenge. Existing solutions relying on Head-Mounted Devices (HMDs) typically utilize limited input signals, such as head and hands 6-DoF (degrees of freedom). While recent advancements have demonstrated impressive performance in generating full-body motion from head and hand signals, they all share a common limitation – the assumption of full-hand visibility. This assumption, valid in scenarios involving motion controllers, falls short in many mixed reality experiences where hand tracking relies on egocentric sensors, introducing partial hand visibility due to the restricted field of view of the HMD.
Researchers from Microsoft Mixed Reality & AI Lab, Cambridge, UK, have introduced a groundbreaking approach- HMD-NeMo (HMD Neural Motion Model). This unified neural network generates plausible and accurate full-body motion even when hands are only partially visible. HMD-NeMo operates in real-time and online, making it suitable for dynamic mixed-reality scenarios.
At the core of HMD-NeMo lies a spatiotemporal encoder featuring novel temporally adaptable mask tokens (TAMT). These tokens play a crucial role in encouraging plausible motion in the absence of hand observations. The approach incorporates recurrent neural networks to capture temporal information efficiently and a transformer to model complex relations between different input signal components.
The paper outlines two scenarios considered for evaluation: Motion Controllers (MC), where hands are tracked with motion controllers, and Hand Tracking (HT), where hands are tracked via egocentric hand-tracking sensors. HMD-NeMo proves to be the first approach capable of handling both scenarios within a unified framework. In the HT scenario, where hands may be partially or entirely out of the field of view, the temporally adaptable mask tokens demonstrate their effectiveness in maintaining temporal coherence.
The proposed method is trained using a loss function that considers data accuracy, smoothness, and auxiliary tasks for human pose reconstruction in SE(3). The experiments involve extensive evaluations of the AMASS dataset, a large collection of human motion sequences converted into 3D human meshes. Metrics such as mean per-joint position error (MPJPE) and mean per-joint velocity error (MPJVE) are employed to assess the performance of HMD-NeMo.
Comparisons with state-of-the-art approaches in the motion controller scenario reveal that HMD-NeMo achieves superior accuracy and smoother motion generation. Furthermore, the model’s generalizability is demonstrated through cross-dataset evaluations, outperforming existing methods on multiple datasets.
Ablation studies delve into the impact of different components, including the effectiveness of the TAMT module in handling missing hand observations. The study shows that HMD-NeMo’s design choices, such as the spatiotemporal encoder, contribute significantly to its success.
In conclusion, HMD-NeMo represents a significant step forward in addressing the challenges of generating full-body avatar motion in mixed-reality scenarios. Its versatility in handling both motion controller and hand tracking scenarios, coupled with its impressive performance metrics, positions it as a pioneering solution in the field.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 32k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest in the scope of software and data science applications. She is always reading about the developments in different field of AI and ML.