The constant development of intelligent systems replicating and comprehending human behavior has led to significant advancements in the complementary fields of Computer Vision and Artificial Intelligence (AI). Machine learning models are gaining immense popularity while bridging the gap between reality and virtuality. Although 3D human body modeling has received a lot of attention in the field of computer vision, the task of modeling the acoustic side and producing 3D spatial audio from speech and body motion is still a topic of discussion. The focus has always been on the visual fidelity of artificial representations of the human body.
Human perception is multi-modal in nature as it incorporates both auditory and visual cues into the comprehension of the environment. It is essential to simulate 3D sound that corresponds with the visual picture accurately in order to create a sense of presence and immersion in a 3D world. To address these challenges, a team of researchers from Shanghai AI Laboratory and Meta Reality Labs Research has introduced a model that produces accurate 3D spatial audio representations for entire human bodies.
The team has shared that the proposed technique uses head-mounted microphones and data on human body pose to synthesize 3D spatial sound precisely. The case study focuses on a telepresence scenario combining augmented reality and virtual reality (AR/VR) in which users communicate using full-body avatars. Egocentric audio data from head-mounted microphones and body posture data that is utilized to animate the avatar have been used as examples of input.
Current methods for sound spatialization presume that the sound source is known and that it is captured there undisturbed. The suggested approach gets around these problems by using body pose data to train a multi-modal network that distinguishes between the sources of various noises and produces precisely spatialized signals. The sound area surrounding the body is the output, and the audio from seven head-mounted microphones and the subject’s posture make up the input.
The team has conducted an empirical evaluation, demonstrating that the model can reliably produce sound fields resulting from body movements when trained with a suitable loss function. The model’s code and dataset are available for public use on the internet, promoting openness, repeatability, and additional developments in this field. The GitHub repository can be accessed at https://github.com/facebookresearch/SoundingBodies.
The primary contributions of the work have been summarized by the team as follows.
- A unique technique has been introduced that uses head-mounted microphones and body poses to render realistic 3D sound fields for human bodies.
- A comprehensive empirical evaluation has been shared that highlights the importance of body pose and a well-thought-out loss function.
- The team has shared a new dataset they have produced that combines multi-view human body data with spatial audio recordings from a 345-microphone array.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.