The identities or qualities a face video provides may now be changed and manipulated extremely easily, thanks to the recent fast development of face-generating and manipulation tools. This has several significant and stunning uses for producing hilarious videos, movies, and other media types. However, these methods may also be utilized maliciously, leading to a significant crisis in their society’s sense of security and confidence. Consequently, learning to spot video face forgeries has recently become a popular study issue.
To date, one effective line of study attempts to distinguish between real and false photos by looking for “spatial” artifacts in the produced images (such as checkboard, unnaturalness, and artifacts underlying the generative model, for example). These techniques have remarkable results when looking for spatially linked artifacts. Still, they neglect the temporal coherence of a video and miss “temporal” artifacts like flickering and discontinuity in video face forgeries. Recent studies take note of this problem and make an effort to solve it by using temporal hints.
The resultant models can recognize unnatural artifacts at the temporal level, but they need to improve their ability to detect artifacts connected to space. They try to capture spatial and temporal artifacts in this research to identify broad video face-faking. An effective spatiotemporal network (3D ConvNet) can often search for spatial and temporal artifacts. However, they discover that naive training may make it depend too readily on spatial artifacts while disregarding temporal artifacts to get to a conclusion, leading to a poor generalization capacity. This is so that a 3D convolutional network may more readily rely on spatial artifacts, as spatial artifacts are typically more visible than temporal incoherence.
Therefore, the issue is making the spatiotemporal network capable of capturing both temporal and spatial artifacts. Researchers from the University of Science and Technology of China, Microsoft Research Asia and Hefei Comprehensive National Science Center in this study suggest an innovative training method called AltFreezing to achieve this. The important concept is to alternatively freeze weights relating to space and time throughout training. A spatiotemporal network is specifically constructed using 3D resblocks that combine spatial convolution with a kernel size of 1 × Kh × Kw and temporal convolution with a kernel size of Kt × 1 × 1. The spatial- and temporal-level characteristics are captured via these spatial and temporal convolutional kernels, respectively. To overcome spatial and temporal artifacts, their AltFreezing technique promotes the two sets of weights to be updated alternately.
Additionally, they provide a collection of tools for creating training movies with false content that are at the video level. These techniques might be split into two categories. The first is bogus clips, which solely use temporal artifacts and repeat and remove frames from actual clips at random. The second type of clip is made by blending an area from one genuine clip to another real clip, and it only has spatial artifacts. These video augmentation techniques are the first to produce phony videos that are both spatially and temporally limited. These improvements assist the spatiotemporal model in capturing both spatial and temporal artifacts. With the two methodologies discussed above, they can perform at the cutting edge in various difficult face forgery detection scenarios, including generalization to unseen forgeries and resilience to diverse perturbations. To confirm the efficacy of their suggested framework, they also offer a thorough study of their methodology.
The following are their three key contributions.
• They suggest investigating spatial and temporal artifacts for detecting video face faking. A brand-new training technique called AltFreezing is proposed to accomplish this.
• They offer video-level false data augmentation techniques to nudge the model towards capturing a broader spectrum of forgeries.
• Extensive tests on five benchmark datasets, including evaluations of the proposed approach across manipulations and datasets, show it achieves new state-of-the-art performance.
Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.