VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Wang, Limin; Huang, Bingkun; Zhao, Zhiyu; Tong, Zhan; He, Yinan; Wang, Yi; Wang, Yali; Qiao, Yu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2303.16727 (cs)

[Submitted on 29 Mar 2023 (v1), last revised 18 Apr 2023 (this version, v2)]

Title:VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Authors:Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao

View PDF

Abstract:Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner. The code and model is available at \url{this https URL}.

Comments:	CVPR 2023 camera-ready version
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2303.16727 [cs.CV]
	(or arXiv:2303.16727v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.16727

Submission history

From: Limin Wang [view email]
[v1] Wed, 29 Mar 2023 14:28:41 UTC (239 KB)
[v2] Tue, 18 Apr 2023 11:46:41 UTC (239 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators