×
Nov 3, 2022 · We then propose to harmonize such gradients, via two techniques: (i) cross-modality gradient realignment: modifying different CMA loss gradients ...
We primarily study the conflicts between gradients gva and gvt, which indicate the often poor alignments, and discuss how to harmonize their conflicts. towards ...
Even on the commonly adopted instructional videos (i.e. Howto100M[1]), the cross-modality alignment (CMA) only provide weak and noisy supervision.
To further validate the conjecture that aligned video-text-audio triplets should have higher cosine similarity for gva and gvt, and vice versa, ...
Applying those gradient harmonization techniques to pre-training VATT on the HowTo100M dataset, we consistently improve its performance on different downstream ...
Self-supervised pre-training recently demonstrates success on large-scale multimodal data, and state-of-the-art contrastive learning methods often enforce ...
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization, Junru Wu et al. [Paper]. Chinese CLIP: Contrastive Vision-Language Pretraining in ...
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization ... We then propose to harmonize such gradients during pre-training, via two ...
Enhancing and Scaling Cross-Modality Alignment for. Multimodal Pre-Training via Gradient. Harmonization. Supplementary Material. Anonymous Author(s).
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization. Watch later. Favorite. Share presentation · VATT: Transformers for Multimodal Self- ...