skip to main content
research-article

Adapt and explore: : Multimodal mixup for representation learning

Published: 01 May 2024 Publication History

Abstract

Research on general multimodal systems has gained significant attention due to the proliferation of multimodal data in the real world. Despite the remarkable performance achieved by existing multimodal representation learning schemes, missing modalities remain a persistent issue, thereby limiting the overall applicability of multimodal systems. Intending to address the issue, we propose a novel approach named M3ixup (Multi-Modal Mixup), which leverages the mixup strategy to improve unimodal and multimodal representation learning while simultaneously increasing robustness against missing modalities. First, we adopt productive multimodal learning scheme to model representations with modality-specific and joint-modality encoders. The general scheme ensuring the proposed approach transferable for various multimodal learning scenarios, including supervised, unsupervised, and reinforcement learning. Then, the unimodal input and manifold mixup is used to enhance the modality-specific encoders to capture intra-modal dynamics. Next, we present multimodal mixup to mix different modalities and generate mixed multimodal representations in adapting and exploring steps. The former step aims at bridging the huge information gaps between unimodal and multimodal representations in the joint space in the alignment, while the latter step further captures the inter-modal dynamics and exploits the non-linear relationships among different modalities. After that, the mixed views are aligned with the original multimodal representations by contrastive learning. Additionally, we innovatively extend the mixup strategy to the loss function of multimodal contrastive learning in two steps to improve the alignment between mixed and original views. Extensive experiments on public datasets in various multimodal learning scenarios demonstrate the superiority of the proposed M3ixup. The codes are available at https://github.com/RH-Lin/m3ixup.

Highlights

Innovatively introducing mixup strategy to multimodal representation learning.
Conducting multimodal mixup through adapting and exploring steps.
Mixing negative samples in multimodal contrastive learning.
Improving the robustness against missing modalities and the performance of multimodal representations.
Transferable to various multimodal learning scenarios.

References

[1]
Baltrušaitis T., Ahuja C., Morency L.-P., Multimodal machine learning: A survey and taxonomy, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2) (2018) 423–443.
[2]
Poria S., Cambria E., Bajpai R., Hussain A., A review of affective computing: From unimodal analysis to multimodal fusion, Inf. Fusion 37 (2017) 98–125,. URL https://www.sciencedirect.com/science/article/pii/S1566253517300738.
[3]
Poria S., Hazarika D., Majumder N., Mihalcea R., Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research, IEEE Trans. Affect. Comput. (2020),.
[4]
Gandhi A., Adhvaryu K., Poria S., Cambria E., Hussain A., Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions, Inf. Fusion 91 (2023) 424–444,. URL https://www.sciencedirect.com/science/article/pii/S1566253522001634.
[5]
Gan Z., Li L., Li C., Wang L., Liu Z., Gao J., Vision-language pre-training: Basics, recent advances, and future trends, Found. Trends. Comput. Graph. Vis. 14 (3–4) (2022) 163–352,.
[6]
Uppal S., Bhagat S., Hazarika D., Majumder N., Poria S., Zimmermann R., Zadeh A., Multimodal research in vision and language: A review of current and emerging trends, Inf. Fusion 77 (2022) 149–171,. URL https://www.sciencedirect.com/science/article/pii/S1566253521001512.
[7]
Chen F.-L., Zhang D.-Z., Han M.-L., Chen X.-Y., Shi J., Xu S., Xu B., Vlp: A survey on vision-language pre-training, Mach. Intell. Res. 20 (1) (2023) 38–56.
[8]
Lee M.A., Zhu Y., Srinivasan K., Shah P., Savarese S., Fei-Fei L., Garg A., Bohg J., Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks, in: 2019 International Conference on Robotics and Automation, ICRA, 2019, pp. 8943–8950,.
[9]
Jiang Y., Gupta A., Zhang Z., Wang G., Dou Y., Chen Y., Fei-Fei L., Anandkumar A., Zhu Y., Fan L., VIMA: General robot manipulation with multimodal prompts, in: NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022, URL https://openreview.net/forum?id=oU2DzdTI94.
[10]
He K., Zhang X., Ren S., Sun J., Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016, pp. 770–778,.
[11]
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
[12]
Baevski A., Zhou H., Mohamed A., Auli M., Wav2vec 2.0: A framework for self-supervised learning of speech representations, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Curran Associates Inc., Red Hook, NY, USA, 2020.
[13]
Ngiam J., Khosla A., Kim M., Nam J., Lee H., Ng A.Y., Multimodal deep learning, in: Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML ’11, Omni Press, Madison, WI, USA, 2011, pp. 689–696.
[14]
Du Y., Liu Z., Li J., Zhao W.X., A survey of vision-language pre-trained models, in: Raedt L.D. (Ed.), Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, International Joint Conferences on Artificial Intelligence Organization, 2022, pp. 5436–5443,. Survey Track.
[15]
Gkoumas D., Li Q., Lioma C., Yu Y., Song D., What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis, Inf. Fusion 66 (2021) 184–197,. URL https://www.sciencedirect.com/science/article/pii/S1566253520303675.
[16]
Radford A., Kim J.W., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., Clark J., Krueger G., Sutskever I., Learning transferable visual models from natural language supervision, in: Meila M., Zhang T. (Eds.), Proceedings of the 38th International Conference on Machine Learning, in: Proc. Mach. Learn. Res., 139, PMLR, 2021, pp. 8748–8763. URL https://proceedings.mlr.press/v139/radford21a.html.
[17]
Su W., Zhu X., Cao Y., Li B., Lu L., Wei F., Dai J., VL-BERT: Pre-training of generic visual-linguistic representations, in: International Conference on Learning Representations, 2020, URL https://openreview.net/forum?id=SygXPaEYvH.
[18]
Bao H., Wang W., Dong L., Liu Q., Mohammed O.K., Aggarwal K., Som S., Piao S., Wei F., VLMo: Unified vision-language pre-training with mixture-of-modality-experts, Koyejo S., Mohamed S., Agarwal A., Belgrave D., Cho K., Oh A. (Eds.), Advances in Neural Information Processing Systems, vol. 35, Curran Associates, Inc., 2022, pp. 32897–32912. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/d46662aa53e78a62afd980a29e0c37ed-Paper-Conference.pdf.
[19]
Kim W., Son B., Kim I., Vilt: Vision-and-language transformer without convolution or region supervision, in: Meila M., Zhang T. (Eds.), Proceedings of the 38th International Conference on Machine Learning, in: Proceedings of Machine Learning Research, vol. 139, PMLR, 2021, pp. 5583–5594. URL https://proceedings.mlr.press/v139/kim21k.html.
[20]
Tsai Y.-H.H., Bai S., Liang P.P., Kolter J.Z., Morency L.-P., Salakhutdinov R., Multimodal transformer for unaligned multimodal language sequences, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 6558–6569,. URL https://aclanthology.org/P19-1656.
[21]
Li L.H., You H., Wang Z., Zareian A., Chang S.-F., Chang K.-W., Unsupervised vision-and-language pre-training without parallel images and captions, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 5339–5350,. URL https://aclanthology.org/2021.naacl-main.420.
[22]
M. Zhou, L. Yu, A. Singh, M. Wang, Z. Yu, N. Zhang, Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2022, pp. 16485–16494.
[23]
Vasco M., Yin H., Melo F.S., Paiva A., Leveraging hierarchy in multimodal generative models for effective cross-modality inference, Neural Netw. 146 (2022) 238–255,. URL https://www.sciencedirect.com/science/article/pii/S0893608021004470.
[24]
Silva R., Vasco M., Melo F.S., Paiva A., Veloso M., Playing games in the dark: An approach for cross-modality transfer in reinforcement learning, in: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’20, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2020, pp. 1260–1268.
[25]
Jia C., Yang Y., Xia Y., Chen Y.-T., Parekh Z., Pham H., Le Q., Sung Y.-H., Li Z., Duerig T., Scaling up visual and vision-language representation learning with noisy text supervision, in: Meila M., Zhang T. (Eds.), Proceedings of the 38th International Conference on Machine Learning, in: Proceedings of Machine Learning Research, vol. 139, PMLR, 2021, pp. 4904–4916. URL https://proceedings.mlr.press/v139/jia21b.html.
[26]
Li J., Li D., Xiong C., Hoi S., BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation, in: Chaudhuri K., Jegelka S., Song L., Szepesvari C., Niu G., Sabato S. (Eds.), Proceedings of the 39th International Conference on Machine Learning, in: Proceedings of Machine Learning Research, vol. 162, PMLR, 2022, pp. 12888–12900. URL https://proceedings.mlr.press/v162/li22n.html.
[27]
M. Ma, J. Ren, L. Zhao, D. Testuggine, X. Peng, Are Multimodal Transformers Robust to Missing Modality?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18177–18186.
[28]
Ma M., Ren J., Zhao L., Tulyakov S., Wu C., Peng X., SMIL: Multimodal learning with severely missing modality, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 2302–2310,. URL https://ojs.aaai.org/index.php/AAAI/article/view/16330.
[29]
Hazarika D., Li Y., Cheng B., Zhao S., Zimmermann R., Poria S., Analyzing modality robustness in multimodal sentiment analysis, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Seattle, United States, 2022, pp. 685–696,. URL https://aclanthology.org/2022.naacl-main.50.
[30]
Yin H., Melo F., Billard A., Paiva A., Associate latent encodings in learning from demonstrations, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, 2017,. URL https://ojs.aaai.org/index.php/AAAI/article/view/11040.
[31]
Wu M., Goodman N., Multimodal generative models for scalable weakly-supervised learning, in: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS ’18, Curran Associates Inc., Red Hook, NY, USA, 2018, pp. 5580–5590.
[32]
Tsai Y.-H.H., Liang P.P., Zadeh A., Morency L.-P., Salakhutdinov R., Learning factorized multimodal representations, in: International Conference on Learning Representations, 2019, URL https://openreview.net/forum?id=rygqqsA9KX.
[33]
Poklukar P., Vasco M., Yin H., Melo F.S., Paiva A., Kragic D., Geometric multimodal contrastive representation learning, in: Chaudhuri K., Jegelka S., Song L., Szepesvari C., Niu G., Sabato S. (Eds.), Proceedings of the 39th International Conference on Machine Learning, in: Proceedings of Machine Learning Research, vol. 162, PMLR, 2022, pp. 17782–17800. URL https://proceedings.mlr.press/v162/poklukar22a.html.
[34]
Zhang H., Cisse M., Dauphin Y.N., Lopez-Paz D., Mixup: Beyond empirical risk minimization, in: International Conference on Learning Representations, 2018, URL https://openreview.net/forum?id=r1Ddp1-Rb.
[35]
Verma V., Lamb A., Beckham C., Najafi A., Mitliagkas I., Lopez-Paz D., Bengio Y., Manifold mixup: Better representations by interpolating hidden states, in: Chaudhuri K., Salakhutdinov R. (Eds.), Proceedings of the 36th International Conference on Machine Learning, in: Proceedings of Machine Learning Research, vol. 97, PMLR, 2019, pp. 6438–6447. URL https://proceedings.mlr.press/v97/verma19a.html.
[36]
Oord A.v.d., Li Y., Vinyals O., Representation learning with contrastive predictive coding, 2018, arXiv e-prints, arXiv–1807. URL https://arxiv.org/pdf/1807.03748.pdf.
[37]
Liang P.P., Lyu Y., Fan X., Wu Z., Cheng Y., Wu J., Chen L.Y., Wu P., Lee M.A., Zhu Y., Salakhutdinov R., Morency L.-P., MultiBench: Multiscale benchmarks for multimodal representation learning, in: Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021, URL https://openreview.net/forum?id=izzQAL8BciY.
[38]
Mai S., Sun Y., Zeng Y., Hu H., Excavating multimodal correlation for representation learning, Inf. Fusion 91 (2023) 542–555,. URL https://www.sciencedirect.com/science/article/pii/S1566253522002135.
[39]
Lillicrap T.P., Hunt J.J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., Wierstra D., Continuous control with deep reinforcement learning, in: Bengio Y., LeCun Y. (Eds.), ICLR, 2016, URL http://arxiv.org/abs/1509.02971.
[40]
Bugliarello E., Cotterell R., Okazaki N., Elliott D., Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language BERTs, Trans. Assoc. Comput. Linguist. 9 (2021) 978–994,. URL https://aclanthology.org/2021.tacl-1.58.
[41]
Li L.H., Yatskar M., Yin D., Hsieh C.-J., Chang K.-W., What does BERT with vision look at?, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 5265–5275,. URL https://aclanthology.org/2020.acl-main.469.
[42]
Li L.H., You H., Wang Z., Zareian A., Chang S.-F., Chang K.-W., Unsupervised vision-and-language pre-training without parallel images and captions, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 5339–5350,. URL https://aclanthology.org/2021.naacl-main.420.
[43]
Chen Y.-C., Li L., Yu L., El Kholy A., Ahmed F., Gan Z., Cheng Y., Liu J., UNITER: Universal image-text representation learning, in: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, Springer-Verlag, Berlin, Heidelberg, 2020, pp. 104–120,.
[44]
Li X., Yin X., Li C., Zhang P., Hu X., Zhang L., Wang L., Hu H., Dong L., Wei F., et al., Oscar: Object-semantics aligned pre-training for vision-language tasks, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, Springer, 2020, pp. 121–137.
[45]
Lu J., Batra D., Parikh D., Lee S., ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, in: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA, 2019.
[46]
Tan H., Bansal M., LXMERT: Learning cross-modality encoder representations from transformers, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 5100–5111,. URL https://aclanthology.org/D19-1514.
[47]
Li J., Selvaraju R., Gotmare A., Joty S., Xiong C., Hoi S.C.H., Align before fuse: Vision and language representation learning with momentum distillation, Ranzato M., Beygelzimer A., Dauphin Y., Liang P., Vaughan J.W. (Eds.), Advances in Neural Information Processing Systems, vol. 34, Curran Associates, Inc., 2021, pp. 9694–9705. URL https://proceedings.neurips.cc/paper/2021/file/505259756244493872b7709a8a01b536-Paper.pdf.
[48]
Liu J., Zhu X., Liu F., Guo L., Zhao Z., Sun M., Wang W., Lu H., Zhou S., Zhang J., et al., OPT: Omni-perception pre-trainer for cross-modal understanding and generation, 2021, arXiv preprint arXiv:2107.00249.
[49]
Lu J., Clark C., Zellers R., Mottaghi R., Kembhavi A., UNIFIED-IO: A unified model for vision, language, and multi-modal tasks, in: The Eleventh International Conference on Learning Representations, 2023, URL https://openreview.net/forum?id=E01k9048soZ.
[50]
Rahate A., Walambe R., Ramanna S., Kotecha K., Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions, Inf. Fusion 81 (2022) 203–239,. URL https://www.sciencedirect.com/science/article/pii/S1566253521002530.
[51]
Guo H., Nonlinear mixup: Out-of-manifold data augmentation for text classification, vol. 34 (no. 04) (2020) 4044–4051,. URL https://ojs.aaai.org/index.php/AAAI/article/view/5822.
[52]
Liu G., Mao Y., Hailong H., Weiguo G., Xuan L., Adversarial mixing policy for relaxing locally linear constraints in mixup, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 2998–3008,. URL https://aclanthology.org/2021.emnlp-main.238.
[53]
Galdran A., Carneiro G., González Ballester M.A., Balanced-mixup for highly imbalanced medical image classification, in: Medical Image Computing and Computer Assisted Intervention, MICCAI 2021, Springer International Publishing, Cham, 2021, pp. 323–333.
[54]
Bellinger C., Corizzo R., Japkowicz N., Calibrated resampling for imbalanced and long-tails in deep learning, in: Discovery Science, Springer International Publishing, Cham, 2021, pp. 242–252.
[55]
Zhang S., Chen C., Zhang X., Peng S., Label-occurrence-balanced mixup for long-tailed recognition, in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2022, pp. 3224–3228,.
[56]
Wang T., Jiang W., Lu Z., Zheng F., Cheng R., Yin C., Luo P., VLMixer: Unpaired vision-language pre-training via cross-modal CutMix, in: Chaudhuri K., Jegelka S., Song L., Szepesvari C., Niu G., Sabato S. (Eds.), Proceedings of the 39th International Conference on Machine Learning, in: Proceedings of Machine Learning Research, vol. 162, PMLR, 2022, pp. 22680–22690. URL https://proceedings.mlr.press/v162/wang22h.html.
[57]
X. Hao, Y. Zhu, S. Appalaraju, A. Zhang, W. Zhang, B. Li, M. Li, MixGen: A New Multi-Modal Data Augmentation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2023, pp. 379–389.
[58]
Hadsell R., Chopra S., LeCun Y., Dimensionality reduction by learning an invariant mapping, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2, CVPR’06, 2006, pp. 1735–1742,.
[59]
He K., Fan H., Wu Y., Xie S., Girshick R., Momentum contrast for unsupervised visual representation learning, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2020, pp. 9726–9735,.
[60]
Wu Z., Xiong Y., Yu S.X., Lin D., Unsupervised feature learning via non-parametric instance discrimination, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3733–3742. URL https://openaccess.thecvf.com/content_cvpr_2018/papers/Wu_Unsupervised_Feature_Learning_CVPR_2018_paper.pdf.
[61]
Chen T., Kornblith S., Norouzi M., Hinton G., A simple framework for contrastive learning of visual representations, in: III H.D., Singh A. (Eds.), Proceedings of the 37th International Conference on Machine Learning, in: Proceedings of Machine Learning Research, vol. 119, PMLR, 2020, pp. 1597–1607. URL https://proceedings.mlr.press/v119/chen20j.html.
[62]
Khosla P., Teterwak P., Wang C., Sarna A., Tian Y., Isola P., Maschinot A., Liu C., Krishnan D., Supervised contrastive learning, Larochelle H., Ranzato M., Hadsell R., Balcan M., Lin H. (Eds.), Advances in Neural Information Processing Systems, vol. 33, Curran Associates, Inc., 2020, pp. 18661–18673. URL https://proceedings.neurips.cc/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf.
[63]
X. Chen, K. He, Exploring Simple Siamese Representation Learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2021, pp. 15750–15758.
[64]
Baevski A., Zhou Y., Mohamed A., Auli M., Wav2vec 2.0: A framework for self-supervised learning of speech representations, in: Larochelle H., Ranzato M., Hadsell R., Balcan M., Lin H. (Eds.), Advances in Neural Information Processing Systems, vol. 33, Curran Associates, Inc., 2020, pp. 12449–12460. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf.
[65]
Saeed A., Grangier D., Zeghidour N., Contrastive learning of general-purpose audio representations, in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 3875–3879,.
[66]
Qu Y., Shen D., Shen Y., Sajeev S., Chen W., Han J., Co{DA}: Contrast-enhanced and diversity-promoting data augmentation for natural language understanding, in: International Conference on Learning Representations, 2021, URL https://openreview.net/forum?id=Ozk9MrX1hvA.
[67]
Gao T., Yao X., Chen D., SimCSE: Simple contrastive learning of sentence embeddings, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 6894–6910,. URL https://aclanthology.org/2021.emnlp-main.552.
[68]
Li W., Gao C., Niu G., Xiao X., Liu H., Liu J., Wu H., Wang H., UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 2592–2607,. URL https://aclanthology.org/2021.acl-long.202.
[69]
M. Zolfaghari, Y. Zhu, P. Gehler, T. Brox, CrossCLR: Cross-Modal Contrastive Learning for Multi-Modal Video Representations, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV, 2021, pp. 1450–1459.
[70]
Zeiler M., Fergus R., Visualizing and understanding convolutional networks, in: Computer Vision, ECCV 2014 - 13th European Conference, Proceedings, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), PART 1, Springer Verlag, 2014, pp. 818–833,. 13th European Conference on Computer Vision, ECCV 2014 ; Conference date: 06-09-2014 Through 12-09-2014.
[71]
Lee W.S., Bartlett P.L., Williamson R.C., Lower bounds on the VC dimension of smoothly parameterized function classes, Neural Comput. 7 (5) (1995) 1040–1053,.
[72]
Behzad M., Asghari K., Eazi M., Palhang M., Generalization performance of support vector machines and neural networks in runoff modeling, Expert Syst. Appl. 36 (4) (2009) 7624–7629,. URL https://www.sciencedirect.com/science/article/pii/S095741740800674X.
[73]
Girshick R., Fast R-CNN, in: International Conference on Computer Vision, ICCV, 2015, URL https://arxiv.org/pdf/1504.08083.pdf.
[74]
Guo H., Mao Y., Zhang R., Mixup as locally linear out-of-manifold regularization, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 3714–3722,. URL https://ojs.aaai.org/index.php/AAAI/article/view/4256.
[75]
Mai Z., Hu G., Chen D., Shen F., Shen H.T., MetaMixUp: Learning adaptive interpolation policy of mixup with metalearning, IEEE Trans. Neural Netw. Learn. Syst. 33 (7) (2022) 3050–3064,.
[76]
Beckham C., Honari S., Verma V., Lamb A., Ghadiri F., Hjelm R.D., Bengio Y., Pal C., On adversarial mixup resynthesis, in: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA, 2019.
[77]
Goodfellow I.J., Shlens J., Szegedy C., Explaining and harnessing adversarial examples, in: Bengio Y., LeCun Y. (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015, URL http://arxiv.org/abs/1412.6572.
[78]
Hinton G., Vinyals O., Dean J., et al., Distilling the knowledge in a neural network, 2015, arXiv preprint arXiv:1503.02531 2 (7). URL https://arxiv.org/pdf/1503.02531.
[79]
F. Wang, H. Liu, Understanding the Behaviour of Contrastive Loss, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2021, pp. 2495–2504.
[80]
Kingma D.P., Ba J., Adam: A method for stochastic optimization, in: Bengio Y., LeCun Y. (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[81]
Zadeh A., Chen M., Poria S., Cambria E., Morency L.-P., Tensor fusion network for multimodal sentiment analysis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 1103–1114,. URL https://aclanthology.org/D17-1115.
[82]
Liu Z., Shen Y., Lakshminarasimhan V.B., Liang P.P., Bagher Zadeh A., Morency L.-P., Efficient low-rank multimodal fusion with modality-specific factors, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 2247–2256,. URL https://aclanthology.org/P18-1209.
[83]
Zadeh A., Liang P.P., Mazumder N., Poria S., Cambria E., Morency L.-P., Memory fusion network for multi-view sequential learning, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018, URL https://ojs.aaai.org/index.php/AAAI/article/view/12021/11880.
[84]
Zadeh A., Liang P.P., Poria S., Vij P., Cambria E., Morency L.-P., Multi-attention recurrent network for human communication comprehension, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018,. URL https://ojs.aaai.org/index.php/AAAI/article/view/12024.
[85]
Li Q., Gkoumas D., Lioma C., Melucci M., Quantum-inspired multimodal fusion for video sentiment analysis, Inf. Fusion 65 (2021) 58–71,. URL https://www.sciencedirect.com/science/article/pii/S1566253520303365.
[86]
Zadeh A., Zellers R., Pincus E., Morency L.-P., Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos, 2016, arXiv preprint arXiv:1606.06259. URL https://arxiv.org/ftp/arxiv/papers/1606/1606.06259.pdf.
[87]
Zadeh A., Liang P.P., Poria S., Cambria E., Morency L.-P., Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 2236–2246,. URL https://aclanthology.org/P18-1208.
[88]
J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014, pp. 1532–1543.
[89]
Degottex G., Kane J., Drugman T., Raitio T., Scherer S., COVAREP — A collaborative voice analysis repository for speech technologies, in: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2014, pp. 960–964,.
[90]
. iMotions 2017, Facial expression analysis. [Online] https://imotions.com/.
[91]
Wu Y., Lin Z., Zhao Y., Qin B., Zhu L.-N., A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, 2021, pp. 4730–4738,. URL https://aclanthology.org/2021.findings-acl.417.
[92]
Hazarika D., Zimmermann R., Poria S., Misa: Modality-invariant and-specific representations for multimodal sentiment analysis, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1122–1131,.
[93]
Vasco M., Yin H., Melo F.S., Paiva A., How to sense the world: Leveraging hierarchy in multimodal perception for robust reinforcement learning agents, in: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’22, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2022, pp. 1301–1309.
[94]
Shi Y., N S., Paige B., Torr P., Variational mixture-of-experts autoencoders for multi-modal deep generative models, Wallach H., Larochelle H., Beygelzimer A., d’Alché Buc F., Fox E., Garnett R. (Eds.), Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc., 2019, URL https://proceedings.neurips.cc/paper_files/paper/2019/file/0ae775a8cb3b499ad1fca944e6f5c836-Paper.pdf.
[95]
Higgins I., Pal A., Rusu A., Matthey L., Burgess C., Pritzel A., Botvinick M., Blundell C., Lerchner A., DARLA: Improving zero-shot transfer in reinforcement learning, in: Precup D., Teh Y.W. (Eds.), Proceedings of the 34th International Conference on Machine Learning, in: Proceedings of Machine Learning Research, vol. 70, PMLR, 2017, pp. 1480–1490. URL https://proceedings.mlr.press/v70/higgins17a.html.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Fusion
Information Fusion  Volume 105, Issue C
May 2024
619 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 May 2024

Author Tags

  1. Unimodal and multimodal mixup
  2. Multimodal representation learning
  3. Multimodal contrastive learning
  4. Missing modality issues
  5. Various multimodal learning scenarios

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media