skip to main content
research-article

Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

Published: 17 November 2022 Publication History

Abstract

As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from an untrimmed video according to a given language query. Most previous methods depend heavily on numerous manual annotations (i.e., moment boundaries), which are extremely expensive to acquire in practice. In addition, due to the domain gap between different datasets, directly applying these pre-trained models to an unseen domain leads to a significant performance drop. In this paper, we focus on a novel task: cross-domain VMR, where fully-annotated datasets are available in one domain (&#x201C;source domain&#x201D;), but the domain of interest (&#x201C;target domain&#x201D;) only contains unannotated datasets. As far as we know, we present the first study on cross-domain VMR. To address this new task, we propose a novel <bold>M</bold>ulti-<bold>M</bold>odal <bold>C</bold>ross-<bold>D</bold>omain <bold>A</bold>lignment (MMCDA) network to transfer the annotation knowledge from the source domain to the target domain. However, due to the domain discrepancy between the source and target domains and the semantic gap between videos and queries, directly applying trained models to the target domain generally leads to a performance drop. To solve this problem, we develop three novel modules: (i) a domain alignment module is designed to align the feature distributions between different domains of each modality; (ii) a cross-modal alignment module aims to map both video and query features into a joint embedding space and to align the feature distributions between different modalities in the target domain; and (iii) a specific alignment module tries to obtain the fine-grained similarity between a specific frame and the given query for optimal localization. By jointly training these three modules, our MMCDA can learn domain-invariant and semantic-aligned cross-modal representations. Extensive experiments on three challenging benchmarks (ActivityNet Captions, Charades-STA and TACoS) illustrate that our cross-domain method MMCDA outperforms all state-of-the-art single-domain methods. Impressively, MMCDA raises the performance by more than 7&#x0025; in representative cases, which demonstrates its effectiveness.

References

[1]
G. Wang et al., “Cross-modal dynamic networks for video moment retrieval with text query,” IEEE Trans. Multimedia, vol. 24, pp. 1221–1232, 2022.
[2]
X. Yang et al., “Video moment retrieval with cross-modal neural architecture search,” IEEE Trans. Image Process., vol. 31, pp. 1204–1216, 2022.
[3]
Y. Zeng et al., “Moment is important: Language-based video moment retrieval via adversarial learning,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 18, no. 2, pp. 1–21, 2022.
[4]
F. Zhu, Y. Zhu, X. Chang, and X. Liang, “Vision-language navigation with self-supervised auxiliary reasoning tasks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10009–10019.
[5]
X. Fang, D. Liu, P. Zhou, Z. Xu, and R. Li, “Hierarchical local-global transformer for temporal sentence grounding,” 2022, arXiv:2208.14882.
[6]
Z. Guo et al., “TaoHighlight: Commodity-aware multi-modal video highlight detection in e-commerce,” IEEE Trans. Multimedia, vol. 24, pp. 2606–2616, 2022.
[7]
X. Fang, Y. Hu, P. Zhou, and D. O. Wu, “Unbalanced incomplete multi-view clustering via the scheme of view evolution: Weak views are meat; strong views do eat,” IEEE Trans. Emerg. Topics Comput. Intell., vol. 6, no. 4, pp. 913–927, Aug. 2022.
[8]
X. Fang, Y. Hu, P. Zhou, and D. O. Wu, “ANIMC: A soft approach for auto-weighted noisy and incomplete multi-view clustering,” IEEE Trans. Artif. Intell., vol. 3, no. 2, pp. 192–206, Apr. 2022.
[9]
X. Fang, Y. Hu, P. Zhou, and D. O. Wu, “V $^{3}$ h: View variation and view heredity for incomplete multiview clustering,” IEEE Trans. Artif. Intell., vol. 1, no. 3, pp. 233–247, Dec. 2020.
[10]
Y. Zeng et al., “Multi-modal relational graph for cross-modal video moment retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2215–2224.
[11]
H. Tang, J. Zhu, M. Liu, Z. Gao, and Z. Cheng, “Frame-wise cross-modal matching for video moment retrieval,” IEEE Trans. Multimedia, vol. 24, pp. 1338–1349, 2022.
[12]
D. Liu, X. Fang, W. Hu, and P. Zhou, “Exploring optical-flow-guided motion and detection-based appearance for temporal sentence grounding,” 2022, arXiv:2203.02966.
[13]
M. Zhang et al., “Multi-stage aggregated transformer network for temporal language localization in videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12664–12673.
[14]
D. Cao et al., “Strong: Spatio-temporal reinforcement learning for cross-modal video moment localization,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 4162–4170.
[15]
D. Liu et al., “Context-aware biaffine localizing network for temporal sentence grounding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 11230–11239.
[16]
J. Lei, L. Yu, T. L. Berg, and M. Bansal, “TVR: A large-scale dataset for video-subtitle moment retrieval,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 447–463.
[17]
Y. Zhao, Z. Zhao, Z. Zhang, and Z. Lin, “Cascaded prediction network via segment tree for temporal video grounding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 4195–4204.
[18]
S. Xiao et al., “Boundary proposal network for two-stage natural language video localization,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 2986–2994.
[19]
H. Wang, Z.-J. Zha, L. Li, D. Liu, and J. Luo, “Structured multi-level interaction network for video moment localization via language query,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 7022–7031.
[20]
H. Xu et al., “Multilevel language and vision integration for text-to-clip retrieval,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 9062–9069.
[21]
S. Chen and Y. Jiang, “Semantic proposal for activity localization in videos via sentence query,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 8199–8206.
[22]
R. Ge, J. Gao, K. Chen, and R. Nevatia, “MAC: Mining activity concepts for language-based temporal localization,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., 2019, pp. 245–253.
[23]
D. Liu, X. Qu, J. Dong, and P. Zhou, “Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network,” in Proc. 28th Int. Conf. Comput. Linguistics, 2020, pp. 1841–1851.
[24]
D. Liu et al., “Jointly cross- and self-modal graph attention network for query-based moment localization,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 4070–4078.
[25]
K. Li, D. Guo, and M. Wang, “Proposal-free video grounding with contextual pyramid network,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 1902–1910.
[26]
C. Rodriguez, E. Marrese-Taylor, F. S. Saleh, H. Li, and S. Gould, “Proposal-free temporal moment localization of a natural-language query in video using guided attention,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., 2020, pp. 2453–2462.
[27]
J. Wang, L. Ma, and W. Jiang, “Temporally grounding language queries in videos by contextual boundary-aware prediction,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 12168–12175.
[28]
D. Zhang, X. Dai, X. Wang, Y. Wang, and L. S. Davis, “MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1247–1257.
[29]
Z. Zhang, Z. Lin, Z. Zhao, and Z. Xiao, “Cross-modal interaction networks for query-based moment retrieval in videos,” in Proc. 42nd Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2019, pp. 655–664.
[30]
Z. Zhang, Z. Lin, Z. Zhao, J. Zhu, and X. He, “Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 4098–4106.
[31]
R. Tan, H. Xu, K. Saenko, and B. A. Plummer, “Logan: Latent graph co-attention network for weakly-supervised video moment retrieval,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2021, pp. 2083–2092.
[32]
Z. Chen, L. Ma, W. Luo, and K. K. Wong, “Weakly-supervised spatio-temporally grounding natural sentence in video,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 1884–1894.
[33]
X. Duan et al., “Weakly supervised dense event captioning in videos,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., 2018, pp. 3063–3073.
[34]
N. C. Mithun, S. Paul, and A. K. Roy-Chowdhury, “Weakly supervised video moment retrieval from text queries,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 11584–11593.
[35]
M. Ma et al., “VLANET: Video-language alignment network for weakly-supervised video moment retrieval,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 156–171.
[36]
Z. Lin, Z. Zhao, Z. Zhang, Q. Wang, and H. Liu, “Weakly-supervised video moment retrieval via semantic completion network,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 11539–11546.
[37]
Y. Song, J. Wang, L. Ma, Z. Yu, and J. Yu, “Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos,” 2020, arXiv:2003.07048.
[38]
Z. Zhang et al., “Counterfactual contrastive learning for weakly-supervised vision-language grounding,” in Proc. 34th Int. Conf. Neural Inf. Process. Syst., 2020, pp. 18123–18134.
[39]
P. Wang et al., “Information maximizing adaptation network with label distribution priors for unsupervised domain adaptation,” IEEE Trans. Multimedia, 2022, early access.
[40]
A. Zhang et al., “Latent domain generation for unsupervised domain adaptation object counting,” IEEE Trans. Multimedia, early access.
[41]
M. Jing, L. Meng, J. Li, L. Zhu, and H. T. Shen, “Adversarial mixup ratio confusion for unsupervised domain adaptation,” IEEE Trans. Multimedia, early access.
[42]
P. Wang et al., “Uncertainty-aware clustering for unsupervised domain adaptive object re-identification,” IEEE Trans. Multimedia, early access.
[43]
Y. Tao, J. Zhang, J. Hong, and Y. Zhu, “DREAMT: Diversity enlarged mutual teaching for unsupervised domain adaptive person re-identifcation,” IEEE Trans. Multimedia, early access.
[44]
J. Munro and D. Damen, “Multi-modal domain adaptation for fine-grained action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 119–129.
[45]
Y. Wang et al., “Multi-modal domain adaptation variational autoencoder for EEG-based emotion recognition,” IEEE/CAA J. Automatica Sinica, vol. 9, no. 9, pp. 1612–1626, Sep. 2022.
[46]
Z. Zhang et al., “Temporal textual localization in video via adversarial bi-directional interaction networks,” IEEE Trans. Multimedia, vol. 23, pp. 3306–3317, 2021.
[47]
K. Chen, R. Kovvuri, and R. Nevatia, “Query-guided regression network with context policy for phrase grounding,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 824–832.
[48]
X. Sun, H. Wang, and B. He, “MABAN: Multi-agent boundary-aware network for natural language moment retrieval,” IEEE Trans. Image Process., vol. 30, pp. 5589–5599, 2021.
[49]
Y. Hu et al., “Coarse-to-fine semantic alignment for cross-modal moment localization,” IEEE Trans. Image Process., vol. 30, pp. 5933–5943, 2021.
[50]
J. Na, H. Jung, H. J. Chang, and W. Hwang, “FixBi: Bridging domain spaces for unsupervised domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1094–1103.
[51]
Y. Ganin et al., “Domain-adversarial training of neural networks,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 2096–2030, 2016.
[52]
E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher, “Augmented cyclic adversarial learning for low resource domain adaptation,” in Proc. Int. Conf. Learn. Representations, 2018.
[53]
X. Huang, Y. Peng, and M. Yuan, “MHTN: Modal-adversarial hybrid transfer network for cross-modal retrieval,” IEEE Trans. Cybern., vol. 50, no. 3, pp. 1047–1059, Mar. 2018.
[54]
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. J. Smola, “A kernel two-sample test,” J. Mach. Learn. Res., vol. 13, pp. 723–773, 2012.
[55]
E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2962–2971.
[56]
X. Song et al., “Spatio-temporal contrastive domain adaptation for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 9782–9790.
[57]
D. Kim et al., “Learning cross-modal contrastive features for video domain adaptation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 13598–13607.
[58]
Q. Chen, Y. Liu, and S. Albanie, “Mind-the-gap! unsupervised domain adaptation for text-video retrieval,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 1072–1080.
[59]
Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 1180–1189.
[60]
F. Long et al., “Deep domain adaptation hashing with adversarial learning,” in Proc. 41st Int. ACM SIGIR Conf.Res. Develop. Inf. Retrieval, 2018, pp. 725–734.
[61]
M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., 2018, pp. 1647–1657.
[62]
X. Li, Y. He, F. Fioranelli, and X. Jing, “Semisupervised human activity recognition with radar micro-doppler signatures,” IEEE TGRS, vol. 60, pp. 1–12, 2022.
[63]
D. Guan, J. Huang, A. Xiao, S. Lu, and Y. Cao, “Uncertainty-aware unsupervised domain adaptation in object detection,” IEEE Trans. Multimedia, vol. 24, pp. 2502–2514, 2022.
[64]
D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2015, pp. 4489–4497.
[65]
A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6000–6010.
[66]
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” in Proc. NIPS Workshop Deep Learn., Dec. 2014.
[67]
J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2014, pp. 1532–1543.
[68]
Z. Wang et al., “Robust video-based person re-identification by hierarchical mining,” IEEE Trans. Circuits Syst. for Video Technol., early access.
[69]
Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang, “Invariance matters: Exemplar memory for domain adaptive person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 598–607.
[70]
D. Liu et al., “Unsupervised temporal video grounding with deep semantic clustering,” Proc. AAAI Conf. Artif. Intell., vol. 36, no. 2, pp. 1683–1691, 2022.
[71]
M. Regneri et al., “Grounding action descriptions in videos,” Trans. Assoc. Comput. Linguistics, vol. 1, pp. 25–36, 2013.
[72]
J. Gao, C. Sun, Z. Yang, and R. Nevatia, “TALL: Temporal activity localization via language query,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 5277–5285.
[73]
R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, “Dense-captioning events in videos,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 706–715.
[74]
G. A. Sigurdsson et al., “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 510–526.
[75]
H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Span-based localizing network for natural language video localization,” in Proc. Assoc. Comput. Linguistics, 2020, pp. 6543–6554.
[76]
S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2D temporal adjacent networks for moment localization with natural language,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 12870–12877.
[77]
Y. Yuan, T. Mei, and W. Zhu, “To find where you talk: Temporal sentence localization in video with attention based location regression,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 9159–9166.
[78]
M. Liu et al., “Attentive moment retrieval in videos,” in Proc. 41st Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2018, pp. 15–24.
[79]
L. Yu et al., “MattNet: Modular attention network for referring expression comprehension,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1307–1315.
[80]
K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2014, pp. 1724–1734.
[81]
Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, “Semantic conditioned dynamic modulation for temporal sentence grounding in videos,” in Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019, pp. 536–546.
[82]
W. Wang, Y. Huang, and L. Wang, “Language-driven temporal activity localization: A semantic matching reinforcement learning model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 334–343.
[83]
L. A. Hendricks et al., “Localizing moments in video with temporal language,” in Proc. Conf. Empirical Methods Natural Lang. Process., Brussels, Belgium: Association for Computational Linguistics, Oct.–Nov. 2018, pp. 1380–1390.
[84]
H. Zhang et al., “Natural language video localization: A revisit in span-based question answering framework,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 8, pp. 4252–4266, Aug. 2022.
[85]
A. Silva, L. Luo, S. Karunasekera, and C. Leckie, “Embracing domain differences in fake news: Cross-domain fake news detection using multi-modal data,” in Proc. AAAI Conf. Artif. Intell., 2021, vol. 35, no. 1, pp. 557–565.
[86]
W. Zhao, X. Wu, and J. Luo, “Cross-domain image captioning via cross-modal retrieval and model adaptation,” IEEE Trans. Image Process., vol. 30, pp. 1180–1192, 2021.
[87]
M. Jaritz, T.-H. Vu, R. d. Charette, E. Wirbel, and P. Pérez, “xMUDA: Cross-modal unsupervised domain adaptation for 3D semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 12602–12611.
[88]
W. Zhang, D. Xu, J. Zhang, and W. Ouyang, “Progressive modality cooperation for multi-modality domain adaptation,” IEEE Trans. Image Process., vol. 30, pp. 3293–3306, 2021.
[89]
M. Zolfaghari, Y. Zhu, P. Gehler, and T. Brox, “CrossCLR: Cross-modal contrastive learning for multi-modal video representations,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 1450–1459.
[90]
L. Li et al., “Hero: Hierarchical encoder for video language omni-representation pre-training,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2020, pp. 2046–2065.

Cited By

View all
  • (2024)Hierarchical Debiasing and Noisy Correction for Cross-domain Video Tube RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681632(9271-9280)Online publication date: 28-Oct-2024
  • (2024)Simple Yet Effective: Structure Guided Pre-trained Transformer for Multi-modal Knowledge Graph ReasoningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681112(1554-1563)Online publication date: 28-Oct-2024
  • (2024)Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked AttentionIEEE Transactions on Multimedia10.1109/TMM.2024.345306226(11204-11218)Online publication date: 1-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 25, Issue
2023
8932 pages

Publisher

IEEE Press

Publication History

Published: 17 November 2022

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media