skip to main content
chapter

Challenges and applications in multimodal machine learning

Published: 01 October 2018 Publication History
First page of PDF

References

[1]
D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, pp. 173--182. 23, 25
[2]
C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos. 2015. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review, 43(2):155--177. 23
[3]
G. Andrew, R. Arora, J. Bilmes, and K. Livescu. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning, pp. 1247--1255. 25, 32
[4]
L. Anne Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1--10. 37
[5]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2425--2433, 2015. 23, 27
[6]
R. Arora and K. Livescu. 2013. Multi-view cca-based acoustic features for phonetic recognition across speakers and domains. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7135--7139. IEEE. 34
[7]
P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, 16(6):345--379, 2010. 21
[8]
D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural Machine Translation By Jointly Learning To Align and Translate. ICLR. 29
[9]
T. Baltrušaitis, C. Ahuja, and L.-P. Morency. 2017. Multimodal machine learning: A survey and taxonomy. arXiv preprint arXiv:1705.09406. 20, 21, 38
[10]
L. W. Barsalou. 2008. Grounded cognition. Annu. Rev. Psychol., 59:617--645. 36
[11]
Y. Bengio, A. Courville, and P. Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798--1828. 23, 26, 27, 28
[12]
J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, et al. 2010. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pp. 333--342. ACM. 22
[13]
A. Blum and T. Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 92--100. ACM. 33, 34
[14]
H. Bourlard and S. Dupont. 1996. A mew asr approach based on independent processing and recombination of partial frequency bands. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 1, pp. 426--429. IEEE, 1996. 21
[15]
M. Brand, N. Oliver, and A. Pentland. 1997. Coupled hidden markov models for complex action recognition. In Computer vision and pattern recognition, 1997. proceedings., 1997 ieee computer society conference on, pp. 994--999. IEEE. 21
[16]
M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. 2010. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3594--3601. IEEE. 31
[17]
E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran. Distributional semantics in technicolor. 2012. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 136--145. Association for Computational Linguistics. 36
[18]
E. Bruni, N.-K. Tran, and M. Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res.(JAIR), 49(2014): 1--47. 36
[19]
Y. Cao, M. Long, J. Wang, Q. Yang, and S. Y. Philip. 2016. Deep visual-semantic hashing for cross-modal retrieval. In KDD, pp. 1445--1454. 27, 31
[20]
J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, et al. 2005. The ami meeting corpus: A pre-announcement. In International Workshop on Machine Learning for Multimodal Interaction, pp. 28--39. Springer. 22
[21]
X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325. 30
[22]
C. M. Christoudias, K. Saenko, L.-P. Morency, and T. Darrell. 2006. Co-adaptation of audio-visual speech and gesture classifiers. In Proceedings of the 8th international conference on Multimodal interfaces, pp. 84--91. ACM. 33
[23]
C. M. Christoudias, R. Urtasun, and T. Darrell. 2008. Multi-view learning in the presence of view disagreement. In UAI. 33
[24]
P. Cosi, E. M. Caldognetto, K. Vagges, G. A. Mian, and M. Contolini. 1994. Bimodal recognition experiments with recurrent neural networks. In Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, volume 2, pp. II-553. IEEE, 1994. 30
[25]
F. De la Torre and J. F. Cohn. 2011. Facial expression analysis. In Visual analysis of humans, pp. 377--409. Springer. 22
[26]
S. K. D'Mello and J. Kory. 2015. A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys (CSUR), 47(3): 43. 22, 25, 26
[27]
G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, and Y. Avrithis. 2013. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia, 15(7): 1553--1568. 22
[28]
A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. 2009. Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition, 2009., pp. 1778--1785. IEEE. 37
[29]
F. Feng, X. Wang, and R. Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 7--16. ACM. 32
[30]
F. Feng, R. Li, and X. Wang. 2015. Deep correspondence restricted boltzmann machine for cross-modal retrieval. Neurocomputing, 154: 50--60. 32
[31]
Y. Feng and M. Lapata. 2010. Visual information in semantic representation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 91--99. Association for Computational Linguistics. 36
[32]
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121--2129. 25, 27, 30, 34, 35, 37
[33]
X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249--256. 28
[34]
M. Gurban, J.-P. Thiran, T. Drugman, and T. Dutoit. 2008. Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition. In Proceedings of the 10th international conference on Multimodal interfaces, pp. 237--240. ACM. 21
[35]
D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12): 2639--2664. 32
[36]
G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine. 23, 25
[37]
G. E. Hinton and R. S. Zemel. 1994. Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pp. 3--10. 28
[38]
G. E. Hinton, S. Osindero, and Y.-W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation, 18(7): 1527--1554. 28
[39]
S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8): 1735--1780. 29
[40]
M. Hodosh, P. Young, and J. Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47: 853--899, 2013. 22
[41]
H. Hotelling. 1936. Relations between two sets of variates. Biometrika, 28(3/4):321--377. 32
[42]
J. Huang and B. Kingsbury. 2013. Audio-visual deep learning for noise robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7596--7599. IEEE. 28
[43]
A. Jameson and P. O. Kristensson. 2017. Understanding and supporting modality choices. In The Handbook of Multimodal-Multisensor Interfaces, pp. 201--238. Association for Computing Machinery and Morgan & Claypool. 21
[44]
Q.-y. Jiang and W.-j. Li. 2017. Deep Cross-Modal Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31
[45]
X. Jiang, F. Wu, Y. Zhang, S. Tang, W. Lu, and Y. Zhuang. 2015. The classification of multi-modal data with hidden conditional random field. Pattern Recognition Letters, 51: 63--69. 31
[46]
B. H. Juang and L. R. Rabiner. 1991. Hidden markov models for speech recognition. Technometrics, 33(3): 251--272. 21
[47]
S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-Lewandowski, et al. 2016. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 10(2): 99--111. 27
[48]
M. M. Khapra, A. Kumaran, and P. Bhattacharyya. 2010. Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 420--428. Association for Computational Linguistics. 37
[49]
D. Kiela and L. Bottou. 2014. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In EMNLP, pp. 36--45. 36
[50]
D. Kiela and S. Clark. 2015. Multi-and cross-modal semantics beyond vision: Grounding in auditory perception. Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2461--2470. 36
[51]
D. Kiela, L. Bulat, and S. Clark. 2015. Grounding semantics in olfactory perception. In ACL (2), pp. 231--236. 36
[52]
Y. Kim, H. Lee, and E. M. Provost. 2013. Deep learning for robust feature generation in audiovisual emotion recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 3687--3691. IEEE. 27, 28
[53]
R. Kiros, R. Salakhutdinov, and R. S. Zemel. 2015. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. TACL. 27, 30
[54]
B. Klein, G. Lev, G. Sadeh, and L. Wolf. 2015. Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation. In CVPR. 32
[55]
C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. 2014. What are you talking about? text-to-image coreference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3558--3565. 36
[56]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097--1105. 23, 25
[57]
S. Kumar and R. Udupa. 2011. Learning hash functions for cross-view similarity search. In IJCAIproceedings-international joint conference on artificial intelligence, volume 22, p. 1360. 31
[58]
P. L. Lai and C. Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 10(05): 365--377. 32
[59]
A. Lazaridou, E. Bruni, and M. Baroni. 2014. Is this a wampimuk? cross-modal mapping between distributional semantics and the visual world. In ACL (1), pp. 1403--1414. 37
[60]
A. Levin, P. Viola, and Y. Freund. 2003. Unsupervised improvement of visual detectors using cotraining. In ICCV. 33
[61]
Y. Li, S. Wang, Q. Tian, and X. Ding. 2015. A survey of recent advances in visual feature detection. Neurocomputing, 149: 736--751. 23
[62]
R. Lienhart. 1999. Comparison of automatic shot boundary detection algorithms. In Storage and Retrieval for Image and Video Databases (SPIE), pp. 290--301. 22
[63]
M. M. Louwerse. 2011. Symbol interdependency in symbolic and embodied cognition. Topics in Cognitive Science, 3(2): 273--302. 36
[64]
D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2): 91--110. 23
[65]
B. Mahasseni and S. Todorovic. 2016. Regularizing long short term memory with 3d human-skeleton sequences for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3054--3062. 34, 35
[66]
H. McGurk and J. MacDonald. 1976. Hearing lips and seeing voices. Nature, 264(5588): 746--748. 21
[67]
G. McKeown, M. F. Valstar, R. Cowie, and M. Pantic. 2010. The semaine corpus of emotionally coloured character interactions. In Multimedia and Expo (ICME), 2010 IEEE International Conference on, pp. 1079--1084. IEEE. 22
[68]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111--3119. 25, 35
[69]
S. Moon, S. Kim, and H. Wang. 2015. Multimodal Transfer Deep Learning for Audio-Visual Recognition. NIPS Workshops. 34
[70]
Y. Mroueh, E. Marcheret, and V. Goel. 2015. Deep multimodal learning for audio-visual speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 2130--2134. IEEE. 27
[71]
P. Nakov and H. T. Ng. 2012. Improving statistical machine translation for a resource-poor language using related resource-rich languages. Journal of Artificial Intelligence Research, 44: 179--222. 34, 37
[72]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689--696. 21, 26, 27, 28, 34
[73]
M. A. Nicolaou, H. Gunes, and M. Pantic. 2011. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Transactions on Affective Computing, 2(2): 92--105. 27, 30
[74]
W. Ouyang, X. Chu, and X. Wang. 2014. Multi-source deep learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2329--2336. 26, 27, 29
[75]
M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. 2009. Zero-shot learning with semantic output codes. In Advances in neural information processing systems, pp. 1410--1418. 34, 37
[76]
Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4594--4602. 27, 30
[77]
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641--2649. 36
[78]
S. S. Rajagopalan, L.-P. Morency, T. Baltrušaitis, and R. Goecke. 2016. Extending long short-term memory for multi-view structured learning. In European Conference on Computer Vision, pp. 338--353. Springer. 27, 30
[79]
J. Rajendran, M. M. Khapra, S. Chandar, and B. Ravindran. 2015. Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning. In NAACL. 34, 37
[80]
N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia, pp. 251--260. ACM. 32
[81]
M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. 2013. Grounding Action Descriptions in Videos. TACL. ISSN 2307--387X. 36
[82]
R. Salakhutdinov and G. Hinton. 2009. Deep boltzmann machines. In Artificial Intelligence and Statistics, pp. 448--455. 28
[83]
M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp. 2007. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 9(7): 1396--1403. 32
[84]
A. Sarkar. 2001. Applying co-training methods to statistical parsing. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pp. 1--8. Association for Computational Linguistics. 33
[85]
B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic. 2011. Avec 2011--the first international audio/visual emotion challenge. Affective Computing and Intelligent Interaction, pp. 415--424. 22
[86]
E. Shutova, D. Kiela, and J. Maillard. 2016. Black holes and white rabbits: Metaphor identification with visual features. In HLT-NAACL, pp. 160--170. 34, 36
[87]
C. Silberer and M. Lapata. 2012. Grounded models of semantic representation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1423--1433. Association for Computational Linguistics. 36
[88]
C. Silberer and M. Lapata. 2014. Learning grounded meaning representations with autoencoders. In ACL (1), pp. 721--732. 27, 28
[89]
M. Slaney and M. Covell. 2001. Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In Advances in Neural Information Processing Systems, pp. 814--820. 32
[90]
C. G. Snoek and M. Worring. 2005. Multimodal video indexing: A review of the state-of-the-art. volume 25, pp. 5--35. Springer. 21, 22
[91]
R. Socher and L. Fei-Fei. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 966--973. IEEE. 37
[92]
R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. 2013. Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935--943. 34, 37
[93]
R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2: 207--218. 30
[94]
N. Srivastava and R. Salakhutdinov. 2012a. Learning representations for multimodal data with deep belief nets. In International conference on machine learning workshop. 28
[95]
N. Srivastava and R. R. Salakhutdinov. 2012b. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pp. 2222--2230. 23, 27, 29, 34
[96]
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1): 1929--1958. 28
[97]
H.-I. Suk, S.-W. Lee, D. Shen, A. D. N. Initiative, et al. 2014. Hierarchical feature representation and multimodal fusion with deep learning for ad/mci diagnosis. NeuroImage, 101: 569--582. 29
[98]
G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou. 2016. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 5200--5204. IEEE. 25
[99]
M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, and M. Pantic. 2013. Avec 2013: the continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge, pp. 3--10. ACM. 22
[100]
I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun. 2016. Order-Embeddings of Images and Language. In ICLR. 25, 27, 31
[101]
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. NAACL. 29
[102]
D. Wang, P. Cui, M. Ou, and W. Zhu. 2015a. Deep multimodal hashing with orthogonal regularization. In IJCAI, pp. 2291--2297. 26, 28
[103]
J. Wang, H. T. Shen, J. Song, and J. Ji. 2014. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927. 31
[104]
L. Wang, Y. Li, and S. Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005--5013. 31
[105]
W. Wang, R. Arora, K. Livescu, and J. Bilmes. 2015b. On deep multi-view representation learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1083--1092. 27, 32
[106]
J. Weston, S. Bengio, and N. Usunier. 2011. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, volume 11, pp. 2764--2770. 30
[107]
D. Wu and L. Shao. 2014. Multimodal dynamic networks for gesture recognition. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 945--948. ACM. 29
[108]
Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. 2014. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 167--176. ACM. 27
[109]
R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI, volume 5, p. 6. 27, 30
[110]
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 67--78. 31
[111]
H. Yu and J. M. Siskind. 2013. Grounded language learning from video described with sentences. In ACL (1), pp. 53--63. 36
[112]
B. P. Yuhas, M. H. Goldstein, and T. J. Sejnowski. 1989. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27(11): 65--71. 21
[113]
D. Zhang and W.-J. Li. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI, volume 1, p. 7. 32
[114]
H. Zhang, Z. Hu, Y. Deng, M. Sachan, Z. Yan, and E. P. Xing. 2016. Learning concept taxonomies from multi-modal data. arXiv preprint arXiv:1606.09239. 31

Cited By

View all
  1. Challenges and applications in multimodal machine learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Books
    The Handbook of Multimodal-Multisensor Interfaces: Signal Processing, Architectures, and Detection of Emotion and Cognition - Volume 2
    October 2018
    2034 pages
    ISBN:9781970001716
    DOI:10.1145/3107990

    Publisher

    Association for Computing Machinery and Morgan & Claypool

    Publication History

    Published: 01 October 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Chapter

    Appears in

    ACM Books

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)114
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media