skip to main content
chapter

Deep learning for multisensorial and multimodal interaction

Published: 01 October 2018 Publication History
First page of PDF

References

[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. 99
[2]
R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, Y. Bengio, A. Bergeron, J. Bergstra, V. Bisson, J. Bleecher Snyder, N. Bouchard, N. Boulanger-Lewandowski, X. Bouthillier, A. de Brébisson, O. Breuleux, P.-L. Carrier, K. Cho, J. Chorowski, P. Christiano, T. Cooijmans, M.-A. Côté, M. Côté, A. Courville, Y. N. Dauphin, O. Delalleau, J. Demouth, G. Desjardins, S. Dieleman, L. Dinh, M. Ducoffe, V. Dumoulin, S. Ebrahimi Kahou, D. Erhan, Z. Fan, O. Firat, M. Germain, X. Glorot, I. Goodfellow, M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, J.-P. Heng, B. Hidasi, S. Honari, A. Jain, S. Jean, K. Jia, M. Korobov, V. Kulkarni, A. Lamb, P. Lamblin, E. Larsen, C. Laurent, S. Lee, S. Lefrancois, S. Lemieux, N. Léonard, Z. Lin, J. A. Livezey, C. Lorenz, J. Lowin, Q. Ma, P.-A. Manzagol, O. Mastropietro, R. T. McGibbon, R. Memisevic, B. van Merriënboer, V. Michalski, M. Mirza, A. Orlandi, C. Pal, R. Pascanu, M. Pezeshki, C. Raffel, D. Renshaw, M. Rocklin, A. Romero, M. Roth, P. Sadowski, J. Salvatier, F. Savard, J. Schlüter, J. Schulman, G. Schwartz, I. V. Serban, D. Serdyuk, S. Shabanian, E. Simon, S. Spieckermann, S. R. Subramanyam, J. Sygnowski, J. Tanguay, G. van Tulder, J. Turian, S. Urban, P. Vincent, F. Visin, H. de Vries, D. Warde-Farley, D. J. Webb, M. Willson, K. Xu, L. Xue, L. Yao, S. Zhang, and Y. Zhang. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688. 99
[3]
D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, A. Y. Hannun, B. Jun, T. Han, P. LeGresley, X. Li, L. Lin, S. Narang, A. Y. Ng, S. Ozair, R. Prenger, S. Qian, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, C. Wang, Y. Wang, Z. Wang, B. Xiao, Y. Xie, D. Yogatama, J. Zhan, and Z. Zhu. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proc. Int. Conf. on Machine Learning, pp. 173--182. New York, NY. 99
[4]
E. Arisoy, T. Sainath, B. Kingsbury, and B. Ramabhadran. 2012. Deep neural network language models. In Proc. NAACL-HLT Workshop, pp. 20--28. Montreal, Canada. 115
[5]
P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, 16(6): 345--379. 100, 104
[6]
D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. International Conference on Learning Representations. Banff, Canada. 99, 110, 111, 112
[7]
Y. Bengio and R. Ducharme. 2001. A neural probabilistic language model. In Proc. Advances in Neural Information Processing Systems, vol. 13, pp. 932--938. Denver, CO. 114
[8]
L. Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proc. COMPSTAT'2010, pp. 177--186. Springer, Paris, France. 105
[9]
Y. Cheng, X. Zhao, R. Cai, Z. Li, K. Huang, and Y. Rui. 2016. Semi-supervised multimodal deep learning for RGB-D object recognition. In Proc. Int. Joint Conf. on AI, pp. 3345--3351. New York, NY. 105
[10]
K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. Empirical Methods in Natural Language Processing, EMNLP 2014, pp. 1724--1734. Doha, Qatar. 105
[11]
K. Cho, A. Courville, and Y. Bengio. 2015. Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17(11): 1875--1886. 110
[12]
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. 2015. Attention-based models for speech recognition. In Proc. Advances in Neural Information Processing Systems, pp. 577--585. Montreal, Canada. 111
[13]
R. Collobert and J. Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proc. Int. Conf. on Machine Learning, pp. 160--167. Helsinki, Finland. 116
[14]
L. Deng and D. Yu. 2014. Deep learning: Methods and applications. Foundations and Trends in Signal Processing, 7(3-4): 197--387. 99
[15]
A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard. 2015. Multimodal deep learning for robust rgb-d object recognition. In Proc. Intelligent Robots and Systems (IROS), pp. 681--687. IEEE, Hamburg, Germany. 105
[16]
G. Erdogan, I. Yildirim, and R. A. Jacobs. 2014. Transfer of object shape knowledge across visual and haptic modalities. In Proc. 36th Annual Conference of the Cognitive Science Society. Quebec City, Canada. 112
[17]
F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie. 2010. Online emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces, 3(1-2): 7--19. 103
[18]
R. Girshick, J. Donahue, T. Darrell, and J. Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, pp. 580--587. Columbus, OH. 113
[19]
I. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org. 100
[20]
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6): 82--97. 99
[21]
S. Hochreiter and J. Schmidhuber. 1997a. Long short-term memory. Neural Computation, 9(8): 1735--1780. 118
[22]
S. Hochreiter and J. Schmidhuber. 1997b. Long short-term memory. Neural Computation, 9(8): 1735--1780. 105
[23]
E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proc. 50th Annual Meeting Assoc. for Computational Linguistics, pp. 873--882. Jeju Island, Korea. 116
[24]
S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-Lewandowski, et al. 2016. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 10(2): 99--111. 103
[25]
A. Karpathy and L. Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3128--3137. Boston, MA. 113, 119, 120
[26]
A. Karpathy, A. Joulin, and F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proc. Advances in Neural Information Processing Systems, pp. 1889--1897. Montreal, Canada. 119
[27]
G. Keren, J. Deng, J. Pohjalainen, and B. Schuller. 2016. Convolutional neural networks with data augmentation for classifying speakers native language. In Proc. INTERSPEECH, Annual Conference of the International Speech Communication Association. San Francisco, CA. 103
[28]
G. Keren, S. Sabato, and B. W. Schuller. 2017a. Tunable sensitivity to large errors in neural network training. In Proc. Conference on Artificial Intelligence (AAAI), pp. 2087--2093. San Francisco, CA. 105
[29]
G. Keren, S. Sabato, and B. W. Schuller. 2017b. Fast single-class classification and the principle of logit separation. arXiv preprint arXiv:1705.10246. 105
[30]
G. Keren and B. W. Schuller. 2016. Convolutional RNN: an enhanced model for extracting features from sequential data. In Proc. International Joint Conference on Neural Networks, IJCNN, pp. 3412--3419. Vancouver, Canada. 105
[31]
D. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In Proc. International Conference on Learning Representations. Banff, Canada. 105
[32]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems, pp. 1097--1105. Lake Tahoe, NV. 99
[33]
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4): 541--551. 108
[34]
E. Levin, R. Pieraccini, and W. Eckert. 1998. Using markov decision process for learning dialogue strategies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 201--204. IEEE, Seattle, WA. 122
[35]
E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. 2015. Generating images from captions with attention. In Proc. International Conference on Learning Representations. Banff, Canada. 108, 109, 111
[36]
H. Mei, M. Bansal, and M. R. Walter. 2016. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proc. Conference on Artificial Intelligence (AAAI), pp. 2772--2778. Phoenix, AZ. 107, 111
[37]
T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 114, 116, 117
[38]
T. Mikolov, W. Yih, and G. Zweig. 2013b. Linguistic regularities in continuous space word representations. In Proc. NAACL-HLT, pp. 746--751. Atlanta, Georgia. 114
[39]
T. Mikolov, M. Karafiát, L. Burget, J. H. Černocký, and S. Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the INTERSPEECH, Annual Conference of the International Speech Communication Association, pp. 1045--1048. Makuhari, Chiba, Japan. 117
[40]
T. Mikolov, J. Kopeckyý, L. Burget, O. Glembek, and J. Cernockyý. 2009. Neural network based language models for highly inflective languages. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4725--4728. Taipei, Taiwan. 114
[41]
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540): 529--533. . 123
[42]
F. Morin and Y. Bengio. 2005. Hierarchical probabilistic neural network language model. In Proc. International Workshop on Artificial Intelligence and Statistics, pp. 246--252. Barbados.
[43]
A. E. Mousa. 2014. Sub-Word Based Language Modeling of Morphologically Rich Languages for LVCSR. Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. 115
[44]
A. E. Mousa, H.-K. J. Kuo, L. Mangu, and H. Soltau. 2013. Morpheme-based feature-rich language models using deep neural networks for LVCSR of Egyptian Arabic. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada. 115
[45]
M. Paleari and B. Huet. 2008. Toward emotion indexing of multimedia excerpts. In Proc. 2008 International Workshop on Content-Based Multimedia Indexing, pp. 425--432. IEEE, London, UK. 102
[46]
E. Park, X. Han, T. L. Berg, and A. C. Berg. 2016. Combining multiple sources of knowledge in deep CNNs for action recognition. In Proc. 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1--8. IEEE, Lake Placid, NY. 105
[47]
B. T. Polyak. 1964. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5): 1--17. 105
[48]
S. Reed, Z. Akata, H. Lee, and B. Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. Las Vegas, NV. 120, 121, 122
[49]
M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11): 2673--2681. 111
[50]
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587): 484--489. 123
[51]
J. Sung, I. Lenz, and A. Saxena. 2015. Deep multimodal embedding: Manipulating novel objects with point-clouds, language and trajectories. arXiv preprint arXiv:1509.07831. 113
[52]
I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems, pp. 3104--3112. Montreal, Canada. 105
[53]
R. S. Sutton and A. G. Barto. 1998. Introduction to Reinforcement Learning. MIT Press. 122
[54]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1--9. Boston, MA. 99
[55]
G. Trigeorgis, M. Nicolaou, S. Zafeiriou, and B. W. Schuller. 2016. Deep canonical time warping. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5110--5118. Las Vegas, NV. 100
[56]
S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015a. Sequence to sequence-video to text. In Proc. IEEE Conference Computer Vision, pp. 4534--4542. Santiago, Chile. 108
[57]
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko. 2015b. Translating videos to natural language using deep recurrent neural networks. In Proc. NAACL-HLT, pp. 1494--1504. Denver, CO. 108
[58]
O. Vinyals, M. Fortunato, and N. Jaitly. 2015a. Pointer networks. In Proc. Advances in Neural Information Processing Systems, pp. 2692--2700. Montreal, Canada. 111
[59]
O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. 2015b. Grammar as a foreign language. In Proc. Advances in Neural Information Processing Systems, pp. 2773--2781. Montreal, Canada. 107, 108, 111
[60]
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015c. Show and tell: A neural image caption generator. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3156--3164. Boston, MA. 107, 109
[61]
K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proc. International Conference on Machine Learning, pp. 2048--2057. Lille, France. 110, 111, 112
[62]
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proc. IEEE Conference Computer Vision, pp. 4507--4515. Santiago, Chile. 111

Cited By

View all
  1. Deep learning for multisensorial and multimodal interaction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Books
    The Handbook of Multimodal-Multisensor Interfaces: Signal Processing, Architectures, and Detection of Emotion and Cognition - Volume 2
    October 2018
    2034 pages
    ISBN:9781970001716
    DOI:10.1145/3107990

    Publisher

    Association for Computing Machinery and Morgan & Claypool

    Publication History

    Published: 01 October 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Chapter

    Appears in

    ACM Books

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 25 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media