Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter Open Access March 31, 2010

Soft missing-feature mask generation for Robot Audition

  • Toru Takahashi EMAIL logo , Kazuhiro Nakadai , Kazunori Komatani , Tetsuya Ogata and Hiroshi G. Okuno

Abstract

This paper describes an improvement in automatic speech recognition (ASR) for robot audition by introducing Missing Feature Theory (MFT) based on soft missing feature masks (MFM) to realize natural human-robot interaction. In an everyday environment, a robot’s microphones capture various sounds besides the user’s utterances. Although sound-source separation is an effective way to enhance the user’s utterances, it inevitably produces errors due to reflection and reverberation. MFT is able to cope with these errors. First, MFMs are generated based on the reliability of time-frequency components. Then ASR weighs the time-frequency components according to the MFMs. We propose a new method to automatically generate soft MFMs, consisting of continuous values from 0 to 1 based on a sigmoid function. The proposed MFM generation was implemented for HRP-2 using HARK, our open-sourced robot audition software. Preliminary results show that the soft MFM outperformed a hard (binary) MFM in recognizing three simultaneous utterances. In a human-robot interaction task, the interval limitations between two adjacent loudspeakers were reduced from 60 degrees to 30 degrees by using soft MFMs.

References

[1] J. Barker, M. Cooke, and P. Green. Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise. In Procedings of Eurospeech-2001, pages 213–216. ESCA, 2001.10.21437/Eurospeech.2001-76Search in Google Scholar

[2] J. Barker, L. Josifovski, M. Cooke, and P. Green. Soft decisions in missing data techniques for robust automatic speech recognition. In Proc. of 6th International Conference on Spoken Language Processing (ICSLP-2000), volume I, pages 373–376, 2000.10.21437/ICSLP.2000-92Search in Google Scholar

[3] S. F. Boll. A spectral subtraction algorithm for suppression of acoustic noise in speech. In roceedings of 1979 International Conference on Acoustics, Speech, and Signal Processing (ICASSP-79), pages 200–203. IEEE, 1979.Search in Google Scholar

[4] C. Breazeal. Emotive qualities in robot speech. In Proceedings of 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2001), pages 1389–1394, 2001.Search in Google Scholar

[5] I. Cohen and B. Berdugo. Speech enhancement for nonstationary noise environments. Signal Processing, 81(2):2403–2418, 2001.10.1016/S0165-1684(01)00128-1Search in Google Scholar

[6] M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust automatic speech recognition with missing and unreliable acoustic data. Speech Communication, 34(3):267–285, May 2000.10.1016/S0167-6393(00)00034-0Search in Google Scholar

[7] C. Côté, D. Létourneau, F. Michaud, J. M. Valin, Y. Brosseau, C. Räievsky, M. Lemay, and V. Tran. Reusability tools for programming mobile robots. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 1820–1825. IEEE, 2004.Search in Google Scholar

[8] J. de Veth, F. de Wet, B. Cranen, and L. Boves. Missing feature theory in asr: Make sure you miss the right type of features. In Proceedings ofWorkshop on Robust Methods for ASR in Adverse Conditions, Tampere, pages 231–234, 1999.Search in Google Scholar

[9] A. Drygajlo and M. El-Maliki. Speaker verification in noisy environments with combined spectral subtraction and missing feature theory. In Proceedings of 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1998), pages 121–124, 1998.Search in Google Scholar

[10] Y. Ephraim and D. Malah. Speech enhancement using minimum mean-square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-32(6):1109–1121, 1984.10.1109/TASSP.1984.1164453Search in Google Scholar

[11] I. Hara, F. Asano, H. Asoh, J. Ogata, N. Ichimura, Y. Kawai, F. Kanehiro, H. Hirukawa, and K. Yamamoto. Robust speech interface based on audio and video information fusion for humanoid HRP-2. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 2404–2410. IEEE, 2004.Search in Google Scholar

[12] H. Isao, A. Futoshi, K. Yoshihiro, K. Fumio, and Y. Kiyoshi. Robust speech interface based on audio and video information fusion for humanoid hrp-2. In Proceedings of 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 2404–2410, 2004.Search in Google Scholar

[13] Multiband Julius. http://www.furui.cs.titech.ac.jp/mbandjulius/.Search in Google Scholar

[14] H. D. Kim, K. Komatani, T. Ogata, and H. G. Okuno. Human tracking system integrating sound and face localization using em algorithm in real environments. Advanced Robotics, 23(6):629–653, 2007.10.1163/156855309X431659Search in Google Scholar

[15] R. P. Lippmann and B. A. Carlson. Robust speech recognition with time-varying filtering, interruptions, and noise. In Proceedings of 1997 ISCA 5th European Conference on Speech Communication and Technology (EuroSpeech 1997), pages 365–372, 1997.Search in Google Scholar

[16] Y. Matsusaka, T. Tojo, S. Kuota, K. Furukawa, D. Tamiya, K. Hayata, Y. Nakano, and T. Kobayashi. Multi-person conversation via multi-modal interface – a robot who communicates withmulti-user. In Proceedings of 6th European Conference on Speech Communication Technology (Eurospeech 1999), pages 1723–1726, 1999.10.21437/Eurospeech.1999-387Search in Google Scholar

[17] I. A. McCowan and H. Bourlard. Microphone array post-filter for diffuse noise field. In ICASSP-2002, volume 1, pages 905–908, 2002.10.1109/ICASSP.2002.1005887Search in Google Scholar

[18] K. Nakadai, T. Lourens, H. G. Okuno, and H. Kitano. Active audition for humanoid. In Proc. of 17th National Conference on Artificial Intelligence (AAAI-2000), pages 832–839. AAAI, 2000.Search in Google Scholar

[19] K. Nakadai, D. Matasuura, H. G. Okuno, and H. Tsujino. Improvement of recognition of simultaneous speech signals using av integration and scattering theory for humanoid robots. Speech Communication, 44(1-4):97–112, October 2004.10.1016/j.specom.2004.10.010Search in Google Scholar

[20] K. Nakadai, D. Matsuura, H. G. Okuno, and H. Tsujino. Improvement of recognition of simultaneous speech signals using av integration and scattering theory for humanoid robots. Speech Communication, 44(1-4):97–112, 2004.10.1016/j.specom.2004.10.010Search in Google Scholar

[21] K. Nakadai, H. G. Okuno, H. Nakajima, Y. Hasegawa, and H. Tsujino. An open source software system for robot audition hark and its evaluation. In Proceedings of 2008 IEEE/RAS International Conference on Humanoid Robots (HUMANOIDS 2008), pages 561–566, 2008.10.1109/ICHR.2008.4756031Search in Google Scholar

[22] Y. Nishimura, T. Shinozaki, K. Iwano, and S. Furui. Noise-robust speech recognition using multi-band spectral features. In Proceedings of 148th Acoustical Society of America Meetings, number 1aSC7, 2004.Search in Google Scholar

[23] M. T. Padilla, T. F. Quantieri, and D. A. Reynolds. Missing feature theory with soft spectral subtraction for speaker verification. In Proceedings of the 8th International Congress on Spoken Language Processing (InterSpeech 2006), pages 913–916, 2006.10.21437/Interspeech.2006-169Search in Google Scholar

[24] H.M. Park and R.M. Stern. Missing feature speech recognition using dereverberation and echo suppression in reerberation environments. In Proceedings of 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), volume IV, pages 381–384, 2007.10.1109/ICASSP.2007.366929Search in Google Scholar

[25] L. C. Parra and C. V. Alvino. Geometric source separation: Mergin convolutive source separation with geometric beamforming. IEEE Transactions on Speech and Audio Processing, 10(6):352–362, 2002.Search in Google Scholar

[26] R. Plomp, L. C. W. Pols, and J. P. van de Geer. Dimensional analysis of vowel spectra. Acoustical Society of America, 41(3):707–712, 1967.10.1121/1.1910398Search in Google Scholar

[27] B. Raj and R. M. Stern. Missing-feature approaches in speech recognition. Signal Processing Magazine, 22(5):101–116, 2005.10.1109/MSP.2005.1511828Search in Google Scholar

[28] P. Renevey and A. Drygajlo. Missing feature theory and probabilistic estimation of clean speech components for robust speech recognition. In Proceedings of European Conference on Speech Communication Technology (Eurospeech-1999), pages 2627–2630, 1999.10.21437/Eurospeech.1999-579Search in Google Scholar

[29] M. L. Seltzer, B. Raj, and R. M. Stern. A bayesian classifier for spectrographicmask estimation formissing feature speech recognition. Speech Communication, 43:379–393, 2004.10.1016/j.specom.2004.03.006Search in Google Scholar

[30] T. Takahashi, K. Nakadai, K. Komatani, T. Ogata, and H. G. Okuno. Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model. In Proceedings of 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2009), pages 2730–2735, 2009.10.1109/IROS.2009.5354201Search in Google Scholar

[31] K. Tatsuya and L. Akinobu. Free software toolkit for Japanese large vocabulary continuous speech recognition. In International Conference on Spoken Language Processing (ICSLP), volume 4, pages 476–479, 2000.Search in Google Scholar

[32] J. M. Valin, F. Michaud, and J. Rouat. Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robotics and Autonomous Systems Journal, 55(3):216–228, 2007.10.1016/j.robot.2006.08.004Search in Google Scholar

[33] J. M. Valin, J. Rouat, and F. Michaud. Enhanced robot audition based on microphone array source separation with post-filter. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2133–2128, 2004.Search in Google Scholar

[34] J. M. Valin, J. Rouat, and F. Michaud. Enhanced robot audition based on microphone array source separation with post-filter. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2004), pages 2123–2128. IEEE, 2004.Search in Google Scholar

[35] F. Wang, Y. Takeuchi, N. Ohnishi, and N. Sugie. Amobile robot with active localization and discrimination of a sound source. Journal of Robotic Society of Japan, 15(2):61–67, 1997.10.7210/jrsj.15.223Search in Google Scholar

[36] S. Yamamoto, K. Nakadai, J. M. Valin, J. Rouat, F. Michaud, , K. Komatani, T. Ogata, and H. G. Okuno. Making a robot recognize three simultaneous sentences in real-time. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2005), pages 897–902. IEEE, 2005.10.1109/IROS.2005.1545094Search in Google Scholar

[37] S. Yamamoto, J. M. Valin, K. Nakadai, T. Ogata, and H. G. Okuno. Enhanced robot speech recognition based on microphone array source separation and missing feature theory. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2005), pages 1489–1494. IEEE, 2005.Search in Google Scholar

[38] S. Yamamoto, K. Nakadai, M. Nakano, H. Tsujino, J. M. Valin, K. Komatani, T. Ogata, and H. G. Okuno. Design and implementation of a robot audition system for automatic speech recognition of simultaneous speech. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU-2007), pages 111–116. IEEE, 2007.Search in Google Scholar

[39] S. Yamamoto, K. Nakadai, H. Tsujino, T. Yokoyama, and H. G. Okuno. Improvement of robot audition by interfacing sound source separation and automatic speech recognition with missing feature theory. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA 2004), pages 1517–1523. IEEE, 2004.10.1109/ROBOT.2004.1308039Search in Google Scholar

[40] S. Yamamoto, K. Nakadai, J.M. Valin, J. Rouat, F. Michaud, K. Komatani, T. Ogata, and H. G. Okuno. Genetic algorithm-based improvement of robot hearing capabilities inseparating and recognizing simultaneous speech signals. In Proceedings of 19th International Conference on Industrial, Engineering, and Other Applications of Applied Intelligent Systems (IEA/AIE’06), volume LNAI 4031, pages 207–217. Springer-Verlag, 2006.10.1007/11779568_24Search in Google Scholar

Received: 2010-2-21
Accepted: 2010-3-19
Published Online: 2010-3-31
Published in Print: 2010-3-1

© Toru Takahashi et al.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Downloaded on 4.1.2025 from https://www.degruyter.com/document/doi/10.2478/s13230-010-0005-1/html
Scroll to top button