Lexical Acquisition from Audio-Visual Streams Using a Multimodal Recurrent State-Space Model | IEEE Conference Publication | IEEE Xplore