Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data

Tapia, Luis Sanchez; Gomez, Antonio; Esparza, Mario; Jatla, Venkatesh; Pattichis, Marios; Celedón-Pattichis, Sylvia; LópezLeiva, Carlos

Computer Science > Sound

arXiv:2112.13463 (cs)

[Submitted on 26 Dec 2021]

Title:Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data

Authors:Luis Sanchez Tapia, Antonio Gomez, Mario Esparza, Venkatesh Jatla, Marios Pattichis, Sylvia Celedón-Pattichis, Carlos LópezLeiva

View PDF

Abstract:Speech recognition is very challenging in student learning environments that are characterized by significant cross-talk and background noise. To address this problem, we present a bilingual speech recognition system that uses an interactive video analysis system to estimate the 3D speaker geometry for realistic audio simulations. We demonstrate the use of our system in generating a complex audio dataset that contains significant cross-talk and background noise that approximate real-life classroom recordings. We then test our proposed system with real-life recordings.
In terms of the distance of the speakers from the microphone, our interactive video analysis system obtained a better average error rate of 10.83% compared to 33.12% for a baseline approach. Our proposed system gave an accuracy of 27.92% that is 1.5% better than Google Speech-to-text on the same dataset. In terms of 9 important keywords, our approach gave an average sensitivity of 38% compared to 24% for Google Speech-to-text, while both methods maintained high average specificity of 90% and 92%.
On average, sensitivity improved from 24% to 38% for our proposed approach. On the other hand, specificity remained high for both methods (90% to 92%).

Comments:	11 pages, 6 figures
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Cite as:	arXiv:2112.13463 [cs.SD]
	(or arXiv:2112.13463v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2112.13463
Journal reference:	The 19th International Conference on Computer Analysis of Images and Patterns (CAIP), 2021

Submission history

From: Luis Armando Sanchez Tapia [view email]
[v1] Sun, 26 Dec 2021 23:29:56 UTC (32,508 KB)

Computer Science > Sound

Title:Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators