Multi-View Self-Attention Based Transformer for Speaker Recognition

Wang, Rui; Ao, Junyi; Zhou, Long; Liu, Shujie; Wei, Zhihua; Ko, Tom; Li, Qing; Zhang, Yu

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2110.05036 (eess)

[Submitted on 11 Oct 2021 (v1), last revised 27 Jan 2022 (this version, v2)]

Title:Multi-View Self-Attention Based Transformer for Speaker Recognition

Authors:Rui Wang, Junyi Ao, Long Zhou, Shujie Liu, Zhihua Wei, Tom Ko, Qing Li, Yu Zhang

View PDF

Abstract:Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.

Comments:	Paper to appear at ICASSP 2022
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
Cite as:	arXiv:2110.05036 [eess.AS]
	(or arXiv:2110.05036v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2110.05036

Submission history

From: Rui Wang [view email]
[v1] Mon, 11 Oct 2021 07:03:23 UTC (316 KB)
[v2] Thu, 27 Jan 2022 07:10:10 UTC (310 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-View Self-Attention Based Transformer for Speaker Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-View Self-Attention Based Transformer for Speaker Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators