skip to main content
chapter

Building and Designing Expressive Speech Synthesis

Published: 02 October 2021 Publication History
First page of PDF

References

[1]
A. Abdolrahmani, R. Kuber, and S. M. Branham. 2018. “Siri talks at you”: An empirical investigation of voice-activated personal assistant (VAPA) usage by individuals who are blind. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility. 249–258.
[2]
J. Adell, A. Bonafonte, and D. Escudero. 2007. Filled pauses in speech synthesis: Towards conversational speech. In International Conference on Text, Speech and Dialogue. Springer, 358–365.
[3]
K. Akuzawa, Y. Iwasawa, and Y. Matsuo. 2018. Expressive speech synthesis via modeling expressions with variational autoencoder. In Interspeech 2018. ISCA, 3067–3071. http://www.isca-speech.org/archive/Interspeech_2018/abstracts/1113.html.
[4]
S. Andersson, K. Georgila, D. Traum, M. Aylett, and R. A. Clark. 2010. Prediction and realisation of conversational characteristics by utilising spontaneous speech for unit selection. In Speech Prosody 2010-Fifth International Conference.
[5]
S. Andrist, M. Ziadee, H. Boukaram, B. Mutlu, and M. Sakr. 2015. Effects of culture on the credibility of robot speech. In Proceedings of the 10th Annual International Conference on Human–Robot Interaction, HRI’15. ACM/IEEE, 157–164.
[6]
D. Antos, C. M. De Melo, J. Gratch, and B. J. Grosz. 2011. The influence of emotion expression on perceptions of trustworthiness in negotiation. In Proceedings of the 25th AAAI Conference on Artificial Intelligence.
[7]
P. Arias, C. Soladie, O. Bouafif, A. Robel, R. Seguier, and J.-J. Aucouturier. 2018. Realistic transformation of facial and vocal smiles in real-time audiovisual streams. IEEE Transactions on Affective Computing.
[8]
S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou. 2018. Neural voice cloning with a few samples. In Advances in Neural Information Processing Systems. 10019–10029.
[9]
A. Aubin, A. Cervone, O. Watts, and S. King. 2019. Improving speech synthesis with discourse relations. In INTERSPEECH. 4470–4474.
[10]
M. P. Aylett and D. A. Braude. 2018. Designing speech interaction for the Sony Xperia Ear and Oakley Radar Pace smartglasses. In Proceedings of the 20th International Conference on Human–Computer Interaction with Mobile Devices and Services Adjunct. 379–384.
[11]
M. P. Aylett, B. Potard, and C. J. Pidcock. 2013. Expressive speech synthesis: Synthesising ambiguity. In Eighth ISCA Workshop on Speech Synthesis.
[12]
M. P. Aylett, P. O. Kristensson, S. Whittaker, and Y. Vazquez-Alvarez. 2014. None of a CHInd: Relationship counselling for HCI and speech technology. In Proceedings of the Extended Abstracts of the 32nd Annual ACM Conference on Human Factors in Computing Systems – CHI EA’14. ACM Press, Toronto, Ontario, Canada, 749–760. ISBN: 978-1-4503-2474-8. http://dl.acm.org/citation.cfm?doid=2559206.2578868.
[13]
M. P. Aylett, A. Vinciarelli, and M. Wester. 2017. Speech synthesis for the generation of artificial personality. IEEE Transactions on Affective Computing.
[14]
M. P. Aylett, B. R. Cowan, and L. Clark. 2019a. Siri, Echo and performance: You have to suffer darling. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, alt08.
[15]
M. P. Aylett, S. J. Sutton, and Y. Vazquez-Alvarez. 2019b. The right kind of unnatural: Designing a robot voice. In Proceedings of the 1st International Conference on Conversational User Interfaces. 1–2.
[16]
W. A. Bainbridge, J. W. Hart, E. S. Kim, and B. Scassellati. 2011. The benefits of interactions with physically present robots over video-displayed agents. Int. J. Soc. Rob. 3, 1, 41–52.
[17]
R. Barthes. 1977. Image–Music–Text. Macmillan.
[18]
T. Baumann and D. Schlangen. 2012. Inpro_iSS: A component for just-in-time incremental speech synthesis. In Proceedings of the ACL 2012 System Demonstrations. 103–108.
[19]
G. M. Begany, N. Sa, and X. Yuan. 2016. Factors affecting user perception of a spoken language vs. textual search interface: A content analysis. Interact. Comput. 28, 2, 170–180.
[20]
H. Bishop, N. Coupland, and P. Garrett. 2005. Conceptual accent evaluation: Thirty years of accent prejudice in the UK. Acta Linguist. Hafniensia 37, 1, 131–154.
[21]
A. W. Black and K. Tokuda. 2005. The Blizzard Challenge-2005: Evaluating corpus-based speech synthesis on common datasets. In Ninth European Conference on Speech Communication and Technology.
[22]
B. Bollepalli, L. Juvela, P. Alku. 2019. Lombard speech synthesis using transfer learning in a Tacotron text-to-speech system. In Interspeech. 2833–2837.
[23]
D. A. Braude, M. P. Aylett, C. Laoide-Kemp, S. Ashby, K. M. Scott, B. O. Raghallaigh, A. Braudo, A. Brouwer, and A. Stan. 2019. All together now: The living audio dataset. In Interspeech. 1521–1525.
[24]
P. Bremner, A. G. Pipe, C. Melhuish, M. Fraser, and S. Subramanian. 2011. The effects of robot-performed co-verbal gesture on listener behaviour. In 2011 11th IEEE-RAS International Conference on Humanoid Robots. IEEE, 458–465.
[25]
M. Bretan, G. Hoffman, and G. Weinberg. 2015. Emotionally expressive dynamic physical behaviors in robots. Int. J. Hum. Comput. Stud. 78, 1–16.
[26]
J. Cambre and C. Kulkarni. 2019. One voice fits all? Social implications and research challenges of designing voices for smart devices. Proc. ACM Hum. Comput. Interact. 3, CSCW, 1–19.
[27]
D. Cameron. 2001. Working with Spoken Discourse. Sage.
[28]
S. Campanella and P. Belin. 2007. Integrating face and voice in person perception. Trends Cogn. Sci. 11, 12, 535–543.
[29]
A. C. Cargile and H. Giles. 1997. Understanding language attitudes: Exploring listener affect and identity. Lang. Commun. 17, 3, 195–217.
[30]
L. Clark, A. Ofemile, S. Adolphs, and T. Rodden. 2016. A multimodal approach to assessing user experiences with agent helpers. ACM Trans. Interact. Intell. Syst. 6, 4, 29.
[31]
L. Clark, P. Doyle, D. Garaialde, E. Gilmartin, S. Schlögl, J. Edlund, M. Aylett, J. Cabral, C. Munteanu, and J. Edwards. 2019a. The state of speech in HCI: Trends, themes and challenges. Interact. Comput. 31, 4, 349–371.
[32]
L. Clark, N. Pantidi, O. Cooney, P. Doyle, D. Garaialde, J. Edwards, B. Spillane, E. Gilmartin, C. Murad, C. Munteanu, V. Wade, and B. R. Cowan. 2019b. What makes a good conversation? Challenges in designing truly conversational agents. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.
[33]
M. Cooke, C. Mayo, and C. Valentini-Botinhao. 2013 August. Intelligibility-enhancing speech modifications: The Hurricane Challenge. In Interspeech. 3552–3556.
[34]
E. Corbett and A. Weber. 2016. What can I say? Addressing user experience challenges of a mobile voice user interface for accessibility. In Proceedings of the 18th International Conference on Human–Computer Interaction with Mobile Devices and Services. 72–82.
[35]
N. Coupland and H. Bishop. 2007. Ideologised values for British accents. J. Socioling. 11, 1, 74–93.
[36]
B. R. Cowan, N. Pantidi, D. Coyle, K. Morrissey, P. Clarke, S. Al-Shehri, D. Earley, and N. Bandeira. 2017. What can I help you with?: Infrequent users’ experiences of intelligent personal assistants. In Proceedings of the 19th International Conference on Human–Computer Interaction with Mobile Devices and Services. ACM, 43.
[37]
B. R. Cowan, P. Doyle, J. Edwards, D. Garaialde, A. Hayes-Brady, H. P. Branigan, J. A. Cabral, and L. Clark. 2019. What’s in an accent? The impact of accented synthetic speech on lexical choice in human–machine dialogue. In Proceedings of the 1st International Conference on Conversational User Interfaces, CUI’19. Association for Computing Machinery, New York, NY. ISBN: 9781450371872.
[38]
A. Cruttenden. 1997. Intonation. Cambridge University Press.
[39]
D. Crystal. 1997. A Dictionary of Linguistics and Phonetics. Blackwell, UK.
[40]
D. Crystal. 2011. A Dictionary of Linguistics and Phonetics, Vol. 30. John Wiley & Sons.
[41]
N. Dahlbäck, S. Swamy, C. Nass, F. Arvidsson, and J. Skågeby. 2001. Spoken interaction with computers in a native or non-native language—Same or different. In Proceedings of Interact. 294–301.
[42]
N. Dahlbäck, Q. Wang, C. Nass, and J. Alwin. 2007. Similarity is more important than expertise: Accent effects in speech interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1553–1556.
[43]
A. Danielescu. 2020. Eschewing gender stereotypes in voice assistants to promote inclusion. In Proceedings of the 2nd Conference on Conversational User Interfaces. 1–3.
[44]
C. De Looze, S. Scherer, B. Vaughan, and N. Campbell. 2014. Investigating automatic measurements of prosodic accommodation and its dynamics in social interaction. Speech Commun. 58, 11–34.
[45]
J. de Wit, A. Brandse, E. Krahmer, and P. Vogt. 2020. Varied human-like gestures for social robots: Investigating the effects on children’s engagement and language learning. In Proceedings of the 2020 ACM/IEEE International Conference on Human–Robot Interaction. 359–367.
[46]
D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast, A. Gainer, K. Georgila, J. Gratch, A. Hartholt, M. Lhommet, G. Lucas, S. Marsella, F. Morbini, A. Nazarian, S. Scherer, G. Stratou, A. Suri, D. Traum, R. Wood, Y. Xu, A. Rizzo, and L.-P. Morency. 2014. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems. 1061–1068.
[47]
P. R. Doyle, J. Edwards, O. Dumbleton, L. Clark, and B. R. Cowan. 2019. Mapping perceptions of humanness in intelligent personal assistant interaction. In Proceedings of the 21st International Conference on Human–Computer Interaction with Mobile Devices and Services, MobileHCI’19. Association for Computing Machinery, New York, NY. ISBN: 9781450368254.
[48]
P. R. Doyle, L. Clark, and B. R. Cowan. 2021. What do we see in them? Identifying dimensions of partner models for speech interfaces using a psycholexical approach. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, Article 244, 1–14.
[49]
J. Edwards, H. Liu, T. Zhou, S. J. J. Gould, L. Clark, P. Doyle, and B. R. Cowan. 2019. Multitasking with Alexa: How using intelligent personal assistants impacts language-based primary task performance. In Proceedings of the 1st International Conference on Conversational User Interfaces, CUI’19. Association for Computing Machinery, New York, NY. ISBN: 9781450371872.
[50]
K. El Haddad, H. Cakmak, A. Moinet, S. Dupont, and T. Dutoit. 2015a. An HMM approach for synthesizing amused speech with a controllable intensity of smile. In 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). IEEE, 7–11.
[51]
K. El Haddad, S. Dupont, N. D’Alessandro, and T. Dutoit. 2015b. An HMM-based speech–smile synthesis system: An approach for amusement synthesis. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 5. IEEE, 1–6.
[52]
K. El Haddad, I. Torre, E. Gilmartin, H. Çakmak, S. Dupont, T. Dutoit, and N. Campbell. 2017. Introducing AmuS: The amused speech database. In N. Camelin, Y. Estève, and C. Martìn-Vide (Eds.), Proceedings of Statistical Language and Speech Processing Conference. Springer International Publishing, 229–240. ISBN: 978-3-319-68456-7.
[53]
D. Erickson. 2005. Expressive speech: Production, perception and application to speech synthesis. Acoust. Sci. Technol. 26, 4, 317–325.
[54]
T. Fong, I. Nourbakhsh, and K. Dautenhahn. 2003. A survey of socially interactive robots. Rob. Auton. Syst. 42, 3–4, 143–166.
[55]
P. Gangamohan, S. R. Kadiri, and B. Yegnanarayana. 2016. Analysis of emotional speech—A review. In Toward Robotic Socially Believable Behaving Systems—Volume I. Springer, 205–238.
[56]
K. Georgila, A. W. Black, K. Sagae, and D. R. Traum. 2012. Practical evaluation of human and synthesized speech for virtual human dialogue systems. In LREC. 3519–3526.
[57]
E. Goffman. 2005. Interaction Ritual: Essays in Face-to Face-Behavior. AldineTransaction.
[58]
A. Govender and S. King. 2018. Using pupillometry to measure the cognitive load of synthetic speech. System 50, 100.
[59]
D. Govind and S. M. Prasanna. 2013. Expressive speech synthesis: A review. Int. J. Speech Technol. 16, 2, 237–260.
[60]
S. Heo, M. Annett, B. J. Lafreniere, T. Grossman, and G. W. Fitzmaurice. 2017. No need to stop what you’re doing: Exploring no-handed smartwatch interaction. In Graphics Interface. 107–114.
[61]
Z. Hodari, O. Watts, and S. King. 2019. Using generative modelling to produce varied intonation for speech synthesis. arXiv preprint arXiv:1906.04233.
[62]
G. Hofer, K. Richmond, and R. Clark. 2005. Informed blending of databases for emotional speech synthesis. In Proc. Interspeech.
[63]
A. Hughes, P. Trudgill, and D. Watt. 2013. English Accents and Dialects: An Introduction to Social and Regional Varieties of English in the British Isles. Routledge.
[64]
A. J. Hunt and A. W. Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Vol. 1. IEEE, 373–376.
[65]
A. Ikeno and J. H. Hansen. Nov 2007. The effect of listener accent background on accent perception and comprehension. EURASIP J Audio Speech Music Processing 2007, 1, 076030. ISSN: 1687-4722.
[66]
J. Jung, S. Lee, J. Hong, E. Youn, and G. Lee. 2020. Voice+tactile: Augmenting in-vehicle voice user interface with tactile touchpad interaction. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
[67]
R. G. Kamiloğlu, A. H. Fischer, and D. A. Sauter. 2020. Good vibrations: A review of vocal expressions of positive emotions. Psychon. Bull. Rev. 27, 2, 237–265.
[68]
T. Kawahara. 2019. Spoken dialogue system for a human-like conversational robot ERICA. In 9th International Workshop on Spoken Dialogue System Technology. Springer, 65–75.
[69]
T. Kenter, V. Wan, C.-A. Chan, R. Clark, and J. Vit. 2019. CHiVE: Varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network. In International Conference on Machine Learning. 3331–3340.
[70]
C. D. Kidd and C. Breazeal. 2004. Effect of a robot on user perceptions. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (IEEE Cat. No. 04CH37566). Vol. 4. IEEE, 3559–3564.
[71]
K. D. Kinzler, K. H. Corriveau, and P. L. Harris. 2011. Children’s selective trust in native-accented speakers. Development. Sci. 14, 1, 106–111.
[72]
P. Kirschthaler, M. Porcheron, and J. E. Fischer. 2020. What can I say? Effects of discoverability in VUIs on task performance and user experience. In Proceedings of the 2nd International Conference on Conversational User Interfaces, CUI’20. Association for Computing Machinery, New York, NY. ISBN: 978-1-4503-7544-3/20/07.
[73]
D. H. Klatt. 1980. Software for a cascade/parallel formant synthesizer. J. Acoust. Soc. Am. 67, 3, 971–995.
[74]
J. Kominek and A. W. Black. 2004. The CMU Arctic speech databases. In Fifth ISCA Workshop on Speech Synthesis.
[75]
S. G. Koolagudi and K. S. Rao. 2012. Emotion recognition from speech: A review. Int. J. Speech Technol. 15, 2, 99–117.
[76]
H. Kose-Bagci, E. Ferrari, K. Dautenhahn, D. S. Syrdal, and C. L. Nehaniv. 2009. Effects of embodiment and gestures on social interaction in drumming games with a humanoid robot. Adv. Rob. 23, 14, 1951–1996.
[77]
E. Krahmer and M. Swerts. 2007. The effects of visual beats on prosodic prominence: Acoustic analyses, auditory perception and visual perception. J. Mem. Lang. 57, 3, 396–414.
[78]
D. R. Large, L. Clark, A. Quandt, G. Burnett, and L. Skrypchuk. 2017. Steering the conversation: A linguistic exploration of natural language interactions with a digital assistant during simulated driving. Appl. Ergon. 63. 53–61. ISSN: 00036870. https://linkinghub.elsevier.com/retrieve/pii/S0003687017300790.
[79]
L. Leahu, M. Cohn, and W. March. 2013. How categories come to matter. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems—CHI’13. ACM Press, Paris, France, 3331. ISBN: 978-1-4503-1899-0. http://dl.acm.org/citation.cfm?doid=2470654.2466455.
[80]
K. A. Lenzo and A. W. Black. 2000. Diphone collection and synthesis. In Sixth International Conference on Spoken Language Processing.
[81]
J. Li. 2015. The benefit of being physically present: A survey of experimental works comparing copresent robots, telepresent robots and virtual agents. Int. J. Hum. Comput. Stud. 77, 23–37.
[82]
J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, R. Barra-Chicote, Alexis Moinet, and Vatsal Aggarwal. 2018. Towards achieving robust universal neural vocoding. arXiv preprint arXiv:1811.06292.
[83]
E. Luger and A. Sellen. 2016. Like having a really bad PA: The gulf between user expectation and experience of conversational agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 5286–5297.
[84]
H.-T. Luong and J. Yamagishi. 2020. Nautilus: A versatile voice cloning system. arXiv preprint arXiv:2005.11004.
[85]
F. Marelli, B. Schnell, H. Bourlard, T. Dutoit, and P. N. Garner. 2019. An end-to-end network to synthesize intonation using a generalized command response model. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7040–7044.
[86]
C. McGinn and I. Torre. 2019. Can you tell the robot by the voice? An exploratory study on the role of voice in the perception of robots. In 2019 14th ACM/IEEE International Conference on Human–Robot Interaction (HRI). IEEE, 211–221.
[87]
H. McGurk and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588, 746–748.
[88]
I. Medhi, S. N. Gautama, and K. Toyama. 2009. A comparison of mobile money-transfer UIs for non-literate and semi-literate users. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1741–1750.
[89]
J. Mendelson and M. P. Aylett. 2017. Beyond the listening test: An interactive approach to TTS evaluation. In Interspeech. 249–253.
[90]
R. K. Moore. 2017. Is spoken language all-or-nothing? Implications for future speech-based human–machine interaction. In Dialogues with Social Robots. Springer, 281–291.
[91]
R. K. Moore, H. Li, and S.-H. Liao. 2016. Progress and prospects for spoken language technology: What ordinary people think. 3007–3011. http://www.isca-speech.org/archive/Interspeech_2016/abstracts/0874.html.
[92]
J. Mumm and B. Mutlu. 2011. Human–robot proxemics: Physical and psychological distancing in human–robot interaction. In Proceedings of the 6th ACM/IEEE International Conference on Human–Robot Interaction. 331–338.
[93]
C. Nass and K. M. Lee. 2000. Does computer-generated speech manifest personality? An experimental test of similarity-attraction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI’00. ACM, New York, NY, 329–336. ISBN: 978-1-58113-216-8.
[94]
C. Nass and K. M. Lee. 2001. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. J. Exp. Psychol. Appl. 7, 3, 171.
[95]
C. I. Nass and S. Brave. 2005. Wired for Speech: How Voice Activates and Advances the Human–Computer Relationship. MIT Press Cambridge, MA.
[96]
C. Nass, J. Steuer, and E. R. Tauber. 1994. Computers are social actors. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 72–78.
[97]
A. Niculescu, G. M. White, S. S. Lan, R. U. Waloejo, and Y. Kawaguchi. 2008. Impact of English regional accents on user acceptance of voice user interfaces. In Proceedings of the 5th Nordic Conference on Human–Computer Interaction: Building Bridges, NordiCHI’08. Association for Computing Machinery, New York, NY, 523–526. ISBN: 9781595937049.
[98]
J. J. Ohala. 1983. Cross-language use of pitch: An ethological view. Phonetica 40, 1, 1–18.
[99]
A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
[100]
A. Ortony, G. Clore, and A. Collins. 1988. The Cognitive Structure of Emotion. CUP, Cambridge.
[101]
A. L. Paugh. 2005. Multilingual play: Children’s code-switching, role play, and agency in Dominica, West Indies. Lang. Soc. 34, 1, 63–86.
[102]
E. Pincus, K. Georgila, and D. Traum. 2015. Which synthetic voice should I choose for an evocative task? In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 105–113.
[103]
W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller. 2017. Deep Voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654.
[104]
J. F. Pitrelli, R. Bakis, E. M. Eide, R. Fernandez, W. Hamza, and M. A. Picheny. 2006. The IBM expressive text-to-speech synthesis system for American English. IEEE Trans. Audio Speech Lang. Process. 14, 4, 1099–1108.
[105]
M. Porcheron, J. E. Fischer, and S. Sharples. 2017. “Do animals have accents?”: Talking with agents in multi-party conversation. In Proceedings of the 20th ACM Conference on Computer-Supported Cooperative Work & Social Computing, CSCW’17. ACM, New York, NY, 207–219.
[106]
M. Porcheron, J. E. Fischer, S. Reeves, and S. Sharples. 2018. Voice interfaces in everyday life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 640.
[107]
B. Potard, M. P. Aylett, and D. A. Braude. 2016. Cross modal evaluation of high quality emotional speech synthesis with the Virtual Human Toolkit. In International Conference on Intelligent Virtual Agents. Springer, 190–197.
[108]
N. Prateek, M. Łajszczak, R. Barra-Chicote, T. Drugman, J. Lorenzo-Trueba, T. Merritt, S. Ronanki, and T. Wood. 2019. In other news: A bi-style text-to-speech model for synthesizing newscaster voice with limited data. arXiv preprint arXiv:1904.02790.
[109]
S. Ramakrishnan. 2012. Recognition of emotion from speech: A review. Speech Enhancement, Modeling and Recognition–Algorithms and Applications 7, 121–137.
[110]
G. Reyes-Cruz, J. E. Fischer, and S. Reeves. 2020. Reframing disability as competency: Unpacking everyday technology practices of people with visual impairments. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI’20. ACM, New York, NY, 1–13. ISBN: 9781450367080.
[111]
D. C. Rubin and J. M. Talarico. 2009. A comparison of dimensional models of emotion: Evidence from emotions, prototypical events, autobiographical memories, and words. Memory 17, 8, 802–808.
[112]
E. B. Ryan and H. Giles. 1982. An integrative perspective for the study of attitudes towards language variation. In Attitudes Towards Language Variation: Social and Applied Contexts. Edward Arnold London, 1–19.
[113]
M. Salem, F. Eyssel, K. Rohlfing, S. Kopp, and F. Joublin. 2013. To err is human (-like): Effects of robot gesture on perceived anthropomorphism and likability. Int. J. Soc. Rob. 5, 3, 313–323.
[114]
D. Sato, S. Zhu, M. Kobayashi, H. Takagi, and C. Asakawa. 2011. Sasayaki: Augmented voice web browsing experience. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2769–2778.
[115]
S. Sayago, B. B. Neves, and B. R. Cowan. 2019. Voice assistants and older people: Some open issues. In Proceedings of the 1st International Conference on Conversational User Interfaces, CUI’19. ACM, New York, NY. ISBN: 9781450371872.
[116]
M. Schröder. 2001. Emotional speech synthesis: A review. In Proceedings Eurospeech 01. 561–564.
[117]
M. Schröder. 2004. Dimensional emotion representation as a basis for speech synthesis with non-extreme emotions. In Proceedings Workshop on Affective Dialogue Systems. 209–220.
[118]
M. Schröder. 2009. Expressive speech synthesis: Past, present, and possible futures. In Affective Information Processing. Springer, 111–126.
[119]
M. Schröder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes, D. Heylen, M. Ter Maat, G. McKeown, S. Pammi, M. Pantic, C. Pelachaud, B. Schuller, E. de Sevin, M. Valstar, and M. Wöllmer. 2011. Building autonomous sensitive artificial listeners. IEEE Trans. Affective Comput. 3, 2, 165–183.
[120]
R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. arXiv preprint arXiv:1803.09047.
[121]
V. Srinivasan and R. Murphy. 2011. A survey of social gaze. In Proceedings of the 6th ACM/IEEE International Conference on Human–Robot Interaction. 253–254.
[122]
G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu. 2020. Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6264–6268.
[123]
S. Sundaram and S. Narayanan. 2007. Automatic acoustic synthesis of human-like laughter. J. Acoust. Soc. Am. 121, 1, 527–535.
[124]
S. Sutton. 2020. Gender ambiguous, not genderless: Designing gender in voice user interfaces (VUIs) with sensitivity. In Proceedings of the 2nd International Conference on Conversational User Interfaces, CUI’20. Association for Computing Machinery, New York, NY.
[125]
M. Swerts and E. Krahmer. 2010. Visual prosody of newsreaders: Effects of information structure, emotional content and intended audience on facial expressions. J. Phon. 38, 2, 197–206.
[126]
E. Székely. 2015. Expressive Speech Synthesis in Human Interaction. Ph.D. thesis, University College Dublin.
[127]
É. Székely, G. E. Henter, J. Beskow, and J. Gustafson. 2019. Spontaneous conversational speech synthesis from found data. In Interspeech.
[128]
V. C. Tartter and D. Braun. 1994. Hearing smiles and frowns in normal and whisper registers. J. Acoust. Soc. Am. 96, 4, 2101–2107.
[129]
P. Taylor and A. Isard. 1997. SSML: A speech synthesis markup language. Speech Commun. 21, 1–2, 123–133.
[130]
M. Theune, K. Meijs, D. Heylen, and R. Ordelman. 2006. Generating expressive speech for storytelling applications. IEEE Trans. Audio Speech Lang. Process. 14, 4, 1137–1144.
[131]
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura. 2000. Speech parameter generation algorithms for HMM-based speech synthesis. In 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), Vol. 3. IEEE, 1315–1318.
[132]
I. Torre and S. L. Maguer. 2020. Should robots have accents? In Proceedings of the 29th International Workshop on Robot and Human Interactive Communication, RO-MAN’ 20. IEEE.
[133]
I. Torre, J. Goslin, and L. White. 2015. Investing in accents: How does experience mediate trust attributions to different voices? In Proceedings of the 18th International Congress of Phonetic Sciences (ICPhS 2015).
[134]
I. Torre, E. Carrigan, K. McCabe, R. McDonnell, and N. Harte. 2018. Survival at the museum: A cooperation experiment with emotionally expressive virtual characters. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 423–427.
[135]
I. Torre, J. Goslin, and L. White. 2020. If your device could smile: People trust happy-sounding artificial agents more. Comput. Hum. Behav. 105, 106215.
[136]
J. Trouvain and M. Schröder. 2004. How (not) to add laughter to synthetic speech. In Tutorial and Research Workshop on Affective Dialogue Systems. Springer, 229–232.
[137]
G. R. Tucker. 1999. A Global Perspective on Bilingualism and Bilingual Education. ERIC Digest.
[138]
P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje Henter, S. Le Maguer, Z. Malisz, É. Székely, C. Tånnander, and Jana Voße. 2019. Speech synthesis evaluation—State-of-the-art assessment and suggestion for a novel research program. In Proceedings of the 10th Speech Synthesis Workshop (SSW10).
[139]
J. Wainer, D. J. Feil-Seifer, D. A. Shell, and M. J. Mataric. 2006. The role of physical embodiment in human–robot interaction. In ROMAN 2006—The 15th IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 117–122.
[140]
Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurou. 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
[141]
M. West, R. Kraut, and H. Chew, 2019. I’d blush if I could: Closing gender divides in digital skills through education. https://unesdoc.unesco.org/ark:/48223/pf0000367416.
[142]
M. Wester, M. Aylett, M. Tomalin, and R. Dall. 2015. Artificial personality and disfluency. Interspeech 2015. 5.
[143]
M. Wester, D. A. Braude, B. Potard, M. P. Aylett, and F. Shaw. 2017. Real-time reactive speech synthesis: Incorporating interruptions. In Interspeech. 3996–4000.
[144]
A. Williams, J. Cambre, I. Bicking, A. Wallin, J. Tsai, and J. Kaye. 2020. Toward voice-assisted browsers: A preliminary study with Firefox Voice. In Proceedings of the 2nd International Conference on Conversational User Interfaces, CUI’20. Association for Computing Machinery, New York, NY. ISBN: 978-1-4503-7544-3/20/07.
[145]
Z. Wu, Z. Xie, and S. King. 2019. The blizzard challenge 2019. Blizzard Challenge Workshop. http://www.festvox.org/blizzard/bc2019/blizzard2019_overview_paper.pdf.
[146]
Y. Wu, J. Edwards, O. Cooney, A. Bleakley, P. R. Doyle, L. Clark, D. Rough, and B. R. Cowan. 2020a. Mental workload and language production in non-native speaker IPA interaction. In Proceedings of the 2nd Conference on Conversational User Interfaces, CUI’20. Association for Computing Machinery, New York, NY. ISBN: 9781450375443.
[147]
Y. Wu, D. Rough, A. Bleakley, J. Edwards, O. Cooney, P. R. Doyle, L. Clark, and B. R. Cowan. 2020b. See what I’m saying? Comparing intelligent personal assistant use for native and non-native language speakers. In Proceedings of the 22nd International Conference on Human–Computer Interaction with Mobile Devices and Services, Mobile HCI’20. Association for Computing Machinery, New York, NY.
[148]
J. Yamagishi, K. Ogata, Y. Nakano, J. Isogai, and T. Kobayashi. 2006. HSMM-based model adaptation algorithms for average-voice-based speech synthesis. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1. IEEE, I–I.
[149]
J. Yamagishi, C. Veaux, and K. MacDonald. 2019. CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92).
[150]
H. Zen, A. Senior, and M. Schuster. 2013. Statistical parametric speech synthesis using deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7962–7966.
[151]
H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu. 2019. LibriTTS: A corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882.
[152]
Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran. 2019. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. arXiv preprint arXiv:1907.04448.
[153]
X. Zhu and L. Xue. 2020. Building a controllable expressive speech synthesis system with multiple emotion strengths. Cognit. Syst. Res. 59, 151–159.

Cited By

View all
  • (2024)Enhancing Trust towards the Police through Interaction with Virtual Agents - Investigating the Ingroup Effect with Mixed-Cultural IndividualsProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673944(1-9)Online publication date: 16-Sep-2024
  • (2024)Using Speech Agents for Mood Logging within Blended Mental Healthcare: Mental Healthcare Practitioners' PerspectivesACM Conversational User Interfaces 202410.1145/3640794.3665540(1-11)Online publication date: 8-Jul-2024
  • (2024)What a Laugh! – Effects of Voice and Laughter on a Social Robot’s Humorous Appeal and Recipients’ Transportation and Emotions in Humorous Robotic Storytelling2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN)10.1109/RO-MAN60168.2024.10731342(2131-2138)Online publication date: 26-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Books
The Handbook on Socially Interactive Agents: 20 years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics Volume 1: Methods, Behavior, Cognition
September 2021
538 pages
ISBN:9781450387200
DOI:10.1145/3477322

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2021

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Chapter

Appears in

ACM Books

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)2
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media