skip to main content
10.1145/2858036.2858416acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

Simplified Audio Production in Asynchronous Voice-Based Discussions

Published: 07 May 2016 Publication History

Abstract

Voice communication adds nuance and expressivity to virtual discussions, but its one-shot nature tends to discourage collaborators from utilizing it. However, text-based interfaces have made voice editing much easier, especially with recent advancements enabling live, time-aligned speech transcription. We introduce SimpleSpeech, an easy-to-use platform for asynchronous audio communication (AAC) with lightweight tools for inserting content, adjusting pauses, and correcting transcript errors. Qualitative and quantitative results suggest that novice audio producers, such as high school students, experience decreased mental workload when using SimpleSpeech to produce audio messages than without editing. The linguistic formality in SimpleSpeech messages was also studied, and found to form a middle ground between oral and written media. Our findings on editable voice messages show new implications for the optimal design and use cases of AAC systems.

Supplementary Material

suppl.mov (pn1888-file3.mp4)
Supplemental video

References

[1]
Stephen Ades and Daniel C Swinehart. 1986. Voice annotation and editing in a workstation environment. XEROX Corporation, Palo Alto Research Center.
[2]
Barry Arons. 1993. SpeechSkimmer: Interactively Skimming Recorded Speech. In Proceedings of the 6th Annual ACM Symposium on User Interface Software and Technology (UIST '93). ACM, NY, NY, USA, 187--196.
[3]
Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for Placing Cuts and Transitions in Interview Video. ACM Trans. Graph. 31, 4, Article 67 (July 2012), 8 pages.
[4]
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O'Reilly Media.
[5]
Thomas Cho. 2010. Linguistic Features of Electronic Mail in the Workplace: A Comparison with Memoranda. Language@Internet 7, 3 (2010).
[6]
Juan Casares et al. 2002a. Simplifying Video Editing Using Metadata. In DIS '02 Proceedings. London, 157--166.
[7]
Janet M. Baker et al. 2009. Developments and directions in speech recognition and understanding, Part 1. Signal Processing, IEEE 26, 3 (2009).
[8]
Steve Whittaker et al. 2002b. SCANMail: a voicemail interface that makes speech browsable, readable and searchable. CHI Letters 4, 1 (2002).
[9]
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter Sentiment Classification using Distant Supervision. Technical Report. Stanford.
[10]
Jonathan Grudin. 1988. Why CSCW applications fail: Problems in the design and evaluation of organizational interfaces. ACM Conference on Computer-Supported Cooperative Work (1988).
[11]
C Halverson, D Horn, C Karat, and John Karat. 1999. The beauty of errors: Patterns of error correction in desktop speech systems. In Proceedings of INTERACT99. 133--140.
[12]
Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Human Mental Workload, P. A. Hancock and N. Meshkati (Eds.). North Holland Press, Amsterdam.
[13]
Francis Heylighen and Jean-Marc Dewaele. 2002. Variation in the contextuality of language: an empirical measure. Foundations of Science 7 (2002), 239--340.
[14]
Debby Hindus and Chris Schmandt. 1992. Ubiquitous Audio: Capturing Spontaneous Collaboration. In Proceedings of the 1992 ACM Conference on Computer-supported Cooperative Work (CSCW '92). ACM, NY, NY, USA, 210--217.
[15]
Philip Ice, Reagan Curtis, Perry Phillips, and John Wells. 2007. Using asynchronous audio feedback to enhance teaching presence and students' sense of community. Journal of Asynchronous Learning Networks 11, 2 (2007).
[16]
Sara Kiesler, Jane Siegel, and Timothy W. McGuire. 1984. Social Psychological Aspects of Computer-Mediated Communication. Amer. Psychologist 39, 10 (1984), 1123--1134.
[17]
Henry Kucera and W. Nelson Francis. 1967. Computational Analysis of Present-Day American English. Brown University Press, Providence.
[18]
Jennifer Lai and John Vergo. 1997. MedSpeak: Report Creation with Continuous Speech Recognition. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI '97). ACM, NY, NY, USA, 431--438.
[19]
Philip Marriott. 2002. Voice vs text-based discussion forums: An implementation of Wimba Voice Boards. In World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education, Vol. 2002. 640--646.
[20]
Philip Marriott and Jane Hiscock. 2002. Voice vs Text-based Discussion Forums: An Implementation of Wimba Voice Boards. In Proc. E-Learn: World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education, M. Driscoll and T. Reeves (Eds.). Chesapeake, VA.
[21]
Jody Oomen-Early, Mary Bold, Kristin L. Wiginton, Tara L. Gallien, and Nancy Anderson. 2008. Using asynchronous audio communication (AAC) in the online classroom: a comparative study. Journal of Online Learning and Teaching 4, 3 (2008).
[22]
Steve Rubin, Floraine Berthouzoz, Gautham J. Mysore, Wilmot Li, and Maneesh Agrawala. 2013. Content-Based Tools for Editing Audio Stories. In UIST '13. 113--122.
[23]
George Saon, Hong-Kwang J. Kuo, Steven Rennie, and Michael Picheny. 2015. The IBM 2015 English Conversational Telephone Speech Recognition System. Interspeech (2015).
[24]
Christopher Schmandt. 1981. The Intelligent Ear: A Graphical Interface to Digital Audio. In Proceedings, IEEE International Conference on Cybernetics and Society, IEEE.
[25]
Hagen Soltau, George Saon, and Tara N. Sainath. 2014. Joint training of convolutional and non-convolutional neural networks. In Proceedings of the IEEE Intl. Conference on Acoustic, Speech and Signal Processing. Florence, 5572--5576.
[26]
Will Styler. 2011. The EnronSent Corpus. Technical Report 01--2011. University of CO at Boulder Institute of Cognitive Science, Boulder, CO.
[27]
Chi-Hsiung Tu and Marina McIsaac. 2002. The Relationship of Social Presence and Interaction in Online Classes. Amer. Journal of Distance Education 16, 3 (2002).
[28]
Sunil Vemuri, Philip DeCamp, Walter Bender, and Chris Schmandt. 2004a. Improving Speech Playback Using Time-compression and Speech Recognition. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '04). ACM, NY, NY, USA, 295--302.
[29]
Sunil Vemuri, Philip DeCamp, Walter Bender, and Chris Schmandt. 2004b. Improving Speech Playback Using Time-Compression and Speech Recognition. CHI Letters 6, 1 (2004).
[30]
Joseph B. Walther. 1995. Relational Aspects of Computer-Mediated Communication: Experimental Observations over Time. Organization Science 6, 2 (1995), 186--203.
[31]
Steve Whittaker and Brian Amento. 2004. Semantic Speech Editing. CHI Letters (2004), 527--534.
[32]
Lynn Wilcox, Ian Smith, and Marcia Bush. 1992. Wordspotting for Voice Editing and Audio Indexing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '92). ACM, NY, NY, USA, 655--656.
[33]
Dongwook Yoon, Nicholas Chen, Francois Guimbreti'ere, and Abigail Sellen. 2014. RichReview: blending ink, speech, and gesture to support collaborative document review. In Proceedings of the 27th annual ACM symposium on User interface software and technology. ACM, 481--490.
[34]
Dongwook Yoon, Nicholas Chen, Bernie Randles, Amy Cheatle, Corinna E. Loeckenhoff, Steven J. Jackson, Abigail Sellen, and Francois Guimbreti'ere. 2016. Deployment of a Collaborative Multi-Modal Annotation System for Instructor Feedback and Peer Discussion. In Proceedings of the 19th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM.

Cited By

View all

Index Terms

  1. Simplified Audio Production in Asynchronous Voice-Based Discussions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI '16: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems
    May 2016
    6108 pages
    ISBN:9781450333627
    DOI:10.1145/2858036
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 May 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. asynchronous audio communication
    2. speech editing
    3. transcription-based editing

    Qualifiers

    • Research-article

    Conference

    CHI'16
    Sponsor:
    CHI'16: CHI Conference on Human Factors in Computing Systems
    May 7 - 12, 2016
    California, San Jose, USA

    Acceptance Rates

    CHI '16 Paper Acceptance Rate 565 of 2,435 submissions, 23%;
    Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 14 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media