skip to main content
10.1145/3635636.3656192acmconferencesArticle/Chapter ViewAbstractPublication Pagesc-n-cConference Proceedingsconference-collections
research-article
Open access

VideoMap: Supporting Video Exploration, Brainstorming, and Prototyping in the Latent Space

Published: 23 June 2024 Publication History

Abstract

Video editing is a creative and complex endeavor and we believe that there is potential for reimagining a new video editing interface to better support the creative and exploratory nature of video editing. We take inspiration from latent space exploration tools that help users find patterns and connections within complex datasets. We present VideoMap, a proof-of-concept video editing interface that operates on video frames projected onto a latent space. We support intuitive navigation through map-inspired navigational elements and facilitate transitioning between different latent spaces through swappable lenses. We built three VideoMap components to support editors in three common video tasks. In a user study with both professionals and non-professionals, editors found that VideoMap helps reduce grunt work, offers a user-friendly experience, provides an inspirational way of editing, and effectively supports the exploratory nature of video editing. We further demonstrate the versatility of VideoMap by implementing three extended applications. For interactive examples, we invite you to visit our project page: https://chuanenlin.com/videomap.

References

[1]
2023. Descript — All-in-one video & podcast editing, easy as a doc.Retrieved March 10, 2023 from https://www.descript.com
[2]
2023. Get directions and show routes. Retrieved March 10, 2023 from https://support.google.com/maps/answer/144339
[3]
2023. Match Cuts and Creative Transitions with Examples — Editing Techniques. Retrieved March 10, 2023 from https://www.studiobinder.com/blog/match-cuts-creative-transitions-examples
[4]
2023. Reflections on Foundation Models. Retrieved August 15, 2023 from https://hai.stanford.edu/news/reflections-foundation-models
[5]
2023. Surprising Facts on The History of Video Editing. Retrieved March 10, 2023 from https://www.videoeditinginstitute.com/surprising-facts-on-the-history-of-video-editing
[6]
2023. Type Studio — Edit Your Video By Editing Text. Retrieved March 10, 2023 from https://www.typestudio.co
[7]
2023. Upwork. Retrieved March 10, 2023 from https://www.upwork.com
[8]
2023. Working in the Project panel. Retrieved March 10, 2023 from https://helpx.adobe.com/premiere-pro/using/customizing-project-panel.html
[9]
Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2, 4 (2010), 433–459.
[10]
Angie Boggust, Brandon Carter, and Arvind Satyanarayan. 2022. Embedding comparator: Visualizing differences in global structure and local neighborhoods via small multiples. In 27th international conference on intelligent user interfaces. 746–766.
[11]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
[12]
Diogo Cabral and Nuno Correia. 2012. Videoink: a pen-based approach for video editing. In Adjunct proceedings of the 25th annual ACM symposium on User interface software and technology. 67–68.
[13]
Diogo Cabral and Nuno Correia. 2017. Video editing with pen-based technology. Multimedia tools and applications 76 (2017), 6889–6914.
[14]
Juan Casares, A Chris Long, Brad A Myers, Rishi Bhatnagar, Scott M Stevens, Laura Dabbish, Dan Yocum, and Albert Corbett. 2002. Simplifying video editing using metadata. In Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques. 157–166.
[15]
Renan G Cattelan, Cesar Teixeira, Rudinei Goularte, and Maria Da Graça C Pimentel. 2008. Watch-and-comment as a paradigm toward ubiquitous interactive video editing. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 4, 4 (2008), 1–24.
[16]
Minsuk Chang, Léonore V Guillain, Hyeungshik Jung, Vivian M Hare, Juho Kim, and Maneesh Agrawala. 2018. Recipescape: An interactive tool for analyzing cooking instructions at scale. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1–12.
[17]
Minsuk Chang, Mina Huh, and Juho Kim. 2021. Rubyslippers: Supporting content-based voice navigation for how-to videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.
[18]
Boris Chen, Amir Ziai, Rebecca S Tucker, and Yuchen Xie. 2023. Match Cutting: Finding Cuts with Smooth Visual Transitions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2115–2125.
[19]
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).
[20]
Peggy Chi, Nathan Frey, Katrina Panovich, and Irfan Essa. 2021. Automatic Instructional Video Creation from a Markdown-Formatted Tutorial. In The 34th Annual ACM Symposium on User Interface Software and Technology. 677–690.
[21]
Peggy Chi, Zheng Sun, Katrina Panovich, and Irfan Essa. 2020. Automatic video creation from a web page. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 279–292.
[22]
Pei-Yu Chi, Joyce Liu, Jason Linder, Mira Dontcheva, Wilmot Li, and Bjoern Hartmann. 2013. Democut: generating concise instructional videos for physical demonstrations. In Proceedings of the 26th annual ACM symposium on User interface software and technology. 141–150.
[23]
Brock Craft and Paul Cairns. 2005. Beyond guidelines: what can we learn from the visual information seeking mantra?. In Ninth International Conference on Information Visualisation (IV’05). IEEE, 110–118.
[24]
Hai Dang and Daniel Buschek. 2021. GestureMap: Supporting Visual Analytics and Quantitative Analysis of Motion Elicitation Data by Learning 2D Embeddings. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–12.
[25]
Klaus Eckelt, Andreas Hinterreiter, Patrick Adelberger, Conny Walchshofer, Vaishali Dhanoa, Christina Humer, Moritz Heckmann, Christian Steinparz, and Marc Streit. 2022. Visual exploration of relationships and structure in low-dimensional embeddings. IEEE Transactions on Visualization and Computer Graphics (2022).
[26]
Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. 2019. Text-based editing of talking-head video. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–14.
[27]
Andreas Girgensohn, John Boreczky, Patrick Chiu, John Doherty, Jonathan Foote, Gene Golovchinsky, Shingo Uchihashi, and Lynn Wilcox. 2000. A semi-automatic approach to home video editing. In Proceedings of the 13th annual ACM symposium on User interface software and technology. 81–89.
[28]
Dan B Goldman, Chris Gonterman, Brian Curless, David Salesin, and Steven M Seitz. 2008. Video object annotation, navigation, and composition. In Proceedings of the 21st annual ACM symposium on User interface software and technology. 3–12.
[29]
Nicolas Grossmann, Eduard Gröller, and Manuela Waldner. 2022. Concept splatters: Exploration of latent spaces based on human interpretable concepts. Computers & Graphics 105 (2022), 73–84.
[30]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[31]
Marius Hogräfer, Magnus Heitzler, and Hans-Jörg Schulz. 2020. The state of the art in map-like visualization. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 647–674.
[32]
Xian-Sheng Hua, Lie Lu, and Hong-Jiang Zhang. 2003. AVE: automated home video editing. In Proceedings of the eleventh ACM international conference on Multimedia. 490–497.
[33]
Chong Huang, Chuan-En Lin, Zhenyu Yang, Yan Kong, Peng Chen, Xin Yang, and Kwang-Ting Cheng. 2019. Learning to film from professional human motion videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4244–4253.
[34]
Yuzhong Huang, Xue Bai, Oliver Wang, Fabian Caba, and Aseem Agarwala. 2021. Learning Where to Cut from Edited Videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3215–3223.
[35]
Bernd Huber, Hijung Valentina Shin, Bryan Russell, Oliver Wang, and Gautham J Mysore. 2019. B-script: Transcript-based B-roll video editing with recommendations. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11.
[36]
Edwin L Hutchins, James D Hollan, and Donald A Norman. 1985. Direct manipulation interfaces. Human–computer interaction 1, 4 (1985), 311–338.
[37]
Dan Jackson, James Nicholson, Gerrit Stoeckigt, Rebecca Wrobel, Anja Thieme, and Patrick Olivier. 2013. Panopticon: A parallel video overview system. In proceedings of the 26th annual ACM symposium on User interface software and technology. 123–130.
[38]
Thorsten Karrer, Malte Weiss, Eric Lee, and Jan Borchers. 2008. Dragon: a direct manipulation interface for frame-accurate in-scene video navigation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 247–250.
[39]
Jeongyeon Kim, Daeun Choi, Nicole Lee, Matt Beane, and Juho Kim. 2023. Surch: Enabling Structural Search and Comparison for Surgical Videos. (2023).
[40]
Don Kimber, Tony Dunnigan, Andreas Girgensohn, Frank Shipman, Thea Turner, and Tao Yang. 2007. Trailblazing: Video playback control by direct object manipulation. In 2007 IEEE International Conference on Multimedia and Expo. IEEE, 1015–1018.
[41]
Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Computational video editing for dialogue-driven scenes.ACM Trans. Graph. 36, 4 (2017), 130–1.
[42]
Mackenzie Leake, Hijung Valentina Shin, Joy O Kim, and Maneesh Agrawala. 2020. Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness. In CHI, Vol. 20. 25–30.
[43]
Clayton Lewis. 1982. Using the" thinking-aloud" method in cognitive interface design. IBM TJ Watson Research Center Yorktown Heights, NY.
[44]
Quan Li, Kristanto Sean Njotoprawiro, Hammad Haleem, Qiaoan Chen, Chris Yi, and Xiaojuan Ma. 2018. Embeddingvis: A visual analytics approach to comparative network embedding inspection. In 2018 IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 48–59.
[45]
David Chuan-En Lin, Anastasis Germanidis, Cristóbal Valenzuela, Yining Shi, and Nikolas Martelaro. 2023. Soundify: Matching sound effects to video. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–13.
[46]
David Chuan-En Lin, Fabian Caba Heilbron, Joon-Young Lee, Oliver Wang, and Nikolas Martelaro. 2024. Videogenic: Identifying Highlight Moments in Videos with Professional Photographs as a Prior. In Proceedings of the 16th Conference on Creativity and Cognition.
[47]
David Chuan-En Lin and Nikolas Martelaro. 2024. Jigsaw: Supporting Designers to Prototype Multimodal Applications by Assembling AI Foundation Models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems.
[48]
Shusen Liu, Peer-Timo Bremer, Jayaraman J Thiagarajan, Vivek Srikumar, Bei Wang, Yarden Livnat, and Valerio Pascucci. 2017. Visual exploration of semantic relationships in neural word embeddings. IEEE transactions on visualization and computer graphics 24, 1 (2017), 553–562.
[49]
Yang Liu, Eunice Jun, Qisheng Li, and Jeffrey Heer. 2019. Latent space cartography: Visual analysis of vector space embeddings. In Computer graphics forum, Vol. 38. Wiley Online Library, 67–78.
[50]
Bruce D Lucas and Takeo Kanade. 1981. An iterative image registration technique with an application to stereo vision. In IJCAI’81: 7th international joint conference on Artificial intelligence, Vol. 2. 674–679.
[51]
Kevin Lynch. 1984. Reconsidering the image of the city. Springer.
[52]
Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2013. Swifter: improved online video scrubbing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1159–1168.
[53]
Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
[54]
Allison Merz, Annie Hu, and Tracey Lin. 2018. ClipWorks: a tangible interface for collaborative video editing. In Proceedings of the 17th ACM Conference on Interaction Design and Children. 497–500.
[55]
Cuong Nguyen, Yuzhen Niu, and Feng Liu. 2013. Direct manipulation video navigation in 3D. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1169–1172.
[56]
Alejandro Pardo, Fabian Caba, Juan León Alcázar, Ali K Thabet, and Bernard Ghanem. 2021. Learning to cut by watching movies. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6858–6868.
[57]
Amy Pavel, Dan B Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. Sceneskim: Searching and browsing movies using synchronized captions, scripts and plot summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. 181–190.
[58]
Amy Pavel, Colorado Reed, Björn Hartmann, and Maneesh Agrawala. 2014. Video digests: a browsable, skimmable format for informational lecture videos. In UIST, Vol. 10. Citeseer, 2642918–2647400.
[59]
Yi-Hao Peng, Jeffrey P Bigham, and Amy Pavel. 2021. Slidecho: Flexible non-visual exploration of presentation videos. In Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility. 1–12.
[60]
Nicola Pezzotti, Thomas Höllt, B Lelieveldt, Elmar Eisemann, and Anna Vilanova. 2016. Hierarchical stochastic neighbor embedding. In Computer Graphics Forum, Vol. 35. Wiley Online Library, 21–30.
[61]
Suporn Pongnumkul, Jue Wang, Gonzalo Ramos, and Michael Cohen. 2010. Content-aware dynamic timeline for video browsing. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. 139–142.
[62]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
[63]
Karel Reisz and Gavin Millar. 1971. The technique of film editing. (1971).
[64]
Ben Shneiderman. 2003. The eyes have it: A task by data type taxonomy for information visualizations. In The craft of information visualization. Elsevier, 364–371.
[65]
Venkatesh Sivaraman, Yiwei Wu, and Adam Perer. 2022. Emblaze: Illuminating machine learning representations through interactive comparison of embedding spaces. In 27th International Conference on Intelligent User Interfaces. 418–432.
[66]
Daniel Smilkov, Nikhil Thorat, Charles Nicholson, Emily Reif, Fernanda B Viégas, and Martin Wattenberg. 2016. Embedding projector: Interactive visualization and interpretation of embeddings. arXiv preprint arXiv:1611.05469 (2016).
[67]
Tomas Sokoler, Håkan Edeholt, and Martin Johansson. 2002. VideoTable: a tangible interface for collaborative exploration of video material during design sessions. In CHI’02 Extended Abstracts on Human Factors in Computing Systems. 656–657.
[68]
Robert Thorndike. 1953. Who belongs in the family?Psychometrika 18, 4 (1953), 267–276.
[69]
Ruo-Feng Tong, Yun Zhang, and Meng Ding. 2011. Video brush: A novel interface for efficient video cutout. In Computer Graphics Forum, Vol. 30. Wiley Online Library, 2049–2057.
[70]
Yoshinobu Tonomura, Akihito Akutsu, Kiyotaka Otsuji, and Toru Sadakata. 1993. Videomap and videospaceicon: Tools for anatomizing video content. In Proceedings of the INTERACT’93 and CHI’93 Conference on Human Factors in Computing Systems. 131–136.
[71]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).
[72]
Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, and Raj Sodhi. 2024. LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing. arXiv preprint arXiv:2402.10294 (2024).
[73]
Kevin Wang, Deva Ramanan, and Aayush Bansal. 2021. Video-Specific Autoencoders for Exploring, Editing and Transmitting Videos. arXiv preprint arXiv:2103.17261 (2021).
[74]
Miao Wang, Guo-Wei Yang, Shi-Min Hu, Shing-Tung Yau, Ariel Shamir, 2019. Write-a-video: computational video montage from themed text.ACM Trans. Graph. 38, 6 (2019), 177–1.
[75]
Tinghuai Wang, Andrew Mansfield, Rui Hu, and John P Collomosse. 2009. An evolutionary approach to automatic video editing. In 2009 Conference for Visual Media Production. IEEE, 127–134.
[76]
Haijun Xia, Jennifer Jacobs, and Maneesh Agrawala. 2020. Crosscast: adding visuals to audio travel podcasts. In Proceedings of the 33rd annual ACM symposium on user interface software and technology. 735–746.
[77]
Haojin Yang and Christoph Meinel. 2014. Content based lecture video retrieval using speech and video text information. IEEE transactions on learning technologies 7, 2 (2014), 142–154.
[78]
Saelyne Yang, Jisu Yim, Aitolkyn Baigutanova, Seoyoung Kim, Minsuk Chang, and Juho Kim. 2022. SoftVideo: Improving the Learning Experience of Software Tutorial Videos with Collective Interaction Data. In 27th International Conference on Intelligent User Interfaces. 646–660.
[79]
Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui Xiong. 2018. Dynamic word embeddings for evolving semantic discovery. In Proceedings of the eleventh acm international conference on web search and data mining. 673–681.
[80]
Jamie Zigelbaum, Michael S Horn, Orit Shaer, and Robert JK Jacob. 2007. The tangible video editor: collaborative video editing with active tokens. In Proceedings of the 1st international Conference on Tangible and Embedded interaction. 43–46.

Index Terms

  1. VideoMap: Supporting Video Exploration, Brainstorming, and Prototyping in the Latent Space

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      C&C '24: Proceedings of the 16th Conference on Creativity & Cognition
      June 2024
      718 pages
      ISBN:9798400704857
      DOI:10.1145/3635636
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 June 2024

      Check for updates

      Author Tags

      1. latent space visualization
      2. video editing interface

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      C&C '24
      Sponsor:
      C&C '24: Creativity and Cognition
      June 23 - 26, 2024
      IL, Chicago, USA

      Acceptance Rates

      Overall Acceptance Rate 108 of 371 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 363
        Total Downloads
      • Downloads (Last 12 months)363
      • Downloads (Last 6 weeks)80
      Reflects downloads up to 09 Jan 2025

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media