skip to main content
10.1145/3640543.3645162acmconferencesArticle/Chapter ViewAbstractPublication PagesiuiConference Proceedingsconference-collections
research-article
Open access

Snapper: Accelerating Bounding Box Annotation in Object Detection Tasks with Find-and-Snap Tooling

Published: 05 April 2024 Publication History

Abstract

Object detection tasks are central to the development of datasets and algorithms in computer vision and machine learning. Despite its centrality, object detection remains tedious and time-consuming due to the inherent interactions that are often associated with drawing precise annotations. In this paper, we introduce Snapper, an interactive and intelligent annotation tool that intercepts bounding box annotations as they’re drawn and “snaps” them to the nearby object edges in real-time. Through a mixed-design user study with 18 full-time annotators, we compare Snapper’s annotation mode to alternative modes of annotation and find that Snapper enables participants to complete object detection tasks 39% more quickly without diminishing annotation quality. Further, we find that participants perceive Snapper as a tool that is interactively intuitive, trustworthy, and helpful. We conclude by discussing the implications of our findings as they relate to augmenting annotators’ conventions for drawing annotations in practice.

References

[1]
David Acuna, Amlan Kar, and Sanja Fidler. 2019. Devil is in the edges: Learning semantic boundaries from noisy annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11075–11083.
[2]
Bishwo Adhikari, Jukka Peltomaki, Jussi Puura, and Heikki Huttunen. 2018. Faster bounding box annotation for object detection in indoor scenes. In 2018 7th European Workshop on Visual Information Processing (EUVIP). IEEE, 1–6.
[3]
Eirikur Agustsson, Jasper RR Uijlings, and Vittorio Ferrari. 2019. Interactive full image segmentation by considering all regions jointly. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11622–11631.
[4]
Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, 2019. Guidelines for human-AI interaction. In Proceedings of the 2019 chi conference on human factors in computing systems. 1–13.
[5]
Aaron Bangor, Philip T Kortum, and James T Miller. 2008. An empirical evaluation of the system usability scale. Intl. Journal of Human–Computer Interaction 24, 6 (2008), 574–594.
[6]
Patrick Baudisch, Edward Cutrell, Mary Czerwinski, Daniel C Robbins, Peter Tandler, Benjamin B Bederson, and Alex Zierlinger. 2003. Drag-and-Pop and Drag-and-Pick: Techniques for Accessing Remote Screen Content on Touch-and Pen-Operated Systems. In Interact, Vol. 3. 57–64.
[7]
Patrick Baudisch, Edward Cutrell, Ken Hinckley, and Adam Eversole. 2005. Snap-and-go: helping users align objects without the modality of traditional snapping. In Proceedings of the SIGCHI conference on Human factors in computing systems. 301–310.
[8]
Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. 2016. What’s the point: Semantic segmentation with point supervision. In European conference on computer vision. Springer, 549–565.
[9]
Rodrigo Benenson, Stefan Popov, and Vittorio Ferrari. 2019. Large-scale interactive object segmentation with human annotators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11700–11709.
[10]
Eric A Bier. 1990. Snap-dragging in three dimensions. ACM SIGGRAPH Computer Graphics 24, 2 (1990), 193–204.
[11]
Eric A Bier and Maureen C Stone. 1986. Snap-dragging. ACM SIGGRAPH Computer Graphics 20, 4 (1986), 233–240.
[12]
Andrew Blake, Carsten Rother, Matthew Brown, Patrick Perez, and Philip Torr. 2004. Interactive image segmentation using an adaptive GMMRF model. In European conference on computer vision. Springer, 428–441.
[13]
Yuri Y Boykov and M-P Jolly. 2001. Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In Proceedings eighth IEEE international conference on computer vision. ICCV 2001, Vol. 1. IEEE, 105–112.
[14]
Virginia Braun and Victoria Clarke. 2012. Thematic analysis. (2012).
[15]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. European Conference on Computer Vision (2020).
[16]
Ben Carterette and Ian Soboroff. 2010. The effect of assessor error on IR system evaluation. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 539–546.
[17]
Yung-Yu Chuang, Brian Curless, David H Salesin, and Richard Szeliski. 2001. A bayesian approach to digital matting. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Vol. 2. IEEE, II–II.
[18]
Cognilytica. 2019. Data Engineering, Preparation, and Labeling for AI 2020.Technical Report. https://www.cognilytica.com/document/data-preparation-labeling-for-ai-2020/
[19]
Brandon Dang, Miles Hutson, and Matt Lease. 2016. Mmmturkey: A crowdsourcing framework for deploying tasks and recording worker behavior on amazon mechanical turk. arXiv preprint arXiv:1609.00945 (2016).
[20]
Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. 2009. Pedestrian detection: A benchmark. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 304–311.
[21]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. [n. d.]. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
[22]
Milton Friedman. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association 32, 200 (1937), 675–701.
[23]
Ujwal Gadiraju, Ricardo Kawase, and Stefan Dietze. 2014. A taxonomy of microtasks on the web. In Proceedings of the 25th ACM conference on Hypertext and social media. 218–223.
[24]
Michael Gleicher. 1995. Image snapping. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques. 183–190.
[25]
Matthew C Gombolay, Reymundo A Gutierrez, Shanelle G Clarke, Giancarlo F Sturla, and Julie A Shah. 2015. Decision-making authority, team efficiency and human worker satisfaction in mixed human–robot teams. Autonomous Robots 39, 3 (2015), 293–312.
[26]
Anthony G Greenwald. 1976. Within-subjects designs: To use or not to use?Psychological Bulletin 83, 2 (1976), 314.
[27]
Hayley Guillou, Kevin Chow, Thomas Fritz, and Joanna McGrenere. 2020. Is your time well spent? reflecting on knowledge work more holistically. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–9.
[28]
Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology. Vol. 52. Elsevier, 139–183.
[29]
Björn Hartmann, Leith Abdulla, Manas Mittal, and Scott R Klemmer. 2007. Authoring sensor-based interactions by demonstration with direct manipulation and pattern recognition. In Proceedings of the SIGCHI conference on Human factors in computing systems. 145–154.
[30]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385 (2015).
[31]
Danula Hettiachchi, Mark Sanderson, Jorge Goncalves, Simo Hosio, Gabriella Kazai, Matthew Lease, Mike Schaekermann, and Emine Yilmaz. 2021. Investigating and Mitigating Biases in Crowdsourced Data. In Companion Publication of the 2021 Conference on Computer Supported Cooperative Work and Social Computing. 331–334.
[32]
Danula Hettiachchi, Mike Schaekermann, Tristan J McKinney, and Matthew Lease. 2021. The Challenge of Variable Effort Crowdsourcing and How Visible Gold Can Help. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–26.
[33]
Paul Heymann and Hector Garcia-Molina. 2011. Turkalytics: analytics for human computation. In Proceedings of the 20th international conference on World wide web. 477–486.
[34]
Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai. 2012. Diagnosing error in object detectors. In European conference on computer vision. Springer, 340–353.
[35]
Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 159–166.
[36]
Suyog Dutt Jain and Kristen Grauman. 2013. Predicting sufficient annotation strength for interactive foreground segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 1313–1320.
[37]
Sanum Joshi. 2019. How artificial intelligence is creating jobs in India, not just stealing them | India News - Times of India. Technical Report. https://timesofindia.indiatimes.com/india/how-artificial-intelligence-is-creating-jobs-in-india-not-just-stealing-them/articleshow/71030863.cms
[38]
Anna Khoreva, Anna Rohrbach, and Bernt Schiele. 2018. Video object segmentation with language referring expressions. In Asian Conference on Computer Vision. Springer, 123–141.
[39]
Young-Ho Kim, Eun Kyoung Choe, Bongshin Lee, and Jinwook Seo. 2019. Understanding Personal Productivity: How Knowledge Workers Define, Evaluate, and Reflect on their productivity. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.
[40]
Aniket Kittur, Jeffrey V Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton. 2013. The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work. 1301–1318.
[41]
Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Lei Li, and Jianbo Shi. 2020. Foveabox: Beyound anchor-based object detection. IEEE Transactions on Image Processing 29 (2020), 7389–7398.
[42]
Theodora Kontogianni, Michael Gygli, Jasper Uijlings, and Vittorio Ferrari. 2020. Continuous adaptation for interactive object segmentation by learning from corrections. In European Conference on Computer Vision. Springer, 579–596.
[43]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 1 (2017), 32–73.
[44]
Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1-2 (1955), 83–97.
[45]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, 2020. The open images dataset v4. International Journal of Computer Vision 128, 7 (2020), 1956–1981.
[46]
Edward Lank and Eric Saund. 2005. Sloppy selection: Providing an accurate interpretation of imprecise selection gestures. Computers & Graphics 29, 4 (2005), 490–500.
[47]
Gun A Lee and Mark Billinghurst. 2011. A user study on the snap-to-feature interaction method. In 2011 10th IEEE International Symposium on Mixed and Augmented Reality. IEEE, 245–246.
[48]
Gun A Lee, Ungyeon Yang, Yongwan Kim, Dongsik Jo, and Ki-Hong Kim. 2010. Snap-to-feature interface for annotation in mobile augmented reality. In Augmented Reality Super Models Workshop at the 9th IEEE International Symposium on Mixed and Augmented Reality.
[49]
Yin Li, Jian Sun, Chi-Keung Tang, and Heung-Yeung Shum. 2004. Lazy snapping. ACM Transactions on Graphics (ToG) 23, 3 (2004), 303–308.
[50]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
[51]
Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and Sanja Fidler. 2019. Fast interactive object annotation with curve-gcn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5257–5266.
[52]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. European Conference on Computer Vision (2016).
[53]
Yinglu Liu, Hailin Shi, Hao Shen, Yue Si, Xiaobo Wang, and Tao Mei. 2020. A new dataset and boundary-attention semantic segmentation for face parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11637–11644.
[54]
Kazuaki Maeda, Haejoong Lee, Julie Medero, and Stephanie Strassel. 2006. A new phase in annotation tool development at the Linguistic Data Consortium: The evolution of the Annotation Graph Toolkit. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06).
[55]
Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool. 2018. Deep extreme cut: From extreme points to object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 616–625.
[56]
Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics (1947), 50–60.
[57]
Gloria Mark, Mary Czerwinski, and Shamsi T Iqbal. 2018. Effects of individual differences in blocking workplace distractions. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12.
[58]
Anthony May, André Sagodi, Christian Dremel, and Benjamin van Giffen. 2020. Realizing Digital Innovation from Artificial Intelligence. In ICIS.
[59]
Mark Maybury. 1998. Intelligent user interfaces: an introduction. In Proceedings of the 4th international conference on Intelligent user interfaces. 3–4.
[60]
Robert Munro Monarch. 2021. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Simon and Schuster.
[61]
Eric N Mortensen and William A Barrett. 1995. Intelligent scissors for image composition. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques. 191–198.
[62]
Stefanie Mueller, Pedro Lopes, and Patrick Baudisch. 2012. Interactive construction: interactive fabrication of functional mechanical devices. In Proceedings of the 25th annual ACM symposium on User interface software and technology. 599–606.
[63]
Gioacchino Noris, Daniel Sỳkora, Arik Shamir, Stelian Coros, Brian Whited, Maryann Simmons, Alexander Hornung, Marcus Gross, and Robert Sumner. 2012. Smart scribbles for sketch segmentation. In Computer Graphics Forum, Vol. 31. Wiley Online Library, 2516–2527.
[64]
Benjamin Nuernberger, Eyal Ofek, Hrvoje Benko, and Andrew D Wilson. 2016. Snaptoreality: Aligning augmented reality to the real world. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 1233–1244.
[65]
Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. 2019. Fast user-guided video object segmentation by interaction-and-propagation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5247–5256.
[66]
Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. 2017. Extreme clicking for efficient object annotation. In Proceedings of the IEEE international conference on computer vision. 4930–4939.
[67]
Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. 2017. Training object class detectors with click supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6374–6383.
[68]
Amy Rechkemmer, Alex C Williams, Matthew Lease, and Li Erran Li. 2023. Characterizing Time Spent in Video Object Tracking Annotation Tasks: A Study of Task Complexity in Vehicle Tracking. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 11. 140–151.
[69]
Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. arXiv (2018).
[70]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).
[71]
Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. 2004. " GrabCut" interactive foreground extraction using iterated graph cuts. ACM transactions on graphics (TOG) 23, 3 (2004), 309–314.
[72]
Joshua S Rubinstein, David E Meyer, and Jeffrey E Evans. 2001. Executive control of cognitive processes in task switching.Journal of experimental psychology: human perception and performance 27, 4 (2001), 763.
[73]
Jeffrey Rzeszotarski and Aniket Kittur. 2012. CrowdScape: interactively visualizing user behavior and output. In Proceedings of the 25th annual ACM symposium on User interface software and technology. 55–62.
[74]
Jeffrey M Rzeszotarski and Aniket Kittur. 2011. Instrumenting the crowd: using implicit behavioral measures to predict task performance. In Proceedings of the 24th annual ACM symposium on User interface software and technology. 13–22.
[75]
Hisham A Saad, Mark A Terry, Neda Shamie, Edwin S Chen, Daniel F Friend, Jeffrey D Holiman, and Christopher Stoeger. 2008. An easy and inexpensive method for quantitative analysis of endothelial damage by using vital dye staining and Adobe Photoshop software. Cornea 27, 7 (2008), 818–824.
[76]
Eric Saund and Edward Lank. 2011. Minimizing Modes for Smart Selection in Sketching/Drawing Interfaces. Sketch-based Interfaces and Modeling (2011), 55–80.
[77]
Robert Simpson, Kevin R Page, and David De Roure. 2014. Zooniverse: observing the world’s largest citizen science platform. In Proceedings of the 23rd international conference on world wide web. 1049–1054.
[78]
Jinming Song and Mohammad Hussaini. 2020. Adopting solutions for annotation and reporting of next generation sequencing in clinical practice. Practical laboratory medicine 19 (2020), e00154.
[79]
Ivan E Sutherland. 1964. Sketchpad a man-machine graphical communication system. Simulation 2, 5 (1964), R–3.
[80]
Zsolt Szalavári, Erik Eckstein, and Michael Gervautz. 1998. Collaborative gaming in augmented reality. In Proceedings of the ACM symposium on Virtual reality software and technology. 195–204.
[81]
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision. 9627–9636.
[82]
Carlos Toxtli, Siddharth Suri, and Saiph Savage. 2021. Quantifying the Invisible Labor in Crowd Work. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–26.
[83]
Ding Wang, Shantanu Prabhat, and Nithya Sambasivan. 2022. Whose AI Dream? In search of the aspiration in data annotation. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–16.
[84]
Zhong-Qiu Wang and Ivan Tashev. 2017. Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5150–5154.
[85]
Alex C Williams, Gloria Mark, Kristy Milland, Edward Lank, and Edith Law. 2019. The Perpetual Work Life of Crowdworkers: How Tooling Practices Increase Fragmentation in Crowdwork. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–28.
[86]
Pengfei Xu, Hongbo Fu, Oscar Kin-Chung Au, and Chiew-Lan Tai. 2012. Lazy selection: a scribble-based tool for smart shape elements selection. ACM Transactions on Graphics (TOG) 31, 6 (2012), 1–9.
[87]
Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen Lin. 2019. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9657–9666.
[88]
Weijia Zhang. 2019. The Status of Chinese Data Annotation Market Needs in 2021 - An industry research report.Technical Report. https://www.qianzhan.com/analyst/detail/220/210508-8792d1e4.html
[89]
Xuefeng Zhang, Bo Liu, Jieqiong Wang, Zhe Zhang, Kaibo Shi, and Shuanglin Wu. 2014. Adobe photoshop quantification (PSQ) rather than point-counting: A rapid and precise method for quantifying rock textural data and porosities. Computers & Geosciences 69 (2014), 62–71.

Cited By

View all
  • (2024)Estimating the Semantic Density of Visual MediaProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681594(4601-4609)Online publication date: 28-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IUI '24: Proceedings of the 29th International Conference on Intelligent User Interfaces
March 2024
955 pages
ISBN:9798400705083
DOI:10.1145/3640543
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 April 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Assisted annotation
  2. annotator productivity.
  3. object detection

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

IUI '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 746 of 2,811 submissions, 27%

Upcoming Conference

IUI '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)259
  • Downloads (Last 6 weeks)66
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Estimating the Semantic Density of Visual MediaProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681594(4601-4609)Online publication date: 28-Oct-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media