skip to main content
research-article

Understanding Expert Disagreement in Medical Data Analysis through Structured Adjudication

Published: 07 November 2019 Publication History

Abstract

Expert disagreement is pervasive in clinical decision making and collective adjudication is a useful approach for resolving divergent assessments. Prior work shows that expert disagreement can arise due to diverse factors including expert background, the quality and presentation of data, and guideline clarity. In this work, we study how these factors predict initial discrepancies in the context of medical time series analysis, examining why certain disagreements persist after adjudication, and how adjudication impacts clinical decisions. Results from a case study with 36 experts and 4,543 adjudicated cases in a sleep stage classification task show that these factors contribute to both initial disagreement and resolvability, each in their own unique way. We provide evidence suggesting that structured adjudication can lead to significant revisions in treatment-relevant clinical parameters. Our work demonstrates how structured adjudication can support consensus and facilitate a deep understanding of expert disagreement in medical data analysis.

References

[1]
Paul André, Aniket Kittur, and Steven P Dow. 2014. Crowd synthesis: Extracting categories and clusters from complex data. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 989--998.
[2]
Lora Aroyo and Chris Welty. 2014. The three sides of CrowdTruth. Journal of Human Computation, Vol. 1 (2014), 31--34.
[3]
Elham Bagheri, Justin Dauwels, Brian C. Dean, Chad G. Waters, M. Brandon Westover, and Jonathan J. Halford. 2017. Interictal epileptiform discharge characteristics underlying expert interrater agreement. Clinical Neurophysiology, Vol. 128, 10 (10 2017), 1994--2005. https://doi.org/10.1016/j.clinph.2017.06.252
[4]
A. Baker, K. Young, J. Potter, and I. Madan. 2010. A review of grading systems for evidence-based guidelines produced by medical specialties. Clinical Medicine, Vol. 10, 4 (8 2010), 358--363. https://doi.org/10.7861/clinmedicine.10--4--358
[5]
Erin P. Balogh, Bryan T. Miller, and John R. Ball (Eds.). 2015. Improving Diagnosis in Health Care. National Academies Press, Washington, D.C. https://doi.org/10.17226/21794
[6]
Forrest S Bao, Xin Liu, and Christina Zhang. 2011. PyEEG: An Open Source Python Module for EEG/MEG Feature Extraction. Computational Intelligence and Neuroscience, Vol. 2011 (2011), 1--7. https://doi.org/10.1155/2011/406391
[7]
Michael L. Barnett, Dhruv Boddupalli, Shantanu Nundy, and David W. Bates. 2019. Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians. JAMA Network Open, Vol. 2, 3 (3 2019), e190096. https://doi.org/10.1001/jamanetworkopen.2019.0096
[8]
Floris Bex, Henry Prakken, Chris Reed, and Douglas Walton. 2003. Towards a Formal Account of Reasoning about Evidence: Argumentation Schemes and Generalisations. Artificial Intelligence and Law, Vol. 11, 2/3 (2003), 125--165. https://doi.org/10.1023/B:ARTI.0000046007.11806.9a
[9]
Katarzyna Budzynska, Mathilde Janier, Juyeon Kang, Chris Reed, Patrick Saint-Dizier, Manfred Stede, and Olena Yaskorska. 2014. Towards Argument Mining from Dialogue. In Computational Models of Argument - Proceedings of COMMA 2014, Atholl Palace Hotel, Scottish Highlands, UK, September 9--12, 2014. 185--196. https://doi.org/10.3233/978--1--61499--436--7--185
[10]
Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 2334--2346. https://doi.org/10.1145/3025453.3026044
[11]
Nancy Chang, Praveen Paritosh, David Huynh, and Collin Baker. 2015. Scaling semantic frame annotation. In Proceedings of The 9th Linguistic Annotation Workshop. 1--10.
[12]
Quanze Chen, Jonathan Bragg, Lydia B. Chilton, and Daniel S. Weld. 2019. Cicero. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI '19. ACM Press, New York, New York, USA, 1--14. https://doi.org/10.1145/3290605.3300761
[13]
Carlos Ches n evar, Jarred McGinnis, Sanjay Modgil, Iyad Rahwan, Chris Reed, Guillermo Simari, Matthew South, Gerard Vreeswijk, and Steven Willmott. 2006. Towards an argument interchange format. The Knowledge Engineering Review, Vol. 21, 04 (12 2006), 293. https://doi.org/10.1017/S0269888906001044
[14]
Robin Cohen. 1987. Analyzing the Structure of Argumentative Discourse. Comput. Linguist., Vol. 13, 1--2 (1 1987), 11--24. http://dl.acm.org/citation.cfm?id=26386.26388
[15]
Robin Cohen, Mike Schaekermann, Sihao Liu, and Michael Cormier. 2019. Trusted AI and the Contribution of Trust Modeling in Multiagent Systems. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS '19). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1644--1648. http://dl.acm.org/citation.cfm?id=3306127.3331890
[16]
Norman Dalkey and Olaf Helmer. 1963. An Experimental Application of the DELPHI Method to the Use of Experts. Management Science, Vol. 9, 3 (4 1963), 458--467. https://doi.org/10.1287/mnsc.9.3.458
[17]
Todd Davies and Reid Chandler. 2012. Online deliberation design. Democracy in motion: Evaluation the practice and impact of deliberative civic engagement (2012), 103--131.
[18]
Ryan Drapeau, Lydia B. Chilton, Jonathan Bragg, and Daniel S. Weld. 2016. MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP) .
[19]
Anca Dumitrache, Lora Aroyo, and Chris Welty. 2018. Crowdsourcing Ground Truth for Medical Relation Extraction. ACM Transactions on Interactive Intelligent Systems, Vol. 8, 2 (7 2018), 1--20. https://doi.org/10.1145/3152889
[20]
Luciana Garbayo. 2014. Epistemic Considerations on Expert Disagreement, Normative Justification, and Inconsistency Regarding Multi-criteria Decision Making. Constraint Programming and Decision Making, Vol. 539 (2014), 35--45. http://link.springer.com/10.1007/978--3--319-04280-0%5C_5
[21]
Gowri Gopalakrishna, Miranda W Langendam, Rob JPM Scholten, Patrick MM Bossuyt, and Mariska MG Leeflang. 2013. Guidelines for guideline developers: a systematic review of grading systems for medical tests. Implementation Science, Vol. 8, 1 (12 2013), 78. https://doi.org/10.1186/1748--5908--8--78
[22]
Nitesh Goyal and Susan R Fussell. 2016. Effects of sensemaking translucence on distributed collaborative analysis. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. ACM, 288--302.
[23]
Melody Guan, Varun Gulshan, Andrew Dai, and Geoffrey Hinton. 2018. Who said what: Modeling individual labelers improves classification. In AAAI Conference on Artificial Intelligence. https://arxiv.org/pdf/1703.08774.pdf
[24]
Danna Gurari and Kristen Grauman. 2017. CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 3511--3522. https://doi.org/10.1145/3025453.3025781
[25]
Danna Gurari, Kun He, Bo Xiong, Jianming Zhang, Mehrnoosh Sameki, Suyog Dutt Jain, Stan Sclaroff, Margrit Betke, and Kristen Grauman. 2017. Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s). (4 2017). http://arxiv.org/abs/1705.00366
[26]
Francis T. Hartman and Andrew Baldwin. 1995. Using Technology to Improve Delphi Method. Journal of Computing in Civil Engineering, Vol. 9, 4 (10 1995), 244--249. https://doi.org/10.1061/(ASCE)0887--3801(1995)9:4(244)
[27]
Conrad Iber, Sonia Ancoli-Israel, Andrew L Cheeson Jr., and Stuart F Quan. 2007. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. American Academy of Sleep Medicine.
[28]
Sanjay Kairam and Jeffrey Heer. 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing - CSCW '16. ACM Press, New York, New York, USA, 1635--1646. https://doi.org/10.1145/2818048.2820016
[29]
Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2018. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy. Ophthalmology (3 2018). https://doi.org/10.1016/j.ophtha.2018.01.034
[30]
John Lawrence and Chris Reed. 2015. Combining Argument Mining Techniques. In Proceedings of the 2nd Workshop on Argumentation Mining at ACL 2015. 127--136. https://doi.org/10.3115/v1/W15-0516
[31]
John Lawrence and Chris Reed. 2016. Argument Mining using Argumentation Scheme Structures. Proceedings of the 6th International Conference on Computational Models of Argument (COMMA 2016), Vol. 0 (2016), 379 -- 390. https://doi.org/10.3233/978--1--61499--686--6--379
[32]
V K Chaithanya Manam and Alexander J Quinn. 2018. WingIt: Efficient Refinement of Unclear Task Instructions. In The Sixth AAAI Conference on Human Computation and Crowdsourcing. 108--116. https://www.aaai.org/ocs/index.php/HCOMP/HCOMP18/paper/view/17931
[33]
Jeryl L. Mumpower and Thomas R. Stewart. 1996. Expert Judgement and Expert Disagreement. Thinking & Reasoning, Vol. 2, 2--3 (7 1996), 191--212. https://doi.org/10.1080/135467896394500
[34]
Susannah BF Paletz, Joel Chan, and Christian D Schunn. 2016. Uncovering uncertainty through disagreement. Applied Cognitive Psychology, Vol. 30, 3 (2016), 387--400.
[35]
Simon Parsons, Elizabeth Sklar, Jordan Salvit, Holly Wall, and Zimi Li. 2013. ArgTrust: Decision Making with Information from Sources of Varying Trustworthiness. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems (AAMAS '13). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1395--1396. http://dl.acm.org/citation.cfm?id=2484920.2485242
[36]
Matthew P Pase, Jayandra J Himali, Natalie A Grima, Alexa S Beiser, Claudia L Satizabal, Hugo J Aparicio, Robert J Thomas, Daniel J Gottlieb, Sandford H Auerbach, and Sudha Seshadri. 2017. Sleep architecture and the risk of incident dementia in the community. Neurology, Vol. 89, 12 (2017), 1244--1250.
[37]
Thomas Penzel, Xiaozhe Zhang, and Ingo Fietze. 2013. Inter-scorer reliability between sleep centers can teach us what to improve in the scoring rules. Journal of Clinical Sleep Medicine, Vol. 9, 1 (2013), 81--87.
[38]
Ronald B Postuma, Alex Iranzo, Michele Hu, Birgit Hö gl, Bradley F Boeve, Raffaele Manni, Wolfgang H Oertel, Isabelle Arnulf, Luigi Ferini-Strambi, Monica Puligheddu, and others. 2019. Risk and predictors of dementia and parkinsonism in idiopathic REM sleep behaviour disorder: a multicentre study. Brain, Vol. 142, 3 (2019), 744--759.
[39]
Stefan R"abiger, Gizem Gezici, Yücel Saygin, and Myra Spiliopoulou. 2018. Predicting worker disagreement for more effective crowd labeling. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 179--188.
[40]
Maithra Raghu, Katy Blumer, Rory Sayres, Ziad Obermeyer, Robert Kleinberg, Sendhil Mullainathan, and Jon Kleinberg. 2018. Direct Uncertainty Prediction for Medical Second Opinions. (7 2018). http://arxiv.org/abs/1807.01771
[41]
Pranav Rajpurkar, Awni Y. Hannun, Masoumeh Haghpanahi, Codie Bourn, and Andrew Y. Ng. 2017. Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. (7 2017). http://arxiv.org/abs/1707.01836
[42]
Chris Reed and Timothy Norman. 2004. Argumentation Machines. Argumentation Library, Vol. 9. Springer Netherlands, Dordrecht. https://doi.org/10.1007/978--94-017-0431--1
[43]
Chris Reed and Doug Walton. 2005. Towards a Formal and Implemented Model of Argumentation Schemes in Agent Communication. 19--30. https://doi.org/10.1007/978--3--540--32261-0_2
[44]
Richard S. Rosenberg and Steven van Hout. 2013. The American Academy of Sleep Medicine Inter-scorer Reliability Program: Sleep Stage Scoring. Journal of Clinical Sleep Medicine (1 2013). https://doi.org/10.5664/jcsm.2350
[45]
Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019. Capturing Expert Arguments from Medical Adjudication Discussions in a Machine-readable Format. In Companion Proceedings of The 2019 World Wide Web Conference on - WWW '19, Vol. 2. ACM Press, New York, New York, USA, 1131--1137. https://doi.org/10.1145/3308560.3317085
[46]
Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019. crowdEEG: A Platform for Structured Consensus Formation in Medical Time Series Analysis. In 8th Workshop on Interactive Systems in Healthcare (WISH) at CHI 2019. Glasgow, UK.
[47]
Mike Schaekermann, Joslin Goh, Kate Larson, and Edith Law. 2018a. Resolvable vs. Irresolvable Disagreement: A Study on Worker Deliberation in Crowd Work. In Proceedings of the 2018 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW'18). New York City, NY. https://doi.org/10.1145/3274423
[48]
Mike Schaekermann, Edith Law, Kate Larson, and Andrew Lim. 2018b. Expert Disagreement in Sequential Labeling: A Case Study on Adjudication in Medical Time Series Analysis. In 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing at HCOMP 2018. Zurich, Switzerland.
[49]
Mike Schaekermann, Edith Law, Alex C Williams, and William Callaghan. 2016. Resolvable vs. Irresolvable Ambiguity: A New Hybrid Framework for Dealing with Uncertain Ground Truth. In 1st Workshop on Human-Centered Machine Learning at SIGCHI 2016. San Jose, CA.
[50]
Miriam Solomon. 2006. Groupthink versus The Wisdom of Crowds : The Social Epistemology of Deliberation and Dissent. The Southern Journal of Philosophy, Vol. 44, S1 (3 2006), 28--42. https://doi.org/10.1111/j.2041--6962.2006.tb00028.x
[51]
Miriam Solomon. 2007. The social epistemology of NIH consensus conferences. In Establishing medical reality. Springer, 167--177.
[52]
D Walton, C Reed, and F Macagno. 2008. Argumentation Schemes. Cambridge University Press. https://books.google.ca/books?id=qc3LCgAAQBAJ

Cited By

View all
  • (2025)Investigating and Improving Latent Density Segmentation Models for Aleatoric Uncertainty Quantification in Medical ImagingIEEE Transactions on Medical Imaging10.1109/TMI.2024.344599944:1(384-395)Online publication date: Jan-2025
  • (2024)AI-enabled workflow for automated classification and analysis of feto-placental Doppler imagesFrontiers in Digital Health10.3389/fdgth.2024.14557676Online publication date: 16-Oct-2024
  • (2024)Inpatient Experiences of Long-Term Monitoring: The Case of EEGCompanion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing10.1145/3678884.3681899(511-518)Online publication date: 11-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Human-Computer Interaction
Proceedings of the ACM on Human-Computer Interaction  Volume 3, Issue CSCW
November 2019
5026 pages
EISSN:2573-0142
DOI:10.1145/3371885
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2019
Published in PACMHCI Volume 3, Issue CSCW

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adjudication
  2. ambiguity
  3. disagreement
  4. medical time series

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)78
  • Downloads (Last 6 weeks)6
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media