research-article

A Case Study of Data Quality in Text Mining Clinical Progress Notes

Authors:

Donald J. Berndt,

James A. McCart,

Dezon K. Finch,

Stephen L. LutherAuthors Info & Claims

ACM Transactions on Management Information Systems (TMIS), Volume 6, Issue 1

Article No.: 1, Pages 1 - 21

https://doi.org/10.1145/2669368

Published: 03 April 2015 Publication History

Abstract

Text analytic methods are often aimed at extracting useful information from the vast array of unstructured, free format text documents that are created by almost all organizational processes. The success of any text mining application rests on the quality of the underlying data being analyzed, including both predictive features and outcome labels. In this case study, some focused experiments regarding data quality are used to assess the robustness of Statistical Text Mining (STM) algorithms when applied to clinical progress notes. In particular, the experiments consider the impacts of task complexity (by removing signals), training set size, and target outcome quality. While this research is conducted using a dataset drawn from the medical domain, the data quality issues explored are of more general interest.

References

[1]

S. Agarwal, S. Godbole, D. Punjani, and S. Roy. 2007. How much noise is too much: A study in automatic text classification. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM). IEEE Computer Society, Washington, DC, 3--12.

Digital Library

[2]

H. Alamgir, S. Muazzam, and M. Nasrullah. 2012. Unintentional falls mortality among elderly in the United States: Time for action. Injury 43, 12 (Dec. 2012), 2065--2071.

[3]

A. Baneyx, J. Charlet, and M. C. Jaulent. 2007. Building an ontology of pulmonary diseases with natural language processing tools using textual corpora. Int. J. Med. Inf. 76, 2--3 (2007), 208--215.

[4]

M. E. Betz and G. Li. 2005. Epidemiologic patterns of injuries treated in ambulatory care settings. Ann. Emerg. Med. 46, 6 (Dec. 2005), 544--551.

[5]

P. Bramsen, P. Deshpande, Y. K. Lee, and R. Barzilay. 2006. Finding temporal order in discharge summaries. In AMIA Symposium. 81--85.

[6]

C. C. Chang and C. J. Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3 (2011), 27.

Digital Library

[7]

W. W. Chapman, W. Bridewell, P. Hanbury, G. F. Cooper, and B. G. Buchanan. 2001. A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform. 34, 5 (Oct. 2001), 301--310.

[8]

W. W. Chapman, D. Chu, and J. N. Dowling. 2007. ConText: An algorithm for identifying contextual features from clinical text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing (BioNLP’07). Association for Computational Linguistics, Stroudsburg, PA, 8188.

Digital Library

[9]

J. Cohen. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20 (April 1960), 37--46.

[10]

C. Cortes and V. Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (Sept. 1995), 273--297.

[11]

L. W. D’Avolio, T. M. Nguyen, W. R. Farwell, Y. Chen, F. Fitzmeyer, O. M. Harris, and L. D. Fiore. 2010. Evaluation of a generalizable approach to clinical information retrieval using the automated retrieval console (ARC). J. Am. Med. Inform. Assoc. 17, 4 (July-Aug. 2010), 375--382.

[12]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41 (Sept. 1990), 391--407.

[13]

J. H. Garvin, S. L. Duvall, B. R. South, B. E. Bray, D. Bolton, J. Heavirland, S. Pickard, P. Heidenreich, S. Shen, C. Weir, M. Samore, and M. K. Goldstein. 2012. Automated extraction of ejection fraction for quality measurement using regular expressions in Unstructured Information Management Architecture (UIMA) for heart failure. J. Am. Med. Inform. Assoc. 19, 5 (2012), 859--866.

[14]

J. M. Hausdorff, D. A. Rios, and H. K. Edelberg. 2001. Gait variability and fall risk in community-living older adults: A 1-year prospective study. Arch. Phys. Med. Rehabil. 82, 8 (Aug. 2001), 1050--1056.

[15]

J. C. Ho, C. H. Lee, and J. Ghosh. 2014. Septic shock prediction for patients with missing data. ACM Trans. Manage Inf. Syst. 5, 1 (April 2014), 1:1--1:15.

Digital Library

[16]

M. Lan, C. J. Tan, J. Su, and Y. Lu. 2009. Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Machine Intell. 31, 4 (April 2009), 721--735.

Digital Library

[17]

M. E. Matheny, F. Fitzhenry, T. Speroff, J. K. Green, M. L. Griffith, E. E. Vasilevskis, E. M. Fielstein, P. L. Elkin, and S. H. Brown. 2012. Detection of infectious symptoms from VA emergency department and primary care clinical documentation. Int. J. Med. Inform. 81, 3 (March 2012), 143--156.

[18]

J. A. McCart, D. J. Berndt, J. Jarman, D. K. Finch, and S. L. Luther. 2013. Finding falls in ambulatory care clinical documents using statistical text mining. J. Am. Med.l Inform.s Assoc. 20, 5 (Sept. 2013), 906--914.

[19]

A. T. McCray, S. Srinivasan, and A. C. Browne. 1994. Lexical methods for managing variation in biomedical terminologies. In Proceedings of the Annual Symposium on Computer Application Medical Care. 235--239.

[20]

I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. 2006. YALE: Rapid prototyping for complex data mining tasks. In Proceedings of the 12th International Conference on Knowledge Discovery and Data Mining (KDD’06). 935--940.

Digital Library

[21]

H. S. Noyes. 2007. Good cause is bad medicine for the new e-discovery rules. Harvard J. Law Technol. 21, 1 (Fall 2007), 50--96.

[22]

P. V. Ogren. 2006. Knowtator: A protege plug-in for annotated corpus construction. In Proceedings of the Human Language Technology Conference NAACL, Companion Vol: Demonstrations. 273--275.

Digital Library

[23]

S. V. Pakhomov, P. L. Hanson, S. S. Bjornsen, and S. A. Smith. 2008. Automatic classification of foot examination findings using clinical notes and machine learning. J. Am. Med. Inform. Assoc. 15, 2 (2008), 198--202.

[24]

S. Pakhomov, S. A. Weston, S. J. Jacobsen, C. G. Chute, R. Meverden, and V. L. Roger. 2007. Electronic medical records for clinical research: Application to the identification of heart failure. Am. J. Manag. Care 13, 6 Part 1 (June 2007), 281--288.

[25]

G. Salton and C. Buckley. 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 5 (1988), 513--523.

Digital Library

[26]

F. Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (March 2002), 147.

Digital Library

[27]

J. A. Stevens, P. S. Corso, E. A. Finkelstein, and T. R. Miller. 2006. The costs of fatal and non-fatal falls among older adults. Injury Prevention 12, 5 (2006), 290--295.

[28]

M. Tremblay, D. Berndt, S. Luther, P. Foulis, and D. French. 2009. Identifying fall-related injuries: Text mining the electronic medical record. Inform. Technol Manag. 10, 4 (Dec. 2009), 253--265.

Digital Library

[29]

S. Varma and R. Simon. 2006. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7, 91 (Feb. 2006).

[30]

A. Vinciarelli. 2005. Noisy text categorization. IEEE Trans. Pattern Anal. Machine Intell. 27, 12 (Dec. 2005).

Digital Library

[31]

A. Wilcox and G. Hripcsak. 1999. Classification algorithms applied to narrative reports. In AMIA Symposium. 455--459.

[32]

I. H. Witten and E. Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques (2nd ed.). Morgan Kaufmann, San Francisco, CA.

Digital Library

[33]

World Health Organization. 2007. WHO Global Report on Falls Prevention in Older Age. Retrieved from http://www.who.int/ageing/publications/Falls_prevention7March.pdf.

[34]

H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman, and J. C. Denny. 2010. MedEx: A medication information extraction system for clinical narratives. J. Am. Med. Inform. Assoc. 17, 1 (Jan.- Feb. 2010), 19--24.

[35]

Q. T. Zeng, S. Goryachev, S. Weiss, M. Sordo, S. N. Murphy, and R. Lazarus. 2006. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: Evaluation of a natural language processing system. BMC Med. Inf. Decis. Making 6 (2006), 30.

[36]

G. Zhou, J. Zhang, J. Su, D. Shen, and C. Tan. 2004. Recognizing names in biomedical texts: A machine learning approach. Bioinformatics 20, 7 (2004), 1178--1190.

Digital Library

[37]

X. Zhou, H. Han, I. Chankai, A. A. Prestrud, and A. D. Brooks. 2005. Converting semi-structured clinical medical records into information and knowledge. In Proceedings of the 21st International Conference on Data Engineering Workshops (ICDEW’05). 1162.

Digital Library

Cited By

Huisman SKraiss Jde Vos J(2024)Examining a sentiment algorithm on session patient records in an eating disorder treatment setting: a preliminary studyFrontiers in Psychiatry10.3389/fpsyt.2024.127523615Online publication date: 13-Mar-2024
https://doi.org/10.3389/fpsyt.2024.1275236
Panaite VDevendorf AFinch DBouayad LLuther SSchultz S(2022)The Value of Extracting Clinician-Recorded Affect for Advancing Clinical Research on Depression: Proof-of-Concept Study Applying Natural Language Processing to Electronic Health RecordsJMIR Formative Research10.2196/344366:5(e34436)Online publication date: 12-May-2022
https://doi.org/10.2196/34436
Werder KRamesh BZhang R(2022)Establishing Data Provenance for Responsible Artificial Intelligence SystemsACM Transactions on Management Information Systems10.1145/350348813:2(1-23)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1145/3503488
Show More Cited By

Index Terms

A Case Study of Data Quality in Text Mining Clinical Progress Notes

Recommendations

Clinical notes as prognostic markers of mortality associated with diabetes mellitus following critical care: A retrospective cohort analysis using machine learning and unstructured big data
Abstract Background
Clinical notes are ubiquitous resources offering potential value in optimizing critical care via data mining technologies.
Objective
To determine the predictive value of clinical notes as prognostic markers of 1-year all-cause ...
Graphical abstract

Display Omitted
Highlights
- Clinical text-based machine learning yielded high prognostic ability for predicting mortality of diabetes patients.
- Lasso approach coupled with natural language processing demonstrated capability to elucidate predictive tokens.
- ...
Explore Data Quality Challenges Based on Data Structure of Electronic Health Records
Human Interface and the Management of Information
Abstract
As the adoption of electronic health records (EHR) in primary care, ensuring high-quality data used is the premise of the quality of decision making and quality of care. Prior literature on EHR data quality has addressed dimensions and methods of ...
The impact of data quality defects on clinical decision-making in the intensive care unit
Highlights
- A poor data quality level affects clinical decision making about medication prescribed in the ICU.
Abstract Objective
Poor clinical data quality might affect clinical decision making and patient treatment. This study identifies quality defects in clinical data collected automatically by bedside monitoring devices in the Intensive ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Management Information Systems

ACM Transactions on Management Information Systems Volume 6, Issue 1

April 2015

111 pages

ISSN:2158-656X

EISSN:2158-6578

DOI:10.1145/2742819

Editor:
Alexander Tuzhilin
New York University, USA

Issue’s Table of Contents

Copyright © 2015 ACM.

© 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 April 2015

Accepted: 01 June 2014

Revised: 01 April 2014

Received: 01 November 2012

Published in TMIS Volume 6, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Veterans Healthcare Administration Health Services Research & Development

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
615
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huisman SKraiss Jde Vos J(2024)Examining a sentiment algorithm on session patient records in an eating disorder treatment setting: a preliminary studyFrontiers in Psychiatry10.3389/fpsyt.2024.127523615Online publication date: 13-Mar-2024
https://doi.org/10.3389/fpsyt.2024.1275236
Panaite VDevendorf AFinch DBouayad LLuther SSchultz S(2022)The Value of Extracting Clinician-Recorded Affect for Advancing Clinical Research on Depression: Proof-of-Concept Study Applying Natural Language Processing to Electronic Health RecordsJMIR Formative Research10.2196/344366:5(e34436)Online publication date: 12-May-2022
https://doi.org/10.2196/34436
Werder KRamesh BZhang R(2022)Establishing Data Provenance for Responsible Artificial Intelligence SystemsACM Transactions on Management Information Systems10.1145/350348813:2(1-23)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1145/3503488
Isaza PDeng YNidd MAzad AShwartz L(2022)Improving Model Performance Using Metric-Guided Data Selection Framework2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020746(4750-4757)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020746
Mahri MShen NBerrizbeitia FRodan RDaer AFaigan MTaqi DWu KAhmadi MDucret MEmami ETamimi F(2021)Osseointegration Pharmacology: A Systematic Mapping Using Artificial IntelligenceActa Biomaterialia10.1016/j.actbio.2020.11.011119(284-302)Online publication date: Jan-2021
https://doi.org/10.1016/j.actbio.2020.11.011
Gąsieniec LStachowiak G(2020)Enhanced Phase Clocks, Population Protocols, and Fast Space Optimal Leader ElectionJournal of the ACM10.1145/342465968:1(1-21)Online publication date: 17-Nov-2020
https://dl.acm.org/doi/10.1145/3424659
Cheerkoot-Jalim SKhedo K(2020)A systematic review of text mining approaches applied to various application areas in the biomedical domainJournal of Knowledge Management10.1108/JKM-09-2019-0524ahead-of-print:ahead-of-printOnline publication date: 21-Dec-2020
https://doi.org/10.1108/JKM-09-2019-0524
Assale MDui LCina ASeveso ACabitza F(2019)The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health RecordsFrontiers in Medicine10.3389/fmed.2019.000666Online publication date: 17-Apr-2019
https://doi.org/10.3389/fmed.2019.00066
Adnan KAkbar R(2019)An analytical study of information extraction from unstructured and multidimensional big dataJournal of Big Data10.1186/s40537-019-0254-86:1Online publication date: 17-Oct-2019
https://doi.org/10.1186/s40537-019-0254-8
Kartal HLiu XLi X(2019)Differential Privacy for the Vast MajorityACM Transactions on Management Information Systems10.1145/332971710:2(1-15)Online publication date: 8-Jul-2019
https://dl.acm.org/doi/10.1145/3329717
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents