skip to main content
research-article

A Case Study of Data Quality in Text Mining Clinical Progress Notes

Published: 03 April 2015 Publication History

Abstract

Text analytic methods are often aimed at extracting useful information from the vast array of unstructured, free format text documents that are created by almost all organizational processes. The success of any text mining application rests on the quality of the underlying data being analyzed, including both predictive features and outcome labels. In this case study, some focused experiments regarding data quality are used to assess the robustness of Statistical Text Mining (STM) algorithms when applied to clinical progress notes. In particular, the experiments consider the impacts of task complexity (by removing signals), training set size, and target outcome quality. While this research is conducted using a dataset drawn from the medical domain, the data quality issues explored are of more general interest.

References

[1]
S. Agarwal, S. Godbole, D. Punjani, and S. Roy. 2007. How much noise is too much: A study in automatic text classification. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM). IEEE Computer Society, Washington, DC, 3--12.
[2]
H. Alamgir, S. Muazzam, and M. Nasrullah. 2012. Unintentional falls mortality among elderly in the United States: Time for action. Injury 43, 12 (Dec. 2012), 2065--2071.
[3]
A. Baneyx, J. Charlet, and M. C. Jaulent. 2007. Building an ontology of pulmonary diseases with natural language processing tools using textual corpora. Int. J. Med. Inf. 76, 2--3 (2007), 208--215.
[4]
M. E. Betz and G. Li. 2005. Epidemiologic patterns of injuries treated in ambulatory care settings. Ann. Emerg. Med. 46, 6 (Dec. 2005), 544--551.
[5]
P. Bramsen, P. Deshpande, Y. K. Lee, and R. Barzilay. 2006. Finding temporal order in discharge summaries. In AMIA Symposium. 81--85.
[6]
C. C. Chang and C. J. Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3 (2011), 27.
[7]
W. W. Chapman, W. Bridewell, P. Hanbury, G. F. Cooper, and B. G. Buchanan. 2001. A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform. 34, 5 (Oct. 2001), 301--310.
[8]
W. W. Chapman, D. Chu, and J. N. Dowling. 2007. ConText: An algorithm for identifying contextual features from clinical text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing (BioNLP’07). Association for Computational Linguistics, Stroudsburg, PA, 8188.
[9]
J. Cohen. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20 (April 1960), 37--46.
[10]
C. Cortes and V. Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (Sept. 1995), 273--297.
[11]
L. W. D’Avolio, T. M. Nguyen, W. R. Farwell, Y. Chen, F. Fitzmeyer, O. M. Harris, and L. D. Fiore. 2010. Evaluation of a generalizable approach to clinical information retrieval using the automated retrieval console (ARC). J. Am. Med. Inform. Assoc. 17, 4 (July-Aug. 2010), 375--382.
[12]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41 (Sept. 1990), 391--407.
[13]
J. H. Garvin, S. L. Duvall, B. R. South, B. E. Bray, D. Bolton, J. Heavirland, S. Pickard, P. Heidenreich, S. Shen, C. Weir, M. Samore, and M. K. Goldstein. 2012. Automated extraction of ejection fraction for quality measurement using regular expressions in Unstructured Information Management Architecture (UIMA) for heart failure. J. Am. Med. Inform. Assoc. 19, 5 (2012), 859--866.
[14]
J. M. Hausdorff, D. A. Rios, and H. K. Edelberg. 2001. Gait variability and fall risk in community-living older adults: A 1-year prospective study. Arch. Phys. Med. Rehabil. 82, 8 (Aug. 2001), 1050--1056.
[15]
J. C. Ho, C. H. Lee, and J. Ghosh. 2014. Septic shock prediction for patients with missing data. ACM Trans. Manage Inf. Syst. 5, 1 (April 2014), 1:1--1:15.
[16]
M. Lan, C. J. Tan, J. Su, and Y. Lu. 2009. Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Machine Intell. 31, 4 (April 2009), 721--735.
[17]
M. E. Matheny, F. Fitzhenry, T. Speroff, J. K. Green, M. L. Griffith, E. E. Vasilevskis, E. M. Fielstein, P. L. Elkin, and S. H. Brown. 2012. Detection of infectious symptoms from VA emergency department and primary care clinical documentation. Int. J. Med. Inform. 81, 3 (March 2012), 143--156.
[18]
J. A. McCart, D. J. Berndt, J. Jarman, D. K. Finch, and S. L. Luther. 2013. Finding falls in ambulatory care clinical documents using statistical text mining. J. Am. Med.l Inform.s Assoc. 20, 5 (Sept. 2013), 906--914.
[19]
A. T. McCray, S. Srinivasan, and A. C. Browne. 1994. Lexical methods for managing variation in biomedical terminologies. In Proceedings of the Annual Symposium on Computer Application Medical Care. 235--239.
[20]
I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. 2006. YALE: Rapid prototyping for complex data mining tasks. In Proceedings of the 12th International Conference on Knowledge Discovery and Data Mining (KDD’06). 935--940.
[21]
H. S. Noyes. 2007. Good cause is bad medicine for the new e-discovery rules. Harvard J. Law Technol. 21, 1 (Fall 2007), 50--96.
[22]
P. V. Ogren. 2006. Knowtator: A protege plug-in for annotated corpus construction. In Proceedings of the Human Language Technology Conference NAACL, Companion Vol: Demonstrations. 273--275.
[23]
S. V. Pakhomov, P. L. Hanson, S. S. Bjornsen, and S. A. Smith. 2008. Automatic classification of foot examination findings using clinical notes and machine learning. J. Am. Med. Inform. Assoc. 15, 2 (2008), 198--202.
[24]
S. Pakhomov, S. A. Weston, S. J. Jacobsen, C. G. Chute, R. Meverden, and V. L. Roger. 2007. Electronic medical records for clinical research: Application to the identification of heart failure. Am. J. Manag. Care 13, 6 Part 1 (June 2007), 281--288.
[25]
G. Salton and C. Buckley. 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 5 (1988), 513--523.
[26]
F. Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (March 2002), 147.
[27]
J. A. Stevens, P. S. Corso, E. A. Finkelstein, and T. R. Miller. 2006. The costs of fatal and non-fatal falls among older adults. Injury Prevention 12, 5 (2006), 290--295.
[28]
M. Tremblay, D. Berndt, S. Luther, P. Foulis, and D. French. 2009. Identifying fall-related injuries: Text mining the electronic medical record. Inform. Technol Manag. 10, 4 (Dec. 2009), 253--265.
[29]
S. Varma and R. Simon. 2006. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7, 91 (Feb. 2006).
[30]
A. Vinciarelli. 2005. Noisy text categorization. IEEE Trans. Pattern Anal. Machine Intell. 27, 12 (Dec. 2005).
[31]
A. Wilcox and G. Hripcsak. 1999. Classification algorithms applied to narrative reports. In AMIA Symposium. 455--459.
[32]
I. H. Witten and E. Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques (2nd ed.). Morgan Kaufmann, San Francisco, CA.
[33]
World Health Organization. 2007. WHO Global Report on Falls Prevention in Older Age. Retrieved from http://www.who.int/ageing/publications/Falls_prevention7March.pdf.
[34]
H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman, and J. C. Denny. 2010. MedEx: A medication information extraction system for clinical narratives. J. Am. Med. Inform. Assoc. 17, 1 (Jan.- Feb. 2010), 19--24.
[35]
Q. T. Zeng, S. Goryachev, S. Weiss, M. Sordo, S. N. Murphy, and R. Lazarus. 2006. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: Evaluation of a natural language processing system. BMC Med. Inf. Decis. Making 6 (2006), 30.
[36]
G. Zhou, J. Zhang, J. Su, D. Shen, and C. Tan. 2004. Recognizing names in biomedical texts: A machine learning approach. Bioinformatics 20, 7 (2004), 1178--1190.
[37]
X. Zhou, H. Han, I. Chankai, A. A. Prestrud, and A. D. Brooks. 2005. Converting semi-structured clinical medical records into information and knowledge. In Proceedings of the 21st International Conference on Data Engineering Workshops (ICDEW’05). 1162.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Management Information Systems
ACM Transactions on Management Information Systems  Volume 6, Issue 1
April 2015
111 pages
ISSN:2158-656X
EISSN:2158-6578
DOI:10.1145/2742819
Issue’s Table of Contents
© 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 April 2015
Accepted: 01 June 2014
Revised: 01 April 2014
Received: 01 November 2012
Published in TMIS Volume 6, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Machine learning
  2. clinical progress notes
  3. data quality
  4. electronic health records
  5. feature selection
  6. health informatics
  7. noisy text analysis
  8. predictive model quality
  9. text mining

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Veterans Healthcare Administration Health Services Research & Development

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media