@Article{info:doi/10.2196/54601, author="Trevena, William and Zhong, Xiang and Alvarado, Michelle and Semenov, Alexander and Oktay, Alp and Devlin, Devin and Gohil, Aarya Yogesh and Chittimouju, Sai Harsha", title="Using Large Language Models to Detect and Understand Drug Discontinuation Events in Web-Based Forums: Development and Validation Study", journal="J Med Internet Res", year="2025", month="Jan", day="30", volume="27", pages="e54601", keywords="natural language processing; large language models; ChatGPT; drug discontinuation events; zero-shot classification; artificial intelligence; AI", abstract="Background: The implementation of large language models (LLMs), such as BART (Bidirectional and Auto-Regressive Transformers) and GPT-4, has revolutionized the extraction of insights from unstructured text. These advancements have expanded into health care, allowing analysis of social media for public health insights. However, the detection of drug discontinuation events (DDEs) remains underexplored. Identifying DDEs is crucial for understanding medication adherence and patient outcomes. Objective: The aim of this study is to provide a flexible framework for investigating various clinical research questions in data-sparse environments. We provide an example of the utility of this framework by identifying DDEs and their root causes in an open-source web-based forum, MedHelp, and by releasing the first open-source DDE datasets to aid further research in this domain. Methods: We used several LLMs, including GPT-4 Turbo, GPT-4o, DeBERTa (Decoding-Enhanced Bidirectional Encoder Representations from Transformer with Disentangled Attention), and BART, among others, to detect and determine the root causes of DDEs in user comments posted on MedHelp. Our study design included the use of zero-shot classification, which allows these models to make predictions without task-specific training. We split user comments into sentences and applied different classification strategies to assess the performance of these models in identifying DDEs and their root causes. Results: Among the selected models, GPT-4o performed the best at determining the root causes of DDEs, predicting only 12.9{\%} of root causes incorrectly (hamming loss). Among the open-source models tested, BART demonstrated the best performance in detecting DDEs, achieving an F1-score of 0.86, a false positive rate of 2.8{\%}, and a false negative rate of 6.5{\%}, all without any fine-tuning. The dataset included 10.7{\%} (107/1000) DDEs, emphasizing the models' robustness in an imbalanced data context. Conclusions: This study demonstrated the effectiveness of open- and closed-source LLMs, such as GPT-4o and BART, for detecting DDEs and their root causes from publicly accessible data through zero-shot classification. The robust and scalable framework we propose can aid researchers in addressing data-sparse clinical research questions. The launch of open-access DDE datasets has the potential to stimulate further research and novel discoveries in this field. ", issn="1438-8871", doi="10.2196/54601", url="https://www.jmir.org/2025/1/e54601", url="https://doi.org/10.2196/54601", url="http://www.ncbi.nlm.nih.gov/pubmed/39883487" }