%0 Journal Article %@ 1947-2579 %I JMIR Publications %V 17 %N %P e71720 %T Understanding Patient Perceptions of Bacterial Vaginosis Treatments: Mixed Methods Sentiment Analysis Study of Online Drug Review Forums %A Watkins,Eren %A Cimino,Andrea N %A Culbertson,Curtis %A Raymaker,Juliana %A Amico,Jennifer R %+ Medical Affairs & Outcomes Research, Organon & Co, 30 Hudson St, Jersey City, NJ, 07302, United States, 1 404 670 4920, eren.watkins@organon.com %K bacterial vaginosis %K bacterial vaginosis symptoms %K patient experiences %K bacterial vaginosis treatment %K sentiment analysis %K natural language processing %K qualitative %D 2025 %7 10.10.2025 %9 Original Paper %J Online J Public Health Inform %G English %X Background: Bacterial vaginosis (BV) is the most common cause of vaginal discharge in people of childbearing age in the United States. More information about what patients do and do not like about the most common BV products and the extent to which they reduce BV symptoms is important for understanding patients’ health and the current treatment landscape for BV. Objective: Using data from online drug review forums, this study’s objectives were to (1) quantitatively characterize the patient voice via sentiments (positive to negative) and emotions about the three most common Food and Drug Administration (FDA)–approved treatments for BV—oral metronidazole (OM), vaginal metronidazole (VM), vaginal clindamycin (VC)—and (2) qualitatively summarize themes characterizing the patient-perceived impact of BV and BV products. Methods: Data for this mixed methods descriptive study came from 1645 users’ reviews of BV products posted on WebMD.com and Drugs.com. Reviewer attributes, reviewer-submitted star ratings, and sentiment analysis (SA) using word-emotion association were analyzed with descriptive statistics and bivariate associations. A traditional qualitative analysis using qualitative description was also performed. Results: Most reviewers were female (n=629, 99.4%), between the ages of 18 and 44 years, and reported using BV products for less than 1 month, though qualitative results suggested most reported recurrent BV infections. Quantitative results revealed reviewers’ preference for vaginal products. The mean star ratings for VC were significantly higher when compared to OM and VM. VC reviews had the highest proportion of positive emotion words compared to OM and VM. Qualitative results for VC supported the quantitative findings: favorable themes related to perceptions of value, effectiveness in alleviating symptoms, and minimal side effects. Additionally, despite some concerns related to the cost of VC, reviewers said they would use the medication again. Other qualitative findings supported BV medical education campaigns for patients and providers on BV treatment. Conclusions: Overall, people want a BV treatment that is easy to use, quickly alleviates symptoms, and has minimal side effects. Patients use product reviews to inform their decision-making about BV treatment, ask and seek answers to health-related questions, and share their experiences, presenting a unique opportunity for comprehensive patient education through clinical encounters or public health outreach efforts. %M 41072007 %R 10.2196/71720 %U https://ojphi.jmir.org/2025/1/e71720 %U https://doi.org/10.2196/71720 %U http://www.ncbi.nlm.nih.gov/pubmed/41072007 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e79156 %T Mentalizing Without a Mind: Psychotherapeutic Potential of Generative AI %A Yirmiya,Karen %A Fonagy,Peter %+ Clinical, Educational and Health Psychology, University College London, 1-19 Torrington Pl, London, WC1E 7HB, United Kingdom, 44 7934597389, k.yirmiya@ucl.ac.uk %K generative artificial intelligence %K psychotherapy %K mentalization %K epistemic trust %K reflective functioning %D 2025 %7 10.10.2025 %9 Viewpoint %J J Med Internet Res %G English %X This paper explores the integration of generative artificial intelligence (AI) into psychotherapeutic practice through the lens of mentalization theory, with a particular focus on epistemic trust—a critical relational mechanism that facilitates psychological change. We critically examine AI’s capability to replicate core therapeutic components, such as empathy, embodied mentalizing, biobehavioral synchrony, and reciprocal mentalizing. Although current AI systems, especially large language models, demonstrate significant potential in simulating emotional responsiveness, cognitive empathy, and therapeutic dialogue, fundamental limitations persist. AI’s inherent lack of genuine emotional presence, reciprocal intentionality, and affective commitment constrains its ability to foster authentic epistemic trust and meaningful therapeutic relationships. Additionally, we outline significant risks, notably for individuals with complex trauma or relational vulnerabilities, highlighting concerns regarding pseudo-empathy, mistaking phenomenal experience for objective reality (psychic equivalence), fruitless ungrounded pursuit of social understanding (hypermentalization), and epistemic exploitation of individuals in whom artificial understanding by AI triggers excessive credulity. Nonetheless, we propose ethically informed pathways for integrating AI to enhance clinical practice, therapist training, and client care, particularly in augmenting human capacities within group and adjunctive therapy contexts. Paradoxically, AI could support psychotherapists in improving their capacity to mentalize, improve their understanding of their clients, and provide such understanding within the moral constraints that normally govern their work. This paper calls for careful ethical regulation similar to that limiting genetic manipulation, interdisciplinary research, and clinician involvement in shaping future AI-based psychotherapeutic models, emphasizing that AI’s role should complement rather than replace the irreplaceable relational core of psychotherapy. %M 41071597 %R 10.2196/79156 %U https://www.jmir.org/2025/1/e79156 %U https://doi.org/10.2196/79156 %U http://www.ncbi.nlm.nih.gov/pubmed/41071597 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e77837 %T Large Language Model–Enhanced Drug Repositioning Knowledge Extraction via Long Chain-of-Thought: Development and Evaluation Study %A Kang,Hongyu %A Li,Jiao %A Hou,Li %A Xu,Xiaowei %A Zheng,Si %A Li,Qin %K drug repositioning %K knowledge extraction %K large language models %K long chain-of-thought %K reinforcement learning %D 2025 %7 7.10.2025 %9 %J JMIR Med Inform %G English %X Background: Drug repositioning is a pivotal strategy in pharmaceutical research, offering accelerated and cost-effective therapeutic discovery. However, biomedical information relevant to drug repositioning is often complex, dispersed, and underutilized due to limitations in traditional extraction methods, such as reliance on annotated data and poor generalizability. Large language models (LLMs) show promise but face challenges such as hallucinations and interpretability issues. Objective: This study proposed long chain-of-thought for drug repositioning knowledge extraction (LCoDR-KE), a lightweight and domain-specific framework to enhance LLMs’ accuracy and adaptability in extracting structured biomedical knowledge for drug repositioning. Methods: A domain-specific schema defined 11 entities (eg, drug, disease) and 18 relationships (eg, treats, is biomarker of). Following the established schema architecture, we constructed automatic annotation based on 10,000 PubMed abstracts via chain-of-thought prompt engineering. A total of 1000 expert-validated abstracts were curated into a drug repositioning corpus, a high-quality specialized corpus, while the remaining entries were allocated for model training purposes. Then, the proposed LCoDR-KE framework combined supervised fine-tuning of the Qwen2.5-7B-Instruct model with reinforcement learning and dual-reward mechanisms. Performance was evaluated against state-of-the-art models (eg, conditional random fields, Bidirectional Encoder Representations From Transformers, BioBERT, Qwen2.5, DeepSeek-R1, OpenBioLLM-70B, and model variants) using precision, recall, and F1-score. In addition, the convergence of the training method was assessed by analyzing performance progression across iteration steps. Results: LCoDR-KE achieved an entity F1 of 81.46% (eg, drug 95.83%, disease 90.52%) and triplet F1 of 69.04%, outperforming traditional models and rivaling larger LLMs (DeepSeek-R1: entity F1=84.64%, triplet F1=69.02%). Ablation studies confirmed the contributions of supervised fine-tuning (8.61% and 20.70% F1 drop if removed) and reinforcement learning (6.09% and 14.09% F1 drop if removed). The training process demonstrated stable convergence, validated through iterative performance monitoring. Qualitative analysis of the model’s chain-of-thought outputs showed that LCoDR-KE performed structured and schema-aware reasoning by validating entity types, rejecting incompatible relations, enforcing constraints, and generating compliant JSON. Error analysis revealed 4 main types of mistakes and challenges for further improvement. Conclusions: LCoDR-KE enhances LLMs’ domain-specific adaptability for drug repositioning by offering an open-source drug repositioning corpus and a long chain-of-thought framework based on a lightweight LLM model. This framework supports drug discovery and knowledge reasoning while providing scalable, interpretable solutions applicable to broader biomedical knowledge extraction tasks. %R 10.2196/77837 %U https://medinform.jmir.org/2025/1/e77837 %U https://doi.org/10.2196/77837 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e81769 %T Critical Limitations in Systematic Reviews of Large Language Models in Health Care %A Weizman,Zvi %K letter %K large language models %K AI %K health care %K review %K LLM %K clinical %K artificial intelligence %K digital health %D 2025 %7 24.9.2025 %9 %J J Med Internet Res %G English %X %R 10.2196/81769 %U https://www.jmir.org/2025/1/e81769 %U https://doi.org/10.2196/81769 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e82729 %T Author's Reply: Critical Limitations in Systematic Reviews of Large Language Models in Health Care %A Python,Andre %A Li,HongYi %A Fu,Jun-Fen %K large language model %K LLM %K clinical %K artificial intelligence %K AI %K digital health %K LLM review %K review %K letter %D 2025 %7 24.9.2025 %9 %J J Med Internet Res %G English %X %R 10.2196/82729 %U https://www.jmir.org/2025/1/e82729 %U https://doi.org/10.2196/82729 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e69408 %T Diabetic Foot Ulcer Classification Models Using Artificial Intelligence and Machine Learning Techniques: Systematic Review %A Silva,Manuel Alberto %A Hamilton,Emma J %A Russell,David A %A Game,Fran %A Wang,Sheila C %A Baptista,Sofia %A Monteiro-Soares,Matilde %+ , USF Sanus Carandá, ULS Braga, Praça Cândido Costa Pires, piso 1, Braga, 4715-402, Portugal, 351 253201530, manuelalbertorsilva@gmail.com %K artificial intelligence %K diabetic foot %K classification %K machine learning %K prognosis. %D 2025 %7 24.9.2025 %9 Review %J J Med Internet Res %G English %X Background: Diabetes-related foot ulceration (DFU) is a common complication of diabetes, with a significant impact on survival, health care costs, and health-related quality of life. The prognosis of DFU varies widely among individuals. The International Working Group on the Diabetic Foot recently updated their guidelines on how to classify ulcers using “classical” classification and scoring systems. No system was recommended for individual prognostication, and the group considered that more detail in ulcer characterization was needed and that machine learning (ML)–based models may be the solution. Despite advances in the field, no assessment of available evidence was done. Objective: This study aimed to identify and collect available evidence assessing the ability of ML-based models to predict clinical outcomes in people with DFU. Methods: We searched the MEDLINE database (PubMed), Scopus, Web of Science, and IEEE Xplore for papers published up to July 2023. Studies were eligible if they were anterograde analytical studies that examined the prognostic abilities of ML models in predicting clinical outcomes in a population that included at least 80% of adults with DFU. The literature was screened independently by 2 investigators (MMS and DAR or EH in the first phase, and MMS and MAS in the second phase) for eligibility criteria and data extracted. The risk of bias was evaluated using the Quality In Prognosis Studies tool and the Prediction model Risk Of Bias Assessment Tool by 2 investigators (MMS and MAS) independently. A narrative synthesis was conducted. Results: We retrieved a total of 2412 references after removing duplicates, of which 167 were subjected to full-text screening. Two references were added from searching relevant studies’ lists of references. A total of 11 studies, comprising 13 papers, were included focusing on 3 outcomes: wound healing, lower extremity amputation, and mortality. Overall, 55 predictive models were created using mostly clinical characteristics, random forest as the developing method, and area under the receiver operating characteristic curve (AUROC) as a discrimination accuracy measure. AUROC varied from 0.56 to 0.94, with the majority of the models reporting an AUROC equal or superior to 0.8 but lacking 95% CIs. All studies were found to have a high risk of bias, mainly due to a lack of uniform variable definitions, outcome definitions and follow-up periods, insufficient sample sizes, and inadequate handling of missing data. Conclusions: We identified several ML-based models predicting clinical outcomes with good discriminatory ability in people with DFU. Due to the focus on development and internal validation of the models, the proposal of several models in each study without selecting the “best one,” and the use of nonexplainable techniques, the use of this type of model is clearly impaired. Future studies externally validating explainable models are needed so that ML models can become a reality in DFU care. Trial Registration: PROSPERO CRD42022308248; https://www.crd.york.ac.uk/PROSPERO/view/CRD42022308248 %M 40991939 %R 10.2196/69408 %U https://www.jmir.org/2025/1/e69408 %U https://doi.org/10.2196/69408 %U http://www.ncbi.nlm.nih.gov/pubmed/40991939 %0 Journal Article %@ 2369-1999 %I JMIR Publications %V 11 %N %P e71102 %T Understanding Cancer Survivorship Care Needs Using Amazon Reviews: Content Analysis, Algorithm Development, and Validation Study %A Wang,Liwei %A Lu,Qiuhao %A Li,Rui %A Harrison,Taylor B %A Jia,Heling %A Huang,Ming %A Dowst,Heidi %A Zhang,Rui %A Badr,Hoda %A Fan,Jungwei W %A Liu,Hongfang %K real-world data %K cancer research %K cancer survivorship care %K natural language processing %K annotation %K baseline models %K deep learning %K large language model %D 2025 %7 23.9.2025 %9 %J JMIR Cancer %G English %X Background: Complementary therapies are being increasingly used by cancer survivors. As a channel for customers to share their feelings, outcomes, and perceived knowledge about the products purchased from e-commerce platforms, Amazon consumer reviews are a valuable real-world data source for understanding cancer survivorship care needs. Objective: In this study, we aimed to highlight the potential of using Amazon consumer reviews as a novel source for identifying cancer survivorship care needs, particularly related to symptom self-management. Specifically, we present a publicly available, manually annotated corpus derived from Amazon reviews of health-related products and develop baseline natural language processing models using deep learning and large language model (LLM) to demonstrate the usability of this dataset. Methods: We preprocessed the Amazon review dataset to identify sentences with cancer mentions through a rule-based method and conducted content analysis including text feature analysis, sentiment analysis, topic modeling, cancer type, and symptom association analysis. We then designed an annotation guideline, targeting survivorship-relevant constructs. A total of 159 reviews were annotated, and baseline models were developed based on deep learning and large language model (LLM) for named entity recognition and text classification tasks. Results: A total of 4703 sentences containing positive cancer mentions were identified, drawn from 3349 reviews associated with 2589 distinct products. The identified topics through topic modeling revealed meaningful insights into cancer symptom management and survivorship experiences. Examples included discussions of green tea use during chemotherapy, cancer prevention strategies, and product recommendations for breast cancer. Top 15 symptoms in reviews were also identified, with pain being the most frequent symptom, followed by inflammation, fatigue, etc. The annotation labels were designed to capture cancer types, indicated symptoms, and symptom management outcomes. The resulting annotation corpus contains 2067 labels from 159 Amazon reviews. It is publicly accessible, together with the annotation guideline through the Open Health Natural Language Processing (OHNLP) GitHub. Our baseline model, Bert-base-cased, achieved the highest weighted average F1-score, that is, 66.92%, for named entity recognition, and LLM gpt4-1106-preview-chat achieved the highest F1-score for text classification tasks, that is, 66.67% for “Harmful outcome,” 88.46% for “Favorable outcome” and 73.33% for “Ambiguous outcome.” Conclusions: Our results demonstrate the potential of Amazon consumer reviews as a novel data source for identifying persistent symptoms, concerns, and self-management strategies among cancer survivors. This corpus, along with the baseline natural language processing models developed for named entity recognition and text classification, lays the groundwork for future methodological advancements in cancer survivorship research. Importantly, insights from this study could be evaluated against established clinical guidelines for symptom management in cancer survivorship care. By revealing the feasibility of using consumer-generated data for mining survivorship-related experiences, this study offers a promising foundation for future research and argumentation analysis aimed at improving long-term outcomes and support for cancer survivors. %R 10.2196/71102 %U https://cancer.jmir.org/2025/1/e71102 %U https://doi.org/10.2196/71102 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e69752 %T Comparative Evaluation of a Medical Large Language Model in Answering Real-World Radiation Oncology Questions: Multicenter Observational Study %A Dennstädt,Fabio %A Schmerder,Max %A Riggenbach,Elena %A Mose,Lucas %A Bryjova,Katarina %A Bachmann,Nicolas %A Mackeprang,Paul-Henry %A Ahmadsei,Maiwand %A Sinovcic,Dubravko %A Windisch,Paul %A Zwahlen,Daniel %A Rogers,Susanne %A Riesterer,Oliver %A Maffei,Martin %A Gkika,Eleni %A Haddad,Hathal %A Peeken,Jan %A Putora,Paul Martin %A Glatzer,Markus %A Putz,Florian %A Hoefler,Daniel %A Christ,Sebastian M %A Filchenko,Irina %A Hastings,Janna %A Gaio,Roberto %A Chiang,Lawrence %A Aebersold,Daniel M %A Cihoric,Nikola %+ Inselspital, Department of Radiation Oncology, Bern University Hospital, University of Bern, Freiburgstrasse 18, Bern, 3010, Switzerland, 41 764228338, nikola.cihoric@gmail.com %K large language models %K natural language processing %K artificial intelligence %K radiation oncology %K Llama-3 %K benchmarking %K evaluation %D 2025 %7 23.9.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models (LLMs) hold promise for supporting clinical tasks, particularly in data-driven and technical disciplines such as radiation oncology. While prior evaluation studies have focused on examination-style settings for evaluating LLMs, their performance in real-life clinical scenarios remains unclear. In the future, LLMs might be used as general AI assistants to answer questions arising in clinical practice. It is unclear how well a modern LLM, locally executed within the infrastructure of a hospital, would answer such questions compared with clinical experts. Objective: This study aimed to assess the performance of a locally deployed, state-of-the-art medical LLM in answering real-world clinical questions in radiation oncology compared with clinical experts. The aim was to evaluate the overall quality of answers, as well as the potential harmfulness of the answers if used for clinical decision-making. Methods: Physicians from 10 departments of European hospitals collected questions arising in the clinical practice of radiation oncology. Fifty of these questions were answered by 3 senior radiation oncology experts with at least 10 years of work experience, as well as the LLM Llama3-OpenBioLLM-70B (Ankit Pal and Malaikannan Sankarasubbu). In a blinded review, physicians rated the overall answer quality on a 5-point Likert scale (quality), assessed whether an answer might be potentially harmful if used for clinical decision-making (harmfulness), and determined if responses were from an expert or the LLM (recognizability). Comparisons between clinical experts and LLMs were then made for quality, harmfulness, and recognizability. Results: There were no significant differences between the quality of the answers between LLM and clinical experts (mean scores of 3.38 vs 3.63; median 4.00, IQR 3.00-4.00 vs median 3.67, IQR 3.33-4.00; P=.26; Wilcoxon signed rank test). The answers were deemed potentially harmful in 13% of cases for the clinical experts compared with 16% of cases for the LLM (P=.63; Fisher exact test). Physicians correctly identified whether an answer was given by a clinical expert or an LLM in 78% and 72% of cases, respectively. Conclusions: A state-of-the-art medical LLM can answer real-life questions from the clinical practice of radiation oncology similarly well as clinical experts regarding overall quality and potential harmfulness. Such LLMs can already be deployed within the local hospital environment at an affordable cost. While LLMs may not yet be ready for clinical implementation as general AI assistants, the technology continues to improve at a rapid pace. Evaluation studies based on real-life situations are important to better understand the weaknesses and limitations of LLMs in clinical practice. Such studies are also crucial to define when the technology is ready for clinical implementation. Furthermore, education for health care professionals on generative AI is needed to ensure responsible clinical implementation of this transforming technology. %M 40986858 %R 10.2196/69752 %U https://www.jmir.org/2025/1/e69752 %U https://doi.org/10.2196/69752 %U http://www.ncbi.nlm.nih.gov/pubmed/40986858 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e73603 %T Large Language Models’ Clinical Decision-Making on When to Perform a Kidney Biopsy: Comparative Study %A Toal,Michael %A Hill,Christopher %A Quinn,Michael %A O'Neill,Ciaran %A Maxwell,Alexander P %K kidney biopsy %K renal biopsy %K nephrology %K chronic kidney disease %K kidney failure %K proteinuria %K hematuria %K glomerulonephritis %K machine learning %K large language models %K artificial intelligence %K decision support %D 2025 %7 18.9.2025 %9 %J J Med Internet Res %G English %X Background: Artificial intelligence (AI) and large language models (LLMs) are increasing in sophistication and are being integrated into many disciplines. The potential for LLMs to augment clinical decision-making is an evolving area of research. Objective: This study compared the responses of over 1000 kidney specialist physicians (nephrologists) with the outputs of commonly used LLMs using a questionnaire determining when a kidney biopsy should be performed. Methods: This research group completed a large online questionnaire for nephrologists to determine when a kidney biopsy should be performed. The questionnaire was co-designed with patient input, refined through multiple iterations, and piloted locally before international dissemination. It was the largest international study in the field and demonstrated variation among human clinicians in biopsy propensity relating to human factors such as sex and age, as well as systemic factors such as country, job seniority, and technical proficiency. The same questions were put to both human doctors and LLMs in an identical order in a single session. Eight commonly used LLMs were interrogated: ChatGPT-3.5, Mistral Hugging Face, Perplexity, Microsoft Copilot, Llama 2, GPT-4, MedLM, and Claude 3. The most common response given by clinicians (human mode) for each question was taken as the baseline for comparison. Questionnaire responses on the indications and contraindications for biopsy generated a score (0-44) reflecting biopsy propensity, in which a higher score was used as a surrogate marker for an increased tolerance of potential associated risks. Results: The ability of LLMs to reproduce human expert consensus varied widely with some models demonstrating a balanced approach to risk in a similar manner to humans, while other models reported outputs at either end of the spectrum for risk tolerance. In terms of agreement with the human mode, ChatGPT-3.5 and GPT-4 (OpenAI) had the highest levels of alignment, agreeing with the human mode on 6 out of 11 questions. The total biopsy propensity score generated from the human mode was 23 out of 44. Both OpenAI models produced similar propensity scores between 22 and 24. However, Llama 2 and MS Copilot also scored within this range but with poorer response alignment to the human consensus at only 2 out of 11 questions. The most risk-averse model in this study was MedLM, with a propensity score of 11, and the least risk-averse model was Claude 3, with a score of 34. Conclusions: The outputs of LLMs demonstrated a modest ability to replicate human clinical decision-making in this study; however, performance varied widely between LLM models. Questions with more uniform human responses produced LLM outputs with higher alignment, whereas questions with lower human consensus showed poorer output alignment. This may limit the practical use of LLMs in real-world clinical practice. %R 10.2196/73603 %U https://www.jmir.org/2025/1/e73603 %U https://doi.org/10.2196/73603 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e76745 %T Decoding HIV Discourse on Social Media: Large-Scale Analysis of 191,972 Tweets Using Machine Learning, Topic Modeling, and Temporal Analysis %A Zhan,Xiangming %A Song,Meijia %A Shrader,Cho Hee %A Forbes,Chad E %A Algarin,Angel B %K HIV prevention %K social media %K topic modeling %K temporal analysis %K machine learning %K public health informatics %D 2025 %7 29.8.2025 %9 %J J Med Internet Res %G English %X Background: HIV remains a global challenge, with stigma, financial constraints, and psychosocial barriers preventing people living with HIV from accessing health care services, driving them to seek information and support on social media. Despite the growing role of digital platforms in health communication, existing research often narrowly focuses on specific HIV-related topics rather than offering a broader landscape of thematic patterns. In addition, much of the existing research lacks large-scale analysis and predominantly predates COVID-19 and the platform’s transition to X (formerly known as Twitter), limiting our understanding of the comprehensive, dynamic, and postpandemic HIV-related discourse. Objective: This study aims to (1) observe the dominant themes in current HIV-related social media discourse, (2) explore similarities and differences between theory-driven (eg, literature-informed predetermined categories) and data-driven themes (eg, unsupervised Latent Dirichlet Allocation [LDA] without previous categorization), and (3) examine how emotional responses and temporal patterns influence the dissemination of HIV-related content. Methods: We analyzed 191,972 tweets collected between June 2023 and August 2024 using an integrated analytical framework. This approach combined: (1) supervised machine learning for text classification, (2) comparative topic modeling with both theory-driven and data-driven LDA to identify thematic patterns, (3) sentiment analysis using VADER (Valence Aware Dictionary and sEntiment Reasoner) and the NRC Emotion Lexicon to examine emotional dimensions, and (4) temporal trend analysis to track engagement patterns. Results: Theory-driven themes revealed that information and education content constituted the majority of HIV-related discourse (120,985/191,972, 63.02%), followed by opinions and commentary (23,863/191,972, 12.43%), and personal experiences and stories (19,672/191,972, 10.25%). The data-driven approach identified 8 distinct themes, some of which shared similarities with aspects from the theory-driven approach, while others were unique. Temporal analysis revealed 2 different engagement patterns: official awareness campaigns like World AIDS Day generated delayed peak engagement through top-down information sharing, while community-driven events like National HIV Testing Day showed immediate user engagement through peer-to-peer interactions. Conclusions: HIV-related social media discourse on X reflects the dominance of informational content, the emergence of prevention as a distinct thematic focus, and the varying effectiveness of different timing patterns in HIV-related messaging. These findings suggest that effective HIV communication strategies can integrate medical information with community perspectives, maintain balanced content focus, and strategically time messages to maximize engagement. These insights provide valuable guidance for developing digital outreach strategies that better connect healthcare services with vulnerable populations in the post–COVID-19 pandemic era. %R 10.2196/76745 %U https://www.jmir.org/2025/1/e76745 %U https://doi.org/10.2196/76745 %0 Journal Article %@ 1947-2579 %I JMIR Publications %V 17 %N %P e66798 %T Forecasting Neonatal Mortality in Ethiopia to Assess Progress Toward National and International Reduction Targets Using Classical Techniques and Deep Learning: Time-Series Forecasting Study %A Kebede,Shimels Derso %K neonatal mortality %K time series %K forecasting %K deep learning %K machine learning %K health sector transformation plan %K deep learning model %K Ethiopia %D 2025 %7 25.8.2025 %9 %J Online J Public Health Inform %G English %X Background: Neonatal disease and its outcomes are important indicators for a responsive health care system and encompass the effects of socioeconomic and environmental factors on new-borns and mothers. Ethiopia is working to achieve the Sustainable Development Goal target for the reduction of 12 or less per 1000 birth by 2030 and 21 per 1000 livebirths by 2025 as part of the second Ethiopian Health Sector Transformation Plan. Objective: This study aimed to compare the performance of classical time-series models with that of deep learning models and to forecast the neonatal mortality rate in Ethiopia to verify whether Ethiopia will achieve national and international targets. Methods: Data were extracted from the official World Bank database. Classical time-series models, such as autoregressive integrated moving average (ARIMA) and double exponential smoothing, and neural network-based models, such as multilayer perceptron, convolutional neural network, and long short-term memory, have been applied to forecast neonatal mortality rates from 2021 to 2030 in Ethiopia. During model building, the first 21 years of data (from 1990 to 2010) were used for training, and the remaining 10 years of data were used to test model performance. Model performance was evaluated using R², mean absolute percentage error (MAPE), and root mean squared error (RMSE). Finally, the best model was used to forecast the neonatal mortality rate over the next 10 years from 2021 to 2030, with a 95% prediction interval (PI). Results: The results showed that the double exponential smoothing model was the best, with a maximum R2 of 99.94% and minimum MAPE and RMSE of 0.002 and 0.0748, respectively. The worst performing among the 5 models was the CNN, with an R2 of 93.71% and a maximum RMSE of 0.79. Neonatal mortality in Ethiopia is forecasted to be 23.20 (PI 22.20-24.40) per 1000 live births in 2025 and 19.80 (PI 17.10-22.80) per 1000 live births in 2030. Conclusions: This study revealed that national and international targets for neonatal mortality cannot be realized if the current trend continues. This highlights the need for urgent interventions to strengthen the health system to fasten the decline rate of neonatal mortality and collaborative effort with concerned stakeholders for improved and responsive neonatal and child health services in order to achieve these targets. %R 10.2196/66798 %U https://ojphi.jmir.org/2025/1/e66798 %U https://doi.org/10.2196/66798 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e66100 %T AI and Machine Learning Terminology in Medicine, Psychology, and Social Sciences: Tutorial and Practical Recommendations %A Cao,Bo %A Greiner,Russell %A Greenshaw,Andrew %A Sui,Jie %K artificial intelligence %K machine learning %K terminology %K medicine %K psychology %K social sciences %K prediction %K regression %K deep learning %K tutorial %K prospective prediction %K validation %D 2025 %7 18.8.2025 %9 %J J Med Internet Res %G English %X Recent applications of artificial intelligence (AI) and machine learning in medicine, psychology, and social sciences have led to common terminological confusions. In this paper, we review emerging evidence from systematic reviews documenting widespread misuse of key terms, particularly “prediction” being applied to studies merely demonstrating association or retrospective analysis. We clarify when “prediction” should be used and recommend using “prospective prediction” for future prediction; explain validation procedures essential for model generalizability; discuss overfitting and generalization in machine learning and traditional regression methods; clarify relationships between features, independent variables, predictors, risk factors, and causal factors; and clarify the hierarchical relationship between AI, machine learning, deep learning, large language models, and generative AI. We provide evidence-based recommendations for terminology use that can facilitate clearer communication among researchers from different disciplines and between the research community and the public, ultimately advancing the rigorous application of AI in medicine, psychology, and social sciences. %R 10.2196/66100 %U https://www.jmir.org/2025/1/e66100 %U https://doi.org/10.2196/66100 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e72637 %T Effects of a Theory- and Evidence-Based, Motivational Interviewing–Oriented Artificial Intelligence Digital Assistant on Vaccine Attitudes: A Randomized Controlled Trial %A Li,Yan %A Li,Mengqi %A Yorke,Janelle %A Bressington,Daniel %A Chung,Joyce %A Xie,Yao-Jie %A Yang,Lin %A He,Mengting %A Sun,Tsz-Ching %A Leung,Angela Y M %K attitude %K motivational interviewing %K artificial intelligence %K chatbot %K vaccine hesitancy %K COVID-19 %D 2025 %7 8.8.2025 %9 %J J Med Internet Res %G English %X Background: Attitude-targeted interventions are important approaches for promoting vaccination. Educational approaches alone cannot effectively cultivate positive vaccine attitudes. Artificial intelligence (AI)–driven chatbots and motivational interviewing (MI) techniques show high promise in improving vaccine attitudes and facilitating readiness for vaccination. Objective: This study aimed to evaluate the effectiveness of a theory and evidence-based, MI-oriented AI digital assistant in improving COVID-19 vaccine attitudes among adults in Hong Kong. Methods: This 2 parallel-armed randomized controlled trial was conducted from October 2022 to June 2024. Hong Kong adults (N=177) who were vaccine-hesitant were randomly assigned into 2 study groups. The intervention group (n=91) interacted with the AI digital assistant over 5 weeks, including receiving a web-based education program comprising 5 educational modules and communicating with an AI-driven chatbot equipped with MI techniques. The control group (n=86) received WhatsApp (Meta) messages directing them to government websites for COVID-19 vaccine information and knowledge, with the same dosage as the intervention group. Primary outcomes included vaccine hesitancy. Secondary outcomes included vaccine readiness, confidence, trust in government, and health literacy. Outcomes were measured at baseline, postintervention, 3-month, and 6-month follow-up. Focus group interviews were conducted postintervention. Intervention effects were analyzed using the generalized estimating equation model. Interview data were content analyzed. Results: Decreases in vaccine hesitancy were observed while no statistically significant time-by-group interaction effects were found. The intervention showed significant time-by-group interaction effects on vaccine readiness (P=.04), confidence (P=.02), and trust in government (P=.04). Significant between-group differences with medium effect sizes were identified for vaccine readiness (Cohen d=0.52) and trust in government (Cohen d=0.54) postintervention, respectively. Increases in vaccine-related health literacy were observed, and a significant time effect was found (P=.01). In total, three categories were summarized from interview data: (1) improved vaccine literacy, confidence, and trust in government; (2) hesitancy varied while readiness improved; and (3) facilitators, barriers, and recommendations of modifications on the intervention. Conclusions: The intervention indicated promising yet significant effects on vaccine readiness while the effects on vaccine hesitancy require further confirmation. The qualitative findings; however, further consolidate the significant effects on participants’ attitudes toward vaccines. The findings provide novel evidence to encourage the adoption and refinement of a MI-oriented AI digital assistant in vaccine promotion. Trial Registration: ClinicalTrials.gov NCT05531058; https://clinicaltrials.gov/study/NCT05531058 %R 10.2196/72637 %U https://www.jmir.org/2025/1/e72637 %U https://doi.org/10.2196/72637 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e70087 %T Public Medical Appeals and Government Online Responses: Big Data Analysis Based on Chinese Digital Governance Platforms %A Li,Hebin %A Liu,Zhihan %A Zhang,Ziyan %A Ping,Lu %A Gu,Wenxin %A Yao,Yuan %K medical appeals %K government online responses %K digital governance %K content analysis %K co-word analysis %K logistic regression %D 2025 %7 6.8.2025 %9 %J J Med Internet Res %G English %X Background: In the era of internet-based governance, online public appeals—particularly those related to health care—have emerged as a crucial channel through which citizens articulate their needs and concerns. Objective: This study aims to investigate the thematic structure, emotional tone, and underlying logic of governmental responses related to public medical appeals in China. Methods: We collected messages posted on the “Message Board for Leaders” hosted by People’s Daily Online between January 2022 and November 2023 to identify valid medical appeals for analysis. (1) Key themes of public appeals were identified using the term frequency-inverse document frequency model for feature word extraction, followed by hierarchical cluster analysis. (2) Sentiment classification was conducted using supervised machine learning, with additional validation through sentiment scores derived from a lexicon-based approach. (3) A binary logistic regression model was employed to examine the influence of textual, transactional, and macro-environmental factors on the likelihood of receiving a government response. Robustness was tested using a Probit model. Results: From a total of 404,428 online appeals, 8864 valid public medical messages were retained after filtering. These primarily concerned pandemic control, fertility policies, health care institutions, and insurance issues. Negative sentiment predominated across message types, accounting for 3328 out of 3877 (85.84%) complaints/help-seeking messages, 1666 out of 2381 (69.97%) consultation messages, and 1710 out of 2606 (65.62%) suggestions. Regression analysis revealed that textual features, issue complexity, and benefit attribution were not significantly associated with government responsiveness. Specifically, for textual features, taking the epidemic issue as the reference category in the appeal theme, the P values were as follows: fertility issue (P=.63), hospital issue (P=.63), security issue (P=.72), and other issues (P=.34). Other textual features include appeal content (P=.80), appeal sentiment (P=.64), and appeal title (P=.55). Regarding the difficulty of resolving incidents, with low difficulty as the reference category, the P values were moderate difficulty (P=.59) and high difficulty (P=.96). For benefits attribution, using individual interest as the reference, collective interest (P=.25) was not statistically significant. By contrast, macro-level factors—specifically internet penetration, education, economic development, and labor union strength—had significant effects. Compared with areas with lower levels, higher internet penetration (odds ratio 1.577-9.930, P=.004 to <.001), education (odds ratio 2.497, P<.001), and gross domestic product (odds ratio 2.599, P<.001) were associated with increased responsiveness. Conversely, medium (odds ratio 0.565, P<.001) and high (odds ratio 0.116, P<.001) levels of labor union development were linked to lower response odds. Conclusions: Public medical appeals exhibit 5 defining characteristics: urgency induced by pandemic conditions, connections to fertility policy reforms, tensions between the efficacy and costs of medical services, challenges related to cross-regional insurance coverage, and a predominance of negative sentiment. The findings indicate that textual features and issue-specific content exert limited influence on government responsiveness, likely due to the politically sensitive and complex nature of health care–related topics. Instead, macro-level environmental factors emerge as key determinants. These insights can inform the optimization of response mechanisms on digital health platforms and offer valuable theoretical and empirical contributions to the advancement of health information dissemination and digital governance within the health care sector. %R 10.2196/70087 %U https://www.jmir.org/2025/1/e70087 %U https://doi.org/10.2196/70087 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e70381 %T Artificial Intelligence in Health Promotion and Disease Reduction: Rapid Review %A Yousefi,Farzaneh %A Naye,Florian %A Ouellet,Steven %A Yameogo,Achille-Roghemrazangba %A Sasseville,Maxime %A Bergeron,Frédéric %A Ozkan,Marianne %A Cousineau,Martin %A Amil,Samira %A Rhéaume,Caroline %A Gagnon,Marie-Pierre %K artificial intelligence %K health promotion %K disease reduction %K AI in health %K rapid review %K SWOT analysis %D 2025 %7 1.8.2025 %9 %J J Med Internet Res %G English %X Background: Chronic diseases represent a significant global burden of mortality, exacerbated by behavioral risk factors. Artificial intelligence (AI) has transformed health promotion and disease reduction through improved early detection, encouraging healthy lifestyle modifications, and mitigating the economic strain on health systems. Objective: The aim of this study is to investigate how AI contributes to health promotion and disease reduction among Organization for Economic Co-operation and Development countries. Methods: We conducted a rapid review of the literature to identify the latest evidence on how AI is used in health promotion and disease reduction. We applied comprehensive search strategies formulated for MEDLINE (OVID) and CINAHL to locate studies published between 2019 and 2024. A pair of reviewers independently applied the inclusion and exclusion criteria to screen the titles and abstracts, assess the full texts, and extract the data. We synthesized extracted data from the study characteristics, intervention characteristics, and intervention purpose using structured narrative summaries of main themes, giving a portrait of the current scope of available AI initiatives used in promoting healthy activities and preventing disease. Results: We included 22 studies in this review (out of 3442 publications screened), most of which were conducted in the United States (10/22, 45%) and focused on health promotion by targeting lifestyle dimensions, such as dietary behavior (10/22, 45%), smoking cessation (6/22, 27%), physical activity (4/22, 18%), and mental health (3/22, 14%). Three studies targeted disease reduction related to metabolic health (eg, obesity, diabetes, hypertension). Most AI initiatives were AI-powered mobile apps. Overall, positive results were reported for process outcomes (eg, acceptability, engagement), cognitive and behavioral outcomes (eg, confidence, step count), and health outcomes (eg, glycemia, blood pressure). We categorized the challenges, benefits, and suggestions identified in the studies using a Strengths, Weaknesses, Opportunities, and Threats analysis to inform future developments. Key recommendations include conducting further investigations, taking into account the needs of end users, improving the technical aspect of the technology, and allocating resources. Conclusions: These findings offer critical insights into the effective implementation of AI for health promotion and disease prevention, potentially guiding policymakers and health care practitioners in optimizing the use of AI technologies in supporting health promotion and disease reduction. Trial Registration: OSF Registries osf.io/e9v6x; https://osf.io/e9v6x/ %R 10.2196/70381 %U https://www.jmir.org/2025/1/e70381 %U https://doi.org/10.2196/70381 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e75849 %T Harm Reduction Strategies for Thoughtful Use of Large Language Models in the Medical Domain: Perspectives for Patients and Clinicians %A Moëll,Birger %A Sand Aronsson,Fredrik %K conversational AI %K risk mitigation %K health care innovation %K assistive technology %K verification protocols %K governance frameworks %K bias awareness %K regulatory compliance %K human-in-the-loop %K trustworthiness %K artificial intelligence %D 2025 %7 25.7.2025 %9 %J J Med Internet Res %G English %X The integration of large language models (LLMs) into health care presents significant risks to patients and clinicians, inadequately addressed by current guidance. This paper adapts harm reduction principles from public health to medical LLMs, proposing a structured framework for mitigating these domain-specific risks while maximizing ethical utility. We outline tailored strategies for patients, emphasizing critical health literacy and output verification, and for clinicians, enforcing “human-in-the-loop” validation and bias-aware workflows. Key innovations include developing thoughtful use protocols that position LLMs as assistive tools requiring mandatory verification, establishing actionable institutional policies with risk-stratified deployment guidelines and patient disclaimers, and critically analyzing underaddressed regulatory, equity, and safety challenges. This research moves beyond theory to offer a practical roadmap, enabling stakeholders to ethically harness LLMs, balance innovation with accountability, and preserve core medical values: patient safety, equity, and trust in high-stakes health care settings. %R 10.2196/75849 %U https://www.jmir.org/2025/1/e75849 %U https://doi.org/10.2196/75849 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e70080 %T Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study %A Yang,Han %A Li,Mingchen %A Zhou,Huixue %A Xiao,Yongkang %A Fang,Qian %A Zhou,Shuang %A Zhang,Rui %K large language models %K ensemble learning %K medical question answering %K healthcare AI %K GPT-4 %D 2025 %7 14.7.2025 %9 %J J Med Internet Res %G English %X Background: Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within the benchmark, GPT-4 achieves the best 71% on MedMCQA (medical multiple-choice question answering dataset), Vicuna-13B achieves 89.5% on PubMedQA (a dataset for biomedical question answering), and MedAlpaca-13B achieves the best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge. Objective: To develop and evaluate efficient ensemble learning approaches, we focus on improving performance across 3 medical QA datasets through our proposed two ensemble strategies. Methods: Our study uses 3 medical QA datasets: PubMedQA (1000 manually labeled and 11,269 test, with yes, no, or maybe answered for each question), MedQA-USMLE (Medical Question Answering dataset based on the United States Medical Licensing Examination; 12,724 English board-style questions; 1272 test, 5 options), and MedMCQA (182,822 training/4183 test questions, 4-option multiple choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering. Results: Both ensemble methods outperformed individual LLMs across all 3 datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE. Conclusions: The LLM-Synergy framework, using 2 ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics. %R 10.2196/70080 %U https://www.jmir.org/2025/1/e70080 %U https://doi.org/10.2196/70080 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e71916 %T Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline %A Li,HongYi %A Fu,Jun-Fen %A Python,Andre %+ Center for Data Science, Zhejiang University, 866 Yuhangtang Road, Hangzhou, China, 86 13262579007, python.andre@gmail.com %K large language model %K LLM %K clinical %K artificial intelligence %K AI %K digital health %K LLM review %D 2025 %7 11.7.2025 %9 Review %J J Med Internet Res %G English %X Background: Large language models (LLMs) can generate outputs understandable by humans, such as answers to medical questions and radiology reports. With the rapid development of LLMs, clinicians face a growing challenge in determining the most suitable algorithms to support their work. Objective: We aimed to provide clinicians and other health care practitioners with systematic guidance in selecting an LLM that is relevant and appropriate to their needs and facilitate the integration process of LLMs in health care. Methods: We conducted a literature search of full-text publications in English on clinical applications of LLMs published between January 1, 2022, and March 31, 2025, on PubMed, ScienceDirect, Scopus, and IEEE Xplore. We excluded papers from journals below a set citation threshold, as well as papers that did not focus on LLMs, were not research based, or did not involve clinical applications. We also conducted a literature search on arXiv within the same investigated period and included papers on the clinical applications of innovative multimodal LLMs. This led to a total of 270 studies. Results: We collected 330 LLMs and recorded their application frequency in clinical tasks and frequency of best performance in their context. On the basis of a 5-stage clinical workflow, we found that stages 2, 3, and 4 are key stages in the clinical workflow, involving numerous clinical subtasks and LLMs. However, the diversity of LLMs that may perform optimally in each context remains limited. GPT-3.5 and GPT-4 were the most versatile models in the 5-stage clinical workflow, applied to 52% (29/56) and 71% (40/56) of the clinical subtasks, respectively, and they performed best in 29% (16/56) and 54% (30/56) of the clinical subtasks, respectively. General-purpose LLMs may not perform well in specialized areas as they often require lightweight prompt engineering methods or fine-tuning techniques based on specific datasets to improve model performance. Most LLMs with multimodal abilities are closed-source models and, therefore, lack of transparency, model customization, and fine-tuning for specific clinical tasks and may also pose challenges regarding data protection and privacy, which are common requirements in clinical settings. Conclusions: In this review, we found that LLMs may help clinicians in a variety of clinical tasks. However, we did not find evidence of generalist clinical LLMs successfully applicable to a wide range of clinical tasks. Therefore, their clinical deployment remains challenging. On the basis of this review, we propose an interactive online guideline for clinicians to select suitable LLMs by clinical task. With a clinical perspective and free of unnecessary technical jargon, this guideline may be used as a reference to successfully apply LLMs in clinical settings. %M 40644686 %R 10.2196/71916 %U https://www.jmir.org/2025/1/e71916 %U https://doi.org/10.2196/71916 %U http://www.ncbi.nlm.nih.gov/pubmed/40644686 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e75347 %T Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction %A Maharjan,Julina %A Jin,Ruoming %A Zhu,Jianfeng %A Kenne,Deric %K psychology %K Big Five personality %K artificial intelligence %K large language models %K embeddings %K deep learning %K social media %K Reddit %D 2025 %7 8.7.2025 %9 %J J Med Internet Res %G English %X Background: Recent advancements in large language models (LLMs) have generated significant interest in their potential for assessing psychological constructs, particularly personality traits. While prior research has explored LLMs’ capabilities in zero-shot or few-shot personality inference, few studies have systematically evaluated LLM embeddings within a psychometric validity framework or examined their correlations with linguistic and emotional markers. Additionally, the comparative efficacy of LLM embeddings against traditional feature engineering methods remains underexplored, leaving gaps in understanding their scalability and interpretability for computational personality assessment. Objective: This study evaluates LLM embeddings for personality trait prediction through four key analyses: (1) performance comparison with zero-shot methods on PANDORA Reddit data, (2) psychometric validation and correlation with LIWC (Linguistic Inquiry and Word Count) and emotion features, (3) benchmarking against traditional feature engineering approaches, and (4) assessment of model size effects (OpenAI vs BERT vs RoBERTa). We aim to establish LLM embeddings as a psychometrically valid and efficient alternative for personality assessment. Methods: We conducted a multistage analysis using 1 million Reddit posts from the PANDORA Big Five personality dataset. First, we generated text embeddings using 3 LLM architectures (RoBERTa, BERT, and OpenAI) and trained a custom bidirectional long short-term memory model for personality prediction. We compared this approach against zero-shot inference using prompt-based methods. Second, we extracted psycholinguistic features (LIWC categories and National Research Council emotions) and performed feature engineering to evaluate potential performance enhancements. Third, we assessed the psychometric validity of LLM embeddings: reliability validity using Cronbach α and convergent validity analysis by examining correlations between embeddings and established linguistic markers. Finally, we performed traditional feature engineering on static psycholinguistic features to assess performance under different settings. Results: LLM embeddings trained using simple deep learning techniques significantly outperform zero-shot approaches on average by 45% across all personality traits. Although psychometric validation tests indicate moderate reliability, with an average Cronbach α of 0.63, correlation analyses spark a strong association with key linguistic or emotional markers; openness correlates highly with social (r=0.53), conscientiousness with linguistic (r=0.46), extraversion with social (r=0.41), agreeableness with pronoun usage (r=0.40), and neuroticism with politics-related text (r=0.63). Despite adding advanced feature engineering on linguistic features, the performance did not improve, suggesting that LLM embeddings inherently capture key linguistic features. Furthermore, our analyses demonstrated efficacy on larger model size with a computational cost trade-off. Conclusions: Our findings demonstrate that LLM embeddings offer a robust alternative to zero-shot methods in personality trait analysis, capturing key linguistic patterns without requiring extensive feature engineering. The correlation between established psycholinguistic markers and the performance trade-off with computational cost provides a hint for future computational linguistic work targeting LLM for personality assessment. Further research should explore fine-tuning strategies to enhance psychometric validity. %R 10.2196/75347 %U https://www.jmir.org/2025/1/e75347 %U https://doi.org/10.2196/75347 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e56891 %T Generative AI in Medicine: Pioneering Progress or Perpetuating Historical Inaccuracies? Cross-Sectional Study Evaluating Implicit Bias %A Sutera,Philip %A Bhatia,Rohini %A Lin,Timothy %A Chang,Leslie %A Brown,Andrea %A Jagsi,Reshma %K Artificial Intelligence %K generative artificial intelligence %K workforce diversity %K bias %K historical inequity %K social inequity %K implicit bias %K AI bias %D 2025 %7 24.6.2025 %9 %J JMIR AI %G English %X Background: Generative artificial intelligence (gAI) models, such as DALL-E 2, are promising tools that can generate novel images or artwork based on text input. However, caution is warranted, as these tools generate information based on historical data and are thus at risk of propagating past learned inequities. Women in medicine have routinely been underrepresented in academic and clinical medicine and the stereotype of a male physician persists. Objective: The primary objective is to evaluate implicit bias among gAI across medical specialties. Methods: To evaluate for potential implicit bias, 100 photographs for each medical specialty were generated using the gAI platform DALL-E2. For each specialty, DALL-E2 was queried with “An American [specialty name].” Our primary endpoint was to compare the gender distribution of gAI photos to the current distribution in the United States. Our secondary endpoint included evaluating the racial distribution. gAI photos were classified according to perceived gender and race based on a unanimous consensus among a diverse group of medical residents. The proportion of gAI women subjects was compared for each medical specialty to the most recent Association of American Medical Colleges report for physician workforce and active residents using χ2 analysis. Results: A total of 1900 photos across 19 medical specialties were generated. Compared to physician workforce data, AI significantly overrepresented women in 7/19 specialties and underrepresented women in 6/19 specialties. Women were significantly underrepresented compared to the physician workforce by 18%, 18%, and 27% in internal medicine, family medicine, and pediatrics, respectively. Compared to current residents, AI significantly underrepresented women in 12/19 specialties, ranging from 10% to 36%. Additionally, women represented <50% of the demographic for 17/19 specialties by gAI. Conclusions: gAI created a sample population of physicians that underrepresented women when compared to both the resident and active physician workforce. Steps must be taken to train datasets in order to represent the diversity of the incoming physician workforce. %R 10.2196/56891 %U https://ai.jmir.org/2025/1/e56891 %U https://doi.org/10.2196/56891 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e70450 %T Large Language Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Evaluation Study %A Huang,Jiajie %A Lai,Honghao %A Zhao,Weilong %A Xia,Danni %A Bai,Chunyang %A Sun,Mingyao %A Liu,Jianing %A Liu,Jiayi %A Pan,Bei %A Tian,Jinhui %A Ge,Long %+ Department of Health Policy and Management, School of Public Health, Lanzhou University, No. 222 South Tianshui Road, Lanzhou, 730000, China, 86 13893192463, gelong2009@163.com %K systematic review %K large language models %K risk of bias 2 %K artificial intelligence %K efficiency %D 2025 %7 24.6.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: The revised Risk-of-Bias tool (RoB2) overcomes the limitations of its predecessor but introduces new implementation challenges. Studies demonstrate low interrater reliability and substantial time requirements for RoB2 implementation. Large language models (LLMs) may assist in RoB2 implementation, although their effectiveness remains uncertain. Objective: This study aims to evaluate the accuracy of LLMs in RoB2 assessments to explore their potential as research assistants for bias evaluation. Methods: We systematically searched the Cochrane Library (through October 2023) for reviews using RoB2, categorized by interest in adhering or assignment. From 86 eligible reviews of randomized controlled trials (covering 1399 RCTs), we randomly selected 46 RCTs (23 per category). In addition, 3 experienced reviewers independently assessed all 46 RCTs using RoB2, recording assessment time for each trial. Reviewer judgments were reconciled through consensus. Furthermore, 6 RCTs (3 from each category) were randomly selected for prompt development and optimization. The remaining 40 trials established the internal validation standard, while Cochrane Reviews judgments served as external validation. Primary outcomes were extracted as reported in corresponding Cochrane Reviews. We calculated accuracy rates, Cohen κ, and time differentials. Results: We identified significant differences between Cochrane and reviewer judgments, particularly in domains 1, 4, and 5, likely due to different standards in assessing randomization and blinding. Among the 20 articles focusing on adhering, 18 Cochrane Reviews and 19 reviewer judgments classified them as “High risk,” while assignment-focused RCTs showed more heterogeneous risk distribution. Compared with Cochrane Reviews, LLMs demonstrated accuracy rates of 57.5% and 70% for overall (assignment) and overall (adhering), respectively. When compared with reviewer judgments, LLMs’ accuracy rates were 65% and 70% for these domains. The average accuracy rates for the remaining 6 domains were 65.2% (95% CI 57.6-72.7) against Cochrane Reviews and 74.2% (95% CI 64.7-83.9) against reviewers. At the signaling question level, LLMs achieved 83.2% average accuracy (95% CI 77.5-88.9), with accuracy exceeding 70% for most questions except 2.4 (assignment), 2.5 (assignment), 3.3, and 3.4. When domain judgments were derived from LLM-generated signaling questions using the RoB2 algorithm rather than direct LLM domain judgments, accuracy improved substantially for Domain 2 (adhering; 55-95) and overall (adhering; 70-90). LLMs demonstrated high consistency between iterations (average 85.2%, 95% CI 85.15-88.79) and completed assessments in 1.9 minutes versus 31.5 minutes for human reviewers (mean difference 29.6, 95% CI 25.6-33.6 minutes). Conclusions: LLMs achieved commendable accuracy when guided by structured prompts, particularly through processing methodological details through structured reasoning. While not replacing human assessment, LLMs demonstrate strong potential for assisting RoB2 evaluations. Larger studies with improved prompting could enhance performance. %M 40554779 %R 10.2196/70450 %U https://www.jmir.org/2025/1/e70450 %U https://doi.org/10.2196/70450 %U http://www.ncbi.nlm.nih.gov/pubmed/40554779 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e72034 %T Assessment of Large Language Model Performance on Medical School Essay-Style Concept Appraisal Questions: Exploratory Study %A Mehta,Seysha %A Haddad,Eliot N %A Burke,Indira Bhavsar %A Majors,Alana K %A Maeda,Rie %A Burke,Sean M %A Deshpande,Abhishek %A Nowacki,Amy S %A Lindenmeyer,Christina C %A Mehta,Neil %K essay-type questions %K large language models %K generative AI %K Microsoft Copilot %K artificial intelligence %D 2025 %7 16.6.2025 %9 %J JMIR Med Educ %G English %X Bing Chat (subsequently renamed Microsoft Copilot)—a ChatGPT 4.0–based large language model—demonstrated comparable performance to medical students in answering essay-style concept appraisals, while assessors struggled to differentiate artificial intelligence (AI) responses from human responses. These results highlight the need to prepare students and educators for a future world of AI by fostering reflective learning practices and critical thinking. %R 10.2196/72034 %U https://mededu.jmir.org/2025/1/e72034 %U https://doi.org/10.2196/72034 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e73052 %T Predictive Performance of Machine Learning for Suicide in Adolescents: Systematic Review and Meta-Analysis %A Liu,Lingjiang %A Li,Zhiyuan %A Hu,Yaxin %A Li,Chunyou %A He,Shuhan %A Zhang,Shibei %A Gao,Jie %A Zhu,Huaiyi %A Huang,Guoping %+ Sichuan Mental Health Center, Department of Psychiatry, The Third Hospital of Mianyang, No. 190, East Section of Jiannan Road, Mianyang, 621000, China, 86 18030990990, xyhuanggp@126.com %K machine learning %K predictive model %K meta-analysis %K suicide prediction %K adolescent mental health %K suicide prevention %D 2025 %7 16.6.2025 %9 Review %J J Med Internet Res %G English %X Background: In the context of escalating global mental health challenges, adolescent suicide has become a critical public health concern. In current clinical practices, considerable challenges are encountered in the early identification of suicide risk, as traditional assessment tools demonstrate limited predictive accuracy. Recent advancements in machine learning (ML) present promising solutions for risk prediction. However, comprehensive evaluations of their efficacy in adolescent populations remain insufficient. Objective: This study systematically assessed the performance of ML-based prediction models across various suicide-related behaviors in adolescents, aiming to establish an evidence-based foundation for the development of clinically applicable risk assessment tools. Methods: This review assessed ML for predicting adolescent suicide–related behaviors. PubMed, Embase, Cochrane, and Web of Science databases were rigorously searched until April 20, 2024, and a multivariate prediction model was employed to assess the risk of bias. The c-index was used as the primary outcome measure to conduct a meta-analysis on nonsuicidal self-injury (NSSI), suicidal ideation, suicide attempts, suicide attempts combined with suicidal ideation, and suicide attempts combined with NSSI, evaluating their accuracy in the validation set. Results: A total of 42 studies published from 2018 to 2024 were included, encompassing 104 distinct ML models and 1,408,375 adolescents aged 11 to 20 years. The combined area under the receiver operating characteristic curve values for ML models in predicting NSSI, suicidal ideation, suicide attempts, suicide attempts combined with suicidal ideation, and suicide attempts combined with NSSI were 0.79 (95% CI 0.72-0.86), 0.77 (95% CI 0.71-0.83), 0.84 (95% CI 0.83-0.86), 0.82 (95% CI 0.79-0.84), and 0.75 (95% CI 0.73-0.76), respectively. The ML models demonstrated the highest combined sensitivity for suicide attempt prediction, with a value of 0.80 (95% CI 0.75-0.84), and the highest combined specificity for NSSI prediction, with a value of 0.96 (95% CI 0.94-0.99). Conclusions: Our findings suggest that ML techniques exhibit promising predictive performance for forecasting suicide risk in adolescents, particularly in predicting suicide attempts. Notably, ensemble methods, such as random forest and extreme gradient boosting, showed superior performance across multiple outcome types. However, this study has several limitations, including the predominance of internal validation methods employed in the included literature, with few studies employing external validation, which may limit the generalizability of the results. Future research should incorporate larger and more diverse datasets and conduct external validation to improve the prediction capability of these models, ultimately contributing to the development of ML-based adolescent suicide risk prediction tools. %M 40522723 %R 10.2196/73052 %U https://www.jmir.org/2025/1/e73052 %U https://doi.org/10.2196/73052 %U http://www.ncbi.nlm.nih.gov/pubmed/40522723 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e68872 %T The Machine Learning Models in Major Cardiovascular Adverse Events Prediction Based on Coronary Computed Tomography Angiography: Systematic Review %A Ma,Yuchen %A Li,Mohan %A Wu,Huiqun %+ Department of Medical Informatics, Medical School of Nantong University, Chongchuan District, 19 Qixiu Road, Nantong, 226001, China, 86 513 85051891, wuhuiqun@ntu.edu.cn %K coronary computed tomography angiography %K machine learning %K major adverse cardiovascular events %K radiomics %K plaque %D 2025 %7 13.6.2025 %9 Review %J J Med Internet Res %G English %X Background: Coronary computed tomography angiography (CCTA) has emerged as the first-line noninvasive imaging test for patients at high risk of coronary artery disease (CAD). When combined with machine learning (ML), it provides more valid evidence in diagnosing major adverse cardiovascular events (MACEs). Radiomics provides informative multidimensional features that can help identify high-risk populations and can improve the diagnostic performance of CCTA. However, its role in predicting MACEs remains highly debated. Objective: We evaluated the diagnostic value of ML models constructed using radiomic features extracted from CCTA in predicting MACEs, and compared the performance of different learning algorithms and models, thereby providing clinical recommendations for the diagnosis, treatment, and prognosis of MACEs. Methods: We comprehensively searched 5 online databases, Cochrane Library, Web of Science, Elsevier, CNKI, and PubMed, up to September 10, 2024, for original studies that used ML models among patients who underwent CCTA to predict MACEs and reported clinical outcomes and endpoints related to it. Risk of bias in the ML models was assessed by the Prediction Model Risk of Bias Assessment Tool, while the radiomics quality score (RQS) was used to evaluate the methodological quality of the radiomics prediction model development and validation. We also followed the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) guidelines to ensure transparency of ML models included. Meta-analysis was performed using Meta-DiSc software (version 1.4), which included the I² score and Cochran Q test, along with StataMP 17 (StataCorp) to assess heterogeneity and publication bias. Due to the high heterogeneity observed, subgroup analysis was conducted based on different model groups. Results: Ten studies were included in the analysis, 5 (50%) of which differentiated between training and testing groups, where the training set collected 17 kinds of models and the testing set gathered 26 models. The pooled area under the receiver operating characteristic (AUROC) curve for ML models predicting MACEs was 0.7879 in the training set and 0.7981 in the testing set. Logistic regression (LR), the most commonly used algorithm, achieved an AUROC of 0.8229 in the testing group and 0.7983 in the training group. Non-LR models yielded AUROCs of 0.7390 in the testing set and 0.7648 in the training set, while the random forest (RF) models reached an AUROC of 0.8444 in the training group. Conclusions: Study limitations included a limited number of studies, high heterogeneity, and the types of included studies. The performance of ML models for predicting MACEs was found to be superior to that of general models based on basic feature extraction and integration from CCTA. Specifically, LR-based ML diagnostic models demonstrated significant clinical potential, particularly when combined with clinical features, and are worth further validation through more clinical trials. Trial Registration: PROSPERO CRD42024596364; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024596364 %M 40513092 %R 10.2196/68872 %U https://www.jmir.org/2025/1/e68872 %U https://doi.org/10.2196/68872 %U http://www.ncbi.nlm.nih.gov/pubmed/40513092 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e73233 %T Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models %A Zhong,Wei %A Sun,Mingyue %A Yao,Shun %A Liu,YiFan %A Peng,Dingchuan %A Liu,Yan %A Yang,Kai %A Gao,HuiMin %A Yan,HuiHui %A Hao,WenJing %A Yan,YouSheng %A Yin,ChengHong %K multimodal large language models %K ChatGPT %K rare diseases %K human phenotype ontology %K open-source LLMs %K large language model %D 2025 %7 2.6.2025 %9 %J J Med Internet Res %G English %X Background: Identifying Human Phenotype Ontology (HPO) terms is crucial for diagnosing and managing rare diseases. However, clinicians, especially junior physicians, often face challenges due to the complexity of describing patient phenotypes accurately. Traditional manual search methods using HPO databases are time-consuming and prone to errors. Objective: The aim of the study is to investigate whether the use of multimodal large language models (MLLMs) can improve the accuracy of junior physicians in identifying HPO terms from patient images related to rare diseases. Methods: In total, 20 junior physicians from 10 specialties participated. Each physician evaluated 27 patient images sourced from publicly available literature, with phenotypes relevant to rare diseases listed in the Chinese Rare Disease Catalogue. The study was divided into 2 groups: the manual search group relied on the Chinese Human Phenotype Ontology website, while the MLLM-assisted group used an electronic questionnaire that included HPO terms preidentified by ChatGPT-4o as prompts, followed by a search using the Chinese Human Phenotype Ontology. The primary outcome was the accuracy of HPO identification, defined as the proportion of correctly identified HPO terms compared to a standard set determined by an expert panel. Additionally, the accuracy of outputs from ChatGPT-4o and 2 open-source MLLMs (Llama3.2:11b and Llama3.2:90b) was evaluated using the same criteria, with hallucinations for each model documented separately. Furthermore, participating physicians completed an additional electronic questionnaire regarding their rare disease background to identify factors affecting their ability to accurately describe patient images using standardized HPO terms. Results: A total of 270 descriptions were evaluated per group. The MLLM-assisted group achieved a significantly higher accuracy rate of 67.4% (182/270) compared to 20.4% (55/270) in the manual group (relative risk 3.31, 95% CI 2.58‐4.25; P<.001). The MLLM-assisted group demonstrated consistent performance across departments, whereas the manual group exhibited greater variability. Among standalone MLLMs, ChatGPT-4o achieved an accuracy of 48% (13/27), while the open-source models Llama3.2:11b and Llama3.2:90b achieved 15% (4/27) and 18% (5/27), respectively. However, MLLMs exhibited a high hallucination rate, frequently generating HPO terms with incorrect IDs or entirely fabricated content. Specifically, ChatGPT-4o, Llama3.2:11b, and Llama3.2:90b generated incorrect IDs in 57.3% (67/117), 98% (62/63), and 82% (46/56) of cases, respectively, and fabricated terms in 34.2% (40/117), 41% (26/63), and 32% (18/56) of cases, respectively. Additionally, a survey on the rare disease knowledge of junior physicians suggests that participation in rare disease and genetic disease training may enhance the performance of some physicians. Conclusions: The integration of MLLMs into clinical workflows significantly enhances the accuracy of HPO identification by junior physicians, offering promising potential to improve the diagnosis of rare diseases and standardize phenotype descriptions in medical research. However, the notable hallucination rate observed in MLLMs underscores the necessity for further refinement and rigorous validation before widespread adoption in clinical practice. %R 10.2196/73233 %U https://www.jmir.org/2025/1/e73233 %U https://doi.org/10.2196/73233 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e73918 %T Evaluating User Interactions and Adoption Patterns of Generative AI in Health Care Occupations Using Claude: Cross-Sectional Study %A Alain,Gabriel %A Crick,James %A Snead,Ella %A Quatman-Yates,Catherine C %A Quatman,Carmen E %K communication %K humans %K artificial intelligence %K language %K machine learning %K self-management %K information-seeking behavior %K outcome assessment/health care %K referral and consultation %K patient care %K workflow %K patient participation %K health literacy %D 2025 %7 30.5.2025 %9 %J J Med Internet Res %G English %X Background: Generative artificial intelligence (GenAI) systems like Anthropic’s Claude and OpenAI’s ChatGPT are rapidly being adopted in various sectors, including health care, offering potential benefits for clinical support, administrative efficiency, and patient information access. However, real-world adoption patterns and the extent to which GenAI is used for health care–related tasks remain poorly understood and distinct from performance benchmarks in controlled settings. Understanding these organic usage patterns is key for assessing GenAI’s impact on health care delivery and patient-provider dynamics. Objective: This study aimed to quantify the real-world frequency and scope of health care–related tasks performed using Anthropic’s Claude GenAI. We sought to (1) measure the proportion of Claude interactions related to health care tasks versus other domains; (2) identify specific health care occupations (as per O*NET classifications) with high associated interaction volumes; (3) assess the breadth of task adoption within roles using a “digital adoption rate”; and (4) interpret these findings considering the inherent ambiguity regarding user identity (ie, professionals vs public) in the dataset. Methods: We performed a cross-sectional analysis of more than 4 million anonymized user conversations with Claude (ie, including both free and pro subscribers) from December 2024 to January 2025, using a publicly available dataset from Anthropic’s Economic Index research. Interactions were preclassified by Anthropic’s proprietary Clio model into standardized occupational tasks mapped to the US Department of Labor’s O*NET database. The dataset did not allow differentiation between health care professionals and the general public as users. We focused on interactions mapped to O*NET Healthcare Practitioners and Technical Occupations. Main outcomes included the proportion of interactions per health care occupation, proportion of overall health care interaction versus other categories, and the digital adoption rate (ie, distinct tasks performed via GenAI divided by the total possible tasks per occupation). Results: Health care–related tasks accounted for 2.58% of total analyzed GenAI conversations, significantly lower than domains such as computing (37.22%). Within health care, interaction frequency varied notably by role. Occupations emphasizing patient education and guidance exhibited the highest proportion, including dietitians and nutritionists (6.61% of health care conversations), nurse practitioners (5.63%), music therapists (4.54%), and clinical nurse specialists (4.53%). Digital adoption rates (task breadth) ranged widely across top health care roles (13.33%‐65%), averaging 16.92%, below the global average (21.13%). Tasks associated with medical records and health information technicians had the highest adoption rate (65.0%). Conclusions: GenAI tools are being adopted for a measurable subset of health care–related tasks, with usage concentrated in specific, often patient-facing roles. The critical limitation of user anonymity prevents definitive conclusions regarding whether usage primarily reflects patient information–seeking behavior (potentially driven by access needs) or professional workflow assistance. This ambiguity necessitates caution when interpreting current GenAI adoption. Our findings emphasize the urgent need for strategies addressing potential impacts on clinical workflows, patient decision-making, information quality, and health equity. Future research must aim to differentiate user types, while stakeholders should develop targeted guidance for both safe patient use and responsible professional integration. %R 10.2196/73918 %U https://www.jmir.org/2025/1/e73918 %U https://doi.org/10.2196/73918 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e70179 %T Attitudes Toward AI Usage in Patient Health Care: Evidence From a Population Survey Vignette Experiment %A Kühne,Simon %A Jacobsen,Jannes %A Legewie,Nicolas %A Dollmann,Jörg %+ , Institute of Sociology, University of Münster, Schlossplatz 2, Münster, 48149, Germany, 49 251 830, nlegewie@uni-muenster.de %K artificial intelligence %K trust %K public attitudes %K patient health care %K survey research %K vignette experiment %D 2025 %7 27.5.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: The integration of artificial intelligence (AI) holds substantial potential to alter diagnostics and treatment in health care settings. However, public attitudes toward AI, including trust and risk perception, are key to its ethical and effective adoption. Despite growing interest, empirical research on the factors shaping public support for AI in health care (particularly in large-scale, representative contexts) remains limited. Objective: This study aimed to investigate public attitudes toward AI in patient health care, focusing on how AI attributes (autonomy, costs, reliability, and transparency) shape perceptions of support, risk, and personalized care. In addition, it examines the moderating role of sociodemographic characteristics (gender, age, educational level, migration background, and subjective health status) in these evaluations. Our study offers novel insights into the relative importance of AI system characteristics for public attitudes and acceptance. Methods: We conducted a factorial vignette experiment with a probability-based survey of 3030 participants from Germany’s general population. Respondents were presented with hypothetical scenarios involving AI applications in diagnosis and treatment in a hospital setting. Linear regression models assessed the relative influence of AI attributes on the dependent variables (support, risk perception, and personalized care), with additional subgroup analyses to explore heterogeneity by sociodemographic characteristics. Results: Mean values between 4.2 and 4.4 on a 1-7 scale indicate a generally neutral to slightly negative stance toward AI integration in terms of general support, risk perception, and personalized care expectations, with responses spanning the full scale from strong support to strong opposition. Among the 4 dimensions, reliability emerges as the most influential factor (percentage of explained variance [EV] of up to 10.5%). Respondents expect AI to not only prevent errors but also exceed current reliability standards while strongly disapproving of nontraceable systems (transparency is another important factor, percentage of EV of up to 4%). Costs and autonomy play a comparatively minor role (percentage of EVs of up to 1.5% and 1.3%), with preferences favoring collaborative AI systems over autonomous ones, and higher costs generally leading to rejection. Heterogeneity analysis reveals limited sociodemographic differences, with education and migration background influencing attitudes toward transparency and autonomy, and gender differences primarily affecting cost-related perceptions. Overall, attitudes do not substantially differ between AI applications in diagnosis versus treatment. Conclusions: Our study fills a critical research gap by identifying the key factors that shape public trust and acceptance of AI in health care, particularly reliability, transparency, and patient-centered approaches. Our findings provide evidence-based recommendations for policy makers, health care providers, and AI developers to enhance trust and accountability, key concerns often overlooked in system development and real-world applications. The study highlights the need for targeted policy and educational initiatives to support the responsible integration of AI in patient care. %M 40424613 %R 10.2196/70179 %U https://www.jmir.org/2025/1/e70179 %U https://doi.org/10.2196/70179 %U http://www.ncbi.nlm.nih.gov/pubmed/40424613 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e70068 %T Predictive Models Using Machine Learning to Identify Fetal Growth Restriction in Patients With Preeclampsia: Development and Evaluation Study %A Hua,Qing %A Yang,Fengchun %A Zhou,Yadan %A Shi,Fenglian %A You,Xiaoyan %A Guo,Jing %A Li,Li %+ Department of Obstetrics and Gynecology, Zhengzhou Central Hospital Affiliated to Zhengzhou University, No. 16, Tongbai North Road, Zhongyuan District, Zhengzhou, 450007, China, 86 13683816225, zzsylili@zzu.edu.cn %K machine learning %K fetal growth restriction %K preeclampsia %K random forest %K Shapley additive explanations %D 2025 %7 27.5.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Fetal growth restriction (FGR) is a common complication of preeclampsia. FGR in patients with preeclampsia increases the risk of neonatal-perinatal mortality and morbidity. However, previous prediction methods for FGR are class-biased or clinically unexplainable, which makes it difficult to apply to clinical practice, leading to a relative delay in intervention and a lack of effective treatments. Objective: The study aims to develop an auxiliary diagnostic model based on machine learning (ML) to predict the occurrence of FGR in patients with preeclampsia. Methods: This study used a retrospective case-control approach to analyze 38 features, including the basic medical history and peripheral blood laboratory test results of pregnant patients with preeclampsia, either complicated or not complicated by FGR. ML models were constructed to evaluate the predictive value of maternal parameter changes on preeclampsia combined with FGR. Multiple algorithms were tested, including logistic regression, light gradient boosting, random forest (RF), extreme gradient boosting, multilayer perceptron, naive Bayes, and support vector machine. The model performance was identified by the area under the curve (AUC) and other evaluation indexes. The Shapley additive explanations (SHAP) method was adopted to rank the feature importance and explain the final model for clinical application. Results: The RF model performed best in discriminative ability among the 7 ML models. After reducing features according to importance rank, an explainable final RF model was established with 9 features, including urinary protein quantification, gestational week of delivery, umbilical artery systolic-to-diastolic ratio, amniotic fluid index, triglyceride, D-dimer, weight, height, and maximum systolic pressure. The model could accurately predict FGR for 513 patients with preeclampsia (149 with FGR and 364 without FGR) in the training and testing dataset (AUC 0.83, SD 0.03) using 5-fold cross-validation, which was closely validated for 103 patients with preeclampsia (n=45 with FGR and n=58 without FGR) in an external dataset (AUC 0.82, SD 0.048). On the whole, urinary protein quantification, umbilical artery systolic-to-diastolic ratio, and gestational week of delivery exhibited the highest contributions to the model performance (c=0.45, 0.34, and 0.33) based on SHAP analysis. For specific individual patients, SHAP results reveal the protective and risk factors to develop FGR for interpreting the model’s clinical significance. Finally, the model has been translated into a convenient web page tool to facilitate its use in clinical settings. Conclusions: The study successfully developed a model that accurately predicts FGR development in patients with preeclampsia. The SHAP method captures highly relevant risk factors for model interpretation, alleviating concerns about the “black box” problem of ML techniques. %M 40424611 %R 10.2196/70068 %U https://www.jmir.org/2025/1/e70068 %U https://doi.org/10.2196/70068 %U http://www.ncbi.nlm.nih.gov/pubmed/40424611 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e69838 %T A Data-Driven Approach to Assessing Hepatitis B Mother-to-Child Transmission Risk Prediction Model: Machine Learning Perspective %A Nguyen Tien,Dung %A Thi Thu Bui,Huong %A Hoang Thi Ngoc,Tram %A Thi Pham,Thuy %A Trung Nguyen,Dac %A Nguyen Thi Thu,Huyen %A Thu Hang Vu,Thi %A Lan Anh Luong,Thi %A Thu Hoang,Lan %A Cam Tu,Ho %A Körber,Nina %A Bauer,Tanja %A Khanh Ho,Lam %+ Department of Microbiology, Thai Nguyen University of Medicine and Pharmacy, 284 Lương Ngọc Quyến, Thái Nguyên, 250000, Vietnam, 84 912916863, huongbuithithu@tnmc.edu.vn %K chronic hepatitis B virus infection %K liver %K pregnant women %K cord blood %K PBMCs (peripheral blood mononuclear cells) %K ID3 (Iterative Dichotomiser 3) %K CART (classification and regression trees) %D 2025 %7 23.5.2025 %9 Original Paper %J JMIR Form Res %G English %X Background: Hepatitis B virus (HBV) can be transmitted from mother to child either through transplacental infection or via blood-to-blood contact during or immediately after delivery. Early and accurate risk assessments are essential for guiding clinical decisions and implementing effective preventive measures. Data mining techniques are powerful tools for identifying key predictors in medical diagnostics. Objective: This study aims to develop a robust predictive model for mother-to-child transmission (MTCT) of HBV using decision tree algorithms, specifically Iterative Dichotomiser 3 (ID3) and classification and regression trees (CART). The study identifies clinically and paraclinically relevant predictors, particularly hepatitis B e antigen (HBeAg) status and peripheral blood mononuclear cell (PBMC) concentration, for effective risk stratification and prevention. Additionally, we will assess the model’s reliability and generalizability through cross-validation with various training-test split ratios, aiming to enhance its applicability in clinical settings and inform improved preventive strategies against HBV MTCT. Methods: This study used decision tree algorithms—ID3 and CART—on a data set of 60 hepatitis B surface antigen (HBsAg)–positive pregnant women. Samples were collected either before or at the time of delivery, enabling the inclusion of patients who were undiagnosed or had limited access to treatment. We analyzed both clinical and paraclinical parameters, with a particular focus on HBeAg status and PBMC concentration. Additional biochemical markers were evaluated for their potential contributory or inhibitory effects on MTCT risk. The predictive models were validated using multiple training-test split ratios to ensure robustness and generalizability. Results: Our analysis showed that 20 out of 48 (based on a split ratio of 0.8 from a total of 60 cases, 42%) to 27 out of 57 (based on a split ratio of 0.95 from a total of 60 cases, 47%) training cases with HBeAg-positive status were associated with a significant risk of MTCT of HBV (χ28=21.16, P=.007, df=8). Among HBeAg-negative women, those with PBMC concentrations ≥8 × 106 cells/mL exhibited a low risk of MTCT, whereas individuals with PBMC concentrations <8 × 106 cells/mL demonstrated a negligible risk. Across all training-test split ratios, the decision tree models consistently identified HBeAg status and PBMC concentration as the most influential predictors, underscoring their robustness and critical role in MTCT risk stratification. Conclusions: This study demonstrates that decision tree models are effective tools for stratifying the risk of MTCT of HBV by integrating key clinical and paraclinical markers. Among these, HBeAg status and PBMC concentration emerged as the most critical predictors. While the analysis focused on untreated patients, it provides a strong foundation for future investigations involving treated populations. These findings offer actionable insights to support the development of more targeted and effective HBV MTCT prevention strategies. %M 40409750 %R 10.2196/69838 %U https://formative.jmir.org/2025/1/e69838 %U https://doi.org/10.2196/69838 %U http://www.ncbi.nlm.nih.gov/pubmed/40409750 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e71654 %T Early Predictive Accuracy of Machine Learning for Hemorrhagic Transformation in Acute Ischemic Stroke: Systematic Review and Meta-Analysis %A Wang,Benqiao %A Jiang,Bohao %A Liu,Dan %A Zhu,Ruixia %+ Department of Neurology, First Hospital of China Medical University, No 155 Nanjing Bei Street, Heping District, Shenyang, 110001, China, 86 13504035936, zrx_200626313@163.com %K machine learning %K hemorrhagic transformation %K acute ischemic stroke %K predictive model %K meta-analysis %D 2025 %7 23.5.2025 %9 Review %J J Med Internet Res %G English %X Background: Hemorrhagic transformation (HT) is commonly detected in acute ischemic stroke (AIS) and often leads to poor outcomes. Currently, there is no ideal tool for early prediction of HT risk. Recently, machine learning has gained traction in stroke management, prompting the exploration of predictive models for HT. However, systematic evidence on these models is lacking. Objective: In this study, we assessed the predictive capability of machine learning models for HT risk in AIS, aiming to inform the development of HT prediction tools. Methods: We conducted a thorough search of medical databases, such as Web of Science, Embase, Cochrane, and PubMed up until March 2025. The risk of bias was determined through the Prediction Model Risk of Bias Assessment Tool (PROBAST). Subgroup analysis was performed based on treatment backgrounds, diagnostic criteria, and types of HT. Results: A total of 83 eligible articles were included, containing 106 models and 88,197 patients with AIS with 9323 HT cases. There were 104 validation sets with a total c-index of 0.832 (95% CI 0.814-0.849), sensitivity of 0.82 (95% CI 0.79-0.84), and specificity of 0.78 (95% CI 0.74-0.81). Subgroup analysis indicated that the combined model achieved superior prediction accuracy. Moreover, we also analyzed the predictive performance of 6 mature models. Conclusions: Currently, although several prediction methods for HT have been developed, their predictive values are not satisfactory. Fortunately, our findings suggest that machine learning methods, particularly those combining clinical features and radiomics, hold promise for improving predictive accuracy. Our meta-analysis may provide evidence-based guidance for the subsequent development of more efficient clinical predictive models for HT. Trial Registration: PROSPERO CRD42024498997; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024498997 %M 40408765 %R 10.2196/71654 %U https://www.jmir.org/2025/1/e71654 %U https://doi.org/10.2196/71654 %U http://www.ncbi.nlm.nih.gov/pubmed/40408765 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e72197 %T Youth Perspectives on Generative AI and Its Use in Health Care %A Schaaff,Christian %A Bains,Manvir %A Davis,Sophie %A Amalraj,Trinity %A Frank,Abby %A Waselewski,Marika %A Chang,Tammy %A Wong,Andrew %K generative artificial intelligence %K medical informatics %K adolescent health %K health technology %K young adult %D 2025 %7 21.5.2025 %9 %J J Med Internet Res %G English %X A nationwide survey of youth aged 14 to 24 years on generative artificial intelligence (GAI) found that many youths are wary about the use of GAI in health care, suggesting that health professionals should acknowledge concerns about AI health tools and address them with adolescent patients as they become more pervasive. %R 10.2196/72197 %U https://jmir.org/2025/1/e72197 %U https://doi.org/10.2196/72197 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 12 %N %P e69709 %T A Comparison of Responses from Human Therapists and Large Language Model–Based Chatbots to Assess Therapeutic Communication: Mixed Methods Study %A Scholich,Till %A Barr,Maya %A Wiltsey Stirman,Shannon %A Raj,Shriti %+ Institute for Human-Centered AI, Stanford University, 353 Jane Stanford Way, Stanford, CA, 94305, United States, 1 650 723 2300, tills@umich.edu %K mental health %K large language models %K artificial intelligence therapy %K AI therapy %K large language model %K LLM %K therapists %K artificial intelligence %K AI %D 2025 %7 21.5.2025 %9 Original Paper %J JMIR Ment Health %G English %X Background: Consumers are increasingly using large language model–based chatbots to seek mental health advice or intervention due to ease of access and limited availability of mental health professionals. However, their suitability and safety for mental health applications remain underexplored, particularly in comparison to professional therapeutic practices. Objective: This study aimed to evaluate how general-purpose chatbots respond to mental health scenarios and compare their responses to those provided by licensed therapists. Specifically, we sought to identify chatbots’ strengths and limitations, as well as the ethical and practical considerations necessary for their use in mental health care. Methods: We conducted a mixed methods study to compare responses from chatbots and licensed therapists to scripted mental health scenarios. We created 2 fictional scenarios and prompted 3 chatbots to create 6 interaction logs. We recruited 17 therapists and conducted study sessions that consisted of 3 activities. First, therapists responded to the 2 scenarios using a Qualtrics form. Second, therapists went through the 6 interaction logs using a think-aloud procedure to highlight their thoughts about the chatbots’ responses. Finally, we conducted a semistructured interview to explore subjective opinions on the use of chatbots for supporting mental health. The study sessions were analyzed using thematic analysis. The interaction logs from chatbot and therapist responses were coded using the Multitheoretical List of Therapeutic Interventions codes and then compared to each other. Results: We identified 7 themes describing the strengths and limitations of the chatbots as compared to therapists. These include elements of good therapy in chatbot responses, conversational style of chatbots, insufficient inquiry and feedback seeking by chatbots, chatbot interventions, client engagement, chatbots’ responses to crisis situations, and considerations for chatbot-based therapy. In the use of Multitheoretical List of Therapeutic Interventions codes, we found that therapists evoked more elaboration (Mann-Whitney U=9; P=.001) and used more self-disclosure (U=45.5; P=.37) as compared to the chatbots. The chatbots used affirming (U=28; P=.045) and reassuring (U=23; P=.02) language more often than the therapists. The chatbots also used psychoeducation (U=22.5; P=.02) and suggestions (U=12.5; P=.003) more often than the therapists. Conclusions: Our study demonstrates the unsuitability of general-purpose chatbots to safely engage in mental health conversations, particularly in crisis situations. While chatbots display elements of good therapy, such as validation and reassurance, overuse of directive advice without sufficient inquiry and use of generic interventions make them unsuitable as therapeutic agents. Careful research and evaluation will be necessary to determine the impact of chatbot interactions and to identify the most appropriate use cases related to mental health. %M 40397927 %R 10.2196/69709 %U https://mental.jmir.org/2025/1/e69709 %U https://doi.org/10.2196/69709 %U http://www.ncbi.nlm.nih.gov/pubmed/40397927 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e73601 %T Identifying Disinformation on the Extended Impacts of COVID-19: Methodological Investigation Using a Fuzzy Ranking Ensemble of Natural Language Processing Models %A Chen,Jian-An %A Chung,Wu-Chun %A Hung,Che-Lun %A Wu,Chun-Ying %+ , Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, No 155, Sec 2, Linong St Beitou Dist, Taipei, 112, Taiwan, 886 2 2826 7349, clhung@nycu.edu.tw %K misinformation %K COVID-19 %K ensemble models %K fuzzy ranks %K language model %D 2025 %7 21.5.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: During the COVID-19 pandemic, the continuous spread of misinformation on the internet posed an ongoing threat to public trust and understanding of epidemic prevention policies. Although the pandemic is now under control, information regarding the risks of long-term COVID-19 effects and reinfection still needs to be integrated into COVID-19 policies. Objective: This study aims to develop a robust and generalizable deep learning framework for detecting misinformation related to the prolonged impacts of COVID-19 by integrating pretrained language models (PLMs) with an innovative fuzzy rank-based ensemble approach. Methods: A comprehensive dataset comprising 566 genuine and 2361 fake samples was curated from reliable open sources and processed using advanced techniques. The dataset was randomly split using the scikit-learn package to facilitate both training and evaluation. Deep learning models were trained for 20 epochs on a Tesla T4 for hierarchical attention networks (HANs) and an RTX A5000 (for the other models). To enhance performance, we implemented an ensemble learning strategy that incorporated a reparameterized Gompertz function, which assigned fuzzy ranks based on each model’s prediction confidence for each test case. This method effectively fused outputs from state-of-the-art PLMs such as robustly optimized bidirectional encoder representations from transformers pretraining approach (RoBERTa), decoding-enhanced bidirectional encoder representations from transformers with disentangled attention (DeBERTa), and XLNet. Results: After training on the dataset, various classification methods were evaluated on the test set, including the fuzzy rank-based method and state-of-the-art large language models. Experimental results reveal that language models, particularly XLNet, outperform traditional approaches that combine term frequency–inverse document frequency features with support vector machine or utilize deep models like HAN. The evaluation metrics—including accuracy, precision, recall, F1-score, and area under the curve (AUC)—indicated a clear performance advantage for models that had a larger number of parameters. However, this study also highlights that model architecture, training procedures, and optimization techniques are critical determinants of classification effectiveness. XLNet’s permutation language modeling approach enhances bidirectional context understanding, allowing it to surpass even larger models in the bidirectional encoder representations from transformers (BERT) series despite having relatively fewer parameters. Notably, the fuzzy rank-based ensemble method, which combines multiple language models, achieved impressive results on the test set, with an accuracy of 93.52%, a precision of 94.65%, an F1-score of 96.03%, and an AUC of 97.15%. Conclusions: The fusion of ensemble learning with PLMs and the Gompertz function, employing fuzzy rank-based methodology, introduces a novel prediction approach with prospects for enhancing accuracy and reliability. Additionally, the experimental results imply that training solely on textual content can yield high prediction accuracy, thereby providing valuable insights into the optimization of fake news detection systems. These findings not only aid in detecting misinformation but also have broader implications for the application of advanced deep learning techniques in public health policy and communication. %M 40397945 %R 10.2196/73601 %U https://www.jmir.org/2025/1/e73601 %U https://doi.org/10.2196/73601 %U http://www.ncbi.nlm.nih.gov/pubmed/40397945 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e70537 %T Early Prediction of Mortality Risk in Acute Respiratory Distress Syndrome: Systematic Review and Meta-Analysis %A Tan,Ruimin %A Ge,Chen %A Li,Zhe %A Yan,Yating %A Guo,He %A Song,Wenjing %A Zhu,Qiong %A Du,Quansheng %+ Critical Care Department, Hebei General Hospital, No. 348 Heping West Road, Xinhua District, Shijiazhuang, 050051, China, 86 13230163769, dqs888@126.com %K acute respiratory distress syndrome %K mortality %K machine learning %K systematic review %K meta-analysis %D 2025 %7 20.5.2025 %9 Review %J J Med Internet Res %G English %X Background: Acute respiratory distress syndrome (ARDS) is a life-threatening condition associated with high mortality rates. Despite advancements in critical care, reliable early prediction methods for ARDS-related mortality remain elusive. Accurate risk assessment is crucial for timely intervention and improved patient outcomes. Machine learning (ML) techniques have emerged as promising tools for mortality prediction in patients with ARDS, leveraging complex clinical datasets to identify key prognostic factors. However, the efficacy of ML-based models remains uncertain. This systematic review aims to assess the value of ML models in the early prediction of ARDS mortality risk and to provide evidence supporting the development of simplified, clinically applicable ML-based scoring tools for prognosis. Objective: This study systematically reviewed available literature on ML-based ARDS mortality prediction models, primarily aiming to evaluate the predictive performance of these models and compare their efficacy with conventional scoring systems. It also sought to identify limitations and provide insights for improving future ML-based prediction tools. Methods: A comprehensive literature search was conducted across PubMed, Web of Science, the Cochrane Library, and Embase, covering publications from inception to April 27, 2024. Studies developing or validating ML-based ARDS mortality predicting models were considered for inclusion. The methodological quality and risk of bias were assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST). Subgroup analyses were performed to explore heterogeneity in model performance based on dataset characteristics and validation approaches. Results: In total, 21 studies involving a total of 31,291 patients with ARDS were included. The meta-analysis demonstrated that ML models achieved relatively high predictive performance. In the training datasets, the pooled concordance index (C-index) was 0.84 (95% CI 0.81-0.86), while for in-hospital mortality prediction, the pooled C-index was 0.83 (95% CI 0.81-0.86). In the external validation datasets, the pooled C-index was 0.81 (95% CI 0.78-0.84), and the corresponding value for in-hospital mortality prediction was 0.80 (95% CI 0.77-0.84). ML models outperformed traditional scoring tools, which demonstrated lower predictive performance. The pooled area under the receiver operating characteristic curve (ROC-AUC) for standard scoring systems was 0.7 (95% CI 0.67-0.72). Specifically, 2 widely used clinical scoring systems, the Sequential Organ Failure Assessment (SOFA) and Simplified Acute Physiology Score II (SAPS-II), demonstrated ROC-AUCs of 0.64 (95% CI 0.62-0.67) and 0.70 (95% CI 0.66-0.74), respectively. Conclusions: ML-based models exhibited superior predictive accuracy over conventional scoring tools, suggesting their potential use in early ARDS mortality risk assessment. However, further research is needed to refine these models, improve their interpretability, and enhance their clinical applicability. Future efforts should focus on developing simplified, efficient, and user-friendly ML-based prediction tools that integrate seamlessly into clinical workflows. Such advancements may facilitate the early identification of high-risk patients, enabling timely interventions and personalized, risk-based prevention strategies to improve ARDS outcomes. %M 40392588 %R 10.2196/70537 %U https://www.jmir.org/2025/1/e70537 %U https://doi.org/10.2196/70537 %U http://www.ncbi.nlm.nih.gov/pubmed/40392588 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e66917 %T Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study %A Omar,Mahmud %A Agbareia,Reem %A Glicksberg,Benjamin S %A Nadkarni,Girish N %A Klang,Eyal %K safe AI %K artificial intelligence %K AI %K algorithm %K large language model %K LLM %K natural language processing %K NLP %K deep learning %D 2025 %7 16.5.2025 %9 %J JMIR Med Inform %G English %X Background: The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. Objective: This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs’ ability to accurately judge their own responses. Methods: We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%‐100%). We calculated the correlation between each model’s mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests. Results: The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=−0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model—GPT-4o—had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model—Qwen2-7B—showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003). Conclusions: Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs. %R 10.2196/66917 %U https://medinform.jmir.org/2025/1/e66917 %U https://doi.org/10.2196/66917 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e70733 %T Identification of Online Health Information Using Large Pretrained Language Models: Mixed Methods Study %A Tan,Dongmei %A Huang,Yi %A Liu,Ming %A Li,Ziyu %A Wu,Xiaoqian %A Huang,Cheng %+ , College of Medical Informatics, Chongqing Medical University, 1 Yixueyuan Road Yuzhong District, Chongqing, 400016, China, 1 023 6848 0060, huangcheng@cqmu.edu.cn %K large pretrained language models %K online health information %K information identification %K text similarity analysis %K performance evaluation %K ChatGPT %K text generation %K latent Dirichlet allocation %K artificial intelligence %D 2025 %7 14.5.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Online health information is widely available, but a substantial portion of it is inaccurate or misleading, including exaggerated, incomplete, or unverified claims. Such misinformation can significantly influence public health decisions and pose serious challenges to health care systems. With advances in artificial intelligence and natural language processing, pretrained large language models (LLMs) have shown promise in identifying and distinguishing misleading health information, although their effectiveness in this area remains underexplored. Objective: This study aimed to evaluate the performance of 4 mainstream LLMs (ChatGPT-3.5, ChatGPT-4, Ernie Bot, and iFLYTEK Spark) in the identification of online health information, providing empirical evidence for their practical application in this field. Methods: Web scraping was used to collect data from rumor-refuting websites, resulting in 2708 samples of online health information, including both true and false claims. The 4 LLMs’ application programming interfaces were used for authenticity verification, with expert results as benchmarks. Model performance was evaluated using semantic similarity, accuracy, recall, F1-score, content analysis, and credibility. Results: This study found that the 4 models performed well in identifying online health information. Among them, ChatGPT-4 achieved the highest accuracy at 87.27%, followed by Ernie Bot at 87.25%, iFLYTEK Spark at 87%, and ChatGPT-3.5 at 81.82%. Furthermore, text length and semantic similarity analysis showed that Ernie Bot had the highest similarity to expert texts, whereas ChatGPT-4 showed good overall consistency in its explanations. In addition, the credibility assessment results indicated that ChatGPT-4 provided the most reliable evaluations. Further analysis suggested that the highest misjudgment probabilities with respect to the LLMs occurred within the topics of food and maternal-infant nutrition management and nutritional science and food controversies. Overall, the research suggests that LLMs have potential in online health information identification; however, their understanding of certain specialized health topics may require further improvement. Conclusions: The results demonstrate that, while these models show potential in providing assistance, their performance varies significantly in terms of accuracy, semantic understanding, and cultural adaptability. The principal findings highlight the models’ ability to generate accessible and context-aware explanations; however, they fall short in areas requiring specialized medical knowledge or updated data, particularly for emerging health issues and context-sensitive scenarios. Significant discrepancies were observed in the models’ ability to distinguish scientifically verified knowledge from popular misconceptions and in their stability when processing complex linguistic and cultural contexts. These challenges reveal the importance of refining training methodologies to improve the models’ reliability and adaptability. Future research should focus on enhancing the models’ capability to manage nuanced health topics and diverse cultural and linguistic nuances, thereby facilitating their broader adoption as reliable tools for online health information identification. %M 40367512 %R 10.2196/70733 %U https://www.jmir.org/2025/1/e70733 %U https://doi.org/10.2196/70733 %U http://www.ncbi.nlm.nih.gov/pubmed/40367512 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e67253 %T Large Language Models and Artificial Neural Networks for Assessing 1-Year Mortality in Patients With Myocardial Infarction: Analysis From the Medical Information Mart for Intensive Care IV (MIMIC-IV) Database %A Shi,Boqun %A Chen,Liangguo %A Pang,Shuo %A Wang,Yue %A Wang,Shen %A Li,Fadong %A Zhao,Wenxin %A Guo,Pengrong %A Zhang,Leli %A Fan,Chu %A Zou,Yi %A Wu,Xiaofan %+ Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, No. 2, Anzhen Road, Chaoyang District, Beijing, Beijing, China, 86 01084005591, drwuxf@163.com %K artificial neural network %K large language model %K myocardial infarction %K prediction model %K risk assessment %D 2025 %7 12.5.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Accurate mortality risk prediction is crucial for effective cardiovascular risk management. Recent advancements in artificial intelligence (AI) have demonstrated potential in this specific medical field. Qwen-2 and Llama-3 are high-performance, open-source large language models (LLMs) available online. An artificial neural network (ANN) algorithm derived from the SWEDEHEART (Swedish Web System for Enhancement and Development of Evidence-Based Care in Heart Disease Evaluated According to Recommended Therapies) registry, termed SWEDEHEART-AI, can predict patient prognosis following acute myocardial infarction (AMI). Objective: This study aims to evaluate the 3 models mentioned above in predicting 1-year all-cause mortality in critically ill patients with AMI. Methods: The Medical Information Mart for Intensive Care IV (MIMIC-IV) database is a publicly available data set in critical care medicine. We included 2758 patients who were first admitted for AMI and discharged alive. SWEDEHEART-AI calculated the mortality rate based on each patient’s 21 clinical variables. Qwen-2 and Llama-3 analyzed the content of patients’ discharge records and directly provided a 1-decimal value between 0 and 1 to represent 1-year death risk probabilities. The patients’ actual mortality was verified using follow-up data. The predictive performance of the 3 models was assessed and compared using the Harrell C-statistic (C-index), the area under the receiver operating characteristic curve (AUROC), calibration plots, Kaplan-Meier curves, and decision curve analysis. Results: SWEDEHEART-AI demonstrated strong discrimination in predicting 1-year all-cause mortality in patients with AMI, with a higher C-index than Qwen-2 and Llama-3 (C-index 0.72, 95% CI 0.69-0.74 vs C-index 0.65, 0.62-0.67 vs C-index 0.56, 95% CI 0.53-0.58, respectively; all P<.001 for both comparisons). SWEDEHEART-AI also showed high and consistent AUROC in the time-dependent ROC curve. The death rates calculated by SWEDEHEART-AI were positively correlated with actual mortality, and the 3 risk classes derived from this model showed clear differentiation in the Kaplan-Meier curve (P<.001). Calibration plots indicated that SWEDEHEART-AI tended to overestimate mortality risk, with an observed-to-expected ratio of 0.478. Compared with the LLMs, SWEDEHEART-AI demonstrated positive and greater net benefits at risk thresholds below 19%. Conclusions: SWEDEHEART-AI, a trained ANN model, demonstrated the best performance, with strong discrimination and clinical utility in predicting 1-year all-cause mortality in patients with AMI from an intensive care cohort. Among the LLMs, Qwen-2 outperformed Llama-3 and showed moderate predictive value. Qwen-2 and SWEDEHEART-AI exhibited comparable classification effectiveness. The future integration of LLMs into clinical decision support systems holds promise for accurate risk stratification in patients with AMI; however, further research is needed to optimize LLM performance and address calibration issues across diverse patient populations. %M 40354652 %R 10.2196/67253 %U https://www.jmir.org/2025/1/e67253 %U https://doi.org/10.2196/67253 %U http://www.ncbi.nlm.nih.gov/pubmed/40354652 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e67830 %T Comparing Artificial Intelligence–Generated and Clinician-Created Personalized Self-Management Guidance for Patients With Knee Osteoarthritis: Blinded Observational Study %A Du,Kai %A Li,Ao %A Zuo,Qi-Heng %A Zhang,Chen-Yu %A Guo,Ren %A Chen,Ping %A Du,Wei-Shuai %A Li,Shu-Ming %+ , Beijing Hospital of Traditional Chinese Medicine, 23 Meishuguan Houjie, Dongcheng District, Beijing, 100010, China, 86 13810986862, lishuming@bjzhongyi.com %K artificial intelligence in health care %K large language models %K knee osteoarthritis %K self-management %K personalized medicine %K patient education %K artificial intelligence %K AI-generated %K knee %K osteoarthritis %K observational study %K GPT-4 %K ChatGPT %K LLMs %K orthopedics %D 2025 %7 7.5.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Knee osteoarthritis is a prevalent, chronic musculoskeletal disorder that impairs mobility and quality of life. Personalized patient education aims to improve self-management and adherence; yet, its delivery is often limited by time constraints, clinician workload, and the heterogeneity of patient needs. Recent advances in large language models offer potential solutions. GPT-4 (OpenAI), distinguished by its long-context reasoning and adoption in clinical artificial intelligence research, emerged as a leading candidate for personalized health communication. However, its application in generating condition-specific educational guidance remains underexplored, and concerns about misinformation, personalization limits, and ethical oversight remain. Objective: We evaluated GPT-4’s ability to generate individualized self-management guidance for patients with knee osteoarthritis in comparison with clinician-created content. Methods: This 2-phase, double-blind, observational study used data from 50 patients previously enrolled in a registered randomized trial. In phase 1, 2 orthopedic clinicians each generated personalized education materials for 25 patient profiles using anonymized clinical data, including history, symptoms, and lifestyle. In phase 2, the same datasets were processed by GPT-4 using standardized prompts. All content was anonymized and evaluated by 2 independent, blinded clinical experts using validated scoring systems. Evaluation criteria included efficiency, readability (Flesch-Kincaid, Gunning Fog, Coleman-Liau, and Simple Measure of Gobbledygook), accuracy, personalization, and comprehensiveness and safety. Disagreements between reviewers were resolved through consensus or third-party adjudication. Results: GPT-4 outperformed clinicians in content generation speed (530.03 vs 37.29 words per min, P<.001). Readability was better on the Flesch-Kincaid (mean 11.56, SD 1.08 vs mean 12.67 SD 0.95), Gunning Fog (mean 12.47, SD 1.36 vs mean 14.56, SD 0.93), and Simple Measure of Gobbledygook (mean 13.33, SD 1.00 vs mean 13.81 SD 0.69) indices (all P<.001), though GPT-4 scored slightly higher on the Coleman-Liau Index (mean 15.90, SD 1.03 vs mean 15.15, SD 0.91). GPT-4 also outperformed clinicians in accuracy (mean 5.31, SD 1.73 vs mean 4.76, SD 1.10; P=.05, personalization (mean 54.32, SD 6.21 vs mean 33.20, SD 5.40; P<.001), comprehensiveness (mean 51.74, SD 6.47 vs mean 35.26, SD 6.66; P<.001), and safety (median 61, IQR 58-66 vs median 50, IQR 47-55.25; P<.001). Conclusions: GPT-4 could generate personalized self-management guidance for knee osteoarthritis with greater efficiency, accuracy, personalization, comprehensiveness, and safety than clinician-generated content, as assessed using standardized, guideline-aligned evaluation frameworks. These findings underscore the potential of large language models to support scalable, high-quality patient education in chronic disease management. The observed lexical complexity suggests the need to refine outputs for populations with limited health literacy. As an exploratory, single-center study, these results warrant confirmation in larger, multicenter cohorts with diverse demographic profiles. Future implementation should be guided by ethical and operational safeguards, including data privacy, transparency, and the delineation of clinical responsibility. Hybrid models integrating artificial intelligence–generated content with clinician oversight may offer a pragmatic path forward. %M 40332991 %R 10.2196/67830 %U https://www.jmir.org/2025/1/e67830 %U https://doi.org/10.2196/67830 %U http://www.ncbi.nlm.nih.gov/pubmed/40332991 %0 Journal Article %@ 2369-1999 %I JMIR Publications %V 11 %N %P e62833 %T Association Between Risk Factors and Major Cancers: Explainable Machine Learning Approach %A Huang,Xiayuan %A Ren,Shushun %A Mao,Xinyue %A Chen,Sirui %A Chen,Elle %A He,Yuqi %A Jiang,Yun %K electronic health record %K EHR %K cancer risk modeling %K risk factor analysis %K explainable machine learning %K machine learning %K ML %K risk factor %K major cancers %K monitoring %K cancer risk %K breast cancer %K colorectal cancer %K lung cancer %K prostate cancer %K cancer patients %K clinical decision-making %D 2025 %7 2.5.2025 %9 %J JMIR Cancer %G English %X Background: Cancer is a life-threatening disease and a leading cause of death worldwide, with an estimated 611,000 deaths and over 2 million new cases in the United States in 2024. The rising incidence of major cancers, including among younger individuals, highlights the need for early screening and monitoring of risk factors to manage and decrease cancer risk. Objective: This study aimed to leverage explainable machine learning models to identify and analyze the key risk factors associated with breast, colorectal, lung, and prostate cancers. By uncovering significant associations between risk factors and these major cancer types, we sought to enhance the understanding of cancer diagnosis risk profiles. Our goal was to facilitate more precise screening, early detection, and personalized prevention strategies, ultimately contributing to better patient outcomes and promoting health equity. Methods: Deidentified electronic health record data from Medical Information Mart for Intensive Care (MIMIC)–III was used to identify patients with 4 types of cancer who had longitudinal hospital visits prior to their diagnosis presence. Their records were matched and combined with those of patients without cancer diagnoses using propensity scores based on demographic factors. Three advanced models, penalized logistic regression, random forest, and multilayer perceptron (MLP), were conducted to identify the rank of risk factors for each cancer type, with feature importance analysis for random forest and MLP models. The rank biased overlap was adopted to compare the similarity of ranked risk factors across cancer types. Results: Our framework evaluated the prediction performance of explainable machine learning models, with the MLP model demonstrating the best performance. It achieved an area under the receiver operating characteristic curve of 0.78 for breast cancer (n=58), 0.76 for colorectal cancer (n=140), 0.84 for lung cancer (n=398), and 0.78 for prostate cancer (n=104), outperforming other baseline models (P<.001). In addition to demographic risk factors, the most prominent nontraditional risk factors overlapped across models and cancer types, including hyperlipidemia (odds ratio [OR] 1.14, 95% CI 1.11‐1.17; P<.01), diabetes (OR 1.34, 95% CI 1.29‐1.39; P<.01), depressive disorders (OR 1.11, 95% CI 1.06‐1.16; P<.01), heart diseases (OR 1.42, 95% CI 1.32‐1.52; P<.01), and anemia (OR 1.22, 95% CI 1.14‐1.30; P<.01). The similarity analysis indicated the unique risk factor pattern for lung cancer from other cancer types. Conclusions: The study’s findings demonstrated the effectiveness of explainable ML models in assessing nontraditional risk factors for major cancers and highlighted the importance of considering unique risk profiles for different cancer types. Moreover, this research served as a hypothesis-generating foundation, providing preliminary results for future investigation into cancer diagnosis risk analysis and management. Furthermore, expanding collaboration with clinical experts for external validation would be essential to refine model outputs, integrate findings into practice, and enhance their impact on patient care and cancer prevention efforts. %R 10.2196/62833 %U https://cancer.jmir.org/2025/1/e62833 %U https://doi.org/10.2196/62833 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e67525 %T Development of a Predictive Model for Metabolic Syndrome Using Noninvasive Data and its Cardiovascular Disease Risk Assessments: Multicohort Validation Study %A Park,Jin-Hyun %A Jeong,Inyong %A Ko,Gang-Jee %A Jeong,Seogsong %A Lee,Hwamin %+ Korea University College of Medicine, 73 Goryeodae-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea, 82 1063205109, hwamin@korea.ac.kr %K metabolic syndrome prediction %K noninvasive data %K clinical interpretable model %K body composition data %K early intervention %D 2025 %7 2.5.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Metabolic syndrome is a cluster of metabolic abnormalities, including obesity, hypertension, dyslipidemia, and insulin resistance, that significantly increase the risk of cardiovascular disease (CVD) and other chronic conditions. Its global prevalence is rising, particularly in aging and urban populations. Traditional screening methods rely on laboratory tests and specialized assessments, which may not be readily accessible in routine primary care and community settings. Limited resources, time constraints, and inconsistent screening practices hinder early identification and intervention. Developing a noninvasive and scalable predictive model could enhance accessibility and improve early detection. Objective: This study aimed to develop and validate a predictive model for metabolic syndrome using noninvasive body composition data. Additionally, we evaluated the model’s ability to predict long-term CVD risk, supporting its application in clinical and public health settings for early intervention and preventive strategies. Methods: We developed a machine learning–based predictive model using noninvasive data from two nationally representative cohorts: the Korea National Health and Nutrition Examination Survey (KNHANES) and the Korean Genome and Epidemiology Study. The model was trained using dual-energy x-ray absorptiometry data from KNHANES (2008-2011) and validated internally with bioelectrical impedance analysis data from KNHANES 2022. External validation was conducted using Korean Genome and Epidemiology Study follow-up datasets. Five machine learning algorithms were compared, and the best-performing model was selected based on the area under the receiver operating characteristic curve. Cox proportional hazards regression was used to assess the model’s ability to predict long-term CVD risk. Results: The model demonstrated strong predictive performance across validation cohorts. Area under the receiver operating characteristic curve values for metabolic syndrome prediction ranged from 0.8338 to 0.8447 in internal validation, 0.8066 to 0.8138 in external validation 1, and 0.8039 to 0.8123 in external validation 2. The model’s predictions were significantly associated with future cardiovascular risk, with Cox regression analysis indicating that individuals classified as having metabolic syndrome had a 1.51-fold higher risk of developing CVD (hazard ratio 1.51, 95% CI 1.32-1.73; P<.001). The ability to predict long-term CVD risk highlights the potential utility of this model for guiding early interventions. Conclusions: This study developed a noninvasive predictive model for metabolic syndrome with strong performance across diverse validation cohorts. By enabling early risk identification without laboratory tests, the model enhances accessibility in primary care and large-scale screenings. Its ability to predict long-term CVD risk supports proactive intervention strategies, potentially reducing the burden of cardiometabolic diseases. Further research should refine the model with additional clinical factors and broader population validation to maximize its clinical impact. %M 40315452 %R 10.2196/67525 %U https://www.jmir.org/2025/1/e67525 %U https://doi.org/10.2196/67525 %U http://www.ncbi.nlm.nih.gov/pubmed/40315452 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e70703 %T AI in Home Care—Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study %A Pérez-Esteve,Clara %A Guilabert,Mercedes %A Matarredona,Valerie %A Srulovici,Einav %A Tella,Susanna %A Strametz,Reinhard %A Mira,José Joaquín %+ Fundación para el Fomento de la Investigación Sanitaria y Biomédica de la Comunitat Valenciana, Centro de Salud Hospital-Plá, Hermanos López de Osaba s/n, Alicante, 03013, Spain, 34 966658984, jose.mira@umh.es %K large language models %K older adults %K informal caregiver %K error prevention %K patient safety %K ChatGPT %K Microsoft Copilot %K training %K health literacy %D 2025 %7 28.4.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: The aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate adequate and safe care, including in-home settings. Traditionally, informal caregiver training has relied on verbal and written instructions. However, the advent of digital resources has introduced videos and interactive platforms, offering more accessible and effective training. Large language models (LLMs) have emerged as potential tools for personalized information delivery. While LLMs exhibit the capacity to mimic clinical reasoning and support decision-making, their potential to serve as alternatives to evidence-based professional instruction remains unexplored. Objective: We aimed to evaluate the appropriateness of home care instructions generated by LLMs (including GPTs) in comparison to a professional gold standard. Furthermore, it seeks to identify specific domains where LLMs show the most promise and where improvements are necessary to optimize their reliability for caregiver training. Methods: An observational, comparative case study evaluated 3 LLMs—GPT-3.5, GPT-4o, and Microsoft Copilot—in 10 home care scenarios. A rubric assessed the models against a reference standard (gold standard) created by health care professionals. Independent reviewers evaluated variables including specificity, clarity, and self-efficacy. In addition to comparing each LLM to the gold standard, the models were also compared against each other across all study domains to identify relative strengths and weaknesses. Statistical analyses compared LLMs performance to the gold standard to ensure consistency and validity, as well as to analyze differences between LLMs across all evaluated domains. Results: The study revealed that while no LLM achieved the precision of the professional gold standard, GPT-4o outperformed GPT-3.5, and Copilot in specificity (4.6 vs 3.7 and 3.6), clarity (4.8 vs 4.1 and 3.9), and self-efficacy (4.6 vs 3.8 and 3.4). However, the models exhibited significant limitations, with GPT-4o and Copilot omitting relevant details in 60% (6/10) of the cases, and GPT-3.5 doing so in 80% (8/10). When compared to the gold standard, only 10% (2/20) of GPT-4o responses were rated as equally specific, 20% (4/20) included comparable practical advice, and just 5% (1/20) provided a justification as detailed as professional guidance. Furthermore, error frequency did not differ significantly across models (P=.65), though Copilot had the highest rate of incorrect information (20%, 2/10 vs 10%, 1/10 for GPT-4o and 0%, 0/0 for GPT-3.5). Conclusions: LLMs, particularly GPT-4o subscription-based, show potential as tools for training informal caregivers by providing tailored guidance and reducing errors. Although not yet surpassing professional instruction quality, these models offer a flexible and accessible alternative that could enhance home safety and care quality. Further research is necessary to address limitations and optimize their performance. Future implementation of LLMs may alleviate health care system burdens by reducing common caregiver errors. %M 40294407 %R 10.2196/70703 %U https://www.jmir.org/2025/1/e70703 %U https://doi.org/10.2196/70703 %U http://www.ncbi.nlm.nih.gov/pubmed/40294407 %0 Journal Article %@ 2561-7605 %I JMIR Publications %V 8 %N %P e66723 %T Development of a Longitudinal Model for Disability Prediction in Older Adults in China: Analysis of CHARLS Data (2015-2020) %A Chu,Jingjing %A Li,Ying %A Wang,Xinyi %A Xu,Qun %A Xu,Zherong %K disability %K prediction model %K older adults %K China Health and Retirement Longitudinal Study %K CHARLS %K medical resources allocation %D 2025 %7 17.4.2025 %9 %J JMIR Aging %G English %X Background: Disability profoundly affects older adults’ quality of life and imposes considerable burdens on health care systems in China’s aging society. Timely predictive models are essential for early intervention. Objective: We aimed to build effective predictive models of disability for early intervention and management in older adults in China, integrating physical, cognitive, physiological, and psychological factors. Methods: Data from the China Health and Retirement Longitudinal Study (CHARLS), spanning from 2015 to 2020 and involving 2450 older individuals initially in good health, were analyzed. The dataset was randomly divided into a training set with 70% data and a testing set with 30% data. LASSO regression with 10-fold cross-validation identified key predictors, which were then used to develop an Extreme Gradient Boosting (XGBoost) model. Model performance was evaluated using receiever operating characteristic curves, calibration curves, and clinical decision and impact curves. Variable contributions were interpreted using SHapley Additive exPlanations (SHAP) values. Results: LASSO regression was used to evaluate 36 potential predictors, resulting in a model incorporating 9 key variables: age, hand grip strength, standing balance, the 5-repetition chair stand test (CS-5), pain, depression, cognition, respiratory function, and comorbidities. The XGBoost model demonstrated an area under the curve of 0.846 (95% CI 0.825‐0.866) for the training set and 0.698 (95% CI 0.654‐0.743) for the testing set. Calibration curves demonstrated reliable predictive accuracy, with mean absolute errors of 0.001 and 0.011 for the training and testing sets, respectively. Clinical decision and impact curves demonstrated significant utility across risk thresholds. SHAP analysis identified pain, respiratory function, and age as top predictors, highlighting their substantial roles in disability risk. Hand grip and the CS-5 also significantly influenced the model. A web-based application was developed for personalized risk assessment and decision-making. Conclusion: A reliable predictive model for 5-year disability risk in Chinese older adults was developed and validated. This model enables the identification of high-risk individuals, supports early interventions, and optimizes resource allocation. Future efforts will focus on updating the model with new CHARLS data and validating it with external datasets. %R 10.2196/66723 %U https://aging.jmir.org/2025/1/e66723 %U https://doi.org/10.2196/66723 %0 Journal Article %@ 2369-1999 %I JMIR Publications %V 11 %N %P e67914 %T Evaluation of Large Language Models in Tailoring Educational Content for Cancer Survivors and Their Caregivers: Quality Analysis %A Liu,Darren %A Hu,Xiao %A Xiao,Canhua %A Bai,Jinbing %A Barandouzi,Zahra A %A Lee,Stephanie %A Webster,Caitlin %A Brock,La-Urshalar %A Lee,Lindsay %A Bold,Delgersuren %A Lin,Yufen %K large language models %K GPT-4 %K cancer survivors %K caregivers %K education %K health equity %D 2025 %7 7.4.2025 %9 %J JMIR Cancer %G English %X Background: Cancer survivors and their caregivers, particularly those from disadvantaged backgrounds with limited health literacy or racial and ethnic minorities facing language barriers, are at a disproportionately higher risk of experiencing symptom burdens from cancer and its treatments. Large language models (LLMs) offer a promising avenue for generating concise, linguistically appropriate, and accessible educational materials tailored to these populations. However, there is limited research evaluating how effectively LLMs perform in creating targeted content for individuals with diverse literacy and language needs. Objective: This study aimed to evaluate the overall performance of LLMs in generating tailored educational content for cancer survivors and their caregivers with limited health literacy or language barriers, compare the performances of 3 Generative Pretrained Transformer (GPT) models (ie, GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo; OpenAI), and examine how different prompting approaches influence the quality of the generated content. Methods: We selected 30 topics from national guidelines on cancer care and education. GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo were used to generate tailored content of up to 250 words at a 6th-grade reading level, with translations into Spanish and Chinese for each topic. Two distinct prompting approaches (textual and bulleted) were applied and evaluated. Nine oncology experts evaluated 360 generated responses based on predetermined criteria: word limit, reading level, and quality assessment (ie, clarity, accuracy, relevance, completeness, and comprehensibility). ANOVA (analysis of variance) or chi-square analyses were used to compare differences among the various GPT models and prompts. Results: Overall, LLMs showed excellent performance in tailoring educational content, with 74.2% (267/360) adhering to the specified word limit and achieving an average quality assessment score of 8.933 out of 10. However, LLMs showed moderate performance in reading level, with 41.1% (148/360) of content failing to meet the sixth-grade reading level. LLMs demonstrated strong translation capabilities, achieving an accuracy of 96.7% (87/90) for Spanish and 81.1% (73/90) for Chinese translations. Common errors included imprecise scopes, inaccuracies in definitions, and content that lacked actionable recommendations. The more advanced GPT-4 family models showed better overall performance compared to GPT-3.5 Turbo. Prompting GPTs to produce bulleted-format content was likely to result in better educational content compared with textual-format content. Conclusions: All 3 LLMs demonstrated high potential for delivering multilingual, concise, and low health literacy educational content for cancer survivors and caregivers who face limited literacy or language barriers. GPT-4 family models were notably more robust. While further refinement is required to ensure simpler reading levels and fully comprehensive information, these findings highlight LLMs as an emerging tool for bridging gaps in cancer education and advancing health equity. Future research should integrate expert feedback, additional prompt engineering strategies, and specialized training data to optimize content accuracy and accessibility. International Registered Report Identifier (IRRID): RR2-10.2196/48499 %R 10.2196/67914 %U https://cancer.jmir.org/2025/1/e67914 %U https://doi.org/10.2196/67914 %0 Journal Article %@ 2369-1999 %I JMIR Publications %V 11 %N %P e69663 %T Identifying Adverse Events in Outpatients With Prostate Cancer Using Pharmaceutical Care Records in Community Pharmacies: Application of Named Entity Recognition %A Yanagisawa,Yuki %A Watabe,Satoshi %A Yokoyama,Sakura %A Sayama,Kyoko %A Kizaki,Hayato %A Tsuchiya,Masami %A Imai,Shungo %A Someya,Mitsuhiro %A Taniguchi,Ryoo %A Yada,Shuntaro %A Aramaki,Eiji %A Hori,Satoko %+ , Division of Drug Informatics, Keio University Faculty of Pharmacy, 1-5-30 Shibakoen, Minato-ku, Tokyo, 105-8512, Japan, 81 3 5400 2650, satokoh@keio.jp %K natural language processing %K pharmaceutical care records %K androgen receptor axis-targeting agents %K adverse events %K outpatient care %D 2025 %7 11.3.2025 %9 Original Paper %J JMIR Cancer %G English %X Background: Androgen receptor axis-targeting reagents (ARATs) have become key drugs for patients with castration-resistant prostate cancer (CRPC). ARATs are taken long term in outpatient settings, and effective adverse event (AE) monitoring can help prolong treatment duration for patients with CRPC. Despite the importance of monitoring, few studies have identified which AEs can be captured and assessed in community pharmacies, where pharmacists in Japan dispense medications, provide counseling, and monitor potential AEs for outpatients prescribed ARATs. Therefore, we anticipated that a named entity recognition (NER) system might be used to extract AEs recorded in pharmaceutical care records generated by community pharmacists. Objective: This study aimed to evaluate whether an NER system can effectively and systematically identify AEs in outpatients undergoing ARAT therapy by reviewing pharmaceutical care records generated by community pharmacists, focusing on assessment notes, which often contain detailed records of AEs. Additionally, the study sought to determine whether outpatient pharmacotherapy monitoring can be enhanced by using NER to systematically collect AEs from pharmaceutical care records. Methods: We used an NER system based on the widely used Japanese medical term extraction system MedNER-CR-JA, which uses Bidirectional Encoder Representations from Transformers (BERT). To evaluate its performance for pharmaceutical care records by community pharmacists, the NER system was first applied to 1008 assessment notes in records related to anticancer drug prescriptions. Three pharmaceutically proficient researchers compared the results with the annotated notes assigned symptom tags according to annotation guidelines and evaluated the performance of the NER system on the assessment notes in the pharmaceutical care records. The system was then applied to 2193 assessment notes for patients prescribed ARATs. Results: The F1-score for exact matches of all symptom tags between the NER system and annotators was 0.72, confirming the NER system has sufficient performance for application to pharmaceutical care records. The NER system automatically assigned 1900 symptom tags for the 2193 assessment notes from patients prescribed ARATs; 623 tags (32.8%) were positive symptom tags (symptoms present), while 1067 tags (56.2%) were negative symptom tags (symptoms absent). Positive symptom tags included ARAT-related AEs such as “pain,” “skin disorders,” “fatigue,” and “gastrointestinal symptoms.” Many other symptoms were classified as serious AEs. Furthermore, differences in symptom tag profiles reflecting pharmacists’ AE monitoring were observed between androgen synthesis inhibition and androgen receptor signaling inhibition. Conclusions: The NER system successfully extracted AEs from pharmaceutical care records of patients prescribed ARATs, demonstrating its potential to systematically track the presence and absence of AEs in outpatients. Based on the analysis of a large volume of pharmaceutical medical records using the NER system, community pharmacists not only detect potential AEs but also actively monitor the absence of severe AEs, offering valuable insights for the continuous improvement of patient safety management. %M 40068144 %R 10.2196/69663 %U https://cancer.jmir.org/2025/1/e69663 %U https://doi.org/10.2196/69663 %U http://www.ncbi.nlm.nih.gov/pubmed/40068144 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e56774 %T Reporting Quality of AI Intervention in Randomized Controlled Trials in Primary Care: Systematic Review and Meta-Epidemiological Study %A Zhong,Jinjia %A Zhu,Ting %A Huang,Yafang %+ , School of General Practice and Continuing Education, Capital Medical University, 4th Fl, Jieping Building, Capital Medical University, No.10 You An Men Wai Xi Tou Tiao, Fengtai district, Beijing, 100069, China, 86 18810673886, yafang@ccmu.edu.cn %K artificial intelligence %K randomized controlled trial %K reporting quality %K primary care %K meta-epidemiological study %D 2025 %7 25.2.2025 %9 Review %J J Med Internet Res %G English %X Background: The surge in artificial intelligence (AI) interventions in primary care trials lacks a study on reporting quality. Objective: This study aimed to systematically evaluate the reporting quality of both published randomized controlled trials (RCTs) and protocols for RCTs that investigated AI interventions in primary care. Methods: PubMed, Embase, Cochrane Library, MEDLINE, Web of Science, and CINAHL databases were searched for RCTs and protocols on AI interventions in primary care until November 2024. Eligible studies were published RCTs or full protocols for RCTs exploring AI interventions in primary care. The reporting quality was assessed using CONSORT-AI (Consolidated Standards of Reporting Trials–Artificial Intelligence) and SPIRIT-AI (Standard Protocol Items: Recommendations for Interventional Trials–Artificial Intelligence) checklists, focusing on AI intervention–related items. Results: A total of 11,711 records were identified. In total, 19 published RCTs and 21 RCT protocols for 35 trials were included. The overall proportion of adequately reported items was 65% (172/266; 95% CI 59%-70%) and 68% (214/315; 95% CI 62%-73%) for RCTs and protocols, respectively. The percentage of RCTs and protocols that reported a specific item ranged from 11% (2/19) to 100% (19/19) and from 10% (2/21) to 100% (21/21), respectively. The reporting of both RCTs and protocols exhibited similar characteristics and trends. They both lack transparency and completeness, which can be summarized in three aspects: without providing adequate information regarding the input data, without mentioning the methods for identifying and analyzing performance errors, and without stating whether and how the AI intervention and its code can be accessed. Conclusions: The reporting quality could be improved in both RCTs and protocols. This study helps promote the transparent and complete reporting of trials with AI interventions in primary care. %M 39998876 %R 10.2196/56774 %U https://www.jmir.org/2025/1/e56774 %U https://doi.org/10.2196/56774 %U http://www.ncbi.nlm.nih.gov/pubmed/39998876 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e63476 %T Discovering Time-Varying Public Interest for COVID-19 Case Prediction in South Korea Using Search Engine Queries: Infodemiology Study %A Ahn,Seong-Ho %A Yim,Kwangil %A Won,Hyun-Sik %A Kim,Kang-Min %A Jeong,Dong-Hwa %+ Department of Artificial Intelligence, The Catholic University of Korea, Jibong-Ro 43 3-1, Bucheon-Si, Republic of Korea, 82 2 2164 5564, kangmin89@catholic.ac.kr %K COVID-19 %K confirmed case prediction %K search engine queries %K query expansion %K word embedding %K public health %K case prediction %K South Korea %K search engine %K infodemiology %K infodemiology study %K policy %K lifestyle %K machine learning %K machine learning techniques %K utilization %K temporal variation %K novel framework %K temporal %K web-based search %K temporal semantics %K prediction model %K model %D 2024 %7 16.12.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: The number of confirmed COVID-19 cases is a crucial indicator of policies and lifestyles. Previous studies have attempted to forecast cases using machine learning techniques that use a previous number of case counts and search engine queries predetermined by experts. However, they have limitations in reflecting temporal variations in queries associated with pandemic dynamics. Objective: This study aims to propose a novel framework to extract keywords highly associated with COVID-19, considering their temporal occurrence. We aim to extract relevant keywords based on pandemic variations using query expansion. Additionally, we examine time-delayed web-based search behavior related to public interest in COVID-19 and adjust for better prediction performance. Methods: To capture temporal semantics regarding COVID-19, word embedding models were trained on a news corpus, and the top 100 words related to “Corona” were extracted over 4-month windows. Time-lagged cross-correlation was applied to select optimal time lags correlated to confirmed cases from the expanded queries. Subsequently, ElasticNet regression models were trained after reducing the feature dimensions using principal component analysis of the time-lagged features to predict future daily case counts. Results: Our approach successfully extracted relevant keywords depending on the pandemic phase, encompassing keywords directly related to COVID-19, such as its symptoms, and its societal impact. Specifically, during the first outbreak, keywords directly linked to COVID-19 and past infectious disease outbreaks similar to those of COVID-19 exhibited a high positive correlation. In the second phase of the pandemic, as community infections emerged, keywords related to the government’s pandemic control policies were frequently observed with a high positive correlation. In the third phase of the pandemic, during the delta variant outbreak, keywords such as “economic crisis” and “anxiety” appeared, reflecting public fatigue. Consequently, prediction models trained by the extracted queries over 4-month windows outperformed previous methods for most predictions 1-14 days ahead. Notably, our approach showed significantly higher Pearson correlation coefficients than models based solely on the number of past cases for predictions 9-11 days ahead (P=.02, P<.01, and P<.01), in contrast to heuristic- and symptom-based query sets. Conclusions: This study proposes a novel COVID-19 case-prediction model that automatically extracts relevant queries over time using word embedding. The model outperformed previous methods that relied on static symptom-based or heuristic queries, even without prior expert knowledge. The results demonstrate the capability of our approach to track temporal shifts in public interest regarding changes in the pandemic. %M 39680913 %R 10.2196/63476 %U https://www.jmir.org/2024/1/e63476 %U https://doi.org/10.2196/63476 %U http://www.ncbi.nlm.nih.gov/pubmed/39680913 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e55856 %T Screening for Depression and Anxiety Using a Nonverbal Working Memory Task in a Sample of Older Brazilians: Observational Study of Preliminary Artificial Intelligence Model Transferability %A Georgescu,Alexandra Livia %A Cummins,Nicholas %A Molimpakis,Emilia %A Giacomazzi,Eduardo %A Rodrigues Marczyk,Joana %A Goria,Stefano %K depression %K anxiety %K Brazil %K machine learning %K n-back %K working memory %K artificial intelligence %K gerontology %K older adults %K mental health %K AI %K transferability %K detection %K screening %K questionnaire %K longitudinal study %D 2024 %7 12.12.2024 %9 %J JMIR Form Res %G English %X Background: Anxiety and depression represent prevalent yet frequently undetected mental health concerns within the older population. The challenge of identifying these conditions presents an opportunity for artificial intelligence (AI)–driven, remotely available, tools capable of screening and monitoring mental health. A critical criterion for such tools is their cultural adaptability to ensure effectiveness across diverse populations. Objective: This study aims to illustrate the preliminary transferability of two established AI models designed to detect high depression and anxiety symptom scores. The models were initially trained on data from a nonverbal working memory game (1- and 2-back tasks) in a dataset by thymia, a company that develops AI solutions for mental health and well-being assessments, encompassing over 6000 participants from the United Kingdom, United States, Mexico, Spain, and Indonesia. We seek to validate the models’ performance by applying it to a new dataset comprising older Brazilian adults, thereby exploring its transferability and generalizability across different demographics and cultures. Methods: A total of 69 Brazilian participants aged 51-92 years old were recruited with the help of Laços Saúde, a company specializing in nurse-led, holistic home care. Participants received a link to the thymia dashboard every Monday and Thursday for 6 months. The dashboard had a set of activities assigned to them that would take 10-15 minutes to complete, which included a 5-minute game with two levels of the n-back tasks. Two Random Forest models trained on thymia data to classify depression and anxiety based on thresholds defined by scores of the Patient Health Questionnaire (8 items) (PHQ-8) ≥10 and those of the Generalized Anxiety Disorder Assessment (7 items) (GAD-7) ≥10, respectively, were subsequently tested on the Laços Saúde patient cohort. Results: The depression classification model exhibited robust performance, achieving an area under the receiver operating characteristic curve (AUC) of 0.78, a specificity of 0.69, and a sensitivity of 0.72. The anxiety classification model showed an initial AUC of 0.63, with a specificity of 0.58 and a sensitivity of 0.64. This performance surpassed a benchmark model using only age and gender, which had AUCs of 0.47 for PHQ-8 and 0.53 for GAD-7. After recomputing the AUC scores on a cross-sectional subset of the data (the first n-back game session), we found AUCs of 0.79 for PHQ-8 and 0.76 for GAD-7. Conclusions: This study successfully demonstrates the preliminary transferability of two AI models trained on a nonverbal working memory task, one for depression and the other for anxiety classification, to a novel sample of older Brazilian adults. Future research could seek to replicate these findings in larger samples and other cultural contexts. Trial Registration: ISRCTN Registry ISRCTN90727704; https://www.isrctn.com/ISRCTN90727704 %R 10.2196/55856 %U https://formative.jmir.org/2024/1/e55856 %U https://doi.org/10.2196/55856 %0 Journal Article %@ 2563-3570 %I JMIR Publications %V 5 %N %P e62747 %T Eco-Evolutionary Drivers of Vibrio parahaemolyticus Sequence Type 3 Expansion: Retrospective Machine Learning Approach %A Campbell,Amy Marie %A Hauton,Chris %A van Aerle,Ronny %A Martinez-Urtaza,Jaime %+ Department of Genetics and Microbiology, Autonomous University of Barcelona, Facultat de Biociènces, oficina C3/109, Campus de la UAB, Bellaterra, Barcelona, 08193, Spain, 34 93 581 2729, jaime.martinez.urtaza@uab.cat %K pathogen expansion %K climate change %K machine learning %K ecology %K evolution %K vibrio parahaemolyticus %K sequencing %K sequence type 3 %K VpST3 %K genomics %D 2024 %7 28.11.2024 %9 Original Paper %J JMIR Bioinform Biotech %G English %X Background: Environmentally sensitive pathogens exhibit ecological and evolutionary responses to climate change that result in the emergence and global expansion of well-adapted variants. It is imperative to understand the mechanisms that facilitate pathogen emergence and expansion, as well as the drivers behind the mechanisms, to understand and prepare for future pandemic expansions. Objective: The unique, rapid, global expansion of a clonal complex of Vibrio parahaemolyticus (a marine bacterium causing gastroenteritis infections) named Vibrio parahaemolyticus sequence type 3 (VpST3) provides an opportunity to explore the eco-evolutionary drivers of pathogen expansion. Methods: The global expansion of VpST3 was reconstructed using VpST3 genomes, which were then classified into metrics characterizing the stages of this expansion process, indicative of the stages of emergence and establishment. We used machine learning, specifically a random forest classifier, to test a range of ecological and evolutionary drivers for their potential in predicting VpST3 expansion dynamics. Results: We identified a range of evolutionary features, including mutations in the core genome and accessory gene presence, associated with expansion dynamics. A range of random forest classifier approaches were tested to predict expansion classification metrics for each genome. The highest predictive accuracies (ranging from 0.722 to 0.967) were achieved for models using a combined eco-evolutionary approach. While population structure and the difference between introduced and established isolates could be predicted to a high accuracy, our model reported multiple false positives when predicting the success of an introduced isolate, suggesting potential limiting factors not represented in our eco-evolutionary features. Regional models produced for 2 countries reporting the most VpST3 genomes had varying success, reflecting the impacts of class imbalance. Conclusions: These novel insights into evolutionary features and ecological conditions related to the stages of VpST3 expansion showcase the potential of machine learning models using genomic data and will contribute to the future understanding of the eco-evolutionary pathways of climate-sensitive pathogens. %M 39607996 %R 10.2196/62747 %U https://bioinform.jmir.org/2024/1/e62747 %U https://doi.org/10.2196/62747 %U http://www.ncbi.nlm.nih.gov/pubmed/39607996 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e64844 %T Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 Versus LLaMA2 for Case Reports %A Hirosawa,Takanobu %A Harada,Yukinori %A Tokumasu,Kazuki %A Shiraishi,Tatsuya %A Suzuki,Tomoharu %A Shimizu,Taro %+ Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, 880 Kitakobayashi, Mibu-cho, Shimotsuga, 321-0293, Japan, 81 0282861111, hirosawa@dokkyomed.ac.jp %K artificial intelligence %K clinical decision support system %K generative artificial intelligence %K large language models %K natural language processing %K NLP %K AI %K clinical decision making %K decision support %K decision making %K LLM: diagnostic %K case report %K diagnosis %K generative AI %K LLaMA %D 2024 %7 19.11.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Generative artificial intelligence (AI), particularly in the form of large language models, has rapidly developed. The LLaMA series are popular and recently updated from LLaMA2 to LLaMA3. However, the impacts of the update on diagnostic performance have not been well documented. Objective: We conducted a comparative evaluation of the diagnostic performance in differential diagnosis lists generated by LLaMA3 and LLaMA2 for case reports. Methods: We analyzed case reports published in the American Journal of Case Reports from 2022 to 2023. After excluding nondiagnostic and pediatric cases, we input the remaining cases into LLaMA3 and LLaMA2 using the same prompt and the same adjustable parameters. Diagnostic performance was defined by whether the differential diagnosis lists included the final diagnosis. Multiple physicians independently evaluated whether the final diagnosis was included in the top 10 differentials generated by LLaMA3 and LLaMA2. Results: In our comparative evaluation of the diagnostic performance between LLaMA3 and LLaMA2, we analyzed differential diagnosis lists for 392 case reports. The final diagnosis was included in the top 10 differentials generated by LLaMA3 in 79.6% (312/392) of the cases, compared to 49.7% (195/392) for LLaMA2, indicating a statistically significant improvement (P<.001). Additionally, LLaMA3 showed higher performance in including the final diagnosis in the top 5 differentials, observed in 63% (247/392) of cases, compared to LLaMA2’s 38% (149/392, P<.001). Furthermore, the top diagnosis was accurately identified by LLaMA3 in 33.9% (133/392) of cases, significantly higher than the 22.7% (89/392) achieved by LLaMA2 (P<.001). The analysis across various medical specialties revealed variations in diagnostic performance with LLaMA3 consistently outperforming LLaMA2. Conclusions: The results reveal that the LLaMA3 model significantly outperforms LLaMA2 per diagnostic performance, with a higher percentage of case reports having the final diagnosis listed within the top 10, top 5, and as the top diagnosis. Overall diagnostic performance improved almost 1.5 times from LLaMA2 to LLaMA3. These findings support the rapid development and continuous refinement of generative AI systems to enhance diagnostic processes in medicine. However, these findings should be carefully interpreted for clinical application, as generative AI, including the LLaMA series, has not been approved for medical applications such as AI-enhanced diagnostics. %M 39561356 %R 10.2196/64844 %U https://formative.jmir.org/2024/1/e64844 %U https://doi.org/10.2196/64844 %U http://www.ncbi.nlm.nih.gov/pubmed/39561356 %0 Journal Article %@ 2369-2960 %I JMIR Publications %V 10 %N %P e60373 %T Optimizing a Classification Model to Evaluate Individual Susceptibility in Noise-Induced Hearing Loss: Cross-Sectional Study %A Li,Shiyuan %A Yu,Xiao %A Ma,Xinrong %A Wang,Ying %A Guo,Junjie %A Wang,Jiping %A Shen,Wenxin %A Dong,Hongyu %A Salvi,Richard %A Wang,Hui %A Yin,Shankai %K noise-induced hearing loss %K susceptible %K resistance %K machine learning algorithms %K linear regression %K extended high frequencies %K phenotypic characteristics %K genetic heterogeneity %D 2024 %7 14.11.2024 %9 %J JMIR Public Health Surveill %G English %X Background: Noise-induced hearing loss (NIHL), one of the leading causes of hearing loss in young adults, is a major health care problem that has negative social and economic consequences. It is commonly recognized that individual susceptibility largely varies among individuals who are exposed to similar noise. An objective method is, therefore, needed to identify those who are extremely sensitive to noise-exposed jobs to prevent them from developing severe NIHL. Objective: This study aims to determine an optimal model for detecting individuals susceptible or resistant to NIHL and further explore phenotypic traits uniquely associated with their susceptibility profiles. Methods: Cross-sectional data on hearing loss caused by occupational noise were collected from 2015 to 2021 at shipyards in Shanghai, China. Six methods were summarized from the literature review and applied to evaluate their classification performance for susceptibility and resistance of participants to NIHL. A machine learning (ML)–based diagnostic model using frequencies from 0.25 to 12 kHz was developed to determine the most reliable frequencies, considering accuracy and area under the curve. An optimal method with the most reliable frequencies was then constructed to detect individuals who were susceptible versus resistant to NIHL. Phenotypic characteristics such as age, exposure time, cumulative noise exposure, and hearing thresholds (HTs) were explored to identify these groups. Results: A total of 6276 participants (median age 41, IQR 33‐47 years; n=5372, 85.6% men) were included in the analysis. The ML-based NIHL diagnostic model with misclassified subjects showed the best performance for identifying workers in the NIHL-susceptible group (NIHL-SG) and NIHL-resistant group (NIHL-RG). The mean HTs at 4 and 12.5 kHz showed the highest predictive value for detecting those in the NIHL-SG and NIHL-RG (accuracy=0.78 and area under the curve=0.81). Individuals in the NIHL-SG selected by the optimized model were younger than those in the NIHL-RG (median 28, IQR 25‐31 years vs median 35, IQR 32‐39 years; P<.001), with a shorter duration of noise exposure (median 5, IQR 2‐8 years vs median 8, IQR 4‐12 years; P<.001) and lower cumulative noise exposure (median 90, IQR 86‐92 dBA-years vs median 92.2, IQR 89.2‐94.7 dBA-years; P<.001) but greater HTs (4 and 12.5 kHz; median 58.8, IQR 53.8‐63.8 dB HL vs median 8.8, IQR 7.5‐11.3 dB HL; P<.001). Conclusions: An ML-based NIHL diagnostic model with misclassified subjects using the mean HTs of 4 and 12.5 kHz was the most reliable method for identifying individuals susceptible or resistant to NIHL. However, further studies are needed to determine the genetic factors that govern NIHL susceptibility. Trial Registration: Chinese Clinical Trial Registry ChiCTR-RPC-17012580; https://www.chictr.org.cn/showprojEN.html?proj=21399 %R 10.2196/60373 %U https://publichealth.jmir.org/2024/1/e60373 %U https://doi.org/10.2196/60373 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 13 %N %P e53447 %T Using a Device-Free Wi-Fi Sensing System to Assess Daily Activities and Mobility in Low-Income Older Adults: Protocol for a Feasibility Study %A Chung,Jane %A Pretzer-Aboff,Ingrid %A Parsons,Pamela %A Falls,Katherine %A Bulut,Eyuphan %+ Nell Hodgson Woodruff School of Nursing, Emory University, 1520 Clifton Road NE, Atlanta, GA, 30322, United States, 1 4047277980, jane.chung@emory.edu %K Wi-Fi sensing %K dementia %K mild cognitive impairment %K older adults %K health disparities %K in-home activities %K mobility %K machine learning %D 2024 %7 12.11.2024 %9 Protocol %J JMIR Res Protoc %G English %X Background: Older adults belonging to racial or ethnic minorities with low socioeconomic status are at an elevated risk of developing dementia, but resources for assessing functional decline and detecting cognitive impairment are limited. Cognitive impairment affects the ability to perform daily activities and mobility behaviors. Traditional assessment methods have drawbacks, so smart home technologies (SmHT) have emerged to offer objective, high-frequency, and remote monitoring. However, these technologies usually rely on motion sensors that cannot identify specific activity types. This group often lacks access to these technologies due to limited resources and technology experience. There is a need to develop new sensing technology that is discreet, affordable, and requires minimal user engagement to characterize and quantify various in-home activities. Furthermore, it is essential to explore the feasibility of developing machine learning (ML) algorithms for SmHT through collaborations between clinical researchers and engineers and involving minority, low-income older adults for novel sensor development. Objective: This study aims to examine the feasibility of developing a novel channel state information–based device-free, low-cost Wi-Fi sensing system, and associated ML algorithms for localizing and recognizing different patterns of in-home activities and mobility in residents of low-income senior housing with and without mild cognitive impairment. Methods: This feasibility study was conducted in collaboration with a wellness care group, which serves the healthy aging needs of low-income housing residents. Prior to this feasibility study, we conducted a pilot study to collect channel state information data from several activity scenarios (eg, sitting, walking, and preparing meals) using the proposed Wi-Fi sensing system continuously over a week in apartments of low-income housing residents. These activities were videotaped to generate ground truth annotations to test the accuracy of the ML algorithms derived from the proposed system. Using qualitative individual interviews, we explored the acceptability of the Wi-Fi sensing system and implementation barriers in the low-income housing setting. We use the same study protocol for the proposed feasibility study. Results: The Wi-Fi sensing system deployment began in November 2022, with participant recruitment starting in July 2023. Preliminary results will be available in the summer of 2025. Preliminary results are focused on the feasibility of developing ML models for Wi-Fi sensing–based activity and mobility assessment, community-based recruitment and data collection, ground truth, and older adults’ Wi-Fi sensing technology acceptance. Conclusions: This feasibility study can make a contribution to SmHT science and ML capabilities for early detection of cognitive decline among socially vulnerable older adults. Currently, sensing devices are not readily available to this population due to cost and information barriers. Our sensing device has the potential to identify individuals at risk for cognitive decline by assessing their level of physical function by tracking their in-home activities and mobility behaviors, at a low cost. International Registered Report Identifier (IRRID): DERR1-10.2196/53447 %M 39531268 %R 10.2196/53447 %U https://www.researchprotocols.org/2024/1/e53447 %U https://doi.org/10.2196/53447 %U http://www.ncbi.nlm.nih.gov/pubmed/39531268 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e58413 %T Development and Validation of Deep Learning–Based Infectivity Prediction in Pulmonary Tuberculosis Through Chest Radiography: Retrospective Study %A Chung,Wou young %A Yoon,Jinsik %A Yoon,Dukyong %A Kim,Songsoo %A Kim,Yujeong %A Park,Ji Eun %A Kang,Young Ae %+ Department of Internal Medicine, Yonsei University College of Medicine, 50-1 Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea, 82 2 2228 1954, mdkang@yuhs.ac %K pulmonary tuberculosis %K chest radiography %K artificial intelligence %K tuberculosis %K TB %K smear %K smear test %K culture test %K diagnosis %K treatment %K deep learning %K CXR %K PTB %K management %K cost effective %K asymptomatic infection %K diagnostic tools %K infectivity %K AI tool %K cohort %D 2024 %7 7.11.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Pulmonary tuberculosis (PTB) poses a global health challenge owing to the time-intensive nature of traditional diagnostic tests such as smear and culture tests, which can require hours to weeks to yield results. Objective: This study aimed to use artificial intelligence (AI)–based chest radiography (CXR) to evaluate the infectivity of patients with PTB more quickly and accurately compared with traditional methods such as smear and culture tests. Methods: We used DenseNet121 and visualization techniques such as gradient-weighted class activation mapping and local interpretable model-agnostic explanations to demonstrate the decision-making process of the model. We analyzed 36,142 CXR images of 4492 patients with PTB obtained from Severance Hospital, focusing specifically on the lung region through segmentation and cropping with TransUNet. We used data from 2004 to 2020 to train the model, data from 2021 for testing, and data from 2022 to 2023 for internal validation. In addition, we used 1978 CXR images of 299 patients with PTB obtained from Yongin Severance Hospital for external validation. Results: In the internal validation, the model achieved an accuracy of 73.27%, an area under the receiver operating characteristic curve of 0.79, and an area under the precision-recall curve of 0.77. In the external validation, it exhibited an accuracy of 70.29%, an area under the receiver operating characteristic curve of 0.77, and an area under the precision-recall curve of 0.8. In addition, gradient-weighted class activation mapping and local interpretable model-agnostic explanations provided insights into the decision-making process of the AI model. Conclusions: This proposed AI tool offers a rapid and accurate alternative for evaluating PTB infectivity through CXR, with significant implications for enhancing screening efficiency by evaluating infectivity before sputum test results in clinical settings, compared with traditional smear and culture tests. %M 39509691 %R 10.2196/58413 %U https://www.jmir.org/2024/1/e58413 %U https://doi.org/10.2196/58413 %U http://www.ncbi.nlm.nih.gov/pubmed/39509691 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e54246 %T A New Natural Language Processing–Inspired Methodology (Detection, Initial Characterization, and Semantic Characterization) to Investigate Temporal Shifts (Drifts) in Health Care Data: Quantitative Study %A Paiva,Bruno %A Gonçalves,Marcos André %A da Rocha,Leonardo Chaves Dutra %A Marcolino,Milena Soriano %A Lana,Fernanda Cristina Barbosa %A Souza-Silva,Maira Viana Rego %A Almeida,Jussara M %A Pereira,Polianna Delfino %A de Andrade,Claudio Moisés Valiense %A Gomes,Angélica Gomides dos Reis %A Ferreira,Maria Angélica Pires %A Bartolazzi,Frederico %A Sacioto,Manuela Furtado %A Boscato,Ana Paula %A Guimarães-Júnior,Milton Henriques %A dos Reis,Priscilla Pereira %A Costa,Felício Roberto %A Jorge,Alzira de Oliveira %A Coelho,Laryssa Reis %A Carneiro,Marcelo %A Sales,Thaís Lorenna Souza %A Araújo,Silvia Ferreira %A Silveira,Daniel Vitório %A Ruschel,Karen Brasil %A Santos,Fernanda Caldeira Veloso %A Cenci,Evelin Paola de Almeida %A Menezes,Luanna Silva Monteiro %A Anschau,Fernando %A Bicalho,Maria Aparecida Camargos %A Manenti,Euler Roberto Fernandes %A Finger,Renan Goulart %A Ponce,Daniela %A de Aguiar,Filipe Carrilho %A Marques,Luiza Margoto %A de Castro,Luís César %A Vietta,Giovanna Grünewald %A Godoy,Mariana Frizzo de %A Vilaça,Mariana do Nascimento %A Morais,Vivian Costa %+ Computer Science Department, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil, Street Daniel de Carvalho, 1846, apto 201, Belo Horizonte, 30431310, Brazil, 55 31999710134, angelfire7@gmail.com %K health care %K machine learning %K data drifts %K temporal drifts %D 2024 %7 28.10.2024 %9 Original Paper %J JMIR Med Inform %G English %X Background: Proper analysis and interpretation of health care data can significantly improve patient outcomes by enhancing services and revealing the impacts of new technologies and treatments. Understanding the substantial impact of temporal shifts in these data is crucial. For example, COVID-19 vaccination initially lowered the mean age of at-risk patients and later changed the characteristics of those who died. This highlights the importance of understanding these shifts for assessing factors that affect patient outcomes. Objective: This study aims to propose detection, initial characterization, and semantic characterization (DIS), a new methodology for analyzing changes in health outcomes and variables over time while discovering contextual changes for outcomes in large volumes of data. Methods: The DIS methodology involves 3 steps: detection, initial characterization, and semantic characterization. Detection uses metrics such as Jensen-Shannon divergence to identify significant data drifts. Initial characterization offers a global analysis of changes in data distribution and predictive feature significance over time. Semantic characterization uses natural language processing–inspired techniques to understand the local context of these changes, helping identify factors driving changes in patient outcomes. By integrating the outcomes from these 3 steps, our results can identify specific factors (eg, interventions and modifications in health care practices) that drive changes in patient outcomes. DIS was applied to the Brazilian COVID-19 Registry and the Medical Information Mart for Intensive Care, version IV (MIMIC-IV) data sets. Results: Our approach allowed us to (1) identify drifts effectively, especially using metrics such as the Jensen-Shannon divergence, and (2) uncover reasons for the decline in overall mortality in both the COVID-19 and MIMIC-IV data sets, as well as changes in the cooccurrence between different diseases and this particular outcome. Factors such as vaccination during the COVID-19 pandemic and reduced iatrogenic events and cancer-related deaths in MIMIC-IV were highlighted. The methodology also pinpointed shifts in patient demographics and disease patterns, providing insights into the evolving health care landscape during the study period. Conclusions: We developed a novel methodology combining machine learning and natural language processing techniques to detect, characterize, and understand temporal shifts in health care data. This understanding can enhance predictive algorithms, improve patient outcomes, and optimize health care resource allocation, ultimately improving the effectiveness of machine learning predictive algorithms applied to health care data. Our methodology can be applied to a variety of scenarios beyond those discussed in this paper. %M 39467275 %R 10.2196/54246 %U https://medinform.jmir.org/2024/1/e54246 %U https://doi.org/10.2196/54246 %U http://www.ncbi.nlm.nih.gov/pubmed/39467275 %0 Journal Article %@ 2369-2960 %I JMIR Publications %V 10 %N %P e58358 %T AI Governance: A Challenge for Public Health %A Wagner,Jennifer K %A Doerr,Megan %A Schmit,Cason D %K artificial intelligence %K legislation and jurisprudence %K harm reduction %K social determinants of health %K one health %K AI %K invisible algorithms %K modern life %K public health %K engagement %K AI governance %K traditional regulation %K soft law %D 2024 %7 30.9.2024 %9 %J JMIR Public Health Surveill %G English %X The rapid evolution of artificial intelligence (AI) is structuralizing social, political, and economic determinants of health into the invisible algorithms that shape all facets of modern life. Nevertheless, AI holds immense potential as a public health tool, enabling beneficial objectives such as precision public health and medicine. Developing an AI governance framework that can maximize the benefits and minimize the risks of AI is a significant challenge. The benefits of public health engagement in AI governance could be extensive. Here, we describe how several public health concepts can enhance AI governance. Specifically, we explain how (1) harm reduction can provide a framework for navigating the governance debate between traditional regulation and “soft law” approaches; (2) a public health understanding of social determinants of health is crucial to optimally weigh the potential risks and benefits of AI; (3) public health ethics provides a toolset for guiding governance decisions where individual interests intersect with collective interests; and (4) a One Health approach can improve AI governance effectiveness while advancing public health outcomes. Public health theories, perspectives, and innovations could substantially enrich and improve AI governance, creating a more equitable and socially beneficial path for AI development. %R 10.2196/58358 %U https://publichealth.jmir.org/2024/1/e58358 %U https://doi.org/10.2196/58358 %0 Journal Article %@ 2369-2960 %I JMIR Publications %V 10 %N %P e57437 %T Personality and Health-Related Quality of Life of Older Chinese Adults: Cross-Sectional Study and Moderated Mediation Model Analysis %A Dong,Xing-Xuan %A Huang,Yueqing %A Miao,Yi-Fan %A Hu,Hui-Hui %A Pan,Chen-Wei %A Zhang,Tianyang %A Wu,Yibo %K personality %K health-related quality of life %K older adults %K sleep quality %K quality of life %K old %K older %K Chinese %K China %K mechanisms %K psychology %K behavior %K analysis %K hypothesis %K neuroticism %K mediation analysis %K health care providers %K aging %D 2024 %7 12.9.2024 %9 %J JMIR Public Health Surveill %G English %X Background: Personality has an impact on the health-related quality of life (HRQoL) of older adults. However, the relationship and mechanisms of the 2 variables are controversial, and few studies have been conducted on older adults. Objective: The aim of this study was to explore the relationship between personality and HRQoL and the mediating and moderating roles of sleep quality and place of residence in this relationship. Methods: A total of 4123 adults 60 years and older were from the Psychology and Behavior Investigation of Chinese Residents survey. Participants were asked to complete the Big Five Inventory, the Brief version of the Pittsburgh Sleep Quality Index, and EQ-5D-5L. A backpropagation neural network was used to explore the order of factors contributing to HRQoL. Path analysis was performed to evaluate the mediation hypothesis. Results: As of August 31, 2022, we enrolled 4123 older adults 60 years and older. Neuroticism and extraversion were strong influencing factors of HRQoL (normalized importance >50%). The results of the mediation analysis suggested that neuroticism and extraversion may enhance and diminish, respectively, HRQoL (index: β=−.262, P<.001; visual analog scale: β=−.193, P<.001) by increasing and decreasing brief version of the Pittsburgh Sleep Quality Index scores (neuroticism: β=.17, P<.001; extraversion: β=−.069, P<.001). The multigroup analysis suggested a significant moderating effect of the place of residence (EQ-5D-5L index: P<.001; EQ-5D-5L visual analog scale: P<.001). No significant direct effect was observed between extraversion and EQ-5D-5L index in urban older residents (β=.037, P=.73). Conclusions: This study sheds light on the potential mechanisms of personality and HRQoL among older Chinese adults and can help health care providers and relevant departments take reasonable measures to promote healthy aging. %R 10.2196/57437 %U https://publichealth.jmir.org/2024/1/e57437 %U https://doi.org/10.2196/57437 %0 Journal Article %@ 2563-6316 %I JMIR Publications %V 5 %N %P e56993 %T Machine Learning–Based Hyperglycemia Prediction: Enhancing Risk Assessment in a Cohort of Undiagnosed Individuals %A Oyebola,Kolapo %A Ligali,Funmilayo %A Owoloye,Afolabi %A Erinwusi,Blessing %A Alo,Yetunde %A Musa,Adesola Z %A Aina,Oluwagbemiga %A Salako,Babatunde %K hyperglycemia %K diabetes %K machine learning %K hypertension %K random forest %D 2024 %7 11.9.2024 %9 %J JMIRx Med %G English %X Background: Noncommunicable diseases continue to pose a substantial health challenge globally, with hyperglycemia serving as a prominent indicator of diabetes. Objective: This study employed machine learning algorithms to predict hyperglycemia in a cohort of individuals who were asymptomatic and unraveled crucial predictors contributing to early risk identification. Methods: This dataset included an extensive array of clinical and demographic data obtained from 195 adults who were asymptomatic and residing in a suburban community in Nigeria. The study conducted a thorough comparison of multiple machine learning algorithms to ascertain the most effective model for predicting hyperglycemia. Moreover, we explored feature importance to pinpoint correlates of high blood glucose levels within the cohort. Results: Elevated blood pressure and prehypertension were recorded in 8 (4.1%) and 18 (9.2%) of the 195 participants, respectively. A total of 41 (21%) participants presented with hypertension, of which 34 (83%) were female. However, sex adjustment showed that 34 of 118 (28.8%) female participants and 7 of 77 (9%) male participants had hypertension. Age-based analysis revealed an inverse relationship between normotension and age (r=−0.88; P=.02). Conversely, hypertension increased with age (r=0.53; P=.27), peaking between 50‐59 years. Of the 195 participants, isolated systolic hypertension and isolated diastolic hypertension were recorded in 16 (8.2%) and 15 (7.7%) participants, respectively, with female participants recording a higher prevalence of isolated systolic hypertension (11/16, 69%) and male participants reporting a higher prevalence of isolated diastolic hypertension (11/15, 73%). Following class rebalancing, the random forest classifier gave the best performance (accuracy score 0.89; receiver operating characteristic–area under the curve score 0.89; F1-score 0.89) of the 26 model classifiers. The feature selection model identified uric acid and age as important variables associated with hyperglycemia. Conclusions: The random forest classifier identified significant clinical correlates associated with hyperglycemia, offering valuable insights for the early detection of diabetes and informing the design and deployment of therapeutic interventions. However, to achieve a more comprehensive understanding of each feature’s contribution to blood glucose levels, modeling additional relevant clinical features in larger datasets could be beneficial. %R 10.2196/56993 %U https://xmed.jmir.org/2024/1/e56993 %U https://doi.org/10.2196/56993 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60501 %T Prompt Engineering Paradigms for Medical Applications: Scoping Review %A Zaghir,Jamil %A Naguib,Marco %A Bjelogrlic,Mina %A Névéol,Aurélie %A Tannier,Xavier %A Lovis,Christian %+ Department of Radiology and Medical Informatics, University of Geneva, Chemin des Mines, 9, Geneva, 1202, Switzerland, 41 022 379 08 18, Jamil.Zaghir@unige.ch %K prompt engineering %K prompt design %K prompt learning %K prompt tuning %K large language models %K LLMs %K scoping review %K clinical natural language processing %K natural language processing %K NLP %K medical texts %K medical application %K medical applications %K clinical practice %K privacy %K medicine %K computer science %K medical informatics %D 2024 %7 10.9.2024 %9 Review %J J Med Internet Res %G English %X Background: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. Objective: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. Methods: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering–based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). Results: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering–specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. Conclusions: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field. %M 39255030 %R 10.2196/60501 %U https://www.jmir.org/2024/1/e60501 %U https://doi.org/10.2196/60501 %U http://www.ncbi.nlm.nih.gov/pubmed/39255030 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e53562 %T Automated Behavioral Coding to Enhance the Effectiveness of Motivational Interviewing in a Chat-Based Suicide Prevention Helpline: Secondary Analysis of a Clinical Trial %A Pellemans,Mathijs %A Salmi,Salim %A Mérelle,Saskia %A Janssen,Wilco %A van der Mei,Rob %+ Department of Mathematics, Vrije Universiteit Amsterdam, De Boelelaan 1111, Amsterdam, 1081 HV, Netherlands, 31 20 5987700, m.j.pellemans@vu.nl %K motivational interviewing %K behavioral coding %K suicide prevention %K artificial intelligence %K effectiveness %K counseling %K support tool %K online help %K mental health %D 2024 %7 1.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: With the rise of computer science and artificial intelligence, analyzing large data sets promises enormous potential in gaining insights for developing and improving evidence-based health interventions. One such intervention is the counseling strategy motivational interviewing (MI), which has been found effective in improving a wide range of health-related behaviors. Despite the simplicity of its principles, MI can be a challenging skill to learn and requires expertise to apply effectively. Objective: This study aims to investigate the performance of artificial intelligence models in classifying MI behavior and explore the feasibility of using these models in online helplines for mental health as an automated support tool for counselors in clinical practice. Methods: We used a coded data set of 253 MI counseling chat sessions from the 113 Suicide Prevention helpline. With 23,982 messages coded with the MI Sequential Code for Observing Process Exchanges codebook, we trained and evaluated 4 machine learning models and 1 deep learning model to classify client- and counselor MI behavior based on language use. Results: The deep learning model BERTje outperformed all machine learning models, accurately predicting counselor behavior (accuracy=0.72, area under the curve [AUC]=0.95, Cohen κ=0.69). It differentiated MI congruent and incongruent counselor behavior (AUC=0.92, κ=0.65) and evocative and nonevocative language (AUC=0.92, κ=0.66). For client behavior, the model achieved an accuracy of 0.70 (AUC=0.89, κ=0.55). The model’s interpretable predictions discerned client change talk and sustain talk, counselor affirmations, and reflection types, facilitating valuable counselor feedback. Conclusions: The results of this study demonstrate that artificial intelligence techniques can accurately classify MI behavior, indicating their potential as a valuable tool for enhancing MI proficiency in online helplines for mental health. Provided that the data set size is sufficiently large with enough training samples for each behavioral code, these methods can be trained and applied to other domains and languages, offering a scalable and cost-effective way to evaluate MI adherence, accelerate behavioral coding, and provide therapists with personalized, quick, and objective feedback. %M 39088244 %R 10.2196/53562 %U https://www.jmir.org/2024/1/e53562 %U https://doi.org/10.2196/53562 %U http://www.ncbi.nlm.nih.gov/pubmed/39088244 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56930 %T Roles, Users, Benefits, and Limitations of Chatbots in Health Care: Rapid Review %A Laymouna,Moustafa %A Ma,Yuanchao %A Lessard,David %A Schuster,Tibor %A Engler,Kim %A Lebouché,Bertrand %+ Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, D02.4110 – Glen Site, 1001 Decarie Blvd, Montreal, QC, H4A 3J1, Canada, 1 514 843 2090, bertrand.lebouche@mcgill.ca %K chatbot %K conversational agent %K conversational assistant %K user-computer interface %K digital health %K mobile health %K electronic health %K telehealth %K artificial intelligence %K AI %K health information technology %D 2024 %7 23.7.2024 %9 Review %J J Med Internet Res %G English %X Background: Chatbots, or conversational agents, have emerged as significant tools in health care, driven by advancements in artificial intelligence and digital technology. These programs are designed to simulate human conversations, addressing various health care needs. However, no comprehensive synthesis of health care chatbots’ roles, users, benefits, and limitations is available to inform future research and application in the field. Objective: This review aims to describe health care chatbots’ characteristics, focusing on their diverse roles in the health care pathway, user groups, benefits, and limitations. Methods: A rapid review of published literature from 2017 to 2023 was performed with a search strategy developed in collaboration with a health sciences librarian and implemented in the MEDLINE and Embase databases. Primary research studies reporting on chatbot roles or benefits in health care were included. Two reviewers dual-screened the search results. Extracted data on chatbot roles, users, benefits, and limitations were subjected to content analysis. Results: The review categorized chatbot roles into 2 themes: delivery of remote health services, including patient support, care management, education, skills building, and health behavior promotion, and provision of administrative assistance to health care providers. User groups spanned across patients with chronic conditions as well as patients with cancer; individuals focused on lifestyle improvements; and various demographic groups such as women, families, and older adults. Professionals and students in health care also emerged as significant users, alongside groups seeking mental health support, behavioral change, and educational enhancement. The benefits of health care chatbots were also classified into 2 themes: improvement of health care quality and efficiency and cost-effectiveness in health care delivery. The identified limitations encompassed ethical challenges, medicolegal and safety concerns, technical difficulties, user experience issues, and societal and economic impacts. Conclusions: Health care chatbots offer a wide spectrum of applications, potentially impacting various aspects of health care. While they are promising tools for improving health care efficiency and quality, their integration into the health care system must be approached with consideration of their limitations to ensure optimal, safe, and equitable use. %M 39042446 %R 10.2196/56930 %U https://www.jmir.org/2024/1/e56930 %U https://doi.org/10.2196/56930 %U http://www.ncbi.nlm.nih.gov/pubmed/39042446 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e58158 %T Evaluating and Enhancing Large Language Models’ Performance in Domain-Specific Medicine: Development and Usability Study With DocOA %A Chen,Xi %A Wang,Li %A You,MingKe %A Liu,WeiZhi %A Fu,Yu %A Xu,Jie %A Zhang,Shaoting %A Chen,Gang %A Li,Kang %A Li,Jian %+ Sports Medicine Center, West China Hospital, Sichuan University, No. 37, Guoxue Alley, Wuhou District, Chengdu, 610041, China, 86 18980601388, lijian_sportsmed@163.com %K large language model %K retrieval-augmented generation %K domain-specific benchmark framework %K osteoarthritis management %D 2024 %7 22.7.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: The efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored. Objective: This study focused on evaluating and enhancing the clinical capabilities and explainability of LLMs in specific domains, using OA management as a case study. Methods: A domain-specific benchmark framework was developed to evaluate LLMs across a spectrum from domain-specific knowledge to clinical applications in real-world clinical scenarios. DocOA, a specialized LLM designed for OA management integrating retrieval-augmented generation and instructional prompts, was developed. It can identify the clinical evidence upon which its answers are based through retrieval-augmented generation, thereby demonstrating the explainability of those answers. The study compared the performance of GPT-3.5, GPT-4, and a specialized assistant, DocOA, using objective and human evaluations. Results: Results showed that general LLMs such as GPT-3.5 and GPT-4 were less effective in the specialized domain of OA management, particularly in providing personalized treatment recommendations. However, DocOA showed significant improvements. Conclusions: This study introduces a novel benchmark framework that assesses the domain-specific abilities of LLMs in multiple aspects, highlights the limitations of generalized LLMs in clinical contexts, and demonstrates the potential of tailored approaches for developing domain-specific medical LLMs. %M 38833165 %R 10.2196/58158 %U https://www.jmir.org/2024/1/e58158 %U https://doi.org/10.2196/58158 %U http://www.ncbi.nlm.nih.gov/pubmed/38833165 %0 Journal Article %@ 2369-1999 %I JMIR Publications %V 10 %N %P e43070 %T Artificial Intelligence–Based Co-Facilitator (AICF) for Detecting and Monitoring Group Cohesion Outcomes in Web-Based Cancer Support Groups: Single-Arm Trial Study %A Leung,Yvonne W %A Wouterloot,Elise %A Adikari,Achini %A Hong,Jinny %A Asokan,Veenaajaa %A Duan,Lauren %A Lam,Claire %A Kim,Carlina %A Chan,Kai P %A De Silva,Daswin %A Trachtenberg,Lianne %A Rennie,Heather %A Wong,Jiahui %A Esplen,Mary Jane %+ de Souza Institute, University Health Network, de Souza Institute c/o Toronto General Hospital, 200 Elizabeth St RFE 3-440, Toronto, ON, M5G 2C4, Canada, 1 647 299 1360, yw.leung@utoronto.ca %K group cohesion %K LIWC %K online support group %K natural language processing %K NLP %K emotion analysis %K machine learning %K sentiment analysis %K emotion detection %K integrating human knowledge %K emotion lining %K cancer %K oncology %K support group %K artificial intelligence %K AI %K therapy %K online therapist %K emotion %K affect %K speech tagging %K speech tag %K topic modeling %K named entity recognition %K spoken language processing %K focus group %K corpus %K language %K linguistic %D 2024 %7 22.7.2024 %9 Original Paper %J JMIR Cancer %G English %X Background: Commonly offered as supportive care, therapist-led online support groups (OSGs) are a cost-effective way to provide support to individuals affected by cancer. One important indicator of a successful OSG session is group cohesion; however, monitoring group cohesion can be challenging due to the lack of nonverbal cues and in-person interactions in text-based OSGs. The Artificial Intelligence–based Co-Facilitator (AICF) was designed to contextually identify therapeutic outcomes from conversations and produce real-time analytics. Objective: The aim of this study was to develop a method to train and evaluate AICF’s capacity to monitor group cohesion. Methods: AICF used a text classification approach to extract the mentions of group cohesion within conversations. A sample of data was annotated by human scorers, which was used as the training data to build the classification model. The annotations were further supported by finding contextually similar group cohesion expressions using word embedding models as well. AICF performance was also compared against the natural language processing software Linguistic Inquiry Word Count (LIWC). Results: AICF was trained on 80,000 messages obtained from Cancer Chat Canada. We tested AICF on 34,048 messages. Human experts scored 6797 (20%) of the messages to evaluate the ability of AICF to classify group cohesion. Results showed that machine learning algorithms combined with human input could detect group cohesion, a clinically meaningful indicator of effective OSGs. After retraining with human input, AICF reached an F1-score of 0.82. AICF performed slightly better at identifying group cohesion compared to LIWC. Conclusions: AICF has the potential to assist therapists by detecting discord in the group amenable to real-time intervention. Overall, AICF presents a unique opportunity to strengthen patient-centered care in web-based settings by attending to individual needs. International Registered Report Identifier (IRRID): RR2-10.2196/21453 %M 39037754 %R 10.2196/43070 %U https://cancer.jmir.org/2024/1/e43070 %U https://doi.org/10.2196/43070 %U http://www.ncbi.nlm.nih.gov/pubmed/39037754 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e51327 %T Public Perceptions and Discussions of the US Food and Drug Administration's JUUL Ban Policy on Twitter: Observational Study %A Liu,Pinxin %A Lou,Xubin %A Xie,Zidian %A Shang,Ce %A Li,Dongmei %+ Department of Clinical and Translational Research, University of Rochester Medical Center, 265 Crittenden Boulevard CU 420708, Rochester, NY, 14642-0708, United States, 1 5852767285, Dongmei_Li@urmc.rochester.edu %K e-cigarettes %K JUUL %K Twitter %K deep learning %K FDA %K Food and Drug Administration %K vape %K vaping %K smoking %K social media %K regulation %D 2024 %7 11.7.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: On June 23, 2022, the US Food and Drug Administration announced a JUUL ban policy, to ban all vaping and electronic cigarette products sold by Juul Labs. Objective: This study aims to understand public perceptions and discussions of this policy using Twitter (subsequently rebranded as X) data. Methods: Using the Twitter streaming application programming interface, 17,007 tweets potentially related to the JUUL ban policy were collected between June 22, 2022, and July 25, 2022. Based on 2600 hand-coded tweets, a deep learning model (RoBERTa) was trained to classify all tweets into propolicy, antipolicy, neutral, and irrelevant categories. A deep learning model (M3 model) was used to estimate basic demographics (such as age and gender) of Twitter users. Furthermore, major topics were identified using latent Dirichlet allocation modeling. A logistic regression model was used to examine the association of different Twitter users with their attitudes toward the policy. Results: Among 10,480 tweets related to the JUUL ban policy, there were similar proportions of propolicy and antipolicy tweets (n=2777, 26.5% vs n=2666, 25.44%). Major propolicy topics included “JUUL causes youth addition,” “market surge of JUUL,” and “health effects of JUUL.” In contrast, major antipolicy topics included “cigarette should be banned instead of JUUL,” “against the irrational policy,” and “emotional catharsis.” Twitter users older than 29 years were more likely to be propolicy (have a positive attitude toward the JUUL ban policy) than those younger than 29 years. Conclusions: Our study showed that the public showed different responses to the JUUL ban policy, which varies depending on the demographic characteristics of Twitter users. Our findings could provide valuable information to the Food and Drug Administration for future electronic cigarette and other tobacco product regulations. %M 38990633 %R 10.2196/51327 %U https://formative.jmir.org/2024/1/e51327 %U https://doi.org/10.2196/51327 %U http://www.ncbi.nlm.nih.gov/pubmed/38990633 %0 Journal Article %@ 2373-6658 %I JMIR Publications %V 8 %N %P e48378 %T A Random Forest Algorithm for Assessing Risk Factors Associated With Chronic Kidney Disease: Observational Study %A Liu,Pei %A Liu,Yijun %A Liu,Hao %A Xiong,Linping %A Mei,Changlin %A Yuan,Lei %+ Department of Health Management, Second Military Medical University, No.800 Xiangyin Road, Yangpu District, Shanghai, China, Shanghai, 200433, China, 86 15026929271, yuanleigz@163.com %K chronic kidney disease %K random forest model %K risk factors %K assessment %D 2024 %7 3.6.2024 %9 Original Paper %J Asian Pac Isl Nurs J %G English %X Background: The prevalence and mortality rate of chronic kidney disease (CKD) are increasing year by year, and it has become a global public health issue. The economic burden caused by CKD is increasing at a rate of 1% per year. CKD is highly prevalent and its treatment cost is high but unfortunately remains unknown. Therefore, early detection and intervention are vital means to mitigate the treatment burden on patients and decrease disease progression. Objective: In this study, we investigated the advantages of using the random forest (RF) algorithm for assessing risk factors associated with CKD. Methods: We included 40,686 people with complete screening records who underwent screening between January 1, 2015, and December 22, 2020, in Jing’an District, Shanghai, China. We grouped the participants into those with and those without CKD by staging based on the glomerular filtration rate staging and grouping based on albuminuria. Using a logistic regression model, we determined the relationship between CKD and risk factors. The RF machine learning algorithm was used to score the predictive variables and rank them based on their importance to construct a prediction model. Results: The logistic regression model revealed that gender, older age, obesity, abnormal index estimated glomerular filtration rate, retirement status, and participation in urban employee medical insurance were significantly associated with the risk of CKD. On RF algorithm–based screening, the top 4 factors influencing CKD were age, albuminuria, working status, and urinary albumin-creatinine ratio. The RF model predicted an area under the receiver operating characteristic curve of 93.15%. Conclusions: Our findings reveal that the RF algorithm has significant predictive value for assessing risk factors associated with CKD and allows the screening of individuals with risk factors. This has crucial implications for early intervention and prevention of CKD. %M 38830204 %R 10.2196/48378 %U https://apinj.jmir.org/2024/1/e48378 %U https://doi.org/10.2196/48378 %U http://www.ncbi.nlm.nih.gov/pubmed/38830204 %0 Journal Article %@ 2369-2960 %I JMIR Publications %V 10 %N %P e52691 %T Exploring the Association Between Structural Racism and Mental Health: Geospatial and Machine Learning Analysis %A Mohebbi,Fahimeh %A Forati,Amir Masoud %A Torres,Lucas %A deRoon-Cassini,Terri A %A Harris,Jennifer %A Tomas,Carissa W %A Mantsch,John R %A Ghose,Rina %+ Department of Pharmacology & Toxicology, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI, 53226, United States, 1 4149558861, jomantsch@mcw.edu %K machine learning %K geospatial %K racial disparities %K social determinant of health %K structural racism %K mental health %K health disparities %K deep learning %D 2024 %7 3.5.2024 %9 Original Paper %J JMIR Public Health Surveill %G English %X Background: Structural racism produces mental health disparities. While studies have examined the impact of individual factors such as poverty and education, the collective contribution of these elements, as manifestations of structural racism, has been less explored. Milwaukee County, Wisconsin, with its racial and socioeconomic diversity, provides a unique context for this multifactorial investigation. Objective: This research aimed to delineate the association between structural racism and mental health disparities in Milwaukee County, using a combination of geospatial and deep learning techniques. We used secondary data sets where all data were aggregated and anonymized before being released by federal agencies. Methods: We compiled 217 georeferenced explanatory variables across domains, initially deliberately excluding race-based factors to focus on nonracial determinants. This approach was designed to reveal the underlying patterns of risk factors contributing to poor mental health, subsequently reintegrating race to assess the effects of racism quantitatively. The variable selection combined tree-based methods (random forest) and conventional techniques, supported by variance inflation factor and Pearson correlation analysis for multicollinearity mitigation. The geographically weighted random forest model was used to investigate spatial heterogeneity and dependence. Self-organizing maps, combined with K-means clustering, were used to analyze data from Milwaukee communities, focusing on quantifying the impact of structural racism on the prevalence of poor mental health. Results: While 12 influential factors collectively accounted for 95.11% of the variability in mental health across communities, the top 6 factors—smoking, poverty, insufficient sleep, lack of health insurance, employment, and age—were particularly impactful. Predominantly, African American neighborhoods were disproportionately affected, which is 2.23 times more likely to encounter high-risk clusters for poor mental health. Conclusions: The findings demonstrate that structural racism shapes mental health disparities, with Black community members disproportionately impacted. The multifaceted methodological approach underscores the value of integrating geospatial analysis and deep learning to understand complex social determinants of mental health. These insights highlight the need for targeted interventions, addressing both individual and systemic factors to mitigate mental health disparities rooted in structural racism. %M 38701436 %R 10.2196/52691 %U https://publichealth.jmir.org/2024/1/e52691 %U https://doi.org/10.2196/52691 %U http://www.ncbi.nlm.nih.gov/pubmed/38701436 %0 Journal Article %@ 1947-2579 %I JMIR Publications %V 16 %N %P e50201 %T Applying Machine Learning Techniques to Implementation Science %A Huguet,Nathalie %A Chen,Jinying %A Parikh,Ravi B %A Marino,Miguel %A Flocke,Susan A %A Likumahuwa-Ackman,Sonja %A Bekelman,Justin %A DeVoe,Jennifer E %+ Department of Family Medicine, Oregon Health & Science University, 3181 SW Sam Jackson Park Road, Portland, OR, 97239, United States, 1 503 494 4404, huguetn@ohsu.edu %K implementation science %K machine learning %K implementation strategies %K techniques %K implementation %K prediction %K adaptation %K acceptance %K challenges %K scientist %D 2024 %7 22.4.2024 %9 Viewpoint %J Online J Public Health Inform %G English %X Machine learning (ML) approaches could expand the usefulness and application of implementation science methods in clinical medicine and public health settings. The aim of this viewpoint is to introduce a roadmap for applying ML techniques to address implementation science questions, such as predicting what will work best, for whom, under what circumstances, and with what predicted level of support, and what and when adaptation or deimplementation are needed. We describe how ML approaches could be used and discuss challenges that implementation scientists and methodologists will need to consider when using ML throughout the stages of implementation. %M 38648094 %R 10.2196/50201 %U https://ojphi.jmir.org/2024/1/e50201 %U https://doi.org/10.2196/50201 %U http://www.ncbi.nlm.nih.gov/pubmed/38648094 %0 Journal Article %@ 1947-2579 %I JMIR Publications %V 16 %N %P e50771 %T Machine Learning for Prediction of Tuberculosis Detection: Case Study of Trained African Giant Pouched Rats %A Jonathan,Joan %A Barakabitze,Alcardo Alex %A Fast,Cynthia D %A Cox,Christophe %+ Department of Informatics and Information Technology, Sokoine University of Agriculture, PO Box 3038, Morogoro, United Republic of Tanzania, 255 763 630 054, joanjonathan@sua.ac.tz %K machine learning %K African giant pouched rat %K diagnosis %K tuberculosis %K health care %D 2024 %7 16.4.2024 %9 Original Paper %J Online J Public Health Inform %G English %X Background: Technological advancement has led to the growth and rapid increase of tuberculosis (TB) medical data generated from different health care areas, including diagnosis. Prioritizing better adoption and acceptance of innovative diagnostic technology to reduce the spread of TB significantly benefits developing countries. Trained TB-detection rats are used in Tanzania and Ethiopia for operational research to complement other TB diagnostic tools. This technology has increased new TB case detection owing to its speed, cost-effectiveness, and sensitivity. Objective: During the TB detection process, rats produce vast amounts of data, providing an opportunity to identify interesting patterns that influence TB detection performance. This study aimed to develop models that predict if the rat will hit (indicate the presence of TB within) the sample or not using machine learning (ML) techniques. The goal was to improve the diagnostic accuracy and performance of TB detection involving rats. Methods: APOPO (Anti-Persoonsmijnen Ontmijnende Product Ontwikkeling) Center in Morogoro provided data for this study from 2012 to 2019, and 366,441 observations were used to build predictive models using ML techniques, including decision tree, random forest, naïve Bayes, support vector machine, and k-nearest neighbor, by incorporating a variety of variables, such as the diagnostic results from partner health clinics using methods endorsed by the World Health Organization (WHO). Results: The support vector machine technique yielded the highest accuracy of 83.39% for prediction compared to other ML techniques used. Furthermore, this study found that the inclusion of variables related to whether the sample contained TB or not increased the performance accuracy of the predictive model. Conclusions: The inclusion of variables related to the diagnostic results of TB samples may improve the detection performance of the trained rats. The study results may be of importance to TB-detection rat trainers and TB decision-makers as the results may prompt them to take action to maintain the usefulness of the technology and increase the TB detection performance of trained rats. %M 38625737 %R 10.2196/50771 %U https://ojphi.jmir.org/2024/1/e50771 %U https://doi.org/10.2196/50771 %U http://www.ncbi.nlm.nih.gov/pubmed/38625737 %0 Journal Article %@ 1947-2579 %I JMIR Publications %V 15 %N %P e52782 %T Machine Learning Model for Predicting Mortality Risk in Patients With Complex Chronic Conditions: Retrospective Analysis %A Hernández Guillamet,Guillem %A Morancho Pallaruelo,Ariadna Ning %A Miró Mezquita,Laura %A Miralles,Ramón %A Mas,Miquel Àngel %A Ulldemolins Papaseit,María José %A Estrada Cuxart,Oriol %A López Seguí,Francesc %+ Chair in ICT and Health, Centre for Health and Social Care Research (CESS), University of Vic - Central University of Catalonia (UVic-UCC), Carrer Miquel Martí i Pol, 1, Vic, 08500, Spain, 1 938863342, francesc.lopez.segui@gmail.com %K machine learning %K mortality prediction %K chronicity %K chromic %K complex %K artificial intelligence %K complexity %K health data %K predict %K prediction %K predictive %K mortality %K death %K classification %K algorithm %K algorithms %K mortality risk %K risk prediction %D 2023 %7 28.12.2023 %9 Original Paper %J Online J Public Health Inform %G English %X Background: The health care system is undergoing a shift toward a more patient-centered approach for individuals with chronic and complex conditions, which presents a series of challenges, such as predicting hospital needs and optimizing resources. At the same time, the exponential increase in health data availability has made it possible to apply advanced statistics and artificial intelligence techniques to develop decision-support systems and improve resource planning, diagnosis, and patient screening. These methods are key to automating the analysis of large volumes of medical data and reducing professional workloads. Objective: This article aims to present a machine learning model and a case study in a cohort of patients with highly complex conditions. The object was to predict mortality within the following 4 years and early mortality over 6 months following diagnosis. The method used easily accessible variables and health care resource utilization information. Methods: A classification algorithm was selected among 6 models implemented and evaluated using a stratified cross-validation strategy with k=10 and a 70/30 train-test split. The evaluation metrics used included accuracy, recall, precision, F1-score, and area under the receiver operating characteristic (AUROC) curve. Results: The model predicted patient death with an 87% accuracy, recall of 87%, precision of 82%, F1-score of 84%, and area under the curve (AUC) of 0.88 using the best model, the Extreme Gradient Boosting (XGBoost) classifier. The results were worse when predicting premature deaths (following 6 months) with an 83% accuracy (recall=55%, precision=64% F1-score=57%, and AUC=0.88) using the Gradient Boosting (GRBoost) classifier. Conclusions: This study showcases encouraging outcomes in forecasting mortality among patients with intricate and persistent health conditions. The employed variables are conveniently accessible, and the incorporation of health care resource utilization information of the patient, which has not been employed by current state-of-the-art approaches, displays promising predictive power. The proposed prediction model is designed to efficiently identify cases that need customized care and proactively anticipate the demand for critical resources by health care providers. %M 38223690 %R 10.2196/52782 %U https://ojphi.jmir.org/2023/1/e52782 %U https://doi.org/10.2196/52782 %U http://www.ncbi.nlm.nih.gov/pubmed/38223690 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e52091 %T The Impact of Generative Conversational Artificial Intelligence on the Lesbian, Gay, Bisexual, Transgender, and Queer Community: Scoping Review %A Bragazzi,Nicola Luigi %A Crapanzano,Andrea %A Converti,Manlio %A Zerbetto,Riccardo %A Khamisy-Farah,Rola %+ Laboratory for Industrial and Applied Mathematics, Department of Mathematics and Statistics, York University, 4700 Keele Street, Toronto, ON, M3J 1P3, Canada, 1 416 736 2100, robertobragazzi@gmail.com %K generative conversational artificial intelligence %K chatbot %K lesbian, gay, bisexual, transgender, and queer community %K LGBTQ %K scoping review %K mobile phone %D 2023 %7 6.12.2023 %9 Review %J J Med Internet Res %G English %X Background: Despite recent significant strides toward acceptance, inclusion, and equality, members of the lesbian, gay, bisexual, transgender, and queer (LGBTQ) community still face alarming mental health disparities, being almost 3 times more likely to experience depression, anxiety, and suicidal thoughts than their heterosexual counterparts. These unique psychological challenges are due to discrimination, stigmatization, and identity-related struggles and can potentially benefit from generative conversational artificial intelligence (AI). As the latest advancement in AI, conversational agents and chatbots can imitate human conversation and support mental health, fostering diversity and inclusivity, combating stigma, and countering discrimination. In contrast, if not properly designed, they can perpetuate exclusion and inequities. Objective: This study aims to examine the impact of generative conversational AI on the LGBTQ community. Methods: This study was designed as a scoping review. Four electronic scholarly databases (Scopus, Embase, Web of Science, and MEDLINE via PubMed) and gray literature (Google Scholar) were consulted from inception without any language restrictions. Original studies focusing on the LGBTQ community or counselors working with this community exposed to chatbots and AI-enhanced internet-based platforms and exploring the feasibility, acceptance, or effectiveness of AI-enhanced tools were deemed eligible. The findings were reported in accordance with the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). Results: Seven applications (HIVST-Chatbot, TelePrEP Navigator, Amanda Selfie, Crisis Contact Simulator, REALbot, Tough Talks, and Queer AI) were included and reviewed. The chatbots and internet-based assistants identified served various purposes: (1) to identify LGBTQ individuals at risk of suicide or contracting HIV or other sexually transmitted infections, (2) to provide resources to LGBTQ youth from underserved areas, (3) facilitate HIV status disclosure to sex partners, and (4) develop training role-play personas encompassing the diverse experiences and intersecting identities of LGBTQ youth to educate counselors. The use of generative conversational AI for the LGBTQ community is still in its early stages. Initial studies have found that deploying chatbots is feasible and well received, with high ratings for usability and user satisfaction. However, there is room for improvement in terms of the content provided and making conversations more engaging and interactive. Many of these studies used small sample sizes and short-term interventions measuring limited outcomes. Conclusions: Generative conversational AI holds promise, but further development and formal evaluation are needed, including studies with larger samples, longer interventions, and randomized trials to compare different content, delivery methods, and dissemination platforms. In addition, a focus on engagement with behavioral objectives is essential to advance this field. The findings have broad practical implications, highlighting that AI’s impact spans various aspects of people’s lives. Assessing AI’s impact on diverse communities and adopting diversity-aware and intersectional approaches can help shape AI’s positive impact on society as a whole. %M 37864350 %R 10.2196/52091 %U https://www.jmir.org/2023/1/e52091 %U https://doi.org/10.2196/52091 %U http://www.ncbi.nlm.nih.gov/pubmed/37864350 %0 Journal Article %@ 2369-2960 %I JMIR Publications %V 9 %N %P e46898 %T Application of Machine Learning Prediction of Individual SARS-CoV-2 Vaccination and Infection Status to the French Serosurveillance Survey From March 2020 to 2022: Cross-Sectional Study %A Bougeard,Stéphanie %A Huneau-Salaun,Adeline %A Attia,Mikael %A Richard,Jean-Baptiste %A Demeret,Caroline %A Platon,Johnny %A Allain,Virginie %A Le Vu,Stéphane %A Goyard,Sophie %A Gillon,Véronique %A Bernard-Stoecklin,Sibylle %A Crescenzo-Chaigne,Bernadette %A Jones,Gabrielle %A Rose,Nicolas %A van der Werf,Sylvie %A Lantz,Olivier %A Rose,Thierry %A Noël,Harold %+ Epidemiology, Health and Welfare, Laboratory of Ploufragan-Plouzané-Niort, French Agency for Food, Environmental, Occupational Health & Safety, BP 53 - Technopole Saint Brieuc Armor, Ploufragan, 22440, France, 33 296010150, stephanie.bougeard@anses.fr %K SARS-CoV-2 %K serological surveillance %K infection %K vaccination %K machine learning %K seroprevalence %K blood testing %K immunity %K survey %K vaccine response %K French population %K prediction %D 2023 %7 28.11.2023 %9 Original Paper %J JMIR Public Health Surveill %G English %X Background: The seroprevalence of SARS-CoV-2 infection in the French population was estimated with a representative, repeated cross-sectional survey based on residual sera from routine blood testing. These data contained no information on infection or vaccination status, thus limiting the ability to detail changes observed in the immunity level of the population over time. Objective: Our aim is to predict the infected or vaccinated status of individuals in the French serosurveillance survey based only on the results of serological assays. Reference data on longitudinal serological profiles of seronegative, infected, and vaccinated individuals from another French cohort were used to build the predictive model. Methods: A model of individual vaccination or infection status with respect to SARS-CoV-2 obtained from a machine learning procedure was proposed based on 3 complementary serological assays. This model was applied to the French nationwide serosurveillance survey from March 2020 to March 2022 to estimate the proportions of the population that were negative, infected, vaccinated, or infected and vaccinated. Results: From February 2021 to March 2022, the estimated percentage of infected and unvaccinated individuals in France increased from 7.5% to 16.8%. During this period, the estimated percentage increased from 3.6% to 45.2% for vaccinated and uninfected individuals and from 2.1% to 29.1% for vaccinated and infected individuals. The decrease in the seronegative population can be largely attributed to vaccination. Conclusions: Combining results from the serosurveillance survey with more complete data from another longitudinal cohort completes the information retrieved from serosurveillance while keeping its protocol simple and easy to implement. %M 38015594 %R 10.2196/46898 %U https://publichealth.jmir.org/2023/1/e46898 %U https://doi.org/10.2196/46898 %U http://www.ncbi.nlm.nih.gov/pubmed/38015594 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 7 %N %P e47762 %T The Readability and Quality of Web-Based Patient Information on Nasopharyngeal Carcinoma: Quantitative Content Analysis %A Tan,Denise Jia Yun %A Ko,Tsz Ki %A Fan,Ka Siu %+ Department of Surgery, Royal Stoke University Hospital, Newcastle Rd, Stoke on Trent, ST4 6QG, United Kingdom, 44 7378977812, tszkiko95@gmail.com %K nasopharyngeal cancer %K internet information %K readability %K Journal of the American Medical Association %K JAMA %K DISCERN %K artificial intelligence %K AI %D 2023 %7 27.11.2023 %9 Original Paper %J JMIR Form Res %G English %X Background: Nasopharyngeal carcinoma (NPC) is a rare disease that is strongly associated with exposure to the Epstein-Barr virus and is characterized by the formation of malignant cells in nasopharynx tissues. Early diagnosis of NPC is often difficult owing to the location of initial tumor sites and the nonspecificity of initial symptoms, resulting in a higher frequency of advanced-stage diagnoses and a poorer prognosis. Access to high-quality, readable information could improve the early detection of the disease and provide support to patients during disease management. Objective: This study aims to assess the quality and readability of publicly available web-based information in the English language about NPC, using the most popular search engines. Methods: Key terms relevant to NPC were searched across 3 of the most popular internet search engines: Google, Yahoo, and Bing. The top 25 results from each search engine were included in the analysis. Websites that contained text written in languages other than English, required paywall access, targeted medical professionals, or included nontext content were excluded. Readability for each website was assessed using the Flesch Reading Ease score and the Flesch-Kincaid grade level. Website quality was assessed using the Journal of the American Medical Association (JAMA) and DISCERN tools as well as the presence of a Health on the Net Foundation seal. Results: Overall, 57 suitable websites were included in this study; 26% (15/57) of the websites were academic. The mean JAMA and DISCERN scores of all websites were 2.80 (IQR 3) and 57.60 (IQR 19), respectively, with a median of 3 (IQR 2-4) and 61 (IQR 49-68), respectively. Health care industry websites (n=3) had the highest mean JAMA score of 4 (SD 0). Academic websites (15/57, 26%) had the highest mean DISCERN score of 77.5. The Health on the Net Foundation seal was present on only 1 website, which also achieved a JAMA score of 3 and a DISCERN score of 50. Significant differences were observed between the JAMA score of hospital websites and the scores of industry websites (P=.04), news service websites (P<.048), charity and nongovernmental organization websites (P=.03). Despite being a vital source for patients, general practitioner websites were found to have significantly lower JAMA scores compared with charity websites (P=.05). The overall mean readability scores reflected an average reading age of 14.3 (SD 1.1) years. Conclusions: The results of this study suggest an inconsistent and suboptimal quality of information related to NPC on the internet. On average, websites presented readability challenges, as written information about NPC was above the recommended reading level of sixth grade. As such, web-based information requires improvement in both quality and accessibility, and healthcare providers should be selective about information recommended to patients, ensuring they are reliable and readable. %M 38010802 %R 10.2196/47762 %U https://formative.jmir.org/2023/1/e47762 %U https://doi.org/10.2196/47762 %U http://www.ncbi.nlm.nih.gov/pubmed/38010802 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e37719 %T Identification of Key Influencers for Secondary Distribution of HIV Self-Testing Kits Among Chinese Men Who Have Sex With Men: Development of an Ensemble Machine Learning Approach %A Jing,Fengshi %A Ye,Yang %A Zhou,Yi %A Ni,Yuxin %A Yan,Xumeng %A Lu,Ying %A Ong,Jason %A Tucker,Joseph D %A Wu,Dan %A Xiong,Yuan %A Xu,Chen %A He,Xi %A Huang,Shanzi %A Li,Xiaofeng %A Jiang,Hongbo %A Wang,Cheng %A Dai,Wencan %A Huang,Liqun %A Mei,Wenhua %A Cheng,Weibin %A Zhang,Qingpeng %A Tang,Weiming %+ Institute for Healthcare Artificial Intelligence Application, Guangdong Second Provincial General Hospital, 466 Xingangzhong Road, Guangzhou, 510317, China, 86 15920567132, weiming_tang@med.unc.edu %K HIV self-testing %K machine learning %K MSM %K men who have sex with men %K secondary distribution %K key influencers identification %D 2023 %7 23.11.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: HIV self-testing (HIVST) has been rapidly scaled up and additional strategies further expand testing uptake. Secondary distribution involves people (defined as “indexes”) applying for multiple kits and subsequently sharing them with people (defined as “alters”) in their social networks. However, identifying key influencers is difficult. Objective: This study aimed to develop an innovative ensemble machine learning approach to identify key influencers among Chinese men who have sex with men (MSM) for secondary distribution of HIVST kits. Methods: We defined three types of key influencers: (1) key distributors who can distribute more kits, (2) key promoters who can contribute to finding first-time testing alters, and (3) key detectors who can help to find positive alters. Four machine learning models (logistic regression, support vector machine, decision tree, and random forest) were trained to identify key influencers. An ensemble learning algorithm was adopted to combine these 4 models. For comparison with our machine learning models, self-evaluated leadership scales were used as the human identification approach. Four metrics for performance evaluation, including accuracy, precision, recall, and F1-score, were used to evaluate the machine learning models and the human identification approach. Simulation experiments were carried out to validate our approach. Results: We included 309 indexes (our sample size) who were eligible and applied for multiple test kits; they distributed these kits to 269 alters. We compared the performance of the machine learning classification and ensemble learning models with that of the human identification approach based on leadership self-evaluated scales in terms of the 2 nearest cutoffs. Our approach outperformed human identification (based on the cutoff of the self-reported scales), exceeding by an average accuracy of 11.0%, could distribute 18.2% (95% CI 9.9%-26.5%) more kits, and find 13.6% (95% CI 1.9%-25.3%) more first-time testing alters and 12.0% (95% CI –14.7% to 38.7%) more positive-testing alters. Our approach could also increase the simulated intervention’s efficiency by 17.7% (95% CI –3.5% to 38.8%) compared to that of human identification. Conclusions: We built machine learning models to identify key influencers among Chinese MSM who were more likely to engage in secondary distribution of HIVST kits. Trial Registration: Chinese Clinical Trial Registry (ChiCTR) ChiCTR1900025433; https://www.chictr.org.cn/showproj.html?proj=42001 %M 37995110 %R 10.2196/37719 %U https://www.jmir.org/2023/1/e37719 %U https://doi.org/10.2196/37719 %U http://www.ncbi.nlm.nih.gov/pubmed/37995110 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 7 %N %P e44420 %T Patient Journey Toward a Diagnosis of Light Chain Amyloidosis in a National Sample: Cross-Sectional Web-Based Study %A Dou,Xuelin %A Liu,Yang %A Liao,Aijun %A Zhong,Yuping %A Fu,Rong %A Liu,Lihong %A Cui,Canchan %A Wang,Xiaohong %A Lu,Jin %+ Hematology Department, Peking University People's Hospital, 11 Xizhimen South Street, Beijing, 100044, China, 86 13311491805, jin1lu@sina.com.cn %K systemic light chain amyloidosis %K AL amyloidosis %K rare disease %K big data %K network analysis %K machine model %K natural language processing %K web-based %D 2023 %7 2.11.2023 %9 Original Paper %J JMIR Form Res %G English %X Background: Systemic light chain (AL) amyloidosis is a rare and multisystem disease associated with increased morbidity and a poor prognosis. Delayed diagnoses are common due to the heterogeneity of the symptoms. However, real-world insights from Chinese patients with AL amyloidosis have not been investigated. Objective: This study aimed to describe the journey to an AL amyloidosis diagnosis and to build an in-depth understanding of the diagnostic process from the perspective of both clinicians and patients to obtain a correct and timely diagnosis. Methods: Publicly available disease-related content from social media platforms between January 2008 and April 2021 was searched. After performing data collection steps with a machine model, a series of disease-related posts were extracted. Natural language processing was used to identify the relevance of variables, followed by further manual evaluation and analysis. Results: A total of 2204 valid posts related to AL amyloidosis were included in this study, of which 1968 were posted on haodf.com. Of these posts, 1284 were posted by men (median age 57, IQR 46-67 years); 1459 posts mentioned renal-related symptoms, followed by heart (n=833), liver (n=491), and stomach (n=368) symptoms. Furthermore, 1502 posts mentioned symptoms related to 2 or more organs. Symptoms for AL amyloidosis most frequently mentioned by suspected patients were nonspecific weakness (n=252), edema (n=196), hypertrophy (n=168), and swelling (n=140). Multiple physician visits were common, and nephrologists (n=265) and hematologists (n=214) were the most frequently visited specialists by suspected patients for initial consultation. Additionally, interhospital referrals were also commonly seen, centralizing in tertiary hospitals. Conclusions: Chinese patients with AL amyloidosis experienced referrals during their journey toward accurate diagnosis. Increasing awareness of the disease and early referral to a specialized center with expertise may reduce delayed diagnosis and improve patient management. %M 37917132 %R 10.2196/44420 %U https://formative.jmir.org/2023/1/e44420 %U https://doi.org/10.2196/44420 %U http://www.ncbi.nlm.nih.gov/pubmed/37917132 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e45085 %T Influenza Epidemic Trend Surveillance and Prediction Based on Search Engine Data: Deep Learning Model Study %A Yang,Liuyang %A Zhang,Ting %A Han,Xuan %A Yang,Jiao %A Sun,Yanxia %A Ma,Libing %A Chen,Jialong %A Li,Yanming %A Lai,Shengjie %A Li,Wei %A Feng,Luzhao %A Yang,Weizhong %+ School of Population Medicine and Public Health, Chinese Academy of Medical Sciences & Peking Union Medical College, 9 Dong Dan San Tiao, Dongcheng District, Beijing, 100730, China, 86 010 65120552, yangweizhong@cams.cn %K early warning %K epidemic intelligence %K infectious disease %K influenza-like illness %K surveillance %D 2023 %7 17.10.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: Influenza outbreaks pose a significant threat to global public health. Traditional surveillance systems and simple algorithms often struggle to predict influenza outbreaks in an accurate and timely manner. Big data and modern technology have offered new modalities for disease surveillance and prediction. Influenza-like illness can serve as a valuable surveillance tool for emerging respiratory infectious diseases like influenza and COVID-19, especially when reported case data may not fully reflect the actual epidemic curve. Objective: This study aimed to develop a predictive model for influenza outbreaks by combining Baidu search query data with traditional virological surveillance data. The goal was to improve early detection and preparedness for influenza outbreaks in both northern and southern China, providing evidence for supplementing modern intelligence epidemic surveillance methods. Methods: We collected virological data from the National Influenza Surveillance Network and Baidu search query data from January 2011 to July 2018, totaling 3,691,865 and 1,563,361 respective samples. Relevant search terms related to influenza were identified and analyzed for their correlation with influenza-positive rates using Pearson correlation analysis. A distributed lag nonlinear model was used to assess the lag correlation of the search terms with influenza activity. Subsequently, a predictive model based on the gated recurrent unit and multiple attention mechanisms was developed to forecast the influenza-positive trend. Results: This study revealed a high correlation between specific Baidu search terms and influenza-positive rates in both northern and southern China, except for 1 term. The search terms were categorized into 4 groups: essential facts on influenza, influenza symptoms, influenza treatment and medicine, and influenza prevention, all of which showed correlation with the influenza-positive rate. The influenza prevention and influenza symptom groups had a lag correlation of 1.4-3.2 and 5.0-8.0 days, respectively. The Baidu search terms could help predict the influenza-positive rate 14-22 days in advance in southern China but interfered with influenza surveillance in northern China. Conclusions: Complementing traditional disease surveillance systems with information from web-based data sources can aid in detecting warning signs of influenza outbreaks earlier. However, supplementation of modern surveillance with search engine information should be approached cautiously. This approach provides valuable insights for digital epidemiology and has the potential for broader application in respiratory infectious disease surveillance. Further research should explore the optimization and customization of search terms for different regions and languages to improve the accuracy of influenza prediction models. %M 37847532 %R 10.2196/45085 %U https://www.jmir.org/2023/1/e45085 %U https://doi.org/10.2196/45085 %U http://www.ncbi.nlm.nih.gov/pubmed/37847532 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 10 %N %P e49359 %T Identifying Rare Circumstances Preceding Female Firearm Suicides: Validating A Large Language Model Approach %A Zhou,Weipeng %A Prater,Laura C %A Goldstein,Evan V %A Mooney,Stephen J %+ Department of Epidemiology, School of Public Health, University of Washington, Hans Rosling Center for Population Health, 3980 15th Ave NE, Seattle, WA, 98195, United States, 1 206 685 1643, sjm2186@uw.edu %K female firearm suicide %K large language model %K document classification %K suicide prevention %K suicide %K firearm suicide %K machine learning %K mental health for women %K violent death %K mental health %K language models %K women %K female %K depression %K suicidal %D 2023 %7 17.10.2023 %9 Short Paper %J JMIR Ment Health %G English %X Background: Firearm suicide has been more prevalent among males, but age-adjusted female firearm suicide rates increased by 20% from 2010 to 2020, outpacing the rate increase among males by about 8 percentage points, and female firearm suicide may have different contributing circumstances. In the United States, the National Violent Death Reporting System (NVDRS) is a comprehensive source of data on violent deaths and includes unstructured incident narrative reports from coroners or medical examiners and law enforcement. Conventional natural language processing approaches have been used to identify common circumstances preceding female firearm suicide deaths but failed to identify rarer circumstances due to insufficient training data. Objective: This study aimed to leverage a large language model approach to identify infrequent circumstances preceding female firearm suicide in the unstructured coroners or medical examiners and law enforcement narrative reports available in the NVDRS. Methods: We used the narrative reports of 1462 female firearm suicide decedents in the NVDRS from 2014 to 2018. The reports were written in English. We coded 9 infrequent circumstances preceding female firearm suicides. We experimented with predicting those circumstances by leveraging a large language model approach in a yes/no question-answer format. We measured the prediction accuracy with F1-score (ranging from 0 to 1). F1-score is the harmonic mean of precision (positive predictive value) and recall (true positive rate or sensitivity). Results: Our large language model outperformed a conventional support vector machine–supervised machine learning approach by a wide margin. Compared to the support vector machine model, which had F1-scores less than 0.2 for most infrequent circumstances, our large language model approach achieved an F1-score of over 0.6 for 4 circumstances and 0.8 for 2 circumstances. Conclusions: The use of a large language model approach shows promise. Researchers interested in using natural language processing to identify infrequent circumstances in narrative report data may benefit from large language models. %M 37847549 %R 10.2196/49359 %U https://mental.jmir.org/2023/1/e49359 %U https://doi.org/10.2196/49359 %U http://www.ncbi.nlm.nih.gov/pubmed/37847549 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e42758 %T Conversational AI and Vaccine Communication: Systematic Review of the Evidence %A Passanante,Aly %A Pertwee,Ed %A Lin,Leesa %A Lee,Kristi Yoonsup %A Wu,Joseph T %A Larson,Heidi J %+ Department of Infectious Disease Epidemiology, London School of Hygiene & Tropical Medicine, Keppel Street, London, WC1E 7HT, United Kingdom, 44 2076368636, aly.passanante@lshtm.ac.uk %K chatbots %K artificial intelligence %K conversational AI %K vaccine communication %K vaccine hesitancy %K conversational agent %K COVID-19 %K vaccine information %K health information %D 2023 %7 3.10.2023 %9 Review %J J Med Internet Res %G English %X Background: Since the mid-2010s, use of conversational artificial intelligence (AI; chatbots) in health care has expanded significantly, especially in the context of increased burdens on health systems and restrictions on in-person consultations with health care providers during the COVID-19 pandemic. One emerging use for conversational AI is to capture evolving questions and communicate information about vaccines and vaccination. Objective: The objective of this systematic review was to examine documented uses and evidence on the effectiveness of conversational AI for vaccine communication. Methods: This systematic review was conducted following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. PubMed, Web of Science, PsycINFO, MEDLINE, Scopus, CINAHL Complete, Cochrane Library, Embase, Epistemonikos, Global Health, Global Index Medicus, Academic Search Complete, and the University of London library database were searched for papers on the use of conversational AI for vaccine communication. The inclusion criteria were studies that included (1) documented instances of conversational AI being used for the purpose of vaccine communication and (2) evaluation data on the impact and effectiveness of the intervention. Results: After duplicates were removed, the review identified 496 unique records, which were then screened by title and abstract, of which 38 were identified for full-text review. Seven fit the inclusion criteria and were assessed and summarized in the findings of this review. Overall, vaccine chatbots deployed to date have been relatively simple in their design and have mainly been used to provide factual information to users in response to their questions about vaccines. Additionally, chatbots have been used for vaccination scheduling, appointment reminders, debunking misinformation, and, in some cases, for vaccine counseling and persuasion. Available evidence suggests that chatbots can have a positive effect on vaccine attitudes; however, studies were typically exploratory in nature, and some lacked a control group or had very small sample sizes. Conclusions: The review found evidence of potential benefits from conversational AI for vaccine communication. Factors that may contribute to the effectiveness of vaccine chatbots include their ability to provide credible and personalized information in real time, the familiarity and accessibility of the chatbot platform, and the extent to which interactions with the chatbot feel “natural” to users. However, evaluations have focused on the short-term, direct effects of chatbots on their users. The potential longer-term and societal impacts of conversational AI have yet to be analyzed. In addition, existing studies do not adequately address how ethics apply in the field of conversational AI around vaccines. In a context where further digitalization of vaccine communication can be anticipated, additional high-quality research will be required across all these areas. %M 37788057 %R 10.2196/42758 %U https://www.jmir.org/2023/1/e42758 %U https://doi.org/10.2196/42758 %U http://www.ncbi.nlm.nih.gov/pubmed/37788057 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 7 %N %P e49898 %T Parkinson Disease Recognition Using a Gamified Website: Machine Learning Development and Usability Study %A Parab,Shubham %A Boster,Jerry %A Washington,Peter %+ Department of Information & Computer Sciences, University of Hawaii at Manoa, 2500 Campus Rd, Honolulu, HI, 96822, United States, 1 1 512 680 0926, pyw@hawaii.edu %K Parkinson disease %K digital health %K machine learning %K remote screening %K accessible screening %D 2023 %7 29.9.2023 %9 Original Paper %J JMIR Form Res %G English %X Background: Parkinson disease (PD) affects millions globally, causing motor function impairments. Early detection is vital, and diverse data sources aid diagnosis. We focus on lower arm movements during keyboard and trackpad or touchscreen interactions, which serve as reliable indicators of PD. Previous works explore keyboard tapping and unstructured device monitoring; we attempt to further these works with structured tests taking into account 2D hand movement in addition to finger tapping. Our feasibility study uses keystroke and mouse movement data from a remotely conducted, structured, web-based test combined with self-reported PD status to create a predictive model for detecting the presence of PD. Objective: Analysis of finger tapping speed and accuracy through keyboard input and analysis of 2D hand movement through mouse input allowed differentiation between participants with and without PD. This comparative analysis enables us to establish clear distinctions between the two groups and explore the feasibility of using motor behavior to predict the presence of the disease. Methods: Participants were recruited via email by the Hawaii Parkinson Association (HPA) and directed to a web application for the tests. The 2023 HPA symposium was also used as a forum to recruit participants and spread information about our study. The application recorded participant demographics, including age, gender, and race, as well as PD status. We conducted a series of tests to assess finger tapping, using on-screen prompts to request key presses of constant and random keys. Response times, accuracy, and unintended movements resulting in accidental presses were recorded. Participants performed a hand movement test consisting of tracing straight and curved on-screen ribbons using a trackpad or mouse, allowing us to evaluate stability and precision of 2D hand movement. From this tracing, the test collected and stored insights concerning lower arm motor movement. Results: Our formative study included 31 participants, 18 without PD and 13 with PD, and analyzed their lower limb movement data collected from keyboards and computer mice. From the data set, we extracted 28 features and evaluated their significances using an extra tree classifier predictor. A random forest model was trained using the 6 most important features identified by the predictor. These selected features provided insights into precision and movement speed derived from keyboard tapping and mouse tracing tests. This final model achieved an average F1-score of 0.7311 (SD 0.1663) and an average accuracy of 0.7429 (SD 0.1400) over 20 runs for predicting the presence of PD. Conclusions: This preliminary feasibility study suggests the possibility of using technology-based limb movement data to predict the presence of PD, demonstrating the practicality of implementing this approach in a cost-effective and accessible manner. In addition, this study demonstrates that structured mouse movement tests can be used in combination with finger tapping to detect PD. %M 37773607 %R 10.2196/49898 %U https://formative.jmir.org/2023/1/e49898 %U https://doi.org/10.2196/49898 %U http://www.ncbi.nlm.nih.gov/pubmed/37773607 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e45019 %T Hot Topic Recognition of Health Rumors Based on Anti-Rumor Articles on the WeChat Official Account Platform: Topic Modeling %A Li,Ziyu %A Wu,Xiaoqian %A Xu,Lin %A Liu,Ming %A Huang,Cheng %+ Chongqing Medical University, College of Medical Informatics, No.1 Medical College Road, Yuzhong District, Chongqing, 400016, China, 86 023 6848 0060, huangcheng@cqmu.edu.cn %K topic model %K health rumors %K social media %K WeChat official account %K content analysis %K public health %K machine learning %K Twitter %K social network %K misinformation %K users %K public health %K disease %K diet %D 2023 %7 21.9.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: Social networks have become one of the main channels for obtaining health information. However, they have also become a source of health-related misinformation, which seriously threatens the public’s physical and mental health. Governance of health-related misinformation can be implemented through topic identification of rumors on social networks. However, little attention has been paid to studying the types and routes of dissemination of health rumors on the internet, especially rumors regarding health-related information in Chinese social media. Objective: This study aims to explore the types of health-related misinformation favored by WeChat public platform users and their prevalence trends and to analyze the modeling results of the text by using the Latent Dirichlet Allocation model. Methods: We used a web crawler tool to capture health rumor–dispelling articles on WeChat rumor-dispelling public accounts. We collected information from health-debunking articles posted between January 1, 2016, and August 31, 2022. Following word segmentation of the collected text, a document topic generation model called Latent Dirichlet Allocation was used to identify and generalize the most common topics. The proportion distribution of the themes was calculated, and the negative impact of various health rumors in different periods was analyzed. Additionally, the prevalence of health rumors was analyzed by the number of health rumors generated at each time point. Results: We collected 9366 rumor-refuting articles from January 1, 2016, to August 31, 2022, from WeChat official accounts. Through topic modeling, we divided the health rumors into 8 topics, that is, rumors on prevention and treatment of infectious diseases (1284/9366, 13.71%), disease therapy and its effects (1037/9366, 11.07%), food safety (1243/9366, 13.27%), cancer and its causes (946/9366, 10.10%), regimen and disease (1540/9366, 16.44%), transmission (914/9366, 9.76%), healthy diet (1068/9366, 11.40%), and nutrition and health (1334/9366, 14.24%). Furthermore, we summarized the 8 topics under 4 themes, that is, public health, disease, diet and health, and spread of rumors. Conclusions: Our study shows that topic modeling can provide analysis and insights into health rumor governance. The rumor development trends showed that most rumors were on public health, disease, and diet and health problems. Governments still need to implement relevant and comprehensive rumor management strategies based on the rumors prevalent in their countries and formulate appropriate policies. Apart from regulating the content disseminated on social media platforms, the national quality of health education should also be improved. Governance of social networks should be clearly implemented, as these rapidly developed platforms come with privacy issues. Both disseminators and receivers of information should ensure a realistic attitude and disseminate health information correctly. In addition, we recommend that sentiment analysis–related studies be conducted to verify the impact of health rumor–related topics. %M 37733396 %R 10.2196/45019 %U https://www.jmir.org/2023/1/e45019 %U https://doi.org/10.2196/45019 %U http://www.ncbi.nlm.nih.gov/pubmed/37733396 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e46523 %T Bot or Not? Detecting and Managing Participant Deception When Conducting Digital Research Remotely: Case Study of a Randomized Controlled Trial %A Loebenberg,Gemma %A Oldham,Melissa %A Brown,Jamie %A Dinu,Larisa %A Michie,Susan %A Field,Matt %A Greaves,Felix %A Garnett,Claire %+ UCL Tobacco and Alcohol Research Group, University College London, 1-19 Torrington Place, London, WC1E 7HB, United Kingdom, 44 20 7679 8781, gemma.loebenberg@ucl.ac.uk %K artificial intelligence %K false information %K mHealth applications %K participant deception %K participant %K recruit %K research subject %K web-based studies %D 2023 %7 14.9.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: Evaluating digital interventions using remote methods enables the recruitment of large numbers of participants relatively conveniently and cheaply compared with in-person methods. However, conducting research remotely based on participant self-report with little verification is open to automated “bots” and participant deception. Objective: This paper uses a case study of a remotely conducted trial of an alcohol reduction app to highlight and discuss (1) the issues with participant deception affecting remote research trials with financial compensation; and (2) the importance of rigorous data management to detect and address these issues. Methods: We recruited participants on the internet from July 2020 to March 2022 for a randomized controlled trial (n=5602) evaluating the effectiveness of an alcohol reduction app, Drink Less. Follow-up occurred at 3 time points, with financial compensation offered (up to £36 [US $39.23]). Address authentication and telephone verification were used to detect 2 kinds of deception: “bots,” that is, automated responses generated in clusters; and manual participant deception, that is, participants providing false information. Results: Of the 1142 participants who enrolled in the first 2 months of recruitment, 75.6% (n=863) of them were identified as bots during data screening. As a result, a CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) was added, and after this, no more bots were identified. Manual participant deception occurred throughout the study. Of the 5956 participants (excluding bots) who enrolled in the study, 298 (5%) were identified as false participants. The extent of this decreased from 110 in November 2020, to a negligible level by February 2022 including a number of months with 0. The decline occurred after we added further screening questions such as attention checks, removed the prominence of financial compensation from social media advertising, and added an additional requirement to provide a mobile phone number for identity verification. Conclusions: Data management protocols are necessary to detect automated bots and manual participant deception in remotely conducted trials. Bots and manual deception can be minimized by adding a CAPTCHA, attention checks, a requirement to provide a phone number for identity verification, and not prominently advertising financial compensation on social media. Trial Registration: ISRCTN Number ISRCTN64052601; https://doi.org/10.1186/ISRCTN64052601 %M 37707943 %R 10.2196/46523 %U https://www.jmir.org/2023/1/e46523 %U https://doi.org/10.2196/46523 %U http://www.ncbi.nlm.nih.gov/pubmed/37707943 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 7 %N %P e42756 %T Identification of Risk Groups for and Factors Affecting Metabolic Syndrome in South Korean Single-Person Households Using Latent Class Analysis and Machine Learning Techniques: Secondary Analysis Study %A Lee,Ji-Soo %A Lee,Soo-Kyoung %+ Big Data Convergence and Open Sharing System, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Republic of Korea, 82 2 889 5710, soo1005s@gmail.com %K latent class analysis %K machine learning %K metabolic syndrome %K risk factor %K single-person households %D 2023 %7 12.9.2023 %9 Original Paper %J JMIR Form Res %G English %X Background: The rapid increase of single-person households in South Korea is leading to an increase in the incidence of metabolic syndrome, which causes cardiovascular and cerebrovascular diseases, due to lifestyle changes. It is necessary to analyze the complex effects of metabolic syndrome risk factors in South Korean single-person households, which differ from one household to another, considering the diversity of single-person households. Objective: This study aimed to identify the factors affecting metabolic syndrome in single-person households using machine learning techniques and categorically characterize the risk factors through latent class analysis (LCA). Methods: This cross-sectional study included 10-year secondary data obtained from the National Health and Nutrition Examination Survey (2009-2018). We selected 1371 participants belonging to single-person households. Data were analyzed using SPSS (version 25.0; IBM Corp), Mplus (version 8.0; Muthen & Muthen), and Python (version 3.0; Plone & Python). We applied 4 machine learning algorithms (logistic regression, decision tree, random forest, and extreme gradient boost) to identify important factors and then applied LCA to categorize the risk groups of metabolic syndromes in single-person households. Results: Through LCA, participants were classified into 4 groups (group 1: intense physical activity in early adulthood, group 2: hypertension among middle-aged female respondents, group 3: smoking and drinking among middle-aged male respondents, and group 4: obesity and abdominal obesity among middle-aged respondents). In addition, age, BMI, obesity, subjective body shape recognition, alcohol consumption, smoking, binge drinking frequency, and job type were investigated as common factors that affect metabolic syndrome in single-person households through machine learning techniques. Group 4 was the most susceptible and at-risk group for metabolic syndrome (odds ratio 17.67, 95% CI 14.5-25.3; P<.001), and obesity and abdominal obesity were the most influential risk factors for metabolic syndrome. Conclusions: This study identified risk groups and factors affecting metabolic syndrome in single-person households through machine learning techniques and LCA. Through these findings, customized interventions for each generational risk factor for metabolic syndrome can be implemented, leading to the prevention of metabolic syndrome, which causes cardiovascular and cerebrovascular diseases. In conclusion, this study contributes to the prevention of metabolic syndrome in single-person households by providing new insights and priority groups for the development of customized interventions using classification. %M 37698907 %R 10.2196/42756 %U https://formative.jmir.org/2023/1/e42756 %U https://doi.org/10.2196/42756 %U http://www.ncbi.nlm.nih.gov/pubmed/37698907 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48254 %T Assessing Health Students' Attitudes and Usage of ChatGPT in Jordan: Validation Study %A Sallam,Malik %A Salim,Nesreen A %A Barakat,Muna %A Al-Mahzoum,Kholoud %A Al-Tammemi,Ala'a B %A Malaeb,Diana %A Hallit,Rabih %A Hallit,Souheil %+ Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Queen Rania Al-Abdullah Street-Aljubeiha, Amman, 11942, Jordan, 962 0791845186, malik.sallam@ju.edu.jo %K artificial intelligence %K machine learning %K education %K technology %K healthcare %K survey %K opinion %K knowledge %K practices %K KAP %D 2023 %7 5.9.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT is a conversational large language model that has the potential to revolutionize knowledge acquisition. However, the impact of this technology on the quality of education is still unknown considering the risks and concerns surrounding ChatGPT use. Therefore, it is necessary to assess the usability and acceptability of this promising tool. As an innovative technology, the intention to use ChatGPT can be studied in the context of the technology acceptance model (TAM). Objective: This study aimed to develop and validate a TAM-based survey instrument called TAME-ChatGPT (Technology Acceptance Model Edited to Assess ChatGPT Adoption) that could be employed to examine the successful integration and use of ChatGPT in health care education. Methods: The survey tool was created based on the TAM framework. It comprised 13 items for participants who heard of ChatGPT but did not use it and 23 items for participants who used ChatGPT. Using a convenient sampling approach, the survey link was circulated electronically among university students between February and March 2023. Exploratory factor analysis (EFA) was used to assess the construct validity of the survey instrument. Results: The final sample comprised 458 respondents, the majority among them undergraduate students (n=442, 96.5%). Only 109 (23.8%) respondents had heard of ChatGPT prior to participation and only 55 (11.3%) self-reported ChatGPT use before the study. EFA analysis on the attitude and usage scales showed significant Bartlett tests of sphericity scores (P<.001) and adequate Kaiser-Meyer-Olkin measures (0.823 for the attitude scale and 0.702 for the usage scale), confirming the factorability of the correlation matrices. The EFA showed that 3 constructs explained a cumulative total of 69.3% variance in the attitude scale, and these subscales represented perceived risks, attitude to technology/social influence, and anxiety. For the ChatGPT usage scale, EFA showed that 4 constructs explained a cumulative total of 72% variance in the data and comprised the perceived usefulness, perceived risks, perceived ease of use, and behavior/cognitive factors. All the ChatGPT attitude and usage subscales showed good reliability with Cronbach α values >.78 for all the deduced subscales. Conclusions: The TAME-ChatGPT demonstrated good reliability, validity, and usefulness in assessing health care students’ attitudes toward ChatGPT. The findings highlighted the importance of considering risk perceptions, usefulness, ease of use, attitudes toward technology, and behavioral factors when adopting ChatGPT as a tool in health care education. This information can aid the stakeholders in creating strategies to support the optimal and ethical use of ChatGPT and to identify the potential challenges hindering its successful implementation. Future research is recommended to guide the effective adoption of ChatGPT in health care education. %M 37578934 %R 10.2196/48254 %U https://mededu.jmir.org/2023/1/e48254 %U https://doi.org/10.2196/48254 %U http://www.ncbi.nlm.nih.gov/pubmed/37578934 %0 Journal Article %@ 1947-2579 %I JMIR Publications %V 15 %N %P e50934 %T Framework for Classifying Explainable Artificial Intelligence (XAI) Algorithms in Clinical Medicine %A Gniadek,Thomas %A Kang,Jason %A Theparee,Talent %A Krive,Jacob %+ Department of Biomedical and Health Information Sciences, University of Illinois at Chicago, 1919 W Taylor St 233 AHSB, MC-530, Chicago, IL, 60612, United States, 1 312 996 1445, krive@uic.edu %K explainable artificial intelligence %K XAI %K artificial intelligence %K AI %K AI medicine %K pathology informatics %K radiology informatics %D 2023 %7 1.9.2023 %9 Viewpoint %J Online J Public Health Inform %G English %X Artificial intelligence (AI) applied to medicine offers immense promise, in addition to safety and regulatory concerns. Traditional AI produces a core algorithm result, typically without a measure of statistical confidence or an explanation of its biological-theoretical basis. Efforts are underway to develop explainable AI (XAI) algorithms that not only produce a result but also an explanation to support that result. Here we present a framework for classifying XAI algorithms applied to clinical medicine: An algorithm’s clinical scope is defined by whether the core algorithm output leads to observations (eg, tests, imaging, clinical evaluation), interventions (eg, procedures, medications), diagnoses, and prognostication. Explanations are classified by whether they provide empiric statistical information, association with a historical population or populations, or association with an established disease mechanism or mechanisms. XAI implementations can be classified based on whether algorithm training and validation took into account the actions of health care providers in response to the insights and explanations provided or whether training was performed using only the core algorithm output as the end point. Finally, communication modalities used to convey an XAI explanation can be used to classify algorithms and may affect clinical outcomes. This framework can be used when designing, evaluating, and comparing XAI algorithms applied to medicine. %M 38046562 %R 10.2196/50934 %U https://ojphi.jmir.org/2023/1/e50934 %U https://doi.org/10.2196/50934 %U http://www.ncbi.nlm.nih.gov/pubmed/38046562 %0 Journal Article %@ 1947-2579 %I JMIR Publications %V 14 %N 1 %P e12851 %T Roles of Health Literacy in Relation to Social Determinants of Health and Recommendations for Informatics-Based Interventions: Systematic Review %D 2022 %7 ..2022 %9 %J Online J Public Health Inform %G English %X Objective: There is a low rate of online patient portal utilization in the U.S. This study aimed to utilize a machine learning approach to predict access to online medical records through a patient portal.Methods: This is a cross-sectional predictive machine learning algorithm-based study of Health Information National Trends datasets (Cycles 1 and 2; 2017-2018 samples). Survey respondents were U.S. adults (≥18 years old). The primary outcome was a binary variable indicating that the patient had or had not accessed online medical records in the previous 12 months. We analyzed a subset of independent variables using k-means clustering with replicate samples. A cross-validated random forest-based algorithm was utilized to select features for a Cycle 1 split training sample. A logistic regression and an evolved decision tree were trained on the rest of the Cycle 1 training sample. The Cycle 1 test sample and Cycle 2 data were used to benchmark algorithm performance.Results: Lack of access to online systems was less of a barrier to online medical records in 2018 (14%) compared to 2017 (26%). Patients accessed medical records to refill medicines and message primary care providers more frequently in 2018 (45%) than in 2017 (25%).Discussion: Privacy concerns, portal knowledge, and conversations between primary care providers and patients predict portal access.Conclusion: Methods described here may be employed to personalize methods of patient engagement during new patient registration. %M 36685053 %R 10.5210/ojphi.v14i1.12851 %U %U https://doi.org/10.5210/ojphi.v14i1.12851 %U http://www.ncbi.nlm.nih.gov/pubmed/36685053 %0 Journal Article %@ 1947-2579 %I JMIR Publications %V 9 %N 1 %P e7605 %T Roles of Health Literacy in Relation to Social Determinants of Health and Recommendations for Informatics-Based Interventions: Systematic Review %D 2017 %7 ..2017 %9 %J Online J Public Health Inform %G English %X ObjectiveTo explain the utility of using an automated syndromic surveillanceprogram with advanced natural language processing (NLP) to improveclinical quality measures reporting for influenza immunization.IntroductionClinical quality measures (CQMs) are tools that help measure andtrack the quality of health care services. Measuring and reportingCQMs helps to ensure that our health care system is deliveringeffective, safe, efficient, patient-centered, equitable, and timely care.The CQM for influenza immunization measures the percentage ofpatients aged 6 months and older seen for a visit between October1 and March 31 who received (or reports previous receipt of) aninfluenza immunization. Centers for Disease Control and Preventionrecommends that everyone 6 months of age and older receive aninfluenza immunization every season, which can reduce influenza-related morbidity and mortality and hospitalizations.MethodsPatients at a large academic medical center who had a visit toan affiliated outpatient clinic during June 1 - 8, 2016 were initiallyidentified using their electronic medical record (EMR). The 2,543patients who were selected did not have documentation of influenzaimmunization in a discrete field of the EMR. All free text notes forthese patients between August 1, 2015 and March 31, 2016 wereretrieved and analyzed using the sophisticated NLP built withinGeographic Utilization of Artificial Intelligence in Real-Timefor Disease Identification and Alert Notification (GUARDIAN)– a syndromic surveillance program – to identify any mention ofinfluenza immunization. The goal was to identify additional cases thatmet the CQM measure for influenza immunization and to distinguishdocumented exceptions. The patients with influenza immunizationmentioned were further categorized by GUARDIAN NLP intoReceived, Recommended, Refused, Allergic, and Unavailable.If more than one category was applicable for a patient, they wereindependently counted in their respective categories. A descriptiveanalysis was conducted, along with manual review of a sample ofcases per each category.ResultsFor the 2,543 patients who did not have influenza immunizationdocumentation in a discrete field of the EMR, a total of 78,642 freetext notes were processed using GUARDIAN. Four hundred fiftythree (17.8%) patients had some mention of influenza immunizationwithin the notes, which could potentially be utilized to meet the CQMinfluenza immunization requirement. Twenty two percent (n=101)of patients mentioned already having received the immunizationwhile 34.7% (n=157) patients refused it during the study time frame.There were 27 patients with the mention of influenza immunization,who could not be differentiated into a specific category. The numberof patients placed into a single category of influenza immunizationwas 351 (77.5%), while 75 (16.6%) were classified into more thanone category. See Table 1.ConclusionsUsing GUARDIAN’s NLP can identify additional patients whomay meet the CQM measure for influenza immunization or whomay be exempt. This tool can be used to improve CQM reportingand improve overall influenza immunization coverage by using it toalert providers. Next steps involve further refinement of influenzaimmunization categories, automating the process of using the NLPto identify and report additional cases, as well as using the NLP forother CQMs.Table 1. Categorization of influenza immunization documentation within freetext notes of 453 patients using NLP %R 10.5210/ojphi.v9i1.7605 %U %U https://doi.org/10.5210/ojphi.v9i1.7605 %0 Journal Article %@ 1947-2579 %I JMIR Publications %V 9 %N 1 %P e7650 %T Roles of Health Literacy in Relation to Social Determinants of Health and Recommendations for Informatics-Based Interventions: Systematic Review %D 2017 %7 ..2017 %9 %J Online J Public Health Inform %G English %X ObjectiveTo evaluate prediction of laboratory diagnosis of acute respiratoryinfection (ARI) from participatory data using machine learningmodels.IntroductionARIs have epidemic and pandemic potential. Prediction of presenceof ARIs from individual signs and symptoms in existing studieshave been based on clinically-sourced data1. Clinical data generallyrepresents the most severe cases, and those from locations with accessto healthcare institutions. Thus, the viral information that comes fromclinical sampling is insufficient to either capture disease incidence ingeneral populations or its predictability from symptoms. Participatorydata — information that individuals today can produce on their own— enabled by the ubiquity of digital tools, can help fill this gap byproviding self-reported data from the community. Internet-basedparticipatory efforts such as Flu Near You2have augmented existingARI surveillance through early and widespread detection of outbreaksand public health trends.MethodsThe GoViral platform3was established to obtain self-reportedsymptoms and diagnostic specimens from the community (Table 1summarizes participation detail). Participants from states with themost data, MA, NY, CT, NH, and CA were included. Age, gender,zip code, and vaccination status were requested from each participant.Participants submitted saliva and nasal swab specimens and reportedsymptoms from: fever, cough, sore throat, shortness of breath, chills,fatigue, body aches, headache, nausea, and diarrhea. Pathogenswere confirmed via RT-PCR on a GenMark respiratory panel assay(full virus list reported previously3).Observations with missing, invalid or equivocal lab tests wereremoved. Table 2 summarizes the binary features. Age categorieswere:≤20, > 20 and < 40, and≥40 to represent young, middle-aged, and old. Missing age and gender values were imputed based onoverall distributions.Three machine learning algorithms—Support Vector Machines(SVMs)4, Random Forests (RFs)5, and Logistic Regression (LR) wereconsidered. Both individual features and their combinations wereassessed. Outcome was the presence (1) or absence (0) of laboratorydiagnosis of ARI.ResultsTen-fold cross validation was repeated ten times. Evaluationsmetrics used were: positive predictive value (PPV), negativepredictive value (NPV), sensitivity, and specificity6. LR and SVMsyielded the best PPV of 0.64 (standard deviation:±0.08) with coughand fever as predictors. The best sensitivity of 0.59 (±0.14) was fromLR using cough, fever, and sore throat. RFs had the best NPV andspecificity of 0.62 (±0.15) and 0.83 (±0.10) respectively with theCDC ILI symptom profile of fever and (cough or sore throat). Addingdemographics and vaccination status did not improve performanceof the classifiers. Results are consistent with studies using clinically-sourced data: cough and fever together were found to be the bestpredictors of flu-like illness1. Because our data include mildlyinfectious and asymptomatic cases, the classifier sensitivity and PPVare low compared to results from clinical data.ConclusionsEvidence of fever and cough together are good predictors of ARIin the community, but clinical data may overestimate this due tosampling bias. Integration of participatory data can not only improvepopulation health by actively engaging the general public2but alsoimprove the scope of studies solely based on clinically-sourcedsurveillance data.Table 1. Details of included participants.Table 2. Coding of binary features %R 10.5210/ojphi.v9i1.7650 %U %U https://doi.org/10.5210/ojphi.v9i1.7650