TY - JOUR
T1 - Assessment of Large Language Models (LLMs) in decision-making support for gynecologic oncology
AU - Gumilar, Khanisyah Erza
AU - Indraprasta, Birama R.
AU - Faridzi, Ach Salman
AU - Wibowo, Bagus M.
AU - Herlambang, Aditya
AU - Rahestyningtyas, Eccita
AU - Irawan, Budi
AU - Tambunan, Zulkarnain
AU - Bustomi, Ahmad Fadhli
AU - Brahmantara, Bagus Ngurah
AU - Yu, Zih Ying
AU - Hsu, Yu Cheng
AU - Pramuditya, Herlangga
AU - Putra, Very Great E.
AU - Nugroho, Hari
AU - Mulawardhana, Pungky
AU - Tjokroprawiro, Brahmana A.
AU - Hedianto, Tri
AU - Ibrahim, Ibrahim H.
AU - Huang, Jingshan
AU - Li, Dongqi
AU - Lu, Chien Hsing
AU - Yang, Jer Yen
AU - Liao, Li Na
AU - Tan, Ming
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2024/12
Y1 - 2024/12
N2 - Objective: This study investigated the ability of Large Language Models (LLMs) to provide accurate and consistent answers by focusing on their performance in complex gynecologic cancer cases. Background: LLMs are advancing rapidly and require a thorough evaluation to ensure that they can be safely and effectively used in clinical decision-making. Such evaluations are essential for confirming LLM reliability and accuracy in supporting medical professionals in casework. Study design: We assessed three prominent LLMs—ChatGPT-4 (CG-4), Gemini Advanced (GemAdv), and Copilot—evaluating their accuracy, consistency, and overall performance. Fifteen clinical vignettes of varying difficulty and five open-ended questions based on real patient cases were used. The responses were coded, randomized, and evaluated blindly by six expert gynecologic oncologists using a 5-point Likert scale for relevance, clarity, depth, focus, and coherence. Results: GemAdv demonstrated superior accuracy (81.87 %) compared to both CG-4 (61.60 %) and Copilot (70.67 %) across all difficulty levels. GemAdv consistently provided correct answers more frequently (>60 % every day during the testing period). Although CG-4 showed a slight advantage in adhering to the National Comprehensive Cancer Network (NCCN) treatment guidelines, GemAdv excelled in the depth and focus of the answers provided, which are crucial aspects of clinical decision-making. Conclusion: LLMs, especially GemAdv, show potential in supporting clinical practice by providing accurate, consistent, and relevant information for gynecologic cancer. However, further refinement is needed for more complex scenarios. This study highlights the promise of LLMs in gynecologic oncology, emphasizing the need for ongoing development and rigorous evaluation to maximize their clinical utility and reliability.
AB - Objective: This study investigated the ability of Large Language Models (LLMs) to provide accurate and consistent answers by focusing on their performance in complex gynecologic cancer cases. Background: LLMs are advancing rapidly and require a thorough evaluation to ensure that they can be safely and effectively used in clinical decision-making. Such evaluations are essential for confirming LLM reliability and accuracy in supporting medical professionals in casework. Study design: We assessed three prominent LLMs—ChatGPT-4 (CG-4), Gemini Advanced (GemAdv), and Copilot—evaluating their accuracy, consistency, and overall performance. Fifteen clinical vignettes of varying difficulty and five open-ended questions based on real patient cases were used. The responses were coded, randomized, and evaluated blindly by six expert gynecologic oncologists using a 5-point Likert scale for relevance, clarity, depth, focus, and coherence. Results: GemAdv demonstrated superior accuracy (81.87 %) compared to both CG-4 (61.60 %) and Copilot (70.67 %) across all difficulty levels. GemAdv consistently provided correct answers more frequently (>60 % every day during the testing period). Although CG-4 showed a slight advantage in adhering to the National Comprehensive Cancer Network (NCCN) treatment guidelines, GemAdv excelled in the depth and focus of the answers provided, which are crucial aspects of clinical decision-making. Conclusion: LLMs, especially GemAdv, show potential in supporting clinical practice by providing accurate, consistent, and relevant information for gynecologic cancer. However, further refinement is needed for more complex scenarios. This study highlights the promise of LLMs in gynecologic oncology, emphasizing the need for ongoing development and rigorous evaluation to maximize their clinical utility and reliability.
KW - Accuracy
KW - Artificial intelligence
KW - Consistency
KW - Gynecologic cancer
KW - Large Language Models
UR - http://www.scopus.com/inward/record.url?scp=85208582771&partnerID=8YFLogxK
U2 - 10.1016/j.csbj.2024.10.050
DO - 10.1016/j.csbj.2024.10.050
M3 - Article
AN - SCOPUS:85208582771
SN - 2001-0370
VL - 23
SP - 4019
EP - 4026
JO - Computational and Structural Biotechnology Journal
JF - Computational and Structural Biotechnology Journal
ER -