
Vol 3, No 3 (2025): Current Issue (Volume 3, Issue 3), 2025
Editorial
From Resource-Limited to Research-Rich: Unlocking the Scientific Potential of Developing Nations
Zuhair Dahham Hammood
For too long, the scientific narrative has been dominated by voices from wealthier nations. While their contributions are invaluable, the imbalance has left a vast reservoir of untapped knowledge and innovation in the developing world. Today, the time has come to shift the paradigm—from viewing developing countries as mere recipients of scientific progress to recognizing them as active producers of valuable, context-specific knowledge.
From Resource-Limited to Research-Rich is not a rhetorical flourish—it is a vision, a goal, and a challenge. It reflects a belief that scientific excellence is not the exclusive property of nations with abundant financial resources, but rather, a pursuit driven by curiosity, commitment, and community.
Developing countries, despite limited infrastructure and funding, are home to some of the most pressing health challenges—from endemic infectious diseases and rising non-communicable burdens to unique environmental and sociopolitical contexts. These challenges demand local insight, homegrown data, and context-sensitive solutions. The answers will not come from imported models alone. They must arise from within [1].
In this transformation, medical journals have a profound responsibility—not just as gatekeepers of knowledge, but as platforms for empowerment. Barw Medical Journal stands committed to this mission: to provide a voice to researchers working under constraints, to mentor and guide early-career scientists, and to uphold the integrity and quality of regional scholarship.
Success stories are already emerging. Across Africa, Asia, the Middle East, and Latin America, we are witnessing a rise in high-quality research led by local scientists. These efforts, often fueled by personal passion more than institutional support, prove that scientific ingenuity thrives even where resources are scarce [2].
However, more must be done. Governments must prioritize funding for health research. International agencies must listen more and dictate less. And academic partnerships must be based on equity, not extraction.
The path from resource-limited to research-rich is not paved overnight. It requires intentional investment, strategic collaboration, and relentless belief in the intellectual power of every nation. As we look ahead, let us remember: the next breakthrough in global health may very well come from a modest lab, in a hospital like ours, led by minds that simply needed a chance to be heard.
At Barw Medical Journal, we are here to amplify those voices.
Original Articles

The Effect of Clinical Knee Measurement in Children with Genu Varus
Kamal Jamil, Chong YT, Ahmad Fazly Abd Rasid, Abdul Halim Abdul Rashid, Lawand Ahmed
Abstract
Introduction
Children with genu varus needs frequent assessment and follow up that may need several radiographies. This study investigates the effectiveness of the clinical assessment of genu varus in comparison to the radiological assessment.
Methods
In this study, relationship between clinical and radiographic assessments of genu varus (bow leg) in children, focusing on the use of intercondylar distance (ICD) and clinical tibiofemoral angle (cTFA) as clinical measures, compared to the mechanical tibiofemoral angle (mTFA) obtained via scanogram, the radiographic gold standard for assessing lower limb deformity. Clinical measurements (ICD and cTFA) were gathered along with the mTFA from scanogram radiographs. Reliability was tested between two observers, and Spearman’s correlation coefficient was used to evaluate the relationships between the clinical and radiographic measurements.
Results
The study involved 36 children with an average age of 6.3 years. There were strong intra-rater reliability for both observers (ICC 0.87 for observer 1, ICC 0.97 for observer 2) and excellent inter-observer agreement (ICC 0.97). Positive correlations were found between cTFA and mTFA (r² = 0.67, p < 0.001), between ICD and cTFA (r² = 0.53, p < 0.001), and between ICD and mTFA (r² = 0.62, p < 0.001).
Conclusion
This study suupports the idea that clinical methods may be sufficient for evaluation, minimizing the need for radiation exposure and offering a reliable alternative to radiography.
Introduction
Genu varus, also known as bow-leggedness is defined as any separation of the medial surfaces of the knees when the medial malleoli are in contact, and the patient is standing in the anatomical position [1]. The prevalence of genu varus ranges from 11.4% to 14.5% [2,3]. It is found to be more prevalent in boys than in girls [2]. Genu varus may be physiological or pathological. There are multiple ways to aid in the screening and diagnosis of genu varus, which include clinical and radiological methods. Clinical methods such as intercondylar distance (ICD) and tibiofemoral angle measurement have been used to screen and assess the degree of genu varus. However, imaging modality such as a long-leg AP radiograph or scanogram is considered the gold standard assessment for lower limb deformity.
Many studies on genu varus in children have utilized either the clinical or radiological lower limb measurements to describe the tibiofemoral angle progression in normal children, data of normal ranges of knee angle in relation to age, and transition time from varus to valgus of different populations and ethnic groups [4-10].In a recent systematic review, it is proposed that children above the age of 18 months with genu varus should be closely monitored clinically using ICD or cTFA, whereby an ICD of more than 4 cm needed to be investigated for pathologic cause [11]. However, reliability has not been confirmed.
Hence, serial assessment might be needed to manage children with genu varus. Clinical methods of assessment are preferrable due to no exposure to radiation as compared to a radiograph but may be inaccurate or unreliable [12]. We are interested to find out the correlation between the radiological and clinical assessments.
Methods
Study design and setting
This was a single center cohort study. The study was conducted in an orthopaedic clinic of a tertiary hospital. Children with age ranging from 1 to 17 years old who were diagnosed as genu varus by orthopaedic specialists and has long leg radiograph done, were included. We excluded children who have previous history of fracture of the lower limb, had any knee swelling, tumour or contracture. Consent was taken from the parents before enrolment to the study. This study was conducted in accordance with the Declaration of Helsinki and was approved by the Universiti Kebangsaan Malaysia Institutional Ethical Committee (JEP-2020-194).
Procedure
The baseline data such as age, gender, weight/height and underlying diagnosis were taken. The knee intercondylar distance was measured using a measuring tape, with the child standing, and both medial malleoli touching. The centre of the medial femoral condyles was identified by palpation of the most prominent part of the distal femur. The measurement between the condyles was performed following the method described by Heath et al [13]. The reading was measured in centimetres as the intercondylar distance. The clinical tibiofemoral angle (cTFA) was measured with a goniometer, following the method described by Arazi et al [14]. With the child in standing, the anterior superior iliac spine, centre of the patella, and midpoint of the ankle joint were marked with a pen. After the marking of the tibiofemoral axis, the angle was measured and recorded. The angle was expressed in degrees. Illustrates the method of measurements on a patient (Figure 1).
A standardized long-leg anterior-posterior radiograph (scanogram) of lower limbs was obtained from hospital radiological database. The angle formed between the mechanical axis of the femur and the mechanical axis of tibia was recorded as mechanical tibiofemoral angle (mTFA). The mTFA was determined from digital X-Ray by using the measuring tool from Medweb (Medweb, Inc, San Francisco, CA) software. In bilateral cases, the limb with the worst angle measured was chosen for analysis.
The clinical and radiological measurements were performed by a single researcher (CYT), who was trained on the measurement technique. For the radiographic measurements, a prior intra- and inter-observer reliability study was performed on 10 radiographs by two main researchers (CYT and KJ) on the same children at two different intervals.
Data analysis
The intra- and inter-observer reliability of tibiofemoral angle measurement was measured using with 95% confidence intervals to gauge the precisions of the ICCs [15]. Correlations between clinical tibiofemoral angle (cTFA), mechanical tibiofemoral angle (mTFA) and intercondylar distance (ICD) were tested using Spearman’s Correlation test. Differences between cTFA and mTFA were investigated using paired sample t-test and Bland Altman 95% limits of agreement. All statistical analysis was performed using SPSS (v24, IBM, NY, USA). Statistical significance was set at a cut-off of p<0.05.
Results
There were 36 children included with the mean age of 6.3 years. Thirty-two were Malay (88.8%), while the remaining participants were three Indians (8.3 %) and one Chinese (2.7%) by ethnicity. Twenty-two children were male (61%) and 14 female (38%). There were five unilateral and 31 bilateral genu varus. Eleven children had Blount disease; 13 cases had rickets while the remaining 12 was managed as physiological genu varus.
Reliability study performed between two observers for the tibiofemoral angle measurements revealed Good intra-rater reliability for observer 1 (ICC 0.87) and Excellent intra-rater reliability for observer 2 (ICC 0.97). Excellent inter-observer agreement (ICC 0.97) was also shown.
All thirty-six children (mean age 6.6 ± 5.7) were examined in standing position. The association between the radiological mTFA and clinical TFA measurements was assessed. Our findings revealed that there was a moderate correlation between cTFA and mTFA (r2=0.67, p< 0.001) (Figure 2).
Subsequently, the association between the ICD and clinical TFA and between ICD and radiological mTFA measurements were assessed. We also found a moderate positive correlation between ICD and cTFA, (r2=0.53, p< 0.001) and between ICD and mTFA, (r2=0.62, p< 0.001), respectively (Figure 3).
Paired t- test revealed a mean difference of -4.67 degrees between the cTFA and mTFA. The difference was statistically significant of p= 0.00. The limits of agreement revealed were a lower limit of -7.02 degrees and an upper limit of -2.34 degrees (Figure 4).
Discussion
We examined the correlation between clinical and radiographic TFA measurements of the lower extremities in 36 children with genu varus who has been referred to our centre. We found a significant correlation between radiological mTFA and clinical TFA. This result is in parallel with other studies by [16,17]. Navali et al concluded that goniometer measurement appears to be valid alternatives to the mechanical axis on full-leg radiograph for determining frontal plane knee alignment [17]. Kraus et al also concluded knee alignment assessed clinically by goniometer or measured on a knee radiograph is correlated with the angle measured on the full-limb radiograph [17]. However, both studies were carried out in adults’ population with osteoarthritis knee. Our study determined the correlation between radiological and clinical TFA specifically in paediatric population with genu varus.
Another significant finding in this study is ICD has moderate correlation with cTFA and mTFA. There are several correlation studies that were reported on ICD. Saini et al found that a fair degree of correlation was established between ICD and tibiofemoral angle (TFA), measured clinically by a goniometer [8]. A similar finding between ICD and TFA was seen in other studies [6-11]. This suggested that both measurements can complement each other in monitoring genu varus. The importance of ICD measurement was highlighted by other authors. Cahuzac et al in 1995 has established a data for the normal values of varus profile of the legs in normal children between 10 and 16 years of age, whereby a measurement of ICD of more than 5 cm is considered abnormal [18]. This is supported by other investigators [14-19]. For younger children aged of at least 18 months, ICD of 4cm should be closely monitored [11].
The different degrees of correlation in various studies might be influenced by the different method of measurements. Mathew et al had found the clinical measurement of using ICD to have minimal intra-observer variability [6]. However, a standardized way of measurement and positioning of the patients is important to get a consistent finding. Obtaining a proper standing radiograph in a young child can proved to be challenging, so other measures such as footprint drawn on the floor have been suggested [11].
We also found that the difference of agreement between cTFA and mTFA measurement were significant. mTFA consistently produced a higher value with the mean difference around 5 degrees indicating that the angles were not similar between the two techniques. However, as mentioned earlier both measurements correlated with each other. This means that although not totally accurate as measured on radiograph (mTFA), clinical method can still show similar trend of deformity therefore useful for monitoring change or progress.
There were some limitations in our study. Firstly, our sample population was relatively small with a wide age range (1-17 years). Secondly, we only performed observer reliability study for the radiographic measurement. However, the clinical measurements were done by a single researcher, who was trained to perform the measurement following the standard protocol.
Conclusion
Clinical measurement of tibiofemoral angle and ICD to good correlation with radiological measurement, when performed with the child in standing position. Therefore, for monitoring purposes or serial alignment assessment, these methods are adequate.
Declarations
Conflicts of interest: The author(s) have no conflicts of interest to disclose.
Ethical approval: The study's ethical approval was obtained from the Universiti Kebangsaan Malaysia Institutional Ethical Committee (JEP-2020-194).
Patient consent (participation and publication): Verbal informed consent was obtained from patients for publication.
Source of Funding: Universiti Kebangsaan Malaysia
Role of Funder: The funder remained independent, refraining from involvement in data collection, analysis, or result formulation, ensuring unbiased research free from external influence.
Acknowledgements: None to be declared.
Authors' contributions: KJ and CYT conceptualized and designed the study, drafted the initial manuscript, and reviewed and revised the manuscript. CYT designed the data collection instruments, collected data and carried out the initial analyses. AFAR and AHAR coordinated and supervised data collection, and critically reviewed the manuscript for important intellectual content. All authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.
Use of AI: AI was not used in the drafting of the manuscript, the production of graphical elements, or the collection and analysis of data.
Data availability statement: Note applicable.

Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Talar Sabir Ahmed, Rawa M. Ali, Ari M. Abdullah, Hadeel A. Yasseen, Ronak S. Ahmed, Ameer M....
Abstract
Introduction
The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, benefits, biases, and limitations of LLMs in diagnosing dermatologic conditions within pathology.
Methods
A pathologist compiled 60 real histopathology case scenarios of skin conditions from a hospital database. Two other pathologists reviewed each patient’s demographics, clinical details, histopathology findings, and original diagnosis. These cases were presented to ChatGPT-3.5, Gemini, and an external pathologist. Each response was classified as complete agreement, partial agreement, or no agreement with the original pathologist’s diagnosis.
Results
ChatGPT-3.5 had 29 (48.4%) complete agreements, 14 (23.3%) partial agreements, and 17 (28.3%) none agreements. Gemini showed 20 (33%), 9 (15%), and 31 (52%) complete agreement, partial agreement, and no agreement responses, respectively. Additionally, the external pathologist had 36(60%), 17(28%), and 7(12%) complete agreements, partial agreements, and no agreements responses, respectively, in relation to the pathologists’ diagnosis. Significant differences in diagnostic agreement were found between the LLMs and the pathologist (P < 0.001).
Conclusion
In certain instances, ChatGPT-3.5 and Gemini may provide an accurate diagnosis of skin pathologies when presented with relevant patient history and descriptions of histopathological reports. However, their overall performance is insufficient for reliable use in real-life clinical settings.
Introduction
The healthcare sector is undergoing significant transformation with the emergence of large language models (LLMs), which have the potential to revolutionize patient care and outcomes. In November 2022, OpenAI introduced a natural language model called Chat Generative Pre-Trained Transformer (ChatGPT). It is renowned for its ability to generate responses that approximate human interaction in various tasks. Gemini, developed by Google, is a text-based AI conversational tool that utilizes machine learning and natural language understanding to address complex inquiries. These models generate new data by identifying structures and patterns from existing data, demonstrating their versatility in producing content across different domains. Generative LLMs rely on sophisticated deep learning methodologies and neural network architectures to scrutinize, comprehend, and produce content that closely resembles human-created outputs. Both ChatGPT and Gemini have gained global recognition for their unprecedented ability to emulate human conversation and cognitive abilities [1-3].
ChatGPT offers a notable advantage in medical decision-making due to its proficiency in analyzing complex medical data. It is a valuable resource for healthcare professionals, providing quick insights derived from patient records, medical research, and clinical guidelines [1,4]. Moreover, ChatGPT can play a crucial role in the differential diagnostic process by synthesizing information from symptoms, medical history, and risk factors, and comprehensively processing this data to present a range of potential medical diagnoses, thereby assisting medical practitioners in their assessments. This has the potential to improve diagnostic accuracy and reduce instances of misdiagnosis or delays [4].
The integration of ChatGPT and Gemini into the medical decision-making landscape has generated interest from various medical specialties. Multiple disciplines have published articles highlighting the significance and potential applications of ChatGPT and Gemini in their respective fields [2,5]. Despite the growing number of these models used in diagnostics, patient management, preventive medicine, and genomic analysis across medicine, the integration of LLMs in dermatology remains limited. This study emphasizes the exploration of large language models, highlighting their less common yet promising role in advancing dermatologic diagnostics and patient care [6]
This study aims to explore the role of LLMs and its decision-making capabilities in the field of pathology, specifically in dermatologic conditions. It focuses on ChatGPT 3.5 and Gemini and compares their accuracy and concordance with the diagnoses of human pathologists. The study also investigates the potential advantages, biases, and constraints of integrating LLM tools into pathology decision-making processes.
Methods
Case Selection
A pathologist selected 60 real case scenarios, with half being neoplastic conditions and the other half non-neoplastic, from a hospital’s medical database. The cases involved patients who had undergone biopsy and histopathological examination for skin conditions. The records included information on age, sex, and the chief complaint of the patients, in addition to a detailed description of the histopathology reports (clinical and microscopic description without the diagnosis).
Consensus Diagnosis
Two additional board-certified pathologists reviewed each case, reaching a collaborative consensus diagnosis through a meticulous review of clinical and microscopic descriptions. This process ensured diagnostic accuracy and reliability while minimizing individual biases.
Eligibility Criteria
The study included cases that had complete and relevant histopathological reports and comprehensive patient demographic information. Specifically, cases were included if they provided a definitive diagnosis in the histopathological report and contained detailed patient data such as age, gender, and clinical history. Cases were excluded if the histopathological report was incomplete, lacked critical patient information, or if the diagnosis could not be definitively made based solely on the textual description.
Sampling Method
The selection process involved a systematic review of available cases from the hospital's medical database to ensure a representative sample of different dermatologic diagnoses. A random sampling method was employed to minimize selection bias and to ensure the sample was representative of the broader population of dermatologic conditions within the database. The selected cases span a range of common and less common dermatologic conditions, enhancing the generalizability of the study’s findings.
Evaluation by AI Systems and External Pathologist
In March 2023, these cases were evaluated using two LLM systems, namely ChatGPT-3.5 and Gemini. In addition, an external board-certified pathologist was tested similarly to the AI systems, receiving only the necessary histopathology report descriptions (without histopathological images) to ensure a fair comparison between the LLM systems and the external pathologist.
Pathologists’ Experience
The Pathologists involved in the study had a minimum of eight years of experience in their respective specialties, handling an average of 30 cases per month. This level of experience ensured a deep familiarity with a wide range of case scenarios. Crucially, the pathologists conducted their assessments were fully informed of the study design, including the comparative analysis with AI systems. Their expertise and understanding were vital in upholding the integrity and reliability of the diagnostic evaluations throughout the study
AI Prompting Strategy
The LLM systems were initially greeted with a prompt saying “Hello,” followed by standardized inquiries presented as: “Please provide the most accurate diagnoses from the texts that will be given below.” Each case was individually presented by copy-pasting it from a Word document and requesting each system to provide a diagnosis of the case scenario based on the information presented. The first response of each system to the inquiry was documented. If no diagnosis was given, the prompt was repeated as such: “Please, based on the histopathological report information given above, provide the most likely disease that causes it.” Until a diagnosis was obtained. In some cases, after a diagnosis was provided, an additional question was asked to specify the histologic subtype of the condition (e.g., if the diagnosis was “seborrheic keratosis”, the system was asked to specify the histologic subtype). Furthermore, the board-certified external pathologist was tested with the same questions, and the correct diagnosis was inquired.
Response Categorization
The responses from both systems and the external pathologist were categorized into three subtypes: complete agreement with the original diagnosis by the human pathologists, partial agreement, or none agreement. The criteria for categorizing agreement levels into "complete," "partial," and "none agreement" are based on the distinction between general and specific diagnostic classifications. For instance, when the original diagnosis provides a detailed type and subtype (e.g., "Seborrheic keratosis, irritated type"), an AI tool's or external pathologist's response was classified as demonstrating "complete agreement" if it accurately identifies both the general diagnosis ("Seborrheic keratosis") and the specific subtype ("irritated type"). This classification acknowledges that accurate identification of both components reflects a thorough understanding and alignment with the original diagnosis. Conversely, an assessment was categorized as "partial agreement" if the response correctly identifies the general diagnosis but inaccurately specifies the subtype. Furthermore, a diagnosis was classified as demonstrating "no agreement" when both the general diagnosis and subtype provided by the AI tool or external pathologist are incorrect. These classification criteria draw upon established methodologies in diagnostic agreement studies, emphasizing the importance of distinguishing between different levels of agreement based on the precision and correctness of diagnostic outputs [7].
Data Processing and Statistical Analysis
The initial processing of the acquired data involved several steps before statistical analysis. First, the data were inputted into Microsoft Excel 2019. Subsequently, they were transferred to Statistical Package for the Social Sciences software (SPSS) 27.0 and the DATA tab for further analysis. Fleiss kappa was utilized to measure agreement among Chat GPT, the external pathologist, and Gemini. Additionally, Chi-square tests were applied to investigate associations between the two LLMs and the external pathologist. In this study, significance was defined as a p-value of < 0.05. A literature review was performed for the study, selectively considering papers from reputable journals while excluding those from predatory sources based on established criteria [8].
Results
ChatGPT-3.5 provided 29 (48.4%) complete agreement, 14 (23.3%) partial agreement, and 17 (28.3%) none agreement responses for the scenarios presented. In contrast, Gemini offered 20 (33%), 9(15%), and 31 (52%) complete agreement, partial agreement, and none agreement responses, respectively, for the same scenarios. Moreover, the external pathologist provided 36 (60%) complete agreement, 17 (28%) partial agreement, and 7 (12%) none agreement responses (Table 1). The complete details of the scenarios, including the diagnosis from the pathologists, ChatGPT’s, Gemini’s, and the external pathologist diagnoses are available in (Supplement 1).
Variables |
Frequency/percentage |
Pathological classification Neoplastic Non-neoplastic |
30 (50%) 30 (50%) |
Neoplastic Benign Malignant |
19 (31.7%) 11 (18.3%) |
Non-neoplastic Dermatosis Infectious, pilosebaceous Connective tissue disease Infectious Granulomatous Vascular Epidermal maturation/keratinization disorder Dermatosis, pilosebaceous Pilosebaceous Panniculitis, Dermatosis, infectious Dermatosis, pigmentation disorder Granulomatous, panniculitis Bullous |
9 (15%) 2 (3.3%) 2 (3.3%) 2 (3.3%) 2 (3.3%) 2 (3.3%) 2 (3.3%) 2 (3.3%) 2 (3.3%) 1 (3.3%) 1 (1.7%) 1 (1.7%) 1 (1.7%) 1 (1.7%) |
External Pathologist Complete agreement Partial agreement None agreement |
36 (60%) 17 (28%) 7 (12%) |
ChatGPT Complete agreement Partial agreement None agreement |
29 (48.4%) 14 (23.3%) 17 (28.3%) |
Gemini Complete agreement Partial agreement None agreement |
20 (33%) 9 (15%) 31 (52%) |
The agreement between Chat GPT, the external pathologist, and Gemini was assessed using Fleiss' kappa, which indicated a statistically significant at a level of <0.001, demonstrating slight to moderate agreement with respect to the original diagnosis made by the pathologists. Out of the 29 questions where Chat GPT agreed with the original diagnosis, only 12 (41.4%) instances also received complete agreement from both Gemini and the external pathologist (Table 2).
Variables | External pathologist |
Measurement of Agreement (Fleiss) |
Significance level |
|||
Complete agreement |
Partial agreement |
None agreement |
||||
Gemini |
Complete agreement |
12 (41.4%) |
1(7.1%) |
2 (11.8%) |
0.25 | <0.001 |
Partial agreement |
3 (10.4%) |
1(7.1%) |
0 (0.0%) |
|||
None agreement |
1(3.4%) |
0(0.0%) |
0 (0.0%) |
|||
Complete agreement |
2 (7%) |
3(21.4%) |
0 (0.0%) |
|||
Partial agreement |
1(3.4%) |
3(21.4%) |
0 (0.0%) |
|||
None agreement |
5(17.2%) |
2(14.4%) |
9 (53%) |
|||
Total |
29 |
14 |
17 |
When assessing the agreement between Chat GPT, the external pathologist, and Gemini, using the external pathologist as the reference, the external pathologist showed complete agreement with the original diagnosis in 36 cases. Among these, Chat GPT achieved complete agreement in 19 cases (52.7%), while Gemini achieved complete agreement in 15 cases (41.7%). Additionally, the external pathologist showed none agreement with the original diagnosis in only 7 cases. Among these, Chat GPT achieved none agreement in 5 cases (71.4%), while Gemini achieved none agreement in 6 cases (85.7%). Statistical analysis indicated significant differences in agreement levels between AI tools (ChatGPT and Gemini) and the external pathologist, with a P-value of <0.001 (Table 3).
AI tools | ChatGPT |
P-value |
|||
Complete agreement |
Partial agreement |
None agreement |
|||
ChatGPT |
Complete agreement |
19(52.7%) |
8(47.1%) |
2(28.6%) |
<0.001 |
Partial agreement |
6(16.7%) |
8(47.1%) |
0(0%) |
||
None agreement |
11(30.6%) |
1(5.8%) |
5(71.4%) |
||
Gemini |
Complete agreement |
15(41.7%) |
4(23.5%) |
1(14.3%) |
<0.001 |
Partial agreement |
5(13.9%) |
4(23.5%) |
0(0%) |
||
None agreement |
16(44.4%) |
9(53%) |
6(85.7%) |
||
Total |
36(100%) |
17(100%) |
7(100%) |
In addition, the agreement between the external pathologist, ChatGPT, and Gemini was assessed for both neoplastic and non-neoplastic cases. Statistical analysis revealed significant differences in the agreement levels between the LLMs and the external pathologist, with a P-value of <0.001, highlighting the statistically significant disparity in agreement rates between the AI tools and the external pathologist (Table 4 and 5).
AI tools | External pathologist | P-value | |||
Complete agreement |
Partial agreement |
None agreement |
|||
ChatGPT |
Complete agreement |
11(61.1%) |
2(40%) |
4(57.1%) |
<0.001 |
Partial agreement |
3(16.7%) |
3(60%) |
1(14.3%) |
||
None agreement |
4(22.2%) |
0(0%) |
2(28.6%) |
||
Gemini |
Complete agreement |
9(50%) |
1(20%) |
1(14.3%) |
<0.001 |
Partial agreement |
7(38.9%) |
4(80%) |
4(57.1%) |
||
None agreement |
2(11.1%) |
0(0%) |
2(28.6%) |
||
Total non-neoplastic cases |
18(100%) |
5(100%) |
7(100%) |
AI tools | External pathologist | P-value | |||
Complete agreement |
Partial agreement |
None agreement |
|||
ChatGPT |
Complete agreement |
8(44.4%) |
0(40%) |
4(40%) |
<0.001 |
Partial agreement |
8(44.4%) |
2(1000%) |
0(0%) |
||
None agreement |
2(11.1%) |
0(0%) |
6(60%) |
||
Gemini |
Complete agreement |
6(33.3%) |
0(20%) |
3(30%) |
<0.001 |
Partial agreement |
9(50%) |
2(100%) |
5(50%) |
||
None agreement |
3(16.7%) |
0(0%) |
2(20%) |
||
Total neoplastic cases |
18(100%) |
2(100%) |
10(100%) |
Discussion
Despite being in existence for over five decades, LLM has recently garnered substantial attention in the public sphere. The increased focus on LLMs in the medical field has led to speculation about the potential replacement of doctors by these systems. However, LLMs are more likely to serve as a complementary tool, aiding clinicians in efficiently processing data and making clinical decisions. This is substantiated by the fact that LLMs can "learn" from extensive collections of medical data. Modern systems are also noted for their self-correcting capabilities. As electronic medical records become more prevalent, there is a growing reservoir of stored patient data. While having access to more data is undoubtedly advantageous, scanning through patient charts can be challenging. Algorithms have been developed to sift through patient notes and detect individuals with specific risk factors, diagnoses, or outcomes. This capability is particularly valuable because, in theory, a LLM system could be developed to review and extract data from medical charts, including pathology reports, and promptly identify patients at highest risk for conditions that could cause significant morbidity or mortality if missed by the physician [6,9].
The field of pathology is no exception to the adaptation of LLMs and the utilization of these technological advancements. Various in recent years have assessed LLM’s accuracy, potential use, and associated limitations. For instance, a study by Vaidyanathaiyer et al., evaluated ChatGPT's proficiency in pathology through thirty clinical case scenarios. These cases were evenly distributed across three primary subcategories: hematology, histopathology, and clinical pathology, with ten cases from each category. The researchers reported that ChatGPT received high grade of “A” on nearly three-quarters of the questions; in the remaining questions, and “B” grades on remaining questions. They found that ChatGPT demonstrated moderate proficiency in these subcategories, excelling in rapid data analysis and providing fundamental insights, though it had limitations in generating thorough and elaborate information [10]. Furthermore, Passby et al. demonstrated capacity of ChatGPT to address multiple-choice inquiries in the Specialty Certificate Examination of dermatology, with ChatGPT-4 outperforming ChatGPT-3.5, scoring 90% versus 63%, respectively, compared to an approximate passing score of 70% [11]. In an investigation by Delsoz et al., twenty corneal pathologies with their respective case descriptions were provided to ChatGPT-3.5 and ChatGPT-4. ChatGPT-4 performed better, correctly answering 85% of the questions, whereas ChatGPT-3.5 answered only 60% correctly [12]. The current study found that ChatGPT-3.5 performed similarly in the percentage of correct responses. However, this study further evaluated the LLM responses and found that nearly 23.3% and 15% of ChatGPT and Gemini answers, respectively, were fair but still had inaccuracies. This highlight areas where these systems can improve, as they sometimes almost answer correctly but not fully. For instance, when a histopathology report of squamous cell carcinoma in situ was given to ChatGPT-3.5, it answered with squamous cell carcinoma. On further prompting, the system favored an invasive squamous cell carcinoma over an in-situ one, even when the suggestion was made to it whether an in-situ lesion was more appropriate for that scenario. similarly, in the case of guttate psoriasis, Gemini answered with only “psoriasis” did not specify the type, while ChatGPT-3.5 responded with “psoriasis vulgaris”. In a study by Rahsepar et al. on pulmonary malignancies, Google Bard (the former name of Gemini) provided 9.2% partially correct answers, similar to Gemini's 15% partially correct responses in this study. However, ChatGPT-3.5 answered 17.5% of lung cancer questions incorrectly, whereas in the present study, ChatGPT-3.5’s incorrect answers were nearly twice as frequent. This may be due to ChatGPT broader access to data and medical information on lung cancer compared to the dermatological conditions tested in this study, highlighting the limitations and risks of relying on these systems for rarer diseases [13].
Although existing language models have access to extensive medical data, they often lack a nuanced understanding of individual diseases or specific patient cases. They have not undergone specialized training for medical tasks, relying solely on the provided data and information. The unclear methodology behind the LLM's diagnostic process leads to skepticism regarding the reliability of LLM-generated diagnoses. Consequently, their ability to accurately diagnose complex or unique cases may be limited, as demonstrated in the current study on skin histopathology cases. Notably, in a few cases, LLMs declined to provide a diagnosis on the initial prompt, citing concerns about giving medical advice, and only issued a diagnosis after repeated prompting with the same scenario. Despite their ability to offer insights based on existing knowledge, LLMs may lack a complete understanding of the intricate details and visual indicators crucial for pathologists' diagnosis. In the current study, the pathologist initially examined the histopathology slides and then provided the report to the AI systems. Another issue is that preserving the integrity of LLMs and safeguarding the confidentiality of associated data from unauthorized access is critical, particularly in scenarios involving sensitive patient information [14,15]. The case scenarios in this study did not include specific patient identifiers. Additionally, failure to evolve the LLM tools utilized in the pathological assessment alongside advancements in clinical practice and treatment poses the risk of stagnation and adherence to outdated methodologies. Although it is possible to manually update LLM algorithms to align with new protocols, their efficacy depends heavily on the availability of pertinent data, which might not be readily accessible during transitional periods. Such adaptations could introduce errors, particularly in pathology, through misclassifications of entities as classification and staging systems undergo revisions. Another concern is automation bias, which refers to the tendency of clinicians to regard LLM-based predictions as flawless or to adhere to them without questioning their validity. This bias often emerges soon after exposure to new technology and may stem from concerns about the legal consequences of disregarding an algorithm's output. Research across various fields has shown that automation bias can reduce clinician accuracy, affecting areas such as electrocardiogram interpretation and dermatologic diagnoses. Clinicians at all proficiency levels, including experts, are susceptible to this phenomenon [3,14-16].
The LLM has numerous applications in the medical field, with various technologies being developed at an unprecedented pace. For example, in the field of epilepsy, Empatica has created a wearable monitor called Embrace, which detects the onset of seizures in patients with epilepsy and notifies designated family members or trusted physicians. This innovation enhances safety and facilitates early management of such cases and received FDA approval six years ago [17]. Additionally, one of the earliest uses of LLM was for the detection of atrial fibrillation. AliveCor mobile application, which facilitates ECG monitoring and atrial fibrillation detection using a mobile phone, was FDA-approved. Recent findings from the REHEARSE-AF study indicated that traditional care methods are less effective at detecting atrial fibrillation in ambulatory individuals compared to remote ECG monitoring using Kardia [17,18]. Another example is the artificial immune recognition system, which has demonstrated remarkable accuracy in diagnosing tuberculosis by using support vector machine classifiers. These advanced systems significantly outperform traditional methods, making them a robust tool in identifying tuberculosis cases with high reliability. This underscores the potential of these models to enhance diagnostic processes in infectious diseases [19]. The advancements across various medical disciplines render the application of LLMs in histopathological diagnostics increasingly viable and anticipated for future clinical implementation. This progress motivates further research by scientists and numerous companies, as the focus has shifted from questioning whether LLM will be used in pathology or not to when and how these models will be utilized precisely.
One limitation of this study is that the aforementioned LLM systems were not evaluated for their ability and accuracy in directly reaching a diagnosis from histopathological images. Instead, the study relied on providing necessary information from the histopathological reports in text form, which imposes practical constraints and still requires an expert pathologist. Future studies focusing on both histopathological images and texts are necessary to further evaluate the comprehensive capabilities of LLM tools in this domain.
Conclusion
In certain instances, ChatGPT-3.5 and Gemini may provide an accurate diagnosis of skin conditions when provided with pertinent patient history and descriptions of histopathological reports. Specifically, Gemini showed higher accuracy in diagnosing non-neoplastic cases, while ChatGPT-3.5 demonstrated better performance in neoplastic cases. However, despite these strengths, the overall performance of both models is insufficient for reliable use in real-life clinical settings.
Declarations
Conflicts of interest: The author(s) have no conflicts of interest to disclose.
Ethical approval: Not applicable.
Patient consent (participation and publication): Not applicable.
Funding: The present study received no financial support.
Acknowledgements: None to be declared.
Authors' contributions: RMA and AMA were significant contributors to the conception of the study and the literature search for related studies. DSH and SHM involved in the literature review, study design, and manuscript writing. TSA, HAY, RSA, and AMS were involved in the literature review, the study's design, the critical revision of the manuscript, and data collection. RMA and DSH confirm the authenticity of all the raw data. All authors approved the final version of the manuscript.
Use of AI: ChatGPT-3.5 was used to assist in language editing and improving the clarity of the introduction section. All content was reviewed and verified by the authors. Authors are fully responsible for the entire content of their manuscript.
Data availability statement: Not applicable.
Early View Articles
Annotations on Indeterminate Cytology of Thyroid Nodules in Thyroidology: Novi Sub Sole?
Ilker Sengul, Demet Sengul
Letter to the Editor
Dear Editor,
Indeterminate cytology (IC) remains the most challenging issue for health professionals working in thyroidology, thyroidologists [1-4]. We read a great deal of the article by Ali et al [5]. entitled "Clinicopathological Features of Indeterminate Thyroid Nodules: A Single-center Cross-sectional Study," published in 3rd volume, Barw Medical Journal. This study addresses a challenging and crucial issue by examining the characteristics and malignancy rates of thyroid nodules with IC, the most controversial category for The Bethesda System for Reporting Thyroid Cytopathology (TBSRTC). The authors evaluated the clinicopathological features of the thyroid nodules with Category III, TBSRTC, in a single-center cross-sectional study [5].
One of the strengths of the article is its focus on the challenges in managing IC. Ali and colleagues [5] thoroughly examine comprehensive data, including demographic details, medical history, laboratory tests, preoperative imaging, cytologic evaluation, and histopathological diagnosis. The results indicate a notable malignancy rate in Category III, TBSRTC. Furthermore, the study points out that malignancy tended to be younger, while benign nodules were significantly larger than malignant ones. The study also found a significant association between malignant nodules and Thyroid Imaging Reporting and Data System (TI-RADS) categories 4 and 5 and benign with TI-RADS 2 and 3, which findings align with some existing literature, providing valuable insights into the clinical assessment of IC.
However, several limitations of the study warrant consideration. Firstly, its single-center and retrospective design may limit the generalizability of the findings to diverse populations and settings. As the authors acknowledge, the retrospective data collection might have resulted in missing crucial information. While TI-RADS scoring was provided, more specific ultrasound features of thyroid nodules could have been beneficial. Of note, does including or excluding noninvasive follicular thyroid neoplasm with papillary-like nuclear features (NIFTP), which has been considered a low-risk entity by the current understanding, affect and/or alter the overall results and the assessment of diagnostic performance and study outcome(s)? [2-4] Furthermore, which caliber of the needle had been utilized throughout the study with or without local and/or topical anesthetic agent(s), and would the utilization of thicker or finer needles in order to obtain cytologic samples with or without any local and/or topical anesthesia alter the outcome(s) of this study? [2] Moreover, which edition of TBSRTC has been used for the work and would stress the up-to-date 3rd edition of TBSRTC [3], considering both the novel and crucial subdivisions of category III might affect the study’s relevant outcome(s)? [3,4] Another point of attention is the relatively short data collection period compared to the publication. Finally, while the discussion section compares the findings with various studies in the literature, a more in-depth exploration of the methodological differences and potential discrepancies in results could have been provided. For instance, the conflicting views in the literature regarding the relationship between nodule size and malignancy risk could have been further contrasted with the study's findings. The authors also acknowledge the small sample size as a limitation. For future research, multi-center and prospective studies with detailed imaging, such as elastography and contrast-enhanced sonography, and investigations into the role of molecular markers in thyroid nodules with Category III could improve diagnostic accuracy and potentially reduce unnecessary surgical interventions.
In conclusion, this study significantly contributes to the evaluation of IC in thyroidology despite its limitations. However, considering the noted limitations, further research with more comprehensive and methodologically robust studies in this area is warranted. This issue merits further investigation.
Sincerely,