MLHC 2024 Proceedings

https://proceedings.mlr.press/v252/

MLHC 2024 Abstracts

Poster Session A (Research Track)

Paper ID: 13

Network-Assisted Mediation Analysis with High-Dimensional Neuroimaging Mediators

Baoyi Shi; Ying Liu; Shanghong Xie; Xi Zhu; Yuanjia Wang

Mediation analysis is a widely used statistical approach to estimate the causal pathways through which an exposure affects an outcome via intermediate variables, i.e., mediators. In many applications, high-dimensional correlated biomarkers are potential mediators, posing challenges to standard mediation analysis approaches. However, some of these biomarkers, such as neuroimaging measures across brain regions, often exhibit hierarchical network structures that can be leveraged to advance mediation analysis. In this paper, we aim to study how brain cortical thickness, characterized by a star-shaped hierarchical network structure, mediates the effect of maternal smoking on children's cognitive abilities within the adolescent brain cognitive development (ABCD) study. We propose a network-assisted mediation analysis approach based on a conditional Gaussian graphical model to account for the star-shaped network structure of neuroimaging mediators. Within our framework, the joint indirect effect of these mediators is decomposed into the indirect effect through hub mediators and the indirect effects solely through each leaf mediator. This decomposition provides mediator-specific insights and informs efficient intervention designs. Additionally, after accounting for hub mediators, the indirect effects solely through each leaf mediator can be identified and evaluated individually, thereby addressing the challenges of high-dimensional correlated mediators. In our study, our proposed approach identifies a brain region as a significant leaf mediator, a finding that existing approaches cannot discover.

Paper ID: 131

Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

Hye Sun Yun; David Pogrebitskiy; Iain James Marshall; Byron C Wallace

Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individual trials to be synthesized. Ideally, language technologies would permit fully automatic meta-analysis, on demand. This requires accurately extracting numerical results from individual trials, which has been beyond the capabilities of natural language processing (NLP) models to date. In this work, we evaluate whether modern large language models (LLMs) can reliably perform this task. We annotate (and release) a modest but granular evaluation dataset of clinical trial reports with numerical findings attached to interventions, comparators, and outcomes. Using this dataset, we evaluate the performance of seven LLMs applied zero-shot for the task of conditionally extracting numerical findings from trial reports. We find that massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality). However, LLMs---including ones trained on biomedical texts---perform poorly when the outcome measures are complex and tallying the results requires inference. This work charts a path toward fully automatic meta-analysis of RCTs via LLMs, while also highlighting the limitations of existing models for this aim.

Paper ID: 139

MALADE: Orchestration of LLM-powered Agents with Retrieval Augmented Generation for Pharmacovigilance

Jihye Choi; Nils Palumbo; Prasad Chalasani; Matthew M. Engelhard; Somesh Jha; Anivarya Kumar; David Page

In the era of Large Language Models (LLMs), given their remarkable text understanding and generation abilities, there is an unprecedented opportunity to develop new, LLM-based methods for trustworthy medical knowledge synthesis, extraction, and summarization. This paper focuses on the problem of Pharmacovigilance (PhV), where the significance and challenges lie in identifying Adverse Drug Events (ADEs) from diverse text sources, such as medical literature, clinical notes, and drug labels. Unfortunately, this task is hindered by factors including variations in the terminologies of drugs and outcomes, and ADE descriptions often being buried in large amounts of narrative text. We present MALADE, the first effective collaborative multi-agent system powered by LLM with Retrieval Augmented Generation for ADE extraction from drug label data. This technique involves augmenting a query to an LLM with relevant information extracted from text resources and instructing the LLM to compose a response consistent with the augmented data. MALADE is a general LLM-agnostic architecture, and its unique capabilities are: (1) leveraging a variety of external sources, such as medical literature, drug labels, and FDA tools (e.g., OpenFDA drug information API), (2) extracting drug-outcome association in a structured format along with the strength of the association, and (3) providing explanations for established associations. Instantiated with GPT-4 Turbo or GPT-4o, and FDA drug label data, MALADE demonstrates its efficacy with an Area Under ROC Curve of 0.90 against the OMOP Ground Truth table of ADEs. Our implementation leverages the Langroid multi-agent LLM framework and can be found at https://github.com/jihyechoi77/malade.

Paper ID: 142

Multimodal Sleep Apnea Detection with Missing or Noisy Modalities

Hamed Fayyaz; Niharika S. D'Souza; Rahmatollah Beheshti

Polysomnography (PSG) is a type of sleep study that records multimodal physiological signals and is widely used for purposes such as sleep staging and respiratory event detection. Conventional machine learning methods assume that each sleep study is associated with a fixed set of observed modalities and that all modalities are available for each sample. However, noisy and missing modalities are a common issue in real-world clinical settings. In this study, we propose a comprehensive pipeline aiming to compensate for the missing or noisy modalities when performing sleep apnea detection. Unlike other existing studies, our proposed model works with any combination of available modalities. Our experiments show that the proposed model outperforms other state-of-the-art approaches in sleep apnea detection using various subsets of available data and different levels of noise, and maintains its high performance (AUROC$>$0.9) even in the presence of high levels of noise or missingness. This is especially relevant in settings where the level of noise and missingness is high (such as pediatric or outside-of-clinic scenarios).

Paper ID: 150

FineRadScore: A Radiology Report Line-by-Line Evaluation Technique Generating Corrections with Severity Scores

Alyssa Huang; Oishi Banerjee; Kay Wu; Eduardo Pontes Reis; Pranav Rajpurkar

The current gold standard for evaluating generated chest x-ray (CXR) reports is through radiologist annotations. However, this process can be extremely time-consuming and costly, especially when evaluating large numbers of reports. In this work, we present FineRadScore, a Large Language Model (LLM)-based automated evaluation metric for generated CXR reports. Given a candidate report and a ground-truth report, FineRadScore gives the minimum number of line-by-line corrections required to go from the candidate to the ground-truth report. Additionally, FineRadScore provides an error severity rating with each correction and generates comments explaining why the correction was needed. We demonstrate that FineRadScore's corrections and error severity scores align with radiologist opinions. We also show that, when used to judge the quality of the report as a whole, FineRadScore aligns with radiologists as well as current state-of-the-art automated CXR evaluation metrics. Finally, we analyze FineRadScore's shortcomings to provide suggestions for future improvements.

Paper ID: 151

Direct Preference Optimization for Suppressing Hallucinated Prior Exams in Radiology Report Generation

Oishi Banerjee; Hong-Yu Zhou; Kay Wu; Subathra Adithan; Stephen Kwak; Pranav Rajpurkar

Recent advances in generative vision-language models (VLMs) have exciting potential implications for AI in radiology, yet VLMs are also known to produce hallucinations, nonsensical text, and other unwanted behaviors that can waste clinicians' time and cause patient harm. Drawing on recent work on direct preference optimization (DPO), we propose a simple method for modifying the behavior of pretrained VLMs performing radiology report generation by suppressing unwanted types of generations. We apply our method to the prevention of hallucinations of prior exams, addressing a long-established problem behavior in models performing chest X-ray report generation. Across our experiments, we find that DPO fine-tuning achieves a 3.2-4.8x reduction in lines hallucinating prior exams while maintaining model performance on clinical accuracy metrics. Our work is, to the best of our knowledge, the first work to apply DPO to medical VLMs, providing a data- and compute- efficient way to suppress problem behaviors while maintaining overall clinical accuracy.

Paper ID: 155

Bidirectional Generative Pre-training for Improving Healthcare Time-series Representation Learning

Ziyang Song; Qincheng Lu; He Zhu; David L. Buckeridge; Yue Li

Learning time-series representations for discriminative tasks, such as classification and regression, has been a long-standing challenge in the healthcare domain. Current pre-training methods are limited in either unidirectional next-token prediction or randomly masked token prediction. We propose a novel architecture called Bidirectional Timely Generative Pre-trained Transformer (BiTimelyGPT), which pre-trains on biosignals and longitudinal clinical records by both next-token and previous-token prediction in alternating transformer layers. This pre-training task preserves original distribution and data shapes of the time-series. Additionally, the full-rank forward and backward attention matrices exhibit more expressive representation capabilities. Using biosignals and longitudinal clinical records, BiTimelyGPT demonstrates superior performance in predicting neurological functionality, disease diagnosis, and physiological signs. By visualizing the attention heatmap, we observe that the pre-trained BiTimelyGPT can identify discriminative segments from biosignal time-series sequences, even more so after fine-tuning on the task.

Paper ID: 160

Risk stratification through class-conditional conformal estimation: A strategy that improves the rule-out performance of MACE in the prehospital setting

Juan Jose Garcia; Nikhil Sarin; Rebecca R. Kitzmiller; Ashok Krishnamurthy; Jessica K. Zègre-Hemsey

Accurate risk stratification of clinical scores is important to mitigate adverse outcomes in patient care. In this study we explore whether class-conditional conformal estimation can yield better risk stratification cutoffs, as measured by rule-out and rule-in performance. In the binary setting, the cutoffs are chosen to theoretically bound the false positive rate (FPR) and the false negative rate (FNR). We showcase rule-out performance improvements for the task of 30-day major adverse cardiac event (MACE) prediction in the prehospital setting over standard of care HEART and HEAR algorithms. Further, we observe the theoretical bounds materialize 96\% and 77\% of the time for FPR and FNR respectively across multiple datasets. Improving risk score accuracy is important since inaccurate stratification can lead to significant negative patient outcomes. For instance, in the case of MACE prediction, better rule-out performance translates into less delay of time dependent therapies that restore bloodflow to the compromised myocardium, thereby reducing morbidity and mortality.

Paper ID: 161

Selective Fine-tuning on LLM-labeled Data May Reduce Reliance on Human Annotation: A Case Study Using Schedule-of-Event Table Detection

Bhawesh Kumar; Jonathan Amar; Eric Yang; Nan Li; Yugang jia

Large Language Models (LLMs) have demonstrated their efficacy across a broad spectrum of tasks in healthcare applications. However, often LLMs need to be fine-tuned on task specific expert-annotated data to achieve optimal performance, which can be expensive and time consuming. In this study, we fine-tune PaLM-2 with parameter efficient fine-tuning (PEFT) using noisy labels obtained from Gemini-pro 1.0 for the detection of Schedule-of-Event (SoE) tables, which specify care plan in clinical trial protocols. We introduce a filtering mechanism to select high-confidence labels for this table classification task, thereby reducing the noise in the auto-generated labels. We find that the fine-tuned PaLM-2 with filtered labels outperforms Gemini Pro 1.0 and other LLMs on this task and achieves performance close to PaLM-2 fine-tuned on non-expert human annotations. Our results show that leveraging LLM-generated labels, coupled with strategic filtering can be a viable and cost-effective strategy for improving LLM performance on specialized tasks, especially in domains where expert annotations are scarce, expensive, or time-consuming to obtain.

Paper ID: 163

PRECISe : Prototype-Reservation for Explainable classification under Imbalanced and Scarce-Data Settings

Vaibhav Ganatra; Drishti Goel

Deep learning models used for medical image classification tasks are often constrained by the limited amount of training data along with severe class imbalance. Despite these problems, models should be explainable to enable human trust in the models' decisions to ensure wider adoption in high risk situations. In this paper, we propose PRECISe, an explainable-by-design model meticulously constructed to concurrently address all three challenges. Evaluation on 2 imbalanced medical image datasets reveals that PRECISe outperforms the current state-of-the-art methods on data efficient generalization to minority classes, achieving an accuracy of ~87% in detecting pneumonia in chest x-rays upon training on <60 images only. Additionally, a case study is presented to highlight the model's ability to produce easily interpretable predictions, reinforcing its practical utility and reliability for medical imaging tasks.

Paper ID: 164

DOSSIER: Fact Checking in Electronic Health Records while Preserving Patient Privacy

Haoran Zhang; Supriya Nagesh; Milind Shyani; Nina Mishra

Given a particular claim about a specific document, the fact checking problem is to determine if the claim is true and, if so, provide corroborating evidence. The problem is motivated by contexts where a document is too lengthy to quickly read and find an answer. This paper focuses on electronic health records, or a medical dossier, where a physician has a pointed claim to make about the record. Prior methods that rely on directly prompting an LLM may suffer from hallucinations and violate privacy constraints. We present a system, DOSSIER, that verifies claims related to the tabular data within a document. For a clinical record, the tables include timestamped vital signs, medications, and labs. DOSSIER weaves together methods for tagging medical entities within a claim, converting natural language to SQL, and utilizing biomedical knowledge graphs, in order to identify rows across multiple tables that prove the answer. A distinguishing and desirable characteristic of DOSSIER is that no private medical records are shared with an LLM. An extensive experimental evaluation is conducted over a large corpus of medical records demonstrating improved accuracy over five baselines. Our methods provide hope that physicians can privately, quickly, and accurately fact check a claim in an evidence-based fashion.

Paper ID: 174

LLMSYN: Generating Synthetic Electronic Health Records Without Patient-Level Data

Yijie Hao; Huan He; Joyce C. Ho

Recent advancements in large language models (LLMs) have shown promise in tasks like question answering, text summarization, and code generation. However, their effectiveness within the healthcare sector remains uncertain. This study investigates LLMs’ potential in generating synthetic Electronic Health Records (EHRs) by assessing their ability to produce structured data. Unfortunately, our preliminary results indicate that employing LLMs directly resulted in poor statistical similarity and utility. Feeding real-world dataset to LLMs could mitigate this issue, but privacy concerns were raised when uploading pa- tients’ information to the LLM API. To address these challenges and unleash the potential of LLMs in health data science, we present a new generation pipeline called LLMSYN. This pipeline utilizes only high-level statistical information from datasets and publicly available medical knowledge. The results demonstrate that the generated EHRs by LLMSYN ex- hibit improved statistical similarity and utility in downstream tasks, achieving predictive performance comparable to training with real data, while presenting minimal privacy risks. Our findings suggest that LLMSYN offers a promising approach to enhance the utility of LLM models in synthetic structured EHR generation.

Paper ID: 179

Predicting Long-Term Allograft Survival in Liver Transplant Recipients

Xiang Gao; Michael Cooper; Maryam Naghibzadeh; Amirhossein Azhie; Mamatha Bhat; Rahul Krishnan

Liver allograft failure occurs in approximately 20% of liver transplant recipients within five years post-transplant, leading to mortality or the need for retransplantation. Providing an accurate and interpretable model for individualized risk estimation of graft failure is essential for improving post-transplant care. To this end, we introduce the Model for Allograft Survival (MAS), a simple linear risk score that outperforms other advanced survival models. Using longitudinal patient follow-up data from the United States (U.S.), we develop our models on 82,959 liver transplant recipients and conduct multi-site evaluations on 11 regions. Additionally, by testing on a separate non-U.S. cohort, we explore the out-of-distribution generalization performance of various models without additional fine-tuning, a crucial property for clinical deployment. We find that the most complex models are also the ones most vulnerable to distribution shifts despite achieving the best in-distribution performance. Our findings not only provide a strong risk score for predicting long-term graft failure but also suggest that the routine machine learning pipeline with only in-distribution held-out validation could create harmful consequences for patients at deployment.

Paper ID: 180

Decision-Focused Model-based Reinforcement Learning for Reward Transfer

Abhishek Sharma; Sonali Parbhoo; Omer Gottesman; Finale Doshi-Velez

Model-based reinforcement learning (MBRL) provides a way to learn a transition model of the environment, which can then be used to plan personalized policies for different patient cohorts, and to understand the dynamics involved in the decision-making process. However, standard MBRL algorithms are either sensitive to changes in the reward function or achieve suboptimal performance on the task when the transition model is restricted. Motivated by the need to use simple and interpretable models in critical domains such as healthcare, we propose a novel robust decision-focused (RDF) algorithm that learns a transition model that achieves high returns while being robust to changes in the reward function. We demonstrate our RDF algorithm can be used with several model classes and planning algorithms. We also provide theoretical and empirical envidence, on variety of simulators and real patient data, that RDF can learn simple yet effective models that can be used to plan personalized policies.

Paper ID: 182

Localising the Seizure Onset Zone from Single-Pulse Electrical Stimulation Responses with a CNN Transformer

Jamie Norris; Aswin Chari; Dorien van Blooijs; Gerald K. Cooray; Karl Friston; Martin M Tisdall; Richard E Rosch

Epilepsy is one of the most common neurological disorders, often requiring surgical intervention when medication fails to control seizures. For effective surgical outcomes, precise localisation of the epileptogenic focus - often approximated through the Seizure Onset Zone (SOZ) - is critical yet remains a challenge. Active probing through electrical stimulation is already standard clinical practice for identifying epileptogenic areas. Our study advances the application of deep learning for SOZ localisation using Single-Pulse Electrical Stimulation (SPES) responses, with two key contributions. Firstly, we implement an existing deep learning model to compare two SPES analysis paradigms: divergent and convergent. These paradigms evaluate outward and inward effective connections, respectively. We assess the generalisability of these models to unseen patients and electrode placements using held-out test sets. Our findings reveal a notable improvement in moving from a divergent (AUROC: 0.574) to a convergent approach (AUROC: 0.666), marking the first application of the latter in this context. Secondly, we demonstrate the efficacy of CNN Transformers with cross-channel attention in handling heterogeneous electrode placements, increasing the AUROC to 0.730. These findings represent a significant step in modelling patient-specific intracranial EEG electrode placements in SPES. Future work will explore integrating these models into clinical decision-making processes to bridge the gap between deep learning research and practical healthcare applications.

Paper ID: 184

XDT-CXR: Investigating Cross-Disease Transferability in Zero-Shot Binary Classification of Chest X-Rays

Umaima Rahman; Abhishek Basu; Muhammad Uzair Khattak; Aniq Ur Rahman

This study explores the concept of cross-disease transferability (XDT) in medical imaging, focusing on the potential of binary classifiers trained on one disease to perform zero-shot classification on another disease affecting the same organ. Utilizing chest X-rays (CXR) as the primary modality, we investigate whether a model trained on one pulmonary disease can make predictions about another novel pulmonary disease, a scenario with significant implications for medical settings with limited data on emerging diseases. The XDT framework leverages the embedding space of a vision encoder, which, through kernel transformation, aids in distinguishing between diseased and non-diseased classes in the latent space. This capability is especially beneficial in resource-limited environments or in regions with low prevalence of certain diseases, where conventional diagnostic practices may fail. However, the XDT framework is currently limited to binary classification, determining only the presence or absence of a disease rather than differentiating among multiple diseases. This limitation underscores the supplementary role of XDT to traditional diagnostic tests in clinical settings. Furthermore, results show that XDT-CXR as a framework is able to make better predictions comapred to other zero-shot learning (ZSL) baselines.

Paper ID: 192

Can Large Language Models Provide Emergency Medical Help Where There Is No Ambulance? A Comparative Study on Large Language Model Understanding of Emergency Medical Scenarios in Resource-Constrained Settings

Paulina Boadiwaa Mensah; Nana Serwaa Agyeman Quao

The capabilities of Large Language Models (LLMs) have advanced since their populariza- tion a few years ago. The healthcare sector operates on, and generates a large volume of data annually and thus, there is a growing focus on the applications of LLMs within this sector. There are a few medicine-oriented evaluation datasets and benchmarks for assess- ing the performance of various LLMs in clinical scenarios; however, there is a paucity of information on the real-world usefulness of LLMs in context-specific scenarios in resource- constrained settings. In this study, 16 iterations of a decision support tool for medical emergencies using 4 distinct generalized LLMs were constructed, alongside a combination of 4 Prompt Engineering techniques: In-Context Learning with 5-shot prompting (5SP), chain-of-thought prompting (CoT), self-questioning prompting (SQP), and a stacking of self-questioning prompting and chain-of-thought (SQCT). In total 428 model responses were quantitatively and qualitatively evaluated by 22 clinicians familiar with the medi- cal scenarios and background contexts. Our study highlights the benefits of In-Context Learning with few-shot prompting, and the utility of the relatively novel self-questioning prompting technique. We also demonstrate the benefits of combining various prompting techniques to elicit the best performance of LLMs in providing contextually applicable health information. We also highlight the need for continuous human expert verification in the development and deployment of LLM-based health applications, especially in use cases where context is paramount.

Paper ID: 194

Leveraging LLMs for Multimodal Medical Time Series Analysis

Nimeesha Chan; Felix Parker; William C Bennett; Tianyi Wu; Mung Yao Jia; James Fackler MD; Kimia Ghobadi

The complexity and heterogeneity of data in many real-world applications pose significant challenges for traditional machine learning and signal processing techniques. For instance, in medicine, effective analysis of diverse physiological signals is crucial for patient monitoring and clinical decision-making and yet highly challenging. We introduce MedTsLLM, a general multimodal large language model (LLM) framework that effectively integrates time series data and rich contextual information in the form of text to analyze physiological signals, performing three tasks with clinical relevance: semantic segmentation, boundary detection, and anomaly detection in time series. These critical tasks enable deeper analysis of physiological signals and can provide actionable insights for clinicians. We utilize a reprogramming layer to align embeddings of time series patches with a pretrained LLM's embedding space and make effective use of raw time series, in conjunction with textual context. Given the multivariate nature of medical datasets, we develop methods to handle multiple covariates. We additionally tailor the text prompt to include patient-specific information. Our model outperforms state-of-the-art baselines, including deep learning models, other LLMs, and clinical methods across multiple medical domains, specifically electrocardiograms and respiratory waveforms. MedTsLLM presents a promising step towards harnessing the power of LLMs for medical time series analysis that can elevate data-driven tools for clinicians and improve patient outcomes.

Poster Session B (Research Track)

Paper ID: 156

Event-Based Contrastive Learning for Medical Time Series

Nassim Oufattole; Hyewon Jeong; Matthew B.A. McDermott; Aparna Balagopalan; Bryan Jangeesingh; Marzyeh Ghassemi; Collin Stultz

In clinical practice, one often needs to identify whether a patient is at high risk of adverse outcomes after some key medical event. For example, quantifying the risk of adverse outcomes after an acute cardiovascular event helps healthcare providers identify those patients at the highest risk of poor outcomes; i.e., patients who benefit from invasive therapies that can lower their risk. Assessing the risk of adverse outcomes, however, is challenging due to the complexity, variability, and heterogeneity of longitudinal medical data, especially for individuals suffering from chronic diseases like heart failure. In this paper, we introduce Event-Based Contrastive Learning (EBCL) - a method for learning embeddings of heterogeneous patient data that preserves temporal information before and after key index events. We demonstrate that EBCL can be used to construct models that yield improved performance on important downstream tasks relative to other pretraining methods. We develop and test the method using a cohort of heart failure patients obtained from a large hospital network and the publicly available MIMIC-IV dataset consisting of patients in an intensive care unit at a large tertiary care center. On both cohorts, EBCL pretraining yields models that are performant with respect to a number of downstream tasks, including mortality, hospital readmission, and length of stay. In addition, unsupervised EBCL embeddings effectively cluster heart failure patients into subgroups with distinct outcomes, thereby providing information that helps identify new heart failure phenotypes. The contrastive framework around the index event can be adapted to a wide array of time-series datasets and provides information that can be used to guide personalized care.

Paper ID: 195

MedAutoCorrect Image-Conditioned Autocorrection in Medical Reporting

Arnold Caleb Asiimwe; Didac Suris Coll-Vinent; Pranav Rajpurkar; Carl Vondrick

n medical reporting, the accuracy of radiological reports, whether generated by humans or machine learning algorithms, is critical. We tackle a new task in this paper: image- conditioned autocorrection of inaccuracies within these reports. Using the MIMIC-CXR dataset, we first intentionally introduce a diverse range of errors into reports. Subsequently, we propose a two-stage framework capable of pinpointing these errors and then making corrections, simulating an autocorrection process. This method aims to address the short- comings of existing automated medical reporting systems, like factual errors and incorrect conclusions, enhancing report reliability in vital healthcare applications. Importantly, our approach could serve as a guardrail, ensuring the accuracy and trustworthiness of automated report generation. Experiments on established datasets and state of the art report generation models validate this method’s potential in correcting medical reporting errors.

Paper ID: 21

Multinomial belief networks for healthcare data

Hylke Cornelis Donker; Dorien Neijzen; Johann de Jong; Gerton Lunter

Healthcare data from patient or population cohorts are often characterized by sparsity, high missingness and relatively small sample sizes. In addition, being able to quantify uncertainty is often important in a medical context. To address these analytical requirements we propose a deep generative Bayesian model for multinomial count data. We develop a collapsed Gibbs sampling procedure that takes advantage of a series of augmentation relations, inspired by the Zhou--Cong--Chen model. We visualise the model's ability to identify coherent substructures in the data using a dataset of handwritten digits. We then apply it to a large experimental dataset of DNA mutations in cancer and show that we can identify biologically meaningful clusters of mutational signatures in a fully data-driven way.

Paper ID: 25

Benchmarking Reliability of Deep Learning Models for Pathological Gait Classification

Abhishek Jaiswal; Nisheeth Srivastava

Early detection of neurodegenerative disorders is an important open problem, since early diagnosis and treatment may yield a better prognosis. Researchers have recently sought to leverage advances in machine learning algorithms to detect symptoms of altered gait, possibly corresponding to the emergence of neurodegenerative etiologies. However, while several claims of positive and accurate detection have been made in the recent literature, using a variety of sensors and algorithms, solutions are far from being realized in practice. This paper analyzes existing approaches to identify gaps inhibiting translation. Using a set of experiments across three Kinect-simulated and one real Parkinson's patient datasets, we highlight possible sources of errors and generalization failures in these approaches. Based on these observations, we propose our strong baseline called Asynchronous Multi-Stream Graph Convolutional Network (AMS-GCN) that can reliably differentiate multiple categories of pathological gaits across datasets.

Paper ID: 26

General-Purpose Retrieval-Enhanced Medical Prediction Model Using Near-Infinite History

Junu Kim; Chaeeun Shim; Bosco Seong Kyu Yang; Chami Im; Sung Yoon Lim; Han-Gil Jeong; Edward Choi

Machine learning (ML) has recently shown promising results in medical predictions using electronic health records (EHRs). However, since ML models typically have a limited capability in terms of input sizes, selecting specific medical events from EHRs for use as input is necessary. This selection process, often relying on expert opinion, can cause bottlenecks in development. We propose Retrieval-Enhanced Medical prediction model (REMed) to address such challenges. REMed can essentially evaluate unlimited medical events, select the relevant ones, and make predictions. This allows for an unrestricted input size, eliminating the need for manual event selection. We verified these properties through experiments involving 27 clinical prediction tasks across four independent cohorts, where REMed outperformed the baselines. Notably, we found that the preferences of REMed align closely with those of medical experts. We expect our approach to significantly expedite the development of EHR prediction models by minimizing clinicians' need for manual involvement.

Paper ID: 29

The Data Addition Dilemma

Judy Hanwen Shen; Inioluwa Deborah Raji; Irene Y. Chen

In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources. But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings? We identify this situation as the Data Addition Dilemma, demonstrating that adding training data in this multi-source scaling context can at times result in reduced overall accuracy, uncertain fairness outcomes and reduced worst-subgroup performance. We find that this possibly arises from an empirically observed trade-off between model performance improvements due to data scaling and model deterioration from distribution shift. We thus establish baseline strategies for navigating this dilemma, introducing distribution shift heuristics to guide decision-making for which data sources to add in order to yield the expected model performance improvements. We conclude with a discussion of the required considerations for data collection and suggestions for studying data composition and scale in the age of increasingly larger models.

Paper ID: 3

Needles in Needle Stacks: Meaningful Clinical Information Buried in Noisy Sensor Data

Sujay Nagaraj; Andrew J Goodwin; Dmytro Lopushanskyy; Sebastian David Goodfellow; Danny Eytan; Hadrian Balaci; Robert Greer; Anand Jayarajan; Azadeh Assadi; Mjaye Leslie Mazwi; Anna Goldenberg

Central Venous Lines (C-Lines) and Arterial Lines (A-Lines) are routinely used in the Critical Care Unit (CCU) for blood sampling, medication administration, and high-frequency blood pressure measurement. Judiciously accessing these lines is important, as over-utilization is associated with significant in-hospital morbidity and mortality. Documenting the frequency of line-access is an important step in reducing these adverse outcomes. Unfortunately, the current gold-standard for documentation is manual and subject to error, omission, and bias. The high-frequency blood pressure waveform data from sensors in these lines are often noisy and full of artifacts. Standard approaches in signal processing remove noise artifacts before meaningful analysis. However, from bedside observations, we characterized a *distinct* artifact that occurs during each instance of C-Line or A-Line use. These artifacts are buried amongst physiological waveform and extraneous noise. We focus on Machine Learning (ML) models that can detect these artifacts from waveform data in real-time - finding needles in needle stacks, in order to automate the documentation of line-access. We built and evaluated ML classifiers running in real-time at a major children's hospital to achieve this goal. We demonstrate the utility of these tools for reducing documentation burden, increasing available information for bedside clinicians, and informing unit-level initiatives to improve patient safety.

Paper ID: 34

Early Prediction of Causes (not Effects) in Healthcare by Long-Term Clinical Time Series Forecasting

Michael Staniek; Marius Fracarolli; Michael Hagmann; Stefan Riezler

Machine learning for early syndrome diagnosis aims to solve the intricate task of predicting a ground truth label that most often is the outcome (effect) of a medical consensus definition applied to observed clinical measurements (causes), given clinical measurements observed several hours before. Instead of focusing on the prediction of the future effect, we propose to directly predict the causes via time series forecasting (TSF) of clinical variables and determine the effect by applying the gold standard consensus definition to the forecasted values. This method has the invaluable advantage of being straightforwardly interpretable to clinical practitioners, and because model training does not rely on a particular label anymore, the forecasted data can be used to predict any consensus-based label. We exemplify our method by means of long-term TSF with Transformer models, with a focus on accurate prediction of sparse clinical variables involved in the SOFA-based Sepsis-3 definition and the new Simplified Acute Physiology Score (SAPS-II) definition. Our experiments are conducted on two datasets and show that contrary to recent proposals which advocate set function encoders for time series and direct multi-step decoders, best results are achieved by a combination of standard dense encoders with iterative multi-step decoders. The key for success of iterative multi-step decoding can be attributed to its ability to capture cross-variate dependencies and to a student forcing training strategy that teaches the model to rely on its own previous time step predictions for the next time step prediction.

Paper ID: 38

FairEHR-CLP: Towards Fairness-Aware Clinical Predictions with Contrastive Learning in Multimodal Electronic Health Records

Yuqing Wang; Malvika Pillai; Yun Zhao; Catherine M Curtin; Tina Hernandez-Boussard

In the high-stakes realm of healthcare, ensuring fairness in predictive models is crucial. Electronic Health Records (EHRs) have become integral to medical decision-making, yet existing methods for enhancing model fairness restrict themselves to unimodal data and fail to address the multifaceted social biases intertwined with demographic factors in EHRs. To mitigate these biases, we present $\textit{FairEHR-CLP}$: a general framework for $\textbf{Fair}$ness-aware Clinical $\textbf{P}$redictions with $\textbf{C}$ontrastive $\textbf{L}$earning in $\textbf{EHR}$s. FairEHR-CLP operates through a two-stage process, utilizing patient demographics, longitudinal data, and clinical notes. First, synthetic counterparts are generated for each patient, allowing for diverse demographic identities while preserving essential health information. Second, fairness-aware predictions employ contrastive learning to align patient representations across sensitive attributes, jointly optimized with an MLP classifier with a softmax layer for clinical classification tasks. Acknowledging the unique challenges in EHRs, such as varying group sizes and class imbalance, we introduce a novel fairness metric to effectively measure error rate disparities across subgroups. Extensive experiments on three diverse EHR datasets on three tasks demonstrate the effectiveness of FairEHR-CLP in terms of fairness and utility compared with competitive baselines. FairEHR-CLP represents an advancement towards ensuring both accuracy and equity in predictive healthcare models.

Paper ID: 44

Minimax Risk Classifiers for Mislabeled Data: a Study on Patient Outcome Prediction Tasks

Lucia Filippozzi; Santiago Mazuelas; Iñigo Urteaga

Healthcare datasets are often impacted by incorrect or mislabeled data, due to imperfect annotations, data collection issues, ambiguity, and subjective interpretations. Incorrectly classified data, referred to as "noisy labels", can significantly degrade the performance of supervised learning models. Namely, noisy labels hinder the algorithm's ability to accurately capture the true underlying patterns from observed data. More importantly, evaluating the performance of a classifier when only noisy test labels are available is a significant complication. We hereby tackle the challenge of trusting the labelling process both in training and testing, as noisy patient outcome labels in healthcare raise methodological and ethical considerations. We propose a novel adaptation of Minimax Risk Classifiers (MRCs) for data subject to noisy labels, both in training and evaluation. We show that the upper bound of the MRC's expected loss can serve as a useful estimator for the classifier's performance, especially in situations where clean test data is not available. We demonstrate the benefits of the proposed methodology in healthcare tasks where patient outcomes are predicted from mislabeled data. The proposed technique is accurate and stable, avoiding overly optimistic assessments of prediction error, a significantly harmful burden in patient outcome prediction tasks in healthcare.

Paper ID: 45

NeRF-US: Removing Ultrasound Imaging Artifacts from Neural Radiance Fields in the Wild

Rishit Dagli; Atsuhiro Hibi; Rahul Krishnan; Pascal N Tyrrell

Current methods for performing 3D reconstruction and novel view synthesis (NVS) in ultrasound imaging data often face severe artifacts when training NeRF-based approaches. The artifacts produced by current approaches differ from NeRF floaters in general scenes because of the unique nature of ultrasound capture. Furthermore, existing models fail to produce reasonable 3D reconstructions when ultrasound data is captured or obtained casually in uncontrolled environments, which is common in clinical settings. Consequently, existing reconstruction and NVS methods struggle to handle ultrasound motion, fail to capture intricate details, and cannot model transparent and reflective surfaces. In this work, we introduced NeRF-US, which incorporates 3D-geometry guidance for border probability and scattering density into NeRF training, while also utilizing ultrasound-specific rendering over traditional volume rendering. These 3D priors are learned through a diffusion model. Through experiments conducted on our new "Ultrasound in the Wild" dataset, we observed accurate, clinically plausible, artifact-free reconstructions.

Paper ID: 53

G-Transformer: Counterfactual Outcome Prediction under Dynamic and Time-varying Treatment Regimes

Hong Xiong; Feng Wu; Leon Deng; Megan Su; Li-wei H. Lehman

In the context of medical decision making, counterfactual prediction enables clinicians to predict treatment outcomes of interest under alternative courses of therapeutic actions given observed patient history. Prior machine learning approaches for counterfactual predictions under time-varying treatments focus on static time-varying treatment regimes where treatments do not depend on previous covariate history. In this work, we present G-Transformer, a Transformer-based framework supporting g-computation for counterfactual prediction under dynamic and time-varying treatment strategies. G-Transfomer captures complex, long-range dependencies in time-varying covariates using a Transformer architecture. G-Transformer estimates the conditional distribution of relevant covariates given covariate and treatment history at each time point using an encoder architecture, then produces Monte Carlo estimates of counterfactual outcomes by simulating forward patient trajectories under treatment strategies of interest. We evaluate G-Transformer extensively using two simulated longitudinal datasets from mechanistic models, and a real-world sepsis ICU dataset from MIMIC-IV. G-Transformer outperforms both classical and state-of-the-art counterfactual prediction models in these settings. To the best of our knowledge, this is the first Transformer-based architecture for counterfactual outcome prediction under dynamic and time-varying treatment strategies.

Paper ID: 58

Semi-Supervised Generative Models for Disease Trajectories: A Case Study on Systemic Sclerosis

Cécile Trottet; Manuel Schürch; Ahmed Allam; Imon Shoumitra Barua; Liubov Petelytska; Oliver Distler; Anna-Maria Hoffmann-Vold; Michael Krauthammer

We propose a deep generative approach using latent temporal processes for modeling and holistically analyzing complex disease trajectories, with a particular focus on Systemic Sclerosis (SSc). We aim to learn temporal latent representations of the underlying generative process that explain the observed patient disease trajectories in an interpretable and comprehensive way. To enhance the interpretability of these latent temporal processes, we develop a semi-supervised approach for disentangling the latent space using established medical knowledge. By combining the generative approach with medical definitions of different characteristics of SSc, we facilitate the discovery of new aspects of the disease. We show that the learned temporal latent processes can be utilized for further data analysis and clinical hypothesis testing, including finding similar patients and clustering SSc patient trajectories into novel sub-types. Moreover, our method enables personalized online monitoring and prediction of multivariate time series with uncertainty quantification.

Paper ID: 64

Mixed Type Multimorbidity Variational Autoencoder: A Deep Generative Model for Multimorbidity Analysis

Woojung Kim; Paul A. Jenkins; Christopher Yau

This paper introduces the Mixed Type Multimorbidity Variational Autoencoder ($\text{M}^{3}$VAE), a deep probabilistic generative model developed for supervised dimensionality reduction in the context of multimorbidity analysis. The model is designed to overcome the limitations of purely supervised or unsupervised approaches in this field. $\text{M}^{3}$VAE focuses on identifying latent representations of mixed-type health-related attributes essential for predicting patient survival outcomes. It integrates datasets with multiple modalities (by which we mean data of multiple types), encompassing health measurements, demographic details, and (potentially censored) survival outcomes. A key feature of $\text{M}^{3}$VAE is its ability to reconstruct latent representations that exhibit clustering patterns, thereby revealing important patterns in disease co-occurrence. This functionality provides insights for understanding and predicting health outcomes. The efficacy of $\text{M}^{3}$VAE has been demonstrated through experiments with both synthetic and real-world electronic health record data, showing its capability in identifying interpretable morbidity groupings related to future survival outcomes.

Paper ID: 66

A Comprehensive View of Personalized Federated Learning on Heterogeneous Clinical Datasets

Fatemeh Tavakoli; D. B. Emerson; Sana Ayromlou; John Taylor Jewell; Amrit Krishnan; Yuchong Zhang; Amol Verma; Fahad Razak

Federated learning (FL) is increasingly being recognized as a key approach to overcoming the data silos that so frequently obstruct the training and deployment of machine-learning models in clinical settings. This work contributes to a growing body of FL research specifically focused on clinical applications along three important directions. First, we expand the FLamby benchmark (du Terrail et al., 2022a) to include a comprehensive evaluation of personalized FL methods and demonstrate substantive performance improvements over the original results. Next, we advocate for a comprehensive checkpointing and evaluation framework for FL to reflect practical settings and provide multiple comparison baselines. To this end, an open-source library aimed at making FL experimentation simpler and more reproducible is released. Finally, we propose an important ablation of PerFCL (Zhang et al., 2022). This ablation results in a natural extension of FENDA (Kim et al., 2016) to the FL setting. Experiments conducted on the FLamby benchmark and GEMINI datasets (Verma et al., 2017) show that the proposed approach is robust to heterogeneous clinical data and often outperforms existing global and personalized FL techniques, including PerFCL.

Paper ID: 68

Predictive Powered Inference for Healthcare; Relating Optical Coherence Tomography Scans to Multiple Sclerosis Disease Progression

Jacob Schultz; Jerry L Prince; Bruno Michel Jedynak

Predictive power inference (PPI and PPI++) is a recently developed statistical method for computing confidence intervals and tests. It combines observations with machine-learning predictions. We use this technique to measure the association between the thickness of retinal layers and the time from the onset of Multiple Sclerosis (MS) symptoms. Further, we correlate the former with the Expanded Disability Status Scale, a measure of the progression of MS. In both cases, the confidence intervals provided with PPI++ improve upon standard statistical methodology, showing the advantage of PPI++ for answering inference problems in healthcare.

Paper ID: 69

A LUPI distillation-based approach: Application to predicting Proximal Junctional Kyphosis

Yun Chao Lin; Andrea Clark-Sevilla; Rohith Ravindranath; Fthimnir Hassan; Justin Reyes; Joseph Lombardi; Lawrence G. Lenke; Ansaf Salleb-Aouissi

We propose a learning algorithm called XGBoost+, a modified version of the extreme gradient boosting algorithm (XGBoost). The new algorithm utilizes privileged information (PI), data collected after inference time. XGBoost+ incorporates PI into a distillation framework for XGBoost. We also evaluate our proposed method on a real-world clinical dataset about Proximal Junctional Kyphosis (PJK). Our approach outperforms vanilla XGBoost, SVM, and SVM+ on various datasets. Our approach showcases the advantage of using privileged information to improve the performance of machine learning models in healthcare, where data after inference time can be leveraged to build better models.

Paper ID: 8

To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

Saige Rutherford; Thomas Wolfers; charlotte fraza; Nathaniel G. Harnett; Christian Beckmann; Henricus G. Ruhe; Andre Marquand

Reference classes in healthcare establish healthy norms, such as pediatric growth charts of height and weight, and are used to chart deviations from these norms which represent potential clinical risk. How the demographics of the reference class influence clinical interpretation of deviations is unknown. Using normative modeling, a method for building reference classes, we evaluate the fairness (racial bias) in reference models of structural brain images that are widely used in psychiatry and neurology. We test whether including “race” in the model creates fairer models. We predict self-reported race using the deviation scores from three different reference class normative models to better understand bias in an integrated, multivariate sense. Across all these tasks, we uncover racial disparities that are not easily addressed with existing data or commonly used modeling techniques. Our work suggests that deviations from the norm could be due to demographic mismatch with the reference class, and assigning clinical meaning to these deviations should be done with caution. Our approach also suggests that acquiring more representative samples is an urgent research priority.

Paper ID: 82

CORE-BEHRT: A Carefully Optimized and Rigorously Evaluated BEHRT

Mikkel Fruelund Odgaard; Kiril Vadimovic Klein; Martin Sillesen; Sanne Møller Thysen; Espen Jimenez-Solem; Mads Nielsen

The widespread adoption of Electronic Health Records (EHR) has significantly increased the amount of available healthcare data. This has allowed models inspired by Natural Language Processing (NLP) and Computer Vision, which scale exceptionally well, to be used in EHR research. Particularly, BERT-based models have surged in popularity following the release of BEHRT and Med-BERT. Subsequent models have largely built on these foundations despite the fundamental design choices of these pioneering models remaining underexplored. Through incremental optimization, we study BERT-based EHR modeling and isolate the sources of improvement for key design choices, giving us insights into the effect of data representation, individual technical components, and training procedure. Evaluating this across a set of generic tasks (death, pain treatment, and general infection), we showed that improving data representation can increase the average downstream performance from 0.785 to 0.797 AUROC ($p < 10^{−7}$), primarily when including medication and timestamps. Improving the architecture and training protocol on top of this increased average downstream performance to 0.801 AUROC ($p < 10^{−7}$). We then demonstrated the consistency of our optimization through a rigorous evaluation across 25 diverse clinical prediction tasks. We observed significant performance increases in 17 out of 25 tasks and improvements in 24 tasks, highlighting the generalizability of our results. Our findings provide a strong foundation for future work and aim to increase the trustworthiness of BERT-based EHR models.

Paper ID: 91

Beyond Clinical Trials: Using Real World Evidence to Investigate Heterogeneous, Time-Varying Treatment Effects

Isabel Chien; Cliff Wong; Zelalem Gero; Jaspreet Bagga; Risa Ueno; Richard E. Turner; Roshanthi K Weerasinghe; Brian Piening; Tristan Naumann; carlo bifulco; Hoifung Poon; Javier Gonzalez

Randomized controlled trials (RCTs), though essential for evaluating the efficacy of novel treatments, are costly and time-intensive. Due to strict eligibility criteria, RCTs may not adequately represent diverse patient populations, leading to equity issues and limited generalizability. Additionally, conventional trial analysis methods are limited by strict assumptions and biases. Real-world evidence (RWE) offers a promising avenue to explore treatment effects beyond trial settings, addressing gaps in representation and providing additional insights into patient outcomes over time. We introduce TRIALSCOPE-X and TRIALSCOPE-XL, machine learning pipelines designed to analyze treatment outcomes using RWE by mitigating biases that arise from observational data and addressing the limitations of conventional methods. We estimate causal, time-varying treatment effects across heterogeneous patient populations and varied timeframes. Preliminary results investigating the treatment benefit of Keytruda, a widely-used cancer immunotherapy drug, demonstrate the utility of our methods in evaluating treatment outcomes under novel settings and uncovering potential disparities. Our findings highlight the potential of RWE-based analysis to provide data-driven insights that inform evidence-based medicine and shape more inclusive and comprehensive clinical research, supplementing traditional clinical trial findings.

Poster Session C (Clinical Track)

Paper ID: 10

Development of EHR-based Neurodevelopmental Surveillance Dashboard for Clinicians

De Rong Loh; Elliot D. Hill; Ann Marie Navar; Geraldine Dawson; Matthew M. Engelhard

**Background/Motivation** We have developed predictive models for autism and attention-deficit/hyperactivity disorder (ADHD) from an early age. Silent deployment of these models within electronic health records (EHR) is a critical step towards using them in practice. Despite growing interest$^1$, few such models have been effectively integrated into clinical practice. Translating machine learning models into clinical use is a Duke institutional priority. Following development and retrospective evaluation, the goal is to deploy these models via a user-friendly web application, which will support clinicians with predictive insights to enhance their clinical decision-making process. Early identification of children who require early intervention can lead to improved autism and ADHD outcomes. This abstract delves into the design considerations and developmental steps during the implementation process of this dashboard. **Design Considerations** We emphasize a user-centered approach, tailored to meet the needs of clinicians screening for children with high likelihood of diagnosis and providing access to individualized information. The interface is designed to be minimalistic, intuitive and interactive using StreamLit, an open-source Python framework. Clinicians can effectively navigate the dashboard and interact with the widgets for enhanced user experience and data visualization. We also recognize the importance of inclusive language, adopting neurodiversity-affirming communication practices$^2$ to ensure all users feel respected. For example, we use the term “likelihood” instead of “risk” to convey the probability of diagnosis in a non-stigmatizing manner. Finally, the dashboard is hosted within Protected Analytics Computing Environment (PACE), ensuring robust security for handling protected health information. Deployment within PACE also enables data processing and extraction with short latency while upholding privacy standards, and will facilitate subsequent integration with Epic for clinical use. **Dashboard Interface** The landing page features a “Dashboard Overview”, which includes the “Model Facts” label developed by the Duke Institute for Health Innovation$^3$. This label serves as a structured guideline aimed at ensuring clinicians understand the appropriate methods and instances for incorporating model output into clinical decision-making, as well as when not to do so. The major sections encompass the summary of the model and dataset, uses and directions, warnings, and other information. Further information regarding the evaluation metrics and models’ test performance are provided in the “Model Performance” page. The “All Patients” page presents a default display of all eligible patients within the Duke University Health System (DUHS) EHR. Clinicians can effectively navigate and prioritize patient care by ordering them based on predicted diagnosis likelihood and filtering them according to patient characteristics, such as diagnosis status and age. Additionally, a convenient string matching-based search feature expedites locating specific patient names. By clicking on a patient Medical Record Number (MRN), clinicians are directed to the “Patient Lookup” page, providing a comprehensive individualized view. Here, they can access (a) a patient snapshot (e.g. demographics, comorbidities, clinical encounters), (b) predicted probabilities of diagnosis at various timepoints, and (c) patient-specific likelihood over time versus population and subgroup distributions (i.e., year of birth, sex, and race). Entering the patient MRN directly will also navigate to the same page. Note that we have removed the EHR data pipeline and simulated with synthetic patient data to ensure patient confidentiality. For access to repository containing the dashboard, please visit: https://github.com/engelhard-lab/aap-dashboard **Future Work** Our next steps include implementing explainability tools for clinicians to understand model outputs better as well as conducting focus groups to iteratively refine the dashboard. We will also transition to prospective deployment and validation of the models for real-life clinical settings. These advancements will further the utility and reliability of our dashboard in improving patient care and outcomes. **References** 1. Goldstein BA, Navar AM, Pencina MJ, Ioannidis JPA. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc. 2017;24(1):198-208. doi:10.1093/jamia/ocw042 2. Brandsen S. Inclusive language in autism research. Presented at: December 1, 2023. 3. Sendak MP, Gao M, Brajer N, Balu S. Presenting machine learning model information to clinical end users with model facts labels. Npj Digit Med. 2020;3(1):41. doi:10.1038/s41746-020-0253-3

https://openreview.net/pdf/dd5f1a5404be50d7d7a56d02e24c09813022c49e.pdf

Paper ID: 103

Evaluating stimulus-based heart rate variability assessment in chronic constipation using a machine learning approach

Jihong Chen; Lijun Liu; Jennifer Yu; Jan D. Huizinga

Autonomic dysfunction plays an important role in the pathophysiology of chronic constipation. However, this has been severely understudied due to the heterogenicity and complexity of the clinical data. The autonomic nervous system (ANS) is divided into sympathetic and parasympathetic branches, providing essential control of the gastrointestinal tract, and allowing for normal digestion and expulsion of stool. Heart rate variability (HRV) is a non-invasive method to assess autonomic function using a series of heartbeat intervals derived from electrocardiography data. In order to comprehensively assess autonomic function in patients, we developed a stimulus-based HRV protocol involving active standing, deep breathing, and stomach distention, separately stimulating the sympathetic and parasympathetic branches of the ANS. Due to the presence of multiple interventions and HRV variables, we employed logistic regression to determine the value of each test intervention in order to improve the understanding of pathophysiology in chronic constipation.

https://openreview.net/pdf/bed5a489f90b3fba2f9217a85f6135d20c6117ca.pdf

Paper ID: 104

Development of a Fair Machine Learning Model to Predict Survival after Radical Cystectomy for Bladder Cancer

Samuel Carbunaru; Yassamin Neshatvar; Hyungrok Do; Katie S Murray; Rajesh Ranganath; Madhur Nayan

Background. Outcome-prediction models based on machine learning (ML) methods are being increasingly developed and adopted in healthcare. However, these models may be prone to bias and may demonstrate variable performance in population subgroups. This is particularly concerning in healthcare where model bias may perpetuate existing disparities in care for vulnerable populations. Bladder cancer disproportionately affects white males in the United States; the epidemiology of this disease may present a challenge to develop models that perform well for females and minority populations. In this study, we develop a model to predict survival after radical cystectomy, the current standard treatment for muscle-invasive bladder cancer and evaluate for potential unfairness in sex and race subgroups. We then use and compare unfairness mitigation techniques to improve model fairness. Methods. We used the National Cancer Database to identify patients undergoing a radical cystectomy for muscle-invasive bladder cancer between 2004-2016. We extracted demographic, clinical, and disease-specific factors for these patients and used the race/ethnicity data to categorize patients into National Institute of Health (NIH) subgroups of ‘White’, ‘Black, ‘Hispanic’, and ‘Asian’. We split the sample into training and test sets (80%/20%). We trained and compared various ML classification algorithms (Random Forest, Decision Tree, XGboost, and Logistic Regression) to predict overall survival at 5 years after cystectomy. The primary model performance metric was the F1 score. We evaluated model performance in the overall test set and subgroups based on sex and NIH race. The primary metric for model fairness was the equalized odds ratio (eOR), as we prioritized a model that minimized disparities and satisfied equality of true positive and false positive error rates across sensitive groups; an eOR of 1 indicates equal rates across groups. We compared three unfairness mitigating techniques to improve the eOR: correlation remover, exponentiated gradient, and threshold optimizer. Results. We identified 16,481 patients that met inclusion criteria; 3,800 (23.1%) were female; 15,080 (91.5%) were ‘White’, 832 (5.0%) were ‘Black’, 373 (2.3%) were ‘Hispanic’, and 196 (1.2%) were ‘Asian’. The 5-year mortality rate was 74.6% (12,290 deaths). The best naive model was XGBoost which had an F1 score of 0.860 (95% confidence interval (CI) 0.849 - 0.869)) and eOR of 0.619. This model performed best in Black males (F1 0.907 (95% CI 0.859 - 0.947)) and worst in Asian females (F1 0.824 (95% CI 0.571 - 0.947)). All unfairness mitigation techniques increased the eOR, with correlation remover showing the highest increase and resulting in a final eOR of 0.750. This mitigated model had F1 scores of 0.861 (95% CI 0.851 - 0.870), 0.904 (95% CI 0.855 - 0.946), and 0.824 (95% CI 0.571 - 0.947) in the full, Black male, and Asian female test sets, respectively. Conclusion. We developed a ML model to predict survival after radical cystectomy for bladder cancer and found that a naive model exhibited bias, as certain subgroups of patients, based on sex and race, had inferior performance compared to others. Using unfairness mitigation techniques to minimize disparities between groups, we were able to improve model fairness, as measured by the eOR. Our study demonstrates the importance of evaluating for potential model unfairness and the application of unfairness mitigation techniques to avoid disparities in healthcare arising from biased models. We deploy the first fair ML model predicting survival after radical cystectomy at https://nayanlab.shinyapps.io/fair_cystectomy_survival/

https://openreview.net/pdf/b5f33c91f63e5cff442c9206b418c47968a3434e.pdf

Paper ID: 105

Enhanced AFib Classification: Integrating CNN Features and R-R Intervals in CatBoost

David Maslove; Nooshin Maghsoodi; Stephanie Sibley; Sarah Nassar; Sophia Mannina; Shamel Addas; Purang Abolmaesumi; Parvin Mousavi

Atrial fibrillation (AFib), the most common cardiac arrhythmia encountered in intensive care units (ICUs), presents a significant health burden due to its associated risks of stroke, thromboembolic events, and mortality. Accurate and prompt detection of AFib in the high-stakes environment of the ICU is critical for timely intervention that may prevent poor outcomes. Traditional methods for AFib detection rely on manual interpretation of electrocardiograms (ECGs) or the application of basic machine learning techniques. The manual interpretation is time-consuming and subject to inter-observer variability, while machine learning methods often require extensive, well-annotated datasets and face deployment challenges in the complex ICU environment. The ICU setting poses unique challenges due to the critical nature of patient conditions and the corresponding need for immediate and accurate diagnostic information. Unlike non-ICU settings, where researchers may have access to large datasets for developing and testing algorithms, the ICU presents a more limited, yet high-velocity, high-variability data environment, which complicates the development of robust detection models. Most of the existing literature on AFib detection using advanced computational methods focuses on non-ICU settings, taking advantage of larger, cleaner datasets, and does not address the nuances of critical care. Our study seeks to overcome these challenges and enhance AFib detection in ICU settings where data limitations are a significant concern. Despite the rise of foundation models and transformers in recent literature, convolutional neural networks (CNNs) continue to demonstrate high efficacy in processing ECG signals. Our work utilizes public, non-ICU ECG datasets to train CNN, which serve as feature extractors adept at identifying subtle ECG signal patterns. These CNN-derived features, along with R-R interval analysis, provide a comprehensive feature set that feeds into a secondary classifier. This dual-stage approach harnesses the strengths of deep learning and signal processing, informing our strategy to address the challenge of AFib detection in critical care.

https://openreview.net/pdf/ac134a7e7e5ce908c6a8f219fc22363175df6639.pdf

Paper ID: 107

Predict non-Metastatic Castration Resistant Prostate Cancer Patients Prognosis Using Deep Learning Survival Analysis

CHUNYANG LI

Background. Prostate cancer (PC) is the most common cancer in and the second-leading cause of cancer death among men. While a majority of men with PC have early, less aggressive disease, a proportion ultimately develops a more advanced form associated with resistance to hormone therapy, referred to as Non-metastatic castration resistant prostate cancer (nmCRPC). The prognosis of nmCRPC patients varies, and early identification of high-risk patients could help clinicians adjust treatment plans and thus prolongs patients’ survival. The Veteran Healthcare Administration is the largest integrated healthcare system in the United States. It has the largest cohort of nmCRPC patients. We sought to develop an automatic patients’ prognosis prediction model to identify high risk patients, thus facilitate closer surveillance and offer better treatment plans. Electronic Healthcare Record (EHR) contains a lot of time series information, although it remains unclear how to best take advantage of it. Feature engineering time series data is commonly used to transform time series data into panel data, since most models are not able to fit time series information directly. Some deep learning methods can process time series data automatically. We aim to investigate if time series information helps improve prediction accuracy, and if deep learning offers advantage processing time series information. Methods. Using data from the Department of Veterans Affairs (VA) Cancer Registry System and VA Corporate Data Warehouse, we identified a nationwide cohort of 12,819 patients diagnosed with prostate cancer from January 1, 2006 through December 31, 2019 who later developed nmCRPC. 16.5%, and 29.8% of patients had metastasis or died at one year and two-year landmark, respectively. We used static and time-series features for time-to-event prediction. Static features included age and body mass index (BMI) at nmCRPC, race/ethnicity, Gleason score, time from prostate cancer diagnosis to nmCRPC, and Charlson comorbidity index (CCI) 6 months prior to nmCRPC diagnosis date. Time-series features included treatment, PSA, number of days from prostate cancer diagnosis, and nmCRPC status indicator. Our hypothesis was that disease response as encoded by the variation of PSA with treatment would provide information for the model as to how aggressive and/or refractory to treatment at any time point. We compared Cox with Elastic net model, SurvTrace, a deep survival model that used attention mechanism but was only able to fit static features, and WTTE-RNN, a deep survival model that was developed to predict engine failures and used time series features. Raw longitudinal features are used for WTTE-RNN model. For Cox with Elastic net model and SurvTRACE, longitudinal PSA values are summarized into static features as follows: minimum, maximum, median of PSA values and slopes. Treatments are summarized as treatment type and duration. We also took out the time series (TS) features and compared how the models perform. The outcome of interest was time to metastatsis or death from the initiation of first-line treatment after nmCRPC date or landmark date if treatment was not initiated within 3 months of nmCRPC date. Model performances were evaluated using AUROC and Brier Score (BS) with 10-fold cross validation. Results. SurvTrace with temporal information offered the best discrimination and calibration. Summarized temporal information improves performance for both Cox and SurvTrace, though the improvement was more for SurvTrace compared to the improvement for Cox model. WTTE was able to intrinsically make use of TS EHR information and outperform models without TS information. Conclusion. We demonstrate the value of EHR time series information in improving prediction in deep learning models, even those that do not intrinsically support time series information like SurvTrace. We demonstrate a novel use of WTTE-RNN for prediction using EHR data, a model, commonly used in engineering but that, to our knowledge, has never been used in EHR data. We hypothesize that with feature-engineered TS information, SurvTrace was able to outperform WTTE, possibly thanks to its leveraging attention mechanisms. Prediction using DL on EHR data should leverage its time series nature, either via models that can intrinsically utilize TS information such as RNN based models, or through careful engineering of TS features.

https://openreview.net/pdf/a2435853852c7bf105c7f8fe074a659d1f89064a.pdf

Paper ID: 109

Leveraging Irregular Time Series Intervals of Physiological Data for Inpatient Mortality Prediction

Allan Pang; Owen Ashby Johnson; Marc de Kamps; Dr Alwyn Kotzé; Geoff Hall

Background. Routinely collected inpatient physiological data can be used to determine a patient's physiological state to correlate with clinical outcomes. The sequential nature of physiological data makes Recurrent Neural Network (RNN) architectures an obvious choice for encoding this data and learning trajectories. These techniques are helpful in downstream predictions of outcomes such as inpatient mortality. A limitation of standard RNNs is the assumption of regularly spaced observations, which require pre-processing techniques such as time-boxing, potentially resulting in information loss. In the reality of the clinical environment, sampling rates between observations naturally differ, and the intervals between them can inform a patient’s state. Recently, Che et al. proposed an adapted Gated Recurrent Unit cell that learns a decay function (GRU-D) to handle differing sampling rates between variables in a multivariate time series [4]. By adapting RNNs to handle irregular multivariate time series, we may offer predictive performance advantages in a real-time risk monitoring system. Methods. To demonstrate that adapting an irregular time-series approach improves downstream event prediction, we apply adapted RNN architectures, including GRU-D, to two real-world datasets using physiological data with different sampling frequencies. Surgical Analytics (SA) contains physiological data from general wards (Level 1) and is a proprietary repository of surgical health records from Leeds NHS Teaching Hospital Trust, a major teaching hospital trust in the UK. MIMIC-IV is an open-access dataset containing critical care (Level 2/3) physiological data, typically with higher sampling and event rates. These datasets represent a range of levels of care and sampling rates, mirroring the diversity of real-world conditions required for these techniques to be deployed in the clinical environment. We trained models to differentiate episodes with favourable outcomes (discharge from unit/hospital) against poor outcomes (death/escalation of care) with a 24-hour warning. Baseline models include RNN/GRU/LSTMs that have processed data using timebox techniques and feed-forward imputation. We conducted a 5-fold stratified cross-validation and error analysis of the models in an early-warning setting. Results. Allowing for irregular time series improves prediction compared to standard time-box processing. In SA, statistically significant improvements occur at both time-step and sequence-level predictions. At the time-step level, these include AUROC (0.715 ±0.035 vs. 0.634 ±0.010), AUPRC (0.087 ±0.019 vs. 0.030 ±0.005), Recall (0.492 ±0.026 vs. 0.172 ±0.02), and Balanced Accuracy (0.683 ±0.007 vs. 0.573 ± 0.005). Sequence level predictions improved in Recall (0.447 ±0.024 vs. 0.105 ±0.012), Balanced accuracy (0.708 ±0.012 vs. 0.552 ±0.006) and F1 score (0.558 ±0.02 vs. 0.188 ±0.02). In MIMIC-IV, there are performance gains in AUROC (0.739 ±0.008 vs. 0.730 ±0.021), AUPRC (0.122 ±0.005 vs. 0.092 ±0.031), Recall (0.471 ±0.047 vs 0.349 ±0.125), and Balanced Accuracy (0.665 ±0.011 vs 0.627 ± 0.025). Error analysis demonstrates that irregular interval models, including GRU-D, can identify periods of instability outside the labelled 24-hour window, inferring similar sequences occurring within 24 hours of death occur outside of this period. The clinical utility of such a model could be used to highlight physiological instability and identify those in need of clinical intervention, similar to proprietary deterioration indices or early warning scoring systems. Conclusion. Our findings, which challenge conventional time-box pre-processing, underscore the importance of time intervals between observations in understanding a patient’s clinical state and can be leveraged in an RNN architecture. The performance gains observed in critical care and general ward environments demonstrate the versatility and robustness of this method and can be adapted to other outcomes. The ability of these models to identify physiological temporal patterns associated with poor outcomes holds the potential for improving patient outcomes. We are planning further research to explore periods of instability, which could provide insights into the confounding effect of ongoing treatments.

https://openreview.net/pdf/50d77f322a6949aa6a8444bd01fced63826419f9.pdf

Paper ID: 113

Understanding Localised Tumour Response Patterns in High-Grade Serous Ovarian Cancer Across Multiple Disease Sites Using Serial Medical Imaging Registration

Ionut-Gabriel Funingana; Ines Prata Machado; Thisanaporn Mungmeeprued; Zeyu Gao; Thomas Buddenkotte; Golnar K. Mahani; Bevis Drury; Marika A.V. Reinius; Ramona Woitek; Evis Sala; James D. Brenton; Mireia Crispin-Ortuzar

Background. High-grade serous ovarian carcinoma (HGSOC) is a highly heterogeneous disease that typically presents at an advanced, metastatic state. Neoadjuvant chemotherapy (NACT) followed by delayed primary surgery (DPS) is becoming the most frequent treatment strategy for advanced HGSOC. In current clinical practice, tumour response assessment is usually based on tumour size change on serial contrast-enhanced computerised tomography (CE-CT) scan images. Although Response Evaluation Criteria in Solid Tumours (RECIST 1.1) is the current clinical guideline to assess size change of solid tumours after therapeutic treatment, it is based on measurements of lesion diameters, and therefore is not able to capture the three-dimensional complexity of tumours. Additionally, only a few tumours are tracked by radiologists, whereas many other metastatic tumours are excluded or overlooked in the clinical evaluation process. The new generation of clinical trials for ovarian cancer requires stronger, more sensitive response metrics that can provide more robust discrimination. The aim of this study is to test the feasibility of co-registering CE-CT scans before and after treatment to enable local monitoring of volumetric disease progression at each disease site, with a particular emphasis on addressing the challenges posed by highly complex deformations. The three-dimensional deformation produced by the model could be used in combination with clinical data in an integrated prognostic model. Experiments and Methods. This study utilises data obtained from a prospective observational study conducted at Addenbrooke’s Hospital comprising 198 abdominopelvic CE-CT scans from a total of 99 patients (51 of whom were responders according to RECIST 1.1 criteria). These patients were diagnosed histopathologically with HGSOC and underwent NACT prior to DPS. Baseline scans were acquired between 0 and 14 weeks before initiation of NACT and post-treatment scans were acquired for response assessment after 1.6–5.8 months of treatment. All cancer lesions were segmented semi-automatically by a board-certified radiologist with ten years of experience in clinical imaging, using Microsoft Radiomics (project InnerEye; Microsoft, Redmond, WA, USA): omentum (24.4%), right upper quadrant (5.1%), left upper quadrant (1.1%), epigastrium (0.1%), mesentery (0.3%), right paracolic gutter (0.8%), left paracolic gutter (0.3%), ovaries & pelvis (66.4%), infrarenal abdominal lymph nodes (0.8%), suprarenal abdominal lymph nodes (0.2%), inguinal lymph nodes (0.2%), and supradiaphragmatic lymph nodes (0.1%). The percentages in parentheses represent the proportion of each lesion’s volume relative to the total volume of cancer lesions. An unsupervised deformable image registration algorithm was used to register sets of CE-CT images acquired before and after treatment and deformation vector fields (DVFs) were automatically generated for each disease site as shown in Figure 1. Results. To evaluate the registration performance, we calculate the Dice Similarity Coefficient (DSC) between the post-treatment scan and the registered pre-treatment scan at different disease sites and the vertebrae. The algorithm achieved a mean DSC of 0.77 ± 0.10 for the vertebrae and a mean DSC of 0.63 ± 0.25 for the largest disease site in non-responders according to RECIST 1.1 criteria. Conclusion. Preliminary results show that CE-CT imaging co-registration is feasible and could enable a more accurate assessment of tumour volume change in a three-dimensional space, simultaneous tracking of multiple disease sites and detection of necrotic or density changes that can be visually missed or overlooked. From a clinical perspective, automated registration of longitudinal imaging data is a prerequisite for exploiting the full potential of standard-of-care CT images for treatment response assessment in HGSOC patients. Here, we reported the registration performance for non-responders according to the RECIST 1.1 criteria, as this pipeline is particularly relevant to tracking the potential disease sites composed of resistant tumour cells. In future work, DVFs will be integrated with clinical data, including stage and age at diagnosis, performance status as evaluated by the Eastern Cooperative Oncology Group (ECOG) scale, outcome of surgery, and validated genomic biomarkers, including Homologous Recombination Deficiency (HRD) and the pathogenic variants in BRCA1/2 in a Cox Proportional-Hazard model to predict overall survival.

https://openreview.net/pdf/3244714d97b0cce2c4eb54524032897c2db5610f.pdf

Paper ID: 114

Collaboration for trANslational Artificial Intelligence tRIals: Project CANAIRI

Melissa Danielle McCradden; Xiaoxuan Liu; Judy Gichoya; Mark Sendak; Lauren Oakden-Rayner; Carolyn Semmler; Lauren Erdman; Ismail Akrout; Mjaye Leslie Mazwi; Anton van der Vegt; Ian Stedman; Antonios Perperidis; Alex John London; James A Anderson; Mandy Rickard

The governance surrounding the clinical translation of artificial intelligence (AI) products is coalescing around best practices, but a major gap concerns the bridging of proof-of-concept work and clinical trials. The ‘silent trial’ (also known as silent evaluation, shadow trial, shadow evaluation) refers to when an AI system runs in real-time in the intended clinical environment without affecting patient care[1-4]. The silent trial enables collection of on-the-ground evidence of performance while integrating data security considerations, assessing operational feasibility, validating deployment strategy and testing workflow integration. Ethically, the silent trial offers a means to generate clinically relevant evidence without risk to patients - this step thereby enables healthcare settings to trial an AI tool prior to making the decision to integrate (or not) on the basis of relevant, local evidence [5]. Many AI-enabled institutions have long recognized the value of these trials [3]. However, generally, silent trials have primarily been viewed narrowly as technical checks on the model’s performance[1]. Our group advocates for a widening of the current view on the value of these trials toward one that is socio technical in nature, operationalized through a set of best practices as recommendations for healthcare institutions[6,7]. As a first step, we use the term ‘translational trial’ to widen the scope of practices to include human factors, implementation science, operational/systems integration, social license, legal and ethical, economics, environmental, and regulatory considerations. We propose that - like the canary in the coal mine - universally adopting translational trials for AI integration can prevent known AI failure modes[8,9] and mitigate many current and emerging risks. Translational trials are proposed as a key mechanism for organizational accountability while promoting innovation and AI advancement[10].

https://openreview.net/pdf/1517d08b235f43cdc87fe0acc525c769b0848323.pdf

Paper ID: 115

The Medical Algorithmic Audit Playbook: Key considerations based on the application of a collaborative safety monitoring framework in the UK National Health Service

Aditya Uday Kale; Alastair Denniston; Xiaoxuan Liu

Background Regulators, health providers, and government are looking for efficient, scalable approaches to post-market monitoring of Artificial Intelligence (AI) Literature demonstrates that models are prone to poor generalizability (variable performance across sites) and decline in performance over time. The Medical Algorithmic Audit (MAA) [Liu et al 2022, Lancet Digital Health] is a safety monitoring framework for algorithmic error auditing and failure mode detection for AI medical devices. This paper outlines practical considerations for implementation of the MAA based on experiences of conducting these audits in the context of three AI medical devices in a large teaching hospital in the UK. These three use cases are: 1) Skin Cancer detection and triage, 2) Autonomous CXR reporting and 3) Breast screening. Findings: The Medical Algorithmic Audit (MAA) Figure1: Bow-tie analysis of AI implementation. Three example threats (left) and consequences (rights) are illustrated, with relevant preventative and mitigative barriers. The MAA aims to guide auditors through a five-stage approach to identify risks and potential failure modes, and design appropriate mitigations to ensure patient safety. Here we outline a few of the key challenges in implementation of this audit framework. The first is the availability of information in the mapping stage to identify risks and known vulnerabilities of the system. Engagement with AI vendors is important for delivery of the MAA, however commercial sensitivity is likely to be a barrier, particularly when considering training dates, algorithm architecture and thresholds. We are currently working to understand whether a lack of these details will limit identification of safety issues, or implementation of control measures. For the skin and CXR use cases mentioned above, we had no insight into the algorithmic architecture and threshold selection. However for the breast screening AI, several thresholds have been approved for use by regulators making this information vital. The second challenge we highlight is during the testing phase. Subgroup testing is a complex process and we aimed to identify the performance for predefined subgroups. For the skin AI audit we selected age, ethnicity, sex and skin type as the main subgroups based on previous literature. This process has been more complex for the CXR AI and breast screening devices. Work is needed to understand the role of clustering to identify hidden subgroups of interest (and failure modes). It is important to strike a balance between ensuring that only necessary data is collected, whilst ensuring that enough data is captured to identify and mitigate algorithmic biases. Discussion Further work involves understanding how algorithmic auditing can be done in a feasible manner including how often audits should be undertaken, how this can be integrated into existing governance structures and job plans, and how expertise from university hospitals can be disseminated to smaller less digitally mature hospitals otherwise lacking the infrastructure. If implemented appropriately within local governance processes, algorithmic auditing may support safer and more equitable use of AIaMD through early identification of unsafe technologies and planning of risk mitigations.

https://openreview.net/pdf/8e0b3c562dc3ef33a5f269c9a65a21aff3d48100.pdf

Paper ID: 118

Craniosynostosis Classification using Dynamic Graph Neural Network On 3d Photographs

John Phillips; Jaryd Hunter; Mélissa Roy; Pouria Mashouri; Michael Brudno; Devin Singh; Noah Stancati; Sam Osia; Rakshita Kathuria

Background. Craniosynostosis is the premature fusion of cranial sutures leading to craniofacial deformity. It is a rare condition with an estimated prevalence of 5.9 per 10,000 live births$^{i}$. Diagnosis prior to 3-4 months of age allows for less invasive endoscopic interventions$^{ii}$. Currently, patients are seen first come first serve where most referrals are non-synostosis. This can lead to delays in interventions for patients who receive a diagnosis resulting in more invasive treatment options. Using a three-dimensional (3d) photogrammetric image capture system (3dMD$^{iii}$) we have trained a model to predict three synostosis diagnoses, plagiocephaly and patients with normal head shapes, in the hopes of accelerating access to care for patients with likely diagnoses. Methods. Data collection 3d scans were taken of patients less than one year old visiting the clinic in the Plastic and Reconstructive Surgery Clinical Department at The Hospital for Sick Children in Toronto, Canada. Patients with a diagnosis of sagittal, metopic, or unicoronal synostosis, and patients with plagiocephaly or normal head shapes (without a craniofacial diagnosis) were collected. All patients were fitted with a stocking cap prior to capturing a stereo photogram. Each scan was manually cropped to below the mandible, and other anomalous captured points outside of the patient’s head were removed. Only scans before any cranial interventions (including helmeting) were used. This resulted in a dataset of 856 scans over 715 patients. For each scan, only the vertex information was kept, and all surface information was removed. The vertices were all normalized such that the origin is at the central point and the largest vector has a length of 1. The normalized point clouds were then cropped from the brow to the nape to remove any facial features. Model training A Dynamic Graph Convolutional Neural Network$^{iv}$ (DGCNN) model was trained using fivefold cross validation. To convert the normalized scans to graphs that can be convolved upon a random sample of m vertices is selected prior to prediction. After sampling m points a set of edges between points is created using the k nearest neighbours algorithm. This random sampling process is repeated each time the scan is seen in the training set. For model training and evaluation fivefold cross-validation was used and each patient was randomly assigned to one of the five folds. At training time three data augmentations were applied, first the plane to crop away the facial features was randomly shifted, between 0 and 0.15 in normalized Euclidean distance, along the plane’s normal vector. This resulted in further removal of the lower portions of the head. Then with probability of 50% a random point was selected and all points within 0.15 by Euclidean distance to that point were prevented from being sampled. Adam optimizer was used with the OneCycle1 learning rate scheduler, starting at learning rate of 0.0001, and maximizing at a learning rate of 0.01. Results. Our model achieved an accuracy of 75.4% with mean AUROC of 92%. This resulted in the five classes sagittal, metopic, unicoronal, plagiocephaly, and normal achieving precision of 87.3%, 62.9%, 72.0%, 84.4%, and 64.3%, respectively. The model also achieved recall of 80.1%, 84.2%, 79.8%, 71.1%, and 67.3%, respectively. The rate of non-accelerated synostosis patients was evaluated by binarizing the 5 classes to synostosis and non-synostosis. Considering synostosis the positive class, results in an accuracy of 84.7%, precision of 80.7%, and recall of 88.9%. Conclusion and future directions. We trained a DGCNN model on a cohort of patients visiting the Clinic at The Hospital for Sick Children. Our model achieved strong performance, which may be useful in accelerating visit times for patients with a high certainty of a diagnosis, increasing frequency of less invasive endoscopic interventions. Collecting 3d images still requires an initial visit to the hospital, so we aim to supplement our data collection with images captured from mobile devices. Evaluating performance on a more portable data capture method may reduce unnecessary visits to the Craniofacial clinic.

https://openreview.net/pdf/73b3003f8fc566bc02f498a6f4e0cfcbdc58af51.pdf

Paper ID: 122

Assessing the Medical Segment Anything Model semantic segmentation of the liver in laparoscopic surgery – accuracy and usability testing

Martins P; AS Soares; Sophia Bano

Background. Medical Segment Anything Model (MedSAM) is an artificial intelligence algorithm trained to segment medical images. MedSAM has been trained on 1570263 images, of which 27095 (1.7%) are laparoscopic images. Part of the usability of this algorithm relies on faster segmentation. Performance of MedSAM in this setting has not been adequately tested. We aimed to evaluate the performance of the MedSAM for semantic segmentation of liver done by residents compared with ground truth expert segmentation on the Dresden Surgical Anatomy Dataset (DSAD) and the usability in terms of reduction in time for annotation per image. Methods. The Dresden Surgical Anatomy Dataset (DSAD) contains expert semantic segmentation for 1023 laparoscopic liver images, from 20 patients. In this dataset, groundtruth masks were created using a polygon-based tool and the results were reviewed by expert surgeons in minimally invasive surgery. The data used to train MedSAM did not include the data of the DSAD. In the present submission, general surgery residents followed the annotation instructions for DSAD and created masks for liver images using MedSAM bounding box annotation. These segmentation masks were compared using the Dice Similarity Coefficient (DSC), which measures the intersection over the union of the masks. In an exploratory analysis, segmentation was also performed using an open source watershed tool (Pixel Annotation Tool) and timed for a randomized subset of images. Comparison of DSC between annotators was calculated using the Kruskal Wallis test, with a p value less than or equal to 0.05 considered statistically significant. Results. In total 3359 frames were annotated by 3 surgery residents. Semantic segmentation using MedSAM achieved a median DSC of 0.95, with an interquartile range (IQR) of 0.08. Per resident median DSC and IQR were: 0.95 (0.89 - 0.98), 0.95 (0.89-0.97) and 0.96 (0.91-0.98). There were statistically significant differences (p value < 0.01), due to annotator number 3 achieving higher accuracy. In the exploratory analysis, the same set of 295 randomized liver images was annotated by two residents. Segmentation using MedSAM achieved an average time per image of 25,4 seconds against an average time per image of 63 seconds with segmentation using the watershed method. This represents a 59.7% reduction in annotation time. Conclusion. Semantic segmentation performed by residents using MedSAM achieved a median DSC of 0.95 for liver semantic segmentation in laparoscopic view, when compared with expert segmentation on the DSAD. While this is an excellent result, inter annotator variability still exists. Future work should be done to identify determinants of the best segmentation strategies using MedSAM. Segmentation using the bounding box method with MedSAM was performed in a significantly shorter average time per image when compared with watershed segmentation (59.7% time reduction)

https://openreview.net/pdf/3656e43bec2aa64287d88364672a46abd3734a46.pdf

Paper ID: 124

Diagnosing Disparities: Assessing the Impact of Patient Demographic Descriptors on GPT-3.5 Turbo's Clinical Reasoning Performance on Multiple-Choice Questions

Zachary M. Cross

Diagnosing Disparities: Assessing the Impact of Patient Demographic Descriptors on GPT-3.5 Turbo's Clinical Reasoning Performance on Multiple-Choice Questions Author: Zachary M. Cross, BA Affiliation: Department of Medical Education, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA Background: Large language models (LLMs) are rapidly emerging as powerful tools with the potential to transform healthcare through applications such as clinical decision support, patient communication, medical education, and research.1,2 Recent studies have demonstrated the impressive capabilities of LLMs in comprehending clinical knowledge and achieving high performance on medical licensing examinations.3–5 However, concerns have been raised about potential biases encoded within these models, questioning their equitable use in practice.6,7 Further research is needed to understand whether such biases could maintain or exacerbate existing health disparities among diverse patient populations. Additionally, while prior research has investigated the presence of gender and racial/ethnic biases in LLMs8–10, the impact of other social determinants of health, such as socioeconomic status, health insurance coverage, and sexual orientation, on LLM performance on clinical reasoning tasks remains largely unexplored. This pilot study aims to systematically evaluate GPT-3.5 Turbo's performance on clinical vignette-based multiple-choice questions, specifically investigating whether varied patient demographic descriptors influenced the model's clinical reasoning and answer accuracy. Methods: A dataset of 745 clinical vignette-based multiple-choice questions (MCQs) was curated from publicly available USMLE sample items spanning 2018-2024 across Steps 1, 2CK, and 3.11–13 Vignettes were extracted, preprocessed, and categorized by patient age and gender. Demographic descriptors related to race/ethnicity, sexual orientation, socioeconomic status, and other factors were systematically varied across versions of the dataset using placeholder substitutions. Parallel processing scripts from OpenAI were adapted for asynchronous chat completion calls to evaluate gpt-3.5-turbo-0125.14 The model selected the best answer choice and provided reasoning for each MCQ, with outputs in JSON format using parameters: temperature=1, top_p=1, response_format=json, max_tokens=4096. Response accuracy was compared between the demographic-free baseline dataset and each modified dataset containing demographic descriptors using independent t-tests. An ANOVA assessed performance differences across all groups to identify impacts of demographic variables. Results: Across all demographic descriptor categories, including race/ethnicity, sexual orientation, socioeconomic status, and others, statistical analysis revealed no significant differences in the performance of GPT-3.5 Turbo when evaluated on clinical vignette MCQs containing varied patient demographics (p > 0.05 for all comparisons). The model achieved an overall accuracy of 63.67% on the baseline dataset without demographic information. This score is consistent with the performance of GPT-3.5 from prior studies.3,4,15 These findings suggest that, within the limitations of this pilot study, GPT-3.5 Turbo's clinical reasoning abilities, as assessed by multiple-choice question accuracy, were not significantly influenced by the inclusion of patient demographic information. Conclusion: Despite the limitations inherent in this pilot study's design, our findings suggest GPT-3.5 Turbo's performance on clinical vignette-based MCQs remains largely unaffected by the inclusion of diverse patient demographic information. While these results are encouraging, they should motivate further investigation into the potential for biases in more advanced models like GPT-4, as well as exploration of alternative input modifications that may influence LLM performance on clinical reasoning tasks. Further, establishing robust evaluation metrics beyond MCQ accuracy is crucial to comprehensively assess clinical reasoning skills and potential biases in LLM outputs within real-world clinical workflows. Ultimately, continued research and vigilant oversight are essential to ensure the equitable and responsible integration of AI within healthcare.

https://openreview.net/pdf/28c6c1055949738f2a68a78f604c503c59e5b2eb.pdf

Paper ID: 126

Implementing Artificial Intelligence in Clinical Medical Education using a Case-Based Curriculum

Prithi Chakrapani; Rafael Schulman; Ali Razavi; Celine Hardy; Ankita Saxena; Sriram S. Narsipur

Background:Artificial Intelligence (AI) and Machine Learning (ML) based algorithms are increasingly influential in medicine, offering revolutionary diagnostic and treatment tools, while also raising ethical or legal questions. Despite rapid advancements, there is little consensus on the best approach to teach AI to medical students. Here, we present a case-based format to impart practical AI knowledge. In addition to delivering content to clinicians-in-training, we also surveyed participant’s current knowledge about AI and ML. Our project aims to improve medical student understanding of AI and ML in medicine and gather data on their AI knowledge breadth. Methods:We developed a case-based curriculum on AI in medicine, each case targeting specific learning objectives in clinical knowledge, technical aspects, and ethical/legal issues. Focusing on medical scenarios involving AI or ML, our cases address computer vision and machine learning for grading bladder cancer and diabetic retinopathy, large language models, neural networks for coronary artery disease risk estimation, and image processing in pulmonary embolism determination. Each case elicits related ethical and legal discussion. We evaluated the curriculum's efficacy through pre and post-session surveys. Results:Thus far, we have evaluated this content through an extracurricular event and a ‘Transition to Residency’ (TiR) course for fourth-year medical students. Both sessions involved a presentation, Q&A session, and discussion after each AI clinical use case. In both programs, preliminary findings show that students gained a more holistic appreciation of AI's clinical application. Extracurricular Event:30 medical students registered for the session, ranging from first to fourth year of medical school. The pre-session survey showed that only 56.3% of students found the ethics of AI important to learn in medical education, which rose to 93.3% in the post-session survey. Survey results also indicated a rise in student interest in AI technical knowledge, from 68.8% pre-session to 80.0% post-session. Before the session, all surveyed student participants agreed on the importance of AI clinical-best practices in medical education. Transition to Residency :140 fourth year medical students attended this session, and 101 students completed the pre-session survey and 81 students completed the post session survey. In the pre-session survey, 20.8% of students rated the importance of AI literacy in medical education as a 5/5, rising to 42% in the post session survey. Between the pre and post session survey, there was a 22% increase in students reporting familiarity with convolutional neural networks, 21% increase in familiarity with training and testing sets, and a 12.7% increase in familiarity with supervised and unsupervised learning methods. 90.1% of students reported learning about the ethics of existing tools was important, and 93.8% of students reported that clinical best practices were important to include in medical education. Conclusion:This case-based content is a practical response to a national need in medical education. Better explanation and use of AI technologies in medical education can improve physicians familiarity with advanced tools in their practice. Our clinical case-based format was able to show an increased appreciation and interest for incorporation of these topics into medical education. Additionally we will continue to gather data on how well the information is understood and to refine our methods to boost AI literacy and ease with clinical AI tools. Moreover, we hope that incorporating practical AI content into the curriculum will nurture involvement of the medical community with the emerging AI field, thus encouraging the development of innovative medical AI applications and their judicious adoption.

https://openreview.net/pdf/c09a633c41f75e7658970668a84705651bcee10f.pdf

Paper ID: 129

Development of a computational model of SBS scores in critically ill children

Sayantika Roy; Barbara Pejic; Sarah Wu; Feiyang Huang; Sidharth Raghavan; Jake Samuels Hoffmann; Jessica LaRosa; Kristen M. Brown; Nicholas J. Durr; Sapna R. Kudchadkar; James Fackler MD

Development of a computational model of SBS scores in critically ill children Sayantika Roy1; Barbara Pejic 2; Sarah Wu3; Feiyang Huang, BS4; Sidharth Raghavan2; Jake Hoffmann3; Jessica LaRosa5, MD; Kristen Brown6, DNP, CRNP, CPNP-AC, CHSE-A, FAAN; Nicholas J. Durr, PhD2; Sapna Ravi Kudchadkar5, MD, PhD; James Fackler5, MD 1University of Rochester School of Medicine and Dentistry 2Department of Biomedical Engineering, Johns Hopkins University 3Department of Computer Engineering, Johns Hopkins University 4Weill Cornell Medicine, Tri-Institutional Training Program in Computational Biology and Medicine 5Johns Hopkins University School of Medicine, Anesthesiology and Critical Care Medicine 6Johns Hopkins School of Nursing Background. Standardized evaluation of sedation levels in critically ill children is essential in sedation management. 250,000 children receive critical care in the US annually, over 30% of whom receive sedative medications at some point during their stay. However, only 58% of those are optimally sedated, with 32% oversedated and 10% undersedated. Improper sedation increases risk of adverse effects, with over-sedation leading to prolonged respiratory support and withdrawal syndrome, while under-sedation increases patients' physical and physiological stress, often leading to self-extubation and the need for restraints. Current standards of care use sedation-agitation scales such as the State Behavioral Scale (SBS) to assess patient sedation level, but their subjective nature leads to high variability, highlighting the need for more precise evaluation methods to guide dosage administration in the pediatric intensive care unit (PICU). The objective of this study is to investigate whether vitals data alone can predict nurse-reported SBS scores. We hypothesize that a computational model incorporating heart rate (HR), respiratory rate (HR), and oxygen saturation (SpO2) changes will accurately predict nurse-reported SBS scores. This is the first step towards continuous automated monitoring of sedation level using vitals data in pediatric patients. Methods. A retrospective electronic medical record (EMR) dataset from 415 pediatric patients receiving continuous sedation with at least one documented SBS score at Johns Hopkins PICU was acquired. To adjust for class imbalance biased towards SBS 0 (awake and able to calm), 126 subjects with at least one instance of extreme SBS scores ( -2 or +2) were selected. Time-series data at 1 minute resolution was acquired for the entire patient duration of stay. 173 normalized time series features were extracted from HR, RR, and SpO2 waveforms 15 minutes prior to each documented SBS instance. To adjust for patient demographics, we included age and ventilation status as additional features. We developed 2 ordinal models under two experimental settings: three-class (SBS: {-3, -2, -1}, {0}, {+1, +2}), and five-class (SBS: {-3, -2}, {-1}, {0}, {+1}, {+2}). To adjust for sample imbalance, weighted sampling was implemented on the training dataset using one linear layer of weights. Ten repeats of ten-fold cross-validation were performed, and accuracy, F1-score, sensitivity, specificity and Cohen’s Kappa were computed to evaluate model performance. Results. Our three-class model predicted SBS with 44.05 ± 5.34% accuracy, 44.65 ± 6.36% F1-score, 44.05 ± 5.34% sensitivity, 84.57 ± 2.86% specificity, and 12.94 ± 6.84% Cohen’s Kappa. Our five-class model had 25.45 ± 6.07% accuracy, 31.86 ± 6.78% F1-score, 25.54 ± 6.86% sensitivity, 81.46 ± 3.16% specificity, and 5.54 ± 6.09% Cohen’s Kappa. The three-class model had a higher performance than five-class, with a more precise distinction of deep sedation (SBS < 0) and agitation (SBS > 0). Conclusion. Both models, particularly the three-class, indicate a correlation between vitals and SBS, but the low resolution of EMR data is insufficient in prediction. Future efforts to improve model performance include collection of higher-resolution time-series vitals with standardized frequent SBS scoring. Though further investigation is necessary, our future efforts aim to create a robust sedation assessment algorithm with the goal of improving real-time sedation assessment.

https://openreview.net/pdf/aa57197d04af263075db16cbc97e2e1fe8332c80.pdf

Paper ID: 130

Optimizing Operating Room Scheduling and Ensuring Fairness in Managing Waitlists for Joint Replacements

Aazad Abbas; Cari Whyne; Elias Boutros Khalil

Background: Inefficiencies in surgical services are a leading contributor to rising costs and increasing wait times in Canadian healthcare. Perioperative care comprises up to 48% of hospital budgets (operating room (OR) cost >$34/minute). Despite a steady increase in spending, wait times for surgery among most OECD countries, including Canada, have been increasing. Lack of timely surgical care can have a direct deleterious impact on patients; 19% of patients awaiting total hip arthroplasty report a quality of life “worse than death”. Delays are increasing with the rapidly growing demand for surgery, further exacerbated due to the COVID-19 pandemic. Currently, in many institutions, primary total hip and knee arthroplasties (THA and TKA) are scheduled from a surgeon’s waitlist on a first come first served basis, without consideration of patients’ severity (pain and function). The aim of this project was to determine if machine learning (ML) scheduling could increase OR throughput and ensure patient fairness with respect to access for THA and TKA surgery. Methods: A predict-then-optimize scheduling pipeline was first employed to improve prediction of duration of surgery for THA and TKA. Data was collected from all primary and revision THA and TKAs performed at a single institution from 2012 to 2022 (REB #4899). The features obtained from the data collected included age, gender, co-morbidities, and joint specific features such as arthritis severity, deformity and implant type. Using these features, various machine learning models were trained for each procedure type, and the best performing predictive model used in the optimization. To create the schedules, a multi-objective optimization was employed consisting of three objectives: utilization, wait time, and preoperative patient reported outcome measures (PROMs) (Figure 1). The utilization score was based on maximizing OR utilization during the regularly scheduled OR hours with a penalty for overtime. The wait time was based on the cumulative duration of a patient on the surgical waitlist and the preoperative PROM used the Western Ontario and McMaster Universities Arthritis Index (WOMAC) scores. The score for each objective was scaled between 0 and 1 to allow for comparability among objectives. The scaled objective values were then multiplied by importance weights, set by the scheduler to reflect their scheduling priorities, and summed to get the final score of the schedule. Using software written in Python, simulations of the scheduling process were performed. Results: A multi-layer perceptron (MLP) model was found to yield the best performance, with surgical duration predictions 10% better than a surgery surgeon specific prediction currently used at our institution (based on the durations of the surgeon’s last twelve cases of each specific procedure). The optimized scheduler was able to perform 1740 hip and knee replacements that were performed from 2021-2022 in 66 fewer OR days, which represents a 13% decrease in resource utilization. Using a fixed pool of 500 patients, the utilization of OR time achieved was 97% of the maximum when only contributing 40% to this component objective (30% of the component objectives were assigned to preoperative PROMs and to time on waitlist) (Figure 2). The average WOMAC score and wait time of the patients scheduled increased 34% and 8% respectively, when contributing 30% each to the overall objective compared to 0%. Reducing these contributions to only 15%, led to only small decreases in the WOMAC and weeks on waitlist measures. Additionally, a simulated scheduling of 20,880 patients, resampling from the 1740 cases, was conducted to ensure no bias in wait time for various protected attributes such as gender, age, or BMI. There were no significant differences in patient wait times across these attributes. Conclusion: By leveraging ML predictions and multi-objective optimization of elective scheduling of THA and TKA procedures, the efficiency of OR schedules can be increased. Moreover, this can be done in a fair way, ensuring no bias against patients based on their age, gender or BMI. The multi objective scheduling optimization requires a trade-off between the relative weights of the different objectives, which may vary depending on the priorities of specific institutions. This allows for a powerful tool to ensure surgical care is equitable and fair for all populations.

https://openreview.net/pdf/2440e570244bc6ad03ff77271ba04bc040be5673.pdf

Paper ID: 134

Integrated Predictive Modeling of COVID-19 Severity through Advanced Machine Learning and Deep Learning Algorithms: Particulate Matter Exposure

Sophia Kwon; Ziqi Zhao; George Crowley; Anna Nolan

Background. COVID-19 was the third leading cause of death in the US between March 2020 and October 2021, and devastated densely populated cities like Milan, Wuhan and New York City (NYC) by overwhelming hospitals with high caseloads of resource-depleting disease. Understanding the complexity of COVID-19 was also challenging due to the large amount of data that was generated, limiting the ability of traditional analytic methods. This has prompted a rigorous scientific inquiry into predictive methodologies that can effectively identify risk factors of COVID-19 related critical illness. Machine learning facilitates exploring risk factors that predate disease presentation that could prime an immune response and worsen outcomes. We examine the impact of particulate matter (PM), which constitutes a significant component of air pollution, and has been studied as a vector of transmission and risk factor for worsened lung outcomes. In particular, the fluctuations in PM2.5 may induce severe lung injury, exacerbation of chronic obstructive pulmonary disease, and even immunotoxicity, indicating the complicated correlations and mutual effects between PM and disease outcomes. Methods. Extraction of demographic characteristics, vital signs, and laboratory value data was performed from the electronic medical record (EMR) using structured query language (SQL). A cohort of 14,726 individuals was identified who had a hospital encounter at our institution during the period from March 1st, 2020, to April 26th, 2021, were diagnosed with COVID-19, and consented to participate in this study. We defined two outcomes: 1) moderate COVID-19 as never being admitted to the ICU and survived; and 2) severe COVID-19 as either being admitted to the ICU or died. Environmental exposure data, specifically concerning particulate matter and other pollutants, was obtained from the Environmental Protection Agency (EPA) and linked with the patient data based on zip code using Pollution-Associated Risk Geospatial Analysis SITE (Pargasite) to acquire an annual and monthly mean level of exposure to pollutants, specifically PM2.5, ozone, nitrogen oxide, and carbon monoxide. Preprocessing: Data was split into 80% training and 20% testing randomly. Class imbalance was addressed by SMOTE and class weight compute function on sklearn utils. Processing missing value with mean imputation and multivariate imputation by chained equations (MICE) was performed in R to handle any missing values. Feature selection occurred through stepwise regression of 739 features. Hierarchical models and ensemble methods include random forests or gradient-boosting trees, and supervised machine learning algorithms with inherent feature interaction capabilities including decision trees and neural networks were integrated in Python. Results. The study included all hospital admissions of individuals aged 18 years or older (n=14726), of which N=11,888 had moderate COVID-19, and N=2,838 patients had severe COVID-19. Average annual PM2.5 concentration in 2019 was significantly less in those with moderate COVID-19, mean(SD) 7.98(0.57) vs 8.01(0.57) µg/m3 who had severe COVID-19, p=0.011. There was no significant difference between any of the air pollutant levels prior to 2019 in moderate vs severe COVID-19. Age, neutrophil/ lymphocyte levels, BMI, and pollutant levels are in the top 30 features with an AUC of 0.93. Logistic regression shows risk of severe COVID is increased by 73% if male, and by 13% for every 1µg/m3 increase of PM2.5 in 2019. In evaluating different machine learning algorithms, the Decision Tree model achieved the highest accuracy (0.9586), closely followed by Random Forest (0.9569). Gaussian Naive Bayes displayed a comparatively lower accuracy (0.7807), possibly due to its assumption of feature independence, which is often violated in complex clinical datasets. The Neural Network and K-Nearest Neighbors models demonstrated moderate accuracy, underscoring the need for careful model selection and hyperparameter tuning tailored to the dataset's characteristics. Conclusion. Our comprehensive methodology encompasses the extraction and integration of a vast dataset, comprising demographic information, clinical parameters, and features of the exposome (PM exposure levels) from 14,726 COVID-19 patients. The results indicate that even small increases in PM2.5 levels are associated with a higher risk of severe COVID-19 outcomes, highlighting the broader implications of air quality on public health. This study underscores the potential of integrating environmental health data with clinical indicators to forge robust predictive tools, paving the way for more targeted and effective public health responses. The novelty of this research lies in its integrative approach, combining medical expertise with data science to predict COVID-19 severity and inform clinical decisions.

https://openreview.net

Paper ID: 147

An Automated Standardized Patient for Medical Student Training

Elliot Levi; Ali Razavi; Christopher Dunham; Rafael Schulman

Background. Standardized patients are an integral part of medical student training. Interaction with standardized patient actors offers medical students the opportunity to practice bedside manner, history taking, physical examination, and clinical reasoning skills prior to real patient interactions in a clinical setting. Additionally, associating symptoms with specific medical conditions may be more easily retained when learned through patient interaction, leading to better long-term retention. Regardless of this utility, standardized patients are only used occasionally; the cost of employing and training real life actors is prohibitive to frequent use. Moreover, with the current framework of standardized patients, students only interact with a narrow range of patient personalities and presentations, which encompass factors such as how the patient communicates and reports symptoms, what the patient forgets to mention, and general deviations from the classical presentations of the diagnosis. The advent of transformer-based large language models has made it possible for convincing automated conversational agents. Moreover, prompt engineering allows for rapid modification and customization of model behavior. Here, we introduce “disease scripts”, which are prompts that instruct large language models to answer as a standardized patient would. Each script provides generic instructions for standardized patient behavior, as well as disease-specific information to allow the model to role-play a patient with a specific condition. Methods. We developed a workflow that intakes a text file as a “disease script”, provides this as context to ChatGPT through the ChatGPT API, and allows a user to conversate with the pre-prompted model. We wrote “disease scripts” for 4 conditions: Multiple Sclerosis, Stroke, Parkinson’s Disease, and Cluster Headaches. Clinical information for each of these conditions was adopted from standard study preparation material for the United States Medical Licensing Examination Step 1 exam, which is administered to all second-year medical students in the United States. This information was integrated with fictional personal data for the standardized patient. In sum, each script included, demographics, mood, symptoms, level of health literacy, and physical exam findings. Additionally, each prompt contains instructions developed and optimized to guide the model to behave appropriately as a standardized patient. Instructions were optimized by iteratively conversating with the model, noting weaknesses in dialogue, and adjusting the script accordingly. Versions of each prompt were stored for later reference. Additionally, we experimented with both GTP-3.5 and GTP-4 to assess performance and tried varied temperature settings. Results. We show that with appropriate prompting, large language models can successfully play the role of the patient, incorporating both general context instructions and specific patient information. However, without clear and specific prompting, we found that the model can misunderstand instruction and create unintended responses. For example, we sometimes observed the model to play the role of the student or divulge too much information, thereby revealing the diagnosis. For example, when asked about any family history of neurological disorders, the patient at times responded that they have no family history of multiple sclerosis, thereby revealing the disease entity that should have been elucidated by probing questioning by the student. Conclusion. Supplementing in-person standardized patient actors with automated patients will increase access to a valuable educational tool. Large language models can effectively play the role of patients with given conditions. To achieve this, simple and effective prompt engineering, rather than model training, is sufficient to effect the desired model behavior. Automated standardized patients have the potential to improve several aspects of medical education, including patient interaction skills and bedside manner. While we explicitly explored the usage of these models as an adjunct to live standardized patients, many other potential applications exist. Board examinations for medical licensure are currently administered in a multiple-choice question format that does not adequately simulate or assess the crucial information-gathering skills necessary for competent medical practice. Implementation of automated standardized patients could potentially measure competency with regards to this skill set that is notoriously difficult to evaluate, thereby improving the quality of licensure exams. Future projects will require continued improvement in the quality of prompt engineering, accurately assess the validity of automated responses, and potentially provide constructive feedback to student users.

https://openreview.net/pdf/0683bfd28e102c6bfb0fb027a6d518f492763363.pdf

Paper ID: 152

Beyond Single Metrics: A Time-Series Hybrid Machine Learning Approach to Optimize ICU Discharge Decisions

Bita Behrouzi; Tina Behrouzi; Anna Goldenberg

Background. Accurate and timely discharge decisions in the Intensive Care Unit (ICU) are vital for patient outcomes and resource management. Identifying optimal discharge timing is crucial to prevent premature discharges, which could result in readmissions or death, and to avoid unnecessarily prolonged ICU stays. Traditional models often assess static metrics such as length of stay, readmission rate, or mortality in isolation, potentially overlooking the complex, evolving clinical parameters of ICU patients. Recognizing these gaps, our research develops a machine learning model that integrates these essential metrics with time-series data. This approach provides real-time, comprehensive assessments of discharge readiness, enhancing decision-making in the fast-paced ICU setting and potentially improving patient care and operational efficiency. Methods. We implemented a hybrid machine learning model using a Long Short-Term Memory (LSTM) network, processing time-series data from the Medical Information Mart for Intensive Care (MIMIC-IV) database. The LSTM model, updated every 6 hours, initially predicted the mortality rate using a task-specific multilayer perceptron (MLP). If the mortality risk was below 50%, it proceeded to predict the time to discharge and readmission rate with task-specific MLPs. An attention mechanism enhanced the model by prioritizing significant past data features. For time-to-discharge predictions, a discretized loss calculation divided the predictions into twelve 6-hour intervals to fine-tune accuracy. In the testing phase, the penalty mechanism refined the model’s predictive accuracy by applying penalties when a patient's actual discharge occurred later than predicted, enabling real-time adjustments based on observed discrepancies. Model robustness was validated through 5-fold cross-validation, assessing mortality and readmission with the Area Under the Curve (AUC) and accuracy; time to discharge was evaluated with Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). Results. We trained the hybrid LSTM-based machine learning model using a dataset comprising 47,649 ICU stays from the MIMIC-IV database, which encompasses detailed adult patient information including demographics, vital signs, medications, and procedures. The data was characterized by a mortality rate of 7%, a readmission rate of 12%, and an average length of stay of 2.68 days. Employing 5-fold cross-validation, the LSTM model achieved an AUC of 0.95 for mortality predictions and 0.91 for readmission, demonstrating its strong predictive performance for crucial patient outcomes. The models’ accuracy rates were 91% for mortality and 89% for readmission. The predictions for time to discharge showed an MAE of 23.2 hours and an RMSE of 26.8 hours, reflecting the model’s precision in estimating discharge timings. Moreover, the integration of a penalty mechanism for deviations in predicted discharge time resulted in an average error reduction of 8%. The penalty mechanism’s success in reducing individual-level prediction error underlines the importance of continuously adapting the model with new patients’ data, ensuring tailored accuracy and continued relevance in a dynamic clinical setting. Conclusion. This study introduces a hybrid LSTM-based machine learning model that optimizes ICU discharge decisions through dynamic predictions. The model uniquely predicts three critical metrics—time to discharge, readmission rates, and mortality—with high accuracy, enhancing discharge decision-making. Its capacity for dynamic discharge flagging, based on predefined criteria for time to discharge, mortality, and readmission rates, marks a significant advancement in managing ICU resources and patient care. Distinguished by its integration of an LSTM network with task-specific MLPs and an attention mechanism, the model improves responsiveness and accuracy in real-time ICU settings. In addition, the integration of a penalty mechanism during testing allows for continuous refinement of predictions across diverse patient populations, promptly addressing any discrepancies. This personalized online learning technique adapts to deviations in patient outcomes from previously trained populations, increasing the model’s utility and reliability. Future developments aimed at interpreting the attention module to identify critical decision-making junctures and important features will further enhance clinical decision support. The adaptability of the model to individual patient data and temporal variations underscores its significant potential to improve patient outcomes and optimize ICU resource management, making it a transformative tool in healthcare settings.

https://openreview.net/pdf/10bfd3422852acecd59f3b98af5ccc1f6d14512c.pdf

Paper ID: 153

Assessing real-time mobile-friendly computer vision models in hip fracture detection on ultrasound imaging

Nicholas J. Yee; Yohannes Soenjaya; Michael Raymond Hardisty; Christine Demore; Mansur Halai; Cari Whyne

Background. Falls in the elderly population can have serious medical consequences, such as hip fractures, and the standard of care is to visit an emergency department (ED) and obtain x-ray imaging. However, among older adults presenting to EDs with acute hip trauma, only 25.7% are reported to have a hip fracture. While some patients who fall minimally benefit from an ED visit, others may experience delayed time-to-surgery as further hospital transfers may be required for orthopaedic management or due to long ED wait times leading to higher risks for morbidities and mortality. Developing an improved protocol for on-site hip trauma triaging may reduce the time-to-surgery and improve healthcare access and outcomes, especially for older patients living in rural and remote communities. Point-of-care ultrasound imaging (US) is a portable and affordable diagnostic imaging tool that has been shown to be useful in identifying hip fractures, but its current use is dependent on the limited availability of trained sonographers. Advancements in computer vision presents an opportunity to reduce operator dependence for point-of-care US by automating image interpretation, thereby improving hip trauma assessments outside the ED in pre-hospital environments. By reducing the US operator dependence, it could equip healthcare professional with essential on-demand medical imaging. This study aims to develop mobile-friendly computer vision models that can automate hip fracture detection on US images with the goal to enable naïve musculoskeletal US operators, such as emergency medical services and nurses, with imaging as part of their triaging assessments. Currently, there are no published studies evaluating real-time classification models in hip fracture detection on ultrasound images. Methods. Both limbs of eight porcine cadavers were imaged along anatomical regions of interests with US pre- and post-hip fracture using a Philips EPIQ 7G Ultrasound System (Koninklijke Philips N.V., Amsterdam, Netherlands). The image frames of the cadavers were split into training (6), validation (1), and test (1) datasets. The training set was augmented by rotation, translation, scaling, horizontal flipping, resolution reduction, perspective changing, adjusting brightness, contrast, and saturation, and applying Gaussian noise. The mobile-friendly models were pre-trained on ImageNet: MobileNetV3 (Small and Large) and EfficientNet-Lite (0, 1, and 2). Comparison was made against standard ImageNet pre-trained vision models: ResNet-18 and Vision Transformers (ViT-B and -L). Acquired US images were labelled as no bone, intact bone, or fractured bone (Fig. 1). Evaluation included accuracy, precision, recall, and specificity. Results. The 1682 training, 363 validation, and 306 testing images each contained ~25% no bone, ~50% intact bone, and ~25% fractured bone images. EfficientNet-Lite1 (4 million parameters) had the highest performance with 89% accuracy, 89% precision, 92% recall, and 98% specificity. The reference models generally had lower performance: Resnet-18 (11 million parameters) with 92% accuracy, 91% precision, 92% recall, and 92% specificity; and the best vision transformer model, ViT-L (303 million parameters), with 83% accuracy, 93% precision, 83% recall, and 80% specificity. Conclusion. This pre-clinical study evaluated mobile-friendly computer vision models in identifying hip fractures from porcine US images and demonstrated that, given our dataset size, the smaller models had sufficient capacity to adequately learn US image features to characterize fractures. EfficientNet-Lite1, a mobile-friendly computer vision model, had the highest performance compared to the reference models suggesting the potential for real-time fracture detection models that could be powered by mobile devices for hip trauma assessments in communities. Future clinical studies may benefit from transfer learning from these models towards the goal of enabling real-time point-of-care imaging of hip fractures by emergency medical services and nurses.

https://openreview.net/pdf/958a64103ec3716d1df222f0df9215dd60e3fce4.pdf

Paper ID: 154

Bridging the Gap Between Research and Clinical Practice: A System for Deploying Machine Learning Models in the Operating Room

Amin Madani; Spencer Gable-Cook; Jun Ma; Pouria Mashouri; Michael Brudno; BO WANG; Stephanie Williams; Jimmy Qiu; Caterina Masino; Haochi Zhang; Clare E Mcelcheran

Background. Surgical complications are extremely common and one of healthcare’s biggest sources of morbidity, mortality, and costs. Most adverse events are preventable and related to decisions at the time of surgery, suggesting that there is a need for quality-improvement initiatives aiming to enhance surgical team performance. Artificial intelligence (AI) models have shown the ability to provide surgeons with real-time guidance and decision support and ultimately improve patient outcomes. For example, computer vision AI algorithms can make inferences on live video streams of image-guided procedures (e.g. laparoscopy) to assist with the surgery. Despite multiple AI solutions that have achieved acceptable benchmarks, the translation into the operating room (OR) has been limited.1,2 There exist several reasons for this lack of adoption, including the need for high computational power to ensure real-time rendering of surgical video, requiring the processing of video frames in less than 85 ms.3 OR computers do not commonly have the specialized hardware to attain the required processing times. To mitigate this issue, Zhang et al4 created a cloud-based solution that runs model inference on a private server. However, the cloud-based system exhibited a latency of 100 ms. We propose a universal plug-and-play solution that runs any AI model locally to ensure that the network bottleneck is eliminated. Methods This project implements a full pipeline for overlaying AI output on the live sterile field monitor (“monitor”). It employs GoNoGoNet, a deep learning model trained to identify the safe and dangerous zones (green “Go” zone and red “No-Go” zone, respectively) for dissection during laparoscopic cholecystectomies (LC)5 (Figure 1). The output of the scope tower is fed into a capture device, which is connected into the workstation. The workstation performs inference on the current video frame, combines the model output with the original frame, and transmits the result to the monitor. Surgeons and other clinicians interact with the system using the mobile application (Figure 2). The workstation is built in a compact, small form factor 15 L case to minimize the device footprint in the OR. A NVIDIA GeForce RTX 4070 GPU is used to ensure that the workstation can perform real-time inference on both in-house models trained within the University Health Network (UHN) and external open-source models.6,7 Based on semi-structured interviews of surgeons, a mobile application was developed to provide a wireless user-friendly experience for managing user and procedure specific settings on the workstation, such as model selection, threshold prediction, overlay colour, and transparency. During the procedure, the mobile app enables users to toggle between preselected AI models and the original video. A central management service facilitates communication between the mobile app and workstation as well as storage of user and model information. Results The system has been tested in the lab using a Storz scope (Image 1 Hub) with an input frame resolution of 720x486 pixels. An Epiphan AV.io SDI to USB capture device transmitted the scope video into the workstation at 60 frames per second. The capture device’s output video resolution was matched to the closest resolution of the input source, 720x480 pixels, and the output frames were upscaled to the monitor resolution of 1920x1080 pixels through the system. The average time to process and display a frame was 5.7 ms. The mean latency between the scope video and the video output on the monitor was 32 ms. The system has been utilized in two live LC procedures at Toronto Western Hospital. Conclusion This project represents an interdisciplinary effort among various teams within UHN: AI scientists from the UHN AI Hub and UHN Data Team, engineers with the UHN Techna Institute and UHN Medical Engineering, and surgeons affiliated with the UHN organization. This unique partnership ensures expertise in all aspects of translating AI models from bench to bedside, including training of the AI models, optimizing inference times of the models, interfacing the models to interact with the workstation and mobile app, and integrating the hardware to minimize latency between the scope video and the video displayed on the monitors. More testing in different ORs across UHN is necessary before integrating the system into a surgeon’s daily workflow.

https://openreview.net/pdf/030e2912eb4b80a72d97a82008dd6ae778eab3c6.pdf

Paper ID: 169

Traumatic Brain Injury Intervention Prediction

Mike Matos

Blunt traumatic brain injuries (TBI) in North America, primarily caused by falls and motor vehicle accidents, necessitate extensive healthcare resources due to the complexity and potential long-term disabilities associated with these injuries. At a level 1 trauma center, most TBI cases are initially classified as non-intervention; however, some progress to require surgical or invasive measures, justifying a period of observational triage. To enhance decision-making and resource utilization, we applied a machine learning model using an ExtraTrees Classifier to predict the need for intervention based on a retrospective cohort study of 200 non-penetrating TBI patients from a trauma registry. We extracted and processed various clinical, demographic, and imaging data, including the use of Term Frequency-Inverse Document Frequency (TF-IDF) on narrative CT scan reports. The model displayed promising results, with a balanced dataset showing an AUC of 0.875, Precision of 0.882, and an F1 Score of 0.882 in testing, underscoring its potential in clinical settings. As we expand our dataset towards the target of 2,000 records, and plan to incorporate CT image features at 500 records, we anticipate further improvements in performance, aiming to refine predictive accuracy for better managing TBI cases.

https://openreview.net/pdf/a3a29079ab1fd3e90c7fb7e9d93b0911c7daa20a.pdf

Paper ID: 23

General Practice Eventograms (GPEGs): Predicting Emergency Hospital Admission with Human-Understandable EHR Representations

Benjamin Post; Roman Klapaukh; Stephen J. Brett; Aldo A. Faisal

General Practice Eventograms (GPEGs): Predicting Emergency Hospital Admission with Human-Understandable EHR Representations Benjamin Post1,2, Roman Klapaukh3, Stephen Brett1,4, Aldo Faisal 1,2,5 1UKRI Center in AI for Healthcare 2Dept of Bioeng & Dept of Computing, 3Dept of Primary Care, University of Melbourne, 4Dept of Surgery & Cancer, Imperial College London; 5Institute of Artificial & Human Intelligence, University of Bayreuth Background. The prediction of emergency hospital admission from primary care electronic health records (EHRs), has primarily used features such as patient demographics, medical history and physiological parameters (Hippisley-Cox 2013, Donnan 2008, Rahimian 2018, Liley 2021). Recording all of these variables for each patient in a population is neither practical nor necessary, so these data suffer high levels of missingness. We harness this presence or absence of EHR data to capture a novel temporal conceptualisation of primary care activity. We then demonstrate how this simplified data representation can be incorporated into high-performance emergency hospital admission predictors. Methods. We performed a retrospective observational study using the Secure Anonymised Information Linkage (SAIL) Databank (Ford 2009, Jones 2014, Lyons 2009), which contains billions of person-based health records and covers over 80% of the population of the country of Wales. We identified all patients aged 18-100 with at least 1 recorded primary care EHR data point between the years 2016-2017. These data are organized according to the Read Code Thesaurus (Chisholm 1990) which contains clinical terms organised by chapters. By associating each Read code with its chapter, we structured a patient’s health record into 12 time series channels based on the total Read codes per chapter per day for the study period. To visualize this time series activity, for each patient we created the General Practice Eventogram (GPEG), where each channel's daily activity was visualized as a single pixel, with brightness indicating activity levels (Figure 1). These GPEGs were then combined with a patient’s demographic data to predict emergency hospital admission or death within 3 months of the end of the study period (January - March 2018). The study population was divided into training and test cohorts using a 90:10 split and we compared 2 prediction models: a vanilla neural network (VNN) (3 dense-dropout layers) and a convolutional neural network (CNN) (3 convolutional blocks + 3 dense-dropout layers). All models were trained using Keras (version 2.10.0) with Tensorflow (version 2.10.0) in Python (version 3.10.10). This study was approved by the SAIL independent Information Governance Review Panel (IGRP) (ref 1323). Results. A total of 2,118,444 patients met inclusion criteria, with 80,208 patients meeting the outcome criteria (population outcome rate of 3.8%). Individuals that underwent emergency hospital admission or death) were older (72 years [Q1 54, Q3 83] vs 49 years [Q1 34, Q3 64]) and slightly more likely to be female (52.6\% vs 53.7\%). The CNN outperformed the VNN in all performance metrics: AUROC 0.81 +/- 0.01 vs 0.77 +/- 0.01, Recall 0.69 +/- 0.00 vs 0.63 +/- 0.03, AUPRC 0.19 +/- 0.01 vs 0.12 +/- 0.02 (CNN vs VNN). Conclusion. We have demonstrated a novel conceptualisation of primary care electronic health records as multivariable time series data, encapsulated in the General Practice Eventogram (GPEG). This Morse Code of Health can be successfully harnessed to predict short-term risk of emergency hospital admission and death.

https://openreview.net/pdf/965f3c1dfa3f7f188255d39d6d0b7baa0ab16fd8.pdf

Paper ID: 32

Development and Evaluation of a Vision-Pose Tracking Based Beighton Score Tool for Generalized Joint Hypermobility in Individuals with Suspected Ehlers-Danlos Syndromes

Nimish Mittal; Andrea Sabo; AMOL DESHPANDE; Babak Taati

$\textbf{Background.}$ Generalized joint hypermobility (GJH) describes the wide-spread presence of hypermobility in the body, usually affecting upper and lower extremities and torso. While GJH may represent a normal genetic or physiologic trait with minimal impact on an individual’s quality of life, it can also be linked to inherited connective tissue disorders (CTD) such as Marfan Syndrome, osteogenesis imperfecta, and Ehlers Danlos Syndromes (EDS). Currently, the presence of GJH is assessed using the Beighton score, whereby a clinician performs a physical examination and assigns a score out of 9 tested joints, one for each of the joints assessed that meet the clinical criteria for hypermobility. Individuals who score 4 or higher on the Beighton exam may be referred to a specialized clinic for formal diagnosis of a CTD if necessary. However, these specialized CTD clinics often note large discrepancies between the Beighton score assigned by the referring clinician, and the score assigned by the CTD specialists, with the referring clinician reporting significantly higher scores (leading to unnecessary referrals). In this work, we propose a method for screening for the presence of GJH more objectively using consumer grade smartphone videos. $\textbf{Methods.}$ A total of 225 adults (91.8% female, average age 33.9 ± 9.9 years) referred to a specialized EDS clinic were recruited for this study. During a single clinical visit, participants were asked to perform the Beighton maneuvers twice: the first in which the clinician assessed their joint hypermobility through a physical examination, and the second in which the participant was verbally instructed to perform the Beighton assessment while a tripod-mounted smartphone camera was used to record their movements. The system was developed on the first 100 individuals, and validated on the remaining 125 individuals. The proposed system relies on the use of human pose-estimation libraries to first extract the locations of keypoints in the videos. Five off-the-shelf body keypoint pose-estimation libraries and two hand keypoint libraries were investigated for their ability to track joints in this hypermobile population. Ultimately, the MoveNet-Thunder library was used to track keypoints of the body, and a custom domain-specific model with a MobileNet-v2 backbone (finetuned on data collected as part of this study) was used to track the keypoints of the hand. The tracked keypoints were used to geometrically estimate the angles of hyperextension. Thresholds for separating positive and negative hypermobility findings for each joint were selected empirically by varying the cut-off and selecting the value at which 80% of the true positives (as assessed by the clinician) were recalled. The hypermobility of each joint was assessed independently and then summed to yield a final Beighton score prediction out of 9. In instances where the hypermobility of a joint could not be determined (either because patients experienced pain and could not perform specific Beighton maneuvers or where the vision-based system could not confidently determine if a joint satisfied hypermobility criteria) a random forest (RF) model was used to impute missing values. The RF model used the algorithm's predictions of hypermobility of the other eight joints of the Beighton exam (as calculated from the video) as input and returned a binary decision on the remaining joint's hypermobility. The total algorithm-predicted Beighton score was compared to the clinician-assessed Beighton score and the algorithm was evaluated for its ability to screen-out individuals without GJH (Beighton score < 4). $\textbf{Results.}$ The training set included 48.9% of people with a positive overall Beighton assessment (hypermobility in 4 or more out of 9 joints), while the test set had 30.3%. The system screened out 24.5% of the training set and 20.5% of the test set as not having GJH (ie. predicted Beighton score < 4), while recalling 91.3% and 97.3% of the true positives, respectively. The consistency of the system between the training and test sets suggests that it generalizes well to unseen individuals. The system was tuned to be with a focus on sensitivity to avoid screening out individuals with GJH. As such, the per-joint sensitivity (recall) on the test set ranged from 0.700 to 0.954, while the precision ranged from 0.226 to 0.798. The system was also tuned for computational efficiency in preparation for deployment to a larger population through a smartphone application, achieving an average processing speed of 13s per video on a machine without a GPU. $\textbf{Conclusion.}$ The proposed system can objectively identify individuals with possible syndromic GJH, while screening out those without GJH during the referral process, reducing the burden on specialized EDS clinics while providing early diagnostic triage. Future research will focus on deploying the tool as a smartphone application.

https://openreview.net/pdf/e3d480e484a35d23d9c823cc5fb3d56e2e9dcd0c.pdf

Paper ID: 37

Comparing physiotherapist and computer vision-based algorithm prediction of future falls and impact of algorithm output on physiotherapist predictions

Yashoda Sharma; Vida Adeli; Babak Taati; Kara K Patterson; Andrea Iaboni

Background. Artificial Intelligence (AI) has the potential to support and improve clinical decision-making. For example, AI algorithms show promise in being able to identify “at-risk populations” and can regularly calculate patient risk over time. This can alert clinicians of changes in patient status, ultimately encouraging early intervention. However, prior to implementing AI into clinical practice, it is imperative to ensure its’ safety. One issue with AI in clinical decision-making is the possibility for clinicians to overly depend on its’ results, forgoing their own clinical judgement or other evidence (i.e., automation bias). This can lead to inappropriate clinical decisions and harmful patient outcomes. Thus, it remains unclear whether the risks of AI algorithms in clinical decision-making outweigh the benefits. The purpose of my project is to evaluate the role of a computer-vision based AI-falls risk predictive algorithm in a physiotherapist’s falls risk assessment of long-term care residents with dementia. My aims were to: 1) Compare the accuracy of a physiotherapist’s falls risk prediction to AI-falls risk predictive algorithm and; 2) Determine the impact of the AI-falls risk prediction on the physiotherapist’s falls risk assessment. Methods. This study is using an existing dataset of gait videos, falls data of long-term care residents, and a recently developed AI based falls risk predictive algorithm.4 The algorithm is a Multi-Layer Perceptron (MLP) network trained with a leave-one-subject-out cross-validation framework. It was trained on a dataset of 54 people (~4700 gait videos) and externally validated on a dataset of 15 people (~1300 gait videos). The input features included in the final algorithm were gait measures (cadence and estimated margin of stability), STRATIFY (i.e., clinical falls risk outcome measure) and dosage of antipsychotic medications. For this study, Canadian registered physiotherapists who have clinical experience working with older adults were recruited to participate in a virtual observational study via Microsoft Teams. This study had two parts. In Part 1, therapists predicted the chance of a resident falling within 4 weeks based on the residents’ gait videos and clinical information (i.e., STRATIFY, and antipsychotic medication dosage). Therapists rated the likelihood of a fall on a scale from 0-100%. In Part 2, therapists reviewed the output of the falls risk predictive algorithm described above. Using the algorithm output, therapist's had the opportunity to revise their falls risk predictions from Part 1 based on this new information. The predictions made by the physiotherapist and AI-falls risk predictive algorithm were compared to the residents’ actual future fall events to determine differences in prediction accuracy, sensitivity, and specificity between groups. Preliminary Results. Therapists (n=44) had a mean age of 42.6 (10.0) years, and they were predominantly female (90.9%). Overall, therapists were less accurate in their predictions (65.3%) compared to the algorithm (86.7%). Therapists' predictions had a similar sensitivity to the algorithm (66.9% and 66.7% respectively), but a lower specificity compared to the algorithm (64.9% and 91.7% respectively) before using the algorithm output to inform their predictions. After reviewing the output of the algorithm, there was a minimal increase in therapists’ accuracy (67.8%) and no change in the specificity of their prediction (65%). However, the sensitivity of therapist predictions increased to 78.9%. Conclusion. Preliminary results from this study show that the AI falls risk predictive algorithm was more accurate and specific than physiotherapists in predicting a future fall based on video observation of gait. While there was little impact of the algorithm on the accuracy of the physiotherapist prediction, the algorithm helped therapists identify some fallers they would have otherwise missed (i.e. increased their sensitivity). Overall, this study is the necessary first step in identifying whether AI supports or hinders a physiotherapist’s decision-making regarding patient falls risk and will identify the impact of AI on physiotherapist decision-making. Future steps include performing a logistic regression to statistically compare fall predictions between therapists and the algorithm using clinical demographic variables as independent variables.  References. Adeli V, Korhani N, Sabo A, Mehdizadeh S, Mansfield A, Flint A, Iaboni A, Taati B. Ambient Monitoring of Gait and Machine Learning Models for Dynamic and Short-Term Falls Risk Assessment in People With Dementia. IEEE J Biomed Health Inform. 2023 Jul;27(7):3599-3609. doi: 10.1109/JBHI.2023.3267039. Epub 2023 Jun 30. PMID: 37058371.

https://openreview.net

Paper ID: 43

Exploring the limitations of available healthcare datasets: a synthesis of five systematic reviews.

Joseph E. Alderman; Elinor Laws; Joanne Palmer; Jaspret Gill; Rubeta N Matin; Alastair Denniston; Xiaoxuan Liu

Exploring the limitations of available healthcare datasets: a synthesis of five systematic reviews. Joseph Alderman MBChB FRCA, Elinor Laws BSc MBBCh, Joanne Palmer PhD, Jaspret Gill MSc, Rubeta Matin PhD FRCP, Alastair Denniston FRCOphth PhD, and Xiaoxuan Liu MBChB PhD. Background. Artificial intelligence (AI) health technologies have the potential to transform healthcare, but a growing corpus of literature demonstrates their ability to cause or contribute to health inequity. (1) Limitations intrinsic to the way healthcare datasets are recorded can be readily encoded within AI technologies. This risks systematising, amplifying and automating pre-existing societal biases - worsening social injustice and causing harm to patients. (2) Methods. We have conducted five systematic reviews to identify and assess datasets relating to mammography, heart failure, Covid-19, skin cancer (3) and ophthalmic imaging. (4) We identified datasets through structured searches of MEDLINE and Google Dataset Search, and extracted data relating to each datasets’ composition (‘who’ is included and ‘how’ they are represented), and the documentation accompanying each dataset. We present a summary of these reviews below. Results. Across all five disease areas we reviewed 268 datasets representing 76 countries. The content and format of dataset documentation was inconsistent, often omitting important demographic attributes (click here to view Figure). Much of the world was excluded from datasets, particularly countries in the global south. We identified two patterns of dataset availability: open access (designed to be freely available without restriction) and managed access (requiring registration, requests, approval, contracts etc). We omitted datasets which were inaccessible from these reviews and this abstract. - Mammography: The 11 datasets identified represent 6 countries and contain over 2 million mammographic images. The most commonly cited datasets were curated and digitized from film images in the UK and US - locations which now predominantly use digital imaging. No datasets included definitions for demographic attributes: for example, it was unclear what was meant by ‘sex’ or ‘gender’. - Heart Failure: The 20 datasets identified represent 7 different countries and include a range of data types from echocardiography to biochemistry. There was almost equal representation of ‘female’ and ‘male’ individuals. Socioeconomic status was reported by eight datasets and quantified inconsistently which limits comparability. - Covid-19: The 192 included datasets represent 72 countries. We found frequent data ‘remixing’ across datasets, which risks the same data being present in both training and test sets. (5) Age and other key metadata often were not reported, even when pediatric chest X-rays were included. Many datasets shared similar names, and did not have unique identifiers. Under half of datasets described how Covid-19 was diagnosed - vital context given shifting laboratory assays and clinical definitions during the pandemic. - Skin cancer: The 21 included datasets represent 16 countries and contain 106,950 skin cancer images. Thirteen of the included datasets were hosted on a single open access repository. (6) Despite skin tone or Race/Ethnicity data being essential context for skin cancer datasets, less than 3% of images included skin tone labels (such as Fitzpatrick skin type). - Ophthalmology: The 94 included datasets represent 24 countries and contain 507,724 ophthalmic images. Datasets included detailed technical information around imaging parameters and devices but contained limited reporting of individuals’ demographic attributes. Disease populations were unevenly represented; there was disproportionate representation of glaucoma, diabetic retinopathy and age-related macular degeneration in comparison to other eye diseases. Conclusion. We highlight substantial limitations in many datasets used for AI development, including widespread absence of data on demographic attributes. These attributes have complex relationships with health - associations with adverse outcomes often reflect the systematization of discrimination and oppression rather than biological causality (7), and their redaction from datasets does not protect individuals against harm driven by algorithmic bias. (8) Transparent reporting of datasets’ composition and limitations can empower AI developers to select data most befitting their purpose, enabling technologies which are safe and effective for everyone, not just the privileged few. References. 1: Chen IY et al. Annu Rev Biomed Data Sci. 2021. 2: Ibrahim H et al. Lancet Digital Health. 2021. 3: Wen D et al. Lancet Digit Health. 2022. 4: Khan SM et al. Lancet Digit Health. 2021. 5: Garcia Santa Cruz et al. Med Image Anal. 2021. 6: ISIC, International Skin Imaging Collaboration. 7: Crenshaw K. Univ Chic Leg Forum. 1989. 8: Obermeyer Z et al. Science. 2019.

https://openreview.net/pdf/df2e3bc209774367f3e6f4bec4e135f8fd49606d.pdf

Paper ID: 46

Evaluation of a Digital Phenotype for the Early Recognition of Pediatric Sepsis

Noah Prizant; Shems Saleh; William Ratliff; Marshall Nichols; Mike Revoir; Michael Gao; Mark Sendak; Suresh Balu; Emily Greenwald; Emily C Sterrett

Background. Sepsis is a life-threatening response to infection that results in significant morbidity and mortality among children worldwide. Between 2004-2012, sepsis accounted for an estimated 3.1% of all pediatric hospitalizations in the U.S. with a mortality rate of 8.2%1. Given the disease burden and high mortality of the condition, international standardized guidelines have been developed by organizations such as the Surviving Sepsis Campaign to improve patient outcomes. Their recommendations include implementing systematic screening for timely recognition of sepsis as well as starting antibiotic within 1 hour of recognition2. However, sepsis can be very difficult to recognize in children due to varied presentations and a constellation of nonspecific signs and symptoms. Additionally, many operational barriers exist in a clinical setting that lead to delayed recognition of pediatric sepsis. At Duke Children’s Hospital, an estimated 38% of children receive timely recognition (blood culture in <1 hour) while 28% receive timely treatment for sepsis (antibiotics in <1 hour). Implementing changes in the clinical setting to shorten time to recognition and time to treatment is of critical importance to reduce mortality and negative health outcomes. Our primary objective in this study was to develop and validate a digital phenotype to identify pediatric patients in real-time who are at high risk of sepsis. Methods. Our retrospective cohort consisted of 28,399 pediatric hospitalizations at Duke Children’s Hospital between 11/1/2016 – 04/30/2023. Patients that were discharged from the emergency department were not included in this study. All data was acquired from the electronic health record (EHR). The primary outcome of this study was the Duke Pediatric Sepsis Phenotype (DPSP), which consists of the intersection between (1) a previously described retrospective informatics-based definition of sepsis requiring 4 days of antibiotics (Full Weiss definition3), which we modified to be usable in real-time (Real-Time Weiss); and (2) the Duke Children’s Trigger Tool (TT), a local consensus-based phenotype to direct empiric antibiotic use which was developed by our local multi-specialty team. Phenotype performance was assessed with sensitivity and positive predictive value (PPV) for retrospective definitions of sepsis (Full Weiss and ICD-10 codes) and compared to standalone Real-Time Weiss (RT) and TT phenotypes. To determine the DPSP’s real-time applicability, we also evaluated its sensitivity and PPV when its components (RT, TT) were time-constrained. Hospital length-of-stay (LOS), mortality and demographic information (age, sex, race, ethnicity) were characterized for the cohort of patients meeting DPSP. Results. In our retrospective cohort of 28,399 hospitalizations, 3,104 patients (10.9%) met criteria for the DPSP. The DPSP had sensitivity of 0.79 and PPV of 0.18 for ICD codes for sepsis. Sensitivity for DPSP was lower than TT and RT standalone phenotypes (TT 0.92, RT 0.84), however PPV for DPSP was higher (TT 0.09, RT 0.15). For the Full Weiss definition of sepsis, the DPSP had sensitivity of 0.95 and PPV of 0.30. Sensitivity for DPSP was lower than TT and RT standalone phenotypes (TT 0.95, RT 1.00) and PPV for DPSP was higher (TT 0.13, RT 0.25). Performance of DPSP with time-constrained components showed a slight decrease in sensitivity and increase in PPV (ICD: 0.76 Sens, 0.20 PPV; FW: 0.90 Sens 0.30 PPV), suggesting that DPSP is effective at predicting retrospective definitions of sepsis in real-time. Patients meeting DPSP had significantly longer mean hospital length-of-stay (23.48 vs 6.72 days, p < 0.001) and mortality (5.67% vs 0.74%, p < 0.001) than the full retrospective cohort. Patients meeting DPSP had a lower mean age (6.99 vs. 7.79, p < 0.001), were more likely to be male (54% vs. 52%, p < 0.05) and were more likely to be Black/African American than the full retrospective cohort (36% vs. 31%, p < 0.001). Conclusion. The Duke Pediatric Sepsis Phenotype (DPSP) can accurately identify patients in real-time who meet retrospective definitions of sepsis. Patients fulfilling the DPSP have significantly higher mortality, longer hospital length-of-stay, and are more likely to be Black/African American. Next steps for this study include prospective validation, clinical adjudications to confirm clinical relevance and deployment at bedside. Validation of DPSP lays the groundwork for future efforts to train machine learning models.

https://openreview.net/pdf/cb67fc01418db30decb6015ef8e8fff4161c2634.pdf

Paper ID: 47

Implementing An Informatics-Driven Notification System for Patients With High-Risk Conditions Presenting With Fever In The Pediatric Emergency Department

Noah Prizant; Shems Saleh; William Ratliff; Marshall Nichols; Mike Revoir; Matt Gardner; Michael Gao; Mark Sendak; Suresh Balu; Emily Greenwald; Emily C Sterrett

Background: Patients with high-risk conditions (HRC) presenting to the pediatric emergency department (ED) with fever are at greatly increased risk for developing systemic infection or sepsis.1 Consensus guidelines for these patients emphasize timely evaluation and administration of antibiotics, ideally within 1 hour of presentation.2,3 Longer time-to-antibiotics (TTA) has been associated with poorer outcomes in high-risk patients.4,5 From 2016-2023, 42% of patients admitted to Duke Children’s Hospital who met HRC+Fever received antibiotics within 1 hour of presentation. Objective: In this study, our primary objective was to implement an informatics-driven system to immediately identify patients with high-risk conditions who presented to the Duke Pediatric ED with a fever to reduce TTA. Methods: The prospective study cohort included all patients with a high-risk condition who presented with a fever to the Duke Pediatric ED from 1/1/24-7/1/24. Live data was obtained every 15 minutes from the electronic health record (EHR). High-risk conditions included active chemotherapy (1+ dose of chemotherapy in 6 months prior to encounter), followed by transplant team (on the solid organ transplant “list” at time of admission) and sickle cell disease (prior encounter or problem list contained at least one sickle cell anemia diagnosis). Fever was defined as a temperature of >38oC measured in the ED or by chief complaint of fever. For each patient who met the HRC+fever phenotype, if antibiotics had not yet been given, an automated notification page was sent to the ED charge nurse and ED pharmacy to coordinate prompt evaluation and treatment. Silent validation was performed from 1/1/24-2/25-24 in which notifications were generated but not sent to clinicians. HRC+Fever notifications went live to clinicians in the Duke Pediatric ED on 2/26/24. Reminder pages went live on 4/13/24, in which an additional notification was sent if a patient had not yet received antibiotics after 45 minutes. We tracked time from ED admission to antibiotics and notification to antibiotics via run charts. Results: From 1/1/24-7/1/24, 123 total patients met the HRC+Fever phenotype and generated an alert. An additional 32 reminder pages were generated after 4/13/24. 87.8% of silent trial alerts and 78.0% of live in ED alerts were actionable; non-actionable alerts occurred when the patient did not receive antibiotics or antibiotics were administered before the alert. Median ED admission to antibiotic administration time did not change from silent trial to live in ED (75 minutes), however median notification to antibiotic administration time decreased from 114 to 58 minutes (Table 1). Antibiotic compliance (patients receiving antibiotics <1 hour from admission) increased from 38.9 to 40.3%, while notification compliance (patients receiving antibiotics <1 hour from notification) increased from 5.5% to 53.7%. Run charts (Figure 1) showed special cause variation in cases of significantly delayed antibiotic administration; this corresponds to instances where consult teams (transplant, infectious disease) were deferred to for antibiotic decision-making. Conclusion: We successfully implemented an informatics-based notification system to identify high-risk patients presenting with fever in the Pediatric ED. While overall admission-antibiotics time and compliance did not significantly change, notification-antibiotic time greatly decreased and compliance increased, suggesting that notifications were effective in reducing time to antibiotic administration. We will continue to track impact metrics and investigate additional workflow improvements as HRC+Fever notifications continue.

https://openreview.net/pdf/92ef17e37796a8dbb54b59dd2e4a15de14d91ff2.pdf

Paper ID: 62

AI Governance in Healthcare in a Learning Health System: Case Study of a Canadian Hospital System

Mark Sendak; Jee Young Kim; Alifia Hasan; Jacqueline K Kueper; Mohamed Abdalla; Benjamin A Fine

Background. The potential benefits of incorporating Artificial Intelligence (AI) into healthcare are well-documented. Previous studies have shown that integrating AI into healthcare systems can improve operational efficiency and enhance patient care. Despite widespread enthusiasm among healthcare organizations for AI integration, adoption rates remain low due to the absence of standardized governance systems for AI technologies. This study aimed to address this gap by establishing an AI governance framework within the Learning Health System (LHS) framework of a multi-site hospital system in Canada to ensure the safe, effective, and equitable adoption of AI in healthcare settings. The findings of this research illustrate the necessary steps and resources required to implement a robust AI governance system within a healthcare organization. Methods. Establishing an AI governance system required active engagement of various stakeholders across the organization and a series of research activities. In-depth interviews were conducted with 12 stakeholders from operational, technical, and clinical domains to understand the current landscape of AI adoption, identify future AI governance needs, and assess the capability of the organization. Additionally, two journey mapping sessions were conducted to identify stakeholders engaged in AI governance. Insights from interviews and journey mapping were used to formulate an initial set of recommendations across four domains: process, people, technology, and operation. To refine these recommendations, six senior executives who are integral to digital health governance were recruited. Initially, they completed a survey that contained questions related to key decisions made in implementing AI governance across four domains of the recommendations. Subsequently, a series of four design thinking workshops were conducted, each centering on one of the recommendation domains. During these workshops, the executives contributed their perspectives and insights. These collaborative efforts led to the refinement and finalization of the initial set of recommendations. Results. The finalized recommendations described the process, people, technology, and operation required for the establishment of an effective AI governance system. Using these recommendations, drafted governance documents were subsequently approved as policy by senior executives. Within the process domain, recommendations delineated the scope, key decision points, risk assessment protocols, and bounds of enforcement spanning the entire AI lifecycle. In the people domain, recommendations elucidated the composition and structure of the AI governance committee, specifying required expertise, member responsibilities, and protocols for committee membership updates. In the technology domain, recommendations outlined necessary technical capabilities and infrastructure prerequisites to support AI integration, along with guidelines for their management within the organization. In the operation domain, recommendations described the operationalization of the AI governance system within the organization, along with proposed metrics for assessing its successful implementation. These comprehensive recommendations provided a robust foundation upon which the organization developed governance documents, ensuring a structured approach to AI governance for policy endorsement and implementation. Conclusion. The current research highlights the need for a centralized and standardized AI governance framework to ensure the responsible integration of AI in healthcare. While the recommendations were tailored to a particular health system context, the methods employed in establishing the AI governance system and the overarching insights encapsulated in the final recommendations hold broad applicability through an LHS lens. We believe that this research lays the groundwork for their potential adaptation and implementation across diverse healthcare settings. Such generalizability not only facilitates broader uptake but also promotes the development of more robust and universally applicable AI governance practices in healthcare.

https://openreview.net/pdf/ca0c78275fb354678ede7744595e27e58ab0935a.pdf

Paper ID: 65

CGM-GPT: A Transformer Based Glucose Prediction Model to Predict Glucose Trajectories at Different Time Horizons

Mansur E. Shomali; Junjie Luo; Abhimanyu Kumbara; Anand K. Iyer; Guodong Gao

Introduction Accurate glucose value prediction and the subsequent, automated coaching based on these predictions are important to and can help improve the self-management of diabetes. We have previously shown that by combining dense continuous glucose monitoring (CGM) sensor data with medication, education, diet, activity, and lab data (MEDAL) from a digital health solution, we can accurately predict binary outcome variables, such as whether the glucose time in range (TIR) will be above or below a certain threshold. While this binary outcome prediction is valuable for population-level glucose management over longer time horizons (e.g., 70-90 days from a baseline period), having accurate glucose prediction at much shorter intervals – such as 30 minutes, 60 minutes, and 2 hours – can be essential for person-level, real-time glucose management. Large Language Models (LLMs) with transformer architectures have proven to be adept at predicting the next word, allowing us to construct entire sentences, paragraphs, or long form text with appropriate prompts. In this study, we aimed to construct a “Large Glucose Model” (LGM) using a transformer architecture to predict the next glucose value, thereby providing glucose trajectories over 30 minutes, 60 minutes, and 2-hour intervals, which we refer to as CGM-GPT. Additionally, we compared the accuracy of our CGM-GPT models to that of other deep learning-based models reported in the current literature for the same time horizons in question. Methods We evaluated real-world CGM data from a digital health platform for 617 individuals with type 1 (T1D) and type 2 (T2D) diabetes. This dataset accounted for over 17 million CGM entries, covering approximately 59,000 patient-days (equivalent to 161.7 patient-years). The dataset was down-sampled to 10%, and it was further split into a held-in sample and a held-out sample with a ratio of 9:1. We constructed two different GPT models: The first GPT model (GPT1) used only T1D population data in the training set, and the second GPT model (GPT2) used only data from the T2D population in the training set. Each of these models was used to predict glucose trajectories for both T1D and T2D populations at 30 minute, 60 minute, and 2-hour time horizons. Model accuracy was evaluated by calculating the root mean square (RMSE) (mg/dL) at these time intervals. Results For the GPT1 model, the held-out sample RMSE (mg/dL) for predicting T1D-only glucose trajectories at 30 minutes, 60 minutes, and 2-hours were 12.8, 23.5, and 40.1, respectively. Similarly, for the GPT2 model, the held-out sample RMSE for predicting T2D-only glucose trajectories at the same time intervals were 10.4, 17.5, and 27.4, respectively. Interestingly, we also used the GPT2 model to predict glucose trajectories for the T1D-only held-out sample set. The RMSE for this prediction at 30 minutes, 60 minutes, and 2-hours were 13.0, 23.5, and 39.4, respectively. Notably, the GPT2 model, which was trained on T2D population, produced similar RMSE scores when used to predict T1D glucose trajectories at the same time horizons as the GPT1 model, hinting at the possibility of further generalizing such models in the future. Comparing our RMSE scores for the GPT1 model (used to predict T1D population glucose trajectories) to state-of-the art scores from the current literature, we found that our RMSE scores were considerably lower than for the state of the art models, with the current literature average for the 30-minute and 60-minute RMSE scores being 18 and 30 mg/dL, respectively. Conclusion Novel transformer-based glucose prediction models can be highly accurate in predicting glucose trajectories at 30 minute, 60 minute and 2-hour time horizons for both T1D and T2D populations. Interestingly, the T2D-only trained GPT model can also be used to accurately predict T1D-only population glucose trajectories. Our GPT1 model achieved considerably lower RMSE scores when compared to those from current literature. Notably, none of the models in the current literature – based on deep learning architectures for glucose predictions – used the T2D population for training and predicting glucose trajectories at the specified time horizons. Additionally, our GPT1 and GPT2 models were the first to predict glucose trajectories at the 2-hour time horizons , which the state-of-the-art model did not do. In the future, we aim to enhance our GPT models by incorporating MEDAL data into the training sets and further investigating the breadth of applying the GPT2 model to predict glucose trajectories in a T1D population.

https://openreview.net/pdf/7c5450a53826011abc9845915b99caae9da65731.pdf

Paper ID: 77

Natural language processing for sedation state classification during procedural sedation

Jack Li; Mohammad Goudarzi-Rad; Blair E. Warren; Babak Taati; Aaron Conway

Natural language processing for sedation state classification during procedural sedation Jack Li1, Mohammad Goudarzi Rad1, Blair Warren2, Sebastian Mafeld2, and Babak Taati3,5, Aaron Conway4 1 Lawrence S. Bloomberg Faculty of Nursing, University of Toronto, 2 Joint Department of Medical Imaging, UHN, 3 KITE Research Institute - Toronto Rehabilitation Institute, UHN, 4 School of Nursing, Queensland University of Technology Background. An ambient clinical intelligence system could use natural language processing (NLP) to automatically document when pain is reported by patients or when clinician’s provide verbal instructions to patients because their movement is interfering with the procedure. Figure 1 provides an overview of how classifications of sentences from audio recordings during procedural sedation could be integrated into clinical documentation systems. Classifications could be automated completely if predictions are sufficiently accurate, while others could be presented to clinicians as a ‘prompt’ to confirm or refute the occurrence of a ‘problem’ state, such as pain or movement. The aim of this study was to determine the accuracy of NLP pipelines for classification of sedation states. Methods. A prospective observational study was conducted. Patients undergoing elective procedures in the interventional radiology suite at a large academic hospital who were 18 years or older and able to provide informed consent were eligible. Audio recordings collected during procedures were pre-processed to filter out background noise before being transcribed using the 1550 million parameter Whisper model (version large-v3). Transcribed sentences were labelled into categories of “pain”, “assessment”, “movement”, “breathing”, and “other”. Two members of the research team independently annotated each sentence by accepting or re-assigning initial labels generated from few shot in-context learning using the GPT-3.5 model, accessed through the OpenAI application programming interface. Twenty percent of the data was assigned to the testing set with the remaining data split into 80:20 for training and validation. Sentences were distributed to ensure data from the same procedure were included only in either the training, validation or test set. Three NLP pipelines were evaluated using the spaCy NLP python library. Pipelines included a simple Bag-of-Words (BOW) model, an ensemble that combined a linear BOW model and a “token-to-vector” (Tok2Vec) component and the most complex pipeline consisted of a transformer-based architecture, using the RoBERTa pre-trained model. Results. Of 119 patients who were screened for eligibility, 4 were ineligible, 30 patients declined to participate, and 3 were excluded due to changes to procedure, resulting in a total of 82 participants included in the analysis. The majority of participants were male (68%) with a median age of 68, and a majority of participants reported to be White (European descent). A total of 10,434 sentences were used for training, 3,3375 for validation, and 2,127 for testing with similar distribution of sedation state labels between training and test sets, and higher number of movement labels for sentences in the validation set. The BOW approach achieved an Area under the ROC curve (AUC-ROC) of 0.9, F1 score of 0.7, precision of 0.83, and recall of 0.66. The BOW and Tok2Vec combination pipeline achieved an Area under the AUC-ROC of 0.96, F1 score of 0.79, precision of 0.83, and recall of 0.77. The RoBERTa transformer pipeline achieved an AUC-ROC of 0.97, F1 score of 0.87, precision of 0.86, and recall of 0.89. Few shot in-context learning with classifications from the GPT-3.5 model produced an F1 score of 0.65, precision of 0.57, and recall of 0.93. The RoBERTa model achieved an F1 score of 0.81, a precision of 0.85, and a recall of 0.77 for the ‘Pain’ label. For the ‘Movement’ label, predictions from the RoBERTa model on the test set produced and an F1 score of 0.79, a precision of 0.82, and a recall of 0.78. These results were superior to the other models for these labels. Conclusion. Automating sedation state assessments using NLP would allow for more timely documentation of the care received by sedated patients. Downstream applications can also be generated from the classifications, including for example real-time visualizations of sedation state, which may facilitate improved communication of the adequacy of the sedation between clinicians, who may be performing supervision remotely. Additional research is required to evaluate the performance and computing requirements needed for an ambient clinical intelligence system that incorporates the best performing transformer-based NLP pipeline with real-time data. Figure 1. Flow diagram of proposed automated sedation state assessment system using natural language processing for sentence classification

https://openreview.net/pdf/0f933ad428f7974a75d2e67ae758e334e7f63aac.pdf

Paper ID: 80

Learning patient representations of coded disease sequences for prediction in primary care electronic health records

Thomas Beaney; Sneha Jha; Asem Alaa; Alexander Smith; Jonathan Clarke; Thomas Woodcock; Azeem Majeed; Paul Aylin; Mauricio Barahona

Background. Multimorbidity, defined as the co-occurrence of two or more long-term conditions, is a mounting challenge facing health systems worldwide. With access to large electronic health records (EHRs), there is growing interest in learning representations (embeddings) of patients, which can be used to characterise patterns of disease in multimorbidity for a variety of downstream tasks, such as risk prediction, stratification, and clustering. However, it remains unclear whether using increasingly complex methods, such as transformer architectures, which utilise sequences of diseases, rather than co-occurrence alone, produce embeddings with substantively better predictive content than standard epidemiological methods. In this study, we create unsupervised patient representations based on structured diagnostic codes, using a selection of methods derived from natural language processing, and test how these perform for predicting a range of clinically relevant tasks. Methods. We used a large EHR dataset from the Clinical Practice Research Datalink, a nationally representative sample of patients registered to primary care in England, linked to secondary care data from Hospital Episode Statistics, including all patients with multimorbidity aged ≥18 years, registered on 1st January 2015. 9,462 diagnostic codes (Medcodes) were clinically categorised into 212 diseases relevant to multimorbidity, and for each patient, we created two time-ordered sequences, containing either their history of Medcodes or diseases developed before 2015. For each input sequence, we generated unsupervised patient embeddings by applying: i) co-occurrence-based methods: Latent Dirichlet Allocation (LDA) and Distributed Bag of Words (DBOW), and ii) sequence-based transformers. We tested two existing transformer architectures bespoke for EHR data: Med-BERT and BEHRT and developed a new architecture, EHR-BERT, which incorporates embeddings for age, gender, ethnicity, socioeconomic deprivation and calendar year. The predictive content of the embeddings generated by each method and input sequence was compared using a logistic classifier, incorporating age, gender, ethnicity and deprivation as covariates. We also compared with a standard epidemiological approach of inputting each disease into the classifier as i) binary disease indicators; ii) counts of each disease, or iii) use of sociodemographic covariates alone, without the embeddings. Prediction of six outcomes were evaluated over the subsequent one year (2015), on data not seen during training of the embeddings: i) mortality in patients aged 60+ years; ii) any emergency department (ED) attendance; iii) any emergency hospital admission; and new diagnoses of iv) hypertension, v) diabetes and vi) depression. Results. 6,286,233 patients were included, with a mean age of 53.8 years, of whom 53.1% were female, 86.2% were of White ethnicity and with an even spread across deprivation deciles. Embeddings generally performed well at predicting mortality, but poorly at predicting any ED attendance over 1 year (Figure). Across all endpoints, embeddings from EHR-BERT performed best, with higher ROC-AUC scores, followed by BEHRT. Use of simple binary disease indicators performed better than embeddings from DBOW or LDA. AUC scores for predicting new diagnoses of hypertension, diabetes or depression were lower (range 0.683-0.758), but the best performance was found from embeddings generated by transformers. Little difference was found if using Medcodes or diseases as inputs. Conclusion. Unsupervised patient representations generated by transformers using disease sequences have richer predictive content for a range of downstream clinical endpoints compared to methods generated by co-occurrence alone. The better performance of these embeddings, even without fine-tuning, indicates the potential of these approaches for learning multi-purpose representations which could be useful in future for segmentation and risk stratification. Pre-print available at: https://www.medrxiv.org/content/10.1101/2023.11.16.23298640v2

https://openreview.net/pdf/b25868ea3cf31c52a4924afbbb51d11242b1167a.pdf

Paper ID: 83

Model-Controller Framework for Precision Alerting Using Machine Learning Models

Emily A Balczewski; Karandeep Singh; Brian Patterson; Erkin Otles

Background. Risk prediction models can offer valuable and timely information to clinicians through alerts. To maximize the benefits (i.e., improved patient outcomes) without incurring heavy costs (e.g., alert fatigue, unnecessary interventions), alerts should be surfaced to the right person(s) at the right time in the right format in the right channel with the right information. [1] Risk prediction models may generate accurate predictions, but model implementers generally employ static decision thresholds that do not adapt alerting behavior to the broader care environment. While there are examples of more complicated logic for alerting behavior in current use [2], this logic is often implemented and evaluated ad hoc. We address this gap by introducing the concept of a controller that tailors model alerting behavior by considering model outputs in the context of environmental and patient factors. Methods. We formalize the definition of a controller c(Xt, Wt, Y ̂t), which receives information about patients Xt (e.g., diagnostic codes), information about the environment Wt (e.g., bed availability), risk predictions Y ̂t from a model f(Xt) (e.g., probability of developing outcome) to determine which patients receive an alert at time t. We outline several potential use cases of a controller in Table 1. Additionally, we are actively developing a controller implementation. Example. Suppose we have a model to predict the likelihood from 0-1 that 500 hospitalized patients will develop sepsis in the next 8 hours. Imagine that the inherent risk of sepsis for 25 hematology/oncology (HO) patients is higher than for the general inpatient population due to both patient (e.g., immunosuppression) and environmental (e.g., staffing model) factors. However, this inherent risk does not translate to higher model scores due to limited data availability for this subpopulation. When a single alert threshold (>0.80) is imposed, the model's sensitivity is poor for HO patients. When two thresholds are created for HO (>0.70) and non-HO (>0.80) patients, respectively, the sensitivity and positive predictive value (PPV) of the model greatly improves with minimal increase in the overall alert burden (Figure 1). Conclusion. The controller in our simulated example uses only one patient variable to modify alerting behavior but can easily incorporate many heterogeneous patient and environmental variables. The formal definition of a controller establishes a necessary foundation for evaluating model-driven alerts. We welcome comments on this framework and the controller software implementation we are developing.

https://openreview.net/pdf/2d10c0496b2a3a0c6f38cf5f810f55cd8798a1c8.pdf

Paper ID: 92

Artificial Intelligence to Predict Postoperative Health-Related Quality of Life for Adolescent Idiopathic Scoliosis Patients

Dusan Kovacevic

Background. Adolescent idiopathic scoliosis (AIS) has a large impact on health-related quality of life (HRQoL) for patients including poor psychosocial functioning, psychological distress, and pain. Surgical management for AIS is common; however, there is limited consensus on preoperative and intraoperative strategies to optimize HRQoL outcomes following surgery. Accurate prediction of postoperative outcomes can guide operative planning, ultimately leading to improved HRQoL. This feasibility study aimed to generate machine learning models (MLMs) using preoperative and intraoperative variables to accurately predict postoperative HRQoL outcomes following AIS surgery. Methods. A prospective, longitudinal, multicenter database was queried for AIS patients of Lenke 1 or 5 classification with minimum two-year follow-up. MLMs were generated using various preoperative and intraoperative factors to predict the difference in Scoliosis Research Society-22 (SRS-22) questionnaire scores from preoperative assessment to two-year follow-up. MLMs were compared to a model that estimates the mean score by evaluating the coefficient of determination (R2) and the number of times the prediction was within a predesignated value of the actual score (i.e. buffer accuracy). Results. A total of 1,417 patients were included. The stochastic gradient descent (SGD) model had the highest R2 for all SRS-22 scores (0.31–0.64). For 0.5-buffer accuracy, the linear regression model performed best for the satisfaction (66.2%), self-image (70.1%), pain (65.7%), and total SRS-22 scores (80.9%), while the SGD model performed best for the mental health (54.9%) and general function SRS-22 scores (79.9%). The SGD model had the highest 1-buffer accuracy across all SRS-22 scores (87.4%–97.2%). All MLMs, except for the AdaBoost model, outperformed the mean estimates on all accuracy metrics across each outcome. Conclusion. MLMs accurately predicted the difference in HRQoL outcomes for AIS patients using preoperative and intraoperative factors. Findings provide key insights into the feasibility of implementing MLMs to guide operative planning and counsel patients on expected outcomes of surgical management. Future work should aim to optimize these factors to ultimately maximize patient outcomes.

https://openreview.net/pdf/cbe86956c7d5e99ba09b33a60a8aed11080f5322.pdf

Paper ID: 93

Silent Prospective Deployment and Evaluation of a Machine Learning System to Predict Emergency Department Visits During Cancer Treatment

Robert C. Grant; Benjamin Grant; Viet Tran; Tran Truong; Tirth Patel; Jiang Chen He; Muammar M. Kabir

Background: Patients receiving active treatment for cancer require frequent visits to the emergency department (ED), which burdens the healthcare system. Our objective is to use the long-term data in electronic health records (EHR) to deploy and test a previously built warning system that can generate accurate predictions of patients at risk of ED visits. This system would guide clinicians to initiate early personalized measures to prevent ED visits, which will conserve resources and improve the quality of life for people with cancer. Methods: Machine learning systems were trained and tested using longitudinal retrospective EHR data from people receiving intravenous medical treatments primarily for gastrointestinal cancers at the Princess Margaret Cancer Centre. The target was whether or not a patient visited the ED within 30 days after each treatment. Treatment sessions that were followed by an ED visit the same or following day were excluded from training to ensure the system focuses on detecting early warning signs rather than imminent ED visits. Models included tree-based approaches and neural networks. Hyperparameters were tuned using Bayesian procedures and probabilities were calibrated with isotonic regression. A temporal split was used to create a held-out retrospective test cohort. For prospective clinical validation, the best model was deployed in silent mode among people receiving medical treatment for gastrointestinal cancer using an internally developed software platform ‘MIRA’ that provides custom data-driven pipelines for both retrospective analysis and prospective clinical integration. Of note, the retrospective and prospective cohorts coincide with a shift between EHR vendors. Within MIRA, patients’ EHR data with treatments scheduled the following day are extracted from the EHR system and passed to the model for analysis. Results: In the retrospective cohort, 1,997 patients received 24,350 treatments from January 1st 2014 to December 31st 2019, with ED visits within 30 days following 2,219 (9.1%) of treatments. The best model was the extreme gradient boosting tree, which achieved an area under the receiver operating characteristic curve (AUROC) of 0.677 and area under the precision-recall curve (AUPRC) of 0.186 in the held-out test set. Although the evaluation of the system through prospective silent deployment is ongoing, here we report on the first two weeks of treatments between March 1st, 2024 to June 30th, 2024, with adequate 30-day follow up by July 30th, 2024. During this period, 489 patients received 2,630 treatments, with ED visits within 30-days following 232 (8.8%) treatments. The system trained in the retrospective data achieved an AUROC of 0.689 (confidence interval: 0.645-0.732) and an AUPRC of 0.211 (confidence interval: 0.203-0.233). At a 10% alarm frequency before treatments, alarms had a positive predictive value of 0.281 and outcome-level sensitivity of 0.693. Conclusion: Our system predicted ED visits among people receiving medical treatment for cancer during a prospective silent deployment. These results suggest that the system should be implemented within the clinical workflow and paired with interventions to prevent ED visits.

https://openreview.net/pdf/859690ffb5d766482c23608e047801a88a69b574.pdf

Paper ID: 95

Change in machine learning model performance upon retraining after deployment into clinical practice: The real-world effect of model predictions on clinician actions, outcome labels, and the potential for a feedback loop

Michael Colacci; George Alexandru Adam; Chloé Pou-Prom; Anna Goldenberg; Amol Verma; Muhammad Mamdani

Background: Simulation studies have suggested that machine learned (ML) clinical decision support systems may deteriorate in performance after deployment because they influence the data used to later update the model (termed a feedback loop). To our knowledge, no real-world studies have evaluated feedback loops in deployed clinical ML tools. Feedback loops could potentially affect all clinical ML tools that undergo regular updates. Thus, evaluations and potential solutions are urgently needed. Methods: CHARTwatch is an ML early warning system that predicts patient deterioration (defined as death or transfer to ICU) and has been deployed at St. Michael’s Hospital in Toronto, Canada, since October 2020. To evaluate the influence of CHARTwatch alerts on clinician behaviors and outcomes, we conducted a prospective cohort study of CHARTwatch high risk alerts between July 1st, 2022, and July 1st, 2023. We interviewed physicians on the patient’s care team in real-time immediately prior to the delivery of each patient deterioration alert, to determine how the alert changed their perception of patient risk. Responses were linked to subsequent clinical actions measured in the electronic medical record. Alerts were grouped into three categories: (1) Predicted: when the clinician predicted the high-risk classification prior to receiving the ML alert, (2) Agree: when the clinician did not independently predict the high-risk classification prior to receiving the ML alert, but agreed with the patient being high-risk for deterioration after being informed of the ML alert, and (3) Disagree: when clinicians both did not predict the high-risk classification and disagreed with the ML alert. Interview responses were used to estimate the likelihood of averting the predicted outcome attributable to CHARTwatch (“recovery probability”), by comparing patient outcome rates when clinicians did and did not modify behavior following a CHARTwatch alert. Finally, we evaluated the change in model performance due to feedback loops across a range of recovery and recall probabilities, ML model types and baseline model performances. Results: A total of 100 clinician interviews were completed. In 36% (36/100) of alerts, clinicians correctly predicted the alarm in advance of receiving the alert. In 30% (30/100) of alerts, clinicians did not anticipate the alert but agreed with the high-risk classification, and in 34% (34/100) of alerts, clinicians both did not predict the alert and disagreed with the high-risk classification. Clinicians were more likely to take a new predefined clinical action when they agreed with but did not anticipate an alert (78%), compared to when they disagreed (47%) or when they independently predicted the alert (25%, p=0.0001). CHARTwatch performance upon retraining was modelled using a range of model recall values (including the observed value of 0.7) as well as a range of recovery probabilities (including the observed rate of 0.37). Under current model parameters a feedback loop was present upon retraining that significantly worsened model performance, however the absolute magnitude of this change was small (<2% change in AUC). Varying model type was not consistently associated with the presence or severity of a feedback loop. Conclusion: In this prospective evaluation of an ML tool for inpatient deterioration, we determined that clinician agreement with an ML prediction was strongly associated with subsequent clinical responses. We identified that factors that increase the probability of an ML prediction altering a clinical outcome, such as the baseline event rate, model performance, and recovery probability, all increase the likelihood of a feedback loop. Finally, we determined that given the specific parameters of the CHARTwatch early warning system, a feedback loop upon model retraining is likely, but that the magnitude of this change is small. Given that specific model and clinical parameters influence the likelihood of a feedback loop, similar evaluations should be undertaken for all clinical ML models prior to retraining.

https://openreview.net

Paper ID: 96

Deciding When to Stop Immune Checkpoint Inhibitors: Development of an Individualized Treatment Rule for Patients with Advanced Melanoma

Arpan Sahoo; Mathilde Amiot; Stéphane Dalle; Bastien Oriano; Lebbe; Raphael Porcher; François Grolleau

$\textbf{Background.}$ Treatment of advanced melanoma often involves immune checkpoint inhibitors (ICIs), e.g., anti-PD1 agents. While ICIs have been shown to improve the overall survival of patients, it is still unclear when we should stop them, with respect to the patient’s individual characteristics and response to treatment. Typically, ICIs are discontinued for patients who experience toxicity or advance to progressive disease. However, certain patients may also benefit from "elective" ICI discontinuation, as prolonged ICI therapy increases the potential for adverse effects and incurs additional medical expenses. Defining better rules and guidelines for stopping ICIs may assist clinicians in maximizing the benefits of treatment while minimizing the harms. With a sequential decision-making paradigm in mind, we used reinforcement learning (RL) to estimate an optimal individualized treatment rule for stopping ICIs in patients with melanoma. $\textbf{Methods.}$ We obtained data from MelBase, a French multicenter cohort of adults with unresectable stage III/IV melanoma. Patients were included if they were started on ICIs between Mar. 2015 and Dec. 2021 ($n$ = 1,017). Each patient was characterized by a set of baseline features: sex, age, ECOG performance status, AJCC tumor stage, serum LDH level, presence of brain or liver metastases, and BRAF mutation status. Patients were also characterized by features at four time points after treatment initiation (6, 12, 18, and 24 months): ECOG status, LDH, tumor response to treatment (partial response, complete response, or stable disease), and treatment decision (stop or continue ICIs). We applied a recursive $Q$-learning procedure (a type of RL algorithm) to estimate the optimal sequential treatment decisions for maximizing a patient’s survival probability at 48 months. At each time point $t$ (24, 18, 12, and 6 months, respectively), a random forest (RF) regression model was trained to regress 48-month survival on patient information available at time $t$ (hyperparameters fine-tuned via grid search with 5-fold cross-validation). We used each model to select the decisions at time $t$ that would optimize patient survival, which then informed the next model to be trained. This process produced, for each time point, a distinct RF model and a corresponding list of optimal treatment decisions for the patients. This represents a dynamic treatment regime, which consists of a different rule for ICI discontinuation at each time point. For clinical interpretability, we then aimed to consolidate these rules into a unique rule for ICI discontinuation, represented by a decision tree. This rule is stationary, i.e., it applies the same criteria at any time point. $\textbf{Results.}$ 1,017 patients were included in this study. Under our $Q$-learning procedure, we trained four (time point specific) RF regression models to estimate survival probabilities at 48 months and infer the optimal treatment decision at each time point for each patient (i.e., a dynamic treatment regime). Based on these results, we devised a single treatment rule for ICI discontinuation, in the form of a decision tree classifier (training samples being patients with features such as LDH and ECOG; and classification labels being the optimal treatment decisions assigned by the main $Q$-learning procedure). As a simplification, the tree only considers the patient’s baseline and current characteristics, making it applicable to any time of decision (i.e., a stationary rule). For this tree, the accuracy was 94%, the AUC was 98%, and the F1 score was 95%, suggesting that our estimated optimal dynamic treatment regime can be approximated well by a simple stationary rule. At any point in a patient’s care, clinicians can read the (easily visualized) decision tree to get a recommendation about whether to stop ICI therapy. To provide a few examples, the tree recommends stopping ICIs if the baseline tumor stage is IIIB/IIIC and current LDH is ≤ 0.7; or if the baseline tumor stage is IVB/IVC, the tumor is currently in complete response (CR), current ECOG status is 0, and baseline age is ≤ 26; or if the baseline tumor stage is IVB/IVC, current ECOG status is ≥ 1, and baseline age is ≥ 75. This seems to represent a mix of treatment strategies, such as stopping treatment if the patient is already faring well (e.g., young age, tumor in CR, low ECOG, low LDH, etc.), or if the patient is elderly or very sick and may be harmed by further treatment. $\textbf{Conclusion.}$ By arriving at a simple decision tree after employing a complex RL approach, we created a practical and interpretable rule for stopping ICIs for patients with advanced melanoma. The next steps are (1) to conduct prospective testing and refinement of the simplified treatment rule and (2) to evaluate its effectiveness in actual clinical practice compared to standard care.

https://openreview.net/pdf/0a9c0584c0dbc2da0253ff9e53f69eb64322037e.pdf

Paper ID: 97

Evaluating sociodemographic bias in the implementation of a machine learned model to predict patient deterioration

Michael Colacci; Chloé Pou-Prom; Amol Verma; Muhammad Mamdani

Background: Machine learned (ML) tools may have unequal performance across patient subgroups and perpetuate or amplify historical biases. Additionally, clinicians may utilize ML tools differently between patient populations. This study investigates the performance of a machine learned early warning system for patient deterioration and its association with clinical care and outcomes across patient sociodemographic subgroups, to determine whether there was bias either intrinsic to the tool or in its application. Methods: This was a pre-planned, retrospective analysis of a prospective, non-randomized, controlled study evaluating the impact of a machine learned early warning system (ML-EWS) among patients hospitalized on the general internal medicine (GIM) service at an academic hospital. The impact of the ML-EWS on outcomes among GIM patients at St. Michael’s Hospital (University of Toronto) in Ontario, Canada was compared before model deployment (control period: November 1, 2016 to September 30, 2020) and after deployment (intervention period: October 1, 2020 to November 1, 2022). CHARTwatch performance was assessed according to sociodemographic subgroup: age (< 60 years, 60-79 years, >80 years), sex, homelessness, neighborhood material resources and neighborhood racialized and newcomer population. The two primary outcomes were the sensitivity and specificity of the model between subgroups. Subgroups were also compared using the ratio of balanced error rates (BERR of <0.8 or >1.25 considered to be a meaningful difference). Propensity score matching with overlap weights was used to match patients in the control and intervention periods overall and within subgroups. Secondary outcomes included clinician actions and patient outcomes between patient subgroups after CHARTwatch implementation. Results: 12,877 patients were included in the analysis, including 9,079 in the control period and 3,798 in the intervention period. 42.2% of patients were female, 35.6% were less than 60 years old and 24.2% were over 80 years old, 13.3% were experiencing homelessness, 30.9% lived in the lowest quintile of neighborhood material resources, and 36.9% lived in the quintile with the highest neighborhood racialized and newcomer populations. The overall rate of ICU transfer or death among all patients was 8.0% (N=720). The overall accuracy of the model was 0.84, precision was 0.29, recall was 0.69, AUC was 0.85, F1 score was 0.41 and Brier score was 0.74. Model sensitivity was 70% (95% Confidence Interval [CI] 66-73%) and did not significantly differ between subgroups. Model specificity was 85% (95%CI 84-86%), with a slight increase among patients <60 years old (88%, 95%CI 89-90%), but otherwise no significant differences between sociodemogaphic subgroups. The BER ratio was similar for all patient subgroups, ranging from 0.96-1.17. Clinician actions following CHARTwatch implementation were similar among all sociodemographic subgroups. Non-palliative death was decreased in the overall cohort (2.2 vs. 1.6%, p=0.03), as well as the subgroup of patients over 80 years old (4.3% vs. 2.5%, p=0.02) and not experiencing homelessness (2.3% vs. 1.7%, p=0.046). Conclusion We identified that an ML model for inpatient deterioration performed generally well and consistently among all measured sociodemographic subgroups, and that clinical actions following a high-risk notification did not appear to differ significantly by patient subgroup. An increase in palliative deaths was observed for men and patients in the highest quintile of neighbourhood racialized and newcomer population who received CHARTwatch high-risk alerts, though it is uncertain if this difference represents bias in model utilization or appropriate clinical care.

https://openreview.net

Paper ID: 98

Deep Learning-Based Analysis of Colonic Sonography for Accurate Bowel Wall Segmentation and Motility Assessment

Jihong Chen; Krish Patel; Amer Hussain; Jan D. Huizinga

Background: Sonographic imaging plays a pivotal role in evaluating gastrointestinal motility, yet accurately segmenting the bowel wall while misinterpreting gas and luminal content remains challenging. Conventional machine learning has aided in colonic activity segmentation and frequency analysis; however, it often inaccurately differentiates between anatomical and non-anatomical features. This study introduces a deep learning approach that identifies bowel walls and assesses motility, thereby addressing current diagnostic limitations and improving comprehensive data analysis for colonic diameter changes. Methods: The method consists of two primary phases: data preparation and real-time application. Initially, a dataset comprising 9,816 ultrasound images of 640x640 pixels was curated from patients and volunteers. These images underwent extensive preprocessing and augmentation before being used to train a Unet model explicitly designed for binary segmentation of colon walls. The model training incorporated advanced convolutional neural network (CNN) techniques, enhancing its ability to segment the colon walls accurately. The second phase utilizes this trained model to analyze real-time ultrasound video data, performing complex analyses such as spatiotemporal mapping and wavelet analysis to extract detailed insights into colon behaviour. Results: The Unet model demonstrated exceptional accuracy, with an Intersection over Union (IoU) score of 93.69% on the validation set, significantly outperforming traditional segmentation methods. Furthermore, the model was rigorously tested on a new test set of 50 diverse videos, achieving robust segmentation accuracy. The model effectively facilitated detailed colon examinations from sonographies, providing critical insights through various computational analyses, including producing diameter mapping, Fast Fourier transform frequency analysis, spatiotemporal mapping, wavelet frequency analysis, and breathing displacement. Conclusion: This study significantly advances the field of gastrointestinal imaging by integrating deep learning into ultrasound analysis. Our methodology not only overcomes the inherent challenges posed by traditional segmentation techniques but also provides a powerful tool for non-invasive, real-time analysis of colon health, potentially transforming clinical practices and enhancing patient outcomes. Figures: The accompanying spatiotemporal maps [A) Machine learning B) Deep learning] present the changes in colonic diameter (red – contraction, blue – relaxation) over time from the same colonic sonography (machine learning interprets gas and content as bowel wall). C) Frequency FFT plot. The graphs serve as visual validations of our method's efficacy.

https://openreview.net/pdf/1fc87ccd0c6eed79452624b8afcddac845c2a575.pdf

Paper ID: 99

Evaluation of Balancing Techniques for Clinical Datasets and AI Diagnosis

Ohm Sharma; Robi Polikar

Background. Alzheimer’s disease (AD) is one of many incurable illnesses with no definitive method of diagnosis and has been a focus of machine learning diagnostics. Clinicians currently rely on many diagnostic tools, such as memory and behavioral tests, to evaluate symptoms of AD, including dementia. Previously published work in this lab has shown the viability of various machine learning models in predicting the level of dementia in AD patients by using a clock drawing test (CDT) dataset. However, a constant concern with training these models is the lack of sufficient balanced data to train and test the models. As such, this study analyzes different oversampling and undersampling techniques on a small biased dataset of digital CDTs and their effect on model metrics. While these methods have had success in large datasets, we hypothesize that these sophisticated techniques will not significantly increase the model performance due to the extreme imbalance and size of the dataset. Methods. The model of choice for these experiments was XGBoost, a tree-based gradient-boosting ensemble model. The model was trained and tested with five-fold cross-validation. We also performed hyperparameter tuning for optimal results for each model and dataset. The AD dataset was preprocessed using a variety of oversampling and undersampling techniques before training. The model was then evaluated using several figures of merit, including precision, recall, accuracy, and F1 score. Results. We found that balancing techniques did not increase performance metrics in a meaningful way. The average change across all methods in accuracy and minority class F1 score was -0.08 and 0.154 respectively. Additionally, the average recall for the minority class was 0.31, indicating that the model could not efficiently learn the minority class. Conclusion. We show that these balancing techniques are inadequate for an extremely small imbalanced dataset, such as this CDT dataset. Rebalancing provided minimal improvement in the model performance even after hyperparameter tuning. These results illustrate the limitations of current rebalancing methods in modeling rare disease diagnoses, where the amount of data from disease cohorts can be limited. Future directions include collaborative clinical research to create larger datasets for optimal training of diagnostic models.

https://openreview.net/pdf/95c9d9ab4dc3d86db7c70f3f07151d822222c9ca.pdf