*Purpose: Machine learning systems are increasingly showing potential for achieving clinical expert level performance. As such, there are growing efforts to deploy these systems in radiology practice. Accordingly, there is increasing recognition of the necessity to monitor performance of machine learning models after deployment in actual clinical workflow. Specifically, good performance on retrospective trial datasets does not guarantee acceptable future performance in routine real-world practice. This could be due to factors such as source data drift and changes in disease prevalence. Therefore, end-users relying on AI models need the ability to regularly and conveniently evaluate model performance relative to required specifications. Only then can they be informed about performance degradations and trigger the necessary model removal, revision, or refresh in a timely manner. To this end, we designed and implemented a performance monitoring framework to monitor the performance of a deep learning-based chest radiograph (CXR) classification model over time. Our approach enables collection and analysis of longitudinal model performance data and enables better visibility and greater reliability for end-users.
*Methods: We deployed a CXR classification system for triaging of studies with urgent findings at a major public hospital’s radiology department. The system accepts CXRs from PACS as inputs and generates positive or negative inferences on 14 different labels specified in the publicly available CheXpert dataset. Our performance monitoring system first collates lists of CXRs forwarded to the AI system in 24-hour periods, then randomly samples 20 CXRs from each period for evaluation against their ground truth labels. These labels were obtained by manual annotation of sampled CXRs with the aid of their accompanying radiology reports. We computed performance metrics including accuracy, sensitivity, and specificity over time and visualized the results on a Shewhart control chart. To alleviate the burden of manual ground truth annotation, we also implemented the publicly available CheXpert natural language processing (NLP) enabled labeler to efficiently generate image labels. We then compared the manually and automatically generated labels with the AI model predictions. After implementation, we also surveyed end-user radiologists to assess their perception on the impact of our performance monitoring tools on their workflow and the reliability of the deployed AI system.
*Results: Our performance monitoring framework was able to integrate with existing healthcare IT and imaging infrastructure at our radiology department. Further, the graphical dashboard in our framework provided synoptic visual feedback on performance of the deployed model over time when conducting quality control and audits. End-users expressed increased reassurance and confidence operating alongside the AI model in the knowledge that a performance monitoring system is in place. Although image labels generated by automated NLP labelling are currently not as good as manual annotations, they do enable more efficient and passive performance monitoring for a higher volume of images.
*Conclusions: We demonstrate real-world implementation of a performance monitoring framework that enables radiologists to regularly track deep learning model performance over time. This approach overcomes important barriers to acceptance and adoption of AI into radiology workflow. As NLP technology advances, we expect further reduction in the manual annotation burden and greater efficiencies for longitudinal performance monitoring. As our framework can serve as a sentinel for model deviation from specifications, scaling it for a variety of use cases has implications for safer AI deployments in radiology.
This research is supported by core funding from: I2R
Grant Reference no. : N/A