12 Plagues of AI in Healthcare

 

Problem 3. Sample Size: Looking at the World Through a Keyhole

www.frontiersin.org

A concern inherent to most analytical solutions includes issues of sample size. When creating and training a model using a limited sample size, inflated results may be demonstrated when actually testing it against a control sample. Subsequently, when introduced into a different, clinical environment, the model’s accuracy can lessen, and it can fail. This is because the small sample size may cause a model to present results that are merely due to chance. Consideration of this is paramount to acquire reliable results.

In an age of increasingly available data, ML algorithms must be trained on adequately sized and diverse datasets. While this will likely require the harmonization of data from multiple centers, this point is imperative if we are to believe our models are robust enough to consistently provide reliable results at different time points and in a number of different environments. A recent systematic review assessed 62 comprehensive studies detailing new ML models for the diagnosis or prognosis of COVID-19 utilizing chest x-rays and/or computed tomography images and found a common problem among studies was in the limited sample size utilized to train models, with other reviews noting similar observations. Amongst the 62 models assessed, more than 50% of diagnostic focused studies (19/32) utilized <2,000 data points. Unlike rare brain tumors which may require years of small amounts of data collection, there have been more than 2 million cases of COVID-19 to date. Therefore, it is unsurprising that developing a model to understand a complex biological phenomenon on only 2,000 data points can lead to unreliable results in the intended environment.

As in all aspects of medical research, newer ML models must be trained on a large and diverse dataset to provide reliable results. A clear solution is to implement an active and ongoing effort to acquire more data from a variety of sources. ML models, unlike physical medical devices, can be continually updated and improved to provide more reliable and accurate performances. As more data becomes available, models can be retrained with larger sample sizes and then updated for the best performance in the field. Ultimately, this may require a model to be pulled from a clinical practice, retrained and tested based on more data which is now available, and then placed back into the environment of interest. However, with the increased demand for more data, one must consider the possibility that an ML modeler may acquire and include data for which they have no patient consent. Once the new data is incorporated into the model and the model is input into the field, it is very difficult to identify which data was utilized. A possible solution to these patient consent concerns with increasing data is implementing in each image a certificate stating patient consent, such as through non-fungible tokens (NFTs).

Problem 4. Discriminatory Biases: Rubbish in, Rubbish Out

Perhaps the most commonly discussed problem concerning AI technologies is their potential to exacerbate our current societal biases in clinical practice. Even more alarmingly, developing models with historical data may perpetuate our historical biases on a variety of different patients in ways we have since improved from.

Issues concerning discriminatory bias may manifest as a model demonstrating high performance on a single sample of patient data, but then failing on different subsets of individuals. For instance, an increasing focus in ML based applications has been to create algorithms capable of assisting dermatologists in the diagnosis and treatment of diseases of the skin. While racial inequalities to healthcare delivery are becoming more and more documented across the world, recent work has suggested one of the major sources of these inequalities stems from the lack of representation of race and skin tone in medical textbooks. Medical schools in recent years have immediately begun to address these concerns by updating textbook images for increased inclusivity, yet deep learning (DL) image-based classifier algorithms often continue to use low quality datasets for training, which commonly contain unidimensional data (e.g., mostly lighter skin images). In turn, these algorithms will perform better on the image type it is trained on, and subsequently propagate biases that were represented in the original datasets. This cycle ultimately provides the chance for profound failures with specific groups of peoples. In a study examining three commercially available facial recognition technologies (Microsoft, Face++, and IBM) based on intersectional analyses of gender and race, differences in error rates of gender classification were demonstrated to be as high as 34.7% in darker-skinned females compared to 0.8% in lighter-skinned males. Ultimately, even if some diseases are more common in specific races or genders, such as melanoma in non-Hispanic white persons, all patients with a variety of skin types should be included for the potential benefits of these algorithms in the future. While these concerns may be more obvious, even the location of the image acquisition center can bias a model’s performance. This is due to the demographical makeup of the surrounding community in which images were trained and tested. Thus, when the model performs in a different environment, its performance will drastically differ based on the new demographics encountered.

Improved patient related factors must be considered when building datasets for the training of a model. By building models on data from a variety of different sites with increased awareness of the specific populations included, we may begin to mitigate the potential biases in our results. Ultimately, improvements in DL algorithms remain relatively nascent, and there has been an increased focus on classification performance across gender and race, providing us with impetus to ensure that DL algorithms can be successfully used to mitigate health care disparities based on demographics.