We have come a long way in human healthcare – from increasing average life expectancy, to controlling infectious diseases like AIDS and tuberculosis. Despite the acceleration in pharmacological and technological advances over the years that contributed to this success, the invasion of viruses such as SARS, and recently Coronavirus (COVID-19), serve as reminders of the fact that the field needs to progress beyond its contemporary state. How can the current generation’s tech front-runner – artificial intelligence (AI) – break the status quo and address shortcomings in this field? Ideally, it can help discover viable treatments and develop drugs at a much faster pace.
AI has shown exponential growth over the years, with more than a 270% increase in its implementation across different enterprises in just the last four years. It demonstrates extensive use-cases in diverse fields and has limitless potential to reimagine problem solving possibilities. From entertainment, aerospace, electronics, retail, and agriculture, its increasing pervasiveness impacts around USD $50 – $600 billion in each area. In the healthcare and medical sector, McKinsey Global estimates that AI will be worth USD $200 – $300 billion in the 2020s. But even with such promise, AI’s penetration into the healthcare and pharma space has not been as quick due to its unique set of challenges.
Healthcare and pharma demand high accuracy from AI model predictions, while maintaining generalizability across different populations. Both factors are essential because ensuring patient safety is critical and each patient is inherently different. These AI model predictions are very much driven by the type of data that is used for training the models. Variability in experiments and poor translatability from animals to humans are very common during pharmaceutical development. Small and sparse datasets, which are especially prevalent in the drug development domain, make it even harder for AI models to acquire sufficient knowledge. Models trained on limited datasets can have skewed learning, overfitting and low optimization problems which directly affect their accuracy and generalizability. To put it in simple terms, because of the limited historic data available for AI to learn, models cannot properly distinguish the nuances from diverse populations and suffers undue significance to specious correlations. Therefore, small and sparse datasets have become one of the major blocks to AI gaining traction in the space.
AI, and machine learning in general, have matured over the years, and several approaches have evolved to deal with such data. Some methods that can be used to address the challenges with healthcare data:
- Handle undesired variability by using alternative metrics to evaluate the AI models performance.
- Reduce skewed learning by up-sampling or down-sampling to the right size such that balance is reached.
- Avoid overfitting by using regularization – selecting the appropriate level of training complexity such that models are generalizing better.
- Handle errors stemming from low optimization:
- Cross-fold validation – resampling dataset to assess model performance
- Transfer learning – acquiring knowledge from larger datasource and applying on related problem
- Ensemble learning techniques – combining multiple model algorithms to increase accuracy
Although transfer learning and ensemble learning have made a noticeable impact in domains with smaller datasets for a few years now, they don’t seem to be adopted as much as the other techniques mentioned above. These two techniques have started gaining impetus, as exemplified by their prominent adoption in the medical image analysis space.
Transfer learning models are a straightforward two-stage paradigm – first train AI models, for example, to learn all the images on the internet, and then teach the models how to interpret chest X-ray images to detect Stage II lung cancer. In this way, a large data source can provide a training base that “transfers” its knowledge to the specialized use case of chest X-ray interpretation. This type of model was developed recently by Google and has outperformed six human radiologists with 8 years of average medical experience.
Ensemble learning is a method of combining different ML algorithm models to achieve better accuracy. There are several approaches:
- Bagging (Bootstrap AGGregatING) – Combine models that are trained on randomly drawn subsets of datasets (Eg. Random Forests).
- Boosting – Build models incrementally from generic weak model predictions (Eg. Adaboost).
- Stacking – Use predictions from one model algorithm as inputs to different learning algorithms. There are variations on these stacked models such as generalized, weighted, blended and Frankenstein. These variations can be used based on the type and amount of data.
In my time developing deep ML models for BIOiSIM at VeriSIM Life (VSL), I have had plenty of first hand experiences in applying aforementioned techniques to ML challenges involving small datasets. BIOiSIM, is our revolutionary product developed in-house here at VSL that enables simulation and prediction of drug disposition in human and animal bodies. Because pharmaceutical data is so sparse, it is crucial for the integrated AI models to efficiently mitigate issues inherent to small datasets. In the process, I have had good success with stacking models while dealing with such data, specifically weighted-stacked ensemble models.
Through incorporation of aforesaid approaches and more in our AI models, we have observed considerable improvements in the prediction accuracy of BIOiSIM. More accurate predictions mean identification of more drug failures at earlier phases of drug development, vital to our mission to drastically reduce drug attrition for pharmaceutical companies. Through the use of intricately designed ML tools, BIOiSIM will prove to be an extremely effective tool in the pharmaceutical industry for creating cures to diseases from COVID-19 to cancer at a much higher pace and with greater accuracy. We are now at a juncture in time when technologies like AI can help us go beyond our current limitations in biological knowledge to truly revolutionize human healthcare.