Synthetic Data in Drug Development - What it is and How it Relates to AI-informed Approaches

In recent years, demand for the high-quality data used to train and test artificial intelligence (AI) models has exploded. However, an array of regulatory requirements and privacy rules are making that data more complex, expensive and time-consuming to gather. 

As a result, interest in using synthetic data to power AI models has intensified across many sectors, especially in those that are highly-regulated or which require harder to obtain real-world data (RWD). In drug development, researchers are investigating the viability of synthetic data as an alternative to RWD in some of its AI-informed approaches. Let’s take a closer look at how  the use of synthetic data is impacting the drug development landscape.

What is synthetic data?

In contrast to RWD (which is data collected from electronic health records, medical claims data, product/disease registries, or other real world sources), synthetic data is information that's artificially generated by computer algorithms or other statistical methods to simulate real-world data. Synthetic datasets can include numerical, binary, categorical or unstructured data. These synthetic data sets can then be used in lieu of real-world data sets in order to train or validate machine learning (ML) models.

Synthetic data can be created in a variety of ways, including: random selection from a distribution, agent-based modeling, or via AI-supported generative models.

Synthetic data is still considered an emerging field of science, which means many of the methodologies used to create synthetic data sets are still being actively developed and tested.

What are some of the benefits of synthetic data?

There are a range of important benefits offered by the use of synthetic data.

  • Protects sensitive data -  Many of the privacy laws and ethical concerns that guide the legal handling of personal data like medical records can be avoided through the use of synthetic data, which mimics real data but contains no sensitive information and remains anonymous.
  • Inexpensively produced - Traditional data collection methods are resource-intensive, making them slow and costly to implement. Compared to RWD, synthetic data is inexpensive to produce and maintain, with significantly lower costs for data collection and storage. 
  • Easy to use - RWD requires specific measures to maintain privacy, filter errors, and convert data from variable formats. By contrast, synthetic data is generally uniform, can be automatically labeled, and is generally simpler to generate and use.
  • Better scalability - Huge amounts of data are needed to train predictive models. Synthetic data can be used to supplement or generate missing or rare data that would otherwise hold back a RWD effort. For example, if you’re training an AI system to diagnose a rare medical condition using imaging but lack enough RWD to train the system adequately, synthetic data offers a work around that will get you the data sets you need faster and at a significantly lower cost.
  • Mitigates bias - Synthetic data has been shown to increase diversity, balance and variety in datasets, helping to address some of the biases that tend to be present in RWD and which can impact the overall quality of a neural network. Synthetic data sets which better reflect the underlying population also help reduce the risk of discriminatory outcomes.

Some applications for synthetic data in drug development

There are many real and potential applications for synthetic data that relate to healthcare and drug discovery specifically. Especially because health data is so strictly regulated, synthetic data gives researchers a way to obtain vital information without accessing actual patient data records. This is different from data masking techniques, which still present a range of privacy-related complications. 

Some possible applications for synthetic data use in drug development include:

  • Artificial patients - Virtual or artificial patients are based on data sets from real patient data, but without actually including any traceable, identifiable real-patient data. This protects sensitive health data, while allowing trial operations to move forward. 
  • Synthetic control arms - Instead of collecting data from real patients who have been assigned to the control arm of a clinical study (for example, receiving placebo), synthetic control arms can eliminate the need for control participants, fill in data gaps, reduce delays, lower trial costs, and ultimately help bring drugs from bench to bedside faster.
  • Hypothesis testing - Synthetic data can also be used to fast track a hypothesis inexpensively and quickly. Validation of mathematical models for drug efficacy and safety can also be leveraged through synthetic data.
  • Training new AI and ML models - Training on synthetic data allows researchers to obtain and use data from different sources quickly and cost-efficiently so they can develop more robust and intelligent models.
  • Cross-border collaborations - Multi-population research is often hampered by concerns around patient data security, however synthetic data would give potential partners and collaborators easier, ethical access to data for inter-organizational or even cross-border and international research.

What’s next for synthetic data?

The broader role synthetic data will play in drug development remains to be seen, though many experts agree its use in AI-informed approaches is likely to increase exponentially. Gartner has estimated that by 2030, synthetic data will overtake actual data in training AI models. When it comes to widespread adoption, researchers will have to work alongside regulators and policymakers to develop and adapt clinical-quality measures and evaluation metrics for synthetic data and its practical use. 

VeriSIM Life has developed its own sophisticated computational platform that leverages advanced AI and ML techniques to improve drug discovery and development by greatly reducing the time and money it takes to bring a drug to market. Contact us to learn more about BIOiSIM™ and how our AI-enabled platform helps de-risk R&D decisions.