Einstein is often attributed to saying, “If I had an hour to save the world, I would spend 55 minutes defining the problem.” In that spirit, we begin by defining what we mean by clinical data science, at least for the purposes of this blog series. CDS is the application of data science to medical and therapeutic decision-making, including research applications. Simple, right? So far, so good?
Well, now comes the hard part: defining “data science” itself as even seasoned professionals can be inconsistent with terminology. Is a given example best characterized as data science, or data analytics, or perhaps business intelligence? Do you need a data visualization expert? Perhaps we should just give everything to a statistician and hold our breath? These distinctions are more than purely academic, as finding the right person with the right skillset can seriously impact your outcome.
Before going any further, it is worth noting that for historical reasons, as much as anything, many of these job titles overlap, and many who perform them could have almost identical skillsets. The definitions and distinctions offered here may be one of many frameworks possible, but they seem to represent a plurality and perhaps the early signs of a convergent system. To help unpack these concepts, let’s examine them through the lens of when a given question would likely be addressed.
- A statistician typically looks at data through a hindsight lens. They will generally try to answer questions about what happened and frame it in relation to attempting to pinpoint a true value.
- A data visualization expert creates dashboards and similar tools to help decision-makers interpret data faster and easier.
- A business intelligence expert will try to pull information from multiple sources, combine it with a knowledge of business operations, and make decisions largely based on the business’s own interests.
- A data analyst will typically analyze past data, often with some statistical inference techniques, to make predictions and decisions about future events.
- A data scientist uses real-time, or near real-time, data to make decisions through largely algorithmic techniques.
As you can see, the difference between the definitions of data analyst and data scientist is subtle, but the key to understanding the difference is that a data scientist’s approach will tend to be much faster to adapt than a data analyst. In essence, the data analyst may be examing trends a month, a quarter, a year, or more old. This is not to say that such analyses are not useful – one of the key business metrics is a year-over-year performance analysis which is exclusively in the purview of a posteriori analysis.
The advent and evolution of clinical data science as a discipline includes some exciting possibilities for clinical trials. A greater emphasis on adaptive trials and real-world evidence can increase the speed of trials and the validity of results. In fact, as data shows in some diseases, especially those with a strong genetic component (such as oncology or auto-immune diseases), an adaptive clinical trial can often decrease time, costs, and risks to both subjects and sponsor alike (Pallmann, Philip. Adaptive Designs in Clinical Trials: Why Use Them, and How to Run and Report Them. 2018). Umbrella, basket, and platform trial designs have all seen an increase in the past decade as:
- knowledge of molecular genetics has increased,
- the cost and difficulty of molecular techniques have decreased,
- and the complexity of interim analysis has decreased.
A knowledgable clinical data scientist could even conceivably program a majority of interim analyses to run repeatedly (using an appropriate correction for multiplicity issues) which would decrease downtime that can hinder adaptive designs. This approach could also end trials that have become unproductive or unnecessary earlier than in a traditional approach.
Almost all parties at all levels in clinical trials can benefit from the use of data science. Industry sponsors could see the most direct benefits in both cost and time reduction. Traditionally a large pharmaceutical company may value data science mostly to achieve time savings. In contrast, smaller start-ups and midsize companies might see value primarily on the side of cost savings. A sponsor can reduce the resources required to bring a product to market by applying a more adaptive design to trials that can support a greater frequency of interim analyses without the traditional overhead of scheduling a database freeze, completely resolving queries, and having a small army of statisticians and programmers spend weeks only to then have to present the data to a DSMB or similar body.
A clinical data scientist could theoretically preprogram all the necessary analysis and utilize any number of machine learning techniques to mitigate many unforeseeable circumstances, such as missing or erroneous data, outliers, or noncompliance. Machine learning, coupled with Natural Language Processing and search engine spiders, a.k.a. web crawlers, could conceivably enable sites and sponsors to monitor web forums, messages within EHR systems, and many more systems for SAEs or underreported adverse events. Similarly, a clinical data scientist could use emerging technologies to gather and process data to a degree that even 5 years ago would have seemed impossible. While technology continues to evolve at a substantial pace, it seems likely that humans will always be part of the process; therefore, it is unlikely that overhead can ever be reduced to zero, but data science can greatly reduce it.
CROs and SMOs tend to take a more business intelligence approach to clinical data science. It makes sense that if site A sees 5x more lung cancer patients than site B, then it should reasonably follow, all things being equal, that site A should get the phase III trial for a novel targeted lung therapy. This is an oversimplified example, but it illustrates the point that data can and should inform business decisions wherever possible. If a CRO notices a sharp uptick in the number of queries for a given site, they might arrange for retraining at the next monitor visit to increase compliance.
Regulatory agencies and quality assurance departments can utilize data science to increase the effectiveness of risk-based monitoring programs and the distribution of routine audits to increase efficiency, which can increase the effectiveness of these programs. The US FDA’s BIMO (Bioresearch Monitoring) program already uses some algorithmic approaches to determine where to send inspectors. The next logical evolution would likely be to incorporate machine learning to make the algorithm more competent. Perhaps it would incorporate Natural Language Processing to see trends in FDA 3500 forms (for safety reporting) or even social media groups discussing research experiences.
Research sites, too, can utilize data science methods to increase their own efficiency. Let’s be frank; we’ve all at least heard of sites that will apply to join every study, even remotely applicable to their practice. It’s understandable – in the purely academic realm, the term of art is “publish or perish,” and if you don’t stay in the black, then you won’t be a site for very long. This shotgun approach to research participation can have unintended consequences, however. Even something as simple as completing a study application form can take hours, contracts can take days, and study start-up can take weeks. All this time is, as they say, money. A data-driven approach has the potential to guard against this tendency.