In the previous post, I described the difference between efficacy and effectiveness, an increasingly important concept in clinical research and healthcare. After stressing the importance of effectiveness research to health policy planning and patient decision-making, I summarized seven criteria for identifying effectiveness studies. Finally, I asked whether these criteria could be re-purposed beyond a medical intervention to inform how we measure the effectiveness of software systems used to conduct clinical trials.
Is it possible to assess clinical trial software through the lens of effectiveness, as opposed to just efficacy?
I believe that it’s not only possible, but crucial. Why? We all want to reduce the time and cost it takes to deliver safe, effective drugs to those that need them. But if we don’t scrutinize our tools for doing so, we risk letting the status quo impede our progress. When lives are on the line, we can’t afford to let any inefficiency stand.
In this post, I adapt the criteria for effectiveness studies in clinical research into a methodology for evaluating the effectiveness of clinical research software. I limit the scope of adaptation to electronic data capture (EDC) systems, but I suspect that a similar methodology could be developed for CTMS, IVR, eTMF and other complementary technologies. If I open a field of inquiry, or even just broaden one that exists, I’ll consider it time well spent.
To start, we need to translate some key concepts. While “subject,” “intervention,” and “outcome measures” are well understood concepts within clinical research, what are their analogues for studying the impact of an EDC system? Let’s consider the following parallels:
- Subject = any individual (e.g. site investigator, study coordinator, data manager, patient) contributing time, knowledge, skill, or information to a study
- Intervention = the EDC software used by any contributor to a clinical study
- Outcomes = broadly, measures of the speed and quality of tasks performed by a study contributor
With these adapted terms in mind, we can turn to the seven criteria for distinguishing effectiveness from efficacy proposed by Dr. Gartlehner et. al. and ask how each might inform an evaluation of your EDC software.
Criterion 1: Populations in Primary Care: “For effectiveness trials, settings should reflect the initial care facilities available to a diverse population with the condition of interest.”
Criterion 2: Less Stringent Eligibility Criteria: “[E]ligibility criteria must allow the source population to reflect the heterogeneity of external populations: the full spectrum of the human population, their comorbidities, variable compliance rates, and use of other medications (or other therapies, such as psychotherapies, or complementary and alternative medications).”
In clinical research, these two criteria ensure that the intervention is applied to the most diverse population possible, as defined by an indication that’s usually diagnosed, treated and managed in a primary care setting. By analogy, in the evaluation of EDC, we should imagine our population as all individuals contributing to a study (per the “less stringent eligibility” criterion) and doing so in way typical for their role (per the “primary care” criterion). Specifically, we need to consider the impact on:
- A study coordinator or nurse, entering data or responding to queries at their workstation, or providing e-source data on a tablet while engaging with a patient
- A site investigator, reviewing and signing casebooks at their workstation
- A data manager, building eCRFs, querying data, and producing data management reports at their workstation or on a mobile device
- A patient, reporting outcomes via one or more devices
- A clinical research associate, researching queries and verifying source data at their workstation or on a mobile device
- A clinical trial manager or lead sponsor representative, tracking enrollment and top-line data management at their workstation or on a mobile device
The more contributor types that your electronic data capture systems serves, the greater your opportunity to measure (and gain) efficiency. For example, if the contributor responsible for site education can’t get ready access to real-time query reports, she can’t bring the most needed education to bear. If a patient has learn how to use an unfamiliar device to provide reports, she is more likely to grow frustrated and neglect the task.
Granted, there are plenty of efficacy measures you could quantify regardless of user, from installation cost to uptime. But unless those measures impact a broad array of users, any gains in task speed and quality will have minimal effect on overall trial execution. So before investing or continuing to invest in your EDC system, consider whether it maximizes the contribution of every user type listed above.
Criterion 3: Health Outcomes: “Efficacy studies, especially phase III clinical trials, commonly use objective or subjective outcomes (e.g., symptom scores, laboratory data, or time to disease recurrence) to determine intermediate (surrogate) outcomes … Health outcomes, relevant to the condition of interest, should be the principal outcome measures in effectiveness studies. Intermediate outcomes are adequate only if empirical evidence verifies that the effect of the intervention on an intermediate endpoint predicts and fully captures the net effect on a health outcome.”
Again, for clinical research, this criterion ensures that measurement isn’t limited to markers that predict (sometimes weakly) a general improvement in health, but considers those improvements directly. An anticoagulant may inhibit one or more clotting factors, but if patients who are prescribed the therapy are just as likely to suffer a pulmonary embolism as those who are not, the treatment falls short of an important aim. In the same vein, beware of treating an EDC’s technical virtues as the only determinants of its value.
Compatibility with existing IT infrastructure and uptime do matter, but if a data manager needs to click through multiple screens to query a single field, data quality and time to lock are profoundly undermined. Meanwhile, regulatory agencies award no points for installing a system in five days as opposed to seven.
So what are the outcomes that matter, beyond the obvious one of getting better data faster? I’m not aware of any comprehensive, gold-standard list. But if you were to gather representative contributors from the groups listed in the prior section, I imagine they’d propose these “endpoints” for EDC:
Clinical Trial Managers and Sponsor Representatives:
- Increase in the number of phases and therapeutic areas covered
- Reduction in the number of separate technologies for “adjacent” processes, such as subject randomization and the collection of PRO
- Reduction in “per study” cost (unaffected by enrollment numbers and number of sites activated)
- Reduction in incremental cost for additional studies
Note that these operational outcomes are in addition to the requirement to comply with applicable regulations (21 CFR Part 11, ICH GCP, Annex 11, etc.).
- Reduction in time to collaboratively build out the protocol
- Reduction in time to design, test, and deploy eCRFs
- Reduction in time from event to data entry
- Reduction in rate of queries per form completed
- Increase in the number and relevance of data management reports available
- Increase in the number and relevance of enrollment reports available
- Increase in the number and relevance of source document verification reports available
- Reduction in time required to produce reports
- Increase in protocol compliance
- Decrease in lag between event and data entry
- Reduction in mean “session time” per form completed or casebook signed (i.e. spending less time, from log in to log out, on the same volume of data entry or approval)
- Reduction in rate of queries per form completed
- Reduction in average time accessing and interpreting queries
- Reduction in time spent with study monitors
Clinical Research Associates
- Reduction in average time accessing and interpreting queries
- Reduction in average time required to verify source data
- Reduction in time required to learn and complete reporting tasks
- Reduction in effort required to adhere to reporting schedules
- Increased engagement and compliance with protocol
Of course, none of these outcomes depend solely on factors intrinsic to the EDC system. But that’s precisely what makes them effectiveness outcomes: they assess how much an intervention, in the context of related variables, directly impacts an “ultimate goal”. In this case, that goal is the rapid collection, cleaning, and verification of reliable data. The closer the relation between a particular metric and that ultimate goal, the more time and attention you should spend measuring that metric.
Criterion 4: Long Study Duration, Clinically Relevant Treatment Modalities: “In effectiveness trials, study durations should mimic a minimum length of treatment in a clinical setting to allow the assessment of health outcomes. Treatment modalities should reflect clinical relevance (e.g., no fixed-dose designs; equivalent dosages for head-to-head comparisons). Diagnosis should rely on diagnostic standards that practicing physicians use.”
The benefits of our most celebrated antiplatelet drugs weren’t established in a matter of weeks. Or months. Or even years. A single Phase III study of ticagrelor in combination with aspirin took more than half a decade. Similarly evaluating EDC systems from this perspective might require assessing the systems’ impact on every stage of the trial lifecycle, from initial database build to database lock You’d probably want to do this across multiple builds and multiple database locks, in order to see how the system speeds (or delays) key processes across a variety of representative studies.
A comprehensive, reliable, and repeatable method of comparing EDC systems along these lines may be a pipe dream. However, more anecdotal evidence may be available to you. If you have access to “metadata” on a dozen or so studies with sufficiently similar protocols, half using one EDC system and half using another, you may have the rudiments of a retrospective study. For each study, collect:
- Time from system launch to first form published
- Time from system launch to last form published
- Time from “first patient first visit” to first form completed
- Rate of queries per field completed at 25% enrolled
- Rate of queries per field completed at 50% enrolled
- Rate of queries per field completed at 75% enrolled
- Time from “last patient last visit” to last form completed
- Time from last form completed to database lock
With these values in hand, you may be able to apply something like the Mann-Whitney U test to see whether the two EDC “cohorts” differed significantly along any of these measures.
Experienced data managers could easily add another dozen measures to this list, just as real statisticians could propose a better methodology. I only want to suggest that a data-driven comparison of how two systems impact a study is possible. Making it a reality would constitute one more step toward a better evaluation.
Criterion 5: Assessment of adverse events: “[U]sing an extensive objective adverse events scale is often not feasible in daily clinical practice because of time constraints and practical considerations. Therefore, adverse events assessments in effectiveness trials could be limited to critical issues based on experiences from prior trials.”
A patient nicks their finger while opening a blister pack. Do you run a blood panel? Of course not. And while the example is extreme, it points to a real phenomenon: bringing so much scrutiny to minor AEs that a trial becomes almost impossible to administer, while actual patient safety remains unaffected. Of course, patient safety is always the first priority. And collection of safety data is required. But each hour spent on AE scoring and minute description is an hour no longer available to care for the patient or to collect data on efficacy and serious AEs.
Turning to software assessment, are you measuring every page’s load time for web-based applications? Or are you measuring “session times” and (better still) average time to query close? Are you giving too much weight to password strength criteria or change frequency, and not enough to backup/DR processes? The devil may be in the details, but in the attempt to root him out, don’t blind yourself to the ultimate aims your software is meant to serve.
Criterion 6: Adequate Sample Size To Assess a Minimally Important Difference From a Patient Perspective: “The sample size of an effectiveness trial should be sufficient to detect at least a minimally important difference on a health-related quality of life scale.”
Criterion 7: Intention-to-treat (ITT) analysis: “[S]tatistical analyses in efficacy trials frequently exclude patients with protocol deviations. In clinical practice, however, factors such as compliance, adverse events, drug regimens, co-morbidities, concomitant treatments, or costs all can alter efficacy. A ‘completers only’ analysis would not take these factors adequately into account.”
As with the first two, I’m bundling these criteria. They aren’t synonymous, but their intent is similar: to enlarge the pool of data. In simple terms, these criteria tell us to “randomize as many subjects as necessary to infer even small differences in outcome, and include them all in the analysis.”
Turing to EDC evaluation, consider these two sets of data:
- mean “session time per form completed” among the 20 study coordinators who entered data for a Phase Ib
- mean “time to query resolution” for all 200 sites contributing to a Phase III
There are two significant differences between these sets, one obvious and one less so. Provided at least one contributor at each site in the latter set received EDC sign in credentials, the sample of 200 sites is just larger than a sample of 20 coordinators. Also, while the first set restricts the source of data to those who actually interacted with the system, the second set draws no such line: anyone at any site (provided they were given sign in credentials) is a source of data. Why should this be so? All behaviors–from those who entered data, those who signed in but did not enter data, and those who never signed in but could have–provide clues to a system’s impact; in this case, to its adoption rate.
As an industry, we are years–maybe decades–away from valid and reliable measures of our technology’s impact on the effectiveness of trial conduct. The endeavor is probably in its infancy. But what better time to ask the big questions? Let us know yours.