Whitepaper
Pioneering the Science of AI Evaluation
Introduction
"Pajama time" is now commonplace across all clinical specialties
time spent after clinical hours to catch up on clinical documentation
of physicians report burnout
Ambient documentation systems (or “AI scribes”) reduce documentation burden by drafting clinical notes based on raw audio of the clinical encounter. Clinicians edit and verify these notes before signing them in the Electronic Health Record (EHR). However, evaluation of these systems, and the quality of the documentation they draft, is complicated by the free-form nature of generated text and the various uses of clinical documentation.
This whitepaper describes our process for evaluating AI-generated documentation at Abridge, quantitative performance measures for many of our key systems, and our process for conducting holistic clinician-in-the-loop studies prior to green-lighting updated AI components into production. These evaluations have informed the development and deployment of our AI systems into deployment; systems that already process hundreds of thousands of clinical encounters every week, across dozens of health systems. By sharing our approach, we hope to foster greater transparency and facilitate dialogue among stakeholders in this fast-moving space.
Executive summary
Commitment
and expertise
As a clinician-led company with roots in academia, we are committed to building trustworthy AI systems, and have the expertise to do so.
Quality and
monitoring process
For assessing the quality of complex free-text documents, human judgment remains the gold standard, and our evaluation process reflects the need to keep humans in the loop.
Performance
Our systems already outperform off-the-shelf clinical and open models. In this report, we provide some deep dives on particular aspects of performance we have found critical to having an impact with our health system partners.
Looking ahead
The state of evaluation in generative AI remains a rapidly developing area with novel ideas arising every week.
quality and monitoring
At the highest level, we can think of our clinical documentation engine as being comprised of two primary components: a world-class medically tailored speech recognition system and a note-generation system that transforms raw transcripts into drafted clinical notes.
The first component is our Automated Speech Recognition (ASR) system, which takes raw clinical audio and produces a transcript of the encounter.
The second component is our note-generation system which uses the transcript to produce a draft of the clinical documentation.
Each component uses a variety of underlying models. These are evaluated both individually and in an end-to-end fashion. We take a deliberate approach to evaluating and releasing upgrades to our core components:
quality and monitoring
Model development guided by automated metrics and clinician spot-checks
Automated metrics as a screening tool for model development
We rely on many automated metrics to guide early model development, using a large internal benchmark dataset containing clinical audio, gold standard transcripts, human-written reference notes, and rich metadata on patient characteristics.
For our automatic speech recognition system, we use canonical metrics, including word error rate and medically tailored metrics (e.g., recall of medical terms), alongside more targeted analyses (e.g., capture of newly minted medication names). For our note-generation system, we compute automated metrics of quality that compare AI-generated and human-written summaries, including variants of precision and recall (e.g., of medical concepts), which serve as proxies for factuality and completeness. Throughout, we perform stratified analyses to assess performance across diverse patient subpopulations.
Clinician-driven spot-checks
Throughout model development, Abridge clinicians spot-check notes generated on a curated set of encounters covering a range of clinical scenarios, providing both a coarse signal on difficult-to-measure aspects of quality and helping our modeling team develop intuition. This feedback is particularly important when a new model upgrade is meant to address a more subjective concern, such as incorporating certain stylistic preferences into note generation.
quality and monitoring
Validation via blinded head-to-head trials Adjudicated by licensed clinicians
While automated metrics and clinician spot-checks are helpful to guide development of upgrades to our overall system, these metrics capture only a portion of the many salient dimensions of quality. Certifying a new system for deployment requires more than informal spot-checks.
Before deploying models, we perform blinded, head-to-head evaluations with licensed clinicians as evaluators.
To this end, we developed a software platform that presents notes (one from the current system, the other from candidate systems) side-by-side, with clinical reviewers blinded to the system that authored each.
To allow for early stopping when the results are conclusive, we use anytime-valid sequential hypothesis testing1, providing guarantees on the false positive rate that hold regardless of how long we run the trial. This approach provides strong statistical evidence for whether a proposed system is better (or worse) than its predecessor.
1. Ramdas A. Foundations of Large-Scale Sequential Experimentation. In: KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. doi:10.1145/3292500.3332282
quality and monitoring
Staged release process
Even once our models survive a rigorous blinded test of clinician judgment, we take a careful staged-release approach to productionizing our models, verifying performance in vivo on selected cohorts prior to approving a broader release.
The first stage is an “alpha” release with limited rollouts to specially trained early adopters. These clinicians are in frequent contact with Abridge staff and are trained to be especially vigilant about editing and comprehensive in their feedback. Only once we are satisfied that these users are better served by the improved model do we consider rolling out to wider audiences.
At every stage of release, including full deployment, we continually collect both active feedback in the form of comments and star ratings, and passive feedback in the form of edits required to finalize the note, similar to the ongoing metrics we track during deployment as discussed below.
quality and monitoring
Ongoing post-deployment monitoring
First, we capture edits made
to the AI-generated note
before they are finalized shared
in the medical record.
Clinician-generated edits are inherently scalable as a feedback mechanism, as editing is already a natural workflow. These edits provide us with a rich signal for modeling purposes and enable the computation of high-level metrics to gauge how much editing is required. In addition, users have the option to provide quantitative ratings of note quality within the note-editing interface.
These scalable quantitative metrics not only allow us to track ongoing real-world performance, but also provide us with sample sizes required to assess performance across diverse patient and provider subpopulations. Stratified evaluations require high-quality metadata (e.g., on patient demographics) that are not reliably captured in conversations, but can in many cases be pulled (with varying degrees of accuracy2) via integration with the EHR.
We also develop model-driven proxies for meta-data that may not be captured in the medical record. For example, Abridge documents visits in a variety of languages, including visits that span multiple languages. Performing relevant language-stratified analyses requires that we can accurately identify which languages are spoken in each encounter.
2. Johnson JA, Moore B, Hwang EK, Hickner A, Yeo H. The accuracy of race & ethnicity data in US based healthcare databases: A systematic review. The American Journal of Surgery. 10:463-70. doi:10.1016/j.amjsurg.2023.05.011.
Qualitative feedback for uncovering blind spots
Clinicians frequently provide free-text feedback, which plays an equally important role. Given the diversity of clinical encounters and open-endedness of note-drafting, blind spots are inevitable. Ensuring that concerns expressed in qualitative feedback are heard by model developers is essential for finding these blind spots and developing new tests to catch them in the future.
To illustrate, consider the following real-world examples of issues that have been identified through open-ended feedback:
Does the documentation correctly spell medications that the models have not encountered previously?
Given an infant patient, a parent, and a clinician, does the documentation appropriately distinguish between the parent and the (non-speaking) patient?
If a patient misstates their own diagnosis, and a clinician corrects them, is the conclusion appropriately reflected in the final note?
Customer feedback helps to identify these types of problems, and tracking of problem categories provides another mechanism for evaluating the efficacy of deployed patches.
Qualitative feedback is also useful for eliciting stylistic preferences—one clinician may prefer that certain elements appear in the History of Present Illness (HPI) section, while another may prefer them in the Assessment & Plan section.
Audits via dedicated tools
Ongoing audits of note quality are essential for building and maintaining trust. However, audits can be inefficient—to evaluate the appropriateness of a single note sentence, an evaluator must parse a long transcript of a conversation that they have never seen before. To aid in efficient review, we outfit all of our note viewers with a tool for pinpointing the relevant transcript excerpt.
This same feature is available to our clinician users, helping them to verify the output of our system during the editing process.
Note viewer for evaluating an individual clinical note, with the associated transcript. Highlighting a section of the generated note (left) surfaces relevant sections of the transcript (right), streamlining the process of validating factuality.
Evaluation of AI-driven ambient documentation is an evolving challenge, requiring more than simply determining the right set of metrics, or doing a single evaluation at a given point in time.
Equally important is the process by which human feedback is incorporated into model development, auditing, and the construction of novel evaluation criteria and test cases.
Performance
In this section we provide selected deep dives on particular aspects of performance we have found critical to having an impact with our health system partners. While these examples do not cover every dimension of performance that we monitor, each example illustrates how targeted evaluation helps drive clinically meaningful improvements both over current off-the-shelf models, and in our own system over time.
Performance
Measuring the clinical fidelity of automatic speech recognition
One of the core components of our product is our Automatic Speech Recognition (ASR) system, which converts patient-clinician conversations into a transcript, which in turn, constitutes the key input to our note-generation system.
Accurate transcription is critical to ensuring the quality of the final output, as well as an accurate record of the conversation that the clinician can reference later.
Our ASR models take the audio from a conversation as input, and produce several secondary outputs in addition to the transcript, including alignment of text to audio timestamps, and assignment of phrases to the relevant speaker (a process known as “diarization”). Here, we focus on our evaluation of the main output, the transcript itself.
Curating diverse data for evaluation
In order to appropriately evaluate our transcription model, we need diverse data on clinical conversations, including audio and gold standard transcripts. To that end, we use several benchmark datasets, including:
Over 10,000 hours of medical conversations
Over 10,000 hours of clinical conversations with associated audio, gold standard reference transcripts, and human annotations.
Designed by our team to pose difficulties and probe weaknesses of clinical ASR models, including clinical conversations laden with new medication names.
Such as the Librispeech benchmark dataset.
To evaluate the quality of a transcript produced by an ASR system, we typically compare it to a gold standard reference transcript, written by a human listener.
Word error rate (WER) is calculated as the minimum number of word-level edits (substitutions, deletions, and insertions) to transform the reference transcript into the generated transcript, divided by the length of the reference transcript.
This calculation is illustrated in Example 1, where the transcript missed three words, and misunderstood one word (taken vs taking). Here, the minimum number of edits to correct the transcript is four: Adding back the three missing words, and correcting the misunderstood word. WER is calculated by dividing the number of edits by the length of the reference transcript, which contains 24 words, for a WER of 4 / 24 = 0.167.
However, as shown in the example above, not all errors are created equal. The patient mentions feeling pain in their back and the use of Advil. In general, correctly transcribing these medical terms is a top priority for an ASR system specialized for clinical text.
Medical term recall rate (MTR) serves the purpose of tracking more clinically meaningful errors, and is a complementary metric. MTR tracks the fraction of medical terms in the reference that are captured in the generated transcript. We continue with the example above to illustrate how this metric is calculated.
In this case, the words “feeling,” “pain,” “back,” and “Advil” are all medical terms, and since the generated transcript captured three out of four terms, the MTR would be 75%.
Example 1 (continued): MTR calculation with a simple example of a reference and generated transcript.
Since medical conversations have substantial overlap with generic conversations, we benchmark our ASR models on public datasets to ensure that we have comparable performance in the general conversation setting. On generic speech corpora (e.g., the public Librispeech benchmark dataset), our ASR system performs comparably to state-of-the-art ASR models like Whisper v3 from OpenAI.
The true value of our speech recognition system is revealed on medical conversation transcription, where we significantly outperform off-the-shelf ASR models that are specifically designed for medical conversations (e.g., Google Medical Conversations).
For instance, on one of our internal medical conversation benchmarks, we compare ourselves against several public ASR models:
Compared to Google Medical Conversation ASR with slightly higher rate of medical term recall, and exhibits high medical term recall overall.
As well as substantially lower rates of medical term recall.
On a curated challenge dataset laden with new medication names, our system demonstrates a 81% relative reduction in error on new medications compared to Google Medical Conversations ASR.
Compared to OpenAI’s Whisper v3 model.
Performance
Assessing and improving performance across multilingual settings
In this section, we consider real-world performance of our end-to-end system as judged by our users, and illustrate how we assess performance across a diverse set of users who speak multiple languages.
Internal multilingual evaluation of our ASR and note-generation system
We start by discussing the evaluations we have used to benchmark our multilingual performance internally, and which have formed the basis for model improvements that drive our real-world multilingual performance, described at the end of this deep dive.
First, our approved set of languages is based on the observed performance of our ASR system on audio and associated gold standard transcripts from a variety of languages. One of the datasets we use in evaluation is FLEURS, a parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark. Here we have access to gold standard transcripts, which allows us to establish ASR performance by measuring WER.
Measured on this dataset, for three of the non-English languages that we support, our WER is even lower in that language, compared to English:
than current state-of-the-art, on average across these remaining languages
For our note-generation system, we compare the performance of our system using the original English transcript against performance using a translated version of the transcript (into e.g., Spanish). Here, we observe comparable performance, even when the transcript is not in English.
The recall rate of relevant medical terms in the final note, using non-English transcripts, is >80% of the recall rate on the original English transcripts.
Multilingual performance in the wild
The primary metric for the end-to-end evaluation of our system is quantitative ratings of note quality (or “star ratings”) that users provide within the note editing interface, giving a rating from 1 to 5. This feedback lends itself to straightforward metrics (e.g., tracking the average star rating over time), and tracking performance across categories of encounters.
For reference, over the last three months (May-July), we collected tens of thousands of star ratings, with an average rating of 4.3 out of 5 for English-language encounters. Meanwhile, our performance in our most frequent non-English languages is nearly as strong over the same time period.
Moreover, ratings for non-English encounters have continued to improve over time. For example, our ratings for Spanish-language encounters have risen from an average of 3.7 during the month of February to its current average of 4.1 over the last month.
Conclusion
Evaluation is not just a set of guardrails, but a compass. Continuous evaluation of our product is not just designed to catch or prevent issues, but to drive improvements of our product at scale. As feedback from the field continues to suggest avenues for improving our product, our evaluations expand to cover more aspects of quality beyond the basic requirements of correctness. Rapid product improvements (in response to feedback) create a virtuous cycle—when users see that their feedback is taken seriously, they are more likely to provide feedback in the future.
In the early days of building our product, users would give feedback about missing problems in the problem-based Assessment & Plan.
Once we shipped improvements to our system that more reliably captured the relevant problems, feedback tended to focus more on how we presented that information, with clinicians in different specialties wanting different ordering of problems.
Rigorous evaluation is not a one-time exercise on a static dataset, but a continuous and ever-evolving process that makes large-scale improvement possible.
Dozens of health systems trust Abridge to produce high-quality documentation, and we take that trust seriously. Setting standards for the measurement of note quality is not a unilateral effort performed by a single research team, but an exercise done in partnership with our users and the broader community. We intend to continue participating in the broader conversation on how to evaluate systems like ours, and hope to work toward a common industry framework for apples-to-apples comparisons.