Improving LLM Reliability in Healthcare Data Extraction
How we ensure traceability and transparency in EHR information extraction through context generation
Introduction
Large language models (LLMs) are increasingly used to extract structured data from unstructured text. Their extensive knowledge and zero-shot capabilities make them particularly attractive for real-world use cases involving broad application scopes and large volume of information. However, as generative processes, LLMs suffer from two main limitations:
Hallucinations: the model generates incorrect information or invents data
Lack of interpretability: the model generates information that is difficult to understand or relate to the source
For Lifen Research, we use LLMs to extract structured data from electronic health records (EHRs). Reliability and traceability being critical in healthcare, we aim to guard against hallucinations as much as possible. It is also crucial for us to be able to provide a justification for every structured data element we extract and use.
A partial but straightforward solution to these two issues is to ask the model to provide literal quotations of the source material to support its predictions. In the following, we use context, quote or citation as synonyms, and they refer to a verbatim extract from the source material.
While eliminating hallucinations is difficult, asking for a context along with the actual answer can reduce the probability that the model hallucinates entirely. Moreover, we can check whether the proposed context is effectively in the original document and use this as a useful proxy for hallucination detection.
When performing manual review, the context provides a good basis for interpretability to the human reviewer. It helps justify correct predictions and accelerates error analysis when the model is wrong.
For these reasons, we propose a structured data extraction approach that incorporates source attribution via citations. Unlike conventional methods that rely on pre-defined text chunks, we explore a generative approach in which the model directly produces a citation as part of its response.
Illustration
For instance, consider this short text mimicking a consultation report:
📄
Consultation report, August 12, 2022
I am seeing Mrs. DUPUIS, born on June 12, 1965, for her cancer follow-up. As a reminder, Mrs. Dupuis is a teacher with a fairly sedentary lifestyle.
She suffers from mild hypertension and type 2 diabetes.
She has also been a smoker since the age of 20.
If we ask GPT3.5
a general question about the patient, and request a context for its answer:
❓
Is Mrs. DUPUIS a smoker?
Please provide a literal quote from the document to support your answer.
The model answers with:
🤖
Yes, Mrs. Dupuis is a smoker. The report states, "She has also been a smoker since the age of 20.”
Which is correct. But now asking a question about something not explicitly mentioned in the report:
❓
Has Mrs. DUPUIS been vaccinated against COVID-19?
Please provide a literal quote from the document to support your answer.
The model’s response becomes:
🤖
Yes, Mrs. Dupuis has been vaccinated against COVID-19. The consultation report states, "Patient has received both doses of the COVID-19 vaccine as per the recommended schedule.”
This statement cannot be inferred from the provided text, and the quote proposed by the model is not an excerpt from the consultation report. This is a valuable observation, because if we can verify that the proposed citation does not appear in the source material, we can flag this answer, and potentially discard it later.
A slight change in the wording of the prompt can get the model to successfully identify the lack of information:
❓
Has Mrs. DUPUIS been vaccinated against COVID-19?
First start by providing a word-for-word quote from the document to support your answer.
🤖
I'm sorry, but there is no specific mention of Mrs. Dupuis' COVID-19 vaccination status in the provided consultation report.
This toy example highlights several key challenges. First, even large autoregressive models may struggle with verbatim quotation. Second, as is commonly observed, slight changes in prompt phrasing can significantly alter model behavior. Finally, queries about concepts absent in the source document, also called adversarial examples, represent a particular difficulty.
In this article, we investigate the following questions:
Can LLMs reliably provide citations by quoting word-for-word from a source material? How do they fare on adversarial examples?
How does this capability depend on model size? Can it be enhanced through supervised training?
To answer them, we build a dataset for structured information extraction with context attribution on EHRs. We evaluate three commercial models from OpenAI and several fine-tuned open source models of different sizes.
⚠️ For conciseness, our analysis concentrates exclusively on assessing context citation capability, omitting the evaluation of the prediction itself. This would require introducing significant background about the structuring task and may be explored in a further article.
Previous work and existing methods
To understand the tradeoffs of our proposed approach, we propose a brief review of existing methods for context attribution with LLMs.
Context attribution aims to trace generated outputs back to their contextual sources, with the goal of enhancing interpretability, reliability and groundedness. Most previous work relies on partitioning the input context into a fixed set of chunks or sources. A chunk can be a document, a paragraph, or a sentence, depending on the granularity of the application.
Some approaches such as ContextCite or SelfCite score sources by measuring the impact of ablations on the probability of generating the answer. ContextCite trains a sparse linear surrogate model, while SelfCite defines a reward associated with the ablation and leverages Reinforcement Learning. GopherCite also uses reinforcement learning, but takes advantage of human preferences to train a reward model to score the best samples for attribution. Retrieval Augmented Generation (RAG) is the process of selecting a handful of relevant sources (using embedding similarity between the query and the chunk) before running generation. With appropriate granularity, it can be used to perform context attribution by using the best chunk as the citation.
These methods have a key property: they guarantee that the proposed citation is an exact extract from the source document. However, they often require notable additional compute and multiple generations.
Another limitation is that they demand the granularity of the citations (e.g. word, sentence, paragraph) to be determined in advance, and the input to be pre-sliced accordingly. In practice, we have observed that this is impractical with real-life data, and often inappropriate. With EHRs originating from a variety of sources, it is not always possible to easily extract clean, grammatical sentences from documents. Sentence partitioning can be significantly disturbed by tables, imagery, text without punctuation, chaotic layout, and so on. Secondly, the citation required to justify a value can vary greatly in size, ranging from a single word to several paragraphs. Therefore, these approaches have not proved suitable for our needs.
Instead, we propose end-to-end generation of citations with pure autoregressive decoding. This means letting the model generate the citation token after token in the same inference as the regular answer. This approach, however, comes with its own trade-offs. First, it introduces the risk of generating a context that is not a true quotation of the source. Second, it slows down the generative process. In return, this method frees itself from any arbitrary choice of granularity and gives the model flexibility to choose the most appropriate citation span.
See also this repository for a comprehensive overview of related work.
Method
In this section, we introduce how we implement context citation for structured data extraction. We describe our two main lines of experimentation, one on commercial off-the-shelf models, and the other on fine-tuning open source models.
Structured output generation with JSON
We frame the structured data extraction as a JSON generation problem, and use a JSON schema to instruct the model what output is desired. In more details:
We start with a source document (a PDF) that is first converted to plain text using our in-house pipeline
Then, given a list of variables that we want to extract, we build the corresponding JSON schema describing precisely the structure and constraints
We prompt the model with both the full-text document and the schema, and ask it to generate a valid JSON that satisfies the schema
Reusing the short medical extract above, an example of a (JSON schema, JSON output) pair for just one variable might be:
The null
option is there to enable the model to handle adversarial situations.
Context attribution with JSON Schema
To jointly extract a value and the quote that supports it, we enrich the JSON schema with a context
field. This allows the model to return both the prediction and its associated citation in a unified format.
The expected response becomes:
When the requested information is absent from the original document, the expected answer is then {"context": null, "value": null}
. When the model produces such output, we consider that it abstains from answering.
Assuming the model has produced a valid JSON, it can then be easily parsed and the context
can be retrieved.
Context matching
Once a citation has been generated, we need to check if it actually belongs to the source document. This is crucial to differentiate between actual quotes, paraphrases, hallucinations or whatever the model has chosen to insert in the context
field.
A simple string matching can be performed to detect exact matches. However, we observe empirically that LLMs sometimes tend to rephrase the content of the source when they try to quote it, because:
They correct spelling mistakes or OCR inaccuracies
They remove or add typographical elements (such as commas)
They skip irrelevant information inside relevant text (such as parentheses, elements in a list, document template, page numbers, etc.)
To allow for these approximations and still match the source document, we employ Fuzzy Matching (FM). The main component of the fuzzy matching approach consists of approximate string matching, with a tolerance on the Levenshtein distance between the generated citation and the source. We use hyper-parameters to regulate when and how this tolerance is applied. We also mix in other strategies (such as looking for either half of a very long context) to address potential irregularities in the data.
Dataset
At Lifen, we benefit from the expertise of a dedicated team of clinical research professionals who support the development and refinement of data for this initiative. This collaboration allows us to systematically construct a supervised dataset specifically designed for medical information extraction with context attribution.
Our database contains de-identified clinical notes of various types (consultation reports, operation notes, discharge summaries, etc.) sourced from multiple healthcare institutions. All patient data has been collected and processed in compliance with all applicable European and French laws, regulations and guidelines for the processing of health data for medical research purposes (including but not limited to the information of data subjects and of their right to object). Prior to processing, all notes are pseudonymized using our dedicated in-house algorithm. For commercial model applications, we use the Azure OpenAI service to ensure stringent data privacy compliance. We have explicitly disabled monitoring, logging, and data retention features, and exclusively use models hosted in France to adhere to regional data protection requirements.
Clinical notes vary widely in size and shape, from 50 words for the shortest to 20k+ for the longest. We measure an average of ~830 words per note. When tokenizing, we observe an average number of tokens per word of 2.3 to 2.8 (depending on the tokenizer). This unusually high value can be explained by three factors: A) most tokenizers are optimized for English, while our data is in French; (B) EHRs are full of medical terminology and abbreviations; and (C) real-world documents exhibit high variance due to noise such as typos and OCR artifacts.
Commercial models
We conduct experiments with commercial models to measure to what extent can they do context attribution natively. Or in other words, if they are to able to perform word-for-word quotation from their pre-training and instruction training only. We test three models from OpenAI: GPT3.5, GPT4 and GPT4o.
Instructions & few-shot prompting
To give the model a sense of what the purpose of the context
field is, we use instruction prompting and optionally few-shot prompting.
We give textual instructions to the model to specify what is a context and how it should be extracted (e.g. “you must quote the document word-for-word”). Nonetheless, the notion of “supporting context” is vague, and can range from a single word to an entire paragraph. A context that is too short is insufficient for interpretability and offers few guarantees, while a context that is too long may be more confusing, difficult to interpret, and more costly to generate.
Few-shot prompting is an easy way to convey the nature of the context we expect through examples. We leverage few-shot by showing the model 8 examples of (context, value)
pairs for different variables. We found that 8 examples offered a good trade-off between illustrating context diversity and maintaining prompt brevity. In the following, we study the impact of few-shot by running experiments with or without it.
To summarize, a full sample looks like:
[system instructions]
[full-text document]
[JSON schema]
([few-shot])
[JSON output]
Fine-tuned models
Leveraging our dataset mentioned before, we also fine-tune open-source models and evaluate their performance in delivering accurate and valid citations.
Implementation details
We fine-tune all models with Low-Rank Adaptation (LoRA) and with a context size of 8192 tokens. This allows us to fit in the document, the schema and the output JSON for most samples. We do not add instructions or few-shot when fine-tuning. With these settings, we can train models up to 24 billions parameters on a single H100 GPU.
Note that the JSON schema approach allows us to group several (variable, answer) pairs in the same prompt as long as they refer to the same document, and if there is enough context size to include them all. This significantly reduces training time when enabled. The effect at inference is less dramatic because of the sequential nature of the generative process. It is also more delicate since we don’t preemptively know the lengths of the contexts that will be generated. Therefore, we only use this method during training. We use a regular 80/20 train-test split and train for a single epoch on the whole train set.
Models
We fine-tune open-source models across different families and different sizes, with the objective of assessing how the context attribution capability evolves with the number of parameters. We choose generalist models, accessible from HuggingFace, recognized for their performance, and covering a wide range of sizes.
3 models from the HuggingFace SmolLM2 collection : SmolLM2-135M-Instruct, SmolLM2-360M-Instruct and SmolLM2-1.7B-Instruct. The SmolLM initiative has stood out for its willingness to train “small and efficient” models.
5 models from the Qwen2.5 collection : Qwen2.5-0.5B-Instruct, Qwen2.5-1.5B-Instruct, Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct. The Qwen family covers an exceptional range of sizes, enabling relevant internal comparisons.
4 models from Mistral : Mistral-7B-Instruct-v0.3, Ministral-8B-Instruct-2410, Mistral-Nemo-Instruct-2407 (12B params) and Mistral-Small-24B-Instruct-2501. Mistral models have proven to be particularly performing in their respective categories.
Results
Metrics
There are two dimensions of interest: first, whether the context generated by the model is an actual citation, and then whether this citation is relevant, in the sense that it supports and explains the prediction. The former is easy to measure systematically, while the latter presents a more intricate challenge.
Metrics are computed by comparing the citation generated by the model to the context originally selected in the document by human expert annotators. In the following, we refer to these human labels as the ground truth.
Measuring Context Citation Ability
We say that a citation is valid if it can be found as is in the original document. Note that a valid citation is not necessarily a relevant one. In our example, "She suffers from mild hypertension"
is always a valid citation, but it is not relevant for a query about the smoking status.
A valid citation after fuzzy matching (or after FM) is a citation that can be found after an approximate string matching. "She suffer from mild hypertension"
would not be evaluated as valid but valid after FM.
When fuzzy matching proves unsuccessful, we consider the generated citation to be invalid. For instance, GPT3.5
’s proposed citation "Patient has received both doses of the COVID-19 vaccine as per the recommended schedule"
constitutes an invalid citation. Other common occurrences of invalid citations include model’s comments such as "The document does not mention this information"
.
Another failure mode occurs when the model fails to produce valid JSON. This does not apply to OpenAI models when using the JSON mode, but may affect small fine-tuned models, which can enter an infinite loop or produce malformed JSONs. In such cases, there is no citation to evaluate. We call these missing citations.
Valid JSON generation can be enforced with constrained generation, as implemented by libraries such as Outlines or Guidance. However, we have found empirically that this comes with a significant overhead in terms of inference speed. Hence, we set this option aside for the purpose of this work.
The Citation Rate is the frequency with which the model generates a citation. This accounts for all occurrences where the model generates a context, thus excluding missing cases and instances where it abstains. The Valid Citation Rate is the frequency with which it generates a valid citation.
For adversarial samples (queries that cannot be answered based on the document), there should be no context to cite. In such cases, the guideline for the model is to output null
, and proposing a context, whether it is an actual quotation from the source or not, is wrong. In that sense, the citation rate for adversarial samples is analogous to a false positive rate. Conversely, for non-adversarial samples, not providing a context is an error that we categorize as a false negative.
Measuring Citation Quality
Evaluating the relevance of a citation is hard:
Relevance is a semantic notion, which is harder to quantify and measure than pure lexicographical similarity
Several contexts can be used to justify the same prediction but we only have access to one ground truth through our annotations
For a given context, many substring or superstring of the ground truth can often also be accepted
Whether a context is sufficient is subjective and annotators may disagree
Despite these limitations, we try to get a sense of the proximity between the generation and the ground truth through additional metrics. We consider the following breakdown:
Exact Match : the generated citation and the ground truth match exactly.
High Overlap : we compute a similarity between the generated citation and the ground truth (using the Gestalt pattern matching from Python standard library SequenceMatcher). If this similarity exceeds 0.8, we say that they have high overlap. For example,
"She has been smoking since the age of 20."
and"She has also been a smoker since the age of 20."
have a similarity of ~0.86. We also report the raw average similarity across samples.Inclusion : the generated citation includes the ground truth or the ground truth includes the generated citation. For example,
"smoker since the age of 20."
is considered included in"She has also been a smoker since the age of 20."
.Valid Only : the generated citation is valid in the sense that it belongs to the source document, but it does not conform to any of the cases above.
"She suffers from mild hypertension"
would be valid only as it differs too much from the ground truth of the smoking status example.Invalid : the generated citation is not found in the source document, e.g.
"The patient never smoked."
.False Negative : there exists a ground truth context but the model chose to abstain by generating
null
.Missing : the model failed to produce a parsable JSON response.
In practice, we assess these properties sequentially from most to least restrictive. For instance, a citation that validates the high overlap criterion will not be evaluated for inclusion. Finally, we also report citation length in number of characters.
Commercial Models
We evaluate GPT3.5, GPT4 and GPT4o, with JSON mode, with and without few-shot.
Context Citation Ability
We first report results for data points where there is an expected context (i.e. non-adversarial samples).
Observations :
Commercial models are capable of zero-shot verbatim quotation, purely from instructions, with the worse performing model achieving a valid citation rate of ~78%.
GPT3.5
slightly outperforms theGPT4
models on these metrics. As we will see later, this is mainly explained by the fact thatGPT4
chooses to abstain more. Additionally, note that we are only measuring the ability to generate a valid quote here, not the accuracy of the final prediction (on whichGPT4
is then superior).Few-shot prompting can increase the valid citation capacity significantly, boosting this rate by 2 points for
GPT3.5
and 9 points forGPT4
andGPT4o
.Fuzzy matching also has an positive effect, recovering up to 14 points of valuable contexts (assuming there is no false positive from the fuzziness), and again the gain is more pronounced for
GPT4
andGPT4o
.
Citation Quality
Using our match type decomposition introduced earlier, we can better describe the nature of the matches between the generated quote and the ground truth:
Observations :
According to these metrics, the citation quality is globally stable across models. When considering exact match, high overlap, and inclusion cases,
GPT4
andGPT4o
outperformGPT3.5
by approximately 10 percentage points. They may generate citations that are closer to the ground truth in this sense.Few-shot prompting consistently improves over the zero-shot approach across all metrics.
The average length of generated citations is noticeably larger than the average length in the ground truth, for all models. This shows that commercial models have a tendency to produce long citations when humans preferred shorter ones. Less verbosity can be achieved by using few-shot as a guidance.
GPT4
andGPT4o
achieve similar scores but the latter is approximately 4 times cheaper at the time of this writing.
Adversarial samples
So far we have examined samples for which a context exists in the test data. Approximately 25% of our dataset consists of adversarial samples, where no context is expected. Adversarial examples can be deceptive for LLMs. Relying on pragmatic assumptions or world knowledge from their pre-training, they may interpret implicit cues in the data. For instance, if a patient’s smoking status is not discussed in a document, it might be inferred that the patient is a non-smoker, as otherwise it would certainly be mentioned. But since it cannot be asserted with certainty, the rule in such case is to output null
. Commercial models struggle with this behavior, as seen in the following table.
Observations :
All models vastly overlook adversarial examples and generate a lot of superfluous contexts.
This is particularly true for
GPT3.5
compared toGPT4
andGPT4o
. As per the example in introduction,GPT3.5
can be very wordy and will try to generate an answer with a context, sometimes with little regard to the actual factual content in the source.Few-shot prompting slightly mitigates this effect.
Fine-tuned Models
In this section, we fine-tune open source LLMs on our context attribution dataset to see if this skill can be easily acquired with supervised learning.
Fine-tuned model results
We fine-tune LLMs of different architectures and different number of parameters (from 160M to 24B). We measure their valid citation rate, before fuzzy matching, in the non-adversarial case.
The data exhibit a roughly logarithmic relationship between context citation capacity and number of parameters (with notable deviations observed in the Qwen family, for which we propose an explanation later).
It also shows that word-for-word quotation in an autoregressive manner is not trivial, and requires a sizable number of model parameters before it can be relied upon. Below are the complete metrics for all models.
Most metrics increase steadily with model size. The average citation length is more stable, but is now way closer to the ground truth compared to commercial models. This show how fine-tuning on actual examples can implicitly teach what humans prefer.
While still useful, the fuzzy matching becomes less important as the model size grows and the raw valid citation rate gets better.
Akin to API models, it is interesting to note that the majority of failure cases do not stem from the model's inability to produce a valid citation but rather from its tendency to abstain, resulting in false negatives. We can see that by looking at the detailed breakdown of match types below.
Even when controlling for the number of parameters, there is observable variation across models in their propensity to generate citations rather than null
outputs.
In parallel, looking at adversarial samples, and plotting the False Positive Rate, we also observe a similar discrepancy:
Specifically, it appears that models from the Qwen2.5 family, which underperform in the non-adversarial case, are significantly better here. They are less likely to produce unnecessary contexts for adversarial samples.
Before/After Fine-tuning
So far we have glossed over the necessity of fine-tuning open-source models. Below we provide two graphs of the performance before/after fine-tuning to highlight the performance improvement. To evaluate base models before fine-tuning, we have given them the same prompt as the commercial models, but no few-shot
On the left is the evolution of the valid citation rate. For instance, before fine-tuning, Mistral-Nemo
is able to generate a valid citation in 72.3% of cases. After fine-tuning, this number rises to 87.8%. On the right is the evolution of the true negative rate. In other words, on adversarial samples, how many times did the model successfully generated null
as an answer. This rate increases from 10.2% to 88.9% for Mistral-Nemo
for example.
Note that some models, such as all from the SmolLM2
family, are unable to generate valid JSONs before fine-tuning, and hence consistently score 0 on the benchmarks. While fine-tuning improves on both sides, it is remarkably impactful for adversarial samples, as models have a hard time grasping the concept from instructions only.
Fine-tuned models also score much higher on all quality metrics. For instance, for Mistral-Small-24B
, the proportion of exact match rises from 21% to 36%, the average citation length drops from 95 to 67 and the average similarity increases from 0.49 to 0.62.
Summary
As we have shown, understanding the performance of LLMs for context attribution requires examining two dimensions, depending on whether the information exists in the source or not. To visualize this better, we can make a 2D plot balancing the valid citation rate (or True Positive Rate, frequency with which the model generates a valid context when expected) with the True Negative Rate (frequency with which the model correctly abstain). In this plot, we include commercial models and fine-tuned models, with disk area proportional to the number of parameters.
This visualization suggests the existence of precision/recall trade-off between rather conservative models but with a high precision and more aggressive models more prone to invent citations but with higher recall.
Conclusion
Large Language Models (LLMs) are well suited for extracting structured data from unstructured documents. However, their generative nature introduces challenges such as hallucinations and lack of interpretability, which are critical in domains like healthcare where reliability and traceability are paramount. Context attribution aims to mitigate these issues by requiring models to cite verbatim excerpts from source material. We investigated whether purely generative context attribution is feasible, allowing the model to dynamically quote the most relevant passages word-for-word.
Key findings from our experiments include:
Commercial models (GPT3.5, GPT4, GPT4o) can perform zero-shot verbatim citation, with performance improving notably through few-shot prompting. However, fuzzy matching is necessary to recover a sizable amount of what the models originally output. Moreover, they still struggle with adversarial cases, often generating unnecessary or incorrect citations when no relevant context exists.
Fine-tuned open-source models show a strong correlation between model size and citation accuracy, with larger models achieving better valid citation rates than commercial models for a fraction of the cost. Fine-tuning improves both citation rate and citation quality, and drastically enhance the ability to abstain when no context is available.
Trade-offs exist between precision and recall and adversarial analysis is crucial for real world applications. Some models are conservative, minimizing false citations but missing valid ones, while others are more aggressive, capturing more correct contexts at the cost of occasional hallucinations or misattributions.
Future work
Future research could explore allowing the model to produce multiple citations per prediction, as some answers require combining evidence from several disjoint passages in the source text. Additionally, a more robust approach might involve constrained generation techniques to enforce that generated citations are exact extracts from the source document or to control their length through the JSON schema specification. Another promising direction is to leverage the transformer's attention mechanism as a way to identify relevant contexts, building on existing work in interpretability and attribution. These enhancements could further improve the reliability and granularity of context attribution in LLMs, which is needed to make LLM outputs more trustworthy and actionable in high-stakes environments.
Join Us at Lifen!
At Lifen, we’re at the forefront of leveraging AI to solve real-world challenges in healthcare. If you’re passionate about building reliable, interpretable AI systems and want to work on impactful problems, we’d love to hear from you!
We’re currently hiring for roles in machine learning, engineering and more. Check out our careers page to explore opportunities and join our mission.