Developing a Standardized Framework for Curating Oncology Datasets Generated By Manual Abstraction and Artificial Intelligence

Share publication:


The widespread uptake of electronic health records (EHRs) has made the creation of custom, real-world datasets for research more feasible. As a result, multiple research datasets with overlapping populations are often generated, using different methodologies, and frequently siloed within and between research groups, limiting the scope of the data’s use. Currently, there is no standard for collating and evaluating such data. Using existing lung oncology datasets, we developed an approach to determine optimal methods of combining and curating clinical data from different sources.


Two separate study datasets containing data for lung cancer patients diagnosed and/or treated within Princess Margaret Cancer Centre (PM, Toronto) were investigated. Study 1 manually abstracted clinical data for 1,990 patients, first seen at PM between 2014-2016; Study 2 leveraged the artificial intelligence engine, DARWEN™, to extract clinical data directly from EHRs for 4,466 patients, diagnosed between 2014-2018. Each dataset was individually assessed for internal consistency before comparing the overlapping population (Test Group, n=1892) to identify, investigate, and resolve differences. Patterns of data extraction performance were evaluated to define optimal methods for combining datasets and informing future data collection. Herein, epidermal growth factor receptor (EGFR) mutation status is used as an illustrative example.


Study 1 and 2 had similar distributions of clinicodemographic data and frequency of EGFR mutations. The Test Group had 100% agreement for date of birth, and >99% agreement for sex, with all discrepancies resulting from human error in Study 1. The Test Group had a 98% agreement for EGFR positivity and 98-99% agreement for specific exon mutations. Of the 106 disagreements for specific mutations, 50% (n=53) were due to Study 1 human error. Study 2 prioritized specificity over sensitivity for biomarker extraction, resulting in more false negatives (25% of errors, n=26). As DARWEN™ only extracted EGFR data from pathology reports, 18% (n=19) of discrepancies were due to lack of access to relevant information captured elsewhere in patients’ EHRs. Adjudicators could not resolve the remaining 7% of disagreements (n=8).


By comparing overlapping datasets, the strengths and weaknesses of each study design and extraction methodology were identified. This process demonstrated the effectiveness of artificial intelligence for extracting accurate patient-level clinicodemographic and mutation status data from EHRs, and the value of targeted manual chart review. Our approach provides a roadmap for leveraging existing clinical datasets to their fullest potential ,which is relevant across diverse data extraction methods and study designs.