Validation of Scalable, Automated Data Extraction in an Advanced Lung Cancer Patient Population

Share publication:


Manual extraction from electronic health records (EHRs) is currently the standard approach for accessing real-world healthcare data but can be time consuming and challenging to maintain over time. Automated data extraction using natural language processing (NLP) is emerging as a viable method of data extraction from structured and unstructured fields of EHRs. While speed of NLP-based data extraction is established, some question the validity of the extracted data. This study compares the accuracy of, and concordance between, manual and NLP-extracted data from EHRs of patients with advanced lung cancer (aLC).


EHRs of 1209 patients with aLC were screened using the AI engine, DARWEN™, to identify a subset of 333 patients diagnosed and treated with systemic therapy at Princess Margaret Cancer Centre in Toronto between January 2015 and December 2017. Full feature models were run on all 333 patients to extract data from EHRs, from which 100 patients were randomly selected for manual data extraction by two trained abstractors to validate against NLP-extracted data. An expert adjudicator reviewed inconsistencies between manual and NLP-extracted results and was referenced as the gold standard when calculating accuracy and concordance.


NLP-extracted data from EHRs proved to be accurate and concordant with manual extraction methods (Table 1). Features with lower syntactic and semantic variation such as patient demographics (i.e., age and sex), characteristics (i.e., histologic subtype and comorbid conditions), and treatment details were reported with high accuracy and concordance. These tend to be the cases where manual reviewers would agree. Conversely, features with richer syntactic and semantic variation requiring deeper clinical interpretation had slightly lower accuracy by NLP extraction and, typically, manual review. By nature of the varying ways that biomarker testing and reporting is documented, extracting this data can be challenging. While NLP detection of biomarker testing was highly accurate and concordant, detection of results was more variable. NLP out-performed manual extraction in identifying metastatic sites with the exception of lung and lymph node metastases, which was due to analogous terms used in radiology reports that were not applied to variable definitions used to train DARWEN™.


The use of NLP technology in oncology provides opportunity for real-world evidence studies at a larger scale than ever before. NLP was not only faster than manual extraction but, for many features, was also more accurate than a traditional manual approach, demonstrating the advances of modern NLP techniques as a scalable alternative to manual extraction.