Developing a Data and Analytics Platform to Enable a Breast Cancer Learning Health System at a Regional Cancer Center

Share publication:


CONCLUSION: This study describes how data warehousing combined with NLP can be used to create a prospective data and analytics platform to enable a learning health system. Although upfront time investment required to create the platform was considerable, now that it has been developed, daily data processing is completed automatically in less than an hour.

For routinely collected data to be effective in enabling a learning health system, they generally must be integrated, timely, meaningful, high-quality, and actionable.5 These conditions can be difficult to meet. For example, in the context of regional cancer centers, a single patient’s record may be split among all the different health care organizations where they have received care. Regional data repositories designed to address this challenge often experience delays in receiving and cleaning data and may be missing important clinical,3 social determinant,4,7 or patient-reported outcome data.5 Indeed, important information about a patient’s condition is often recorded only in the form of free-text clinical documentation (such as consult notes and radiology reports), which is historically needed to be abstracted via manual review.3 Furthermore, clinical documentation standards often vary, which can introduce data quality issues.810 Privacy requirements or inadequate data governance frameworks can also impede access to these data, limiting their actionability.11,12

Despite the challenges, a number of efforts have been undertaken to create the necessary data and analytics infrastructure to enable a learning health system in oncology.1318 Many of these have used a data warehousing or data lake approach, in which data are automatically transferred on a regular schedule from disparate sources to a centralized data repository that is designed for analytics.1926 When properly designed, data warehouses and data lakes can significantly improve the timeliness and actionability of data although they require significant investment in time and resources to create.

Although data integration through warehouses and lakes can make data more timely and actionable, they are not typically equipped to facilitate the analysis of free-text documentation where so much clinically meaningful information resides. This has spurred recent interest in natural language processing (NLP), a branch of artificial intelligence, for the extraction of structured data from clinical text. Although NLP rarely achieves perfect accuracy, research suggests that it can achieve similar performance to manual chart abstraction.2736 There is increasing interest in blending NLP with traditional data warehousing to create a comprehensive data set for research and quality improvement although to date, relatively few institutions have developed these systems,3739 and to the best of our knowledge, none currently exists for breast cancer.

The Juravinski Cancer Center (JCC) is a regional cancer center serving a catchment area of approximately 2.5 million people and receiving referrals from 10 community hospitals. Data integration challenges at this center were similar to those faced by other regional cancer centers. Patient records were distributed across six clinical information systems, and a significant amount of information was stored exclusively as free text in clinical notes and radiology reports. Medical records of referred patients were often incomplete or transferred as text files with no structure. Although a regional viewer was available for clinicians to see records in other hospitals’ EHRs, these records were view-only and not available in a repository for research or quality improvement purposes.40 As a result of these factors, making use of routinely collected health data was a difficult and time-consuming task.

This article documents the establishment of data and analytics platform for a breast cancer learning health system at the JCC in Ontario, Canada. This platform automatically extracts patient data—including social determinants and patient reported outcomes—from disparate clinical information systems on a nightly basis, uses NLP to extract structured data from free-text documentation, and integrates them into a single up-to-date, longitudinal, prospective data model.

Stakeholder Engagement

One of the primary barriers to the kinds of quality improvement initiatives enabled through learning health systems can be stakeholder resistance.41,42 To help address this barrier from the outset, we set about engaging stakeholders around the hospital to contribute to shaping the vision and mission of the data and analytics platform. These stakeholders included clinicians, managers, quality improvement teams, data analysts, researchers, information technologists, privacy officers, and executive sponsors. This engagement extended to helping determine which data elements would be included in the platform since we reasoned that buy-in would be highest if all stakeholder groups could make effective use of the platform to meet their own goals. Stakeholders were invited to provide input into a briefing note that outlined the vision and mission for the platform, along with proposed methods, privacy/security protections, and data elements. Key stakeholders were asked to add their names to the briefing note once their input had been included, and the note was then reviewed and approved by the hospital’s data and analytics governance committee.

Data Sources

Once a list of data elements had been identified by our stakeholders, we set about identifying where these data resided in our hospital’s informatics environment. With help from our information technology and decision support departments, we identified that data of interest originated in six distinct clinical information systems.

The hospital’s primary EHR and stored clinical documentation and data on patient demographics and encounters.
Hamilton regional laboratory medicine program.
Regional laboratory information system that stored pathology reports.
Regional radiology reporting platform that stored medical imaging reports.
Oncology clinical information system that stored data on radiation planning.
Oncology patient information system.
Province-wide clinical information system that stored data on systemic therapy.43
Your symptoms matter.
Province-wide electronic information system that stored data on patient-reported outcome measures via the Edmonton System Assessment System.44

We then mapped the data flows between these distinct systems and identified that copies of the data of interest from PowerScribe and the Hamilton Regional Laboratory Medicine Program were stored in MEDITECH and copies of data from the Oncology Patient Information System and Your Symptoms Matter were stored in a MOSAIQ data mart. We examined the copied data to verify that they were complete and useable for our purposes.

In addition to these six systems, we also identified that information on social determinants of health could be acquired from the Ontario Marginalization Index,45 a Canadian deprivation-based index similar to the Multidimensional Deprivation Index developed by the US Census Bureau.46

Architecture and Data Flows

Architecture was developed in consultation with our hospital’s information technology department. The hospital was in the process of developing a data warehouse for operational and financial analytics, so we elected to use the same design patterns for both systems to minimize operational overhead and provide the option of merging the resources in the future. We thus adopted Microsoft SQL server for primary data storage, Microsoft Server Integration Services (SSIS) for developing and managing extract/transform/load jobs, T-SQL for stored procedures, and estrogen receptor/Studio for data modeling. The NLP software DARWEN was run on Docker. A deidentified copy of the data was stored in a separate research informatics environment using PostgreSQL. The flow of data is illustrated in Figure 1, and a detailed description of data handling is provided in the Data Supplement.

To maximize the timeliness of data while minimizing the chance of performance degradation on clinical systems, the extract/transform/load jobs were programmed to automatically run every night at 2:30 am. Our data engineer used Microsoft SSIS’ change data capture features to extract only new or modified data each night.

Since data in the data and analytics platform could be used for clinical and quality improvement purposes, it included personal health information. In consultation with our privacy office, we created a deidentified version of this database to be used for research purposes. The intent of this approach was to improve efficiency and reliability by creating a rigorous deidentification procedure up-front, rather than requiring an analyst to deidentify data on a project-by-project basis in the future. To deidentify the database, we removed direct identifiers (name, health card number, etc) and modified quasi-identifiers (eg, data elements that contained specific dates were modified to use days from diagnosis).

System performance for data extraction, transformation, and loading was monitored and evaluated using system logs recorded by Microsoft SSIS.


We used NLP for data extraction in two scenarios: first, when no structured data were available for a particular data element (eg, comorbidities) for any of our patients; second, when structured data were available for some patients, but not all. The second scenario occurred because some patients received all their care at our regional cancer center, whereas others were referred only after being diagnosed or having received some treatment at a referring hospital. For patients who received all their care at our cancer center, we had structured data on estrogen receptor status, progesterone receptor status, and human epidermal growth factor receptor 2 (HER2) status from synoptic reports. However, for patients referred after their diagnosis, we used NLP to extract these data from free-text clinical documentation. In these cases, we elected to use structured data from synoptic reports when they were available and to fill in the gaps with NLP when they were not.

We used DARWEN, a commercially available medical NLP engine, to extract structured data from unstructured clinical documentation. DARWEN uses a proprietary combination of linguistic rules–based algorithms and deep learning models to perform data extraction. Its operations and performance have been described previously.27,28

For data elements where structured data were not available for any patients, ground truth for model development and evaluation was established through manual chart review. Chart abstraction rules were drafted by a clinical expert and refined in collaboration with the JCC’s Breast Cancer Disease Site Group, which included medical, radiation, and surgical oncologists. Manual extraction of 200 randomly selected charts was performed by two trained chart reviewers. 100 abstracted charts were used for model training, with 50 reserved for validation and 50 held back for final testing. In addition to this approach, we were able to conduct a further test for estrogen receptor status, progesterone receptor status, and HER2 status using structured data from JCC pathologists’ synoptic reports as ground truth. This approach allowed for a very large test set for performance estimation in cases where NLP was used to fill in the gaps for patients whose pathology workup was performed at a referring hospital.

To ensure that the data produced by NLP were well tuned to our local data and thus of sufficiently high-quality to be used in research studies, we evaluated its performance by comparing it against manual chart abstraction for the held-out test set (n = 50). F1 score, the harmonic mean of sensitivity and positive predictive value, was used as the primary evaluation metric. Secondary metrics included sensitivity (recall), specificity, positive predictive value (precision), negative predictive value, and overall accuracy. We also conducted a detailed manual error analysis for the variables with the lowest performance.

Ongoing Quality Assurance

Performance decay of artificial intelligence systems because of data drift or concept drift is a growing concern in health care.47,48 To control for potential performance decay of our NLP models, we launched an ongoing quality assurance program, in which a random selection of charts is reviewed semiannually by two oncologists. Results of the chart review are compared against NLP output, and any deviation from baseline NLP performance is flagged for follow-up by our technical team. In addition, system logs are reviewed regularly to identify any failures with extract/transform/load operations. To ensure that this process can be carried out efficiently, we have created a browser-based chart abstraction tool that allows chart abstractors to simultaneously review clinical documentation while filling out a standardized chart abstraction form in a single browser window (the screenshot is included in the Data Supplement).

This work culminated in an automated, longitudinal, prospective data and analytics platform that provides access to integrated, timely, meaningful, high-quality, and actionable data for research, quality improvements, and other learning health system activities. The platform contains 141 data elements of 7,019 patients with newly diagnosed breast cancer who received care at the JCC from January 1, 2014, to June 3, 2022.

Data elements in the platform are organized into tables on the basis of subject areas, which include both data elements originating from structured databases and those extracted with NLP (Fig 2). A detailed data dictionary is included in the Data Supplement. All data elements can be joined at the patient level using PatientID as a key. Data in the platform are longitudinal, which allows for the visualization and analysis of patients’ entire care journey as a timeline (Fig 3). Data in the platform can be accessed through dashboards built in the organization’s enterprise business intelligence tool (Tableau) or using analysis tools such as R, Python, and SAS.

System Performance

Daily updation of the database takes place every day at 2:30 am to minimize impact on source systems and our hospital network. Over the month of May 2022, the average runtime of daily update jobs was 56 minutes. Extract/transform/load jobs were consistent at approximately 40 minutes, with the primary source of variability coming from daily NLP processing. This variability was driven by clinic schedules, with daily patient volumes ranging from 80 to 300. The initial NLP run used to populate the database with records from over 7,000 patients took 24 hours. The extract/transform/load and NLP jobs were run on a server with a four core Intel Gold 6248 @ 2.50 GHz CPU and 16 GB of RAM.

NLP Performance

NLP performance is described in Table 1, and the distribution of labels for variables extracted by NLP is reported in the Data Supplement. An F1 of 1.0 was achieved for 19 variables, with a further 16 variables with an F1 of > 0.95. These results are consistent with previous validation studies of the DARWEN NLP engine.27,28

The lowest F1 was for detecting venous thromboembolism (0.57), which in this case was related to a lower positive predictive value (precision), a result not entirely unexpected given the rarity of this complication (there were only two cases in our test data set). Our manual error analysis of these false-positive cases found that the NLP had missed negating clauses (ie, no evidence of venous thromboembolism).

The other lower F1 scores were primarily related to detecting comorbidities, specifically atrial fibrillation (0.80), chronic obstructive pulmonary disorder (0.80), and stroke (0.86). For all three of these conditions, this was driven by lower sensitivity, indicating that the NLP missed some cases. On manual error analysis, we noted that all the missed cases occurred when patients had four or five comorbidities. In these cases, the NLP successfully detected three or four of the comorbidities, but missed a fourth or fifth.

In addition, we completed the first cycle of semiannual quality assurance before publication. This activity identified an anomaly in our radiation planning data. Before launch, we conducted extensive tests on our extract/transform/load jobs to ensure that they were stable and performed source to target verification to ensure that data in the platform were an accurate representation of data in the source systems. However, during routine quality monitoring, we noted that some older radiation data were changing in ways that did not make clinical sense. On investigation, we discovered that although our platform’s extract/transform/load jobs were operating correctly, data in the source system table we were pulling from were unstable because of an error in their code. We were able to resolve the issue by working with our IT team to identify alternate tables within the source system that were unaffected by the error.

This study highlights the importance of ongoing quality assurance of artificial intelligence deployments in health care. Although extensive testing before launch can catch most defects, like concept drift and data drift, the instability in the radiation planning data that our quality assurance activities uncovered could only be detected through ongoing monitoring. Although ongoing quality assurance requires resources, without this kind of monitoring, our platform would have gone on faithfully reproducing erroneous data from a defective source system table.

This project was performed at a single center with its own unique challenges with respect to clinical informatics and regional data sharing. Thus, both extract/transform/load operations and NLP models would require adaptation and tuning to be deployed at another center. Similarly, like other regional cancer centers, the JCC typically cares for patients with more advanced disease than referring hospitals, so the cohort in our platform may not be representative of national or global populations. Patients report outcome data on a voluntary basis, with adoption at around 70% and with some disruption associated with the COVID-19 pandemic. Our hospital system did not collect individual-level socioeconomic data, so data on marginalization are based on neighborhood-level estimates, although our ability to link to the most granular census data (district areas) minimizes the risk of ecological fallacy when using the index as an individual-level proxy.49

We created an automated, longitudinal, prospective data and analytics platform for breast cancer at a regional cancer center. This platform combines principles of data warehousing with NLP to provide the integrated, timely, meaningful, high-quality, and actionable data required to establish a learning health system.


Supported by a grant from Roche Canada.