The rapid emergence of ‘big data’ is driving the need for rapid, accurate, cost-effective collection, distillation, and analysis of previously inaccessible data, unleashing unimaginable discovery across every sector, including the field of scientific investigation. This data availability, when combined with Artificial Intelligence (AI) and machine learning, has fueled a bold vision for healthcare. There is much optimism that the tools are within reach to deliver precision medicine and highly personalized care, solving many challenges faced by the industry, payers, regulators, clinicians, and patients.
Unfortunately, this vision is yet unrealized since the underlying information required to conduct this caliber of data analytics still depends on manual extraction of clinical variables from patient charts and research databases. Although it is fair to say that researchers today are drowning in data – dispersed across thousands of files, reports, patient records and decades of completed studies – the data cannot be accessed and abstracted in an economically viable manner.
By current research practices, the standard model of data abstraction sees costs rise in direct correlation to the number of patients and variables examined, since researchers must manually gather these data points from the population cohort.
This reality is vividly clear when you consider the electronic health record of a single lung cancer patient. A compilation of the patient’s clinical notes and reports, in order to obtain the necessary biomarkers, lab and imaging reports, metastases locations, treatment records and outcomes, is typically hundreds of pages. The resulting cost of manual data collection is significant making it difficult for researchers to secure funding to conduct such time consuming and costly patient chart reviews for patients who are the most vulnerable with the highest mortality.
Even when such data gathering is approved for investigation, researchers face numerous challenges. Among them, manual data abstraction is vulnerable to human error in interpretation of diverse data inputs and formats. For example, manual data abstractors who are tasked with collecting patient diagnostic scores such as ECOG might find that a single patient’s record lists ECOG scores at different stages of treatment. Thus the collected data could have great variability in ECOG values, due to the varying interpretations of the manual abstractors, resulting in conflicting findings. Even if the ECOG value of interest to be extracted was defined as that at stage IV, whether the score before or after the date that stage IV was determined could complicate the extraction process.
Similarly, inconsistent information may exist in patient records, such as conflicting information about a patient’s history of smoking, per the lung cancer population cohort. Manual data abstractors may never realize that these inconsistencies exist; if they do identify these issues, they must then make judgements in terms of which data points to input, often without that variability being documented or validated.
The alternative: Applying a data curation engine
As an alternative to the cost, time and inaccuracies that result from manual chart reviews, imagine the potential transformation of current research processes if a health technology provider can introduce a data curation engine that extracts real-world evidence from medical records in a way that is research grade and, ultimately, regulatory-grade once the regulators decide on the scope and applicability of such evidence.
Pentavere Research Group uses both traditional and cutting-edge NLP and AI methodologies to label, extract and curate variables from structured and unstructured clinical texts, producing high-quality row-column datasets. We are working with health agencies to design the framework and methodologies to gather, synthesize and amalgamate their data inputs. This collaborative process will enable regulators to process vast quantities of disparate data and produce data that will be considered regulatory grade.
The process begins by establishing comprehensive protocols for data handling including data and statistical management plans and quality assurance processes. Since it is critical that client research projects attain research-grade data, which can earn regulatory-grade acknowledgement, an important part of the approach is designing natural language processing algorithms for data collection. The ‘gold standard rules’ that define exactly what features will be abstracted are generated in collaboration with the client to ensure that the maximum consistency is attained in data collection.
Then, Pentavere conducts a data discovery process with the clinical inputs and the client to refine these data collection rules and algorithms that will be applied by the data engine. These precise, customized criteria govern what data variables are extracted, and these rules are specifically documented for transparency and consideration by the research team policymakers, and regulatory bodies.
This curation process enables meticulous review and auditing of the abstracted data, to ensure accuracy of findings and allow researchers to have confidence in the final data sets. The transparency of the process enables any necessary regulatory reviews including adjudication and validation of the data in a manner that’s impossible with current manual data extraction and curation.
As an example, Pentavere performed this activity for the University Health Network, successfully aggregating more than 1,400 records for lung cancer patients, including 600 individuals who received treatment at other institutions. In addition to generating unique insights on treatments and outcomes in this vulnerable patient population the process identified potential areas for improvement in the clinical documentation as well as opportunities for improved data generation.
Although the practice of data analytics typically elicits much discussion regarding the protection of customer data and confidentiality – a paramount concern in healthcare – the model applied by Pentavere effectively eliminates these challenges. As a technology company, at no point does Pentavere remove the data from client systems. Pentavere only provides the technology and expertise to scale the curation of a client’s data on their own premises and servers. They deploy their tools behind the client agency’s firewall, and they serve as the agent for that agency to curate and extract their data onsite.
The natural next step is to take real-world data curation beyond the walls of a single healthcare provider or institution and generate mass scale data aggregation across multiple providers, enhancing the richness and detail of population data without compromising data privacy. This ability to compile greater volumes of data is invaluable when one considers the previous example of lung cancer patients. Despite being the number one cause of cancer-related deaths, researchers often lack access to adequate data to conduct the highest quality of investigation for this population cohort.
To help resolve such challenges, Pentavere is partnering with multiple health care stakeholders to enable data aggregation on a larger scope and scale. One collaboration is with the Canadian Personalized Health Innovation Network (CPHIN), a pan-Canadian health network that is focused on generating real-world evidence from the country for the use by researchers, clinicians and the government.
By introducing Pentavere’s data enrichment engine to CPHIN sites across the country, they can aggregate data on a scale that has not been possible in the past with the complexity and significant expense of creating large data sets. Now the Canadian healthcare system can come together to use the power of data to accelerate system transformation and improve health outcomes for all Canadians.
Transforming Clinician-patient Interactions
While Pentavere’s approach to data curation enables more cost-effective research techniques and supports the creation of readily verified, auditable data for regulatory decision-making and approvals, perhaps the most powerful benefit of this technology lies in its ability to transform the way clinicians can serve their patients.
Clinicians today are constantly faced with the scenario in which a patient asks, “What are the likely outcomes of this treatment for patients like me?” While the physician can offer their best opinion, based on their years of experience, recall of similar cases, and reference to clinical trial data, they are unable to answer the patient’s question with specific data that share the same variables. With Pentavere’s approach, clinicians could access specific, tailored data that can enhance both patient engagement and the effectiveness of their treatments.
In summary, there is great potential in the application of real-world data curation. It can dramatically lower the costs of healthcare data collection and enhance the scope, scale and specificity of data available for study, which will enable more precise, accurate and high-quality research.
Ultimately, this technique can place more relevant, meaningful information in the hands of dedicated clinicians as they remain steadfastly focused on providing the best level of care to their patients.