Phenotypes, Data Standards, and Data Quality


Using EHR data for clinical research requires not only a comprehensive understanding of syntactic and semantic interoperability, but also valid approaches for identifying clinical conditions. This necessitates collaboration among clinicians, EHR experts, and informaticians in developing valid algorithms to identify clinical conditions that meet the needs of research planning and protocols. For example, a diagnosis of diabetes does not by itself indicate that a patient has been diagnosed with diabetes, but could indicate a suspicion of diabetes that must be documented to order the appropriate tests. There are many valid ways to identify such a patient—2 diagnoses separated by 3 months, a diagnosis coupled with a prescription for a diabetes medication, a diagnosis by an endocrinologist—and understanding the pros and cons of those approaches is necessary to use EHRs effectively for constructing phenotypes. We hope to accomplish the following:

  • Develop phenotype definitions. We will work with the Demonstration Projects, the NIH, and the investigator community to identify phenotypes of interest across projects, develop a library of computable definitions and algorithms to enable phenotyping for the most common and important conditions, and develop and test phenotype algorithms that can be used within and across projects. These definitions will be based on existing literature or developed de novo in collaboration with project teams. Because computable definitions requires specificity and precision of definition, we will use standard data elements from public repositories (e.g., caDSR, USHIK) that are most likely to be collected in health care settings; i.e., those linked to standard EHR profiles and/or used for meaningful use or required reporting. Where data elements are not yet standardized, we will initiate and steward the development process through the HL7 Clinical Interoperability Council and make the data elements available in public data element registries. We will build on our experience in identifying and reviewing phenotype definitions for Mini-Sentinel, our work in EHRs to identify infectious diseases and other conditions, and the body of computable definitions currently under development for the MURDOCK study. Our work with FDA in developing definitions for phenotypes and outcomes will add an important dimension to the adoption and sustainability of the work of the Collaboratory. Importantly, as phenotype definitions are developed, the common data model underlying the Distributed Research Network will be expanded to include required data elements, as necessary.

  • Identify data validation best practices. We will identify best practices in EHR data use and disseminate those practices to participating clinical systems. Appropriate use of EHRs for clinical research requires comprehensive data characterization and data quality assessment that enable investigators to match a research question to appropriate quality data. These in turn require input from clinicians, data experts, and informaticians who together can identify data capture issues, relationships between data capture and use, and within-system variation. We will leverage ongoing work from research in information quality assessment and electronic indicators of information quality (1K99LM011128), and from statistical approaches developed and employed by Mini-Sentinel and the Observational Medical Outcomes Partnership. We will consult with each project regarding best practices for identifying and addressing data quality issues before they become a barrier to valid research inferences.

  • Store generalizable definitions and best practices in an accessible format. We will work with standards organizations to iteratively improve the data collection in health systems to contribute to the learning health system. As the identification of phenotypes becomes clearer, organizations will be motivated to improve their internal vocabularies and discipline at the point of care so that they can contribute more to the national research agenda, but also to provide more accurate knowledge for their quality efforts and planning for population and individualized management, which increasingly will be driven by matching generalized risk and specific modifiable risks to particular therapies or preventive measures.

  • Use standards organizations to move these measures into practice. Standards permit disparate groups to exchange data without one-on-one agreements, and also obviate the need for each group to discover and develop their own structures, policies, agreements, and processes. We will develop a suite of standards that encompasses what is required and appropriate for the collaborating center. In some cases, we will identify existing standards for CC use. In other cases, we will build on preliminary work to create the required standard. Most importantly, we will formalize these standards through an existing ANSI-accredited Standards Developing Organizations (SDOs). Standards identified or created will include HL7 v2.n and the Clinical Document Architecture, decision-support standards, data elements standards derived through domain analysis models, phenotype algorithms, registry standards, biobank standards, and others. Key to making these standards usable will be the development of implementation guides that will be balloted and approved through existing SDOs. These implementation guides will define what standards are to be used, what data elements will be exchanged, and what format and coding systems will be used in those exchanges.

Collaboratory Core and Working Group Members

​Core/Working Group ​Members ​Administrative Contacts
Phenotypes, Data Standards,
and Data Quality Core
Ed Hammond, Co-Chair
Meredith Nahm, Co-Chair
Rachel Richesson​, Co-Chair
Jerry Sheehan, NIH Representative

Alan Bauck
Denise Cifelli
Greg Simon
John Dickerson
Jennifer Robinson
Chris Helker
Cindy Kluchar

Rosemary Madigan
Renee Pridgen
Shelley Rusincovitch

Michelle Smerek
Sandi McDanel

Clairellen Miller

Naomi Pratt

Michelle Smerek