0.1 Introduction

To adequately assess the data quality of a clinical dataset, the following overarching attributes about the data as a whole should be taken into account:

  1. Cohort definition: that population that is being studied, which are answered by questions such as:

    • Adult or pediatric population?
    • What is the inclusion criteria?
    • What is the exclusion criteria?

An example of a cohort definition is “All the adult HIV-positive patients that were admitted to the ICU at MSK”.

  1. Timeline: the timeframe that the study entails. Extending the example given, the timeline for this study can be the clinical outcomes of HIV-positive patients 1 year after their first HIV-related hospitalization.

Cohort Definition and Timeline serve as the bare minimum knowledge for a data-driven research study and confirmation of both should be the first step in the QA process.

0.2 Principles

0.2.1 Datatype QA Matrix

Once the preliminary research study characterization is established, the Framework itself will take the R class of each field in the dataset, and each R class is associated with a unique data quality pipeline related to the permitted values within the given field. For the purposes of this brief demonstration, rules are applied on the limited set of datatypes of category, text, float, integer, date, and datetime. In reality, many additional datatypes may exist, such as time for a timestamp without a date, or large integers that could be considered the equivalent to bigint in SQL.

In addition to the datatypes, there are two types of flags:
Hard Flag: clinically impossible value. For example, a birthdate that takes place in the future is clinically impossible.
Soft Flag: clinically improbable value, such as an extremely high white blood cell count of 30000.

Matrix that maps a datatype of a field in a dataframe to the R object class (r_class). Each field has a single datatype that is constrained to a specific R object class and a set of additional rules associated with the datatype (rule). A soft_flag represents clinically improbable values that will require confirmation that the value is not due to error such as a typo. A hard_flag is a flag where the value is clinically impossible and must be corrected such as a date value that occurs in the future. Note that a matrix such as this one would require regular refinement to accommodate for nuances such as if a dataset includes a field such as date of next appointment, which will contain dates that may occur in the future. This is an example of where a new datatype alloting future dates may need to be introduced to avoid any hard flags for acceptable values.
datatype r_class rule soft_flag hard_flag
identifier factor value in established valueset value not in valueset
category factor value in established constraints value not in constraints
text character
float c(‘numeric’, ‘double’) only numeric characters with maximum 1 decimal point greater or less than 2.5 standard deviations from the mean
integer integer only numeric characters with 0 decimal point greater or less than 2.5 standard deviations from the mean
date Date maximum 8 and minimum 4 numeric characters, maximum 2 punctuation characters greater or less than 2.5 standard deviations from the mean future date
datetime POSIXct maximum 6 numeric characters, maximum 2 punctuation characters times greater than or equal to 24:00


It is important to note that the category datatype requires special handling. Each field of this datatype is defined by a vector of permissible values, also known as a valueset. Therefore a prequisite for a quality process for this datatype is knowing the valueset that constrains the range of possible values. From a maintenance perspective, category fields require iterative updates with new allowable values added to the valueset.

The purpose of applying these fundamental rules is to confirm the data integrity before moving forward with more complex data quality rules. For example, a more complex rule may be one where the date of death must be preceded by a date of birth. However, it is the foundational rule on the datatype date that preliminarily confirms that all of the date of death and date of birth data are in a parseable format, values that fall above the 95th percentile in either direction have been reviewed and confirmed, and particular attention is paid to any data that shows that date of death and date of birth occurring in the future.

0.2.2 Surveying Source Data

Following the matrix above, each field in the clinical dataset should be surveyed and assigned one of the datatypes. Mapping a source data field to a datatype can be demonstrated using the 100 sample records from the Condition Occurrence table in Polyester, a database of synthetic clinical data generated by Synthea and ETL’d into the OMOP Common Data Model (learn more).

## tibble [1,205 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ person_id           : num [1:1205] 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender              : chr [1:1205] "MALE" "MALE" "MALE" "MALE" ...
##  $ condition_start_date: Date[1:1205], format: "2020-03-05" "2016-04-13" ...
##  $ condition_end_date  : Date[1:1205], format: "2020-03-19" "2016-11-10" ...
##  $ condition_source    : chr [1:1205] "Fatigue" "Otitis media" "Loss of taste" "Streptococcal sore throat" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   person_id = col_double(),
##   ..   gender = col_character(),
##   ..   condition_start_date = col_date(format = ""),
##   ..   condition_end_date = col_date(format = ""),
##   ..   condition_source = col_character()
##   .. )

The survey above serves a guide to assign the appropriate datatype to each column. A few notes representing the challenges faced when applying data quality rules to real world clinical data:

  1. Though the gender field data was read into the R environment as a character class, the result of our survey indicates that it should be of the factor class and thus, be assigned the category datatype.
  2. Another similar example that is debatable is whether or not to consider condition_source a text or category datatype. This would depend on whether this field came from an abstraction or from a structured EHR data capture. Here, I have chosen to treat it as a category.
Map between the fields in the sample of 100 records from the Condition Occurrence table in Polyester and the assigned datatype.
field datatype
person_id identifier
gender category
condition_start_date date
condition_end_date date
condition_source category


0.3 Application

To apply the data quality rules, the fields can be grouped by assigned datatype, and each grouping can be sent to its respective data quality pipeline as defined by the rule in the datatype matrix.

## $category
## [1] "gender"           "condition_source"
## 
## $date
## [1] "condition_start_date" "condition_end_date"  
## 
## $identifier
## [1] "person_id"

0.3.1 Datatype: Identifier

0.3.1.1 Hard Flag

Fields of datatype identifier should have unique values that equivalent to the length that is expected sample size. The programmatic rule used is confirming that all the expected unique identifiers are found in the data. In this example the unique identifiers are the sequence from 1 to 100 (1:100).

## $person_id
## [1] "54"

0.3.1.2 Results

Condition Occurrence failed the hard flag that dictates that all the values within the identifier datatype are to be found within a predefined valueset. In this case, the person_id 54 is missing from the data, triggering a hard flag to ensure that 54 is brought back into the source dataset.

0.3.2 Datatype: Date

0.3.2.1 Hard Flag

Fields of datatype date should not have any dates that occur in the future.

## $condition_start_date
## Date of length 0
## 
## $condition_end_date
## Date of length 0

0.3.2.2 Results

Condition Occurrence did not fail the hard flag that dictates that all the values within the date datatype cannot contain any dates that occur in the future from today.

0.3.2.3 Soft Flag

## $condition_start_date
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "1928-09-08" "2010-08-14" "2017-10-07" "2010-01-19" "2020-03-06" "2020-12-19" 
## 
## $condition_end_date
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "1945-04-12" "2015-05-11" "2020-03-10" "2016-03-20" "2020-03-28" "2020-11-29" 
##         NA's 
##        "372"

0.3.2.4 Results

Condition Occurrence did assed the soft flag that dictates that all the values outside the 95th percentile should be reviewed for accuracy.

0.3.3 Datatype: Category

The category datatype requires the most user feedback for this framework since it requires a set of predetermined constraints that will ultimately determine the degree of flagging that will occur. In this sample data, there are 2 fields of this datatype: gender and condition_source. Each will require a vector of constraints for the QA process

0.3.3.1 Hard Flag

Fields of datatype category should fall within their constraints.

## $gender
## [1] "FEMALE" "MALE"  
## 
## $condition_source
##   [1] "Acute bacterial sinusitis"                          
##   [2] "Acute bronchitis"                                   
##   [3] "Acute deep venous thrombosis"                       
##   [4] "Acute pulmonary embolism"                           
##   [5] "Acute respiratory distress syndrome"                
##   [6] "Acute respiratory failure"                          
##   [7] "Acute viral pharyngitis"                            
##   [8] "Alzheimer's disease"                                
##   [9] "Anemia"                                             
##  [10] "Appendicitis"                                       
##  [11] "Atrial fibrillation"                                
##  [12] "Blighted ovum"                                      
##  [13] "Carcinoma in situ of prostate"                      
##  [14] "Cardiac arrest"                                     
##  [15] "Cerebrovascular accident"                           
##  [16] "Childhood asthma"                                   
##  [17] "Chill"                                              
##  [18] "Chronic congestive heart failure"                   
##  [19] "Chronic intractable migraine without aura"          
##  [20] "Chronic kidney disease stage 1"                     
##  [21] "Chronic low back pain"                              
##  [22] "Chronic neck pain"                                  
##  [23] "Chronic pain"                                       
##  [24] "Chronic sinusitis"                                  
##  [25] "Closed fracture of hip"                             
##  [26] "Complete miscarriage"                               
##  [27] "Concussion with loss of consciousness"              
##  [28] "Concussion with no loss of consciousness"           
##  [29] "Contact dermatitis"                                 
##  [30] "Coronary arteriosclerosis"                          
##  [31] "Cough"                                              
##  [32] "Diarrhea symptom"                                   
##  [33] "Disease caused by 2019-nCoV"                        
##  [34] "Disorder of kidney due to diabetes mellitus"        
##  [35] "Drug overdose"                                      
##  [36] "Dyspnea"                                            
##  [37] "Eclampsia in pregnancy"                             
##  [38] "Emphysematous bronchitis"                           
##  [39] "Epidermal burn of skin"                             
##  [40] "Epilepsy"                                           
##  [41] "Escherichia coli urinary tract infection"           
##  [42] "Essential hypertension"                             
##  [43] "Facial laceration"                                  
##  [44] "Familial Alzheimer's disease of early onset"        
##  [45] "Fatigue"                                            
##  [46] "Fever"                                              
##  [47] "Fibromyalgia"                                       
##  [48] "Fracture of ankle"                                  
##  [49] "Fracture of clavicle"                               
##  [50] "Fracture of forearm"                                
##  [51] "Fracture of rib"                                    
##  [52] "Headache"                                           
##  [53] "Heart failure"                                      
##  [54] "Hemoptysis"                                         
##  [55] "Hyperglycemia"                                      
##  [56] "Hyperlipidemia"                                     
##  [57] "Hypertriglyceridemia"                               
##  [58] "Hypoxemia"                                          
##  [59] "Idiopathic atrophic hypothyroidism"                 
##  [60] "Impacted molars"                                    
##  [61] "Injury of kidney"                                   
##  [62] "Injury of tendon of the rotator cuff of shoulder"   
##  [63] "Joint pain"                                         
##  [64] "Laceration of foot"                                 
##  [65] "Laceration of forearm"                              
##  [66] "Laceration of hand"                                 
##  [67] "Laceration of thigh"                                
##  [68] "Localized, primary osteoarthritis of the hand"      
##  [69] "Loss of taste"                                      
##  [70] "Major depressive disorder"                          
##  [71] "Malignant tumor of breast"                          
##  [72] "Malignant tumor of colon"                           
##  [73] "Metabolic syndrome X"                               
##  [74] "Miscarriage in first trimester"                     
##  [75] "Muscle pain"                                        
##  [76] "Myocardial infarction"                              
##  [77] "Nasal congestion"                                   
##  [78] "Nausea"                                             
##  [79] "Neoplasm of prostate"                               
##  [80] "Neuropathy due to type 2 diabetes mellitus"         
##  [81] "Non-small cell carcinoma of lung, TNM stage 1"      
##  [82] "Non-small cell lung cancer"                         
##  [83] "Normal pregnancy"                                   
##  [84] "Opioid abuse"                                       
##  [85] "Osteoarthritis of hip"                              
##  [86] "Osteoarthritis of knee"                             
##  [87] "Osteoporosis"                                       
##  [88] "Otitis media"                                       
##  [89] "Partial thickness burn"                             
##  [90] "Pathological fracture due to osteoporosis"          
##  [91] "Perennial allergic rhinitis"                        
##  [92] "Perennial allergic rhinitis with seasonal variation"
##  [93] "Pneumonia"                                          
##  [94] "Polyp of colon"                                     
##  [95] "Pre-eclampsia"                                      
##  [96] "Prediabetes"                                        
##  [97] "Pulmonary emphysema"                                
##  [98] "Recurrent rectal polyp"                             
##  [99] "Respiratory distress"                               
## [100] "Retinopathy due to type 2 diabetes mellitus"        
## [101] "Rupture of appendix"                                
## [102] "Seizure disorder"                                   
## [103] "Sepsis caused by virus"                             
## [104] "Septic shock"                                       
## [105] "Sinusitis"                                          
## [106] "Sore throat symptom"                                
## [107] "Sprain of ankle"                                    
## [108] "Sprain of wrist"                                    
## [109] "Sputum finding"                                     
## [110] "Streptococcal sore throat"                          
## [111] "Tubal pregnancy"                                    
## [112] "Type 2 diabetes mellitus"                           
## [113] "Viral sinusitis"                                    
## [114] "Vomiting symptom"                                   
## [115] "Wheezing"                                           
## [116] "Whiplash injury to neck"
## $condition_source
##   [1] "Disease caused by 2019-nCoV"                        
##   [2] "Laceration of thigh"                                
##   [3] "Diarrhea symptom"                                   
##   [4] "Nausea"                                             
##   [5] "Vomiting symptom"                                   
##   [6] "Chronic sinusitis"                                  
##   [7] "Polyp of colon"                                     
##   [8] "Osteoarthritis of knee"                             
##   [9] "Partial thickness burn"                             
##  [10] "Laceration of forearm"                              
##  [11] "Myocardial infarction"                              
##  [12] "Essential hypertension"                             
##  [13] "Whiplash injury to neck"                            
##  [14] "Sputum finding"                                     
##  [15] "Chill"                                              
##  [16] "Concussion with loss of consciousness"              
##  [17] "Concussion with no loss of consciousness"           
##  [18] "Seizure disorder"                                   
##  [19] "Nasal congestion"                                   
##  [20] "Prediabetes"                                        
##  [21] "Sore throat symptom"                                
##  [22] "Pneumonia"                                          
##  [23] "Acute pulmonary embolism"                           
##  [24] "Hypoxemia"                                          
##  [25] "Laceration of foot"                                 
##  [26] "Anemia"                                             
##  [27] "Respiratory distress"                               
##  [28] "Sinusitis"                                          
##  [29] "Miscarriage in first trimester"                     
##  [30] "Dyspnea"                                            
##  [31] "Wheezing"                                           
##  [32] "Escherichia coli urinary tract infection"           
##  [33] "Osteoporosis"                                       
##  [34] "Cerebrovascular accident"                           
##  [35] "Sprain of wrist"                                    
##  [36] "Pulmonary emphysema"                                
##  [37] "Opioid abuse"                                       
##  [38] "Normal pregnancy"                                   
##  [39] "Complete miscarriage"                               
##  [40] "Sprain of ankle"                                    
##  [41] "Blighted ovum"                                      
##  [42] "Drug overdose"                                      
##  [43] "Chronic intractable migraine without aura"          
##  [44] "Chronic pain"                                       
##  [45] "Contact dermatitis"                                 
##  [46] "Chronic low back pain"                              
##  [47] "Impacted molars"                                    
##  [48] "Chronic neck pain"                                  
##  [49] "Alzheimer's disease"                                
##  [50] "Osteoarthritis of hip"                              
##  [51] "Chronic congestive heart failure"                   
##  [52] "Eclampsia in pregnancy"                             
##  [53] "Muscle pain"                                        
##  [54] "Sepsis caused by virus"                             
##  [55] "Joint pain"                                         
##  [56] "Appendicitis"                                       
##  [57] "Hyperlipidemia"                                     
##  [58] "Rupture of appendix"                                
##  [59] "Localized, primary osteoarthritis of the hand"      
##  [60] "Atrial fibrillation"                                
##  [61] "Fracture of ankle"                                  
##  [62] "Cardiac arrest"                                     
##  [63] "Tubal pregnancy"                                    
##  [64] "Perennial allergic rhinitis with seasonal variation"
##  [65] "Recurrent rectal polyp"                             
##  [66] "Malignant tumor of colon"                           
##  [67] "Laceration of hand"                                 
##  [68] "Neoplasm of prostate"                               
##  [69] "Type 2 diabetes mellitus"                           
##  [70] "Carcinoma in situ of prostate"                      
##  [71] "Disorder of kidney due to diabetes mellitus"        
##  [72] "Metabolic syndrome X"                               
##  [73] "Hypertriglyceridemia"                               
##  [74] "Emphysematous bronchitis"                           
##  [75] "Chronic kidney disease stage 1"                     
##  [76] "Acute respiratory failure"                          
##  [77] "Acute deep venous thrombosis"                       
##  [78] "Fracture of forearm"                                
##  [79] "Non-small cell carcinoma of lung, TNM stage 1"      
##  [80] "Major depressive disorder"                          
##  [81] "Non-small cell lung cancer"                         
##  [82] "Idiopathic atrophic hypothyroidism"                 
##  [83] "Headache"                                           
##  [84] "Childhood asthma"                                   
##  [85] "Fibromyalgia"                                       
##  [86] "Coronary arteriosclerosis"                          
##  [87] "Retinopathy due to type 2 diabetes mellitus"        
##  [88] "Neuropathy due to type 2 diabetes mellitus"         
##  [89] "Hyperglycemia"                                      
##  [90] "Familial Alzheimer's disease of early onset"        
##  [91] "Epidermal burn of skin"                             
##  [92] "Epilepsy"                                           
##  [93] "Hemoptysis"                                         
##  [94] "Perennial allergic rhinitis"                        
##  [95] "Acute respiratory distress syndrome"                
##  [96] "Septic shock"                                       
##  [97] "Heart failure"                                      
##  [98] "Injury of kidney"                                   
##  [99] "Injury of tendon of the rotator cuff of shoulder"   
## [100] "Facial laceration"                                  
## [101] "Fracture of rib"                                    
## [102] "Pre-eclampsia"                                      
## [103] "Pathological fracture due to osteoporosis"          
## [104] "Closed fracture of hip"                             
## [105] "Malignant tumor of breast"                          
## [106] "Fracture of clavicle"