COVID-19 Data Harmonization
🏆 I received the 2020 NIAID Merit Award from the office of Dr. Anthony Fauci for my work on this project!
Overview
Background
As the COVID-19 pandemic raged on, hospitals around the world generated massive amounts of clinical patient data. The National Institutes of Health (NIH) received these international datasets for lifesaving research purposes in sporadic, non-standard deliveries. Initially, our team of bioinformaticians and developers set up ad-hoc data processing solutions.
Clinicians and researchers were scrambling to get their hands on the valuable data, to help diagnose patients and understand the inflammatory disorders that COVID-19 causes. The influx of data necessitated a standardized data processing pipeline, and a central source to access and analyze it.
Role
As part of the COVID Data Management Task Force under Dr. Fauci, I acted as a liaison between developers, analysts, and clinicians, gathering stakeholder needs. I also functioned as a developer and data analyst.
Goal
Craft a seamless clinical data access and analysis experience, minimizing manual human error, and enabling healthcare practitioners to focus on lifesaving COVID-19 research.
Solution
Automated data cleaning, validation, upload, and visualization pipeline.
Problem
Nurses in our data validation team had to scan through each questionnaire and manually enter patient data into our COVID-19 research database, LabKey.
Manual entry caused errors, and cost nurses and healthcare IT valuable time.
Current State
Nurses in our data validation team had to scan through each questionnaire and manually enter patient data into our COVID-19 research database, LabKey.
Manual entry caused errors, and cost nurses and healthcare IT valuable time.
Task
Design an automated data workflow to
Aggregate all questionnaires into one file
Provide nurses with one source to review all received data
Automate the upload to LabKey
Each data type needed to be pulled from the questionnaires into separate tables, corresponding to LabKey’s table structure.
-
Demographics (Ethnicity, Age, Sex…)
Lifestyle (Smoker, Occupation…)
Phenotype (Fever, Pneumonia…)
Treatments (Antiviral, Antibiotic…)
Comorbidity (Pre-existing conditions)
Respiratory Measures (Respiration rate…)
Supplemental Oxygen (Ventilator, Canula…)
Hospital Timeline (Date of admit/discharge…)
Clinical Lab Tests (Triglycerides, ANC…)
What I did
As team lead, I combined modules from earlier parsing infrastructure into a launcher script that outlined each data type that needed to be pulled from the questionnaires. I conducted weekly code reviews, communicated with nurses to identify pain points, and healthcare IT professionals to facilitate upload to LabKey.
Team members filled in their designated parsing modules. I took on Supplemental Oxygen, Respiratory Measures, Hospital Timeline, and Clinical Labs. I also handled additional data validation and error flagging.
Gathering Feedback
After speaking with the nurses and iterating over the workflow design, I synthesized a list of goals for our team to accomplish.
Decrease manual parsing time for each batch
Flag errors or inconsistencies
Develop re-runnable solution that can be easily modified
Process
We had two weeks to complete the parsing workflow and met every morning to exchange questions and ideas. I ensured that each of the 9 parsing modules were independent of each other to accommodate future changes in questionnaire format or user feedback. The workflow was written in Python.
Twice a week, I’d show the data validation team prototype output files and gain valuable feedback on date formatting and test value conversions.
I’d work with the validation team to incorporate novel terms into LabKey’s controlled dictionary and deal with edge cases such as missing data, duplicate entries, conflicting demographic information and more.
Eventually, we had a solid workflow down, which could be run from the command line in Terminal. I drafted up a detailed user manual and sent out the scripts to the team of nurses.
Final Solution
Step 1
Nurse launches script in Terminal, pointing the Python code to the directory in which the delivery of questionnaire spreadsheets reside, and the desired output directory.
Step 2
Code runs through all questionnaires in the directory. Let’s look at the first one… a module maps the site’s patient ID to the internal NIH patient ID using a lookup table stored in LabKey.
Step 3
Original language is translated to English using a Google Translate API for Python.
Step 4
Code pulls all fields from the questionnaire into a multi-tabbed Excel spreadsheet. Each tab stores a different data type (like Demographics or Phenotypes).
Dates are converted from International (dd/mm/yyyy) to US format (yyyy-mm-dd).
Step 5
Step 4 is repeated for all questionnaires, compiling all data into one spreadsheet.
Step 6
Nurses review the spreadsheet’s processing notes and make necessary changes.
Step 7
Nurse can drop the file into a folder on the production server. This automatically pushes the processed delivery into LabKey.
Final Outcome
All researchers at NIH can easily access the latest processed + vetted hospital data on LabKey.
Outcomes
The data validation team reported that they saved several hours of manual review per delivery.
The output Excel file has a tab for each data type pulled from the questionnaires. The format of the tables exactly matches LabKey, so the nurses can easily bulk upload once their final QC is complete.
Future Improvements
Many nurses weren’t well versed with Terminal or the command line, so even typing out a simple command was difficult. Given more time, I’d like to implement a Graphical User Interface with a simple button to point the software to the delivery folder, and output the processed_questionnaires.xlsx file.
Lessons Learned
Being patient with the software development cycle and constantly seeking feedback proved rewarding. The modular design helped prevent errors from impacting the whole script and were easily isolated. Applying user centered thinking allowed nurses to focus on a small subset of errors, rather than waste precious time entering all the data.