Programmatically generating annotations for de-identification of clinical data

I. Guclu

Master thesis 2021.

Clinical records may contain protected health information (PHI) which are privacy sensitive information. It is important to annotate and replace PHI in unstructured medical records, before being able to share the data for other research purposes. Machine learning models are quick to implement and can achieve competitive results (micro-averaged F1-scores Dutch radiology dataset: 0.88 and English i2b2 dataset: 0.87). However, to develop machine learning models, we need training data. In this project, we applied weak supervision to annotate and collect training data for de-identification of medical records. It is essential to automate this process as manual annotation is a laborious and repetitive task. We used the two human annotated datasets, where we 'removed' the gold annotations to weakly tag PHI instances in medical records, where we unified the output labels using two different aggregation models: aggregation at the token level (Snorkel) and sequential labeling (Skweak). The output is then used to train a discriminative end model where we achieve competitive results on the Dutch dataset (micro-averaged F1 score: 0.76) whereas performance on the English dataset is suboptimal (micro-averaged F1-score: 0.49). The results indicate that on structured PHI tags we approach human annotated results, but more complicated entities still need more attention.