Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Info

Scrubber v3.0 now uses Apache cTakes to provide parallel concept extraction during de-idenification. Apache cTAKES graciously invited us to port the Scrubber de-identification pipeline to the Apache hosted codebase. The maintenance version of the 2.X will remain available. The publication describing this work has been accepted with minor revision, this site will be updated shortly to reflect the described methods and results.

Code Block
McMurry* AJ, Fitch* B, Savova G, Kohane IS, Reis BY. “Improved de-identification of physician notes through integrative modeling of both identifying and non-identifying medical text”, BMC Medical Informatics and Decision Making Accepted minor revise Jan 2013.

...

Distinguishing pertinent clinical facts from sensitive patient identifiers in free text clinical narratives is a difficult classification task. 
One reason is that variations in physician writing styles have limited how broadly NLP algorithms can be utilized in multi-site studies. 
Another reason is that hospital IRBs have differing perspectives regarding "privacy risk to research benefit". 
As a result, relatively few physician notes are used in research studies despite the wealth of available high quality clinical phenotypes .

Approach

The HMS Scrubber builds on years of community progress in de-identification and NLP. 
In 2006, Beckwith developed and validated a rule based system to de-identify pathology reports. 
This widely accessed de-id program performed well in the pathology setting and was approved by four IRBs at Harvard teaching hospitals. 

...

This new approach using machine learning analyzes similaraties and differences betwen physician notes, medical dictionaries, and medical journal publications.