OVERVIEW

This database was created to store hundreds of thousands of human and machine annotations.
In this context "human" annotations refer to annotations created by an expert reviewer using a program such as protege, whereas "machine" annotations are automatically labeled. Medical dictionaries and journal publications are parsed and stored in this database.

LIST OF TABLES

Table Name

Used for Scrubbing

Medical Concepts (UMLS)

Used for Publication Analysis

Feature_matrix_test

YES

YES

 

Feature_matrix_train

YES

YES

 

Human_annotations_test

YES

 

 

Human_annotations_train

YES

 

 

Machine_annotations_test

YES

 

 

Machine_annotations_train

YES

 

 

Lookup_dictionary

YES

YES

 

Lookup_term_frequency

YES

 

 

Lookup_umls

YES

 

 

Pubs_authors

 

 

YES

Pubs_keywords

 

 

YES

Pubs

 

 

YES

Pubs_refs

 

 

YES

TABLE DESCRIPTIONS

Feature_matrix_*

Stores feature matrix that is built from the Machine_annotations_* and Human_annotations_*  tables. This is the rolled up feature set used for classification.

Human_annotations_*

Stores all annotations created by humans as part of a manual annotations effort.

Machine_annotations_*

Stores all annotations created by the UIMA pipeline.

Lookup_dictionary

Contains names from the 1990 US census that are used in

Lookup_term_frequency

Contains term frequency calculated across a random selection of 10,000 open access medical publications. Raw open access publications are available for free through NIH/NLM.

Lookup_umls

Contains terms from UMLS subset that was used Scrubber.

This DOES NOT include the UMLS CUIDs due to licensing restrictions.

Medical Vocabularies

Vocabularies

#Concepts

COSTAR

3,461

HL7V2.5

5,020

HL7V3.0

8,062

ICD10CM

102,048

ICD10PCS

253,708

ICD9CM

40,491

LOINC

327,181

MESH

739,161

RXNORM

437,307

SNOMEDCT

1,170,855

  • No labels