We used multiple publicly available resources to aggressively identify medical and non-medical usages. For each word token, we extracted all corresponding text lines. The total number of our word tokens in the corpus was 13.9 billion corresponding to 61.8 million unique word tokens. The corpus includes all notes generated from a diverse range of medical specialties. In this paper, leveraging a large collection of clinical notes in our clinical data warehouse, we investigated the usages of short forms in clinical documents, assessed potential features which can be utilized for identifying varying usages, and assembled high frequent usage patterns towards WSD tasks.įor this study, we used a corpus consisting of 40 million clinical notes from 1995 to 2011 retrieved from the Mayo clinical data warehouse. For example, “MEA” stands for “MayoExpertAdvisor”, a short form for a decision support system at Mayo Clinic or “Minnesota Education Assembly”. Additionally, some short forms have localized usages. For example, the nonmedical usage of “ICE” in “ICE number” as “In Case of an Emergency phone number” is not included in existing clinical terminology resources. Existing clinical terminology resources may not capture their non-medical usages. As described, short forms can have both medical and non-medical usages in clinical documents. However, the prerequisite towards WSD is the creation of a sense inventory which catalogs the associated meanings of those ambiguous words ( 10). Research ( 5, 9) has been focused to develop individual classifier per short form to achieve reasonable performance rather than one generalized classifier for all. Additionally, expert or distributed semantic (e.g., SNOMED concepts or topic models) features have also been utilized but they highly depend on the comprehensiveness of knowledge resources ( 8) and the availability of a large collection of documents in the corresponding domain. Document metadata features are also crucial but they vary across different EHR systems or institutions. However, those textual features may not perform well in documents that are semi-structured (e.g., containing tables) or not grammatically well formed ( 7) such as Emergency Department (ED) notes. Various textual features (such as POS tags or n-grams) have been inspected ( 5, 6). WSD in the clinical domain has been studied including the use of rules or machine learning (ML) approaches ( 1, 4, 5). For instance, the short form “US” in “The US shows a lesion” and “The patient retired from US army in 2011” has two different meanings where the first one means “Ultrasound”, and the second one means “United States”. Additionally, besides multiple clinical meanings, short forms in clinical texts can also have non-medical English usages. For example, “PT” can stand for “Patient” in the subjective section and it can mean “Prothrombin time” in the assessment section. Second, “one sense per discourse ( 2)”, a common understanding in the general English WSD community, may not hold here ( 3). “DT” can be frequently used as “Discharge time” in Discharge summaries and “Diptheria-tetanus double vaccination” in Immunization reports. For example, “CA” can mean “California”, “Cancer”, and “Calcium” without any clear definition in clinical narratives. They are seldom defined in the text but can have diverse meanings at different settings. First, the appearance of short forms in clinical documents is very different from other domains ( 1). There are some unique characteristics of short forms in the clinical domain. Resolving the high ambiguity of short forms, a special case of word sense disambiguation (WSD), is essential in clinical natural language processing (NLP) systems. Given the limited number of potential short forms, one short form can have multiple senses. The short forms of medical concepts or expressions (i.e., acronyms/abbreviations) are prevalent in clinical documentation. Our initial findings will be applicable for automatic usage/sense resolution. Usages could be distinguished using basic trigram/bigram/line information. Our short forms had an average of 3.58 senses. The remaining 19% showed alternating usage based upon case form. We identified 68% of our short forms as primarily serving medical usages, whereas 12% had non-medical usages. We assessed various features in their ability to disambiguate medical and non-medical usages. This paper outlines our process of identifying 141 potential short forms with randomly sampled phrases from a large clinical corpus. However, one prerequisite for resolving ambiguity of short forms is to have a sense inventory. Resolving the ambiguity of short forms is essential in clinical natural language processing (NLP). Given the limited number of potential short forms, they are also highly ambiguous.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |