Last update: July 2021.
This document explains how this Shiny visualization tool works.
This tool shows a table of the concepts associated with a particular target concept selected by the user. The concepts in the table are ordered by decreasing strength of association.
The data used in this application has been precomputed using the biomedical text content from Medline and PubMed Central (PMC) as well as the UMLS metathesaurus and PubTatorCentral.
The associations between concepts may change depending on which dataset is selected by the user. For example abstracts are shorter and usually focus only on the most important concepts, whereas full papers contain more details and therefore more specific concepts.
The PMI association value (see below) is calculated based on how often any two concepts appear together in the data and how often they appear individually. The fact that two concepts "appear together" can be interpreted at different levels, so for every dataset two levels of co-occurrence are proposed to the user:
Of course with the former option much more co-occurrences are counted. Some of the concept pairs might be remotely related (if at all), especially in the case of full articles. On the other hand, the latter option is restrictive and tends to only capture the clearest cases of relations between concepts, so it might be more accurate but it might also not cover all of relations.
The user can select a target concept among a predefined list of ND concepts.
Technical note: a free choice of target concept is not possible because of the volume of data and the intensive precomputations required. Thus this list is currently limited to the set of diseases which appear as descendants of the concept "Neurodegenerative Disorders" in either MeSH or UMLS. It might be possible to extend the list with more concepts in the future. The interface might also be improved in the future with a search box in order to make the selection more user-friendly.
Pointwise Mutual Information (PMI) is the default measure used as indicator of the strength of the association between two concepts. This value is based on how often two concepts \(A\) and \(B\) occur together with respect to how often each of them occurs on its own. In other words, it does not only take into account how often two concepts are found together (the joint probability \(p(A,B)\), which would be biased towards frequent concepts), it makes it relative to each concept frequency (\(p(A)\) and \(p(B)\)). This way a rare concept \(A\) might be found to be strongly associated with a frequent concept \(B\) if \(B\) almost always appears when \(A\) does (high conditional probability \(p(B|A)\)), even though \(A\) usually does not appear when \(B\) does (low conditional probability \(p(A|B)\)).
The PMI value has no predefined bounds, its minimum and maximum depend on the probabilities of the concepts \(A\) and \(B\).
The tool also proposes several other ways to measure association between concepts, presented below. It is recommended to choose the measure by trial and error based on the visible top result: one measure might be suitable for the desired goal for a particular target while another meaure works better in a different content.
A few other less standard options are also proposed.
Technical note: the MI value is calculated based on the 2x2 contingency table corresponding to the presence or absence of each of the two concept, i.e. four cases are considered: neither A nor B is present, only A is present, only B is present, or both are present.
The user can select a minimum joint frequency, i.e. the minimum number of times two concepts must appear together to be selected. This makes it possible to filter out cases where two concepts occur rarely together (even if they have a high association value), and consequently push pairs which may have a slightly lower association value up to the the top of the table. This is useful because there are many rare concepts which appear always accompanied by the target concept, however they are often too specific to be considered as an important indicator of the target concept. Some rare concepts may also appear by chance with the target, as opposed to more frequent concepts. The more often a co-occurrence event happens, the more one can be confident that the association value is meaningful.
The frequency threshold can also be seen as a way to adjust the level of generality of the observed concepts: increasing the threshold shows relationships involving high-level concepts, while decreasing leads to more specific relationship. Importantly, finding the desired level of generality may depend on the target concept, i.e. different concepts may require different threshold values.
The concepts shown in the results table can be filtered by semantic category/group.
The concepts which do not belong to any of the selected semantic types are filtered out from the results, causing the other concepts to be pushed up to the top.
The table on the right side of the tool shows the concepts related to the target, ordered by decreasing association score (PMI by default), after applying the selected filters (see above).
In the following examples the abstracts only
and by sentence
options are selected for the dataset. The "Amyotrophic Lateral Sclerosis (ALS)" concept is used as target, and the default PMI is used as association measure.
The following table shows the top 5 concepts obtained with the default minimum frequency 10:
concept | term | group | jointFreq | pmi | rank |
---|---|---|---|---|---|
C4475575 | Radicava | CHEM | 13 | 12.37871 | 18.0 |
C0154683 | Other motor neuron disease | DISO | 50 | 11.96223 | 22.0 |
C0678179 | Rilutek | CHEM | 11 | 11.58517 | 32.5 |
C3686938 | Progressive motor neuron disease | DISO | 42 | 11.46969 | 41.0 |
C1456383 | IGFALS gene | GENE | 1086 | 11.29005 | 43.0 |
For example it can be seen that "Other motor neuron disease" belongs to the disorder ("DISO") category, it appears together with ALS 50 times in total which represents a large proportion of the occurences of the CUI "Other motor neuron disease" but only a small proportion of the occurrences of the target ALS, which is much more frequent.
The minimum frequency can be used to adjust the results to the desired level of generality. For example decreasing the minimum to zero gives the following top 5 concepts:
concept | term | group | jointFreq | pmi | rank |
---|---|---|---|---|---|
C4519182 | AZD-7295 | CHEM | 1 | 12.58517 | 8.5 |
C0154758 | Inflammatory and toxic neuropathy | DISO | 1 | 12.58517 | 8.5 |
C0154684 | Other anterior horn cell diseases | DISO | 1 | 12.58517 | 8.5 |
C0154754 | Hereditary and idiopathic neuropathy, unspecified | DISO | 2 | 12.58517 | 8.5 |
C2317805 | multifactorial amyotrophic lateral sclerosis | DISO | 1 | 12.58517 | 8.5 |
This results in many rare concepts (only 1 or 2 occurrences) which have a high PMI with the target ALS because the latter always appears when they do. However this is hardly meaningful, especially in the case of concepts which appear only once.
Note: the rank of these concepts is identical because they have exactly the same PMI with the target. In such cases where several concepts are tied, the rank is the average of the ranks that would be obtained by ordering the tied concepts arbitrarily.
It is also possible to do the oposite, that is to observe only the most general related concepts by increasing the minimum frequency, for example at 100 below:
concept | term | group | jointFreq | pmi | rank |
---|---|---|---|---|---|
C1456383 | IGFALS gene | GENE | 1086 | 11.29005 | 43 |
C4024896 | Motor neuron atrophy | DISO | 525 | 11.05528 | 50 |
C0154682 | Lateral Sclerosis | DISO | 292 | 10.93107 | 58 |
C1428691 | C9orf72 gene | GENE | 1382 | 10.90755 | 60 |
C0524459 | Lower motor neuron | ANAT | 893 | 10.86020 | 62 |
With the rare concepts discarded, the results now show frequent concepts as strongest associations. One may notice that the top PMI values are not very far from the ones observed before without any filtering. This is often the case when the target concept is very frequent, since many other concepts have a quite strong association with it, i.e. there are only small variations in association power (PMI) across a very large set of concepts. The filtering options help visualizing the most relevant relations for a particular goal.
The below top 5 concepts are obtained by selecting only the "Genes and Molecular Sequences" in the semantic filter, with a min. frequency of 100:
concept | term | group | jointFreq | pmi | rank |
---|---|---|---|---|---|
C1456383 | IGFALS gene | GENE | 1086 | 11.29005 | 43 |
C1428691 | C9orf72 gene | GENE | 1382 | 10.90755 | 60 |
C1420588 | TARDBP gene | GENE | 2325 | 10.55432 | 89 |
C1421305 | UBQLN2 gene | GENE | 124 | 10.39725 | 99 |
C1420306 | SOD1 gene | GENE | 4541 | 10.34614 | 100 |
Note that this filter leaves only around 40 genetic concepts in the table (thanks to the minimum frequency filter). This way one can observe the difference in association strength between different concepts: while the top concepts have a high PMI close to, the concepts found at the bottom of the table have a PMI close to 0 or even slightly negative. This means that they are not especially associated with the target, and it can be seen that most of them are indeed generic terms such as "Genes", "Alleles", "DNA sequence".