This document explains how this Shiny visualization tool works. It is recommended to understand the "Top Associated Concepts with a Target Concept" tool (see documentation) before starting with this one.
This tool is intended to contrast two datasets in order to emphasize how they differ the most. It presents the user with a list of concepts associated with a chosen target concept, ordered according to the difference in association strength between two selected datasets, with the concepts with the highest difference shown at the top. Thus the user is able to visualize the concepts which have a very high association strength in the first dataset but a very low one in the second one.
This comparison is meant to detect concepts which are susceptible to have a meaningful relationship with the target concept, but this relationship has been "overlooked" in the literature. The method relies on the assumption that concepts relationships which are well studied are strongly associated even if considering only abstracts at the sentence level, which correspond to the most restrictive type of literature data available. By contrast, the full articles by paper are the least retrictive in terms of volume and diversity of relationships captured. Therefore it is assumed that two concepts which have a strong association in the latter (full articles) but are absent or have a weak association in the former (abstracts) are a good candidate as an overlooked relationship: the relationship is not well studied otherwise it would be found in the abstracts, and it is significant otherwise it would not have a strong association in the full articles.
Essentially, this tool attempts to simulate the process of a human expert going through the list of top associated concepts from the full articles (as can be done with the previous tool) and extracting the relationships considered non-obvious and potentially candidates for further investigation.
The frequency filters min joint freq in dataset 1
and max joint frequency in dataset 2
play an important role in the selection of the associated concepts shown in the results. Intuitively, the former allows the user to adjust the generality level while the latter is used to adjust the contrast level. The tool is designed to let the user find the "sweet spot" for these parameters so that the output contains exploitable concepts, i.e. points to relationships which are relevant, non-obvious and good candidates for further study.
This tool uses the same type of data as the "Top Associated Concepts with a Target Concept" tool, please refer to its documentation for details about the data). However, while the previous tool was based on a single dataset, the present one is based on comparing two datasets against each other. The comparison is not symetrical, because the system selects concepts which satisfy these two conditions:
The tool is intended to be used with a high-coverage first dataset and a restrictive second dataset. By construction, the data obtained from abstracts only
contains less but more reliable co-occurrences between concepts, whereas the articles only
data contains more but less reliable co-occurences and the abstracts+articles
contains all the co-occurrences. Also by contruction, the level by sentence
is much more restrictive and more accurate than the level by paper
, since the former captures only co-occurences in a sentence while the latter also considers co-occurrences of concepts which appear far away from each other in a document.
The user can select a target concept among a predefined list of ND concepts.
See the documentation of the previous tool for details.
The order of the associated concepts in the results table depends on two options: the association measure (MI or PMI), which is explained in the documentation of the previous tool (PMI, the default, is recommended), and the method to rank CUIs
, which can be either Basic Contrast
(the default), Relative Diff Rank
or Absolute Diff Rank
.
With this option enabled, the results table shows the concepts ordered by their association measure in the first data: first the reference view concepts are ranked, then the concepts which have a joint frequency in the mask view higher than the maximum parameter are filtered out (see Frequency Filters below). This means that it shows the same list as the previous tool for the reference view, except that the filtering is based on the mask view.
This view can be simpler to interpret because it directly follows the order of the top associated concepts in the first dataset, while letting the user analyze the differences between the two datasets. In this view, the max joint frequency in dataset 2
(see Frequency Filters below) is a crucial parameter: if set to a very high value, the list contains all the concepts in dataset 1, but if set to a low value (especially at the default 0) the concepts which have a higher frequency in dataset 2 are filtered out. In this case the effect is very similar to using the Diff relative rank 1 vs 2
method, since it keeps only the concepts which have a low association (or are absent) in dataset 2 at the top of list.
In both datasets, the associated concepts are ranked by their association measure (PMI by default) and their rank (relative or absolute) is calculated. Then the concepts are ordered by the difference between their two ranks (dataset 1 minus dataset 2) from lowest difference at the top to highest difference at the bottom of the ranking.
This method satisfies the main objective since the highest possible difference happens for concepts which have a high association in dataset 1 and a low one in dataset 2. It orders the full list of concepts by how much their rank differs in the two datasets, so even if there is no concept which satisfies the two conditions perfectly the top concepts will be the closest ones to the goal. For example, a concept with a high association in dataset 1 but an average association in dataset 2 appears higher than one with an average rank in both.
Technical notes:
The options min joint freq in dataset 1
and max joint frequency in dataset 2
provide the user with the ability to adjust the list by filtering out concepts which do not satisfy the two conditions.
The minimum frequency in the first dataset is used in the same way as in the previous tool. The user can adjust this parameter to filter out the least frequently associated concepts, which can happen by chance and are often too specific (see also the documentation of the previous tool).
The maximum frequency in the second dataset is proposed in this tool so that the user can also filter out concepts which appear too frequently in the second dataset (since the goal is to find concepts which are poorly or not at all associated with the target in the second dataset). By default the threshold is set to zero, which means that only concepts which don't appear at all in the second dataset are presented. But this threshold can be relaxed by setting a higher value, thus letting concepts which exist in the second dataset appear in the results. The effect differs depending on the ranking method:
Basic Contrast
method, the max joint frequency in dataset 2
is the only way to make the second dataset affect which concepts are shown: with a low threshold (especially the default value 0), the concepts which a high association in dataset 2 are filtered out, but with a very high threshold they are shown as well so the contrast between the two datasets is not clearly visible.Diff rank 1 vs 2
method, even a high max joint frequency in dataset 2
value is unlikely to show any additional concept at the top of the list because their difference in relative rank is not as high as the ones with a lower frequency in dataset 2. However this can happen either if the target concept has few associated concepts, or if the min joint freq in dataset 1
is set to a high value. In this case which helps finding general/frequent concepts, the max joint frequency in dataset 2
value is used to avoid (or not) the concepts which do not have a high difference between the two datasets.Overall the two thresholds min joint freq in dataset 1
and max joint frequency in dataset 2
play an important role in the selection of the associated concepts shown in the results. Intuitively, the former allows the user to adjust the generality level while the latter is used to adjust the contrast level. These two thresholds should be adjusted depending on the target concept (in particular its frequency and number of associated concepts) and the desired outcome (level of generality and contrast).
Please refer to the documentation of the previous tool. You can also check the usage tips.
In the following examples the default articles only
,by paper
and abstracts only
,by paper
options are selected as dataset 1 and 2 respectively. The "Amyotrophic Lateral Sclerosis (ALS)" concept is used as target, and the default PMI is used as association measure.
With the default parameters the following top 5 concepts are shown to the user:
concept | term | group | jointFreq.ref | jointFreq.mask | pmi.ref | pmi.mask | rank.ref | rank.mask | rank |
---|---|---|---|---|---|---|---|---|---|
C3255210 | Azolen | CHEM | 12 | 0 | 15.43535 | NA | 1384.5 | 23165 | 1 |
C0381157 | Autologen | CHEM | 20 | 0 | 15.41333 | NA | 1387.0 | 23165 | 2 |
C3498115 | entorhinal white matter | ANAT | 85 | 0 | 15.40563 | NA | 1388.0 | 23165 | 3 |
C0718950 | Biomox | CHEM | 12 | 0 | 15.32844 | NA | 1397.0 | 23165 | 4 |
C0613678 | UTEN | CHEM | 101 | 0 | 15.26653 | NA | 1410.0 | 23165 | 5 |
These concepts are quite specific and their quite strong association with ALS is not immediately clear. It is possible that some are explained by chance or by some bias in the literature, but in general an expert would have to perform a deeper analysis in order to understand the nature of the association.
It can be observed that jointFreq.mask
, the joint frequency in dataset 2, is always 0, which indicates that these concepts do not appear at all together in the second dataset. As a result their association is is the lowest possible in dataset 2, and since it is high in dataset 1 they obtain a high difference.
With the ranking method Basic contrast
, results are strongly affected by the max joint freq in dataset 2
parameter: if the threshold is high then the method simply shows the top concepts in dataset 1, whereas if the threshold is low (especially 0) then it tends to show the same results as the alternative ranking method Diff rank 1 vs 2
, as shown in the following examples.
concept | term | group | jointFreq.ref | jointFreq.mask | pmi.ref | pmi.mask | rank.ref | rank.mask | rank |
---|---|---|---|---|---|---|---|---|---|
C4475575 | Radicava | CHEM | 50 | 8 | 15.55083 | 14.20717 | 688.5 | 33 | 1.5 |
C3280587 | AMYOTROPHIC LATERAL SCLEROSIS 16, JUVENILE | DISO | 12 | 1 | 15.55083 | 12.62221 | 688.5 | 204 | 1.5 |
C0668601 | SOD1 G93A protein | CHEM | 36 | 4 | 15.51130 | 14.20717 | 1377.0 | 33 | 3.0 |
C4522181 | Brachial Amyotrophic Diplegia | DISO | 13 | 10 | 15.44391 | 13.20717 | 1382.0 | 129 | 4.0 |
C3255210 | Azolen | CHEM | 12 | 0 | 15.43535 | NA | 1384.5 | 23165 | 6.5 |
This example where max joint freq in dataset 2
is set to 10 shows that some concepts have a high association in both datasets, as opposed to the ones obtained with max joint freq in dataset 2
set to 0 or the Diff rank 1 vs 2
method (see below). Naturally the goal of the tool is to filter out these concepts, but the basic contrast
method can be used to observe how such well-studied concepts are progressively discarded when starting from a high max joint freq in dataset 2
value and then decreasing it.
concept | term | group | jointFreq.ref | jointFreq.mask | pmi.ref | pmi.mask | rank.ref | rank.mask | rank |
---|---|---|---|---|---|---|---|---|---|
C3255210 | Azolen | CHEM | 12 | 0 | 15.43535 | NA | 1384.5 | 23165 | 1 |
C0381157 | Autologen | CHEM | 20 | 0 | 15.41333 | NA | 1387.0 | 23165 | 2 |
C3498115 | entorhinal white matter | ANAT | 85 | 0 | 15.40563 | NA | 1388.0 | 23165 | 3 |
C0718950 | Biomox | CHEM | 12 | 0 | 15.32844 | NA | 1397.0 | 23165 | 4 |
C0613678 | UTEN | CHEM | 101 | 0 | 15.26653 | NA | 1410.0 | 23165 | 5 |
In this example the max joint freq in dataset 2
threshold is 0 (default) but this is not the reason why the top results are made of concepts which do not appear in the second dataset. Indeed, changing this parameter to a high value (for instance 99999) shows the exact same top concepts. In fact, the max joint freq in dataset 2
threshold often has little effect with the Diff rank 1 vs 2
ranking method, unless the min joint freq in dataset 1
threshold is high or the target concept is not frequent. This is because this ranking method favours concepts which do not appear at all in dataset 2, since these concepts have a very high relative rank difference as long as they are ranked at the top in dataset 1.
With the default Diff rank 1 vs 2
method, one should set a high value for both parameters min joint freq in dataset 1
and max joint freq in dataset 2
in order to see some concepts which appear in dataset 2 at the top. The first discards rare concepts in dataset 1, while the second allows frequent concepts in dataset 2. As seen above, without any contraints the top concepts with Diff rank 1 vs 2
tend to be quite rare and specific. Increasing min joint freq in dataset 1
removes such specific concepts from the top of the list and replaces them with more frequent and therefore more general concepts. Such concepts are more likely to also appear in dataset 2, as shown in the following example:
concept | term | group | jointFreq.ref | jointFreq.mask | pmi.ref | pmi.mask | rank.ref | rank.mask | rank |
---|---|---|---|---|---|---|---|---|---|
C1308189 | ERF protein, human | CHEM | 415 | 0 | 12.54866 | NA | 10418 | 23165.0 | 1 |
C1333360 | ERF gene | GENE | 420 | 0 | 12.30001 | NA | 11954 | 23165.0 | 2 |
C1446539 | tegafur-uracil | CHEM | 480 | 2 | 14.05151 | 4.303291 | 3181 | 14284.5 | 3 |
C0043102 | Weil Disease | DISO | 648 | 1 | 13.42155 | 4.294284 | 5608 | 14324.5 | 4 |
C0085196 | Oxidopamine | CHEM | 676 | 9 | 12.95100 | 3.917538 | 8025 | 15910.0 | 5 |
It can be observed that the difference in relative rank is lower than in the previous examples, due to the removal of the concepts with the highest difference. The top two concepts are still concepts which do not appear in dataset 2, but the third and fourth have a couple co-occurrences. Still their relative rank in dataset 2 (abstracts) is very low as expected (this is why they are ranked at the top), which could be an indication that their association with the target is not established in the literature. Of course, it is also possible that their high association in the full articles is due to some artefact.
Finally one can use the semantic filters in order to focus on a specific type of concept. The following example is obtained by selecting only the "Genes and Molecular Sequences" in the semantic filter:
concept | term | group | jointFreq.ref | jointFreq.mask | pmi.ref | pmi.mask | rank.ref | rank.mask | rank |
---|---|---|---|---|---|---|---|---|---|
C1333360 | ERF gene | GENE | 420 | 0 | 12.30001 | NA | 11954 | 23165.0 | 2 |
C1412727 | BACE1 gene | GENE | 547 | 3 | 12.56508 | 4.326569 | 9692 | 14178.0 | 8 |
C1415888 | IFI30 gene | GENE | 960 | 5 | 14.34985 | 6.660279 | 2919 | 5979.5 | 9 |
C1414477 | ETV5 gene | GENE | 825 | 5 | 12.27221 | 4.475175 | 12060 | 13591.0 | 15 |
C0599295 | ERG gene | GENE | 959 | 7 | 11.86224 | 3.576386 | 15856 | 17258.0 | 17 |