User Guide: Contrast two Datasets given a Target Concept

This document explains how this Shiny visualization tool works. It is recommended to understand the "Top Associated Concepts with a Target Concept" tool (see documentation) before starting with this one.

Summary

This tool is intended to contrast two datasets in order to emphasize how they differ the most. It presents the user with a list of concepts associated with a chosen target concept, ordered according to the difference in association strength between two selected datasets, with the concepts with the highest difference shown at the top. Thus the user is able to visualize the concepts which have a very high association strength in the first dataset but a very low one in the second one.

This comparison is meant to detect concepts which are susceptible to have a meaningful relationship with the target concept, but this relationship has been "overlooked" in the literature. The method relies on the assumption that concepts relationships which are well studied are strongly associated even if considering only abstracts at the sentence level, which correspond to the most restrictive type of literature data available. By contrast, the full articles by paper are the least retrictive in terms of volume and diversity of relationships captured. Therefore it is assumed that two concepts which have a strong association in the latter (full articles) but are absent or have a weak association in the former (abstracts) are a good candidate as an overlooked relationship: the relationship is not well studied otherwise it would be found in the abstracts, and it is significant otherwise it would not have a strong association in the full articles.

Essentially, this tool attempts to simulate the process of a human expert going through the list of top associated concepts from the full articles (as can be done with the previous tool) and extracting the relationships considered non-obvious and potentially candidates for further investigation.

The frequency filters min joint freq in dataset 1 and max joint frequency in dataset 2 play an important role in the selection of the associated concepts shown in the results. Intuitively, the former allows the user to adjust the generality level while the latter is used to adjust the contrast level. The tool is designed to let the user find the "sweet spot" for these parameters so that the output contains exploitable concepts, i.e. points to relationships which are relevant, non-obvious and good candidates for further study.

Data

This tool uses the same type of data as the "Top Associated Concepts with a Target Concept" tool, please refer to its documentation for details about the data). However, while the previous tool was based on a single dataset, the present one is based on comparing two datasets against each other. The comparison is not symetrical, because the system selects concepts which satisfy these two conditions:

The concept has a high association strength with the target in the first dataset,
The concept has a low association strength or is not associated at all with the target in the second dataset.

The tool is intended to be used with a high-coverage first dataset and a restrictive second dataset. By construction, the data obtained from abstracts only contains less but more reliable co-occurrences between concepts, whereas the articles only data contains more but less reliable co-occurences and the abstracts+articles contains all the co-occurrences. Also by contruction, the level by sentence is much more restrictive and more accurate than the level by paper, since the former captures only co-occurences in a sentence while the latter also considers co-occurrences of concepts which appear far away from each other in a document.

Target Selection

The user can select a target concept among a predefined list of ND concepts.

See the documentation of the previous tool for details.

Ranking Method

The order of the associated concepts in the results table depends on two options: the association measure (MI or PMI), which is explained in the documentation of the previous tool (PMI, the default, is recommended), and the method to rank CUIs, which can be either Basic Contrast (the default), Relative Diff Rank or Absolute Diff Rank.

Basic Contrast

With this option enabled, the results table shows the concepts ordered by their association measure in the first data: first the reference view concepts are ranked, then the concepts which have a joint frequency in the mask view higher than the maximum parameter are filtered out (see Frequency Filters below). This means that it shows the same list as the previous tool for the reference view, except that the filtering is based on the mask view.

This view can be simpler to interpret because it directly follows the order of the top associated concepts in the first dataset, while letting the user analyze the differences between the two datasets. In this view, the max joint frequency in dataset 2 (see Frequency Filters below) is a crucial parameter: if set to a very high value, the list contains all the concepts in dataset 1, but if set to a low value (especially at the default 0) the concepts which have a higher frequency in dataset 2 are filtered out. In this case the effect is very similar to using the Diff relative rank 1 vs 2 method, since it keeps only the concepts which have a low association (or are absent) in dataset 2 at the top of list.

By Difference in Relative/Absolute Rank

In both datasets, the associated concepts are ranked by their association measure (PMI by default) and their rank (relative or absolute) is calculated. Then the concepts are ordered by the difference between their two ranks (dataset 1 minus dataset 2) from lowest difference at the top to highest difference at the bottom of the ranking.

This method satisfies the main objective since the highest possible difference happens for concepts which have a high association in dataset 1 and a low one in dataset 2. It orders the full list of concepts by how much their rank differs in the two datasets, so even if there is no concept which satisfies the two conditions perfectly the top concepts will be the closest ones to the goal. For example, a concept with a high association in dataset 1 but an average association in dataset 2 appears higher than one with an average rank in both.

Technical notes:

The difference is not calculated based on the association value because these values are not reliably comparable across datasets due to their different size.
The absolute rank depends on the number of concepts in the view, whereas the relative rank doesn't since it's normalized by the size of the view. This implies that the relative and absolute versions produce different rankings.

Frequency Filters

The options min joint freq in dataset 1 and max joint frequency in dataset 2 provide the user with the ability to adjust the list by filtering out concepts which do not satisfy the two conditions.

The minimum frequency in the first dataset is used in the same way as in the previous tool. The user can adjust this parameter to filter out the least frequently associated concepts, which can happen by chance and are often too specific (see also the documentation of the previous tool).

The maximum frequency in the second dataset is proposed in this tool so that the user can also filter out concepts which appear too frequently in the second dataset (since the goal is to find concepts which are poorly or not at all associated with the target in the second dataset). By default the threshold is set to zero, which means that only concepts which don't appear at all in the second dataset are presented. But this threshold can be relaxed by setting a higher value, thus letting concepts which exist in the second dataset appear in the results. The effect differs depending on the ranking method:

With the default Basic Contrast method, the max joint frequency in dataset 2 is the only way to make the second dataset affect which concepts are shown: with a low threshold (especially the default value 0), the concepts which a high association in dataset 2 are filtered out, but with a very high threshold they are shown as well so the contrast between the two datasets is not clearly visible.
With the Diff rank 1 vs 2 method, even a high max joint frequency in dataset 2 value is unlikely to show any additional concept at the top of the list because their difference in relative rank is not as high as the ones with a lower frequency in dataset 2. However this can happen either if the target concept has few associated concepts, or if the min joint freq in dataset 1 is set to a high value. In this case which helps finding general/frequent concepts, the max joint frequency in dataset 2 value is used to avoid (or not) the concepts which do not have a high difference between the two datasets.

Overall the two thresholds min joint freq in dataset 1 and max joint frequency in dataset 2 play an important role in the selection of the associated concepts shown in the results. Intuitively, the former allows the user to adjust the generality level while the latter is used to adjust the contrast level. These two thresholds should be adjusted depending on the target concept (in particular its frequency and number of associated concepts) and the desired outcome (level of generality and contrast).

Filtering by semantic type and other options

Please refer to the documentation of the previous tool. You can also check the usage tips.

Examples

In the following examples the default articles only,by paper and abstracts only,by paper options are selected as dataset 1 and 2 respectively. The "Amyotrophic Lateral Sclerosis (ALS)" concept is used as target, and the default PMI is used as association measure.

With the default parameters the following top 5 concepts are shown to the user:

Top 5 concepts for ALS (default parameters)
concept	term	group	jointFreq.ref	pmi.ref	pmi.mask	rank.ref	rank.mask	rank
C3255210	Azolen	CHEM	12	15.43535	NA	1384.5	23165	1
C0381157	Autologen	CHEM	20	15.41333	NA	1387.0	23165	2
C3498115	entorhinal white matter	ANAT	85	15.40563	NA	1388.0	23165	3
C0718950	Biomox	CHEM	12	15.32844	NA	1397.0	23165	4
C0613678	UTEN	CHEM	101	15.26653	NA	1410.0	23165	5

These concepts are quite specific and their quite strong association with ALS is not immediately clear. It is possible that some are explained by chance or by some bias in the literature, but in general an expert would have to perform a deeper analysis in order to understand the nature of the association.

It can be observed that jointFreq.mask, the joint frequency in dataset 2, is always 0, which indicates that these concepts do not appear at all together in the second dataset. As a result their association is is the lowest possible in dataset 2, and since it is high in dataset 1 they obtain a high difference.

Ranking Methods and Thresholds

With the ranking method Basic contrast, results are strongly affected by the max joint freq in dataset 2 parameter: if the threshold is high then the method simply shows the top concepts in dataset 1, whereas if the threshold is low (especially 0) then it tends to show the same results as the alternative ranking method Diff rank 1 vs 2, as shown in the following examples.

Top 5 concepts for ALS, Basic Contrast with max joint freq 2 = 10
concept	term	group	jointFreq.ref	jointFreq.mask	pmi.ref	pmi.mask	rank.ref	rank.mask	rank
C4475575	Radicava	CHEM	50	8	15.55083	14.20717	688.5	33	1.5
C3280587	AMYOTROPHIC LATERAL SCLEROSIS 16, JUVENILE	DISO	12	1	15.55083	12.62221	688.5	204	1.5
C0668601	SOD1 G93A protein	CHEM	36	4	15.51130	14.20717	1377.0	33	3.0
C4522181	Brachial Amyotrophic Diplegia	DISO	13	10	15.44391	13.20717	1382.0	129	4.0
C3255210	Azolen	CHEM	12	0	15.43535	NA	1384.5	23165	6.5

This example where max joint freq in dataset 2 is set to 10 shows that some concepts have a high association in both datasets, as opposed to the ones obtained with max joint freq in dataset 2 set to 0 or the Diff rank 1 vs 2 method (see below). Naturally the goal of the tool is to filter out these concepts, but the basic contrast method can be used to observe how such well-studied concepts are progressively discarded when starting from a high max joint freq in dataset 2 value and then decreasing it.

Top 5 concepts for ALS, Absolute Diff Rank with max joint freq = 0
concept	term	group	jointFreq.ref	pmi.ref	pmi.mask	rank.ref	rank.mask	rank
C3255210	Azolen	CHEM	12	15.43535	NA	1384.5	23165	1
C0381157	Autologen	CHEM	20	15.41333	NA	1387.0	23165	2
C3498115	entorhinal white matter	ANAT	85	15.40563	NA	1388.0	23165	3
C0718950	Biomox	CHEM	12	15.32844	NA	1397.0	23165	4
C0613678	UTEN	CHEM	101	15.26653	NA	1410.0	23165	5

In this example the max joint freq in dataset 2 threshold is 0 (default) but this is not the reason why the top results are made of concepts which do not appear in the second dataset. Indeed, changing this parameter to a high value (for instance 99999) shows the exact same top concepts. In fact, the max joint freq in dataset 2 threshold often has little effect with the Diff rank 1 vs 2 ranking method, unless the min joint freq in dataset 1 threshold is high or the target concept is not frequent. This is because this ranking method favours concepts which do not appear at all in dataset 2, since these concepts have a very high relative rank difference as long as they are ranked at the top in dataset 1.

With the default Diff rank 1 vs 2 method, one should set a high value for both parameters min joint freq in dataset 1 and max joint freq in dataset 2 in order to see some concepts which appear in dataset 2 at the top. The first discards rare concepts in dataset 1, while the second allows frequent concepts in dataset 2. As seen above, without any contraints the top concepts with Diff rank 1 vs 2 tend to be quite rare and specific. Increasing min joint freq in dataset 1 removes such specific concepts from the top of the list and replaces them with more frequent and therefore more general concepts. Such concepts are more likely to also appear in dataset 2, as shown in the following example:

Top 5 concepts for ALS with ranking method 'Diff absolute rank 1 vs 2', 'min joint freq in dataset 1' set to 400 and 'max joint freq in dataset 2' set to 9999
concept	term	group	jointFreq.ref	jointFreq.mask	pmi.ref	pmi.mask	rank.ref	rank.mask	rank
C1308189	ERF protein, human	CHEM	415	0	12.54866	NA	10418	23165.0	1
C1333360	ERF gene	GENE	420	0	12.30001	NA	11954	23165.0	2
C1446539	tegafur-uracil	CHEM	480	2	14.05151	4.303291	3181	14284.5	3
C0043102	Weil Disease	DISO	648	1	13.42155	4.294284	5608	14324.5	4
C0085196	Oxidopamine	CHEM	676	9	12.95100	3.917538	8025	15910.0	5

It can be observed that the difference in relative rank is lower than in the previous examples, due to the removal of the concepts with the highest difference. The top two concepts are still concepts which do not appear in dataset 2, but the third and fourth have a couple co-occurrences. Still their relative rank in dataset 2 (abstracts) is very low as expected (this is why they are ranked at the top), which could be an indication that their association with the target is not established in the literature. Of course, it is also possible that their high association in the full articles is due to some artefact.

Finally one can use the semantic filters in order to focus on a specific type of concept. The following example is obtained by selecting only the "Genes and Molecular Sequences" in the semantic filter:

Top 5 concepts in the 'Genes and Molecular Sequences' category for ALS (with ranking method 'Diff absolute rank 1 vs 2', 'min joint freq in dataset 1' set to 200 and 'max joint freq in dataset 2' set to 9999)
concept	term	group	jointFreq.ref	jointFreq.mask	pmi.ref	pmi.mask	rank.ref	rank.mask	rank
C1333360	ERF gene	GENE	420	0	12.30001	NA	11954	23165.0	2
C1412727	BACE1 gene	GENE	547	3	12.56508	4.326569	9692	14178.0	8
C1415888	IFI30 gene	GENE	960	5	14.34985	6.660279	2919	5979.5	9
C1414477	ETV5 gene	GENE	825	5	12.27221	4.475175	12060	13591.0	15
C0599295	ERG gene	GENE	959	7	11.86224	3.576386	15856	17258.0	17