User Guide: Top Associated Concepts with a Target Concept

This document explains how this Shiny visualization tool works.

Summary

This tool shows a table of the concepts associated with a particular target concept selected by the user. The concepts in the table are ordered by decreasing strength of association.

The user can select variants of the data used as a basis for calculating the associations: abstracts and/or full articles, co-occurrences counted by sentence or by paper.
The "strength of association" is measured in the data by Pointwise Mutual Information (PMI), a statistical measure based on how often two concepts occur together with respect to how often each of them occurs on its own.
The user can refine the search of associated concepts:
- by setting a minimum frequency threshold, which removes the least "important" concepts
- by filtering the semantic types of the related concepts, for instance in order to visualize only the top genetic concepts.

Data

The data used in this application has been precomputed using the biomedical text content from Medline and PubMed Central (PMC) as well as the UMLS metathesaurus. Three variants of the data are proposed to the user:

"abstracts only": the Medline abstracts
"articles only": the PMC full articles
"abstracts+articles": both the Medline abstracts and PMC articles (duplicate abstracts which appear in both are discarded)

The associations between concepts may change depending on which dataset is selected by the user. For example abstracts are shorter and usually focus only on the most important concepts, whereas full papers contain more details and therefore more specific concepts.

The PMI association value (see below) is calculated based on how often any two concepts appear together in the data and how often they appear individually. The fact that two concepts "appear together" can be interpreted at different levels, so for every dataset two levels of co-occurrence are proposed to the user:

"by paper" means that every pair of concepts found in the same document (abstract or article) are counted as a co-occurrence.
"by sentence" means that a pair of concept is counted as a co-occurrence only if the two concepts appear in the same sentence.

Of course with the former option much more co-occurrences are counted. Some of the concept pairs might be remotely related (if at all), especially in the case of full articles. On the other hand, the latter option is restrictive and tends to only capture the clearest cases of relations between concepts, so it might be more accurate but it might also not cover all of relations.

Target Selection

The user can select a target concept among a predefined list of ND concepts. The list of available concepts can be filtered by ND group using the checkboxes above the target selection box. The list of available concepts is also organized by ND group, with the group abbreviation shown before the specific concept.

There are several special cases available for selection: for every ND group X there is a choice "X (all concepts)", and there are also two global choices "ALL NDs target concepts" and "ALL NDs target concepts except group 'Other'". These special cases are made of the union of several target concepts, which means that a co-occurrence with a concept Y is counted every time any of the considered target concepts appears together with Y. This is intended to show for example the concepts most associated with Alzheimer's Disease (AD) defined in a broad sense, including concepts such as "presenile dementia" and "early onset of AD".

Technical note: a free choice of target concept is not possible because of the volume of data and the intensive precomputations required. Thus this list is currently limited to the initial target ND concepts, but it would be possible to extend it with more concepts in the future. The interface might also be improved in the future with a search box in order to make the selection more user-friendly.

Association Measure

Pointwise Mutual Information (PMI) is the default measure used as indicator of the strength of the association between two concepts. This value is based on how often two concepts \(A\) and \(B\) occur together with respect to how often each of them occurs on its own. In other words, it does not only take into account how often two concepts are found together (the joint probability \(p(A,B)\), which would be biased towards frequent concepts), it makes it relative to each concept frequency (\(p(A)\) and \(p(B)\)). This way a rare concept \(A\) might be found to be strongly associated with a frequent concept \(B\) if \(B\) almost always appears when \(A\) does (high conditional probability \(p(B|A)\)), even though \(A\) usually does not appear when \(B\) does (low conditional probability \(p(A|B)\)).

The PMI value has no predefined bounds, its minimum and maximum depend on the probabilities of the concepts \(A\) and \(B\).

A high positive value denotes a high association, i.e. \(A\) and \(B\) tend to "attract each other".
A value of zero (or close to zero) denotes the absence of interdependency, i.e. \(A\) and \(B\) appear together only by chance.
A negative value denotes a negative association, that is \(A\) and \(B\) tend to "repulse each other".

The tool also proposes several other ways to measure association between concepts, presented below. It is recommended to choose the measure by trial and error based on the visible top result: one measure might be suitable for the desired goal for a particular target while another meaure works better in a different content.

Normalized Pointwise Mutual Information (NPMI). In text data PMI is often considered biased towards rare events. NPMI is a variant of PMI known to reduces this bias, and it has the important advantage to be normalized: the value is between 0 and 1, making it easier to interpret.
PMI^2 and PMI^3 are two other variants of PMI meant to give more importance to more frequent events. As a result these measures focus on the most general concepts, PMI^3 even more strongly than PMI^2.
Mutual Information (MI). MI is closely related to PMI but it is more complex. It does not only reflect how much two concepts tend to appear together, it also takes into account how much they don't. This means that MI can be high also if two concepts tend to "avoid each other". While this could potentially be useful in general, it seems inadequate in the present application due to the very high number of concepts in the data. The MI results are also harder to interpret since there can be different reasons for a value to be high or low. For these reasons we do not recommend using it, but the option is available in the tool.

Technical note: the MI value is calculated based on the 2x2 contingency table corresponding to the presence or absence of each of the two concept, i.e. four cases are considered: neither A nor B is present, only A is present, only B is present, or both are present.

Joint Frequency Thresholding

The user can select a minimum joint frequency, i.e. the minimum number of times two concepts must appear together to be selected. This makes it possible to filter out cases where two concepts occur rarely together (even if they have a high association value), and consequently push pairs which may have a slightly lower association value up to the the top of the table. This is useful because there are many rare concepts which appear always accompanied by the target concept, however they are often too specific to be considered as an important indicator of the target concept. Some rare concepts may also appear by chance with the target, as opposed to more frequent concepts. The more often a co-occurrence event happens, the more one can be confident that the association value is meaningful.

The frequency threshold can also be seen as a way to adjust the level of generality of the observed concepts: increasing the threshold shows relationships involving high-level concepts, while decreasing leads to more specific relationship. Importantly, finding the desired level of generality may depend on the target concept, i.e. different concepts may require different threshold values.

Filtering by semantic type

The concepts shown in the results table can be filtered by semantic category. This feature relies on UMLS semantic types, which involves two levels of classification. By default the semantic filtering is disabled, the user can enable it by selecting one of two available levels as "Granularity of the semantic categories". This action makes the selection box "Filter by semantic types" appear, where the user can select the desired categories of concepts. The concepts which do not belong to any of the selected semantic types are filtered out from the results, causing the other concepts to be pushed up to the top.

Usability tip: when the "detailed" granularity is selected, there are around 50 to 60 semantic types available in the selection box. The "select all" and "deselect all" buttons (at the top of the selection list) are provided for conveniently eliminating or keeping only a few types.

Viewing the Results Table

The table on the right side of the tool shows the concepts related to the target, ordered by decreasing association score (PMI by default), after applying the selected filters (see above).

Every concept is identified by a CUI (Concept Unique Id). The CUI links to the UMLS page describing this concept in detail (this requires a UTS account).
- See Usage Tips below about creating a UTS account.
- There is a technical glitch which prevents redirecting the user to the UMLS concept page if they were not previously connected with their UTS account: the UTS website login page appears, but after login the user is redirected to the general UMLS page instead of the specific concept page. In this case the user has to click a second time on the CUI link, and this time the correct page appears. This happens only the first time the user clicks a link, since they will already be signed afterward.
The "View Options" checkboxes can be used to control the appearance and content of the table:
- "Show all terms": print the full list of terms for every concept instead of only the first term (default).
- "Show concepts groups": include the rows corresponding to special "groups concepts" (see Target Selection above) in the list.
- "Show both PMI and MI" adds a column showing the value other than the selected association measure.
- "Show conditional probabilties": show the conditional probabilities columns.
The table itself offers various controls for convenience: a search box at the top right, the number of rows at the bottom left, buttons to iterate through pages at the bottom right (the latter can be used to visualize concepts with low PMI value at the end of the table).

Usage Tips

The user's browser can be used to open several tabs or windows in order to see this document and manipulate the app at the same time.
It is also possible to open several tabs or windows of the app itself in order to compare what happens in different configurations.
It is possible to select some rows in the results table and copy/paste them in an external document. Normally this is supposed to preserve the formatting, but this probably depends on the software used.
Access to the UMLS Terminology Services (UTS) (when clicking on a CUI link in the results table) requires the user to have a UTS account. Creating a UTS account is free (and it can be done through Google or Facebook authentication), but this is not an automatic process so the user may have to wait for the account to be validated. Once validated, the user just has to sign in in order to browse the concepts in the UMLS metathesaurus.

Examples

In the following examples the abstracts only and by sentence options are selected for the dataset. The "Amyotrophic Lateral Sclerosis (ALS)" concept is used as target, and the default PMI is used as association measure.

The following table shows the top 5 concepts obtained with the default minimum frequency 10:

Top 5 concepts associated with target ALS at min. frequency 10
CUI	firstTerm	coarseCatId	jointFreq	probCuiGivenTarget	probTargetGivenCui	PMI
group.ALS	NA	NA	73631	1.0000000	0.8423925	10.750706
C0154683	Other motor neuron disease	DISO	44	0.0005976	0.6666667	10.413179
C0678179	Rilutek	CHEM;CHEM	10	0.0001358	0.4761905	9.927752
C3686938	Progressive motor neuron disease	DISO	34	0.0004618	0.4473684	9.837676
C1515501	9p21.2	ANAT	10	0.0001358	0.4000000	9.676213

The first row is the special group concept ALS. By construction, this group which includes the specific concept ALS is strongly associated with it: the probCuiGivenTarget conditional probability is one, meaning that whenever the target appears the group concept also appears (by definition of the group); the probTargetGivenCui is 0.84, meaning that the target appears 84% of the time when the group concept (naturally this is expected, ALS being the main concept in the group). Note that special "group concepts" can be filtered out using the View Options.

Starting from the second row regular concepts appear. For example it can be seen that "Other motor neuron disease" belongs to the disorder ("DISO") category, it appears together with ALS 44 times in total which represents 66.7% of the occurences of the CUI "Other motor neuron disease" but only 0.06% of the occurrences of the target ALS, which is much more frequent.

Filtering with Minimum Joint Frequency

The minimum frequency can be used to adjust the results to the desired level of generality. For example decreasing the minimum to zero gives the following top 5 concepts:

Top 5 concepts associated with target ALS at min. frequency 0
CUI	firstTerm	coarseCatId	jointFreq	probCuiGivenTarget	probTargetGivenCui	PMI
C4475575	Radicava	CHEM;CHEM	8	0.0001086	1	10.99814
C0154754	Hereditary and idiopathic neuropathy, unspecified	DISO	2	0.0000272	1	10.99814
C0154758	Inflammatory and toxic neuropathy	DISO	1	0.0000136	1	10.99814
C2317803	acquired amyotrophic lateral sclerosis	DISO	1	0.0000136	1	10.99814
C1328482	Drug-induced myasthenic syndrome	DISO	1	0.0000136	1	10.99814

This results in many rare concepts which have a high PMI with the target ALS because the latter always appears when they do (probTargetGivenCui is 1). However this is hardly meaningful, especially in the case of concepts which appear only once. It is also possible to do the oposite, that is to observe only the most general related concepts by increasing the minimum frequency, for example at 100 below:

Top 5 concepts associated with target ALS at min. frequency 100
CUI	firstTerm	coarseCatId	jointFreq	probCuiGivenTarget	probTargetGivenCui	PMI
group.ALS	NA	NA	73631	1.0000000	0.8423925	10.750706
C4024896	Motor neuron atrophy	DISO	493	0.0066955	0.3551873	9.504793
C0154682	Lateral Sclerosis	DISO	258	0.0035040	0.3185185	9.347590
C1428691	C9orf72 gene	GENE	1026	0.0139343	0.3127095	9.321036
C3498531	semiannular sulcus	ANAT	545	0.0074018	0.2917559	9.220975

With the rare concepts discarded, the results now show frequent concepts as strongest associations: "Motor neuron atrophy", "lateral sclerosis", "C9orf72 gene", "semiannular sulcus".
One may notice that the top PMI values are not very far from the ones observed before without any filtering. This is often the case when the target concept is very frequent, since many other concepts have a quite strong association with it, i.e. there are only small variations in association power (PMI) across a very large set of concepts. The filtering options help visualizing the most relevant relations for a particular goal.

Filtering with Semantic Types

The below top 5 concepts are obtained by selecting the coarse semantic granularity and then enabling only the "Genes and Molecular Sequences" in the semantic filter:

Top 5 concepts in the 'Genes and Molecular Sequences' category for target ALS (min. frequency 100)
CUI	firstTerm	coarseCatId	jointFreq	probCuiGivenTarget	probTargetGivenCui	PMI
C1428691	C9orf72 gene	GENE	1026	0.0139343	0.3127095	9.321036
C1420588	TARDBP gene	GENE	1942	0.0263748	0.2591061	9.049756
C1421305	UBQLN2 gene	GENE	106	0.0014396	0.2494118	8.994742
C1538299	ATXN2 gene	GENE	150	0.0020372	0.2423263	8.953164
C1456383	IGFALS gene	GENE	445	0.0060437	0.2372068	8.922358

With a min. frequency of 100, this filter leaves only 33 genetic concepts in the table. This way one can observe the difference in association strength between different concepts: while the top concepts have a PMI close to 9 and appear 20 to 30% together with the target (based on probTargetGivenCui), the concepts found at the bottom of the table have a PMI close to 0 or even slightly negative. This means that they are not especially associated with the target, and it can be seen that most of them are indeed generic terms such as "Genes", "Alleles", "DNA sequence".