3D Embedding

From CheS-Mapper Wiki
Jump to: navigation, search

What is 3D Embedding?

3D Embedding is the process of computing coordinates in 3D space for each compound based on the compound feature values. In other words, embedding algorithms convert the n-dimensional feature space (n = the number of features) to 3 dimensions. More generally, this process is called dimensionality reduction.

What is a good 3D Embedding?

A good embedding is achieved if the distance in 3D space reflects the feature values for each compound: the distance in 3D space should correlate to the distance of the feature values. However, a loss-free transformation (of the n-dimensional feature space to 3D space) is not always feasible, especially if the dataset is large, and many un-correlated features are used.

Embedding quality - global

CheS-Mapper measures how well the 3D positions reflect the features values of compounds. When the viewer starts, it provides the global embedding quality (on the top right of the screen). Additionally it may give a warning if the embedding quality is not good.

The overall embedding quality is computed by comparing feature values and compound coordinates. In more detail, it is the correlation between the feature value distance matrix and the 3d-positions distance matrix. The feature value distance matrix is computed using the same distance measure as the Embedding algorithm (by default, this is the Euclidean distance, see below). The 3d-positions matrix is computed using Euclidean distance.

Pearson's correlation coefficient ([1]) is computed. It ranges from 1 (perfect correlation), to 0 (no correlation) to -1 (negative correlation). I.e., the better the embedding, the closer the correlation values to 1. We use a conservative categorization of embedding quality (>=0.8 very good, >= 0.6 good, >= 0.4 weak, >= 0.2 very weak, else poor) provided by Evans, J.D.: Straightforward Statistics for the Behavioral Sciences. Duxbury Press, Pacific Grove (1996). A warning is given to the user if the embedding quality is below 0.6.

If there are to many features with to diverse values, it may be impossible to compute a good overall 3D embedding at all.

Embedding stress - local

The Embedding stress for each compound is provided as a compound feature, and can be selected in the viewer with the dropdown menu on the bottom left. The closer the stress value to 0, the lower is the embedding stress for the particular compound, and the better does the 3d-distance to the other compounds in the dataset correlate to the feature value distance.

Note that the highlighting colors may be missleading: the compound with the highest embedding error will always be shown in red, even if the whole dataset including this compound is almost perfectly embedded.

The Embedding stress is computed analogous to the global embedding quality, but it uses only the distances of all other compounds to this particular compound: Embedding-stress(compound) = 1 - Pearson's correlation (feature-distances-to-compound, 3d-position-distances-to-compound). (In other words, the global embedding quality is the correlation of two matrixes, the local embedding stress is 1 - the correlation of two arrays.)

What if the Embedding Quality is not good?

This cannot be avoided in some cases. This means that CheS-Mapper provides a compressed, over-simplified view on the feature space. However, you can compute the nearest neigbors for (high-stress) compounds.

Embedding Algorithms

Algorithm Linear Local Deterministic Independent of R Configurable distance mesure
PCA 3D Embedder (WEKA) Yes Yes Yes
PCA 3D Embedder (R) Yes Yes
Sammon 3D Embedder (R) Yes Yes
SMACOF 3D Embedder (R) Yes
TSNE 3D Embedder (R) Yes
Linear
The algorithm runtime is very fast (Non-linear algorithms might take long on large datasets).
Local
The embedding method tries to conserve local-structures within your data, the overall global 3d-embedding will be worse.
Deterministic
The algorithm contains no random element, the result is equal for each run of the algorithm. No entry: the user can specify a Random seed value to initialize the algorithm.
Independent of R
For some algorithms the R statistical software has to be installed on your machine.
Configurable distance measure
Some algorithms allow to select different distance measures (see below).

Distance measures

By default, Euclidean distance is (implicitly) used as distance measure by most algorithms. Other distance/similarity measures (like e.g. Tanimoto similarity) may be more suitable for nominal features.

Which embedding algorithm to select?

The embedding algorithm should be selected according to the number of compounds in the dataset (see Image below).

If the embedding quality is poor try to:

  • use a different algorithm (if Sammon/Smacof take too long, try to reduce the number of iterations)
  • select Sammon embedding with different distance measures (e.g. Tanimoto similartiy may be suitable for nominal feature values).
  • remove outlier compounds (Often outliers are spatially clearly seperated from the remaining dataset. Moreover, you can identify "problem compounds" via the compound feature Embedding stress.)
  • reduce the number of features

Select embedding.png

Details

PCA with R
There are two variants to perform PCA, the method prcomp is used because it can be applied to datasets with more features than compounds.