3D Embedding 2

From CheS-Mapper Wiki
Jump to: navigation, search

Introduction

3D Embedding is the process of computing coordinates in 3D space for each compound based on the compound feature values. A good embedding is achieved if the distance in 3D space reflects the feature values for each compound: the distance in 3D space should correlate to the distance of the feature values. In other words, embedding algorithms convert the n-dimensional feature space (n = the number of features) to 3 dimensions.

Embedding Algorithms

Algorithm Linear Local Deterministic Independent of R Configurable distance mesure
PCA 3D Embedder (WEKA) Yes Yes Yes
PCA 3D Embedder (R) Yes Yes
Sammon 3D Embedder (R) Yes Yes
SMACOF 3D Embedder (R) Yes
TSNE 3D Embedder (R) Yes
Linear
The algorithm runtime is very fast (Non-linear algorithms might take long on large datasets).
Local
The embedding method tries to conserve local-structures within your data, the overall global 3d-embedding will be worse.
Deterministic
The algorithm contains no random element, the result is equal for each run of the algorithm. No entry: the user can specify a Random seed value to initialize the algorithm.
Independent of R
For some algorithms the R statistical software has to be installed on your machine.
Configurable distance measure
Some algorithms allow to select different distance measures (see below).


Distance measures

By default, Euclidean distance is (implicitly) used as distance measure by most algorithms. Other distance/similarity measures (like e.g. Tanimoto similarity) may be more suitable for nominal features.

Embedding quality - global

When the viewer starts, it provides the global embedding quality (on the top right of the screen). Additionally it may give a warning if the embedding quality is not good.

The overall embedding quality is computed by comparing feature values and compound coordinates. In more detail, it is the correlation between the feature value distance matrix and the 3d-positions distance matrix. The feature value distance matrix is computed using the same distance measure as the Embedding algorithm (by default, this is the Euclidean distance). The 3d-positions matrix is computed using Euclidean distance.

The correlation between both matrices is computed using the CCC (perfect correlation: 1.0, no correlation: 0.0) (Concordance correlation coefficient).

The better the embedding, the closer the correlation value to 1. If the there are too many features with too diverse values, it may be impossible to compute a good overall 3D embedding at all.

Embedding stress - local

The Embedding stress for each compound is provided as a compound feature, and can be selected in the viewer with the dropdown menu on the bottom left. The closer the stress value to 0, the lower is the embedding stress for the particular compound, and the better does the 3d-distance to the other compounds in the dataset correlate to the feature value distance.

Note that the highlighting colors may be missleading: the compound with the highest embedding error will always be shown in red, even if the whole dataset including this compound is almost perfectly embedded.

The Embedding stress is computed analogous to the global embedding CCC, but it uses only the distances of all other compounds to this particular compound: Embedding-stress(compound) = 1 - CCC(feature-distances-to-compound, 3d-position-distances-to-compound). (In other words, the global embedding quality is the correlation of two matrixes, the local embedding stress is 1 - the correlation of two arrays.)

Which embedding algorithm to select?

The embedding algorithm should be selected according to the number of compounds in the dataset (see Image below).

If the embedding quality is poor try to:

  • use a different algorithm (if Sammon/Smacof take too long, try to reduce the number of iterations)
  • select Sammon embedding with different distance measures (e.g. Tanimoto similartiy may be suitable for nominal feature values).
  • remove outlier compounds (Often outliers are spatially clearly seperated from the remaining dataset. Moreover, you can identify "problem compounds" via the compound feature Embedding stress.)
  • reduce the number of features

Select embedding.png

Details

PCA with R
There are two variants to perform PCA, the method prcomp is used because it can be applied to datasets with more features than compounds.