Working with the Wizard

From CheS-Mapper Wiki
Jump to: navigation, search

Load Dataset

  • To directly load a dataset from the internet, copy the dataset link into the textfield provided, and press the Load Dataset - Button.

Create 3D Structures

Extract Features

Cluster Dataset


Algorithm Cluster approach Num Clusters Variable Various Distance Functions Deterministic Random restarts Independent of R
SimpleKMeans (WEKA) Centroid Yes Yes
k-Means - Cascade (WEKA) Centroid Yes Yes Yes Yes
FarthestFirst (WEKA) Centroid Yes
Expectation Maximization (WEKA) Distribution Yes* Yes
Cobweb (WEKA) Connectivity Yes Yes
Hierarchical (WEKA) Connectivity Yes Yes Yes
k-Means (R) Centroid Yes
k-Means - Cascade (R) Centroid Yes Yes
Hierarchical (R) Connectivity Yes
Hierarchical - Dynamic Tree Cut (R) Connectivity Yes Yes

Embed into 3D Space


Algorithm Linear Local Deterministic Runs without R
PCA 3D Embedder (WEKA) Yes Yes Yes
PCA 3D Embedder (R) Yes Yes
Sammon 3D Embedder (R) Yes
SMACOF 3D Embedder (R) Yes
TSNE 3D Embedder (R) Yes

Embedding quality

The R² (Coefficient of Determination) is provided as a measure for the overall (global) embedding quality. Hence, it indicates how good the computed 3d-positions reflect the features values.


R² measures the correlation between the Euclidean distance matrix based on compound feature values, and the Euclidean distance matrix based on compound 3d positions. The better the embedding, the closer is the r-square value to 1. If the there are too many features with too diverse values, it may be impossible to compute a good overall 3D embedding at all.

Which embedding algorithm to choose?

The embedding algorithm should be selected according to the number of compounds in the dataset (see Image below). If the embedding quality is poor, try a different algorithm. If Sammon/Smacof take to long, try to reduce the number of iterations. Reducing the number of features could help to get a better embedding.

Select embedding.png


PCA with R
There are two variants to perform PCA, the method prcomp is used because it can be applied to datasets with more features than compounds.

Align Compounds

This is a prelimniary version, more documentation will be added soon...