# Working with the Wizard

## Contents

## Load Dataset

- To
**directly load a dataset from the internet**, copy the dataset link into the textfield provided, and press the**Load Dataset**- Button.

## Create 3D Structures

## Extract Features

## Cluster Dataset

### Overview

Algorithm | Cluster approach | Num Clusters Variable | Various Distance Functions | Deterministic | Random restarts | Independent of R |
---|---|---|---|---|---|---|

SimpleKMeans (WEKA) | Centroid | Yes | Yes | |||

k-Means - Cascade (WEKA) | Centroid | Yes | Yes | Yes | Yes | |

FarthestFirst (WEKA) | Centroid | Yes | ||||

Expectation Maximization (WEKA) | Distribution | Yes* | Yes | |||

Cobweb (WEKA) | Connectivity | Yes | Yes | |||

Hierarchical (WEKA) | Connectivity | Yes | Yes | Yes | ||

k-Means (R) | Centroid | Yes | ||||

k-Means - Cascade (R) | Centroid | Yes | Yes | |||

Hierarchical (R) | Connectivity | Yes | ||||

Hierarchical - Dynamic Tree Cut (R) | Connectivity | Yes | Yes |

## Embed into 3D Space

### Overview

Algorithm | Linear | Local | Deterministic | Runs without R |
---|---|---|---|---|

PCA 3D Embedder (WEKA) | Yes | Yes | Yes | |

PCA 3D Embedder (R) | Yes | Yes | ||

Sammon 3D Embedder (R) | Yes | |||

SMACOF 3D Embedder (R) | Yes | |||

TSNE 3D Embedder (R) | Yes |

### Embedding quality

The R² (Coefficient of Determination) is provided as a measure for the overall (global) embedding quality. Hence, it indicates how good the computed 3d-positions reflect the features values.

R² measures the correlation between the Euclidean distance matrix based on compound feature values, and the Euclidean distance matrix based on compound 3d positions. The better the embedding, the closer is the r-square value to 1. If the there are too many features with too diverse values, it may be impossible to compute a good overall 3D embedding at all.

### Which embedding algorithm to choose?

The embedding algorithm should be selected according to the number of compounds in the dataset (see Image below). If the embedding quality is poor, try a different algorithm. If Sammon/Smacof take to long, try to reduce the number of iterations. Reducing the number of features could help to get a better embedding.

### Details

- PCA with R
- There are two variants to perform PCA, the method
**prcomp**is used because it can be applied to datasets with more features than compounds.

## Align Compounds

**This is a prelimniary version, more documentation will be added soon...**