Tutorial - Mapping a dataset with integrated features
This Tutorial is outdated. Checkout Tutorial - Mapping a dataset with integrated features - v2
You will use the CheS-Mapper application to visualize a dataset using the molecular properties stored in the dataset.
The dataset endpoint is Caco-2 permeation, which is - according to the authors - correlated to simple molecular properties (http://pubs.acs.org/doi/suppl/10.1021/ci049884m). The authors provide 100 structural diverse compounds (We joined the 77 training compounds and 23 test compounds to a single dataset). Five numeric features are available in the dataset. One of the features is the actual endpoint Caco-2 permeability (logPapp). The name of this endpoint feature in the data set is caco2. The remaining four molecular descriptors are: experimental distribution coefficient (logD), high charged polar surface area (HCPSA), radius of gyration (rgyr), and fraction of rotatable bonds (fROTB). The authors use these four features to build a QSAR model to predict Caco-2 permeation.
You will use the CheS-Mapper to verify that the properties are correlated. You will easily detect a compound that is not described well by the authors QSAR model.
CheS-Mapper Wizard Settings
When the program is started (Run CheS-Mapper), the CheS-Mapper wizard shows up. It will guide you through the mapping process. A copy of the dataset is available here: http://opentox.informatik.uni-freiburg.de/ches-mapper/data/caco2.sdf. To load the dataset into the application, copy the dataset link into the text-field in the first wizard step and click 'Load dataset'. After just a view seconds the dataset properties indicate that 100 compounds have been loaded successfully, including 7 properties and 3D structural data. Hence, skip 3D structure generation in the second wizard step (by selecting 'No 3D Structure Generation'). In the third wizard step, double click on the feature node with the label 'Included in Dataset', and select the above mentioned four properties (HCPSA, fROTB, logD, and rgyr). This is done by simply selecting the feature and clicking the 'Add feature' button: the feature will be listed in the 'Selected Features' list on the top right. Take care to not select the actual endpoint caco2. We will use default settings for steps wizard step four, five and six, so you can press the 'Start mapping' button right-away. This will apply default clustering (k-means with random restarts and different values of k), default embedding (Principal Component Analysis), and no cluster alignment (as the compounds are too structural diverse).
When the mapping process has finished, the CheS-Mapper Viewer shows up.
The dataset was separated by the cluster algorithm into two clusters.
You can now highlight the features that have been used for embedding and clustering (by selecting the feature in the bottom left drop-down menu). The figure below shows the dataset with logD selected. As logD was selected in the third wizard step, the 3D positions of each compound were calculated based on their logD. To this end, compounds with similar values are close to each other. The same holds for the other 3 features.
Highlighting the endpoint
Keep in mind that the endpoint was not used as input for the embedding algorithm. Selecting the endpoint (see figure below) shows that compounds that are close to each other tend to have a similar caco2 value. We can conclude, that the endpoint is indeed correlated to the feature values.
Additionally, the selected cluster contains most of the compounds with high caco2 values, and only few with medium or low values. This is visualized by the red color of the cluster compounds. Use the mouse to hover over cluster 1: the chart in the bottom right corner of the screen will highlight the feature values of the compounds of this cluster. You will see that how it differs from the histogram of the complete dataset.
Another method to analyse the compound feature values of the different clusters is to use the superimpose method:
- You can superimpose the cluster compounds by clicking the check-box on the left side. Now, the compounds of each cluster are drawn on top of each other. The compound with the median feature value of each cluster is drawn solid.
- Activate the labeling of feature values by clicking on the label-checkbox on the bottom-left. This gives you the interval of the feature values, as well as the median value.
Disable labelling and superimposition. You can now zoom into cluster 1 by clicking on a compound of this cluster. With caco2 highlighted you will identify two compounds that are colored blue. Accordingly, these compounds have a smaller endpoint value. Hover with the mouse cursor over the compound that is of darker blue color and is located further away from the cluster borders. The compound features values will be shown on the right side of the screen.
This compound pirenzepine that has a relatively low endpoint value of -6.35. Still it was assigned to this cluster by the clustering algorithm, and was embedded in 3D by the PCA close to other compounds that have a higher endpoint value. Hence, you can conclude that for pirenzepine, the relationship between feature values and endpoint is slightly different compared to most of the other features.
Indeed, when having a look at the article, pirenzepine is the training compound with the highest prediction error of the authors QSAR model.