Tutorial - Mapping a dataset with integrated features - v2
We will use the CheS-Mapper application to visualize a dataset using the molecular properties stored in the dataset.
The dataset endpoint is Caco-2 permeation, which is - according to the authors - correlated to simple molecular properties (http://pubs.acs.org/doi/suppl/10.1021/ci049884m). The authors provide 100 structural diverse compounds (We joined the 77 training compounds and 23 test compounds to a single dataset). Five numeric features are available in the dataset. One of the features is the actual endpoint Caco-2 permeability (feature name in the datset: caco2). The remaining four molecular descriptors are: experimental distribution coefficient (logD), high charged polar surface area (HCPSA), radius of gyration (rgyr), and fraction of rotatable bonds (fROTB). The authors use these four features to build a QSAR model to predict Caco-2 permeation.
We will verify that the properties are correlated, and detect a compound that is not described well by the authors QSAR model.
CheS-Mapper Wizard Settings
The wizard, for configuring the Chemical-Space Mapping, opens up when the software starts (Run CheS-Mapper). A copy of the dataset is available here: http://opentox.informatik.uni-freiburg.de/ches-mapper/data/caco2.sdf. To directly load the dataset into the application (without downloading the file first), press 'Open file' in the first wizard step and copy the dataset link into the text-field. The dataset properties indicate that 100 compounds have been loaded successfully, including 7 properties and 3D structural data. Hence, skip 3D structure generation in the second wizard step (by selecting 'No 3D Structure Generation'). In the third wizard step, double click on the feature node with the label 'Included in Dataset', and select the above mentioned four properties (HCPSA, fROTB, logD, and rgyr). This is done by simply selecting the feature and clicking the 'Add feature' button: the feature will be listed in the 'Selected Features' list on the top right. Take care to not select the actual endpoint caco2. We will use default settings for steps wizard step four, five and six, so you can press the 'Start mapping' button right-away. This will apply no clustering, default embedding (Principal Component Analysis), and no cluster alignment (as the compounds are too structural diverse).
When the mapping process has finished, the CheS-Mapper Viewer shows up.
3D viewer organization
The dataset is mapped into 3D space and located in the center. At the top left, a compound list is available to select compounds. At the top right, the dataset properties are shown, including the embedding quality. The embedding quality is excellent, as the PCA had to reduce the number of dimensions only by 1 (from 4 molecular descriptors to 3 dimensions). Furthermore, the mean feature values for the whole dataset are listed (+- standard deviation).
We highlight the features that have been used for embedding and clustering by selecting the feature in the bottom left drop-down menu (or clicking on the feature in the list on the righthand side). The figure below shows the dataset with logD selected. As logD was selected in the third wizard step, it was employed for the dimensional reduction into 3D space, and compounds with similar values are close to each other. The same holds for the other 3 features. Take note that the compound list is sorted according to the feature value.
Highlighting the endpoint (Inspecting the activity landscape)
For highlighting the endpoint (see screenshot below), we changed the depiction of compounds to Dots. Keep in mind that the endpoint was not used as input for the embedding algorithm. Selecting the endpoint shows the activity landscape for this dataset. Overall, compounds that are close to each other tend to have a similar caco2 value. We can conclude, that the endpoint is indeed correlated to the feature values.
However, the landscape is not entirely smooth. In the screenshot, we have selected the compound pirenzepine that attracts our attention because it is colored in blue (due to a relatively low endpoint value of -6.35, see histogram at the bottom right), compared to its neighbors.
Inspect neighbors of a compound
By selecting the compound pirenzepine and pressing ALT + d, we can detect it's neighbors, the compounds that are most similar to this compound based on the feature values. The compound list is now sorted according to the distance to pirenzepine and allows us to select the nearest neighbors (select top compound in the list, hold down shift and click on the 10th compound).
By clicking the small filter button (next to the compound list), all unselected compounds are filtered out. This allows a easier comparison of a small group of compounds. In the screenshot below, we have highlighted the endpoint again, switched the depiction to Balls&Sticks, and enabled sphere highlighting.
Automatically detect activity cliffs
Pirenzepine is part of an activity cliff in the activity landscape. For large datasets, where no perfect 3D Embedding is feasible, it is harder to visually detect activity cliffs. CheS-Mapper can compute cliffs automatically. Select Activity Cliffs... in the Edit-Menu, select caco2 as endpoint and press ok. It is no surprise that the already discovered compound has the highest Activity Cliff value.