Tutorial - Structural clustering using OpenBabel fingerprints - v2

From CheS-Mapper Wiki
Jump to: navigation, search


You will use CheS-Mapper to cluster a dataset according to structural properties of the compounds.

The dataset consists of 467 COX-2 inhibitors, that have been tested for the selective inhibition of the human enzyme Cyclooxygenase-2 (COX-2) (see http://pubs.acs.org/doi/abs/10.1021/ci034143r). The experimentally derived activity of each compound is stored in the dataset as 'IC_50' value (half maximal inhibitory concentration). The inhibitors are structurally very similar, as they have to fit the active site of the COX-2 enzyme.

We will use MACCS keys for the 3D embedding, and investigate if there is a relationship between the structural properties encoded in the MACCS keys, and the activity.


To follow this tutorial in detail, you need to have Open Babel (Version ≥ v2.3) and R running on your machine. The tools are freely available for Linux, Windows and Mac, and can be installed in just a view minutes.

Without Open Babel / R: You can do this tutorial without Open Babel and R, but this will produce a slightly different embedding and clustering result.

CheS-Mapper Wizard Settings

Run CheS-Mapper to start the CheS-Mapper wizard.

  • Step 1: Load Dataset To directly load the dataset into the application (without downloading the file first), press 'Open file' in the first wizard step and copy the dataset link into the text-field (http://opentox.informatik.uni-freiburg.de/ches-mapper/data/cox2.sdf).
  • Step 2: Create 3D Structures The 3D structure generation can be skipped (Select 'No 3D Structure Generation'), as we have precomputed the 3D structures with OpenBabel.
  • Step 3: Extract Features Double click on 'Match SMARTS Lists' within 'Structural Fragments', select the MACCS list, and hit the 'Add feature' button. Click on the 'Settings for fragments'-Link, and make sure that the min-frequency is set to 10 and Open Babel is selected. You can pre-compute the fragment matching by clicking the 'Load feature values' button: 97 fragments are found.
  • Step 4: Cluster Dataset Click on 'Advanced >>' and select the 'Hierachical - Dynamic Tree Cut' method. As dissimilarity measure, select 'Tanimoto' similarity (which is appropriate for structural fragments as it ignores common non-matching fragments).
  • Step 5: Embed into 3D Space Click on 'Advanced >>' and select the 'Sammon 3D Embedder' method. Again, as dissimilarity measure, select 'Tanimoto' similarity.
  • Step 6: Align Compounds Select Maximum Common Subgraph Aligner.

Hit the 'Start mapping' button to start the viewer.

Without Open Babel / R: The MACCS features can be matched with CDK, for clustering and embedding use default settings in the '<< Simple' wizard view.

CheS-Mapper Viewer

Investigate the cluster result

When the mapping process is finished, the viewer shows that there are 7 clusters of different sizes (39 - 110). Click on a cluster-compound to zoom into a cluster. Use the slider (at the bottom left) to decrease the compound size (move the handle to the left): the compounds will overlap less. Press the 'X' (next to the compound list at the top left) to zoom out again.


Select single compounds

Select All compounds in the cluster list (at the top left). This will enable single compound selection. When hovering the mouse over compounds, you can now select single compounds (instead of the entire cluster). Click a compound to zoom in. Hit the 'X' button twice to zoom out again.

A single compound can be selected

Highlight structural fragments

The feature list (on the right side) shows the mean feature values of the compounds in the dataset. For example, the feature CSN does not match 208 compounds, and matches 208 compounds in the dataset. Click on the feature to highlight its values. Compounds that match this fragment will be highlighted in red. The SMARTS code of this fragment is shown above the chart in the bottom right ([#6]-[#16]-[#7]). It encodes a fragment of 3 Atoms (Carbon - Sulfur - Nitrogen; double click on the SMARTS code [#6]-[#16]-[#7] to open a visualization of this fragment using http://smartsview.de).

Fragemnt CSN selected.

Select a single compound that matches this fragment (either by selecting a cluster first, or by selecting All compounds first, before clicking on the compound). You will note that the matching fragment is highlighted in orange. Enable Highlighting > Sphere highlighting to indicate the feature value with spheres instead of changing the atom color.

Fragemnt CSN selected.

Determine important feature values for a cluster

Select an arbitrary cluster. You will note that the features list at the right changes its ordering. The most important features for each cluster are at the top, in order to aid finding out the most specific features for a cluster (details). In the example below, all compounds in cluster 3 match the top fragment (in the feature list for this cluster), but most compounds in the overall dataset do not match this fragment.

Sorting of features according to specificity

Superimpose compounds

Edit > superimpose compounds will move compounds within each cluster to the same position in 3D space (to the center of each cluster). As these compounds are aligned according to common substructures, they are 'overlayed' based on the maximum common fragment in each cluster. Select 'MCS' in the drop down menu on the bottom left to highlight the MCS.


Embedding stress and filtering

When zooming out to the entire dataset, the embedding quality is shown at the top right: it is 'moderate' with a Concordance Correlation Coefficient (CCC) of 0.64 (more on Embedding quality). This means, that the distance in the 3D-space does not perfectly resemble the Tanimoto distance for each compound pair. To inspect compounds with high embedding stress, click on '3D embedding' in the feature list (or select 'Embedding stress' in the drop down menu on the bottom left).

Embedding stress highlighted.

The most similar (based on feature values) compounds to a particular compound are nearby in 3D space, when a compound has low embedding stress. To identify the nearest neighbors of compounds with high embedding stress, select a compound and press 'Alt+D' (or: 'Edit > Distance to..'). This shows the Tanimoto distance of all other compounds to this compound.

Distance to a compound.

You can filter the view, showing only a selection of compounds. Having the Tanimoto distance feature selected, click on 'All compounds' and select the first compound in the list while holding down the Control-key (this will prevent the viewer from zooming in). Holding down the 'Shift'-Key and selecting a compound below in the list, will select the multiple compound (the selected compound and its nearest neighbors). Now press the small filter icon next to the compound list, to remove all other compounds from the viewer. You can remove the filter by pressing the X-Button at the top left.

Distance to a compound.

Highlight endpoint

Select the feature 'IC50_mM'. As this feature is log-distributed (see chart at the bottom right), select Highlighting > Highlight colors... and enable log-highlighting. Also reverse the color gradient by pressing the '<->' button. Now active compounds (with low feature values) will be colored in red. Click on the blue 'high'-button and select a green color. Inactive compounds will be now be colored green. In order to get a better overview of the endpoint value distribution in the feature space select 'Dots' at the bottom left. You will note that the compounds are separated quite well in 3D space according to the endpoint value (even though this feature was not used for embedding). Moreover, the clusters differ largely in mean endpoint value as well (see mean IC50 values in the cluster list at the top left). This indicates that the selected structural fragments are correlated to the endpoint.

Endpoint values highlighted.

Computing activity cliffs

Activity cliffs are pairs of compounds that are similar based on the feature value, but dissimilar based on the endpoint values. You can compute mean SALI values here: Edit > Activity cliffs.. (use IC50_mM as endpoint, and compute the index based on the log-transformed values). A special case for this definition are compounds that have identical structural features. In this case, 61 cliffs are found. These cliffs include compounds that cannot be distinguished based on their features, but have different endpoint values. This indicates that not all of the structurally very similar Cox2-inhibitors cannot be distinguished with the MACCS keys, and that more features are required to better model the structure activity feature relationship.

Activity cliffs.