Tutorial - Structural clustering using OpenBabel fingerprints
This Tutorial is outdated. Checkout Tutorial - Structural clustering using OpenBabel fingerprints - v2
You will use CheS-Mapper to embed a small dataset according to structural properties of the compounds.
The dataset consists of 34 polybrominated diphenyl ethers (PBDEs) with experimentally measured endpoint 'Vapor pressure'. The dataset was employed to build and validate QSPR Models (see http://onlinelibrary.wiley.com/doi/10.1002/qsar.200860183/abstract). The authors computed physicochemical descriptors with the Dragon software. They determined the feature T(O...Br) to be the most significant descriptor with respect to the endpoint. This feature describes the sum of the topological distance between oxygen and bromine.
You will use the CheS-Mapper program to verify that the endpoint depends on the structure, more precisely on the distance between bromine and the ether group (i.e. the oxygen atom).
For this tutorial, you need to have OpenBabel (Version ≥ v2.3) running on your machine. It is freely available for Linux, Windows and Mac, and can be installed in just a view minutes.
CheS-Mapper Wizard Settings
Run CheS-Mapper to start the CheS-Mapper wizard.
- Step 1: Load Dataset To load the dataset, you have to copy and paste the dataset link into the text-field and press 'Load dataset' (http://opentox.informatik.uni-freiburg.de/ches-mapper/data/PBDE_LogVP.ob3d.sdf).
- Step 2: Create 3D Structures The 3D structure generation can be skipped (Select 'No 3D Structure Generation'), as we have precomputed the 3D structures with OpenBabel.
- Step 3: Extract Features Double click on 'Structural Fragments', select 'OpenBabel Linear Fragments' and hit the 'Add feature' button. This feature set will contain all linear fragments up to a size of 7 atoms. Click on the 'Settings for fragments'-Link, and decrease the minimum frequency to 1. This makes sense as the dataset is small and contains structural very similar compounds. Make sure that the option 'Skip fragments that match all compounds' is enabled. This will omit features, that have equal feature values for all compounds. You can pre-compute the linear fragments by clicking the 'Load feature values' button: 15 fragments are found, each of them containing bromine (A bromine atom is encoded as [#35]). Do not select the actual target endpoint (which is 'Included in Dataset' as 'Vapor Pressure (measured)').
- Step 4: Cluster Dataset Use default settings (CascadeSimpleKMeans).
- Step 5: Embed into 3D Space Use default settings (PCA 3D Embedder (WEKA)).
- Step 6: Align Compounds Select Maximum Common Subgraph Aligner.
Hit the 'Start mapping' button to start the viewer.
Investigate the cluster result
When the mapping process is finished, the Viewer shows that there are three clusters of different sizes. You can rotate the view (hold mouse button down and drag the mouse) to see that the clusters are clearly separated in space.
Highlight the endpoint
Use the drop down menu at the bottom left to select the dataset endpoint 'Vapor Pressure (measured)' (it is at the end of the drop down list). This draws the compounds in a color gradient that ranges from red to blue, each compound is drawn according to its endpoint value. You will see that compounds with similar endpoint values are close to each other in 3D space. This gives evidence that the endpoint indeed depends on the structure (as only structural fragments have been used as input for the 3D embedding).
Move the mouse cursor over the cluster 3. Accordingly, the chart at the bottom right highlights the compound values of the cluster in comparison to the whole dataset. You can see that the this cluster contains most compounds with high endpoint values.
Highlight structural features
You can investigate the dataset properties further by selecting the structural fragments that have been used for embedding and clustering (again with the drop down menu at the bottom left). Select the structural feature '[#35]-[#6]:[#6]:[#6]-[#8]'. This is a SMARTS String (equal to Br-c:c:c-O) that corresponds to a linear sequence of 5 atoms: bromine, three aromatic carbons, and oxygen. The compounds are colored red if they contain the fragment, blue if the fragment does not match. Additionally, the atoms that match this sequence are highlighted in each compound (in orange, see Figure below). You will notice that the matching compounds have in general a higher endpoint value. Indeed all compounds in cluster 3 (the cluster including the compounds with the highest endpoint values), do match the fragment Br-c:c-O (one c-Atom less) as well. Undeniably, the endpoint depends on the structural fragments. This supports the findings described in the article.
Superimpose compounds to show structural (dis-)similarities
You can use superimposing to compare the structure of compounds within a cluster (Enable the 'Superimpose' check-box on the left). Click on the superimposed cluster 3 to zoom into the cluster. The compounds are aligned according to the Maximum Common Substructure of each cluster. You can move the mouse cursor over the corresponding list item in the compound list at the top left of the screen: only the selected compound is drawn solid.