KNIME Node Tutorial - Visual validation of a QSAR prediction
KNIME Node Tutorial
This is a tutorial for the CheS-Mapper node for KNIME. A general description of the CheS-Mapper node for KNIME can be found here: http://tech.knime.org/book/ches-mapper-node-for-knime-trusted-extension.
We will use KNIME to build and apply a QSAR model for acute fish toxicity, and the CheS-Mapper node to visually validate the prediction. The original dataset can be downloaded from the web pages of the US Environmental Protection Agency (EPA) (see: http://www.epa.gov/ncct/dsstox/sdf_epafhm.html). We will load the dataset and predict the endpoint 'LC50_mmol' with the KNIME workbench. The CheS-Mapper node for KNIME will then be used to visualize the prediction result directly within KNIME.
To simplify this tutorial, we already have computed OpenBabel descriptors using CheS-Mapper for the dataset. Further cosmetical pre-processing of the original dataset: we have created 3D structures with OpenBabel, removed 37 (of 617) compounds that have no measured value for LC50_mmol, removed a single vast outlier compound (strychnine hemisulphate salt) to make the visualization more compact, and computed the LOG value for the endpoint LC50mm_mol.
Install CheS-Mapper extension within KNIME
The CheS-Mapper extension is a Trusted Community Contribution and can be easily installed (Menu -> Help -> Install New Software..) as documented here: http://tech.knime.org/community.
Add SDF Reader node
Start your new KNIME workflow by adding a SDF Reader node (in Chemistry -> I/O). Double-click on the node, go to the File selection tab and copy paste the dataset URL into the textfield: http://opentox.informatik.uni-freiburg.de/ches-mapper/data/EPAFHM_v4b_617_15Feb2008.ob3d.missingRemoved.outlierRemoved.obFeatures.log.sdf. Go to the Property handling tab and click Scan files. This will load the 579 compounds of the file and list 30 compound properties (starting with DSSTox_RID).
Add Linear Regression (Learner) node
To build a simple regression model, add the Linear Regression (Learner) node (from Statistics -> Regression) and connect the first output port from the reader node to the first input node of the regression node. Double-click on the regression node: you will notice that the correct target property is already selected (because it is the last property in the SDF-file): LC50_mmol_log. Remove all non-OpenBabel descriptors from the values used for regression, by selecting in the Include panel the first property DSSTox_RID, scroll down to ExcessToxicityIndex, hold down the shift button and click on ExcessToxicityIndex, and click the << remove-Button. This leaves 12 OpenBabel descriptors as input variables for the regression (starting with OB:abonds).
Add Regression (Predictor) node
To make the actual predictions, you will need the Regression (Predictor) node (from Statistics -> Regression). Connect the output port from the regression learner to the input port of the regression predictor, and the first output port of the SDF Reader node to the second input port of the predictor node. Note, that this will make a prediction of the training dataset (the model predictions will be probably less good on unseen data).
Add CheS-Mapper node
Finally, add the CheS-Mapper node from the Community Nodes, and connect the output port from the predictor node with the input port of the CheS-Mapper node. Execute the CheS-Mapper node (right-click -> execute), the lights of all nodes should turn green. Right-click on the CheS-Mapper node and select the second view: Start CheS-Mapper wizard. Configure CheS-Mapper to use only the regression-input variables for the chemical space embedding: go to step 3 Extract features and remove all non-OpenBabel descriptors (above OB:abonds and below OB:TPSA) from the Selected Features panel (again you can hold down the shift button to select multiple features at once). Press Start mapping to start the CheS-Mapper visualization.
Visual validation using CheS-Mapper
WARNING: The visual validation described here does not use CheS-Mapper 2.0 functionalities yet, will be updated soon
When the viewer starts, you will see the 579 compounds embedded into 3D space. The embedding is based on the same OpenBabel descriptors that have been used for the Linear regression model, therefore compounds with similar features values are close to each other. Note that the actual endpoint values (LC50_mmol(_log)) have not been used for embedding. Select the endpoint feature LC50_mmol at the dropdown menu on the bottom left, and enable Log highlighting in the Highlighting menu. Even though this feature was not used for the 3D-embbeding, compound positions in 3D-space reflect the endpoint values quite well (Figure 1). This indicates that the OpenBabel feature values are correlated to the endpoint values.
In the center of the compound cloud, you will note a compound (1-Benzylpyridinium 3-sulfonate) that is rather inactive (drawn in red due to a relatively high endpoint value of 9,67mmol), but is close to more active compounds (blue color). Click on the compound to zoom in (Figure 2).
As linear regression tends to work better when predicting normal distributed data, the endpoint has been converted using the logarithm, stored in the original dataset. Hence, the actual target value is LC50_mmol_log, the predicted values are LC50_mmol_log (prediction). You can highlight both features at once by enabling Highlight features with spheres and Highlight two features in the Highlighting menu (You should uncheck Log highlingting). Now select LC50_mmol_log (prediction) first in the dropdown menu at the bottom left, and second select LC50_mmol_log second. This will show the actual value LC50_mmol_log in the inner circle, and the predicted value LC50_mmol_log (prediction). Thus you can easly detect compounds with high prediction errors such as 1-Benzylpyridinium 3-sulfonate. You can then investigate possible reasons for the prediction error, such as measurement errors during the experiment, insignificant features or inadequat prediction algorithms.