Supported Dataset Formats and Size

From CheS-Mapper Wiki
Jump to: navigation, search


We recommend CheS-Mapper for small and medium sized datasets of up to 1500 compounds. However, we successfully tested CheS-Mapper with datasets including over 8000 compounds, but pre-processing may take a while on these datasets (see Algorithm Runtimes).

Main memory usage

Moreover, as the software displays the structure of all dataset compounds in 3D space, large dataset require a lot of main memory. If the pre-processing fails with OutOfMemoryError: java heap space while loading the 3D library Jmol, the memory available to CheS-Mapper can be increased with the Xmx param when starting CheS-Mapper via command line (as described here; the windows executable version uses up to 8giga byte main memmory or 80% of the available main memory on your machine).

The following table shows the approximate main memory usage of CheS-Mapper in Giga-Byte for large datasets (This is only a rough estimate, and depends on the compounds, the selected features, and the selected algorithms. It is measured with CheS-Mapper v2.1 with a subset of the Tox21 dataset, OpenJDK 64bit-version 1.7.0_55, 14 PC descriptors have been calculated with Open Babel and used for PCA embedding, clustering was disabled).

Num compounds Default version Big data mode
2000 1.0g 0.8g
4000 1.8g 1.1g
8000 3.2g 1.7g

Big data mode

The Big data mode can be enabled in the first wizard step (Disable Show compound structures in 3D). This renders compounds as dots only in 3D space, the compound structure is only available as 2D depiction (when selecting a compound with the mouse). This reduces the amount of memory the software requires and makes CheS-Mapper a bit faster.

Formats (with corresponding file ending)

CheS-Mapper tries to detect the file-type according to the file extension. For example, if you are using a SMILES file as input, the file should end with '.smi'.

All files supported by CDK

CTX (ctx), PubChem Substances ASN (asn), Mol2 (Sybyl) (mol2), MDL Molfile V2000 (mol), Gaussian94, CrystClust (crystclust), PubChem Compound XML (xml), IUPAC-NIST Chemical Identifier (XML) (inchi), Gaussian92, PolyMorph Predictor (Cerius) (pmp), Crystallographic Interchange Format (cif), PubChem Substance XML (xml), Gaussian 2003, Gaussian95, Q-Chem (qc), Jaguar (j), Aces2, Ghemical Quantum/Molecular Mechanics Model (gpr), MoSS Output Format (mossoutput), MOPAC 2002 (mop), MDL Molfile (mol), CAChe MolStruct (cache), MDL Reaction format (rxn), Ghemical Simplified Protein Model, Gaussian90, MOPAC7 (mop), MDL Structure-data file (sdf), MOPAC 97 (mop), Protein Brookhave Database (PDB) (pdb), ZMatrix, VASP, ADF, PubChem Compound ASN (asn), IUPAC-NIST Chemical Identifier (Plain Text), MDL Mol/SDF V3000, Spartan Quantum Mechanics Program, PubChem Substances XML (xml), NWChem (nw), HyperChem HIN (hin), PubChem Compounds XML (xml), GAMESS log file (gam), Dalton, ABINIT, ShelXL (ins), MDL RXN V3000 (rxn), Chemical Markup Language (cml), Symyx Rgroup query files (mol), Gaussian98, CDK OWL (N3) (n3), MOPAC 93 (mop)

Additional formats

  • SMILES file format (.smi) : Each line has to contain a SMILES string. Optionally a whitespace character and the name (or an ID) for each compound can be added.

Example (first entries from the EPA-FHM dataset):

C1=CC(C=O)=CC(OC)=C1OCCCCCC 4-(Hexyloxy)-m-anisaldehyde
C1(OC)=C([N+]([O-])=O)C(C=O)=CC(Br)=C1O 5-Bromo-2-nitrovanillin
CCCCCCCCOC(=O)C1=CC=CC(C(=O)OCCCCCCCC)=C1 Di-n-octylisophthalate
C1=CC(Cl)=CC=C1OC2=C([N+](=O)[O-])C=CC=C2 p-Chlorophenyl-o-nitrophenyl ether

  • Comma-Seperated-File (.csv) : The first column has to be SMILES or InChI. Comma or semicolon can be used as column seperator. Quotes are recommended (SMILES syntax includes both, commas and semicolons). This format can be easily exported from Microsoft Excel or Libre Office.


"C1=CC(Cl)=CC=C1OC2=C([N+](=O)[O-])C=CC=C2","7,69E-03","p-Chlorophenyl-o-nitrophenyl ether"