About ZINC subsets

From DISI
Revision as of 16:49, 28 March 2012 by Frodo (talk | contribs)
Jump to navigation Jump to search

Some information about ZINC subsets

Molecular formats

Molecules are available in four formats: isomeric SMILES, mol2, SDF and flexibase. Molecules are represented as a single pH=7 form (called reference or ref for short). Additional representations (protonation variants and tautomers) are available in three incremental subsets to augment the single representative:

  • medium pH (6 to 8), probably desirable for most proteins
  • high pH 8-9.5 (e.g. for docking to metals)
  • and low pH 4.5-6 (e.g. for docking to a positively charged binding site, which in our experience is rare).


Subsets are broken into tranches for easier download

  • mol2 files: 700 MB uncompressed (CD size), about 127 MB compressed and about 120,000 molecules each on average.
  • SDF files: about 542MB uncompressed, 90 MB compressed, about 120,000 molecules each on average.
  • Flexibase files: about 90MB uncompressed, 37 MB compressed. about 2000 hierarchies per file on average.

Shell scripts simplify download

For all but the smallest subsets, we recommend the use of a shell script to download the database. For most docking projects, we recommend the Usual ligands: relevant forms between pH 6 and 8. For docking to metalloenzymes, we recommend the Metal subsets, which includes high pH forms. Note that these sets are overlapping so you do not want to download both Metals and All. For chemo-informatic projects, you may just want a single representation at pH 7