About ZINC subsets

From DISI
Revision as of 20:30, 2 December 2011 by Frodo (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Some information about ZINC subsets

Molecular formats

Molecules are available in four formats: isomeric SMILES, mol2, SDF and flexibase. Molecules are represented as a single pH=7 form (called reference or ref for short). Additional representations (protonation variants and tautomers) are available in three incremental subsets to augment the single representative:

  • medium pH (6 to 8), probably desirable for most proteins
  • high pH 8-9.5 (e.g. for docking to metals)
  • and low pH 4.5-6 (e.g. for docking to a positively charged binding site, which in our experience is rare).


Subsets are broken into tranches for easier download

  • mol2 files: 100 MB uncompressed, about 23 MB compressed and about 20,000 molecules each on average.
  • SDF files: about 90MB uncompressed, 16 MB compressed, about 20,000 molecules each on average.
  • Flexibase files: about 90MB uncompressed, 37 MB compressed. about 2000 hierarchies per file on average.

Shell scripts simplify download

For all but the smallest subsets, we recommend the use of a shell script to download the database. For most docking projects, we recommend the Usual ligands: relevant forms between pH 6 and 8. For docking to metalloenzymes, we recommend the Metal subsets, which includes high pH forms. Note that these sets are overlapping so you do not want to download both Metals and All. For chemo-informatic projects, you may just want a single representation at pH 7