ZINC:Errata

From DISI
Revision as of 21:50, 19 December 2007 by JohnIrwin (talk | contribs)
Jump to navigation Jump to search

Here are errata as reported for ZINC:


  • for SIGMA propiophenone P51605 ZINC has 1671385 entry, and the ring in it does not show as aromatic.
    • FIXED on 12/18/07
  • many molecules reported with ZINC01278699. Sorry about this case. It will be removed in the next version.
  • the following pairs are not identical, but actually different protonation states of hydroxamic acids (looks like PipelinePilot has a problem interpreting the mol2 files, I rechecked everything with the sdf files): ZINC03817650, ZINC04628541; ZINC01548784, ZINC03820719
  • I downloaded the databases Asinex and Sigma-aldrich from the version 7

of ZINC in both the formats SMILES and MOL2. For both the databases I found a difference in the molecules present in the archives, that means some molecules present in the multi-mol2 file and not in the SMILES and vice versa. Is it possible or I did some errors in the comparison?

No, you are quite correct. I just did:

>  zmore sial_p0.smi.gz | awk '{print $2}' | sort -u > smiles_codes  
>  zcat sial_p0.?.mol2.gz | grep ZINC | sort -u > mol2_p0_codes
>  wc -l smiles_codes mol2_p0_codes
114763 smiles_codes
112069 mol2_p0_codes
> diff smiles_codes  mol2_p0_codes  |wc -l
4265

I agree that there are a little over 2,500 differences in the mol2 and SMILES of Sigma Aldrich in ZINC version 7, a little over 2% of the library.


  • There appears to be an issue with isomerism in the ZINC database, which affects DUD. An example is compounds 3165371 and 4460991. These are E/Z isomers around a double bond. If you ask for the SMILES strings, these match the 2D depictions you see on the search results page (not suprising, really, as I imagine that the 2D depictions are generated directly from the SMILES). However, if you ask for these compounds as SDF files, you get the correct isomer of 4460991, but you get two structures for 3165371, both of which are wrong compared to the SMILES (one is identical to 4460991, and the other has the exocyclic double bond flipped).

Both of these compounds are in the DUD data set. However, since the SDF data available for 3165371 are incorrect, the DUD data set has ended up with duplicate structures.

We believe there are several cases of this in DUD (and ZINC). E/Z isomerism has been a sore point for us ever since we got started. We are aware of this problem, and aim to fix as many of these problems as possible in the next release. I think the SDF should be taken as authoritative over the SMILES, if they disagree. However, we have not yet been able to make a complete study of this.