DUDE: Difference between revisions

From DISI
Jump to navigation Jump to search
mNo edit summary
No edit summary
 
(22 intermediate revisions by 4 users not shown)
Line 1: Line 1:
This is the Wiki Page for DUD-E, a directory of useful decoys - enhanced.  DUD-E is on the web at http://dude.docking.org. 
{{Software_Infobox
|title = DUD-E
|image = Dud.jpg
|Paragigm = [[Docking]]
|Developer = [[Shoichet Lab]]
|Stable Version = [[DUDE]]
|Programming Language = [[Python]], [[C-Shell]]
|Dependencies = [[JChem]], [[ChemAxon]]
}}


This page contains documentation, FAQ, and may be used for posting errors and ommissions in DUD-E, and also for commenting on the database's design and usefulness.  
This is the Wiki Page for '''DUD-E''' or (we like to call "DUDE"), a '''D'''irectory of '''U'''seful '''D'''ecoys - '''E'''nhanced.  DUD-E is on the web at http://dude.docking.org. 
 
This page contains documentation, FAQ, and may be used for posting errors and omissions in DUD-E, and also for commenting on the database's design and usefulness.
 
==About==
DUD-E is an enhanced and rebuilt version of [[DUD]]. DUD-E is designed to help benchmark molecular docking programs by providing challenging decoys. It contains:
* 22,886 active compounds and their affinities against 102 targets, an average of 224 ligands per target.
* 50 decoys for each active having similar physico-chemical properties but dissimilar 2-D topology.
<div class="toccolours mw-collapsible mw-collapsed">
Differences between DUDE and original DUD:
<div class="mw-collapsible-content">
{| class="wikitable"
|-
! scope="col" | Feature
! scope="col" | DUD-E
! scope="col" | Original DUD
|-
! scope="row" | Number of targets
| 102
| 40
|-
! scope="row" | Number of ligands per target
| 100 to 600, 224 avg.
| 11 to 475, 98 avg.
|-
! scope="row" | Decoys per ligand
| 50
| 33
|-
! scope="row" | Physical properties matched
| Same as in DUD, plus net molecular charge
| Molecular weight, calculated logP, H-bond donors and acceptors, number of rotatable bonds
|-
! scope="row" | Fingerprint and dissimilarity criteria
| ECFP4, most 25% dissimilar
| CACTVS default, 0.7 maximum (in retrospect, far too generous)
|-
! scope="row" | Clustering to downweight highly similar ligands
| Yes
| No
|-
! scope="row" | Literature references and affinities
| Yes, via ChEMBL
| No
|-
! scope="row" | Decoy maker available on-line for arbitrary active sets?
| Yes
| No
|}
</div>
</div>
==Cite DUD-E==
To cite DUD-E, please reference Mysinger MM, Carchia M, Irwin JJ, Shoichet BK J. Med. Chem., 2012, Jul 5. [http://pubs.acs.org/doi/10.1021/jm300687e DOI 10.1021/jm300687e] . You may also wish to cite the original version of DUD, Huang, Shoichet and Irwin, J. Med. Chem., 2006, 49(23), 6789-6801. [http://pubs.acs.org/doi/abs/10.1021/jm0608356 DOI 10.1021/jm0608356].  


= Detailed file documentation =  
= Detailed file documentation =  
== General comments ==
First, you may not need any of these files. They are provided in an attempt to be completely transparent, but the "all" files linked on the DUDE website contains the files you should need: receptor, crystal ligand, actives (isomeric SMILES, mol2 and SDF) and decoys (same three formats)
First, you may not need any of these files. They are provided in an attempt to be completely transparent, but the "all" files linked on the DUDE website contains the files you should need: receptor, crystal ligand, actives (isomeric SMILES, mol2 and SDF) and decoys (same three formats)


== File by file, folder by folder explanations ==
* Folders like P29274, P30543, or P46616 : these are swissprot codes. The directories contains preparation files specific to that individual code, which is often species specific.  
* Folders like P29274, P30543, or P46616 : these are swissprot codes. The directories contains preparation files specific to that individual code, which is often species specific.  


Line 47: Line 109:


= FAQ =
= FAQ =
 
DUD-E is a research tool which we have tried to make as useful and as correct as we know how. We will endeavor to put right any problems promptly, as best we can. We have also answered some questions below:
== Q1. How duplicates are removed ==
== Q1. How duplicates are removed ==
In the paper, you said "We then remove duplicate decoys from the ligand set by sorting decoys from
In the paper, you said "We then remove duplicate decoys from the ligand set by sorting decoys from
Line 55: Line 117:
similarity.  Please explain!
similarity.  Please explain!


== A1. ==
===A1. Duplicate removal procedure ===
Decoys are uniquely identified by their ZINC protonation ids. We
Decoys are uniquely identified by their ZINC protonation ids. We
ensure that a particular protonation id (prot_id) is only assigned to
ensure that a particular protonation id (prot_id) is only assigned to
Line 67: Line 129:


The effect is to spread the decoys as evenly as possible among the ligands they could belong to.
The effect is to spread the decoys as evenly as possible among the ligands they could belong to.
== Q2. Weird residue labeling in receptor.pdb files ==
I have noticed some weird residue labeling in the PDB files on the DUDE website, what is that all about?  e.g.
* comt: Residues #168/9 are labeled as DIC/ASZ respectively. In the original PDB file they are labeled as ASP/ASN.
* cp2c9: Residues 58/61 are labeled as NMA/ACE. The original PDB has different labeling (I couldn't figure out the mapping).
* jak2: Residues 160/1/2 are labeled as PTR which is labeled as HETATM in the original PDB file. Also, there are missing residues between PRO156 and PTR160 that appear in the original PDB file (GLN, ASP, LYS, GLU).
* pde5a: Residues 653/4 are labeled as HIZ/DID where the original PDB file contains HIS/ASP.
=== A2. PDB receptor file residue labeling ===
These are the actual preparations we used in DOCK itself. As explained in the paper, many of the targets have been enhanced to help docking performance. The non-standard residues most typically change the partial charges, which is used for metal-coordination and dipole "tarting". See docking/grids/prot.table.ambcrg.ambH in that targets directory for how the residue is being interpreted.
These are not errors, and represent over 6 months of work to improve DUD-E for DOCK. Some of these improvements are likely transferable to other docking programs (waters retained, ASN flips, etc.). The original PDB code is provided if trying to translate this information to another docking protocol is too much of a hassle.
The funky names help to make our manual intervention explicit and transparent.
== Q3. More on duplicate IDs ==
In a number of decoy sets there are duplicate ids.  So, for example, the decoys in try1 contain
Cc1ccc(cc1Cl)NC(=O)C(=O)NN/C(=C\C(=O)Nc2ccc(cc2)NC(=O)C)/C C02343192
Cc1ccc(cc1Cl)NC(=O)C(=O)NN/C(=C/C(=O)Nc2ccc(cc2)NC(=O)C)/C C02343192
Where the same id refers to the different stereochemistry.  My question is why was this design decision taken (to have both in the decoys file)?  Perhaps a single entry with undefined stereochemistry at the differing position might be an alternative, or if two entries have different stereochemistry because they can be synthesised that way a suffix in the id would have been helpful "C02343192_1", "C02343192_2".  Taking the representation in ZINC (in this case the bottom one) is also another possibility - but then where does the top one come from?
=== A3. Answers on duplicate IDs ===
The uniqueness of decoys is enforced at the level of protonation ids.  At the time it was generated, there were 2 protomer ids for that zinc
substance id.
decoys.F91123300.picked:
Cc1ccc(cc1Cl)NC(=O)C(=O)NN/C(=C\C(=O)Nc2ccc(cc2)NC(=O)C)/C C02343192      P94844827
Cc1ccc(cc1Cl)NC(=O)C(=O)NN/C(=C/C(=O)Nc2ccc(cc2)NC(=O)C)/C C02343192      P98798844
Only the second protomer remains now. So this was an error in ZINC that was later fixed. I expect these kinds of source errors to show up
in both ChEBML and ZINC.
== Q3A. more about dups ==
I understand your protonation explanation -- but what is unclear to me is that the difference in my two examples is stereochemical and not regarding protonation state (so why do these structures have different protomer ids?). Your explanation applies to the following example decoys in decoys_final.ism in the try1 directory:
C[C@H](CNC(=O)[C@H]1C2=C(CCC2)[NH2+]N1)Oc3cccc(c3)C[NH+](C)Cc4cscn4 C39363328
C[C@H](CNC(=O)[C@H]1C2=C(CCC2)NN1)Oc3cccc(c3)C[NH+](C)Cc4cscn4 C39363328
These have different protonation states of which only the bottom one is found in ZINC (http://zinc.docking.org/substance/39363328).  Will these entries be cleaned up from the dataset (and possibly only the molecule in its physiological protonation state be kept) ?  Or is this "by design" ?  Isn't there the risk of having an "analog bias" effect if you keep both these molecules? I have attached the full list of duplicates found in the current decoy sets, just in case you are interested in following this further. (long file...)
=== A3A. more about dups answered ===
For the decoys, we only ask ZINC for the single best representative at pH 7.05. So, when it gives more than one protonation state this is an error in ZINC itself, as the molecule had multiple versions of this "reference" state. The ones with the steriochemistry change are only that much more broken.
In the grand scheme, using your list there are only 1521 of these duplicates out of 1.4 million decoys (~ 0.1%), which actually means that ZINC has a much better error rate than in the past. My inclination is to leave it alone, but possibly document the duplicates as arising from errors in ZINC. If one wanted to fix this, the intermediate data is usually there in the raw ligand sets to fill in these decoys with others that don't have this problem (in the .filtered files).
== Q4. The receptor structure is not the same as in the PDB. Why? ==
I noticed that in the ace receptor.pdb structure some residues are missing compared to the original 3bkl.pdb.  I am sure this won't influence docking results but nevertheless: is there a reason for this or just an artifact  which happens in all big databases?
=== A4. Why receptor.pdb is not the same as the PDB file ===
We know you can go to the PDB yourself and get the original structure. receptor.pdb shows you exactly, atom by atom, the model that we used for docking.  In some cases, we have made some judgement calls to remove solvent, or even atoms distance from the binding site that are too far away to affect docking.  For instance, we often remove atoms further than 25A away from the binding site.  You are welcome to go back to the PDB to get the original structure.
= Community Feedback =
This space is reserved for community users to comment or discuss DUD-E. Please feel free to use this resource to communicate with other DUD-E users. You may also write to the authors, and we may then adapt our correspondence into an entry on this page.


END OF FAQ
END OF FAQ
Line 72: Line 196:
[[Category:DUDE]]
[[Category:DUDE]]
[[Category:FAQ]]
[[Category:FAQ]]
[[Category:Benchmarks]]
[[Category:Decoys]]

Latest revision as of 20:35, 4 January 2019

DUD-E
Dud.jpg
Paragigm Docking
Developer Shoichet Lab
Stable Version DUDE
Programming Language Python, C-Shell
Dependencies JChem, ChemAxon

This is the Wiki Page for DUD-E or (we like to call "DUDE"), a Directory of Useful Decoys - Enhanced. DUD-E is on the web at http://dude.docking.org.

This page contains documentation, FAQ, and may be used for posting errors and omissions in DUD-E, and also for commenting on the database's design and usefulness.

About

DUD-E is an enhanced and rebuilt version of DUD. DUD-E is designed to help benchmark molecular docking programs by providing challenging decoys. It contains:

  • 22,886 active compounds and their affinities against 102 targets, an average of 224 ligands per target.
  • 50 decoys for each active having similar physico-chemical properties but dissimilar 2-D topology.

Differences between DUDE and original DUD:

Feature DUD-E Original DUD
Number of targets 102 40
Number of ligands per target 100 to 600, 224 avg. 11 to 475, 98 avg.
Decoys per ligand 50 33
Physical properties matched Same as in DUD, plus net molecular charge Molecular weight, calculated logP, H-bond donors and acceptors, number of rotatable bonds
Fingerprint and dissimilarity criteria ECFP4, most 25% dissimilar CACTVS default, 0.7 maximum (in retrospect, far too generous)
Clustering to downweight highly similar ligands Yes No
Literature references and affinities Yes, via ChEMBL No
Decoy maker available on-line for arbitrary active sets? Yes No

Cite DUD-E

To cite DUD-E, please reference Mysinger MM, Carchia M, Irwin JJ, Shoichet BK J. Med. Chem., 2012, Jul 5. DOI 10.1021/jm300687e . You may also wish to cite the original version of DUD, Huang, Shoichet and Irwin, J. Med. Chem., 2006, 49(23), 6789-6801. DOI 10.1021/jm0608356.

Detailed file documentation

General comments

First, you may not need any of these files. They are provided in an attempt to be completely transparent, but the "all" files linked on the DUDE website contains the files you should need: receptor, crystal ligand, actives (isomeric SMILES, mol2 and SDF) and decoys (same three formats)

File by file, folder by folder explanations

  • Folders like P29274, P30543, or P46616 : these are swissprot codes. The directories contains preparation files specific to that individual code, which is often species specific.
  • Docking, docking_auto
  • dudgen_clustered, dudgen_ecfp4 - The clustered sets live in dudgen_clustered while the full raw sets are in dudgen_ecfp4. Inside, the ligands.charge file contains the mapping from chembl ids to unique property sets, and the search/decoys.*.picked contain the actual decoys for that protonation set.

The file formats are as follows:

  • ligands.charge - gives unique protonation states of input ligands.
Format: one ligand protonation form per line
SMILES input_id protonation_id mwt logp rb hba hbd charge
  • search/
    • decoys.<protonation_id>.picked - contains matched decoys for each unique ligand protonation state
Format: ligand protomer and then 50 matched decoys
first line: ligand SMILES input_id protonation_id
SMILES ZINC_ID ZINC_Protonation_ID
  • actives_* including: actives_combined.ism, actives_final.ism, actives_murcko_1.ism, actives_murcko_1_30_nM.ism, actives_murcko_enumeration.ism, actives_nM_chembl.ism, actives_nM_combined.ism, actives_scaffolds.ism, actives_trimmed.txt, actives_final.mol2.gz, actives_final.sdf.gz
  • common_scaffolds.ism
  • crystal_ligand.mol2
  • decoys_*, including: decoys_final.ism, decoys_scaffolds.ism, decoys_tabbed.ism, decoys_to_scaffolds.ism, decoys_final.mol2.gz, decoys_final.sdf.gz
  • inactives_*, including: inactives_combined.ism, inactives_nM_chembl.ism, inactives_nM_combined.ism
  • marginal_* including: marginal_actives_combined.ism, marginal_actives_nM_chembl.ism, marginal_actives_nM_combined.ism, marginal_inactives_combined.ism, marginal_inactives_nM_chembl.ism, marginal_inactives_nM_combined.ism
  • pdb_analyze.txt pdb_blessed.txt
  • receptor.pdb
  • scaffold_count.txt
  • subset_decoys.py in the target directory can covert the full dudgen_ecfp4 set into a dudgen_clustered type set given a list of molregno ids.
  • uniprot.txt


FAQ

DUD-E is a research tool which we have tried to make as useful and as correct as we know how. We will endeavor to put right any problems promptly, as best we can. We have also answered some questions below:

Q1. How duplicates are removed

In the paper, you said "We then remove duplicate decoys from the ligand set by sorting decoys from least to most duplicated and assigned each decoy to the protonated ligand which has the least number of decoys already assigned." I don't know which molecules you considered as duplicates, and how to define and calculate the similarity. Please explain!

A1. Duplicate removal procedure

Decoys are uniquely identified by their ZINC protonation ids. We ensure that a particular protonation id (prot_id) is only assigned to one ligand of a given target.

  • 0) After filtering the for the 25% most dissimilar decoys
  • 1) map each prot_id to the number of different ligands it could be assigned to
  • 2) sort from the prot_ids that hit the fewest ligands to those that hit the most
  • 3) loop over that sorted list, starting with the most constrained decoys
  • 4) assign each prot_id to the the ligand with the fewest other prot_ids so far

The effect is to spread the decoys as evenly as possible among the ligands they could belong to.

Q2. Weird residue labeling in receptor.pdb files

I have noticed some weird residue labeling in the PDB files on the DUDE website, what is that all about? e.g.

  • comt: Residues #168/9 are labeled as DIC/ASZ respectively. In the original PDB file they are labeled as ASP/ASN.
  • cp2c9: Residues 58/61 are labeled as NMA/ACE. The original PDB has different labeling (I couldn't figure out the mapping).
  • jak2: Residues 160/1/2 are labeled as PTR which is labeled as HETATM in the original PDB file. Also, there are missing residues between PRO156 and PTR160 that appear in the original PDB file (GLN, ASP, LYS, GLU).
  • pde5a: Residues 653/4 are labeled as HIZ/DID where the original PDB file contains HIS/ASP.

A2. PDB receptor file residue labeling

These are the actual preparations we used in DOCK itself. As explained in the paper, many of the targets have been enhanced to help docking performance. The non-standard residues most typically change the partial charges, which is used for metal-coordination and dipole "tarting". See docking/grids/prot.table.ambcrg.ambH in that targets directory for how the residue is being interpreted.

These are not errors, and represent over 6 months of work to improve DUD-E for DOCK. Some of these improvements are likely transferable to other docking programs (waters retained, ASN flips, etc.). The original PDB code is provided if trying to translate this information to another docking protocol is too much of a hassle.

The funky names help to make our manual intervention explicit and transparent.

Q3. More on duplicate IDs

In a number of decoy sets there are duplicate ids. So, for example, the decoys in try1 contain

Cc1ccc(cc1Cl)NC(=O)C(=O)NN/C(=C\C(=O)Nc2ccc(cc2)NC(=O)C)/C C02343192
Cc1ccc(cc1Cl)NC(=O)C(=O)NN/C(=C/C(=O)Nc2ccc(cc2)NC(=O)C)/C C02343192

Where the same id refers to the different stereochemistry. My question is why was this design decision taken (to have both in the decoys file)? Perhaps a single entry with undefined stereochemistry at the differing position might be an alternative, or if two entries have different stereochemistry because they can be synthesised that way a suffix in the id would have been helpful "C02343192_1", "C02343192_2". Taking the representation in ZINC (in this case the bottom one) is also another possibility - but then where does the top one come from?

A3. Answers on duplicate IDs

The uniqueness of decoys is enforced at the level of protonation ids. At the time it was generated, there were 2 protomer ids for that zinc substance id.

decoys.F91123300.picked:
Cc1ccc(cc1Cl)NC(=O)C(=O)NN/C(=C\C(=O)Nc2ccc(cc2)NC(=O)C)/C C02343192       P94844827
Cc1ccc(cc1Cl)NC(=O)C(=O)NN/C(=C/C(=O)Nc2ccc(cc2)NC(=O)C)/C C02343192       P98798844

Only the second protomer remains now. So this was an error in ZINC that was later fixed. I expect these kinds of source errors to show up in both ChEBML and ZINC.

Q3A. more about dups

I understand your protonation explanation -- but what is unclear to me is that the difference in my two examples is stereochemical and not regarding protonation state (so why do these structures have different protomer ids?). Your explanation applies to the following example decoys in decoys_final.ism in the try1 directory:

C[C@H](CNC(=O)[C@H]1C2=C(CCC2)[NH2+]N1)Oc3cccc(c3)C[NH+](C)Cc4cscn4 C39363328
C[C@H](CNC(=O)[C@H]1C2=C(CCC2)NN1)Oc3cccc(c3)C[NH+](C)Cc4cscn4 C39363328

These have different protonation states of which only the bottom one is found in ZINC (http://zinc.docking.org/substance/39363328). Will these entries be cleaned up from the dataset (and possibly only the molecule in its physiological protonation state be kept) ? Or is this "by design" ? Isn't there the risk of having an "analog bias" effect if you keep both these molecules? I have attached the full list of duplicates found in the current decoy sets, just in case you are interested in following this further. (long file...)

A3A. more about dups answered

For the decoys, we only ask ZINC for the single best representative at pH 7.05. So, when it gives more than one protonation state this is an error in ZINC itself, as the molecule had multiple versions of this "reference" state. The ones with the steriochemistry change are only that much more broken.

In the grand scheme, using your list there are only 1521 of these duplicates out of 1.4 million decoys (~ 0.1%), which actually means that ZINC has a much better error rate than in the past. My inclination is to leave it alone, but possibly document the duplicates as arising from errors in ZINC. If one wanted to fix this, the intermediate data is usually there in the raw ligand sets to fill in these decoys with others that don't have this problem (in the .filtered files).

Q4. The receptor structure is not the same as in the PDB. Why?

I noticed that in the ace receptor.pdb structure some residues are missing compared to the original 3bkl.pdb. I am sure this won't influence docking results but nevertheless: is there a reason for this or just an artifact which happens in all big databases?

A4. Why receptor.pdb is not the same as the PDB file

We know you can go to the PDB yourself and get the original structure. receptor.pdb shows you exactly, atom by atom, the model that we used for docking. In some cases, we have made some judgement calls to remove solvent, or even atoms distance from the binding site that are too far away to affect docking. For instance, we often remove atoms further than 25A away from the binding site. You are welcome to go back to the PDB to get the original structure.

Community Feedback

This space is reserved for community users to comment or discuss DUD-E. Please feel free to use this resource to communicate with other DUD-E users. You may also write to the authors, and we may then adapt our correspondence into an entry on this page.


END OF FAQ