Generating decoys (Reed's way): Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
 
(103 intermediate revisions by 5 users not shown)
Line 1: Line 1:
Written on April 3, 2018.
Written by Reed Stein on April 3, 2018.
 
updated 5/3/2019
 
updated 8/15/2019
 
updated 3/6/2020
 
updated 5/18/2020
 
This pipeline will generate property-matched decoys for a set of ligand SMILES. To build ligands yourself, see "ligand prep" in:
    http://wiki.docking.org/index.php/DOCK_3.7_tutorial_%28Anat%29


All scripts for this tutorial can be found in:
All scripts for this tutorial can be found in:
     /mnt/nfs/home/rstein/zzz.scripts/DUDE_SCRIPTS/
     /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/


== Input SMILES file ==
Before running any scripts, make sure to source the current version of Python
  source /nfs/soft/python/envs/complete/current/env.csh


Starting with a SMILES file with the format (SMILES first, ID second):
Additionally, JChem needs to be sourced in your ~/.cshrc file with the command:
  source /nfs/soft/jchem/current/env.csh
 
If the below script still not run:
  source /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/env.csh
== Querying ZINC for Protomers ==
 
This procedure generates decoys for your input ligands by searching through 3D conformers that are already built in ZINC. This procedure is advised if you want decoys to be charge-matched to ligands.
 
=== Step 1) Setting up directories for Protomers ===
 
Before starting, you need a SMILES file with the format (SMILES first, <B>unique</B> ID second):
   S(Nc1c(O)cc(C(=O)O)cc1)(c2c(scc2)C(=O)O)(=O)=O 116
   S(Nc1c(O)cc(C(=O)O)cc1)(c2c(scc2)C(=O)O)(=O)=O 116


Run the following command to protonate the SMILES, and create the decoy generation directory:
You also need an input file named "decoy_generation.in" with the following lines:
   python /mnt/nfs/home/rstein/zzz.scripts/DUDE_SCRIPTS/0000_protonate_setup_dirs.py {SMILES_FILE} {NEW_DIR_NAME}
   
    PROTONATE YES
    MWT 0 125
    LOGP 0 3.6
    RB 0 5
    HBA 0 4
    HBD 0 3
    CHARGE 0 2
    LIGAND TC RANGE 0.0 0.35
    MINIMUM DECOYS PER LIGAND 20
    DECOYS PER LIGAND 50
    MAXIMUM TC BETWEEN DECOYS 0.8
    TANIMOTO YES
 
 
If your input ligand SMILES file is already protonated as you want it, set "PROTONATE NO".
 
If you want your input ligand SMILES protonated, only protomer SMILES with unique properties will be kept for generating decoys. Therefore, if you have one ligand that exists in 4 tautomers, all of which have identical molecular weight, cLogP, # rotatable bonds, # H-bond acceptors and donors, and net charge, only <B>one</B> will be maintained for decoy matching. This doesn't apply if you set "PROTONATE NO".
 
This file specifies that for each ligand protomer, at least 20 decoys will be retrieved with the following properties:
    - within +/- 125 Daltons
    - within +/- 3.6 logP
    - within +/- 5 rotatable bonds
    - within +/- 4 hydrogen bond acceptors
    - within +/- 3 hydrogen bond donors
    - within +/- 2 charge
    - 0.35 or less Tanimoto
    - minimum 20 decoys per ligand protomer, if available
    - preferred 50 decoys per ligand protomer, if available
    - the maximum TC between decoy molecules should be 0.8
    - "TANIMOTO" refers to whether a Tanimoto calculation should be performed - see step 3 for when this is necessary
 
These are <B>arbitrary</B>, and you can input your desired minimum and maximum values that decoys can differ by, relative to the ligands.
 
Once you have created this file, run the following command to create the decoy generation directory:
 
   python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0000_protonate_setup_dirs.py {SMILES_FILE} {NEW_DIR_NAME}


Provide a directory name that you want in place of {NEW_DIR_NAME}. This will create the directory with subdirectories named
Provide a directory name that you want in place of {NEW_DIR_NAME}. This will create the directory with subdirectories named
"ligand_${number}" for each of the ligands in the SMILES file you input.
"ligand_${number}" for each of the ligands in the SMILES file you input.


== Only create SMILES directory ==
=== Step 2) Retrieving protomer decoys from ZINC15 ===


If you already have a SMILES file that is protonated correctly, you can just create a SMILES directory with the correct format.
If you have edited the "decoy_generation.in" file which is now located in {NEW_DIR_NAME} as you want, you can run the following command:
To do this, run the following command:
  python /mnt/nfs/home/rstein/zzz.scripts/DUDE_SCRIPTS/alt_0000_setup_dirs.py {SMILES_FILE} {NEW_DIR_NAME}


== Retrieving Decoys from ZINC15 ==
    python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0001_qsub_generate_decoys.py {NEW_DIR_NAME}


Now that you have a decoy generation directory, run the following command:
This should take 15 minutes to an hour, depending on how many ligands you input.
    python /mnt/nfs/home/rstein/zzz.scripts/DUDE_SCRIPTS/0001_qsub_generate_decoys.py {NEW_DIR_NAME}


For each ligand protomer, 50 decoys will be retrieved with the following properties:
=== Step 3) Assigning accepted protomer decoys to each ligand protomer ===
     - within 125 Daltons
 
     - within 3.6 logP
We can assign the property-matched decoys to the ligand protomers. Make sure you have the "decoy_generation_input.in" file from before in {NEW_DIR_NAME}.
     - within 5 rotatable bonds
 
     - within 4 hydrogen bond acceptors
To filter the decoys, run the following command:
     - within 3 hydrogen bond donors
    python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0002_qsub_filter_decoys.py {NEW_DIR_NAME}
 
This will run on the queue. A log file called "FILTER_DECOYS.log" will be generated in {NEW_DIR_NAME} with information and any errors.
 
If you don't get enough decoys, the "decoy_generation.in" file can be modified by changing "MAXIMUM TC BETWEEN DECOYS", "MINIMUM DECOYS PER LIGAND", etc.
To not run the time-consuming Tanimoto calculation between all decoys again, simply add/change this in the "decoy_generation.in" file:
 
    TANIMOTO NO
 
If you set Tanimoto to "NO", make sure that your {NEW_DIR_NAME} still has the original files:
 
    "test_ligdecoy_smiles.smi"
    "cluster_head.list"
 
Otherwise, this step will not run.
 
If these original files still remain, this will skip the Tanimoto calculation step, and filter property matched decoys based on the new parameters in the "decoy_generation.in" file.
 
If this has completed successfully, you should see files in your {NEW_DIR_NAME} with the format "{LIGAND_ID}_final_property_matched_decoys.txt". These files contain the ligands and their properties, as well as property-matched decoys that have been assigned to them. These files have the format "SMILES", "ZINC ID", "logP", "#Rotatable Bonds", "# Hydrogen Bond Donors", "# Hydrogen Bond Acceptors", "Charge", "Protomer SMILES", and "Tanimoto Coefficient to Ligand".
 
There should also be files with the format "{LIGAND_ID}_replacements.txt", which include extra property-matched decoys that were assigned to that ligand.
 
If you still cannot get enough decoys for your ligands, consider reducing the number of ligands you have by clustering, for example, or using the SMILES decoy generation below, which is not limited to only molecules that are already built in ZINC15.
 
=== Step 4) Copying decoy .db2.gz files into your directories ===
 
To copy property-matched decoys into your own directory of choice, run the following command:
 
    python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0003_copy_decoys_to_new_dir.py {NEW_DIR_NAME} {COPY_TO_DIR}
 
where {COPY_TO_DIR} is a new directory that will be created where your decoys will be copied into. In this directory, two subdirectories will be created:
    "ligands" - this includes the input ligands for which there are X number property matched decoys (these are all ligands with "{LIGAND_ID}_final_property_matched_decoys.txt" files in {NEW_DIR_NAME})
    "decoys" - this will include the decoy .db2.gz files for docking and "decoys.smi" which contains all the SMILES strings for property matched decoys
 
IMPORTANT: It is possible that all of your ligand protomers were not matched to property-matched decoys. The "ligands.smi" file in {COPY_TO_DIR} will not include these. Make
sure you do not dock these if you calculate enrichment values.
 
== Querying ZINC for SMILES ==
 
This procedure generates decoys for your input ligand SMILES by finding decoy SMILES in ZINC that are property-matched. This procedure will provide decoy SMILES that you can build yourself into 3D models for docking. If you would like to query ZINC for decoy SMILES so that you can build decoys yourself or if your ligands are >400 Da, start here. If not, go to "Querying ZINC for Protomers" to generate decoys that already have 3D models.
 
=== Step 1) Setting up SMILES directory ===
 
Before starting, you need a SMILES file with the format (SMILES first, ID second):
  S(Nc1c(O)cc(C(=O)O)cc1)(c2c(scc2)C(=O)O)(=O)=O 116
 
You also need an input file named "decoy_generation.in" with the following lines:
 
    SMILES YES
    PROTONATE YES
    MWT 0 125
    LOGP 0 3.6
    RB 0 5
    HBA 0 4
    HBD 0 3
    CHARGE 0 2
    LIGAND TC RANGE 0.0 0.35
    MINIMUM DECOYS PER LIGAND 20
    DECOYS PER LIGAND 50
    MAXIMUM TC BETWEEN DECOYS 0.8
    TANIMOTO YES
    GENERATE DECOYS 750
   
If your input ligand SMILES file is already protonated as you want it, set "PROTONATE NO". "SMILES" tells the function you want to query ZINC for SMILES, not built protomers.
 
This file specifies that for each ligand protomer, {MINIMUM DECOYS PER LIGAND} to {DECOYS PER LIGAND} decoys will be retrieved with the following properties:
     - within +/- 125 Daltons
     - within +/- 3.6 logP
     - within +/- 5 rotatable bonds
     - within +/- 4 hydrogen bond acceptors
     - within +/- 3 hydrogen bond donors
     - within +/- 2 charge
     - within +/- 2 charge
     - 0.35 or less Tanimoto
     - 0.35 or less Tanimoto


These are the original parameters used for DUD-E. The ranges can be altered if desired.
"GENERATE DECOYS" specifies how many potential decoys you want to check for property matching with your ligands. A smaller number results in faster decoy generation, but a smaller pool of potential decoys to compare your ligand against. A larger number results in slower decoy generation, and greater likelihood of property-matched decoys for all your ligands.
 
As with protomers, "MINIMUM DECOYS PER LIGAND" refers to the minimum number of decoys you want for each ligand protomer;
 
"DECOYS PER LIGAND" refers to your preferred number of decoys for each ligand protomer;
 
"MAXIMUM TC BETWEEN DECOYS" refers to the maximum Tc allowed between decoys (the lower, the more dissimilar your decoys will be);
 
and "TANIMOTO" refers to whether the ligand-decoy full Tc matrix should be calculated - this must be done at least once and should not be set to "NO" unless you are re-running step 3.
 
 
These are <B>arbitrary</B>, and you can input your desired minimum and maximum values that decoys can differ by, relative to the ligands. 
 
Once you have created this file, run the following command to create the decoy generation directory:
 
  python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0000_protonate_setup_dirs.py {SMILES_FILE} {NEW_DIR_NAME}
 
Provide a directory name that you want in place of {NEW_DIR_NAME}. This will create the directory with subdirectories named
"ligand_${number}" for each of the ligands in the SMILES file you input.
 
=== Step 2) Retrieving SMILES decoys from ZINC15 ===
 
If you have edited the "decoy_generation.in" file which is now located in {NEW_DIR_NAME} as you want, you can run the following command:
 
    python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0001_qsub_generate_decoys.py {NEW_DIR_NAME}
 
Jobs will run for 15 minutes to 1-2 hours depending on how many ligands you input.
 
***There was a new bug in Reed's original script that Brendan Hall was able to fix (July 30, 2024), run this script instead:
 
    python /nfs/home/bwhall61/work/decoy_gen_improvement/0001_qsub_generate_decoys.py {NEW_DIR_NAME}
=== Step 3) Assigning decoys to ligands ===
 
To assign property matched decoys to your ligand protomers, run the following command:
 
    python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0002_qsub_filter_decoys.py {NEW_DIR_NAME}
 
*** Again Brendan Hall was able to fix the errors in the above script, run this instead:
    cd {NEW_DIR_NAME}
    source /nfs/home/bwhall61/.python_envs/pulp/bin/activate
    python /nfs/home/bwhall61/work/decoy_gen_improvement/filter_decoys.py
   
    this may take some time to run, so it is recommended to run in a screen.
 
 
This will run on the queue. As with "Querying ZINC for Protomers":
 
If you don't get enough decoys, the "decoy_generation.in" file can be modified by changing "MAXIMUM TC BETWEEN DECOYS", "MINIMUM DECOYS PER LIGAND", etc.
To not run the time-consuming Tanimoto calculation between all ligands and decoys again, simply add/change this in the "decoy_generation.in" file:
 
    TANIMOTO NO
 
If you set Tanimoto to "NO", make sure that your {NEW_DIR_NAME} still has the original files:
 
    "test_ligdecoy_smiles.smi"
    "cluster_head.list"
 
Otherwise, this step will not run.
 
If these original files still remain, this will skip the Tanimoto calculation step, and filter property matched decoys based on the new parameters in the "decoy_generation.in" file.
 
If this has completed successfully, you should see files in your {NEW_DIR_NAME} with the format "{LIGAND_ID}_final_property_matched_decoys.txt". These files have the format "SMILES", "ZINC ID", "logP", "#Rotatable Bonds", "# Hydrogen Bond Donors", "# Hydrogen Bond Acceptors", "Charge", "Protomer ID", and "Tanimoto Coefficient to Ligand".
 
These files contain the ligands and their properties, as well as property-matched decoys that have been assigned to them. There should also be files with the format "{LIGAND_ID}_replacements.txt", which include extra property-matched decoys that were assigned to that ligand.
 
=== Step 4) Setting up ligand/decoy directories for building SMILES ===
 
If you have queried ZINC for SMILES, you need to build the decoys yourself. To write the SMILES file, run the following command:


If you would like to run CHARGE MATCHED decoy retrieval (i.e., decoys have the same charge as ligand protomers), run the following command instead:
  python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0003b_write_out_ligands_decoys.py {NEW_DIR_NAME} {COPY_TO_DIR}
    python /mnt/nfs/home/rstein/zzz.scripts/DUDE_SCRIPTS/0001_qsub_CHARGE_MATCHED_generate_decoys.py {NEW_DIR_NAME}


Jobs will run 5 at a time until completed. This should take a few hours, depending on how many ligands you input.
This will create {COPY_TO_DIR} with two subdirectories, "ligands" and "decoys" as well as SMILES files for:


== Removing Decoys that are too similar to known ligands ==
    ligands.smi - this includes the input ligands for which there are X number property matched decoys (these are all ligands with "{LIGAND_ID}_final_property_matched_decoys.txt" files in {NEW_DIR_NAME})
    decoys.smi  - this includes the canonicalized property-matched decoy SMILES
    decoy_protomers.smi - this includes the actual property-matched decoy protomer SMILES


To remove any decoys retrieved that are too similar to all the ligands you have retrieved decoys for,
SMILES for decoys can now be built.
run the following command:
    python /mnt/nfs/home/rstein/zzz.scripts/DUDE_SCRIPTS/0002_remove_similar_compounds.py {NEW_DIR_NAME}


This will run on the queue.
For decoy building, use the following command:


== Assigning accepted decoys to each ligand protomer ==
    setenv DOCKBASE /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk
    source /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/env.csh
    ${DOCKBASE}/ligand/generate/build_database_ligand.sh -H $ph decoy_protomers.smi <B>--pre-tautomerized</B> --no-db


Now that the previous script has removed any decoys that were too similar to known ligands, we can assign the remaining decoys
If not all decoys successfully build, more property matched decoys can be taken from the "{LIGAND_ID}_replacements.txt" files. Additionally, you can build decoys without the <B>--pre-tautomerized</B> flag:
to the ligand protomers. To do this, run the following command:
    python /mnt/nfs/home/rstein/zzz.scripts/DUDE_SCRIPTS/0003_qsub_filter_decoys.py {NEW_DIR_NAME}


If you are running CHARGE MATCHED decoy retrieval, use the following command instead of the one above:
     ${DOCKBASE}/ligand/generate/build_database_ligand.sh -H $ph <B>decoys.smi</B> --no-db
     python /mnt/nfs/home/rstein/zzz.scripts/DUDE_SCRIPTS/0003_CHARGE_MATCHED_filter_decoys.py {NEW_DIR_NAME}


This will run on the queue.
This will produce all protomers of each decoy, including the property-matched decoy protomer.


== Copying decoy .db2.gz files into your directories ==
== Visualizing Decoy Properties ==
=== Visualizing property distributions ===


Now that we have assigned decoys to your ligand protomers, we can copy these decoys into your own directory of choice. To do this,
To visualize the distributions of molecular properties of matched decoys relative to the ligands, run the following command:
run the following command:
     python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0004_plot_properties.py {NEW_DIR_NAME}
     python /mnt/nfs/home/rstein/zzz.scripts/DUDE_SCRIPTS/0004_copy_decoys_to_new_dir.py {NEW_DIR_NAME} {COPY_TO_DIR}


where {COPY_TO_DIR} is a new directory that will be created where your decoys will be copied into. In this directory, two subdirectories will be created:
There will be 6 images in {NEW_DIR_NAME} for molecular weight, logP, number of rotatable bonds, number of hydrogen bond donors, number of hydrogen bond acceptors, and net charge of ligands and decoys.
    "ligands" - this will include "ligands.smi" which includes all the SMILES strings that have at least 50 property matched decoys
 
    "decoys" - this will include the decoy .db2.gz files for docking and "decoys.smi" which contains all the SMILES strings for property matched decoys
=== Visualizing decoy Tanimotos to ligands ===
 
To visualize how different the matched decoys are to the input ligands, run the following command:
  python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0005_plot_tanimoto_to_lig.py {NEW_DIR_NAME}


IMPORTANT: It is possible that there were not 50 property-matched decoys for all of your ligand protomers. The "ligands.smi" file in {COPY_TO_DIR} will not include these. Make
There will be a box and whisker plot image in {NEW_DIR_NAME} showing the Tanimotos calculated between each ligand and all decoys.
sure you do not dock these if you calculating enrichment values.

Latest revision as of 18:59, 30 July 2024

Written by Reed Stein on April 3, 2018.

updated 5/3/2019

updated 8/15/2019

updated 3/6/2020

updated 5/18/2020

This pipeline will generate property-matched decoys for a set of ligand SMILES. To build ligands yourself, see "ligand prep" in:

   http://wiki.docking.org/index.php/DOCK_3.7_tutorial_%28Anat%29

All scripts for this tutorial can be found in:

   /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/

Before running any scripts, make sure to source the current version of Python

  source /nfs/soft/python/envs/complete/current/env.csh

Additionally, JChem needs to be sourced in your ~/.cshrc file with the command:

  source /nfs/soft/jchem/current/env.csh

If the below script still not run:

  source /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/env.csh

Querying ZINC for Protomers

This procedure generates decoys for your input ligands by searching through 3D conformers that are already built in ZINC. This procedure is advised if you want decoys to be charge-matched to ligands.

Step 1) Setting up directories for Protomers

Before starting, you need a SMILES file with the format (SMILES first, unique ID second):

  S(Nc1c(O)cc(C(=O)O)cc1)(c2c(scc2)C(=O)O)(=O)=O 116

You also need an input file named "decoy_generation.in" with the following lines:

   PROTONATE YES
   MWT 0 125
   LOGP 0 3.6
   RB 0 5
   HBA 0 4
   HBD 0 3
   CHARGE 0 2
   LIGAND TC RANGE 0.0 0.35
   MINIMUM DECOYS PER LIGAND 20
   DECOYS PER LIGAND 50
   MAXIMUM TC BETWEEN DECOYS 0.8
   TANIMOTO YES


If your input ligand SMILES file is already protonated as you want it, set "PROTONATE NO".

If you want your input ligand SMILES protonated, only protomer SMILES with unique properties will be kept for generating decoys. Therefore, if you have one ligand that exists in 4 tautomers, all of which have identical molecular weight, cLogP, # rotatable bonds, # H-bond acceptors and donors, and net charge, only one will be maintained for decoy matching. This doesn't apply if you set "PROTONATE NO".

This file specifies that for each ligand protomer, at least 20 decoys will be retrieved with the following properties:

    - within +/- 125 Daltons
    - within +/- 3.6 logP
    - within +/- 5 rotatable bonds
    - within +/- 4 hydrogen bond acceptors
    - within +/- 3 hydrogen bond donors
    - within +/- 2 charge
    - 0.35 or less Tanimoto
    - minimum 20 decoys per ligand protomer, if available
    - preferred 50 decoys per ligand protomer, if available
    - the maximum TC between decoy molecules should be 0.8
    - "TANIMOTO" refers to whether a Tanimoto calculation should be performed - see step 3 for when this is necessary

These are arbitrary, and you can input your desired minimum and maximum values that decoys can differ by, relative to the ligands.

Once you have created this file, run the following command to create the decoy generation directory:

  python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0000_protonate_setup_dirs.py {SMILES_FILE} {NEW_DIR_NAME}

Provide a directory name that you want in place of {NEW_DIR_NAME}. This will create the directory with subdirectories named "ligand_${number}" for each of the ligands in the SMILES file you input.

Step 2) Retrieving protomer decoys from ZINC15

If you have edited the "decoy_generation.in" file which is now located in {NEW_DIR_NAME} as you want, you can run the following command:

   python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0001_qsub_generate_decoys.py {NEW_DIR_NAME}

This should take 15 minutes to an hour, depending on how many ligands you input.

Step 3) Assigning accepted protomer decoys to each ligand protomer

We can assign the property-matched decoys to the ligand protomers. Make sure you have the "decoy_generation_input.in" file from before in {NEW_DIR_NAME}.

To filter the decoys, run the following command:

   python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0002_qsub_filter_decoys.py {NEW_DIR_NAME}

This will run on the queue. A log file called "FILTER_DECOYS.log" will be generated in {NEW_DIR_NAME} with information and any errors.

If you don't get enough decoys, the "decoy_generation.in" file can be modified by changing "MAXIMUM TC BETWEEN DECOYS", "MINIMUM DECOYS PER LIGAND", etc. To not run the time-consuming Tanimoto calculation between all decoys again, simply add/change this in the "decoy_generation.in" file:

   TANIMOTO NO

If you set Tanimoto to "NO", make sure that your {NEW_DIR_NAME} still has the original files:

   "test_ligdecoy_smiles.smi"
   "cluster_head.list"

Otherwise, this step will not run.

If these original files still remain, this will skip the Tanimoto calculation step, and filter property matched decoys based on the new parameters in the "decoy_generation.in" file.

If this has completed successfully, you should see files in your {NEW_DIR_NAME} with the format "{LIGAND_ID}_final_property_matched_decoys.txt". These files contain the ligands and their properties, as well as property-matched decoys that have been assigned to them. These files have the format "SMILES", "ZINC ID", "logP", "#Rotatable Bonds", "# Hydrogen Bond Donors", "# Hydrogen Bond Acceptors", "Charge", "Protomer SMILES", and "Tanimoto Coefficient to Ligand".

There should also be files with the format "{LIGAND_ID}_replacements.txt", which include extra property-matched decoys that were assigned to that ligand.

If you still cannot get enough decoys for your ligands, consider reducing the number of ligands you have by clustering, for example, or using the SMILES decoy generation below, which is not limited to only molecules that are already built in ZINC15.

Step 4) Copying decoy .db2.gz files into your directories

To copy property-matched decoys into your own directory of choice, run the following command:

   python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0003_copy_decoys_to_new_dir.py {NEW_DIR_NAME} {COPY_TO_DIR}

where {COPY_TO_DIR} is a new directory that will be created where your decoys will be copied into. In this directory, two subdirectories will be created:

    "ligands" - this includes the input ligands for which there are X number property matched decoys (these are all ligands with "{LIGAND_ID}_final_property_matched_decoys.txt" files in {NEW_DIR_NAME})
    "decoys" - this will include the decoy .db2.gz files for docking and "decoys.smi" which contains all the SMILES strings for property matched decoys

IMPORTANT: It is possible that all of your ligand protomers were not matched to property-matched decoys. The "ligands.smi" file in {COPY_TO_DIR} will not include these. Make sure you do not dock these if you calculate enrichment values.

Querying ZINC for SMILES

This procedure generates decoys for your input ligand SMILES by finding decoy SMILES in ZINC that are property-matched. This procedure will provide decoy SMILES that you can build yourself into 3D models for docking. If you would like to query ZINC for decoy SMILES so that you can build decoys yourself or if your ligands are >400 Da, start here. If not, go to "Querying ZINC for Protomers" to generate decoys that already have 3D models.

Step 1) Setting up SMILES directory

Before starting, you need a SMILES file with the format (SMILES first, ID second):

  S(Nc1c(O)cc(C(=O)O)cc1)(c2c(scc2)C(=O)O)(=O)=O 116

You also need an input file named "decoy_generation.in" with the following lines:

   SMILES YES
   PROTONATE YES
   MWT 0 125
   LOGP 0 3.6
   RB 0 5
   HBA 0 4
   HBD 0 3
   CHARGE 0 2
   LIGAND TC RANGE 0.0 0.35
   MINIMUM DECOYS PER LIGAND 20
   DECOYS PER LIGAND 50
   MAXIMUM TC BETWEEN DECOYS 0.8
   TANIMOTO YES
   GENERATE DECOYS 750
   

If your input ligand SMILES file is already protonated as you want it, set "PROTONATE NO". "SMILES" tells the function you want to query ZINC for SMILES, not built protomers.

This file specifies that for each ligand protomer, {MINIMUM DECOYS PER LIGAND} to {DECOYS PER LIGAND} decoys will be retrieved with the following properties:

    - within +/- 125 Daltons
    - within +/- 3.6 logP
    - within +/- 5 rotatable bonds
    - within +/- 4 hydrogen bond acceptors
    - within +/- 3 hydrogen bond donors
    - within +/- 2 charge
    - 0.35 or less Tanimoto

"GENERATE DECOYS" specifies how many potential decoys you want to check for property matching with your ligands. A smaller number results in faster decoy generation, but a smaller pool of potential decoys to compare your ligand against. A larger number results in slower decoy generation, and greater likelihood of property-matched decoys for all your ligands.

As with protomers, "MINIMUM DECOYS PER LIGAND" refers to the minimum number of decoys you want for each ligand protomer;

"DECOYS PER LIGAND" refers to your preferred number of decoys for each ligand protomer;

"MAXIMUM TC BETWEEN DECOYS" refers to the maximum Tc allowed between decoys (the lower, the more dissimilar your decoys will be);

and "TANIMOTO" refers to whether the ligand-decoy full Tc matrix should be calculated - this must be done at least once and should not be set to "NO" unless you are re-running step 3.


These are arbitrary, and you can input your desired minimum and maximum values that decoys can differ by, relative to the ligands.

Once you have created this file, run the following command to create the decoy generation directory:

  python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0000_protonate_setup_dirs.py {SMILES_FILE} {NEW_DIR_NAME}

Provide a directory name that you want in place of {NEW_DIR_NAME}. This will create the directory with subdirectories named "ligand_${number}" for each of the ligands in the SMILES file you input.

Step 2) Retrieving SMILES decoys from ZINC15

If you have edited the "decoy_generation.in" file which is now located in {NEW_DIR_NAME} as you want, you can run the following command:

   python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0001_qsub_generate_decoys.py {NEW_DIR_NAME}

Jobs will run for 15 minutes to 1-2 hours depending on how many ligands you input.

      • There was a new bug in Reed's original script that Brendan Hall was able to fix (July 30, 2024), run this script instead:
   python /nfs/home/bwhall61/work/decoy_gen_improvement/0001_qsub_generate_decoys.py {NEW_DIR_NAME}

Step 3) Assigning decoys to ligands

To assign property matched decoys to your ligand protomers, run the following command:

   python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0002_qsub_filter_decoys.py {NEW_DIR_NAME}
      • Again Brendan Hall was able to fix the errors in the above script, run this instead:
   cd {NEW_DIR_NAME}
   source /nfs/home/bwhall61/.python_envs/pulp/bin/activate
   python /nfs/home/bwhall61/work/decoy_gen_improvement/filter_decoys.py
   
   this may take some time to run, so it is recommended to run in a screen.


This will run on the queue. As with "Querying ZINC for Protomers":

If you don't get enough decoys, the "decoy_generation.in" file can be modified by changing "MAXIMUM TC BETWEEN DECOYS", "MINIMUM DECOYS PER LIGAND", etc. To not run the time-consuming Tanimoto calculation between all ligands and decoys again, simply add/change this in the "decoy_generation.in" file:

   TANIMOTO NO

If you set Tanimoto to "NO", make sure that your {NEW_DIR_NAME} still has the original files:

   "test_ligdecoy_smiles.smi"
   "cluster_head.list"

Otherwise, this step will not run.

If these original files still remain, this will skip the Tanimoto calculation step, and filter property matched decoys based on the new parameters in the "decoy_generation.in" file.

If this has completed successfully, you should see files in your {NEW_DIR_NAME} with the format "{LIGAND_ID}_final_property_matched_decoys.txt". These files have the format "SMILES", "ZINC ID", "logP", "#Rotatable Bonds", "# Hydrogen Bond Donors", "# Hydrogen Bond Acceptors", "Charge", "Protomer ID", and "Tanimoto Coefficient to Ligand".

These files contain the ligands and their properties, as well as property-matched decoys that have been assigned to them. There should also be files with the format "{LIGAND_ID}_replacements.txt", which include extra property-matched decoys that were assigned to that ligand.

Step 4) Setting up ligand/decoy directories for building SMILES

If you have queried ZINC for SMILES, you need to build the decoys yourself. To write the SMILES file, run the following command:

  python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0003b_write_out_ligands_decoys.py {NEW_DIR_NAME} {COPY_TO_DIR}

This will create {COPY_TO_DIR} with two subdirectories, "ligands" and "decoys" as well as SMILES files for:

   ligands.smi - this includes the input ligands for which there are X number property matched decoys (these are all ligands with "{LIGAND_ID}_final_property_matched_decoys.txt" files in {NEW_DIR_NAME})
   decoys.smi  - this includes the canonicalized property-matched decoy SMILES
   decoy_protomers.smi - this includes the actual property-matched decoy protomer SMILES

SMILES for decoys can now be built.

For decoy building, use the following command:

   setenv DOCKBASE /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk
   source /nfs/soft/dock/versions/dock37/DOCK-3.7-trunk/env.csh
   ${DOCKBASE}/ligand/generate/build_database_ligand.sh -H $ph decoy_protomers.smi --pre-tautomerized --no-db

If not all decoys successfully build, more property matched decoys can be taken from the "{LIGAND_ID}_replacements.txt" files. Additionally, you can build decoys without the --pre-tautomerized flag:

   ${DOCKBASE}/ligand/generate/build_database_ligand.sh -H $ph decoys.smi --no-db

This will produce all protomers of each decoy, including the property-matched decoy protomer.

Visualizing Decoy Properties

Visualizing property distributions

To visualize the distributions of molecular properties of matched decoys relative to the ligands, run the following command:

   python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0004_plot_properties.py {NEW_DIR_NAME}

There will be 6 images in {NEW_DIR_NAME} for molecular weight, logP, number of rotatable bonds, number of hydrogen bond donors, number of hydrogen bond acceptors, and net charge of ligands and decoys.

Visualizing decoy Tanimotos to ligands

To visualize how different the matched decoys are to the input ligands, run the following command:

  python /mnt/nfs/home/rstein/zzz.scripts/new_DUDE_SCRIPTS/0005_plot_tanimoto_to_lig.py {NEW_DIR_NAME}

There will be a box and whisker plot image in {NEW_DIR_NAME} showing the Tanimotos calculated between each ligand and all decoys.