Dockopt (pydock3 script): Difference between revisions

From DISI
Jump to navigation Jump to search
 
(40 intermediate revisions by 4 users not shown)
Line 1: Line 1:
''dockopt'' allows the generation of many different docking configurations which are then evaluated & analyzed in parallel using a specified job scheduler (e.g. Slurm).
''dockopt'' allows the generation of many different docking configurations which are then evaluated & analyzed in parallel using a specified job scheduler (e.g. Slurm). If you are a Shoichet Lab user, please see a special section for you, below.


The name "dockopt", aside from being an uncreative rehash of the name "blastermaster", derives from the notion of a literal dockopt, i.e., the person in charge of a dock who manages freight logistics and bosses around numerous dockworkers. In this analogy, a single dockworker corresponds to the processing of a single docking configuration.
To use DOCK 3.8, you must first license it and install it.
[[DOCK 3.8:How to install pydock3]]


== ''init'' ==
== Note for UCSF Shoichet Lab members ==


First you need to create the file structure for your dockopt job. To do so, simply type
''pydock3'' is already installed on the following clusters. You can source the provided Python environment scripts to expose the ''pydock3'' executable and declare default environmental variables:


pydock3 dockopt - init
=== Wynton ===


By default, the job directory is named ''dockopt_job''. To specify a different name, type
  source /wynton/group/bks/soft/python_envs/env.sh


  pydock3 dockopt - init <JOB_DIR_NAME>
=== Gimel ===
 
Only nodes other than ''gimel'' itself are supported, e.g., ''gimel2''.
 
ssh gimel2
source /nfs/soft/ian/env.sh
 
== Subcommand: ''new'' ==
 
Prepare rec.pdb, xtal-lig.pdb as described in Bender, 2021. https://pubmed.ncbi.nlm.nih.gov/34561691/
Or download pre-preared sample files from dudez2022.docking.org.
 
Be sure that you are in the directory containing the required input files:
* ''rec.pdb'' or ''rec.crg.pdb''
* ''xtal-lig.pdb''
* ''actives.tgz''
* ''decoys.tgz''
 
Note the inclusion of ''actives.tgz'' and ''actives.tgz''. Each of these is a tarball of DB2 files, which may be optionally gzipped (with extension .db2.gz). Each tarball represents a binary class for the binary classification task that the docking model is trained on. The positive class (actives) contains the molecules that you want the docking model to preferentially assign ''favorable'' docking scores to. The negative class (decoys) contains the molecules that you want the docking model to preferentially assign ''unfavorable'' scores to. The most common strategy is to use a set of known ligands as the actives and a larger set of property-matched decoys for the decoys (a decoy-to-active ratio of 50:1 is standard), but other strategies are supported. For example, to create a docking model that preferentially assigns favorable scores to agonists over antagonists, a set of agonists can be used as the actives and a set of antagonists can be used for the decoys.
 
Therefore, you need to build the molecules yourself (see: https://tldr.docking.org/start/build3d38). Each tarball should contain only DB2 files. For example, if one has a directory ''actives/'' containing only DB2 files to use as the actives, then ''acitves.tgz'' should be created as follows:
 
cd actives/
tar -czf actives.tgz *.db2*
 
Similarly, for a directory ''decoys/'' containing only DB2 files to use as decoys, ''decoys.tgz'' should be created as follows:
 
cd decoys/
tar -czf decoys.tgz *.db2*
 
To create the file structure for your dockopt job, simply type
 
  pydock3 dockopt - new
 
By default, the job directory is named ''dockopt_job''. To specify a different name, use the "--job_dir_path" flag. E.g.:
 
pydock3 dockopt - new --job_dir_path=dockopt_job_2


The job directory contains two sub-directories:  
The job directory contains two sub-directories:  
Line 17: Line 54:
# ''retrodock_jobs'': individual retrodock jobs for each docking configuration
# ''retrodock_jobs'': individual retrodock jobs for each docking configuration


The key difference between the working directories of ''blastermaster'' and ''dockopt'' is that the working directory of ''dockopt'' may contain multiple variants of the blaster files (prefixed by a number, e.g. "1_box"). These variant files are used to create the different docking configurations specified by the multi-valued entries of ''dockopt_config.yaml''. They are created efficiently, such that the same variant used in multiple docking configurations is not created more than once.  
The key difference between the working directories of ''blastermaster'' and ''dockopt'' is that the working directory of ''dockopt'' may contain multiple variants of the blaster files (suffixed by a number, e.g. "box_1"). These variant files are used to create the different docking configurations specified by the multi-valued entries of ''dockopt_config.yaml''. They are created efficiently, such that the same variant used in multiple docking configurations is not created more than once.  


If your current working directory contains any of the following files, then they will be automatically copied into the working directory within the created job directory. This feature is intended to simplify the process of configuring the dockopt job.
If your current working directory contains any of the following files, then they will be automatically copied into the working directory within the created job directory. This feature is intended to simplify the process of configuring the dockopt job.


* ''rec.pdb''
* ''rec.pdb''
* ''rec.crg.pdb''
* ''xtal-lig.pdb''
* ''xtal-lig.pdb''
* ''rec.crg.pdb''
* ''reduce_wwPDB_het_dict.txt''
* ''reduce_wwPDB_het_dict.txt''
* ''filt.params''
* ''filt.params''
Line 34: Line 71:


Only the following are required. Default versions / generated versions of the others will be used instead if they are not detected.
Only the following are required. Default versions / generated versions of the others will be used instead if they are not detected.
* ''rec.pdb''
* ''rec.pdb'' or ''rec.crg.pdb''.  Either is required, but not both.  If both are present, only ''rec.crg.pdb'' is used.
* ''xtal-lig.pdb''
* ''xtal-lig.pdb''


Line 54: Line 91:
== Environmental variables ==
== Environmental variables ==


Designate where the short cache and long cache should be located. E.g.:
=== TMPDIR ===
 
Designate where temporary job files should be placed. E.g.:
 
export TMPDIR=/scratch
 
==== Note for UCSF researchers ====
 
On the Wynton cluster, ''/scratch'' only exists on development nodes (not log nodes). Therefore, we recommend running on development nodes (see: https://wynton.ucsf.edu/hpc/get-started/development-prototyping.html). E.g.:
 
ssh dev1
export TMPDIR=/scratch


export SHRTCACHE=/dev/shm
If a log node must be used, then ''/wynton/scratch'' may be used:
export LONGCACHE=/dev/shm


SHRTCACHE: temporary storage for job files
ssh log1
export TMPDIR=/wynton/scratch


LONGCACHE: long-term storage for files shared between jobs
=== job scheduler environmental variables ===


In order for ''dockopt'' to know which scheduler it should use, please configure the following environmental variables according to which one of the job schedulers you have.
In order for ''dockopt'' to know which scheduler it should use, please configure the following environmental variables according to which one of the job schedulers you have.


=== Slurm ===
==== Slurm ====


E.g., on the UCSF Shoichet Lab Gimel cluster (on any node other than 'gimel' itself, such as 'gimel5'):
E.g., on the UCSF Shoichet Lab Gimel cluster (on any node other than 'gimel' itself, such as 'gimel2'):


  export SBATCH_EXEC=/usr/bin/sbatch
  export SBATCH_EXEC=/usr/bin/sbatch
  export SQUEUE_EXEC=/usr/bin/squeue
  export SQUEUE_EXEC=/usr/bin/squeue


=== SGE ===
==== SGE ====


E.g., on the UCSF Wynton cluster:
On most clusters using SGE the following should be correct:


  export QSTAT_EXEC=/opt/sge/bin/lx-amd64/qstat
  export QSTAT_EXEC=/opt/sge/bin/lx-amd64/qstat
  export QSUB_EXEC=/opt/sge/bin/lx-amd64/qsub
  export QSUB_EXEC=/opt/sge/bin/lx-amd64/qsub
export SGE_SETTINGS=/opt/sge/default/common/settings.sh
===== Note for UCSF researchers =====


The following is necessary on the UCSF Wynton cluster:
The following is necessary on the UCSF Wynton cluster:


export QSTAT_EXEC=/opt/sge/bin/lx-amd64/qstat
export QSUB_EXEC=/opt/sge/bin/lx-amd64/qsub
  export SGE_SETTINGS=/opt/sge/wynton/common/settings.sh
  export SGE_SETTINGS=/opt/sge/wynton/common/settings.sh


On most clusters, this will probably be:
== Subcommand: ''run'' ==
export SGE_SETTINGS=/opt/sge/default/common/settings.sh
 
== ''run'' ==


Once your job has been configured to your liking, navigate to the the job directory and run ''dockopt'':
Once your job has been configured to your liking, navigate to the the job directory and run ''dockopt'':
  cd <JOB_DIR_NAME>
  cd <JOB_DIR_NAME>
  pydock3 dockopt - run <JOB_SCHEDULER_NAME>
  pydock3 dockopt - run <JOB_SCHEDULER_NAME> [--retrodock_job_timeout_minutes=None] [--retrodock_job_max_reattempts=0] [--extra_submission_cmd_params_str=None] [--export_decoys_mol2=False]


where <JOB_SCHEDULER_NAME> is one of:
where <JOB_SCHEDULER_NAME> is one of:
Line 97: Line 146:
* ''slurm''
* ''slurm''


This will execute the many dockopt subroutines in sequence, except for the retrodock jobs run on each docking configuration, which are run in parallel via the scheduler. The state of the program will be printed to standard output as it runs.
This will execute the many dockopt subroutines in sequence. Once this is done, the retrodock jobs for all created docking configurations are run in parallel via the scheduler. The state of the program will be printed to standard output as it runs.
 
You can also set the following flags to adjust retrodock job submission behavior. This example show the default values:
pydock3 dockopt - run <JOB_SCHEDULER_NAME> --retrodock_job_max_reattempts=0 --retrodock_job_timeout_minutes=None


Once the dockopt job is complete, the following files will be generated in the job directory:
Once the dockopt job is complete, the following files will be generated in the job directory:
* ''dockopt_job_report.pdf'': contains (1) roc.png of best retrodock job, (2) box plots of enrichment for every multi-valued config parameter, and (3) heatmaps of enrichment for every pair of multi-valued config parameters
* ''report.html'': contains (1) a histogram of the performance of all tested docking configurations compared against a distribution of the performance of a random classifier, so as to show whether the test docking configurations are significantly better than ones that can be produced by a random classifier. This is necessary due to the fact that many configurations are being tested. Hence, a Bonferroni correction is applied to the significance threshold, dividing p=0.01 by the number of tested configurations. (2) ROC, charge, and energy plots of the top docking configurations, comparing actives vs. decoys, (3) box plots of enrichment for every multi-valued config parameter, and (4) heatmaps of enrichment for every pair of multi-valued config parameters.
* ''dockopt_job_results.csv'': enrichment metrics for each docking configuration
* ''results.csv'': parameter values, criterion values, and other information about each docking configuration.


In addition, the best retrodock job will be copied to its own sub-directory ''best_retrodock_job/''.  
In addition, some number of the best retrodock jobs will be copied to their own sub-directory ''best_retrodock_jobs/''.  


Within each retrodock job directory, there are the following files and sub-directories:
Within each sub-directory of ''best_retrodock_jobs/'', there are:
* ''working/'': intermediate files
* ''dockfiles/'': parameters files and ''INDOCK'' for given docking configuration
* ''dockfiles/'': parameters files and INDOCK for given docking configuration
* ''output/'': contains:  
* ''output/'': contains:  
** joblist
** sub-directories ''actives/'' (containing ''OUTDOCK'' and ''test.mol2'' files) and ''decoys/'' (containing just ''OUTDOCK'')
** sub-directories ''1/'' for actives and ''2/'' for decoys (each containing OUTDOCK and test.mol2 files)
* plot images (e.g., ''roc.png'')
** log files for the retrodock jobs
 
* ''retrodock_job_results.csv'': data loaded from OUTDOCK files for both actives and decoys
'''Note:''' by default, a mol2 file is exported only for actives, not for decoys, in order to prevent disk space issues.
* ''roc.png'': the ROC enrichment curve (log-scaled x-axis) for given docking configuration

Latest revision as of 19:08, 18 July 2024

dockopt allows the generation of many different docking configurations which are then evaluated & analyzed in parallel using a specified job scheduler (e.g. Slurm). If you are a Shoichet Lab user, please see a special section for you, below.

To use DOCK 3.8, you must first license it and install it. DOCK 3.8:How to install pydock3

Note for UCSF Shoichet Lab members

pydock3 is already installed on the following clusters. You can source the provided Python environment scripts to expose the pydock3 executable and declare default environmental variables:

Wynton

 source /wynton/group/bks/soft/python_envs/env.sh

Gimel

Only nodes other than gimel itself are supported, e.g., gimel2.

ssh gimel2
source /nfs/soft/ian/env.sh

Subcommand: new

Prepare rec.pdb, xtal-lig.pdb as described in Bender, 2021. https://pubmed.ncbi.nlm.nih.gov/34561691/ Or download pre-preared sample files from dudez2022.docking.org.

Be sure that you are in the directory containing the required input files:

  • rec.pdb or rec.crg.pdb
  • xtal-lig.pdb
  • actives.tgz
  • decoys.tgz

Note the inclusion of actives.tgz and actives.tgz. Each of these is a tarball of DB2 files, which may be optionally gzipped (with extension .db2.gz). Each tarball represents a binary class for the binary classification task that the docking model is trained on. The positive class (actives) contains the molecules that you want the docking model to preferentially assign favorable docking scores to. The negative class (decoys) contains the molecules that you want the docking model to preferentially assign unfavorable scores to. The most common strategy is to use a set of known ligands as the actives and a larger set of property-matched decoys for the decoys (a decoy-to-active ratio of 50:1 is standard), but other strategies are supported. For example, to create a docking model that preferentially assigns favorable scores to agonists over antagonists, a set of agonists can be used as the actives and a set of antagonists can be used for the decoys.

Therefore, you need to build the molecules yourself (see: https://tldr.docking.org/start/build3d38). Each tarball should contain only DB2 files. For example, if one has a directory actives/ containing only DB2 files to use as the actives, then acitves.tgz should be created as follows:

cd actives/
tar -czf actives.tgz *.db2*

Similarly, for a directory decoys/ containing only DB2 files to use as decoys, decoys.tgz should be created as follows:

cd decoys/
tar -czf decoys.tgz *.db2*

To create the file structure for your dockopt job, simply type

pydock3 dockopt - new

By default, the job directory is named dockopt_job. To specify a different name, use the "--job_dir_path" flag. E.g.:

pydock3 dockopt - new --job_dir_path=dockopt_job_2

The job directory contains two sub-directories:

  1. working: input files, intermediate blaster files, sub-directories for individual blastermaster subroutines
  2. retrodock_jobs: individual retrodock jobs for each docking configuration

The key difference between the working directories of blastermaster and dockopt is that the working directory of dockopt may contain multiple variants of the blaster files (suffixed by a number, e.g. "box_1"). These variant files are used to create the different docking configurations specified by the multi-valued entries of dockopt_config.yaml. They are created efficiently, such that the same variant used in multiple docking configurations is not created more than once.

If your current working directory contains any of the following files, then they will be automatically copied into the working directory within the created job directory. This feature is intended to simplify the process of configuring the dockopt job.

  • rec.pdb
  • rec.crg.pdb
  • xtal-lig.pdb
  • reduce_wwPDB_het_dict.txt
  • filt.params
  • radii
  • amb.crg.oxt
  • vdw.siz
  • delphi.def
  • vdw.parms.amb.mindock
  • prot.table.ambcrg.ambH

Only the following are required. Default versions / generated versions of the others will be used instead if they are not detected.

  • rec.pdb or rec.crg.pdb. Either is required, but not both. If both are present, only rec.crg.pdb is used.
  • xtal-lig.pdb

If you would like to use files not present in your current working directory, copy them into your job's working directory, e.g.:

cp <FILE_PATH> <JOB_DIR_NAME>/working/

Finally, configure the dockopt_config.yaml file in the job directory to your specifications. The parameters in this file govern the behavior of dockopt.

Note: The dockopt_config.yaml file differs from the blastermaster_config.yaml file in that every parameter of the former may accept either a single value or a list of comma-separated values, which indicates a pool of values to attempt for that parameter. Multiple such multi-valued parameters may be provided, and all unique resultant docking configurations will be attempted.

Single-valued YAML line format:

distance_to_surface: 1.0

Multi-valued YAML line format:

distance_to_surface: [1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9]

Environmental variables

TMPDIR

Designate where temporary job files should be placed. E.g.:

export TMPDIR=/scratch

Note for UCSF researchers

On the Wynton cluster, /scratch only exists on development nodes (not log nodes). Therefore, we recommend running on development nodes (see: https://wynton.ucsf.edu/hpc/get-started/development-prototyping.html). E.g.:

ssh dev1
export TMPDIR=/scratch

If a log node must be used, then /wynton/scratch may be used:

ssh log1
export TMPDIR=/wynton/scratch

job scheduler environmental variables

In order for dockopt to know which scheduler it should use, please configure the following environmental variables according to which one of the job schedulers you have.

Slurm

E.g., on the UCSF Shoichet Lab Gimel cluster (on any node other than 'gimel' itself, such as 'gimel2'):

export SBATCH_EXEC=/usr/bin/sbatch
export SQUEUE_EXEC=/usr/bin/squeue

SGE

On most clusters using SGE the following should be correct:

export QSTAT_EXEC=/opt/sge/bin/lx-amd64/qstat
export QSUB_EXEC=/opt/sge/bin/lx-amd64/qsub
export SGE_SETTINGS=/opt/sge/default/common/settings.sh
Note for UCSF researchers

The following is necessary on the UCSF Wynton cluster:

export QSTAT_EXEC=/opt/sge/bin/lx-amd64/qstat
export QSUB_EXEC=/opt/sge/bin/lx-amd64/qsub
export SGE_SETTINGS=/opt/sge/wynton/common/settings.sh

Subcommand: run

Once your job has been configured to your liking, navigate to the the job directory and run dockopt:

cd <JOB_DIR_NAME>
pydock3 dockopt - run <JOB_SCHEDULER_NAME> [--retrodock_job_timeout_minutes=None] [--retrodock_job_max_reattempts=0] [--extra_submission_cmd_params_str=None] [--export_decoys_mol2=False]

where <JOB_SCHEDULER_NAME> is one of:

  • sge
  • slurm

This will execute the many dockopt subroutines in sequence. Once this is done, the retrodock jobs for all created docking configurations are run in parallel via the scheduler. The state of the program will be printed to standard output as it runs.

Once the dockopt job is complete, the following files will be generated in the job directory:

  • report.html: contains (1) a histogram of the performance of all tested docking configurations compared against a distribution of the performance of a random classifier, so as to show whether the test docking configurations are significantly better than ones that can be produced by a random classifier. This is necessary due to the fact that many configurations are being tested. Hence, a Bonferroni correction is applied to the significance threshold, dividing p=0.01 by the number of tested configurations. (2) ROC, charge, and energy plots of the top docking configurations, comparing actives vs. decoys, (3) box plots of enrichment for every multi-valued config parameter, and (4) heatmaps of enrichment for every pair of multi-valued config parameters.
  • results.csv: parameter values, criterion values, and other information about each docking configuration.

In addition, some number of the best retrodock jobs will be copied to their own sub-directory best_retrodock_jobs/.

Within each sub-directory of best_retrodock_jobs/, there are:

  • dockfiles/: parameters files and INDOCK for given docking configuration
  • output/: contains:
    • sub-directories actives/ (containing OUTDOCK and test.mol2 files) and decoys/ (containing just OUTDOCK)
  • plot images (e.g., roc.png)

Note: by default, a mol2 file is exported only for actives, not for decoys, in order to prevent disk space issues.