How to dock in DOCK3.8: Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
No edit summary
 
(50 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= IMPORTANT - UPDATED DOCUMENTATION =
https://wiki.docking.org/index.php/SUBDOCK_DOCK3.8
= OLD DOCUMENTATION =
How to dock in DOCK 3.8.0
How to dock in DOCK 3.8.0


== Differences from DOCK.3.7 ==
Related page: Docking Analysis with DOCK 3.8
 
http://wiki.docking.org/index.php/Docking_Analysis_in_DOCK3.8
 
[[How to install DOCK 3.8]]
 
== Checkpointing & Restartability in DOCK3.8 ==


DOCK 3.8.0 can be interrupted safely and restarted, which allows more flexibility when submitting docking jobs.
DOCK 3.8 can be interrupted safely and restarted, which allows more flexibility when submitting docking jobs.


For example, you could set QSUB_ARGS="-l s_rt=00:05:00 -l h_rt=00:07:00" (or SBATCH_ARGS="--time=00:07:00")
For example, you could set QSUB_ARGS="-l s_rt=00:05:00 -l h_rt=00:07:00" (or SBATCH_ARGS="--time=00:07:00")
so that each docking job will only run for 5 minutes before being interrupted. The new subdock.bash script allows submitting the same set of jobs multiple times, until they are all complete. A more pragmatic choice might be "-l s_rt=00:28:00 -l h_rt=00:30:00" to get the benefit of faster scheduling on wynton in the short.q.  
so that each docking job will only run for 5 minutes before being interrupted. The new subdock.bash script allows submitting the same set of jobs multiple times until they are all complete. A more pragmatic choice might be "-l s_rt=00:28:00 -l h_rt=00:30:00" to get the benefit of faster scheduling on wynton in the short.q.  
Another advantage is that the job can be interrupted at any time on AWS and it will checkpoint and be restartable.
Another advantage is that the job can be interrupted at any time on AWS and it will checkpoint and be restartable.


== Running the Script ==
== Submitting Jobs/Running the Script ==


New subdock scripts are here:
New subdock scripts are here:
Line 39: Line 51:


An NFS path to a DOCK binary executable (NOT a wrapper script).
An NFS path to a DOCK binary executable (NOT a wrapper script).
IMPORTANT: You should append the executable's compile time stamp to the end of it's name, e.g dock64.20210302. This will avoid any confusion of this executable with other versions of DOCK floating around.


==== DOCKFILES ====
==== DOCKFILES ====


An NFS path to the dockfiles (INDOCK, spheres, receptor files, grids, etc.) being used for this docking run. The dockfiles directory should be named uniquely, to avoid confusion with other dockfiles other users may be running.
An NFS path to the dockfiles (INDOCK, spheres, receptor files, grids, etc.) being used for this docking run. Note that INDOCK is expected to be part of these files.


=== Optional Arguments ===
=== Optional Arguments ===
Line 50: Line 60:
==== SHRTCACHE ====
==== SHRTCACHE ====


The directory DOCK will perform it's work in. Files saved to this directory will be deleted once the docking job has concluded. By default this is /dev/shm.
The directory DOCK will perform it's work in. Files saved to this directory will be deleted once the docking job has concluded. By default this is /scratch. If /scratch is not available, change this to something else.


==== LONGCACHE ====
==== LONGCACHE ====


The directory DOCK will store files that are shared between multiple docking jobs. Files saved to this directory (dockexec and dockfiles) will persist until they are deleted. By default this directory is /tmp.
The directory DOCK will store files that are shared between multiple docking jobs. Files saved to this directory (dockexec and dockfiles) will persist on compute nodes until they are deleted by hand or an automated culling process. By default this directory is /scratch.
 
Beware of using the default SHRTCACHE or LONGCACHE settings on large clusters.


==== SBATCH_ARGS ====
==== SBATCH_ARGS ====
Line 65: Line 73:


Additional arguments to provide to sge's qsub, if using the sge version of subdock.bash
Additional arguments to provide to sge's qsub, if using the sge version of subdock.bash
==== SHRTCACHE_USE_ENV ====
At script runtime, have SHRTCACHE be set to the value of another environment variable, whose name is the value of SHRTCACHE_USE_ENV. This is useful for example if the scheduler sets up a directory for your job to perform work in, e.g Wynton's TMPDIR.


== Examples ==
== Examples ==
Line 71: Line 83:


  <nowiki>
  <nowiki>
export INPUT_SOURCE=example.in
export INPUT_SOURCE=/path/to/example.in
export OUTPUT_DEST=output
export EXPORT_DEST=/path/to/output
export DOCKEXEC=$DOCKBASE/docking/DOCK/bin/dock64
export DOCKEXEC=$DOCKBASE/docking/DOCK/bin/dock64
export DOCKFILES=dockfiles.example
export DOCKFILES=/path/to/dockfiles.example
export SHRTCACHE=/dev/shm
export SHRTCACHE=/dev/shm
export LONGCACHE=/tmp
export LONGCACHE=/tmp
export SBATCH_ARGS="--time=02:00:00"
export SBATCH_ARGS="--time=02:00:00"


$DOCKBASE/docking/submit/slurm/subdock.bash
bash $DOCKBASE/docking/submit/slurm/subdock.bash
</nowiki>
</nowiki>


Line 85: Line 97:


  <nowiki>
  <nowiki>
export INPUT_SOURCE=example.in
export INPUT_SOURCE=/path/to/example.in
export OUTPUT_DEST=output
export EXPORT_DEST=/path/to/output
export DOCKEXEC=$DOCKBASE/docking/DOCK/bin/dock64
export DOCKEXEC=$DOCKBASE/docking/DOCK/bin/dock64
export DOCKFILES=dockfiles.example
export DOCKFILES=/path/to/dockfiles.example
export SHRTCACHE=/scratch
export SHRTCACHE_USE_ENV=TMPDIR
export LONGCACHE=/scratch
export LONGCACHE=/scratch
export QSUB_ARGS="-l s_rt=00:28:00 -l h_rt=00:30:00"
export QSUB_ARGS="-l s_rt=00:28:00 -l h_rt=00:30:00"


$DOCKBASE/docking/submit/sge/subdock.bash
bash $DOCKBASE/docking/submit/sge/subdock.bash
</nowiki>
</nowiki>
== Note on using SHRTCACHE and LONGCACHE ==
You should avoid using global networked directories for SHRTCACHE, these would be directories prefixed with /nfs/ on BKS and directories starting with /wynton/ on wynton. SHRTCACHE is used for writing out logs & job output in real time- a high latency network disk is inappropriate for this task. If running many jobs in parallel, there is a good chance that the network disk will be overloaded from all the jobs trying to send write requests simultaneously. I feel the difference between RAM, network directories and local directories is misunderstood, so I'll explain it with an analogy:
Imagine you just finished washing your clothes and you need to put them away.
Writing data to RAM (or /dev/shm) is like dropping them in a basket next to the washing machine.
Writing data to your local disk is like walking to your closet to put them away.
Writing data to a network disk is like walking to your neighbor's house and putting them away in his closet.
Now, networked disks don't perform much worse than local disks (your neighbor is just next door, after all), but what if everyone in the neighborhood put their stuff away in your neighbor's closet? For one there would be a line of people out his front door, meaning it would take you way longer to put away your clothes. Your neighbor also would probably not be happy with this arrangement, just like the wynton admins will not be happy if you set SHRTCACHE/LONGCACHE to a /wynton directory.
== Developer Example: Building your own db2.tgz files & Submitting jobs ==
What you need:
# One or more db2(.gz) files
# An nfs directory(s) to store:
## docking input/output
## dockfiles
## dock executable
# subdock & rundock scripts
Create a list of all the db2 files you want to run docking against. The example below is merely a suggestion, make the list in any way you please so long as each entry is a *full* path (relative to your working directory) to a db2 (or db2.gz) file.
<nowiki>
find $MY_DB2_SOURCE -type f -name "*.db2*" > my_db2_list</nowiki>
Split this list into reasonably sized chunks, our standard is 5000 but you can make them as large or small as you like. Do be careful about making the chunks larger- 5000 db2s is already quite heavy.
<nowiki>
>> split -a 3 --lines=5000 my_db2_list db2_chunk.
>> ls
db2_chunk_aaa
db2_chunk_aab
db2_chunk_aac
...
my_db2_list</nowiki>
Create a db2.tgz archive from each of these lists.
<nowiki>
for db2_chunk in db2_chunk.*; do
    tar -czf $db2_chunk.db2.tgz --files-from $db2_chunk
done</nowiki>
If you already have premade db2.tgz files, for example from the zinc22 3D archive, start the tutorial here.
Create a list of every db2.tgz archive. Each job will evaluate one db2.tgz archive. Again, this example is a suggestion applicable only if you've been following the tutorial up to this point.
<nowiki>
find . -type f -name "db2_chunk.*.db2.tgz" > job_input_list</nowiki>
Now all you need to do is specify your docking parameters and launch the jobs. Set INPUT_SOURCE to be the job_input_list created in the previous step.
SGE
<nowiki>
export DOCKEXEC=<dock executable path>
export DOCKFILES=<dockfiles path>
export EXPORT_DEST=<output directory path>
# optional arguments for the job controller. Note that these arguments are examples and not the only configuration recommended
export QSUB_ARGS="-l s_rt=00:28:00 -l h_rt=00:30:00 -l mem_free=2G"
export INPUT_SOURCE=job_input_list
bash <scripts directory>/sge/subdock.bash</nowiki>
SLURM
<nowiki>
export DOCKEXEC=<dock executable path>
export DOCKFILES=<dockfiles path>
export EXPORT_DEST=<output directory path>
export SBATCH_ARGS="--time=00:30:00 --mem-per-cpu=2G"
export INPUT_SOURCE=job_input_list
bash <scripts directory>/slurm/subdock.bash</nowiki>
=== Large Docking Jobs ===
If your list of db2.tgz files is very large you may want to further split it. Each db2.tgz file in the job_input_list represents a job submitted to the queue, and often there is a limit on how many jobs can be queued at once.
In order to avoid this problem, we will need an automatic solution to split up our job_input_list and submit batches of jobs only when there is space left in the queue.
For example, imagine we want to submit in batches of 10,000 and limit total jobs to 50,000 (with a package size of 5000 this is 250M molecules in the queue maximum, submitting 50M at a time). The example I have shows how you would do this in slurm.
<nowiki>
#!/bin/bash
### submit_all_slurm.bash
BINDIR=$(dirname $0)
BATCH_SIZE=10000
MAX_QUEUED=50000
# the script is more portable if we provide the various parameters as arguments instead of hard-coding
INPUT_LIST=$1
BASE_EXPORT_DEST=$2 # this is the directory where further subdirectories will be created that contain docking job results
export DOCKEXEC=$3
export DOCKFILES=$4
# we can use our EXPORT_DEST as staging grounds for our input
mkdir -p $BASE_EXPORT_DEST/input
split --lines=$BATCH_SIZE -a 3 -n $INPUT_LIST $BASE_EXPORT_DEST/input/job_input.
export SBATCH_ARGS="--time=00:30:00 --mem-per-cpu=2G -J dock"
for job_input in $BASE_EXPORT_DEST/input/job_input.*; do
    export INPUT_SOURCE=$job_input
    input_num=$(printf $job_input | cut -d'.' -f2) # get the suffix of the split filename
    export EXPORT_DEST=$BASE_EXPORT_DEST/$input_num
    # loop forever
    while [ -z ]; do
        # counts how many jobs in total are pending or running on this user
        njobs=$(squeue -u $(whoami) -h -t pending,running -r | wc -l)
        # if you want to instead set a limit on how many *dock* jobs are pending or running you would just run the command through a filter
        # njobs=$(squeue -u $(whoami) -h -t pending,running -r | grep "dock" | wc -l)
        if [ $njobs -lt $((MAX_QUEUED-BATCH_SIZE)) ]; then
            break
        fi
        sleep 10
    done
    # the slurm subdock and rundock scripts need to live next to this script in a directory named "slurm"
    bash $BINDIR/slurm/subdock.bash
done</nowiki>
This script will run until all jobs are submitted, so for very large jobs you may want to keep the process alive in a screen.
== Tip: Using Wynton's $TMPDIR ==
<b>Doing this with SHRTCACHE_USE_ENV</b>
Before you run subdock, simply export the SHRTCACHE_USE_ENV option.
<nowiki>
export SHRTCACHE_USE_ENV=TMPDIR</nowiki>
This will cause the script to use the $TMPDIR variable for SHRTCACHE.


== Example: Running a lot of docking jobs ==
== Example: Running a lot of docking jobs ==
Line 110: Line 274:


* 2. set up INDOCK and dockfiles. rename dockfiles to dockfiles.$indockhash. On some nodes, the shasum command is called by sha1sum. Ultimately, renaming the dockfiles to a unique dockfiles is key.  
* 2. set up INDOCK and dockfiles. rename dockfiles to dockfiles.$indockhash. On some nodes, the shasum command is called by sha1sum. Ultimately, renaming the dockfiles to a unique dockfiles is key.  
Note: As of 3/19/2021, this is no longer necessary
  bash
  bash
  indockhash=$(cat INDOCK | shasum | awk '{print substr($1, 1, 12)}')
  indockhash=$(cat INDOCK | shasum | awk '{print substr($1, 1, 12)}')
Line 151: Line 318:
Added by Ying 3/10/2021
Added by Ying 3/10/2021


* set up the folder to run docking.  
To use: copy and paste the code section into terminal. '''Note to change the path where labelled with ''CHANGE this'' '''
 
* '''set up the folder to run docking. '''
Path to my example: /wynton/home/shoichetlab/yingyang/work/5HT-5a/10_AL-dock/zinc22_3d_build_3-10-2021
Path to my example: /wynton/home/shoichetlab/yingyang/work/5HT-5a/10_AL-dock/zinc22_3d_build_3-10-2021
   mkdir zinc22_3d_build_3-10-2021
   mkdir zinc22_3d_build_3-10-2021
   cd zinc22_3d_build_3-10-2021
   cd zinc22_3d_build_3-10-2021


* copy INDOCK into dockfiles folder, and transfer to the created folder
* '''copy INDOCK into dockfiles folder, and transfer to the created folder'''
   cp INDOCK dockfiles
   cp INDOCK dockfiles
   scp -r dockfiles dt2.wynton.ucsf.edu:/path_to_created_folder
   scp -r INDOCK dockfiles dt2.wynton.ucsf.edu:/path_to_created_folder


* get sdi of monocations of already built ZINC22 (<= H26 heavy atom count)
* '''get sdi of monocations of already built ZINC22 (<= H26 heavy atom count)'''
Modify to your own need...
  <nowiki>
  <nowiki>
mkdir sdi
mkdir sdi
Line 169: Line 339:
    
    
   touch sdi/${hac}.sdi
   touch sdi/${hac}.sdi
  # CHANGE this: to your need
   foreach tgz (`ls /wynton/group/bks/zinc-22*/${hac}/${hac}[PM]???/*-O*.db2.tgz`)
   foreach tgz (`ls /wynton/group/bks/zinc-22*/${hac}/${hac}[PM]???/*-O*.db2.tgz`)
     ls $tgz
     ls $tgz
     echo $tgz >> sdi/${hac}.sdi
     echo $tgz >> sdi/${hac}.sdi
   end
   end
end
end
</nowiki>
</nowiki>


* rename the dockfiles directory
* '''rename the dockfiles directory'''
 
Note: As of 3/19/2021 this step is no longer necessary
 
   indockhash=$(cat INDOCK | sha1sum | awk '{print substr($1, 1, 12)}')
   indockhash=$(cat INDOCK | sha1sum | awk '{print substr($1, 1, 12)}')
   mv dockfiles dockfiles.${indockhash}
   mv dockfiles dockfiles.${indockhash}


* write and run the super_run.sh
* '''write and run the super_run.sh'''
  <nowiki>
  <nowiki>
cat <<EOF > super_run.sh
cat <<EOF > super_run.sh
export DOCKBASE=/wynton/group/bks/soft/DOCK-3.8.0.1
export DOCKBASE=/wynton/group/bks/soft/DOCK-3.8.0.1
export DOCKEXEC=$DOCKBASE/docking/DOCK/bin/dock64
export DOCKEXEC=\$DOCKBASE/docking/DOCK/bin/dock64


# CHANGE here: path to the dockfiles.${indockhash}
# CHANGE here: path to the previously renamed dockfiles.\${indockhash}
### Note: as of 3/19/2021 renaming your dockfiles is no longer necessary
export DOCKFILES=/wynton/group/bks/work/yingyang/5HT-5a/10_AL-dock/zinc22_3d_build_3-10-2021/dockfiles.${indockhash}
export DOCKFILES=/wynton/group/bks/work/yingyang/5HT-5a/10_AL-dock/zinc22_3d_build_3-10-2021/dockfiles.${indockhash}
export SHRTCACHE=/scratch
export SHRTCACHE=/scratch
export LONGCACHE=/scratch
export LONGCACHE=/scratch
Line 195: Line 368:


for i in  sdi/*.sdi  ; do
for i in  sdi/*.sdi  ; do
export k=$(basename $i .sdi)
    export k=\$(basename \$i .sdi)
echo k $k
    echo k \$k
export INPUT_SOURCE=$PWD/$i
    export INPUT_SOURCE=$PWD/\$i
export EXPORT_DEST=$PWD/output/$k
    export EXPORT_DEST=$PWD/output/\$k
\$DOCKBASE/docking/submit/sge/subdock.bash
    \$DOCKBASE/docking/submit/sge/subdock.bash
done
EOF
 
bash super_run.sh
</nowiki>
 
* '''keep submitting the super_run script until all db2s have been docked. '''
After all docking jobs finish, check the output. If no weird error, we can use a while loop to restart.
<nowiki>
while true
do
  export jobN=$(qstat | grep -c 'rundock')
  if [[ $jobN -gt 0 ]]
  then
    sleep 60
  else
    bash super_run.sh
  fi
done
done
</nowiki>
When no new job is going to be submitted, use Ctrl+c to exit the while loop.
* '''extract scores from output. '''
<nowiki>
cat << EOF > qsub_extract.csh
#\$ -S /bin/csh
#\$ -cwd
#\$ -pe smp 1
#\$ -l mem_free=100G
#\$ -l scratch=100G
#\$ -l h_rt=50:00:00
#\$ -j yes
#\$ -o extract_all.out
hostname
date


setenv DOCKBASE /wynton/group/bks/soft/DOCK-3.8.0.1
setenv dir_in $PWD
if ! (-d \$TMPDIR ) then
    if (-d /scratch ) then
        setenv TMPDIR /scratch/\$USER
    else
        setenv TMPDIR /tmp/\$USER
    endif
    mkdir -p \$TMPDIR
endif
pushd \$TMPDIR
ls -d \${dir_in}/output/*/*/ > dirlist
python \$DOCKBASE/analysis/extract_all_blazing_fast.py \
dirlist extract_all.txt -30
mv extract_all.* \$dir_in
popd
echo '---job info---'
qstat -j \$JOB_ID
echo '---complete---'
EOF
EOF


bash super_run.sh
qsub qsub_extract.csh
</nowiki>
</nowiki>


* extract the output
Another way is to run the command from the login node (Not recommended since sorting utilizes large memory)
  ls -d output/*/*/ > dirlist
ls -d output/*/*/ > dirlist
  python $DOCKBASE/analysis/extract_all_blazing_fast.py dirlist extract_all.txt 0
python $DOCKBASE/analysis/extract_all_blazing_fast.py dirlist extract_all.txt -20
 
* '''get poses in parallel'''
<nowiki>
set score_file = $PWD/extract_all.sort.uniq.txt
set score_name = ${score_file:t:r}
set fileprefix = 'tmp_'
set number_per_file = 5000
 
set workdir  = $PWD/${score_name}_poses
mkdir -p $workdir
cd $workdir
 
split --lines=$number_per_file --suffix-length=4 \
-d $score_file ${fileprefix}
 
set num  = ` ls ${fileprefix}* | wc -l `
echo "Number of score files to process:" $num
 
cat << EOF > qsub_poses.csh
#\$ -S /bin/csh
#\$ -cwd
#\$ -j yes
#\$ -pe smp 1
#\$ -l mem_free=5G
#\$ -l scratch=20G
#\$ -l h_rt=25:00:00
#\$ -t 1-$num
 
hostname
date
 
setenv DOCKBASE /wynton/group/bks/soft/DOCK-3.8.0.1
 
set list = \` ls \$PWD/${fileprefix}* \`
set MOL = "\${list[\$SGE_TASK_ID]}"
set name = \${MOL:t:r}
 
python2 $DOCKBASE/analysis/getposes_blazing_faster.py \
"" \${MOL} $number_per_file poses_\${name}.mol2 test.mol2.gz
 
EOF
 
qsub qsub_poses.csh
cd ../
</nowiki>
 
* '''Post-processing...'''
 


* get poses.mol2
[[Category:DOCK 3.8]]
  /wynton/home/shoichetlab/yingyang/programs/miniconda3/envs/opencadd/bin/python \
  /wynton/home/shoichetlab/yingyang/scripts/get_poses.py -z test.mol2.gz.0 -n 1000 -p poses_top1k.mol2

Latest revision as of 22:28, 1 December 2022

IMPORTANT - UPDATED DOCUMENTATION

https://wiki.docking.org/index.php/SUBDOCK_DOCK3.8

OLD DOCUMENTATION

How to dock in DOCK 3.8.0

Related page: Docking Analysis with DOCK 3.8

http://wiki.docking.org/index.php/Docking_Analysis_in_DOCK3.8

How to install DOCK 3.8

Checkpointing & Restartability in DOCK3.8

DOCK 3.8 can be interrupted safely and restarted, which allows more flexibility when submitting docking jobs.

For example, you could set QSUB_ARGS="-l s_rt=00:05:00 -l h_rt=00:07:00" (or SBATCH_ARGS="--time=00:07:00") so that each docking job will only run for 5 minutes before being interrupted. The new subdock.bash script allows submitting the same set of jobs multiple times until they are all complete. A more pragmatic choice might be "-l s_rt=00:28:00 -l h_rt=00:30:00" to get the benefit of faster scheduling on wynton in the short.q. Another advantage is that the job can be interrupted at any time on AWS and it will checkpoint and be restartable.

Submitting Jobs/Running the Script

New subdock scripts are here:

$DOCKBASE/docking/submit/sge/subdock.bash $DOCKBASE/docking/submit/slurm/subdock.bash

subdock.bash requires a number of environmental variables to be passed as arguments.

Required Arguments

INPUT_SOURCE

INPUT_SOURCE should be either:

a) A directory containing one or more db2.tgz files OR

b) A text file containing a list of paths to db2.tgz files

A db2.tgz file should be a tarred + gzipped archive (tar -czf archive.tgz) that contains one or more db2 or db2.gz files.

A job will be launched for each db2.tgz file in INPUT_SOURCE.

EXPORT_DEST

A directory on the NFS where you would like your docking output to end up. If the directory does not exist, the script will try to create it.

DOCKEXEC

An NFS path to a DOCK binary executable (NOT a wrapper script).

DOCKFILES

An NFS path to the dockfiles (INDOCK, spheres, receptor files, grids, etc.) being used for this docking run. Note that INDOCK is expected to be part of these files.

Optional Arguments

SHRTCACHE

The directory DOCK will perform it's work in. Files saved to this directory will be deleted once the docking job has concluded. By default this is /scratch. If /scratch is not available, change this to something else.

LONGCACHE

The directory DOCK will store files that are shared between multiple docking jobs. Files saved to this directory (dockexec and dockfiles) will persist on compute nodes until they are deleted by hand or an automated culling process. By default this directory is /scratch.

SBATCH_ARGS

Additional arguments to provide to slurm's sbatch, if using the slurm version of subdock.bash.

QSUB_ARGS

Additional arguments to provide to sge's qsub, if using the sge version of subdock.bash

SHRTCACHE_USE_ENV

At script runtime, have SHRTCACHE be set to the value of another environment variable, whose name is the value of SHRTCACHE_USE_ENV. This is useful for example if the scheduler sets up a directory for your job to perform work in, e.g Wynton's TMPDIR.

Examples

BKS Example

export INPUT_SOURCE=/path/to/example.in
export EXPORT_DEST=/path/to/output
export DOCKEXEC=$DOCKBASE/docking/DOCK/bin/dock64
export DOCKFILES=/path/to/dockfiles.example
export SHRTCACHE=/dev/shm
export LONGCACHE=/tmp
export SBATCH_ARGS="--time=02:00:00"

bash $DOCKBASE/docking/submit/slurm/subdock.bash

Wynton Example

export INPUT_SOURCE=/path/to/example.in
export EXPORT_DEST=/path/to/output
export DOCKEXEC=$DOCKBASE/docking/DOCK/bin/dock64
export DOCKFILES=/path/to/dockfiles.example
export SHRTCACHE_USE_ENV=TMPDIR
export LONGCACHE=/scratch
export QSUB_ARGS="-l s_rt=00:28:00 -l h_rt=00:30:00"

bash $DOCKBASE/docking/submit/sge/subdock.bash

Note on using SHRTCACHE and LONGCACHE

You should avoid using global networked directories for SHRTCACHE, these would be directories prefixed with /nfs/ on BKS and directories starting with /wynton/ on wynton. SHRTCACHE is used for writing out logs & job output in real time- a high latency network disk is inappropriate for this task. If running many jobs in parallel, there is a good chance that the network disk will be overloaded from all the jobs trying to send write requests simultaneously. I feel the difference between RAM, network directories and local directories is misunderstood, so I'll explain it with an analogy:

Imagine you just finished washing your clothes and you need to put them away.

Writing data to RAM (or /dev/shm) is like dropping them in a basket next to the washing machine.

Writing data to your local disk is like walking to your closet to put them away.

Writing data to a network disk is like walking to your neighbor's house and putting them away in his closet.

Now, networked disks don't perform much worse than local disks (your neighbor is just next door, after all), but what if everyone in the neighborhood put their stuff away in your neighbor's closet? For one there would be a line of people out his front door, meaning it would take you way longer to put away your clothes. Your neighbor also would probably not be happy with this arrangement, just like the wynton admins will not be happy if you set SHRTCACHE/LONGCACHE to a /wynton directory.

Developer Example: Building your own db2.tgz files & Submitting jobs

What you need:

  1. One or more db2(.gz) files
  2. An nfs directory(s) to store:
    1. docking input/output
    2. dockfiles
    3. dock executable
  3. subdock & rundock scripts

Create a list of all the db2 files you want to run docking against. The example below is merely a suggestion, make the list in any way you please so long as each entry is a *full* path (relative to your working directory) to a db2 (or db2.gz) file.

find $MY_DB2_SOURCE -type f -name "*.db2*" > my_db2_list

Split this list into reasonably sized chunks, our standard is 5000 but you can make them as large or small as you like. Do be careful about making the chunks larger- 5000 db2s is already quite heavy.

>> split -a 3 --lines=5000 my_db2_list db2_chunk.

>> ls
db2_chunk_aaa
db2_chunk_aab
db2_chunk_aac
...
my_db2_list

Create a db2.tgz archive from each of these lists.

for db2_chunk in db2_chunk.*; do
    tar -czf $db2_chunk.db2.tgz --files-from $db2_chunk
done


If you already have premade db2.tgz files, for example from the zinc22 3D archive, start the tutorial here.

Create a list of every db2.tgz archive. Each job will evaluate one db2.tgz archive. Again, this example is a suggestion applicable only if you've been following the tutorial up to this point.

find . -type f -name "db2_chunk.*.db2.tgz" > job_input_list

Now all you need to do is specify your docking parameters and launch the jobs. Set INPUT_SOURCE to be the job_input_list created in the previous step.

SGE

export DOCKEXEC=<dock executable path>
export DOCKFILES=<dockfiles path>
export EXPORT_DEST=<output directory path>
# optional arguments for the job controller. Note that these arguments are examples and not the only configuration recommended
export QSUB_ARGS="-l s_rt=00:28:00 -l h_rt=00:30:00 -l mem_free=2G"

export INPUT_SOURCE=job_input_list

bash <scripts directory>/sge/subdock.bash

SLURM

export DOCKEXEC=<dock executable path>
export DOCKFILES=<dockfiles path>
export EXPORT_DEST=<output directory path>
export SBATCH_ARGS="--time=00:30:00 --mem-per-cpu=2G"

export INPUT_SOURCE=job_input_list

bash <scripts directory>/slurm/subdock.bash

Large Docking Jobs

If your list of db2.tgz files is very large you may want to further split it. Each db2.tgz file in the job_input_list represents a job submitted to the queue, and often there is a limit on how many jobs can be queued at once.

In order to avoid this problem, we will need an automatic solution to split up our job_input_list and submit batches of jobs only when there is space left in the queue.

For example, imagine we want to submit in batches of 10,000 and limit total jobs to 50,000 (with a package size of 5000 this is 250M molecules in the queue maximum, submitting 50M at a time). The example I have shows how you would do this in slurm.

#!/bin/bash
### submit_all_slurm.bash

BINDIR=$(dirname $0)

BATCH_SIZE=10000
MAX_QUEUED=50000

# the script is more portable if we provide the various parameters as arguments instead of hard-coding
INPUT_LIST=$1
BASE_EXPORT_DEST=$2 # this is the directory where further subdirectories will be created that contain docking job results
export DOCKEXEC=$3
export DOCKFILES=$4

# we can use our EXPORT_DEST as staging grounds for our input
mkdir -p $BASE_EXPORT_DEST/input

split --lines=$BATCH_SIZE -a 3 -n $INPUT_LIST $BASE_EXPORT_DEST/input/job_input.

export SBATCH_ARGS="--time=00:30:00 --mem-per-cpu=2G -J dock"

for job_input in $BASE_EXPORT_DEST/input/job_input.*; do

    export INPUT_SOURCE=$job_input
    input_num=$(printf $job_input | cut -d'.' -f2) # get the suffix of the split filename

    export EXPORT_DEST=$BASE_EXPORT_DEST/$input_num

    # loop forever
    while [ -z ]; do

        # counts how many jobs in total are pending or running on this user
        njobs=$(squeue -u $(whoami) -h -t pending,running -r | wc -l)
        # if you want to instead set a limit on how many *dock* jobs are pending or running you would just run the command through a filter
        # njobs=$(squeue -u $(whoami) -h -t pending,running -r | grep "dock" | wc -l)

        if [ $njobs -lt $((MAX_QUEUED-BATCH_SIZE)) ]; then
            break
        fi

        sleep 10
    done

    # the slurm subdock and rundock scripts need to live next to this script in a directory named "slurm"
    bash $BINDIR/slurm/subdock.bash
done

This script will run until all jobs are submitted, so for very large jobs you may want to keep the process alive in a screen.

Tip: Using Wynton's $TMPDIR

Doing this with SHRTCACHE_USE_ENV

Before you run subdock, simply export the SHRTCACHE_USE_ENV option.

export SHRTCACHE_USE_ENV=TMPDIR

This will cause the script to use the $TMPDIR variable for SHRTCACHE.

Example: Running a lot of docking jobs

  • 1. set up sdi files
mkdir sdi
export sdi=sdi
ls /wynton/group/bks/zinc-22/H19/H19P0??/*.db2.tgz > $sdi/h19p0.in
ls /wynton/group/bks/zinc-22/H19/H19P1??/*.db2.tgz > $sdi/h19p1.in
ls /wynton/group/bks/zinc-22/H19/H19P2??/*.db2.tgz > $sdi/h19p2.in
ls /wynton/group/bks/zinc-22/H19/H19P3??/*.db2.tgz > $sdi/h19p3.in
and so on
  • 2. set up INDOCK and dockfiles. rename dockfiles to dockfiles.$indockhash. On some nodes, the shasum command is called by sha1sum. Ultimately, renaming the dockfiles to a unique dockfiles is key.

Note: As of 3/19/2021, this is no longer necessary

bash
indockhash=$(cat INDOCK | shasum | awk '{print substr($1, 1, 12)}')
  • 3. super script:
export DOCKBASE=/wynton/group/bks/work/jji/DOCK
export DOCKFILES=$WORKDIR/dockfiles.21751f1bb16b
export DOCKEXEC=$DOCKBASE/docking/DOCK/bin/dock64
#export SHRTCACHE=/dev/shm # default
export SHRTCACHE=/scratch
export LONGCACHE=/scratch
export QSUB_ARGS="-l s_rt=00:28:00 -l h_rt=00:30:00 -l mem_free=2G"

for i in  sdi/*.in  ; do
        export k=$(basename $i .in)
	echo k $k
	export INPUT_SOURCE=$PWD/$i
	export EXPORT_DEST=$PWD/output/$k
	$DOCKBASE/docking/submit/sge/subdock.bash
done

  1. 3a. to run for first time
sh super
  1. 4. how to restart (to make sure complete, iterate until complete)
sh super
  1. 5. check which output is valid (and broken or incomplete output)
  1. 6. extract all blazing fast
  1. 7. extract mol2

more soon, under active development, Jan 28.

Appendix: Docking mono-cations of ZINC22 with DOCK3.8 on Wynton

Added by Ying 3/10/2021

To use: copy and paste the code section into terminal. Note to change the path where labelled with CHANGE this

  • set up the folder to run docking.

Path to my example: /wynton/home/shoichetlab/yingyang/work/5HT-5a/10_AL-dock/zinc22_3d_build_3-10-2021

 mkdir zinc22_3d_build_3-10-2021
 cd zinc22_3d_build_3-10-2021
  • copy INDOCK into dockfiles folder, and transfer to the created folder
 cp INDOCK dockfiles
 scp -r INDOCK dockfiles dt2.wynton.ucsf.edu:/path_to_created_folder
  • get sdi of monocations of already built ZINC22 (<= H26 heavy atom count)

Modify to your own need...

mkdir sdi

foreach i (`seq 4 1 26`)
  set hac = `printf "H%02d" $i `
  echo $i $hac
  
  touch sdi/${hac}.sdi
  # CHANGE this: to your need
  foreach tgz (`ls /wynton/group/bks/zinc-22*/${hac}/${hac}[PM]???/*-O*.db2.tgz`)
    ls $tgz
    echo $tgz >> sdi/${hac}.sdi
  end
end

  • rename the dockfiles directory

Note: As of 3/19/2021 this step is no longer necessary

 indockhash=$(cat INDOCK | sha1sum | awk '{print substr($1, 1, 12)}')
 mv dockfiles dockfiles.${indockhash}
  • write and run the super_run.sh
cat <<EOF > super_run.sh
export DOCKBASE=/wynton/group/bks/soft/DOCK-3.8.0.1
export DOCKEXEC=\$DOCKBASE/docking/DOCK/bin/dock64

# CHANGE here: path to the previously renamed dockfiles.\${indockhash}
### Note: as of 3/19/2021 renaming your dockfiles is no longer necessary
export DOCKFILES=/wynton/group/bks/work/yingyang/5HT-5a/10_AL-dock/zinc22_3d_build_3-10-2021/dockfiles.${indockhash}
export SHRTCACHE=/scratch
export LONGCACHE=/scratch
export QSUB_ARGS="-l s_rt=00:28:00 -l h_rt=00:30:00 -l mem_free=2G"

for i in  sdi/*.sdi  ; do
    export k=\$(basename \$i .sdi)
    echo k \$k
    export INPUT_SOURCE=$PWD/\$i
    export EXPORT_DEST=$PWD/output/\$k
    \$DOCKBASE/docking/submit/sge/subdock.bash
done
EOF

bash super_run.sh

  • keep submitting the super_run script until all db2s have been docked.

After all docking jobs finish, check the output. If no weird error, we can use a while loop to restart.

while true
do
  export jobN=$(qstat | grep -c 'rundock')
  if [[ $jobN -gt 0 ]] 
  then
    sleep 60
  else 
    bash super_run.sh
  fi
done

When no new job is going to be submitted, use Ctrl+c to exit the while loop.

  • extract scores from output.
cat << EOF > qsub_extract.csh
#\$ -S /bin/csh
#\$ -cwd
#\$ -pe smp 1
#\$ -l mem_free=100G
#\$ -l scratch=100G
#\$ -l h_rt=50:00:00
#\$ -j yes
#\$ -o extract_all.out

hostname
date

setenv DOCKBASE /wynton/group/bks/soft/DOCK-3.8.0.1

setenv dir_in $PWD

if ! (-d \$TMPDIR ) then
    if (-d /scratch ) then
        setenv TMPDIR /scratch/\$USER
    else
        setenv TMPDIR /tmp/\$USER
    endif
    mkdir -p \$TMPDIR
endif
pushd \$TMPDIR

ls -d \${dir_in}/output/*/*/ > dirlist

python \$DOCKBASE/analysis/extract_all_blazing_fast.py \
dirlist extract_all.txt -30

mv extract_all.* \$dir_in

popd

echo '---job info---'
qstat -j \$JOB_ID
echo '---complete---'
EOF

qsub qsub_extract.csh

Another way is to run the command from the login node (Not recommended since sorting utilizes large memory)

ls -d output/*/*/ > dirlist
python $DOCKBASE/analysis/extract_all_blazing_fast.py dirlist extract_all.txt -20
  • get poses in parallel
set score_file = $PWD/extract_all.sort.uniq.txt
set score_name = ${score_file:t:r}
set fileprefix = 'tmp_'
set number_per_file = 5000

set workdir  = $PWD/${score_name}_poses
mkdir -p $workdir 
cd $workdir

split --lines=$number_per_file --suffix-length=4 \
-d $score_file ${fileprefix}

set num  = ` ls ${fileprefix}* | wc -l `
echo "Number of score files to process:" $num

cat << EOF > qsub_poses.csh
#\$ -S /bin/csh
#\$ -cwd
#\$ -j yes
#\$ -pe smp 1
#\$ -l mem_free=5G
#\$ -l scratch=20G
#\$ -l h_rt=25:00:00
#\$ -t 1-$num

hostname
date

setenv DOCKBASE /wynton/group/bks/soft/DOCK-3.8.0.1

set list = \` ls \$PWD/${fileprefix}* \` 
set MOL = "\${list[\$SGE_TASK_ID]}"
set name = \${MOL:t:r}

python2 $DOCKBASE/analysis/getposes_blazing_faster.py \
"" \${MOL} $number_per_file poses_\${name}.mol2 test.mol2.gz

EOF

qsub qsub_poses.csh
cd ../
 
  • Post-processing...