How to dock in DOCK3.8: Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
Line 46: Line 46:
==== DOCKFILES ====
==== DOCKFILES ====


An NFS path to the dockfiles (INDOCK, spheres, receptor files, grids, etc.) being used for this docking run.  
An NFS path to the dockfiles (INDOCK, spheres, receptor files, grids, etc.) being used for this docking run. Note that INDOCK is expected to be part of these files. The job script will also recognize tarred dockfiles, e.g with a .tgz or tar.gz extension and handle them appropriately.


<s>The dockfiles directory should be named uniquely, to avoid confusion with other dockfiles other users may be running.</s>
<s>The dockfiles directory should be named uniquely, to avoid confusion with other dockfiles other users may be running.</s>

Revision as of 22:48, 19 March 2021

How to dock in DOCK 3.8.0

Differences from DOCK.3.7

DOCK 3.8.0 can be interrupted safely and restarted, which allows more flexibility when submitting docking jobs.

For example, you could set QSUB_ARGS="-l s_rt=00:05:00 -l h_rt=00:07:00" (or SBATCH_ARGS="--time=00:07:00") so that each docking job will only run for 5 minutes before being interrupted. The new subdock.bash script allows submitting the same set of jobs multiple times until they are all complete. A more pragmatic choice might be "-l s_rt=00:28:00 -l h_rt=00:30:00" to get the benefit of faster scheduling on wynton in the short.q. Another advantage is that the job can be interrupted at any time on AWS and it will checkpoint and be restartable.

Running the Script

New subdock scripts are here:

$DOCKBASE/docking/submit/sge/subdock.bash $DOCKBASE/docking/submit/slurm/subdock.bash

subdock.bash requires a number of environmental variables to be passed as arguments.

Required Arguments

INPUT_SOURCE

INPUT_SOURCE should be either:

a) A directory containing one or more db2.tgz files OR

b) A text file containing a list of paths to db2.tgz files

A db2.tgz file should be a tarred + gzipped archive (tar -czf archive.tgz) that contains one or more db2 or db2.gz files.

A job will be launched for each db2.tgz file in INPUT_SOURCE.

EXPORT_DEST

A directory on the NFS where you would like your docking output to end up. If the directory does not exist, the script will try to create it.

DOCKEXEC

An NFS path to a DOCK binary executable (NOT a wrapper script).

IMPORTANT: You should append the executable's compile time stamp to the end of it's name, e.g dock64.20210302. This will avoid any confusion of this executable with other versions of DOCK floating around.

Update 3/19/2021: No longer necessary to do this! Name your dock executable whatever you want, different versions are now distinguished from the computed shasum of the executable. No more effort on your part.

DOCKFILES

An NFS path to the dockfiles (INDOCK, spheres, receptor files, grids, etc.) being used for this docking run. Note that INDOCK is expected to be part of these files. The job script will also recognize tarred dockfiles, e.g with a .tgz or tar.gz extension and handle them appropriately.

The dockfiles directory should be named uniquely, to avoid confusion with other dockfiles other users may be running.

Update 3/19/2021: No longer necessary! dockfiles are now identified via shasum, like the dock executable.

Optional Arguments

SHRTCACHE

The directory DOCK will perform it's work in. Files saved to this directory will be deleted once the docking job has concluded. By default this is /scratch. If /scratch is not available, change this to something else.

LONGCACHE

The directory DOCK will store files that are shared between multiple docking jobs. Files saved to this directory (dockexec and dockfiles) will persist until they are deleted. By default this directory is /scratch.

Beware of using the default SHRTCACHE or LONGCACHE settings on large clusters.

SBATCH_ARGS

Additional arguments to provide to slurm's sbatch, if using the slurm version of subdock.bash.

QSUB_ARGS

Additional arguments to provide to sge's qsub, if using the sge version of subdock.bash

Examples

BKS Example

export INPUT_SOURCE=example.in
export OUTPUT_DEST=output
export DOCKEXEC=$DOCKBASE/docking/DOCK/bin/dock64
export DOCKFILES=dockfiles.example
export SHRTCACHE=/dev/shm
export LONGCACHE=/tmp
export SBATCH_ARGS="--time=02:00:00"

$DOCKBASE/docking/submit/slurm/subdock.bash

Wynton Example

export INPUT_SOURCE=example.in
export OUTPUT_DEST=output
export DOCKEXEC=$DOCKBASE/docking/DOCK/bin/dock64
export DOCKFILES=dockfiles.example
export SHRTCACHE=/scratch
export LONGCACHE=/scratch
export QSUB_ARGS="-l s_rt=00:28:00 -l h_rt=00:30:00"

$DOCKBASE/docking/submit/sge/subdock.bash

Example: Running a lot of docking jobs

  • 1. set up sdi files
mkdir sdi
export sdi=sdi
ls /wynton/group/bks/zinc-22/H19/H19P0??/*.db2.tgz > $sdi/h19p0.in
ls /wynton/group/bks/zinc-22/H19/H19P1??/*.db2.tgz > $sdi/h19p1.in
ls /wynton/group/bks/zinc-22/H19/H19P2??/*.db2.tgz > $sdi/h19p2.in
ls /wynton/group/bks/zinc-22/H19/H19P3??/*.db2.tgz > $sdi/h19p3.in
and so on
  • 2. set up INDOCK and dockfiles. rename dockfiles to dockfiles.$indockhash. On some nodes, the shasum command is called by sha1sum. Ultimately, renaming the dockfiles to a unique dockfiles is key.
bash
indockhash=$(cat INDOCK | shasum | awk '{print substr($1, 1, 12)}')
  • 3. super script:
export DOCKBASE=/wynton/group/bks/work/jji/DOCK
export DOCKFILES=$WORKDIR/dockfiles.21751f1bb16b
export DOCKEXEC=$DOCKBASE/docking/DOCK/bin/dock64
#export SHRTCACHE=/dev/shm # default
export SHRTCACHE=/scratch
export LONGCACHE=/scratch
export QSUB_ARGS="-l s_rt=00:28:00 -l h_rt=00:30:00 -l mem_free=2G"

for i in  sdi/*.in  ; do
        export k=$(basename $i .in)
	echo k $k
	export INPUT_SOURCE=$PWD/$i
	export EXPORT_DEST=$PWD/output/$k
	$DOCKBASE/docking/submit/sge/subdock.bash
done

  1. 3a. to run for first time
sh super
  1. 4. how to restart (to make sure complete, iterate until complete)
sh super
  1. 5. check which output is valid (and broken or incomplete output)
  1. 6. extract all blazing fast
  1. 7. extract mol2

more soon, under active development, Jan 28.

Appendix: Docking mono-cations of ZINC22 with DOCK3.8 on Wynton

Added by Ying 3/10/2021

To use: copy and paste the code section into terminal. Note to change the path where labelled with CHANGE this

  • set up the folder to run docking.

Path to my example: /wynton/home/shoichetlab/yingyang/work/5HT-5a/10_AL-dock/zinc22_3d_build_3-10-2021

 mkdir zinc22_3d_build_3-10-2021
 cd zinc22_3d_build_3-10-2021
  • copy INDOCK into dockfiles folder, and transfer to the created folder
 cp INDOCK dockfiles
 scp -r INDOCK dockfiles dt2.wynton.ucsf.edu:/path_to_created_folder
  • get sdi of monocations of already built ZINC22 (<= H26 heavy atom count)

Modify to your own need...

mkdir sdi

foreach i (`seq 4 1 26`)
  set hac = `printf "H%02d" $i `
  echo $i $hac
  
  touch sdi/${hac}.sdi
  # CHANGE this: to your need
  foreach tgz (`ls /wynton/group/bks/zinc-22*/${hac}/${hac}[PM]???/*-O*.db2.tgz`)
    ls $tgz
    echo $tgz >> sdi/${hac}.sdi
  end
end

  • rename the dockfiles directory
 indockhash=$(cat INDOCK | sha1sum | awk '{print substr($1, 1, 12)}')
 mv dockfiles dockfiles.${indockhash}
  • write and run the super_run.sh
cat <<EOF > super_run.sh
export DOCKBASE=/wynton/group/bks/soft/DOCK-3.8.0.1
export DOCKEXEC=\$DOCKBASE/docking/DOCK/bin/dock64

# CHANGE here: path to the previously renamed dockfiles.\${indockhash}
export DOCKFILES=/wynton/group/bks/work/yingyang/5HT-5a/10_AL-dock/zinc22_3d_build_3-10-2021/dockfiles.${indockhash}
export SHRTCACHE=/scratch
export LONGCACHE=/scratch
export QSUB_ARGS="-l s_rt=00:28:00 -l h_rt=00:30:00 -l mem_free=2G"

for i in  sdi/*.sdi  ; do
    export k=\$(basename \$i .sdi)
    echo k \$k
    export INPUT_SOURCE=$PWD/\$i
    export EXPORT_DEST=$PWD/output/\$k
    \$DOCKBASE/docking/submit/sge/subdock.bash
done
EOF

bash super_run.sh

  • keep submitting the super_run script until all db2s have been docked.

After all docking jobs finish, check the output. If no weird error, we can use a while loop to restart.

while true
do
  export jobN=$(qstat | grep -c 'rundock')
  if [[ $jobN -gt 0 ]] 
  then
    sleep 60
  else 
    bash super_run.sh
  fi
done

When no new job is going to be submitted, use Ctrl+c to exit the while loop.

  • extract scores from output.
cat << EOF > qsub_extract.csh
#\$ -S /bin/csh
#\$ -cwd
#\$ -pe smp 1
#\$ -l mem_free=100G
#\$ -l scratch=100G
#\$ -l h_rt=50:00:00
#\$ -j yes
#\$ -o extract_all.out

hostname
date

setenv DOCKBASE /wynton/group/bks/soft/DOCK-3.8.0.1

setenv dir_in $PWD

if ! (-d \$TMPDIR ) then
    if (-d /scratch ) then
        setenv TMPDIR /scratch/\$USER
    else
        setenv TMPDIR /tmp/\$USER
    endif
    mkdir -p \$TMPDIR
endif
pushd \$TMPDIR

ls -d \${dir_in}/output/*/*/ > dirlist

python \$DOCKBASE/analysis/extract_all_blazing_fast.py \
dirlist extract_all.txt -30

mv extract_all.* \$dir_in

popd

echo '---job info---'
qstat -j \$JOB_ID
echo '---complete---'
EOF

qsub qsub_extract.csh

Another way is to run the command from the login node (Not recommended since sorting utilizes large memory)

ls -d output/*/*/ > dirlist
python $DOCKBASE/analysis/extract_all_blazing_fast.py dirlist extract_all.txt -20
  • get poses in parallel
set score_file = $PWD/extract_all.sort.uniq.txt
set score_name = ${score_file:t:r}
set fileprefix = 'tmp_'
set number_per_file = 5000

set workdir  = $PWD/${score_name}_poses
mkdir -p $workdir 
cd $workdir

split --lines=$number_per_file --suffix-length=4 \
-d $score_file ${fileprefix}

set num  = ` ls ${fileprefix}* | wc -l `
echo "Number of score files to process:" $num

cat << EOF > qsub_poses.csh
#\$ -S /bin/csh
#\$ -cwd
#\$ -j yes
#\$ -pe smp 1
#\$ -l mem_free=5G
#\$ -l scratch=20G
#\$ -l h_rt=25:00:00
#\$ -t 1-$num

hostname
date

setenv DOCKBASE /wynton/group/bks/soft/DOCK-3.8.0.1

set list = \` ls \$PWD/${fileprefix}* \` 
set MOL = "\${list[\$SGE_TASK_ID]}"
set name = \${MOL:t:r}

python2 $DOCKBASE/analysis/getposes_blazing_faster.py \
"" \${MOL} $number_per_file poses_\${name}.mol2 test.mol2.gz

EOF

qsub qsub_poses.csh
cd ../
 
  • Post-processing...