Building The 3D Pipeline ZINC22: Difference between revisions

From DISI
Jump to navigation Jump to search
(48 intermediate revisions by 2 users not shown)
Line 2: Line 2:


The 3D pipeline is a collection of scripts and software packages that enable the massively parallel creation of dockable 3D molecules.
The 3D pipeline is a collection of scripts and software packages that enable the massively parallel creation of dockable 3D molecules.
The setup for the pipeline can be somewhat complicated, so we are working on a containerized version with a separate wiki page @ http://wiki.docking.org/index.php/ZINC22_3D_Pipeline_Container (WIP).


== Requirements ==
== Requirements ==
Line 14: Line 16:


<ol>
<ol>
<li>clone the appropriate branch of the zinc-3d-build-3 repository (github link) to your working directory</li>
<li>Clone the master branch of the zinc-3d-build-3 repository (https://github.com/btingle/zinc-3d-build-3)</li>
 
<ol type="a">
<li>You can also grab these scripts from the DOCK repository (https://github.com/docking-org/DOCK.git) @ DOCK/ligand/submit</li>
 
<li>Wynton Users: The most up-to-date DOCK version is hosted @ /wynton/group/bks/soft/DOCK-current</li>
</ol>


<li>Secure copy the software distribution from our cluster at (scp link)</li>
<li>BKS Users: Copy the software distribution from our cluster at /nfs/home/xyz/soft/*.tar.gz to the $HOME/soft directory of the user that will be running the script</li>


<li>Extract the software distribution to the $HOME/soft directory of the user that will be running the script</li>
<li>Others: Copy the software distribution from our cluster, minus corina, which you will need to supply yourself.


<ol type="a">
<ol type="a">
Line 28: Line 36:
</ol>
</ol>


<li>Copy your licenses into the same directory you extracted the software to. Copy your corina distribution (as a tar.gz) into this same folder.</li>
<li><b>Important:</b> Most of the required software stays static, however certain items are updated occasionally. Specifically, we maintain a version of the DOCK toolset stripped down to just the essentials for 3D building. On Wynton, a link is maintained that points to the latest stable version @ /wynton/group/bks/soft/DOCK.3D/DOCK.3.8.current.3d.tar.gz. On BKS, this link is located @ /nfs/soft/dock/versions/dock38/DOCK.3D/DOCK.3.8.current.3d.tar.gz. You can switch the DOCK version you use with the DOCK_VERSION environment variable (see below sections) By default if DOCK_VERSION is not specified the script will try to use DOCK.3.8.current.3d. </li>
 
<li>Copy your licenses into your $SOFT_HOME. Copy your corina distribution (as a tar.gz) into $SOFT_HOME/soft if not already present.</li>


<ol type="a">
<ol type="a">
<li>This script assumes that the jchem and openeye licenses will be named ".jchem-license.cxl" and ".oe-license.txt" respectively. Corina is assumed to be named "corina.tar.gz"</li>
<li>This script assumes that the jchem and openeye licenses will be named ".jchem-license.cxl" and ".oe-license.txt" respectively. Corina is assumed to be a tarball named "corina.tar.gz" which contains a single directory, "corina" with executables etc.</li>
 
<li>BKS Users: You can copy these licenses /nfs/home/xyz to your own $SOFT_HOME.</li>
</ol>
</ol>
</ol>
</ol>


== Running The Script ==
== Running The Script ==
FYI: You MUST be in the pipeline bin directory (zinc-3d-build-3) when running the submit-all-jobs script
You should also run this command in a screen, as it needs to persist until all jobs are submitted.


General Example
General Example
Line 42: Line 58:
export INPUT_FILE=$YOUR_INPUT_FILE.smi
export INPUT_FILE=$YOUR_INPUT_FILE.smi
export OUTPUT_DEST=$SOME_GLOBAL_DIRECTORY
export OUTPUT_DEST=$SOME_GLOBAL_DIRECTORY
export DOCK_VERSION=$DOCK_VERSION
# if you don't have a global home directory (but some other global directory available), specify an alternate SOFT_HOME
# if you don't have a global home directory (but some other global directory available), specify an alternate SOFT_HOME
# export SOFT_HOME=$YOUR_SOFT_HOME
# export SOFT_HOME=$YOUR_SOFT_HOME
export LINES_PER_BATCH=50000
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export LINES_PER_JOB=50
export TEMPDIR=/tmp
export SHRTCACHE=/dev/shm
export LONGCACHE=/tmp
./submit-all-jobs-<slurm/sge>.bash
./submit-all-jobs-<slurm/sge>.bash
</nowiki>
</nowiki>
Line 55: Line 73:
export INPUT_FILE=example.smi
export INPUT_FILE=example.smi
export OUTPUT_DEST=/nfs/exb/zinc22/tarballs
export OUTPUT_DEST=/nfs/exb/zinc22/tarballs
export SBATCH_ARGS="--time=00:45:00 --requeue"
# if you want to run a more recent/experimental branch of the pipeline
# export DOCK_VERSION=DOCK.3.8.4.0
export SBATCH_ARGS="--time=02:00:00"
export LINES_PER_BATCH=50000
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export LINES_PER_JOB=50
export TEMPDIR=/dev/shm
export SHRTCACHE=/dev/shm
export LONGCACHE=/dev/shm
./submit-all-jobs-slurm.bash
./submit-all-jobs-slurm.bash
</nowiki>
</nowiki>
Line 66: Line 87:
export INPUT_FILE=example.smi
export INPUT_FILE=example.smi
export OUTPUT_DEST=/wynton/group/bks/zinc22
export OUTPUT_DEST=/wynton/group/bks/zinc22
# if you want to run a more recent/experimental branch of the pipeline
# export DOCK_VERSION=DOCK.3.8.4.0
export QSUB_ARGS="-l s_rt=00:28:30 -l h_rt=00:30:00 -r y"
export QSUB_ARGS="-l s_rt=00:28:30 -l h_rt=00:30:00 -r y"
export LINES_PER_BATCH=50000
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export LINES_PER_JOB=50
export TEMPDIR=/scratch
export LONGCACHE=/scratch
export SHRTCACHE=/scratch
./submit-all-jobs-sge.bash
./submit-all-jobs-sge.bash
</nowiki>
</nowiki>
Line 77: Line 101:
export INPUT_FILE=example.smi
export INPUT_FILE=example.smi
export OUTPUT_DEST=$SCRATCH/zinc22
export OUTPUT_DEST=$SCRATCH/zinc22
export SBATCH_ARGS="--time-min=01:00:00 --time=02:00:00 --requeue -q flex -C haswell"
# if you want to run a more recent/experimental branch of the pipeline
export LINES_PER_BATCH=50000
# export DOCK_VERSION=DOCK.3.8.4.0
export SBATCH_ARGS="--cpus-per-task=1 --time=02:00:00 --requeue -q shared -C haswell"
export LINES_PER_BATCH=20000
export LINES_PER_JOB=50
export LINES_PER_JOB=50
export TEMPDIR=$SCRATCH
export MAX_BATCHES=10
# $SCRATCH on Cori can be used for debugging, but production batch jobs should take advantage of a burst buffer allocation
# this requires some custom modification of the script
# export SHRTCACHE=$SCRATCH
# export LONGCACHE=$SCRATCH
./submit-all-jobs-slurm.bash
./submit-all-jobs-slurm.bash
</nowiki>
</nowiki>
== Resubmission ==
If your jobs for building have finished (or timed out), and you want to continue process whatever has not been processed yet, just run submit-all-jobs-slurm/sge again (with same env arguments). The submit-all script will detect which entries haven't finished and resubmit them.


== Script Arguments ==
== Script Arguments ==
Line 106: Line 140:


Out contains tar.gz output from each job. The tarballs should contain a number of 3d molecule formats for each molecule in the batch, including 1 or more db2.gz files.
Out contains tar.gz output from each job. The tarballs should contain a number of 3d molecule formats for each molecule in the batch, including 1 or more db2.gz files.
==== SHRTCACHE ====
The base working directory for the script. By default it is /dev/shm.
==== LONGCACHE ====
The base directory for persistent files that are shared between jobs to go (i.e where software is installed). By default it is /tmp.


==== LINES_PER_BATCH ====
==== LINES_PER_BATCH ====


How many lines of the source .smi file should be processed per array batch job
How many lines of the source .smi file should be processed per array batch job, default is 50000.


==== LINES_PER_JOB ====
==== LINES_PER_JOB ====


How many lines of the batch .smi file should be processed per array task
How many lines of the batch .smi file should be processed per array task, default is 50.


==== SBATCH_ARGS ====
==== SBATCH_ARGS ====


Additional arguments for the sbatch command. It is recommended to set a --time limit, as build jobs will save progress & terminate if they are still running two minutes before the --time limit.
Additional arguments for the sbatch command. It is recommended to set a --time limit, as build jobs will save progress & terminate if they are still running two minutes before the --time limit.
--requeue allows jobs that reach the time limit to run later, utilizing their saved progress.


==== QSUB_ARGS ====
==== QSUB_ARGS ====
Line 125: Line 165:
Additional arguments for the qsub command. Similar to slurm, it is recommended to set a time limit, but you will need to manually specify both s_rt & h_rt. In the example, we set s_rt to be a minute and thirty seconds before h_rt. s_rt is the point where jobs will save progress and terminate, h_rt is when they will be forcibly terminated, even if they've not finished saving.
Additional arguments for the qsub command. Similar to slurm, it is recommended to set a time limit, but you will need to manually specify both s_rt & h_rt. In the example, we set s_rt to be a minute and thirty seconds before h_rt. s_rt is the point where jobs will save progress and terminate, h_rt is when they will be forcibly terminated, even if they've not finished saving.


-r y allows jobs that reach the time limit to run later, utilizing their saved progress.
==== MAX_BATCHES ====
 
Each batch job will contain LINES_PER_BATCH/LINES_PER_JOB jobs, and there will be a maximum of MAX_BATCHES batches submitted at any given time. By default this value is 25, which corresponds to 25,000 queued jobs at any given time if there are 1000 jobs per batch.
 
==== SOFT_HOME ====
 
Where software tarballs for the pipeline are stored. By default this is $HOME. It is expected that there is a directory named "soft" within $SOFT_HOME that contains the actual software. The licenses to run this software are stored directly in SOFT_HOME.
By default, your $SOFT_HOME should contain the following:
 
<nowiki>
licenses:
$SOFT_HOME/.jchem-license.cxl
$SOFT_HOME/.oe-license.txt
 
soft:
$SOFT_HOME/soft/DOCK.version.tar.gz
$SOFT_HOME/soft/corina.tar.gz
$SOFT_HOME/soft/lig_build_py3-3.7.1.tar.gz
$SOFT_HOME/soft/lib.tar.gz
$SOFT_HOME/soft/jchem-19.15.tar.gz
$SOFT_HOME/soft/openbabel-install.tar.gz
</nowiki>
 
==== DOCK_VERSION ====
 
Tells which tarball of DOCK should be used for ligand building. The variable value should not include the tar.gz extension. By default this is DOCK.3.8.current.3d, which you can find @ /wynton/group/bks/soft/DOCK.3D/DOCK.3.8.current.3d.tar.gz on Wynton, and @ /nfs/soft/dock/versions/dock38/DOCK.3D/DOCK.3.8.current.3d.tar.gz on BKS. Link one of these in your $SOFT_HOME/soft directory to make sure your DOCK version stays up to date.
 
==== CORINA_MAX_CONFS ====
 
Usually, when creating 3D embeddings from the jchem generated protomers, corina generates 1 conformation, equivalent to mc=1 on the command line. You can change this mc value by exporting CORINA_MAX_CONFS=<N> prior to running build-3d.bash or submit-all-jobs.bash.


== Repatriation ==
== Repatriation ==
Line 135: Line 204:


for output in $OUTPUT_DEST/*.batch-3d.d; do
for output in $OUTPUT_DEST/*.batch-3d.d; do
         echo "starting rsync on $output to $MIGRATE_USER@$MIGRATE_HOST"
         echo "starting rsync on $output to $MIGRATE_USER@files2.docking.org"
         sshpass -f $PW_FILE rsync -arv $output/out $MIGRATE_USER@$MIGRATE_HOST:/nfs/exb/zinc22/tarballs/$(basename $output).out
         sshpass -f $PW_FILE rsync -arv $output/out $MIGRATE_USER@files2.docking.org:/nfs/exb/zinc22/tarballs/$(basename $output).out
done
done
</nowiki>
</nowiki>


sshpass is optional here but preferable for convenience's sake.
sshpass is optional here but preferable for convenience's sake. Since files2.docking.org is only visible within the UCSF network, any clusters outside will need to maintain a network tunnel when rsyncing.
 


== Errors ==
== Errors ==
Line 155: Line 223:


For example, if one of your nodes went down and caused a bunch of jobs to fail, it would be safe to re-run ./submit-all-jobs.bash to re-submit those jobs. (assuming there are no jobs for that file currently queued/running)
For example, if one of your nodes went down and caused a bunch of jobs to fail, it would be safe to re-run ./submit-all-jobs.bash to re-submit those jobs. (assuming there are no jobs for that file currently queued/running)
back to [[ZINC22:Building_3D]]
[[Category:ZINC22]]

Revision as of 22:41, 24 January 2022

Introduction

The 3D pipeline is a collection of scripts and software packages that enable the massively parallel creation of dockable 3D molecules.

The setup for the pipeline can be somewhat complicated, so we are working on a containerized version with a separate wiki page @ http://wiki.docking.org/index.php/ZINC22_3D_Pipeline_Container (WIP).

Requirements

  • Chemaxon License (email John for license details)
  • Openeye License (email John for license details)
  • Corina License (comes packaged with the executable, so you will need to provide the executable)
  • SGE or SLURM queueing system installed on your cluster
  • A networked file system installed on your cluster

Setup Instructions

  1. Clone the master branch of the zinc-3d-build-3 repository (https://github.com/btingle/zinc-3d-build-3)
    1. You can also grab these scripts from the DOCK repository (https://github.com/docking-org/DOCK.git) @ DOCK/ligand/submit
    2. Wynton Users: The most up-to-date DOCK version is hosted @ /wynton/group/bks/soft/DOCK-current
  2. BKS Users: Copy the software distribution from our cluster at /nfs/home/xyz/soft/*.tar.gz to the $HOME/soft directory of the user that will be running the script
  3. Others: Copy the software distribution from our cluster, minus corina, which you will need to supply yourself.
    1. When this script is submitted to a machine for the first time, it will copy and install the necessary software from $HOME/soft to local storage
    2. Because of this, it is important that your $HOME be global if you are running the script unmodified
    3. If you don't have a global home, or otherwise want your software to reside somewhere else, you can specify an alternative global directory to copy the software from by exporting SOFT_HOME prior to running submit-all-jobs.bash (see below)
  4. Important: Most of the required software stays static, however certain items are updated occasionally. Specifically, we maintain a version of the DOCK toolset stripped down to just the essentials for 3D building. On Wynton, a link is maintained that points to the latest stable version @ /wynton/group/bks/soft/DOCK.3D/DOCK.3.8.current.3d.tar.gz. On BKS, this link is located @ /nfs/soft/dock/versions/dock38/DOCK.3D/DOCK.3.8.current.3d.tar.gz. You can switch the DOCK version you use with the DOCK_VERSION environment variable (see below sections) By default if DOCK_VERSION is not specified the script will try to use DOCK.3.8.current.3d.
  5. Copy your licenses into your $SOFT_HOME. Copy your corina distribution (as a tar.gz) into $SOFT_HOME/soft if not already present.
    1. This script assumes that the jchem and openeye licenses will be named ".jchem-license.cxl" and ".oe-license.txt" respectively. Corina is assumed to be a tarball named "corina.tar.gz" which contains a single directory, "corina" with executables etc.
    2. BKS Users: You can copy these licenses /nfs/home/xyz to your own $SOFT_HOME.

Running The Script

FYI: You MUST be in the pipeline bin directory (zinc-3d-build-3) when running the submit-all-jobs script

You should also run this command in a screen, as it needs to persist until all jobs are submitted.

General Example

export INPUT_FILE=$YOUR_INPUT_FILE.smi
export OUTPUT_DEST=$SOME_GLOBAL_DIRECTORY
export DOCK_VERSION=$DOCK_VERSION
# if you don't have a global home directory (but some other global directory available), specify an alternate SOFT_HOME
# export SOFT_HOME=$YOUR_SOFT_HOME
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export SHRTCACHE=/dev/shm
export LONGCACHE=/tmp
./submit-all-jobs-<slurm/sge>.bash

BKS Example

export INPUT_FILE=example.smi
export OUTPUT_DEST=/nfs/exb/zinc22/tarballs
# if you want to run a more recent/experimental branch of the pipeline
# export DOCK_VERSION=DOCK.3.8.4.0
export SBATCH_ARGS="--time=02:00:00"
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export SHRTCACHE=/dev/shm
export LONGCACHE=/dev/shm
./submit-all-jobs-slurm.bash

Wynton Example

export INPUT_FILE=example.smi
export OUTPUT_DEST=/wynton/group/bks/zinc22
# if you want to run a more recent/experimental branch of the pipeline
# export DOCK_VERSION=DOCK.3.8.4.0
export QSUB_ARGS="-l s_rt=00:28:30 -l h_rt=00:30:00 -r y"
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export LONGCACHE=/scratch
export SHRTCACHE=/scratch
./submit-all-jobs-sge.bash

Cori Example

export INPUT_FILE=example.smi
export OUTPUT_DEST=$SCRATCH/zinc22
# if you want to run a more recent/experimental branch of the pipeline
# export DOCK_VERSION=DOCK.3.8.4.0
export SBATCH_ARGS="--cpus-per-task=1 --time=02:00:00 --requeue -q shared -C haswell"
export LINES_PER_BATCH=20000
export LINES_PER_JOB=50
export MAX_BATCHES=10
# $SCRATCH on Cori can be used for debugging, but production batch jobs should take advantage of a burst buffer allocation
# this requires some custom modification of the script
# export SHRTCACHE=$SCRATCH
# export LONGCACHE=$SCRATCH
./submit-all-jobs-slurm.bash

Resubmission

If your jobs for building have finished (or timed out), and you want to continue process whatever has not been processed yet, just run submit-all-jobs-slurm/sge again (with same env arguments). The submit-all script will detect which entries haven't finished and resubmit them.

Script Arguments

INPUT_FILE

The input .smi file to be built. This file should contain only two columns of data: (SMILES, NAME) with no header.

OUTPUT_DEST

The base directory for output to be stored. The script will create a sub-directory here named $INPUT_FILE.batch-3d.d

Within this output directory there are 3 sub-directories:

  1. in
  2. log
  3. out

In contains the input file split into fragments and sub-fragments. By default the script first splits the input file into batches of 50000, then splits those batches into sub-batches of 50. Each individual job works on one of these sub-batches. Each array batch job works on one of the batches of 50000. All of the other directories alongside 'in' share the same directory structure.

Log contains log messages from the jobs. If you are re-submitting a file, be aware that log messages from previous runs on this file will be overwritten.

Out contains tar.gz output from each job. The tarballs should contain a number of 3d molecule formats for each molecule in the batch, including 1 or more db2.gz files.

SHRTCACHE

The base working directory for the script. By default it is /dev/shm.

LONGCACHE

The base directory for persistent files that are shared between jobs to go (i.e where software is installed). By default it is /tmp.

LINES_PER_BATCH

How many lines of the source .smi file should be processed per array batch job, default is 50000.

LINES_PER_JOB

How many lines of the batch .smi file should be processed per array task, default is 50.

SBATCH_ARGS

Additional arguments for the sbatch command. It is recommended to set a --time limit, as build jobs will save progress & terminate if they are still running two minutes before the --time limit.

QSUB_ARGS

Additional arguments for the qsub command. Similar to slurm, it is recommended to set a time limit, but you will need to manually specify both s_rt & h_rt. In the example, we set s_rt to be a minute and thirty seconds before h_rt. s_rt is the point where jobs will save progress and terminate, h_rt is when they will be forcibly terminated, even if they've not finished saving.

MAX_BATCHES

Each batch job will contain LINES_PER_BATCH/LINES_PER_JOB jobs, and there will be a maximum of MAX_BATCHES batches submitted at any given time. By default this value is 25, which corresponds to 25,000 queued jobs at any given time if there are 1000 jobs per batch.

SOFT_HOME

Where software tarballs for the pipeline are stored. By default this is $HOME. It is expected that there is a directory named "soft" within $SOFT_HOME that contains the actual software. The licenses to run this software are stored directly in SOFT_HOME. By default, your $SOFT_HOME should contain the following:

licenses:
$SOFT_HOME/.jchem-license.cxl
$SOFT_HOME/.oe-license.txt

soft:
$SOFT_HOME/soft/DOCK.version.tar.gz
$SOFT_HOME/soft/corina.tar.gz
$SOFT_HOME/soft/lig_build_py3-3.7.1.tar.gz
$SOFT_HOME/soft/lib.tar.gz
$SOFT_HOME/soft/jchem-19.15.tar.gz
$SOFT_HOME/soft/openbabel-install.tar.gz

DOCK_VERSION

Tells which tarball of DOCK should be used for ligand building. The variable value should not include the tar.gz extension. By default this is DOCK.3.8.current.3d, which you can find @ /wynton/group/bks/soft/DOCK.3D/DOCK.3.8.current.3d.tar.gz on Wynton, and @ /nfs/soft/dock/versions/dock38/DOCK.3D/DOCK.3.8.current.3d.tar.gz on BKS. Link one of these in your $SOFT_HOME/soft directory to make sure your DOCK version stays up to date.

CORINA_MAX_CONFS

Usually, when creating 3D embeddings from the jchem generated protomers, corina generates 1 conformation, equivalent to mc=1 on the command line. You can change this mc value by exporting CORINA_MAX_CONFS=<N> prior to running build-3d.bash or submit-all-jobs.bash.

Repatriation

At BKS, we currently store the tarred output of the pipeline @ /nfs/exb/zinc22/tarballs. Currently, we use the following command to repatriate output from other clusters to our cluster:

### migrate_output.bash

for output in $OUTPUT_DEST/*.batch-3d.d; do
        echo "starting rsync on $output to $MIGRATE_USER@files2.docking.org"
        sshpass -f $PW_FILE rsync -arv $output/out $MIGRATE_USER@files2.docking.org:/nfs/exb/zinc22/tarballs/$(basename $output).out
done

sshpass is optional here but preferable for convenience's sake. Since files2.docking.org is only visible within the UCSF network, any clusters outside will need to maintain a network tunnel when rsyncing.

Errors

Sometimes an output tarball will have few or no entries within. Certain molecule types [elaborate] will fail to be built, and often these molecules get bunched together (i.e if the input file is sorted by SMILES). Additionally, a small percentage of all molecules may fail to be processed by corina or amsol. If neither of these explain what is causing your missing entries, check that tarball's corresponding log entry for more info.

Additional Notes

This script was designed for a 64 bit architecture. You will likely run into some library issues trying to run it on 32 bit machines. If that's all you have you can try to swap out libg2c.so.0* in lib.tar.gz with a 32 bit version, but I cannot help you beyond that.

It is safe to re-run the same file multiple times- the script takes care of making sure not to re-run any jobs that have already completed successfully prior. This is only the case if that file's corresponding batch-3d.d output directory has not been moved or deleted.

For example, if one of your nodes went down and caused a bunch of jobs to fail, it would be safe to re-run ./submit-all-jobs.bash to re-submit those jobs. (assuming there are no jobs for that file currently queued/running)


back to ZINC22:Building_3D