Building The 3D Pipeline ZINC22: Difference between revisions

From DISI
Jump to navigation Jump to search
Line 2: Line 2:


The 3D pipeline is a collection of scripts and software packages that enable the massively parallel creation of dockable 3D molecules.
The 3D pipeline is a collection of scripts and software packages that enable the massively parallel creation of dockable 3D molecules.
The setup for the pipeline can be somewhat complicated, so we are working on a containerized version with a separate wiki page @ http://wiki.docking.org/index.php/ZINC22_3D_Pipeline_Container (WIP).


== License Requirements ==
== License Requirements ==

Revision as of 22:50, 15 August 2022

Introduction

The 3D pipeline is a collection of scripts and software packages that enable the massively parallel creation of dockable 3D molecules.

License Requirements

  • Chemaxon License
  • Openeye License
  • Corina License (comes packaged with the executable, so you will need to provide the executable)

BKS Cluster Installation

Step 1: Link Software & Licenses

ln -s /nfs/soft/dock/versions/dock38/pipeline_3D_ligands/soft ~/soft
cp /nfs/soft/dock/versions/dock38/pipeline_3D_ligands/licenses/.* ~

Step 2: Download Submission Scripts

$ git clone https://github.com/docking-org/zinc-3d-build-3.git
Cloning into 'zinc-3d-build-3'...
remote: Enumerating objects: 205, done.
remote: Counting objects: 100% (205/205), done.
remote: Compressing objects: 100% (129/129), done.
remote: Total 205 (delta 118), reused 156 (delta 72), pack-reused 0
Receiving objects: 100% (205/205), 70.75 KiB | 0 bytes/s, done.
Resolving deltas: 100% (118/118), done.

If you've already set up submission scripts before, it is recommended to run "git pull" inside the zinc-3d-build-3 directory to receive any new updates if you haven't built ligands in a while.

IMPORTANT: when running submission scripts, make sure you are logged in to gimel5, otherwise you won't be able to use slurm

Wynton Cluster Installation

Nothing has been prepared as of yet on the Wynton cluster to simplify the installation procedure, however all that is needed is the software, licenses, and submission scripts. You can use scp or another utility to copy the contents of your $HOME/soft directory on BKS to your $HOME/soft directory on Wynton, as well as copying the licenses from $HOME/.*license* to your $HOME on Wynton.

Submission scripts can be downloaded the same way on Wynton, however you will need to be logged in to a dev node to use git (ssh dev3).

SGE submission scripts may be out of date and not work- use with caution, or try out Wynton's new Slurm system.

Script Arguments

Main submission scripts are named submit-all-jobs-slurm.bash and submit-all-jobs-sge.bash. These scripts use environment variables as arguments instead of usual command line ones.

E.g, on bash you would pass one of these arguments like so:

export INPUT_FILE=example.smi

or on csh:

setenv INPUT_FILE example.smi

Prior to running the script.

Required Arguments

INPUT_FILE

The input .smi file to be built. This file should contain only two columns of data: (SMILES, NAME) with no header.

OUTPUT_DEST

The base directory for output to be stored. The script will create a sub-directory here named $INPUT_FILE.batch-3d.d

Within this output directory there are 3 sub-directories:

  1. in
  2. log
  3. out

In contains the input file split into fragments and sub-fragments. By default the script first splits the input file into batches of 50000, then splits those batches into sub-batches of 50. Each individual job works on one of these sub-batches. Each array batch job works on one of the batches of 50000. All of the other directories alongside 'in' share the same directory structure.

Log contains log messages from the jobs. If you are re-submitting a file, be aware that log messages from previous runs on this file will be overwritten.

Out contains tar.gz output from each job. The tarballs should contain a number of 3d molecule formats for each molecule in the batch, including 1 or more db2.gz files.

Optional Arguments

SHRTCACHE

The base working directory for the script. By default it is /scratch

LONGCACHE

The base directory for persistent files that are shared between jobs to go (i.e where software is installed). By default it is /scratch.

LINES_PER_BATCH

How many lines of the source .smi file should be processed per array batch job, default is 50000.

LINES_PER_JOB

How many lines of the batch .smi file should be processed per array task, default is 50.

SBATCH_ARGS

Additional arguments for the sbatch command. It is recommended to set a --time limit, as build jobs will save progress & terminate if they are still running two minutes before the --time limit.

QSUB_ARGS

Additional arguments for the qsub command. Similar to slurm, it is recommended to set a time limit, but you will need to manually specify both s_rt & h_rt. In the example, we set s_rt to be a minute and thirty seconds before h_rt. s_rt is the point where jobs will save progress and terminate, h_rt is when they will be forcibly terminated, even if they've not finished saving.

MAX_BATCHES

Each batch job will contain LINES_PER_BATCH/LINES_PER_JOB jobs, and there will be a maximum of MAX_BATCHES batches submitted at any given time. By default this value is 25, which corresponds to 25,000 queued jobs at any given time if there are 1000 jobs per batch.

The submit-all script will block until less than MAX_BATCHES job arrays are in the queue. TODO: block until less than MAX_BATCHES total jobs are running or in the queue.

SOFT_HOME

Where software tarballs for the pipeline are stored. By default this is $HOME, however if your $HOME directory is not networked (and you are submitting across network) this variable will need to be set. It is expected that there is a directory named "soft" within $SOFT_HOME that contains the actual software. The licenses to run this software are stored directly in SOFT_HOME.

DOCK_VERSION

Which DOCK distribution should be used for 3D building. Typically the default value will point towards the most recent stable distribution, e.g DOCK_VERSION=DOCK.3.8.4.1.3d, corresponding to the tarball DOCK.3.8.4.1.3d.tar.gz in $SOFT_HOME/soft

PYENV_VERSION

Which python distribution should be used for 3D building. By default this is lig_build_py3-3.7, corresponding to $SOFT_HOME/soft/lig_build_py3-3.7.tar.gz which is a very heavy python environment that has been in use for a while. Attempts to re-create or modify this environment should be done with an expectation of frustration.

CORINA_VERSION

Which CORINA distribution should be used for 3D building. This needs to be updated when the current CORINA license has expired, for example we recently updated from just "corina" to "corina-2025".

CORINA_MAX_CONFS

Usually, when creating 3D embeddings from the jchem generated protomers, corina generates 1 conformation, equivalent to mc=1 on the command line. You can change this mc value by exporting CORINA_MAX_CONFS=<N> prior to running build-3d.bash or submit-all-jobs.bash.

Examples

Minimal Example

export INPUT_FILE=example.smi
export OUTPUT_DEST=/nfs/exb/zinc22/tarballs
bash submit-all-jobs-slurm.bash

BKS Example

export INPUT_FILE=example.smi
export OUTPUT_DEST=/nfs/exb/zinc22/tarballs
export SBATCH_ARGS="--time=02:00:00"
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export MAX_BATCHES=10
export SHRTCACHE=/dev/shm
export LONGCACHE=/dev/shm
bash submit-all-jobs-slurm.bash

Wynton Example

export INPUT_FILE=example.smi
export OUTPUT_DEST=/wynton/group/bks/zinc22
export QSUB_ARGS="-l s_rt=00:28:30 -l h_rt=00:30:00 -r y"
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export MAX_BATCHES=15
export LONGCACHE=/scratch
export SHRTCACHE=/scratch
bash submit-all-jobs-sge.bash

Resubmission

If your jobs for building have finished (or timed out), and you want to continue process whatever has not been processed yet, just run submit-all-jobs-slurm/sge again (with same env arguments). The submit-all script will detect which entries haven't finished and resubmit them.

Repatriation

At BKS, we currently store the tarred output of the pipeline @ /nfs/exb/zinc22/tarballs. Currently, we use the following command to repatriate output from other clusters to our cluster:

### migrate_output.bash

for output in $OUTPUT_DEST/*.batch-3d.d; do
        echo "starting rsync on $output to $MIGRATE_USER@files2.docking.org"
        sshpass -f $PW_FILE rsync -arv $output/out $MIGRATE_USER@files2.docking.org:/nfs/exb/zinc22/tarballs/$(basename $output).out
done

sshpass is optional here but preferable for convenience's sake. Since files2.docking.org is only visible within the UCSF network, any clusters outside will need to maintain a network tunnel when rsyncing.

Errors

Sometimes an output tarball will have few or no entries within. Certain molecule types will fail to be built, and often these molecules get bunched together (i.e if the input file is sorted by SMILES). Additionally, a small percentage of all molecules may fail to be processed by corina or amsol. If neither of these explain what is causing your missing entries, check that tarball's corresponding log entry for more info.

Additional Notes

It is safe to re-run the same file multiple times- the script takes care of making sure not to re-run any jobs that have already completed successfully prior. This is only the case if that file's corresponding batch-3d.d output directory has not been moved or deleted.

For example, if one of your nodes went down and caused a bunch of jobs to fail, it would be safe to re-run ./submit-all-jobs.bash to re-submit those jobs. (assuming there are no jobs for that file currently queued/running)

back to ZINC22:Building_3D