Building The 3D Pipeline ZINC22: Difference between revisions

From DISI
Jump to navigation Jump to search
 
(100 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Introduction ==
= Introduction =


The 3D pipeline is a collection of scripts and software packages that enable the massively parallel creation of dockable 3D molecules.
The 3D pipeline is a collection of scripts and software packages that enable the massively parallel creation of dockable 3D molecules.


== Requirements ==
= EZ Setup =


* Chemaxon License (email John for license details)
== BKS Cluster ==
* Openeye License  (email John for license details)
* Corina License  (comes packaged with the executable, so you will need to provide the executable)
* SGE or SLURM queueing system installed on your cluster
* A networked file system installed on your cluster


== Setup Instructions ==
<nowiki>
source /nfs/soft/dock/versions/dock38/pipeline_3D_ligands/env.(sh|csh)</nowiki>
 
This environment will set up most of the required variables for you, as well as adds the submission scripts to your PATH, which means submission can be as simple as:
 
bash
<nowiki>
export INPUT_FILE=$HOME/myligands.smi
export OUTPUT_DEST=$HOME/myoutput
submit-all-jobs-slurm.bash</nowiki>
 
csh
<nowiki>
setenv INPUT_FILE $HOME/myligands.smi
setenv OUTPUT_DEST $HOME/myoutput
submit-all-jobs-slurm.bash</nowiki>
 
== Wynton Cluster ==


<ol>
<nowiki>
<li>clone the appropriate branch of the zinc-3d-build-3 repository (github link) to your working directory</li>
source /wynton/group/bks/soft/pipeline_3D_ligands/env.(sh|csh)</nowiki>


<li>Secure copy the software distribution from our cluster at (scp link)</li>
Similar to the BKS example, this environment will set most of the required variables for you


<li>Extract the software distribution to the $HOME/soft directory of the user that will be running the script</li>
bash
<nowiki>
export INPUT_FILE=$HOME/myligands.smi
export OUTPUT_DEST=$HOME/myoutput
submit-all-jobs-sge.bash</nowiki>


<ol type="a">
csh
<li>When this script is submitted to a machine for the first time, it will copy and extract the software distribution from $HOME/soft to local storage</li>
<nowiki>
setenv INPUT_FILE $HOME/myligands.smi
setenv OUTPUT_DEST $HOME/myoutput
submit-all-jobs-sge.bash</nowiki>


<li>Because of this, it is important that your $HOME be global if you are running the script unmodified</li>
= Repackaging Output For Docking =


<li>If you don't have a global home, or otherwise want your software to reside somewhere else, you can specify an alternative global directory to copy the software from by exporting SOFT_HOME prior to running submit-all-jobs.bash (see below)</li>
The output of the 3D pipeline scripts will be a number of tar.gz files with roughly LINES_PER_JOB molecules contained per package.
</ol>


<li>Copy your licenses into the same directory you extracted the software to. Copy your corina distribution (as a tar.gz) into this same folder.</li>
It is standard practice to repackage these smaller packages into larger packages for docking, as 50 molecules do not take long to process with DOCK.


<ol type="a">
See this wiki page for how to do this: [[Repackaging_DB2_DOCK38]]
<li>This script assumes that the jchem and openeye licenses will be named ".jchem-license.cxl" and ".oe-license.txt" respectively. Corina is assumed to be named "corina.tar.gz"</li>
</ol>
</ol>


== Running The Script ==
= Script Arguments =


General Example
Main submission scripts are named submit-all-jobs-slurm.bash and submit-all-jobs-sge.bash. These scripts use environment variables as arguments instead of usual command line ones.


<nowiki>
E.g, on bash you would pass one of these arguments like so:
export INPUT_FILE=$YOUR_INPUT_FILE.smi
export OUTPUT_DEST=$SOME_GLOBAL_DIRECTORY
# if you don't have a global home directory (but some other global directory available), specify an alternate SOFT_HOME
# export SOFT_HOME=$YOUR_SOFT_HOME
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
./submit-all-jobs-<slurm/sge>.bash
</nowiki>


BKS Example
<nowiki>export INPUT_FILE=$PWD/example.smi</nowiki>


<nowiki>
or on csh:
export INPUT_FILE
export OUTPUT_DEST
export SBATCH_ARGS="--time=00:45:00 --requeue"
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export TEMPDIR=/dev/shm
./submit-all-jobs-slurm.bash
</nowiki>


Wynton Example
  <nowiki>setenv INPUT_FILE $PWD/example.smi</nowiki>
  <nowiki>
export INPUT_FILE
export OUTPUT_DEST
export QSUB_ARGS="-l s_rt=00:28:30 -l h_rt=00:30:00 -r y"
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export TEMPDIR=/scratch
./submit-all-jobs-sge.bash
</nowiki>


Cori Example
Prior to running the script.
<nowiki>
export INPUT_FILE
export OUTPUT_DEST
export SBATCH_ARGS="--time-min=00:01:00 --time=00:02:00 --requeue -q flex -C haswell"
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export TEMPDIR=$SCRATCH
./submit-all-jobs-slurm.bash
</nowiki>


== Input And Output ==
== Required Arguments ==


INPUT_FILE
==== INPUT_FILE ====


The input .smi file to be built. This file should contain only two columns of data: (SMILES, NAME) with no header.
The input .smi file to be built. This file should contain only two columns of data: (SMILES, NAME) with no header.


OUTPUT_DEST
==== OUTPUT_DEST ====


The base directory for output to be stored. The script will create a sub-directory here named $INPUT_FILE.batch-3d.d
The base directory for output to be stored. The script will create a sub-directory here named $INPUT_FILE.batch-3d.d
Line 106: Line 88:
Out contains tar.gz output from each job. The tarballs should contain a number of 3d molecule formats for each molecule in the batch, including 1 or more db2.gz files.
Out contains tar.gz output from each job. The tarballs should contain a number of 3d molecule formats for each molecule in the batch, including 1 or more db2.gz files.


== Errors ==
==== SOFT_HOME ====
 
Where software tarballs for the pipeline are stored. Symbolic links should be maintained in this directory according to the rules described in the "software arguments" section of this page. If you're sourcing a premade environment, don't worry about setting this value.
 
==== LICENSE_HOME ====
 
Where software licenses are stored. Currently our licensed software includes jchem and openeye, licenses must be named .jchem-license.cxl and .oe-license.txt respectively. if you're sourcing a premade environment, don't worry about setting this value.


Sometimes an output tarball will have few or no entries within. Certain molecule types [elaborate] will fail to be built, and often these molecules get bunched together (i.e if the input file is sorted by SMILES). Additionally, a small percentage of all molecules may fail to be processed by corina or amsol. If neither of these explain what is causing your missing entries, check that tarball's corresponding log entry for more info.
== Script Arguments ==


== Additional Notes ==
==== SHRTCACHE ====


This script was designed for a 64 bit architecture. You will likely run into some library issues trying to run it on 32 bit machines.
The base working directory for the script. By default it is /scratch
If that's all you have you can try to swap out libg2c.so.0* in lib.tar.gz with a 32 bit version, but I cannot help you beyond that.
 
==== LONGCACHE ====
 
The base directory for persistent files that are shared between jobs to go (i.e where software is installed). By default it is /scratch.
 
==== CORINA_MAX_CONFS ====
 
How many nitrogen flapping configurations of each protomer corina should generate. By default only one is generated.
 
==== pH_LEVEL ====
 
Sets what pH to generate the compound(s) at. Default is a pH of 7.4
 
 
Note: For addition of new variables, that variable needs also needs to be added to the 'optional_vars' line of submit-all.bash
 
== Omega Arguments ==
 
These parameters correspond to torsion driving parameters described in the omega manual: https://docs.eyesopen.com/applications/omega/omega/omega_opt_params.html#torsion-driving-parameters
 
If you'd like to know more about how these parameters function, cross reference with the manual page.
 
==== OMEGA_MAX_CONFS ====
 
Maximum configurations OMEGA will generate, default 600.
 
==== OMEGA_ENERGY_WINDOW ====
 
Torsion energy window, if set to zero OMEGA will use an alternative rotatable bond dependent window method instead. Default is 12
 
==== OMEGA_TORLIB ====
 
Torsion library- can choose between GubaV21 or Original, default is Original.
 
==== OMEGA_FF ====
 
https://docs.eyesopen.com/toolkits/cpp/oefftk/OEFFConstants/OEMMFFSheffieldFFType.html#OEFF::OEMMFFSheffieldFFType::MMFF94Smod
 
Default is MMFF94Smod.
 
==== OMEGA_RMSD ====
 
Sets rmsd for clustering and filtering conformations. If zero, omega will use an alternative rotatable-bond dependent method instead. Default is 0.5
 
== Job Submission Arguments ==
 
==== SUBMIT_MODE ====
 
Choose the job submission method, choose between SGE, SLURM, or TEST_LOCAL. This will be automatically set if you use the job controller's corresponding superscript, e.g submit-all-jobs-slurm.bash. TEST_LOCAL will bypass the job controller and run the first input chunk in your shell.
 
==== LINES_PER_BATCH ====
 
How many lines of the source .smi file should be processed per array batch job, default is 50000.
 
==== LINES_PER_JOB ====
 
How many lines of the batch .smi file should be processed per array task, default is 50.
 
==== MAX_BATCHES ====
 
Each batch job will contain LINES_PER_BATCH/LINES_PER_JOB jobs, and there will be a maximum of MAX_BATCHES batches submitted at any given time. By default this value is 25, which corresponds to 25,000 queued jobs at any given time if there are 1000 jobs per batch.
 
The submit-all script will block until less than MAX_BATCHES job arrays are in the queue. TODO: block until less than MAX_BATCHES total jobs are running or in the queue.
 
==== SBATCH_ARGS ====
 
Additional arguments for the sbatch command. It is recommended to set a --time limit, as build jobs will save progress & terminate if they are still running two minutes before the --time limit.
 
==== QSUB_ARGS ====
 
Additional arguments for the qsub command. Similar to slurm, it is recommended to set a time limit, but you will need to manually specify both s_rt & h_rt. In the example, we set s_rt to be a minute and thirty seconds before h_rt. s_rt is the point where jobs will save progress and terminate, h_rt is when they will be forcibly terminated, even if they've not finished saving.
 
== Software Options ==
 
All software variables will be set automatically if there exists a symbolic link in $SOFT_HOME matching the software variable's name, for example:
<nowiki>
dock-latest -> DOCK.3.8.4.3d.tar.gz
jchem-latest -> jchem-19.15_r1.tar.gz
pyenv-latest -> lig_build_py3-3.7.1.tar.gz</nowiki>
 
They may also bet set manually- value is expected to be a path to a tar.gz file.
 
We use the following software:
 
* DOCK_VERSION
 
* JCHEM_VERSION
 
* PYENV_VERSION
 
* CORINA_VERSION
 
* OPENBABEL_VERSION
 
* EXTRALIBS_VERSION
  Note on EXTRALIBS- Run the pipeline with an empty EXTRALIBS package (but all other software accounted for) and see which shared libraries come up as missing in the error log. Locate all missing libraries and toss them in EXTRALIBS, they will be added to LD_LIBRARY_PATH
 
* JAVA_VERSION
 
= Examples =
 
Minimal Example
<nowiki>
export INPUT_FILE=$PWD/example.smi
export OUTPUT_DEST=$PWD
bash submit-all-jobs-slurm.bash</nowiki>
 
BKS Example - limit time to 2 hours, change batch size variables. Slurm tasks should automatically save progress when reaching their time limit.
 
<nowiki>
export INPUT_FILE=$PWD/example.smi
export OUTPUT_DEST=$PWD/ligand_building
export SBATCH_ARGS="--time=02:00:00"
export LINES_PER_BATCH=20000
export LINES_PER_JOB=25
export MAX_BATCHES=15
bash submit-all-jobs-slurm.bash</nowiki>
 
Wynton Example - limit time to 30 minutes, but set a soft limit 1:30 prior to the hard limit - the interrupt generated by the soft limit will signal the job to save progress for any resubmissions and exit.
 
<nowiki>
export INPUT_FILE=$PWD/example.smi
export OUTPUT_DEST=$PWD/ligand_building
export QSUB_ARGS="-l s_rt=00:28:30 -l h_rt=00:30:00 -r y"
bash submit-all-jobs-sge.bash</nowiki>
 
= Resubmission =
 
If your jobs for building have finished (or timed out), and you want to continue process whatever has not been processed yet, just run submit-all-jobs-slurm/sge again (with same env arguments). The submit-all script will detect which entries haven't finished and resubmit them.
 
== Repatriation ==
 
At BKS, we currently store the tarred output of the pipeline @ /nfs/exb/zinc22/tarballs. Currently, we use the following command to repatriate output from other clusters to our cluster:
 
<nowiki>
### migrate_output.bash
 
for output in $OUTPUT_DEST/*.batch-3d.d; do
        echo "starting rsync on $output to $MIGRATE_USER@files2.docking.org"
        sshpass -f $PW_FILE rsync -arv $output/out $MIGRATE_USER@files2.docking.org:/nfs/exb/zinc22/tarballs/$(basename $output).out
done</nowiki>
 
sshpass is optional here but preferable for convenience's sake. Since files2.docking.org is only visible within the UCSF network, any clusters outside will need to maintain a network tunnel when rsyncing.
 
= Errors =
 
Sometimes an output tarball will have few or no entries within. Certain molecule types will fail to be built, and often these molecules get bunched together (i.e if the input file is sorted by SMILES). Additionally, a small percentage of all molecules may fail to be processed by corina or amsol. If neither of these explain what is causing your missing entries, check that tarball's corresponding log entry for more info.
 
= Additional Notes =


It is safe to re-run the same file multiple times- the script takes care of making sure not to re-run any jobs that have already completed successfully prior. This is only the case if that file's corresponding batch-3d.d output directory has not been moved or deleted.
It is safe to re-run the same file multiple times- the script takes care of making sure not to re-run any jobs that have already completed successfully prior. This is only the case if that file's corresponding batch-3d.d output directory has not been moved or deleted.


For example, if one of your nodes went down and caused hundreds of jobs to fail, it would be safe to re-run ./submit-all-jobs.bash to re-submit those jobs. (assuming there are no jobs for that file currently queued/running)
For example, if one of your nodes went down and caused a bunch of jobs to fail, it would be safe to re-run ./submit-all-jobs.bash to re-submit those jobs. (assuming there are no jobs for that file currently queued/running)
 
back to [[ZINC22:Building_3D]]
 
[[Category:ZINC22]]
[[Category:DOCK_3.8]]

Latest revision as of 22:46, 4 June 2024

Introduction

The 3D pipeline is a collection of scripts and software packages that enable the massively parallel creation of dockable 3D molecules.

EZ Setup

BKS Cluster

source /nfs/soft/dock/versions/dock38/pipeline_3D_ligands/env.(sh|csh)

This environment will set up most of the required variables for you, as well as adds the submission scripts to your PATH, which means submission can be as simple as:

bash

export INPUT_FILE=$HOME/myligands.smi
export OUTPUT_DEST=$HOME/myoutput
submit-all-jobs-slurm.bash

csh

setenv INPUT_FILE $HOME/myligands.smi
setenv OUTPUT_DEST $HOME/myoutput
submit-all-jobs-slurm.bash

Wynton Cluster

source /wynton/group/bks/soft/pipeline_3D_ligands/env.(sh|csh)

Similar to the BKS example, this environment will set most of the required variables for you

bash

export INPUT_FILE=$HOME/myligands.smi
export OUTPUT_DEST=$HOME/myoutput
submit-all-jobs-sge.bash

csh

setenv INPUT_FILE $HOME/myligands.smi
setenv OUTPUT_DEST $HOME/myoutput
submit-all-jobs-sge.bash

Repackaging Output For Docking

The output of the 3D pipeline scripts will be a number of tar.gz files with roughly LINES_PER_JOB molecules contained per package.

It is standard practice to repackage these smaller packages into larger packages for docking, as 50 molecules do not take long to process with DOCK.

See this wiki page for how to do this: Repackaging_DB2_DOCK38

Script Arguments

Main submission scripts are named submit-all-jobs-slurm.bash and submit-all-jobs-sge.bash. These scripts use environment variables as arguments instead of usual command line ones.

E.g, on bash you would pass one of these arguments like so:

export INPUT_FILE=$PWD/example.smi

or on csh:

setenv INPUT_FILE $PWD/example.smi

Prior to running the script.

Required Arguments

INPUT_FILE

The input .smi file to be built. This file should contain only two columns of data: (SMILES, NAME) with no header.

OUTPUT_DEST

The base directory for output to be stored. The script will create a sub-directory here named $INPUT_FILE.batch-3d.d

Within this output directory there are 3 sub-directories:

  1. in
  2. log
  3. out

In contains the input file split into fragments and sub-fragments. By default the script first splits the input file into batches of 50000, then splits those batches into sub-batches of 50. Each individual job works on one of these sub-batches. Each array batch job works on one of the batches of 50000. All of the other directories alongside 'in' share the same directory structure.

Log contains log messages from the jobs. If you are re-submitting a file, be aware that log messages from previous runs on this file will be overwritten.

Out contains tar.gz output from each job. The tarballs should contain a number of 3d molecule formats for each molecule in the batch, including 1 or more db2.gz files.

SOFT_HOME

Where software tarballs for the pipeline are stored. Symbolic links should be maintained in this directory according to the rules described in the "software arguments" section of this page. If you're sourcing a premade environment, don't worry about setting this value.

LICENSE_HOME

Where software licenses are stored. Currently our licensed software includes jchem and openeye, licenses must be named .jchem-license.cxl and .oe-license.txt respectively. if you're sourcing a premade environment, don't worry about setting this value.

Script Arguments

SHRTCACHE

The base working directory for the script. By default it is /scratch

LONGCACHE

The base directory for persistent files that are shared between jobs to go (i.e where software is installed). By default it is /scratch.

CORINA_MAX_CONFS

How many nitrogen flapping configurations of each protomer corina should generate. By default only one is generated.

pH_LEVEL

Sets what pH to generate the compound(s) at. Default is a pH of 7.4


Note: For addition of new variables, that variable needs also needs to be added to the 'optional_vars' line of submit-all.bash

Omega Arguments

These parameters correspond to torsion driving parameters described in the omega manual: https://docs.eyesopen.com/applications/omega/omega/omega_opt_params.html#torsion-driving-parameters

If you'd like to know more about how these parameters function, cross reference with the manual page.

OMEGA_MAX_CONFS

Maximum configurations OMEGA will generate, default 600.

OMEGA_ENERGY_WINDOW

Torsion energy window, if set to zero OMEGA will use an alternative rotatable bond dependent window method instead. Default is 12

OMEGA_TORLIB

Torsion library- can choose between GubaV21 or Original, default is Original.

OMEGA_FF

https://docs.eyesopen.com/toolkits/cpp/oefftk/OEFFConstants/OEMMFFSheffieldFFType.html#OEFF::OEMMFFSheffieldFFType::MMFF94Smod

Default is MMFF94Smod.

OMEGA_RMSD

Sets rmsd for clustering and filtering conformations. If zero, omega will use an alternative rotatable-bond dependent method instead. Default is 0.5

Job Submission Arguments

SUBMIT_MODE

Choose the job submission method, choose between SGE, SLURM, or TEST_LOCAL. This will be automatically set if you use the job controller's corresponding superscript, e.g submit-all-jobs-slurm.bash. TEST_LOCAL will bypass the job controller and run the first input chunk in your shell.

LINES_PER_BATCH

How many lines of the source .smi file should be processed per array batch job, default is 50000.

LINES_PER_JOB

How many lines of the batch .smi file should be processed per array task, default is 50.

MAX_BATCHES

Each batch job will contain LINES_PER_BATCH/LINES_PER_JOB jobs, and there will be a maximum of MAX_BATCHES batches submitted at any given time. By default this value is 25, which corresponds to 25,000 queued jobs at any given time if there are 1000 jobs per batch.

The submit-all script will block until less than MAX_BATCHES job arrays are in the queue. TODO: block until less than MAX_BATCHES total jobs are running or in the queue.

SBATCH_ARGS

Additional arguments for the sbatch command. It is recommended to set a --time limit, as build jobs will save progress & terminate if they are still running two minutes before the --time limit.

QSUB_ARGS

Additional arguments for the qsub command. Similar to slurm, it is recommended to set a time limit, but you will need to manually specify both s_rt & h_rt. In the example, we set s_rt to be a minute and thirty seconds before h_rt. s_rt is the point where jobs will save progress and terminate, h_rt is when they will be forcibly terminated, even if they've not finished saving.

Software Options

All software variables will be set automatically if there exists a symbolic link in $SOFT_HOME matching the software variable's name, for example:

dock-latest -> DOCK.3.8.4.3d.tar.gz
jchem-latest -> jchem-19.15_r1.tar.gz
pyenv-latest -> lig_build_py3-3.7.1.tar.gz

They may also bet set manually- value is expected to be a path to a tar.gz file.

We use the following software:

  • DOCK_VERSION
  • JCHEM_VERSION
  • PYENV_VERSION
  • CORINA_VERSION
  • OPENBABEL_VERSION
  • EXTRALIBS_VERSION
 Note on EXTRALIBS- Run the pipeline with an empty EXTRALIBS package (but all other software accounted for) and see which shared libraries come up as missing in the error log. Locate all missing libraries and toss them in EXTRALIBS, they will be added to LD_LIBRARY_PATH
  • JAVA_VERSION

Examples

Minimal Example

export INPUT_FILE=$PWD/example.smi
export OUTPUT_DEST=$PWD
bash submit-all-jobs-slurm.bash

BKS Example - limit time to 2 hours, change batch size variables. Slurm tasks should automatically save progress when reaching their time limit.

export INPUT_FILE=$PWD/example.smi
export OUTPUT_DEST=$PWD/ligand_building
export SBATCH_ARGS="--time=02:00:00"
export LINES_PER_BATCH=20000
export LINES_PER_JOB=25
export MAX_BATCHES=15
bash submit-all-jobs-slurm.bash

Wynton Example - limit time to 30 minutes, but set a soft limit 1:30 prior to the hard limit - the interrupt generated by the soft limit will signal the job to save progress for any resubmissions and exit.

export INPUT_FILE=$PWD/example.smi
export OUTPUT_DEST=$PWD/ligand_building
export QSUB_ARGS="-l s_rt=00:28:30 -l h_rt=00:30:00 -r y"
bash submit-all-jobs-sge.bash

Resubmission

If your jobs for building have finished (or timed out), and you want to continue process whatever has not been processed yet, just run submit-all-jobs-slurm/sge again (with same env arguments). The submit-all script will detect which entries haven't finished and resubmit them.

Repatriation

At BKS, we currently store the tarred output of the pipeline @ /nfs/exb/zinc22/tarballs. Currently, we use the following command to repatriate output from other clusters to our cluster:

### migrate_output.bash

for output in $OUTPUT_DEST/*.batch-3d.d; do
        echo "starting rsync on $output to $MIGRATE_USER@files2.docking.org"
        sshpass -f $PW_FILE rsync -arv $output/out $MIGRATE_USER@files2.docking.org:/nfs/exb/zinc22/tarballs/$(basename $output).out
done

sshpass is optional here but preferable for convenience's sake. Since files2.docking.org is only visible within the UCSF network, any clusters outside will need to maintain a network tunnel when rsyncing.

Errors

Sometimes an output tarball will have few or no entries within. Certain molecule types will fail to be built, and often these molecules get bunched together (i.e if the input file is sorted by SMILES). Additionally, a small percentage of all molecules may fail to be processed by corina or amsol. If neither of these explain what is causing your missing entries, check that tarball's corresponding log entry for more info.

Additional Notes

It is safe to re-run the same file multiple times- the script takes care of making sure not to re-run any jobs that have already completed successfully prior. This is only the case if that file's corresponding batch-3d.d output directory has not been moved or deleted.

For example, if one of your nodes went down and caused a bunch of jobs to fail, it would be safe to re-run ./submit-all-jobs.bash to re-submit those jobs. (assuming there are no jobs for that file currently queued/running)

back to ZINC22:Building_3D