Building The 3D Pipeline ZINC22

Introduction

The 3D pipeline is a collection of scripts and software packages that enable the massively parallel creation of dockable 3D molecules.

Requirements

Chemaxon License (email John for license details)
Openeye License (email John for license details)
Corina License (comes packaged with the executable, so you will need to provide the executable)
SGE or SLURM queueing system installed on your cluster
A networked file system installed on your cluster

Setup Instructions

clone the appropriate branch of the zinc-3d-build-3 repository (github link) to your working directory
Secure copy the software distribution from our cluster at (scp link)
Extract the software distribution to the $HOME/soft directory of the user that will be running the script

When this script is submitted to a machine for the first time, it will copy and extract the software distribution from $HOME/soft to local storage
Because of this, it is important that your $HOME be global if you are running the script unmodified
If you don't have a global home, you can specify an alternative global directory to copy the software from by exporting SOFT_HOME prior to running submit-all-jobs.bash (see below)

Copy your licenses into the same directory you extracted the software to. Copy your corina distribution (as a tar.gz) into this same folder.

This script assumes that the jchem and openeye licenses will be named ".jchem-license.cxl" and ".oe-license.txt" respectively. Corina is assumed to be named "corina.tar.gz"

Running The Script

export INPUT_FILE=$YOUR_INPUT_FILE.smi
export OUTPUT_DEST=$SOME_GLOBAL_DIRECTORY
# if no global home:
# export SOFT_HOME=$YOUR_SOFT_HOME
./submit-all-jobs.bash

BKS Example

export INPUT_FILE
export OUTPUT_FILE
export SBATCH_ARGS="--time=00:45:00 --requeue"
export LINES_PER_BATCH=50000
export LINES_PER_JOB=50
export TEMPDIR=/dev/shm
./submit-all-jobs-slurm.bash

Input And Output

INPUT_FILE

The input .smi file to be built. This file should contain only two columns of data: (SMILES, NAME) with no header.

OUTPUT_DEST

The base directory for output to be stored. The script will create a sub-directory here named $INPUT_FILE.batch-3d.d

Within this output directory there are 3 sub-directories:

in
log
out

In contains the input file split into fragments and sub-fragments. By default the script first splits the input file into batches of 50000, then splits those batches into sub-batches of 50. Each individual job works on one of these sub-batches. Each array batch job works on one of the batches of 50000. All of the other directories alongside 'in' share the same directory structure.

Log contains log messages from the jobs. If you are re-submitting a file, be aware that log messages from previous runs on this file will be overwritten.

Out contains tar.gz output from each job. The tarballs should contain a number of 3d molecule formats for each molecule in the batch, including 1 or more db2.gz files.

Errors

Sometimes an output tarball will have few or no entries within. Certain molecule types [elaborate] will fail to be built, and often these molecules get bunched together (i.e if the input file is sorted by SMILES). Additionally, a small percentage of all molecules may fail to be processed by corina or amsol. If neither of these explain what is causing your missing entries, check that tarball's corresponding log entry for more info.

Additional Notes

This script was designed for a 64 bit architecture. You will likely run into some library issues trying to run it on 32 bit machines. If that's all you have you can try to swap out libg2c.so.0* in lib.tar.gz with a 32 bit version, but I cannot help you beyond that.

It is safe to re-run the same file multiple times- the script takes care of making sure not to re-run any jobs that have already completed successfully prior. This is only the case if that file's corresponding batch-3d.d output directory has not been moved or deleted.

For example, if one of your nodes went down and caused hundreds of jobs to fail, it would be safe to re-run ./submit-all-jobs.bash to re-submit those jobs. (assuming there are no jobs for that file currently queued/running)

Building The 3D Pipeline ZINC22

Contents

Introduction

Requirements

Setup Instructions

Running The Script

Input And Output

Errors

Additional Notes

Navigation menu

Building The 3D Pipeline ZINC22

Introduction

Requirements

Setup Instructions

Running The Script

Input And Output

Errors

Additional Notes

Navigation menu

Search