New 3D Building On Wynton

From DISI
Jump to navigation Jump to search
have re-vamped the script for the 2nd time, this time configured to use SGE on Wynton.
the script is based in ~/zinc-3d-build-3 on jji@wynton
I've had to re-install/reconfigure some of the software as it was not working properly on the wynton cluster
This software has been installed in various places in $HOME

The output of both the script results and the log files are organized in a similar fashion, which I will explain

There is one script of interest for running jobs, and this is submit-all-jobs.bash. This script takes in a source SMILES file and an output destination.
The script will then submit a number of jobs to build 3D ligand data and save results in an organized fashion to the output destination

Each job submitted by the script works on a batch of 100 substances. A group of 10,000 substances, or 100 jobs, is called a "batch"
Each 100 SMILES read in by the script is assigned a batch no. based on it's position in the source file

ex:

smiles | ZINC ID | line no. | batch no.
=======================================
CCAA   | ZINC000 | 0        | 0
...
CCZZ   | ZINCaaa | 10,000   | 1
...
CCXX   | ZINCbbb | 20,000   | 2
...
CCYY   | ZINCccc | 30,000   | 3


basically, BATCH_NO=LINE_NO/10000

Each job saves its results tarball to /wynton/scratch/jji/$SRC_FILENAME/$BATCH_ID/$END_ID.tar.gz
Each job saves its log stdout and stderr to /wynton/home/shoichetlab/jji/zinc-3d-build-3/logs/$SRC_FILENAME/$BATCH_ID/$END_ID.*
These directories can be re-configured by changing environment variables OUTPUT_DEST and LOG_BASE_DIR respectively prior to running the submit-all-jobs script

$SRC_FILENAME is the filename of the source file this group of jobs was run from
$BATCH_ID is the batch no. of the smiles
$END_ID is the line no. of the last substance in the job

Revised 3D Building On Wynton

The batch size has been changed to 50K, and batches are now submitted in arrays of 1000 instead of one-by-one.
Batches are now identified alphabetically instead of numerically, e.g aaa instead of 0, aab instead of 1, etc...
Logs are saved to local /scratch during the runtime of the job, and then moved to $OUTPUT_DIR/log once the job has completed. Prior to this logs were being streamed to the NFS, which was causing a lot of I/O strain.
Result tarballs are saved to $OUTPUT_DIR/out
Input batches are saved to $OUTPUT_DIR/in