How to do parallel search of smi files on the cluster

This tutorial shows how to do parallel search of smi files on the cluster. The files and scripts can be found in /nfs/home/jizhou/ex7/2D/test-parallel @gimel.compbio.ucsf.edu. Indexing and parallel computing are used to speedup searching. The performance of qsub depends on the workload of the whole cluster. Generally, searching with qsub has good scalability.

Create a folder with the following files and scripts

SUBMIT.sh
input.txt
search_smi.sh
merge.sh

SUBMIT.sh

SUBMIT.sh contains bash code for qsub. SUBMIT.sh specify the qsub command, parameters for qsub, input file, the function script, parameters for the function. A example is shown below.

#!/bin/bash

/nfs/soft/tools/utils/qsub-slice/qsub-mr \                 #  The qsub command
    -l 5 \                                                 #  The number of lines to be handled by each task, here is 5
    -N test \                                              #  The name of the queue to submit to
    input.txt \                                            #  The input file names and directory
    ./search_smi.sh \                                      #  The searching function to be performed 
    -q "CS(=O)(=O)CCNCc1ccccc1"                            #  Parameter for search_smi.sh, the input query for searching

input.txt

The input file names and directory. An example of input.txt is shown below. You can use ls *.smi > input.txt to generate this file.

/nfs/home/jizhou/ex7/2D/CD/CDAA.smi
/nfs/home/jizhou/ex7/2D/CD/CDAB.smi
/nfs/home/jizhou/ex7/2D/CD/CDAC.smi
/nfs/home/jizhou/ex7/2D/CD/CDAD.smi
/nfs/home/jizhou/ex7/2D/CD/CDAE.smi
...

search_smi.sh

The searching function used by qsub. The core function of search_smi.sh is mol2img_trial which is located in "/nfs/home/jizhou/work/Projects/smi_index/dotmatics/". mol2img_trial generates index for the smi file to speedup searching. search_smi.sh requires an input query for searching. An example is shown below

-q "CS(=O)(=O)CCNCc1ccccc1"

run SUBMIT.sh

Run SUBMIT.sh to submit the job to cluster. The job will be run on the background. When it finishes, a new directory outputs will be created in current folder. The outputs will be stored in outputs/. You can use the following command to check qsub status, start or stop a job. For more information, please refer to qstat

qstat                         # check the status of jobs, example is shown below.

-bash-4.1$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
6511305 1.25000 test-map   jizhou       r     07/19/2018 10:42:43 all.q@n-5-29.cluster.ucsf.bksl     1 1
6511305 0.75000 test-map   jizhou       r     07/19/2018 10:42:43 all.q@n-9-20.cluster.ucsf.bksl     1 2
6511305 0.58333 test-map   jizhou       r     07/19/2018 10:42:43 all.q@n-1-132.cluster.ucsf.bks     1 3
6511305 0.50000 test-map   jizhou       r     07/19/2018 10:42:43 all.q@n-9-21.cluster.ucsf.bksl     1 4

merge.sh

When all jobs are completed, run merge.sh to check the outputs. Sample outputs are shown below

CS(=O)(=O)CCNCc1ccncc1 ZINC000037491283|70.6
CS(=O)(=O)CCNCc1ccc(O)cc1 ZINC000037740328|70.6
CS(=O)(=O)CCNCCOc1ccccc1 ZINC000048777006|70.6
CS(=O)(=O)CCNCc1ccccc1 ZINC000037491280|100.0
CS(=O)(=O)CCNCCc1ccccc1 ZINC000037491281|75.0
...

Clean up

To clean up, run /nfs/soft/tools/utils/qsub-slice/qsub-mr --clean. The outputs directory and its files will be removed.

/nfs/soft/tools/utils/qsub-slice/qsub-mr --clean

How to do parallel search of smi files on the cluster

Navigation menu

Search