ELC: Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 2: Line 2:


Eddie's Legacy Code (ELC) consists an RMSD script, a rank script, several scripts to process DockBlaster's job folders, [[Eplop|a PLOP wrapper]], and [[MPose|a script to run single mode DOCK]].
Eddie's Legacy Code (ELC) consists an RMSD script, a rank script, several scripts to process DockBlaster's job folders, [[Eplop|a PLOP wrapper]], and [[MPose|a script to run single mode DOCK]].
{{TOCright}}


==RMSD Script==
==RMSD Script==

Latest revision as of 12:37, 21 March 2014

Overview

Eddie's Legacy Code (ELC) consists an RMSD script, a rank script, several scripts to process DockBlaster's job folders, a PLOP wrapper, and a script to run single mode DOCK.

RMSD Script

The RMSD script is a wrapper around the FORTRAN program rmsd. When the FORTRAN program accepts only PDB input with atoms ordered properly across the two inputs, the wrapper performs necessary format transform using OEChem and makes sure the ordering of atoms is correct.

To run the script, invoke it with two molecule files in whatever format that is accepted by OEChem:

~ycao/eplop/rmsd molecule1.mol2 molecule2.sdf

The first argument can be a multi-molecule MOL2 file. In this case, each entry will be compared to the molecule referenced by the second argument.

Note:

  1. This script cannot detect problmes. For example, it is possible that if the molecules are not valid, a result will still be printed and no error message woudl appear.
  2. This script utilizes convert.py and file2file.py to perform format conversion. Make sure you have set up the PATH correctly if you encounter problem running this script.

Rank Script

The rank script takes two (or four, or six, or any even number) of score files, and calculate rank of the molecule(s) in the first file when compared against those in the second file.

The score file is a simple space- or tab-delimited file, whose first column is the compound ID, and the following coumns are scores. For example:

C02953194         -8.610           1.379         -33.057           7.692
C13059374         -8.129           0.155         -28.957          -9.227
C09926947         -9.664          -0.753         -32.827         -21.917
C13043202         -6.366           1.365         -25.151         -10.238
C08517892         -0.529           4.565         -10.501          22.875
C13218443          0.920           4.319          -4.919          -2.689
C13305090          0.702           5.530          -7.720          -0.875
C13305085         -6.405           2.963         -27.119          -3.966

Such a format is used by score.py, the rescoring wrapper, and eplop, the PLOP wrapper.

As an example, to rank the ligand against the decoys using rescoring scores, you can run

cd /raid0/tmp/dockgui/06/24676/
python ~ycao/mpose/rank.py run.ligand/1.A/rescore.out run.decoys/1.A/rescore.out

This will compare ligand in the first file against decoys in the second file. Because there is only one row (and therefore one ligand) in the first score file, the output will consist of one row; because the score files consists of four scores, the output will have four ranks per row. In this example, the output is:

0.0198  0.0198  0.0198  0.0198

When the same molecule ID appears multiple times in the first score file, only the best score (being the smallest in value) will be considered. This is useful when you have multiple poses and therefore multiple scores for one molecule.

When scores of different types are stored in separate files, you are allowed to put more score files. For example, in the case of DockBlaster, DOCK scores are in a.energy, and rescoring results are in rescore.out, then you can run the following:

cd /raid0/tmp/dockgui/06/24676/
python ~/mpose/rank.py run.ligand/1.A/a.energy run.ligand/1.A/rescore.out run.decoys/1.A/a.energy run.decoys/1.A/rescore.out

The script will do a join by compound name before calculating the ranks. The above will give you an output of one row (corresponding to one ligand) and five columns (corresponding to five scores per molecule):

0.0000  0.0198  0.0198  0.0198  0.0198

Note that you must put the score files for the ligands before the score files for the decoys, and the number of score files for the ligands must be the same as that for decoys.

When scores of the same types are stored in separate files, you might want to perform concatenation before calculating ranks. Use the -c option in this case.

Note: The definition of rank is rank / (number of decoys + number of ligands), where rank starts with 0.

Scripts to Process DockBlaster Job Folders

The scripts are located under ~ycao/dockenv/etc

A few scripts have been written to apply the SIMD-style calculation to a series of DockBlaster job folders. SIMD stands for Single Instruction, Multiple Data. It means to apply the same processing logic to multiple data. In the context of processing DockBlaster job folders, it means to process all the folders using the same script or program, either sequentially or in parallel with the help of qsub.

The scripts below depends on the availability of an index file, which contains of a list of DockBlaster job IDs. Each line of the index file starts with one job ID, while the remaining content of the line will be ignored. Therefore, the MEMO generated by running auto_dock.csh works perfectly for this purpose.

foralljobs.py

This script takes an index file and a command, and will invoke the command in each of the jobs whose IDs are listed in the index file. For example:

foralljobs.py DockBl_ID_MEMO_astex pwd

will run pwd in each of the job folders for my astex DockBlaster jobs.

The command to run, given as the second argument, can be more complicated. I used the following command to back up job folders:

foralljobs.py ~/work/astex2/list 'pwd; A=`pwd`; B=`basename $A`; cd ..; tar -cjf ~/data/astex2/$B.tar.bz2 $B --exclude $B/run.similar'

To process jobs in parallel, just invoke the processing script with qsub:

foralljobs.py DockBl_ID_MEMO_astex 'qsub myprocessing.csh'

foralldirs.py

Similar to foralljobs.py, this script also takes an index file and a command. Instead of running the command in each job folder, it runs the command under each parametered run folder, such as run.decoys/1.A. To demonstrate the use, and at the same time see what folders are traversed, run the following command:

foralljobs.py DockBl_ID_MEMO_astex pwd

As a concrete example, you can perform rescoring with:

foralljobs.py DockBl_ID_MEMO_astex 'score.py a.new.mol2 > rescore.out'

This script allows you to consider only the run.ligand/* folders:

foralljobs.py --ligand DockBl_ID_MEMO_astex pwd

or the run.decoys/* folders only:

foralljobs.py --decoys DockBl_ID_MEMO_astex pwd

As a concrete example, you can perform enrichment analysis using rescoring result by first create the following simple script named /tmp/process.sh:

DIR=`pwd`
DIRN=`basename $DIR`
SISTER=../../run.decoys/$DIRN
~/mpose/rank.py rescore.out $SISTER/rescore.out

and then run

foralljobs.py --ligand DockBl_ID_MEMO_astex 'bash /tmp/process.sh'

In case of DUD multi-ligands-multi-decoys experiments, two extra run folders are created, namely run.dudact and run.duddec. This script can also traverse them:

foralljobs.py --dud DockBl_ID_MEMO_astex pwd

NOTE: This script will check each traversed folder, and make sure DOCK has finished in that folder. Specifically, it makes sure the OUTDOCK file ends with a line starting with elapsed time (sec):.

eplop

eplop, or Easy PLOP, is a wrapper script for PLOP.

MPose

MPose, or Multiple Pose, is a script to run single mode DOCK.