MUD - Michael's Utilities for Docking: Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
No edit summary
 
(29 intermediate revisions by 2 users not shown)
Line 19: Line 19:
    
    
===Main Workflow===
===Main Workflow===
For a quick summary of what to do first see [[SGE_Cluster_Docking]]. For a detailed look at how to get the details right see [[How to run and analyze a DOCK run by hand]].
*Submit a parallel job to the cluser
*Submit a parallel job to the cluser
  $mud/submit.csh
  $mud/submit.csh
Line 29: Line 32:
This works even if some subjobs are still running. Occasionally, however, jobs can fail with no detectable remnants. To force those jobs to restart you can use the -f option, but beware that this will also restart all subjobs that are still running.
This works even if some subjobs are still running. Occasionally, however, jobs can fail with no detectable remnants. To force those jobs to restart you can use the -f option, but beware that this will also restart all subjobs that are still running.


===More specialized commands===
===Specialized Commands===
*Submit job to the local machine
$mud/sublocal.csh
*Submit a single directory to the cluster
*Submit a single directory to the cluster
  $mud/subsge_single.csh
  qsub $mud/runsge.csh
*Submit a single directory to the local machine
*Submit a single directory to the local machine
  $mud/sublocal.csh
  $mud/runsubdir.csh
*Remove docking output leaving only input - will DELETE even completed jobs
*Remove docking output leaving only input - will DELETE even completed jobs
  $mud/clean.py
  $mud/clean.py
Line 44: Line 49:


To achieve consistency, you have two options:
To achieve consistency, you have two options:
*1. Write coordinates for all molecules (what I use)
1. Write coordinates for all molecules (what I use)
In INDOCK, set number_save to 100000 or something high enough to capture all docked hierarchies. DOCK output is now gzipped so this is cheaper than it used to be.
In INDOCK, set number_save to 50000 or something high enough to capture all dockable hierarchies. DOCK output is now gzipped so this is cheaper in disk space than it used to be.
*2. Do not check for broken molecules
2. Do not check for broken molecules
Use the -b option when running combine.py
Use the -b option when running combine.py


===Combining Parallel Jobs===
===Combining Parallel Jobs===
Merge all parallel jobs into a unique set of scores.
*Merge all parallel jobs into a single set of unique scores.
  $mud/combine.py
  $mud/combine.py
This combine carefully accounts for all docked molecules, for more informative enrichment plots.
This combine carefully accounts for all docked molecules, for more informative enrichment plots.


*Options:  
*Options:
Use -b or --broken to skip finding broken molecules. Use -d or --done to indicate that all subjobs are complete, for the case where you did not submit with a MUD submission script. Use -p or --prefix if your output files are named something other than test. Use --box if your  box file is not at ../../grids/box relative to your subjob directories.
Use -b or --broken to skip finding broken molecules. Use -d or --done to indicate that all subjobs are complete, for the case where you did not submit with a MUD submission script. Use -p or --prefix if your output files are named something other than test. Use --box if your  box file is not at ../../grids/box relative to your subjob directories.


*Creates:
*Creates:
#combine.scores - fully processed scores, taking the best one for each id
#combine.scores - fully processed scores, using the best one for each id
#combine.raw - contains all scores as scrapped from DOCK output
#combine.raw - contains all scores as scrapped from DOCK output
#combine.broken - broken molecules and the reason they failed
#combine.broken - broken molecules and the reason they failed
Line 64: Line 69:


format of combine.scores:
format of combine.scores:
<mono> <id> <contact score> <
<id> <shape> <elect> <VdW> <polar solv> <apolar solv> <total> <subdir>


The .zeroes file is a sanity check because it lists the number of molecules followed by the number of zeroes in each scoring column. Past experience has shown that when DOCK fails randomly and silently, it often generates a large  number of zero scores. If this happens, simply re-running the job will give better results.  
The .zeroes file is a sanity check because it lists the number of molecules followed by the number of zeroes in each scoring column. Past experience has shown that when DOCK fails randomly and silently, it often generates a large  number of zero scores. If this happens, simply re-running the job will give better results.  


===Computing Enrichments===
===Computing Enrichments===
*Compute enrichment starting from the combined scores.
$mud/enrich.py -s -l LIGAND_FILE
< or >
$mud/enrich.py -l LIGAND_FILE -d DECOY_FILE
Generates both enrichment and roc curves, both for the ligands against all molecules and for the ligands versus just the decoys. It will try to run combine if it has not been run yet, but will do so only with defaults for every option.
*Input:
Use -l to specify the ligand identifier file and -d to specify the decoy identifier file.
The identifier files simply contain an id for each known ligand that matched the one in the docking databases. The script is smart enough to match "ZINC12345678" to "C12345678", so either form is acceptable.
*Options:
Use -s or --skip-own-curves to skip consideration of decoys and thus generation of _own curves. Use -f to force combine to run again.
<span id="Enrich_Types"></span>
*Creates:
#enrich.txt - Enrichment curve for ligands versus all molecules
#roc.txt - ROC curve for ligands versus all molecules
#enrich_own.txt - Enrichment curve for ligands versus only the decoys
#roc_own.txt - ROC curve for ligands versus only the decoys
_own files are not generate is the -s option is used.
format for output files:
#AUC 50.00  LogAUC 0.00
<x> <y>
<x> <y>
  ...
AUC is area under the curve and the random expectation value is 50%. [[LogAUC]] is the area between the log curve and the log random curve, so the random expectation value is 0%. <y> is always "% ligands found", and <x> is either "% database searched" for enrichment plots or "% non-ligands found" for ROC plots.
===Plotting Enrichments===
Easily plot enrichment and roc curves from one or more jobs.
$mud/plots.py -i . -l New_Run -i ../old_run_dir -l Old_Run -t AmpC
< or >
$mud/plots.py -i .
Generates plots with one curve for each -i input_directory.
*Options:
Use -s or --skip-own-curves to skip _own curves, especially if they don't exist because enrich.py was run with -s. You can either label each -i INDIR with a -l LABEL, or use no -l options to get the default labels based on parent directory names. Use -t TITLE to change the plot title and filename. Use -o to specify a different output directory. Use -n to get normal instead of semi-log plots (and AUC in place of LogAUC).
*Creates:
#[title_]enrich.png
#[title_]roc.png
#[title_]enrich_own.png
#[title_]roc_own.png
The various graphs have the same meaning as their respective cures from [[#Computing Enrichments]]. [title_] is optional and exists when a custom title is given with the -t option.
===Computing Energy Histograms===
*Compute energy distributions starting from the combined scores.
$mud/energies.py -s -l LIGAND_FILE
< or >
$mud/energies.py -l LIGAND_FILE -d DECOY_FILE
Generates the energy distributions for the ligands, decoys, and all the other molecules.
*Input:
Use -l to specify the ligand identifier file and -d to specify the decoy identifier file.
The identifier files simply contain an id for each known ligand that matched the one in the docking databases. The script is smart enough to match "ZINC12345678" to "C12345678", so either form is acceptable.
*Options:
Use -s or --skip-own-curves to skip consideration of decoys.
*Creates:
#counts.txt - Energy distributions
format for output:
number_of_sections number_of_bins min_energy_threshold max_energy_threshold
##### section_name
bin_upper_edge1 count_below_edge1
...
bin_upper_edgeN count_below_edgeN
ABOVE count_above_last_edge
The sections are for ligands, decoys (optional), and others. The bins and counts define the energy histogram. The bins are finely spaced here in order to have more resolution when combine with other runs, whose energy ranges may be different.
===Plotting Energy Histograms===
Easily plot energy histograms from one or more jobs.
$mud/eplots.py -i . -l New_Run -i ../old_run_dir -l Old_Run -t AmpC
< or >
$mud/eplots.py -i .
Generates plots with energy distributions for each -i input_directory.
*Options:
You can either label each -i INDIR with a -l LABEL, or use no -l options to get the default labels based on parent directory names. Use -t TITLE to change the plot title and filename. Use -o to specify a different output directory.
*Creates:
#[title_]counts.png
===Visualizing Molecule by Molecule Results===
Create a DOCK 4,5,6 type pdb file for use in Chimera's ViewDOCK.
$mud/topdock.py -o topdock.pdb
*Options:
Use -o to specify an output file besides stdout. Use -t NUMBER to get whatever number of top scoring molecules.
&rarr; Back to [[Tutorials]]
[[Category:Tutorials]]
[[Category:Software]]
[[Category:Docking]]

Latest revision as of 00:57, 11 March 2014

What's in MUD?

  • Tools to start, check, and restart dock jobs
  • Tools to combine, enrich, plot, and view docking results

Setting up MUD

  • For convenience, point a shell variable to the base mud directory to save typing
set mud=~mysinger/code/mud/trunk
  • If you use MUD a lot, you can add this to your ~/.login
  • Then simply run commands like this:
$mud/submit.csh
$mud/check.py -h
  • Use -h or --help to get full help information for the .py (python) scripts
  • The .csh scripts will automatically print usage information if mis-used
  • The scripts automatically use their invocation path to find other scripts and libraries they depend on.

Job Control

Main Workflow

For a quick summary of what to do first see SGE_Cluster_Docking. For a detailed look at how to get the details right see How to run and analyze a DOCK run by hand.

  • Submit a parallel job to the cluser
$mud/submit.csh

Uses 'dirlist' to determine which directories to run. Similar to startdockbksX, but also indicates job submission by touching a submitted file in each directory.

  • Check parallel job status
$mud/check.py

Indicates the status of unfinished (or unsubmitted) jobs. Note that it simply returns nothing if everything is finished.

  • Restart all failed subjobs
$mud/restart.py

This works even if some subjobs are still running. Occasionally, however, jobs can fail with no detectable remnants. To force those jobs to restart you can use the -f option, but beware that this will also restart all subjobs that are still running.

Specialized Commands

  • Submit job to the local machine
$mud/sublocal.csh
  • Submit a single directory to the cluster
qsub $mud/runsge.csh
  • Submit a single directory to the local machine
$mud/runsubdir.csh
  • Remove docking output leaving only input - will DELETE even completed jobs
$mud/clean.py
  • Restart single directory
$mud/restartdir.py

Job Analysis

  • Enrichment plots are sensitive to consistent treatment and proper accounting for all docked molecules. The combine script properly accounts for all docked molecules by detecting bumped out, no matched, and timed out molecules.

To achieve consistency, you have two options: 1. Write coordinates for all molecules (what I use) In INDOCK, set number_save to 50000 or something high enough to capture all dockable hierarchies. DOCK output is now gzipped so this is cheaper in disk space than it used to be. 2. Do not check for broken molecules Use the -b option when running combine.py

Combining Parallel Jobs

  • Merge all parallel jobs into a single set of unique scores.
$mud/combine.py

This combine carefully accounts for all docked molecules, for more informative enrichment plots.

  • Options:

Use -b or --broken to skip finding broken molecules. Use -d or --done to indicate that all subjobs are complete, for the case where you did not submit with a MUD submission script. Use -p or --prefix if your output files are named something other than test. Use --box if your box file is not at ../../grids/box relative to your subjob directories.

  • Creates:
  1. combine.scores - fully processed scores, using the best one for each id
  2. combine.raw - contains all scores as scrapped from DOCK output
  3. combine.broken - broken molecules and the reason they failed
  4. combine.zeroes - important sanity check

format of combine.scores:

<id> <shape> <elect> <VdW> <polar solv> <apolar solv> <total> <subdir>

The .zeroes file is a sanity check because it lists the number of molecules followed by the number of zeroes in each scoring column. Past experience has shown that when DOCK fails randomly and silently, it often generates a large number of zero scores. If this happens, simply re-running the job will give better results.

Computing Enrichments

  • Compute enrichment starting from the combined scores.
$mud/enrich.py -s -l LIGAND_FILE
< or >
$mud/enrich.py -l LIGAND_FILE -d DECOY_FILE

Generates both enrichment and roc curves, both for the ligands against all molecules and for the ligands versus just the decoys. It will try to run combine if it has not been run yet, but will do so only with defaults for every option.

  • Input:

Use -l to specify the ligand identifier file and -d to specify the decoy identifier file.

The identifier files simply contain an id for each known ligand that matched the one in the docking databases. The script is smart enough to match "ZINC12345678" to "C12345678", so either form is acceptable.

  • Options:

Use -s or --skip-own-curves to skip consideration of decoys and thus generation of _own curves. Use -f to force combine to run again.

  • Creates:
  1. enrich.txt - Enrichment curve for ligands versus all molecules
  2. roc.txt - ROC curve for ligands versus all molecules
  3. enrich_own.txt - Enrichment curve for ligands versus only the decoys
  4. roc_own.txt - ROC curve for ligands versus only the decoys

_own files are not generate is the -s option is used.

format for output files:

#AUC 50.00  LogAUC 0.00
<x> <y>
<x> <y>
 ...

AUC is area under the curve and the random expectation value is 50%. LogAUC is the area between the log curve and the log random curve, so the random expectation value is 0%. <y> is always "% ligands found", and <x> is either "% database searched" for enrichment plots or "% non-ligands found" for ROC plots.

Plotting Enrichments

Easily plot enrichment and roc curves from one or more jobs.

$mud/plots.py -i . -l New_Run -i ../old_run_dir -l Old_Run -t AmpC
< or >
$mud/plots.py -i .

Generates plots with one curve for each -i input_directory.

  • Options:

Use -s or --skip-own-curves to skip _own curves, especially if they don't exist because enrich.py was run with -s. You can either label each -i INDIR with a -l LABEL, or use no -l options to get the default labels based on parent directory names. Use -t TITLE to change the plot title and filename. Use -o to specify a different output directory. Use -n to get normal instead of semi-log plots (and AUC in place of LogAUC).

  • Creates:
  1. [title_]enrich.png
  2. [title_]roc.png
  3. [title_]enrich_own.png
  4. [title_]roc_own.png

The various graphs have the same meaning as their respective cures from #Computing Enrichments. [title_] is optional and exists when a custom title is given with the -t option.

Computing Energy Histograms

  • Compute energy distributions starting from the combined scores.
$mud/energies.py -s -l LIGAND_FILE
< or >
$mud/energies.py -l LIGAND_FILE -d DECOY_FILE

Generates the energy distributions for the ligands, decoys, and all the other molecules.

  • Input:

Use -l to specify the ligand identifier file and -d to specify the decoy identifier file.

The identifier files simply contain an id for each known ligand that matched the one in the docking databases. The script is smart enough to match "ZINC12345678" to "C12345678", so either form is acceptable.

  • Options:

Use -s or --skip-own-curves to skip consideration of decoys.

  • Creates:
  1. counts.txt - Energy distributions

format for output:

number_of_sections number_of_bins min_energy_threshold max_energy_threshold
##### section_name
bin_upper_edge1 count_below_edge1
...
bin_upper_edgeN count_below_edgeN
ABOVE count_above_last_edge

The sections are for ligands, decoys (optional), and others. The bins and counts define the energy histogram. The bins are finely spaced here in order to have more resolution when combine with other runs, whose energy ranges may be different.

Plotting Energy Histograms

Easily plot energy histograms from one or more jobs.

$mud/eplots.py -i . -l New_Run -i ../old_run_dir -l Old_Run -t AmpC
< or >
$mud/eplots.py -i .

Generates plots with energy distributions for each -i input_directory.

  • Options:

You can either label each -i INDIR with a -l LABEL, or use no -l options to get the default labels based on parent directory names. Use -t TITLE to change the plot title and filename. Use -o to specify a different output directory.

  • Creates:
  1. [title_]counts.png

Visualizing Molecule by Molecule Results

Create a DOCK 4,5,6 type pdb file for use in Chimera's ViewDOCK.

$mud/topdock.py -o topdock.pdb
  • Options:

Use -o to specify an output file besides stdout. Use -t NUMBER to get whatever number of top scoring molecules.

→ Back to Tutorials