MUD - Michael's Utilities for Docking

From DISI
Revision as of 01:44, 5 December 2009 by Mysinger (talk | contribs) (Add energy histogram programs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

What's in MUD?

  • Tools to start, check, and restart dock jobs
  • Tools to combine, enrich, plot, and view docking results

Setting up MUD

  • For convenience, point a shell variable to the base mud directory to save typing
set mud=~mysinger/code/mud/trunk
  • If you use MUD a lot, you can add this to your ~/.login
  • Then simply run commands like this:
$mud/submit.csh
$mud/check.py -h
  • Use -h or --help to get full help information for the .py (python) scripts
  • The .csh scripts will automatically print usage information if mis-used
  • The scripts automatically use their invocation path to find other scripts and libraries they depend on.

Job Control

Main Workflow

  • Submit a parallel job to the cluser
$mud/submit.csh

Uses 'dirlist' to determine which directories to run. Similar to startdockbksX, but also indicates job submission by touching a submitted file in each directory.

  • Check parallel job status
$mud/check.py

Indicates the status of unfinished (or unsubmitted) jobs. Note that it simply returns nothing if everything is finished.

  • Restart all failed subjobs
$mud/restart.py

This works even if some subjobs are still running. Occasionally, however, jobs can fail with no detectable remnants. To force those jobs to restart you can use the -f option, but beware that this will also restart all subjobs that are still running.

Specialized Commands

  • Submit job to the local machine
$mud/sublocal.csh
  • Submit a single directory to the cluster
qsub $mud/runsge.csh
  • Submit a single directory to the local machine
$mud/runsubdir.csh
  • Remove docking output leaving only input - will DELETE even completed jobs
$mud/clean.py
  • Restart single directory
$mud/restartdir.py

Job Analysis

  • Enrichment plots are sensitive to consistent treatment and proper accounting for all docked molecules. The combine script properly accounts for all docked molecules by detecting bumped out, no matched, and timed out molecules.

To achieve consistency, you have two options: 1. Write coordinates for all molecules (what I use) In INDOCK, set number_save to 50000 or something high enough to capture all dockable hierarchies. DOCK output is now gzipped so this is cheaper in disk space than it used to be. 2. Do not check for broken molecules Use the -b option when running combine.py

Combining Parallel Jobs

  • Merge all parallel jobs into a single set of unique scores.
$mud/combine.py

This combine carefully accounts for all docked molecules, for more informative enrichment plots.

  • Options:

Use -b or --broken to skip finding broken molecules. Use -d or --done to indicate that all subjobs are complete, for the case where you did not submit with a MUD submission script. Use -p or --prefix if your output files are named something other than test. Use --box if your box file is not at ../../grids/box relative to your subjob directories.

  • Creates:
  1. combine.scores - fully processed scores, using the best one for each id
  2. combine.raw - contains all scores as scrapped from DOCK output
  3. combine.broken - broken molecules and the reason they failed
  4. combine.zeroes - important sanity check

format of combine.scores:

<id> <shape> <elect> <VdW> <polar solv> <apolar solv> <total> <subdir>

The .zeroes file is a sanity check because it lists the number of molecules followed by the number of zeroes in each scoring column. Past experience has shown that when DOCK fails randomly and silently, it often generates a large number of zero scores. If this happens, simply re-running the job will give better results.

Computing Enrichments

  • Compute enrichment starting from the combined scores.
$mud/enrich.py -s -l LIGAND_FILE
< or >
$mud/enrich.py -l LIGAND_FILE -d DECOY_FILE

Generates both enrichment and roc curves, both for the ligands against all molecules and for the ligands versus just the decoys. It will try to run combine if it has not been run yet, but will do so only with defaults for every option.

  • Input:

Use -l to specify the ligand identifier file and -d to specify the decoy identifier file.

The identifier files simply contain an id for each known ligand that matched the one in the docking databases. The script is smart enough to match "ZINC12345678" to "C12345678", so either form is acceptable.

  • Options:

Use -s or --skip-own-curves to skip consideration of decoys and thus generation of _own curves. Use -f to force combine to run again.

  • Creates:
  1. enrich.txt - Enrichment curve for ligands versus all molecules
  2. roc.txt - ROC curve for ligands versus all molecules
  3. enrich_own.txt - Enrichment curve for ligands versus only the decoys
  4. roc_own.txt - ROC curve for ligands versus only the decoys

_own files are not generate is the -s option is used.

format for output files:

#AUC 50.00  LogAUC 0.00
<x> <y>
<x> <y>
 ...

AUC is area under the curve and the random expectation value is 50%. LogAUC is the area between the log curve and the log random curve, so the random expectation value is 0%. <y> is always "% ligands found", and <x> is either "% database searched" for enrichment plots or "% non-ligands found" for ROC plots.

Plotting Enrichments

Easily plot enrichment and roc curves from one or more jobs.

$mud/plots.py -i . -l New_Run -i ../old_run_dir -l Old_Run -t AmpC
< or >
$mud/plots.py -i .

Generates plots with one curve for each -i input_directory.

  • Options:

Use -s or --skip-own-curves to skip _own curves, especially if they don't exist because enrich.py was run with -s. You can either label each -i INDIR with a -l LABEL, or use no -l options to get the default labels based on parent directory names. Use -t TITLE to change the plot title and filename. Use -o to specify a different output directory. Use -n to get normal instead of semi-log plots (and AUC in place of LogAUC).

  • Creates:
  1. [title_]enrich.png
  2. [title_]roc.png
  3. [title_]enrich_own.png
  4. [title_]roc_own.png

The various graphs have the same meaning as their respective cures from #Computing Enrichments. [title_] is optional and exists when a custom title is given with the -t option.

Computing Energy Histograms

  • Compute energy distributions starting from the combined scores.
$mud/energies.py -s -l LIGAND_FILE
< or >
$mud/energies.py -l LIGAND_FILE -d DECOY_FILE

Generates the energy distributions for the ligands, decoys, and all the other molecules.

  • Input:

Use -l to specify the ligand identifier file and -d to specify the decoy identifier file.

The identifier files simply contain an id for each known ligand that matched the one in the docking databases. The script is smart enough to match "ZINC12345678" to "C12345678", so either form is acceptable.

  • Options:

Use -s or --skip-own-curves to skip consideration of decoys.

  • Creates:
  1. counts.txt - Energy distributions

format for output:

number_of_sections number_of_bins min_energy_threshold max_energy_threshold
##### section_name
bin_upper_edge1 count_below_edge1
...
bin_upper_edgeN count_below_edgeN
ABOVE count_above_last_edge

The sections are for ligands, decoys (optional), and others. The bins and counts define the energy histogram. The bins are finely spaced here in order to have more resolution when combine with other runs, whose energy ranges may be different.

Plotting Energy Histograms

Easily plot energy histograms from one or more jobs.

$mud/eplots.py -i . -l New_Run -i ../old_run_dir -l Old_Run -t AmpC
< or >
$mud/eplots.py -i .

Generates plots with energy distributions for each -i input_directory.

  • Options:

You can either label each -i INDIR with a -l LABEL, or use no -l options to get the default labels based on parent directory names. Use -t TITLE to change the plot title and filename. Use -o to specify a different output directory.

  • Creates:
  1. [title_]counts.png

Visualizing Molecule by Molecule Results

Create a DOCK 4,5,6 type pdb file for use in Chimera's ViewDOCK.

$mud/topdock.py -o topdock.pdb
  • Options:

Use -o to specify an output file besides stdout. Use -t NUMBER to get whatever number of top scoring molecules.

→ Back to Tutorials