Difference between revisions of "MUD - Michael's Utilities for Docking"
|Line 95:||Line 95:|
AUC is the and the random value is 50%. LogAUC is the area between the log curve and the log random curve, so the random value is 0%. <y> is always "% ligands found", and <x> is either "% database searched" for enrichment plots or "% non-ligands found" for ROC plots.
Easily plot enrichment and roc curves from
Easily plot enrichment and roc curves from jobs.
Revision as of 10:00, 29 October 2008
What's in MUD?
- Tools to start, check, and restart dock jobs
- Tools to combine, enrich, plot, and view docking results
Setting up MUD
- For convenience, point a shell variable to the base mud directory to save typing
- If you use MUD a lot, you can add this to your ~/.login
- Then simply run commands like this:
$mud/submit.csh $mud/check.py -h
- Use -h or --help to get full help information for the .py (python) scripts
- The .csh scripts will automatically print usage information if mis-used
- The scripts automatically use their invocation path to find other scripts and libraries they depend on.
- Submit a parallel job to the cluser
Uses 'dirlist' to determine which directories to run. Similar to startdockbksX, but also indicates job submission by touching a submitted file in each directory.
- Check parallel job status
Indicates the status of unfinished (or unsubmitted) jobs. Note that it simply returns nothing if everything is finished.
- Restart all failed subjobs
This works even if some subjobs are still running. Occasionally, however, jobs can fail with no detectable remnants. To force those jobs to restart you can use the -f option, but beware that this will also restart all subjobs that are still running.
More specialized commands
- Submit a single directory to the cluster
- Submit a single directory to the local machine
- Remove docking output leaving only input - will DELETE even completed jobs
- Restart single directory
- Enrichment plots are sensitive to consistent treatment and proper accounting for all docked molecules. The combine script properly accounts for all docked molecules by detecting bumped out, no matched, and timed out molecules.
To achieve consistency, you have two options: 1. Write coordinates for all molecules (what I use) In INDOCK, set number_save to 100000 or something high enough to capture all dockable hierarchies. DOCK output is now gzipped so this is cheaper than it used to be. 2. Do not check for broken molecules Use the -b option when running combine.py
Combining Parallel Jobs
- Merge all parallel jobs into a single set of unique scores.
This combine carefully accounts for all docked molecules, for more informative enrichment plots.
Use -b or --broken to skip finding broken molecules. Use -d or --done to indicate that all subjobs are complete, for the case where you did not submit with a MUD submission script. Use -p or --prefix if your output files are named something other than test. Use --box if your box file is not at ../../grids/box relative to your subjob directories.
- combine.scores - fully processed scores, using the best one for each id
- combine.raw - contains all scores as scrapped from DOCK output
- combine.broken - broken molecules and the reason they failed
- combine.zeroes - important sanity check
format of combine.scores:
<id> <shape> <elect> <VdW> <polar solv> <apolar solv> <total> <subdir>
The .zeroes file is a sanity check because it lists the number of molecules followed by the number of zeroes in each scoring column. Past experience has shown that when DOCK fails randomly and silently, it often generates a large number of zero scores. If this happens, simply re-running the job will give better results.
- Compute enrichment starting from the combined scores.
$mud/enrich.py -l LIGAND_FILE -d DECOY_FILE < or > $mud/enrich.py -s -l LIGAND_FILE
Generates both enrichment and roc curves, both for the ligands against all molecules and for the ligands versus just the decoys. It will try to run combine if it has not been run yet, but will do so only with defaults for every option.
Input: Use -l to specify the ligand identifier file and -d to specify the decoy identifier file.
The identifier files simply contain an id for each known ligand that matched the one in the docking databases. The script is smart enough to match "ZINC12345678" to "C12345678", so either form is acceptable.
Use -s or --skip-own-curves to skip consideration of decoys and thus generation of _own curves. Use -f to force combine to run again.
- enrich.txt - Enrichment curve for ligands versus all molecules
- roc.txt - ROC curve for ligands versus all molecules
- enrich_own.txt - Enrichment curve for ligands versus only the decoys
- roc_own.txt - ROC curve for ligands versus only the decoys
_own files are not generate is the -s option is used.
format for output files:
#AUC 50.00 LogAUC 0.00 <x> <y> <x> <y> ...
AUC is area under the curve and the random expectation value is 50%. LogAUC is the area between the log curve and the log random curve, so the random expectation value is 0%. <y> is always "% ligands found", and <x> is either "% database searched" for enrichment plots or "% non-ligands found" for ROC plots.
Easily plot enrichment and roc curves from one or more jobs.