Docking Analysis in DOCK3.8: Difference between revisions

From DISI
Jump to navigation Jump to search
 
(30 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Location of new scripts/Install Instructions ==
== Location of new scripts/Install Instructions ==


/wynton/home/btingle/bin/top_poses
You can retrieve these scripts from the "docktop" repository on github, which is public.


All programs described are located on this directory for now. Copy the directory to your own $HOME or wherever you see fit. Github link soon.
<nowiki>git clone https://github.com/docking-org/docktop.git</nowiki>


Note the link to python3.8 in this directory. You need to include a link to a python3.8 executable in your personal bin directory. There are no pip requirements, just a blank python 3.8 install.
= Python 3.8+ =


== Scripts Description ==
== Conda Environment ==


=== top_poses.py ===
The simplest way to source python 3.8+ is to just install via conda.


==== Description ====
<nowiki>
conda create -n py311 python==3.11
conda activate py311</nowiki>
 
No other packages are required!
 
== Manual Install ==
 
<b>On Wynton you can use the version installed @ /wynton/group/bks/soft/python-versions/python-3.8-install</b>
 
If you want to install python3.8 on your own, try the following:
 
<nowiki>
wget https://www.python.org/ftp/python/3.8.8/Python-3.8.8.tgz


Main pose retrieval algorithm, runs on multiple cores. 7 cores is recommended and also the default.
# MY_SOFT is the directory you want to install to
tar -C $MY_SOFT -xzf Python-3.8.8.tgz
pushd $MY_SOFT/Python-3.8.8
./configure --prefix=$MY_SOFT
make && make install
popd


Input can be a directory or a file. If input is a directory, the script will use a find command to locate all test.mol2.gz* files residing in the directory structure.
# add the new python 3.8 executable to your path to use
export PATH=$PATH:$MY_SOFT/python-3.8-install/bin
 
# optional: clean up the configuration files
# rm -r $MY_SOFT/Python-3.8.8.tgz
# rm Python-3.8.8.tgz</nowiki>
 
= top_poses.py =
 
== Description ==
 
Main pose retrieval algorithm, runs on multiple processes.
 
Input can be a directory or a file. If input is a directory, the script will use a recursive find command to locate all test.mol2.gz* files residing in the directory structure.


If input is a file, each line in the file should map to a valid pose file, e.g:
If input is a file, each line in the file should map to a valid pose file, e.g:
Line 31: Line 62:
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0009/test.mol2.gz</nowiki>
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0009/test.mol2.gz</nowiki>


Output is where the top 300K poses will be written out when the script has finished. e.g /scratch/top_poses.mol2.gz
Output prefix is where the top N poses will be written out when the script has finished. e.g /scratch/top_poses.mol2.gz, as well as a human-readable .scores file.
 
==== Usage ====
 
<nowiki>
python3.8 top_poses.py <input> <output> <<ncores>></nowiki>
 
=== run_top_poses.bash ===
 
==== Description ====
 
Wrapper script for top_poses.py, can be used to submit individual pose jobs. Will run with 7 cores allocated.
 
==== Usage ====
 
<nowiki>
run_top_poses.bash <input> <output></nowiki>


==== Typical qsub usage ====
== Usage ==


  <nowiki>
  <nowiki>
qsub -wd $PWD run_top_poses.bash <input> <output></nowiki>
usage: top_poses.py [-h] [-n NPOSES] [-o OUTPREFIX] [-j NPROCESSES] [--id-file INPUT_ID_FILE]
 
                    [--verbose] [--quiet] [--log-interval LOG_INTERVAL]
=== run_top_poses_mr.bash ===
                    [--find-min-size FIND_MIN_SIZE]
                    dockresults_path


==== Description ====
Retrieve the top N poses from docking results


Map-reduce script to submit a number of analysis jobs and combine their results. The preferred method of running large analysis workloads.
positional arguments:
  dockresults_path      Can be either a directory containing docking results, or a file where each
                        line points to a docking results file.


Input field is evaluated the same as in top_poses.py.
optional arguments:
  -h, --help            show this help message and exit
  -n NPOSES            How many top poses to retrieve, default of 150000
  -o OUTPREFIX          Output file prefix. Each run will produce two files, a mol2.gz containing
                        pose data, and a .scores file containing relevant score information.
                        Default is "top_poses"
  -j NPROCESSES        How many processes should be dedicated to this run, default is 2. If your
                        files are spread across multiple disks, increasing this number will
                        improve performance.
  --id-file INPUT_ID_FILE
                        Only retrieve poses matching ids specified in an external file.
  --verbose            write verbose logs to stdout
  --quiet              write minimum logs to stdout
  --log-interval LOG_INTERVAL
                        number of poses between log statements. Ignored if --quiet enabled
  --find-min-size FIND_MIN_SIZE
                        filter out test.mol2.gz* files below a minimum bytes size
</nowiki>


Staging directory should be an NFS directory writable by your user. This is where input/output will be stored by the script.
== Note on Parallel Processing ==


Final output will show up in <staging directory>/output_final.poses.mol2.gz
By default, this script allocates two extra threads (-j 2) to read in files. This ensures that the main thread can sort poses uninterrupted, while the others take care of the grunt work of reading and annotating files. Increasing the number of reader threads beyond two does not guarantee an improvement in performance, but depending on the filesystem(s) your docking poses live on, they could. For example, on Wynton it can be helpful to allocate up to 8 extra threads for reading files, due to the way the filesystem works on Wynton. On the BKS cluster, increasing the number of reader threads beyond two will have a negligible (or even negative) impact, unless your files happen to be striped across multiple servers.
 
Batch size refers to how many poses files will be evaluated by each job, the default is 1000, though you may want to modify this depending on the properties of your poses files/how many there are.
 
Only works on sge for right now. Tested on Wynton.
 
==== Usage ====
 
<nowiki>
run_top_poses_mr.bash <input> <staging directory> <<batch size>></nowiki>


== Checking Logs ==
== Checking Logs ==


After your jobs have finished, check the logs to see if anything went wrong. If everything went smoothly, there should be nothing in the .err logs, and each .out log should end with a string of text that looks like this:
If everything went smoothly, your log should end with a string of text that looks like this:


  <nowiki>
  <nowiki>
Line 85: Line 111:
299900 / 300000</nowiki>
299900 / 300000</nowiki>


If you find an output file that doesn't end like this, you may wish to re-attempt that particular job.  
You may also see a message that looks like this:
 
<nowiki>
short timeout reached while retrieving pose... trying again! curr=...</nowiki>
 
This just indicates slowness in the file reading, and is common to see at the beginning of a log or when the filesystem is under high load.


If you submitted with run_top_poses_mr.bash, all you need to do is to run it again with the same parameters as before. The script detects existing output and will only re-submit as necessary. This will also update the output_final.poses.mol2.gz file.
[[Category:DOCK 3.8]]

Latest revision as of 04:34, 3 March 2023

Location of new scripts/Install Instructions

You can retrieve these scripts from the "docktop" repository on github, which is public.

git clone https://github.com/docking-org/docktop.git

Python 3.8+

Conda Environment

The simplest way to source python 3.8+ is to just install via conda.

conda create -n py311 python==3.11
conda activate py311

No other packages are required!

Manual Install

On Wynton you can use the version installed @ /wynton/group/bks/soft/python-versions/python-3.8-install

If you want to install python3.8 on your own, try the following:

wget https://www.python.org/ftp/python/3.8.8/Python-3.8.8.tgz

# MY_SOFT is the directory you want to install to
tar -C $MY_SOFT -xzf Python-3.8.8.tgz
pushd $MY_SOFT/Python-3.8.8
./configure --prefix=$MY_SOFT
make && make install
popd

# add the new python 3.8 executable to your path to use
export PATH=$PATH:$MY_SOFT/python-3.8-install/bin

# optional: clean up the configuration files
# rm -r $MY_SOFT/Python-3.8.8.tgz
# rm Python-3.8.8.tgz

top_poses.py

Description

Main pose retrieval algorithm, runs on multiple processes.

Input can be a directory or a file. If input is a directory, the script will use a recursive find command to locate all test.mol2.gz* files residing in the directory structure.

If input is a file, each line in the file should map to a valid pose file, e.g:

/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0000/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0001/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0002/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0003/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0004/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0005/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0006/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0007/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0008/test.mol2.gz
/wynton/group/bks/work/yingyang/5HT-1d/04_LSD/run_dock_es1.5_ld0.3/docked_chunks/chunk0009/test.mol2.gz

Output prefix is where the top N poses will be written out when the script has finished. e.g /scratch/top_poses.mol2.gz, as well as a human-readable .scores file.

Usage

usage: top_poses.py [-h] [-n NPOSES] [-o OUTPREFIX] [-j NPROCESSES] [--id-file INPUT_ID_FILE]
                    [--verbose] [--quiet] [--log-interval LOG_INTERVAL]
                    [--find-min-size FIND_MIN_SIZE]
                    dockresults_path

Retrieve the top N poses from docking results

positional arguments:
  dockresults_path      Can be either a directory containing docking results, or a file where each
                        line points to a docking results file.

optional arguments:
  -h, --help            show this help message and exit
  -n NPOSES             How many top poses to retrieve, default of 150000
  -o OUTPREFIX          Output file prefix. Each run will produce two files, a mol2.gz containing
                        pose data, and a .scores file containing relevant score information.
                        Default is "top_poses"
  -j NPROCESSES         How many processes should be dedicated to this run, default is 2. If your
                        files are spread across multiple disks, increasing this number will
                        improve performance.
  --id-file INPUT_ID_FILE
                        Only retrieve poses matching ids specified in an external file.
  --verbose             write verbose logs to stdout
  --quiet               write minimum logs to stdout
  --log-interval LOG_INTERVAL
                        number of poses between log statements. Ignored if --quiet enabled
  --find-min-size FIND_MIN_SIZE
                        filter out test.mol2.gz* files below a minimum bytes size

Note on Parallel Processing

By default, this script allocates two extra threads (-j 2) to read in files. This ensures that the main thread can sort poses uninterrupted, while the others take care of the grunt work of reading and annotating files. Increasing the number of reader threads beyond two does not guarantee an improvement in performance, but depending on the filesystem(s) your docking poses live on, they could. For example, on Wynton it can be helpful to allocate up to 8 extra threads for reading files, due to the way the filesystem works on Wynton. On the BKS cluster, increasing the number of reader threads beyond two will have a negligible (or even negative) impact, unless your files happen to be striped across multiple servers.

Checking Logs

If everything went smoothly, your log should end with a string of text that looks like this:

received all input!
joining threads...
done processing! writing out...
299900 / 300000

You may also see a message that looks like this:

short timeout reached while retrieving pose... trying again! curr=...

This just indicates slowness in the file reading, and is common to see at the beginning of a log or when the filesystem is under high load.