Running ChemSTEP: Difference between revisions
mNo edit summary |
mNo edit summary |
||
| Line 31: | Line 31: | ||
This script expects a directory named "output*" within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named "scores_round_0.txt". For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. | This script expects a directory named "output*" within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named "scores_round_0.txt". For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. | ||
'''4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition.''' | '''4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv''' | ||
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py . | cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py . | ||
python convert_scores_to_npy.py 0 | |||
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. | This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. | ||
'''5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.''' | '''5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.''' | ||
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt . | cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt . | ||
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py . | cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py . | ||
| Line 64: | Line 64: | ||
python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi | python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi | ||
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. '''Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!''' Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). | When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. '''Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!''' Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). Copy these into your ChemSTEP working directory. | ||
Revision as of 22:37, 3 September 2025
last update: August 26 2025 katie
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs].
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the "seed set") has already been done.
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP
1. Source ChemSTEP virtual environment on Wynton
source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate
2. Copy InfiniSee seed-set SDI file into your docking directory
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .
3. Dock seed set to your receptor of interest using DOCK 3.8 /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD.
export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]
export DOCKFILES=[path to your dockfiles]
export INPUT_FOLDER=[the folder containing your .sdi file(s)]
export OUTPUT_FOLDER=[where you want the output ]
/wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .
python3 get_scores.py 0
This script expects a directory named "output*" within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named "scores_round_0.txt". For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt.
4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .
python convert_scores_to_npy.py 0
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores.
5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .
6. Edit params.txt file This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining.
seed_indices_file: /absoulte/path/to/your/indices_round_0.npy seed_scores_file: /absolute/path/to/your/scores_round_0.npy hit_pprop: 5 n_docked_per_round: 10000000 max_beacons: 150 max_n_rounds: 250
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for the FIRST round of XReal docking. If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library. We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper).
7. Run ChemSTEP with the following command:
qsub launch_chemstep_as_job.sh
When finished, there will be a smi_round_1.smi inside of output/complete_info. These molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below:
8. Build prioritized molecules (DOCK 3.8). /taken from docs.docking.org
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh
python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)! Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). Copy these into your ChemSTEP working directory.
9. Set up for iterative rounds of ChemSTEP
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .
10. Run ChemSTEP
qsub launch_chemstep_as_job.sh [round number]
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon's distance to all previous beacons.