Running ChemSTEP in the 13 billion space

From DISI
Revision as of 21:42, 18 September 2025 by Pseemann (talk | contribs)
Jump to navigation Jump to search
  • Written by Philipp Seemann with the kind assistance of Katie Holland and Joseph Pepe (08/28/2025)
  • This tutorial should work on Wynton with the provided scripts and public environments; you might need to catch tiny typos. I apologize.

This is meant to be a simple hands-on step by step guide for ChemSTEP on Wynton in the 13 billion space. For exploring the trillion space please refer to:https://wiki.docking.org/index.php?title=Running_ChemSTEP

The general workflow is:

  |--- Simply DOCK the seeds set (you can use any docking method)
  |
  |--- Run ChemSTEP 
  |
  |--- Build molecules from 1st round of ChemSTEP
  |
  |--- DOCK molecules from 1st round of ChemSTEP
  |
  |--- Run ChemSTEP the 2nd time
  |
  |--- Build molecules from 2nd round of ChemSTEP
  |
  |--- DOCK molecules from 2nd round of ChemSTEP
  |
  |--- Run ChemSTEP the 3rd time
  .
  .
  .
  .
  .
  |--- Repeat until your Recovery rate does not improve

Here is a recommended folder layout after several rounds of ChemSTEP. We will also provide the initial layout here right after this one. This should just give you an overview of what folders we will need to make iteratively, not to confuse our building and docking, and ChemSTEP rounds. (You can change this however you want. After you get the hang of this, you can be creative and find a better layout workflow for you!) This tutorial will lead you through round_0 and round_1. After that, it should become clear what steps you will need to repeat over and over again. This tutorial is written for Wynton. Please use the scripts starting from a dev node.

Folder layout:

  CHEMSTEP_PROJECT_FOLDER/
  |
  |----run_inital_and_iterative_chemstep/
  |
  |----round_0/
  |
  |----building_1/ 
  |
  |----round_1/ 
  |
  |----building_2/ 
  |
  |----round_2/
  |
  |....

Explanation of folders run_inital_and_iterative_chemstep/

Here, we actually run ChemSTEP with every new round as well. Here will be copied all scores_round_*.npy and indices_round_*.npy after every new round of docking.

round_0/

Docking of the seeds set. So this contains your dockfiles, and here we generate the initial indices and score .npys, which will be copied to the run_inital_and_iterative_chemstep/ folder later on

building_1/

This folder is for building of the first .smi file from run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi

round_1/

docking of the molecules from the building_1 folder. So this contains again your dockfiles with an adjusted INDOCK and a new sdi file. After extracting the scores, it will yield us indices and scores .npys of round_1 for the second round of chemstep.

building_2/

After running chemstep from run_inital_and_iterative_chemstep the second time we will have a new .smi file which we will need to build.

round_2/

If you have read carefully to this point, you will know what comes next. We will dock the molecules from building_2 here and extract the scores. Generate .npys files. Copy those to our run_inital_and_iterative_chemstep folder to generate the next smi_round_*.smi


So this is now all still confusing, but we will start easy. It will make sense as any other Shoichet Lab tutorial. I promise.

First round of docking

Here are the first folders that you will need to set up for this tutorial

  CHEMSTEP_PROJECT_FOLDER/
  |
  |
  |----round_0/ # docking of the seeds set
  |
  |
  |----run_inital_and_iterative_chemstep/ #here will  be copied all scores_round_*.npy and indices_round_*.npy files

First, we make a traditional DOCKing directory in your work directories (be sure not to be in your home directory because we will dock and build (DISKSPACE!)). The first step for ChemSTEP is basically just docking the 13B_seed_set_built.sdi, so submit as your traditional LSD screen (whatever submission style or script you prefer). This layout follows the above shown layout of folders. What you need to bring on your own here are your dockfiles (and your way of submission script).

  mkdir CHEMSTEP_PROJECT_FOLDER/
  cd CHEMSTEP_PROJECT_FOLDER/
  mkdir round_0
  cd round_0
  cp -r path/to/your/dockfiles . #check your INDOCK parameters
  mkdir sdi
  cd sdi
  cp /wynton/group/bks/work/bwhall61/mor_chemstep/DOCK/13M/seed/docking/bundle_paths.sdi .
  cd ..
  # cp your favourite submission script or whatever you use for submitting jobs
  # or in case you use SymDOCK with the right executable set in the submission    
  # script

Submit the docking job with your method of preference.

Things to consider for the size of the seeds set

Ideally, we want enough molecules in our desired pProp region to be considered as beacons and virtual hits. So it is of great importance that enough molecules score in the desired region to be considered as beacons. On Wynton, there is the 130k seeds set, a 13M set, and somewhere also a 1.3M set.

  /wynton/group/bks/work/shared/kholland/chemstep_13B/13B_seed_set_built.sdi #130k 
  /wynton/group/bks/work/bwhall61/mor_chemstep/DOCK/13M/seed/docking/bundle_paths.sdi #13M

Our aim is actually to dock a small chunk, which is still big enough to represent the library, and among those molecules we want enough beacons (so probably something between 50-100 molecules in the desired score range) chosen by ChemSTEP

When the docking finishes:

Source the right environment now:

  source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

We will now extract scores and Molecule IDs, so we run get_scores.py

  #cd into your round_0 directory
  
  cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .
  python get_scores.py 0
  
  #0 for initial round of chemstep

When get_scores.py runs successfully, we see scores_round_0.txt in our folder. Check your scores.txt file for output. It should look like this:

  MOL12457028547 -29.32
  MOL12457032486 -32.39
  ...

The get_scores.py script expects a certain output folder structure. If you do not see any output in your .txt file, vim into the script and adjust the paths. If you use the copied version, it should be:

  /output/*/*/OUTDOCK.*

Now we translate scores_round_0.txt into indices_round_0.npy and scores_round_0.npy

  #cd into your round_0 directory
  
  cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .
  python convert_scores_to_npy.py 0
  
  #0 for initial round of chemstep

now we should find indices_round_0.npy and scores_round_0.npy in our round_0 directory

First round of ChemSTEP. EXCITING!!!

Now we set up ChemSTEP:

  #now cd into your CHEMSTEP_PROJECT_FOLDER
  
  mkdir run_initial_and_iterative_chemstep
  cp round_0/*_round_0.npy run_initial_and_iterative_chemstep/
  cd run_initial_and_iterative_chemstep/

Now the two .npy files should be in our run_inital_and_iterative_chemstep folder to run chemstep. We will now set up our initial submission script, which will be slightly different from the iterative one for the following rounds.

  #cd into your folder for running chemstep - 
  #run_initial_and_iterative_chemstep/ 
  
  cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_initial.sh .
  cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_initial.py .

We stay in this folder and copy over a params.txt file, and then adjust it to your liking

  #we are still in run_inital_and_iterative_chemstep
  
  cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .

Adjust your params file - with vim if you like.

  vim params.txt
  seed_indices_file: path/to/your/run_initial_and_iterative_chemstep/indices_round_0.npy 
  
  seed_scores_file: path/to/your/run_initial_and_iterative_chemstep/scores_round_0.npy
  
  hit_pprop: 6 #may change depending on library size, here 13B
  
  n_docked_per_round: 1000000 #reasonable for 13B
  
  max_beacons: 100 #reasonable for 13B
  
  max_n_rounds: 250 # recommended
  

Now there should be two scripts and two .npy files, and one params.txt file in your run_inital_and_iterative_chemstep folder. Now we submit our first round of running ChemSTEP with :

  qsub launch_chemstep_as_job_initial.sh
  
  # this will queue a job
  # the job will split into many jobs, 600, eg which will run for a while
  # the initial job will run for a while after the jobs finish, and should generate 
  # output in /output/complete_info/ ,there should be a .smi file when sucessfully 

Two notes here:

First, we will never touch the params.txt file again. From my understanding, we need to copy each following scores_round_*.npy and indices_round_*.npy into the same directory (in our case, run_initial_and_iterative_chemstep)

Second, I had issues with getting Wynton to execute python for every submitted job. I hope this is fixed for you as well now.

First round of building:

When ChemSTEP finishes, we can go to our CHEMSTEP_PROJECT_FOLDER and brace ourselves for building.

  #cd into your CHEMSTEP_PROJECT_FOLDER
  
  mkdir building_1
  cd building_1
  cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi .
  
  #if there is no smi file, something went wrong

We now source the building environment, prepare the job, submit the job, and wait until the building is completed.

source environment:

  source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

prepare job

  python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi
  
  #note: smi_round_*.smi for the following rounds must be adjusted. 

submit job

  qsub building_array_job.sh

When this is finished, check for failed jobs and resubmit. Some will always fail, just try to keep them low. When the molecules are built, we can proceed with docking them. Now we generate an .sdi file from our first building round.

  #cd into your building_1 folder
  
  find /wynton/group/bks/work/pseemann/CHEMSTEP_PROJECT_FOLDER/building_1/building_output/ -type f -name "*.tgz" > round1.sdi
  
  #example adjust paths

Next round of docking:

We proceed with docking the freshly built compounds, so we make a directory called round_1, with dockfiles, and copy our new SDI file here, too

  #cd into your CHEMSTEP_PROJECT_FOLDER
  
  mkdir round_1
  cd round_1
  mkdir sdi
  cd sdi
  cp ../../building_1/*.sdi .
  cd ..
  cp -r ../round_0/dockfiles .
  vim dockfiles/INDOCK 
  
  # adjust the score maximum in the INDOCK to your chosen pprop (recommended)
  # submit with your favourite submission script just as you did for round_0
  # so sh your_way_of_submitting.sh

As before, we will now run get_scores.py

source environment

  source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

Copy scripts and execute them

  #cd into your round_1 directory
  
  cp ../round_0/get_scores.py .
  python get_scores.py 1
  
  # wait until this finishes, always adjust the argument after                 
  # python get_scores.py to not confuse your scores files with other rounds


Copy scripts and execute them

  #still in your round_1 directory
  #copy the right script, IDs have changed, and needed to be adjusted in the convert script
  
  cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/64_char_convert_scores_to_npy.py .
  python 64_char_convert_scores_to_npy.py 1
  
  # Wait until this finishes; always adjust the argument afterward 
  # python 64_char_convert_scores_to_npy.py
  # script was adjusted to fit the output of round_1

Next round of ChemSTEP:

  #cd into your run_initial_and_iterative_chemstep directory
  
  cp ../round_1/scores_round_1.npy .
  cp ../round_1/indices_round_1.npy .

Copy over the iterative run_chemstep_iteratively.py and launch_chemstep_as_job_iteratively.sh These are slightly different then the initial ones. But from here, we will only use these scripts. Be sure to always provide the right arguments to the scripts. Be sure to use the right scripts. This is (hopefully) the most confusing step in this ChemSTEP tutorial.

  #still in run_initial_and_iterative_chemstep directory
  
  cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_iteratively.sh .
  cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iteratively.py .

Submit the 2nd round of ChemSTEP:

  qsub launch_chemstep_as_job_iteratively.sh 2 
  
  #Give the number of rounds you are running chemstep if you are using scores_round_1.npy and indices_round_1.npy you provide 2 as an
  #argument, like in the example above.
  #if you run your 3rd round of chemstep and provide indices_round_2.py and 
  #scores_round_2.npy, one would do qsub launch_chemstep_as_job_iteratively.sh 3
  #and so on...
  #If no argument is provided, it will queue but shut down. 
  #You can check this in the chemstep_submission.log file | 
  #So when the job is submitted, it will just quit, when no argument is given, after a few seconds to minutes.

Now we wait until the jobs finish, and brace ourselves for the next time-consuming round of building.

Next round of building:

  #so in your CHEMSTEP_PROJECT_FOLDER
  
  mkdir building_2
  cd building_2
  cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_2.smi .

source environment

  source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

prepare job (always adjust smi_round.smi here)

  python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_2.smi

submit the job

  qsub building_array_job.sh

Now like shown before. Resubmit failed jobs, make your sdi, and proceed with docking (new folder round_2 recommended).

Now we went full circle. The next step would be to make your next round_2 of the docking directory with the updated .sdi file from the 2nd round of building. With the next round of docking, you would extract scores and IDs, convert them, feed them to ChemSTEP, and repeat this as often as you don't see an increase in recovery rate. You would always need to update the arguments you pass to the python and submission scripts to match the round of ChemSTEP and building, and extracting etc.

Notes:

Slight changes here so far compared to the trillion space:

-added a workflow chart

-added a suggested directory structure

-adjusted the extract scripts for 13B

-Separated the initial and iterative scripts for submission and run ChemSTEP, the iterative submission now only works when an argument is passed for the round of submission

-Added a shared folder for the 13B scripts if others want to use them as well


  /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/