DISI - User contributions [en]

Running ChemSTEP in the 13 billion space

2025-09-23T17:30:19Z

Pseemann:

*Written by Philipp Seemann with the kind assistance of Katie Holland and Joseph Pepe (08/28/2025)
*This tutorial should work on Wynton with the provided scripts and public environments; you might need to catch tiny typos. I apologize.

This is meant to be a simple hands-on step by step guide for ChemSTEP on Wynton in the 13 billion space. For exploring the trillion space please refer to:[[Running ChemSTEP|https://wiki.docking.org/index.php?title=Running_ChemSTEP]]

'''The general workflow is:'''
|--- Simply DOCK the seeds set (you can use any docking method)
|
|--- Run ChemSTEP
|
|--- Build molecules from 1st round of ChemSTEP
|
|--- DOCK molecules from 1st round of ChemSTEP
|
|--- Run ChemSTEP the 2nd time
|
|--- Build molecules from 2nd round of ChemSTEP
|
|--- DOCK molecules from 2nd round of ChemSTEP
|
|--- Run ChemSTEP the 3rd time
.
.
.
.
.
|--- Repeat until your Recovery rate does not improve

Here is a recommended folder layout after several rounds of ChemSTEP. We will also provide the initial layout here right after this one. This should just give you an overview of what folders we will need to make iteratively, not to confuse our building and docking, and ChemSTEP rounds. (You can change this however you want. After you get the hang of this, you can be creative and find a better layout workflow for you!) This tutorial will lead you through round_0 and round_1. After that, it should become clear what steps you will need to repeat over and over again. This tutorial is written for Wynton. Please use the scripts starting from a dev node.

'''Folder layout:'''
CHEMSTEP_PROJECT_FOLDER/
|
|----run_inital_and_iterative_chemstep/
|
|----round_0/
|
|----building_1/
|
|----round_1/
|
|----building_2/
|
|----round_2/
|
|....

'''Explanation of folders'''
'''run_inital_and_iterative_chemstep/'''

Here, we actually run ChemSTEP with every new round as well. Here will be copied all scores_round_*.npy and indices_round_*.npy after every new round of docking.

'''round_0/'''

Docking of the seeds set. So this contains your dockfiles, and here we generate the initial indices and score .npys, which will be copied to the run_inital_and_iterative_chemstep/ folder later on

'''building_1/'''

This folder is for building of the first .smi file from run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi

'''round_1/'''

docking of the molecules from the building_1 folder. So this contains again your dockfiles with an adjusted INDOCK and a new sdi file. After extracting the scores, it will yield us indices and scores .npys of round_1 for the second round of chemstep.

'''building_2/'''

After running chemstep from run_inital_and_iterative_chemstep the second time we will have a new .smi file which we will need to build.

'''round_2/'''

If you have read carefully to this point, you will know what comes next. We will dock the molecules from building_2 here and extract the scores. Generate .npys files. Copy those to our run_inital_and_iterative_chemstep folder to generate the next smi_round_*.smi

So this is now all still confusing, but we will start easy. It will make sense as any other Shoichet Lab tutorial. I promise.

'''First round of docking'''

Here are the first folders that you will need to set up for this tutorial

CHEMSTEP_PROJECT_FOLDER/
|
|
|----round_0/ # docking of the seeds set
|
|
|----run_inital_and_iterative_chemstep/ #here will be copied all scores_round_*.npy and indices_round_*.npy files

First, we make a traditional DOCKing directory in your work directories (be sure not to be in your home directory because we will dock and build (DISKSPACE!)). The first step for ChemSTEP is basically just docking the 13B_seed_set_built.sdi, so submit as your traditional LSD screen (whatever submission style or script you prefer). This layout follows the above shown layout of folders. What you need to bring on your own here are your dockfiles (and your way of submission script).

mkdir CHEMSTEP_PROJECT_FOLDER/
cd CHEMSTEP_PROJECT_FOLDER/
mkdir round_0
cd round_0
cp -r path/to/your/dockfiles . #check your INDOCK parameters
mkdir sdi
cd sdi
cp /wynton/group/bks/work/bwhall61/mor_chemstep/DOCK/13M/seed/docking/bundle_paths.sdi .
cd ..
# cp your favourite submission script or whatever you use for submitting jobs
# or in case you use SymDOCK with the right executable set in the submission
# script

Submit the docking job with your method of preference.

'''Things to consider for the size of the seeds set'''

Ideally, we want enough molecules in our desired pProp region to be considered as beacons and virtual hits. So it is of great importance that enough molecules score in the desired region to be considered as beacons. On Wynton, there is the 130k seeds set, a 13M set, and somewhere also a 1.3M set.

/wynton/group/bks/work/shared/kholland/chemstep_13B/13B_seed_set_built.sdi #130k
/wynton/group/bks/work/bwhall61/mor_chemstep/DOCK/13M/seed/docking/bundle_paths.sdi #13M

Our aim is actually to dock a small chunk, which is still big enough to represent the library, and among those molecules we want enough beacons (so probably something between 50-100 molecules in the desired score range) chosen by ChemSTEP

'''When the docking finishes:'''

Source the right environment now:

source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

We will now extract scores and Molecule IDs, so we run get_scores.py

#cd into your round_0 directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .
python get_scores.py 0

#0 for initial round of chemstep

When get_scores.py runs successfully, we see scores_round_0.txt in our folder. Check your scores.txt file for output. It should look like this:

MOL12457028547 -29.32
MOL12457032486 -32.39
...

The get_scores.py script expects a certain output folder structure. If you do not see any output in your .txt file, vim into the script and adjust the paths. If you use the copied version, it should be:
/output/*/*/OUTDOCK.*

Now we translate scores_round_0.txt into indices_round_0.npy and scores_round_0.npy

#cd into your round_0 directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .
python convert_scores_to_npy.py 0

#0 for initial round of chemstep

now we should find indices_round_0.npy and scores_round_0.npy in our round_0 directory

'''First round of ChemSTEP. EXCITING!!!'''

'''Now we set up ChemSTEP:'''

#now cd into your CHEMSTEP_PROJECT_FOLDER

mkdir run_initial_and_iterative_chemstep
cp round_0/*_round_0.npy run_initial_and_iterative_chemstep/
cd run_initial_and_iterative_chemstep/

Now the two .npy files should be in our run_inital_and_iterative_chemstep folder to run chemstep.
We will now set up our initial submission script, which will be slightly different from the iterative one for the following rounds.

#cd into your folder for running chemstep -
#run_initial_and_iterative_chemstep/

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_initial.sh .
cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_initial.py .

We stay in this folder and copy over a params.txt file, and then adjust it to your liking

#we are still in run_inital_and_iterative_chemstep

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .

Adjust your params file - with vim if you like.
vim params.txt

seed_indices_file: path/to/your/run_initial_and_iterative_chemstep/indices_round_0.npy

seed_scores_file: path/to/your/run_initial_and_iterative_chemstep/scores_round_0.npy

hit_pprop: 6 #may change depending on library size, here 13B

n_docked_per_round: 1000000 #reasonable for 13B

max_beacons: 100 #reasonable for 13B

max_n_rounds: 250 # recommended

Now there should be two scripts and two .npy files, and one params.txt file in your run_inital_and_iterative_chemstep folder.
Now we submit our first round of running ChemSTEP with :

qsub launch_chemstep_as_job_initial.sh

# this will queue a job
# the job will split into many jobs, 600, eg which will run for a while
# the initial job will run for a while after the jobs finish, and should generate
# output in /output/complete_info/ ,there should be a .smi file when sucessfully

'''Two notes here:'''

First, we will never touch the params.txt file again. From my understanding, we need to copy each following scores_round_*.npy and indices_round_*.npy into the same directory (in our case, run_initial_and_iterative_chemstep)

Second, I had issues with getting Wynton to execute python for every submitted job. I hope this is fixed for you as well now.

'''First round of building:'''

When ChemSTEP finishes, we can go to our CHEMSTEP_PROJECT_FOLDER and brace ourselves for building.

#cd into your CHEMSTEP_PROJECT_FOLDER

mkdir building_1
cd building_1
cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi .

#if there is no smi file, something went wrong

We now source the building environment, prepare the job, submit the job, and wait until the building is completed.

source environment:
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

prepare job

python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi

#note: smi_round_*.smi for the following rounds must be adjusted.

submit job
qsub building_array_job.sh

When this is finished, check for failed jobs and resubmit. Some will always fail, just try to keep them low.
When the molecules are built, we can proceed with docking them.
Now we generate an .sdi file from our first building round.

#cd into your building_1 folder

find /wynton/group/bks/work/pseemann/CHEMSTEP_PROJECT_FOLDER/building_1/building_output/ -type f -name "*.tgz" > round1.sdi

#example adjust paths

'''Next round of docking: '''

We proceed with docking the freshly built compounds, so we make a directory called round_1, with dockfiles, and copy our new SDI file here, too

#cd into your CHEMSTEP_PROJECT_FOLDER

mkdir round_1
cd round_1
mkdir sdi
cd sdi
cp ../../building_1/*.sdi .
cd ..
cp -r ../round_0/dockfiles .
vim dockfiles/INDOCK

# adjust the score maximum in the INDOCK to your chosen pprop (recommended)
# submit with your favourite submission script just as you did for round_0
# so sh your_way_of_submitting.sh

As before, we will now run get_scores.py

source environment
source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

Copy scripts and execute them
#cd into your round_1 directory

cp ../round_0/get_scores.py .
python get_scores.py 1

# wait until this finishes, always adjust the argument after
# python get_scores.py to not confuse your scores files with other rounds

Copy scripts and execute them
#still in your round_1 directory
#copy the right script, IDs have changed, and needed to be adjusted in the convert script

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/64_char_convert_scores_to_npy.py .
python 64_char_convert_scores_to_npy.py 1

# Wait until this finishes; always adjust the argument afterward
# python 64_char_convert_scores_to_npy.py
# script was adjusted to fit the output of round_1

'''Next round of ChemSTEP:'''

#cd into your run_initial_and_iterative_chemstep directory

cp ../round_1/scores_round_1.npy .
cp ../round_1/indices_round_1.npy .

Copy over the iterative run_chemstep_iteratively.py and launch_chemstep_as_job_iteratively.sh
These are slightly different then the initial ones. But from here, we will only use these scripts. Be sure to always provide the right arguments to the scripts. Be sure to use the right scripts. This is (hopefully) the most confusing step in this ChemSTEP tutorial.

#still in run_initial_and_iterative_chemstep directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_iteratively.sh .
cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iteratively.py .

'''Submit the 2nd round of ChemSTEP:'''

qsub launch_chemstep_as_job_iteratively.sh 2

#Give the number of rounds you are running chemstep if you are using scores_round_1.npy and indices_round_1.npy you provide 2 as an
#argument, like in the example above.
#if you run your 3rd round of chemstep and provide indices_round_2.py and
#scores_round_2.npy, one would do qsub launch_chemstep_as_job_iteratively.sh 3
#and so on...
#If no argument is provided, it will queue but shut down.
#You can check this in the chemstep_submission.log file |
#So when the job is submitted, it will just quit, when no argument is given, after a few seconds to minutes.

Now we wait until the jobs finish, and brace ourselves for the next time-consuming round of building.

'''Next round of building:'''

#so in your CHEMSTEP_PROJECT_FOLDER

mkdir building_2
cd building_2
cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_2.smi .

'''source environment'''
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

'''prepare job (always adjust smi_round.smi here)'''

python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_2.smi

'''submit the job'''
qsub building_array_job.sh

Now like shown before. Resubmit failed jobs, make your sdi, and proceed with docking (new folder round_2 recommended).

Now we went full circle. The next step would be to make your next round_2 of the docking directory with the updated .sdi file from the 2nd round of building. With the next round of docking, you would extract scores and IDs, convert them, feed them to ChemSTEP, and repeat this as often as you don't see an increase in recovery rate. You would always need to update the arguments you pass to the python and submission scripts to match the round of ChemSTEP and building, and extracting etc.

'''Check your beacons'''

In the first rounds, one might want to check the selected beacons that were chosen by ChemSTEP. To do so, go to your run_inital_and_iterative_chemstep/ folder (the folder where you run ChemSTEP). Then make a list of IDs:

cat chemstep_algo.log | grep "with score" | awk '{print $3}' > list.txt

Copy this file to a new folder. Then source a Python environment.

source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

Copy over this script and run it

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/beacon_to_smiles.py .
python beacon_to_smiles.py /wynton/group/bks/work/shared/kholland/chemstep_13B/boltz_fplib.pickle list.txt
#python beacon_to_smiles.py path/to/fingerprint_library list.txt

'''All scripts and files for this tutorial are present in'''

/wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/

ChemSTEP and how to cinvince it to pick the good stuff

2025-09-19T19:27:19Z

Pseemann:

Recently some Lab members encountered an unusual and dominating enrichment of non-reasonable molecules after several rounds of ChemSTEP. This here might be a work around to train ChemSTEP in the beginning to pick molecules as beacons not only by score but also to be influenced by interaction-filtering or visual inspection. This can be crucial in the first round of ChemSTEP, because it can lead to a exclusive domination of cheating molecules -at least in the current version- if the automated beacon selection takes place.

'''First step'''

Docking the seeds set as described either here [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]. This is happening in your round_0 before running ChemSTEP.

'''Filtering - inspection - your own prioritization'''

What we will basically do is IFP-filtering - or any other filtering method you want to do - to not only rely on score. I am assuming here that you will do IFP, too, but you can also just do visual inspection and pick 100 molecules you like, to force ChemSTEP to like them, too (so to use them as beacons for your first round of ChemSTEP). For Lab members IFP-filtering is best described in the Read-the-DOCK-docs: https://docs.docking.org/filtering.html#interaction-and-novelty-filtering
if you want to have a reference for files and folder structures etc. you might want to have a look here on wynton:

cd /wynton/group/bks/work/pseemann/2_IFP_CHEMSTEP

'''Making a list of names'''

Make a list of the molecule IDs that you find reasonable - so from your list of molecules which passed your filtering/visual-inspection/whatever-method-you-used. Save it as filterd_molecules.txt for example.

Note: Between the 'trillion' and 'billion' version of ChemSTEP the beginning of the MOL IDs might be different, but that should not be an issue for this approach

'''example list'''

MOL0000egWAbN
MOL0000deAfVb
MOL00006qQdBI
MOL0000besVZV
MOL0000blhcqH
MOL00008TJKq8
MOL0000emHoK9
MOL0000e3b5Xz
MOL00008K8dFx
MOL0000byZUhT

We go back to our docking folder for the seeds set (round_0) and run the get_scores.py (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]). This will yield a scores_round_0.txt file. This file we will now tweak with artificial scores. So I usually make a folder like this

mkdir round_0_filtering
cd round_0_filtering
cp /path/to/filtered_molecules.txt . # your file with the names of molecules you want to be picked as beacons
cp /path/to/scores_round_0.txt . # your file from the get_scores.py script in your round_0 seed set docking folder

Now you need a script to just set artificial scores. I went for good stuff 0 and bad stuff 100. You could also do -40 and 100 or whatever you like. I do IFP every round since my target is a bit special. But I think only doing it in the first round can also suffice.

An example script to do so is here:

'''score_correction.py'''

import sys

scores_file = sys.argv[1]
ids_file = sys.argv[2]
out_file = sys.argv[3]

with open(ids_file) as f:
id_set = set(line.strip() for line in f if line.strip())

print("IDs loaded:", id_set)

with open(scores_file) as f:
lines = f.readlines()

with open(out_file, "w") as out:
for line in lines:
parts = line.strip().split()
if len(parts) != 2:
continue
mol_id, score = parts
mol_id = mol_id.strip()
if mol_id in id_set:
out.write(f"{mol_id} 0\n") #set good stuff to 0 you can adjust
else:
out.write(f"{mol_id} 100\n") #set bad stuff to 100 you can adjust

The script takes three arguments which is

### We run this now in the prepared round_0_filtering folder which contains your scores_round_0.txt and filtered_molecules.txt
python score_correction.py scores_round_0.txt filtered_molecules.txt out.txt

The out.txt should contain now all the previous IDs

to check
grep " 0" out.txt | wc -l

'''file example of out.txt '''

MOL00004n1zC2 100
MOL00004n204O 100
MOL00004n21qu 100
MOL00004n242A 100
MOL00004n24Zm 0
MOL00004n28dM 100
MOL00004n2Anc 100
MOL00004n2AvB 100

So now we can copy the out.txt file over to our round_0 seed set docking folder and replace the 'old' scores_round_0.txt
#assuming you are still in round_0_filtering
cp out.txt /path/to/round_0/scores_round_0.txt

So with the tweaked scores_round_0.txt file we now run convert_scores_to_npy.py, which now reads the adjusted scores_round_0.txt to produce the indices.npy and score.npy files. Meaning, that now every step will be the same as previously described (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]).

ChemSTEP should now just 'see' compounds with a good score of 0 and ignore the ones with a score of 100. You can check this also in your ChemSTEP folder by looking at the .log files and the picked beacons.

ChemSTEP and how to cinvince it to pick the good stuff

2025-09-19T19:26:30Z

Pseemann:

Recently some Lab members encountered an unusual and dominating enrichment of non-reasonable molecules after several rounds of ChemSTEP. This here might be a work around to train ChemSTEP in the beginning to pick molecules as beacons not only by score but also to be influenced by interaction-filtering or visual inspection. This can be crucial in the first round of ChemSTEP, because it can lead to a exclusive domination of cheating molecules -at least in the current version- if the automated beacon selection takes place.

'''First step'''

Docking the seeds set as described either here [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]. This is happening in your round_0 before running ChemSTEP.

'''Filtering - inspection - your own prioritization'''

What we will basically do is IFP-filtering - or any other filtering method you want to do - to not only rely on score. I am assuming here that you will do IFP, too, but you can also just do visual inspection and pick 100 molecules you like, to force ChemSTEP to like them, too (so to use them as beacons for your first round of ChemSTEP). For Lab members IFP-filtering is best describe in the Read-the-DOCK-docs: https://docs.docking.org/filtering.html#interaction-and-novelty-filtering
if you want to have a reference for files and folder structures etc. you might want to have a look here on wynton:

cd /wynton/group/bks/work/pseemann/2_IFP_CHEMSTEP

'''Making a list of names'''

Make a list of the molecule IDs that you find reasonable - so from your list of molecules which passed your filtering/visual-inspection/whatever-method-you-used. Save it as filterd_molecules.txt for example.

Note: Between the 'trillion' and 'billion' version of ChemSTEP the beginning of the MOL IDs might be different, but that should not be an issue for this approach

'''example list'''

MOL0000egWAbN
MOL0000deAfVb
MOL00006qQdBI
MOL0000besVZV
MOL0000blhcqH
MOL00008TJKq8
MOL0000emHoK9
MOL0000e3b5Xz
MOL00008K8dFx
MOL0000byZUhT

We go back to our docking folder for the seeds set (round_0) and run the get_scores.py (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]). This will yield a scores_round_0.txt file. This file we will now tweak with artificial scores. So I usually make a folder like this

mkdir round_0_filtering
cd round_0_filtering
cp /path/to/filtered_molecules.txt . # your file with the names of molecules you want to be picked as beacons
cp /path/to/scores_round_0.txt . # your file from the get_scores.py script in your round_0 seed set docking folder

Now you need a script to just set artificial scores. I went for good stuff 0 and bad stuff 100. You could also do -40 and 100 or whatever you like. I do IFP every round since my target is a bit special. But I think only doing it in the first round can also suffice.

An example script to do so is here:

'''score_correction.py'''

import sys

scores_file = sys.argv[1]
ids_file = sys.argv[2]
out_file = sys.argv[3]

with open(ids_file) as f:
id_set = set(line.strip() for line in f if line.strip())

print("IDs loaded:", id_set)

with open(scores_file) as f:
lines = f.readlines()

with open(out_file, "w") as out:
for line in lines:
parts = line.strip().split()
if len(parts) != 2:
continue
mol_id, score = parts
mol_id = mol_id.strip()
if mol_id in id_set:
out.write(f"{mol_id} 0\n") #set good stuff to 0 you can adjust
else:
out.write(f"{mol_id} 100\n") #set bad stuff to 100 you can adjust

The script takes three arguments which is

### We run this now in the prepared round_0_filtering folder which contains your scores_round_0.txt and filtered_molecules.txt
python score_correction.py scores_round_0.txt filtered_molecules.txt out.txt

The out.txt should contain now all the previous IDs

to check
grep " 0" out.txt | wc -l

'''file example of out.txt '''

MOL00004n1zC2 100
MOL00004n204O 100
MOL00004n21qu 100
MOL00004n242A 100
MOL00004n24Zm 0
MOL00004n28dM 100
MOL00004n2Anc 100
MOL00004n2AvB 100

So now we can copy the out.txt file over to our round_0 seed set docking folder and replace the 'old' scores_round_0.txt
#assuming you are still in round_0_filtering
cp out.txt /path/to/round_0/scores_round_0.txt

So with the tweaked scores_round_0.txt file we now run convert_scores_to_npy.py, which now reads the adjusted scores_round_0.txt to produce the indices.npy and score.npy files. Meaning, that now every step will be the same as previously described (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]).

ChemSTEP should now just 'see' compounds with a good score of 0 and ignore the ones with a score of 100. You can check this also in your ChemSTEP folder by looking at the .log files and the picked beacons.

ChemSTEP and how to cinvince it to pick the good stuff

2025-09-19T19:22:11Z

Pseemann:

Recently some Lab members encountered an unusual and dominating enrichment of non-reasonable molecules after several rounds of ChemSTEP. This here might be a work around to train ChemSTEP in the beginning to pick molecules as beacons not only by score but also to be influences by interaction-filtering or visual inspection. This can be crucial in the first round of ChemSTEP, because it can lead to a exclusive domination of cheating molecules -at least in the current version- if the automated beacon selection takes place.

'''First step'''

Docking the seeds set as described either here [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]. This is happening in your round_0 before running ChemSTEP.

'''Filtering - inspection - your own prioritization'''

What we will basically do is IFP-filtering - or any other filtering method you want to do - to not only rely on score. I am assuming here that you will do IFP, too, but you can also just do visual inspection and pick 100 molecules you like, to force ChemSTEP to like them, too (so to use them as beacons for your first round of ChemSTEP). For Lab members IFP-filtering is best describe in the Read-the-DOCK-docs: https://docs.docking.org/filtering.html#interaction-and-novelty-filtering
if you want to have a reference for files and folder structures etc. you might want to have a look here on wynton:

cd /wynton/group/bks/work/pseemann/2_IFP_CHEMSTEP

'''Making a list of names'''

Make a list of the molecule IDs that you find reasonable - so from your list of molecules which passed your filtering/visual-inspection/whatever-method-you-used. Save it as filterd_molecules.txt for example.

Note: Between the 'trillion' and 'billion' version of ChemSTEP the beginning of the MOL IDs might be different, but that should not be an issue for this approach

'''example list'''

MOL0000egWAbN
MOL0000deAfVb
MOL00006qQdBI
MOL0000besVZV
MOL0000blhcqH
MOL00008TJKq8
MOL0000emHoK9
MOL0000e3b5Xz
MOL00008K8dFx
MOL0000byZUhT

We go back to our docking folder for the seeds set (round_0) and run the get_scores.py (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]). This will yield a scores_round_0.txt file. This file we will now tweak with artificial scores. So I usually make a folder like this

mkdir round_0_filtering
cd round_0_filtering
cp /path/to/filtered_molecules.txt . # your file with the names of molecules you want to be picked as beacons
cp /path/to/scores_round_0.txt . # your file from the get_scores.py script in your round_0 seed set docking folder

Now you need a script to just set artificial scores. I went for good stuff 0 and bad stuff 100. You could also do -40 and 100 or whatever you like. I do IFP every round since my target is a bit special. But I think only doing it in the first round can also suffice.

An example script to do so is here:

'''score_correction.py'''

import sys

scores_file = sys.argv[1]
ids_file = sys.argv[2]
out_file = sys.argv[3]

with open(ids_file) as f:
id_set = set(line.strip() for line in f if line.strip())

print("IDs loaded:", id_set)

with open(scores_file) as f:
lines = f.readlines()

with open(out_file, "w") as out:
for line in lines:
parts = line.strip().split()
if len(parts) != 2:
continue
mol_id, score = parts
mol_id = mol_id.strip()
if mol_id in id_set:
out.write(f"{mol_id} 0\n") #set good stuff to 0 you can adjust
else:
out.write(f"{mol_id} 100\n") #set bad stuff to 100 you can adjust

The script takes three arguments which is

### We run this now in the prepared round_0_filtering folder which contains your scores_round_0.txt and filtered_molecules.txt
python score_correction.py scores_round_0.txt filtered_molecules.txt out.txt

The out.txt should contain now all the previous IDs

to check
grep " 0" out.txt | wc -l

'''file example of out.txt '''

MOL00004n1zC2 100
MOL00004n204O 100
MOL00004n21qu 100
MOL00004n242A 100
MOL00004n24Zm 0
MOL00004n28dM 100
MOL00004n2Anc 100
MOL00004n2AvB 100

So now we can copy the out.txt file over to our round_0 seed set docking folder and replace the 'old' scores_round_0.txt
#assuming you are still in round_0_filtering
cp out.txt /path/to/round_0/scores_round_0.txt

So with the tweaked scores_round_0.txt file we now run convert_scores_to_npy.py, which now reads the adjusted scores_round_0.txt to produce the indices.npy and score.npy files. Meaning, that now every step will be the same as previously described (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]).

ChemSTEP should now just 'see' compounds with a good score of 0 and ignore the ones with a score of 100. You can check this also in your ChemSTEP folder by looking at the .log files and the picked beacons.

ChemSTEP and how to cinvince it to pick the good stuff

2025-09-19T19:19:22Z

Pseemann:

Recently some Lab members encountered an unusual and dominating enrichment of non-reasonable molecules after several rounds of ChemSTEP. This here might be a work around to train ChemSTEP in the beginning to pick molecules as beacons not only by score but also to be influences by interaction-filtering or visual inspection. This can be crucial in the first round of ChemSTEP, because it can lead to a exclusive domination of cheating molecules -at least in the current version- if the automated beacon selection takes place.

'''First step'''

Docking the seeds set as described either here [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]. This is happening in your round_0 before running ChemSTEP.

'''Filtering - inspection - your own prioritization'''

What we will basically do is IFP-filtering - or any other filtering method you want to do - to not only rely on score. I am assuming here that you will do IFP, too, but you can also just do visual inspection and pick 100 molecules you like, to force ChemSTEP to like them, too (so to use them as beacons for your first round of ChemSTEP). For Lab members IFP-filtering is best describe in the Read-the-DOCK-docs: https://docs.docking.org/filtering.html#interaction-and-novelty-filtering
if you want to have a reference for files and folder structures etc. you might want to have a look here on wynton:

cd /wynton/group/bks/work/pseemann/2_IFP_CHEMSTEP

'''Making a list of names'''

Make a list of the molecule IDs that you find reasonable - so from your list of molecules which passed your filtering/visual-inspection/whatever-method-you-used. Save it as filterd_molecules.txt for example.

Note: Between the 'trillion' and 'billion' version of ChemSTEP the beginning of the MOL IDs might be different, but that should not be an issue for this approach

'''example list'''

MOL0000egWAbN
MOL0000deAfVb
MOL00006qQdBI
MOL0000besVZV
MOL0000blhcqH
MOL00008TJKq8
MOL0000emHoK9
MOL0000e3b5Xz
MOL00008K8dFx
MOL0000byZUhT

We go back to our docking folder for the seeds set (round_0) and run the get_scores.py (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]). This will yield a scores_round_0.txt file. This file we will now tweak with artificial scores. So I usually make a folder like this

mkdir round_0_filtering
cd round_0_filtering
cp /path/to/filtered_molecules.txt . # your file with the names of molecules you want to be picked as beacons
cp /path/to/scores_round_0.txt . # your file from the get_scores.py script in your round_0 seed set docking folder

Now you need a script to just set artificial scores. I went for good stuff 0 and bad stuff 100. You could also do -40 and 100 or whatever you like. I do IFP every round since my target is a bit special. But I think only doing it in the first round can also suffice.

An example script to do so is here:

'''score_correction.py'''

import sys

scores_file = sys.argv[1]
ids_file = sys.argv[2]
out_file = sys.argv[3]

with open(ids_file) as f:
id_set = set(line.strip() for line in f if line.strip())

print("IDs loaded:", id_set)

with open(scores_file) as f:
lines = f.readlines()

with open(out_file, "w") as out:
for line in lines:
parts = line.strip().split()
if len(parts) != 2:
continue
mol_id, score = parts
mol_id = mol_id.strip()
if mol_id in id_set:
out.write(f"{mol_id} 0\n") #set good stuff to 0 you can adjust
else:
out.write(f"{mol_id} 100\n") #set bad stuff to 100 you can adjust

The script takes three arguments which is

### We run this now in the prepared round_0_filtering folder which contains your scores_round_0.txt and filtered_molecules.txt
python score_correction.py scores_round_0.txt filtered_molecules.txt out.txt

The out.txt should contain now all the previous IDs

to check
grep " 0" out.txt | wc -l

'''file example of out.txt '''

MOL00004n1zC2 100
MOL00004n204O 100
MOL00004n21qu 100
MOL00004n242A 100
MOL00004n24Zm 0
MOL00004n28dM 100
MOL00004n2Anc 100
MOL00004n2AvB 100

So now we can copy the out.txt file over to our round_0 seed set docking folder and replace the 'old' scores_round_0.txt
#assuming you are still in round_0_filtering
cp out.txt /path/to/round_0/scores_round_0.txt

So with the tweaked scores_round_0.txt file we now run convert_scores_to_npy.py, which now reads the adjusted scores_round_0.txt to produces the indices.npy and score.npy files. Meaning, that now every step will be the same as previously described (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]).

ChemSTEP should now just 'see' compounds with a good score of 0 and ignore the ones with a score of 100. You can check this also in your ChemSTEP folder by looking at the .log files and the picked beacons.

ChemSTEP and how to cinvince it to pick the good stuff

2025-09-19T19:18:48Z

Pseemann:

Recently some Lab members encountered an unusual and dominating enrichment of non-reasonable molecules after several rounds of ChemSTEP. This here might be a work around to train ChemSTEP in the beginning to pick molecules as beacons not only by score but also to be influences by interactions or visual inspection. This can be crucial in the first round of ChemSTEP, because it can lead to a exclusive domination of cheating molecules -at least in the current version- if the automated beacon selection takes place.

'''First step'''

Docking the seeds set as described either here [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]. This is happening in your round_0 before running ChemSTEP.

'''Filtering - inspection - your own prioritization'''

What we will basically do is IFP-filtering - or any other filtering method you want to do - to not only rely on score. I am assuming here that you will do IFP, too, but you can also just do visual inspection and pick 100 molecules you like, to force ChemSTEP to like them, too (so to use them as beacons for your first round of ChemSTEP). For Lab members IFP-filtering is best describe in the Read-the-DOCK-docs: https://docs.docking.org/filtering.html#interaction-and-novelty-filtering
if you want to have a reference for files and folder structures etc. you might want to have a look here on wynton:

cd /wynton/group/bks/work/pseemann/2_IFP_CHEMSTEP

'''Making a list of names'''

Make a list of the molecule IDs that you find reasonable - so from your list of molecules which passed your filtering/visual-inspection/whatever-method-you-used. Save it as filterd_molecules.txt for example.

Note: Between the 'trillion' and 'billion' version of ChemSTEP the beginning of the MOL IDs might be different, but that should not be an issue for this approach

'''example list'''

MOL0000egWAbN
MOL0000deAfVb
MOL00006qQdBI
MOL0000besVZV
MOL0000blhcqH
MOL00008TJKq8
MOL0000emHoK9
MOL0000e3b5Xz
MOL00008K8dFx
MOL0000byZUhT

We go back to our docking folder for the seeds set (round_0) and run the get_scores.py (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]). This will yield a scores_round_0.txt file. This file we will now tweak with artificial scores. So I usually make a folder like this

mkdir round_0_filtering
cd round_0_filtering
cp /path/to/filtered_molecules.txt . # your file with the names of molecules you want to be picked as beacons
cp /path/to/scores_round_0.txt . # your file from the get_scores.py script in your round_0 seed set docking folder

Now you need a script to just set artificial scores. I went for good stuff 0 and bad stuff 100. You could also do -40 and 100 or whatever you like. I do IFP every round since my target is a bit special. But I think only doing it in the first round can also suffice.

An example script to do so is here:

'''score_correction.py'''

import sys

scores_file = sys.argv[1]
ids_file = sys.argv[2]
out_file = sys.argv[3]

with open(ids_file) as f:
id_set = set(line.strip() for line in f if line.strip())

print("IDs loaded:", id_set)

with open(scores_file) as f:
lines = f.readlines()

with open(out_file, "w") as out:
for line in lines:
parts = line.strip().split()
if len(parts) != 2:
continue
mol_id, score = parts
mol_id = mol_id.strip()
if mol_id in id_set:
out.write(f"{mol_id} 0\n") #set good stuff to 0 you can adjust
else:
out.write(f"{mol_id} 100\n") #set bad stuff to 100 you can adjust

The script takes three arguments which is

### We run this now in the prepared round_0_filtering folder which contains your scores_round_0.txt and filtered_molecules.txt
python score_correction.py scores_round_0.txt filtered_molecules.txt out.txt

The out.txt should contain now all the previous IDs

to check
grep " 0" out.txt | wc -l

'''file example of out.txt '''

MOL00004n1zC2 100
MOL00004n204O 100
MOL00004n21qu 100
MOL00004n242A 100
MOL00004n24Zm 0
MOL00004n28dM 100
MOL00004n2Anc 100
MOL00004n2AvB 100

So now we can copy the out.txt file over to our round_0 seed set docking folder and replace the 'old' scores_round_0.txt
#assuming you are still in round_0_filtering
cp out.txt /path/to/round_0/scores_round_0.txt

So with the tweaked scores_round_0.txt file we now run convert_scores_to_npy.py, which now reads the adjusted scores_round_0.txt to produces the indices.npy and score.npy files. Meaning, that now every step will be the same as previously described (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]).

ChemSTEP should now just 'see' compounds with a good score of 0 and ignore the ones with a score of 100. You can check this also in your ChemSTEP folder by looking at the .log files and the picked beacons.

ChemSTEP and how to cinvince it to pick the good stuff

2025-09-19T19:17:28Z

Pseemann: An approach to force ChemSTEP to select beacons by human reasoning

Recently some Lab members encountered an unusual and dominating enrichment of non-reasonable molecules after several rounds of ChemSTEP. This here might be a work around to train ChemSTEP in the beginning to pick molecules not only by score but also to be influences by interactions or visual inspection. This can be crucial in the first round of ChemSTEP, because it can lead to a exclusive domination of cheating molecules -at least in the current version- .

'''First step'''

Docking the seeds set as described either here [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]. This is happening in your round_0 before running ChemSTEP.

'''Filtering - inspection - your own prioritization'''

What we will basically do is IFP-filtering - or any other filtering method you want to do - to not only rely on score. I am assuming here that you will do IFP, too, but you can also just do visual inspection and pick 100 molecules you like, to force ChemSTEP to like them, too (so to use them as beacons for your first round of ChemSTEP). For Lab members IFP-filtering is best describe in the Read-the-DOCK-docs: https://docs.docking.org/filtering.html#interaction-and-novelty-filtering
if you want to have a reference for files and folder structures etc. you might want to have a look here on wynton:

cd /wynton/group/bks/work/pseemann/2_IFP_CHEMSTEP

'''Making a list of names'''

Make a list of the molecule IDs that you find reasonable - so from your list of molecules which passed your filtering/visual-inspection/whatever-method-you-used. Save it as filterd_molecules.txt for example.

Note: Between the 'trillion' and 'billion' version of ChemSTEP the beginning of the MOL IDs might be different, but that should not be an issue for this approach

'''example list'''

MOL0000egWAbN
MOL0000deAfVb
MOL00006qQdBI
MOL0000besVZV
MOL0000blhcqH
MOL00008TJKq8
MOL0000emHoK9
MOL0000e3b5Xz
MOL00008K8dFx
MOL0000byZUhT

We go back to our docking folder for the seeds set (round_0) and run the get_scores.py (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]). This will yield a scores_round_0.txt file. This file we will now tweak with artificial scores. So I usually make a folder like this

mkdir round_0_filtering
cd round_0_filtering
cp /path/to/filtered_molecules.txt . # your file with the names of molecules you want to be picked as beacons
cp /path/to/scores_round_0.txt . # your file from the get_scores.py script in your round_0 seed set docking folder

Now you need a script to just set artificial scores. I went for good stuff 0 and bad stuff 100. You could also do -40 and 100 or whatever you like. I do IFP every round since my target is a bit special. But I think only doing it in the first round can also suffice.

An example script to do so is here:

'''score_correction.py'''

import sys

scores_file = sys.argv[1]
ids_file = sys.argv[2]
out_file = sys.argv[3]

with open(ids_file) as f:
id_set = set(line.strip() for line in f if line.strip())

print("IDs loaded:", id_set)

with open(scores_file) as f:
lines = f.readlines()

with open(out_file, "w") as out:
for line in lines:
parts = line.strip().split()
if len(parts) != 2:
continue
mol_id, score = parts
mol_id = mol_id.strip()
if mol_id in id_set:
out.write(f"{mol_id} 0\n") #set good stuff to 0 you can adjust
else:
out.write(f"{mol_id} 100\n") #set bad stuff to 100 you can adjust

The script takes three arguments which is

### We run this now in the prepared round_0_filtering folder which contains your scores_round_0.txt and filtered_molecules.txt
python score_correction.py scores_round_0.txt filtered_molecules.txt out.txt

The out.txt should contain now all the previous IDs

to check
grep " 0" out.txt | wc -l

'''file example of out.txt '''

MOL00004n1zC2 100
MOL00004n204O 100
MOL00004n21qu 100
MOL00004n242A 100
MOL00004n24Zm 0
MOL00004n28dM 100
MOL00004n2Anc 100
MOL00004n2AvB 100

So now we can copy the out.txt file over to our round_0 seed set docking folder and replace the 'old' scores_round_0.txt
#assuming you are still in round_0_filtering
cp out.txt /path/to/round_0/scores_round_0.txt

So with the tweaked scores_round_0.txt file we now run convert_scores_to_npy.py, which now reads the adjusted scores_round_0.txt to produces the indices.npy and score.npy files. Meaning, that now every step will be the same as previously described (see [[Running_ChemSTEP]] or here [[Running_ChemSTEP_in_the_13_billion_space|ChemSTEP_in_the_13_billion_space]]).

ChemSTEP should now just 'see' compounds with a good score of 0 and ignore the ones with a score of 100. You can check this also in your ChemSTEP folder by looking at the .log files and the picked beacons.

Running ChemSTEP in the 13 billion space

2025-09-18T21:42:16Z

Pseemann:

*Written by Philipp Seemann with the kind assistance of Katie Holland and Joseph Pepe (08/28/2025)
*This tutorial should work on Wynton with the provided scripts and public environments; you might need to catch tiny typos. I apologize.

This is meant to be a simple hands-on step by step guide for ChemSTEP on Wynton in the 13 billion space. For exploring the trillion space please refer to:[[Running ChemSTEP|https://wiki.docking.org/index.php?title=Running_ChemSTEP]]

'''The general workflow is:'''
|--- Simply DOCK the seeds set (you can use any docking method)
|
|--- Run ChemSTEP
|
|--- Build molecules from 1st round of ChemSTEP
|
|--- DOCK molecules from 1st round of ChemSTEP
|
|--- Run ChemSTEP the 2nd time
|
|--- Build molecules from 2nd round of ChemSTEP
|
|--- DOCK molecules from 2nd round of ChemSTEP
|
|--- Run ChemSTEP the 3rd time
.
.
.
.
.
|--- Repeat until your Recovery rate does not improve

Here is a recommended folder layout after several rounds of ChemSTEP. We will also provide the initial layout here right after this one. This should just give you an overview of what folders we will need to make iteratively, not to confuse our building and docking, and ChemSTEP rounds. (You can change this however you want. After you get the hang of this, you can be creative and find a better layout workflow for you!) This tutorial will lead you through round_0 and round_1. After that, it should become clear what steps you will need to repeat over and over again. This tutorial is written for Wynton. Please use the scripts starting from a dev node.

'''Folder layout:'''
CHEMSTEP_PROJECT_FOLDER/
|
|----run_inital_and_iterative_chemstep/
|
|----round_0/
|
|----building_1/
|
|----round_1/
|
|----building_2/
|
|----round_2/
|
|....

'''Explanation of folders'''
'''run_inital_and_iterative_chemstep/'''

Here, we actually run ChemSTEP with every new round as well. Here will be copied all scores_round_*.npy and indices_round_*.npy after every new round of docking.

'''round_0/'''

Docking of the seeds set. So this contains your dockfiles, and here we generate the initial indices and score .npys, which will be copied to the run_inital_and_iterative_chemstep/ folder later on

'''building_1/'''

This folder is for building of the first .smi file from run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi

'''round_1/'''

docking of the molecules from the building_1 folder. So this contains again your dockfiles with an adjusted INDOCK and a new sdi file. After extracting the scores, it will yield us indices and scores .npys of round_1 for the second round of chemstep.

'''building_2/'''

After running chemstep from run_inital_and_iterative_chemstep the second time we will have a new .smi file which we will need to build.

'''round_2/'''

If you have read carefully to this point, you will know what comes next. We will dock the molecules from building_2 here and extract the scores. Generate .npys files. Copy those to our run_inital_and_iterative_chemstep folder to generate the next smi_round_*.smi

So this is now all still confusing, but we will start easy. It will make sense as any other Shoichet Lab tutorial. I promise.

'''First round of docking'''

Here are the first folders that you will need to set up for this tutorial

CHEMSTEP_PROJECT_FOLDER/
|
|
|----round_0/ # docking of the seeds set
|
|
|----run_inital_and_iterative_chemstep/ #here will be copied all scores_round_*.npy and indices_round_*.npy files

First, we make a traditional DOCKing directory in your work directories (be sure not to be in your home directory because we will dock and build (DISKSPACE!)). The first step for ChemSTEP is basically just docking the 13B_seed_set_built.sdi, so submit as your traditional LSD screen (whatever submission style or script you prefer). This layout follows the above shown layout of folders. What you need to bring on your own here are your dockfiles (and your way of submission script).

mkdir CHEMSTEP_PROJECT_FOLDER/
cd CHEMSTEP_PROJECT_FOLDER/
mkdir round_0
cd round_0
cp -r path/to/your/dockfiles . #check your INDOCK parameters
mkdir sdi
cd sdi
cp /wynton/group/bks/work/bwhall61/mor_chemstep/DOCK/13M/seed/docking/bundle_paths.sdi .
cd ..
# cp your favourite submission script or whatever you use for submitting jobs
# or in case you use SymDOCK with the right executable set in the submission
# script

Submit the docking job with your method of preference.

'''Things to consider for the size of the seeds set'''

Ideally, we want enough molecules in our desired pProp region to be considered as beacons and virtual hits. So it is of great importance that enough molecules score in the desired region to be considered as beacons. On Wynton, there is the 130k seeds set, a 13M set, and somewhere also a 1.3M set.

/wynton/group/bks/work/shared/kholland/chemstep_13B/13B_seed_set_built.sdi #130k
/wynton/group/bks/work/bwhall61/mor_chemstep/DOCK/13M/seed/docking/bundle_paths.sdi #13M

Our aim is actually to dock a small chunk, which is still big enough to represent the library, and among those molecules we want enough beacons (so probably something between 50-100 molecules in the desired score range) chosen by ChemSTEP

'''When the docking finishes:'''

Source the right environment now:

source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

We will now extract scores and Molecule IDs, so we run get_scores.py

#cd into your round_0 directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .
python get_scores.py 0

#0 for initial round of chemstep

When get_scores.py runs successfully, we see scores_round_0.txt in our folder. Check your scores.txt file for output. It should look like this:

MOL12457028547 -29.32
MOL12457032486 -32.39
...

The get_scores.py script expects a certain output folder structure. If you do not see any output in your .txt file, vim into the script and adjust the paths. If you use the copied version, it should be:
/output/*/*/OUTDOCK.*

Now we translate scores_round_0.txt into indices_round_0.npy and scores_round_0.npy

#cd into your round_0 directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .
python convert_scores_to_npy.py 0

#0 for initial round of chemstep

now we should find indices_round_0.npy and scores_round_0.npy in our round_0 directory

'''First round of ChemSTEP. EXCITING!!!'''

'''Now we set up ChemSTEP:'''

#now cd into your CHEMSTEP_PROJECT_FOLDER

mkdir run_initial_and_iterative_chemstep
cp round_0/*_round_0.npy run_initial_and_iterative_chemstep/
cd run_initial_and_iterative_chemstep/

Now the two .npy files should be in our run_inital_and_iterative_chemstep folder to run chemstep.
We will now set up our initial submission script, which will be slightly different from the iterative one for the following rounds.

#cd into your folder for running chemstep -
#run_initial_and_iterative_chemstep/

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_initial.sh .
cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_initial.py .

We stay in this folder and copy over a params.txt file, and then adjust it to your liking

#we are still in run_inital_and_iterative_chemstep

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .

Adjust your params file - with vim if you like.
vim params.txt

seed_indices_file: path/to/your/run_initial_and_iterative_chemstep/indices_round_0.npy

seed_scores_file: path/to/your/run_initial_and_iterative_chemstep/scores_round_0.npy

hit_pprop: 6 #may change depending on library size, here 13B

n_docked_per_round: 1000000 #reasonable for 13B

max_beacons: 100 #reasonable for 13B

max_n_rounds: 250 # recommended

Now there should be two scripts and two .npy files, and one params.txt file in your run_inital_and_iterative_chemstep folder.
Now we submit our first round of running ChemSTEP with :

qsub launch_chemstep_as_job_initial.sh

# this will queue a job
# the job will split into many jobs, 600, eg which will run for a while
# the initial job will run for a while after the jobs finish, and should generate
# output in /output/complete_info/ ,there should be a .smi file when sucessfully

'''Two notes here:'''

First, we will never touch the params.txt file again. From my understanding, we need to copy each following scores_round_*.npy and indices_round_*.npy into the same directory (in our case, run_initial_and_iterative_chemstep)

Second, I had issues with getting Wynton to execute python for every submitted job. I hope this is fixed for you as well now.

'''First round of building:'''

When ChemSTEP finishes, we can go to our CHEMSTEP_PROJECT_FOLDER and brace ourselves for building.

#cd into your CHEMSTEP_PROJECT_FOLDER

mkdir building_1
cd building_1
cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi .

#if there is no smi file, something went wrong

We now source the building environment, prepare the job, submit the job, and wait until the building is completed.

source environment:
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

prepare job

python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi

#note: smi_round_*.smi for the following rounds must be adjusted.

submit job
qsub building_array_job.sh

When this is finished, check for failed jobs and resubmit. Some will always fail, just try to keep them low.
When the molecules are built, we can proceed with docking them.
Now we generate an .sdi file from our first building round.

#cd into your building_1 folder

find /wynton/group/bks/work/pseemann/CHEMSTEP_PROJECT_FOLDER/building_1/building_output/ -type f -name "*.tgz" > round1.sdi

#example adjust paths

'''Next round of docking: '''

We proceed with docking the freshly built compounds, so we make a directory called round_1, with dockfiles, and copy our new SDI file here, too

#cd into your CHEMSTEP_PROJECT_FOLDER

mkdir round_1
cd round_1
mkdir sdi
cd sdi
cp ../../building_1/*.sdi .
cd ..
cp -r ../round_0/dockfiles .
vim dockfiles/INDOCK

# adjust the score maximum in the INDOCK to your chosen pprop (recommended)
# submit with your favourite submission script just as you did for round_0
# so sh your_way_of_submitting.sh

As before, we will now run get_scores.py

source environment
source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

Copy scripts and execute them
#cd into your round_1 directory

cp ../round_0/get_scores.py .
python get_scores.py 1

# wait until this finishes, always adjust the argument after
# python get_scores.py to not confuse your scores files with other rounds

Copy scripts and execute them
#still in your round_1 directory
#copy the right script, IDs have changed, and needed to be adjusted in the convert script

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/64_char_convert_scores_to_npy.py .
python 64_char_convert_scores_to_npy.py 1

# Wait until this finishes; always adjust the argument afterward
# python 64_char_convert_scores_to_npy.py
# script was adjusted to fit the output of round_1

'''Next round of ChemSTEP:'''

#cd into your run_initial_and_iterative_chemstep directory

cp ../round_1/scores_round_1.npy .
cp ../round_1/indices_round_1.npy .

Copy over the iterative run_chemstep_iteratively.py and launch_chemstep_as_job_iteratively.sh
These are slightly different then the initial ones. But from here, we will only use these scripts. Be sure to always provide the right arguments to the scripts. Be sure to use the right scripts. This is (hopefully) the most confusing step in this ChemSTEP tutorial.

#still in run_initial_and_iterative_chemstep directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_iteratively.sh .
cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iteratively.py .

'''Submit the 2nd round of ChemSTEP:'''

qsub launch_chemstep_as_job_iteratively.sh 2

#Give the number of rounds you are running chemstep if you are using scores_round_1.npy and indices_round_1.npy you provide 2 as an
#argument, like in the example above.
#if you run your 3rd round of chemstep and provide indices_round_2.py and
#scores_round_2.npy, one would do qsub launch_chemstep_as_job_iteratively.sh 3
#and so on...
#If no argument is provided, it will queue but shut down.
#You can check this in the chemstep_submission.log file |
#So when the job is submitted, it will just quit, when no argument is given, after a few seconds to minutes.

Now we wait until the jobs finish, and brace ourselves for the next time-consuming round of building.

'''Next round of building:'''

#so in your CHEMSTEP_PROJECT_FOLDER

mkdir building_2
cd building_2
cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_2.smi .

'''source environment'''
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

'''prepare job (always adjust smi_round.smi here)'''

python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_2.smi

'''submit the job'''
qsub building_array_job.sh

Now like shown before. Resubmit failed jobs, make your sdi, and proceed with docking (new folder round_2 recommended).

Now we went full circle. The next step would be to make your next round_2 of the docking directory with the updated .sdi file from the 2nd round of building. With the next round of docking, you would extract scores and IDs, convert them, feed them to ChemSTEP, and repeat this as often as you don't see an increase in recovery rate. You would always need to update the arguments you pass to the python and submission scripts to match the round of ChemSTEP and building, and extracting etc.

'''Notes:'''

Slight changes here so far compared to the trillion space:

-added a workflow chart

-added a suggested directory structure

-adjusted the extract scripts for 13B

-Separated the initial and iterative scripts for submission and run ChemSTEP, the iterative submission now only works when an argument is passed for the round of submission

-Added a shared folder for the 13B scripts if others want to use them as well

/wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/

Running ChemSTEP in the 13 billion space

2025-09-18T20:14:36Z

Pseemann:

*Written by Philipp Seemann with the kind assistance of Katie Holland and Joseph Pepe (08/28/2025)
*This tutorial should work on Wynton with the provided scripts and public environments; you might need to catch tiny typos. I apologize.

This is meant to be a simple hands-on step by step guide for ChemSTEP on Wynton in the 13 billion space. For exploring the trillion space please refer to:[[Running ChemSTEP|https://wiki.docking.org/index.php?title=Running_ChemSTEP]]

'''The general workflow is:'''
|--- Simply DOCK the seeds set (you can use any docking method)
|
|--- Run ChemSTEP
|
|--- Build molecules from 1st round of ChemSTEP
|
|--- DOCK molecules from 1st round of ChemSTEP
|
|--- Run ChemSTEP the 2nd time
|
|--- Build molecules from 2nd round of ChemSTEP
|
|--- DOCK molecules from 2nd round of ChemSTEP
|
|--- Run ChemSTEP the 3rd time
.
.
.
.
.
|--- Repeat until your Recovery rate does not improve

Here is a recommended folder layout after several rounds of ChemSTEP. We will also provide the initial layout here right after this one. This should just give you an overview of what folders we will need to make iteratively, not to confuse our building and docking, and ChemSTEP rounds. (You can change this however you want. After you get the hang of this, you can be creative and find a better layout workflow for you!) This tutorial will lead you through round_0 and round_1. After that, it should become clear what steps you will need to repeat over and over again. This tutorial is written for Wynton. Please use the scripts starting from a dev node.

'''Folder layout:'''
CHEMSTEP_PROJECT_FOLDER/
|
|----run_inital_and_iterative_chemstep/
|
|----round_0/
|
|----building_1/
|
|----round_1/
|
|----building_2/
|
|----round_2/
|
|....

'''Explanation of folders'''
'''run_inital_and_iterative_chemstep/'''

Here, we actually run ChemSTEP with every new round as well. Here will be copied all scores_round_*.npy and indices_round_*.npy after every new round of docking.

'''round_0/'''

Docking of the seeds set. So this contains your dockfiles, and here we generate the initial indices and score .npys, which will be copied to the run_inital_and_iterative_chemstep/ folder later on

'''building_1/'''

This folder is for building of the first .smi file from run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi

'''round_1/'''

docking of the molecules from the building_1 folder. So this contains again your dockfiles with an adjusted INDOCK and a new sdi file. After extracting the scores, it will yield us indices and scores .npys of round_1 for the second round of chemstep.

'''building_2/'''

After running chemstep from run_inital_and_iterative_chemstep the second time we will have a new .smi file which we will need to build.

'''round_2/'''

If you have read carefully to this point, you will know what comes next. We will dock the molecules from building_2 here and extract the scores. Generate .npys files. Copy those to our run_inital_and_iterative_chemstep folder to generate the next smi_round_*.smi

So this is now all still confusing, but we will start easy. It will make sense as any other Shoichet Lab tutorial. I promise.

'''First round of docking'''

Here are the first folders that you will need to set up for this tutorial

CHEMSTEP_PROJECT_FOLDER/
|
|
|----round_0/ # docking of the seeds set
|
|
|----run_inital_and_iterative_chemstep/ #here will be copied all scores_round_*.npy and indices_round_*.npy files

First, we make a traditional DOCKing directory in your work directories (be sure not to be in your home directory because we will dock and build (DISKSPACE!)). The first step for ChemSTEP is basically just docking the 13B_seed_set_built.sdi, so submit as your traditional LSD screen (whatever submission style or script you prefer). This layout follows the above shown layout of folders. What you need to bring on your own here are your dockfiles (and your way of submission script).

mkdir CHEMSTEP_PROJECT_FOLDER/
cd CHEMSTEP_PROJECT_FOLDER/
mkdir round_0
cd round_0
cp -r path/to/your/dockfiles . #check your INDOCK parameters
mkdir sdi
cd sdi
cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13B_seed_set_built.sdi .
cd ..
# cp your favourite submission script or whatever you use for submitting jobs
# or in case you use SymDOCK with the right executable set in the submission
# script

Submit the docking job with your method of preference.

'''Things to consider for the size of the seeds set'''

Ideally, we want enough molecules in our desired pProp region to be considered as beacons and virtual hits. So it is of great importance that enough molecules score in the desired region to be considered as beacons. On Wynton, there is the 130k seeds set, a 13M set, and somewhere also a 1.3M set.

/wynton/group/bks/work/shared/kholland/chemstep_13B/13B_seed_set_built.sdi #130k
/wynton/group/bks/work/bwhall61/mor_chemstep/DOCK/13M/seed/docking/bundle_paths.sdi #13M

Our aim is actually to dock a small chunk, which is still big enough to represent the library, and among those molecules we want enough beacons (so probably something between 50-100 molecules in the desired score range) chosen by ChemSTEP

'''When the docking finishes:'''

Source the right environment now:

source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

We will now extract scores and Molecule IDs, so we run get_scores.py

#cd into your round_0 directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .
python get_scores.py 0

#0 for initial round of chemstep

When get_scores.py runs successfully, we see scores_round_0.txt in our folder. Check your scores.txt file for output. It should look like this:

MOL12457028547 -29.32
MOL12457032486 -32.39
...

The get_scores.py script expects a certain output folder structure. If you do not see any output in your .txt file, vim into the script and adjust the paths. If you use the copied version, it should be:
/output/*/*/OUTDOCK.*

Now we translate scores_round_0.txt into indices_round_0.npy and scores_round_0.npy

#cd into your round_0 directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .
python convert_scores_to_npy.py 0

#0 for initial round of chemstep

now we should find indices_round_0.npy and scores_round_0.npy in our round_0 directory

'''First round of ChemSTEP. EXCITING!!!'''

'''Now we set up ChemSTEP:'''

#now cd into your CHEMSTEP_PROJECT_FOLDER

mkdir run_initial_and_iterative_chemstep
cp round_0/*_round_0.npy run_initial_and_iterative_chemstep/
cd run_initial_and_iterative_chemstep/

Now the two .npy files should be in our run_inital_and_iterative_chemstep folder to run chemstep.
We will now set up our initial submission script, which will be slightly different from the iterative one for the following rounds.

#cd into your folder for running chemstep -
#run_initial_and_iterative_chemstep/

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_initial.sh .
cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_initial.py .

We stay in this folder and copy over a params.txt file, and then adjust it to your liking

#we are still in run_inital_and_iterative_chemstep

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .

Adjust your params file - with vim if you like.
vim params.txt

seed_indices_file: path/to/your/run_initial_and_iterative_chemstep/indices_round_0.npy

seed_scores_file: path/to/your/run_initial_and_iterative_chemstep/scores_round_0.npy

hit_pprop: 6 #may change depending on library size, here 13B

n_docked_per_round: 1000000 #reasonable for 13B

max_beacons: 100 #reasonable for 13B

max_n_rounds: 250 # recommended

Now there should be two scripts and two .npy files, and one params.txt file in your run_inital_and_iterative_chemstep folder.
Now we submit our first round of running ChemSTEP with :

qsub launch_chemstep_as_job_initial.sh

# this will queue a job
# the job will split into many jobs, 600, eg which will run for a while
# the initial job will run for a while after the jobs finish, and should generate
# output in /output/complete_info/ ,there should be a .smi file when sucessfully

'''Two notes here:'''

First, we will never touch the params.txt file again. From my understanding, we need to copy each following scores_round_*.npy and indices_round_*.npy into the same directory (in our case, run_initial_and_iterative_chemstep)

Second, I had issues with getting Wynton to execute python for every submitted job. I hope this is fixed for you as well now.

'''First round of building:'''

When ChemSTEP finishes, we can go to our CHEMSTEP_PROJECT_FOLDER and brace ourselves for building.

#cd into your CHEMSTEP_PROJECT_FOLDER

mkdir building_1
cd building_1
cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi .

#if there is no smi file, something went wrong

We now source the building environment, prepare the job, submit the job, and wait until the building is completed.

source environment:
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

prepare job

python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi

#note: smi_round_*.smi for the following rounds must be adjusted.

submit job
qsub building_array_job.sh

When this is finished, check for failed jobs and resubmit. Some will always fail, just try to keep them low.
When the molecules are built, we can proceed with docking them.
Now we generate an .sdi file from our first building round.

#cd into your building_1 folder

find /wynton/group/bks/work/pseemann/CHEMSTEP_PROJECT_FOLDER/building_1/building_output/ -type f -name "*.tgz" > round1.sdi

#example adjust paths

'''Next round of docking: '''

We proceed with docking the freshly built compounds, so we make a directory called round_1, with dockfiles, and copy our new SDI file here, too

#cd into your CHEMSTEP_PROJECT_FOLDER

mkdir round_1
cd round_1
mkdir sdi
cd sdi
cp ../../building_1/*.sdi .
cd ..
cp -r ../round_0/dockfiles .
vim dockfiles/INDOCK

# adjust the score maximum in the INDOCK to your chosen pprop (recommended)
# submit with your favourite submission script just as you did for round_0
# so sh your_way_of_submitting.sh

As before, we will now run get_scores.py

source environment
source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

Copy scripts and execute them
#cd into your round_1 directory

cp ../round_0/get_scores.py .
python get_scores.py 1

# wait until this finishes, always adjust the argument after
# python get_scores.py to not confuse your scores files with other rounds

Copy scripts and execute them
#still in your round_1 directory
#copy the right script, IDs have changed, and needed to be adjusted in the convert script

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/64_char_convert_scores_to_npy.py .
python 64_char_convert_scores_to_npy.py 1

# Wait until this finishes; always adjust the argument afterward
# python 64_char_convert_scores_to_npy.py
# script was adjusted to fit the output of round_1

'''Next round of ChemSTEP:'''

#cd into your run_initial_and_iterative_chemstep directory

cp ../round_1/scores_round_1.npy .
cp ../round_1/indices_round_1.npy .

Copy over the iterative run_chemstep_iteratively.py and launch_chemstep_as_job_iteratively.sh
These are slightly different then the initial ones. But from here, we will only use these scripts. Be sure to always provide the right arguments to the scripts. Be sure to use the right scripts. This is (hopefully) the most confusing step in this ChemSTEP tutorial.

#still in run_initial_and_iterative_chemstep directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_iteratively.sh .
cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iteratively.py .

'''Submit the 2nd round of ChemSTEP:'''

qsub launch_chemstep_as_job_iteratively.sh 2

#Give the number of rounds you are running chemstep if you are using scores_round_1.npy and indices_round_1.npy you provide 2 as an
#argument, like in the example above.
#if you run your 3rd round of chemstep and provide indices_round_2.py and
#scores_round_2.npy, one would do qsub launch_chemstep_as_job_iteratively.sh 3
#and so on...
#If no argument is provided, it will queue but shut down.
#You can check this in the chemstep_submission.log file |
#So when the job is submitted, it will just quit, when no argument is given, after a few seconds to minutes.

Now we wait until the jobs finish, and brace ourselves for the next time-consuming round of building.

'''Next round of building:'''

#so in your CHEMSTEP_PROJECT_FOLDER

mkdir building_2
cd building_2
cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_2.smi .

'''source environment'''
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

'''prepare job (always adjust smi_round.smi here)'''

python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_2.smi

'''submit the job'''
qsub building_array_job.sh

Now like shown before. Resubmit failed jobs, make your sdi, and proceed with docking (new folder round_2 recommended).

Now we went full circle. The next step would be to make your next round_2 of the docking directory with the updated .sdi file from the 2nd round of building. With the next round of docking, you would extract scores and IDs, convert them, feed them to ChemSTEP, and repeat this as often as you don't see an increase in recovery rate. You would always need to update the arguments you pass to the python and submission scripts to match the round of ChemSTEP and building, and extracting etc.

'''Notes:'''

Slight changes here so far compared to the trillion space:

-added a workflow chart

-added a suggested directory structure

-adjusted the extract scripts for 13B

-Separated the initial and iterative scripts for submission and run ChemSTEP, the iterative submission now only works when an argument is passed for the round of submission

-Added a shared folder for the 13B scripts if others want to use them as well

/wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/

Running ChemSTEP in the 13 billion space

2025-09-02T18:18:54Z

Pseemann:

*Written by Philipp Seemann with the kind assistance of Katie Holland and Joseph Pepe (08/28/2025)
*This tutorial should work on Wynton with the provided scripts and public environments; you might need to catch tiny typos. I apologize.

This is meant to be a simple hands-on step by step guide for ChemSTEP on Wynton in the 13 billion space. For exploring the trillion space please refer to:[[Running ChemSTEP|https://wiki.docking.org/index.php?title=Running_ChemSTEP]]

'''The general workflow is:'''
|--- Simply DOCK the seeds set (you can use any docking method)
|
|--- Run ChemSTEP
|
|--- Build molecules from 1st round of ChemSTEP
|
|--- DOCK molecules from 1st round of ChemSTEP
|
|--- Run ChemSTEP the 2nd time
|
|--- Build molecules from 2nd round of ChemSTEP
|
|--- DOCK molecules from 2nd round of ChemSTEP
|
|--- Run ChemSTEP the 3rd time
.
.
.
.
.
|--- Repeat until your Recovery rate does not improve

Here is a recommended folder layout after several rounds of ChemSTEP. We will also provide the initial layout here right after this one. This should just give you an overview of what folders we will need to make iteratively, not to confuse our building and docking, and ChemSTEP rounds. (You can change this however you want. After you get the hang of this, you can be creative and find a better layout workflow for you!) This tutorial will lead you through round_0 and round_1. After that, it should become clear what steps you will need to repeat over and over again. This tutorial is written for Wynton. Please use the scripts starting from a dev node.

'''Folder layout:'''
CHEMSTEP_PROJECT_FOLDER/
|
|----run_inital_and_iterative_chemstep/
|
|----round_0/
|
|----building_1/
|
|----round_1/
|
|----building_2/
|
|----round_2/
|
|....

'''Explanation of folders'''
'''run_inital_and_iterative_chemstep/'''

Here, we actually run ChemSTEP with every new round as well. Here will be copied all scores_round_*.npy and indices_round_*.npy after every new round of docking.

'''round_0/'''

Docking of the seeds set. So this contains your dockfiles, and here we generate the initial indices and score .npys, which will be copied to the run_inital_and_iterative_chemstep/ folder later on

'''building_1/'''

This folder is for building of the first .smi file from run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi

'''round_1/'''

docking of the molecules from the building_1 folder. So this contains again your dockfiles with an adjusted INDOCK and a new sdi file. After extracting the scores, it will yield us indices and scores .npys of round_1 for the second round of chemstep.

'''building_2/'''

After running chemstep from run_inital_and_iterative_chemstep the second time we will have a new .smi file which we will need to build.

'''round_2/'''

If you have read carefully to this point, you will know what comes next. We will dock the molecules from building_2 here and extract the scores. Generate .npys files. Copy those to our run_inital_and_iterative_chemstep folder to generate the next smi_round_*.smi

So this is now all still confusing, but we will start easy. It will make sense as any other Shoichet Lab tutorial. I promise.

'''First round of docking'''

Here are the first folders that you will need to set up for this tutorial

CHEMSTEP_PROJECT_FOLDER/
|
|
|----round_0/ # docking of the seeds set
|
|
|----run_inital_and_iterative_chemstep/ #here will be copied all scores_round_*.npy and indices_round_*.npy files

First, we make a traditional DOCKing directory in your work directories (be sure not to be in your home directory because we will dock and build (DISKSPACE!)). The first step for ChemSTEP is basically just docking the 13B_seed_set_built.sdi, so submit as your traditional LSD screen (whatever submission style or script you prefer). This layout follows the above shown layout of folders. What you need to bring on your own here are your dockfiles (and your way of submission script).

mkdir CHEMSTEP_PROJECT_FOLDER/
cd CHEMSTEP_PROJECT_FOLDER/
mkdir round_0
cd round_0
cp -r path/to/your/dockfiles . #check your INDOCK parameters
mkdir sdi
cd sdi
cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13B_seed_set_built.sdi .
cd ..
# cp your favourite submission script or whatever you use for submitting jobs
# or in case you use SymDOCK with the right executable set in the submission
# script

Submit the docking job with your method of preference.

'''When the docking finishes:'''

Source the right environment now:

source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

We will now extract scores and Molecule IDs, so we run get_scores.py

#cd into your round_0 directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .
python get_scores.py 0

#0 for initial round of chemstep

When get_scores.py runs successfully, we see scores_round_0.txt in our folder. Check your scores.txt file for output. It should look like this:

MOL12457028547 -29.32
MOL12457032486 -32.39
...

The get_scores.py script expects a certain output folder structure. If you do not see any output in your .txt file, vim into the script and adjust the paths. If you use the copied version, it should be:
/output/*/*/OUTDOCK.*

Now we translate scores_round_0.txt into indices_round_0.npy and scores_round_0.npy

#cd into your round_0 directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .
python convert_scores_to_npy.py 0

#0 for initial round of chemstep

now we should find indices_round_0.npy and scores_round_0.npy in our round_0 directory

'''First round of ChemSTEP. EXCITING!!!'''

'''Now we set up ChemSTEP:'''

#now cd into your CHEMSTEP_PROJECT_FOLDER

mkdir run_initial_and_iterative_chemstep
cp round_0/*_round_0.npy run_initial_and_iterative_chemstep/
cd run_initial_and_iterative_chemstep/

Now the two .npy files should be in our run_inital_and_iterative_chemstep folder to run chemstep.
We will now set up our initial submission script, which will be slightly different from the iterative one for the following rounds.

#cd into your folder for running chemstep -
#run_initial_and_iterative_chemstep/

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_initial.sh .
cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_initial.py .

We stay in this folder and copy over a params.txt file, and then adjust it to your liking

#we are still in run_inital_and_iterative_chemstep

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .

Adjust your params file - with vim if you like.
vim params.txt

seed_indices_file: path/to/your/run_initial_and_iterative_chemstep/indices_round_0.npy

seed_scores_file: path/to/your/run_initial_and_iterative_chemstep/scores_round_0.npy

hit_pprop: 6 #may change depending on library size, here 13B

n_docked_per_round: 1000000 #reasonable for 13B

max_beacons: 100 #reasonable for 13B

max_n_rounds: 250 # recommended

Now there should be two scripts and two .npy files, and one params.txt file in your run_inital_and_iterative_chemstep folder.
Now we submit our first round of running ChemSTEP with :

qsub launch_chemstep_as_job_initial.sh

# this will queue a job
# the job will split into many jobs, 600, eg which will run for a while
# the initial job will run for a while after the jobs finish, and should generate
# output in /output/complete_info/ ,there should be a .smi file when sucessfully

'''Two notes here:'''

First, we will never touch the params.txt file again. From my understanding, we need to copy each following scores_round_*.npy and indices_round_*.npy into the same directory (in our case, run_initial_and_iterative_chemstep)

Second, I had issues with getting Wynton to execute python for every submitted job. I hope this is fixed for you as well now.

'''First round of building:'''

When ChemSTEP finishes, we can go to our CHEMSTEP_PROJECT_FOLDER and brace ourselves for building.

#cd into your CHEMSTEP_PROJECT_FOLDER

mkdir building_1
cd building_1
cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi .

#if there is no smi file, something went wrong

We now source the building environment, prepare the job, submit the job, and wait until the building is completed.

source environment:
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

prepare job

python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi

#note: smi_round_*.smi for the following rounds must be adjusted.

submit job
qsub building_array_job.sh

When this is finished, check for failed jobs and resubmit. Some will always fail, just try to keep them low.
When the molecules are built, we can proceed with docking them.
Now we generate an .sdi file from our first building round.

#cd into your building_1 folder

find /wynton/group/bks/work/pseemann/CHEMSTEP_PROJECT_FOLDER/building_1/building_output/ -type f -name "*.tgz" > round1.sdi

#example adjust paths

'''Next round of docking: '''

We proceed with docking the freshly built compounds, so we make a directory called round_1, with dockfiles, and copy our new SDI file here, too

#cd into your CHEMSTEP_PROJECT_FOLDER

mkdir round_1
cd round_1
mkdir sdi
cd sdi
cp ../../building_1/*.sdi .
cd ..
cp -r ../round_0/dockfiles .
vim dockfiles/INDOCK

# adjust the score maximum in the INDOCK to your chosen pprop (recommended)
# submit with your favourite submission script just as you did for round_0
# so sh your_way_of_submitting.sh

As before, we will now run get_scores.py

source environment
source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

Copy scripts and execute them
#cd into your round_1 directory

cp ../round_0/get_scores.py .
python get_scores.py 1

# wait until this finishes, always adjust the argument after
# python get_scores.py to not confuse your scores files with other rounds

Copy scripts and execute them
#still in your round_1 directory
#copy the right script, IDs have changed, and needed to be adjusted in the convert script

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/64_char_convert_scores_to_npy.py .
python 64_char_convert_scores_to_npy.py 1

# Wait until this finishes; always adjust the argument afterward
# python 64_char_convert_scores_to_npy.py
# script was adjusted to fit the output of round_1

'''Next round of ChemSTEP:'''

#cd into your run_initial_and_iterative_chemstep directory

cp ../round_1/scores_round_1.npy .
cp ../round_1/indices_round_1.npy .

Copy over the iterative run_chemstep_iteratively.py and launch_chemstep_as_job_iteratively.sh
These are slightly different then the initial ones. But from here, we will only use these scripts. Be sure to always provide the right arguments to the scripts. Be sure to use the right scripts. This is (hopefully) the most confusing step in this ChemSTEP tutorial.

#still in run_initial_and_iterative_chemstep directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_iteratively.sh .
cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iteratively.py .

'''Submit the 2nd round of ChemSTEP:'''

qsub launch_chemstep_as_job_iteratively.sh 2

#Give the number of rounds you are running chemstep if you are using scores_round_1.npy and indices_round_1.npy you provide 2 as an
#argument, like in the example above.
#if you run your 3rd round of chemstep and provide indices_round_2.py and
#scores_round_2.npy, one would do qsub launch_chemstep_as_job_iteratively.sh 3
#and so on...
#If no argument is provided, it will queue but shut down.
#You can check this in the chemstep_submission.log file |
#So when the job is submitted, it will just quit, when no argument is given, after a few seconds to minutes.

Now we wait until the jobs finish, and brace ourselves for the next time-consuming round of building.

'''Next round of building:'''

#so in your CHEMSTEP_PROJECT_FOLDER

mkdir building_2
cd building_2
cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_2.smi .

'''source environment'''
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

'''prepare job (always adjust smi_round.smi here)'''

python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_2.smi

'''submit the job'''
qsub building_array_job.sh

Now like shown before. Resubmit failed jobs, make your sdi, and proceed with docking (new folder round_2 recommended).

Now we went full circle. The next step would be to make your next round_2 of the docking directory with the updated .sdi file from the 2nd round of building. With the next round of docking, you would extract scores and IDs, convert them, feed them to ChemSTEP, and repeat this as often as you don't see an increase in recovery rate. You would always need to update the arguments you pass to the python and submission scripts to match the round of ChemSTEP and building, and extracting etc.

'''Notes:'''

Slight changes here so far compared to the trillion space:

-added a workflow chart

-added a suggested directory structure

-adjusted the extract scripts for 13B

-Separated the initial and iterative scripts for submission and run ChemSTEP, the iterative submission now only works when an argument is passed for the round of submission

-Added a shared folder for the 13B scripts if others want to use them as well

/wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/

Running ChemSTEP in the 13 billion space

2025-08-28T19:59:24Z

Pseemann: Tutorial for running ChemSTEP on Wynton in the 13 billion space

*Written by Philipp Seemann with the kind assistance of Katie Holland and Joseph Pepe (08/28/2025)
*This tutorial should work on Wynton with the provided scripts and public environments; you might need to catch tiny typos. I apologize.

This is meant to be a simple hands-on step by step guide for ChemSTEP on Wynton in the 13 billion space. For exploring the trillion space please refer to:[[Running ChemSTEP|https://wiki.docking.org/index.php?title=Running_ChemSTEP]]

'''The general workflow is:'''
|--- Simply DOCK the seeds set (you can use any docking method)
|
|--- Run ChemSTEP
|
|--- Build molecules from 1st round of ChemSTEP
|
|--- DOCK molecules from 1st round of ChemSTEP
|
|--- Run ChemSTEP the 2nd time
|
|--- Build molecules from 2nd round of ChemSTEP
|
|--- DOCK molecules from 2nd round of ChemSTEP
|
|--- Run ChemSTEP the 3rd time
.
.
.
.
.
|--- Repeat until your Recovery rate does not improve

Here is a recommended folder layout after several rounds of ChemSTEP. We will also provide the initial layout here right after this one. This should just give you an overview of what folders we will need to make iteratively, not to confuse our building and docking, and ChemSTEP rounds. (You can change this however you want. After you get the hang of this, you can be creative and find a better layout workflow for you!) This tutorial will lead you through round_0 and round_1. After that, it should become clear what steps you will need to repeat over and over again. This tutorial is written for Wynton. Please use the scripts starting from a dev node.

'''Folder layout:'''
CHEMSTEP_PROJECT_FOLDER/
|
|----run_inital_and_iterative_chemstep/
|
|----round_0/
|
|----building_1/
|
|----round_1/
|
|----building_2/
|
|----round_2/
|
|....

'''Explanation of folders'''
'''run_inital_and_iterative_chemstep/'''

Here, we actually run ChemSTEP with every new round as well. Here will be copied all scores_round_*.npy and indices_round_*.npy after every new round of docking.

'''round_0/'''

Docking of the seeds set. So this contains your dockfiles, and here we generate the initial indices and score .npys, which will be copied to the run_inital_and_iterative_chemstep/ folder later on

'''building_1/'''

This folder is for building of the first .smi file from run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi

'''round_1/'''

docking of the molecules from the building_1 folder. So this contains again your dockfiles with an adjusted INDOCK and a new sdi file. After extracting the scores, it will yield us indices and scores .npys of round_1 for the second round of chemstep.

'''building_2/'''

After running chemstep from run_inital_and_iterative_chemstep the second time we will have a new .smi file which we will need to build.

'''round_2/'''

If you have read carefully to this point, you will know what comes next. We will dock the molecules from building_2 here and extract the scores. Generate .npys files. Copy those to our run_inital_and_iterative_chemstep folder to generate the next smi_round_*.smi

So this is now all still confusing, but we will start easy. It will make sense as any other Shoichet Lab tutorial. I promise.

'''First round of docking'''

Here are the first folders that you will need to set up for this tutorial

CHEMSTEP_PROJECT_FOLDER/
|
|
|----round_0/ # docking of the seeds set
|
|
|----run_inital_and_iterative_chemstep/ #here will be copied all scores_round_*.npy and indices_round_*.npy files

First, we make a traditional DOCKing directory in your work directories (be sure not to be in your home directory because we will dock and build (DISKSPACE!)). The first step for ChemSTEP is basically just docking the 13B_seed_set_built.sdi, so submit as your traditional LSD screen (whatever submission style or script you prefer). This layout follows the above shown layout of folders. What you need to bring on your own here are your dockfiles (and your way of submission script).

mkdir CHEMSTEP_PROJECT_FOLDER/
cd CHEMSTEP_PROJECT_FOLDER/
mkdir round_0
cd round_0
cp -r path/to/your/dockfiles . #check your INDOCK parameters
mkdir sdi
cd sdi
cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13B_seed_set_built.sdi .
cd ..
# cp your favourite submission script or whatever you use for submitting jobs
# or in case you use SymDOCK with the right executable set in the submission
# script

Submit the docking job with your method of preference.

'''When the docking finishes:'''

Source the right environment now:

source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

We will now extract scores and Molecule IDs, so we run get_scores.py

#cd into your round_0 directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .
python get_scores.py 0

#0 for initial round of chemstep

When get_scores.py runs successfully, we see scores_round_0.txt in our folder. Check your scores.txt file for output. It should look like this:

MOL12457028547 -29.32
MOL12457032486 -32.39
...

The get_scores.py script expects a certain output folder structure. If you do not see any output in your .txt file, vim into the script and adjust the paths. If you use the copied version, it should be:
/output/*/*/OUTDOCK.*

Now we translate scores_round_0.txt into indices_round_0.npy and scores_round_0.npy

#cd into your round_0 directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .
python convert_scores_to_npy.py 0

#0 for initial round of chemstep

now we should find indices_round_0.npy and scores_round_0.npy in our round_0 directory

'''First round of ChemSTEP. EXCITING!!!'''

'''Now we set up ChemSTEP:'''

#now cd into your CHEMSTEP_PROJECT_FOLDER

mkdir run_initial_and_iterative_chemstep
cp round_0/*_round_0.npy run_initial_and_iterative_chemstep/
cd run_initial_and_iterative_chemstep/

Now the two .npy files should be in our run_inital_and_iterative_chemstep folder to run chemstep.
We will now set up our initial submission script, which will be slightly different from the iterative one for the following rounds.

#cd into your folder for running chemstep -
#run_initial_and_iterative_chemstep/

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_initial.sh .
cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_initial.py .

We stay in this folder and copy over a params.txt file, and then adjust it to your liking

#we are still in run_inital_and_iterative_chemstep

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .

Adjust your params file - with vim if you like.
vim params.txt

seed_indices_file: path/to/your/run_initial_and_iterative_chemstep/indices_round_0.npy

seed_scores_file: path/to/your/run_initial_and_iterative_chemstep/scores_round_0.npy

hit_pprop: 6 #may change depending on library size, here 13B

n_docked_per_round: 1000000 #reasonable for 13B

max_beacons: 100 #reasonable for 13B

max_n_rounds: 250 # recommended

Now there should be two scripts and two .npy files, and one params.txt file in your run_inital_and_iterative_chemstep folder.
Now we submit our first round of running ChemSTEP with :

qsub launch_chemstep_as_job_initial.sh

# this will queue a job
# the job will split into many jobs, 600, eg which will run for a while
# the initial job will run for a while after the jobs finish, and should generate
# output in /output/complete_info/ ,there should be a .smi file when sucessfully

'''Two notes here:'''

First, we will never touch the params.txt file again. From my understanding, we need to copy each following scores_round_*.npy and indices_round_*.npy into the same directory (in our case, run_initial_and_iterative_chemstep)

Second, I had issues with getting Wynton to execute python for every submitted job. I hope this is fixed for you as well now.

'''First round of building:'''

When ChemSTEP finishes, we can go to our CHEMSTEP_PROJECT_FOLDER and brace ourselves for building.

#cd into your CHEMSTEP_PROJECT_FOLDER

mkdir building_1
cd building_1
cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_1.smi .

#if there is no smi file, something went wrong

We now source the building environment, prepare the job, submit the job, and wait until the building is completed.

source environment:
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

prepare job

python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi

#note: smi_round_*.smi for the following rounds must be adjusted.

submit job
qsub building_array_job.sh

When this is finished, check for failed jobs and resubmit. Some will always fail, just try to keep them low.
When the molecules are built, we can proceed with docking them.
Now we generate an .sdi file from our first building round.

#cd into your building_1 folder

find /wynton/group/bks/work/pseemann/CHEMSTEP_PROJECT_FOLDER/building_1/building_output/ -type f -name "*.tgz" > round1.sdi

#example adjust paths

'''Next round of docking: '''

We proceed with docking the freshly built compounds, so we make a directory called round_1, with dockfiles, and copy our new SDI file here, too

#cd into your CHEMSTEP_PROJECT_FOLDER

mkdir round_1
cd round_1
mkdir sdi
cd sdi
cp ../../building_1/*.sdi .
cd ..
cp -r ../round_0/dockfiles .
vim dockfiles/INDOCK

# adjust the score maximum in the INDOCK to your chosen pprop (recommended)
# submit with your favourite submission script just as you did for round_0
# so sh your_way_of_submitting.sh

As before, we will now run get_scores.py

source environment
source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate

Copy scripts and execute them
#cd into your round_1 directory

cp ../round_0/get_scores.py .
python get_scores.py 1

# wait until this finishes, always adjust the argument after
# python get_scores.py to not confuse your scores files with other rounds

Copy scripts and execute them
#still in your round_1 directory
#copy the right script, IDs have changed, and needed to be adjusted in the convert script

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/64_char_convert_scores_to_npy.py .
python 64_char_convert_scores_to_npy.py 1

# Wait until this finishes; always adjust the argument afterward
# python 64_char_convert_scores_to_npy.py
# script was adjusted to fit the output of round_1

'''Next round of ChemSTEP:'''

#cd into your run_initial_and_iterative_chemstep directory

cp ../round_1/scores_round_1.npy .
cp ../round_1/indices_round_1.npy .

Copy over the iterative run_chemstep_iteratively.py and launch_chemstep_as_job_iteratively.sh
These are slightly different then the initial ones. But from here, we will only use these scripts. Be sure to always provide the right arguments to the scripts. Be sure to use the right scripts. This is (hopefully) the most confusing step in this ChemSTEP tutorial.

#still in run_initial_and_iterative_chemstep directory

cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job_iteratively.sh .
cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iteratively.py .

'''Submit the 2nd round of ChemSTEP:'''

qsub launch_chemstep_as_job_iteratively.sh 2

#Give the number of rounds you are running chemstep if you are using scores_round_1.npy and indices_round_1.npy you provide 2 as an
#argument, like in the example above.
#if you run your 3rd round of chemstep and provide indices_round_2.py and
#scores_round_2.npy, one would do qsub launch_chemstep_as_job_iteratively.sh 3
#and so on...
#If no argument is provided, it will queue but shut down. You can check this in the launch_chemstep_as_job_iterativley.sh.e* file
#So when the job is submitted, it will just quit, when no argument is given, after a few seconds to minutes.

Now we wait until the jobs finish, and brace ourselves for the next time-consuming round of building.

'''Next round of building:'''

#so in your CHEMSTEP_PROJECT_FOLDER

mkdir building_2
cd building_2
cp ../run_initial_and_iterative_chemstep/output/complete_info/smi_round_2.smi .

'''source environment'''
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

'''prepare job (always adjust smi_round.smi here)'''

python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_2.smi

'''submit the job'''
qsub building_array_job.sh

Now like shown before. Resubmit failed jobs, make your sdi, and proceed with docking (new folder round_2 recommended).

Now we went full circle. The next step would be to make your next round_2 of the docking directory with the updated .sdi file from the 2nd round of building. With the next round of docking, you would extract scores and IDs, convert them, feed them to ChemSTEP, and repeat this as often as you don't see an increase in recovery rate. You would always need to update the arguments you pass to the python and submission scripts to match the round of ChemSTEP and building, and extracting etc.

'''Notes:'''

Slight changes here so far compared to the trillion space:

-added a workflow chart

-added a suggested directory structure

-adjusted the extract scripts for 13B

-Separated the initial and iterative scripts for submission and run ChemSTEP, the iterative submission now only works when an argument is passed for the round of submission

-Added a shared folder for the 13B scripts if others want to use them as well

/wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/