Running ChemSTEP
written July 24 2025 by katie. edited August 11 2025 with updated methods specific to InfiniSee XReal.
What you need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP
1. Install ChemSTEP
pip install /wynton/group/bks/work/omailhot/chemstep-0.2.2.tar.gz
2. Copy InfiniSee seed-set into a directory named 'sdi' within your docking directory
This is a split database index file containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total vitual library, currently 1 trillion molecules.
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .
3. Dock seed set to your receptor of interest using DOCK 3.8 /docking directions taken from docs.docking.org
export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]
export DOCKFILES=[path to your dockfiles]
export INPUT_FOLDER=[the folder containing your .sdi file(s)]
export OUTPUT_FOLDER=[where you want the output ]
/wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh
Wait for docking to complete, then extract molecule IDs and corresponding scores. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .
python3 get_scores.py
4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition.
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .
python3 convert_scores_to_npy.py
This should output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores.
5. Make a directory to run ChemSTEP in. Copy in necessary files: params.txt, run_chemstep.py and launch_chemstep_as_job.sh
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .
cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .
6. Edit params.txt file
seed_indices_file: /absoulte/path/to/your/indices_round_0.npy seed_scores_file: /absolute/path/to/your/scores_round_0.npy hit_pprop: 3 n_docked_per_round: 30000 max_beacons: 150 max_n_rounds: 250
Be sure that this file reflects your score and indices files, as well as desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for XReal docking. If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library. We strongly suggest running ChemSTEP as a job array with 32 CPU slots requested (specified in wrapper).
7. Run ChemSTEP with the following command:
qsub launch_chemstep_as_job.sh
When finished, there will be a smi_round_1.smi inside of output_directory/complete_info. These molecules should be built, docked, and fed back into ChemSTEP.
8. Build prioritized molecules (DOCK 3.8). /taken from docs.docking.org
source /wynton/group/bks/soft/DOCK-3.8.5/env.sh
python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. Retrieve docking scores as outlined in step 3 and convert to NumPy arrays (Step 4, be sure to update the round number in the naming conventions!!!).
9. Run second round of ChemSTEP, giving round number as a command-line argument. /copy this script in as run_chemstep_iterative.py
from chemstep.algo import load_from_pickle
import sys
round_n = int(sys.argv[1])
algo = load_from_pickle(f'chemstep_algo_{round_n - 1}.pickle')
indices = np.load(f'indices_round_{round_n - 1}.npy')
scores = np.load(f'scores_round_{round_n - 1}.npy')
algo.run_one_round(round_n, indices, scores)
Again, this should be run as a job on the scheduler with 32 cores. Update launch_chemstep_as_job.sh to call (specifying round number):
python3 run_chemstep_interative.py 2
The output will be smi_round_2.smi file. Repeat steps 8 and 9 for as many rounds as needed. The performance is reported in output_directory/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon's distance to all previous beacons.