Running ChemSTEP
written July 24 2025 by katie. These are directions to run a legacy version of ChemSTEP on Wynton.
What the user needs: working directory, SMILES file of every molecule in virtual library with unique molecule IDs (ranging from 1-size of library), dockfiles
1. Copy all necessary scripts to your working directory
cp -r /wynton/group/bks/work/kholland/shared/chemstep/all_scripts .
includes get_fingerprints.py, chemstep_params.txt, get_threshold.py, run_chemstep for initial and subsequent rounds, as well as a launch_chemstep.sh script for SGE job submission.
2. Source environment
source /wynton/group/bks/work/kholland/shared/chemstep/venv/bin/activate
3. Edit get_fingerprints.py to reflect your input SMILES file and desired output directory. NOTE: is not set up to work at scale right now. i am working on a method for parallelization.
if __name__ == "__main__": smi_file = "library.smi" # Replace with your input file output_dir = "library_fingerprints" # Replace with your output directory
Run generation for large libraries, submit as a job using submit_fp_gen.sh
python3 all_scripts/get_fingerprints.py
4. Dock a random, representative subset of the total library to your POI.
5. Extract scores and respective molecule IDs (same ones used for FP generation) from step 4, assigning a score of 100 to any molecule that did not dock.
mol0001884980 -17.41 mol0001883931 -21.49 mol0001883965 -27.51 mol0001883247 100 mol0001885445 -20.05 mol0001884461 -14.55 mol0001884565 -16.7 mol0001885496 -18.01 mol0001884345 -16.71
6. Edit parameter file to reflect desired step size, pProp goal, and number of beacons per step
seed_scores_file: dicts_810k/scoredict_2.pickle novelty_set_file: known_binders_fps.npy novelty_dist_thresh: 0.5 screen_novelty: False beacon_dist_thresh: 0.0 diversity_dist_thresh: 0.5 hit_pprop: 4 #change this artefact_pprop: 6 use_artefact_filter: False n_docked_per_round: 100 #change this max_beacons: 10 #change this max_n_rounds: 10 #change this
7. Edit run_chemstep_init.py to reflect library size (n_files= number of fp_*.npy files generated in step 3), scores_dict (file with dock scores and mol ID from step 5), and path to fingerprint library from step 3.
from chemstep.fp_library import FpLibrary def run_chemstep_first_round(param_file, libdir, scores_dict, outdir, complete_info_dir, n_proc=32, n_files=#change this):
if __name__ == "__main__":
scores_dict = get_scores_dict('dock_scores_round_0.txt') #change this
run_chemstep_first_round('chemstep_params.txt', '/wynton/group/bks/work/path/to/fingerprint/library', scores_dict,
'chemstep_log', 'chemstep_output') #update path
8. Make output directories
mkdir chemstep_output
mkdir chemstep_log
9. Launch ChemSTEP
note: this may take several hours
qsub all_scripts/launch_chemstep_init.sh
when the job is complete, a pickle file will be created in the working directory. within chemstep_output will be a dataframe containing assigned beacons, a file of calculated tanimoto distances, and an smi_round_1.smi file containing the SMILES strings and IDs of molecules prioritized for the next round of docking.
10. View assigned pProp value
python3 all_scripts/get_threshold.py
11. Build and dock prioritized molecules
When completed, extract scores and IDs as outlined in step 5.
12. Edit run_chemstep.py to reflect new score_dict, and ChemSTEP round number (we are now on round 2).
if __name__ == "__main__":
scores_dict = get_scores_dict('dockingscores_round_1.txt')
run_chemstep_round(scores_dict, 2)
13. Launch ChemSTEP round 2 note: this may take several hours
qsub all_scripts/launch_chemstep.sh
Repeat steps 11-13 as needed for desired hit recovery, making sure to update the scored_dict and round number.