Running ChemSTEP
written July 24 2025 by katie. These are directions to run a legacy version of ChemSTEP on Wynton.
What the user needs: working directory, SMILES file of every molecule in virtual library with unique molecule IDs (ranging from 1-size of library), dockfiles
1. Copy all necessary scripts to your working directory
cp -r /wynton/group/bks/work/kholland/shared/chemstep/all_scripts .
includes get_fingerprints.py, chemstep_params.txt, get_threshold.py, run_chemstep for initial and subsequent rounds, as well as a launch_chemstep.sh script for SGE job submission.
2. Source environment
source /wynton/group/bks/work/kholland/shared/chemstep/venv/bin/activate
3. Generate ECFP4 FPs for your library
python3 split_smi_and_write_job_array.py <name of your SMILES file>.smi get_fingerprints.py qsub launch_jobs.sh
This will split your input SMILES library into chunks of 1,000,000 molecules (you can change this # inside the split_smi_and_write_job_array.py script), and write a job array to submit fingerprint generation for each chunk as one CPU job. The output will be a directory named library_fingerprints containing fps.npy. ids.npy, and smiles.txt files.
4. Dock a random, representative subset of the total library to your POI.
5. Extract scores and respective molecule IDs (same ones used for FP generation) from step 4, assigning a score of 100 to any molecule that did not dock.
mol0001884980 -17.41 mol0001883931 -21.49 mol0001883965 -27.51 mol0001883247 100 mol0001885445 -20.05 mol0001884461 -14.55 mol0001884565 -16.7 mol0001885496 -18.01 mol0001884345 -16.71
6. Edit parameter file to reflect desired step size, pProp goal, and number of beacons per step
seed_scores_file: dicts_810k/scoredict_2.pickle novelty_set_file: known_binders_fps.npy novelty_dist_thresh: 0.5 screen_novelty: False beacon_dist_thresh: 0.0 diversity_dist_thresh: 0.5 hit_pprop: 4 #change this artefact_pprop: 6 use_artefact_filter: False n_docked_per_round: 100 #change this max_beacons: 10 #change this max_n_rounds: 10 #change this
7. Edit run_chemstep_init.py to reflect library size (n_files= number of fp_*.npy files generated in step 3), scores_dict (file with dock scores and mol ID from step 5), and path to fingerprint library from step 3.
from chemstep.fp_library import FpLibrary def run_chemstep_first_round(param_file, libdir, scores_dict, outdir, complete_info_dir, n_proc=32, n_files=#change this):
if __name__ == "__main__":
scores_dict = get_scores_dict('dock_scores_round_0.txt') #change this
run_chemstep_first_round('chemstep_params.txt', '/wynton/group/bks/work/path/to/fingerprint/library', scores_dict,
'chemstep_log', 'chemstep_output') #update path
8. Make output directories
mkdir chemstep_output
mkdir chemstep_log
9. Launch ChemSTEP
note: this may take several hours
qsub all_scripts/launch_chemstep_init.sh
when the job is complete, a pickle file will be created in the working directory. within chemstep_output will be a dataframe containing assigned beacons, a file of calculated tanimoto distances, and an smi_round_1.smi file containing the SMILES strings and IDs of molecules prioritized for the next round of docking.
10. View assigned pProp value
python3 all_scripts/get_threshold.py
11. Build and dock prioritized molecules
When completed, extract scores and IDs as outlined in step 5.
12. Edit run_chemstep.py to reflect new score_dict, and ChemSTEP round number (we are now on round 2).
if __name__ == "__main__":
scores_dict = get_scores_dict('dockingscores_round_1.txt')
run_chemstep_round(scores_dict, 2)
13. Launch ChemSTEP round 2 note: this may take several hours
qsub all_scripts/launch_chemstep.sh
Repeat steps 11-13 as needed for desired hit recovery, making sure to update the scored_dict and round number.