Running ChemSTEP

From DISI
Revision as of 18:40, 28 July 2025 by Kholland (talk | contribs)
Jump to navigation Jump to search

written July 24 2025 by katie. These are directions to run a legacy version of ChemSTEP on Wynton.

What the user needs: working directory, SMILES file of every molecule in virtual library with unique molecule IDs (ranging from 1-size of library), dockfiles

1. Copy all necessary scripts to your working directory

   cp -r /wynton/group/bks/work/kholland/shared/chemstep/all_scripts . 

includes get_fingerprints.py, chemstep_params.txt, get_threshold.py, run_chemstep for initial and subsequent rounds, as well as a launch_chemstep.sh script for SGE job submission.

2. Source environment

   source /wynton/group/bks/work/kholland/shared/chemstep/venv/bin/activate

3. Generate ECFP4 FPs for your library

   python3 split_smi_and_write_job_array.py <name of your SMILES file>.smi get_fingerprints.py
   qsub launch_jobs.sh

This will split your input SMILES library into chunks of 1,000,000 molecules (you can change this # inside the split_smi_and_write_job_array.py script), and write a job array to submit fingerprint generation for each chunk as one CPU job. The output will be a directory named library_fingerprints containing fps.npy. ids.npy, and smiles.txt files.

4. Dock a random, representative subset of the total library to your POI.

5. Extract scores and respective molecule IDs (same ones used for FP generation) from step 4, assigning a score of 100 to any molecule that did not dock.

   mol0001884980 -17.41
   mol0001883931 -21.49
   mol0001883965 -27.51
   mol0001883247 100
   mol0001885445 -20.05
   mol0001884461 -14.55
   mol0001884565 -16.7
   mol0001885496 -18.01
   mol0001884345 -16.71

6. Edit parameter file to reflect desired step size, pProp goal, and number of beacons per step

  seed_scores_file: dicts_810k/scoredict_2.pickle
  novelty_set_file: known_binders_fps.npy
  novelty_dist_thresh: 0.5
  screen_novelty: False
  beacon_dist_thresh: 0.0
  diversity_dist_thresh: 0.5
  hit_pprop: 4 #change this
  artefact_pprop: 6
  use_artefact_filter: False
  n_docked_per_round: 100 #change this
  max_beacons: 10 #change this
  max_n_rounds: 10 #change this

7. Edit run_chemstep_init.py to reflect library size (n_files= number of fp_*.npy files generated in step 3), scores_dict (file with dock scores and mol ID from step 5), and path to fingerprint library from step 3.

  from chemstep.fp_library import FpLibrary
  def run_chemstep_first_round(param_file, libdir, scores_dict, outdir, complete_info_dir, n_proc=32, n_files=#change this):


  if __name__ == "__main__":
  scores_dict = get_scores_dict('dock_scores_round_0.txt') #change this 
  run_chemstep_first_round('chemstep_params.txt', '/wynton/group/bks/work/path/to/fingerprint/library', scores_dict,
                           'chemstep_log', 'chemstep_output') #update path

8. Make output directories

  mkdir chemstep_output
  mkdir chemstep_log


9. Launch ChemSTEP note: this may take several hours

  qsub all_scripts/launch_chemstep_init.sh


when the job is complete, a pickle file will be created in the working directory. within chemstep_output will be a dataframe containing assigned beacons, a file of calculated tanimoto distances, and an smi_round_1.smi file containing the SMILES strings and IDs of molecules prioritized for the next round of docking.

10. View assigned pProp value

   python3 all_scripts/get_threshold.py

11. Build and dock prioritized molecules

When completed, extract scores and IDs as outlined in step 5.

12. Edit run_chemstep.py to reflect new score_dict, and ChemSTEP round number (we are now on round 2).

  if __name__ == "__main__":
  scores_dict = get_scores_dict('dockingscores_round_1.txt')
  run_chemstep_round(scores_dict, 2)

13. Launch ChemSTEP round 2 note: this may take several hours

  qsub all_scripts/launch_chemstep.sh

Repeat steps 11-13 as needed for desired hit recovery, making sure to update the scored_dict and round number.