Running ChemSTEP: Difference between revisions

From DISI
Jump to navigation Jump to search
(Created page with "feljfhqejf")
 
No edit summary
Line 1: Line 1:
feljfhqejf
written July 24 2025 by katie. These are directions to run a legacy version of ChemSTEP on Wynton.
 
What the user needs: SMILES file of every molecule in virtual library with unique molecule IDs (ranging from 1-size of library), dockfiles
 
'''1. Copy all necessary scripts to your working directory'''
    cp -r /wynton/group/bks/work/kholland/shared/chemstep/all_scripts .
includes get_fingerprints.py, chemstep_params.txt, get_threshold.py, run_chemstep for initial and subsequent rounds, as well as a launch_chemstep.sh script for SGE job submission.
 
'''2. Source environment'''
    source /wynton/group/bks/work/kholland/shared/chemstep/venv/bin/activate
 
'''3. Edit get_fingerprints.py''' to reflect your input SMILES file and desired output directory. NOTE: is not set up to work at scale right now. i am working on a method for parallelization.
  if __name__ == "__main__":
    smi_file = '''"library.smi"'''  # Replace with your input file
    output_dir = '''"library_fingerprints"''' # Replace with your output directory
 
''' Run generation'''
for large libraries, submit as a job using submit_fp_gen.sh
    python3 get_fingerprints.py
 
'''4. Dock a random, representative subset of the total library to your POI.'''
 
'''5. Extract scores and respective molecule IDs''' (same ones used for FP generation) from step 4, assigning a score of 100 to any molecule that did not dock.
    mol0001884980 -17.41
    mol0001883931 -21.49
    mol0001883965 -27.51
    mol0001883247 100
    mol0001885445 -20.05
    mol0001884461 -14.55
    mol0001884565 -16.7
    mol0001885496 -18.01
    mol0001884345 -16.71
 
'''6. Edit parameter file''' to reflect desired step size, pProp goal, and number of beacons per step
 
  seed_scores_file: dicts_810k/scoredict_2.pickle
  novelty_set_file: known_binders_fps.npy
  novelty_dist_thresh: 0.5
  screen_novelty: False
  beacon_dist_thresh: 0.0
  diversity_dist_thresh: 0.5
  '''hit_pprop: 4''' #change this
  artefact_pprop: 6
  use_artefact_filter: False
  '''n_docked_per_round:''' 100 #change this
  '''max_beacons:''' 10 #change this
  '''max_n_rounds:''' 10 #change this
 
'''7. Edit run_chemstep_init.py''' to reflect library size (n_files= number of fp_*.npy files generated in step 3), scores_dict (file with dock scores and mol ID from step 5), and path to fingerprint library from step 3.
  if __name__ == "__main__":
  scores_dict = get_scores_dict(''''dock_scores_round_0.txt'''') #change this
  run_chemstep_first_round('chemstep_params.txt', ''''/wynton/group/bks/work/path/to/fingerprint/library'''', scores_dict,
                            'chemstep_log', 'chemstep_output') #update path
 
'''8. Make output directories'''
  mkdir chemstep_output
 
  mkdir chemstep_log
 
 
'''9. Launch ChemSTEP'''
note: this may take several hours
  qsub launch_chemstep_init.sh
 
 
when the job is complete, a pickle file will be created in the working directory. within chemstep_output will be a dataframe containing assigned beacons, a file of calculated tanimoto distances, and an '''smi_round_1.smi''' file containing the SMILES strings and IDs of molecules prioritized for the next round of docking.
 
'''10. View assigned pProp value'''
    python3 get_threshold.py
 
'''11. Build and dock prioritized molecules'''
 
When completed, extract scores and IDs as outlined in step 5.
 
'''12. Edit run_chemstep.py''' to reflect library size, new score_dict, and ChemSTEP round number (we are now on round 2).
  if __name__ == "__main__":
  scores_dict = get_scores_dict(''''dockingscores_round_1.txt'''')
  run_chemstep_round(scores_dict, '''2''')
 
'''13. Launch ChemSTEP round 2'''
note: this may take several hours
  qsub launch_chemstep.sh
 
Repeat steps 11-13 as needed for desired hit recovery, making sure to update the scored_dict and round number.

Revision as of 23:44, 24 July 2025

written July 24 2025 by katie. These are directions to run a legacy version of ChemSTEP on Wynton.

What the user needs: SMILES file of every molecule in virtual library with unique molecule IDs (ranging from 1-size of library), dockfiles

1. Copy all necessary scripts to your working directory

   cp -r /wynton/group/bks/work/kholland/shared/chemstep/all_scripts . 

includes get_fingerprints.py, chemstep_params.txt, get_threshold.py, run_chemstep for initial and subsequent rounds, as well as a launch_chemstep.sh script for SGE job submission.

2. Source environment

   source /wynton/group/bks/work/kholland/shared/chemstep/venv/bin/activate

3. Edit get_fingerprints.py to reflect your input SMILES file and desired output directory. NOTE: is not set up to work at scale right now. i am working on a method for parallelization.

  if __name__ == "__main__":
   smi_file = "library.smi"  # Replace with your input file
   output_dir = "library_fingerprints" # Replace with your output directory

Run generation for large libraries, submit as a job using submit_fp_gen.sh

   python3 get_fingerprints.py

4. Dock a random, representative subset of the total library to your POI.

5. Extract scores and respective molecule IDs (same ones used for FP generation) from step 4, assigning a score of 100 to any molecule that did not dock.

   mol0001884980 -17.41
   mol0001883931 -21.49
   mol0001883965 -27.51
   mol0001883247 100
   mol0001885445 -20.05
   mol0001884461 -14.55
   mol0001884565 -16.7
   mol0001885496 -18.01
   mol0001884345 -16.71

6. Edit parameter file to reflect desired step size, pProp goal, and number of beacons per step

  seed_scores_file: dicts_810k/scoredict_2.pickle
  novelty_set_file: known_binders_fps.npy
  novelty_dist_thresh: 0.5
  screen_novelty: False
  beacon_dist_thresh: 0.0
  diversity_dist_thresh: 0.5
  hit_pprop: 4 #change this
  artefact_pprop: 6
  use_artefact_filter: False
  n_docked_per_round: 100 #change this
  max_beacons: 10 #change this
  max_n_rounds: 10 #change this

7. Edit run_chemstep_init.py to reflect library size (n_files= number of fp_*.npy files generated in step 3), scores_dict (file with dock scores and mol ID from step 5), and path to fingerprint library from step 3.

  if __name__ == "__main__":
  scores_dict = get_scores_dict('dock_scores_round_0.txt') #change this 
  run_chemstep_first_round('chemstep_params.txt', '/wynton/group/bks/work/path/to/fingerprint/library', scores_dict,
                           'chemstep_log', 'chemstep_output') #update path

8. Make output directories

  mkdir chemstep_output
  mkdir chemstep_log


9. Launch ChemSTEP note: this may take several hours

  qsub launch_chemstep_init.sh


when the job is complete, a pickle file will be created in the working directory. within chemstep_output will be a dataframe containing assigned beacons, a file of calculated tanimoto distances, and an smi_round_1.smi file containing the SMILES strings and IDs of molecules prioritized for the next round of docking.

10. View assigned pProp value

   python3 get_threshold.py

11. Build and dock prioritized molecules

When completed, extract scores and IDs as outlined in step 5.

12. Edit run_chemstep.py to reflect library size, new score_dict, and ChemSTEP round number (we are now on round 2).

  if __name__ == "__main__":
  scores_dict = get_scores_dict('dockingscores_round_1.txt')
  run_chemstep_round(scores_dict, 2)

13. Launch ChemSTEP round 2 note: this may take several hours

  qsub launch_chemstep.sh

Repeat steps 11-13 as needed for desired hit recovery, making sure to update the scored_dict and round number.