Running ChemSTEP: Difference between revisions
(Created page with "feljfhqejf") |
No edit summary |
||
| Line 1: | Line 1: | ||
written July 24 2025 by katie. These are directions to run a legacy version of ChemSTEP on Wynton. | |||
What the user needs: SMILES file of every molecule in virtual library with unique molecule IDs (ranging from 1-size of library), dockfiles | |||
'''1. Copy all necessary scripts to your working directory''' | |||
cp -r /wynton/group/bks/work/kholland/shared/chemstep/all_scripts . | |||
includes get_fingerprints.py, chemstep_params.txt, get_threshold.py, run_chemstep for initial and subsequent rounds, as well as a launch_chemstep.sh script for SGE job submission. | |||
'''2. Source environment''' | |||
source /wynton/group/bks/work/kholland/shared/chemstep/venv/bin/activate | |||
'''3. Edit get_fingerprints.py''' to reflect your input SMILES file and desired output directory. NOTE: is not set up to work at scale right now. i am working on a method for parallelization. | |||
if __name__ == "__main__": | |||
smi_file = '''"library.smi"''' # Replace with your input file | |||
output_dir = '''"library_fingerprints"''' # Replace with your output directory | |||
''' Run generation''' | |||
for large libraries, submit as a job using submit_fp_gen.sh | |||
python3 get_fingerprints.py | |||
'''4. Dock a random, representative subset of the total library to your POI.''' | |||
'''5. Extract scores and respective molecule IDs''' (same ones used for FP generation) from step 4, assigning a score of 100 to any molecule that did not dock. | |||
mol0001884980 -17.41 | |||
mol0001883931 -21.49 | |||
mol0001883965 -27.51 | |||
mol0001883247 100 | |||
mol0001885445 -20.05 | |||
mol0001884461 -14.55 | |||
mol0001884565 -16.7 | |||
mol0001885496 -18.01 | |||
mol0001884345 -16.71 | |||
'''6. Edit parameter file''' to reflect desired step size, pProp goal, and number of beacons per step | |||
seed_scores_file: dicts_810k/scoredict_2.pickle | |||
novelty_set_file: known_binders_fps.npy | |||
novelty_dist_thresh: 0.5 | |||
screen_novelty: False | |||
beacon_dist_thresh: 0.0 | |||
diversity_dist_thresh: 0.5 | |||
'''hit_pprop: 4''' #change this | |||
artefact_pprop: 6 | |||
use_artefact_filter: False | |||
'''n_docked_per_round:''' 100 #change this | |||
'''max_beacons:''' 10 #change this | |||
'''max_n_rounds:''' 10 #change this | |||
'''7. Edit run_chemstep_init.py''' to reflect library size (n_files= number of fp_*.npy files generated in step 3), scores_dict (file with dock scores and mol ID from step 5), and path to fingerprint library from step 3. | |||
if __name__ == "__main__": | |||
scores_dict = get_scores_dict(''''dock_scores_round_0.txt'''') #change this | |||
run_chemstep_first_round('chemstep_params.txt', ''''/wynton/group/bks/work/path/to/fingerprint/library'''', scores_dict, | |||
'chemstep_log', 'chemstep_output') #update path | |||
'''8. Make output directories''' | |||
mkdir chemstep_output | |||
mkdir chemstep_log | |||
'''9. Launch ChemSTEP''' | |||
note: this may take several hours | |||
qsub launch_chemstep_init.sh | |||
when the job is complete, a pickle file will be created in the working directory. within chemstep_output will be a dataframe containing assigned beacons, a file of calculated tanimoto distances, and an '''smi_round_1.smi''' file containing the SMILES strings and IDs of molecules prioritized for the next round of docking. | |||
'''10. View assigned pProp value''' | |||
python3 get_threshold.py | |||
'''11. Build and dock prioritized molecules''' | |||
When completed, extract scores and IDs as outlined in step 5. | |||
'''12. Edit run_chemstep.py''' to reflect library size, new score_dict, and ChemSTEP round number (we are now on round 2). | |||
if __name__ == "__main__": | |||
scores_dict = get_scores_dict(''''dockingscores_round_1.txt'''') | |||
run_chemstep_round(scores_dict, '''2''') | |||
'''13. Launch ChemSTEP round 2''' | |||
note: this may take several hours | |||
qsub launch_chemstep.sh | |||
Repeat steps 11-13 as needed for desired hit recovery, making sure to update the scored_dict and round number. | |||
Revision as of 23:44, 24 July 2025
written July 24 2025 by katie. These are directions to run a legacy version of ChemSTEP on Wynton.
What the user needs: SMILES file of every molecule in virtual library with unique molecule IDs (ranging from 1-size of library), dockfiles
1. Copy all necessary scripts to your working directory
cp -r /wynton/group/bks/work/kholland/shared/chemstep/all_scripts .
includes get_fingerprints.py, chemstep_params.txt, get_threshold.py, run_chemstep for initial and subsequent rounds, as well as a launch_chemstep.sh script for SGE job submission.
2. Source environment
source /wynton/group/bks/work/kholland/shared/chemstep/venv/bin/activate
3. Edit get_fingerprints.py to reflect your input SMILES file and desired output directory. NOTE: is not set up to work at scale right now. i am working on a method for parallelization.
if __name__ == "__main__": smi_file = "library.smi" # Replace with your input file output_dir = "library_fingerprints" # Replace with your output directory
Run generation for large libraries, submit as a job using submit_fp_gen.sh
python3 get_fingerprints.py
4. Dock a random, representative subset of the total library to your POI.
5. Extract scores and respective molecule IDs (same ones used for FP generation) from step 4, assigning a score of 100 to any molecule that did not dock.
mol0001884980 -17.41 mol0001883931 -21.49 mol0001883965 -27.51 mol0001883247 100 mol0001885445 -20.05 mol0001884461 -14.55 mol0001884565 -16.7 mol0001885496 -18.01 mol0001884345 -16.71
6. Edit parameter file to reflect desired step size, pProp goal, and number of beacons per step
seed_scores_file: dicts_810k/scoredict_2.pickle novelty_set_file: known_binders_fps.npy novelty_dist_thresh: 0.5 screen_novelty: False beacon_dist_thresh: 0.0 diversity_dist_thresh: 0.5 hit_pprop: 4 #change this artefact_pprop: 6 use_artefact_filter: False n_docked_per_round: 100 #change this max_beacons: 10 #change this max_n_rounds: 10 #change this
7. Edit run_chemstep_init.py to reflect library size (n_files= number of fp_*.npy files generated in step 3), scores_dict (file with dock scores and mol ID from step 5), and path to fingerprint library from step 3.
if __name__ == "__main__":
scores_dict = get_scores_dict('dock_scores_round_0.txt') #change this
run_chemstep_first_round('chemstep_params.txt', '/wynton/group/bks/work/path/to/fingerprint/library', scores_dict,
'chemstep_log', 'chemstep_output') #update path
8. Make output directories
mkdir chemstep_output
mkdir chemstep_log
9. Launch ChemSTEP
note: this may take several hours
qsub launch_chemstep_init.sh
when the job is complete, a pickle file will be created in the working directory. within chemstep_output will be a dataframe containing assigned beacons, a file of calculated tanimoto distances, and an smi_round_1.smi file containing the SMILES strings and IDs of molecules prioritized for the next round of docking.
10. View assigned pProp value
python3 get_threshold.py
11. Build and dock prioritized molecules
When completed, extract scores and IDs as outlined in step 5.
12. Edit run_chemstep.py to reflect library size, new score_dict, and ChemSTEP round number (we are now on round 2).
if __name__ == "__main__":
scores_dict = get_scores_dict('dockingscores_round_1.txt')
run_chemstep_round(scores_dict, 2)
13. Launch ChemSTEP round 2 note: this may take several hours
qsub launch_chemstep.sh
Repeat steps 11-13 as needed for desired hit recovery, making sure to update the scored_dict and round number.