Running ChemSTEP: Difference between revisions

From DISI
Jump to navigation Jump to search
mNo edit summary
No edit summary
Line 1: Line 1:
written July 24 2025 by katie. These are directions to run a legacy version of ChemSTEP on Wynton.
written July 24 2025 by katie. edited August 8 2025 with updated methods (PROSPECTIVE APPLICATION)


What the user needs: working directory, SMILES file of every molecule in virtual library with unique molecule IDs (ranging from 1-size of library), dockfiles  
What you need: a virtual library of interest, dockfiles  


'''1. Copy all necessary scripts to your working directory'''
'''1. Install ChemSTEP'''
     cp -r /wynton/group/bks/work/kholland/shared/chemstep/all_scripts .
     pip install /wynton/group/bks/work/omailhot/chemstep-0.2.0.tar.gz
includes get_fingerprints.py, chemstep_params.txt, get_threshold.py, run_chemstep for initial and subsequent rounds, as well as a launch_chemstep.sh script for SGE job submission.


'''2. Source environment'''
'''2. Generate a mesh of ECFP4 fingerprints for your library'''  
     source /wynton/group/bks/work/kholland/shared/chemstep/venv/bin/activate
binary fingerprints should be stored as contiguous uint8 NumPy arrays
    from rdkit import Chem
    from rdkit.Chem import AllChem
     import numpy as np


'''3. Generate ECFP4 FPs for your library'''
    def morgan_fp(smiles, nbits=2048, radius=2):
    python3 split_smi_and_write_job_array.py <name of your SMILES file>.smi get_fingerprints.py
      mol = Chem.MolFromSmiles(smiles)
    qsub launch_jobs.sh
      fp  = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nbits)
      return np.asarray(fp, dtype=np.uint8)


This will split your input SMILES library into chunks of 1,000,000 molecules (you can change this # inside the split_smi_and_write_job_array.py script), and write a job array to submit fingerprint generation for each chunk as one CPU job. The output will be a directory named library_fingerprints containing fps.npy. ids.npy, and smiles.txt files.
save fingerprints as fps_XXX.npy, zids_XXX.npy, and smi_XXX.smi


'''4. Dock a random, representative subset of the total library to your POI.'''
'''3. Compile FP library into a pickle file.'''
    from chemstep.fp_library import FpLibrary


'''5. Extract scores and respective molecule IDs''' (same ones used for FP generation) from step 4, assigning a score of 100 to any molecule that did not dock.
    smi_fns = []q
    mol0001884980 -17.41
    zids_fns = []
    mol0001883931 -21.49
    fps_fns = []
    mol0001883965 -27.51
    for i in range(0, #files generated in step 3):
    mol0001883247 100
        smi_fns.append(f'/path/to/smi_{i}.smi')
    mol0001885445 -20.05
        ids_fns.append(f'/path/to/zids_{i}.npy')
    mol0001884461 -14.55
        fps_fns.append(f'/path/to/fps_{i}.npy')
    mol0001884565 -16.7
    fplib = FpLibrary(fps_fns, ids_fns, smi_fns, 'name_of_pickle_file')
    mol0001885496 -18.01
    mol0001884345 -16.71


'''6. Edit parameter file''' to reflect desired step size, pProp goal, and number of beacons per step
'''4. Identify a seed set of molecules from virtual library.'''
This should be a RANDOM set of molecules from the total library. The exact number can depend on library size, computational resources, time. etc.


  seed_scores_file: dicts_810k/scoredict_2.pickle
'''5. Build seed set in 3D (DOCK 3.8).'''
  novelty_set_file: known_binders_fps.npy
taken from docs.docking.org
  novelty_dist_thresh: 0.5
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh
  screen_novelty: False
  beacon_dist_thresh: 0.0
  diversity_dist_thresh: 0.5
  '''hit_pprop: 4''' #change this
  artefact_pprop: 6
  use_artefact_filter: False
  '''n_docked_per_round:''' 100 #change this
  '''max_beacons:''' 10 #change this
  '''max_n_rounds:''' 10 #change this


'''7. Edit run_chemstep_init.py''' to reflect library size (n_files= number of fp_*.npy files generated in step 3), scores_dict (file with dock scores and mol ID from step 5), and path to fingerprint library from step 3.  
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder Seeds --bundle_size 1000 --  minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif all.smi


  from chemstep.fp_library import FpLibrary
'''6. Dock molecules.'''
  def run_chemstep_first_round(param_file, libdir, scores_dict, outdir, complete_info_dir, n_proc=32, '''n_files='''#change this):
taken from docs.docking.org
    export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]
    export DOCKFILES=[path to your dockfiles]
    export INPUT_FOLDER=[the folder containing your .sdi file(s)]
    export OUTPUT_FOLDER=[where you want the output ]


    /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh
  if __name__ == "__main__":
  scores_dict = get_scores_dict(''''dock_scores_round_0.txt'''') #change this
  run_chemstep_first_round('chemstep_params.txt', ''''/wynton/group/bks/work/path/to/fingerprint/library'''', scores_dict,
                            'chemstep_log', 'chemstep_output') #update path


'''8. Make output directories'''
'''7. Extract scores and IDs from docking output'''
  mkdir chemstep_output


  mkdir chemstep_log
'''8. Convert scores and IDs into a NumPy array for recognition by ChemSTEP'''
      import numpy as np
      from chemstep.id_helper import char_to_int64
      with open('scores_round_0.txt', 'r') as f:
        lines = f.readlines()
      scores = np.zeros(len(lines), dtype=np.float32)
      indices = np.zeros(len(lines), dtype=np.int64)
      for i, line in enumerate(lines):
        lib_id, score = line.strip().split()
        scores[i] = float(score)
        indices[i] = char_to_int64(lib_id)
      np.save('scores_round_0.npy', scores)
      np.save('indices_round_0.npy', indices)


'''9. Write ChemSTEP parameter file'''
      seed_indices_file: /path/to/indices_round_0.npy
      seed_scores_file: /path/to/scores_round_0.npy
      hit_pprop: 4
      n_docked_per_round: 100000
      max_beacons: 100
      max_n_rounds: 100


'''9. Launch ChemSTEP'''
'''10. Create run_chemstep.py script'''
note: this may take several hours
    from chemstep.algo import CSAlgo
  qsub all_scripts/launch_chemstep_init.sh
    from chemstep.fp_library import load_library_from_pickle
    fplib = load_library_from_pickle('/path/to/library/pickle/file')
    algo = CSAlgo(fplib, 'params.txt', 'output', 32, verbose=True)
    indices, scores = algo.seed()
    algo.run_one_round(1, indices, scores)




when the job is complete, a pickle file will be created in the working directory. within chemstep_output will be a dataframe containing assigned beacons, a file of calculated tanimoto distances, and an '''smi_round_1.smi''' file containing the SMILES strings and IDs of molecules prioritized for the next round of docking.
'''11. Create launch_chemstep.sh wrapper to run ChemSTEP as a job'''
    #$ -S /bin/bash      # the shell language when run via the job scheduler [IMPORTANT]
    #$ -cwd              # job should run in the current working directory
    #$ -pe smp 32


'''10. View assigned pProp value'''
    source /wynton/home/shoichetlab/omailhot/venv/bin/activate
     python3 all_scripts/get_threshold.py
     python3 run_chemstep.py


'''11. Build and dock prioritized molecules'''  
'''12. Launch ChemSTEP job (SGE)'''
    qsub launch_chemstep.sh


When completed, extract scores and IDs as outlined in step 5.


'''12. Edit run_chemstep.py''' to reflect new score_dict, and ChemSTEP round number (we are now on round 2).
When finished, there will be a smi_round_1.smi inside of output_directory/complete_info. These molecules should be built, docked, and scores converted into NumPy arrays (repeating steps 5-8)
  if __name__ == "__main__":
  scores_dict = get_scores_dict(''''dockingscores_round_1.txt'''')
  run_chemstep_round(scores_dict, '''2''')


'''13. Launch ChemSTEP round 2'''
'''13. Run second round of ChemSTEP, giving round number as a command-line argument.'''
note: this may take several hours
    from chemstep.algo import load_from_pickle
  qsub all_scripts/launch_chemstep.sh
    import sys
    round_n = int(sys.argv[1])
    saved_algo = load_from_pickle(f'chemstep_algo_{round_n - 1}.pickle')
    indices = np.load(f'indices_round_{round_n - 1}.npy')
    scores = np.load(f'scores_round_{round_n - 1}.npy')
    algo.run_one_round(round_n, indices, scores)


Repeat steps 11-13 as needed for desired hit recovery, making sure to update the scored_dict and round number.
to run:
    python run_chemstep.py 2
 
 
You can repeat this procedure for as many rounds as needed. The performance is reported in output_directory/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon's distance to all previous beacons.

Revision as of 23:14, 8 August 2025

written July 24 2025 by katie. edited August 8 2025 with updated methods (PROSPECTIVE APPLICATION)

What you need: a virtual library of interest, dockfiles

1. Install ChemSTEP

   pip install /wynton/group/bks/work/omailhot/chemstep-0.2.0.tar.gz

2. Generate a mesh of ECFP4 fingerprints for your library binary fingerprints should be stored as contiguous uint8 NumPy arrays

   from rdkit import Chem
   from rdkit.Chem import AllChem
   import numpy as np
   def morgan_fp(smiles, nbits=2048, radius=2):
      mol = Chem.MolFromSmiles(smiles)
      fp  = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nbits)
      return np.asarray(fp, dtype=np.uint8)

save fingerprints as fps_XXX.npy, zids_XXX.npy, and smi_XXX.smi

3. Compile FP library into a pickle file.

    from chemstep.fp_library import FpLibrary
    smi_fns = []q
    zids_fns = []
    fps_fns = []
    for i in range(0, #files generated in step 3):
       smi_fns.append(f'/path/to/smi_{i}.smi')
       ids_fns.append(f'/path/to/zids_{i}.npy')
       fps_fns.append(f'/path/to/fps_{i}.npy')
    fplib = FpLibrary(fps_fns, ids_fns, smi_fns, 'name_of_pickle_file')

4. Identify a seed set of molecules from virtual library. This should be a RANDOM set of molecules from the total library. The exact number can depend on library size, computational resources, time. etc.

5. Build seed set in 3D (DOCK 3.8). taken from docs.docking.org

     source /wynton/group/bks/soft/DOCK-3.8.5/env.sh
     python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder Seeds --bundle_size 1000 --  minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif all.smi

6. Dock molecules. taken from docs.docking.org

    export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]
    export DOCKFILES=[path to your dockfiles]
    export INPUT_FOLDER=[the folder containing your .sdi file(s)]
    export OUTPUT_FOLDER=[where you want the output ]
    /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh

7. Extract scores and IDs from docking output

8. Convert scores and IDs into a NumPy array for recognition by ChemSTEP

     import numpy as np
     from chemstep.id_helper import char_to_int64
     with open('scores_round_0.txt', 'r') as f:
        lines = f.readlines()
     scores = np.zeros(len(lines), dtype=np.float32)
     indices = np.zeros(len(lines), dtype=np.int64)
     for i, line in enumerate(lines):
        lib_id, score = line.strip().split()
        scores[i] = float(score)
        indices[i] = char_to_int64(lib_id)
     np.save('scores_round_0.npy', scores)
     np.save('indices_round_0.npy', indices)

9. Write ChemSTEP parameter file

     seed_indices_file: /path/to/indices_round_0.npy
     seed_scores_file: /path/to/scores_round_0.npy
     hit_pprop: 4
     n_docked_per_round: 100000
     max_beacons: 100
     max_n_rounds: 100

10. Create run_chemstep.py script

    from chemstep.algo import CSAlgo
    from chemstep.fp_library import load_library_from_pickle
    fplib = load_library_from_pickle('/path/to/library/pickle/file')
    algo = CSAlgo(fplib, 'params.txt', 'output', 32, verbose=True)
    indices, scores = algo.seed()
    algo.run_one_round(1, indices, scores)


11. Create launch_chemstep.sh wrapper to run ChemSTEP as a job

   #$ -S /bin/bash       # the shell language when run via the job scheduler [IMPORTANT]
   #$ -cwd               # job should run in the current working directory
   #$ -pe smp 32
   source /wynton/home/shoichetlab/omailhot/venv/bin/activate
   python3 run_chemstep.py

12. Launch ChemSTEP job (SGE)

   qsub launch_chemstep.sh


When finished, there will be a smi_round_1.smi inside of output_directory/complete_info. These molecules should be built, docked, and scores converted into NumPy arrays (repeating steps 5-8)

13. Run second round of ChemSTEP, giving round number as a command-line argument.

   from chemstep.algo import load_from_pickle
   import sys
   round_n = int(sys.argv[1])
   saved_algo = load_from_pickle(f'chemstep_algo_{round_n - 1}.pickle')
   indices = np.load(f'indices_round_{round_n - 1}.npy')
   scores = np.load(f'scores_round_{round_n - 1}.npy')
   algo.run_one_round(round_n, indices, scores)

to run:

   python run_chemstep.py 2


You can repeat this procedure for as many rounds as needed. The performance is reported in output_directory/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon's distance to all previous beacons.