Running ChemSTEP: Difference between revisions

From DISI
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
Line 30: Line 30:


'''4. Identify a seed set of molecules from virtual library.'''
'''4. Identify a seed set of molecules from virtual library.'''
This should be a RANDOM set of molecules from the total library. The exact number can depend on library size, computational resources, time. etc.  
This should be a RANDOM set of molecules from the total library. The exact number can depend on library size, computational resources, time. etc. We suggest sampling between 0.01 and 1% of the virtual library.  


'''5. Build seed set in 3D (DOCK 3.8).'''
'''5. Build seed set in 3D (DOCK 3.8).'''
Line 77: Line 77:
       max_n_rounds: 100
       max_n_rounds: 100


'''10. Create run_chemstep.py script'''
'''10. Create run_chemstep.py script and SGE wrapper called launch_chemstep.sh'''
     from chemstep.algo import CSAlgo
     from chemstep.algo import CSAlgo
     from chemstep.fp_library import load_library_from_pickle
     from chemstep.fp_library import load_library_from_pickle
     fplib = load_library_from_pickle('/path/to/xreal/library/pickle/file')
     fplib = load_library_from_pickle('/wynton/group/bks/work/omailhot/xreal_v0p0_fplibrary.pickle')
     algo = CSAlgo(fplib, 'params.txt', 'output', 32, verbose=True, scheduler='sge')
     algo = CSAlgo(fplib, 'params.txt', 'output', 32, verbose=True, scheduler='sge')
     indices, scores = algo.seed()
     indices, scores = algo.seed()
     algo.run_one_round(1, indices, scores)
     algo.run_one_round(1, indices, scores)


run in a screen using:
 
     python run_chemstep.py  
    #!/bin/bash          # the shell language when run outside of the job scheduler
    #$ -S /bin/bash      # the shell language when run via the job scheduler
    #$ -cwd              # job should run in the current working directory
    #$ -pe smp 32
     python3 run_chemstep.py
 
run with:
    qsub launch_chemstep.sh




Line 100: Line 107:
     algo.run_one_round(round_n, indices, scores)
     algo.run_one_round(round_n, indices, scores)


to run:
Again, this should be run as a job on the scheduler with 32 cores.  
    python run_chemstep.py 2




You can repeat this procedure for as many rounds as needed. The performance is reported in output_directory/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon's distance to all previous beacons.
You can repeat this procedure for as many rounds as needed. The performance is reported in output_directory/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon's distance to all previous beacons.

Revision as of 21:46, 11 August 2025

written July 24 2025 by katie. edited August 8 2025 with updated methods (PROSPECTIVE APPLICATION)

What you need: a virtual library of interest, dockfiles

1. Install ChemSTEP

   pip install /wynton/group/bks/work/omailhot/chemstep-0.2.0.tar.gz

2. Generate a mesh of ECFP4 fingerprints for your library binary fingerprints should be stored as contiguous uint8 NumPy arrays

   from rdkit import Chem
   from rdkit.Chem import AllChem
   import numpy as np
   def morgan_fp(smiles, nbits=2048, radius=2):
      mol = Chem.MolFromSmiles(smiles)
      fp  = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nbits)
      return np.asarray(fp, dtype=np.uint8)

save fingerprints as fps_XXX.npy, zids_XXX.npy, and smi_XXX.smi

3. Compile FP library into a pickle file.

    from glob import glob
    from chemstep.fp_library import FpLibrary
    lib = FpLibrary(
       fingerprint_files = sorted(glob('absolute/path/to/your/fps*.npy')),
       id_files          = sorted(glob('absolute/path/to/your/zids*.npy')),
       smi_files         = sorted(glob('absolute/path/to/your/smi*.smi')),
       outname           = 'zinc_fplib.pickle')

4. Identify a seed set of molecules from virtual library. This should be a RANDOM set of molecules from the total library. The exact number can depend on library size, computational resources, time. etc. We suggest sampling between 0.01 and 1% of the virtual library.

5. Build seed set in 3D (DOCK 3.8). taken from docs.docking.org

     source /wynton/group/bks/soft/DOCK-3.8.5/env.sh
     python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder Seeds --bundle_size 1000 --  minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif all.smi

If using InfiniSee XReal V0.0, start here. This is a sample of 100 million molecules from the total 1.1 trillion molecule library.

6. Copy (or compile if need be) seed set SDI file (DOCK 3.8)

    cp /wynton/group/bks/work/mfreitas/building_100M/100M.sdi 


6. Dock molecules. taken from docs.docking.org

    export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]
    export DOCKFILES=[path to your dockfiles]
    export INPUT_FOLDER=[the folder containing your .sdi file(s)]
    export OUTPUT_FOLDER=[where you want the output ]
    /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh

7. Extract scores and IDs from docking output in this case, it is named 'scores_round_0.txt'

8. Convert scores and IDs into a NumPy array for recognition by ChemSTEP

     import numpy as np
     from chemstep.id_helper import char_to_int64
     with open('scores_round_0.txt', 'r') as f:
        lines = f.readlines()
     scores = np.zeros(len(lines), dtype=np.float32)
     indices = np.zeros(len(lines), dtype=np.int64)
     for i, line in enumerate(lines):
        lib_id, score = line.strip().split()
        scores[i] = float(score)
        indices[i] = char_to_int64(lib_id)
     np.save('scores_round_0.npy', scores)
     np.save('indices_round_0.npy', indices)

9. Write ChemSTEP parameter file

     seed_indices_file: /path/to/indices_round_0.npy
     seed_scores_file: /path/to/scores_round_0.npy
     hit_pprop: 4
     n_docked_per_round: 100000
     max_beacons: 100
     max_n_rounds: 100

10. Create run_chemstep.py script and SGE wrapper called launch_chemstep.sh

    from chemstep.algo import CSAlgo
    from chemstep.fp_library import load_library_from_pickle
    fplib = load_library_from_pickle('/wynton/group/bks/work/omailhot/xreal_v0p0_fplibrary.pickle')
    algo = CSAlgo(fplib, 'params.txt', 'output', 32, verbose=True, scheduler='sge')
    indices, scores = algo.seed()
    algo.run_one_round(1, indices, scores)


    #!/bin/bash           # the shell language when run outside of the job scheduler
    #$ -S /bin/bash       # the shell language when run via the job scheduler
    #$ -cwd               # job should run in the current working directory
    #$ -pe smp 32
    python3 run_chemstep.py

run with:

    qsub launch_chemstep.sh


When finished, there will be a smi_round_1.smi inside of output_directory/complete_info. These molecules should be built, docked, and scores converted into NumPy arrays (repeating steps 5-8)

11. Run second round of ChemSTEP, giving round number as a command-line argument.

   from chemstep.algo import load_from_pickle
   import sys
   round_n = int(sys.argv[1])
   saved_algo = load_from_pickle(f'chemstep_algo_{round_n - 1}.pickle')
   indices = np.load(f'indices_round_{round_n - 1}.npy')
   scores = np.load(f'scores_round_{round_n - 1}.npy')
   algo.run_one_round(round_n, indices, scores)

Again, this should be run as a job on the scheduler with 32 cores.


You can repeat this procedure for as many rounds as needed. The performance is reported in output_directory/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon's distance to all previous beacons.