Running ChemSTEP

From DISI
Revision as of 02:14, 9 August 2025 by Kholland (talk | contribs)
Jump to navigation Jump to search

written July 24 2025 by katie. edited August 8 2025 with updated methods (PROSPECTIVE APPLICATION)

What you need: a virtual library of interest, dockfiles

1. Install ChemSTEP

   pip install /wynton/group/bks/work/omailhot/chemstep-0.2.0.tar.gz

2. Generate a mesh of ECFP4 fingerprints for your library binary fingerprints should be stored as contiguous uint8 NumPy arrays

   from rdkit import Chem
   from rdkit.Chem import AllChem
   import numpy as np
   def morgan_fp(smiles, nbits=2048, radius=2):
      mol = Chem.MolFromSmiles(smiles)
      fp  = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nbits)
      return np.asarray(fp, dtype=np.uint8)

save fingerprints as fps_XXX.npy, zids_XXX.npy, and smi_XXX.smi

3. Compile FP library into a pickle file.

    from chemstep.fp_library import FpLibrary
    smi_fns = []q
    zids_fns = []
    fps_fns = []
    for i in range(0, #files generated in step 3):
       smi_fns.append(f'/path/to/smi_{i}.smi')
       ids_fns.append(f'/path/to/zids_{i}.npy')
       fps_fns.append(f'/path/to/fps_{i}.npy')
    fplib = FpLibrary(fps_fns, ids_fns, smi_fns, 'name_of_pickle_file')

4. Identify a seed set of molecules from virtual library. This should be a RANDOM set of molecules from the total library. The exact number can depend on library size, computational resources, time. etc.

5. Build seed set in 3D (DOCK 3.8). taken from docs.docking.org

     source /wynton/group/bks/soft/DOCK-3.8.5/env.sh
     python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder Seeds --bundle_size 1000 --  minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif all.smi

6. Dock molecules. taken from docs.docking.org

    export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]
    export DOCKFILES=[path to your dockfiles]
    export INPUT_FOLDER=[the folder containing your .sdi file(s)]
    export OUTPUT_FOLDER=[where you want the output ]
    /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh

7. Extract scores and IDs from docking output

8. Convert scores and IDs into a NumPy array for recognition by ChemSTEP

     import numpy as np
     from chemstep.id_helper import char_to_int64
     with open('scores_round_0.txt', 'r') as f:
        lines = f.readlines()
     scores = np.zeros(len(lines), dtype=np.float32)
     indices = np.zeros(len(lines), dtype=np.int64)
     for i, line in enumerate(lines):
        lib_id, score = line.strip().split()
        scores[i] = float(score)
        indices[i] = char_to_int64(lib_id)
     np.save('scores_round_0.npy', scores)
     np.save('indices_round_0.npy', indices)

9. Write ChemSTEP parameter file

     seed_indices_file: /path/to/indices_round_0.npy
     seed_scores_file: /path/to/scores_round_0.npy
     hit_pprop: 4
     n_docked_per_round: 100000
     max_beacons: 100
     max_n_rounds: 100

10. Create run_chemstep.py script

    from chemstep.algo import CSAlgo
    from chemstep.fp_library import load_library_from_pickle
    fplib = load_library_from_pickle('/path/to/library/pickle/file')
    algo = CSAlgo(fplib, 'params.txt', 'output', 32, verbose=True, scheduler='sge')
    indices, scores = algo.seed()
    algo.run_one_round(1, indices, scores)


When finished, there will be a smi_round_1.smi inside of output_directory/complete_info. These molecules should be built, docked, and scores converted into NumPy arrays (repeating steps 5-8)

11. Run second round of ChemSTEP, giving round number as a command-line argument.

   from chemstep.algo import load_from_pickle
   import sys
   round_n = int(sys.argv[1])
   saved_algo = load_from_pickle(f'chemstep_algo_{round_n - 1}.pickle')
   indices = np.load(f'indices_round_{round_n - 1}.npy')
   scores = np.load(f'scores_round_{round_n - 1}.npy')
   algo.run_one_round(round_n, indices, scores)

to run:

   python run_chemstep.py 2


You can repeat this procedure for as many rounds as needed. The performance is reported in output_directory/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon's distance to all previous beacons.