Running ChemSTEP: Difference between revisions

Revision as of 21:46, 11 August 2025

written July 24 2025 by katie. edited August 8 2025 with updated methods (PROSPECTIVE APPLICATION)

What you need: a virtual library of interest, dockfiles

1. Install ChemSTEP

   pip install /wynton/group/bks/work/omailhot/chemstep-0.2.0.tar.gz

2. Generate a mesh of ECFP4 fingerprints for your library binary fingerprints should be stored as contiguous uint8 NumPy arrays

   from rdkit import Chem
   from rdkit.Chem import AllChem
   import numpy as np

   def morgan_fp(smiles, nbits=2048, radius=2):
      mol = Chem.MolFromSmiles(smiles)
      fp  = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nbits)
      return np.asarray(fp, dtype=np.uint8)

save fingerprints as fps_XXX.npy, zids_XXX.npy, and smi_XXX.smi

3. Compile FP library into a pickle file.

    from glob import glob
    from chemstep.fp_library import FpLibrary

    lib = FpLibrary(
       fingerprint_files = sorted(glob('absolute/path/to/your/fps*.npy')),
       id_files          = sorted(glob('absolute/path/to/your/zids*.npy')),
       smi_files         = sorted(glob('absolute/path/to/your/smi*.smi')),
       outname           = 'zinc_fplib.pickle')

4. Identify a seed set of molecules from virtual library. This should be a RANDOM set of molecules from the total library. The exact number can depend on library size, computational resources, time. etc. We suggest sampling between 0.01 and 1% of the virtual library.

5. Build seed set in 3D (DOCK 3.8). taken from docs.docking.org

     source /wynton/group/bks/soft/DOCK-3.8.5/env.sh

     python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder Seeds --bundle_size 1000 --  minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif all.smi

If using InfiniSee XReal V0.0, start here. This is a sample of 100 million molecules from the total 1.1 trillion molecule library.

6. Copy (or compile if need be) seed set SDI file (DOCK 3.8)

    cp /wynton/group/bks/work/mfreitas/building_100M/100M.sdi

6. Dock molecules. taken from docs.docking.org

    export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]
    export DOCKFILES=[path to your dockfiles]
    export INPUT_FOLDER=[the folder containing your .sdi file(s)]
    export OUTPUT_FOLDER=[where you want the output ]

    /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh

7. Extract scores and IDs from docking output in this case, it is named 'scores_round_0.txt'

8. Convert scores and IDs into a NumPy array for recognition by ChemSTEP

     import numpy as np
     from chemstep.id_helper import char_to_int64
     with open('scores_round_0.txt', 'r') as f:
        lines = f.readlines()
     scores = np.zeros(len(lines), dtype=np.float32)
     indices = np.zeros(len(lines), dtype=np.int64)
     for i, line in enumerate(lines):
        lib_id, score = line.strip().split()
        scores[i] = float(score)
        indices[i] = char_to_int64(lib_id)
     np.save('scores_round_0.npy', scores)
     np.save('indices_round_0.npy', indices)

9. Write ChemSTEP parameter file

     seed_indices_file: /path/to/indices_round_0.npy
     seed_scores_file: /path/to/scores_round_0.npy
     hit_pprop: 4
     n_docked_per_round: 100000
     max_beacons: 100
     max_n_rounds: 100

10. Create run_chemstep.py script and SGE wrapper called launch_chemstep.sh

    from chemstep.algo import CSAlgo
    from chemstep.fp_library import load_library_from_pickle
    fplib = load_library_from_pickle('/wynton/group/bks/work/omailhot/xreal_v0p0_fplibrary.pickle')
    algo = CSAlgo(fplib, 'params.txt', 'output', 32, verbose=True, scheduler='sge')
    indices, scores = algo.seed()
    algo.run_one_round(1, indices, scores)

    #!/bin/bash           # the shell language when run outside of the job scheduler
    #$ -S /bin/bash       # the shell language when run via the job scheduler
    #$ -cwd               # job should run in the current working directory
    #$ -pe smp 32
    python3 run_chemstep.py

run with:

    qsub launch_chemstep.sh

When finished, there will be a smi_round_1.smi inside of output_directory/complete_info. These molecules should be built, docked, and scores converted into NumPy arrays (repeating steps 5-8)

11. Run second round of ChemSTEP, giving round number as a command-line argument.

   from chemstep.algo import load_from_pickle
   import sys
   round_n = int(sys.argv[1])
   saved_algo = load_from_pickle(f'chemstep_algo_{round_n - 1}.pickle')
   indices = np.load(f'indices_round_{round_n - 1}.npy')
   scores = np.load(f'scores_round_{round_n - 1}.npy')
   algo.run_one_round(round_n, indices, scores)

Again, this should be run as a job on the scheduler with 32 cores.

You can repeat this procedure for as many rounds as needed. The performance is reported in output_directory/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon's distance to all previous beacons.

Running ChemSTEP: Difference between revisions

Revision as of 21:46, 11 August 2025

Navigation menu

Search