Running ChemSTEP: Difference between revisions

From DISI
Jump to navigation Jump to search
mNo edit summary
No edit summary
 
(52 intermediate revisions by the same user not shown)
Line 1: Line 1:
written July 24 2025 by katie. edited August 11 2025 with updated methods specific to InfiniSee XReal.  
last update: dec 16 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs).  


What you need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using a drug-like subset of ZINC (22B) or 13.3B library from Enamine REAL. '''for detailed instructions on the 13B space, see 'Running ChemSTEP on the 13B space' wiki page'''. For more general use directions, please refer to [ChemSTEP Read-the-Docs].


'''1. Source ChemSTEP virtual environment'''
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed "seed set", round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of "virtual hits" in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold ("beacons") from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits.
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate




'''2. Copy InfiniSee seed-set into a directory named 'sdi' within your docking directory'''
= Running ChemSTEP (Auto DOCK and Build) =
This is a split database index file containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total vritual library, currently 1 trillion molecules.
    cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .


'''3. Dock seed set to your receptor of interest using DOCK 3.8''' /docking directions taken from docs.docking.org
ChemSTEP is configured to run on Wynton with libraries of '''13B''' and '''22B'''. This page covers the full workflow for running ChemSTEP with automatic submission of docking and building jobs.
    export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]
    export DOCKFILES=[path to your dockfiles]
    export INPUT_FOLDER=[the folder containing your .sdi file(s)]
    export OUTPUT_FOLDER=[where you want the output ]


    /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh
__TOC__


Wait for docking to complete, then extract molecule IDs and corresponding scores. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:
== 1. Source Environment ==
    cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .
    python3 get_scores.py


'''4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition.'''
<pre>
    cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .
source /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/bin/activate
    python3 convert_scores_to_npy.py
</pre>


This should output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores.  
== 2. Dock the Seed Set ==


'''5. Make a directory to run ChemSTEP in. Copy in necessary files: params.txt, run_chemstep.py and launch_chemstep_as_job.sh'''
Copy the <code>.sdi</code> file for the library you want to use:
    cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .
    cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .
    cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .


{| class="wikitable"
! Library !! Path
|-
| 13B || <code>/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/13B/13M_seeds.sdi</code>
|-
| 22B || <code>/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22M_seeds.sdi</code>
|}


'''6. Edit params.txt file'''
Then dock the seed set. See the '''Large-Scale Docking (LSD)''' directions.
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy
    hit_pprop: 3
    n_docked_per_round: 30000
    max_beacons: 150
    max_n_rounds: 250


Be sure that this file reflects your score and indices files, as well as desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for XReal docking. ''If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.'' We strongly suggest running ChemSTEP as a job array with 32 CPU slots requested (specified in wrapper).
== 3. Gather Scores for the Seed Set ==


'''7. Run ChemSTEP''' with the following command:
Once docking is complete, run the following from the directory '''one level above''' your docking output (<code>MOLECULES_DIR_TO_BIND</code>).
    qsub launch_chemstep_as_job.sh


'''22B library:'''
<pre>
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_scores.py 0
</pre>


When finished, there will be a smi_round_1.smi inside of output_directory/complete_info. These molecules should be built, docked, and fed back into ChemSTEP.
'''13B library:'''
<pre>
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_scores_13B.py 0 MOL
</pre>


'''8. Build prioritized molecules (DOCK 3.8).''' /taken from docs.docking.org
{{Note|You must specify the molecule ID prefix for the 13B library (<code>MOL</code>).}}
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh


      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi
Verify that <code>scores_round_0.txt</code> was correctly written:
<pre>
wc -l scores_round_0.txt
</pre>


When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. '''Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!''' Retrieve docking scores as outlined in step 3 and convert to NumPy arrays (Step 4, be sure to update the round number in the naming conventions!!!).  
== 4. Convert Scores to .npy Files ==


'''9. Run second round of ChemSTEP, giving round number as a command-line argument.''' /copy this script in as run_chemstep_iterative.py
Convert scores to ChemSTEP-readable <code>.npy</code> files:
    from chemstep.algo import load_from_pickle
    import sys
    round_n = int(sys.argv[1])
    algo = load_from_pickle(f'chemstep_algo_{round_n - 1}.pickle')
    indices = np.load(f'indices_round_{round_n - 1}.npy')
    scores = np.load(f'scores_round_{round_n - 1}.npy')
    algo.run_one_round(round_n, indices, scores)


Again, this should be run as a job on the scheduler with 32 or 64 cores. Update launch_chemstep_as_job.sh to call (specifying round number):
<pre>
  python3 run_chemstep_interative.py 2
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/convert_scores_to_npy.py 0 <mol_id_prefix>
</pre>


The output will be smi_round_2.smi file. Repeat steps 8 and 9 for as many rounds as needed. The performance is reported in output_directory/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon's distance to all previous beacons.
The <code>mol_id_prefix</code> should match the library:
 
{| class="wikitable"
! Library !! Prefix
|-
| 22B / 72B || <code>CSLB</code>
|-
| 13B || <code>MOL</code>
|}
 
== 5. Set Up the ChemSTEP Run Directory ==
 
Create and enter a new run directory, then copy in the necessary files:
 
<pre>
mkdir chemstep_run
cd chemstep_run
chemstep-run-new
</pre>
 
This will populate the directory with <code>params.txt</code>, <code>run_chemstep.py</code>, and <code>launch_chemstep_as_job.sh</code>.
 
=== Optional: Integrated IFP ===
 
If running with integrated IFP for beacon selection, also run:
 
<pre>
chemstep-run-ifp
</pre>
 
This copies in the additional files <code>ifp_acceptance_criteria.txt</code> and <code>interactions.txt</code>.
 
== 6. Edit params.txt ==
 
Add the absolute paths to the ChemSTEP-readable score and indices <code>.npy</code> arrays generated in Step 4.
 
<pre>
seed_indices_file:  /path/to/your/indices_round_0.npy
seed_scores_file:  /path/to/your/scores_round_0.npy
hit_pprop:          5.5
n_docked_per_round: 2000000
bundle_size:        1000
max_beacons:        100
max_n_rounds:      250
</pre>
 
=== Parameter Reference ===
 
{| class="wikitable"
! Parameter !! Description
|-
| <code>hit_pprop</code> || Defines a "virtual hit." pProp = −log(rank%) within the total library score distribution. E.g., pProp 4 in 13B space ≈ top 0.01% (~1.3M molecules); pProp 5 ≈ 0.001% (~132K). The seed set should contain at least 10<sup>(pProp+2)</sup> molecules.
|-
| <code>n_docked_per_round</code> || Number of molecules prioritized per round. All must be built and docked between rounds. Too many slows throughput and may reduce diversity; too few slows virtual hit recovery.
|-
| <code>max_beacons</code> || Diverse, well-scoring molecules used to guide prioritization. All molecules above the pProp threshold are candidates. Too many reduces inter-beacon diversity; too few hinders space exploration. Fewer beacons than specified may be assigned if insufficient molecules clear the threshold.
|-
| <code>bundle_size</code> || In auto docking mode, number of molecules submitted as a single build job.
|-
| <code>max_n_rounds</code> || No adjustment needed when running ChemSTEP prospectively as described here.
|}
 
== 7. Edit run_chemstep.py ==
 
'''Note:''' All paths must be '''absolute paths'''.
 
=== Required Settings ===
 
Set <code>lib_path</code> to the library pickle for your library:
 
{| class="wikitable"
! Library !! Path
|-
| 13B || <code>/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/13B/boltz_fplib.pickle</code>
|-
| 22B || <code>/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22B_fplib.pickle</code>
|}
 
<pre>
lib_path = '/full/path/to/library.pickle'
</pre>
 
Set <code>dockfiles_path</code>:
 
<pre>
dockfiles_path="/full/path/to/dockfiles"
</pre>
 
=== Optional: minTD Exclusion Zone ===
 
Molecules will not be prioritized from within a specified Tanimoto distance of beacons. Comment in the relevant lines and update the value. Consider also setting <code>enforce_n_docked_per_round = True</code> when using this option:
 
<pre>
min_td_search=0.5,
enforce_n_docked_per_round=True,
</pre>
 
=== Optional: Integrated IFP ===
 
Only selects beacons that satisfy user-defined interaction criteria. Comment in the following lines and update the paths to the necessary files (copied in Step 5 if you ran <code>chemstep-run-ifp</code>):
 
<pre>
use_IFP=True,
ifp_pdb_path='/full/path/to/rec.crg.pdb',
interactions_file='/full/path/to/interactions.txt',
ifp_acceptance_criteria_file='/full/path/to/ifp_acceptance_criteria.txt',
</pre>
 
'''<code>interactions.txt</code>''' — one interaction per line, comma-separated. Format: <code>interaction_type, residue_name_and_number</code>. Example:
 
<pre>
Hydrogen bond, GLY19
Ionic, ASP149
</pre>
 
Supported interaction types include: Proximal, Hydrogen bond, Ionic, Cation-pi, Hydrophobic, Halogen bond, and others. See LUNA and IFP documentation for the full list.
 
'''<code>ifp_acceptance_criteria.txt</code>''' — defines the number of unsatisfied donors/acceptors/specific interactions required for a molecule to pass IFP and be considered for beacon selection. Example:
 
<pre>
#_donors
#_acceptors
#_unstatisfied_donors == 0
#_unstatisfied_acceptors <= 4
Ionic/ASP-149 > 0
</pre>
 
=== Example: AmpC on 22B with minTD=0.50, No IFP ===
 
<pre>
lib_path = '/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22B_fplib.pickle'
lib = load_library_from_pickle(lib_path)
algo = CSAlgo(lib, 'params.txt', 'output', 16, verbose=True,
    scheduler='sge', smi_id_prefix='CSLB',
    python_exec="/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/bin/python",
    dockfiles_path="/wynton/group/bks/work/kholland/chemstep_ampc_22B/seed_docking/dockfiles",
    min_td_search=0.5,
    enforce_n_docked_per_round=True,
    #use_IFP=True,
    #ifp_pdb_path='/path/to/your/reference/rec.crg.pdb',
    #interactions_file='/path/to/your/interactions.txt',
    #ifp_acceptance_criteria_file='/path/to/your/ifp_acceptance_criteria.txt',
    docking_method="auto", track_beacon_orig=True)
</pre>
 
== 8. Launch the Job ==
 
Submit the main ChemSTEP job:
 
<pre>
qsub launch_chemstep_as_job.sh
</pre>
 
== 9. Monitor Job Status ==
 
Check job status with <code>qstat</code>. The main job will run for up to '''2 weeks''' given no errors. ChemSTEP will launch search, building, and docking jobs in successive rounds.
 
'''Note:''' If any building or docking subjobs hang, the main job will not proceed until those are canceled or finished. Monitor job statuses regularly and occasionally verify that docking output files (<code>scores_round_*.txt</code>) are being populated.
 
== 10. View Beacon SMILES and IDs ==
 
From the ChemSTEP running directory, run the following in a '''screen session on a dev node''':
 
<pre>
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_beacon_smiles.py /path/to/library/pickle chemstep_algo.log
</pre>
 
Use the library pickle path from [[#7. Edit run_chemstep.py|Step 7]].
 
== 11. Get Poses After Docking ==
 
Make a list of <code>test.mol2.gz.0</code> files from docking:
 
<pre>
find /round_*_docking/bundle_paths -maxdepth 2 -name "test.mol2.gz.0" > docked_poses.txt
</pre>
 
Then extract top poses:
 
<pre>
python /wynton/group/bks/work/bwhall61/for_beau/top_poses.py \
    -t <pProp_threshold> \
    -s <num_poses_per_file> \
    -dock_results_path docked_poses.txt
</pre>

Latest revision as of 16:53, 16 March 2026

last update: dec 16 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs).

ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using a drug-like subset of ZINC (22B) or 13.3B library from Enamine REAL. for detailed instructions on the 13B space, see 'Running ChemSTEP on the 13B space' wiki page. For more general use directions, please refer to [ChemSTEP Read-the-Docs].

At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed "seed set", round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of "virtual hits" in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold ("beacons") from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits.


Running ChemSTEP (Auto DOCK and Build)

ChemSTEP is configured to run on Wynton with libraries of 13B and 22B. This page covers the full workflow for running ChemSTEP with automatic submission of docking and building jobs.

1. Source Environment

source /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/bin/activate

2. Dock the Seed Set

Copy the .sdi file for the library you want to use:

Library Path
13B /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/13B/13M_seeds.sdi
22B /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22M_seeds.sdi

Then dock the seed set. See the Large-Scale Docking (LSD) directions.

3. Gather Scores for the Seed Set

Once docking is complete, run the following from the directory one level above your docking output (MOLECULES_DIR_TO_BIND).

22B library:

python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_scores.py 0

13B library:

python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_scores_13B.py 0 MOL

Template:Note

Verify that scores_round_0.txt was correctly written:

wc -l scores_round_0.txt

4. Convert Scores to .npy Files

Convert scores to ChemSTEP-readable .npy files:

python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/convert_scores_to_npy.py 0 <mol_id_prefix>

The mol_id_prefix should match the library:

Library Prefix
22B / 72B CSLB
13B MOL

5. Set Up the ChemSTEP Run Directory

Create and enter a new run directory, then copy in the necessary files:

mkdir chemstep_run
cd chemstep_run
chemstep-run-new

This will populate the directory with params.txt, run_chemstep.py, and launch_chemstep_as_job.sh.

Optional: Integrated IFP

If running with integrated IFP for beacon selection, also run:

chemstep-run-ifp

This copies in the additional files ifp_acceptance_criteria.txt and interactions.txt.

6. Edit params.txt

Add the absolute paths to the ChemSTEP-readable score and indices .npy arrays generated in Step 4.

seed_indices_file:  /path/to/your/indices_round_0.npy
seed_scores_file:   /path/to/your/scores_round_0.npy
hit_pprop:          5.5
n_docked_per_round: 2000000
bundle_size:        1000
max_beacons:        100
max_n_rounds:       250

Parameter Reference

Parameter Description
hit_pprop Defines a "virtual hit." pProp = −log(rank%) within the total library score distribution. E.g., pProp 4 in 13B space ≈ top 0.01% (~1.3M molecules); pProp 5 ≈ 0.001% (~132K). The seed set should contain at least 10(pProp+2) molecules.
n_docked_per_round Number of molecules prioritized per round. All must be built and docked between rounds. Too many slows throughput and may reduce diversity; too few slows virtual hit recovery.
max_beacons Diverse, well-scoring molecules used to guide prioritization. All molecules above the pProp threshold are candidates. Too many reduces inter-beacon diversity; too few hinders space exploration. Fewer beacons than specified may be assigned if insufficient molecules clear the threshold.
bundle_size In auto docking mode, number of molecules submitted as a single build job.
max_n_rounds No adjustment needed when running ChemSTEP prospectively as described here.

7. Edit run_chemstep.py

Note: All paths must be absolute paths.

Required Settings

Set lib_path to the library pickle for your library:

Library Path
13B /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/13B/boltz_fplib.pickle
22B /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22B_fplib.pickle
lib_path = '/full/path/to/library.pickle'

Set dockfiles_path:

dockfiles_path="/full/path/to/dockfiles"

Optional: minTD Exclusion Zone

Molecules will not be prioritized from within a specified Tanimoto distance of beacons. Comment in the relevant lines and update the value. Consider also setting enforce_n_docked_per_round = True when using this option:

min_td_search=0.5,
enforce_n_docked_per_round=True,

Optional: Integrated IFP

Only selects beacons that satisfy user-defined interaction criteria. Comment in the following lines and update the paths to the necessary files (copied in Step 5 if you ran chemstep-run-ifp):

use_IFP=True,
ifp_pdb_path='/full/path/to/rec.crg.pdb',
interactions_file='/full/path/to/interactions.txt',
ifp_acceptance_criteria_file='/full/path/to/ifp_acceptance_criteria.txt',

interactions.txt — one interaction per line, comma-separated. Format: interaction_type, residue_name_and_number. Example:

Hydrogen bond, GLY19
Ionic, ASP149

Supported interaction types include: Proximal, Hydrogen bond, Ionic, Cation-pi, Hydrophobic, Halogen bond, and others. See LUNA and IFP documentation for the full list.

ifp_acceptance_criteria.txt — defines the number of unsatisfied donors/acceptors/specific interactions required for a molecule to pass IFP and be considered for beacon selection. Example:

#_donors
#_acceptors
#_unstatisfied_donors == 0
#_unstatisfied_acceptors <= 4
Ionic/ASP-149 > 0

Example: AmpC on 22B with minTD=0.50, No IFP

lib_path = '/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22B_fplib.pickle'
lib = load_library_from_pickle(lib_path)
algo = CSAlgo(lib, 'params.txt', 'output', 16, verbose=True,
    scheduler='sge', smi_id_prefix='CSLB',
    python_exec="/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/bin/python",
    dockfiles_path="/wynton/group/bks/work/kholland/chemstep_ampc_22B/seed_docking/dockfiles",
    min_td_search=0.5,
    enforce_n_docked_per_round=True,
    #use_IFP=True,
    #ifp_pdb_path='/path/to/your/reference/rec.crg.pdb',
    #interactions_file='/path/to/your/interactions.txt',
    #ifp_acceptance_criteria_file='/path/to/your/ifp_acceptance_criteria.txt',
    docking_method="auto", track_beacon_orig=True)

8. Launch the Job

Submit the main ChemSTEP job:

qsub launch_chemstep_as_job.sh

9. Monitor Job Status

Check job status with qstat. The main job will run for up to 2 weeks given no errors. ChemSTEP will launch search, building, and docking jobs in successive rounds.

Note: If any building or docking subjobs hang, the main job will not proceed until those are canceled or finished. Monitor job statuses regularly and occasionally verify that docking output files (scores_round_*.txt) are being populated.

10. View Beacon SMILES and IDs

From the ChemSTEP running directory, run the following in a screen session on a dev node:

python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_beacon_smiles.py /path/to/library/pickle chemstep_algo.log

Use the library pickle path from Step 7.

11. Get Poses After Docking

Make a list of test.mol2.gz.0 files from docking:

find /round_*_docking/bundle_paths -maxdepth 2 -name "test.mol2.gz.0" > docked_poses.txt

Then extract top poses:

python /wynton/group/bks/work/bwhall61/for_beau/top_poses.py \
    -t <pProp_threshold> \
    -s <num_poses_per_file> \
    -dock_results_path docked_poses.txt