Running ChemSTEP
last update: dec 16 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs).
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using a drug-like subset of ZINC (22B) or 13.3B library from Enamine REAL. for detailed instructions on the 13B space, see 'Running ChemSTEP on the 13B space' wiki page. For more general use directions, please refer to [ChemSTEP Read-the-Docs].
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed "seed set", round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of "virtual hits" in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold ("beacons") from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits.
Running ChemSTEP (Auto DOCK and Build)
ChemSTEP is configured to run on Wynton with libraries of 13B and 22B. This page covers the full workflow for running ChemSTEP with automatic submission of docking and building jobs.
1. Source Environment
source /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/bin/activate
2. Dock the Seed Set
Copy the .sdi file for the library you want to use:
| Library | Path |
|---|---|
| 13B | /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/13B/13M_seeds.sdi
|
| 22B | /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22M_seeds.sdi
|
Then dock the seed set. See the Large-Scale Docking (LSD) directions.
3. Gather Scores for the Seed Set
Once docking is complete, run the following from the directory one level above your docking output (MOLECULES_DIR_TO_BIND).
22B library:
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_scores.py 0
13B library:
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_scores_13B.py 0 MOL
Verify that scores_round_0.txt was correctly written:
wc -l scores_round_0.txt
4. Convert Scores to .npy Files
Convert scores to ChemSTEP-readable .npy files:
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/convert_scores_to_npy.py 0 <mol_id_prefix>
The mol_id_prefix should match the library:
| Library | Prefix |
|---|---|
| 22B / 72B | CSLB
|
| 13B | MOL
|
5. Set Up the ChemSTEP Run Directory
Create and enter a new run directory, then copy in the necessary files:
mkdir chemstep_run cd chemstep_run chemstep-run-new
This will populate the directory with params.txt, run_chemstep.py, and launch_chemstep_as_job.sh.
Optional: Integrated IFP
If running with integrated IFP for beacon selection, also run:
chemstep-run-ifp
This copies in the additional files ifp_acceptance_criteria.txt and interactions.txt.
6. Edit params.txt
Add the absolute paths to the ChemSTEP-readable score and indices .npy arrays generated in Step 4.
seed_indices_file: /path/to/your/indices_round_0.npy seed_scores_file: /path/to/your/scores_round_0.npy hit_pprop: 5.5 n_docked_per_round: 2000000 bundle_size: 1000 max_beacons: 100 max_n_rounds: 250
Parameter Reference
| Parameter | Description |
|---|---|
hit_pprop |
Defines a "virtual hit." pProp = −log(rank%) within the total library score distribution. E.g., pProp 4 in 13B space ≈ top 0.01% (~1.3M molecules); pProp 5 ≈ 0.001% (~132K). The seed set should contain at least 10(pProp+2) molecules. |
n_docked_per_round |
Number of molecules prioritized per round. All must be built and docked between rounds. Too many slows throughput and may reduce diversity; too few slows virtual hit recovery. |
max_beacons |
Diverse, well-scoring molecules used to guide prioritization. All molecules above the pProp threshold are candidates. Too many reduces inter-beacon diversity; too few hinders space exploration. Fewer beacons than specified may be assigned if insufficient molecules clear the threshold. |
bundle_size |
In auto docking mode, number of molecules submitted as a single build job. |
max_n_rounds |
No adjustment needed when running ChemSTEP prospectively as described here. |
7. Edit run_chemstep.py
Note: All paths must be absolute paths.
Required Settings
Set lib_path to the library pickle for your library:
| Library | Path |
|---|---|
| 13B | /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/13B/boltz_fplib.pickle
|
| 22B | /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22B_fplib.pickle
|
lib_path = '/full/path/to/library.pickle'
Set dockfiles_path:
dockfiles_path="/full/path/to/dockfiles"
Optional: minTD Exclusion Zone
Molecules will not be prioritized from within a specified Tanimoto distance of beacons. Comment in the relevant lines and update the value. Consider also setting enforce_n_docked_per_round = True when using this option:
min_td_search=0.5, enforce_n_docked_per_round=True,
Optional: Integrated IFP
Only selects beacons that satisfy user-defined interaction criteria. Comment in the following lines and update the paths to the necessary files (copied in Step 5 if you ran chemstep-run-ifp):
use_IFP=True, ifp_pdb_path='/full/path/to/rec.crg.pdb', interactions_file='/full/path/to/interactions.txt', ifp_acceptance_criteria_file='/full/path/to/ifp_acceptance_criteria.txt',
interactions.txt — one interaction per line, comma-separated. Format: interaction_type, residue_name_and_number. Example:
Hydrogen bond, GLY19 Ionic, ASP149
Supported interaction types include: Proximal, Hydrogen bond, Ionic, Cation-pi, Hydrophobic, Halogen bond, and others. See LUNA and IFP documentation for the full list.
ifp_acceptance_criteria.txt — defines the number of unsatisfied donors/acceptors/specific interactions required for a molecule to pass IFP and be considered for beacon selection. Example:
#_donors #_acceptors #_unstatisfied_donors == 0 #_unstatisfied_acceptors <= 4 Ionic/ASP-149 > 0
Example: AmpC on 22B with minTD=0.50, No IFP
lib_path = '/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22B_fplib.pickle'
lib = load_library_from_pickle(lib_path)
algo = CSAlgo(lib, 'params.txt', 'output', 16, verbose=True,
scheduler='sge', smi_id_prefix='CSLB',
python_exec="/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/bin/python",
dockfiles_path="/wynton/group/bks/work/kholland/chemstep_ampc_22B/seed_docking/dockfiles",
min_td_search=0.5,
enforce_n_docked_per_round=True,
#use_IFP=True,
#ifp_pdb_path='/path/to/your/reference/rec.crg.pdb',
#interactions_file='/path/to/your/interactions.txt',
#ifp_acceptance_criteria_file='/path/to/your/ifp_acceptance_criteria.txt',
docking_method="auto", track_beacon_orig=True)
8. Launch the Job
Submit the main ChemSTEP job:
qsub launch_chemstep_as_job.sh
9. Monitor Job Status
Check job status with qstat. The main job will run for up to 2 weeks given no errors. ChemSTEP will launch search, building, and docking jobs in successive rounds.
Note: If any building or docking subjobs hang, the main job will not proceed until those are canceled or finished. Monitor job statuses regularly and occasionally verify that docking output files (scores_round_*.txt) are being populated.
10. View Beacon SMILES and IDs
From the ChemSTEP running directory, run the following in a screen session on a dev node:
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_beacon_smiles.py /path/to/library/pickle chemstep_algo.log
Use the library pickle path from Step 7.
11. Get Poses After Docking
Make a list of test.mol2.gz.0 files from docking:
find /round_*_docking/bundle_paths -maxdepth 2 -name "test.mol2.gz.0" > docked_poses.txt
Then extract top poses:
python /wynton/group/bks/work/bwhall61/for_beau/top_poses.py \
-t <pProp_threshold> \
-s <num_poses_per_file> \
-dock_results_path docked_poses.txt