<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://wiki.docking.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Kholland</id>
	<title>DISI - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://wiki.docking.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Kholland"/>
	<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Special:Contributions/Kholland"/>
	<updated>2026-04-08T09:14:20Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17120</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17120"/>
		<updated>2026-03-16T16:53:01Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 16 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using a drug-like subset of ZINC (22B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Running ChemSTEP (Auto DOCK and Build) =&lt;br /&gt;
&lt;br /&gt;
ChemSTEP is configured to run on Wynton with libraries of &#039;&#039;&#039;13B&#039;&#039;&#039; and &#039;&#039;&#039;22B&#039;&#039;&#039;. This page covers the full workflow for running ChemSTEP with automatic submission of docking and building jobs.&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
== 1. Source Environment ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
source /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/bin/activate&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== 2. Dock the Seed Set ==&lt;br /&gt;
&lt;br /&gt;
Copy the &amp;lt;code&amp;gt;.sdi&amp;lt;/code&amp;gt; file for the library you want to use:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Library !! Path&lt;br /&gt;
|-&lt;br /&gt;
| 13B || &amp;lt;code&amp;gt;/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/13B/13M_seeds.sdi&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| 22B || &amp;lt;code&amp;gt;/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22M_seeds.sdi&amp;lt;/code&amp;gt;&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Then dock the seed set. See the &#039;&#039;&#039;Large-Scale Docking (LSD)&#039;&#039;&#039; directions.&lt;br /&gt;
&lt;br /&gt;
== 3. Gather Scores for the Seed Set ==&lt;br /&gt;
&lt;br /&gt;
Once docking is complete, run the following from the directory &#039;&#039;&#039;one level above&#039;&#039;&#039; your docking output (&amp;lt;code&amp;gt;MOLECULES_DIR_TO_BIND&amp;lt;/code&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;22B library:&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_scores.py 0&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;13B library:&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_scores_13B.py 0 MOL&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
{{Note|You must specify the molecule ID prefix for the 13B library (&amp;lt;code&amp;gt;MOL&amp;lt;/code&amp;gt;).}}&lt;br /&gt;
&lt;br /&gt;
Verify that &amp;lt;code&amp;gt;scores_round_0.txt&amp;lt;/code&amp;gt; was correctly written:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
wc -l scores_round_0.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== 4. Convert Scores to .npy Files ==&lt;br /&gt;
&lt;br /&gt;
Convert scores to ChemSTEP-readable &amp;lt;code&amp;gt;.npy&amp;lt;/code&amp;gt; files:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/convert_scores_to_npy.py 0 &amp;lt;mol_id_prefix&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;code&amp;gt;mol_id_prefix&amp;lt;/code&amp;gt; should match the library:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Library !! Prefix&lt;br /&gt;
|-&lt;br /&gt;
| 22B / 72B || &amp;lt;code&amp;gt;CSLB&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| 13B || &amp;lt;code&amp;gt;MOL&amp;lt;/code&amp;gt;&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== 5. Set Up the ChemSTEP Run Directory ==&lt;br /&gt;
&lt;br /&gt;
Create and enter a new run directory, then copy in the necessary files:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir chemstep_run&lt;br /&gt;
cd chemstep_run&lt;br /&gt;
chemstep-run-new&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will populate the directory with &amp;lt;code&amp;gt;params.txt&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;run_chemstep.py&amp;lt;/code&amp;gt;, and &amp;lt;code&amp;gt;launch_chemstep_as_job.sh&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
=== Optional: Integrated IFP ===&lt;br /&gt;
&lt;br /&gt;
If running with integrated IFP for beacon selection, also run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
chemstep-run-ifp&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This copies in the additional files &amp;lt;code&amp;gt;ifp_acceptance_criteria.txt&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;interactions.txt&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== 6. Edit params.txt ==&lt;br /&gt;
&lt;br /&gt;
Add the absolute paths to the ChemSTEP-readable score and indices &amp;lt;code&amp;gt;.npy&amp;lt;/code&amp;gt; arrays generated in Step 4.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
seed_indices_file:  /path/to/your/indices_round_0.npy&lt;br /&gt;
seed_scores_file:   /path/to/your/scores_round_0.npy&lt;br /&gt;
hit_pprop:          5.5&lt;br /&gt;
n_docked_per_round: 2000000&lt;br /&gt;
bundle_size:        1000&lt;br /&gt;
max_beacons:        100&lt;br /&gt;
max_n_rounds:       250&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Parameter Reference ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Parameter !! Description&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;hit_pprop&amp;lt;/code&amp;gt; || Defines a &amp;quot;virtual hit.&amp;quot; pProp = −log(rank%) within the total library score distribution. E.g., pProp 4 in 13B space ≈ top 0.01% (~1.3M molecules); pProp 5 ≈ 0.001% (~132K). The seed set should contain at least 10&amp;lt;sup&amp;gt;(pProp+2)&amp;lt;/sup&amp;gt; molecules.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;n_docked_per_round&amp;lt;/code&amp;gt; || Number of molecules prioritized per round. All must be built and docked between rounds. Too many slows throughput and may reduce diversity; too few slows virtual hit recovery.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;max_beacons&amp;lt;/code&amp;gt; || Diverse, well-scoring molecules used to guide prioritization. All molecules above the pProp threshold are candidates. Too many reduces inter-beacon diversity; too few hinders space exploration. Fewer beacons than specified may be assigned if insufficient molecules clear the threshold.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;bundle_size&amp;lt;/code&amp;gt; || In auto docking mode, number of molecules submitted as a single build job.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;max_n_rounds&amp;lt;/code&amp;gt; || No adjustment needed when running ChemSTEP prospectively as described here.&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== 7. Edit run_chemstep.py ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; All paths must be &#039;&#039;&#039;absolute paths&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Required Settings ===&lt;br /&gt;
&lt;br /&gt;
Set &amp;lt;code&amp;gt;lib_path&amp;lt;/code&amp;gt; to the library pickle for your library:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Library !! Path&lt;br /&gt;
|-&lt;br /&gt;
| 13B || &amp;lt;code&amp;gt;/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/13B/boltz_fplib.pickle&amp;lt;/code&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| 22B || &amp;lt;code&amp;gt;/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22B_fplib.pickle&amp;lt;/code&amp;gt;&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
lib_path = &#039;/full/path/to/library.pickle&#039;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Set &amp;lt;code&amp;gt;dockfiles_path&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
dockfiles_path=&amp;quot;/full/path/to/dockfiles&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Optional: minTD Exclusion Zone ===&lt;br /&gt;
&lt;br /&gt;
Molecules will not be prioritized from within a specified Tanimoto distance of beacons. Comment in the relevant lines and update the value. Consider also setting &amp;lt;code&amp;gt;enforce_n_docked_per_round = True&amp;lt;/code&amp;gt; when using this option:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
min_td_search=0.5,&lt;br /&gt;
enforce_n_docked_per_round=True,&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Optional: Integrated IFP ===&lt;br /&gt;
&lt;br /&gt;
Only selects beacons that satisfy user-defined interaction criteria. Comment in the following lines and update the paths to the necessary files (copied in Step 5 if you ran &amp;lt;code&amp;gt;chemstep-run-ifp&amp;lt;/code&amp;gt;):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
use_IFP=True,&lt;br /&gt;
ifp_pdb_path=&#039;/full/path/to/rec.crg.pdb&#039;,&lt;br /&gt;
interactions_file=&#039;/full/path/to/interactions.txt&#039;,&lt;br /&gt;
ifp_acceptance_criteria_file=&#039;/full/path/to/ifp_acceptance_criteria.txt&#039;,&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;code&amp;gt;interactions.txt&amp;lt;/code&amp;gt;&#039;&#039;&#039; — one interaction per line, comma-separated. Format: &amp;lt;code&amp;gt;interaction_type, residue_name_and_number&amp;lt;/code&amp;gt;. Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Hydrogen bond, GLY19&lt;br /&gt;
Ionic, ASP149&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Supported interaction types include: Proximal, Hydrogen bond, Ionic, Cation-pi, Hydrophobic, Halogen bond, and others. See LUNA and IFP documentation for the full list.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;code&amp;gt;ifp_acceptance_criteria.txt&amp;lt;/code&amp;gt;&#039;&#039;&#039; — defines the number of unsatisfied donors/acceptors/specific interactions required for a molecule to pass IFP and be considered for beacon selection. Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#_donors&lt;br /&gt;
#_acceptors&lt;br /&gt;
#_unstatisfied_donors == 0&lt;br /&gt;
#_unstatisfied_acceptors &amp;lt;= 4&lt;br /&gt;
Ionic/ASP-149 &amp;gt; 0&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Example: AmpC on 22B with minTD=0.50, No IFP ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
lib_path = &#039;/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/libraries/22B/22B_fplib.pickle&#039;&lt;br /&gt;
lib = load_library_from_pickle(lib_path)&lt;br /&gt;
algo = CSAlgo(lib, &#039;params.txt&#039;, &#039;output&#039;, 16, verbose=True,&lt;br /&gt;
    scheduler=&#039;sge&#039;, smi_id_prefix=&#039;CSLB&#039;,&lt;br /&gt;
    python_exec=&amp;quot;/wynton/group/bks/work/shared/kholland/chemstep_auto_v02/bin/python&amp;quot;,&lt;br /&gt;
    dockfiles_path=&amp;quot;/wynton/group/bks/work/kholland/chemstep_ampc_22B/seed_docking/dockfiles&amp;quot;,&lt;br /&gt;
    min_td_search=0.5,&lt;br /&gt;
    enforce_n_docked_per_round=True,&lt;br /&gt;
    #use_IFP=True,&lt;br /&gt;
    #ifp_pdb_path=&#039;/path/to/your/reference/rec.crg.pdb&#039;,&lt;br /&gt;
    #interactions_file=&#039;/path/to/your/interactions.txt&#039;,&lt;br /&gt;
    #ifp_acceptance_criteria_file=&#039;/path/to/your/ifp_acceptance_criteria.txt&#039;,&lt;br /&gt;
    docking_method=&amp;quot;auto&amp;quot;, track_beacon_orig=True)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== 8. Launch the Job ==&lt;br /&gt;
&lt;br /&gt;
Submit the main ChemSTEP job:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
qsub launch_chemstep_as_job.sh&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== 9. Monitor Job Status ==&lt;br /&gt;
&lt;br /&gt;
Check job status with &amp;lt;code&amp;gt;qstat&amp;lt;/code&amp;gt;. The main job will run for up to &#039;&#039;&#039;2 weeks&#039;&#039;&#039; given no errors. ChemSTEP will launch search, building, and docking jobs in successive rounds.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; If any building or docking subjobs hang, the main job will not proceed until those are canceled or finished. Monitor job statuses regularly and occasionally verify that docking output files (&amp;lt;code&amp;gt;scores_round_*.txt&amp;lt;/code&amp;gt;) are being populated.&lt;br /&gt;
&lt;br /&gt;
== 10. View Beacon SMILES and IDs ==&lt;br /&gt;
&lt;br /&gt;
From the ChemSTEP running directory, run the following in a &#039;&#039;&#039;screen session on a dev node&#039;&#039;&#039;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
python /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/scripts/get_beacon_smiles.py /path/to/library/pickle chemstep_algo.log&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Use the library pickle path from [[#7. Edit run_chemstep.py|Step 7]].&lt;br /&gt;
&lt;br /&gt;
== 11. Get Poses After Docking ==&lt;br /&gt;
&lt;br /&gt;
Make a list of &amp;lt;code&amp;gt;test.mol2.gz.0&amp;lt;/code&amp;gt; files from docking:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
find /round_*_docking/bundle_paths -maxdepth 2 -name &amp;quot;test.mol2.gz.0&amp;quot; &amp;gt; docked_poses.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then extract top poses:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
python /wynton/group/bks/work/bwhall61/for_beau/top_poses.py \&lt;br /&gt;
    -t &amp;lt;pProp_threshold&amp;gt; \&lt;br /&gt;
    -s &amp;lt;num_poses_per_file&amp;gt; \&lt;br /&gt;
    -dock_results_path docked_poses.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17056</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17056"/>
		<updated>2026-02-09T20:29:14Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 16 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of ZINC library (96B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_auto_v02/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 92 million molecules sampled randomly from the total virtual library, currently 96 billion molecules (ZINC).&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/96M_seeds.wynton.sdi .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/rebuilt_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
because the seed set is large, the docking must be done in two separate chunks (100k jobs max per docking array), and the scores combined into one file later on. &lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If docking the seeds from the ZINC library, please combine your score.txt files from both SDI files. Rename this combined file &#039;scores_round_0.txt.&#039;&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;bundle_size:&#039;&#039;&#039; if using auto docking mode, the bundle size is used for building. i.e. the number of molecules that get submitted to build as one job. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Edit run_chemstep.py&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
There are several configurable parameters that can be passed when first calling the CSAlgo object:&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;docking_method=&#039;&#039;&#039; setting this to &amp;quot;auto&amp;quot; will set up and run iterative rounds of building, docking, and running ChemSTEP. If you choose the auto method, be sure to update the full path to your dockfiles in run_chemstep.py and check that all INDOCK parameters are set to your liking. NOTE: docking method is set to auto if you copy in the scripts as outlined above. Change if desired!!!!!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;min_td_search=&#039;&#039;&#039; setting this parameter will establish a minimum-tanimoto distance &amp;quot;zone&amp;quot; around each beacon for prioritization. For example, setting a min_td_search=0.3 will require that each prioritized molecule be greater than 0.3 Td from each beacon. &lt;br /&gt;
&lt;br /&gt;
more optional parameters to come.&lt;br /&gt;
&lt;br /&gt;
At this time, we strongly suggest running ChemSTEP as a job array with 8 or 16 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below for running in MANUAL mode: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;11. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17031</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17031"/>
		<updated>2026-01-12T22:43:34Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 16 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of ZINC library (96B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_auto/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 92 million molecules sampled randomly from the total virtual library, currently 96 billion molecules (ZINC).&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/96M_seeds.wynton.sdi .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/rebuilt_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
because the seed set is large, the docking must be done in two separate chunks (100k jobs max per docking array), and the scores combined into one file later on. &lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If docking the seeds from the ZINC library, please combine your score.txt files from both SDI files. Rename this combined file &#039;scores_round_0.txt.&#039;&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;bundle_size:&#039;&#039;&#039; if using auto docking mode, the bundle size is used for building. i.e. the number of molecules that get submitted to build as one job. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Edit run_chemstep.py&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
There are several configurable parameters that can be passed when first calling the CSAlgo object:&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;docking_method=&#039;&#039;&#039; setting this to &amp;quot;auto&amp;quot; will set up and run iterative rounds of building, docking, and running ChemSTEP. If you choose the auto method, be sure to update the full path to your dockfiles in run_chemstep.py and check that all INDOCK parameters are set to your liking. NOTE: docking method is set to auto if you copy in the scripts as outlined above. Change if desired!!!!!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;min_td_search=&#039;&#039;&#039; setting this parameter will establish a minimum-tanimoto distance &amp;quot;zone&amp;quot; around each beacon for prioritization. For example, setting a min_td_search=0.3 will require that each prioritized molecule be greater than 0.3 Td from each beacon. &lt;br /&gt;
&lt;br /&gt;
more optional parameters to come.&lt;br /&gt;
&lt;br /&gt;
At this time, we strongly suggest running ChemSTEP as a job array with 8 or 16 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below for running in MANUAL mode: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;11. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17021</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17021"/>
		<updated>2025-12-16T19:41:19Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 16 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of ZINC library (96B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_auto/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 92 million molecules sampled randomly from the total virtual library, currently 96 billion molecules (ZINC).&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/96M_seeds.wynton.sdi .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/rebuilt_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
because the seed set is large, the docking must be done in two separate chunks (100k jobs max per docking array), and the scores combined into one file later on. &lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If docking the seeds from the ZINC library, please combine your score.txt files from both SDI files. Rename this combined file &#039;scores_round_0.txt.&#039;&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;bundle_size:&#039;&#039;&#039; if using auto docking mode, the bundle size is used for building. i.e. the number of molecules that get submitted to build as one job. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Edit run_chemstep.py&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
There are several configurable parameters that can be passed when first calling the CSAlgo object:&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;docking_method=&#039;&#039;&#039; setting this to &amp;quot;auto&amp;quot; will set up and run iterative rounds of building, docking, and running ChemSTEP. If you choose the auto method, be sure to update the full path to your dockfiles in run_chemstep.py and check that all INDOCK parameters are set to your liking. NOTE: docking method is set to auto if you copy in the scripts as outlined above. Change if desired!!!!!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;min_td_search=&#039;&#039;&#039; setting this parameter will establish a minimum-tanimoto distance &amp;quot;zone&amp;quot; around each beacon for prioritization. For example, setting a min_td_search=0.3 will require that each prioritized molecule be greater than 0.3 Td from each beacon. &lt;br /&gt;
&lt;br /&gt;
more optional parameters to come.&lt;br /&gt;
&lt;br /&gt;
At this time, we strongly suggest running ChemSTEP as a job array with 8 or 16 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below for running in MANUAL mode: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;11. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17020</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17020"/>
		<updated>2025-12-16T19:30:53Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 16 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of ZINC library (96B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_auto/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 92 million molecules sampled randomly from the total virtual library, currently 96 billion molecules (ZINC).&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/96M_seeds.wynton.sdi .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/rebuilt_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
because the seed set is large, the docking must be done in two separate chunks (100k jobs max per docking array), and the scores combined into one file later on. &lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If docking the seeds from the ZINC library, please combine your score.txt files from both SDI files. Rename this combined file &#039;scores_round_0.txt.&#039;&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;bundle_size:&#039;&#039;&#039; if using auto docking mode, the bundle size is used for building. i.e. the number of molecules that get submitted to build as one job. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Edit run_chemstep.py&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
There are several configurable parameters that can be passed when first calling the CSAlgo object:&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;docking_method=&#039;&#039;&#039; setting this to &amp;quot;auto&amp;quot; will set up and run iterative rounds of building, docking, and running ChemSTEP. If you choose the auto method, be sure to update the full path to your dockfiles in run_chemstep.py and check that all INDOCK parameters are set to your liking. NOTE: docking method is set to auto if you copy in the scripts as outlined above. Change if desired!!!!!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;min_td_search=&#039;&#039;&#039; setting this parameter will establish a minimum-tanimoto distance &amp;quot;zone&amp;quot; around each beacon for prioritization. For example, setting a min_td_search=0.3 will require that each prioritized molecule be greater than 0.3 Td from each beacon. &lt;br /&gt;
&lt;br /&gt;
more optional parameters to come.&lt;br /&gt;
&lt;br /&gt;
At this time, we strongly suggest running ChemSTEP as a job array with 8 or 16 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below for running in MANUAL mode: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17019</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17019"/>
		<updated>2025-12-16T19:27:09Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 16 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of ZINC library (96B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_auto/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/96M_seeds.wynton.sdi .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/rebuilt_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
because the seed set is large, the docking must be done in two separate chunks (100k jobs max per docking array), and the scores combined into one file later on. &lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If docking the seeds from the ZINC library, please combine your score.txt files from both SDI files. Rename this combined file &#039;scores_round_0.txt.&#039;&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;bundle_size:&#039;&#039;&#039; if using auto docking mode, the bundle size is used for building. i.e. the number of molecules that get submitted to build as one job. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Edit run_chemstep.py&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
There are several configurable parameters that can be passed when first calling the CSAlgo object:&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;docking_method=&#039;&#039;&#039; setting this to &amp;quot;auto&amp;quot; will set up and run iterative rounds of building, docking, and running ChemSTEP. If you choose the auto method, be sure to update the full path to your dockfiles in run_chemstep.py and check that all INDOCK parameters are set to your liking. NOTE: docking method is set to auto if you copy in the scripts as outlined above. Change if desired!!!!!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;min_td_search=&#039;&#039;&#039; setting this parameter will establish a minimum-tanimoto distance &amp;quot;zone&amp;quot; around each beacon for prioritization. For example, setting a min_td_search=0.3 will require that each prioritized molecule be greater than 0.3 Td from each beacon. &lt;br /&gt;
&lt;br /&gt;
more optional parameters to come.&lt;br /&gt;
&lt;br /&gt;
At this time, we strongly suggest running ChemSTEP as a job array with 8 or 16 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below for running in MANUAL mode: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17018</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17018"/>
		<updated>2025-12-16T19:26:00Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 16 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of ZINC library (96B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_auto/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/96M_seeds.wynton.sdi .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/rebuilt_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
because the seed set is large, the docking must be done in two separate chunks (100k jobs max per docking array), and the scores combined into one file later on. &lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If docking the seeds from the ZINC library, please combine your score.txt files from both SDI files. Rename this combined file &#039;scores_round_0.txt.&#039;&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;bundle_size:&#039;&#039;&#039; if using auto docking mode, the bundle size is used for building. i.e. the number of molecules that get submitted to build as one job. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Edit run_chemstep.py&#039;&#039;&#039; note that the default parameters when copying over run_chemstep.py invoke the automatic building and docking pipeline. &lt;br /&gt;
&lt;br /&gt;
There are several configurable parameters that can be passed when first calling the CSAlgo object:&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;docking_method=&#039;&#039;&#039; setting this to &amp;quot;auto&amp;quot; will set up and run iterative rounds of building, docking, and running ChemSTEP. If you choose the auto method, be sure to update the full path to your dockfiles in run_chemstep.py and check that all INDOCK parameters are set to your liking. NOTE: docking method is set to auto if you copy in the scripts as outlined above. Change if desired!!!!!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;min_td_search=&#039;&#039;&#039; setting this parameter will establish a minimum-tanimoto distance &amp;quot;zone&amp;quot; around each beacon for prioritization. For example, setting a min_td_search=0.3 will require that each prioritized molecule be greater than 0.3 Td from each beacon. &lt;br /&gt;
&lt;br /&gt;
more optional parameters to come.&lt;br /&gt;
&lt;br /&gt;
At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below for running in MANUAL mode: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17016</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17016"/>
		<updated>2025-12-15T22:06:41Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 15 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of ZINC library (96B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_auto/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/96M_seeds.wynton.sdi .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/rebuilt_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
because the seed set is large, the docking must be done in two separate chunks (100k jobs max per docking array), and the scores combined into one file later on. &lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If docking the seeds from the ZINC library, please combine your score.txt files from both SDI files. Rename this combined file &#039;scores_round_0.txt.&#039;&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;bundle_size:&#039;&#039;&#039; if using auto docking mode, the bundle size is used for building. i.e. the number of molecules that get submitted to build as one job. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Edit run_chemstep.py&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
There are several configurable parameters that can be passed when first calling the CSAlgo object:&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;docking_method=&#039;&#039;&#039; setting this to &amp;quot;auto&amp;quot; will set up and run iterative rounds of building, docking, and running ChemSTEP. If you choose the auto method, be sure to update the full path to your dockfiles in run_chemstep.py and check that all INDOCK parameters are set to your liking. NOTE: docking method is set to auto if you copy in the scripts as outlined above. Change if desired!!!!! &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;min_td_search=&#039;&#039;&#039; setting this parameter will establich a minimum-tanimoto distance &amp;quot;zone&amp;quot; around each beacon for prioritization. For example, setting a min_td_search=0.3 will require that each prioritized molecule be greater than 0.3 Td from each beacon. &lt;br /&gt;
&lt;br /&gt;
more optional parameters to come.&lt;br /&gt;
&lt;br /&gt;
At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below for running in MANUAL mode: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17015</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17015"/>
		<updated>2025-12-15T22:04:58Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 15 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of ZINC library (96B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_auto/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/96M_seeds.wynton.sdi .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/rebuilt_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
because the seed set is large, the docking must be done in two separate chunks (100k jobs max per docking array), and the scores combined into one file later on. &lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If docking the seeds from the ZINC library, please combine your score.txt files from both SDI files. Rename this combined file &#039;scores_round_0.txt.&#039;&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;bundle_size:&#039;&#039;&#039; if using auto docking mode, the bundle size is used for building. i.e. the number of molecules that get submitted to build as one job. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Edit run_chemstep.py&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
There are several configurable parameters that can be passed when first calling the CSAlgo object:&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;docking_method=&#039;&#039;&#039; setting this to &amp;quot;auto&amp;quot; will set up and run iterative rounds of building, docking, and running ChemSTEP. If you choose the auto method, be sure to update the full path to your dockfiles in run_chemstep.py and check that all INDOCK parameters are set to your liking. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;min_td_search=&#039;&#039;&#039; setting this parameter will establich a minimum-tanimoto distance &amp;quot;zone&amp;quot; around each beacon for prioritization. For example, setting a min_td_search=0.3 will require that each prioritized molecule be greater than 0.3 Td from each beacon. &lt;br /&gt;
&lt;br /&gt;
more optional parameters to come.&lt;br /&gt;
&lt;br /&gt;
At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below for running in MANUAL mode: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17014</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17014"/>
		<updated>2025-12-15T21:59:09Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 15 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of ZINC library (96B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_auto/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/96M_seeds.wynton.sdi .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/rebuilt_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
because the seed set is large, the docking must be done in two separate chunks (100k jobs max per docking array), and the scores combined into one file later on. &lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If docking the seeds from the ZINC library, please combine your score.txt files from both SDI files. Rename this combined file &#039;scores_round_0.txt.&#039;&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;bundle_size:&#039;&#039;&#039; if using auto docking mode, the bundle size is used for building. i.e. the number of molecules that get submitted to build as one job. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Edit run_chemstep.py&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
There are several configurable parameters that can be passed when first calling the CSAlgo object:&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;docking_method=&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;AUTOMATIC CHEMSTEP RUNNING DIRECTIONS&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
This version of ChemSTEP will automatically submit building and docking jobs inbetween rounds of ChemSTEP. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. DOCK the seed set to your receptor as outlined above in steps 1-4&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
successful output will be a scores_round_0.npy and indices_round_0.npy. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy necessary scripts into your working ChemSTEP directory&#039;&#039;&#039;&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/params.txt .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/run_chemstep.py .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Edit run_chemstep.py to add the path to your dockfiles&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
the algorithm needs to know where your dockfiles live in order to submit docking jobs. Make sure all INDOCK parameters are set correctly. Use the full path to the dockfiles. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Edit params.txt&#039;&#039;&#039; The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user. The &#039;&#039;&#039;bundle_size:&#039;&#039;&#039; parameter is required for auto docking mode of ChemSTEP. This is the number of molecules submitted at a job for building. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Launch ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
   &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if your job errors out or stops running at any point, check the &#039;&#039;chemstep_submission.log&#039;&#039; file in the working directory.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17013</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17013"/>
		<updated>2025-12-15T21:40:39Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 15 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of ZINC library (96B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_auto/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/96M_seeds.wynton.sdi .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/rebuilt_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
because the seed set is large, the docking must be done in two separate chunks (100k jobs max per docking array), and the scores combined into one file later on. &lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If docking the seeds from the ZINC library, please combine your score.txt files from both SDI files. Rename this combined file &#039;scores_round_0.txt.&#039;&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;bundle_size:&#039;&#039;&#039; if using auto docking mode, the bundle size is used for building. i.e. the number of molecules that get submitted to build as one job. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;AUTOMATIC CHEMSTEP RUNNING DIRECTIONS&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
This version of ChemSTEP will automatically submit building and docking jobs inbetween rounds of ChemSTEP. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. DOCK the seed set to your receptor as outlined above in steps 1-4&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
successful output will be a scores_round_0.npy and indices_round_0.npy. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy necessary scripts into your working ChemSTEP directory&#039;&#039;&#039;&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/params.txt .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/run_chemstep.py .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Edit run_chemstep.py to add the path to your dockfiles&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
the algorithm needs to know where your dockfiles live in order to submit docking jobs. Make sure all INDOCK parameters are set correctly. Use the full path to the dockfiles. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Edit params.txt&#039;&#039;&#039; The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user. The &#039;&#039;&#039;bundle_size:&#039;&#039;&#039; parameter is required for auto docking mode of ChemSTEP. This is the number of molecules submitted at a job for building. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Launch ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
   &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if your job errors out or stops running at any point, check the &#039;&#039;chemstep_submission.log&#039;&#039; file in the working directory.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17012</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17012"/>
		<updated>2025-12-15T21:39:05Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 15 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of ZINC library (96B) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_auto/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/96M_seeds.wynton.sdi .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/rebuilt_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
because the seed set is large, the docking must be done in two separate chunks (100k jobs max per docking array), and the scores combined into one file later on. &lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If docking the seeds from the ZINC library, please combine your score.txt files from both SDI files. Rename this combined file &#039;scores_round_0.txt.&#039;&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_auto/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;AUTOMATIC CHEMSTEP RUNNING DIRECTIONS&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
This version of ChemSTEP will automatically submit building and docking jobs inbetween rounds of ChemSTEP. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. DOCK the seed set to your receptor as outlined above in steps 1-4&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
successful output will be a scores_round_0.npy and indices_round_0.npy. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy necessary scripts into your working ChemSTEP directory&#039;&#039;&#039;&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/params.txt .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/run_chemstep.py .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Edit run_chemstep.py to add the path to your dockfiles&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
the algorithm needs to know where your dockfiles live in order to submit docking jobs. Make sure all INDOCK parameters are set correctly. Use the full path to the dockfiles. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Edit params.txt&#039;&#039;&#039; The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user. The &#039;&#039;&#039;bundle_size:&#039;&#039;&#039; parameter is required for auto docking mode of ChemSTEP. This is the number of molecules submitted at a job for building. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Launch ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
   &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if your job errors out or stops running at any point, check the &#039;&#039;chemstep_submission.log&#039;&#039; file in the working directory.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17011</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17011"/>
		<updated>2025-12-15T21:15:37Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 11 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;AUTOMATIC CHEMSTEP RUNNING DIRECTIONS&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
This version of ChemSTEP will automatically submit building and docking jobs inbetween rounds of ChemSTEP. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. DOCK the seed set to your receptor as outlined above in steps 1-4&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
successful output will be a scores_round_0.npy and indices_round_0.npy. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy necessary scripts into your working ChemSTEP directory&#039;&#039;&#039;&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/params.txt .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/run_chemstep.py .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Edit run_chemstep.py to add the path to your dockfiles&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
the algorithm needs to know where your dockfiles live in order to submit docking jobs. Make sure all INDOCK parameters are set correctly. Use the full path to the dockfiles. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Edit params.txt&#039;&#039;&#039; The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user. The &#039;&#039;&#039;bundle_size:&#039;&#039;&#039; parameter is required for auto docking mode of ChemSTEP. This is the number of molecules submitted at a job for building. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Launch ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
   &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if your job errors out or stops running at any point, check the &#039;&#039;chemstep_submission.log&#039;&#039; file in the working directory.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17010</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17010"/>
		<updated>2025-12-15T20:52:45Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 11 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;AUTOMATIC CHEMSTEP RUNNING DIRECTIONS&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
This version of ChemSTEP will automatically submit building and docking jobs inbetween rounds of ChemSTEP. Note, this version does not currently build with strain (to do: add this)&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. DOCK the seed set to your receptor as outlined above in steps 1-4&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
successful output will be a scores_round_0.npy and indices_round_0.npy. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy necessary scripts into your working ChemSTEP directory&#039;&#039;&#039;&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/params.txt .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/run_chemstep.py .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/chemstep_auto/scripts/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Edit run_chemstep.py to add the path to your dockfiles&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
the algorithm needs to know where your dockfiles live in order to submit docking jobs. Make sure all INDOCK parameters are set correctly. Use the full path to the dockfiles. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Edit params.txt&#039;&#039;&#039; The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user. The &#039;&#039;&#039;bundle_size:&#039;&#039;&#039; parameter is required for auto docking mode of ChemSTEP. This is the number of molecules submitted at a job for building. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Launch ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
   &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if your job errors out or stops running at any point, check the &#039;&#039;chemstep_submission.log&#039;&#039; file in the working directory.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17009</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17009"/>
		<updated>2025-12-11T19:55:28Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 11 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;AUTOMATIC CHEMSTEP RUNNING DIRECTIONS&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
This version of ChemSTEP will automatically submit building and docking jobs inbetween rounds of ChemSTEP. Note, this version does not currently build with strain (to do: add this)&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. DOCK the seed set to your receptor as outlined above in steps 1-4&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
successful output will be a scores_round_0.npy and indices_round_0.npy. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy necessary scripts into your working ChemSTEP directory&#039;&#039;&#039;&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/auto_chemstep/scripts/params.txt .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/auto_chemstep/scripts/run_chemstep.py .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/auto_chemstep/scripts/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Edit run_chemstep.py to add the path to your dockfiles&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
the algorithm needs to know where your dockfiles live in order to submit docking jobs. Make sure all INDOCK parameters are set correctly. Use the full path to the dockfiles. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Edit params.txt&#039;&#039;&#039; The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user. The &#039;&#039;&#039;bundle_size:&#039;&#039;&#039; parameter is required for auto docking mode of ChemSTEP. This is the number of molecules submitted at a job for building. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Launch ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
   &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if your job errors out or stops running at any point, check the &#039;&#039;chemstep_submission.log&#039;&#039; file in the working directory.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17008</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17008"/>
		<updated>2025-12-11T19:53:33Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: dec 11 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
‘’’AUTOMATIC CHEMSTEP RUNNING DIRECTIONS’’’&lt;br /&gt;
&lt;br /&gt;
This version of ChemSTEP will automatically submit building and docking jobs inbetween rounds of ChemSTEP. Note, this version does not currently build with strain (to do: add this)&lt;br /&gt;
&lt;br /&gt;
‘’’1. DOCK the seed set to your receptor as outlined above in steps 1-4’’’ &lt;br /&gt;
&lt;br /&gt;
successful output will be a scores_round_0.npy and indices_round_0.npy. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
‘’’2. Copy necessary scripts into your working ChemSTEP directory’’’&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/auto_chemstep/scripts/params.txt .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/auto_chemstep/scripts/run_chemstep.py .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/auto_chemstep/scripts/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
‘’’3. Edit run_chemstep.py to add the path to your dockfiles’’’&lt;br /&gt;
&lt;br /&gt;
the algorithm needs to know where your dockfiles live in order to submit docking jobs. Make sure all INDOCK parameters are set correctly. Use the full path to the dockfiles. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
‘’’4. Edit params.txt&#039;&#039;&#039; The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user. The &#039;&#039;&#039;bundle_size:&#039;&#039;&#039; parameter is required for auto docking mode of ChemSTEP. This is the number of molecules submitted at a job for building. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Launch ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
   &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if your job errors out or stops running at any point, check the &#039;&#039;chemstep_submission.log&#039;&#039; file in the working directory.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17007</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=17007"/>
		<updated>2025-12-11T03:19:25Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores (no header). Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
‘’’AUTOMATIC CHEMSTEP RUNNING DIRECTIONS’’’&lt;br /&gt;
&lt;br /&gt;
This version of ChemSTEP will automatically submit building and docking jobs inbetween rounds of ChemSTEP. Note, this version does not currently build with strain (to do: add this)&lt;br /&gt;
&lt;br /&gt;
‘’’1. DOCK the seed set to your receptor as outlined above in steps 1-4’’’ &lt;br /&gt;
&lt;br /&gt;
successful output will be a scores_round_0.npy and indices_round_0.npy. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
‘’’2. Copy necessary scripts into your working ChemSTEP directory’’’&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/auto_chemstep/scripts/params.txt .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/auto_chemstep/scripts/run_chemstep.py .&lt;br /&gt;
       cp /wynton/group/bks/work/shared/kholland/auto_chemstep/scripts/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
‘’’3. Edit run_chemstep.py to add the path to your dockfiles’’’&lt;br /&gt;
&lt;br /&gt;
the algorithm needs to know where your dockfiles live in order to submit docking jobs. Make sure all INDOCK parameters are set correctly. Use the full path to the dockfiles. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
‘’’4. Edit params.txt&#039;&#039;&#039; The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    bundle_size: 1000&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user. The &#039;&#039;&#039;bundle_size:&#039;&#039;&#039; parameter is required for auto docking mode of ChemSTEP. This is the number of molecules submitted at a job for building. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Launch ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
   &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if your job errors out or stops running at any point, check the &#039;&#039;chemstep_submission.log&#039;&#039; file in the working directory.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16956</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16956"/>
		<updated>2025-10-09T01:39:57Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Helper scripts:&#039;&#039;&#039; a running list of scripts we have been using for analysis and visualization. &lt;br /&gt;
&lt;br /&gt;
to get SMILES for beacons: &lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/work/bwhall61/needs_github/get_beacon_smi.py --beacon_df_path /path/to/your/chemstep/output/complete_info/beacons.df --n_workers 6 --outfile beacon_smiles.smi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16955</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16955"/>
		<updated>2025-10-09T01:35:25Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (do this in a screen!!!):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different organizational conventions, vim into get_scores.py and change the path. Successful output of the script will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16954</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16954"/>
		<updated>2025-10-09T01:32:03Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. At this time, we strongly suggest running ChemSTEP as a job array with 16 or 32 CPU slots requested. The number of cores must be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper(launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16953</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16953"/>
		<updated>2025-10-09T01:30:42Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, see &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 16 CPU slots requested. The number of cores should be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper (launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16952</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16952"/>
		<updated>2025-10-09T01:26:55Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5 (automatic resubmission of failed SGE jobs). &lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T) or 13.3B library from Enamine REAL. &#039;&#039;&#039;for detailed instructions on the 13B space, &#039;Running ChemSTEP on the 13B space&#039; wiki page&#039;&#039;&#039;. For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
if you are interested in running on a library of 13.2B compounds from Enamine REAL, there are several seed set sizes available: 130k, 1.3M, 13M, and 26M. &lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/130K_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/1.3M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/13M_seeds.wynton.sdi .&lt;br /&gt;
    cp /wynton/group/bks/work/shared/kholland/chemstep_13B/26M_seeds.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/get_scores.py&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/convert_scores_to_npy.py&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 16 CPU slots requested. The number of cores should be specified when calling CSAlgo (in run_chemstep.py) and the SGE wrapper (launch_chemstep_as_job.sh). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_{}.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_{}.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/chemstep_13B_scripts_tutorial/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16951</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16951"/>
		<updated>2025-10-09T00:56:30Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as diverse as possible from already assigned beacons. Greedy max-diversity selection. &lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will select N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. The prioritized molecules are output in /output/complete_info as smi_round_{}.smi &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_N.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16950</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16950"/>
		<updated>2025-10-09T00:51:23Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as far as possible by Td to all assigned beacons.&lt;br /&gt;
&lt;br /&gt;
During assignment, ECFP4 Tanimoto distances between the assigned beacon and all potential beacons are calculated. The second beacon is the molecule with the maximum Td from beacon 1. Beacon 3 will be the molecule that has the highest Td from beacon 1 and beacon 2. This continues until the number of beacons specified in the parameter file is reached. In iterative rounds of ChemSTEP, the beacon selection is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will give N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_N.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16949</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16949"/>
		<updated>2025-10-08T22:24:52Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as far as possible by Td to all assigned beacons. In iterative rounds of ChemSTEP, the first beacon chosen is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will give N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_N.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16948</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16948"/>
		<updated>2025-10-08T21:37:29Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Main ChemSTEP job:&#039;&#039;&#039; reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as far as possible by Td to all assigned beacons. In iterative rounds of ChemSTEP, the first beacon chosen is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. ChainingLog sub-jobs:&#039;&#039;&#039; once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Gathering of SMILES for prioritization:&#039;&#039;&#039; When all Td calculations are completed, the algo will give N number of molecules for prioritization, the number of which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to ANY beacon (current or previous rounds) are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_N.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16947</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16947"/>
		<updated>2025-10-08T21:32:40Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
1. Main ChemSTEP job: reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as far as possible by Td to all assigned beacons. In iterative rounds of ChemSTEP, the first beacon chosen is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
&lt;br /&gt;
2. ChainingLog sub-jobs: once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
&lt;br /&gt;
3. Gathering of SMILES for prioritization: When all Td calculations are completed, the algo will then choose N number of molecules for prioritization, which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to any beacon are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. &lt;br /&gt;
   &lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_N.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16946</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16946"/>
		<updated>2025-10-08T21:32:09Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: oct 8 2025 katie. current ver = 0.3.1.5&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
   1. Main ChemSTEP job: reads input files (DOCK scores and line-matched indices NumPy arrays) and assigns beacons. Beacons are a set of N maximally diverse molecules that score above the pProp threshold specified in the parameter file. For the first round of ChemSTEP, the best-scoring molecule from the seed set is assigned as the first beacon. Subsequent beacons would be molecules that score well (above pProp thresh), and are as far as possible by Td to all assigned beacons. In iterative rounds of ChemSTEP, the first beacon chosen is based on diversity from ALL PREVIOUSLY assigned beacons, including those in earlier rounds. &lt;br /&gt;
   2. ChainingLog sub-jobs: once beacons are assigned, ChemSTEP will launch a series of sub-jobs that calculate the Tanimoto distances of every molecule remaining in the library to the assigned beacons. Calculated distances to the NEAREST beacon (minimum-minimum Td) are updated in the mintddistrib_*.npy files within the /output directory. &lt;br /&gt;
   3. Gathering of SMILES for prioritization: When all Td calculations are completed, the algo will then choose N number of molecules for prioritization, which is specified by the user in the parameter file. Molecules that are &#039;&#039;closest in chemical space&#039;&#039; to beacons are prioritized. I.e. if round size = 1 million, the 1 million molecules with the smallest min-Td to any beacon are prioritized. chemstep_algo.log will provide a &amp;quot;max-minTd&amp;quot; for each round, which is the Tanimoto distance of the most dissimilar molecule prioritized to any one beacon per round. &lt;br /&gt;
   &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_0.3.1.5/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold may be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than the assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_N.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16900</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16900"/>
		<updated>2025-09-18T20:15:48Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 18 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold are be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_N.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, updating the round number when running get_scores.py and convert_scores_to_npy.py. Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16898</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16898"/>
		<updated>2025-09-18T20:14:33Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 18 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold are be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_N.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16897</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16897"/>
		<updated>2025-09-18T20:14:06Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 18 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them for ChemSTEP recognition. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched indices (determined from Mol ID) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold are be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. you can do this with:&lt;br /&gt;
    find /path/to/your/building_output -name &amp;quot;bundle.db2.tgz&amp;quot; -type f &amp;gt; round_N.wynton.sdi&lt;br /&gt;
&lt;br /&gt;
 &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16896</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16896"/>
		<updated>2025-09-18T20:10:17Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 11 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them so ChemSTEP can read it. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign. Be mindful of seed set size when choosing desired pProp. Our suggestion is that the size of the seed set should be at least 10^(pProp +2).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than the assigned pProp threshold are be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &#039;&#039;&#039;Visually inspect these files after each round of ChemSTEP to ensure the algorithm has picked your desired number of beacons and things ran smoothly.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16885</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16885"/>
		<updated>2025-09-11T22:39:16Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 11 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them so ChemSTEP can read it. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than pProp threshold can be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than assigned value. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;max_n_rounds:&#039;&#039;&#039; if prospectively running ChemSTEP, as outlined in this wiki, no need to worry about this parameter.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16884</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16884"/>
		<updated>2025-09-11T22:37:58Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 11 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;, round zero) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for a second round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them so ChemSTEP can read it. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than pProp threshold can be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than assigned value. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16883</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16883"/>
		<updated>2025-09-11T22:36:16Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 11 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for another round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them so ChemSTEP can read it. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than pProp threshold can be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than assigned value. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold calculated by ChemSTEP!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16882</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16882"/>
		<updated>2025-09-11T22:33:04Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 11 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for another round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them so ChemSTEP can read it. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will use to guide prioritization per round. All molecules that score better than pProp threshold can be considered for beacons. By default, ChemSTEP chooses each set of beacons to be as maximally diverse as possible. Choosing too many beacons may result in decreased diversity between beacons, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than assigned value. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16881</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16881"/>
		<updated>2025-09-11T22:29:52Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 11 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for another round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them so ChemSTEP can read it. This script expects a txt file with Molecule IDS and DOCK scores. Use the same round number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will assign to be beacons per round. All molecules that score better than pProp threshold can be considered for beacons. Choosing too many beacons may result in decreased diversity, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than assigned value. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16880</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16880"/>
		<updated>2025-09-11T22:28:25Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 11 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process that identifies molecules from the larger virtual library to prioritize for docking. First, we identify a random sample of the total library (termed &amp;quot;seed set&amp;quot;) and dock those molecules to the target of interest. From this seed set, we can calculate total-library pProp values (-log rank percentages) and the number of &amp;quot;virtual hits&amp;quot; in the total library (high-scoring molecules). ChemSTEP will identify a set of maximally diverse molecules that score above the desired pProp threshold (&amp;quot;beacons&amp;quot;) from the seed set. These beacons guide prioritization, where molecules chosen and output by ChemSTEP are close in chemical space to the beacons. Prioritized molecules are then built, docked, and returned to ChemSTEP for another round of prioritization. This process is iterated until you reach desired virtual hit recovery, or you are no longer recovering virtual hits. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running the ChemSTEP algorithm on Wynton takes place in loosely three steps: (1) submission of the main ChemSTEP job, (2) ChainingLog sub-job array, and (3) gathering of SMILES for prioritization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What the user need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP. &#039;&#039;&#039;All iterative rounds of ChemSTEP should be run in the SAME directory.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules.&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock, just your base docking directory]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory containing your docking files and output folder, while logged into a dev node (I suggest in a screen):&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
For python speed purposes, ChemSTEP requires that DOCK scores and IDs be give in the form of a NumPY array. In this step, we are taking our readable scores file and converting them so ChemSTEP can read it. This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh. Copy in your score and indices numpy files as well, which should be in your docking directory.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through all rounds of ChemSTEP. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Within params.txt, add the absolute paths to the ChemSTEP-readable score and indices NumPY arrays. The rest of the values within the params.txt file are left to the discretion of the user, with some considerations below. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;pProp:&#039;&#039;&#039; this value will define what is considered a &amp;quot;virtual hit&amp;quot; in your campaign. pProp is defined as the -log(rank %) of a molecule within the total library score-distribution. For example, a pProp of 4 in the 1.1T XReal space is equivalent the top 0.01% of the library, corresponding to the top 110 million molecules. pProp 5 = 0.001% = 11 million virtual hits, etc. From the seed set, ChemSTEP will estimate a DOCK score value in kcal/mol that associated with the lower-limit of your desired pProp zone. Any molecules that score better than the threshold is considered a virtual hit in your campaign.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;n_docked_per_round:&#039;&#039;&#039; this number is the desired molecules for prioritization per round. When choosing this value, note that this number of molecules must be built and docked in between each round of ChemSTEP. Prioritizing many molecules will slow building and docking speeds, and coupled with few beacons (below) may lead to decreased diversity. Prioritizing too few molecules may result in slower virtual hit recovery. Round size does not significantly impact the algorithm running time.**   &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;max_beacons:&#039;&#039;&#039; this is the number of diverse, well-scoring molecules ChemSTEP will assign to be beacons per round. All molecules that score better than pProp threshold can be considered for beacons. Choosing too many beacons may result in decreased diversity, but too few beacons could hinder space exploration. If not enough molecules score above your pProp threshold, ChemSTEP may assign fewer beacons than assigned value. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There should be no need to edit run_chemstep.py or the SGE wrapper script. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; At this time, we strongly suggest running ChemSTEP as a job array with 64 CPU slots requested, which is specified in the default wrapper. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This will launch the main ChemSTEP job to the SGE scheduler. Check your job status with the &#039;qstat&#039; command. The algorithm will read the score and indices files provided, calculate pProp (for round 1) and assign beacons. As this job runs, it will launch a subsequent job array. These sub-jobs are the Chaining step of ChemSTEP, where each parallel worker calculates the Tanimoto distances of 100 million molecules from the virtual library to the beacons. These distances are then written to the files within the /output directory. When these jobs complete, the main job will read through all Tanimoto distances and pull molecules for prioritization. &lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the algorithm in a readable format, with timestamps, including the DOCK score associated with your desired pProp. chemstep_submission.log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Troubleshooting:&#039;&#039;&#039; if no jobs are running and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE or Wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** anecdotally true.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16872</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16872"/>
		<updated>2025-09-08T23:36:57Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 3 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_seed_set_v0p0.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the alogorithm in a readable format, with timestamps. chemstep_submission log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &lt;br /&gt;
&lt;br /&gt;
Troubleshooting: if the job is completed and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE/wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16863</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16863"/>
		<updated>2025-09-03T22:54:41Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 3 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the alogorithm in a readable format, with timestamps. chemstep_submission log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. &lt;br /&gt;
&lt;br /&gt;
Troubleshooting: if the job is completed and there is no SMI file, check the chemstep_submission.log first. Any traceback error or SGE error should give you some idea of why ChemSTEP failed. Some errors may be due to SGE/wynton issues. If directed, look at a few .out files within /output/jobs/ . If a ChemSTEP run fails on your FIRST run, delete the output and log files, fix what needs to be fixed, and rerun. Errors in subsequent rounds during the chaining step can potentially corrupt chaining files within the /output directory that are needed for prioritization. At the very least, you will definitely have duplicates of beacons and information written to the log files. At this time, if your run fails during iterative rounds, it&#039;s best to start from the beginning. &lt;br /&gt;
&lt;br /&gt;
Prioritized molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16862</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16862"/>
		<updated>2025-09-03T22:45:42Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 3 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the alogorithm in a readable format, with timestamps. chemstep_submission log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. These molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 5 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16861</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16861"/>
		<updated>2025-09-03T22:42:36Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 3 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the alogorithm in a readable format, with timestamps. chemstep_submission log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. These molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16860</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16860"/>
		<updated>2025-09-03T22:42:09Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 3 2025 katie. current ver = 0.3.1&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python3 get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Running ChemSTEP successfully will result in the output of a SMILES file within output/complete_info/. You will also get (1) chemstep_algo.log and (2) chemstep_submission.log in the working directory after job submission. chemstep_algo.log contains the running output of the alogorithm in a readable format, with timestamps. chemstep_submission log contains the output of ChemSTEP, as well as any information output by the SGE submission system (python errors, cluster issues, tracebacks if failed). These files will be updated with any information with iterative rounds of ChemSTEP run in the same directory. These molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16859</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16859"/>
		<updated>2025-09-03T22:38:58Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: Sept 3 2025 katie&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python3 get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When finished, there will be a smi_round_1.smi inside of output/complete_info. These molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16858</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16858"/>
		<updated>2025-09-03T22:37:08Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: August 26 2025 katie&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python3 get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition. Requires ChemSTEP venv&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files for initiating ChemSTEP: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When finished, there will be a smi_round_1.smi inside of output/complete_info. These molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). Copy these into your ChemSTEP working directory.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16850</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16850"/>
		<updated>2025-08-27T00:05:54Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: August 26 2025 katie&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python3 get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python3 convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When finished, there will be a smi_round_1.smi inside of output/complete_info. These molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16849</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16849"/>
		<updated>2025-08-27T00:02:23Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: August 20 2025 katie&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python3 get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python3 convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When finished, there will be a smi_round_1.smi inside of output/complete_info. These molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], and should increase by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16848</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16848"/>
		<updated>2025-08-26T23:55:56Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: August 20 2025 katie&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python3 get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python3 convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When finished, there will be a smi_round_1.smi inside of output/complete_info. These molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], increasing by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16847</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16847"/>
		<updated>2025-08-26T23:54:16Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: August 20 2025 katie&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python3 get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python3 convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When finished, there will be a smi_round_1.smi inside of output/complete_info. These molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_iterative.sh&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh [round number]&lt;br /&gt;
&lt;br /&gt;
For the first iterative round, the round number is [2], increasing by one for every subsequent round of ChemSTEP. The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in outputy/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16846</id>
		<title>Running ChemSTEP</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=Running_ChemSTEP&amp;diff=16846"/>
		<updated>2025-08-26T23:42:23Z</updated>

		<summary type="html">&lt;p&gt;Kholland: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;last update: August 20 2025 katie&lt;br /&gt;
&lt;br /&gt;
ChemSTEP (Chemical Space Traversal and Exploration Procedure) is an open-source, transparent acceleration algorithm for molecular docking capable of dealing with virtual libraries of several trillion compounds. This wiki page is a guide for BKS lab members to run ChemSTEP on Wynton HPC, using the current version of InifiSee XReal library (1.1T). For more general use directions, please refer to [ChemSTEP Read-the-Docs]. &lt;br /&gt;
&lt;br /&gt;
At a high-level, ChemSTEP is an iterative process to run in between rounds of docking. The general procedure is as follows: build molecules, dock molecules, convert scores for ChemSTEP, run ChemSTEP. In this case, the first round of building (the &amp;quot;seed set&amp;quot;) has already been done.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
What you (the user) need: DOCKFILES, directories for (1) docking (2) building and (3) running ChemSTEP&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1. Source ChemSTEP virtual environment on Wynton&#039;&#039;&#039;&lt;br /&gt;
    source /wynton/group/bks/work/shared/kholland/chemstep_env/bin/activate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2. Copy InfiniSee seed-set SDI file into your docking directory&#039;&#039;&#039;&lt;br /&gt;
This directory should already contain your dockfiles, with INDOCK parameters set to your liking. In this step, we are copying in a split database index file (SDI) containing paths to bundles of db2 files. This seed set contains 100 million molecules sampled randomly from the total virtual library, currently 1.1 trillion molecules. &lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/XR_00_seed_set.wynton.sdi .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3. Dock seed set to your receptor of interest using DOCK 3.8&#039;&#039;&#039; /docking directions taken from docs.docking.org. This is meant to be done as you would do a normal LSD. &lt;br /&gt;
     export MOLECULES_DIR_TO_BIND=[outermost folder containing the molecules to dock]&lt;br /&gt;
     export DOCKFILES=[path to your dockfiles]&lt;br /&gt;
     export INPUT_FOLDER=[the folder containing your .sdi file(s)]&lt;br /&gt;
     export OUTPUT_FOLDER=[where you want the output ]&lt;br /&gt;
&lt;br /&gt;
     /wynton/group/bks/work/bwhall61/needs_github/super_dock3r.sh&lt;br /&gt;
&lt;br /&gt;
Wait for docking to complete. Next, you must extract all molecule IDs and corresponding DOCK scores from above. To do so, run the following commands in the base docking directory (containing your docking files and output folder) while logged into a dev node:&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/get_scores.py .&lt;br /&gt;
     python3 get_scores.py 0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This script expects a directory named &amp;quot;output*&amp;quot; within the CWD. If your output from docking follows different naming conventions, vim into get_scores.py and change the path. The output will be a file named &amp;quot;scores_round_0.txt&amp;quot;. For iterative rounds, pass increasing numbers into the command line. i.e. When docking the first round of prioritized molecules, pass [1] for scores_round_1.txt. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4. Convert scores and molecule IDS into NumPY arrays for ChemSTEP recognition.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/convert_scores_to_npy.py .&lt;br /&gt;
     python3 convert_scores_to_npy.py 0&lt;br /&gt;
&lt;br /&gt;
This script expects a txt file with Molecule IDS and DOCK scores. Use the same number you used in step 3. As run above, this will output two files named scores_round_0.npy and indices_round_0.npy that contain line-matched molecule IDs (indices) and their respective docking scores. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5. Enter into or make a directory to run ChemSTEP in. Copy in necessary files: params.txt, run_chemstep.py and launch_chemstep_as_job.sh, your score and indices numpy files.&#039;&#039;&#039;&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/params.txt .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/run_chemstep.py .&lt;br /&gt;
     cp /wynton/group/bks/work/shared/kholland/chemstep_v0p0/launch_chemstep_as_job.sh .&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;6. Edit params.txt file&#039;&#039;&#039; This ONLY needs to be edited for the initial round of ChemSTEP. The parameters outlined here will be carried through the round of ChemSTEP chaining. &lt;br /&gt;
    seed_indices_file: /absoulte/path/to/your/indices_round_0.npy&lt;br /&gt;
    seed_scores_file: /absolute/path/to/your/scores_round_0.npy&lt;br /&gt;
    hit_pprop: 5&lt;br /&gt;
    n_docked_per_round: 10000000&lt;br /&gt;
    max_beacons: 150&lt;br /&gt;
    max_n_rounds: 250&lt;br /&gt;
&lt;br /&gt;
Be sure that this file reflects your score and indices files for round zero (the seed set). Define your desired pProp, number of beacons, and number to prioritize per round. There should be no need to edit run_chemstep.py or the SGE wrapper script for  the FIRST round of XReal docking. &#039;&#039;If using another virtual library, be sure to update the path in run_chemstep.py to point to your FP library.&#039;&#039; We strongly suggest running ChemSTEP as a job array with 64 CPU slots requested (specified in wrapper). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;7. Run ChemSTEP&#039;&#039;&#039; with the following command: &lt;br /&gt;
    qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When finished, there will be a smi_round_1.smi inside of output/complete_info. These molecules should be built, docked, and their scores fed back into ChemSTEP. More detailed instructions below: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8. Build prioritized molecules (DOCK 3.8).&#039;&#039;&#039; /taken from docs.docking.org &lt;br /&gt;
      source /wynton/group/bks/soft/DOCK-3.8.5/env.sh&lt;br /&gt;
&lt;br /&gt;
      python /wynton/group/bks/soft/DOCK-3.8.5/DOCK3.8/zinc22-3d/submit/submit_building_docker.py --output_folder building_output --bundle_size 1000 --minutes_per_mol 1 --skip_name_check --scheduler sge --container_software apptainer --container_path_or_name /wynton/group/bks/soft/DOCK-3.8.5/building_pipeline.sif smi_round_1.smi&lt;br /&gt;
&lt;br /&gt;
When building has completed, you must write an SDI file with the complete paths to each built bundle and dock. &#039;&#039;&#039;Be sure to change the INDOCK file to save only poses that meet your score pProp score threshold (output by ChemSTEP)!&#039;&#039;&#039; Retrieve docking scores as convert to NumPy arrays as outlined above, update the round number when running get_scores.py and convert_scores_to_npy.py! Copy new score and indices files into the directory you ran ChemSTEP in. If you are following along as a tutorial, you should have scores_round_1.npy and indices_round_1.npy from the previous step (from FIRST round of ChemSTEP prioritization). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9. Set up for iterative rounds of ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
      cp /wynton/group/bks/work/shared/kholland/chemsteo_v0p0/run_chemstep_iterative.py .&lt;br /&gt;
&lt;br /&gt;
      vim/nano launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
Edit the launch_chemstep_as_job.sh wrapper to call. In this example, we now running our second round of prioritization [2]. For additional rounds of ChemSTEP, it is ESSENTIAL to update this number in the job submission script. This will determine the naming convention of the ChemSTEP algorithm output.&lt;br /&gt;
      python run_chemstep_iterative.py 2&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10. Run ChemSTEP&#039;&#039;&#039;&lt;br /&gt;
      qsub launch_chemstep_as_job.sh &lt;br /&gt;
&lt;br /&gt;
The output will be smi_round_****.smi file. Repeat steps 8-10 for as many rounds as needed. The performance is reported in output_directory/complete_info/run_summary.df, which contains the number of beacons selected, the number of molecules docked, the number of hits found, the distance threshold for the selected molecules to dock, and the last added beacon&#039;s distance to all previous beacons.&lt;/div&gt;</summary>
		<author><name>Kholland</name></author>
	</entry>
</feed>