ChemSTEP and how to cinvince it to pick the good stuff
Recently some Lab members encountered an unusual and dominating enrichment of non-reasonable molecules after several rounds of ChemSTEP. This here might be a work around to train ChemSTEP in the beginning to pick molecules as beacons not only by score but also to be influenced by interaction-filtering or visual inspection. This can be crucial in the first round of ChemSTEP, because it can lead to a exclusive domination of cheating molecules -at least in the current version- if the automated beacon selection takes place.
First step
Docking the seeds set as described either here Running_ChemSTEP or here ChemSTEP_in_the_13_billion_space. This is happening in your round_0 before running ChemSTEP.
Filtering - inspection - your own prioritization
What we will basically do is IFP-filtering - or any other filtering method you want to do - to not only rely on score. I am assuming here that you will do IFP, too, but you can also just do visual inspection and pick 100 molecules you like, to force ChemSTEP to like them, too (so to use them as beacons for your first round of ChemSTEP). For Lab members IFP-filtering is best described in the Read-the-DOCK-docs: https://docs.docking.org/filtering.html#interaction-and-novelty-filtering if you want to have a reference for files and folder structures etc. you might want to have a look here on wynton:
cd /wynton/group/bks/work/pseemann/2_IFP_CHEMSTEP
Making a list of names
Make a list of the molecule IDs that you find reasonable - so from your list of molecules which passed your filtering/visual-inspection/whatever-method-you-used. Save it as filterd_molecules.txt for example.
Note: Between the 'trillion' and 'billion' version of ChemSTEP the beginning of the MOL IDs might be different, but that should not be an issue for this approach
example list
MOL0000egWAbN MOL0000deAfVb MOL00006qQdBI MOL0000besVZV MOL0000blhcqH MOL00008TJKq8 MOL0000emHoK9 MOL0000e3b5Xz MOL00008K8dFx MOL0000byZUhT
We go back to our docking folder for the seeds set (round_0) and run the get_scores.py (see Running_ChemSTEP or here ChemSTEP_in_the_13_billion_space). This will yield a scores_round_0.txt file. This file we will now tweak with artificial scores. So I usually make a folder like this
mkdir round_0_filtering cd round_0_filtering cp /path/to/filtered_molecules.txt . # your file with the names of molecules you want to be picked as beacons cp /path/to/scores_round_0.txt . # your file from the get_scores.py script in your round_0 seed set docking folder
Now you need a script to just set artificial scores. I went for good stuff 0 and bad stuff 100. You could also do -40 and 100 or whatever you like. I do IFP every round since my target is a bit special. But I think only doing it in the first round can also suffice.
An example script to do so is here:
score_correction.py
import sys
scores_file = sys.argv[1]
ids_file = sys.argv[2]
out_file = sys.argv[3]
with open(ids_file) as f:
id_set = set(line.strip() for line in f if line.strip())
print("IDs loaded:", id_set)
with open(scores_file) as f:
lines = f.readlines()
with open(out_file, "w") as out:
for line in lines:
parts = line.strip().split()
if len(parts) != 2:
continue
mol_id, score = parts
mol_id = mol_id.strip()
if mol_id in id_set:
out.write(f"{mol_id} 0\n") #set good stuff to 0 you can adjust
else:
out.write(f"{mol_id} 100\n") #set bad stuff to 100 you can adjust
The script takes three arguments which is
### We run this now in the prepared round_0_filtering folder which contains your scores_round_0.txt and filtered_molecules.txt python score_correction.py scores_round_0.txt filtered_molecules.txt out.txt
The out.txt should contain now all the previous IDs
to check grep " 0" out.txt | wc -l
file example of out.txt
MOL00004n1zC2 100 MOL00004n204O 100 MOL00004n21qu 100 MOL00004n242A 100 MOL00004n24Zm 0 MOL00004n28dM 100 MOL00004n2Anc 100 MOL00004n2AvB 100
So now we can copy the out.txt file over to our round_0 seed set docking folder and replace the 'old' scores_round_0.txt
#assuming you are still in round_0_filtering cp out.txt /path/to/round_0/scores_round_0.txt
So with the tweaked scores_round_0.txt file we now run convert_scores_to_npy.py, which now reads the adjusted scores_round_0.txt to produce the indices.npy and score.npy files. Meaning, that now every step will be the same as previously described (see Running_ChemSTEP or here ChemSTEP_in_the_13_billion_space).
ChemSTEP should now just 'see' compounds with a good score of 0 and ignore the ones with a score of 100. You can check this also in your ChemSTEP folder by looking at the .log files and the picked beacons.