ChemSTEP and how to cinvince it to pick the good stuff

From DISI
Jump to navigation Jump to search

Recently some Lab members encountered an unusual and dominating enrichment of non-reasonable molecules after several rounds of ChemSTEP. This here might be a work around to train ChemSTEP in the beginning to pick molecules as beacons not only by score but also to be influenced by interaction-filtering or visual inspection. This can be crucial in the first round of ChemSTEP, because it can lead to a exclusive domination of cheating molecules -at least in the current version- if the automated beacon selection takes place.

First step

Docking the seeds set as described either here Running_ChemSTEP or here ChemSTEP_in_the_13_billion_space. This is happening in your round_0 before running ChemSTEP.

Filtering - inspection - your own prioritization

What we will basically do is IFP-filtering - or any other filtering method you want to do - to not only rely on score. I am assuming here that you will do IFP, too, but you can also just do visual inspection and pick 100 molecules you like, to force ChemSTEP to like them, too (so to use them as beacons for your first round of ChemSTEP). For Lab members IFP-filtering is best described in the Read-the-DOCK-docs: https://docs.docking.org/filtering.html#interaction-and-novelty-filtering if you want to have a reference for files and folder structures etc. you might want to have a look here on wynton:

  cd /wynton/group/bks/work/pseemann/2_IFP_CHEMSTEP 

Making a list of names

Make a list of the molecule IDs that you find reasonable - so from your list of molecules which passed your filtering/visual-inspection/whatever-method-you-used. Save it as filterd_molecules.txt for example.

Note: Between the 'trillion' and 'billion' version of ChemSTEP the beginning of the MOL IDs might be different, but that should not be an issue for this approach

example list

  MOL0000egWAbN
  MOL0000deAfVb
  MOL00006qQdBI
  MOL0000besVZV
  MOL0000blhcqH
  MOL00008TJKq8
  MOL0000emHoK9
  MOL0000e3b5Xz
  MOL00008K8dFx
  MOL0000byZUhT

We go back to our docking folder for the seeds set (round_0) and run the get_scores.py (see Running_ChemSTEP or here ChemSTEP_in_the_13_billion_space). This will yield a scores_round_0.txt file. This file we will now tweak with artificial scores. So I usually make a folder like this

  mkdir round_0_filtering
  cd round_0_filtering
  cp /path/to/filtered_molecules.txt . # your file with the names of molecules you want to be picked as beacons
  cp /path/to/scores_round_0.txt . # your file from the get_scores.py script in your round_0 seed set docking folder

Now you need a script to just set artificial scores. I went for good stuff 0 and bad stuff 100. You could also do -40 and 100 or whatever you like. I do IFP every round since my target is a bit special. But I think only doing it in the first round can also suffice.

An example script to do so is here:

score_correction.py

  import sys
  
  scores_file = sys.argv[1]
  ids_file = sys.argv[2]
  out_file = sys.argv[3]
  
  with open(ids_file) as f:
      id_set = set(line.strip() for line in f if line.strip())
  
  print("IDs loaded:", id_set)
  
  with open(scores_file) as f:
      lines = f.readlines()
  
  with open(out_file, "w") as out:
      for line in lines:
          parts = line.strip().split()
          if len(parts) != 2:
              continue
          mol_id, score = parts
          mol_id = mol_id.strip()
          if mol_id in id_set:
              out.write(f"{mol_id} 0\n") #set good stuff to 0 you can adjust
          else:
              out.write(f"{mol_id} 100\n") #set bad stuff to 100 you can adjust

The script takes three arguments which is

   ### We run this now in the prepared round_0_filtering folder which contains your scores_round_0.txt and filtered_molecules.txt
   python score_correction.py scores_round_0.txt filtered_molecules.txt out.txt

The out.txt should contain now all the previous IDs

  to check
  grep " 0" out.txt | wc -l 

file example of out.txt

  MOL00004n1zC2 100
  MOL00004n204O 100
  MOL00004n21qu 100
  MOL00004n242A 100
  MOL00004n24Zm 0
  MOL00004n28dM 100
  MOL00004n2Anc 100
  MOL00004n2AvB 100

So now we can copy the out.txt file over to our round_0 seed set docking folder and replace the 'old' scores_round_0.txt

  #assuming you are still in round_0_filtering
  cp out.txt /path/to/round_0/scores_round_0.txt

So with the tweaked scores_round_0.txt file we now run convert_scores_to_npy.py, which now reads the adjusted scores_round_0.txt to produce the indices.npy and score.npy files. Meaning, that now every step will be the same as previously described (see Running_ChemSTEP or here ChemSTEP_in_the_13_billion_space).

ChemSTEP should now just 'see' compounds with a good score of 0 and ignore the ones with a score of 100. You can check this also in your ChemSTEP folder by looking at the .log files and the picked beacons.