Zoey's way of filtering LSD

From DISI
Jump to navigation Jump to search

This is my current method (2024-09) of filtering LSD results, and consolidates various processing scripts used by Shoichet lab members into a single tutorial for filtering LSD results. Procedures below are adapted from various other lab sources.

Step 0: Source the environment and copy over relevant files

Source JK's environment with the following commands, accessible on gimel:

source /mnt/nfs/home/jklyu/anaconda3/etc/profile.d/conda.sh
conda activate bioinfo-env2
if ! ( $?PYTHONPATH ) setenv PYTHONPATH ""
setenv PYTHONPATH $PYTHONPATH\:/mnt/nfs/ex5/work/jklyu/IFP/package/ifp/scripts\:

Subsequently, you can source my collection of modified scripts by running the below script to build the requisite directory structure and copy over all necessary scripts:

sh /nfs/exk/work/zdingman/scripts/LSD_analysis/populate_LSD_analysis.sh .

Step 1: Interaction fingerprinting

Copy your initially extracted poses to prep_poses. These poses will need to be split so we can run them in parallel, so run the requisite script. The fewer poses you have in your mol2 files, the faster they will complete. Anecdotally, it takes about 6 hours to fingerprint a file with 2000 molecules in it. Zoey will re-write the splitting script at some point because there's an arbitrary cap at 156 files at most.

[The split script needs to be modified and updated. For now, run splitting on Wynton. Ask Seth or others for assistance. Otherwise, see Tia's scripts.]

Once your mol2 files are split, reformat and relocate them to your working directory. Place your pdb file in the working directory.

Generate a dirlist of mol2 files, and run from the working directory:

ls *.mol2 > dirlist
csh ../scripts/submit.csh [rec name.pdb]

Now we need to combine the output directories back together:

ls -d [0-9]* > dirlist_combine
python2.7 ../scripts/combine_ifp.py dirlist_combine combined

At this point, our interactions have been identified. We can view them in combined.interactions.csv and use the awk command to fetch a list of compounds matching our desired interaction patterns. For example:

awk -F, '$4 <= 1 && ($8 == 1 || $9 == 1) {print $1}' combined.interactions.csv > filtered_compounds

Note: awk is a powerful command-line tool that uses its own scripting language (essentially regex with additional features added on top) to process multi-column text files on a line-by-line basis. The above command opens the file 'combined.interactions.csv' and separates each line into multiple columns separated by commas, as specified by the -F flag. Columns are 1-indexed and can be accessed using $1, analogous to how bash scripting arguments are accessed. The operation being executed is specified within single-quotes; for our purposes we usually want to use a combination of logic operators as above. OR statements (||) are especially useful, especially in cases where multiple interactions are possible (e.x. either a hydrophobic interaction or a pi-stacking interaction with a phenylalanine). For lines which match the pattern specified, awk will output the contents contained within curly braces. We then redirect that output to a new file titled 'filtered_compounds' for future use.

awk is worth learning in greater detail. While far from exhaustive, I found this page and this page to be useful starting points.

Unfortunately, our process of splitting mol2 files and recombining them has scrambled the rank ordering of our compounds. This will cause issues later. I will write a script to deal with this later.

For now, we have a list of compounds that pass the interaction fingerprinting step.

Step 2: Novelty filtering

I've been making use of Olivier's method for novelty, which can be found here.

It has been recommended to restrict the results on ChEMBL for a target protein to just those compounds annotated with binding (EC50, IC50, etc) data. Additionally, it can be worthwhile with targets we have previously screened internally to include a second novelty filter for previously identified/purchased compounds.


Step 3: Best first clustering

Note that the input SMILES need to be in rank-order.