Zoey's way of filtering LSD

From DISI
Jump to navigation Jump to search

This is my current method (updated 2026-03) of filtering LSD results on Cluster 2, and consolidates various processing scripts used by Shoichet lab members into a single tutorial for filtering LSD results. For working up on Wynton, reference the ReadTheDocks. Procedures below are adapted from various other lab sources.

Step 0: Source the environment and copy over relevant files

Source JK's environment with the following commands, accessible on gimel:

source /mnt/nfs/home/jklyu/anaconda3/etc/profile.d/conda.sh
conda activate bioinfo-env2
if ! ( $?PYTHONPATH ) setenv PYTHONPATH ""
setenv PYTHONPATH $PYTHONPATH\:/mnt/nfs/ex5/work/jklyu/IFP/package/ifp/scripts\:

Subsequently, copy the needed scripts:

cp /nfs/home/zdingman/scripts/IFP/jk_scripts ./scripts

Step 1: Interaction fingerprinting

Copy your initially extracted poses to prep_poses. These poses will need to be split so we can run them in parallel, so run the requisite script(s). The fewer poses you have in your mol2 files, the faster they will complete. Each mol2 file must be named 'filtered-#.mol2', owing to how the underlying scripts were written by JK. Rename as follows:

 mkdir output_renamed
 sh rename_utility.sh

Anecdotally, it takes about 6 hours to fingerprint a file with 2000 molecules in it. [Tia and Andrii both have splitting scripts. Alternatively, just use the poses_extract files, but note that the larger # of mol2 per file will run slower.]

Once your mol2 files are split, relocate them to your working directory. This is necessary. Place your pdb file in the working directory. Generate a dirlist of mol2 files, and run from the working directory:

ls *.mol2 > dirlist
csh ../scripts/submit.csh [rec name.pdb]

Now we need to combine the output directories back together:

ls -d [0-9]* > dirlist_combine
python2.7 ../scripts/combine_ifp.py dirlist_combine combined

At this point, our interactions have been identified. We can view them in combined.interactions.csv and use the awk command to fetch a list of compounds matching our desired interaction patterns. For example:

awk -F, '$4 <= 1 && ($8 == 1 || $9 == 1) {print $1}' combined.interactions.csv > filtered_compounds.names

Columns correspond to the following:

CSV Field Value
$1 Compound Name
$2 # H-Bond Donors
$3 # H-Bond Acceptors
$4 # Unsatisfied H-Bond Donors
$5 # Unsatisfied H-Bond Acceptors
etc..., Interactions that were checked.

Finally, run the following:

 ssh epyc2
 cd [WORKING DIRECTORY]
 source /nfs/home/omailhot/pyenv_source.sh
 python ../scripts/process_interactions.py combined.csv rec combined.zincid.energy combined.interactions.csv
 grep -Ff filtered_compounds.names combined.all.csv > filtered.all.interactions.csv
 awk -F',' '{print $2 " " $3}' filtered.all.interactions.csv > filtered.all.interactions.smi

We now have a rank-ordered list of compounds and corresponding structures passing the interaction fingerprinting step.

Note: filtered.all.interactions.csv and filtered_compounds.names should be approximately the same length. filtered.all.interactions.csv may be longer if some compounds had multiple protomers that both passed IFP. This is negligible, and will be taken care of by Best First Clustering.

Step 2: Novelty filtering

See Olivier's method described here.

It is recommended to restrict the results on ChEMBL for a given target protein to only include compounds annotated with binding (EC50, IC50, etc) data. Additionally, it can be worthwhile with targets we have previously screened internally to include a second novelty filter for previously identified/purchased compounds.


Step 3: Best first clustering

Note that the input SMILES need to be in rank-order.

 ssh gimel
 cd [WORKING DIRECTORY]
 source /nfs/soft/dock/versions/dock37/DOCK-3.7.5.0/env.sh
 python2 get_fingerprints_bfc.py *smi [OUTPUT_NAME] [MIN_TC_SIM_IN_CLUSTER] [MAX_NUM_CLUSTERS]


Addendum: Using awk

awk is a powerful command-line tool that uses its own scripting language (essentially regex with additional features added on top) to process multi-column text files on a line-by-line basis. The above command opens the file 'combined.interactions.csv' and separates each line into multiple columns separated by commas, as specified by the -F flag. Columns are 1-indexed and can be accessed using $1, analogous to how bash scripting arguments are accessed. The operation being executed is specified within single-quotes; for our purposes we usually want to use a combination of logic operators as above. OR statements (||) are especially useful, especially in cases where multiple interactions are possible (e.x. either a hydrophobic interaction or a pi-stacking interaction with a phenylalanine). For lines which match the pattern specified, awk will output the contents contained within curly braces. We redirect the awk output to a new file titled 'filtered_compounds' for future use.

awk is worth learning in greater detail. While far from exhaustive, I found this page and this page to be useful starting points.