Zoey's way of filtering LSD
This is my current method (updated 2026-03) of filtering LSD results on Cluster 2, and consolidates various processing scripts used by Shoichet lab members into a single tutorial for filtering LSD results. For working up on Wynton, reference the ReadTheDocks. Procedures below are adapted from various other lab sources.
Step 0: Source the environment and copy over relevant files
Source JK's environment with the following commands, accessible on gimel:
source /mnt/nfs/home/jklyu/anaconda3/etc/profile.d/conda.sh conda activate bioinfo-env2 if ! ( $?PYTHONPATH ) setenv PYTHONPATH "" setenv PYTHONPATH $PYTHONPATH\:/mnt/nfs/ex5/work/jklyu/IFP/package/ifp/scripts\:
Subsequently, copy the needed scripts:
mkdir post_LSD cd post_LSD cp /nfs/home/zdingman/scripts/IFP/jk_scripts ./scripts
Step 1: Interaction fingerprinting
Copy your initially extracted poses to prep_poses. These poses will need to be renamed 'filtered-#.mol2', owing to how the underlying scripts were written by JK. Rename as follows:
mkdir prep_poses cd prep_poses cp [/path/to/poses/*mol2] . mkdir output_renamed sh rename_utility.sh
[Note: It takes about 6 hours to fingerprint a file with 2000 molecules in it. Tia and Andrii both have splitting scripts; if you want to combine all your mol2 files and split them into smaller chunks, now would be the time. Alternatively, just use the original poses_extract mol2, but note that the larger # of poses per file will take a bit longer.]
Once your mol2 files are renamed, relocate them to your working directory. This is necessary (you can't soft link them). Place your pdb file in the working directory. Generate a dirlist of mol2 files, and run from the working directory:
ls *.mol2 > dirlist csh ../scripts/submit.csh [rec name.pdb]
Now we need to combine the output directories back together:
ls -d [0-9]* > dirlist_combine python2.7 ../scripts/combine_ifp.py dirlist_combine combined
At this point, our interactions have been identified. We can view them in combined.interactions.csv and use the awk command to fetch a list of compounds matching our desired interaction patterns. For example:
awk -F, '$4 <= 1 && ($8 == 1 || $9 == 1) {print $1}' combined.interactions.csv > filtered_compounds.names
Columns correspond to the following:
| CSV Field | Value |
|---|---|
| $1 | Compound Name |
| $2 | # H-Bond Donors |
| $3 | # H-Bond Acceptors |
| $4 | # Unsatisfied H-Bond Donors |
| $5 | # Unsatisfied H-Bond Acceptors |
| etc..., | Interactions that were checked. |
Finally, run the following:
ssh epyc2
cd [WORKING DIRECTORY]
source /nfs/home/omailhot/pyenv_source.sh
python ../scripts/process_interactions.py combined.csv rec combined.zincid.energy combined.interactions.csv
grep -Ff filtered_compounds.names combined.all.csv > filtered.all.interactions.csv
awk -F',' '{print $2 " " $3}' filtered.all.interactions.csv > filtered.all.interactions.smi
We now have a rank-ordered list of compounds and corresponding structures passing the interaction fingerprinting step.
Note: filtered.all.interactions.csv and filtered_compounds.names should be approximately the same length. filtered.all.interactions.csv may be longer if some compounds had multiple protomers that both passed IFP. This is negligible, and will be taken care of by Best First Clustering.
Step 2: Novelty filtering
See Olivier's method described here.
It is recommended to restrict the results on ChEMBL for a given target protein to only include compounds annotated with binding (EC50, IC50, etc) data. Additionally, it can be worthwhile with targets we have previously screened internally to include a second novelty filter for previously identified/purchased compounds.
Step 3: Best first clustering
Note that the input SMILES need to be in rank-order.
ssh gimel cd [WORKING DIRECTORY] cp /nfs/home/zdingman/scripts/IFP/get_fingerprints_bfc.py . source /nfs/soft/dock/versions/dock37/DOCK-3.7.5.0/env.sh python2 get_fingerprints_bfc.py *smi [OUTPUT_NAME] [MIN_TC_SIM_IN_CLUSTER] [MAX_NUM_CLUSTERS]
To get the poses from your mol2 file:
cp /nfs/home/zdingman/scripts/IFP/lc_blazing_fast_collect_mol2.py .
awk -F',' '{print $3}' cluster_head.list > cluster_head.names
python lc_blazing_fast_collect_mol2.py cluster_head.names [PATH/TO/COMBINED/MOL2] [OUTPUTNAME].mol2
Addendum: Using awk
awk is a powerful command-line tool that uses its own scripting language (essentially regex with additional features added on top) to process multi-column text files on a line-by-line basis. The above command opens the file 'combined.interactions.csv' and separates each line into multiple columns separated by commas, as specified by the -F flag. Columns are 1-indexed and can be accessed using $1, analogous to how bash scripting arguments are accessed. The operation being executed is specified within single-quotes; for our purposes we usually want to use a combination of logic operators as above. OR statements (||) are especially useful, especially in cases where multiple interactions are possible (e.x. either a hydrophobic interaction or a pi-stacking interaction with a phenylalanine). For lines which match the pattern specified, awk will output the contents contained within curly braces. We redirect the awk output to a new file titled 'filtered_compounds' for future use.
awk is worth learning in greater detail. While far from exhaustive, I found this page and this page to be useful starting points.