Filtering ligands for novelty
Written by Chase Webb 09-01-2018
After a large scale docking campaign, it is important to remove prospective ligands that are too similar to compounds that are already known to modulate the receptor. In this way, we can focus on assessing new chemical interactions. This is best completed after clustering has been conducted as specified here:Processing Results from LSD
This process proceeds in the following steps:
Make a new directory to do similarity filtering.
Make a symbolic link to the location where clustering occurred.
1. Generate a list of smiles for the known compounds. The most simple way to do this is to download them from ZINC. For the Mu opioid receptor (OPRM1) for instance, go here: ZINC15 Genes
2. Generate Fingerprints for the known compounds. Run the following script written by TEB and JKL. The inputs are name of the knowns file and the name of the output fingerprint file.
python ~jklyu/zzz.github/ChemInfTools/utils/teb_chemaxon_cheminf_tools/generate_chemaxon_fingerprints.py knowns_list.smi knowns
3. Convert the fingerprints from binary to unsigned integers. Run the following script written by TEB and JKL. The inputs are the bitstrings generated from the above script, the smiles file used to generate the above script, and the prefix of the output file. You will need to do this for the knowns and the clusterheads that were calculated in the previous tutorial: Processing Results from LSD
~jklyu/zzz.github/ChemInfTools/utils/convert_fp_2_fp_in_16unit/convert_fp_2_fp_in_uint16 knowns.fp knowns_list.smi knowns ~jklyu/zzz.github/ChemInfTools/utils/convert_fp_2_fp_in_16unit/convert_fp_2_fp_in_uint16 extract_all.topN.sort.uniq.fp extract_all.topN.zincid.sort.uniq.smi topN_clusterhead
4. Calculate an all by all TC matrix for the knowns against the clusterheads. Run the following script written by TEB and JKL:
nohup ~jklyu/zzz.github/ChemInfTools/utils/cal_Tc_matrix_uint16/cal_Tc_matrix_uint16 topN_clusterhead_uint16.fp extract_all.topN.zincid.sort.uniq.smi topN_clusterhead_uint16.count knowns_uint16.fp knowns_list.smi knowns_uint16.count tc_matrix > log &
The arguments supplied to this script are as follows:
(1) topN_clusterhead_uint16.fp (2) extract_all.topN.zincid.sort.uniq.smi (3) topN_clusterhead_uint16.count (4) knowns_uint16.fp (5) knowns_list.smi (6) knowns_uint16.count (7) prefix for output file
To view the progress of this script, use the command ps -fu or ls -l the directory where the script is running and check for the log file.