ECFP4 Best First Clustering: Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 18: Line 18:
Then run the command below
Then run the command below


  setenv BFCPATH = "/mnt/nfs/home/jklyu/zzz.github/ChemInfTools/utils/best_first_clustering_uint16/"
  setenv BFCPATH "/mnt/nfs/home/jklyu/zzz.github/ChemInfTools/utils/best_first_clustering_uint16"
  ${BFCPATH}/best_first_clustering_uint16 /path/fingerprint.file /path/count.file /path/smiles.file tc.thres.val max.num.val
  ${BFCPATH}/best_first_clustering_uint16 /path/fingerprint.file /path/count.file /path/smiles.file tc.thres.val max.num.val



Latest revision as of 15:51, 19 September 2017

Written by Jiankun Lyu, 2017/09/13

1) cluster about 1M molecules

Run the script at where your extract_all.sort.uniq.txt locates

cd where your extract_all.sort.uniq.txt locates
csh ~jklyu/zzz.script/large_scale_docking/cluster_analysis/best_first_clustering.csh number_of_top_molecules_you_want_to_cluster TC_cutoff

Example:

csh ~jklyu/zzz.script/large_scale_docking/cluster_analysis/best_first_clustering.csh 1000000 0.5

so you will cluster top1M molecules from a docking run with tc cutoff 0.5.

For clustering top1M molecules, it usually takes about 6 hours. Please do not cluster more than 1M molecules by this script.

2) cluster more than 1M molecules

Follow the tutorial Large-scale SMILES Requesting and Fingerprints Converting to get the smi files and compressed fingerprints for the molecules you want to cluster

Then run the command below

setenv BFCPATH "/mnt/nfs/home/jklyu/zzz.github/ChemInfTools/utils/best_first_clustering_uint16"
${BFCPATH}/best_first_clustering_uint16 /path/fingerprint.file /path/count.file /path/smiles.file tc.thres.val max.num.val
syntax: best_first_clustering_uint16 
 (1) fingerprint file 
 (2) count file 
 (3) smiles file 
 (4) tanimoto coefficient threshold value to define clustering (must be between 0.0 and 1.0) 
 (5) max number of clusters (must be an integer)