Large-scale SMILES Requesting and Fingerprints Converting

From DISI
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Written by Jiankun Lyu, 20170918

The hierarchy of the directories:

smiles_requesting/----- working/ 
              |                |
              |                |------ extract_all.sort.uniq.txt file(soft link)
              |                | 
              |                |------ db_zincid/
              |                                                 
              |                                                 
              |
              ------- scripts/ ------ submit.csh
                              |
                              |------ make_chunks_for_file_new.py
                              |
                              |------ setup_converting_fps_files.py
                              |
                              |------ combine_smi_and_fp.py
                              |
                              |------ check_outputs.csh


This tutorial is for requesting a large number of SMILES for docking results from ZINC server. Usually, the number is larger than 5M ZINC IDs.

1) make directories and copy scripts

mkdir smiles_requesting
cd smiles_requesting
mkdir working
mkdir scripts
cd working
mkdir db_zincid
ln -s /path/to/extract_all.sort.uniq.txt
cd ../scripts
cp /mnt/nfs/home/jklyu/zzz.script/large_scale_docking/cluster_analysis/best_first_clustering/converting_fps/submit.csh .
cp /mnt/nfs/home/jklyu/zzz.script/large_scale_docking/cluster_analysis/best_first_clustering/converting_fps/setup_converting_fps_files.py .
cp /mnt/nfs/home/jklyu/zzz.script/large_scale_docking/cluster_analysis/best_first_clustering/converting_fps/combine_smi_and_fp.py .
cp /mnt/nfs/home/jklyu/zzz.script/large_scale_docking/cluster_analysis/best_first_clustering/converting_fps/check_outputs.csh .
cp /mnt/nfs/home/jklyu/zzz.script/large_scale_docking/cluster_analysis/best_first_clustering/converting_fps/make_chunks_for_file_new.py .
cd ../

2) Get ZINC ID and energy columns from the extract_all.sort.uniq.txt file and split the zincid file

cd working/db_zincid
head -(number) ../extract_all.sort.uniq.txt | awk '{print $3" "$22}' > extract_all.top(number).sort.uniq.zincid.energy note: change number in the brackets
python ../../scripts/make_chunks_for_file_new.py extract_all.top(number).sort.uniq.zincid.energy top(number).zincid 500 .
cd ../

3) Create a zincid.sdi file

ls /full/path/to/db_zincid/top(number)_*.zincid > zincid.sdi

4) Set up requesting files and directories

python ../scripts/setup_converting_fps_files.py . converting_fps_ zincid.sdi 500 count

5) Submit requesting and converting jobs

csh ../scripts/submit.csh

6) Check if every job finishe

cd db_zincid
csh ../../scripts/check_outputs.csh (number_of_jobs) (prefix)
if you find any missing files, please edit dirlist in the working directory and resubmit them.

7) Collect data from the compressed files

python ../../scripts/combine_smi_and_fp.py 500 (prefix)