Search zinc22.py: Difference between revisions
Jocastanon (talk | contribs) No edit summary |
|||
(25 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
=== Description === | |||
<nowiki> | <nowiki> | ||
usage: | usage: init_partitioned_search.py [-i] input_file [-o] results_out [--get-vendors] [-s] | ||
search for smiles by zinc22 id | search for smiles and vendor codes by zinc22 id | ||
required arguments: | |||
input_file file containing list of zinc ids or vendor codes to look up | |||
results_out destination | results_out destination for output directory | ||
optional arguments: | optional arguments: | ||
-h, --help show this help message and exit | -h, --help show this help message and exit | ||
--vendor-search look up molecules by vendor code instead of zinc id | |||
--get-vendors get vendor supplier codes associated with zinc id | --get-vendors get vendor supplier codes associated with zinc id | ||
- | -s submit slurm jobs and start the search process, can be omitted if you are only looking to split a list of molecules into tranches | ||
</nowiki> | </nowiki> | ||
init_partitioned_search.py is a script that replaces search_zinc22.py for looking up zinc ids on the zinc22 system in a more efficient manner. The operation is simple- provide a file containing a list of zincids and the script will split the input into tranches for search across all zinc22 databases. | |||
The output format is as follows: | The output format is as follows: | ||
<nowiki> | |||
SMILES ZINC_ID TRANCHE_NAME</nowiki> | SMILES ZINC_ID TRANCHE_NAME</nowiki> | ||
With --get-vendors the output format looks like this: | With --get-vendors or --vendor-search the output format looks like this: | ||
SMILES ZINC_ID VENDOR_ID TRANCHE_NAME CATALOG</nowiki> | SMILES ZINC_ID VENDOR_ID TRANCHE_NAME CATALOG</nowiki> | ||
Meaning the script will find all vendor | Meaning the script will find all vendor information and smiles associated with the provided zinc ids or vendor codes. | ||
=== Location === | |||
You can activate the environment, using | |||
source /nfs/soft/zinc22/search_zinc/miniconda/bin/activate zinc22_search | |||
'''Side note about slow queries''' | |||
Depending on the molecules you happen to be looking up, your search may go by very quickly, or somewhat slowly. Smaller molecules tend to look up very quickly, while larger molecules take longer to find. We're working on it. | |||
'''Tracking Progress''' | |||
After submitting the search, you can look at the slurm queue to see the status of your job. Jobs will be labeled 'search_zinc22' | |||
'''Using Output''' | |||
When all jobs are done, the output folder will have input files, log files, and .results files. If results files are missing for a tranche, refer to the logs to learn more or contact the JJI team for troubleshooting. You can merge all results together in a file with a simple command | |||
cat *.results > yourresultsfile.txt | |||
=== Usage w/ Bash on BKS cluster === | |||
<nowiki> | <nowiki> | ||
source /nfs/soft/zinc22/search_zinc/ | source /nfs/soft/zinc22/search_zinc/miniconda/bin/activate /nfs/soft/zinc22/search_zinc/miniconda/envs/zinc22_search | ||
python /nfs/soft/zinc22/search_zinc/ | python /nfs/soft/zinc22/search_zinc/init_partitioned_search.py -i /path/to/input/file -o /path/to/output/folder -s | ||
python /nfs/soft/zinc22/search_zinc/ | python /nfs/soft/zinc22/search_zinc/init_partitioned_search.py -i /path/to/input/file -o /path/to/output/folder -s --get_vendors | ||
</nowiki> | |||
=== Usage w/ Csh on BKS cluster === | |||
This version of the tool is currently only compatible with Bash. | |||
=== Dealing with NULL === | |||
Sometimes a ZINC ID will fail to look up. This could be because a server is down (the script will notify you if this is the case), or because the ID is missing from the system for some reason. In this case, it may be helpful to separate the molecules that didn't look up from the molecules that did. You may want to save them for later when the servers come back online, or to | Sometimes a ZINC ID will fail to look up. This could be because a server is down (the script will notify you if this is the case), or because the ID is missing from the system for some reason. In this case, it may be helpful to separate the molecules that didn't look up from the molecules that did. You may want to save them for later when the servers come back online, or to run a deeper search with comb_legacy_files.py (more on this below). | ||
How to: | How to: | ||
<nowiki> | <nowiki> | ||
$ cat legitimate_ids.txt > input.txt | [env]$ cat legitimate_ids.txt > input.txt | ||
$ echo ZINCzz00ZZZZZZZZ >> input.txt | [env]$ echo ZINCzz00ZZZZZZZZ >> input.txt | ||
$ echo ZINCyy00AAAAAAAA >> input.txt | [env]$ echo ZINCyy00AAAAAAAA >> input.txt | ||
$ echo ZINCxx00BBBBBBBB >> input.txt | [env]$ echo ZINCxx00BBBBBBBB >> input.txt | ||
$ python search_zinc.py input.txt output.txt | [env]$ python search_zinc.py input.txt output.txt | ||
$ grep "_null_" output.txt | Searching Zinc22: |XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| 100.0% 0.00s 23/23 complete! | ||
[env]$ grep "_null_" output.txt | |||
_null_ ZINCzz00ZZZZZZZZ H33P270 | _null_ ZINCzz00ZZZZZZZZ H33P270 | ||
_null_ ZINCyy00AAAAAAAA H34P280 | _null_ ZINCyy00AAAAAAAA H34P280 | ||
_null_ ZINCxx00BBBBBBBB H35P290 | _null_ ZINCxx00BBBBBBBB H35P290</nowiki> | ||
</nowiki> | |||
search_zinc.py will not omit IDs that don't look up from the output, instead it will return the zinc id with "_null_" in every other field. Therefore we can use grep to filter our results. | search_zinc.py will not omit IDs that don't look up from the output, instead it will return the zinc id with "_null_" in every other field. Therefore we can use grep to filter our results. | ||
<nowiki> | <nowiki> | ||
grep "_null_" output.txt > missing.txt | [env]$ grep "_null_" output.txt > missing.txt | ||
grep -v "_null_" output.txt > found.txt</nowiki> | [env]$ grep -v "_null_" output.txt > found.txt</nowiki> | ||
It should be very infrequent that ZINC IDs don't look up, but if this happens you can use the following script: | |||
=== comb_legacy_files.py === | |||
<nowiki> | |||
python3 /mnt/nfs/home/xyz/btingle/bin/2dload.testing/utils-2d/tin/misc/comb_legacy_files.py [INPUT_ZINC_IDS_FILE]</nowiki> | |||
You don't need to source any particular python 3 environment for this script, but the environment used for search_zinc22.py will work just fine here. | |||
This script will comb through our deprecated files and attempt to locate your ZINC IDs there. This script will create a file called "result" in your current directory containing all the smiles found. | |||
If you're looking from vendor information, you can look up the SMILES you get back in arthor/smallworld sets to find vendor codes. Functionality is planned in search_zinc22.py for looking up by SMILES, but not implemented yet. | |||
If after this you're STILL unable to find your zinc ids, you can send them to our development team and we will find them for you. | |||
Email ben@tingle.org, ccing khtang015@gmail.com and josecastanon4@gmail.com. Include your missing file as an attachment. |
Latest revision as of 21:25, 3 December 2024
Description
usage: init_partitioned_search.py [-i] input_file [-o] results_out [--get-vendors] [-s] search for smiles and vendor codes by zinc22 id required arguments: input_file file containing list of zinc ids or vendor codes to look up results_out destination for output directory optional arguments: -h, --help show this help message and exit --vendor-search look up molecules by vendor code instead of zinc id --get-vendors get vendor supplier codes associated with zinc id -s submit slurm jobs and start the search process, can be omitted if you are only looking to split a list of molecules into tranches
init_partitioned_search.py is a script that replaces search_zinc22.py for looking up zinc ids on the zinc22 system in a more efficient manner. The operation is simple- provide a file containing a list of zincids and the script will split the input into tranches for search across all zinc22 databases.
The output format is as follows: SMILES ZINC_ID TRANCHE_NAME
With --get-vendors or --vendor-search the output format looks like this:
SMILES ZINC_ID VENDOR_ID TRANCHE_NAME CATALOG</nowiki>
Meaning the script will find all vendor information and smiles associated with the provided zinc ids or vendor codes.
Location
You can activate the environment, using
source /nfs/soft/zinc22/search_zinc/miniconda/bin/activate zinc22_search
Side note about slow queries
Depending on the molecules you happen to be looking up, your search may go by very quickly, or somewhat slowly. Smaller molecules tend to look up very quickly, while larger molecules take longer to find. We're working on it.
Tracking Progress
After submitting the search, you can look at the slurm queue to see the status of your job. Jobs will be labeled 'search_zinc22'
Using Output
When all jobs are done, the output folder will have input files, log files, and .results files. If results files are missing for a tranche, refer to the logs to learn more or contact the JJI team for troubleshooting. You can merge all results together in a file with a simple command
cat *.results > yourresultsfile.txt
Usage w/ Bash on BKS cluster
source /nfs/soft/zinc22/search_zinc/miniconda/bin/activate /nfs/soft/zinc22/search_zinc/miniconda/envs/zinc22_search python /nfs/soft/zinc22/search_zinc/init_partitioned_search.py -i /path/to/input/file -o /path/to/output/folder -s python /nfs/soft/zinc22/search_zinc/init_partitioned_search.py -i /path/to/input/file -o /path/to/output/folder -s --get_vendors
Usage w/ Csh on BKS cluster
This version of the tool is currently only compatible with Bash.
Dealing with NULL
Sometimes a ZINC ID will fail to look up. This could be because a server is down (the script will notify you if this is the case), or because the ID is missing from the system for some reason. In this case, it may be helpful to separate the molecules that didn't look up from the molecules that did. You may want to save them for later when the servers come back online, or to run a deeper search with comb_legacy_files.py (more on this below).
How to:
[env]$ cat legitimate_ids.txt > input.txt [env]$ echo ZINCzz00ZZZZZZZZ >> input.txt [env]$ echo ZINCyy00AAAAAAAA >> input.txt [env]$ echo ZINCxx00BBBBBBBB >> input.txt [env]$ python search_zinc.py input.txt output.txt Searching Zinc22: |XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| 100.0% 0.00s 23/23 complete! [env]$ grep "_null_" output.txt _null_ ZINCzz00ZZZZZZZZ H33P270 _null_ ZINCyy00AAAAAAAA H34P280 _null_ ZINCxx00BBBBBBBB H35P290
search_zinc.py will not omit IDs that don't look up from the output, instead it will return the zinc id with "_null_" in every other field. Therefore we can use grep to filter our results.
[env]$ grep "_null_" output.txt > missing.txt [env]$ grep -v "_null_" output.txt > found.txt
It should be very infrequent that ZINC IDs don't look up, but if this happens you can use the following script:
comb_legacy_files.py
python3 /mnt/nfs/home/xyz/btingle/bin/2dload.testing/utils-2d/tin/misc/comb_legacy_files.py [INPUT_ZINC_IDS_FILE]
You don't need to source any particular python 3 environment for this script, but the environment used for search_zinc22.py will work just fine here.
This script will comb through our deprecated files and attempt to locate your ZINC IDs there. This script will create a file called "result" in your current directory containing all the smiles found.
If you're looking from vendor information, you can look up the SMILES you get back in arthor/smallworld sets to find vendor codes. Functionality is planned in search_zinc22.py for looking up by SMILES, but not implemented yet.
If after this you're STILL unable to find your zinc ids, you can send them to our development team and we will find them for you.
Email ben@tingle.org, ccing khtang015@gmail.com and josecastanon4@gmail.com. Include your missing file as an attachment.