Get custom library: Difference between revisions
(asdf) |
(asdf) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 4: | Line 4: | ||
* tertiary aliphatic amines, thus matching C[ND3](C)(C) | * tertiary aliphatic amines, thus matching C[ND3](C)(C) | ||
* in stock for 2 week delivery | * in stock for 2 week delivery | ||
* representations around physiological pH (usual) | * representations around physiological pH (usual) | ||
* mwt 350 to 500 | * mwt 350 to 500 | ||
Line 13: | Line 13: | ||
<pre> | <pre> | ||
#!/nfs/soft/www/apps/zinc15/envs/internal/bin/python | #!/nfs/soft/www/apps/zinc15/envs/internal/bin/python | ||
# | # fetch.py | ||
import os,sys | import os,sys | ||
import logging | import logging | ||
from zinc.environ_init import init_environ | from zinc.environ_init import init_environ | ||
init_environ(force_green_threads=True) | init_environ(force_green_threads=True) | ||
from zinc.management import build_context | from zinc.management import build_context | ||
from zinc.data.representations import OBJECT_MIMETYPE_TO_FORMATTER | from zinc.data.representations import OBJECT_MIMETYPE_TO_FORMATTER | ||
# | # NB: for parallelism, setenv ZINC_ENVIRON_GREEN yes | ||
build_context(context=globals(),environ=os.environ) | build_context(context=globals(),environ=os.environ) | ||
Line 45: | Line 43: | ||
Substance.purchasable > 20, | Substance.purchasable > 20, | ||
Substance.protomers.any(), | Substance.protomers.any(), | ||
] | ] | ||
q = Substance.query.filter(*c) | q = Substance.query.filter(*c) | ||
Line 56: | Line 53: | ||
for line in formatter(q): | for line in formatter(q): | ||
print line, | print line, | ||
</pre> | |||
This will get you SMILES and ZINC ID in one file. extract the ZINC IDs for the next step. | |||
awk '{print $2}' smiles.txt > zincids.txt | |||
This will only work if you have ssh access to our cluster. If you do not, you'll have to compose a URL and use the API. e.g. | |||
http://zinc15.docking.org/substances/subsets/now/?structure-matches=C[ND3](C)C&count=all&logp-between=1.5+3.5&mwt-between=350+500 | |||
It is a pretty bad query! We are working on making this faster! | |||
== Step 2.Download == | == Step 2.Download == | ||
Line 73: | Line 73: | ||
# get files for each tranche | # get files for each tranche | ||
foreach i (x??) | foreach i (x??) | ||
curl -F zinc_id-in=@${i} "http://zinc15.docking.org/protomers/subsets/now+usual/" -F output_format=txt -F output_fields=files.db2 > $i | curl -F zinc_id-in=@${i} "http://zinc15.docking.org/protomers/subsets/now+usual/" -F output_format=txt -F output_fields=files.db2 > split_database_index.$i | ||
# or, db2: curl -F zinc_id-in=@${i} "http://zinc15.docking.org/protomers/subsets/now+usual/" -F output_format=db2 | gzip -9 -c > $i.db2.gz | # or, db2: curl -F zinc_id-in=@${i} "http://zinc15.docking.org/protomers/subsets/now+usual/" -F output_format=db2 | gzip -9 -c > $i.db2.gz | ||
end | end |
Latest revision as of 23:01, 17 December 2015
Here is how to get a custom library from ZINC. We've included methods both locally on our cluster as well as from anywhere in the world. If you're local, use it, it will be 50-100 times faster. External users: we are aware of this, and are working to make remote execution as fast as local, or at least within 2X.
OK, let's say you want the following library in db2 (or db or sdf or mol2 or whatever) format:
- tertiary aliphatic amines, thus matching C[ND3](C)(C)
- in stock for 2 week delivery
- representations around physiological pH (usual)
- mwt 350 to 500
- logP 1.5 to 3.5
Step 1. Get a list of zincids that match all these criteria
#!/nfs/soft/www/apps/zinc15/envs/internal/bin/python # fetch.py import os,sys import logging from zinc.environ_init import init_environ init_environ(force_green_threads=True) from zinc.management import build_context from zinc.data.representations import OBJECT_MIMETYPE_TO_FORMATTER # NB: for parallelism, setenv ZINC_ENVIRON_GREEN yes build_context(context=globals(),environ=os.environ) logging.basicConfig(level=logging.INFO) Formatter = OBJECT_MIMETYPE_TO_FORMATTER["text/plain"] formatter = Formatter(options={}, fields=[ "smiles","zinc_id", ]) #mwt 350 to 500 m = 'FGHIJ' # logp 1.5 to 3.5 l = 'DEFG' t = [x + y for x in m for y in l] # in stock, contains tertiary aliphatic amine, has protomers in database c = [ Substance.tranche_prefix.in_(t), Substance.structure.match('C[ND3](C)C'), Substance.purchasable > 20, Substance.protomers.any(), ] q = Substance.query.filter(*c) q = q.yield_per(10) q = q.limit(100000) q = q.parallelize() logging.info(str(q)) for line in formatter(q): print line,
This will get you SMILES and ZINC ID in one file. extract the ZINC IDs for the next step.
awk '{print $2}' smiles.txt > zincids.txt
This will only work if you have ssh access to our cluster. If you do not, you'll have to compose a URL and use the API. e.g.
http://zinc15.docking.org/substances/subsets/now/?structure-matches=C[ND3](C)C&count=all&logp-between=1.5+3.5&mwt-between=350+500
It is a pretty bad query! We are working on making this faster!
Step 2.Download
Here, you can get either a split database index file (small, efficient), or download the db2 (or db or whatever) files (slower, but works anywhere not just in the lab)
#!/bin/csh -f # getlibrary.csh # split file "zincids.txt" created in step 1 above into chunks of 1000 molecules per file split -l 10000 zincids.txt # # get files for each tranche foreach i (x??) curl -F zinc_id-in=@${i} "http://zinc15.docking.org/protomers/subsets/now+usual/" -F output_format=txt -F output_fields=files.db2 > split_database_index.$i # or, db2: curl -F zinc_id-in=@${i} "http://zinc15.docking.org/protomers/subsets/now+usual/" -F output_format=db2 | gzip -9 -c > $i.db2.gz end
Step 3. DOCK
...
Optional step 0 is "make sure 3D representations are built for all the relevant molecules
- for right now, please talk to John about getting these built for you.
- We are taking requests!
Step 0 is important now (Dec 2015) as we only have 5 M molecules built, growing by 10M per month.
- This step will be much less important 6 months from now when ZINC15 is built out.