Get custom library
Here is how to get a custom library from ZINC. We've included methods both locally on our cluster as well as from anywhere in the world. If you're local, use it, it will be 50-100 times faster. External users: we are aware of this, and are working to make remote execution as fast as local, or at least within 2X.
OK, let's say you want the following library in db2 (or db or sdf or mol2 or whatever) format:
- tertiary aliphatic amines, thus matching C[ND3](C)(C)
- in stock for 2 week delivery
- representations around physiological pH (usual)
- mwt 350 to 500
- logP 1.5 to 3.5
Step 1. Get a list of zincids that match all these criteria
#!/nfs/soft/www/apps/zinc15/envs/internal/bin/python # fetch.py import os,sys import logging from zinc.environ_init import init_environ init_environ(force_green_threads=True) from zinc.management import build_context from zinc.data.representations import OBJECT_MIMETYPE_TO_FORMATTER # NB: for parallelism, setenv ZINC_ENVIRON_GREEN yes build_context(context=globals(),environ=os.environ) logging.basicConfig(level=logging.INFO) Formatter = OBJECT_MIMETYPE_TO_FORMATTER["text/plain"] formatter = Formatter(options={}, fields=[ "smiles","zinc_id", ]) #mwt 350 to 500 m = 'FGHIJ' # logp 1.5 to 3.5 l = 'DEFG' t = [x + y for x in m for y in l] # in stock, contains tertiary aliphatic amine, has protomers in database c = [ Substance.tranche_prefix.in_(t), Substance.structure.match('C[ND3](C)C'), Substance.purchasable > 20, Substance.protomers.any(), ] q = Substance.query.filter(*c) q = q.yield_per(10) q = q.limit(100000) q = q.parallelize() logging.info(str(q)) for line in formatter(q): print line,
This will get you SMILES and ZINC ID in one file. extract the ZINC IDs for the next step.
awk '{print $2}' smiles.txt > zincids.txt
This will only work if you have ssh access to our cluster. If you do not, you'll have to compose a URL and use the API. e.g.
http://zinc15.docking.org/substances/subsets/now/?structure-matches=C[ND3](C)C&count=all&logp-between=1.5+3.5&mwt-between=350+500
It is a pretty bad query! We are working on making this faster!
Step 2.Download
Here, you can get either a split database index file (small, efficient), or download the db2 (or db or whatever) files (slower, but works anywhere not just in the lab)
#!/bin/csh -f # getlibrary.csh # split file "zincids.txt" created in step 1 above into chunks of 1000 molecules per file split -l 10000 zincids.txt # # get files for each tranche foreach i (x??) curl -F zinc_id-in=@${i} "http://zinc15.docking.org/protomers/subsets/now+usual/" -F output_format=txt -F output_fields=files.db2 > split_database_index.$i # or, db2: curl -F zinc_id-in=@${i} "http://zinc15.docking.org/protomers/subsets/now+usual/" -F output_format=db2 | gzip -9 -c > $i.db2.gz end
Step 3. DOCK
...
Optional step 0 is "make sure 3D representations are built for all the relevant molecules
- for right now, please talk to John about getting these built for you.
- We are taking requests!
Step 0 is important now (Dec 2015) as we only have 5 M molecules built, growing by 10M per month.
- This step will be much less important 6 months from now when ZINC15 is built out.