Get custom library

From DISI
Revision as of 23:01, 17 December 2015 by Frodo (talk | contribs) (asdf)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Here is how to get a custom library from ZINC. We've included methods both locally on our cluster as well as from anywhere in the world. If you're local, use it, it will be 50-100 times faster. External users: we are aware of this, and are working to make remote execution as fast as local, or at least within 2X.

OK, let's say you want the following library in db2 (or db or sdf or mol2 or whatever) format:

  • tertiary aliphatic amines, thus matching C[ND3](C)(C)
  • in stock for 2 week delivery
  • representations around physiological pH (usual)
  • mwt 350 to 500
  • logP 1.5 to 3.5

Step 1. Get a list of zincids that match all these criteria

#!/nfs/soft/www/apps/zinc15/envs/internal/bin/python
# fetch.py

import os,sys
import logging
from zinc.environ_init import init_environ
init_environ(force_green_threads=True)
from zinc.management import build_context
from zinc.data.representations import OBJECT_MIMETYPE_TO_FORMATTER

# NB: for parallelism, setenv ZINC_ENVIRON_GREEN yes

build_context(context=globals(),environ=os.environ)
logging.basicConfig(level=logging.INFO)
Formatter = OBJECT_MIMETYPE_TO_FORMATTER["text/plain"]
formatter = Formatter(options={}, fields=[
   "smiles","zinc_id",
])

#mwt 350 to 500
m = 'FGHIJ'
# logp 1.5 to 3.5
l = 'DEFG'
t = [x + y for x in m for y in l]

# in stock, contains tertiary aliphatic amine, has protomers in database
c = [
    Substance.tranche_prefix.in_(t),
    Substance.structure.match('C[ND3](C)C'),
    Substance.purchasable > 20,
    Substance.protomers.any(),
]
q = Substance.query.filter(*c)
q = q.yield_per(10)
q = q.limit(100000)
q = q.parallelize()

logging.info(str(q))

for line in formatter(q):
    print line,

This will get you SMILES and ZINC ID in one file. extract the ZINC IDs for the next step.

awk '{print $2}' smiles.txt > zincids.txt

This will only work if you have ssh access to our cluster. If you do not, you'll have to compose a URL and use the API. e.g.

http://zinc15.docking.org/substances/subsets/now/?structure-matches=C[ND3](C)C&count=all&logp-between=1.5+3.5&mwt-between=350+500

It is a pretty bad query! We are working on making this faster!

Step 2.Download

Here, you can get either a split database index file (small, efficient), or download the db2 (or db or whatever) files (slower, but works anywhere not just in the lab)

#!/bin/csh -f 
# getlibrary.csh
# split file "zincids.txt" created in step 1 above into chunks of 1000 molecules per file
split -l 10000 zincids.txt
#
# get files for each tranche
foreach i (x??)
	curl -F zinc_id-in=@${i} "http://zinc15.docking.org/protomers/subsets/now+usual/"  -F output_format=txt  -F output_fields=files.db2 > split_database_index.$i
        # or, db2:  curl -F zinc_id-in=@${i} "http://zinc15.docking.org/protomers/subsets/now+usual/"  -F output_format=db2 | gzip -9 -c  > $i.db2.gz 
end

Step 3. DOCK

...

Optional step 0 is "make sure 3D representations are built for all the relevant molecules

  • for right now, please talk to John about getting these built for you.
  • We are taking requests!

Step 0 is important now (Dec 2015) as we only have 5 M molecules built, growing by 10M per month.

  • This step will be much less important 6 months from now when ZINC15 is built out.