Data Sources for ZINC15
ZINC obviously incorporates data from many vendor catalogs and annotated databases. We also use some data in particular ways.
- breaks out endogenous and other levels of specificity , which we load in ZINC as separate catalogs.
- currently we get them from ChEMBL, but this is not updated fast enough
- another source is drugbank XML , which does seem to be regularly updated
- a third way is directly from WHO (Norway). We currently do not do this.
- we parse out FDA approved separately
- we parse out each of the subsets (street, experimental, etc)
- target affinitites of compounds, 10uM or better
- ATC codes
- two levels of protein hierarchy classification (major class and sub class in ZINC15)
proteins whose expression is highly correlated are more likely to be related We get this from Matt, who in turn gets it from the Gillis Lab as CoExpNet.csv. Figure out how to cite this properly. How often might it be updated?
We use Uniprot to translate swissprot/uniprot accession codes in ChEMBL to Uniprot gene symbols, thus e.g. 5HT1A_HUMAN becomes HTR1A. This is how we unify observations from different species, and also how we intersect with ICGC DCC.
Protein-protein interactions with BioGRID
- we got the file from Matt, who simply downloaded it.
- URL goes here.
- update frequency?
- we calculate these ourselves
- script here.
BLAST of NR
- we got the script from Matt
- we adapted it ourselves to just those genes for which ligands are available.
guidetopharmacology - contains metabolites and drugs and targets. We think it is already mostly incorporated into chembl but the metabolite and drug part is not clear to use, thus we use it directly.