Build ChEMBL for SEA

Jump to navigation Jump to search

Here is the tutorial for building a ChEMBL for SEA based from Matt O'Meara's scripts

Updated 11/06/2020

Setting up Postgres

ChEMBL database is located on mem2 (postgres v12.3)

 psql -h mem2 -U chembl -d chembl -p 5433
 # basic psql commands you should know
 -  To list all tables: \dt
 -  To list all users: \du
 -  To list all schemas: \dn
 -  To view table setup: \d+ <table_name>

Installing dependent packages

Install BioChemPantry

R package written by Matt O'Meara for building ChEMBL library for SEA


BioChemPantry is dependent on RPostgres package which required pgsql version > 9.0. Recommend using develop version of pgsql-9.5

export PATH=/usr/pgsql-9.5/bin:$PATH
export LIBPQ_DIR=/usr/pgsql-9.5/
export LIBRARY_PATH=/usr/pgsql-9.5/lib

Install Zr


Install SEAR

install_version("data.table", version = "1.11.8", repos = "")

Install Bethany


Set up library building script

  • Clone BioChemPantry in local directory for editing
git clone 
# This is has been edited for ChEMBL25, might not work for future release but worth to try
  • Edit the scripts
cd <dir_to_install>/BioChemPantry/vignette/sets
cp chembl23 chembl25
cd chembl25/scripts
replace string contains "chembl23" with "chembl25" in script 0-8

Loading ChEMBL on PHI server

Set up .pantry_config file in home directory

Read more

The username used here has to have createdb permission on phi

vim .pantry_config
   "staging_directory" : "<setup_dir>/pantry_sets",
   "login" : {
       "dbname" : "chembl",
       "host" : "mem2",
       "user" : "chembl",
       "password" : "",
       "port" : 5433

Loading ChEMBL into mem2 server

This is recommend to have 2 terminal open: one for R and one open the R script. Execute code by chunks of code.

0_load_chembl_database.R will download the ChEMBL Postgres database into pantry_sets/chembl25/dump and trying the export the file into the database. The psql command might or might not work. If not, try the pg_restore command

pg_restore -h -d momeara -U <username> -O chembl_25_postgresql.dmp

If 0_load_chembl_database.R and pg_restore failed

If you got the error "Segmentation fault(core dump)" or the script just failed, there is a work around this issue and it might take a little bit of work. Check the version of Postgres server on phi. Install the same postgres server version locally on your computer, setup the database. Make sure you are the only one who uses this database!

Step 1: Load the tables into public schema of the newly created local postgres database

pg_restore -U <username> -d <dbname> -O chembl_25_postgresql.dmp

Step 2: Rename schema and export

Login into postgres as postgres and connect to the database where ChEMBL library is loaded

sudo -i
su - postgres
psql -d <dbname>
chembl25=# alter schema public rename to <schema_name>; # <schema_name> is going to match the schema name setup in script 0_load_chembl_database.R
#exit psql and in a new local terminal and make sure that psql --version is the same as the one in phi server
pg_dump -U khtang -d chembl25 --schema=chembl25 -O -Fp >  export_chembl25.sql 

Step 3: Export the sql file and attempt script 0_load_chembl_database.R again

Generating data files for SEA

Please note that it is assuming that the new version of ChEMBL library that you are trying to build is loaded on ZINC. Ask John or Khanh for loading the new library.

Script 1-8 notes

Script 1: Get ZINC info for ChEMBL

  • Specify tmp directory
> write("TMP = YOUR_PATH_VARIABLE", file=file.path('~/.Renviron')) //on R console
  • It is recommend to run the zinc_to_chembl in small batch because the website will timeout.

To specify batch size, add result_batch_size to the list of parameters for Zr::catalog_items. In case it timing out, you could specify the page where it timed out by adding page=<page_num>

> zinc_to_chembl <- Zr::catalog_items(
   page=3, //optional
   result_batch_size=5000) %>%
chembl_id = supplier_code)
  • Alternative better way (on command line)
cd ~khtang/jji/
$ source source /nfs/soft/www/apps/zinc15/envs/production/env.csh
$ python  chembl25 0 <ZINC estimated size> > chembl25.txt
      $ python  chembl25 0 1900000000 > chembl25.txt //as one processes from 0 to 1.9B
   This can be running in parallel as chunks
      $ python  chembl25 0 1000000 > chembl25_1.txt 
      $ python  chembl25 1000001 2000000 > chembl25_2.txt
      Then, concatenate them all together
      $ cat chembl25_1.txt chembl25_2.txt ... > all_chembl25.txt

Script 2-4

Follow the codes in BioChemPantry.

Script 5.1: Precalculate compound images

Must run inside the SEA conda environment

source /nfs/home/khtang/SEA/tools/anaconda2/bin/activate sea16
export REDISPORT=6380
export SEA_APP_ROOT=$CONDA_PREFIX/var/seaserver

The chembl_compound_images part will take ~10days to complete