Build ChEMBL for SEA

From DISI
Jump to navigation Jump to search

Here is the tutorial for building a ChEMBL for SEA based from Matt O'Meara's scripts

Updated 11/06/2020

Setting up Postgres

ChEMBL database is located on mem2 (postgres v12.3)

 psql -h mem2 -U chembl -d chembl -p 5433
 # basic psql commands you should know
 -  To list all tables: \dt
 -  To list all users: \du
 -  To list all schemas: \dn
 -  To view table setup: \d+ <table_name>

In Postgresql console: Create new schema

CREATE SCHEMA chembl33;

Change default schema to the newly created one, you must do it as postgres user

alter role chembl in database chembl set search_path to 'chembl33';

Installing dependent packages

Install BioChemPantry

R package written by Matt O'Meara for building ChEMBL library for SEA

R
install.packages("devtools")

BioChemPantry is dependent on RPostgres package which required pgsql version > 9.0. Recommend using develop version of pgsql-9.5

bash
export PATH=/usr/pgsql-9.5/bin:$PATH
export LIBPQ_DIR=/usr/pgsql-9.5/
export LIBRARY_PATH=/usr/pgsql-9.5/lib
devtools::install_github("momeara/BioChemPantry")

Install Zr

devtools::install_github("momeara/Zr")

Install SEAR

require(devtools)
install_version("data.table", version = "1.11.8", repos = "http://cran.us.r-project.org")
devtools::install_github("momeara/SEAR")

Install Bethany

devtools::install_github("momeara/Bethany")

Set up library building script

  • Clone BioChemPantry in local directory for editing
git clone https://github.com/khtang17/BioChemPantry.git 
# This is has been edited for ChEMBL25, might not work for future release but worth to try
  • Edit the scripts
cd <dir_to_install>/BioChemPantry/vignette/sets
cp chembl23 chembl25
cd chembl25/scripts
replace string contains "chembl23" with "chembl25" in script 0-8

Loading ChEMBL on PHI server

Set up .pantry_config file in home directory

Read more

The username used here has to have createdb permission on phi

vim .pantry_config
{
   "staging_directory" : "<setup_dir>/pantry_sets",
   "login" : {
       "dbname" : "chembl",
       "host" : "mem2",
       "user" : "chembl",
       "password" : "",
       "port" : 5433
   }
}

Loading ChEMBL into mem2 server

This is recommend to have 2 terminal open: one for R and one open the R script. Execute code by chunks of code.

0_load_chembl_database.R will download the ChEMBL Postgres database into pantry_sets/chembl25/dump and trying the export the file into the database. The psql command might or might not work. If not, try the pg_restore command

pg_restore -h phi.cluster.bkslab.org -d momeara -U <username> -O chembl_25_postgresql.dmp

If 0_load_chembl_database.R and pg_restore failed

If you got the error "Segmentation fault(core dump)" or the script just failed, there is a work around this issue and it might take a little bit of work. Check the version of Postgres server on phi. Install the same postgres server version locally on your computer, setup the database. Make sure you are the only one who uses this database!

Step 1: Load the tables into public schema of the newly created local postgres database

pg_restore -U <username> -d <dbname> -O chembl_25_postgresql.dmp

Step 2: Rename schema and export

Login into postgres as postgres and connect to the database where ChEMBL library is loaded

sudo -i
su - postgres
psql -d <dbname>
chembl25=# alter schema public rename to <schema_name>; # <schema_name> is going to match the schema name setup in script 0_load_chembl_database.R
#exit psql and in a new local terminal and make sure that psql --version is the same as the one in phi server
pg_dump -U khtang -d chembl25 --schema=chembl25 -O -Fp >  export_chembl25.sql 

Step 3: Export the sql file and attempt script 0_load_chembl_database.R again

Generating data files for SEA

Please note that it is assuming that the new version of ChEMBL library that you are trying to build is loaded on ZINC. Ask John or Khanh for loading the new library.

Script 1-8 notes

Script 1: Get ZINC info for ChEMBL

  • Specify tmp directory
> write("TMP = YOUR_PATH_VARIABLE", file=file.path('~/.Renviron')) //on R console
  • It is recommend to run the zinc_to_chembl in small batch because the website will timeout.

To specify batch size, add result_batch_size to the list of parameters for Zr::catalog_items. In case it timing out, you could specify the page where it timed out by adding page=<page_num>

> zinc_to_chembl <- Zr::catalog_items(
   catalog_short_name="chembl25",
   count='all',
   verbose=T, 
   page=3, //optional
   result_batch_size=5000) %>%
   dplyr::select(
       zinc_id,
chembl_id = supplier_code)
  • Alternative better way (on command line)
cd ~khtang/jji/
$ source source /nfs/soft/www/apps/zinc15/envs/production/env.csh
$ python fetchVendorC.py  chembl25 0 <ZINC estimated size> > chembl25.txt
   Example:
      $ python fetchVendorC.py  chembl25 0 1900000000 > chembl25.txt //as one processes from 0 to 1.9B
   This can be running in parallel as chunks
      $ python fetchVendorC.py  chembl25 0 1000000 > chembl25_1.txt 
      $ python fetchVendorC.py  chembl25 1000001 2000000 > chembl25_2.txt
      ....
      Then, concatenate them all together
      $ cat chembl25_1.txt chembl25_2.txt ... > all_chembl25.txt

Script 2-4

Follow the codes in BioChemPantry.

Script 5.1: Precalculate compound images

Must run inside the SEA conda environment

source /nfs/home/khtang/SEA/tools/anaconda2/bin/activate sea16
export REDISPORT=6380
export SEA_APP_ROOT=$CONDA_PREFIX/var/seaserver
export SEA_RUN_FOLDER=$SEA_APP_ROOT/run
export SEA_DATA_FOLDER=$SEA_APP_ROOT/data
export SEA_DATA_CACHE_FOLDER=$SEA_APP_ROOT/data/cache

The chembl_compound_images part will take ~10days to complete