Build ChEMBL for SEA
Here is the tutorial for building a ChEMBL for SEA based from Matt O'Meara's scripts
Updated 11/06/2020
Setting up Postgres
ChEMBL database is located on mem2 (postgres v12.3)
psql -h mem2 -U chembl -d chembl -p 5433
# basic psql commands you should know - To list all tables: \dt - To list all users: \du - To list all schemas: \dn - To view table setup: \d+ <table_name>
In Postgresql console: Create new schema
CREATE SCHEMA chembl33;
Change default schema to the newly created one, you must do it as postgres user
alter role chembl in database chembl set search_path to 'chembl33';
Installing dependent packages
Install BioChemPantry
R package written by Matt O'Meara for building ChEMBL library for SEA
R install.packages("devtools")
BioChemPantry is dependent on RPostgres package which required pgsql version > 9.0. Recommend using develop version of pgsql-9.5
bash export PATH=/usr/pgsql-9.5/bin:$PATH export LIBPQ_DIR=/usr/pgsql-9.5/ export LIBRARY_PATH=/usr/pgsql-9.5/lib devtools::install_github("momeara/BioChemPantry")
Install Zr
devtools::install_github("momeara/Zr")
Install SEAR
require(devtools) install_version("data.table", version = "1.11.8", repos = "http://cran.us.r-project.org") devtools::install_github("momeara/SEAR")
Install Bethany
devtools::install_github("momeara/Bethany")
Set up library building script
- Clone BioChemPantry in local directory for editing
git clone https://github.com/khtang17/BioChemPantry.git # This is has been edited for ChEMBL25, might not work for future release but worth to try
- Edit the scripts
cd <dir_to_install>/BioChemPantry/vignette/sets cp chembl23 chembl25 cd chembl25/scripts replace string contains "chembl23" with "chembl25" in script 0-8
Loading ChEMBL on PHI server
Set up .pantry_config file in home directory
Read more
The username used here has to have createdb permission on phi
vim .pantry_config { "staging_directory" : "<setup_dir>/pantry_sets", "login" : { "dbname" : "chembl", "host" : "mem2", "user" : "chembl", "password" : "", "port" : 5433 } }
Loading ChEMBL into mem2 server
This is recommend to have 2 terminal open: one for R and one open the R script. Execute code by chunks of code.
0_load_chembl_database.R will download the ChEMBL Postgres database into pantry_sets/chembl25/dump and trying the export the file into the database. The psql command might or might not work. If not, try the pg_restore command
pg_restore -h phi.cluster.bkslab.org -d momeara -U <username> -O chembl_25_postgresql.dmp
If 0_load_chembl_database.R and pg_restore failed
If you got the error "Segmentation fault(core dump)" or the script just failed, there is a work around this issue and it might take a little bit of work. Check the version of Postgres server on phi. Install the same postgres server version locally on your computer, setup the database. Make sure you are the only one who uses this database!
Step 1: Load the tables into public schema of the newly created local postgres database
pg_restore -U <username> -d <dbname> -O chembl_25_postgresql.dmp
Step 2: Rename schema and export
Login into postgres as postgres and connect to the database where ChEMBL library is loaded
sudo -i su - postgres psql -d <dbname> chembl25=# alter schema public rename to <schema_name>; # <schema_name> is going to match the schema name setup in script 0_load_chembl_database.R
#exit psql and in a new local terminal and make sure that psql --version is the same as the one in phi server pg_dump -U khtang -d chembl25 --schema=chembl25 -O -Fp > export_chembl25.sql
Step 3: Export the sql file and attempt script 0_load_chembl_database.R again
Generating data files for SEA
Please note that it is assuming that the new version of ChEMBL library that you are trying to build is loaded on ZINC. Ask John or Khanh for loading the new library.
Script 1-8 notes
Script 1: Get ZINC info for ChEMBL
- Specify tmp directory
> write("TMP = YOUR_PATH_VARIABLE", file=file.path('~/.Renviron')) //on R console
- It is recommend to run the zinc_to_chembl in small batch because the website will timeout.
To specify batch size, add result_batch_size to the list of parameters for Zr::catalog_items. In case it timing out, you could specify the page where it timed out by adding page=<page_num>
> zinc_to_chembl <- Zr::catalog_items( catalog_short_name="chembl25", count='all', verbose=T, page=3, //optional result_batch_size=5000) %>% dplyr::select( zinc_id, chembl_id = supplier_code)
- Alternative better way (on command line)
cd ~khtang/jji/ $ source source /nfs/soft/www/apps/zinc15/envs/production/env.csh $ python fetchVendorC.py chembl25 0 <ZINC estimated size> > chembl25.txt Example: $ python fetchVendorC.py chembl25 0 1900000000 > chembl25.txt //as one processes from 0 to 1.9B This can be running in parallel as chunks $ python fetchVendorC.py chembl25 0 1000000 > chembl25_1.txt $ python fetchVendorC.py chembl25 1000001 2000000 > chembl25_2.txt .... Then, concatenate them all together $ cat chembl25_1.txt chembl25_2.txt ... > all_chembl25.txt
Script 2-4
Follow the codes in BioChemPantry.
Script 5.1: Precalculate compound images
Must run inside the SEA conda environment
source /nfs/home/khtang/SEA/tools/anaconda2/bin/activate sea16 export REDISPORT=6380 export SEA_APP_ROOT=$CONDA_PREFIX/var/seaserver export SEA_RUN_FOLDER=$SEA_APP_ROOT/run export SEA_DATA_FOLDER=$SEA_APP_ROOT/data export SEA_DATA_CACHE_FOLDER=$SEA_APP_ROOT/data/cache
The chembl_compound_images part will take ~10days to complete