How to generate an HEI database

From DISI
Revision as of 21:49, 27 November 2010 by Kolb (Talk)

Jump to: navigation, search

Introduction

This guide provides explanations for the use of Johannes Hermann's python scripts for library generation and also the scripts which exist for running AMSOL, OMEGA and MOL2DB. Although the latter steps are done in the canonical way, the scripts mentioned below fit in nicely with the data structure generated by the database generation scripts and it is highly recommended to use them.

There is just one caveat: none of these scripts are standard, so one probably has to edit names of files, directories and databases these scripts use.

Recommended data structure

  • generate 4 subdirectories: 1_SOMENAME2SDF, 2_OMEGA, 3_AMSOL and 4_MOL2DB. SMILES files must be of the format somesmiles somename.
  • copy the database preparation scripts (a1*.py-c3*.py) to 1_SOMENAME2SDF.

HEI generation

Broadly speaking, the generation procedure involves three steps:

  1. Conversion of the input .mol2 or .sdf files to SMILES and then isomeric SMILES.
  2. Conversion ("reaction") of the appropriate group(s) in each molecule to form the HEI.
  3. Generation of multiple protonation states, 3D-structures and partial charges for each HEI, resulting in .db files that can be fed to DOCK.

The scripts for each step are prefixed with the letter 'a' (step 1), 'b' (step 2), and 'c' (first part of step 3), respectively. Within each letter, the scripts are enumerated consecutively.

  • every script takes a list of SMILES as input and outputs a list of SMILES (except for the b7 and b8 scripts, which output .sdf files), prefixed with the sequential number of the script.
  • the scripts a3-a4 have to be run in sequence.
  • scripts b1 to b8 all take the output of a4 as input. Each of these scripts describes a different reaction and each reaction will only happen when the appropriate reacting groups are encountered in a molecule.
  • each b script will generate an LN (neutral leaving group) and an LP (protonated leaving group) file.
  • it is a VERY GOOD idea to keep the LN and LP separate throughout the entire procedure, especially when running the c scripts. This will make things easier lateron.
  • c3_sdf2mol2_mysql_names.py has to be run on four files: the LN and LP files coming out of c2_ionizer_min.py and the sdf files resulting from the b7 and b8 scripts.

a Scripts

a1_create_sdf_from_fold.py folder(unpacked from KEGG-website)
a2_corina.py molfile.smi

a2_corina.py can be started with either .smi, .ism or .sdf files.

a3.1_sdf2ism_filter.py molfile.sdf 
a3.2_size_cutoff_filter.py molfile.smi
a4_rm_doubles.py molfile.smi

b Scripts

b1_rxn_carbonyl.py molfile.smi
b1_rxn_lactone.pk.py molfile.smi
b2.1_rxn_aromatic_cleav.py molfile.smi
b2.2_rxn_aromatic_cleav.py molfile.smi
b3_rxn_amidines.py molfile.smi
b4.1_rxn_amidine_aromatic.py molfile.smi
b4.2_rxn_amidine_aromatic.py molfile.smi
b4.3_rxn_amidine_aromatic.py molfile.smi
b4.4_rxn_amidine_aromatic.py molfile.smi
b5_rxn_imin.py molfile.smi
b6.1_rxn_imin_aromatic.py molfile.smi
b6.2_rxn_imin_aromatic.py molfile.smi
b6.3_rxn_imin_aromatic.py molfile.smi
b7.1_parts_split_1.py molfile.smi
b7.2_parts_split_2.py file-identifier(e.g. _mol_32007)
b7.3_parts_connect_1.py file-identifier(e.g. _mol_32007)
b7.4_parts_connect_2.py file-identifier(e.g. _mol_32007)
b7.5_ionizer.py file-identifier(e.g. _mol_32007)
b8.1_thio_parts_split_1.py molfile.smi
b8.2_thio_parts_split_2.py file-identifier(e.g. _mol_32007)
b8.3_thio_parts_connect_1.py file-identifier(e.g. _mol_32007)
b8.4_thio_parts_connect_2.py file-identifier(e.g. _mol_32007)
b8.5_thio_ionizer.py file-identifier(e.g. _mol_32007)
b9_remove_doubles.py start-file-pattern end-file-pattern

c Scripts

c1_corina.py start-file-pattern end-file-pattern
c2_ionizer_min.py start-file-pattern end-file-pattern
c3_sdf2mol2_mysql_names.py sdf-file(from corina+ionizer) suffix(for Folders after ring)
c3_sdf2mol2_mysql_names_remove.py filename_containing_mol2_filenames(zipped does not hurt)

Running omega

  • Be careful! This script needs access to a mysql database – make sure to set the appropriate values that allow you access in the script.
  • change to 2_OMEGA.
  • commandline:
om2_chunks_on_tmp.py MOLS_SUBDIR_1 MOLS_SUBDIR_2 MOL_RAID MAXMOL
  • alternative commandline if you want to run on the cluster:
om2_chunks_on_scratch.py MOLS_SUBDIR_1 MOLS_SUBDIR_2 MOL_RAID MAXMOL
  • the individual arguments will be connected to form the path to the mol2-files generated in step 3:
    /raid[MOL_RAID]/people/kolb/DB4/[MOLS_SUBDIR_2]/MOLS/[MOLS_SUBDIR_1]
  • MAXMOL gives the maximum number of molecules which are processed in one chunk. It is advisable to kill the job between the processing of two chunks.

Running amsol

  • Be careful! This script needs access to a mysql database – make sure to set the appropriate values that allow you access in the script.
  • change to 3_AMSOL
  • commandline:
am_chunks_on_tmp.py MOLS_SUBDIR_1 MOL_RAID MOLS_SUBDIR_2
  • the individual arguments will be connected to form the path to the mol2-files generated in step 3:
    /raid[MOL_RAID]/people/kolb/DB[MOLS_SUBDIR_2]/2_OMEGA/[MOLS_SUBDIR_1]
  • the script will call amsol_limit.py, so make sure that this file is in your directory.

Running mol2db

  • change to 4_MOL2DB.
  • create a subfolder for every subpart of the database, i.e., RING_MORE_KEGG_HEI/OH_LN, RING_MORE_KEGG_HEI/OH_LP, a.s.o.
  • copy the file inhier_col (the input file) into every subdirectory.
  • run the appropriate script directly in the subfolder: mrm_3_limit.py for molecules with rings, mrn_1s.py for molecules with no rings.
  • in each script, make sure that the maximum number of molecules per .db file is set to not more than 1000.
  • keep in mind that the .mol2 file read by mol2db must contain exactly 6 lines between @<TRIPOS>MOLECULE and @<TRIPOS>ATOM

Running mrm_3_limit.py

  • commandline:
mrm_3_limit.py MOL_RAID DB_VERSION MOLS_SUBDIR JOB_ID OMEGA_PATH AMSOL_PATH CHECK WRITE_BROKEN
  • The individual arguments and the pwd will be connected to form the path to the mol2-files generated in step 2:

/raid[MOL_RAID]/people/kolb/DB[DB_VERSION]/[MOLS_SUBDIR]/MOLS/[obtained from pwd: penultimate dir]/[obtained from pwd: last dir].

  • CHECK gives the frequency of the check whether a molecule has already been processed or not: '0' → no check; '1' → check at the beginning of every job; '2' → check before processing each molecule.
  • in case the script stops after just one molecule, do the following:
  • check that the file .labels[JOB_ID].txt exists.
  • create a file .dbnums[JOB_ID].txt and write something like "101 0" to it. The first number will be the starting number for the enumeration of the .db files, while the second is the current number of molecules already in that .db file.
  • delete everything but the header from the .db file.
  • start mrm_3_limit.py again.

Running mrn_1s.py

  • commandline:
mrn_1s.py JOBNUM MOLS_SUBDIR MOL_RAID AMSOL_RAID DB_VERSION CHECK
  • individual arguments and the pwd will be connected to form the path to the mol2-files generated in step 2:
    /raid[MOL_RAID]/people/kolb/DB[DB_VERSION]/[MOLS_SUBDIR]/MOLS/[obtained from pwd: penultimate dir]/[obtained from pwd: last dir]

Acknowledgments

Johannes has sacrificed a week of his time to introduce me to the scripts. Hao Fan has prepared HEI databases on his own and given valuable input.

--Kolb 14:12, 3 July 2008 (PDT)