Difference between revisions of "How to generate an HEI database"

Jump to: navigation, search
Line 146: Line 146:
valuable input.
valuable input.
[http://shoichetlab.compbio.ucsf.edu/~kolb Kolb] [[User:Kolb|Kolb]] 21:11, 27 November 2010 (PST)
[http://shoichetlab.compbio.ucsf.edu/~kolb Kolb] 21:11, 27 November 2010 (PST)

Revision as of 22:12, 27 November 2010


This guide provides explanations for the use of Johannes Hermann's python scripts for library generation and also the scripts which exist for running AMSOL, OMEGA and MOL2DB. Although the latter steps are done in the canonical way, the scripts mentioned below fit in nicely with the data structure generated by the database generation scripts and it is highly recommended to use them.

There is just one caveat: none of these scripts are standard, so one probably has to edit names of files, directories and databases these scripts use.

Recommended data structure

  • generate 4 subdirectories: 1_SOMENAME2SDF, 2_OMEGA, 3_AMSOL and 4_MOL2DB. SMILES files must be of the format somesmiles somename.
  • copy the database preparation scripts (a1*.py-c3*.py) to 1_SOMENAME2SDF.

HEI generation

Broadly speaking, the generation procedure involves three steps:

  1. Conversion of the input .mol2 or .sdf files to SMILES and then isomeric SMILES.
  2. Conversion ("reaction") of the appropriate group(s) in each molecule to form the HEI.
  3. Generation of multiple protonation states, 3D-structures and partial charges for each HEI, resulting in .db files that can be fed to DOCK.

The scripts for each step are prefixed with the letter 'a' (step 1), 'b' (step 2), and 'c' (first part of step 3), respectively. Within each letter, the scripts are enumerated consecutively.

  • every script takes a list of SMILES as input and outputs a list of SMILES (except for the b7 and b8 scripts, which output .sdf files), prefixed with the sequential number of the script.
  • the scripts a3-a4 have to be run in sequence.
  • scripts b1 to b8 all take the output of a4 as input. Each of these scripts describes a different reaction and each reaction will only happen when the appropriate reacting groups are encountered in a molecule.
  • each b script will generate an LN (neutral leaving group) and an LP (protonated leaving group) file.
  • it is a VERY GOOD idea to keep the LN and LP separate throughout the entire procedure, especially when running the c scripts. This will make things easier lateron.
  • c3_sdf2mol2_mysql_names.py has to be run on four files: the LN and LP files coming out of c2_ionizer_min.py and the sdf files resulting from the b7 and b8 scripts.

a Scripts

a1_create_sdf_from_fold.py folder(unpacked from KEGG-website)
a2_corina.py molfile.smi

a2_corina.py can be started with either .smi, .ism or .sdf files.

a3.1_sdf2ism_filter.py molfile.sdf 
a3.2_size_cutoff_filter.py molfile.smi
a4_rm_doubles.py molfile.smi

b Scripts

b1_rxn_carbonyl.py molfile.smi
b1_rxn_lactone.pk.py molfile.smi
b2.1_rxn_aromatic_cleav.py molfile.smi
b2.2_rxn_aromatic_cleav.py molfile.smi
b3_rxn_amidines.py molfile.smi
b4.1_rxn_amidine_aromatic.py molfile.smi
b4.2_rxn_amidine_aromatic.py molfile.smi
b4.3_rxn_amidine_aromatic.py molfile.smi
b4.4_rxn_amidine_aromatic.py molfile.smi
b5_rxn_imin.py molfile.smi
b6.1_rxn_imin_aromatic.py molfile.smi
b6.2_rxn_imin_aromatic.py molfile.smi
b6.3_rxn_imin_aromatic.py molfile.smi
b7.1_parts_split_1.py molfile.smi
b7.2_parts_split_2.py file-identifier(e.g. _mol_32007)
b7.3_parts_connect_1.py file-identifier(e.g. _mol_32007)
b7.4_parts_connect_2.py file-identifier(e.g. _mol_32007)
b7.5_ionizer.py file-identifier(e.g. _mol_32007)
b8.1_thio_parts_split_1.py molfile.smi
b8.2_thio_parts_split_2.py file-identifier(e.g. _mol_32007)
b8.3_thio_parts_connect_1.py file-identifier(e.g. _mol_32007)
b8.4_thio_parts_connect_2.py file-identifier(e.g. _mol_32007)
b8.5_thio_ionizer.py file-identifier(e.g. _mol_32007)
b9_remove_doubles.py start-file-pattern end-file-pattern

c Scripts

c1_corina.py start-file-pattern end-file-pattern
c2_ionizer_min.py start-file-pattern end-file-pattern
c3_sdf2mol2_mysql_names.py sdf-file(from corina+ionizer) suffix(for Folders after ring)
c3_sdf2mol2_mysql_names_remove.py filename_containing_mol2_filenames(zipped does not hurt)

Running omega

  • Be careful! This script needs access to a mysql database – make sure to set the appropriate values that allow you access in the script.
  • change to 2_OMEGA.
  • required files:
    • torlib_1205.txt
    • omega_03.2_3_2.param
    • omega_07.2_3_2.param
    • om2_chunks_on_tmp.py or om2_chunks_on_scratch.py
  • commandline:
  • alternative commandline if you want to run on the cluster:
om2_chunks_on_scratch.py MOLS_SUBDIR_1 MOLS_SUBDIR_2 MOL_RAID MAXMOL
  • the individual arguments will be connected to form the path to the mol2-files generated in step 3:
  • MAXMOL gives the maximum number of molecules which are processed in one chunk. It is advisable to kill the job between the processing of two chunks.

Running amsol

  • Be careful! This script needs access to a mysql database – make sure to set the appropriate values that allow you access in the script.
  • change to 3_AMSOL
  • required files:
    • amsol_limit.py
    • amsol_functions.py
    • amsol.py
    • am_chunks_on_tmp.py or am_chunks_on_scratch.py
  • commandline:
am_chunks_on_tmp.py MOLS_SUBDIR_1 MOL_RAID MOLS_SUBDIR_2
  • alternative commandline if you want to run on the cluster:
am_chunks_on_scratch.py MOLS_SUBDIR_1 MOL_RAID MOLS_SUBDIR_2
  • the individual arguments will be connected to form the path to the mol2-files generated in step 3:
  • the script will call amsol_limit.py, so make sure that this file is in your directory.

Running mol2db

  • change to 4_MOL2DB.
  • create a subfolder for every subpart of the database, i.e., RING_MORE_KEGG_HEI/OH_LN,RING_MORE_KEGG_HEI/OH_LP, a.s.o.
  • required files in 4_MOL2DB:
    • inhier_col
    • mol2db_limit.csh
    • lettercode.txt (a file specifying a single letter for each subdirectory)
  • run the appropriate script directly in the subfolder: mrm_3_limit.py for molecules with multiple rings, mro_5.py for molecules with one ring, and mrn_1s.py for molecules with no rings.
  • in each script, make sure that the maximum number of molecules per .db file is set to not more than 1000.
  • keep in mind that the .mol2 file read by mol2db must contain exactly 6 lines between @<TRIPOS>MOLECULE and @<TRIPOS>ATOM

Example: running mrm_3_limit.py

  • commandline:
  • The individual arguments and the pwd will be connected to form the path to the mol2-files generated in step 2:

/raid[MOL_RAID]/people/kolb/DB[DB_VERSION]/[MOLS_SUBDIR]/MOLS/[obtained from pwd: penultimate dir]/[obtained from pwd: last dir].

  • CHECK gives the frequency of the check whether a molecule has already been processed or not: '0' → no check; '1' → check at the beginning of every job; '2' → check before processing each molecule.
  • in case the script stops after just one molecule, do the following:
  • check that the file .labels[JOB_ID].txt exists.
  • create a file .dbnums[JOB_ID].txt and write something like "101 0" to it. The first number will be the starting number for the enumeration of the .db files, while the second is the current number of molecules already in that .db file.
  • delete everything but the header from the .db file.
  • start mrm_3_limit.py again.

Inserting the newly generated molecules into a mysql database

This step is essential to preserve knowledge about the correspondence between the original database name of a molecule, its HEI form, protonation states and conformations and the final name given by mol2db (of the form A00000000 [one letter + eight digits]).

  • required files:
    • mysql_insert_db6.py
  • this also requires you to generate a mysql database of the proper format, best done with mysql_create_table_db5.pk.py
  • commandline:
  • the individual arguments will be connected to form the path to the .mol2 and files as follows


  • the .db files are expected in


  • TAG is optional and is the name with which the molecule names start.


Johannes has sacrificed a week of his time to introduce me to the scripts. Hao Fan and Magdalena Korczynska have prepared HEI databases on their own and given valuable input.

Kolb 21:11, 27 November 2010 (PST)