Therese: 40 revisions

2012-10-08T20:25:31Z

40 revisions

← Older revision

Revision as of 20:25, 8 October 2012

Tbalius at 20:24, 5 September 2012

2012-09-05T20:24:59Z

New page

=Introduction=

This guide provides explanations for the use of Johannes Hermann's
python scripts for High Energy Intermediate (HEI) library generation and also the scripts which exist
for running AMSOL, OMEGA and MOL2DB. Although the latter steps are
done in the canonical way, the scripts mentioned below fit in nicely with the data
structure generated by the database generation scripts and it is
highly recommended to use them.

There is just one caveat: none of these scripts are standard, so one
probably has to edit names of files, directories and databases these
scripts use.

=Recommended data structure=

*generate 4 subdirectories: <tt>1_SOMENAME2SDF</tt>, <tt>2_OMEGA</tt>, <tt>3_AMSOL</tt> and <tt>4_MOL2DB</tt>. SMILES files must be of the format <tt>somesmiles somename</tt>.
*copy the database preparation scripts (<tt>a1*.py-c3*.py</tt>) to <tt>1_SOMENAME2SDF</tt>.

=HEI generation=
Broadly speaking, the generation procedure involves three steps:
# Conversion of the input .mol2 or .sdf files to [http://www.daylight.com/smiles/ SMILES] and then isomeric SMILES.
# Conversion ("reaction") of the appropriate group(s) in each molecule to form the HEI.
# Generation of multiple protonation states, 3D-structures and partial charges for each HEI, resulting in .db files that can be fed to [[DOCK]].
The scripts for each step are prefixed with the letter 'a' (step 1), 'b' (step 2), and 'c' (first part of step 3), respectively. Within each letter, the scripts are enumerated consecutively.
*every script takes a list of SMILES as input and outputs a list of SMILES (except for the b7 and b8 scripts, which output .sdf files), prefixed with the sequential number of the script.
*the scripts a3-a4 have to be run in sequence.
*scripts b1 to b8 all take the output of a4 as input. Each of these scripts describes a different reaction and each reaction will only happen when the appropriate reacting groups are encountered in a molecule.
*each b script will generate an LN (neutral leaving group) and an LP (protonated leaving group) file.
*it is a VERY GOOD idea to keep the LN and LP separate throughout the entire procedure, especially when running the c scripts. This will make things easier lateron.
*c3_sdf2mol2_mysql_names.py has to be run on four files: the LN and LP files coming out of c2_ionizer_min.py and the sdf files resulting from the b7 and b8 scripts.
==a Scripts==

a1_create_sdf_from_fold.py folder(unpacked from KEGG-website)
a2_corina.py molfile.smi
<tt>a2_corina.py</tt> can be started with either <tt>.smi</tt>, <tt>.ism</tt> or <tt>.sdf</tt> files.

a3.1_sdf2ism_filter.py molfile.sdf
a3.2_size_cutoff_filter.py molfile.smi
a4_rm_doubles.py molfile.smi

==b Scripts==

b1_rxn_carbonyl.py molfile.smi
b1_rxn_lactone.pk.py molfile.smi
b2.1_rxn_aromatic_cleav.py molfile.smi
b2.2_rxn_aromatic_cleav.py molfile.smi
b3_rxn_amidines.py molfile.smi
b4.1_rxn_amidine_aromatic.py molfile.smi
b4.2_rxn_amidine_aromatic.py molfile.smi
b4.3_rxn_amidine_aromatic.py molfile.smi
b4.4_rxn_amidine_aromatic.py molfile.smi
b5_rxn_imin.py molfile.smi
b6.1_rxn_imin_aromatic.py molfile.smi
b6.2_rxn_imin_aromatic.py molfile.smi
b6.3_rxn_imin_aromatic.py molfile.smi
b7.1_parts_split_1.py molfile.smi
b7.2_parts_split_2.py file-identifier(e.g. _mol_32007)
b7.3_parts_connect_1.py file-identifier(e.g. _mol_32007)
b7.4_parts_connect_2.py file-identifier(e.g. _mol_32007)
b7.5_ionizer.py file-identifier(e.g. _mol_32007)
b8.1_thio_parts_split_1.py molfile.smi
b8.2_thio_parts_split_2.py file-identifier(e.g. _mol_32007)
b8.3_thio_parts_connect_1.py file-identifier(e.g. _mol_32007)
b8.4_thio_parts_connect_2.py file-identifier(e.g. _mol_32007)
b8.5_thio_ionizer.py file-identifier(e.g. _mol_32007)
b9_remove_doubles.py start-file-pattern end-file-pattern

==c Scripts==

c1_corina.py start-file-pattern end-file-pattern
c2_ionizer_min.py start-file-pattern end-file-pattern
c3_sdf2mol2_mysql_names.py sdf-file(from corina+ionizer) suffix(for Folders after ring)
c3_sdf2mol2_mysql_names_remove.py filename_containing_mol2_filenames(zipped does not hurt)

=Running [http://www.eyesopen.com/products/applications/omega.html <tt>omega</tt>]=
*''Be careful! This script needs access to a mysql database – make sure to set the appropriate values that allow you access in the script.''
*change to <tt>2_OMEGA</tt>.
*required files:
**torlib_1205.txt
**omega_03.2_3_2.param
**omega_07.2_3_2.param
**om2_chunks_on_tmp.py ''or'' om2_chunks_on_scratch.py
*commandline:
om2_chunks_on_tmp.py MOLS_SUBDIR_1 MOLS_SUBDIR_2 MOL_RAID MAXMOL
*alternative commandline if you want to run on the cluster:
om2_chunks_on_scratch.py MOLS_SUBDIR_1 MOLS_SUBDIR_2 MOL_RAID MAXMOL
*the individual arguments will be connected to form the path to the mol2-files generated in step 3:<br><tt>/raid[MOL_RAID]/people/kolb/DB4/[MOLS_SUBDIR_2]/MOLS/[MOLS_SUBDIR_1]</tt>
*<tt>MAXMOL</tt> gives the maximum number of molecules which are processed in one chunk. It is advisable to kill the job between the processing of two chunks.

=Running <tt>[http://comp.chem.umn.edu/amsol/ amsol]</tt>=
*''Be careful! This script needs access to a mysql database – make sure to set the appropriate values that allow you access in the script.''
*change to <tt>3_AMSOL</tt>
*required files:
**amsol_limit.py
**amsol_functions.py
**amsol.py
**am_chunks_on_tmp.py ''or'' am_chunks_on_scratch.py
*commandline:
am_chunks_on_tmp.py MOLS_SUBDIR_1 MOL_RAID MOLS_SUBDIR_2
*alternative commandline if you want to run on the cluster:
am_chunks_on_scratch.py MOLS_SUBDIR_1 MOL_RAID MOLS_SUBDIR_2
*the individual arguments will be connected to form the path to the mol2-files generated in step 3:<br><tt>/raid[MOL_RAID]/people/kolb/DB[MOLS_SUBDIR_2]/2_OMEGA/[MOLS_SUBDIR_1]</tt>
*the script will call <tt>amsol_limit.py</tt>, so make sure that this file is in your directory.

=Running <tt>mol2db</tt>=

*change to <tt>4_MOL2DB</tt>.
*create a subfolder for every subpart of the database, i.e., <tt>RING_MORE_KEGG_HEI/OH_LN</tt>,<tt>RING_MORE_KEGG_HEI/OH_LP</tt>, a.s.o.
*required files in <tt>4_MOL2DB</tt>:
**inhier_col
**mol2db_limit.csh
**lettercode.txt (a file specifying a single letter for each subdirectory)
*run the appropriate script directly in the subfolder: <tt>mrm_3_limit.py</tt> for molecules with multiple rings, <tt>mro_5.py</tt> for molecules with one ring, and <tt>mrn_1s.py</tt> for molecules with no rings.
*in each script, make sure that the maximum number of molecules per <tt>.db</tt> file is set to not more than 1000.
*keep in mind that the <tt>.mol2</tt> file read by <tt>mol2db</tt> must contain exactly 6 lines between <tt>@<TRIPOS>MOLECULE</tt> and <tt>@<TRIPOS>ATOM</tt>

==Example: running <tt>mrm_3_limit.py</tt>==

*commandline:
mrm_3_limit.py MOL_RAID DB_VERSION MOLS_SUBDIR JOB_ID OMEGA_PATH AMSOL_PATH CHECK WRITE_BROKEN
*The individual arguments and the <tt>pwd</tt> will be connected to form the path to the mol2-files generated in step 2:<br>
<tt>/raid[MOL_RAID]/people/kolb/DB[DB_VERSION]/[MOLS_SUBDIR]/MOLS/[obtained from pwd: penultimate dir]/[obtained from pwd: last dir]</tt>.
*<tt>CHECK</tt> gives the frequency of the check whether a molecule has already been processed or not: '0' → no check; '1' → check at the beginning of every job; '2' → check before processing each molecule.
*in case the script stops after just one molecule, do the following:
*check that the file <tt>.labels[JOB_ID].txt</tt> exists.
*create a file <tt>.dbnums[JOB_ID].txt</tt> and write something like "101 0" to it. The first number will be the starting number for the enumeration of the <tt>.db</tt> files, while the second is the current number of molecules already in that <tt>.db</tt> file.
*delete everything but the header from the <tt>.db</tt> file.
*start <tt>mrm_3_limit.py</tt> again.

=Inserting the newly generated molecules into a mysql database=

This step is essential to preserve knowledge about the correspondence between the original database name of a molecule, its HEI form, protonation states and conformations and the final name given by mol2db (of the form A00000000 [one letter + eight digits]).
*required files:
**mysql_insert_db6.py
*this also requires you to generate a mysql database of the proper format, best done with mysql_create_table_db5.pk.py
*commandline:
mysql_insert_db6.py MOLS_SUBDIR MYSQL_DB MOL2_SUBDIR DB_SUBDIR MYSQL_TABLE DB_VERSION MOL_RAID TAG
*the individual arguments will be connected to form the path to the .mol2 and files as follows
<tt>/raid[MOL_RAID]/people/kolb/DB[DB_VERSION]/[MOLS_SUBDIR]/MOLS/MOL2_SUBDIR</tt>
*the .db files are expected in
<tt>./DB_SUBDIR</tt>
*<tt>TAG</tt> is optional and is the name with which the molecule names start.

=Acknowledgments=
Johannes has sacrificed a week of his time to introduce me to the
scripts. Hao Fan and Magdalena Korczynska have prepared HEI databases on their own and given
valuable input.

[http://shoichetlab.compbio.ucsf.edu/~kolb Kolb] 21:11, 27 November 2010 (PST)
[[Category:Tutorials]]

How to generate an HEI database - Revision history

Therese: 40 revisions

Tbalius at 20:24, 5 September 2012