<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://wiki.docking.org/index.php?action=history&amp;feed=atom&amp;title=How_to_generate_an_HEI_database</id>
	<title>How to generate an HEI database - Revision history</title>
	<link rel="self" type="application/atom+xml" href="http://wiki.docking.org/index.php?action=history&amp;feed=atom&amp;title=How_to_generate_an_HEI_database"/>
	<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=How_to_generate_an_HEI_database&amp;action=history"/>
	<updated>2026-04-08T13:54:49Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>http://wiki.docking.org/index.php?title=How_to_generate_an_HEI_database&amp;diff=3253&amp;oldid=prev</id>
		<title>Therese: 40 revisions</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=How_to_generate_an_HEI_database&amp;diff=3253&amp;oldid=prev"/>
		<updated>2012-10-08T20:25:31Z</updated>

		<summary type="html">&lt;p&gt;40 revisions&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 20:25, 8 October 2012&lt;/td&gt;
				&lt;/tr&gt;
&lt;!-- diff cache key wikidb:diff::1.12:old-3252:rev-3253 --&gt;
&lt;/table&gt;</summary>
		<author><name>Therese</name></author>
	</entry>
	<entry>
		<id>http://wiki.docking.org/index.php?title=How_to_generate_an_HEI_database&amp;diff=3252&amp;oldid=prev</id>
		<title>Tbalius at 20:24, 5 September 2012</title>
		<link rel="alternate" type="text/html" href="http://wiki.docking.org/index.php?title=How_to_generate_an_HEI_database&amp;diff=3252&amp;oldid=prev"/>
		<updated>2012-09-05T20:24:59Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;=Introduction=&lt;br /&gt;
&lt;br /&gt;
This guide provides explanations for the use of Johannes Hermann&amp;#039;s&lt;br /&gt;
python scripts for High Energy Intermediate (HEI) library generation and also the scripts which exist&lt;br /&gt;
for running AMSOL, OMEGA and MOL2DB. Although the latter steps are&lt;br /&gt;
done in the canonical way, the scripts mentioned below fit in nicely with the data&lt;br /&gt;
structure generated by the database generation scripts and it is&lt;br /&gt;
highly recommended to use them.&lt;br /&gt;
&lt;br /&gt;
There is just one caveat: none of these scripts are standard, so one&lt;br /&gt;
probably has to edit names of files, directories and databases these&lt;br /&gt;
scripts use.&lt;br /&gt;
&lt;br /&gt;
=Recommended data structure=&lt;br /&gt;
&lt;br /&gt;
*generate 4 subdirectories: &amp;lt;tt&amp;gt;1_SOMENAME2SDF&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;2_OMEGA&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;3_AMSOL&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;4_MOL2DB&amp;lt;/tt&amp;gt;. SMILES files must be of the format &amp;lt;tt&amp;gt;somesmiles somename&amp;lt;/tt&amp;gt;.&lt;br /&gt;
*copy the database preparation scripts (&amp;lt;tt&amp;gt;a1*.py-c3*.py&amp;lt;/tt&amp;gt;) to &amp;lt;tt&amp;gt;1_SOMENAME2SDF&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
=HEI generation=&lt;br /&gt;
Broadly speaking, the generation procedure involves three steps:&lt;br /&gt;
# Conversion of the input .mol2 or .sdf files to [http://www.daylight.com/smiles/ SMILES] and then isomeric SMILES.&lt;br /&gt;
# Conversion (&amp;quot;reaction&amp;quot;) of the appropriate group(s) in each molecule to form the HEI.&lt;br /&gt;
# Generation of multiple protonation states, 3D-structures and partial charges for each HEI, resulting in .db files that can be fed to [[DOCK]]. &lt;br /&gt;
The scripts for each step are prefixed with the letter &amp;#039;a&amp;#039; (step 1), &amp;#039;b&amp;#039; (step 2), and &amp;#039;c&amp;#039; (first part of step 3), respectively. Within each letter, the scripts are enumerated consecutively.&lt;br /&gt;
*every script takes a list of SMILES as input and outputs a list of SMILES (except for the b7 and b8 scripts, which output .sdf files), prefixed with the sequential number of the script.&lt;br /&gt;
*the scripts a3-a4 have to be run in sequence.&lt;br /&gt;
*scripts b1 to b8 all take the output of a4 as input. Each of these scripts describes a different reaction and each reaction will only happen when    the appropriate reacting groups are encountered in a molecule.&lt;br /&gt;
*each b script will generate an LN (neutral leaving group) and an LP (protonated leaving group) file.&lt;br /&gt;
*it is a VERY GOOD idea to keep the LN and LP separate throughout the entire procedure, especially when running the c scripts. This will make things easier lateron.&lt;br /&gt;
*c3_sdf2mol2_mysql_names.py has to be run on four files: the LN and LP files coming out of c2_ionizer_min.py and the sdf files resulting from the b7 and b8 scripts.&lt;br /&gt;
==a Scripts==&lt;br /&gt;
&lt;br /&gt;
 a1_create_sdf_from_fold.py folder(unpacked from KEGG-website)&lt;br /&gt;
 a2_corina.py molfile.smi&lt;br /&gt;
&amp;lt;tt&amp;gt;a2_corina.py&amp;lt;/tt&amp;gt; can be started with either &amp;lt;tt&amp;gt;.smi&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;.ism&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;.sdf&amp;lt;/tt&amp;gt; files.&lt;br /&gt;
&lt;br /&gt;
 a3.1_sdf2ism_filter.py molfile.sdf &lt;br /&gt;
 a3.2_size_cutoff_filter.py molfile.smi&lt;br /&gt;
 a4_rm_doubles.py molfile.smi&lt;br /&gt;
&lt;br /&gt;
==b Scripts==&lt;br /&gt;
&lt;br /&gt;
 b1_rxn_carbonyl.py molfile.smi&lt;br /&gt;
 b1_rxn_lactone.pk.py molfile.smi&lt;br /&gt;
 b2.1_rxn_aromatic_cleav.py molfile.smi&lt;br /&gt;
 b2.2_rxn_aromatic_cleav.py molfile.smi&lt;br /&gt;
 b3_rxn_amidines.py molfile.smi&lt;br /&gt;
 b4.1_rxn_amidine_aromatic.py molfile.smi&lt;br /&gt;
 b4.2_rxn_amidine_aromatic.py molfile.smi&lt;br /&gt;
 b4.3_rxn_amidine_aromatic.py molfile.smi&lt;br /&gt;
 b4.4_rxn_amidine_aromatic.py molfile.smi&lt;br /&gt;
 b5_rxn_imin.py molfile.smi&lt;br /&gt;
 b6.1_rxn_imin_aromatic.py molfile.smi&lt;br /&gt;
 b6.2_rxn_imin_aromatic.py molfile.smi&lt;br /&gt;
 b6.3_rxn_imin_aromatic.py molfile.smi&lt;br /&gt;
 b7.1_parts_split_1.py molfile.smi&lt;br /&gt;
 b7.2_parts_split_2.py file-identifier(e.g. _mol_32007)&lt;br /&gt;
 b7.3_parts_connect_1.py file-identifier(e.g. _mol_32007)&lt;br /&gt;
 b7.4_parts_connect_2.py file-identifier(e.g. _mol_32007)&lt;br /&gt;
 b7.5_ionizer.py file-identifier(e.g. _mol_32007)&lt;br /&gt;
 b8.1_thio_parts_split_1.py molfile.smi&lt;br /&gt;
 b8.2_thio_parts_split_2.py file-identifier(e.g. _mol_32007)&lt;br /&gt;
 b8.3_thio_parts_connect_1.py file-identifier(e.g. _mol_32007)&lt;br /&gt;
 b8.4_thio_parts_connect_2.py file-identifier(e.g. _mol_32007)&lt;br /&gt;
 b8.5_thio_ionizer.py file-identifier(e.g. _mol_32007)&lt;br /&gt;
 b9_remove_doubles.py start-file-pattern end-file-pattern&lt;br /&gt;
&lt;br /&gt;
==c Scripts==&lt;br /&gt;
&lt;br /&gt;
 c1_corina.py start-file-pattern end-file-pattern&lt;br /&gt;
 c2_ionizer_min.py start-file-pattern end-file-pattern&lt;br /&gt;
 c3_sdf2mol2_mysql_names.py sdf-file(from corina+ionizer) suffix(for Folders after ring)&lt;br /&gt;
 c3_sdf2mol2_mysql_names_remove.py filename_containing_mol2_filenames(zipped does not hurt)&lt;br /&gt;
&lt;br /&gt;
=Running [http://www.eyesopen.com/products/applications/omega.html &amp;lt;tt&amp;gt;omega&amp;lt;/tt&amp;gt;]=&lt;br /&gt;
*&amp;#039;&amp;#039;Be careful! This script needs access to a mysql database &amp;amp;ndash; make sure to set the appropriate values that allow you access in the script.&amp;#039;&amp;#039;&lt;br /&gt;
*change to &amp;lt;tt&amp;gt;2_OMEGA&amp;lt;/tt&amp;gt;.&lt;br /&gt;
*required files:&lt;br /&gt;
**torlib_1205.txt&lt;br /&gt;
**omega_03.2_3_2.param&lt;br /&gt;
**omega_07.2_3_2.param&lt;br /&gt;
**om2_chunks_on_tmp.py &amp;#039;&amp;#039;or&amp;#039;&amp;#039; om2_chunks_on_scratch.py&lt;br /&gt;
*commandline: &lt;br /&gt;
 om2_chunks_on_tmp.py MOLS_SUBDIR_1 MOLS_SUBDIR_2 MOL_RAID MAXMOL&lt;br /&gt;
*alternative commandline if you want to run on the cluster:&lt;br /&gt;
 om2_chunks_on_scratch.py MOLS_SUBDIR_1 MOLS_SUBDIR_2 MOL_RAID MAXMOL&lt;br /&gt;
*the individual arguments will be connected to form the path to the mol2-files generated in step 3:&amp;lt;br&amp;gt;&amp;lt;tt&amp;gt;/raid[MOL_RAID]/people/kolb/DB4/[MOLS_SUBDIR_2]/MOLS/[MOLS_SUBDIR_1]&amp;lt;/tt&amp;gt;&lt;br /&gt;
*&amp;lt;tt&amp;gt;MAXMOL&amp;lt;/tt&amp;gt; gives the maximum number of molecules which are processed in one chunk. It is advisable to kill the job between the processing of two chunks.&lt;br /&gt;
&lt;br /&gt;
=Running &amp;lt;tt&amp;gt;[http://comp.chem.umn.edu/amsol/ amsol]&amp;lt;/tt&amp;gt;=&lt;br /&gt;
*&amp;#039;&amp;#039;Be careful! This script needs access to a mysql database &amp;amp;ndash; make sure to set the appropriate values that allow you access in the script.&amp;#039;&amp;#039;&lt;br /&gt;
*change to &amp;lt;tt&amp;gt;3_AMSOL&amp;lt;/tt&amp;gt;&lt;br /&gt;
*required files:&lt;br /&gt;
**amsol_limit.py&lt;br /&gt;
**amsol_functions.py&lt;br /&gt;
**amsol.py&lt;br /&gt;
**am_chunks_on_tmp.py &amp;#039;&amp;#039;or&amp;#039;&amp;#039; am_chunks_on_scratch.py&lt;br /&gt;
*commandline: &lt;br /&gt;
 am_chunks_on_tmp.py MOLS_SUBDIR_1 MOL_RAID MOLS_SUBDIR_2&lt;br /&gt;
*alternative commandline if you want to run on the cluster:&lt;br /&gt;
 am_chunks_on_scratch.py MOLS_SUBDIR_1 MOL_RAID MOLS_SUBDIR_2&lt;br /&gt;
*the individual arguments will be connected to form the path to the mol2-files generated in step 3:&amp;lt;br&amp;gt;&amp;lt;tt&amp;gt;/raid[MOL_RAID]/people/kolb/DB[MOLS_SUBDIR_2]/2_OMEGA/[MOLS_SUBDIR_1]&amp;lt;/tt&amp;gt;&lt;br /&gt;
*the script will call &amp;lt;tt&amp;gt;amsol_limit.py&amp;lt;/tt&amp;gt;, so make sure that this file is in your directory.&lt;br /&gt;
&lt;br /&gt;
=Running &amp;lt;tt&amp;gt;mol2db&amp;lt;/tt&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
*change to &amp;lt;tt&amp;gt;4_MOL2DB&amp;lt;/tt&amp;gt;.&lt;br /&gt;
*create a subfolder for every subpart of the database, i.e., &amp;lt;tt&amp;gt;RING_MORE_KEGG_HEI/OH_LN&amp;lt;/tt&amp;gt;,&amp;lt;tt&amp;gt;RING_MORE_KEGG_HEI/OH_LP&amp;lt;/tt&amp;gt;, a.s.o.&lt;br /&gt;
*required files in &amp;lt;tt&amp;gt;4_MOL2DB&amp;lt;/tt&amp;gt;:&lt;br /&gt;
**inhier_col&lt;br /&gt;
**mol2db_limit.csh&lt;br /&gt;
**lettercode.txt (a file specifying a single letter for each subdirectory)&lt;br /&gt;
*run the appropriate script directly in the subfolder: &amp;lt;tt&amp;gt;mrm_3_limit.py&amp;lt;/tt&amp;gt; for molecules with multiple rings, &amp;lt;tt&amp;gt;mro_5.py&amp;lt;/tt&amp;gt; for molecules with one ring, and &amp;lt;tt&amp;gt;mrn_1s.py&amp;lt;/tt&amp;gt; for molecules with no rings.&lt;br /&gt;
*in each script, make sure that the maximum number of molecules per &amp;lt;tt&amp;gt;.db&amp;lt;/tt&amp;gt; file is set to not more than 1000.&lt;br /&gt;
*keep in mind that the &amp;lt;tt&amp;gt;.mol2&amp;lt;/tt&amp;gt; file read by &amp;lt;tt&amp;gt;mol2db&amp;lt;/tt&amp;gt; must contain exactly 6 lines between &amp;lt;tt&amp;gt;@&amp;lt;TRIPOS&amp;gt;MOLECULE&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;@&amp;lt;TRIPOS&amp;gt;ATOM&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Example: running &amp;lt;tt&amp;gt;mrm_3_limit.py&amp;lt;/tt&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
*commandline:&lt;br /&gt;
 mrm_3_limit.py MOL_RAID DB_VERSION MOLS_SUBDIR JOB_ID OMEGA_PATH AMSOL_PATH CHECK WRITE_BROKEN&lt;br /&gt;
*The individual arguments and the &amp;lt;tt&amp;gt;pwd&amp;lt;/tt&amp;gt; will be connected to form the path to the mol2-files generated in step 2:&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;tt&amp;gt;/raid[MOL_RAID]/people/kolb/DB[DB_VERSION]/[MOLS_SUBDIR]/MOLS/[obtained from pwd: penultimate dir]/[obtained from pwd: last dir]&amp;lt;/tt&amp;gt;.&lt;br /&gt;
*&amp;lt;tt&amp;gt;CHECK&amp;lt;/tt&amp;gt; gives the frequency of the check whether a molecule has already been processed or not: &amp;#039;0&amp;#039; &amp;amp;rarr; no check; &amp;#039;1&amp;#039; &amp;amp;rarr; check at the beginning of every job; &amp;#039;2&amp;#039; &amp;amp;rarr; check before processing each molecule.&lt;br /&gt;
*in case the script stops after just one molecule, do the following:&lt;br /&gt;
*check that the file &amp;lt;tt&amp;gt;.labels[JOB_ID].txt&amp;lt;/tt&amp;gt; exists.&lt;br /&gt;
*create a file &amp;lt;tt&amp;gt;.dbnums[JOB_ID].txt&amp;lt;/tt&amp;gt; and write something like &amp;quot;101 0&amp;quot; to it. The first number will be the starting number for the enumeration of the &amp;lt;tt&amp;gt;.db&amp;lt;/tt&amp;gt; files, while the second is the current number of molecules already in that &amp;lt;tt&amp;gt;.db&amp;lt;/tt&amp;gt; file.&lt;br /&gt;
*delete everything but the header from the &amp;lt;tt&amp;gt;.db&amp;lt;/tt&amp;gt; file.&lt;br /&gt;
*start &amp;lt;tt&amp;gt;mrm_3_limit.py&amp;lt;/tt&amp;gt; again.&lt;br /&gt;
&lt;br /&gt;
=Inserting the newly generated molecules into a mysql database=&lt;br /&gt;
&lt;br /&gt;
This step is essential to preserve knowledge about the correspondence between the original database name of a molecule, its HEI form, protonation states and conformations and the final name given by mol2db (of the form A00000000 [one letter + eight digits]).&lt;br /&gt;
*required files:&lt;br /&gt;
**mysql_insert_db6.py&lt;br /&gt;
*this also requires you to generate a mysql database of the proper format, best done with mysql_create_table_db5.pk.py&lt;br /&gt;
*commandline:&lt;br /&gt;
 mysql_insert_db6.py MOLS_SUBDIR MYSQL_DB MOL2_SUBDIR DB_SUBDIR MYSQL_TABLE DB_VERSION MOL_RAID TAG&lt;br /&gt;
*the individual arguments will be connected to form the path to the .mol2 and files as follows&lt;br /&gt;
&amp;lt;tt&amp;gt;/raid[MOL_RAID]/people/kolb/DB[DB_VERSION]/[MOLS_SUBDIR]/MOLS/MOL2_SUBDIR&amp;lt;/tt&amp;gt;&lt;br /&gt;
*the .db files are expected in&lt;br /&gt;
&amp;lt;tt&amp;gt;./DB_SUBDIR&amp;lt;/tt&amp;gt;&lt;br /&gt;
*&amp;lt;tt&amp;gt;TAG&amp;lt;/tt&amp;gt; is optional and is the name with which the molecule names start.&lt;br /&gt;
&lt;br /&gt;
=Acknowledgments=&lt;br /&gt;
Johannes has sacrificed a week of his time to introduce me to the&lt;br /&gt;
scripts. Hao Fan and Magdalena Korczynska have prepared HEI databases on their own and given&lt;br /&gt;
valuable input.&lt;br /&gt;
&lt;br /&gt;
[http://shoichetlab.compbio.ucsf.edu/~kolb Kolb] 21:11, 27 November 2010 (PST)&lt;br /&gt;
[[Category:Tutorials]]&lt;/div&gt;</summary>
		<author><name>Tbalius</name></author>
	</entry>
</feed>