Mol2db2 Format 2: Difference between revisions

From DISI
Jump to navigation Jump to search
m (xyz python line)
No edit summary
 
(47 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This page is a wishlist for features that would be nice for a new version of the flexibase file format to support.
This page is a wishlist for features that would be nice for a new version of the flexibase file format to support. mol2db2 format features that are actually implemented so far are marked [x]


*Real Atom Types and Bond Information
= New Features =
*Way to determine which mix-and-match conformations have clashes (and avoid trying them)
== implemented ==
*A place to store an internal energy for each possible conformation
*Real Atom Types and Bond Information [x]
*Terminal hydrogen rotations??
*Way to determine which mix-and-match conformations have clashes (and avoid trying them) [x]
*A place to store an internal energy for each possible conformation [x]
*Terminal hydrogen rotations?? [x]
*support for clusters of conformations [x]
*arbitrary information to be written into output mol2 file (5th and above M lines) [x]
 
== wished ==
*Per-conformation per-atom partial charge & solvation information to support internal energies
*Aliphatic ring movements?
*Aliphatic ring movements?
*support for clusters of conformations
*group tagging (needed for covalent docking) and basic set of covalent groups
*group tagging (needed for covalent docking) and basic set of covalent groups
*specified rigid component override (and better rules for finding non-ring rigid components)
*specified rigid component override (and better rules for finding non-ring rigid components)
*per molecule pKa
*per molecule pKa
*valence for each atom
== Nomenclature Definitions ==
* Conf - one set of atoms that moves together with a single position per atom.
* Set - a group of conformations that completely defines one position for each atom in a ligand.
* Cluster - Not yet implamented in DOCK3.7
* Cloud - Not yet implamented in DOCK3.7


the following represents the current plan for the file format
= File Format =
==current plan for the file format ==
*T type information (implicitly assumed)
*T type information (implicitly assumed)
*M molecule (only 2 lines ever)
*M molecule (4 lines req'd, after that they are optional, 24 lines max)
*A atoms
*A atoms
*B bond
*B bond
*X xyz
*X xyz  
*G group
*R rigid xyz for matching (can actually be any xyzs)
*D group-conf mapping
*C conformation
*C conformation
*S sets
*S sets
*D clusters
*E end of molecule


  T ## namexxxx (implicitly assumed to be the standard 7)
  T ## namexxxx (implicitly assumed to be the standard 7)
  M zincname protname #atoms #bonds #xyz #groups #confs #sets  
  M zincname protname #atoms #bonds #xyz #confs #sets #rigid #Mlines #clusters
  M charge polar_solv apolar_solv total_solv surface_area
  M charge polar_solv apolar_solv total_solv surface_area
M smiles
M longname
[M arbitrary information preserved for writing out]
  A stuff about each atom, 1 per line  
  A stuff about each atom, 1 per line  
  B stuff about each bond, 1 per line
  B stuff about each bond, 1 per line
  X atomnum confnum x y z  
  X coordnum atomnum confnum x y z  
  G groupnum #lines #children_total
  R rigidnum color x y z
G groupnum linenum #children childgroup [until column is full]
  C confnum coordstart coordend
D groupnum #lines #confs_total 
  S setnum #lines #confs_total broken hydrogens omega_energy
D groupnum linenum #confs confnum [until column is full]
C confnum #lines #children_total
  C confnum linenum #children childconf [until column is full]
  S setnum #lines #confs_total [INPUT|MIX] broken omega_energy
  S setnum linenum #confs confs [until full column]
  S setnum linenum #confs confs [until full column]
D clusternum setstart setend matchstart matchend #additionalmatching
D matchnum color x y z
E


With the above descriptions, here is a description of the columns that are used. Format statements for python/fortran will also appear at some point. If speed/size becomes an issue this might get replaced with a binary file format.
With the above descriptions, here is a description of the columns that are used. Format statements for python/fortran will also appear at some point. If speed/size becomes an issue this might get replaced with a binary file format.
Line 43: Line 61:
9 children confs/conf per line.
9 children confs/conf per line.
8 confs/set per line.
8 confs/set per line.
groups/confs with no children are written out.


on the atom line, dt is dock type and co is color.
on the atom line, dt is dock type and co is color.
Line 49: Line 68:
  01234567890123456789012345678901234567890123456789012345678901234567890123456789
  01234567890123456789012345678901234567890123456789012345678901234567890123456789
  T ## typename
  T ## typename
  M ZINCCODEX PROTCODEX ATO BON XYZXXX GRO CONFSX SETSXXXXX
  M ZINCCODEXXXXXXXX PROTCODEX ATO BON XYZXXX CONFSX SETSXX RIGIDX MLINES NUMCLU
  M +CHA.RGEX +POLAR.SOL +APOLA.SOL +TOTAL.SOL SURFA.REA
  M +CHA.RGEX +POLAR.SOL +APOLA.SOL +TOTAL.SOL SURFA.REA
M SMILESXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
M LONGNAMEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
[M ARBITRARY_INFORMATION_PRESERVEDXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX]
  A NUM NAME TYPEX DT CO +CHA.RGEX +POLAR.SOL +APOLA.SOL +TOTAL.SOL SURFA.REA
  A NUM NAME TYPEX DT CO +CHA.RGEX +POLAR.SOL +APOLA.SOL +TOTAL.SOL SURFA.REA
  B NUM ATO ATO TY
  B NUM ATO ATO TY
  X ATO CONFNU +XCO.ORDX +YCO.ORDX +ZCO.ORDX
  X COORDNUMX ATO CONFNU +XCO.ORDX +YCO.ORDX +ZCO.ORDX
  G GRO #LI #CH
  R NUM CO +XCO.ORDX +YCO.ORDX +ZCO.ORDX
  G GRO LIN #C CGN CGN CGN CGN CGN CGN CGN CGN CGN CGN CGN CGN CGN CGN CGN CGN CGN
  C CONFNO COORDSTAR COORDENDX
  D GRO #LIN #CONFS
  S SETIDX #LINES #CO C H +ENERGY.XXX
  D GRO LINE # CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS 
  S SETIDX LINENO # CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS
  C CONFNO #LIN #CONFS
  D CLUSID STASET ENDSET MST MEN ADD
  C CONFNO LINE # CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS
  D NUM CO +XCO.ORDX +YCO.ORDX +ZCO.ORDX
S SETIDXXXX #LINES #CONFS I C +ENER.GYX
  E
  S SETIDXXXX LINENO # CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS


the type lines following are assumed by dock unless overriden:
the type lines following are assumed by dock unless overriden:
Line 67: Line 88:
  T  2 negative
  T  2 negative
  T  3 acceptor
  T  3 acceptor
  T  4   donor
  T  4 donor
  T  5 ester_o
  T  5 ester_o
  T  6 amide_o
  T  6 amide_o
  T  7 neutral
  T  7 neutral


the following are the format statements for python for each line
the following are the format statements for python for each line
  T %2d %8s\n
  T %2d %8s\n
  M %9s %9s %3d %3d %6d %3d %6d %9d\n
  M %16s %9s %3d %3d %6d %6d %6d %6d &6d %6d\n
  M %+9.4f %+10.3f %+10.3f %+10.3f %9.3f\n
  M %+9.4f %+10.3f %+10.3f %+10.3f %9.3f\n
M %77s\n
M %77s\n
M %77s\n
  A %3d %-4s %-5s %2d %2d %+9.4f %+10.3f %+10.3f %+10.3f %9.3f\n
  A %3d %-4s %-5s %2d %2d %+9.4f %+10.3f %+10.3f %+10.3f %9.3f\n
  B %3d %3d %3d %-2s\n
  B %3d %3d %3d %-2s\n
  X %3d %6d %+9.4f %+9.4f %+9.4f\n
  X %9d %3d %6d %+9.4f %+9.4f %+9.4f\n
R %3d %2d %+9.4f %+9.4f %+9.4f\n
C %6d %9d %9d\n
S %6d %6d %3d %1d %1d %+11.3f\n
S %6d %6d %1d %6d %6d %6d %6d %6d %6d %6d %6d\n
D %6d %6d %6d %3d %3d %3d\n
D %3d %2d %+9.4f %+9.4f %+9.4f\n
E\n
 
The following are the fortran77 format statements
 
!T ## namexxxx (implicitly assumed to be the standard 7)
1000 format(2x,i2,1x,a8)
!M zincname protname #atoms #bonds #xyz #groups #confs #sets #rigid #mlines #clusters
2000 format(2x,a16,1x,a9,1x,i3,1x,i3,1x,i6,1x,i6,1x,i6,x,i6,x,i6,x,i6,x,i6)
!M charge polar_solv apolar_solv total_solv surface_area
2100 format(2x,f9.4,1x,f10.3,1x,f10.3,1x,f10.3,1x,f9.3)
!M smiles or longname
2200 format(2x,a77)
!A stuff about each atom, 1 per line
3000 format(2x,i3,1x,a4,1x,a5,1x,i2,1x,i2,1x,f9.4,1x,f10.3,1x,
    &      f10.3,1x,f10.3,1x,f9.3)
!B stuff about each bond, 1 per line
4000 format(2x,i3,1x,i3,1x,i3,1x,a2)
!X atomnum confnum x y z
5000 format(2x,i9,1x,i3,1x,i6,x,f9.4,1x,f9.4,1x,f9.4)
!R rigidnum color x y z
6000 format(2x,i3,x,i2,x,f9.4,1x,f9.4,1x,f9.4)
!C confnum #startcoord #endcoord
7000 format(2x,i6,1x,i9,1x,i9)
!S setnum #lines #confs_total broken hydrogens omega_energy
8000 format(2x,i6,1x,i6,1x,i3,1x,i1,1x,i1,1x,f11.3)
!S setnum linenum #confs confs [until full column]
8100 format(2x,i6,1x,i6,1x,i1,1x,i6,1x,i6,1x,i6,1x,i6,
    &      1x,i6,1x,i6,1x,i6,1x,i6)
!D CLUSID STARTSETX ENDSETXXX ADD MST MEN
9000 format(2x,i6,x,i6,x,i6,x,i3,x,i3,x,i3)
!D NUM CO +XCO.ORDX +YCO.ORDX +ZCO.ORDX
!re-use 6000
!E
!E does not get a format line
 
The following are Fortran95 format statements:


   
  !T ## namexxxx (implicitly assumed to be the standard 7)
      character (len=*), parameter :: DB2NAME = '(2x,i2,x,a8)' !1000
!M zincname protname #atoms #bonds #xyz #confs #sets #rigid #maxmlines #clusters
      character (len=*), parameter :: DB2M1 =
      &    '(2x,a16,x,a9,x,i3,x,i3,x,i6,x,i6,x,i6,x,i6,x,i6,x,i6)' !2000
!M charge polar_solv apolar_solv total_solv surface_area
      character (len=*), parameter :: DB2M2 =
      &    '(2x,f9.4,x,f10.3,x,f10.3,x,f10.3,x,f9.3)' !2100
!M smiles/longname/arbitrary
      character (len=*), parameter :: DB2M3 = '(2x,a78)' !2200
!A stuff about each atom, 1 per line
      character (len=*), parameter :: DB2ATOM =
      &    '(2x,i3,x,a4,x,a5,x,i2,x,i2,x,f9.4,x,f10.3,x,
      &    f10.3,x,f10.3,x,f9.3)' !3000
!B stuff about each bond, 1 per line
      character (len=*), parameter :: DB2BOND =
      &    '(2x,i3,x,i3,x,i3,x,a2)' !4000
!X coordnumx atomnum confnum x y z
      character (len=*), parameter :: DB2COORD =
      &    '(2x,i9,x,i3,x,i6,x,f9.4,x,f9.4,x,f9.4)' !5000
!R rigidnum color x y z
      character (len=*), parameter :: DB2RIGID =
      &    '(2x,i6,x,i2,x,f9.4,x,f9.4,x,f9.4)' !6000
!C confnum coordstart coordend
      character (len=*), parameter :: DB2CONF = '(2x,i6,x,i9,x,i9)' !7000
!S setnum #lines #confs_total broken hydrogens omega_energy
      character (len=*), parameter :: DB2SET1 =
      &    '(2x,i6,x,i6,x,i3,x,i1,x,i1,x,f11.3)' !8000
!S setnum linenum #confs confs [until full column]
      character (len=*), parameter :: DB2SET2 =
      &    '(2x,i6,x,i6,x,i1,x,i6,x,i6,x,i6,x,i6,
      &    1x,i6,x,i6,x,i6,x,i6)' !8100
!D CLUSID STASET ENDSET ADD(ittional matching spheres count) MST(art) MEN(d)
      character (len=*), parameter :: DB2CLUSTER =
      &    '(2x,i6,x,i6,x,i6,x,i3,x,i3,x,i3)' !9000
!D NUM CO x y z
!reuse DB2RIGID
!E
!E does not get a format line


[[Category:Wishlists]]
[[Category:Formats]]

Latest revision as of 15:44, 23 October 2014

This page is a wishlist for features that would be nice for a new version of the flexibase file format to support. mol2db2 format features that are actually implemented so far are marked [x]

New Features

implemented

  • Real Atom Types and Bond Information [x]
  • Way to determine which mix-and-match conformations have clashes (and avoid trying them) [x]
  • A place to store an internal energy for each possible conformation [x]
  • Terminal hydrogen rotations?? [x]
  • support for clusters of conformations [x]
  • arbitrary information to be written into output mol2 file (5th and above M lines) [x]

wished

  • Per-conformation per-atom partial charge & solvation information to support internal energies
  • Aliphatic ring movements?
  • group tagging (needed for covalent docking) and basic set of covalent groups
  • specified rigid component override (and better rules for finding non-ring rigid components)
  • per molecule pKa
  • valence for each atom

Nomenclature Definitions

  • Conf - one set of atoms that moves together with a single position per atom.
  • Set - a group of conformations that completely defines one position for each atom in a ligand.
  • Cluster - Not yet implamented in DOCK3.7
  • Cloud - Not yet implamented in DOCK3.7

File Format

current plan for the file format

  • T type information (implicitly assumed)
  • M molecule (4 lines req'd, after that they are optional, 24 lines max)
  • A atoms
  • B bond
  • X xyz
  • R rigid xyz for matching (can actually be any xyzs)
  • C conformation
  • S sets
  • D clusters
  • E end of molecule
T ## namexxxx (implicitly assumed to be the standard 7)
M zincname protname #atoms #bonds #xyz #confs #sets #rigid #Mlines #clusters
M charge polar_solv apolar_solv total_solv surface_area
M smiles
M longname
[M arbitrary information preserved for writing out]
A stuff about each atom, 1 per line 
B stuff about each bond, 1 per line
X coordnum atomnum confnum x y z 
R rigidnum color x y z
C confnum coordstart coordend
S setnum #lines #confs_total broken hydrogens omega_energy
S setnum linenum #confs confs [until full column]
D clusternum setstart setend matchstart matchend #additionalmatching
D matchnum color x y z
E 

With the above descriptions, here is a description of the columns that are used. Format statements for python/fortran will also appear at some point. If speed/size becomes an issue this might get replaced with a binary file format.

notes: 17 children groups/group per line in current scheme. 9 children confs/group per line. 9 children confs/conf per line. 8 confs/set per line. groups/confs with no children are written out.

on the atom line, dt is dock type and co is color.

          1         2         3         4         5         6         7
01234567890123456789012345678901234567890123456789012345678901234567890123456789
T ## typename
M ZINCCODEXXXXXXXX PROTCODEX ATO BON XYZXXX CONFSX SETSXX RIGIDX MLINES NUMCLU
M +CHA.RGEX +POLAR.SOL +APOLA.SOL +TOTAL.SOL SURFA.REA
M SMILESXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
M LONGNAMEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
[M ARBITRARY_INFORMATION_PRESERVEDXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX]
A NUM NAME TYPEX DT CO +CHA.RGEX +POLAR.SOL +APOLA.SOL +TOTAL.SOL SURFA.REA
B NUM ATO ATO TY
X COORDNUMX ATO CONFNU +XCO.ORDX +YCO.ORDX +ZCO.ORDX
R NUM CO +XCO.ORDX +YCO.ORDX +ZCO.ORDX
C CONFNO COORDSTAR COORDENDX
S SETIDX #LINES #CO C H +ENERGY.XXX
S SETIDX LINENO # CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS CCONFS
D CLUSID STASET ENDSET MST MEN ADD
D NUM CO +XCO.ORDX +YCO.ORDX +ZCO.ORDX
E

the type lines following are assumed by dock unless overriden:

T  1 positive
T  2 negative
T  3 acceptor
T  4 donor
T  5 ester_o
T  6 amide_o
T  7 neutral

the following are the format statements for python for each line

T %2d %8s\n
M %16s %9s %3d %3d %6d %6d %6d %6d &6d %6d\n
M %+9.4f %+10.3f %+10.3f %+10.3f %9.3f\n
M %77s\n
M %77s\n
M %77s\n
A %3d %-4s %-5s %2d %2d %+9.4f %+10.3f %+10.3f %+10.3f %9.3f\n
B %3d %3d %3d %-2s\n
X %9d %3d %6d %+9.4f %+9.4f %+9.4f\n
R %3d %2d %+9.4f %+9.4f %+9.4f\n
C %6d %9d %9d\n
S %6d %6d %3d %1d %1d %+11.3f\n
S %6d %6d %1d %6d %6d %6d %6d %6d %6d %6d %6d\n 
D %6d %6d %6d %3d %3d %3d\n
D %3d %2d %+9.4f %+9.4f %+9.4f\n
E\n

The following are the fortran77 format statements

!T ## namexxxx (implicitly assumed to be the standard 7)
1000 format(2x,i2,1x,a8)
!M zincname protname #atoms #bonds #xyz #groups #confs #sets #rigid #mlines #clusters
2000 format(2x,a16,1x,a9,1x,i3,1x,i3,1x,i6,1x,i6,1x,i6,x,i6,x,i6,x,i6,x,i6)
!M charge polar_solv apolar_solv total_solv surface_area
2100 format(2x,f9.4,1x,f10.3,1x,f10.3,1x,f10.3,1x,f9.3)
!M smiles or longname
2200 format(2x,a77)
!A stuff about each atom, 1 per line
3000 format(2x,i3,1x,a4,1x,a5,1x,i2,1x,i2,1x,f9.4,1x,f10.3,1x,
    &       f10.3,1x,f10.3,1x,f9.3)
!B stuff about each bond, 1 per line
4000 format(2x,i3,1x,i3,1x,i3,1x,a2)
!X atomnum confnum x y z
5000 format(2x,i9,1x,i3,1x,i6,x,f9.4,1x,f9.4,1x,f9.4)
!R rigidnum color x y z
6000 format(2x,i3,x,i2,x,f9.4,1x,f9.4,1x,f9.4)
!C confnum #startcoord #endcoord
7000 format(2x,i6,1x,i9,1x,i9)
!S setnum #lines #confs_total broken hydrogens omega_energy
8000 format(2x,i6,1x,i6,1x,i3,1x,i1,1x,i1,1x,f11.3)
!S setnum linenum #confs confs [until full column]
8100 format(2x,i6,1x,i6,1x,i1,1x,i6,1x,i6,1x,i6,1x,i6,
    &       1x,i6,1x,i6,1x,i6,1x,i6)
!D CLUSID STARTSETX ENDSETXXX ADD MST MEN
9000 format(2x,i6,x,i6,x,i6,x,i3,x,i3,x,i3)
!D NUM CO +XCO.ORDX +YCO.ORDX +ZCO.ORDX
!re-use 6000
!E
!E does not get a format line

The following are Fortran95 format statements:

!T ## namexxxx (implicitly assumed to be the standard 7)
      character (len=*), parameter :: DB2NAME = '(2x,i2,x,a8)' !1000
!M zincname protname #atoms #bonds #xyz #confs #sets #rigid #maxmlines #clusters
      character (len=*), parameter :: DB2M1 =
     &    '(2x,a16,x,a9,x,i3,x,i3,x,i6,x,i6,x,i6,x,i6,x,i6,x,i6)' !2000
!M charge polar_solv apolar_solv total_solv surface_area
      character (len=*), parameter :: DB2M2 =
     &    '(2x,f9.4,x,f10.3,x,f10.3,x,f10.3,x,f9.3)' !2100
!M smiles/longname/arbitrary
      character (len=*), parameter :: DB2M3 = '(2x,a78)' !2200
!A stuff about each atom, 1 per line
      character (len=*), parameter :: DB2ATOM =
     &    '(2x,i3,x,a4,x,a5,x,i2,x,i2,x,f9.4,x,f10.3,x,
     &    f10.3,x,f10.3,x,f9.3)' !3000
!B stuff about each bond, 1 per line
     character (len=*), parameter :: DB2BOND =
     &    '(2x,i3,x,i3,x,i3,x,a2)' !4000
!X coordnumx atomnum confnum x y z
      character (len=*), parameter :: DB2COORD =
     &    '(2x,i9,x,i3,x,i6,x,f9.4,x,f9.4,x,f9.4)' !5000
!R rigidnum color x y z
      character (len=*), parameter :: DB2RIGID =
     &    '(2x,i6,x,i2,x,f9.4,x,f9.4,x,f9.4)' !6000
!C confnum coordstart coordend
      character (len=*), parameter :: DB2CONF = '(2x,i6,x,i9,x,i9)' !7000
!S setnum #lines #confs_total broken hydrogens omega_energy 
      character (len=*), parameter :: DB2SET1 =
     &    '(2x,i6,x,i6,x,i3,x,i1,x,i1,x,f11.3)' !8000
!S setnum linenum #confs confs [until full column]
      character (len=*), parameter :: DB2SET2 =
     &    '(2x,i6,x,i6,x,i1,x,i6,x,i6,x,i6,x,i6,
     &    1x,i6,x,i6,x,i6,x,i6)' !8100
!D CLUSID STASET ENDSET ADD(ittional matching spheres count) MST(art) MEN(d)
      character (len=*), parameter :: DB2CLUSTER =
     &    '(2x,i6,x,i6,x,i6,x,i3,x,i3,x,i3)' !9000
!D NUM CO x y z
!reuse DB2RIGID
!E
!E does not get a format line