3D Pipeline Explanation
From email to Seth:
The way the output file from the tautprot_cxcalc.sh step looks like current is this:
[smiles] [zinc id] [protomer id]
The script generates multiple files but the "protomers-expanded.ism" file is the one used in subsequent steps.As for how the molecules are processed in the pipeline, we apply each pipeline step to all molecules in the batch at once, like so:
- X SMILES go in
- X SMILES get processed into Y protomer SMILES (Y ~= 1.35 * X)
- Z mol2 embedding files are created by corina. Z depends on CORINA_MAX_CONFS argument and whether corina succeeds in creating the embedding(s) (Z ~= MAX_CONFS * Y)
- Z Solvation files are generated from the mol2 files
- Z mol2 files are decomposed into ring systems and processed by omega into W multi-conformer representations, which are kept in-memory. ( W = sum([num_ring_systems(mol) for mol in Z]) )
- W omega representations are combined with solvation files and processed into Z db2s, which are kept in memory (every ring system from the same molecule gets sorted into the same db2 file)
- Z db2s get put into the output tar.gz file
We usually set X=50 when submitting in batches, otherwise memory usage may get too high.
This makes me realize that I mislead you about the naming of the db2s, there is actually only one additional number by default, the protomer id, since each molecule ring system gets sorted into the same db2 (e.g ZINC123.0). With a CORINA_MAX_CONFS argument there will be two additional numbers (e.g ZINC123.0.0).