2dload.py

From DISI
Revision as of 05:00, 25 December 2020 by Btingle (talk | contribs) (Created page with "2dload.py is BKSLab's ZINC22 database management program, created by Benjamin Tingle. 2dload.py has three basic functionalities: * 2dload.py add * 2dload.py rollback * 2dloa...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

2dload.py is BKSLab's ZINC22 database management program, created by Benjamin Tingle.

2dload.py has three basic functionalities:

  • 2dload.py add
  • 2dload.py rollback
  • 2dload.py postgres

2dload.py operates on the level of ZINC22 partitions. (link, or whatever)

Adding data with 2dload

python 2dload.py add ${partition id} ${preprocessed input} ${catalog shortname}

(refer to partitions.txt to see which range of tranche space each partition id is associated with)

The add function will extract new entries from a preprocessed input file to each database table, for each tranche in the partition.

The algorithm is such:

The input file contains multiple tranche input files.
Each tranche input file has two columns, one for molecule SMILES, and another for supplier codes.
For each tranche:
    The input file is first split into its two component columns.
    For each input column:
        The input column file is concatenated with it's corresponding table file to a temporary file, which is then sorted and combed through by a uniqueness algorithm.
        The effect of this is to create two new files, one containing all entries in the column file that are new to the table, and another containing the resulting id of each column input line. If an input line was a duplicate, the id will be the id of the corresponding original entry in the database.
    The resulting id files for each column are pasted onto one another. This new file is the input to our catalog table.
    The catalog input undergoes a process identical to that of the previous columns, except we don't bother to take the resulting ids this time.