2dload.py: Difference between revisions

From DISI
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 15: Line 15:
(refer to partitions.txt to see the range of tranche space each partition id is associated with)
(refer to partitions.txt to see the range of tranche space each partition id is associated with)


The add function will extract new entries from a preprocessed input file to each database table, for each tranche in the partition.
The add function will extract new entries from a preprocessed input file and append them their corresponding tables. This happens for every tranche in the partition.


The algorithm is such:
The algorithm is such:
Line 28: Line 28:
     The resulting id files for each column are pasted onto one another. This new file is the input to our catalog table.
     The resulting id files for each column are pasted onto one another. This new file is the input to our catalog table.
     The catalog input undergoes a process identical to that of the previous columns, except we don't bother to take the resulting ids this time.
     The catalog input undergoes a process identical to that of the previous columns, except we don't bother to take the resulting ids this time.
    All new table entries are appended to their corresponding table file
</nowiki>
</nowiki>


Why not just upload the data directly to postgres? Great question- inserting into a postgres table is slow (especially when the table has multiple indexes), and gets slower the larger the table gets. In our case, we also need to query the data after we insert it to create the catalog table input. This leads to a bit of a paradox- in order to query faster, we need to add indexes to the table, but in order to insert faster, we need to remove them. The solution is to sidestep postgres entirely and user our sorting method to ensure that data going in is already checked for uniqueness.
After this, the new entries for each table are recorded in an archive.
 
Why not just upload the data directly to postgres? Great question- inserting into a postgres table is slow (especially when the table has multiple indexes), and gets slower the larger the table gets. In our case, we also need to select query the data after we insert it to create the catalog table input. This leads to a bit of a paradox- in order to select query faster, we need to add indexes to the table, but in order to insert faster, we need to remove them. The solution is to sidestep postgres entirely and user our sorting method to ensure that data going in is already checked for uniqueness. This allows us to disable all indexes and triggers, copy the data to postgres in bulk, then re-enable indexes and triggers. This allows for the maximum insert performance possible.

Latest revision as of 07:49, 25 December 2020

2dload.py is BKSLab's ZINC22 database management program, created by Benjamin Tingle.

2dload.py has three basic functionalities:

  • 2dload.py add
  • 2dload.py rollback
  • 2dload.py postgres

2dload.py operates on the level of ZINC22 partitions. (link, or whatever)

Adding data with 2dload

python 2dload.py add ${partition id} ${preprocessed input} ${catalog shortname}

(refer to partitions.txt to see the range of tranche space each partition id is associated with)

The add function will extract new entries from a preprocessed input file and append them their corresponding tables. This happens for every tranche in the partition.

The algorithm is such:

The input file contains multiple tranche input files.
Each tranche input file has two columns, one for molecule SMILES, and another for supplier codes.
For each tranche:
    The input file is first split into its two component columns. (substance, supplier)
    For each input column:
        The input column file is concatenated with it's corresponding table file to a temporary file, which is then sorted and combed through by a uniqueness algorithm.
        The effect of this is to create two new files, one containing all entries in the column file that are new to the table, and another containing the resulting primary key of each column input line. If an input line was a duplicate, the primary key will be that of the corresponding original entry in the database.
    The resulting id files for each column are pasted onto one another. This new file is the input to our catalog table.
    The catalog input undergoes a process identical to that of the previous columns, except we don't bother to take the resulting ids this time.
    All new table entries are appended to their corresponding table file

After this, the new entries for each table are recorded in an archive.

Why not just upload the data directly to postgres? Great question- inserting into a postgres table is slow (especially when the table has multiple indexes), and gets slower the larger the table gets. In our case, we also need to select query the data after we insert it to create the catalog table input. This leads to a bit of a paradox- in order to select query faster, we need to add indexes to the table, but in order to insert faster, we need to remove them. The solution is to sidestep postgres entirely and user our sorting method to ensure that data going in is already checked for uniqueness. This allows us to disable all indexes and triggers, copy the data to postgres in bulk, then re-enable indexes and triggers. This allows for the maximum insert performance possible.