Why Loading ZINC Is So Slow: Difference between revisions
No edit summary |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
the problem is with the "resolution" and "loading" steps of the catalog script. these steps ensure the uniqueness of molecules in the substance and catalog_substance table respectively | the problem is with the "resolution" and "loading" steps of the catalog script. these steps ensure the uniqueness of molecules in the substance and catalog_substance table respectively | ||
a. sorting & queries | == a. sorting & queries == | ||
in order to be sure that each molecule is unique, the database must effectively sort molecules as they come in- this is equivalent to a btree | in order to be sure that each molecule is unique, the database must effectively sort molecules as they come in- this is equivalent to a btree select+insert for each new molecule | ||
THEORETICALLY this is fine, but practically it is not | THEORETICALLY this is fine, but practically it is not | ||
Line 10: | Line 10: | ||
when relatively few molecules are being added to the database this is fine, but for large inserts on a large database this becomes inefficient | when relatively few molecules are being added to the database this is fine, but for large inserts on a large database this becomes inefficient | ||
there is a significant amount of overhead involved in each | there is a significant amount of overhead involved in each select, at least compared to the insert operation when done in bulk | ||
the solution is to ensure the uniqueness of each molecule *before* they are inserted into the database. this can be done by a sorting program like `sort -u` | the solution is to ensure the uniqueness of each molecule *before* they are inserted into the database. this can be done by a sorting program like `sort -u` | ||
Line 16: | Line 16: | ||
it is better to load the entire dataset in memory and sort it manually beforehand than to let the database software figure it out as we gradually feed it the data | it is better to load the entire dataset in memory and sort it manually beforehand than to let the database software figure it out as we gradually feed it the data | ||
dropping the | dropping the select operations allows us to make a further optimization... (see b) | ||
b. indexes | == b. indexes == | ||
there are a lot of indexes on the substance and catalog data | there are a lot of indexes on the substance and catalog data | ||
Line 27: | Line 27: | ||
again, it is better to build the indexes when the entire table is loaded rather than piece-by-piece | again, it is better to build the indexes when the entire table is loaded rather than piece-by-piece | ||
== c. improvement == | |||
after disabling the loading and resolution steps, turning off indices, and manually sorting, the time it took to load 2.34m molecules into ZINC went from | |||
12+ hours (maybe more, i never let it finish) | |||
down to | |||
1 hour | |||
and the uniqueness of the molecules is still preserved | |||
== other == | |||
at it's core this is a balance issue- having to both insert and select for a query means making sacrifices. let me explain: in order to make select fast (sub O(n) time), you need to create indexes for the table. in order to make insert fast (O(1)- constant time) you need to disable indexes for the table. if you need to perform both operations in the same query, you must sacrifice one's performance for the other. by removing the select operation from this query, we can utilize the full speed-up for an insert. |
Latest revision as of 00:08, 31 May 2020
/nfs/home/xyz/btingle/zinc_deploy/zincload-catalog.sh
the problem is with the "resolution" and "loading" steps of the catalog script. these steps ensure the uniqueness of molecules in the substance and catalog_substance table respectively
a. sorting & queries
in order to be sure that each molecule is unique, the database must effectively sort molecules as they come in- this is equivalent to a btree select+insert for each new molecule
THEORETICALLY this is fine, but practically it is not
when relatively few molecules are being added to the database this is fine, but for large inserts on a large database this becomes inefficient
there is a significant amount of overhead involved in each select, at least compared to the insert operation when done in bulk
the solution is to ensure the uniqueness of each molecule *before* they are inserted into the database. this can be done by a sorting program like `sort -u`
it is better to load the entire dataset in memory and sort it manually beforehand than to let the database software figure it out as we gradually feed it the data
dropping the select operations allows us to make a further optimization... (see b)
b. indexes
there are a lot of indexes on the substance and catalog data
indexes tend to slow down insert operations significantly (https://use-the-index-luke.com/sql/dml/insert), for large operations this becomes intolerable
the solution is to "shut off" the indexes for the duration of the insert operation and rebuild them once it has completed
again, it is better to build the indexes when the entire table is loaded rather than piece-by-piece
c. improvement
after disabling the loading and resolution steps, turning off indices, and manually sorting, the time it took to load 2.34m molecules into ZINC went from
12+ hours (maybe more, i never let it finish)
down to
1 hour
and the uniqueness of the molecules is still preserved
other
at it's core this is a balance issue- having to both insert and select for a query means making sacrifices. let me explain: in order to make select fast (sub O(n) time), you need to create indexes for the table. in order to make insert fast (O(1)- constant time) you need to disable indexes for the table. if you need to perform both operations in the same query, you must sacrifice one's performance for the other. by removing the select operation from this query, we can utilize the full speed-up for an insert.