Difference between revisions of "Why Loading ZINC Is So Slow"

From DISI
Jump to: navigation, search
(Created page with "/nfs/home/xyz/btingle/zinc_deploy/zincload-catalog.sh the problem is with the "resolution" and "loading" steps of the catalog script. these steps ensure the uniqueness of mol...")
 
Line 5: Line 5:
 
a. sorting & queries
 
a. sorting & queries
 
in order to be sure that each molecule is unique, the database must effectively sort molecules as they come in- this is equivalent to a btree query+insert for each new molecule
 
in order to be sure that each molecule is unique, the database must effectively sort molecules as they come in- this is equivalent to a btree query+insert for each new molecule
 +
 
THEORETICALLY this is fine, but practically it is not
 
THEORETICALLY this is fine, but practically it is not
 +
 
when relatively few molecules are being added to the database this is fine, but for large inserts on a large database this becomes inefficient
 
when relatively few molecules are being added to the database this is fine, but for large inserts on a large database this becomes inefficient
 +
 
there is a significant amount of overhead involved in each query, at least compared to the insert operation which is done in bulk
 
there is a significant amount of overhead involved in each query, at least compared to the insert operation which is done in bulk
 +
 
the solution is to ensure the uniqueness of each molecule *before* they are inserted into the database. this can be done by a sorting program like `sort -u`
 
the solution is to ensure the uniqueness of each molecule *before* they are inserted into the database. this can be done by a sorting program like `sort -u`
 +
 
it is better to load the entire dataset in memory and sort it manually beforehand than to let the database software figure it out as we gradually feed it the data
 
it is better to load the entire dataset in memory and sort it manually beforehand than to let the database software figure it out as we gradually feed it the data
 +
 
dropping the query operations allows us to make a further optimization... (see b)
 
dropping the query operations allows us to make a further optimization... (see b)
  
 
b. indexes
 
b. indexes
 +
 
there are a lot of indexes on the substance and catalog data
 
there are a lot of indexes on the substance and catalog data
 +
 
indexes tend to slow down insert operations significantly (https://use-the-index-luke.com/sql/dml/insert), for large operations this becomes intolerable
 
indexes tend to slow down insert operations significantly (https://use-the-index-luke.com/sql/dml/insert), for large operations this becomes intolerable
 +
 
the solution is to "shut off" the indexes for the duration of the insert operation and rebuild them once it has completed
 
the solution is to "shut off" the indexes for the duration of the insert operation and rebuild them once it has completed
 +
 
again, it is better to build the indexes when the entire table is loaded rather than piece-by-piece
 
again, it is better to build the indexes when the entire table is loaded rather than piece-by-piece

Revision as of 11:01, 30 May 2020

/nfs/home/xyz/btingle/zinc_deploy/zincload-catalog.sh

the problem is with the "resolution" and "loading" steps of the catalog script. these steps ensure the uniqueness of molecules in the substance and catalog_substance table respectively

a. sorting & queries in order to be sure that each molecule is unique, the database must effectively sort molecules as they come in- this is equivalent to a btree query+insert for each new molecule

THEORETICALLY this is fine, but practically it is not

when relatively few molecules are being added to the database this is fine, but for large inserts on a large database this becomes inefficient

there is a significant amount of overhead involved in each query, at least compared to the insert operation which is done in bulk

the solution is to ensure the uniqueness of each molecule *before* they are inserted into the database. this can be done by a sorting program like `sort -u`

it is better to load the entire dataset in memory and sort it manually beforehand than to let the database software figure it out as we gradually feed it the data

dropping the query operations allows us to make a further optimization... (see b)

b. indexes

there are a lot of indexes on the substance and catalog data

indexes tend to slow down insert operations significantly (https://use-the-index-luke.com/sql/dml/insert), for large operations this becomes intolerable

the solution is to "shut off" the indexes for the duration of the insert operation and rebuild them once it has completed

again, it is better to build the indexes when the entire table is loaded rather than piece-by-piece