Continuous curation: Difference between revisions

From DISI
Jump to navigation Jump to search
(asdf)
(asdf)
Line 1: Line 1:
This is the continuous curation page. This serves to communicate current status among the curators, and also to the users of ZINC, what the current status of ZINC curation is.  
We continually curate ZINC.  This page describes the actions taken (briefly) and the current status, with date.
It is used to keep track of the current state of curation, to communicate among the curators, and to inform users of what is done and remains to be done.


== 2D (catalog) loading ==  
== 2D catalog loading ==  
* now loading: molport and enamine-v
* Purpose:
* queued and ready for loading:  molport-v
** To load new catalogs and catalog updates.
* awaiting post-loading curation : DONE5 directory (count unique, update filtered, counts)
** To deplete compounds no longer available.
* current sub_id max is 525,926,658 (Nov 14)
** To count unique, post text files, update filtered, update original
As of Nov 14: sub_id max is 525,926,658
* loading: molport and enamine-v
* queued for loading:  molport-v
* awaiting post-loading curation : DONE5


== 3 D (protomer) loading ==  
== 3 D protomer loading ==  
* protomer building and loading is currently on hold until we have new disk space ready (expect to resume loading Nov 20)
* Purpose: to generate 3D models and load them into ZINC.
* we are currently building Ellman in 3D
As of Nov 14: prot_id max is 255,762,865  
* current prot_id max is 255,762,865 (Nov 14)
* preparation:  on hold
* building: on hold
* loading: on hold
Expect to resume Nov 20.  Currently building Ellman.


== 2D exporting ==  
== 2D exporting ==  
We export 2D by property (tranche browser) over 3 week period beginning on the first of the month.  
* Purpose: We export 2D by property for the tranche browser over 3 week period beginning on the first of the month.
You can see the date we last updated each tranche using files.docking.org/2D
As of Nov 14, oldest is Oct 25, thus ca. 20 days, which is less than our goal of  < 30 days.  
Currently, the oldest tranche is less than 30 days old. It is our intention to maintain this level of currency.
* CE Oct 25 oldest
* 2 running
Currently, the oldest tranche is less than 30 days old. We intend to maintain this level of currency.


== 3D exporting ==  
== 3D exporting ==  
Exporting 3D for the tranche browser runs continuously and takes slightly longer than a month to run.  
* Purpose:
** Exporting 3D for the tranche browser.
** Runs continuously and takes slightly longer than a month to run.
As of Nov 14, oldest is Oct 13, which is just over 30 days.  
You can see the date we last updated each tranche using files.docking.org/3D
You can see the date we last updated each tranche using files.docking.org/3D
Currently, the oldest tranche is less than 40 days old.
Currently, the oldest tranche is less than 40 days old.
It is our intention to keep 3D tranches within 60 days, which we feel is possible even as ZINC grows.
It is our intention to keep 3D tranches within 60 days, which we feel is possible even as ZINC grows.


== Ring curation status ==  
== Ring curation status ==  
We compute rings at the end of the database, and we
* Purpose:
** Compute rings for newly added compounds.
** Compute rings when missing, e.g. recently returned to current status.
** Count rings when rings stabilize.
** Delete unused rings when counts refreshed.
As of Nov 14:


== Pattern curation status ==  
== Pattern curation status ==  
 
* Purpose:
 
** Compute patterns for newly added compounds.
== Ring counts, pattern counts ==
** Compute patterns when missing, e.g. recently returned to current status.
 
 


==  Biological table counts ==  
==  Biological table counts ==  
 
* Purpose:
** maintain counts of compounds on biological resources.


== SEA prediction curation ==  
== SEA prediction curation ==  
* Purpose:
** identify compounds with no SEA prediction
** run SEA prediction on compounds with no prediction and update the database


== basic warehousing (recalculate purchasability, reactivity class) ==  
== basic warehousing (recalculate purchasability, reactivity class) ==  

Revision as of 16:36, 15 November 2016

We continually curate ZINC. This page describes the actions taken (briefly) and the current status, with date. It is used to keep track of the current state of curation, to communicate among the curators, and to inform users of what is done and remains to be done.

2D catalog loading

  • Purpose:
    • To load new catalogs and catalog updates.
    • To deplete compounds no longer available.
    • To count unique, post text files, update filtered, update original

As of Nov 14: sub_id max is 525,926,658

  • loading: molport and enamine-v
  • queued for loading: molport-v
  • awaiting post-loading curation : DONE5

3 D protomer loading

  • Purpose: to generate 3D models and load them into ZINC.

As of Nov 14: prot_id max is 255,762,865

  • preparation: on hold
  • building: on hold
  • loading: on hold

Expect to resume Nov 20. Currently building Ellman.

2D exporting

  • Purpose: We export 2D by property for the tranche browser over 3 week period beginning on the first of the month.

As of Nov 14, oldest is Oct 25, thus ca. 20 days, which is less than our goal of < 30 days.

  • CE Oct 25 oldest
  • 2 running

Currently, the oldest tranche is less than 30 days old. We intend to maintain this level of currency.

3D exporting

  • Purpose:
    • Exporting 3D for the tranche browser.
    • Runs continuously and takes slightly longer than a month to run.

As of Nov 14, oldest is Oct 13, which is just over 30 days. You can see the date we last updated each tranche using files.docking.org/3D Currently, the oldest tranche is less than 40 days old. It is our intention to keep 3D tranches within 60 days, which we feel is possible even as ZINC grows.

Ring curation status

  • Purpose:
    • Compute rings for newly added compounds.
    • Compute rings when missing, e.g. recently returned to current status.
    • Count rings when rings stabilize.
    • Delete unused rings when counts refreshed.

As of Nov 14:

Pattern curation status

  • Purpose:
    • Compute patterns for newly added compounds.
    • Compute patterns when missing, e.g. recently returned to current status.

Biological table counts

  • Purpose:
    • maintain counts of compounds on biological resources.

SEA prediction curation

  • Purpose:
    • identify compounds with no SEA prediction
    • run SEA prediction on compounds with no prediction and update the database

basic warehousing (recalculate purchasability, reactivity class)

We recalculate each catalog as it is loaded. We also recalculate the entire database continuously.


vacuuming

We continuously and aggressively vacuum tables. The rotation order is:

  • substance**, substance_to_ecfp4_new, ecfp4_new, protomer, pattern, rings, subpat, hasring -> then back to the beginning.