ZINC22 Partitions: Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
The ZINC22 molecule database is sorted into buckets that we call tranches. The contents of each tranche are distinguished by a specific range of physiochemical properties, namely heavy atom count and logp value. By creating separate databases for each tranche, we reduce the number of potential collisions when inserting new data, because though we don't know if a particular piece of new data has a counterpart in the database, we do know which database to check it against. There is a problem- there are a lot of tranches, and they can vary in size greatly. Of course, how much the size of each tranche varies depends on the chemical data used as input, and how much input data there is. This makes for something of a problem when it comes to managing resources. Considering that a large portion of these tranches are small or unimportant, does it really make sense to have separately managed databases for each tranche, of which there are thousands? It would be better if each of our database units was roughly equal in size to every other database unit, and the total number of units was closer to a hundred than a thousand+, which makes manually administering each one feasible.  
The ZINC22 molecule database is sorted into buckets that we call tranches. The contents of each tranche are distinguished by a specific range of physiochemical properties, namely heavy atom count and logp value. By creating separate databases for each tranche, we reduce the number of potential collisions when inserting new data, because though we don't know if a particular piece of new data has a counterpart in the database, we do know which database to check it against. There is a problem- there are a lot of tranches, and they can vary in size greatly. Of course, how much the size of each tranche varies depends on the chemical data used as input, and how much input data there is. This makes for something of a problem when it comes to managing resources. Considering that a large portion of these tranches are small, unimportant, or both, does it really make sense to have separately managed databases for each tranche- of which there are thousands? It would be better if each of our database units was roughly equal in size to every other database unit, and the total number of units was closer to a hundred than a thousand+, which makes manually administering each one feasible.  


...
...
Line 7: Line 7:
...
...


Each partition is generated to be roughly equal in size to partitions of the same importance.
Each partition is generated to be roughly equal in size to partitions of the same importance. Partitions of higher importance will have fewer molecules per partition.


This required we assign drug-discovery "importance" to broad regions of our physiochemical space. Each of these broad regions we painted were assigned a relative importance value, and had partitions generated within them.
This required we assign drug-discovery "importance" to broad regions of our physiochemical space. Each of these broad regions we painted were assigned a relative importance value, and had partitions generated within them.
Line 16: Line 16:


Pictured below is a colorful rendering of our current partitions. The X-axis in this image is heavy atom count, the Y-axis is logP value.
Pictured below is a colorful rendering of our current partitions. The X-axis in this image is heavy atom count, the Y-axis is logP value.
You can see some of the regions of importance very clearly- the large regions in the right side of tranche space are small/unimportant, so each is assigned just one partition. The more active regions towards the center are the more important regions. It's hard to tell, but there are fifteen different regions in this image, corresponding to a 3x5 cut on tranche space.


[[File:Partitions_new.PNG]]
[[File:Partitions_new.PNG]]

Revision as of 00:42, 25 December 2020

The ZINC22 molecule database is sorted into buckets that we call tranches. The contents of each tranche are distinguished by a specific range of physiochemical properties, namely heavy atom count and logp value. By creating separate databases for each tranche, we reduce the number of potential collisions when inserting new data, because though we don't know if a particular piece of new data has a counterpart in the database, we do know which database to check it against. There is a problem- there are a lot of tranches, and they can vary in size greatly. Of course, how much the size of each tranche varies depends on the chemical data used as input, and how much input data there is. This makes for something of a problem when it comes to managing resources. Considering that a large portion of these tranches are small, unimportant, or both, does it really make sense to have separately managed databases for each tranche- of which there are thousands? It would be better if each of our database units was roughly equal in size to every other database unit, and the total number of units was closer to a hundred than a thousand+, which makes manually administering each one feasible.

...

Thus partitions were created. A partition consists of one or more tranches that are physiochemically adjacent, such that each one forms a square or rectangle on the tranche "grid".

...

Each partition is generated to be roughly equal in size to partitions of the same importance. Partitions of higher importance will have fewer molecules per partition.

This required we assign drug-discovery "importance" to broad regions of our physiochemical space. Each of these broad regions we painted were assigned a relative importance value, and had partitions generated within them.

...

To be clear- database partitions are still managed on the level of tranches (we want insert performance), but the partition creates logical groupings of tranches that us humans can interface with.

Pictured below is a colorful rendering of our current partitions. The X-axis in this image is heavy atom count, the Y-axis is logP value. You can see some of the regions of importance very clearly- the large regions in the right side of tranche space are small/unimportant, so each is assigned just one partition. The more active regions towards the center are the more important regions. It's hard to tell, but there are fifteen different regions in this image, corresponding to a 3x5 cut on tranche space.

Partitions new.PNG