ZINC22 Partitions: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
The ZINC22 molecule database is sorted into buckets that we call tranches. The contents of each tranche are distinguished by a specific range of physiochemical properties, namely heavy atom count and logp value. By creating separate databases for each tranche, we reduce the number of potential collisions when inserting new data, because though we don't know if a particular piece of new data has a counterpart in the database, we do know which database to check it against. There is a problem- there are a lot of tranches, and they can vary in size greatly. Of course, how much the size of each tranche varies depends on the chemical data used as input, and how much input data there is. This makes for something of a problem when it comes to managing resources. Considering that a large portion of these tranches are small or unimportant, does it really make sense to have separately managed databases for each tranche, of which there are thousands? It would be better if each of our database units was roughly equal in size to every other database unit, and the total number of units was closer to a hundred than a thousand+, which makes manually administering each one feasible. | The ZINC22 molecule database is sorted into buckets that we call tranches. The contents of each tranche are distinguished by a specific range of physiochemical properties, namely heavy atom count and logp value. By creating separate databases for each tranche, we reduce the number of potential collisions when inserting new data, because though we don't know if a particular piece of new data has a counterpart in the database, we do know which database to check it against. There is a problem- there are a lot of tranches, and they can vary in size greatly. Of course, how much the size of each tranche varies depends on the chemical data used as input, and how much input data there is. This makes for something of a problem when it comes to managing resources. Considering that a large portion of these tranches are small or unimportant, does it really make sense to have separately managed databases for each tranche, of which there are thousands? It would be better if each of our database units was roughly equal in size to every other database unit, and the total number of units was closer to a hundred than a thousand+, which makes manually administering each one feasible. | ||
... | |||
Thus partitions were created. A partition consists of one or more tranches that are physiochemically adjacent, such that each one forms a square or rectangle on the tranche "grid". | Thus partitions were created. A partition consists of one or more tranches that are physiochemically adjacent, such that each one forms a square or rectangle on the tranche "grid". | ||
... | |||
Each partition is generated to be roughly equal in size to partitions of the same importance. | Each partition is generated to be roughly equal in size to partitions of the same importance. | ||
This required we assign drug-discovery "importance" to broad regions of our physiochemical space. Each of these broad regions we painted were assigned a relative importance value, and had partitions generated within them. | This required we assign drug-discovery "importance" to broad regions of our physiochemical space. Each of these broad regions we painted were assigned a relative importance value, and had partitions generated within them. | ||
... | |||
To be clear- database partitions are still managed on the level of tranches (we want insert performance), but the partition creates logical groupings of tranches that us humans can interface with. | To be clear- database partitions are still managed on the level of tranches (we want insert performance), but the partition creates logical groupings of tranches that us humans can interface with. |
Revision as of 00:34, 25 December 2020
The ZINC22 molecule database is sorted into buckets that we call tranches. The contents of each tranche are distinguished by a specific range of physiochemical properties, namely heavy atom count and logp value. By creating separate databases for each tranche, we reduce the number of potential collisions when inserting new data, because though we don't know if a particular piece of new data has a counterpart in the database, we do know which database to check it against. There is a problem- there are a lot of tranches, and they can vary in size greatly. Of course, how much the size of each tranche varies depends on the chemical data used as input, and how much input data there is. This makes for something of a problem when it comes to managing resources. Considering that a large portion of these tranches are small or unimportant, does it really make sense to have separately managed databases for each tranche, of which there are thousands? It would be better if each of our database units was roughly equal in size to every other database unit, and the total number of units was closer to a hundred than a thousand+, which makes manually administering each one feasible.
...
Thus partitions were created. A partition consists of one or more tranches that are physiochemically adjacent, such that each one forms a square or rectangle on the tranche "grid".
...
Each partition is generated to be roughly equal in size to partitions of the same importance.
This required we assign drug-discovery "importance" to broad regions of our physiochemical space. Each of these broad regions we painted were assigned a relative importance value, and had partitions generated within them.
...
To be clear- database partitions are still managed on the level of tranches (we want insert performance), but the partition creates logical groupings of tranches that us humans can interface with.