Loading And Creating ZINC Partitions Automatically
In this wiki page I will describe the journey of
1) generating a partition configuration for a set of fine-tranched molecules
2) deploying an individual partition database for a set of fine-tranched molecules
Generating a Partition Configuration
Buckets of the fine tranches are collapsed together to create new partition sets, such that the number of molecules in each partition is as close as possible to a target amount. This ensures that structurally similar molecules will be grouped together in a way that makes the most efficient use of resources. Additionally, certain "zones" of the fine tranches are given priority (this is done by lowering the target partition size for the zone), such as the fertile region from H17P200 to H30P400.
1. Getting/Generating a heatmap
The first thing we will need to generate partitions from the fine tranches is statistics about the fine tranches themselves. Specifically, we need a "heatmap", which is a 2D grid of values representing the number of molecules in each fine tranche bucket.
If you want to grab an existing heatmap, you should request access to this spreadsheet doc: https://docs.google.com/spreadsheets/d/1Ty3G4mQ9xkD3-wSqv2tp9SvIbb5sLuQqxoxANhkAals/edit#gid=811611113 and copy from one of the spreadsheets.
To generate a heatmap from an existing fine tranche directory, cd to /nfs/home/btingle/bin/zinc_deploy and do the following:
- source a python 3.6+ environment (there is one in the tldr3 dir in my work directory)
- `python heatmap.py $TRANCHE_DIR`
- after some time the script will finish and you will have your heatmap
2. Generating the partitions
In the same directory as the heatmap script, run the following:
- `python partition.py $TRANCHE_DIR`
You should get a colorful picture representative of the new partitions that looks something like this:
As well as a partitions.txt file that gives an explicit description of the partition configuration. This can be used independent of the specific fine tranche set used to generate it to deploy any set of fine tranches.
Deploying a Partition
The next step of this process is to prepare and export the molecules in each partition to an actual database. I've used SLURM to automate and parallelize this process.
1. Running the script
Navigate to /nfs/home/xyz/btingle/zinc_deploy where you will find the scripts associated with deploying zinc partitions.
- ./deploy_partition.bash $PARTITION_NO $TRANCHE_DIR $CATALOG_PREFIX $ZINC_HOST
- PARTITION_NO- the line number of the target partition in the partitions.txt file
- TRANCHE_DIR- the source tranche directory
- CATALOG_PREFIX- the character that gets prepended to each of the source files, ex: 'm', 's', 'w', creates 'mH17PXXX', etc...
- ZINC_HOST- hostname of the target machine
This script will queue a number of slurm jobs and terminate. Once it has completed you can find the newly exported files on the target host @ /local2/load/<PARTITION_LABEL>/export
add the ZINC_HOST field to deploy_partition.bash
optimize for more parallel jobs