Loading And Creating ZINC Partitions Automatically

From DISI
Jump to navigation Jump to search

In this wiki page I will describe the journey of

1) generating a partition configuration for a set of fine-tranched molecules

2) deploying an individual partition database for a set of fine-tranched molecules

Generating a Partition Configuration

0. Purpose

Buckets of the fine tranches are collapsed together to create new partition sets, such that the number of molecules in each partition is as close as possible to a target amount. This ensures that structurally similar molecules will be grouped together in a way that makes the most efficient use of resources. Additionally, certain "zones" of the fine tranches are given priority (this is done by lowering the target partition size for the zone), such as the fertile region from H17P200 to H30P400.

1. Getting/Generating a heatmap

The first thing we will need to generate partitions from the fine tranches is statistics about the fine tranches themselves. Specifically, we need a "heatmap", which is a 2D grid of values representing the number of molecules in each fine tranche bucket.

If you want to grab an existing heatmap, you should request access to this spreadsheet doc: https://docs.google.com/spreadsheets/d/1Ty3G4mQ9xkD3-wSqv2tp9SvIbb5sLuQqxoxANhkAals/edit#gid=811611113 and copy from one of the spreadsheets.

To generate a heatmap from an existing fine tranche directory, cd to /nfs/home/btingle/bin/zinc_deploy and do the following:

  • source a python 3.6+ environment (there is one in the tldr3 dir in my work directory)
  • `python heatmap.py $TRANCHE_DIR`
  • after some time the script will finish and you will have your heatmap

2. Generating the partitions

In the same directory as the heatmap script, run the following:

  • `python partition.py $TRANCHE_DIR`

You should get a colorful picture representative of the new partitions that looks something like this:

Partitions new.PNG

As well as a partitions.txt file that gives an explicit description of the partition configuration. This can be used independent of the specific fine tranche set used to generate it to deploy any set of fine tranches.

Deploying a Partition

0. Purpose

The next step of this process is to prepare and export the molecules in each partition to an actual database. I've used SLURM to automate and parallelize this process.

1. Running the script

Navigate to /nfs/home/xyz/btingle/zinc_deploy where you will find the scripts associated with deploying zinc partitions.

  • ./deploy_partition.bash $PARTITION_NO $TRANCHE_DIR $CATALOG_PREFIX $ZINC_HOST
  • PARTITION_NO- the line number of the target partition in the partitions.txt file
  • TRANCHE_DIR- the source tranche directory
  • CATALOG_PREFIX- the character that gets prepended to each of the source files, ex: 'm', 's', 'w', creates 'mH17PXXX', etc...
  • ZINC_HOST- hostname of the target machine

This script will queue a number of slurm jobs and terminate. Once it has completed you can find the newly exported files on the target host @ /local2/load/<PARTITION_LABEL>/export

TODO

add the ZINC_HOST field to deploy_partition.bash

optimize for more parallel jobs [done!]