Repackaging DB2 DOCK38

From DISI
Revision as of 21:59, 27 March 2023 by Btingle (talk | contribs)
Jump to navigation Jump to search

The following is a script for repackaging 3D pipeline results. First, here is the script:

#!/bin/bash
# make_tarballs.bash

# required parameter
TARBALL_SOURCE=$1
TARBALL_REPACK_DEST=$2

TARBALL_SOURCE=$(realpath $TARBALL_SOURCE)
TARBALL_REPACK_DEST=$(realpath $TARBALL_REPACK_DEST)

[ -z $TARBALL_SOURCE ] && echo "need to provide TARBALL_SOURCE as 1st arg!" && exit 1
[ -z $TARBALL_REPACK_DEST ] && echo "need to provide TARBALL_REPACK_DEST as 2nd arg!" && exit 1

# optional parameters
WORKING_DIRECTORY=${WORKING_DIRECTORY-/tmp/$(whoami)}
PACKAGES_PER_PACKAGE=${PACKAGES_PER_PACKAGE-100}
PACKAGE_TYPE=${PACKAGE_TYPE-db2.gz}
PACKAGE_TYPE_SHORT=$(echo $PACKAGE_TYPE | cut -d'.' -f1)

echo WORKING_DIRECTORY=$WORKING_DIRECTORY
mkdir -p $WORKING_DIRECTORY && cd $WORKING_DIRECTORY
mkdir -p output working tarball_split_list

echo finding
find $TARBALL_SOURCE -name '*.tar.gz' > tarball_list.txt
echo splitting
split -l $PACKAGES_PER_PACKAGE tarball_list.txt tarball_split_list/
echo working
cd working
for f in ../tarball_split_list/*; do
        for tb in $(cat $f); do
                ! [ -z $VERBOSE ] && echo tar --transform='s/^.*\///' -xf $tb '*.'$PACKAGE_TYPE 2>/dev/null
                tar --transform='s/^.*\///' -xf $tb '*.'$PACKAGE_TYPE 2>/dev/null
        done
        ! [ -z $VERBOSE ] && echo tar -czf $(basename $f).$PACKAGE_TYPE.tar.gz '*.'$PACKAGE_TYPE
        tar -czf $(basename $f).$PACKAGE_TYPE_SHORT.tar.gz *.$PACKAGE_TYPE
        mv $(basename $f).$PACKAGE_TYPE_SHORT.tar.gz $TARBALL_REPACK_DEST
        rm *.$PACKAGE_TYPE
        echo $(basename $f)
done
cd ..
rm -r $WORKING_DIRECTORY
echo Done! Results in $TARBALL_REPACK_DEST

Now, an example usage:

[user@gimel5 ~] bash make_tarballs.bash $PWD/H17P200_H19P400.smi.batch-3d.d/out $PWD/tarballs_repacked/H17P200_H19P400
finding
splitting
working
aa
ab
ac
ad
ae
af
ag
ah
ai
aj
Done! Results in /tmp/user/output

It should be noted that this script will be effective for fairly small batches of molecules, e.g on the range of millions, rather than billions of molecules. Talk to me (ben@tingle.org) or John Irwin for more information on how to repack Very Large ligand libraries.

For docking from ligands built using our pipeline with default options, running this script unmodified is sufficient for creating appropriately sized packages for docking. You may wish to edit WORKING_DIRECTORY to /scratch or some other larger directory if running out of space on /tmp is a concern. The /tmp directory typically only holds around 50G of data, which may not be enough for some workloads or environments.