GPU DOCK

From DISI
Revision as of 19:02, 19 April 2023 by Saadamin2006 (talk | contribs) (Created page with "# GPU DOCK Wiki GPU DOCK is a fork of DOCK 3.8 that utilizies GPU acceleration via CUDA. Currently (as of April 2023) GPU DOCK uses the GPU for parallel scoring and only supports Van der Waals, electrostatics, and solvation scores. # Running and Building GPU DOCK The main files for GPU DOCK are in the `DOCK/ucsfdock/docking/DOCK/src` directory of DOCK. There are a few important scripts here: * `clean.sh` that clears object files from previous builds * `compile.sh` wh...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
  1. GPU DOCK Wiki

GPU DOCK is a fork of DOCK 3.8 that utilizies GPU acceleration via CUDA. Currently (as of April 2023) GPU DOCK uses the GPU for parallel scoring and only supports Van der Waals, electrostatics, and solvation scores.

  1. Running and Building GPU DOCK

The main files for GPU DOCK are in the `DOCK/ucsfdock/docking/DOCK/src` directory of DOCK. There are a few important scripts here:

  • `clean.sh` that clears object files from previous builds
  • `compile.sh` which uses multithreaded compilation with 8 threads to `make` all components of DOCK
  • `empty.sh` which clears output files from previous runs (subdock needs the previous files cleared)
  • `run.sh` which launches GPU DOCK on a certain set of input ligands (see "Building and Running" below)
  • `debug.sh` which runs `clean.sh`, `compile.sh`, `empty.sh`, and `run.sh` in that order

There also is an additionaly utility in `src/../util` that outputs the makefile dependencies for the Fortran files. Fortran cannot be compiled independently; instead, if file A relies on file B for some function definition or something, then file B must output a `.mod` file from compilation that must be fed into the the compilation invocation for file A. This is problematic for makefile which assumes independent compilation and thus compiles in alphabetical order. Hence, we without the utility have to run `make` multiple times in order to generate the necessary `.mod` files for each "layer" down the dependecy tree. However, with the utility, we can run it to output a makefile dependecy list to tell makefile in what order files must be compiled. Hence, we only have to run `make` only once for the DOCK Fortran files.

When you have cloned GPU DOCK, make sure to set scripts in `src` and the utility script to be executable as a program, otherwise Linux file permissions might get in your way of running DOCK.

    1. Building and Running

You can utilize `clean.sh`, `compile.sh`, `empty.sh`, or `run.sh` individually to suit your needs. Or you can run `debug.sh` to build from scratch and run. Building from scratch is important if you modify a C/C++ header file, because the resulting object files would need to be recompiled, and `make` does not detect this. Furthermore, `debug.sh` builds from scratch within 20 seconds on a FX-8350 (an extremely old CPU that was bad even when it was first released) so building from scratch is not a significant time loss, espcially on modern hardware.

The input ligands for the run are in the `DOCK/results` folder. The `run.sh` script reads the list of `.tgz` files in this directory and automatically creates a `sdi.in` file that is used in docking. The dockfiles are stored in `DOCK/results/dockfiles`. The output will be in `DOCK/results/output`.

Note: **do not try to run multiple instances of GPU DOCK at the same time unless `match_goal` is very high and you have a small list of files in `DOCK/results/input`.** The reason why is that if there were many GPU DOCK instances running, then they all would be competing for GPU resources. As a result, the driver has to micromanage all instances at the same time and this results in **signficant** overhead. You can mitigate this by setting `match_goal` to a very high number. This works because since matching is currently on the CPU, a high `match_goal` moves most of the execution time to the CPU and hence GPU DOCK instances would be less likely to compete for GPU computation time at the exact time. However, this is not a perfect solution as it creates a CPU bottlneck and signficantly slows down the program on the CPU-side, thus rendering the GPU acceleration benefit usless. In a future update, GPU DOCK will support a single instance, multiple threads model (as oppossed to a multiple instances, single thread model that subdock currently uses. See further down this page for more details).

  1. Terminology and Background

Before I get into the technical aspects of GPU DOCK, here is some brief terminology I will use in this page:

  • GPU DOCK is the GPU-ified version of DOCK 3.8
  • CPU DOCK is the original version of DOCK 3.8
  • Uploading is transferring data from the CPU to the GPU via PCIe
  • Downloading is transferring data from the GPU to the CPU via PCIe
  • Kernels are GPU programs

Also, only the CPU can call `cudaMalloc`, i.e. only the CPU can control GPU memory allocations and deallocations.

Launching a kernel is similar to launching a process. Each kernel is launched from the CPU, and each kernel invocation is independent from each other. However, unlike launching processes on the CPU, only one kernel invocation runs at a time on the GPU, and each kernel accesses the exact same allocated blocks of memory.

By default, the GPU and CPU are unsynchronized, i.e. when the CPU issues a call to the GPU to execute a kernel, it does not wait until to GPU is finished (or if the GPU is processing another kernel for that matter, in which case the GPU begins the next kernel immediately after the current). This design is useful in asynchronous models, where the CPU and GPU for the most part work seperately until either needs some data from the other. A consequence of this deisgn is that communicating between the CPU and GPU is really slow. For instance, each upload or download call to the Nvidia driver incurrs a signficant overhead. Hence, upload or download calls are slow and should be avoided, or the data transferred should be packed together or compressed. It is also important to note that only the CPU can initiate uploads and downloads.

The following involve what I refer to by a communication operation:

  • Uploading data
  • Downloading data
  • Launching a kernel

Due to the communication overhead, data transferring follows a curve approximated by $T=log_c(a+b^x)$. More information can be found in this [Nvidia presentation](https://www.cs.virginia.edu/~skadron/Papers/cuda_tuning_bof_sc09_final.pdf).

  1. Overall structure

CPU DOCK processes ligands sequentially which GPU DOCK process them in batches of size `n`, typically 48. For each batch, the docking process looks like this 1. Read `n` ligands from the disk 2. Match `n` ligands *sequentially* 3. Upload matching results and ligands to GPU memory and score 4. Download scoring results 5. Determine the `nsave` best poses and write them to the disk

Any parts of the scoring code (steps 3 and 4) are written in C++ and CUDA. The rational behind batching the ligands instead of processing them on the GPU one-by-one is due to the aforementioned communication bottleneck between the CPU and GPU. If we were communicating with the GPU per each ligand to upload, score, and download, we would be wasting a lot of time per each communication operation. However, we can distribute the communication overhead across multiple ligands and save performance this way.

To reduce the number of transfer calls, everything that needs to be uploaded for a scoring call (e.g. ligands, matches) is batched together in one transfer. All ligands and matches are written to a single contiguous buffer of memory, and a single `cudaMemcpy` call (which transfers data between the CPU and GPU) is used to upload the data. A single kernel is launched to do all the scoring, and the scoring kernel writes to a large buffer that keeps track of all the scores (which I call the scoring buffer). Then a download operation transfers data from the scoring buffer to the CPU.

A short note on inefficiencies: The current program design is designed to not disfigure the original Fortran code significantly. Due to this, the C++ side of DOCK does not directly take the `nsave` best ligands. Instead, it implements an interface for Fortran to read back the ligands and determine the `nsave` best ones. This results inefficiency, and will be removed eventually (ideally, the GPU will take the `nsave` best ligands to reduce the size of the download). More inefficiencies exist in the program, for example, the upload buffer also contains the scoring buffer to clear the scoring buffer. This also will be removed soon.

    1. Prefiltering

CPU DOCK looks at the rigid conf to determine whether it should bump a ligand. This optimization means that it looks at only one conf to determine whether the pose is already bad enough that it should avoid scoring it. Letting the GPU do this is inefficient, as the GPU would have the launch threads that die immediately if bumping is enabled or do useless work if bumping is disabled. Furthermore, the CPU adjusts the size of the matching buffer according to expected number of matches, meaning that a lot of GPU memory would have to be wasted for bumped ligands.

Instead, the CPU checks ahead of time what ligands will be bumped. As a result, it only sends ligands whose rigid conf was no bumped to the GPU to do scoring. Since memory accesses are expensive on the GPU, it is more efficient to recompute the conf score and not reuse the CPU result.

    1. No Caching

Another optimization CPU DOCK uses is that caches the results of confs in the set score computation. As previously mentioned, reading memory is expensive on the GPU, so instead GPU DOCK recalculates set scores.