GPU DOCK: Difference between revisions

From DISI
Jump to navigation Jump to search
(fix formatting and spelling)
Line 45: Line 45:
==GPU DOCK Structure==
==GPU DOCK Structure==
CPU DOCK processes ligands sequentially which GPU DOCK process them in batches of size <code>n</code>, typically 48. For each batch, the docking process looks like this
CPU DOCK processes ligands sequentially which GPU DOCK process them in batches of size <code>n</code>, typically 48. For each batch, the docking process looks like this
1. Read <code>n</code> ligands from the disk
# 1. Read <code>n</code> ligands from the disk
2. Match <code>n</code> ligands *sequentially*
# 2. Match <code>n</code> ligands *sequentially*
3. Upload matching results and ligands to GPU memory and score
# 3. Upload matching results and ligands to GPU memory and score
4. Download scoring results
# 4. Download scoring results
5. Determine the <code>nsave</code> best poses and write them to the disk
# 5. Determine the <code>nsave</code> best poses and write them to the disk


Any parts of the scoring code (steps 3 and 4) are written in C++ and CUDA. The rational behind batching the ligands instead of processing them on the GPU one-by-one is due to the aforementioned communication bottleneck between the CPU and GPU. If we were communicating with the GPU per each ligand to upload, score, and download, we would be wasting a lot of time per each communication operation. However, we can distribute the communication overhead across multiple ligands and save performance this way.  
Any parts of the scoring code (steps 3 and 4) are written in C++ and CUDA. The rational behind batching the ligands instead of processing them on the GPU one-by-one is due to the aforementioned communication bottleneck between the CPU and GPU. If we were communicating with the GPU per each ligand to upload, score, and download, we would be wasting a lot of time per each communication operation. However, we can distribute the communication overhead across multiple ligands and save performance this way.  

Revision as of 00:33, 21 April 2023

GPU DOCK is a fork of DOCK 3.8 that utilizes GPU acceleration via CUDA. Currently (as of April 2023) GPU DOCK uses the GPU for parallel scoring and only supports Van der Waals, electrostatics, and solvation scores.

Running and Building GPU DOCK

The main files for GPU DOCK are in the DOCK/ucsfdock/docking/DOCK/src directory of DOCK. There are a few important scripts here:

  • clean.sh that clears object files from previous builds
  • compile.sh which uses multithreaded compilation with 8 threads to make all components of DOCK
  • empty.sh which clears output files from previous runs (subdock needs the previous files cleared)
  • run.sh which launches GPU DOCK on a certain set of input ligands (see "Building and Running" below)
  • debug.sh which runs clean.sh, compile.sh, empty.sh, and run.sh in that order

There also is an additionally utility in DOCK/ucsfdock/docking/DOCK/util that outputs the makefile dependencies for the Fortran files. Fortran cannot be compiled independently; instead, if file A relies on file B for some function definition or something, then file B must output a .mod file from compilation that must be fed into the the compilation invocation for file A. This is problematic for makefile which assumes independent compilation and thus compiles in alphabetical order. Hence, we without the utility have to run make multiple times in order to generate the necessary .mod files for each "layer" down the dependency tree. However, with the utility, we can run it to output a makefile dependency list to tell makefile in what order files must be compiled. Hence, we only have to run make only once for the DOCK Fortran files.

When you have cloned GPU DOCK, make sure to set scripts in src and the utility script to be executable as a program, otherwise Linux file permissions might get in your way of running DOCK.

Building and Running

You can utilize clean.sh, compile.sh, empty.sh, and run.sh individually to suit your needs. Or you can run debug.sh to build from scratch and run. Building from scratch is important if you modify a C/C++ header file, because the resulting object files would need to be recompiled, and make does not detect this. Furthermore, debug.sh builds from scratch within 20 seconds on a FX-8350 (an extremely old CPU that was bad even when it was first released) so building from scratch is not a significant time loss, especially on modern hardware.

The input ligands for the run are in the DOCK/results folder. The run.sh script reads the list of .tgz files in this directory and automatically creates a sdi.in file that is used in docking. The dockfiles are stored in DOCK/results/dockfiles. The output will be in DOCK/results/output.

Note: **do not try to run multiple instances of GPU DOCK at the same time unless match_goal is very high and you have a small list of files in DOCK/results/input.** The reason why is that if there were many GPU DOCK instances running, then they all would be competing for GPU resources. As a result, the driver has to micromanage all instances at the same time and this results in **significant** overhead. You can mitigate this by setting match_goal to a very high number. This works because since matching is currently on the CPU, a high match_goal moves most of the execution time to the CPU and hence GPU DOCK instances would be less likely to compete for GPU computation time at the exact time. However, this is not a perfect solution as it creates a CPU bottleneck and significantly slows down the program on the CPU-side, thus rendering the GPU acceleration benefit useless. In a future update, GPU DOCK will support a single instance, multiple threads model (as opposed to a multiple instances, single thread model that subdock currently uses. See further down this page for more details).

How GPU DOCK Works

Here, I discuss how GPU DOCK works. I'll cover how the code is structured, and the algorithmic designs.

Terminology and Background

Before I get into the technical aspects of GPU DOCK, here is some brief terminology I will use in this page:

  • GPU DOCK is the GPU-ified version of DOCK 3.8
  • CPU DOCK is the original version of DOCK 3.8
  • Uploading is transferring data from the CPU to the GPU via PCIe
  • Downloading is transferring data from the GPU to the CPU via PCIe
  • Kernels are GPU programs

Also, only the CPU can call cudaMalloc, i.e. only the CPU can control GPU memory allocations and deallocations.

Launching a kernel is similar to launching a process. Each kernel is launched from the CPU, and each kernel invocation is independent from each other. However, unlike launching processes on the CPU, only one kernel invocation runs at a time on the GPU, and each kernel accesses the exact same allocated blocks of memory.

By default, the GPU and CPU are unsynchronized, i.e. when the CPU issues a call to the GPU to execute a kernel, it does not wait until to GPU is finished (or if the GPU is processing another kernel for that matter, in which case the GPU begins the next kernel immediately after the current). This design is useful in asynchronous models, where the CPU and GPU for the most part work separately until either needs some data from the other. A consequence of this design is that communicating between the CPU and GPU is really slow. For instance, each upload or download call to the Nvidia driver incurs a significant overhead. Hence, upload or download calls are slow and should be avoided, or the data transferred should be packed together or compressed. It is also important to note that only the CPU can initiate uploads and downloads.

The following involve what I refer to by a communication operation:

  • Uploading data
  • Downloading data
  • Launching a kernel

Due to the communication overhead, data transferring follows a curve approximated by <math>T=log_c(a+b^x)</math>. More information can be found in this Nvidia presentation.

GPU DOCK Structure

CPU DOCK processes ligands sequentially which GPU DOCK process them in batches of size n, typically 48. For each batch, the docking process looks like this

  1. 1. Read n ligands from the disk
  2. 2. Match n ligands *sequentially*
  3. 3. Upload matching results and ligands to GPU memory and score
  4. 4. Download scoring results
  5. 5. Determine the nsave best poses and write them to the disk

Any parts of the scoring code (steps 3 and 4) are written in C++ and CUDA. The rational behind batching the ligands instead of processing them on the GPU one-by-one is due to the aforementioned communication bottleneck between the CPU and GPU. If we were communicating with the GPU per each ligand to upload, score, and download, we would be wasting a lot of time per each communication operation. However, we can distribute the communication overhead across multiple ligands and save performance this way.

To reduce the number of transfer calls, everything that needs to be uploaded for a scoring call (e.g. ligands, matches) is batched together in one transfer. All ligands and matches are written to a single contiguous buffer of memory, and a single cudaMemcpy call (which transfers data between the CPU and GPU) is used to upload the data. A single kernel is launched to do all the scoring, and the scoring kernel writes to a large buffer that keeps track of all the scores (which I call the scoring buffer). Then a download operation transfers data from the scoring buffer to the CPU.

A short note on inefficiencies: The current program design is designed to not disfigure the original Fortran code significantly. Due to this, the C++ side of DOCK does not directly take the nsave best ligands. Instead, it implements an interface for Fortran to read back the ligands and determine the nsave best ones. This results inefficiency, and will be removed eventually (ideally, the GPU will take the nsave best ligands to reduce the size of the download). More inefficiencies exist in the program, for example, the upload buffer also contains the scoring buffer to clear the scoring buffer. This also will be removed soon.

Prefiltering

CPU DOCK looks at the rigid conf to determine whether it should bump a ligand. This optimization means that it looks at only one conf to determine whether the pose is already bad enough that it should avoid scoring it. Letting the GPU do this is inefficient, as the GPU would have the launch threads that die immediately if bumping is enabled or do useless work if bumping is disabled. Furthermore, the CPU adjusts the size of the matching buffer according to expected number of matches, meaning that a lot of GPU memory would have to be wasted for bumped ligands.

Instead, the CPU checks ahead of time what ligands will be bumped. As a result, it only sends ligands whose rigid conf was no bumped to the GPU to do scoring. Since memory accesses are expensive on the GPU, it is more efficient to recompute the conf score and not reuse the CPU result.

No Caching

Another optimization CPU DOCK uses is that caches the results of confs in the set score computation. As previously mentioned, reading memory is expensive on the GPU, so instead GPU DOCK recalculates set scores.