GPU DOCK

From DISI
Jump to navigation Jump to search

GPU DOCK is a fork of DOCK 3.8 that utilizes GPU acceleration via CUDA. Currently (as of April 2023) GPU DOCK uses the GPU for parallel scoring and only supports Van der Waals, electrostatics, and solvation scores.

```THIS PAGE IS CURRENTLY WIP```

Running and Building GPU DOCK

The main files for GPU DOCK are in the DOCK/ucsfdock/docking/DOCK/src directory of DOCK. There are a few important scripts here:

  • clean.sh that clears object files from previous builds
  • compile.sh which uses multithreaded compilation with 8 threads to make all components of DOCK
  • empty.sh which clears output files from previous runs (subdock needs the previous files cleared)
  • run.sh which launches GPU DOCK on a certain set of input ligands (see "Building and Running" below)
  • debug.sh which runs clean.sh, compile.sh, empty.sh, and run.sh in that order

There also is an additionally utility in DOCK/ucsfdock/docking/DOCK/util that outputs the makefile dependencies for the Fortran files. Fortran cannot be compiled independently; instead, if file A relies on file B for some function definition or something, then file B must output a .mod file from compilation that must be fed into the the compilation invocation for file A. This is problematic for makefile which assumes independent compilation and thus compiles in alphabetical order. Hence, we without the utility have to run make multiple times in order to generate the necessary .mod files for each "layer" down the dependency tree. However, with the utility, we can run it to output a makefile dependency list to tell makefile in what order files must be compiled. Hence, we only have to run make only once for the DOCK Fortran files.

When you have cloned GPU DOCK, make sure to set scripts in src and the utility script to be executable as a program, otherwise Linux file permissions might get in your way of running DOCK.

Building and Running

You can utilize clean.sh, compile.sh, empty.sh, and run.sh individually to suit your needs. Or you can run debug.sh to build from scratch and run. Building from scratch is important if you modify a C/C++ header file, because the resulting object files would need to be recompiled, and make does not detect this. Furthermore, debug.sh builds from scratch within 20 seconds on a FX-8350 (an extremely old CPU that was bad even when it was first released) so building from scratch is not a significant time loss, especially on modern hardware.

The input ligands for the run are in the DOCK/results folder. The run.sh script reads the list of .tgz files in this directory and automatically creates a sdi.in file that is used in docking. The dockfiles are stored in DOCK/results/dockfiles. The output will be in DOCK/results/output.

Note: **do not try to run multiple instances of GPU DOCK at the same time unless match_goal is very high and you have a small list of files in DOCK/results/input.** The reason why is that if there were many GPU DOCK instances running, then they all would be competing for GPU resources. As a result, the driver has to micromanage all instances at the same time and this results in **significant** overhead. You can mitigate this by setting match_goal to a very high number. This works because since matching is currently on the CPU, a high match_goal moves most of the execution time to the CPU and hence GPU DOCK instances would be less likely to compete for GPU computation time at the exact time. However, this is not a perfect solution as it creates a CPU bottleneck and significantly slows down the program on the CPU-side, thus rendering the GPU acceleration benefit useless. In a future update, GPU DOCK will support a single instance, multiple threads model (as opposed to a multiple instances, single thread model that subdock currently uses. See further down this page for more details).

How GPU DOCK Works

GPU DOCK is written in CUDA. The code is structured such that the GPU components are as independent from the original Fortran program. Here, I discuss how GPU DOCK works. I'll cover how the code is structured, and the algorithmic designs.

Terminology and Background

Before I get into the technical aspects of GPU DOCK, here is some brief terminology I will use in this page:

  • GPU DOCK is the GPU-ified version of DOCK 3.8
  • CPU DOCK is the original version of DOCK 3.8
  • Uploading is transferring data from the CPU to the GPU via PCIe
  • Downloading is transferring data from the GPU to the CPU via PCIe
  • Kernels are GPU programs

Also, only the CPU can call cudaMalloc, i.e. only the CPU can control GPU memory allocations and deallocations.

Launching a kernel is similar to launching a process. Each kernel is launched from the CPU, and each kernel invocation is independent from each other. However, unlike launching processes on the CPU, only one kernel invocation runs at a time on the GPU, and each kernel accesses the exact same allocated blocks of memory.

By default, the GPU and CPU are unsynchronized, i.e. when the CPU issues a call to the GPU to execute a kernel, it does not wait until to GPU is finished (or if the GPU is processing another kernel for that matter, in which case the GPU begins the next kernel immediately after the current). This design is useful in asynchronous models, where the CPU and GPU for the most part work separately until either needs some data from the other. A consequence of this design is that communicating between the CPU and GPU is really slow. For instance, each upload or download call to the Nvidia driver incurs a significant overhead. Hence, upload or download calls are slow and should be avoided, or the data transferred should be packed together or compressed. It is also important to note that only the CPU can initiate uploads and downloads.

The following involve what I refer to by a communication operation:

  • Uploading data
  • Downloading data
  • Launching a kernel

Due to the communication overhead, data transferring follows a curve approximated by <math>T=log_c(a+b^x)</math>. More information can be found in this Nvidia presentation.

Partially GPU-ified Docking Pipeline

CPU DOCK processes ligands sequentially which GPU DOCK process them in batches of size n, typically 48. For each batch, the docking process looks like this

  1. Read n ligands from the disk
  2. Match n ligands *sequentially*
  3. Upload matching results and ligands to GPU memory and score
  4. Download scoring results
  5. Determine the nsave best poses and write them to the disk

Any parts of the scoring code (steps 3 and 4) are written in C++ and CUDA. The rational behind batching the ligands instead of processing them on the GPU one-by-one is due to the aforementioned communication bottleneck between the CPU and GPU. If we were communicating with the GPU per each ligand to upload, score, and download, we would be wasting a lot of time per each communication operation. However, we can distribute the communication overhead across multiple ligands and save performance this way.

To reduce the number of transfer calls, everything that needs to be uploaded for a scoring call (e.g. ligands, matches) is batched together in one transfer. All ligands and matches are written to a single contiguous buffer of memory, and a single cudaMemcpy call (which transfers data between the CPU and GPU) is used to upload the data. A single kernel is launched to do all the scoring, and the scoring kernel writes to a large buffer that keeps track of all the scores (which I call the scoring buffer). Then a download operation transfers data from the scoring buffer to the CPU.

A short note on inefficiencies: The current program design is designed to not disfigure the original Fortran code significantly. Due to this, the C++ side of DOCK does not directly take the nsave best ligands. Instead, it implements an interface for Fortran to read back the ligands and determine the nsave best ones. This results inefficiency, and will be removed eventually (ideally, the GPU will take the nsave best ligands to reduce the size of the download). More inefficiencies exist in the program, for example, the upload buffer also contains the scoring buffer to clear the scoring buffer. This also will be removed soon.

Prefiltering

CPU DOCK looks at the rigid conf to determine whether it should bump a ligand. This optimization means that it looks at only one conf to determine whether the pose is already bad enough that it should avoid scoring it. Letting the GPU do this is inefficient, as the GPU would have the launch threads that die immediately if bumping is enabled or do useless work if bumping is disabled. Furthermore, the CPU adjusts the size of the matching buffer according to expected number of matches, meaning that a lot of GPU memory would have to be wasted for bumped ligands. Instead, the CPU checks ahead of time what ligands will be bumped. As a result, it only sends ligands whose rigid conf was not bumped to the GPU to do scoring.

No Caching

Another optimization CPU DOCK uses is that caches the results of confs in the set score computation. The issue with caching is that a lot of memory would have be written to the cache, and this is slow on the GPU. On the other hand, reading and arithmetic are strong points of GPUs, so it makes sense to recompute these scores.

Code Structure

All the original Fortran code are in src and code src/libfgz. The GPU code is in src/gpu, all of which is written in C++ and CUDA. There also are a few new Fortran in src that is mixed with the old Fortran files (some of which have been modified).

The Fortran/C++ Interop

To allow the GPU parts to be written in C++, we need to write an interface for Fortran and C++ to interact with each other and send data. The easiest way to do this is to call C++ code via functions: we can hand off execution momentarily for C++ so it can do work. The interface provided in the Fortran standard for a Fortran/C interop is very restrictive. For this reason, the C++ code is designed to be largely self-contained and shows a very high level interface to Fortran. For example, Fortran calls score_all_ligands_gpu to score the ligands on the C++/CUDA side (instead of calling C++ functions that are much more closely involved in the scoring process, such as GPU memory allocation). Furthermore, no part of C++ calls a Fortran function, but the Fortran code calls the C++ functions.

Fortran by default passes everything by a pointer reference. It also appends an underscore (_) to all function names. Suppose the Fortran compiler saw the code: call myFunc(myIntegerValue) We would have to have the function signature void myFunc_(int* myIntegerValue) defined on the **C** side. For **C++**, on the other hand, the C++ compiler appends garbage to the symbol name. For example, the following function: void fooBar(int* xyz) Will be compiled into the symbol __fooBar@INSERT_GARBAGE_CHARACTERS (or some other variation) and thus the linker will not be able to find it. To get around this issue, we have to append extern "C" to compile without the name mangling. For brevity and clarity, I define the macro FORTRAN_EXPORT to mean extern "C".

As said before, the interface is restrictive. For instance, it is near impossible to pass a struct or array legally and safely into a function. It is also very difficult to make Fortran and C++ share the same memory resources. For this reason, all data (ligands, matches, grids, scores, etc) unfortunately needs to be replicated on the C++ side. A lot of DOCK's data structures contain large arrays (such as ligand atom and conf data). Transferring these directly to C++ is a problem, since we cannot put a struct's array directly into a C++ function for reasons. As a result, I copy these large arrays to a secondary array that both Fortran and C++ have access to. Fortran writes to this array, and C++ (via a function call from Fortran) creates a copy of it on its side (so that the array can be reused for further transfers). This array is located in transfer.f and each of the major data structure files (e.g. db2type.f and matchtype.f) have functions that utilize this array and the other utility functions defined in transfer.f.

C++ Side Data Structures

Since the data structures need to replicated on the C++ side, we need to define them on the C++ side. Here are the currently replicated data structures:

  • matcht, which can be found in matchtype.h/cpp
  • db2, which can be found in db2type.h/cpp (in the C++ version of the code it is called db2t instead of just db2
  • The Van der Waals, electrostatics, and solvation grids, which can be found respectively in vdw.h/cpp, phimap.h/cpp, solvmap.h/cpp respectively

The C++ side data structures *also* need to be replicated on the GPU as well as on the CPU in the C++ side code. These data structures are located in the same files as their CPU counterparts, and have dev_ appended to them to indicate they are on the *device* (CUDA's terminology for GPU). For instance, db2t's GPU counterpart is named dev_db2t.

An important thing to note is the difference between the 2D, 3D, and 4D vector types of GPU-side and CPU-side vector structures. The CPU code uses the glm library, as that is the most convenient and versatile library to use on the CPU for linear algebra. However, CUDA doesn't support native vectorization and other optimizations with glm's types. Instead, CUDA has built-in vector data structures that should be used on the CUDA code instead. This means that when we are uploading data structures to the GPU, we cannot use the source arrays exactly. We have to use an intermediary vector that contains the converted data structures. This does result in an inefficiency and the CPU/GPU vector type mismatch will be removed in a future update.

When Fortran nears the end of the reading/matching loop, it transfers the ligands and matches to C++. Since a single temporary transfer buffer is used for arrays, this requires multiple function calls. First, metadata (e.g. sizes of various arrays, or anything else that isn't a large resizable array, including 2D and 3D vectors) is packed into the function arguments of one call. Then, for each large array in the object, it copies it to the temporary transfer buffer and calls a C++ function that copies the temporary array into a C++ side array. Before the start of the program, various things like grids are sent to C++ using the same process. Generally, all data structures that will be sent to C++ have a function that abstracts their transfer into one function call. For instance, db2 has the helper function send_db2lig in db2type.f to abstract the transferring process away from the main search loop.

Index of Files

Here is an index of important GPU DOCK files and what they do:

  • common.h/cpp contains commonly included headers and basic debugging utilities.
  • cpu_score.h/cpp contains a CPU-side implementation of scoring and the prefiltering code.
  • db2type.h/cpp contains a C++/CUDA-side implementation of the db2 data structure.
  • dev_alloc.hpp is a GPU memory pool.
  • dev_vector.hpp is an implementation of std::vector on CUDA. It does not support resizing.
  • gpu_common.h/cpp contains important includes and structs for GPU-side scoring that are used throughout the code.
  • gpu_score.h/cu contains the GPU-side scoring code and performance collection.
  • hdd_interface.hpp/cu (short for host-device data interface) contains utility functions that help in the upload process.
  • initialize.cpp contains important start-up and clean-up code.
  • interface.h/cpp contains variables, functions, and macros that are important for Fortran/C++ interoperability, data transfer, and debugging.
  • matchtype.h/cpp contains a CUDA/C++ side definition of the matcht data structure.
  • options.h/cpp contains a CUDA/C++ side definition of the options data structure
  • phimap.h/cpp contains a CUDA/C++ side data structure for the phimap (the electrostatics grid).
  • scoring_common.h/cpp contains structs important to both CPU-side and GPU-side scoring.
  • shared_mem.hpp contains a memory pool-like data structure for managing the memory for the master score (since the master score is deprecated, this too is deprecated).
  • solvmap.h/cpp contains the CUDA/C++ side implementation of the solvation and asolvation grids.
  • status.h/cpp contains a copy of Fortran's status codes.
  • timer.hpp contains a utility timer class that runs on the wall clock (not the CPU clock).
  • transfer.h/cpp contains helper functions for receiving integer and float data from Fortran.
  • upload.h/cu writes the ligand and match data to a temporary buffer before uploading it to the GPU.
  • vdw.h/cpp contains the CUDA/C++ definition of the Van der Waals map.

The GPU Memory Management System

GPU DOCK handles a lot of memory. It needs to upload many ligands and matches to the GPU, and download a scoring buffer that is in the hundreds of megabytes. Certainly, we need an efficient memory management system to make sure just moving around memory does not become a bottleneck. As mentioned earlier, each call to cudaMemcpy itself has significant overhead, regardless of the size of data we're transferring. Hence, we batch the uploads and downloads to keep performance high. To do this, we generally stuff as much data into a single buffer when uploading or downloading.

However, putting everything into one buffer leads to another issue: that buffer needs to be very large, so how are we efficiently going to allocate that memory? The most straightforward approach to this is to keep the same large buffer from call to call. If our buffer proves too small, we free it and allocate the necessary size. This does lead to issues at the start of the program, where the buffer is rapidly growing, so we set the initial size of the buffer to a large value. This single-buffer approach, despite how "inelegant" it is, works quite well. When we take multiple data sets from H17P050-N-la* and the DUDE-Z DRD4 receptor and run GPU DOCK *sequentially*, we find that across all GPU DOCK instances, we only have to reallocate when processing 3% of the batches (which would be even lower if we batched all data sets together into one data set and ran only one instance of GPU DOCK).

Generally, the structure of this stuff-everything-into-one-buffer approach is as follows

  1. For each ligand and its matches
  2. # Sum how much GPU memory we would need to upload the ligand and matches
  3. # Add this sum to the "total memory required" sum
  4. If the current buffer is smaller than the total memory required, resize the buffer
  5. Copy all ligands and matches to a temporary buffer
  6. Call cudaMemcpy to upload this buffer