Gpus: Difference between revisions
No edit summary |
(Added troubleshooting information for nvidia kernel api mismatches. Based on info Teague suggested.) |
||
Line 68: | Line 68: | ||
Note that we run the executable with on-one-gpu. | Note that we run the executable with on-one-gpu. | ||
This manages which gpus are used. | This manages which gpus are used. | ||
Line 77: | Line 75: | ||
it is important to write locally to scratch and then copy things over the network onto the disk. | it is important to write locally to scratch and then copy things over the network onto the disk. | ||
If you write large amounts of data directly to the NFS disk it can cause problems for others. | If you write large amounts of data directly to the NFS disk it can cause problems for others. | ||
===When encountering kernel errors for Nvidia=== | |||
The specific error will be an API error such as this which can be viewed with dmesg: | |||
run dmesg | grep "NVRM": | |||
API mismatch: the client has the version 352.93, but | |||
NVRM: this kernel module has the version 352.79. Please | |||
NVRM: make sure that this kernel module and all NVIDIA driver | |||
NVRM: components have the same version. | |||
Address this problem with the following commands: | |||
# this will remove the nvidia modules and reload the updated versions of the modules | |||
rmmod nvidia | |||
rmmod nvidia_uvm | |||
modprobe nvidia | |||
modprobe nvidia_uvm | |||
[[Category:GPU]] | [[Category:GPU]] | ||
[[Category:Internal]] | [[Category:Internal]] |
Revision as of 01:40, 8 July 2016
We have 7 GPUs on the cluster. (June 2016). There is a separate queue gpu.q to manage jobs
To log in interactively to the gpu queue:
qlogin -q gpu.q
Each gpu is a GeForce GTX 980
/sbin/lspci | grep -i nvidia
Instructions for getting setup for GPU computation: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/
NVidia drivers are installed in /usr/loca/cuda*. To use the 7.5 drivers, make sure these environment variables are set:
export PATH=/usr/local/cuda-7.5/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64:$LD_LIBRARY_PATH
Check that the drivers are installed:
cat /proc/driver/nvidia/version
which should return
NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.79 Wed Jan 13 16:17:53 PST 2016 GCC version: gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC)
Try compiling and running the sample programs:
mkdir -p /scratch/$USER/cuda-7.5_samples cp -r /usr/local/cuda-7.5/samples /scratch/$USER/cuda-7.5_samples cd /scratch/$USER/cuda-7.5_samples/ make
Run the sample program
/nfs/ge/bin/on-one-gpu - /scratch/$USER/cuda-7.5_samples/bin/x86_64/linux/release/deviceQuery
Here is a sample script to run amber:
/nfs/work/tbalius/MOR/run_amber/run.pmemd_cuda_wraper.csh
Here is an excerpt from script
########## cat << EOF > qsub.amber.csh #\$ -S /bin/csh #\$ -cwd #\$ -q gpu.q #\$ -o stdout #\$ -e stderr # export CUDA_VISIBLE_DEVICES="0,1,2,3" # setenv CUDA_VISIBLE_DEVICES "0,1,2,3" setenv AMBERHOME /nfs/soft/amber/amber14/ set amberexe = "/nfs/ge/bin/on-one-gpu - \$AMBERHOME/bin/pmemd.cuda" ##########
Note that we run the executable with on-one-gpu. This manages which gpus are used.
If you generate significant output, which is generally but not always true,
it is important to write locally to scratch and then copy things over the network onto the disk.
If you write large amounts of data directly to the NFS disk it can cause problems for others.
When encountering kernel errors for Nvidia
The specific error will be an API error such as this which can be viewed with dmesg:
run dmesg | grep "NVRM": API mismatch: the client has the version 352.93, but NVRM: this kernel module has the version 352.79. Please NVRM: make sure that this kernel module and all NVIDIA driver NVRM: components have the same version.
Address this problem with the following commands:
# this will remove the nvidia modules and reload the updated versions of the modules rmmod nvidia rmmod nvidia_uvm modprobe nvidia modprobe nvidia_uvm