We provide some GPU resource.
Regular rules for batchjobs do apply for GPU jobs !
Remember for GPU jobs (also interactive GPU jobs) the same rules apply as for regular batchjobs, especially the time limit is automatically set to a 3h job-lifetime-span unless you set it otherwise using
+RequestRuntime = <seconds>' # requested runtime in seconds
in your submit file !
See: Submitting Jobs
Interactive machines:
For users of atlas, cms, ilc belle, we provide shared GPU access to interactive WGS. Ask your group admin to add the resource 'nafgpu' to your account:
- naf-atlas-gpu01.desy.de
- naf-cms-gpu01.desy.de
- naf-ilc-gpu01.desy.de
- naf-belle-gpu01.desy.de
These machines have a standard WGS installation, plus some GPU related software. (Ask naf-helpdesk if you are missing something).
Beware: These are shared resources, so use this for development and testing only!
Access to these machines is via ssh or FastX.
The GPU hardware currently is one NVIDIA P100 per server.
Batch machines:
Access is restricted to people with the nafgpu
resource. Contact your experiment support (and they should contact naf-helpdesk) for access to this resource.
Once you have the needed resource, add:
[ ... ] Requirements = OpSysAndVer == "CentOS7" Request_GPUs = 1 [ ... ]
to your job submit file
GPU resources are sparse, and since usage is exclusive, they are precious. Try to efficiently make use of your allocated compute time on the GPUs.
The GPU hardware currently in use is one NVIDIA GeForce GTX 1080Ti per server.
List GPU batchnodes and state
[flemming@pal53]~% condor_status -constraint 'GPUs >= 1' Name OpSys Arch State Activity LoadAv Mem ActvtyTime batchg001.desy.de LINUX X86_64 Claimed Busy 1.280 46758 1+20:27:59 batchg002.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:57:50 batchg003.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+02:19:51 batchg004.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+02:15:10 batchg005.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:41:59 batchg006.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:52:32 batchg007.desy.de LINUX X86_64 Claimed Busy 0.980 46778 0+02:23:47 batchg008.desy.de LINUX X86_64 Unclaimed Idle 0.000 46778 1+21:44:58 batchg009.desy.de LINUX X86_64 Claimed Busy 0.980 46778 0+00:18:21 slot1_1@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+01:08:15 slot1_2@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+02:44:29 slot1_3@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+03:33:18 slot1_4@batchg010.desy.de LINUX X86_64 Claimed Busy 0.000 50176 1+19:19:38 batchg011.desy.de LINUX X86_64 Unclaimed Idle 0.000 385437 1+23:44:52 batchg012.desy.de LINUX X86_64 Claimed Busy 0.000 385437 2+01:10:08 batchg013.desy.de LINUX X86_64 Claimed Busy 1.000 385437 2+01:08:57 Total Owner Claimed Unclaimed Matched Preempting Backfill Drain X86_64/LINUX 16 0 14 2 0 0 0 0 Total 16 0 14 2 0 0 0 0
List Capabilities and Software versions of BIRD GPU nodes
[root@bird-htc-sched02 ~]# condor_status -constraint 'gpus >= 1' -af:h Name CUDACapability CUDADeviceName CUDADriverVersion CUDAGlobalMemoryMb CUDARuntimeVersion Name CUDACapability CUDADeviceName CUDADriverVersion CUDAGlobalMemoryMb CUDARuntimeVersion batchg001.desy.de 6.1 GeForce GTX 1080 Ti 10.0 11178 8.0 batchg002.desy.de 6.1 GeForce GTX 1080 Ti 10.0 11178 8.0 # list software/firmware versions of a specific node [chbeyer@htc-it02]~/htcondor/testjobs% condor_status -l batchg004.desy.de | grep -i cuda CUDACapability = 6.1 CUDADeviceName = "GeForce GTX 1080 Ti" CUDADriverVersion = 10.0 CUDAECCEnabled = false CUDAGlobalMemoryMb = 11178 CUDARuntimeVersion = 8.0 DetectedGPUs = "CUDA0"
List GPU jobs
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -constraint 'RequestGPUs >= 1' -- Schedd: bird-htc-sched02.desy.de : <131.169.56.95:9618?... @ 11/07/18 10:36:14 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS chbeyer ID: 9902561 11/7 09:44 _ 1 8 11 9902561.2-10 chbeyer ID: 9903331 11/7 10:31 _ _ 11 11 9903331.0-10 chbeyer ID: 9903332 11/7 10:31 _ _ 11 11 9903332.0-10
GPU-top
[root@batchg001 ~]# nvidia-smi Wed Nov 7 10:38:21 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:17:00.0 Off | N/A | |100% 90C P2 79W / 250W | 10077MiB / 11178MiB | 100% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 279258 C .../htcondor_exec/gpu-burn-master/gpu_burn 10067MiB | +-----------------------------------------------------------------------------+
Examples
[chbeyer@batchg002]~/htcondor/testjobs% cat gpu_interactive.submit Requirements = OpSysAndVer == "CentOS7" Request_GPUs = 1 queue [chbeyer@htc-it02]~/htcondor/testjobs% condor_submit -i gpu_interactive.submit Submitting job(s). 1 job(s) submitted to cluster 4777582. Waiting for job to start... Welcome to batchg002.desy.de! [chbeyer@batchg002]~/htcondor/testjobs% nvidia-smi Thu Feb 21 09:15:12 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:17:00.0 Off | N/A | | 47% 48C P5 15W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
# you can address any listed classadd of a node as mandatory for your job (see "List Capabilities and Software versions of BIRD GPU nodes") for ex: [chbeyer@batchg002]~/htcondor/testjobs% cat gpu_interactive.submit Requirements = OpSysAndVer == "CentOS7" && (CUDAGlobalMemoryMb > 10000) && (CUDARuntimeVersion == 8.0) Request_GPUs = 1 queue [chbeyer@htc-it02]~/htcondor/testjobs% condor_submit -i gpu_interactive.submit Submitting job(s). 1 job(s) submitted to cluster 4777641. Waiting for job to start... Welcome to batchg002.desy.de!
[chbeyer@batchg002]~/htcondor/testjobs% cat sleep.submit # Unix submit description file # sleep.sub -- simple sleep job using GPU ressources executable = /afs/desy.de/user/c/chbeyer/htcondor_exec/sleep_runtime.sh output = /afs/desy.de/user/c/chbeyer/out_$(Cluster)_$(Process).txt error = /afs/desy.de/user/c/chbeyer/error_$(Cluster)_$(Process).txt Requirements = OpSysAndVer == "CentOS7" Request_GPUs = 1 # uncomment this if you want to use the job specific variables $CLUSTER and $PROCESS inside your batchjob # #environment = "CLUSTER=$(Cluster) PROCESS=$(Process)" # uncomment this to specify a runtime longer than 3 hours (time in seconds) #+RequestRuntime = 6000 # uncomment this to specify an argument given to the executable #Args = 20 # uncomment this to give this batchjob an individual name-tag to find it easily in the queue #batch_name = sleep_test_2 queue 1