Computing : GPU on NAF


We provide some GPU resource.

Regular rules for batchjobs do apply for GPU jobs !

Remember for GPU jobs (also interactive GPU jobs) the same rules apply as for regular batchjobs, especially the time limit is automatically set to a 3h job-lifetime-span unless you set it otherwise using

+RequestRuntime = <seconds>' # requested runtime in seconds

in your submit file !

See: Submitting Jobs

Interactive machines:

For users of atlas, cms, ilc belle, we provide shared GPU access to interactive WGS. Ask your group admin to add the resource 'nafgpu' to your account:

  • naf-atlas-gpu01.desy.de
  • naf-cms-gpu01.desy.de
  • naf-ilc-gpu01.desy.de
  • naf-belle-gpu01.desy.de

These machines have a standard WGS installation, plus some GPU related software. (Ask naf-helpdesk if you are missing something).

Beware: These are shared resources, so use this for development and testing only!

Access to these machines is via ssh or FastX.

The GPU hardware currently is one NVIDIA P100 per server.

Batch machines:

Access is restricted to people with the nafgpu resource. Contact your experiment support (and they should contact naf-helpdesk) for access to this resource.

Once you have the needed resource, add:

[ ... ]
Requirements = OpSysAndVer == "CentOS7"
Request_GPUs = 1
[ ... ]


to your job submit file

GPU resources are sparse, and since usage is exclusive, they are precious. Try to efficiently make use of your allocated compute time on the GPUs.

The GPU hardware currently in use is one NVIDIA GeForce GTX 1080Ti per server.

List GPU batchnodes and state

[flemming@pal53]~% condor_status -constraint 'GPUs >= 1' 
   Name                       OpSys  Arch     State  Activity LoadAv Mem ActvtyTime

batchg001.desy.de LINUX X86_64 Claimed Busy 1.280 46758 1+20:27:59
batchg002.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:57:50
batchg003.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+02:19:51
batchg004.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+02:15:10
batchg005.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:41:59
batchg006.desy.de LINUX X86_64 Claimed Busy 0.980 46758 0+03:52:32
batchg007.desy.de LINUX X86_64 Claimed Busy 0.980 46778 0+02:23:47
batchg008.desy.de LINUX X86_64 Unclaimed Idle 0.000 46778 1+21:44:58
batchg009.desy.de LINUX X86_64 Claimed Busy 0.980 46778 0+00:18:21
slot1_1@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+01:08:15
slot1_2@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+02:44:29
slot1_3@batchg010.desy.de LINUX X86_64 Claimed Busy 1.010 1536 0+03:33:18
slot1_4@batchg010.desy.de LINUX X86_64 Claimed Busy 0.000 50176 1+19:19:38
batchg011.desy.de LINUX X86_64 Unclaimed Idle 0.000 385437 1+23:44:52
batchg012.desy.de LINUX X86_64 Claimed Busy 0.000 385437 2+01:10:08
batchg013.desy.de LINUX X86_64 Claimed Busy 1.000 385437 2+01:08:57

                             Total Owner Claimed Unclaimed Matched Preempting Backfill Drain

X86_64/LINUX   16         0         14              2                 0              0                 0           0

Total                     16        0         14              2                 0               0                 0           0


List Capabilities and Software versions of BIRD GPU nodes

[root@bird-htc-sched02 ~]# condor_status -constraint 'gpus >= 1' -af:h Name CUDACapability CUDADeviceName CUDADriverVersion CUDAGlobalMemoryMb CUDARuntimeVersion
Name              CUDACapability        CUDADeviceName      CUDADriverVersion     CUDAGlobalMemoryMb CUDARuntimeVersion   
batchg001.desy.de 6.1                   GeForce GTX 1080 Ti 10.0                  11178              8.0                  
batchg002.desy.de 6.1                   GeForce GTX 1080 Ti 10.0                  11178              8.0    

# list software/firmware versions of a specific node
[chbeyer@htc-it02]~/htcondor/testjobs% condor_status -l batchg004.desy.de | grep -i cuda  
CUDACapability = 6.1
CUDADeviceName = "GeForce GTX 1080 Ti"
CUDADriverVersion = 10.0
CUDAECCEnabled = false
CUDAGlobalMemoryMb = 11178
CUDARuntimeVersion = 8.0
DetectedGPUs = "CUDA0"
           


List GPU jobs

[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -constraint 'RequestGPUs >= 1'
-- Schedd: bird-htc-sched02.desy.de : <131.169.56.95:9618?... @ 11/07/18 10:36:14
OWNER   BATCH_NAME     SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
chbeyer ID: 9902561  11/7  09:44      _      1      8     11 9902561.2-10
chbeyer ID: 9903331  11/7  10:31      _      _     11     11 9903331.0-10
chbeyer ID: 9903332  11/7  10:31      _      _     11     11 9903332.0-10


GPU-top

[root@batchg001 ~]# nvidia-smi
Wed Nov  7 10:38:21 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
|100%   90C    P2    79W / 250W |  10077MiB / 11178MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    279258      C   .../htcondor_exec/gpu-burn-master/gpu_burn 10067MiB |
+-----------------------------------------------------------------------------+

Examples

Get an interactive session on GPU node
[chbeyer@batchg002]~/htcondor/testjobs% cat gpu_interactive.submit 
Requirements = OpSysAndVer == "CentOS7"
Request_GPUs = 1
queue


[chbeyer@htc-it02]~/htcondor/testjobs% condor_submit -i gpu_interactive.submit
Submitting job(s).
1 job(s) submitted to cluster 4777582.
Waiting for job to start...
Welcome to batchg002.desy.de!

[chbeyer@batchg002]~/htcondor/testjobs% nvidia-smi 
Thu Feb 21 09:15:12 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
| 47%   48C    P5    15W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Get an interactive session an a node with specific properties
# you can address any listed classadd of a node as mandatory for your job (see "List Capabilities and Software versions of BIRD GPU nodes") for ex: 

[chbeyer@batchg002]~/htcondor/testjobs% cat gpu_interactive.submit 
Requirements = OpSysAndVer == "CentOS7" && (CUDAGlobalMemoryMb > 10000) && (CUDARuntimeVersion == 8.0)
Request_GPUs = 1
queue

[chbeyer@htc-it02]~/htcondor/testjobs% condor_submit -i gpu_interactive.submit
Submitting job(s).
1 job(s) submitted to cluster 4777641.
Waiting for job to start...
Welcome to batchg002.desy.de!

Regular batchjob using GPU ressources on the NAF
[chbeyer@batchg002]~/htcondor/testjobs% cat  sleep.submit
# Unix submit description file
# sleep.sub -- simple sleep job using GPU ressources

executable              = /afs/desy.de/user/c/chbeyer/htcondor_exec/sleep_runtime.sh
output                  = /afs/desy.de/user/c/chbeyer/out_$(Cluster)_$(Process).txt
error                   = /afs/desy.de/user/c/chbeyer/error_$(Cluster)_$(Process).txt
Requirements = OpSysAndVer == "CentOS7"
Request_GPUs = 1

# uncomment this if you want to use the job specific variables $CLUSTER and $PROCESS inside your batchjob
# #environment = "CLUSTER=$(Cluster) PROCESS=$(Process)"

# uncomment this to specify a runtime longer than 3 hours (time in seconds)
#+RequestRuntime = 6000

# uncomment this to specify an argument given to the executable
#Args = 20

# uncomment this to give this batchjob an individual name-tag to find it easily in the queue 
#batch_name = sleep_test_2

queue 1