Page tree

Situation

A grid Compute Element running ARC CE onversion 5.2.1 with HTCondor as LRMS and the local condor client on version 8.6.0 run out of memory and started to swap. Starting beginning of the month more and more memory got allocated. Restarting the condor daemons or the whole server brought just a short relieve, but the node run out of memory within hours again.

Several condor_schedd processes were running in parallel each using ~4GB of memory. The condor master process had started one scheduler (/var/log/condor/SchedLog), which spawned every few minutes new schedds that run for several seconds-minutes and disappearing again. Since the parent schedd kept running, ScheddRestartReport did not report much. The short-running condor_schedd processes' /etc/PID/environ and cmdline contained nothing or 'condor_schedd-f', respectively. The sub-schedds got logged in the ProcLog running mostly <2m

> grep -r "2166254" /var/log/condor/
/var/log/condor/ProcLog:02/13/17 11:19:39 : method ENV: found family 2022208 for process 2166254
/var/log/condor/ProcLog:02/13/17 11:19:39 : method PARENT: found family 2022208 for process 2166254 (already determined)
/var/log/condor/ProcLog:02/13/17 11:21:39 : process 2166254 (of family 2022208) has exited

The sub-schedds had quite some hunger for memory, but open files/sockets were unsuspicious (having open /var/lib/condor/spool/job_queue.log and /var/lib/condor/spool/history)

/proc/2181602/status
Name:   condor_schedd

...
VmPeak:  6826732 kB
VmSize:  6826728 kB
VmLck:         0 kB
VmHWM:   6760676 kB
VmRSS:   6760676 kB
VmData:  6766040 kB
VmStk:        88 kB
VmExe:      1268 kB
VmLib:     11632 kB
VmPTE:     13360 kB
VmSwap:     6292 kB
Threads:        1

Downgrading condor packages to 8.4.10 did not resolve the issue.

Cause

The picture got more clear, after realizing that 160000 jobs were actually in status 'held' in Condor.

> condor_status -schedd "grid-arcce1.desy.de"
Name                Machine             RunningJobs   IdleJobs   HeldJobs

grid-arcce1.desy.de grid-arcce1.desy.de          84          0     161617

                      TotalRunningJobs      TotalIdleJobs      TotalHeldJobs

Before monitoring and warnings were mainly oriented on the number of running jobs, i.e., to keep the cluster full., while other states had to be checked manually. Over a week, the number of hold jobs rose from a few to ~160k.

Apparently, the jobs got send into hold when their files could not accessed by their shadow daemons. Presumably, the system got overwhelmed by the large number of hold jobs, /var/lib/condor/spool/job_queue.log having grown already to 1.4GB.

> condor_q  -hold
...
1622466.0   cmsger014       2/3  23:37 Error from slot1@batch0236.desy.de: SHADOW at 131.169.223.111 failed to send file(s) to <131.169.160.67:64456>: error reading from /var/spool/arc/grid/wNCNDmxJ1tpnvxDnJpv7\
dLGoABFKDmABFKDmjpwXDmABFKDmRup1lm/.gahp_complete: (errno 2) No such file or directory; STARTER failed to receive file(s) from <131.169.223.111:9620>
1622467.0   cmsplt036       2/3  23:41 Failed to initialize user log to /var/spool/arc/grid/MeKMDmzJ1tpnvxDnJpv7dLGoABFKDmABFKDmtavXDmKBFKDmGTmUhn/log or /dev/null                                               \

1622468.0   cmsplt036       2/3  23:41 Error from slot1@batch0678.desy.de: SHADOW at 131.169.223.111 failed to send file(s) to <131.169.162.238:56422>: error reading from /var/spool/arc/grid/H4fKDmxJ1tpnvxDnJpv\
7dLGoABFKDmABFKDmtavXDmABFKDmeC6Bin/.gahp_complete: (errno 2) No such file or directory; STARTER failed to receive file(s) from <131.169.223.111:9620>
1622469.0   cmsplt036       2/3  23:41 Error from slot1@batch0445.desy.de: SHADOW at 131.169.223.111 failed to send file(s) to <131.169.163.192:62262>: error reading from /var/spool/arc/grid/FPULDmyJ1tpnvxDnJpv\
7dLGoABFKDmABFKDmtavXDmFBFKDmSXUsFn/.gahp_complete: (errno 2) No such file or directory; STARTER failed to receive file(s) from <131.169.223.111:9620>
1622470.0   cmsger014       2/4  20:52 Error from slot1@batch0236.desy.de: STARTER at 131.169.160.67 failed to send file(s) to <131.169.223.111:9620>; SHADOW at 131.169.223.111 failed to write to file /var/spoo\
l/arc/grid/iHPODmxJ1tpnvxDnJpv7dLGoABFKDmABFKDmjpwXDmBBFKDm9hvuqn.comment: (errno 13) Permission denied

...

Troubleshooting

After draining the ARC's queue setting the session dir to drain

sessiondir="/var/spool/arc/grid drain"

we tried to remove all jobs in hold in Condor - but which failed with repeatedly with different errors

> condor_rm -name grid-arcce1.desy.de -const 'jobstatus==5'
SECMAN:2007:Failed to end classad message.
Couldn't find/remove all jobs matching constraint (jobstatus==5)
> condor_rm -name grid-arcce1.desy.de -const 'jobstatus==5'
DCSchedd::actOnJobs:6004:Can't read response ad
Couldn't find/remove all jobs matching constraint (jobstatus==5)

probably due to Condor's daemon communication overwhelmed by the large number of jobs.
Thus, we stopped condor, moved /var/lib/condor/spool away, reinstalled all condor packages for a fresh start and ramped it up again. Since the cleanup, the CE has performed well again with Condor behaving as expected.

Precautions

  • adding checks for throwing also warnings when a scheduler's absolute number of jobs and the ratio of running and hold jobs reaches some thresholds.
  • limiting the total number (running+hold+...) of jobs on a scheduler
    • on the ARC side setting values for all four maxjobs numbers  ~~> man arc.conf : "number4 - jobs in whole system"
      maxjobs =  "#### #### #### ####"
    • on the Condor side

Original Cause(?)

Presumably, the original cause could have been transparent maintenance works on the CE's VM supervisor beginning of the month.If the maintenance was more opaque than promised, reads/writes to the VM's disk image may have been hampered making the shadows sad. However, the story is not completely convincing, since the number of hold jobs grew over a week while the supervisor intervention took only a day.

 

Comments

Petr Vokac at CERN

It is caused by different default condor_q output format in HTCondor 8.5.5+ and 8.6.0+ (see https://ggus.eu/index.php?mode=ticket_info&ticket_id=123947#update#25). There was an attempt to fix this issue in ARC-CE (http://bugzilla.nordugrid.org/show_bug.cgi?id=3603), but unfortunately they used just HTCondor environment variable from comment #24 and it is necessary to use also environment variable from comment #25.

ARC-CE has to force condor_q to produce output compatible with former version of HTCondor and that can be achieved by appending following environment variables in /usr/share/arc/configure-condor-env.sh

_condor_CONDOR_Q_ONLY_MY_JOBS=false
export _condor_CONDOR_Q_ONLY_MY_JOBS
_condor_CONDOR_Q_DASH_BATCH_IS_DEFAULT=false
export _condor_CONDOR_Q_DASH_BATCH_IS_DEFAULT
  • No labels