Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

cgroup setup

  • cgroup setup differs between Scientific Linux 6 (kernel 2.6*) and CentOS 7 (kernel 3.10*)
    • under SL6/SysV you need to install libcgconfig tools etc. and configure them
    • under EL7 systemd will handle cgroup resources itself
    • note: all CPU shares of slots/cgroups are relative within the share of the condor parent cgroup relative to the other cgroups
    • we set slots to be partitionable to allow single and multi core jobs to mix

EL7

(thanks to Brian for the help)

  • Condor job cgroups can be put under the system resource (maybe dedicated resource unit may be possible??)

BASE_CGROUP = /system.slice/condor.service

  • with it a job's cgroup resource slice should be in

/sys/fs/cgroup/{cpu,cpuacct/memory/...}/system.slice/condor.service/condor_var_lib_condor_execute_slot1_*\@WN.FQDN.HERE/

SL6

  • libcgroup needs to be installed and services cgconfig and cgred enabled/started
  • to enable cgroups configure a parent cgroup 'htcondor'
    •  in the Condor worker node config BASE_CGROUP containing all jobs later on as sub-cgroups
    • and the basic cgroup config (the htcondor group defintion might be placed in /etc/cgconfig.d/... instead)

...

 

...

...

  • with it, job cgroup resource infos/limits should be available around

/cgroup/{cpu/cpuacct/memory}/htcondor/condor_var_lib_condor_execute_slot1_*\@WN.FQDN.HERE/

00worker.conf

Anchor
worker
worker

DAEMON_LIST = MASTER, STARTD
DEFAULT_DOMAIN_NAME = desy.de
UID_DOMAIN = desy.de
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
ALLOW_WRITE = *.$(UID_DOMAIN)
ALLOW_READ = *.$(UID_DOMAIN)
CONDOR_ADMIN = iMAIL@HERE.FOO
CONDOR_HOST = condor01.desy.de
COLLECTOR_NAME = Test Condor Pool - $(CONDOR_HOST)
StartJobs = true
STARTD_ATTRS = StartJobs, $(STARTD_ATTRS)
# When is this node willing to run jobs?
START = (NODE_IS_HEALTHY =?= True) && (StartJobs =?= True)

Anchor
heathOK
heathOK


# Permanent way of stopping jobs from starting
HOSTALLOW_CONFIG = $(CONDOR_HOST)
ALLOW_CONFIG = $(CONDOR_HOST)
ENABLE_RUNTIME_CONFIG = True
RUNTIME_CONFIG_ADMIN = $(CONDOR_HOST)
STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs
ENABLE_PERSISTENT_CONFIG = True
PERSISTENT_CONFIG_DIR = /etc/condor/persistent

# use one shared port
USE_SHARED_PORT = True
SHARED_PORT_ARGS = -p 9620
COLLECTOR_HOST = $(CONDOR_HOST):9618

# Enable CGROUP control
Anchor
cgroup
cgroup

BASE_CGROUP = # SL6: htcondor # EL7: /system.slice/condor.service #
# hard: job can't access more physical memory than allocated
# soft: job can access more physical memory than allocated when there are free memory
CGROUP_MEMORY_LIMIT_POLICY = soft

# slots
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = true
COUNT_HYPERTHREAD_CPUS = true

# startd hook to check if node is healthy
Anchor
worker_health
worker_health

STARTD_CRON_JOBLIST = NODEHEALTH
STARTD_CRON_NODEHEALTH_EXECUTABLE = /etc/condor/tests/healthcheck_wn_condor.sh
STARTD_CRON_NODEHEALTH_PERIOD = 180s
STARTD_CRON_NODEHEALTH_MODE = Periodic

...