Page tree

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

  • settings for running grid jobs are organized in  /etc/condor/config.d/01grid.conf
    • since we have grid and NAF as general users on HTCondor, we keep settings separate if relevant only to one or the other
    • for NAF: todo
  • each node checks every 180s its health as disk space, the state of CVMFS mounts,...

cgroup setup

  • cgroup setup differs between Scientific Linux 6 (kernel 2.6*) and CentOS 7 (kernel 3.10*)
    • under SL6/SysV you need to install libcgconfig tools etc. and configure them
    • under EL7 systemd will handle cgroup resources itself

EL7

(thanks to Brian for the help)

  • Condor job cgroups can be put under the system resource (maybe dedicated resource unit may be possible??)

BASE_CGROUP = /system.slice/condor.service

  • with it a job's cgroup resource slice should be in

/sys/fs/cgroup/{cpu,cpuacct/memory/...}/system.slice/condor.service/condor_var_lib_condor_execute_slot1_*\@WN.FQDN.HERE/

SL6

  • libcgroup needs to be installed and services cgconfig and cgred enabled/started
  • to enable cgroups configure a mother cgroup 'htcondor' with BASE_CGROUP containing all jobs later on as sub-cgroups

 

/cgroup/{cpu/cpuacct/memory}/htcondor/condor_var_lib_condor_execute_slot1_*\@WN.FQDN.HERE/

      • remember: all CPU shares of slots/cgroups are relative within the share of the htcondor mother cgroup relative to the other cgroups
      • we set slots to be partitionable to allow single and multi core jobs to mix

 

00worker.conf

DAEMON_LIST = MASTER, STARTD
DEFAULT_DOMAIN_NAME = desy.de
UID_DOMAIN = desy.de
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
ALLOW_WRITE = *.$(UID_DOMAIN)
ALLOW_READ = *.$(UID_DOMAIN)
CONDOR_ADMIN = iMAIL@HERE.FOO
CONDOR_HOST = condor01.desy.de
COLLECTOR_NAME = Test Condor Pool - $(CONDOR_HOST)
StartJobs = true
STARTD_ATTRS = StartJobs, $(STARTD_ATTRS)
# When is this node willing to run jobs?
START = (NODE_IS_HEALTHY =?= True) && (StartJobs =?= True)

# Permanent way of stopping jobs from starting
HOSTALLOW_CONFIG = $(CONDOR_HOST)
ALLOW_CONFIG = $(CONDOR_HOST)
ENABLE_RUNTIME_CONFIG = True
RUNTIME_CONFIG_ADMIN = $(CONDOR_HOST)
STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs
ENABLE_PERSISTENT_CONFIG = True
PERSISTENT_CONFIG_DIR = /etc/condor/persistent

# use one shared port
USE_SHARED_PORT = True
SHARED_PORT_ARGS = -p 9620
COLLECTOR_HOST = $(CONDOR_HOST):9618

# Enable CGROUP control
BASE_CGROUP = # SL6: htcondor # EL7: /system.slice/condor.service
# hard: job can't access more physical memory than allocated
# soft: job can access more physical memory than allocated when there are free memory
CGROUP_MEMORY_LIMIT_POLICY = soft

# slots
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = true
COUNT_HYPERTHREAD_CPUS = true

# startd hook to check if node is healthy
STARTD_CRON_JOBLIST = NODEHEALTH
STARTD_CRON_NODEHEALTH_EXECUTABLE = /etc/condor/tests/healthcheck_wn_condor.sh
STARTD_CRON_NODEHEALTH_PERIOD = 180s
STARTD_CRON_NODEHEALTH_MODE = Periodic

 

01grid.conf

GRID_RESOURCE = true

# start worker node allowing single core and multi core jobs
# to push mcore jobs, resources are partly drained and allocated to mcore-only
# https://www.gridpp.ac.uk/wiki/Example_Build_of_an_ARC/Condor_Cluster#Fallow
OnlyMulticore = False # legacy

# worker node attributes
STARTD_ATTRS = $(STARTD_ATTRS), GRID_RESOURCE, OnlyMulticore

02naf.conf

 

 

 

  • No labels