A directed acyclic graph (DAG) can be used to represent a set of computations where the input, output, or execution of
one or more computations is dependent on one or more other computations. The computations are nodes (vertices) in
the graph, and the edges (arcs) identify the dependencies. HTCondor finds machines for the execution of programs,
but it does not schedule programs based on dependencies. The Directed Acyclic Graph Manager (DAGMan) is a
meta-scheduler for the execution of programs (computations). DAGMan submits the programs to HTCondor in an order
represented by a DAG and processes the results. A DAG input file describes the DAG.
DAGMan is itself executed as a scheduler universe job within HTCondor. It submits the HTCondor jobs within
nodes in such a way as to enforce the DAG’s dependencies. DAGMan also handles recovery and reporting on the
HTCondor jobs. (from the manual http://research.cs.wisc.edu/htcondor/manual/v8.7/2_10DAGMan_Applications.html)
The submit file for the DAG describes the dependencies of the jobs to execute, e.g.:
[chbeyer@htc-cms01]~/htcondor/testjobs% cat sleep_dag.submit # File name: sleep.dag JOB A sleep_job.condor JOB B sleep_job.condor JOB C sleep_job.condor JOB D sleep_job.condor PARENT A CHILD B C PARENT B C CHILD D |
---|
The actual 'worker-jobs' are regular condor-jobs and all common rules apply.
As these jobs get submitted from the scheduler universe on the scheduler itself they can not inherit your project, hence you must put an appropriate project in the submit file !
[chbeyer@htc-it02]~/htcondor/testjobs% cat sleep_job.condor # Unix submit description file # sleep.sub -- simple sleep job executable = /afs/desy.de/user/c/chbeyer/htcondor_exec/sleep.sh log = /afs/desy.de/user/c/chbeyer/log_$(Cluster)_$(Process).log output = /afs/desy.de/user/c/chbeyer/out_$(Cluster)_$(Process).txt error = /afs/desy.de/user/c/chbeyer/error_$(Cluster)_$(Process).txt +MyProject = "support" queue |
---|
At submit time of the DAG you can add 'MyProject' if you need todo so. Obmitting 'MyProject' will result in a job with the default accounting group of the host that you use as a submit node.
[chbeyer@htc-cms01]~/htcondor/testjobs% condor_submit_dag -append '+MyProject = "support"' sleep_dag.submit Running rescue DAG 7 ----------------------------------------------------------------------- File for submitting this DAG to HTCondor : sleep_dag.submit.condor.sub Log of DAGMan debugging messages : sleep_dag.submit.dagman.out Log of HTCondor library output : sleep_dag.submit.lib.out Log of HTCondor library error messages : sleep_dag.submit.lib.err Log of the life of condor_dagman itself : sleep_dag.submit.dagman.log Submitting job(s). 1 job(s) submitted to cluster 771718. ----------------------------------------------------------------------- |
---|
[chbeyer@htc-cms01]~/htcondor/testjobs% condor_q -nobatch -- Schedd: bird-htc-sched01.desy.de : <131.169.56.32:9618?... @ 02/07/18 13:48:01 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 771718.0 chbeyer 2/7 13:47 0+00:00:28 R 0 0.3 condor_dagman -p 0 -f -l . -Lockfile sleep_dag.submit.lock -AutoRescue 1 -DoR 771719.0 chbeyer 2/7 13:47 0+00:00:00 I 0 0.0 sleep.sh |
---|
HINT:
Do not submit the same DAG, with same DAG input file, from within the same directory, such that more than one of this same DAG is running at the same time. It will fail in an unpredictable manner, as each instance of this same DAG will attempt to use the same file to enforce dependencies. You can though change the directory and use the same file or rename the file per submission without altering the file content to overcome this !