Computing : Individual job status and return codes

The overall handling of job returns of htcondor maybe a bit general and sometimes it can be tempting to figure out which jobs actually did succeed in a way they were supposed to and which ones did not.

An elegant way around this lack is to use 'condor_chirp' to create and alter classadds of the job during runtime and/or create custom status files or entries in the job logfiles.

For all examples you need to enable this option in the submit file: +WantIOProxy = true

Creating custom entries in the job logfile

Inside your job you can use for ex.:

<snip>

/usr/libexec/condor/condor_chirp ulog "Hello World - I am your condor job"

<snip>

Leading to an entry in the job logfile:

[chbeyer@htc-it02]~/htcondor/testjobs% cat /afs/desy.de/user/c/chbeyer/log_7455691_0.log
<snip>
...
008 (7455691.000.000) 08/20 13:15:54 Hello World - I am your condor job
...
005 (7455691.000.000) 08/20 13:15:54 Job terminated.

Writing job states and/or return states after job finish into a custom file in your $HOME

Inside you job you can use for ex.:

<snip>
echo "$CLUSTER I am feeling bad" | /usr/libexec/condor/condor_chirp put -mode wa - /afs/desy.de/user/c/chbeyer/my_logfile.txt
<snip>

In this case 'wa' means 'write' 'append' which also means all of your jobs can potentially write their status or return state in one file that you can monitor usint 'tail -f' for example.

[chbeyer@htc-it02]~/htcondor/testjobs% cat /afs/desy.de/user/c/chbeyer/my_logfile.txt
7456398 I am feeling fine
7456763 I am feeling fine
7456764 I am feeling fine
7456765 I am feeling fine
7456768 I am feeling fine
7456766 I am feeling fine
7456767 I am feeling fine
7456769 I am feeling fine
7456761 I am feeling bad
7456760 I am feeling bad
7456762 I am feeling bad
7456773 I am feeling bad
7456770 I am feeling bad
7456771 I am feeling bad
7456772 I am feeling bad

Altering and adding classadds of a running job from inside the job

You can use 'condor_chirp' to inject additional class_adds to the job or alter existing classadds with the current state of your job from inside the job. the charming thing about this is that you can then use the custom classadd to find or sort jobs using 'condor_q'  while the jobs are running or 'condor_history' once the jobs are done.

At anytime inside your job you can then alter the job-class-add of the running job for ex with state messages like this by adding a classadd that gets created on the fly, I named it 'MyJobState' & 'MyJobReturn' but anything goes, just be sure to not overwrite an existing htcondor classadd of course:

my_job.sh :

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"Starting"'

sleep 120 #do something here

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"1/10 Done"'

sleep 120 # do some more here

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"2/10 Done"'

sleep 120 # you got it ...

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"3/10 Done"'

sleep 120

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"4/10 Done"'

sleep 120

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"5/10 Done"'

sleep 120

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"6/10 Done"'

sleep 120

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"7/10 Done"'

sleep 120

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"8/10 Done"'

sleep 120

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"9/10 Done"'

sleep 120

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"Done"'

/usr/libexec/condor/condor_chirp set_job_attr 'MyJobReturn' '"Good"'

Now you can use 'condor_q' to check on your job states during job-runtime (use 'condor_q -l' to check what else you want to list like submit time etc.) :

[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -af ClusterID -af MyJobState                      

7453792 5/10 Done

7453806 4/10 Done

7453810 4/10 Done

7453815 4/10 Done

7453819 4/10 Done

7453823 4/10 Done

7453827 4/10 Done

7453831 3/10 Done

7453837 3/10 Done

7453843 3/10 Done

7453847 3/10 Done

7453851 3/10 Done

7453855 3/10 Done

7453860 2/10 Done

7453864 2/10 Done

7453868 2/10 Done

7453872 2/10 Done

7453876 2/10 Done

7453878 2/10 Done

7453880 2/10 Done

7453883 1/10 Done

7453885 1/10 Done

7453887 1/10 Done

7453889 1/10 Done

7453891 1/10 Done

7453893 1/10 Done

7453895 Starting

7453897 Starting

7453899 Starting

7453901 Starting

7453904 Starting

You can also list jobs that do have a certain state of course:

[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -af ClusterID -constraint 'MyJobState == "3/10 Done"'

7453860

7453864

7453868

7453872

7453876

7453880

In my example above I put the final return code of my 'job' in the classadd 'MyJobReturn' that I can use with condor_history after the job has finished:

[chbeyer@htc-it02]~/htcondor/testjobs% condor_history -af ClusterID -af 'MyJobReturn'

7453806 Good

7453810 Good

7454011 False

7454010 False

7454013 False

7454012 False

7454009 False

7454008 False

7453999 Not so good

7454001 Not so good

7454004 Not so good

7454007 Not so good

7454003 Not so good

7454005 Not so good

7454002 Not so good

7454006 Not so good

7454000 Not so good

7453792 Good

See the manual page of condor_chirp for more informations: https://htcondor.readthedocs.io/en/latest/man-pages/condor_chirp.html