The overall handling of job returns of htcondor maybe a bit general and sometimes it can be tempting to figure out which jobs actually did succeed in a way they were supposed to and which ones did not.
An elegant way around this lack is to use 'condor_chirp' to create and alter classadds of the job during runtime and/or create custom status files or entries in the job logfiles.
For all examples you need to enable this option in the submit file: +WantIOProxy = true
Creating custom entries in the job logfile
Inside your job you can use for ex.:
<snip> /usr/libexec/condor/condor_chirp ulog "Hello World - I am your condor job" <snip> |
---|
Leading to an entry in the job logfile:
[chbeyer@htc-it02]~/htcondor/testjobs% cat /afs/desy.de/user/c/chbeyer/log_7455691_0.log <snip> ... 008 (7455691.000.000) 08/20 13:15:54 Hello World - I am your condor job ... 005 (7455691.000.000) 08/20 13:15:54 Job terminated. |
---|
Writing job states and/or return states after job finish into a custom file in your $HOME
Inside you job you can use for ex.:
<snip> echo "$CLUSTER I am feeling bad" | /usr/libexec/condor/condor_chirp put -mode wa - /afs/desy.de/user/c/chbeyer/my_logfile.txt <snip> |
---|
In this case 'wa' means 'write' 'append' which also means all of your jobs can potentially write their status or return state in one file that you can monitor usint 'tail -f' for example.
[chbeyer@htc-it02]~/htcondor/testjobs% cat /afs/desy.de/user/c/chbeyer/my_logfile.txt 7456398 I am feeling fine 7456763 I am feeling fine 7456764 I am feeling fine 7456765 I am feeling fine 7456768 I am feeling fine 7456766 I am feeling fine 7456767 I am feeling fine 7456769 I am feeling fine 7456761 I am feeling bad 7456760 I am feeling bad 7456762 I am feeling bad 7456773 I am feeling bad 7456770 I am feeling bad 7456771 I am feeling bad 7456772 I am feeling bad |
---|
Altering and adding classadds of a running job from inside the job
You can use 'condor_chirp' to inject additional class_adds to the job or alter existing classadds with the current state of your job from inside the job. the charming thing about this is that you can then use the custom classadd to find or sort jobs using 'condor_q' while the jobs are running or 'condor_history' once the jobs are done.
At anytime inside your job you can then alter the job-class-add of the running job for ex with state messages like this by adding a classadd that gets created on the fly, I named it 'MyJobState' & 'MyJobReturn' but anything goes, just be sure to not overwrite an existing htcondor classadd of course:
my_job.sh : /usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"Starting"' sleep 120 #do something here /usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"1/10 Done"' sleep 120 # do some more here /usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"2/10 Done"' sleep 120 # you got it ... /usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"3/10 Done"' sleep 120 /usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"4/10 Done"' sleep 120 /usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"5/10 Done"' sleep 120 /usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"6/10 Done"' sleep 120 /usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"7/10 Done"' sleep 120 /usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"8/10 Done"' sleep 120 /usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"9/10 Done"' sleep 120 /usr/libexec/condor/condor_chirp set_job_attr 'MyJobState' '"Done"' /usr/libexec/condor/condor_chirp set_job_attr 'MyJobReturn' '"Good"' |
---|
Now you can use 'condor_q' to check on your job states during job-runtime (use 'condor_q -l' to check what else you want to list like submit time etc.) :
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -af ClusterID -af MyJobState 7453792 5/10 Done 7453806 4/10 Done 7453810 4/10 Done 7453815 4/10 Done 7453819 4/10 Done 7453823 4/10 Done 7453827 4/10 Done 7453831 3/10 Done 7453837 3/10 Done 7453843 3/10 Done 7453847 3/10 Done 7453851 3/10 Done 7453855 3/10 Done 7453860 2/10 Done 7453864 2/10 Done 7453868 2/10 Done 7453872 2/10 Done 7453876 2/10 Done 7453878 2/10 Done 7453880 2/10 Done 7453883 1/10 Done 7453885 1/10 Done 7453887 1/10 Done 7453889 1/10 Done 7453891 1/10 Done 7453893 1/10 Done 7453895 Starting 7453897 Starting 7453899 Starting 7453901 Starting 7453904 Starting |
---|
You can also list jobs that do have a certain state of course:
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -af ClusterID -constraint 'MyJobState == "3/10 Done"' 7453860 7453864 7453868 7453872 7453876 7453880 |
---|
In my example above I put the final return code of my 'job' in the classadd 'MyJobReturn' that I can use with condor_history after the job has finished:
[chbeyer@htc-it02]~/htcondor/testjobs% condor_history -af ClusterID -af 'MyJobReturn' 7453806 Good 7453810 Good 7454011 False 7454010 False 7454013 False 7454012 False 7454009 False 7454008 False 7453999 Not so good 7454001 Not so good 7454004 Not so good 7454007 Not so good 7454003 Not so good 7454005 Not so good 7454002 Not so good 7454006 Not so good 7454000 Not so good 7453792 Good |
---|
See the manual page of condor_chirp for more informations: https://htcondor.readthedocs.io/en/latest/man-pages/condor_chirp.html