Contents

  •  Click here to expand...

Production Plans

  • MC9
    • MC9 started July 5, 2017
      • Phase III signal samples for prerelease-00-09-00b validation
      • Phase III Y(3S) generic (300 fb-1)
      • Phase III Y(4S) generic (4 x 1 ab-1)
      • Phase III Y(5S) generic (1 ab-1)
      • Phase III Y(6S) generic (100 fb-1)
      • Phase III Y(4S) signal samples
      • Phase III Y(4S) low multiplicity samples
      • Phase III Y(5S) signal samples
      • Phase III Y(6S) signal samples
      • Phase II Y(3S) signal samples
      • Phase II Y(4S) generic (50 fb-1)
      • Phase II Y(4S) signal samples
      • Phase II Y(4S) low multiplicity samples

Production Status

MC9

Official production started at ~21:00 JST on July 5, 2017.

 Older status updated (click to expand)

Starting with BGx0 generic samples (0.2 ab-1)

Submitted second batch of BGx0 generic jobs (July 7, ~04:00 JST)

Third and fourth batches of BGx0 generic jobs (July 10)

Submitted a few BGx0 signal samples (July 12 ~04:30 JST)

Submitted the phase 2 generic samples with BGx0 (July 14 ~04:00 JST)

Submitted the rest of the BGx0 signal samples (July 16 ~00:00 JST)

New requests for BGx0 signal samples submitted (July 19 ~01:30 JST)

MC9 restarted with BGx1 phase 2 samples - 50 fb-1 generic and signal samples (July 30 ~10:00 JST)

Submitted first batch of phase 3 samples with background - mixed and charged BBbar - about 140k jobs (August 12 ~08:00 JST)

Added uubar: ~180k jobs (August 13 ~10:30 JST)

Added ddbar: ~53k jobs (August 29 ~22:00 JST)

Added ssbar: ~51k jobs (Sept. 2 ~11:00 JST)

Submitted phase 3 low-multiplicity samples: ~43.5k jobs (Sept. 2 ~13:00 JST) → includes generator level skim so number of jobs is inflated compared to run time

Added ccbar and taupair: ~317k jobs (Sept 3 ~09:30 JST)

Submitted new signal MC samples: ~57.6k jobs (Sept 11 ~23:00 JST)

Submitted new phase 2 signal MC samples: ~21.2k jobs (Sept 12 ~05:00 JST) → short jobs < 3 hrs each

Submitted new phase 3 signal MC samples: ~600k jobs (Sept 24 ~22:30)

Submitted new phase 3 signal MC samples (almost all submitted now) (Sept 28 ~02:00 JST)

Submitted phase 3 Y(5S) bsbs and non-bsbs samples: ~54.4k jobs (Oct 6 ~04:30 JST)

Submitted phase 3 Y(5S) uubar samples: ~242k jobs (Oct 9 ~09:30 JST)

Submitted phase 3 Y(5S) ddbar samples: ~60k jobs (Oct 16 ~10:30 JST)

Submitted phase 3 Y(5S) ssbar and ccbar samples: ~300k jobs (Oct 18 ~02:00 JST)

Submitted a few last signal samples and the phase 3 Y(5S) taupair samples: ~200k jobs (Oct 24 ~03:00 JST)
     → The taupair samples should run as shorter jobs (~5-6 hours at KEKCC)

Submitted Y(6S) continuum samples: ~30k jobs (Oct 30 ~23:00 JST)

Submitted Y(3S) generic samples: ~260k 8h jobs (Nov 3 ~04:30 JST)

Submitted Y(3S) continuum samples (uubar): ~170k 5h jobs (Nov 5 ~08:00 JST)

Submitted Y(3S) continuum samples (ddbar, ssbar, ccbar): ~70k 5h jobs + ~80k 8h jobs (Nov 13 ~23:00 JST)

Submitted Y(3S) taupair samples: ~70k 5h jobs (Nov 17 ~01:00 JST)

Submitted remaining Y(5S) generic: ~4.5k jobs (Nov 20 ~00:00 JST)

Submitted next batch of Y(4S) generic (mixed, charged): ~100k 8h jobs, ~150k 5h jobs (Nov 20 ~03:00 JST)

Submitted Y(4S) uubar continuum: ~250k 8h jobs (Nov 25 ~22:00 JST)

Submitted Y(4S) ddbar continuum: ~100k 5h jobs (Nov 28 ~21:00 JST)

Submitted Y(4S) ssbar continuum: ~100k 5h jobs (Dec 6 ~21:00 JST)

Submitted Y(4S) ccbar continuum: ~200k 8h jobs (Dec 9 ~05:00 JST)

Submitted Y(4S) taupair: ~180k 5h jobs (Jan 2 ~08:30 JST)

Submitted Y(4S) mixed sample (batch 3): ~100k 8h jobs (Jan 8 ~23:00 JST)

Submitted a few low multiplicity samples (Jan 18)

Submitted Y(4S) bbbar samples for data challenge (phase 3, BGx1): ~220k 9hr jobs (Jan 18 ~02:00 JST)

Submitted Y(4S) ddbar samples for data challenge (phase 3, BGx1): ~130k 5 hr jobs (Jan 22 ~22:00 JST)

Submitted phase 3 BG overlay production scripts (Jan 27 ~00:50 JST)

Submitted Y(4S) uubar samples for data challenge (phase 3, BGx1): ~290k 9 hr jobs (Feb 2 ~00:30 JST)

Submitted Y(4S) ssbar samples for data challenge (phase 3, BGx1): ~95k 6 hr jobs (Feb 3 ~21:45 JST)

Submitted Y(4S) ccbar samples for data challenge (phase 3, BGx1): ~320k 7.5 hr jobs (Feb 13 ~10:30 JST)

Submitted Y(4S) taupair samples for data challenge (phase 3, BGx1): ~160k 8 hr jobs (Feb 19 ~23:00 JST)

Submitted Y(4S) generic charged (not for data challenge): ~150k 5h jobs (Mar 1 ~23:00 JST)

Submitted Y(4S) uubar samples (phase 3, BGx1): ~250k 8h jobs (Mar 2 ~21:00 JST)

Submitted phase 3, BGx1 low multiplicity samples as requested by the bottomonium group (Mar 8 ~00:00 JST)

Submitted MC10 analysis validation samples, both BGx0 and BGx1 (Mar 9, 01:40 JST)

More MC9 skim jobs approved (Mar 12, ~23:00 JST)

Submitted Y(4S) ddbar samples (phase 3, BGx1): ~100k 5h jobs (Mar 13 ~00:00 JST)

Submitted Y(4S) ssbar samples (phase 3, BGx1): ~95k 5h jobs (Apr 3 ~04:00 JST)

Submitted Y(4S) ccbar samples (phase 3, BGx1): ~200k 8h jobs (Apr 4 ~02:30 JST)

Submitted Y(4S) taupair samples (phase 3, BGx1): ~180k 5h jobs (Apr 5 ~01:30 JST)

All scheduled MC9 fabrication jobs have finished (as of April 16, 2018)

First official batch of MC10 jobs submitted - phase 3 Y(4S) mixed samples: BGx1 ~100k 8h jobs and BGx0 ~36k 5h jobs (April 14 ~00:00 JST)

MC10 - phase 3 Y(4S) charged samples: BGx1 ~110k 8h jobs and BGx0 ~38k 5h jobs (April 17 ~01:00 JST)

First (small) batch of MC10 signal jobs submitted (April 20 ~03:30 JST)



Central Services

Dirac (dirac.cc.kek.jp, b2dchsv01-b2dchsv06.cc.kek.jp, b2dchsv08.cc.kek.jp)

  • b2dchsv01.cc.kek.jp is rapidly increased by more than 5 GB BIIDCO-952 - Getting issue details... STATUS

    • Restarted and seemed to improve, but now DIRAC Production is down? BIIDCO-956 - Getting issue details... STATUS

DB Production (b2dchdb1.cc.kek.jp, b2dchdb2.cc.kek.jp, b2dcsdb1.cc.kek.jp, b2dcsdb2.cc.kek.jp

  • Please update ticket  BIIDCO-583 - Getting issue details... STATUS  if the monitoring plots continue showing as Down.


DDM (dirac-ddm-prod.hep.pnnl.gov)

  • BIIDCO-844 - Getting issue details... STATUS
  • 2018-03-01 DDM deletion task seems stuck BIIDCO-808 - Getting issue details... STATUS
    which will cause no replication from 2018-02-27  BIIDCO-806 - Getting issue details... STATUS

Conditions DB (belle2db.hep.pnnl.gov, belle2db-files.hep.pnnl.gov)

  • Conditions DB slow response failure happen since 2018-03-29 10:30 UTC  BIIDCO-907 - Getting issue details... STATUS
  •   Conditions DB switch-over to the new server: BIIDCO-742 - Getting issue details... STATUS

Monitor

LFC

File Transfers and Replication Status

See also DDM for related issues

FTS

Any problem in the FTS service or FTS monitoring are to be recorded here. Site/SE specific issues are to be recorded under each SIte/SE

Note that the FTS dashboard we use is an "old" instance and not well-maintained. We, Belle II members in general, do not have access to the "new" monitoring. When the dashboard is down, the shifters just need to notify the expert and skip the corresponding part of their work. The expert should check the new monitoring, for the access to the monitoring page is limited.

  • FTS dashboard is not available since 2018-03-22 6:00 UTC BIIDCO-876 - Getting issue details... STATUS
    → FTS transfer is resumed after 2018-03-22 12:00 UTC from Graphana but FTS dashboard is not recovered

Replication Status

  • 2018-03-12 replication stuck and no file transfer BIIDCO-844 - Getting issue details... STATUS

Job Status Plot

  • Date, Issue, Tickets...

Job Summary

  • Number of running jobs reduced half by KEK-SandboxSE access failure from 2018-04-20 16:00 UTC BIIDCO-974 - Getting issue details... STATUS

SEs

SE Common Issues

  • Zero "Done" with non-zero "Waiting" at almost all SEs BIIDCO-972 - Getting issue details... STATUS
  • Replication status: "scheduled" increased and "Done" decreased for 9 hours and all SEs BIIDCO-965 - Getting issue details... STATUS
  • StorageElementStatusAgent is down BIIDCO-857 - Getting issue details... STATUS

  • Date, Issue, Tickets...

Primary SEs

Primary SE BNL-TMP-SE (dcblsrm.sdcc.bnl.gov)

  • 2018-04-11 : file transfer errors over several hours BIIDCO-948 - Getting issue details... STATUS

Primary SE: CESNET-TMP-SE (dpm1.egee.cesnet.cz)      

Primary SE: CNAF-TMP-SE (storm-fe-archive.cr.cnaf.infn.it)

  • Showing zero done and non zero others in replication status plot.

  • Explicitly banned since Feb. 28 BIIDCO-966 - Getting issue details... STATUS
  • Downtime: Massive flood of the datacenter, Start time: 2017-11-09 11:00, End time: 2018-01-18 09:00 (UTC) : BIIDCO-495 - Getting issue details... STATUS No clear date for the end.

    → SE Health check by DDM : No need to report this issue during the above downtime.

  • 2017/03/13: Not enough free space BIIDCO-137 - Getting issue details... STATUS

Primary SE: DESY-TMP-SE (dcache-se-desy.desy.de)

  • File transfer failure for several hours (2018-04-11) BIIDCO-949 - Getting issue details... STATUS
  • File transfer failure for several hours (2018-02-17): BIIDCO-780 - Getting issue details... STATUS

  • No plot in replication trend

  • Not enough free space BIIDCO-107 - Getting issue details... STATUS

Primary SE:KEK2-TMP-SE (kek2-se01.cc.kek.jp)

  • kek2-se03.cc.kek.jp : Disk full error on /disk/belle/TMP/belle/user BIIDCO-964 - Getting issue details... STATUS
  • SE Health check by DDM : remove file, download, upload do not work since 2018-03-27 13:44:14 UTC.   BIIDCO-894 - Getting issue details... STATUS

Primary SE: KISTI-TMP-SE (belle-se-head.sdfarm.kr)

  • SE Health check by DDM : checksum, remove file, remove directory, download, upload, ls do not work since 2018-04-06 00:41:25 UTC. BIIDCO-928 - Getting issue details... STATUS

  • No new assignment of MC production data blocks to this destination BIIDCO-848 - Getting issue details... STATUS

Primary SE: KIT-TMP-SE (dcachesrm-kit.gridka.de)

  • KIT SE does not have enough free disk space BIIDCO-860 - Getting issue details... STATUS
  • KIT SE giving occasional timeouts  BIIDCO-428 - Getting issue details... STATUS

Primary SE: KMI-TMP-SE (nsrmfe01.hepl.phys.nagoya-u.ac.jp

  • Not enough free space BIIDCO-136 - Getting issue details... STATUS

Primary SE: Napoli-TMP-SE (belle-dpm-01.na.infn.it)

  • File transfer error over several hours (2018-04-11) BIIDCO-947 - Getting issue details... STATUS
  • Disk is full BIIDCO-858 - Getting issue details... STATUS

  • Not enough free space  BIIDCO-146 - Getting issue details... STATUS

  • SE Health check by DDM : checksum, remove file, remove directory, download, upload, ls do not work since 2017-11-30 04:51:01 UTC.

Primary SE: PNNL-TMP-SE (se.hep.pnnl.gov

  • PNNL SE to be decommissioned BIIDCO-838 - Getting issue details... STATUS

Primary SE: SIGNET-TMP-SE (dcache.ijs.si)

Other SEs

Adelaide-TMP-SE (coepp-dpm-01.ersa.edu.au)

CYFRONET-TMP-SE (dpm.cyf-kr.edu.pl)

  • 2018-04-01: The efficiency is less than 50% throughout 4 hours:  BIIDCO-914 - Getting issue details... STATUS

  • 2018-04-10: no activity on transfer / efficiency 0% BIIDCO-938 - Getting issue details... STATUS

Frascati-TMP-SE (atlasse.lnf.infn.it)

  • Downtime 07-Apr-18 13:30:00 to 10-Apr-18 17:30:00 BIIDCO-929 - Getting issue details... STATUS
  • transfer efficiency is zero : BIIDCO-910 - Getting issue details... STATUS

HEPHY-TMP-SE (hephyse.oeaw.ac.at)

  • Date, Issue, Tickets...

IPHC-TMP-SE (sbgse1.in2p3.fr)

  • Date, Issue, Tickets...

Melbourne-TMP-SE (b2se.mel.coepp.org.au)

  • transfer rate to is zero BIIDCO-896 - Getting issue details... STATUS

  • Melbourne-DATA-SE banned for write BIIDCO-927 - Getting issue details... STATUS

McGill-TMP-SE  (storm02.clumeq.mcgill.ca)

  • BIIDCO-516 - Getting issue details... STATUS McGill-TMP-SE will be decomissioned in early 2018.

MPPMU-TMP-SE (grid-srm.rzg.mpg.de)

  • transfer efficiency is zero : BIIDCO-911 - Getting issue details... STATUS

NTU-TMP-SE (bgrid3.phys.ntu.edu.tw)

  • Date, Issue, Tickets...

Pisa-TMP-SE (stormfe1.pi.infn.it)

TAU-TMP-SE (tau-se.hep.tau.ac.il)

Torino-TMP-SE (se-srm-00.to.infn.it)

  • Date, Issue, Tickets...

ULAKBIM-TMP-SE (torik1.ulakbim.gov.tr)

  • Date, Issue, Tickets...

UMiss-TMP-SE (umiss005.hep.olemiss.edu)

  • Date, Issue, Tickets...

UVic-TMP-SE(charon01.westgrid.ca)

  • Date, Issue, Tickets..


Sites

Sites Common Issue

  • "Short pilot jobs" occurred at 25 sites for more than 5 hours, at 18:20, 19:20, 20:20, 21:20, 22:20 UTC on 2018/04/20. BIIDCO-973 - Getting issue details... STATUS
    Failure reason seems caused by KEK-SandboxSE access problem and InputSandbox download failure BIIDCO-974 - Getting issue details... STATUS
  • BIIDCO-257 - Getting issue details... STATUS

ARC.DESY.de

  • Health checker info. : "Failed pilot jobs" has been found since 09:20:00 UTC on 2018/03/16.(details)
  • Health checker info. : "Failed pilot jobs" has been found at 05:20:00 UTC on 2018/03/16.(details)
  • Health checker info. : "Short pilot jobs" has been found at 14:20:00 UTC on 2018/03/15.
  • Health checker info. : "Short pilot jobs" has been found at 23:20:00 UTC on 2018/03/14.

  • Job submission check : Pilot submission failure has been found at 07:33:00 UTC on 2018/02/22. (details BIIDCO-792 - Getting issue details... STATUS

ARC.KIT.de

  • Health checker info. : "Short pilot jobs" has been found at 15:20:00 UTC on 2018/04/19. BIIDCO-970 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found at 22:20:00 UTC on 2018/04/14.(details)
  • Health checker info. : "Short pilot jobs" has been found since 13:20:00 UTC on 2018/04/11.(details)
  • Health checker info. : "Short pilot jobs" has been found since 19:20:00 UTC on 2018/03/19.(details)
  • Health checker info. : "Short pilot jobs" has been found at 23:20:00 UTC on 2018/03/14.

  • Downtime  BIIDCO-819 - Getting issue details... STATUS

  • Health checker info. : "Aborted pilot jobs" has been found since 06:20:00 UTC on 2018/02/17.

  • Downtime BIIDCO-931 - Getting issue details... STATUS

ARC.LMU.de

  • This is a test site. Do not need to report any issue.

ARC.LMU2.de

  • Banned as currently no resource behind the CE BIIDCO-239 - Getting issue details... STATUS

  • Downtime: Start time: 2018-04-09 21:00 (UTC) End time: 2018-04-10 18:00 (UTC)  BIIDCO-932 - Getting issue details... STATUS

ARC.Melbourne.au

  • Health checker info. : "Short pilot jobs" has been found at 05:20:00 UTC on 2018/03/16.(details)

  • downtime from 2018-03-05 01:00 endtime 2018-03-09 10:00  BIIDCO-812 - Getting issue details... STATUS

  •  "Stalled" jobs BIIDCO-446 - Getting issue details... STATUS

ARC.MPPMU.de

  • Health checker info. : "Failed pilot jobs" has been found at 05:20:00 UTC on 2018/03/16.(details)
  • Health checker info. : "Failed pilot jobs" has been found at 02:20:00 UTC on 2018/03/15.(details)
  • Health checker info. : "Aborted pilot jobs" has been found at 16:20:00 UTC on 2018/03/11.(details)
  • Health checker info. : "Failed pilot jobs" has been found since 01:20:00 UTC on 2018/03/09.

  • Health checker info. : "Failed pilot jobs" has been found since 19:20:00 UTC on 2018/02/01.(details) BIIDCO-715 - Getting issue details... STATUS

  • Job submission check:Pilot submission failure has been found at 06:31:00 UTC on 2017/05/10. (details)

ARC.SIGNET.si

  • Job submission check : Pilot submission failure has been found at 13:25:00 UTC on 2018/04/21.
  • Health checker info. : "Failed pilot jobs" has been found at 16:20:00 UTC on 2018/04/19. BIIDCO-832 - Getting issue details... STATUS
  • Job submission check : Pilot submission failure has been found at 14:31:00 UTC on 2018/04/19.
  • wms.ijs.si: Downtime from 2018-04-12 12:00 (UTC) to 2018-05-11 12:00 (UTC) BIIDCO-968 - Getting issue details... STATUS
  • downtime: wms2.arnes.si Start time: 2018-04-12 12:00 (UTC) End time: 2018-05-11 12:00 (UTC) 
    BIIDCO-945 - Getting issue details... STATUS
  • downtime for 'nagios.sling.si' (not to affect Belle II activities) from  2018-03-29 12:00 (UTC) End time: 2018-04-10 12:00 (UTC)  BIIDCO-912 - Getting issue details... STATUS
  • Job submission check : Pilot submission failure has been found at 22:24:00 UTC on 2018/03/25. (details)
  • Failed jobs BIIDCO-832 - Getting issue details... STATUS
    • The number of MCProduction jobs is limited to 10 until the issue is resolved.
  • Health checker info. : "Short pilot jobs" has been found at 08:20:00 UTC on 2018/02/23  BIIDCO-799 - Getting issue details... STATUS

  • "Stalled" jobs  BIIDCO-287 - Getting issue details... STATUS

CLOUD.CC1_Krakow.pl

  • Not used in production yet. Seeing no jobs (no plot) is not a problem

DIRAC.Beihang.cn

  • Health checker info. : "Short pilot jobs" has been found since 21:20:00 UTC on 2018/04/06.
  • Health checker info. : "Short pilot jobs" has been found at 15:20:00 UTC on 2018/02/02.
  • BIIDCO-647 - Getting issue details... STATUS Many MCProduction jobs failed at file upload stage for fail-over SEs 2017-12-24
  • The number of jobs limited. BIIDCO-289 - Getting issue details... STATUS
  • All the upload trials are failing against all the SEs configured: OutputSE (KMI-TMP-SE, PNNL-TMP-SE), Fail-over SEs(DESY-TMP-SE, Napoli-TMP-SE, PNNL-TMP-SE, KIT-TMP-SE)
  • Large % of failed jobs in DIRAC status plot (Added 2016-11-03 22:45:00 UTC) 

DIRAC.BINP.ru

  • Date, Issue, Tickets..

DIRAC.BINP-VM.ru

  • Job status plots, "Application Finished With Errors" (2018-02-11 but lasting for at least a month) BIIDCO-749 - Getting issue details... STATUS

DIRAC.CINVESTAV.mx

  • Job status plots, "Application Finished With Errors" & "Watchdog identified this job as Stalled" (2018-02-12) BIIDCO-755 - Getting issue details... STATUS
  • Job submission check : Pilot submission failure has been found since 20:26:00 UTC on 2018/01/01. (details)

DIRAC.DESY.de

  • Test site. Not in use in MC production

DIRAC.IITG.in

  • Health checker info. : "Aborted pilot jobs" has been found at 22:20:00 UTC on 2018/04/06.
  • "Aborted pilot jobs" has been found at 16:20:00 UTC on 2018/03/11.(details)

DIRAC.LMU.de

  • Not in use in MC production BIIDCO-26 - Getting issue details... STATUS
  • Banned for now.

DIRAC.MIPT.ru

  • Health checker info. : "Aborted pilot jobs" has been found at 14:20:00 UTC on 2018/04/20.
  • Health checker info. : "Aborted pilot jobs" has been found since 13:20:00 UTC on 2018/04/11.(details)
  • Health checker info. : "Aborted pilot jobs" has been found since 01:20:00 UTC on 2018/03/17.(details)
    BIIDCO-747 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found at 23:20:00 UTC on 2018/03/15.(details)
  • Health checker info. : "Short pilot jobs" has been found at 23:20:00 UTC on 2018/03/14.
  • Health checker info. : "Aborted pilot jobs" has been found at 12:20:00 UTC on 2018/02/11  BIIDCO-747 - Getting issue details... STATUS

DIRAC.Nagoya.jp


  • Health checker info. : "Short pilot jobs" has been found since 08:20:00 UTC on 2018/04/05.(details) BIIDCO-925 - Getting issue details... STATUS

DIRAC.Nara-WU.jp

  • Decommissioned site: Since this still uses SL5, DIRAC pilot cannot be executed there.

DIRAC.NDU.jp

  • Date, Issue, Tickets..

DIRAC.Niigata.jp

  • Job submission check : Pilot submission failure has been found since 14:44:00 UTC on 2018/04/04. BIIDCO-922 - Getting issue details... STATUS

DIRAC.Osaka-CU.jp

  • Health checker info. : "Short pilot jobs" has been found since 07:20:00 UTC on 2018/04/19.
  • Health checker info. : "Short pilot jobs" has been found since 16:20:00 UTC on 2018/03/19.(details)
  • Health checker info. : "Belle II software could not be installed on " has been found since 20:20:00 UTC on 2018/03/19.
  • Health checker info. : "Short pilot jobs" has been found since 22:20:00 UTC on 2018/03/17.
    → Ask site admin to check the status 2018-03-17 10:00 JST. (DB access failure again from DIRAC.Osaka-CU.jp to PNNL from 2018-03-16 11:00 UTC)
    BIIDCO-290 - Getting issue details... STATUS
  • DB access failure from Osaka-CU to PNNL server starting from 2018-03-05 around 3:00 UTC which makes "Short pilot jobs" failure.
    Health checker info. : "Short pilot jobs" has been found since 10:20:00 UTC on 2018/02/17 BIIDCO-290 - Getting issue details... STATUS
  •  MCProduction = 5 BIIDCO-312 - Getting issue details... STATUS

DIRAC.PNNL.us

  • Date, Issue, Tickets...
  • Site to be decommissioned BIIDCO-919 - Getting issue details... STATUS

DIRAC.PNNL2.us

  • Site to be decommissioned BIIDCO-920 - Getting issue details... STATUS

DIRAC.PNNL-CASCADE.us

  • Seeing no jobs (no plot) is not a problem

DIRAC.PNNL-PIC.us

  • Seeing no jobs (no plot) is not a problem

DIRAC.RCNP.jp

  • Health checker info. : "Aborted pilot jobs" has been found at 14:20:00 UTC on 2018/03/24.(details)
  • Downtim:  estimated downtime is 8 AM to 6 PM , 22, March JST BIIDCO-878 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found at 23:20:00 UTC on 2018/03/07.

DIRAC.SSU.kr

  • Date, Issue, Tickets...

DIRAC.TIFR.in

  • Health checker info. : "Short pilot jobs" has been found at 14:20:00 UTC on 2018/04/21.
  • Health checker info. : "Short pilot jobs" has been found at 14:20:00 UTC on 2018/04/20.
  • Health checker info. : "Short pilot jobs" has been found at 16:20:00 UTC on 2018/04/19. BIIDCO-971 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found at 22:20:00 UTC on 2018/04/06.
  • Health checker info. : "Short pilot jobs" has been found at 22:20:00 UTC on 2018/03/25.(details)
  • Health checker info. : "Short pilot jobs" has been found at 14:20:00 UTC on 2018/03/20.(details)
  • Health checker info. : "Short pilot jobs" has been found since 22:20:00 UTC on 2018/03/15.(details)
  • Health checker info. : "Short pilot jobs" has been found at 01:20:00 UTC on 2018/03/15.
  • Health checker info. : "Short pilot jobs" has been found at 23:20:00 UTC on 2018/03/08.
  • Job stalled at input data resolution BIIDCO-714 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found at 22:20:00 UTC on 2018/03/01.(details)

DIRAC.TMU.jp

  • Health checker info. : "Aborted pilot jobs" has been found since 09:20:00 UTC on 2018/04/11.
  • Health checker info. : "Aborted pilot jobs" has been found at 22:20:00 UTC on 2018/04/05.
  • Health checker info. : "Aborted pilot jobs" has been found since 23:20:00 UTC on 2018/03/12.
    → Ask site administrator to check the situation 2018-03-18 10:00 JST. BIIDCO-845 - Getting issue details... STATUS
  • Pilot submission failure has been found since 16:24:00 UTC on 2018/02/17. (details)
  • Job submission check : Pilot submission failure has been found since 16:24:00 UTC on 2018/02/17.  BIIDCO-785 - Getting issue details... STATUS

DIRAC.Tokyo.jp

  • Date, Issue, Tickets..

DIRAC.UAS.mx

  • Health checker info. : "Short pilot jobs" has been found since 16:20:00 UTC on 2018/04/04. BIIDCO-923 - Getting issue details... STATUS
    CVMFS revision is out-of-date and pilot jobs failed → report to site admin 2018-04-04

DIRAC.UVic.ca

  • Health checker info. : "Short pilot jobs" has been found at 22:20:00 UTC on 2018/03/25.(details)

DIRAC.UVic-local.ca

  • This site is under commissioning.

DIRAC.Yamagata.jp

  • Date, Issue, Tickets...

DIRAC.Yonsei.kr

  • Health checker info. : "Short pilot jobs" has been found since 09:20:00 UTC on 2018/04/21.
  • Health checker info. : "Short pilot jobs" has been found since 22:20:00 UTC on 2018/04/19.
  • Health checker info. : "Short pilot jobs" has been found since 12:20:00 UTC on 2018/04/19.
  • Health checker info. : "Short pilot jobs" has been found since 20:20:00 UTC on 2018/04/14.(details) → BIIDCO-959 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found at 02:20:00 UTC on 2018/03/17.(details)
  • many failed jobs BIIDCO-833 - Getting issue details... STATUS

LCG.CESNET.cz

  • Health checker info. : "Short pilot jobs" has been found since 09:20:00 UTC on 2018/04/21.
  • Health checker info. : "Short pilot jobs" has been found since 20:20:00 UTC on 2018/04/14.(details) → BIIDCO-960 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found since 02:20:00 UTC on 2018/04/10.
  • Health checker info. : "Short pilot jobs" has been found since 13:20:00 UTC on 2018/03/31.
  • Health checker info. : "Short pilot jobs" has been found since 16:20:00 UTC on 2018/03/21.(details)
  • "Short pilot jobs" has been found since 07:20:00 UTC on 2018/03/10.(details)
  • GGUS ticket : "prague_cesnet_lcg2 : Frequent I/O error at dpm1.egee.cesnet.cz"(133845) has been submited at 03:52:52 UTC on 2018/03/06.
  • Health checker info. : "Short pilot jobs" has been found since 20:20:00 UTC on 2018/03/03. BIIDCO-811 - Getting issue details... STATUS
  •   Need some intervention to run Merge jobs BIIDCO-771 - Getting issue details... STATUS

LCG.CNAF.it

  • Permanently banned since Feb. 15 BIIDCO-967 - Getting issue details... STATUS
  • Downtime: Massive flood of the datacenter, Start time:    2017-11-13 11:00   (UTC), End time: 2018-01-18 09:00 (UTC) : BIIDCO-495 - Getting issue details... STATUS

LCG.Cosenza.it

  • Downtime: Start time: 2018-04-09 10:15 (UTC) End time: 2018-04-12 10:00 (UTC) BIIDCO-937 - Getting issue details... STATUS
  • Health checker info. : "BLAH ERROR" has been found since 17:20:00 UTC on 2018/03/22.(details)
  • Pilot abort with BLAH error since 2018-03-22 15:00 UTC BIIDCO-886 - Getting issue details... STATUS GGUS ticket https://ggus.eu/?mode=ticket_info&ticket_id=134224&come_from=submit
  • Health checker info. : "Not enough disk space on recas-wn-02" has been found since 10:20:00 UTC on 2018/03/17.
  • Health checker info. "Not enough disk space on recas-wn-04" has been found since 13:20:00 UTC on 2018/03/16.
    Disk full failure on both ecas-wn-02 and WNs ecas-wn-04 BIIDCO-859 - Getting issue details... STATUS
    Solved and verified 2018-03-22 : GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=134093 has submitted at 2018-03-16 15:55

LCG.CYFRONET.pl

  • Health checker info. : "Short pilot jobs" has been found at 14:20:00 UTC on 2018/03/22.(details)
  • Health checker info. : "Belle II software could not be installed on n1072-amd" has been found since 02:20:00 UTC on 2018/03/20.
  •   BIIDCO-820 - Getting issue details... STATUS BIIDCO-774 - Getting issue details... STATUS
    → Report site admin via GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=133849 has submitted 2018-03-06 04:49 for "Belle II software installation failure and short pilot jobs"
  • Downtime : Start time: 2018-02-27 23:00 (UTC), End time: 2018-12-31 00:00 (UTC)  BIIDCO-694 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found since 01:20:00 UTC on 2018/03/05.(details) BIIDCO-826 - Getting issue details... STATUS
  • Health checker info. : "Belle II software could not be installed on n1075-amd" has been found since 19:20:00 UTC on 2018/03/05. BIIDCO-774 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found at 07:20:00 UTC on 2018/01/13.(details BIIDCO-531 - Getting issue details... STATUS

LCG.DESY.de

  • Health checker info. : "Short pilot jobs" has been found at 02:20:00 UTC on 2018/03/15.(details)
  • Health checker info. : "Short pilot jobs" has been found since 14:20:00 UTC on 2018/03/08.(details)
  • Health checker info. : "Belle II software could not be installed on grid-wn0840.desy.de,grid-wn0793.desy.de" has been found since 14:20:00 UTC on 2017/12/20.
  • Health checker info. : "Short pilot jobs" has been found since 02:20:00 UTC on 2017/12/19.(details)
  • LCG.DESY.de: Stalled jobs BIIDCO-293 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found at 14:20:00 UTC on 2017/11/22.

LCG.Frascati.it

  • Downtime from 2018-04-19 09:00 (UTC) to 2018-04-20 10:00 (UTC) BIIDCO-969 - Getting issue details... STATUS
  • Downtime: Site: LCG.Frascati.it, Description: Down of central network switch, Start time (UTC) 2018-04-06 08:10 End time (UTC) 2018-04-07 13:30, BIIDCO-929 - Getting issue details... STATUS
  • Health checker info. : "Failed pilot jobs" has been found at 14:20:00 UTC on 2018/03/16.(details)
  • Health checker info. : "Short pilot jobs" has been found since 12:20:00 UTC on 2018/03/06. BIIDCO-826 - Getting issue details... STATUS
  • Health checker info. : "Failed pilot jobs" has been found since 14:20:00 UTC on 2018/03/08.(details)

LCG.HEPHY.at

LCG.IPHC.fr.

  • Job submission check : Pilot submission failure has been found since 10:25:00 UTC on 2018/04/21. BIIDCO-975 - Getting issue details... STATUS
  • Job submission check : Pilot submission failure has been found since 01:26:00 UTC on 2018/03/03.

LCG.KEK.jp

  • Downtime BIIDCO-842 - Getting issue details... STATUS 2018-03-12 06:00 (UTC) -  2018-03-19 06:00 (UTC)
  • Health checker info. : "Failed pilot jobs" has been found since 07:20:00 UTC on 2018/03/08

LCG.KEK2.jp

  • Downtime BIIDCO-842 - Getting issue details... STATUS   2018-03-12 06:00 (UTC) -  2018-03-19 06:00 (UTC)
  • Health checker info. : "Failed pilot jobs" has been found at 15:20:00 UTC on 2018/03/08.(details) BIIDCO-826 - Getting issue details... STATUS
  • Health checker info. : "Failed pilot jobs" has been found since BIIDCO-826 - Getting issue details... STATUS 06:20:00 UTC on 2018/03/08. BIIDCO-794 - Getting issue details... STATUS

LCG.KISTI.kr

  • Downtime: Start time: 2018-04-04 08:00 (UTC) End time: 2018-04-09 03:00 (UTC) BIIDCO-924 - Getting issue details... STATUS
    pilot jobs aborted with BLAH error related to downtime BIIDCO-921 - Getting issue details... STATUS
  •  MCProduction= 280 BIIDCO-280 - Getting issue details... STATUS
  • A large number of Merge jobs in waiting status BIIDCO-773 - Getting issue details... STATUS

LCG.KMI.jp

  • Health checker info. : "Failed pilot jobs" has been found at 05:20:00 UTC on 2018/03/16.(details)
  • Health checker info. : "Short pilot jobs" has been found since 13:20:00 UTC on 2018/03/13.(details)

LCG.Legnaro.it

  • Health checker info. : "Failed pilot jobs" has been found since 13:20:00 UTC on 2018/03/16.(details)

LCG.McGill.ca

  • Downtime: Start time: 2018-01-11 22:30 (UTC) End time: 2018-04-01 04:02 (UTC) BIIDCO-867 - Getting issue details... STATUS
  • BIIDCO-516 - Getting issue details... STATUS LCG.McGill.ca will be decommissioned in early 2018

LCG.Napoli.it

  • Health checker info. : "Failed pilot jobs" has been found since 13:20:00 UTC on 2018/04/20.
  • Health checker info. : "Failed pilot jobs" has been found at 15:20:00 UTC on 2018/04/19.
  • Health checker info. : "Failed pilot jobs" has been found since 09:20:00 UTC on 2018/04/11.(details)
  • Health checker info. : "Failed pilot jobs" has been found since 21:20:00 UTC on 2018/04/05.
  • Health checker info. : "Short pilot jobs" has been found since 22:20:00 UTC on 2018/04/04.
  • Job submission check : Pilot submission failure has been found at 22:24:00 UTC on 2018/03/25. (details)
  • Health checker info. : "Failed pilot jobs" has been found at 14:20:00 UTC on 2018/03/22.(details)
  • This server is in downtime schedule.  BIIDCO-866 - Getting issue details... STATUS
  • Health checker info. : "Failed pilot jobs" has been found at 14:20:00 UTC on 2018/03/17. BIIDCO-825 - Getting issue details... STATUS
  • Health checker info. : "Not enough disk space on wn203.scope.unina.it" has been found since 22:20:00 UTC on 2018/03/15. BIIDCO-841 - Getting issue details... STATUS
  • Health checker info. : "BLAH ERROR" has been found at 14:20:00 UTC on 2018/03/13. BIIDCO-745 - Getting issue details... STATUS
    Solved and verified 2018-03-21 : GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=133844 has submitted 2018-03-06 02:31 UTC

LCG.NTU.tw

  • Health checker info. : "CRL has expired" has been found since 16:20:00 UTC on 2018/04/05. BIIDCO-926 - Getting issue details... STATUS
  • Health checker info. : "Not enough disk space on node34" has been found at 22:20:00 UTC on 2018/03/31 BIIDCO-951 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found at 06:20:00 UTC on 2018/03/24.
  • Health checker info. : "Failed pilot jobs" has been found at 15:20:00 UTC on 2018/03/16.(details)
  • Health checker info. : "Failed pilot jobs" has been found at 05:20:00 UTC on 2018/03/16.(details)
  • Health checker info. : "Short pilot jobs" has been found since 11:20:00 UTC on 2018/02/02, again since 15:20:00 UTC on 2018/02/12. BIIDCO-739 - Getting issue details... STATUS
  • MCProduction = 20 BIIDCO-279 - Getting issue details... STATUS

LCG.Pisa.it

LCG.Roma3.it

  • Roma3 commissioning BIIDCO-111 - Getting issue details... STATUS

LCG.TAU.il

  • Date, Issue, Tickets...

LCG.Torino.it

  • Job submission check : Pilot submission failure has been found at 22:24:00 UTC on 2018/03/25. (details)
  • Job submission check : Pilot submission failure has been found at 14:24:00 UTC on 2018/03/24. (details)
  • Health checker info. : "Short pilot jobs" has been found at 14:20:00 UTC on 2018/03/23.(details)
  • Job submission check : Pilot submission failure has been found since 00:29:00 UTC on 2018/03/17. (details)
  • Job submission check : Pilot submission failure has been found at 22:27:00 UTC on 2018/03/16. (details)
  • Job submission check : Pilot submission failure has been found at 14:32:00 UTC on 2018/03/16. (details)
  • Pilot submission failure has been found since 21:36:00 UTC on 2018/03/08. BIIDCO-352 - Getting issue details... STATUS

LCG.ULAKBIM.tr

OSG.BNL.us

  • Health checker info. : "Aborted pilot jobs" has been found since 13:20:00 UTC on 2018/04/11.(details)
    JIRA ticket is issued BIIDCO-950 - Getting issue details... STATUS
  • Health checker info. :
    1. "Short pilot jobs" has been found since 12:20:00 UTC on 2018/04/11.(details)
    2. "Aborted pilot jobs" has been found since 13:20:00 UTC on 2018/04/11.(details)
  • Health checker info. : "Aborted pilot jobs" has been found since 18:20:00 UTC on 2018/03/30.
  • Health checker info. : Short pilot jobs" has been found at 22:20:00 UTC on 2018/03/25.(details)
  • Health checker info. : Aborted pilot jobs" has been found at 22:20:00 UTC on 2018/03/25.(details)
  • Health checker info. : "Aborted pilot jobs" has been found since 12:20:00 UTC on 2018/03/24.(details)
  • Health checker info. : "Aborted pilot jobs" has been found since 02:20:00 UTC on 2018/03/23.(details)
  • Health checker info. : "Short pilot jobs" has been found at 06:20:00 UTC on 2018/03/22.
  • Health checker info. : "Aborted pilot jobs" has been found since 03:20:00 UTC on 2018/03/13.
  • Health checker info. : "Aborted pilot jobs" has been found since 18:20:00 UTC on 2018/03/08.
  • Health checker info. : "Aborted pilot jobs" has been found since 10:20:00 UTC on 2018/02/16  BIIDCO-718 - Getting issue details... STATUS
  • Health checker info. : "Aborted pilot jobs" has been found since 12:20:00 UTC on 2018/01/25 (for more than 2 days) BIIDCO-718 - Getting issue details... STATUS .

OSG.CORI.us

  • OSG.CORI.us resource has been removed because CY18 allocation was not approved

OSG.UMiss.us

  • Health checker info. : "Short pilot jobs" has been found since 13:20:00 UTC on 2018/03/17.(details)
  • Health checker info. : "Short pilot jobs" has been found since 11:20:00 UTC on 2018/03/15.(details)
  • Health checker info. : "Short pilot jobs" has been found at 07:20:00 UTC on 2018/02/17.
  • enough space error: Application finished with errors  BIIDCO-241 - Getting issue details... STATUS

SSH.KMI.jp

  • Date, Issue, Tickets...

VCYCLE.Napoli.it

  • Date, Issue, Tickets...




Links


Twiki settings:

  • Set INTERWIKIPLUGIN_RULESTOPIC = InterWikis
  • Set EDITMETHOD =ra



  • No labels