Contents

  •  Click here to expand...

 

Production Plans

 

Production Status

MC8 

Official production started at 01:30 JST on Feb 17 with a 1 ab-1 sample of phase 3 - Y(4S) generic MC

Submitted 72 signal samples for a total of about 162 million events (Feb 25, ~02:00 JST)

Submitted additional signal samples equivalent to 820 million events (Mar 1, ~00:00 JST)

Submitted first part of 0.5 ab-1 Y(5S) generic sample: bsbs only = 29.24 million events (Mar 1, ~02:00 JST)

Submitted the rest of 0.5 ab-1 Y(5S) generic sample (Mar 1, ~22:30 JST)

Submitted the rest of the phase 3 Y(4S) signal samples (Mar 8)

Submitted the phase 2 signal samples (Mar 8)

Submitted a 200 fb-1 Y(4S) generic sample with a total of just over 1 billion events (Mar 11, ~01:30 JST)

Submitted phase 3 bottomonium samples (Mar 14) - resubmitted some problematic productions, Miyake-san is fixing others (Mar 15)

Submitted phase 2 bottomonium samples (Mar 15)

Submitted 300 fb-1 phase 3 Y(3S) generic sample and several small signal samples (Mar 18, ~02:00 JST)

Submitted a large generator-level skim sample (96 x 10^9 generated events). These should finish very quickly, but will make a big spike in the production progress plot (Mar 26, ~02:00 JST)

One last signal sample (7400 jobs) was still in the pipeline (requested before the last DP update). Submitted on Mar 28 at ~22:00 JST.

 


MC7

Official production started as scheduled at 00:00 JST on Nov. 1.

A full list of samples is given on the data production page

A total of 100,000 jobs of 10k events each have been submitted as of Nov. 4

The following productions have been stopped and the corresponding jobs killed due to improper ROOT output (Nov. 10):

  • 517 - nonbsbs phase 2 Y(6S)
  • 518 - bsbs phase 2 Y(6S)

A total of 139,972 jobs of 10k events have been submitted as of Nov. 10

Phase 2 background samples have been distributed. 

  • Distribution to the Grid sites: BIIDCO-31 - Getting issue details... STATUS
  • Distribution to the non-Grid sites: BIIDCO-32 - Getting issue details... STATUS
  • When they are ready, we will finish production of phase 2 samples. This will complete the official requests for MC7. Thereafter, we will submit additional generic samples and perhaps include some additional requests from the physics group.

Generic phase 2 samples at Y(6S) and Y(4S) as well as some signal samples with backgrounds have been submitted (~23:50 JST on Nov. 11 and ~04:50 JST on Nov. 12). In total 17,776 Y(6S) + 29,608 Y(4S) = 47,384 jobs were submitted.

All requested MC samples for MC7 have been submitted (Nov 14)

Also submitted Phase III Y(3S) requests, which total 23,850 jobs of 10k events (Nov 14 at 04:30 JST). ((Running total is now 215,332 jobs))

Completed database access tests with 2k and 5k concurrent jobs (Nov 16 at ~10:00 JST)

Submitted Phase III Y(4S) generic samples (1 ab-1), which total 575,300 jobs of 10k events (Nov 16 at ~23:40 JST)

Submitted additional jobs including another 1 ab-1 of phase 3 generic Y(4S) samples, which total 636,000 jobs of about 0.5/0.5 of 10k/5k events. This is roughly equal to the number of jobs previously submitted, bringing the total number to over 1.2 million jobs. (December 8 at ~00:00 JST)

Submitted additional generic samples (1 ab-1 of mixed and charged events), totalling 160,400 additional jobs. (December 12 at ~04:00 JST)

Submitted additional signal samples (12.7k jobs) and a new ccbar sample with modified parameters (159.4k jobs). (December 21 at ~02:00 JST)

The system keeps working with the jobs already submitted until they are exhausted. No new submission of additional samples until January 9, after the downtime of KEK

 

New submission started January 9 at 0:00 JST with phase 3 - Y(4S) generic samples:

  • mixed (BGx1: 42770 jobs, BGx0: 21380) ((~0:00 JST))
  • charged (BGx1: 45230 jobs, BGx0: 22620) ((~6:00 JST))
  • uubar, ddbar, ssbar, ccbar, taupair: 538,060 jobs ((~01:30 JST))

Submitted new ccbar samples with new parameters: 159,480 jobs (~22:15 JST on Jan 27)

Test production for MC8, including 10 jobs of 10k events each with BG for generic samples (70 jobs total) started Feb. 1.

Submitted additional MC7 signal samples: 11,200 jobs (~02:00 JST on Feb 5)

 

Central Services

Dirac

  • Memory consumption increase and fluctuating at one server (b2dchsv05.cc.kek.jp) 2017/03/15 20:30 JST.
    → Experts are still under investigating 2017/03/17 00:14 JST
    → Reset has performed every one hour for b2dchsv05.cc.kek.jp in order to avoid reaching memory limit 2017/03/17 02:27 JST
  • The memory issue above seems gone. Still the root cause is not identified, though. 2017-03-24 09:28 UTC

DDM

  • The parameter "NumFilesPerSE" is increased from 200 to 500. This will cause at max 500 files to be submitted to RMS at a time. 2017-04-02 22:14:00 UTC
  • The parameter "NumFilesPerSE" is increased from 100 to 200. This will cause at max 200 files to be submitted to RMS at a time. Coping with many pending transfers. 2017-04-02 16:44:57 UTC
  • The parameter DDMLoadonRMS decreased from 300.0 to 150.0 as old pending and stuck RMS requests were resubmitted. 2017-03-30 15:30 UTC
  • The number of replication requests to be submitted has been reduced by Computing DIRACConfigParameters#/System/DistributedDataManagement/Development/Agents/TransferRequestExecutingAgent/DDMLoadonRMS


FTS

See DDM for related issues

  • SOLVED: REST interface of KEK server looks problematic. Submitted GGUS:127355. FTS3 server was switched to fts.hep.pnnl.gov 2017-03-29 06:30 UTC
    • FTS3 server switched back to kek2-fts.cc.kek.jp  

Monitor

SEs

SE Common Issues

Destination SE: KEK2-TMP-SE (kek2-se01.cc.kek.jp)

Destination SE: PNNL-TMP-SE (se.hep.pnnl.gov)

  • Waiting jobs are increasing from 2017/03/17 17:00 UTC
    → Experts investigate this issue and seems to be caused by NTU-TMP-SE(
    bgrid3.phys.ntu.edu.tw) downtime (2017/03/17 12:00 - 2017/03/20 23:00)
  • Transfer failure of DestSE se.hep.pnnl.gov from SourceSE grid-srm.rzg.mpg.de (no efficiency from SourceSE grid-srm.rzg.mpg.de in the last three hours), 2017/03/15, 16:08 UTC
  • Continuous increasing scheduled. 2:50 UTC 2017/3/14 reported to experts
    → Scheduled jobs are caused by SourceSE problem at UVic-TMP-SE(charon01.westgrid.ca) GGUS and now solved 2017/03/15 07:53 JST
  • Continuous transfer failure (destination) has observed for several day 2017/03/12 
    BIIDCO-133 - Getting issue details... STATUS Routine small failure in large amount of at jobs at PNNL-TMP-SE 2017/03/12 12:00 JST, No issue.
  • "Transfer Efficiency" is observed less than 30% at 13: 00: 00 UTC on 2017/03/10.
  • "Efficiency" is less than 20% through last 4 hours.
    • Notified to the site admin (2017-03/07 22:30 UTC)
  • SE Health check by DDM : remove file, remove directory, download, upload do not work since 2017-03-07 04:43:39 UTC. notified comp-dc-operations@belle2.org
  • A steady increase of Scheduled for more than 5 hours at 00:03 on Tuesday, March 7, 2017 UTC.
  • SE Health check by DDM : checksum, remove file, remove directory, download, upload, ls do not work since 2017-03-03 17:59:45 UTC.

Destination SE: DESY-TMP-SE (dcache-se-desy.desy.de)

Destination SE: CNAF-TMP-SE (storm-fe-archive.cr.cnaf.infn.it)

Destination SE: KMI-TMP-SE (nsrmfe01.hepl.phys.nagoya-u.ac.jp)

Destination SE: KIT-TMP-SE (gridka-dcache.fzk.de)

Destination SE: Napoli-TMP-SE (belle-dpm-01.na.infn.it)

Destination SE: CESNET-TMP-SE (dpm1.egee.cesnet.cz)

Destination SE: SIGNET-TMP-SE (dcache.ijs.si)

Other SEs

CYFRONET-TMP-SE (dpm.cyf-kr.edu.pl)

  • 2017-02-21 0830JST:
  • Solved and verified : GGUS Ticket #125723 CYFRONET SE: dpm.cyf-kr.edu.pl not accessible

McGill SE  (storm02.clumeq.mcgill.ca)

Pisa-TMP-SE (stormfe1.pi.infn.it)

Torino-TMP-SE (se-srm-00.to.infn.it)

HEPHY-TMP-SE (hephyse.oeaw.ac.at)

ULAKBIM-TMP-SE (torik1.ulakbim.gov.tr)

UVic-TMP-SE(charon01.westgrid.ca)

  • File transfer failure is observed. ggus ticket submitted (Link) 2017/03/14 02:02:00 UTC
    → One node filesystem problem has solved by site maintainer at 2017-03-14 21:37 UTC

NTU-TMP-SE (bgrid3.phys.ntu.edu.tw)

  • Downtime from 2017/03/17 12:00 - 2017/03/20 23:00 UTC BIIDCO-139 - Getting issue details... STATUS
    → Waiting job increased at PNNL-TMP-SRM seems to be caused by this down. 2017/03/17 17:00

 

Sites

Sites Common Issues

ARC.DESY.de

  • Health checker info. : "Short pilot jobs" has been found since 05:20:00 UTC on 2017/03/28.(details) →  reported to  comp-dc-operations@belle2.org
  • Job submission check : Pilot submission failure has been found since 11:29:00 UTC on 2017/03/22.
  • Job submission check : Pilot submission failure has been found since 09:28:00 UTC on 2017/03/21. (details)→ reported to  comp-dc-operations@belle2.org
  • Health checker info. : "Short pilot jobs" has been found since 18:20:00 UTC on 2017/03/19.(details)
  • Job submission check : Pilot submission failure has been found since 05:28:00 UTC on 2017/03/21.
  • Job submission check : Pilot submission failure has been found at 14:28:00 UTC on 2017/03/20. (details)
  • Health checker info. : "Short pilot jobs" has been found since 18:20:00 UTC on 2017/03/19.(details)
  • Health checker info. : "Short pilot jobs" has been found since 18:20:00 UTC on 2017/03/19.(details)
  • Health checker info. : "Short pilot jobs" has been found at 07:20:00 UTC on 2017/03/19.(details)
  • Health checker info. : "Short pilot jobs" has been found since 13:20:00 UTC on 2017/03/18.(details)
  • Health checker info. : "Short pilot jobs" has been found at 06:20:00 UTC on 2017/03/18.(details)
  • Health checker info. : "Short pilot jobs" has been found since 13:20:00 UTC on 2017/03/16.(details) → reported to comp-dc-operations@belle2.org 2017/03/16, 15:20 UTC
    → Expert start check this site 2017/03/17 13:00 JST → Some file IOError has observed at 2017/03/17 15:26 → IOError is not a short pilot reason and not site issue.
  • Health checker info. : "Short pilot jobs" has been found since 03:20:00 UTC on 2017/03/15.(details)
  • Health checker info. : "Short pilot jobs" has been found at 06:20:00 UTC on 2017/03/11.(details)
  • Health checker info. : "Short pilot jobs" has been found at 15:20:00 UTC on 2017/03/08.
  • Solved and verified: job submission to grid-arcce[0-1].desy.de failed (https://ggus.eu/index.php?mode=ticket_info&ticket_id=124740)
  • Job submission check : Pilot submission failure has been found since 13:25:00 UTC on 2016/12/03

ARC.KIT.de

  • Health checker info. : "Failed pilot jobs" has been found at 14:20:00 UTC on 2017/03/16.(details) → Both CEs show failing jobs, reported to comp-dc-operations@belle2.org 2017/03/16, 15:20 UTC
  • Downtime info.: all CEs were in downtime within 24 hours. (GOCDB 22727)
  • Job submission check : Pilot submission failure has been found at 14:27:00 UTC on 2017/03/15. (details)
  • Health checker info. : "Aborted pilot jobs" has been found at 06:20:00 UTC on 2017/03/12.(details)
    → One CE makes abort (arc-2-gridka.de) but now looks working. 2017/03/13 07:41 JST
  • Health checker info. : "Failed pilot jobs" has been found at 07:20:00 UTC on 2017/03/10.(details)

  • Job submission check : Pilot submission failure has been found since 05:26:00 UTC on 2017/03/10. (details)

ARC.LMU.de

  • This is a test site. Do not need to report any issue.

ARC.LMU2.de

  • Health checker info. : "Aborted pilot jobs" has been found at 06:20:00 UTC on 2017/03/16.(details)
  • Health checker info. : "Aborted pilot jobs" has been found since 22:20:00 UTC on 2017/03/11.(details)
    → Experts works in progress with JIRA ticket 2017/03/10 BIIDCO-127 - Getting issue details... STATUS (Jobs looks running but large amount of pilots are aborted)
  • Health checker info. : "Aborted pilot jobs" has been found since 14:20:00 UTC on 2017/03/11.(details) Experts notified again.
  • Health checker info. : "Aborted pilot jobs" has been found since 06:20:00 UTC on 2017/03/09.(details) Notified comp-dc-operations@belle2.org
  • Health checker info. : "Aborted pilot jobs" has been found at 06:20:00 UTC on 2017/03/08.(details) → Expert start to investigate this issue (2017/03/09 21:58 JST)

ARC.MPPMU.de

  • BIIDCO-128 - Getting issue details... STATUS
  • Health checker info. : "Short pilot jobs" has been found at 22:20:00 UTC on 2017/01/31.(details)
  • Stalled jobs at ARC.MPPMU.de, reported to comp-dc-operations@belle2.org Jobs by Final Minor Status

ARC.SIGNET.si

  • Health checker info. : "Short pilot jobs" has been found at 06:20:00 UTC on 2017/03/16.(details)

  • Job submission check : Pilot submission failure has been found since 05:26:00 UTC on 2017/03/10. (details)

  • BIIDCO-126 - Getting issue details... STATUS
  • Job submission check : Pilot submission failure has been found at 14:27:00 UTC on 2017/03/09.

CLOUD.CC1_Krakow.pl

DIRAC.Beihang.cn

  • Banned.
  • Upload to CNAF SE is problematic (packets cannot reach the site).
  • Solved: ggus:124942
    • CNAF-TMP-SE added to OutputSE for verification of the solution.
  • Short pilot jobs at 16:20:00 UTC on 2016/11/05.
  • Large % of failed jobs in DIRAC status plot (Added 2016-11-03 22:45:00 UTC) 
  • Health checker info. : "Short pilot jobs" has been found at 14:20:00 UTC on 2016/11/12.(details)
    • reported to comp-dc-operations@belle2.org 2016/11/13 00:40:00 JST
    • Suddenly, the site cannot access KEK dirac servers. This produces many short pilot jobs and "input data resolution" jobs. (15/Nov/2016)
      Already informed the site maintainer.
  • All the upload trials are failing against all the SEs configured: OutputSE (KMI-TMP-SE, PNNL-TMP-SE), Fail-over SEs(DESY-TMP-SE, Napoli-TMP-SE, PNNL-TMP-SE, KIT-TMP-SE)
    • Banned for now. BIIDCO-43 - Getting issue details... STATUS

DIRAC.BINP.ru

DIRAC.CINVESTAV.mx

DIRAC.DESY.de

DIRAC.IITG.in

DIRAC.LMU.de

  • Not in use in MC production BIIDCO-26 - Getting issue details... STATUS
  • Banned for now.

DIRAC.MIPT.ru

  • Health checker info. : "Aborted pilot jobs" has been found since 08:20:00 UTC on 2017/03/21.(details) → reported to comp-dc-operations@belle2.org
    All jobs failed and pilots are aborted from 2017/03/21 10:00 UTC  BIIDCO-141 - Getting issue details... STATUS
    • Notified to the site admin 2017 3/22 22:30 (UTC).
    • Fixed. 2017 3/23 3:00 (UTC)
  • Health checker info. : "Aborted pilot jobs" has been found at 14:20:00 UTC on 2017/03/19.(details)
  • Health checker info. : "Aborted pilot jobs" has been found since 15:20:00 UTC on 2016/11/16.(details)  These aborted pilots jobs disappeared a few hours later.

DIRAC.Nagoya.jp

  • Health checker info. : "Short pilot jobs" has been found since 07:20:00 UTC on 2017/03/24.(details)
    • dc-operations notified 21:00 2017/03/25
  • Banned for downtime (2017 3/18 6:00 UTC)
    • Un-banned (2017 3/19 5:00 UTC)

DIRAC.Nara-WU.jp

  •  Decommissioned site: Since this still uses SL5, DIRAC pilot cannot be executed there.

DIRAC.NDU.jp

DIRAC.Niigata.jp

DIRAC.Osaka-CU.jp

DIRAC.PNNL-CASCADE.us

DIRAC.PNNL-PIC.us

DIRAC.PNNL.us

DIRAC.PNNL2.us

DIRAC.RCNP.jp

  • All jobs failed and not running from 2017/03/10 00:47 UTC → Scheduled down time 2017/03/10 and recovered
  • Health checker info. : "Short pilot jobs" has been found since 14:20:00 UTC on 2017/02/17. Experts notified.
    • Can not access to LFC. Under investigation (2017/02/23)
    • Fixed by installing libtool-ltdl (2017/02/24).

DIRAC.SSU.kr

  • Failed to install DIRAC' error reported to the site   BIIDCO-118 - Getting issue details... STATUS
    • Fixed at 2017/02/17

DIRAC.TIFR.in

DIRAC.TMU.jp

DIRAC.Tokyo.jp

DIRAC.UAS.mx

DIRAC.UVic.ca

DIRAC.Yamagata.jp

DIRAC.Yonsei.kr

LCG.CESNET.cz

LCG.CNAF.it

LCG.Cosenza.it

LCG.CYFRONET.pl

  • Pilot submission failure is due to the disk full. notified to the site admin (ggus)
    • Fixed now (2017 1/30)

LCG.DESY.de

LCG.Frascati.it

LCG.HEPHY.at

LCG.KEK.jp

LCG.KEK2.jp

  • Job submission check : Pilot submission failure has been found at 07:27:00 UTC on 2017/03/08
    • The same cause as for LCG.KEK.jp BIIDCO-125 - Getting issue details... STATUS

LCG.KISTI.kr

LCG.KIT.de

  • BIIDCO-131 - Getting issue details... STATUS
    • The maximum number of job is set to be zero (for job drain) 2017/03/07.

LCG.KMI.jp

LCG.Legnaro.it

LCG.McGill.ca

LCG.Melbourne.au

LCG.Napoli.it

                 submission failures are concentrated on recas-ce02.na.infn.it/cream-pbs-belle.

  • Health checker info. : "Failed pilot jobs" has been found at 06:20:00 UTC on 2017/03/28. → reported to comp-dc-operations@belle2.org
  • Health checker info. : "Failed pilot jobs" has been found at 14:20:00 UTC on 2017/03/20.(details)
  • Health checker info. : "Failed pilot jobs" has been found at 07:20:00 UTC on 2017/03/14.(details)
  • Health checker info. : "Short pilot jobs" has been found since 20:20:00 UTC on 2017/03/05.

LCG.NTU.tw

  • Health checker info. : "Failed pilot jobs" has been found since 00:20:00 UTC on 2017/03/28.(details) → reported to comp-dc-operations@belle2.org 
  • Job submission check : Pilot submission failure has been found since 12:32:00 UTC on 2017/03/22.
    • Notified dc-operations 00:37:00 UTC 2017/03/26.
    • GGUS ticket submitted (link) on 2017/3/26 1:30 UTC.
  • Downtime info.: bgrid1.phys.ntu.edu.tw is now in downtime. (GOCDB 22765)
  • Job submission check : Pilot submission failure has been found since 22:29:00 UTC on 2017/03/18. (details)
  • Downtime info.: bgrid1.phys.ntu.edu.tw and bgrid3.phys.ntu.edu.tw will be in downtime. (GOCDB 22765) 2017/03/17 12:00 - 2017/03/20 23:00 UTC
    → Pilot submission failures are observed on bgrid1.phys.ntu.edu.tw from 2017/03/16 11:00 UTC due to down time
  • GGUS ticket : "Not enough disk space on WNs"(126897) has been submited at 02:26:01 UTC on 2017/03/03.
    • Fixed now
  • GGUS ticket : "[TW-NTU-HEP] Job aborted with BLAH error"(125175) has been submited at 02:57:16 UTC on 2016/11/25.

LCG.Pisa.it

  • GGUS ticket : "Jobs submitted to gridce0.pi.infn.it are finished immediately"(122842) has been submited at 08:12:14 UTC on 2016/07/13.
  • Job submission check : Pilot submission failure has been found since 03:29:00 UTC on 2017/03/17. (details)
  • Job submission check : Pilot submission failure has been found since 05:29:00 UTC on 2017/03/14. (details)
  • GGUS ticket : "Jobs submitted to gridce0.pi.infn.it are finished immediately"(122842) has been submited at 08:12:14 UTC on 2016/07/13.
  • Health checker info. : "Failed pilot jobs" has been found at 05:20:00 UTC on 2017/03/11.(details)
  • GGUS ticket : "Jobs submitted to gridce0.pi.infn.it are finished immediately"(122842) has been submited at 08:12:14 UTC on 2016/07/13.

  • Health checker info. : "Short pilot jobs" has been found at 06:20:00 UTC on 2016/12/04.(details)
    • GGUS ticket : "Jobs submitted to gridce0.pi.infn.it are finished immediately"(122842) has beensubmited at 08:12:14 UTC on 2016/07/13.
    • reported to comp-dc-operations@belle2.org 8:08 UTC on 2016/12/02
    • Health checker info. : "Short pilot jobs" has been found since 09:20:00 UTC on 2016/11/04 (Reported 2016-11-04 15:45 UTC)
    • Health checker info. : "Short pilot jobs" has been found at 13:20:00 UTC on 2016/11/15.(details 
  • Health checker info. : "Failed pilot jobs" has been found since 02:20:00 UTC on 2016/12/02.(details)
    • Health checker info. : "Failed pilot jobs" has been found since 12:20:00 UTC on 2016/11/21.(details)
    • BIIDCO-75 - Getting issue details... STATUS

LCG.Roma3.it

LCG.Torino.it

LCG.ULAKBIM.tr

OSG.UMiss.us

  • Health checker info. : "Short pilot jobs" has been found at 23:20:00 UTC on 2017/03/04.(details)

SSH.KMI.jp

  • Banned for downtime (2017 3/18 6:00 UTC).
    • Un-banned (2017 3/18 5:00 UTC).
  • Health checker info. : "Not enough disk space on " has been found since 23:20:00 UTC on 2017/03/11.

Links

 


 

 

  • Set INTERWIKIPLUGIN_RULESTOPIC = InterWikis

Set EDITMETHOD = ra

Currently no FTS transfers visible for the last 4 hours. 2017/03/30 13:20 UTC Notified comp-dc-operations@belle2.org

  • No labels