Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@tvami
Copy link
Contributor

@tvami tvami commented Jun 8, 2022

Replay Request

Requestor

ALCADB

Describe the configuration

  • Release: CMSSW_12_5_0_pre2, then CMSSW_12_3_6
  • Run: 352929,354332
  • GTs:
    • expressGlobalTag: 123X_dataRun3_Express_v8, then 123X_dataRun3_Express_v10
    • promptrecoGlobalTag: 123X_dataRun3_Prompt_v10, then 123X_dataRun3_Prompt_v12
    • alcap0GlobalTag: 123X_dataRun3_Prompt_v10, then 123X_dataRun3_Prompt_v12
  • Additional changes:
    Add new express workflow with PPS AlCa producers

Purpose of the test

After CMSSW_12_5_0_pre2 comes out (later today), we'll be able to test the new PPS AlCa producers with the PPS data collected last week (cmsTalk to keep streamer files for the run is in [1]). This is a lightweight replay where all the other streams are ignored.

[1] https://cms-talk.web.cern.ch/t/streamer-files-to-be-kept-for-pps-test-run-352929/11213

T0 Operations cmsTalk thread

https://cms-talk.web.cern.ch/t/replay-for-testing-pps-pcl-on-run-3-data/11398

@tvami
Copy link
Contributor Author

tvami commented Jun 8, 2022

test syntax please

@tvami
Copy link
Contributor Author

tvami commented Jun 8, 2022

atn: @vavati @francescobrivio @malbouis

Copy link
Contributor

@francescobrivio francescobrivio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a producer (PPSCalMaxTracks) to the PPS dataset:

# PPS 2022
DATASETS = ["AlCaPPS"]
for dataset in DATASETS:
    addDataset(tier0Config, dataset,
               do_reco=False,
               scenario=ppScenario)

?

@tvami tvami force-pushed the TestPCL_forPPS_Run3 branch from 98a7904 to 370abfe Compare June 8, 2022 19:46
@tvami tvami force-pushed the TestPCL_forPPS_Run3 branch from 370abfe to e7242ce Compare June 8, 2022 19:50
@tvami
Copy link
Contributor Author

tvami commented Jun 8, 2022

test syntax please

@tvami
Copy link
Contributor Author

tvami commented Jun 8, 2022

@vavati do we need the DQM step for the prompt PPS wf?

@tvami
Copy link
Contributor Author

tvami commented Jun 8, 2022

Add ctpps DQM sequance to PPS prompt

I added it here b5fdea3 let me know if this was wrong of me to do.

@tvami
Copy link
Contributor Author

tvami commented Jun 8, 2022

test syntax please

@jhonatanamado
Copy link
Contributor

@tvami . A replay node is available for this test. Just triggered yourself as soon as the CMSSW release is available.

@tvami
Copy link
Contributor Author

tvami commented Jun 8, 2022

Thanks @jhonatanamado great to hear that! Indeed we'll need to wait for @qliphy to upload the the new release to cvmfs

@vavati
Copy link

vavati commented Jun 8, 2022

Add ctpps DQM sequance to PPS prompt

I added it here b5fdea3 let me know if this was wrong of me to do.
It should be ok. Let's see....

@francescobrivio
Copy link
Contributor

run replay please

  • release is available: /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0_pre2

@cmsdmwmbot
Copy link

Replay testing PR 'Add PPS ALCA producers for a replay on Run-3 data'
An automatic replay has been requested by francescobrivio.
Here is a brief description of the replay.
Deployment ID: 220609094033
Github PR: #4686
PR author: tvami
Requestor: ALCADB
Injected runs: 352929
CMSSW release: CMSSW_12_5_0_pre2
Tier0 release: 3.0.4
ppScenario: ppEra_Run3
Tier0 Config: https://cmst0.web.cern.ch/CMST0/tier0/offline_config/ReplayOfflineConfiguration_047.php
Contatiner ID: 1
Jenkins Build: https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/469/
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-747

@cmsdmwmbot
Copy link

Monitoring for replay is closed.
Log Begins ====
Tier0_REPLAY v469 DMWM-T0-PR-test-job on vocms047.cern.ch. Add PPS ALCA producers for a replay on Run-3 data
JIRA URL : https://its.cern.ch/jira/browse/CMSTZDEV-747
There is no fileset and job.
All filesets were closed.
There was NO paused job in the replay.
End.
Replay was succesfull.
End.

End Of Log ====

@germanfgv
Copy link
Contributor

There was a problem with the definition of ALCAPPS. I'll resubmit.

@germanfgv
Copy link
Contributor

run replay please

@francescobrivio
Copy link
Contributor

@germanfgv could you explain the change in c3c7586? I'm a bit lost...

Also, thanks for restarting the replay! :)

@cmsdmwmbot
Copy link

Replay testing PR 'Add PPS ALCA producers for a replay on Run-3 data'
An automatic replay has been requested by francescobrivio.
Here is a brief description of the replay.
Deployment ID: 220609100020
Github PR: #4686
PR author: tvami
Requestor: ALCADB
Injected runs: 352929
CMSSW release: CMSSW_12_5_0_pre2
Tier0 release: 3.0.4
ppScenario: ppEra_Run3
Tier0 Config: https://cmst0.web.cern.ch/CMST0/tier0/offline_config/ReplayOfflineConfiguration_047.php
Contatiner ID: 1
Jenkins Build: https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/470/
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-748

@tvami
Copy link
Contributor Author

tvami commented Jun 9, 2022

Here is the Grafana link for the replay:
https://monit-grafana.cern.ch/d/t_jr45h7k/cms-tier0-replayid-monitoring?orgId=11&var-Bin=5m&var-ReplayID=220609100020&var-JobType=All&var-WorkflowType=All&refresh=1m

@germanfgv do I understand that correctly that nothing is running? does that mean that there is an issue?

@tvami tvami force-pushed the TestPCL_forPPS_Run3 branch from c3c7586 to 4331b6b Compare June 9, 2022 13:00
@cmsdmwmbot
Copy link

Monitoring for replay is closed.
Log Begins ====
Tier0_REPLAY v470 DMWM-T0-PR-test-job on vocms047.cern.ch. Add PPS ALCA producers for a replay on Run-3 data
JIRA URL : https://its.cern.ch/jira/browse/CMSTZDEV-748
All repack workflows were processed.
All filesets were closed.
There was NO paused job in the replay.
End.
Replay was succesfull.
End.

End Of Log ====

@germanfgv
Copy link
Contributor

run replay please

@cmsdmwmbot
Copy link

Replay testing PR 'Add PPS ALCA producers for a replay on Run-3 data'
An automatic replay has been requested by germanfgv.
Here is a brief description of the replay.
Deployment ID: 220609173230
Github PR: #4686
PR author: tvami
Requestor: ALCADB
Injected runs: 352929
CMSSW release: CMSSW_12_5_0_pre2
Tier0 release: 3.0.4
ppScenario: ppEra_Run3
Tier0 Config: https://cmst0.web.cern.ch/CMST0/tier0/offline_config/ReplayOfflineConfiguration_047.php
Contatiner ID: 1
Jenkins Build: https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/471/
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-749

@tvami
Copy link
Contributor Author

tvami commented Jun 9, 2022

@germanfgv
Copy link
Contributor

We are running a large scale replay that's taking over a lot of resources. that's why we have a lot of idle jobs. Some of the jobs have failed and are being retried, but the failure is not likely to be recoverable:

Fatal Exception (Exit Code: 8006)
An exception of category 'ProductNotFound' occurred while
   [0] Processing  Event run: 352929 lumi: 3 event: 55649 stream: 4
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module TrackToTrackComparisonHists/'hltMerged2highPurityPV'
   [3] Prefetching for module BeamSpotOnlineProducer/'offlineBeamSpot'
   [4] Calling method for module ScalersRawToDigi/'scalersRawToDigi'
Exception Message:
Principal::getByToken: Found zero products matching all criteria
Looking for type: FEDRawDataCollection
Looking for module label: rawDataCollector
Looking for productInstanceName: 

   Additional Info:
      [a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

@jhonatanamado
Copy link
Contributor

jhonatanamado commented Jun 14, 2022

This replay shows almost the same error as before., but this time is with ALCARECO step

cmsRun1
Fatal Exception (Exit Code: 8006)
An exception of category 'ProductNotFound' occurred while
   [0] Processing  Event run: 352929 lumi: 21 event: 9299315 stream: 1
   [1] Running path 'write_StreamALCAPPS_ALCARECO_step'
   [2] Prefetching for module PoolOutputModule/'write_StreamALCAPPS_ALCARECO'
   [3] Prefetching for module CTPPSLocalTrackLiteProducer/'ctppsLocalTrackLiteProducerAlCaRecoProducer'
   [4] Prefetching for module TotemRPLocalTrackFitter/'totemRPLocalTrackFitter'
   [5] Prefetching for module TotemRPUVPatternFinder/'totemRPUVPatternFinder'
   [6] Prefetching for module TotemRPRecHitProducer/'totemRPRecHitProducer'
   [7] Prefetching for module TotemRPClusterProducer/'totemRPClusterProducer'
   [8] Calling method for module TotemVFATRawToDigi/'totemRPRawToDigi'
Exception Message:
Principal::getByToken: Found zero products matching all criteria
Looking for type: FEDRawDataCollection
Looking for module label: rawDataCollector
Looking for productInstanceName: 

   Additional Info:
      [a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

Logs and tarballs can be found here /afs/cern.ch/user/c/cmst0/public/Tarballs_Replays/PR4686/job_20

@tvami
Copy link
Contributor Author

tvami commented Jun 14, 2022

@vavati DQM problem solved... now we are at least dealing with an AlCa problem.

Looking here
https://github.com/cms-sw/cmssw/blob/master/EventFilter/CTPPSRawToDigi/python/ctppsRawToDigi_cff.py#L53
and
https://github.com/cms-sw/cmssw/blob/master/EventFilter/CTPPSRawToDigi/python/ctppsRawToDigi_cff.py#L101
and
https://github.com/cms-sw/cmssw/blob/master/EventFilter/CTPPSRawToDigi/python/ctppsRawToDigi_cff.py#L135
they expect rawDataCollector while the ALCARECO expects
hltPPSCalibrationRaw
https://github.com/cms-sw/cmssw/blob/master/Calibration/PPSAlCaRecoProducer/python/ALCARECOPPSCalMaxTracks_cff.py#L17-L19

weren't these supposed to have the same input? I understand the second one hltPPSCalibrationRaw is the one that's present in the stream, correct? Doing the clone might still make the original run too, and that could cause this, right?

@grzanka
Copy link

grzanka commented Jun 14, 2022

@vavati DQM problem solved... now we are at least dealing with an AlCa problem.

Great !

Looking here https://github.com/cms-sw/cmssw/blob/master/EventFilter/CTPPSRawToDigi/python/ctppsRawToDigi_cff.py#L53 and https://github.com/cms-sw/cmssw/blob/master/EventFilter/CTPPSRawToDigi/python/ctppsRawToDigi_cff.py#L101 and https://github.com/cms-sw/cmssw/blob/master/EventFilter/CTPPSRawToDigi/python/ctppsRawToDigi_cff.py#L135 they expect rawDataCollector while the ALCARECO expects hltPPSCalibrationRaw https://github.com/cms-sw/cmssw/blob/master/Calibration/PPSAlCaRecoProducer/python/ALCARECOPPSCalMaxTracks_cff.py#L17-L19

werent these supposed to have the same input?

Are you wondering why our default reconstruction expects rawDataCollector rather than hltPPSCalibrationRaw ?

I understand the second one hltPPSCalibrationRaw is the one that's present in the stream, correct?

That is correct, its one of the reasons of extensive renaming of modules in PPSCalMaxTracks producer
https://github.com/cms-sw/cmssw/blob/c850e19102dcc5fa5978de102c57e40a7410490c/Calibration/PPSAlCaRecoProducer/python/ALCARECOPPSCalMaxTracks_cff.py

@tvami
Copy link
Contributor Author

tvami commented Jun 14, 2022

Are you wondering why our default reconstruction expects rawDataCollector rather than hltPPSCalibrationRaw ?

Yes indeed that's what I'm wondering about

@grzanka
Copy link

grzanka commented Jun 14, 2022

Are you wondering why our default reconstruction expects rawDataCollector rather than hltPPSCalibrationRaw ?

Yes indeed that's what I'm wondering about

In fact it was using rawDataCollector at least for past 6 years (looking at the commits in the repo).
The hltPPSCalibrationRaw appeared when we started to introduce the new AlCa reco producer in 2021.

@tvami
Copy link
Contributor Author

tvami commented Jun 14, 2022

We can try to explicitly not run the reco: 8964144

@grzanka
Copy link

grzanka commented Jun 14, 2022

This replay shows almost the same error as before., but this time is with ALCARECO step

cmsRun1
Fatal Exception (Exit Code: 8006)
An exception of category 'ProductNotFound' occurred while
   [0] Processing  Event run: 352929 lumi: 21 event: 9299315 stream: 1
   [1] Running path 'write_StreamALCAPPS_ALCARECO_step'
   [2] Prefetching for module PoolOutputModule/'write_StreamALCAPPS_ALCARECO'
   [3] Prefetching for module CTPPSLocalTrackLiteProducer/'ctppsLocalTrackLiteProducerAlCaRecoProducer'
   [4] Prefetching for module TotemRPLocalTrackFitter/'totemRPLocalTrackFitter'
   [5] Prefetching for module TotemRPUVPatternFinder/'totemRPUVPatternFinder'
   [6] Prefetching for module TotemRPRecHitProducer/'totemRPRecHitProducer'
   [7] Prefetching for module TotemRPClusterProducer/'totemRPClusterProducer'
   [8] Calling method for module TotemVFATRawToDigi/'totemRPRawToDigi'

I suspect the problem may be related to the fact that PPS Silicon Strips were not included in the PPSCalMaxTracks AlCa reco producer (corresponding module labels were not provided with AlCaRecoProducer suffix and the input label was not adjusted properly).

This is however strange as we exclude strips from the reconstruction:
https://github.com/cms-sw/cmssw/blob/master/Calibration/PPSAlCaRecoProducer/python/ALCARECOPPSCalMaxTracks_cff.py#L80

I need to investigate if further by analysing provided log files

@cmsdmwmbot
Copy link

Replay testing PR 'Add PPS ALCA producers for a replay on Run-3 data'
An automatic replay has been requested by tvami.
Here is a brief description of the replay.
Deployment ID: 220620064728
Github PR: #4686
PR author: tvami
Requestor: ALCADB
Injected runs: 352929
CMSSW release: CMSSW_12_5_0_pre2
Tier0 release: 3.0.4
ppScenario: ppEra_Run3
Tier0 Config: https://cmst0.web.cern.ch/CMST0/tier0/offline_config/ReplayOfflineConfiguration_047.php
Contatiner ID: 1
Jenkins Build: https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/478/
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-752

@cmsdmwmbot
Copy link

Monitoring for replay is closed.
Log Begins ====
Tier0_REPLAY v478 DMWM-T0-PR-test-job on vocms047.cern.ch. Add PPS ALCA producers for a replay on Run-3 data
JIRA URL : https://its.cern.ch/jira/browse/CMSTZDEV-752
All repack workflows were processed.
All filesets were closed.
There was NO paused job in the replay.
End.
Replay was succesfull.
End.

End Of Log ====

@tvami
Copy link
Contributor Author

tvami commented Jun 29, 2022

@tvami
Copy link
Contributor Author

tvami commented Jun 29, 2022

@jhonatanamado
Copy link
Contributor

Monitoring for this configuration is https://monit-grafana.cern.ch/d/t_jr45h7k/cms-tier0-replayid-monitoring?orgId=11&var-Bin=5m&var-ReplayID=220706190228&var-JobType=All&var-WorkflowType=All. This configuration shows paused jobs with the following error:

cmsRun1
Fatal Exception (Exit Code: 8009)
An exception of category 'Configuration' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=PoolOutputModule label='ALCARECOStreamPromptCalibProdPPSDiamondSampic'
Exception Message:
EventSelector::init, An OutputModule is using SelectEvents
to request a trigger name that does not exist
The unknown trigger name is: pathALCARECOPromptCalibProdPPSDiamondSampicTiming

Logs and tarballs can be found here /afs/cern.ch/user/c/cmst0/public/Tarballs_Replays/PR4686/job_274

@tvami
Copy link
Contributor Author

tvami commented Jul 7, 2022

@vavati please note the error with the SAMPIC above
@jhonatanamado I'll remove the SAMPIC now from the config, so let's do another test in a minute

@tvami
Copy link
Contributor Author

tvami commented Jul 8, 2022

@tvami
Copy link
Contributor Author

tvami commented Jul 8, 2022

@germanfgv please confirm that the latest replay was successful. And if yes, let's go ahead with #4708 (then I'll close this PR)

@germanfgv
Copy link
Contributor

Yes @tvami. It was successful, I'll include this in the 12_3_7 replay.

@tvami
Copy link
Contributor Author

tvami commented Jul 8, 2022

Great, feel free to merge #4708 before doing that

@tvami
Copy link
Contributor Author

tvami commented Jul 8, 2022

Closing this for #4708

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants