Fix NullPointerException in LazyBuildMixIn on jenkins reload#26399
Conversation
timja
left a comment
There was a problem hiding this comment.
Thanks for sorting so quickly. I've tweaked the changelog entry a bit, not entirely happy with but I think its better than before.
jglick
left a comment
There was a problem hiding this comment.
To be clear, is this reverting all or part of some specific prior PR?
I would say that it's re-work of #11038 |
|
/label ready-for-merge This PR is now ready for merge, after ~24 hours, we will merge it if there's no negative feedback. Thanks! |
|
Thanks for getting this fix done so quickly. I'm not too familiar with your LTS build cut-off dates but assuming it gets merged this week is it likely to make it into 2.541.3 in a couple of weeks from now? |
|
see #26397 (comment) |
…ci#26399) Co-authored-by: Dmytro Ukhlov <[email protected]>
…ci#26399) Co-authored-by: Dmytro Ukhlov <[email protected]> (cherry picked from commit ef7e6e6)
| // Iterate through keySet() instead of entrySet() or values() to avoid triggering lazy loading | ||
| // for the first `numToKeep` builds | ||
| runMap.keySet().stream().skip(numToKeep).map(runMap::get) | ||
| .filter(r -> r != null && !shouldKeepRun(r, lsb, lstb)).forEach(r -> { |
There was a problem hiding this comment.
As far as I can tell, this skip does not actually work as advertised. In 2.555.x using RunLoadCounter it seems that even the first numToKeep builds are loaded:
hudson.model.Run.onLoad(Run.java:376)
hudson.model.RunMap.retrieve(RunMap.java:290)
hudson.model.RunMap.retrieve(RunMap.java:64)
jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:451)
jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:445)
jenkins.model.lazy.AbstractLazyLoadRunMap.resolveBuildRef(AbstractLazyLoadRunMap.java:371)
jenkins.model.lazy.AbstractLazyLoadRunMap$BuildReferenceMapAdapterResolver.resolveBuildRef(AbstractLazyLoadRunMap.java:528)
jenkins.model.lazy.BuildReferenceMapAdapter$KeySetAdapter.lambda$iterator$0(BuildReferenceMapAdapter.java:176)
hudson.util.Iterators$6.adapt(Iterators.java:337)
hudson.util.AdaptedIterator.next(AdaptedIterator.java:57)
com.google.common.collect.Iterators$5.computeNext(Iterators.java:674)
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
jenkins.model.lazy.BuildReferenceMapAdapter$KeySetAdapter$1.tryAdvance(BuildReferenceMapAdapter.java:197)
java.base/java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:673)
java.base/java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:718)
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
hudson.tasks.LogRotator.perform(LogRotator.java:171)
hudson.model.Job.logRotate(Job.java:519)
There was a problem hiding this comment.
The current keySet() implementation returns core BuildReferenceMap keys filtered by resolver.isBuildRefResolvable(ref) == true.
jenkins/core/src/main/java/jenkins/model/lazy/BuildReferenceMapAdapter.java
Lines 180 to 187 in 6a36596
isBuildRefResolvable resolves a build if the corresponding BuildReference has not been resolved before (isSet() == false).
jenkins/core/src/main/java/jenkins/model/lazy/AbstractLazyLoadRunMap.java
Lines 551 to 564 in 6a36596
So the first iteration over RunMap.keySet() (or the first one after reload) may trigger BuildReference resolution.
I consider this a reasonable compromise between optimization and consistency.
I'm not sure about all possible use cases. We could make it more consistent by resolving build references on every iteration, or relax consistency further by simply returning the core BuildReferenceMap keys as-is.
There was a problem hiding this comment.
I do not have a strong opinion at this point, but at least the existing comment
avoid triggering lazy loading for the first
numToKeepbuilds
does not seem to be true as written.
Context: analyzing a severe performance problem reported by a CloudBees CI customer (running in high availability mode, though I do not believe it matters here) and examining FINEST stack traces from RunMap I found that hundreds of thousands of build records were being loaded by LogRotator when iterating artifactNumToKeep, because (I inferred) this was set to a value far lower than numToKeep (or numToKeep was not configured at all). Imagine you have a job whose last build is number 1000. numToKeep is set to 100 while artifactNumToKeep is set to 10. So the existing builds will be 901–1000, of which only 991–1000 have artifacts while 901–990 still exist but have no artifacts. Now say you run builds 1001, 1002, and 1003, and then LogRotator runs due to the hourly background build discarder. So it should process numToKeep by skipping over the last 100, examining 901, 902, and 903, and deleting each. Fine so far. Then it processes artifactNumToKeep by skipping over the last 10, examining 904–993. It will delete artifacts from 991, 992, and 993 (forcing them to be loaded, fine); but it will also load 904–990 into memory only to find that they did not have any artifacts (they were deleted long ago). This is a serious performance problem. Without some optimization in core, my conclusion is that you just should not configure artifactNumToKeep on a job with a lot of builds (unless it is nearly as big as numToKeep) because you will be constantly checking builds for nonexistent artifacts and increasing heap (and I/O and CPU).
To confirm my suspicion, I wrote a test using RunLoadCounter.countLoads. But to my surprise, the performance was even worse than predicted: the numToKeep loop loads not just 901–903 but 901–1003.
There was a problem hiding this comment.
I do not have a strong opinion at this point, but at least the existing comment
Before my change it made a performance issue for our setup as well. I made a minimal improvement of the performace to fix the performance problem of my setup.
But to my surprise, the performance was even worse than predicted: the numToKeep loop loads not just 901–903 but 901–1003.
yes, but just first time after loading a RunMap, during next-hour log rotator job it won't do this. It was enough for my usecase
avoid triggering lazy loading for the first numToKeep builds
agree, better to correct the comment (or change a behaviour)
Without some optimization in core, my conclusion is that you just should not configure artifactNumToKeep
yes, we also decided not to use it, we can try to keep mandatory fields in memory (like start/end datetime, artifacts number, shouldKeep etc as BuildReference fields), should be easy for readonly fields, but for field like shouldKeep, which can be changed after build completion it is harder (possibly can use a ReferenceQueue and store the mandatory build's data in BuildReference object during GCing the build or find out some other way)
There was a problem hiding this comment.
I do not have a strong opinion at this point, but at least the existing comment
Please decide what to do (fix the comment or align the code with the comment), I can create a PR for this
There was a problem hiding this comment.
just first time after loading a RunMap, during next-hour log rotator job it won't do this
Even if the SoftReferences are cleared? In the case I found, even loading the old builds once per session was bad enough at that scale, in conjunction with OldDataMonitor. Now I remember that I already wrote up the problem in #26711.
we can try to keep mandatory fields in memory
The trouble (for this case) is that currently there is no API to determine given a Job × number whether artifacts exist except by loading the Run, in the general case that a plugin like artifact-manager-s3 is in use:
jenkins/core/src/main/java/jenkins/model/ArtifactManager.java
Lines 71 to 72 in 142d36e
LogRotator could write some placeholder file .artifacts-deleted to avoid reprocessing builds (at the expense of a little I/O).
decide what to do (fix the comment or align the code with the comment), I can create a PR for this
I guess fix the comment for now, since I do not have a clear idea of what to improve in the code. Thanks!
jenkins/core/src/main/java/jenkins/model/lazy/AbstractLazyLoadRunMap.java Lines 558 to 563 in 142d36e
yes, it is possible way to go |
Fixes #26397
This PR removes the lazy loading behavior of RunMap’s entry.getValue(). The value will now be resolved inside iterator.next() instead of in entry.getValue() itself.
The weak semantics will be preserved for keySet() iteration, allowing iteration without resolving values on each step. Values can still be resolved explicitly by calling RunMap.get(someKey).
Testing done
Only jenkins integration tests. It is hard to reproduce the issue
Screenshots (UI changes only)
Before
After
Proposed changelog entries
RunMapthat causes issues when reloadingProposed changelog category
/label regression-fix
Proposed upgrade guidelines
N/A
Submitter checklist
@Restrictedor have@since TODOJavadocs, as appropriate.@Deprecated(since = "TODO")or@Deprecated(forRemoval = true, since = "TODO"), if applicable.evalto ease future introduction of Content Security Policy (CSP) directives (see documentation).Desired reviewers
@jglick
Before the changes are marked as
ready-for-merge:Maintainer checklist
upgrade-guide-neededlabel is set and there is a Proposed upgrade guidelines section in the pull request title (see example).lts-candidateto be considered.