SEAB-7226: Address DOI creation failures #6174

svonworl · 2025-10-09T05:46:18Z

Description
This PR makes some improvements to the Zenodo DOI generation code that should improve the chances of it successfully generating new DOIs.

Some time in June, as far as we can tell, the DOI creation process began to fail because the Zenodo createFile and deleteFile endpoints were responding with spurious 403 errors, often more often than not, and occasional 503s. The calls would also sometimes succeed, and it does not appear that we were using the endpoint wrong. Rather, something is going sideways on the Zenodo side.

When the DOI generation process fails, a DOI is not generated for the tagged version, which is a bummer. However, something more insidious was happening...

The failed DOI generations left draft deposits in the Zenodo system, causing all future DOI generation attempts to fail for the associated workflow.

This PR addresses the above problems by:

Attempting to delete any draft deposit(s?) corresponding to the concept DOI, at the start of the DOI generation process. This should address the problem of existing drafts and "jammed" workflows, with the caveat that we find drafts via what appears to be an ElasticSearch query on the Zenodo side, which may not always provide up-to-date information (a draft may not yet be indexed, or still be in the index after it is deleted).
Deleting the in-process draft deposit if there's a failure, as we exit the DOI generation code. This should keep the system free of new drafts.
Retrying the createFile and deleteFile calls on failure, to increase the probability that we will succeed. The code currently makes 5 attempts, each separated by a 1 second sleep. I'm tempted to increase the number of attempts, but also concerned about triggering a rate limit.
Adding a bit of LOG output that'll make it easier to tell how the above changes are working.

In tandem, the above changes should allow DOI generation to succeed much more frequently. However, it'll still fail on occasion.

It's very difficult to test how this code responds to various Zenodo failures, especially via automatic tests. So, instead, I user tested locally, by tweaking the code in various spots to simulate various failures (including leaving a draft in the Zenodo sandbox), and submitted various requests to confirm that the code was working properly.

Review Instructions
On staging, push some tagged versions on an entry, and confirm that most of the DOIs have been generated correctly. Try the same thing on prod, after we deploy. After a few weeks on prod, analyze the logs and see if we need to take any more action.

Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-7226

Security and Privacy

If there are any concerns that require extra attention from the security team, highlight them here and check the box when complete.

Security and Privacy assessed

e.g. Does this change...

Any user data we collect, or data location?
Access control, authentication or authorization?
Encryption features?

Please make sure that you've checked the following before submitting your pull request. Thanks!

Check that you pass the basic style checks and unit tests by running mvn clean install
Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
Do not serve user-uploaded binary images through the Dockstore API
Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
Do not create cookies, although this may change in the future
If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

codecov · 2025-10-09T05:57:26Z

Codecov Report

❌ Patch coverage is 60.78431% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.02%. Comparing base (cc46e6f) to head (ac2426a).
⚠️ Report is 4 commits behind head on hotfix/1.18.1.

Files with missing lines	Patch %	Lines
.../io/dockstore/webservice/helpers/ZenodoHelper.java	60.78%	19 Missing and 1 partial ⚠️

Additional details and impacted files

@@                 Coverage Diff                 @@
##             hotfix/1.18.1    #6174      +/-   ##
===================================================
- Coverage            74.07%   74.02%   -0.05%     
- Complexity            5724     5731       +7     
===================================================
  Files                  397      397              
  Lines                20571    20611      +40     
  Branches              2116     2117       +1     
===================================================
+ Hits                 15238    15258      +20     
- Misses                4326     4345      +19     
- Partials              1007     1008       +1

Flag	Coverage Δ
bitbuckettests	`25.78% <0.00%> (-0.06%)`	⬇️
hoverflytests	`27.48% <60.78%> (+0.04%)`	⬆️
integrationtests	`55.91% <0.00%> (-0.11%)`	⬇️
languageparsingtests	`10.77% <0.00%> (-0.03%)`	⬇️
localstacktests	`21.14% <0.00%> (-0.05%)`	⬇️
toolintegrationtests	`29.70% <0.00%> (-0.06%)`	⬇️
unit-tests_and_non-confidential-tests	`26.05% <0.00%> (-0.17%)`	⬇️
workflowintegrationtests	`39.44% <0.00%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

denis-yuen · 2025-10-09T17:33:22Z

Attempting to delete any draft deposit(s?) corresponding to the concept DOI, at the start of the DOI generation process.

Idle thought reading in sequence. How do we think this will react, thinking particularly of the Broad case where they may be deleting and re-creating multiple tags at the same time (will one webservice delete the in-progress draft deposits being created for another one?)

with the caveat that we find drafts via what appears to be an ElasticSearch query on the Zenodo side, which may not always provide up-to-date information (a draft may not yet be indexed, or still be in the index after it is deleted).

On the other hand, maybe this is slow enough?

denis-yuen

Tried relying on the index before and had poor results, see below

denis-yuen · 2025-10-09T17:37:21Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/ZenodoHelper.java

+        // Create a Lucene query that finds drafts corresponding to the specified concept DOI.
+        // Apparently, this endpoint pulls information from ElasticSearch, so the view may be stale.
+        // Drafts may take a while to appear, or seem to persist after they are deleted.
+        String query = "(conceptrecid:\"%d\") AND (submitted:\"false\")".formatted(conceptDoiId);


Oh, I tried using this approach before. See comments from https://ucsc-cgl.atlassian.net/browse/SEAB-7226?focusedCommentId=49380 through https://ucsc-cgl.atlassian.net/browse/SEAB-7226?focusedCommentId=49387

The index is definitely incomplete. I had better luck using the new endpoints documented in https://github.com/dockstore/swagger-java-zenodo-client/pull/28/files#diff-3d6e9eaeeda7aac0f94cadbe92f2b969e2aee88dff371d7622d25a46b9a36b5aR553

(could also try both)

denis-yuen · 2025-10-09T17:38:42Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/ZenodoHelper.java

+
+    private static void deleteDeposit(DepositsApi depositsApi, int depositId) {
+        try {
+            depositsApi.deleteDeposit(depositId);


There's a specific endpoint for deleting draft deposits that may be safer
https://github.com/dockstore/swagger-java-zenodo-client/pull/28/files#diff-3d6e9eaeeda7aac0f94cadbe92f2b969e2aee88dff371d7622d25a46b9a36b5aR536

svonworl · 2025-10-09T17:46:36Z

Attempting to delete any draft deposit(s?) corresponding to the concept DOI, at the start of the DOI generation process.

Idle thought reading in sequence. How do we think this will react, thinking particularly of the Broad case where they may be deleting and re-creating multiple tags at the same time (will one webservice delete the in-progress draft deposits being created for another one?)

I think it's ok, the reasoning is something like:

Currently, we serialize the push processing at the repo level, so, in theory, for a given repo, we should only be generating a single DOI at a given time.
To create the next version DOI, we only remove drafts associated with the particular concept DOI, so we're not going to accidentally clobber drafts of other workflows.
The drafts that show up in the query might be stale, and could actually be published or deleted, in which case the deleteDeposit API call fails, the exception is absorbed, and the DOI generation process continues.

svonworl · 2025-10-10T16:25:39Z

Ok, so, I ran some experiments and dabbled with some test code. I made some improvements to the findDraftDeposits function, that make it more reliable and ensure good performance, even if the API starts mixing published deposits into the response for whatever reason.

Here's my conclusion:

I strongly recommend we go with the current solution. Would like to get this into the hotfix, deploy, and assess how it works after a couple of weeks. Can we do that?

My reasoning:

The design and behavior of the various APIs suggest that they provide different views of the same ElasticSearch resource (I could be wrong, of course). If that's true, switching the calls doesn't buy us anything, and calling both endpoints burns time.
When possible, we should use well-documented endpoints, because undocumented endpoints are more likely to change or disappear.
The deleteDeposits endpoint is documented to only delete unpublished deposits, so there's no danger there. As mentioned in point 1, would not be surprising if the deleteDraftRecord endpoint used the exact same machinery on the backend.

denis-yuen · 2025-10-10T17:48:28Z

When possible, we should use well-documented endpoints, because undocumented endpoints are more likely to change or disappear.

To be clear, this is not an argument in favour of the current approach.

The "new" endpoints are documented by zenodo in openapi, but incompletely without return objects.

The "old" endpoints are purely documented by us in openapi by inspecting their textual documentation and behaviour.

I then extended the openapi description that we use (owned by us in our repository) for the "old" endpoints to cover those two "new" endpoints.

denis-yuen · 2025-10-10T17:50:11Z

3. The deleteDeposits endpoint is documented to only delete unpublished deposits, so there's no danger there. As mentioned in point 1, would not be surprising if the deleteDraftRecord endpoint used the exact same machinery on the backend.

I'm not sure about the downside of just using the new endpoint here.

denis-yuen

I'm ok with splitting the difference, how about using the old search endpoint but the new endpoint for deleting drafts?

svonworl · 2025-10-10T18:02:34Z

When possible, we should use well-documented endpoints, because undocumented endpoints are more likely to change or disappear.

To be clear, this is not an argument in favour of the current approach.

The "new" endpoints are documented by zenodo in openapi, but incompletely without return objects.

The "old" endpoints are purely documented by us in openapi by inspecting their textual documentation and behaviour.

It is indeed an argument in favor of the current approach, both of the "old" endpoints are documented here:
https://developers.zenodo.org/

sonarqubecloud · 2025-10-14T04:07:34Z

Quality Gate failed

Failed conditions
54.9% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

svonworl · 2025-10-14T16:47:36Z

I changed the code to use the "new" listUserRecords and deleteDraftRecord API calls.

denis-yuen · 2025-10-14T19:21:53Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/ZenodoHelper.java

+        // to mix a few published records into the response, or doesn't list the draft first.
+        final int maxResults = 10;
+        // In the Zenodo API, page numbers start at 1 (!)
+        return previewApi.listUserRecords(query, "newest", maxResults, 1, true, false).getHits().getHits().stream()


created https://ucsc-cgl.atlassian.net/browse/SEAB-7341
probably need to work on tags/namespace

svonworl added 13 commits October 1, 2025 19:15

discard unpublished edits

251b6ea

add success message

5a8006f

add log debugging output

ff9a93f

attempt to delete draft deposit

41baeb6

experimental fixes

91c9125

rough draft

4e023f7

fix checkstyle

d43b1a2

add documentation and refactor

67802bb

add hoverfly support for get depositions request

8784812

fix naming oopsie

df1e71c

fix hoverfly matcher

8ca2153

more detailed information

a9da182

add some documentation

9aa5856

svonworl self-assigned this Oct 9, 2025

denis-yuen reviewed Oct 9, 2025

View reviewed changes

svonworl added 2 commits October 9, 2025 23:52

improve draft search code

9ad3812

fix version oopsie

5e8dcde

svonworl requested a review from denis-yuen October 10, 2025 16:25

denis-yuen reviewed Oct 10, 2025

View reviewed changes

svonworl added 4 commits October 13, 2025 18:20

convert to some new api calls

d35f605

add some comments

194184e

remove unused import

1eae8f4

adjust hoverfly test

ac2426a

svonworl requested a review from denis-yuen October 14, 2025 16:47

denis-yuen approved these changes Oct 14, 2025

View reviewed changes

svonworl merged commit 60b6a3f into hotfix/1.18.1 Oct 14, 2025
19 of 23 checks passed

svonworl deleted the feature/seab-7226/address-doi-creation-failures branch October 14, 2025 23:30

SEAB-7226: Address DOI creation failures #6174

SEAB-7226: Address DOI creation failures #6174

Uh oh!

Conversation

svonworl commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

denis-yuen commented Oct 9, 2025

Uh oh!

denis-yuen left a comment

Choose a reason for hiding this comment

Uh oh!

denis-yuen Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

denis-yuen Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

denis-yuen Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

svonworl commented Oct 9, 2025

Uh oh!

svonworl commented Oct 10, 2025

Uh oh!

denis-yuen commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

denis-yuen commented Oct 10, 2025

Uh oh!

denis-yuen left a comment

Choose a reason for hiding this comment

Uh oh!

svonworl commented Oct 10, 2025

Uh oh!

sonarqubecloud bot commented Oct 14, 2025

Quality Gate failed

Uh oh!

svonworl commented Oct 14, 2025

Uh oh!

denis-yuen Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

svonworl commented Oct 9, 2025 •

edited

Loading

codecov bot commented Oct 9, 2025 •

edited

Loading

denis-yuen commented Oct 10, 2025 •

edited

Loading