Part of #3654.
Background
magda-indexer already subscribes to DeleteRecord webhook events and deletes indexed datasets from Elasticsearch. The webhook registration includes delete events and includeEvents: true, and WebhookApi currently builds ES document IDs directly from every DeleteRecord event:
DataSet.uniqueEsDocumentId(recordId, tenantId)
This gives magda-indexer on-the-fly deletion, but it does not check the latest registry state before deleting. If the indexer is behind the webhook event stream, an old delete event can remove search data for a record that has already been recreated.
magda-indexer also has a full /reindex cleanup path. A cronjob currently calls that endpoint periodically as a cleanup safety net. That should remain, but event-based deletion should also be safe.
Problem
A stale-delete race is possible:
- A dataset record exists and is indexed.
- The record is deleted and a
DeleteRecord event is queued.
- The same record ID is recreated before
magda-indexer processes the old delete event.
magda-indexer processes the old delete event and deletes the ES document.
- The latest registry state is not deleted, but the dataset can disappear from search until later backlog events are processed or
/reindex runs.
This is the same class of race addressed by the latest-state guarded deletion design for TypeScript minions and semantic indexers.
Proposed design
Keep magda-indexer subscribed to DeleteRecord events, but guard each delete event with current registry state before deleting from Elasticsearch.
For each webhook payload:
- Process
payload.records as index candidates, as today.
- Extract
DeleteRecord events from payload.events.
- Deduplicate delete events by
tenantId + recordId.
- For each delete event, decide whether to delete:
- If the same
tenantId + recordId is present in payload.records, skip delete.
- Otherwise, fetch latest registry state for that exact
tenantId + recordId.
- If the latest scoped record exists and can be converted to
DataSet, skip delete.
- If the latest scoped lookup returns 404 because the record is deleted or no longer has required indexer aspects, delete the ES document.
- If the latest lookup returns a transient/server error, fail webhook processing so registry retries.
- Acknowledge the webhook only after indexing and guarded deletion both succeed.
The ES delete remains scoped with the existing multi-tenant document ID:
DataSet.uniqueEsDocumentId(recordId, tenantId)
Why payload records should short-circuit deletion
magda-indexer already has coverage for mixed delete events and included records. The registry can send a payload where delete events exist but the current record is also included because it exists or is dereferenced in the same event page.
If payload.records includes the same tenantId + recordId, the indexer should prefer the included current record and skip delete. This preserves existing behavior and avoids a separate registry lookup briefly returning stale data from a read replica.
Tenant-specific latest lookup
The latest-state lookup must be tenant-specific.
Registry record IDs are not globally unique in multi-tenant mode. A lookup using a broad/system tenant context could incorrectly find a record with the same ID in another tenant and skip a valid delete, or otherwise make the wrong decision.
Add or use a registry client method that fetches a record by ID with the event's concrete tenant ID in X-Magda-Tenant-Id.
The lookup should use the same aspect set as indexer crawling/webhook conversion:
RegistryConstants.aspects
RegistryConstants.optionalAspects
dereference=true
A 404 from this scoped lookup should be treated as out of scope for the indexer. That means the ES document should be deleted, because the record is either deleted or no longer has the aspects required to be indexed as a dataset.
Error handling
- Missing/malformed
recordId in a delete event should fail webhook processing.
- Latest scoped lookup 404 should process delete.
- Latest scoped lookup 5xx/network/transient failure should fail webhook processing so registry retries.
- Conversion failure for a latest existing record should be treated carefully:
- if it means the record is not indexable, delete stale ES output
- if it is a transient conversion dependency failure, fail webhook processing
- ES delete failure should fail webhook processing.
- Async webhook acknowledgement should report success only after all indexing and guarded deletion work succeeds.
Relationship to /reindex
The existing /reindex endpoint and cronjob should remain as a periodic cleanup safety net. This ticket improves on-the-fly deletion safety; it does not replace full reindex/trim cleanup.
Acceptance criteria
magda-indexer still subscribes to DeleteRecord events.
- Delete events are deduplicated by
tenantId + recordId.
- If a matching current record is present in
payload.records, the delete event is skipped.
- Latest scoped registry lookup uses the event's tenant ID, not a broad/system tenant lookup.
- Latest scoped existing indexable record skips ES delete.
- Latest scoped 404/out-of-scope record deletes the ES document.
- ES delete uses
DataSet.uniqueEsDocumentId(recordId, tenantId).
- Transient latest-state lookup failures prevent webhook acknowledgement.
- Existing mixed delete/included-record behavior remains covered.
- Add coverage for stale delete event where the latest record exists and should not be deleted.
- Add coverage for stale delete event where the latest record no longer has required aspects and should be deleted.
Relevant code
magda-indexer/src/main/scala/au/csiro/data61/magda/indexer/external/registry/WebhookApi.scala
magda-indexer/src/main/scala/au/csiro/data61/magda/indexer/external/registry/RegisterWebhook.scala
magda-indexer/src/main/scala/au/csiro/data61/magda/indexer/search/SearchIndexer.scala
magda-indexer/src/main/scala/au/csiro/data61/magda/indexer/search/elasticsearch/ElasticSearchIndexer.scala
magda-scala-common/src/main/scala/au/csiro/data61/magda/client/RegistryExternalInterface.scala
magda-int-test/src/test/scala/au/csiro/data61/magda/indexer/WebhookDeleteDatasetsSpec.scala
magda-int-test/src/test/scala/au/csiro/data61/magda/indexer/WebhookIncludedRecordNotDeletedSpec.scala
Part of #3654.
Background
magda-indexeralready subscribes toDeleteRecordwebhook events and deletes indexed datasets from Elasticsearch. The webhook registration includes delete events andincludeEvents: true, andWebhookApicurrently builds ES document IDs directly from everyDeleteRecordevent:DataSet.uniqueEsDocumentId(recordId, tenantId)This gives
magda-indexeron-the-fly deletion, but it does not check the latest registry state before deleting. If the indexer is behind the webhook event stream, an old delete event can remove search data for a record that has already been recreated.magda-indexeralso has a full/reindexcleanup path. A cronjob currently calls that endpoint periodically as a cleanup safety net. That should remain, but event-based deletion should also be safe.Problem
A stale-delete race is possible:
DeleteRecordevent is queued.magda-indexerprocesses the old delete event.magda-indexerprocesses the old delete event and deletes the ES document./reindexruns.This is the same class of race addressed by the latest-state guarded deletion design for TypeScript minions and semantic indexers.
Proposed design
Keep
magda-indexersubscribed toDeleteRecordevents, but guard each delete event with current registry state before deleting from Elasticsearch.For each webhook payload:
payload.recordsas index candidates, as today.DeleteRecordevents frompayload.events.tenantId + recordId.tenantId + recordIdis present inpayload.records, skip delete.tenantId + recordId.DataSet, skip delete.The ES delete remains scoped with the existing multi-tenant document ID:
DataSet.uniqueEsDocumentId(recordId, tenantId)Why payload records should short-circuit deletion
magda-indexeralready has coverage for mixed delete events and included records. The registry can send a payload where delete events exist but the current record is also included because it exists or is dereferenced in the same event page.If
payload.recordsincludes the sametenantId + recordId, the indexer should prefer the included current record and skip delete. This preserves existing behavior and avoids a separate registry lookup briefly returning stale data from a read replica.Tenant-specific latest lookup
The latest-state lookup must be tenant-specific.
Registry record IDs are not globally unique in multi-tenant mode. A lookup using a broad/system tenant context could incorrectly find a record with the same ID in another tenant and skip a valid delete, or otherwise make the wrong decision.
Add or use a registry client method that fetches a record by ID with the event's concrete tenant ID in
X-Magda-Tenant-Id.The lookup should use the same aspect set as indexer crawling/webhook conversion:
RegistryConstants.aspectsRegistryConstants.optionalAspectsdereference=trueA 404 from this scoped lookup should be treated as out of scope for the indexer. That means the ES document should be deleted, because the record is either deleted or no longer has the aspects required to be indexed as a dataset.
Error handling
recordIdin a delete event should fail webhook processing.Relationship to
/reindexThe existing
/reindexendpoint and cronjob should remain as a periodic cleanup safety net. This ticket improves on-the-fly deletion safety; it does not replace full reindex/trim cleanup.Acceptance criteria
magda-indexerstill subscribes toDeleteRecordevents.tenantId + recordId.payload.records, the delete event is skipped.DataSet.uniqueEsDocumentId(recordId, tenantId).Relevant code
magda-indexer/src/main/scala/au/csiro/data61/magda/indexer/external/registry/WebhookApi.scalamagda-indexer/src/main/scala/au/csiro/data61/magda/indexer/external/registry/RegisterWebhook.scalamagda-indexer/src/main/scala/au/csiro/data61/magda/indexer/search/SearchIndexer.scalamagda-indexer/src/main/scala/au/csiro/data61/magda/indexer/search/elasticsearch/ElasticSearchIndexer.scalamagda-scala-common/src/main/scala/au/csiro/data61/magda/client/RegistryExternalInterface.scalamagda-int-test/src/test/scala/au/csiro/data61/magda/indexer/WebhookDeleteDatasetsSpec.scalamagda-int-test/src/test/scala/au/csiro/data61/magda/indexer/WebhookIncludedRecordNotDeletedSpec.scala