Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add latest-state guarded delete handling to magda-indexer #3657

@t83714

Description

@t83714

Part of #3654.

Background

magda-indexer already subscribes to DeleteRecord webhook events and deletes indexed datasets from Elasticsearch. The webhook registration includes delete events and includeEvents: true, and WebhookApi currently builds ES document IDs directly from every DeleteRecord event:

DataSet.uniqueEsDocumentId(recordId, tenantId)

This gives magda-indexer on-the-fly deletion, but it does not check the latest registry state before deleting. If the indexer is behind the webhook event stream, an old delete event can remove search data for a record that has already been recreated.

magda-indexer also has a full /reindex cleanup path. A cronjob currently calls that endpoint periodically as a cleanup safety net. That should remain, but event-based deletion should also be safe.

Problem

A stale-delete race is possible:

  1. A dataset record exists and is indexed.
  2. The record is deleted and a DeleteRecord event is queued.
  3. The same record ID is recreated before magda-indexer processes the old delete event.
  4. magda-indexer processes the old delete event and deletes the ES document.
  5. The latest registry state is not deleted, but the dataset can disappear from search until later backlog events are processed or /reindex runs.

This is the same class of race addressed by the latest-state guarded deletion design for TypeScript minions and semantic indexers.

Proposed design

Keep magda-indexer subscribed to DeleteRecord events, but guard each delete event with current registry state before deleting from Elasticsearch.

For each webhook payload:

  1. Process payload.records as index candidates, as today.
  2. Extract DeleteRecord events from payload.events.
  3. Deduplicate delete events by tenantId + recordId.
  4. For each delete event, decide whether to delete:
    • If the same tenantId + recordId is present in payload.records, skip delete.
    • Otherwise, fetch latest registry state for that exact tenantId + recordId.
    • If the latest scoped record exists and can be converted to DataSet, skip delete.
    • If the latest scoped lookup returns 404 because the record is deleted or no longer has required indexer aspects, delete the ES document.
    • If the latest lookup returns a transient/server error, fail webhook processing so registry retries.
  5. Acknowledge the webhook only after indexing and guarded deletion both succeed.

The ES delete remains scoped with the existing multi-tenant document ID:

DataSet.uniqueEsDocumentId(recordId, tenantId)

Why payload records should short-circuit deletion

magda-indexer already has coverage for mixed delete events and included records. The registry can send a payload where delete events exist but the current record is also included because it exists or is dereferenced in the same event page.

If payload.records includes the same tenantId + recordId, the indexer should prefer the included current record and skip delete. This preserves existing behavior and avoids a separate registry lookup briefly returning stale data from a read replica.

Tenant-specific latest lookup

The latest-state lookup must be tenant-specific.

Registry record IDs are not globally unique in multi-tenant mode. A lookup using a broad/system tenant context could incorrectly find a record with the same ID in another tenant and skip a valid delete, or otherwise make the wrong decision.

Add or use a registry client method that fetches a record by ID with the event's concrete tenant ID in X-Magda-Tenant-Id.

The lookup should use the same aspect set as indexer crawling/webhook conversion:

  • RegistryConstants.aspects
  • RegistryConstants.optionalAspects
  • dereference=true

A 404 from this scoped lookup should be treated as out of scope for the indexer. That means the ES document should be deleted, because the record is either deleted or no longer has the aspects required to be indexed as a dataset.

Error handling

  • Missing/malformed recordId in a delete event should fail webhook processing.
  • Latest scoped lookup 404 should process delete.
  • Latest scoped lookup 5xx/network/transient failure should fail webhook processing so registry retries.
  • Conversion failure for a latest existing record should be treated carefully:
    • if it means the record is not indexable, delete stale ES output
    • if it is a transient conversion dependency failure, fail webhook processing
  • ES delete failure should fail webhook processing.
  • Async webhook acknowledgement should report success only after all indexing and guarded deletion work succeeds.

Relationship to /reindex

The existing /reindex endpoint and cronjob should remain as a periodic cleanup safety net. This ticket improves on-the-fly deletion safety; it does not replace full reindex/trim cleanup.

Acceptance criteria

  • magda-indexer still subscribes to DeleteRecord events.
  • Delete events are deduplicated by tenantId + recordId.
  • If a matching current record is present in payload.records, the delete event is skipped.
  • Latest scoped registry lookup uses the event's tenant ID, not a broad/system tenant lookup.
  • Latest scoped existing indexable record skips ES delete.
  • Latest scoped 404/out-of-scope record deletes the ES document.
  • ES delete uses DataSet.uniqueEsDocumentId(recordId, tenantId).
  • Transient latest-state lookup failures prevent webhook acknowledgement.
  • Existing mixed delete/included-record behavior remains covered.
  • Add coverage for stale delete event where the latest record exists and should not be deleted.
  • Add coverage for stale delete event where the latest record no longer has required aspects and should be deleted.

Relevant code

  • magda-indexer/src/main/scala/au/csiro/data61/magda/indexer/external/registry/WebhookApi.scala
  • magda-indexer/src/main/scala/au/csiro/data61/magda/indexer/external/registry/RegisterWebhook.scala
  • magda-indexer/src/main/scala/au/csiro/data61/magda/indexer/search/SearchIndexer.scala
  • magda-indexer/src/main/scala/au/csiro/data61/magda/indexer/search/elasticsearch/ElasticSearchIndexer.scala
  • magda-scala-common/src/main/scala/au/csiro/data61/magda/client/RegistryExternalInterface.scala
  • magda-int-test/src/test/scala/au/csiro/data61/magda/indexer/WebhookDeleteDatasetsSpec.scala
  • magda-int-test/src/test/scala/au/csiro/data61/magda/indexer/WebhookIncludedRecordNotDeletedSpec.scala

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions