Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Epic: Add latest-state guarded deletion handling for indexers and minions #3654

@t83714

Description

@t83714

Background

Magda has several components that derive secondary data from registry records:

  • magda-indexer indexes registry datasets into Elasticsearch.
  • magda-minion-framework powers metadata enhancement minions that listen to registry changes and write derived aspects back to registry.
  • magda-semantic-indexer-framework builds semantic OpenSearch documents from registry records or storage objects.

Today, these components have incomplete or inconsistent on-the-fly deletion handling.

magda-indexer subscribes to DeleteRecord webhook events and deletes indexed datasets immediately, but it does not check the latest registry state before deleting. This means an old delete event can remove search data for a record that has already been recreated.

magda-indexer also has a full reindex cleanup path. Its /reindex endpoint re-crawls registry records and trims indexed documents that were not refreshed during that crawl. A cronjob currently calls this endpoint periodically as a cleanup safety net. This helps remove stale indexed data eventually, but it is batch-oriented and does not remove the need for safe on-the-fly delete handling.

magda-minion-framework does not subscribe to DeleteRecord events by default. It registers for create/patch style events, requests records, and invokes onRecordFound. Public minions built with @magda/minion-sdk are usually metadata enhancement services, such as minions that derive and write aspects from current record metadata. They currently have no common deletion callback or latest-state decision point.

magda-minion-framework also exposes a /recrawl endpoint that reprocesses current registry records for a minion. This can refresh derived metadata for records that still exist, but it is not an on-the-fly deletion mechanism by itself.

magda-semantic-indexer-framework builds on magda-minion-framework, sets includeEvents: false, and receives records only. It can delete older semantic chunks when the same record is successfully re-indexed, but it cannot remove semantic documents for deleted registry records on the fly. Its /recrawl path reprocesses current registry records but does not currently trim semantic documents for records that are no longer returned by the registry.

Separately, connector trim is a major producer of registry delete events. Connectors built with packages/connector-sdk write records using a sourceTag, then call registry deleteBySource(sourceTagToPreserve, sourceId) at the end of a run. When upstream resources disappear, registry records are deleted and DeleteRecord events are emitted. No connector framework change is currently required, but downstream minions and indexers need safe deletion handling for those events.

Motivation

Periodic recrawl/reindex cleanup is useful and should remain as a safety net. The existing magda-indexer cronjob that calls /reindex is an example of this model. However, batch cleanup is not enough for all cases. Deleted records can remain visible in derived indexes or derived metadata until the next full cleanup. Event-based deletion should reduce stale data windows.

However, delete events must be handled carefully. Webhook consumers can fall behind the event stream. A record can be deleted and later recreated with the same ID before a delayed minion or indexer processes the old delete event.

Example race:

  1. A record exists and has indexed or derived output.
  2. The record is deleted, producing a DeleteRecord event.
  3. The same record ID is recreated before a minion/indexer catches up.
  4. The minion/indexer processes the old delete event from its backlog.
  5. If it blindly deletes derived output, the record becomes invisible or loses derived metadata even though the latest registry state is not deleted.
  6. The record may remain missing from derived output until later backlog events are processed.

The solution should avoid this stale-delete race by checking latest registry state before processing a delete event.

Desired outcome

Introduce a latest-state guarded deletion approach across Magda's derived-data consumers.

The implementation is split across three related areas:

  • TypeScript minions get a generic opt-in delete-event contract.
  • Semantic indexers use that contract to delete only their own OpenSearch documents.
  • magda-indexer keeps its existing delete-event subscription but guards deletes with latest registry state before removing Elasticsearch documents.

Periodic /reindex and /recrawl cleanup paths should remain complementary safety nets. The goal here is to make on-the-fly deletion safe, not to remove full cleanup jobs.

Proposed shared model

For each DeleteRecord event, the consumer should:

  1. Extract recordId and tenantId from the event.
  2. Fetch the current record state from registry.
  3. Treat 404/not found as notFound.
  4. Treat an existing record as exists and pass the latest record to a decision function.
  5. Fail webhook processing on transient or server lookup failures so registry retries.
  6. Only delete owned derived output when the decision is to process the delete.

Default decision:

  • notFound -> process delete
  • exists -> skip delete

Advanced consumers should be able to override this. A recreated record can exist but no longer match a minion or indexer's scope. In that case, the consumer may still need to delete its owned stale output.

Scope and ownership

Deletion must only remove output owned by the current consumer.

For magda-indexer, deletion must be scoped by the existing multi-tenant Elasticsearch document ID:

  • DataSet.uniqueEsDocumentId(recordId, tenantId)

For semantic indexing, every delete query must be scoped by at least:

  • configured semantic index
  • current indexerId
  • relevant record identity

For metadata minions, the framework must not infer which derived aspects to delete from watched aspects alone. Minions can monitor one set of aspects, use optional aspects, write different owned aspects, aggregate data onto parent records, or intentionally preserve output. Cleanup semantics must therefore be explicit via callbacks.

Linked tickets

Implementation tickets:

Acceptance criteria

  • TypeScript minions do not subscribe to delete events by default.
  • Existing onRecordFound minions continue to work unchanged.
  • Minions can opt into DeleteRecord handling and receive latest-state context before deleting owned output.
  • Semantic indexers opt into delete handling and delete only documents belonging to the current indexerId.
  • magda-indexer keeps delete-event handling but checks latest tenant-scoped registry state before deleting Elasticsearch documents.
  • Old delete events are skipped when the latest registry record exists and remains in scope.
  • Old delete events can still trigger cleanup when the latest registry record exists but is out of scope for the consumer.
  • Transient latest-state lookup failures do not acknowledge webhook success.
  • Connector trim delete bursts are handled with deduplication and bounded concurrency.
  • Existing periodic /reindex and /recrawl cleanup paths remain available as safety nets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions