Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add latest-state guarded delete handling to semantic indexer framework #3656

@t83714

Description

@t83714

Part of #3654. Depends on the minion framework delete-event support in #3655.

Background

magda-semantic-indexer-framework builds semantic OpenSearch documents from registry records or storage objects. It runs on top of magda-minion-framework and currently receives records only. It sets includeEvents: false, so it cannot see DeleteRecord events.

The framework already cleans older chunks for a record when that same record is successfully re-indexed. This is done by indexing new chunks and then deleting older documents for the same indexerId and recordId.

That per-record replacement does not clean up semantic documents for records that are deleted from the registry, because deleted records are not returned by webhooks or recrawls as records to re-index.

Problem

Semantic documents can remain in OpenSearch after the source registry record has been deleted. Periodic recrawl with an indexer-scoped trim can help, but it does not provide on-the-fly cleanup.

A delete event handler must also handle stale-delete races. A semantic indexer can be behind the registry event stream. A record may be deleted and then recreated before the old DeleteRecord event is processed. Blindly deleting semantic documents for that event can hide a record whose latest registry state is not deleted.

Multiple semantic indexers may share an OpenSearch cluster or index. Each indexer can use different id, itemType, watched aspects, format filters, and dataset/distribution scope. Deletion must only remove documents owned by the current semantic indexer.

Proposed design

Once magda-minion-framework supports opt-in latest-state guarded delete handling, make magda-semantic-indexer-framework opt in and provide:

  • shouldProcessDeleteEvent
  • onRecordDeleted

The semantic delete implementation must always scope deletion to:

  • configured semantic index name
  • current semantic indexer id stored as indexerId
  • current semantic itemType
  • affected record identity

For itemType: "registryRecord", delete documents matching:

indexerId = current indexer id
itemType = "registryRecord"
recordId = deleted record id

For itemType: "storageObject", delete documents matching:

indexerId = current indexer id
itemType = "storageObject"
(
  recordId = deleted record id
  or parentRecordId = deleted record id
)

The parentRecordId condition matters for dataset deletion. Storage-object documents are keyed by distribution recordId and store the owning dataset as parentRecordId. If a dataset is deleted, semantic content for its distributions should be removed even if only the dataset delete event is considered.

Latest-state decision

For a semantic delete event:

  • If latest registry lookup returns not found, process delete.
  • If latest record exists and still matches this semantic indexer's scope, skip delete.
  • If latest record exists but no longer matches this semantic indexer's scope, process delete.

Scope checks should be cheap and should reuse current indexing predicates where possible.

For registryRecord indexers:

  • required aspects are present
  • any future cheap scope predicate passes

For storageObject indexers:

  • dcat-distribution-strings and dataset-format data are sufficient for indexing
  • download/access URL exists
  • detected format matches configured formatTypes

The delete decision should not call expensive text extraction, file download, parsing, or embedding logic.

Connector trim interaction

Connector trim can emit many DeleteRecord events when upstream resources disappear. It can also be followed by later connector runs that recreate the same record IDs.

The semantic indexer must rely on the minion framework's latest-state guard to avoid deleting semantic documents for records that have since been recreated and still match scope. If the latest record exists but is out of scope, semantic documents owned by this indexer should be removed.

No change is proposed for packages/connector-sdk.

Error handling

  • OpenSearch delete-by-query must be scoped by indexerId.
  • OpenSearch delete failures should fail webhook processing so registry retries.
  • Latest-state lookup transient failures should fail webhook processing via the minion framework.
  • A stale delete event for a recreated in-scope record should be skipped.
  • A stale delete event for a recreated out-of-scope record should remove this indexer's old semantic documents.

Acceptance criteria

  • Semantic indexer opts into delete events through the minion framework.
  • Semantic indexer supplies latest-state delete decision logic.
  • registryRecord delete query is scoped by configured index, indexerId, itemType, and recordId.
  • storageObject delete query is scoped by configured index, indexerId, itemType, and recordId or parentRecordId.
  • Delete handling never removes documents belonging to another indexerId.
  • Latest record in scope skips delete.
  • Latest record out of scope processes delete.
  • Missing latest record processes delete.
  • Existing per-record replacement cleanup remains unchanged.
  • Delete decision avoids expensive embedding/text/file processing.

Relevant code

  • magda-semantic-indexer-framework/src/semanticIndexer.ts
  • magda-semantic-indexer-framework/src/indexEmbeddingText.ts
  • magda-semantic-indexer-framework/src/onRecordFoundRegistryRecord.ts
  • magda-semantic-indexer-framework/src/onRecordFoundStorageObject.ts
  • magda-semantic-indexer-framework/src/indexSchema.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions