Part of #3654. Depends on the minion framework delete-event support in #3655.
Background
magda-semantic-indexer-framework builds semantic OpenSearch documents from registry records or storage objects. It runs on top of magda-minion-framework and currently receives records only. It sets includeEvents: false, so it cannot see DeleteRecord events.
The framework already cleans older chunks for a record when that same record is successfully re-indexed. This is done by indexing new chunks and then deleting older documents for the same indexerId and recordId.
That per-record replacement does not clean up semantic documents for records that are deleted from the registry, because deleted records are not returned by webhooks or recrawls as records to re-index.
Problem
Semantic documents can remain in OpenSearch after the source registry record has been deleted. Periodic recrawl with an indexer-scoped trim can help, but it does not provide on-the-fly cleanup.
A delete event handler must also handle stale-delete races. A semantic indexer can be behind the registry event stream. A record may be deleted and then recreated before the old DeleteRecord event is processed. Blindly deleting semantic documents for that event can hide a record whose latest registry state is not deleted.
Multiple semantic indexers may share an OpenSearch cluster or index. Each indexer can use different id, itemType, watched aspects, format filters, and dataset/distribution scope. Deletion must only remove documents owned by the current semantic indexer.
Proposed design
Once magda-minion-framework supports opt-in latest-state guarded delete handling, make magda-semantic-indexer-framework opt in and provide:
shouldProcessDeleteEvent
onRecordDeleted
The semantic delete implementation must always scope deletion to:
- configured semantic index name
- current semantic indexer
id stored as indexerId
- current semantic
itemType
- affected record identity
For itemType: "registryRecord", delete documents matching:
indexerId = current indexer id
itemType = "registryRecord"
recordId = deleted record id
For itemType: "storageObject", delete documents matching:
indexerId = current indexer id
itemType = "storageObject"
(
recordId = deleted record id
or parentRecordId = deleted record id
)
The parentRecordId condition matters for dataset deletion. Storage-object documents are keyed by distribution recordId and store the owning dataset as parentRecordId. If a dataset is deleted, semantic content for its distributions should be removed even if only the dataset delete event is considered.
Latest-state decision
For a semantic delete event:
- If latest registry lookup returns not found, process delete.
- If latest record exists and still matches this semantic indexer's scope, skip delete.
- If latest record exists but no longer matches this semantic indexer's scope, process delete.
Scope checks should be cheap and should reuse current indexing predicates where possible.
For registryRecord indexers:
- required aspects are present
- any future cheap scope predicate passes
For storageObject indexers:
dcat-distribution-strings and dataset-format data are sufficient for indexing
- download/access URL exists
- detected format matches configured
formatTypes
The delete decision should not call expensive text extraction, file download, parsing, or embedding logic.
Connector trim interaction
Connector trim can emit many DeleteRecord events when upstream resources disappear. It can also be followed by later connector runs that recreate the same record IDs.
The semantic indexer must rely on the minion framework's latest-state guard to avoid deleting semantic documents for records that have since been recreated and still match scope. If the latest record exists but is out of scope, semantic documents owned by this indexer should be removed.
No change is proposed for packages/connector-sdk.
Error handling
- OpenSearch delete-by-query must be scoped by
indexerId.
- OpenSearch delete failures should fail webhook processing so registry retries.
- Latest-state lookup transient failures should fail webhook processing via the minion framework.
- A stale delete event for a recreated in-scope record should be skipped.
- A stale delete event for a recreated out-of-scope record should remove this indexer's old semantic documents.
Acceptance criteria
- Semantic indexer opts into delete events through the minion framework.
- Semantic indexer supplies latest-state delete decision logic.
registryRecord delete query is scoped by configured index, indexerId, itemType, and recordId.
storageObject delete query is scoped by configured index, indexerId, itemType, and recordId or parentRecordId.
- Delete handling never removes documents belonging to another
indexerId.
- Latest record in scope skips delete.
- Latest record out of scope processes delete.
- Missing latest record processes delete.
- Existing per-record replacement cleanup remains unchanged.
- Delete decision avoids expensive embedding/text/file processing.
Relevant code
magda-semantic-indexer-framework/src/semanticIndexer.ts
magda-semantic-indexer-framework/src/indexEmbeddingText.ts
magda-semantic-indexer-framework/src/onRecordFoundRegistryRecord.ts
magda-semantic-indexer-framework/src/onRecordFoundStorageObject.ts
magda-semantic-indexer-framework/src/indexSchema.ts
Part of #3654. Depends on the minion framework delete-event support in #3655.
Background
magda-semantic-indexer-frameworkbuilds semantic OpenSearch documents from registry records or storage objects. It runs on top ofmagda-minion-frameworkand currently receives records only. It setsincludeEvents: false, so it cannot seeDeleteRecordevents.The framework already cleans older chunks for a record when that same record is successfully re-indexed. This is done by indexing new chunks and then deleting older documents for the same
indexerIdandrecordId.That per-record replacement does not clean up semantic documents for records that are deleted from the registry, because deleted records are not returned by webhooks or recrawls as records to re-index.
Problem
Semantic documents can remain in OpenSearch after the source registry record has been deleted. Periodic recrawl with an indexer-scoped trim can help, but it does not provide on-the-fly cleanup.
A delete event handler must also handle stale-delete races. A semantic indexer can be behind the registry event stream. A record may be deleted and then recreated before the old
DeleteRecordevent is processed. Blindly deleting semantic documents for that event can hide a record whose latest registry state is not deleted.Multiple semantic indexers may share an OpenSearch cluster or index. Each indexer can use different
id,itemType, watched aspects, format filters, and dataset/distribution scope. Deletion must only remove documents owned by the current semantic indexer.Proposed design
Once
magda-minion-frameworksupports opt-in latest-state guarded delete handling, makemagda-semantic-indexer-frameworkopt in and provide:shouldProcessDeleteEventonRecordDeletedThe semantic delete implementation must always scope deletion to:
idstored asindexerIditemTypeFor
itemType: "registryRecord", delete documents matching:For
itemType: "storageObject", delete documents matching:The
parentRecordIdcondition matters for dataset deletion. Storage-object documents are keyed by distributionrecordIdand store the owning dataset asparentRecordId. If a dataset is deleted, semantic content for its distributions should be removed even if only the dataset delete event is considered.Latest-state decision
For a semantic delete event:
Scope checks should be cheap and should reuse current indexing predicates where possible.
For
registryRecordindexers:For
storageObjectindexers:dcat-distribution-stringsanddataset-formatdata are sufficient for indexingformatTypesThe delete decision should not call expensive text extraction, file download, parsing, or embedding logic.
Connector trim interaction
Connector trim can emit many
DeleteRecordevents when upstream resources disappear. It can also be followed by later connector runs that recreate the same record IDs.The semantic indexer must rely on the minion framework's latest-state guard to avoid deleting semantic documents for records that have since been recreated and still match scope. If the latest record exists but is out of scope, semantic documents owned by this indexer should be removed.
No change is proposed for
packages/connector-sdk.Error handling
indexerId.Acceptance criteria
registryRecorddelete query is scoped by configured index,indexerId,itemType, andrecordId.storageObjectdelete query is scoped by configured index,indexerId,itemType, andrecordIdorparentRecordId.indexerId.Relevant code
magda-semantic-indexer-framework/src/semanticIndexer.tsmagda-semantic-indexer-framework/src/indexEmbeddingText.tsmagda-semantic-indexer-framework/src/onRecordFoundRegistryRecord.tsmagda-semantic-indexer-framework/src/onRecordFoundStorageObject.tsmagda-semantic-indexer-framework/src/indexSchema.ts