Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@JamesGuthrie
Copy link
Member

There is no FK from the queue table to the source table, or from the embedding table to the source table. As a result, we can have "dangling" entries in both the queue table and the embedding table. These occur because of a race condition between deleting a source table row, and the processing of the embeddings for that row.

To "clean up" dangling embeddings, we insert a queue row when a source table row is deleted. This queue row is dangling.

When a dangling queue row is identified, we use the PK values of the queue row to remove all associated embeddings (if present).

@JamesGuthrie JamesGuthrie requested a review from a team as a code owner June 16, 2025 12:42
@JamesGuthrie JamesGuthrie temporarily deployed to internal-contributors June 16, 2025 12:42 — with GitHub Actions Inactive
@JamesGuthrie JamesGuthrie requested a review from Askir June 16, 2025 12:48
@JamesGuthrie JamesGuthrie force-pushed the james/pga-251-handle-potential-dead-chunks-in-target-table-after-row branch from d7bb489 to 01afd44 Compare June 17, 2025 06:36
@JamesGuthrie JamesGuthrie temporarily deployed to internal-contributors June 17, 2025 06:36 — with GitHub Actions Inactive
@JamesGuthrie JamesGuthrie force-pushed the james/pga-251-handle-potential-dead-chunks-in-target-table-after-row branch from 01afd44 to 38a22d1 Compare June 17, 2025 08:07
@JamesGuthrie JamesGuthrie temporarily deployed to internal-contributors June 17, 2025 08:07 — with GitHub Actions Inactive
alejandrodnm and others added 7 commits June 18, 2025 16:05
* chore: split deps into separate extras

* chore: update docs and dockerfile

* chore: only install vectorizer-worker extra in dockerfile
The vectorizer worker used to process queue items in a single
transaction. If any step (other than file loading) failed, it would
cause the processing to abort, and later be retried. Because it operated
in a single transaction, it did not record failed attempts in the queue
table.

This change rearchitects the queue item processing to consist of two
transactions:
- The "fetch work" transaction gets a batch of rows from the database
  for processing. It updates the `attempts` column of those rows to
  signal that an attempt has been made to process the item. It deletes
  duplicate queue items for the same primary key columns.
- The "embed and write" transaction performs embedding, writes the
  embeddings to the database, and removes successfully processed queue
  rows. Rows which failed to be processed have the `retry_after` column
  set to a value proportional to the number of existing attempts. When
  the `attempts` column goes over a predefined threshold (6), the queue
  item is moved to the "failed" (dead letter) queue.
There is no FK from the queue table to the source table, or from the
embedding table to the source table. As a result, we can have "dangling"
entries in both the queue table and the embedding table. These occur
because of a race condition between deleting a source table row, and the
processing of the embeddings for that row.

To "clean up" dangling embeddings, we insert a queue row when a source
table row is deleted. This queue row is dangling.

When a dangling queue row is identified, we use the PK values of the
queue row to remove all associated embeddings (if present).

Co-authored-by: Jascha <[email protected]>
@JamesGuthrie JamesGuthrie force-pushed the james/pga-251-handle-potential-dead-chunks-in-target-table-after-row branch from 38a22d1 to dc12bf9 Compare June 24, 2025 07:26
@JamesGuthrie JamesGuthrie temporarily deployed to internal-contributors June 24, 2025 07:26 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants