Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

newmanifold
Copy link

@newmanifold newmanifold commented Sep 15, 2025

Problem:
Sometimes for large traces we are seeing exception ValueError('task_done() called too many times'), After few of these exceptions , traces/spans aren't getting into langfuse.

Cause:
In file ingestion_consumer dropped events inside _truncate_item_in_place were being appended to the batch events in the _next method, which caused the task_done call on the ingestion queue in the upload method to fail with an exception.
The issue is that task_done is already called for dropped items inside _truncate_item_in_place.

Changes:

  • Added missing metadata check to event drop condition and updated warning.

Important

Fixes task_done exception in ingestion_consumer.py by preventing dropped items from being appended to batch events.

  • Behavior:
    • Fixes ValueError('task_done() called too many times') in ingestion_consumer.py by preventing dropped items from being appended to batch events.
    • Adds a flag in _truncate_item_in_place() to indicate if an item was dropped.
    • Modifies _next() to append only non-dropped items to events.
  • Functions:
    • _truncate_item_in_place() now returns a tuple (item_size, dropped).
    • Updates _next() to handle the new return value from _truncate_item_in_place() and conditionally append events.

This description was created by Ellipsis for bffe753. You can customize this summary. It will automatically update as commits are pushed.

Disclaimer: Experimental PR review

Greptile Summary

Updated On: 2025-09-15 05:00:00 UTC

This PR fixes a critical bug in the task queue management system for Langfuse's ingestion consumer. The issue occurred when processing large traces that exceeded size limits - the _truncate_item_in_place() method would drop oversized events and call task_done() on the ingestion queue, but these dropped events were still being added to the batch for processing. Later, when the upload() method processed the batch, it would call task_done() again for each event in the batch, including the already-dropped ones, resulting in ValueError('task_done() called too many times').

The fix modifies the _truncate_item_in_place() method to return a tuple containing both the item size and a boolean flag indicating whether the item was dropped. The _next() method now checks this flag and only appends non-dropped events to the batch. This ensures that task_done() is called exactly once per queue item - either during the truncation/drop process or during batch processing, but never both.

The change maintains backward compatibility in terms of functionality while fixing the queue state management. The ingestion consumer continues to handle oversized events by truncating or dropping them as before, but now properly tracks which items were dropped to prevent double-counting in the task queue. This fix is essential for maintaining the integrity of Langfuse's async task processing system, particularly when dealing with large traces that require size-based filtering.

Confidence score: 4/5

  • This PR addresses a well-defined concurrency bug with a focused solution that maintains existing behavior
  • Score reflects solid understanding of the queue management issue and appropriate fix, though the async nature of task queues can have subtle edge cases
  • Pay close attention to the tuple return modification in _truncate_item_in_place() and its usage in _next()

@CLAassistant
Copy link

CLAassistant commented Sep 15, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Bot Settings | Greptile

@hassiebp hassiebp self-requested a review September 15, 2025 09:51
@hassiebp
Copy link
Contributor

Thank you for raising this issue. The event is only dropped in the truncate_item_in_place function if the body does not exist or both the input and output are empty. However, truncation is only necessary if the input, output, or metadata are too large. In your events, are the input and output missing, with only the metadata being large?

@newmanifold
Copy link
Author

Hey! Yes metadata was very large in some of our events, and input/outputs were missing.

Copy link
Contributor

In that case, this PR could be simplified to the drop condition to be extended to check for missing metadata as well: so event is only dropped if input, output, and metadata are missing

@newmanifold
Copy link
Author

Hey!
Do you mean we should only add a check for missing metadata in the drop condition and remove all the other changes in the PR?
That would solve the issue in my specific case, but if the drop condition can still be triggered, then we’d run into the same problem of task_done being called multiple times.

Unless, of course, there's no actual way for the drop condition to be triggered in which case, the drop condition itself might be redundant.

Copy link
Contributor

The drop condition should never be triggered as if all of input / output / metadata are missing, there is no other field that can hold a value with size exceeding the limit. And yes, it is only an addtional safeguard here.

Do you mean we should only add a check for missing metadata in the drop condition and remove all the other changes in the PR?

yes 👍

@newmanifold
Copy link
Author

@hassiebp I've done the changes, let me know if its fine or anything else is needed to be done.
I've also updated the PR description for the changes but Ellipses and Greptile Summaries are still older, I hope that's okay.

@hassiebp hassiebp merged commit 07150c5 into langfuse:v2-stable Sep 16, 2025
1 check passed
@hassiebp
Copy link
Contributor

Thanks for your contribution, @newmanifold

@hassiebp
Copy link
Contributor

Released in 2.60.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants