LLM-as-a-Judge format JSON #9993
-
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
|
Based on your screenshot, the issue appears to be that the LLM is returning extra text alongside the JSON in the score comment field. This is likely a model behavior rather than a Langfuse limitation. When setting up your LLM-as-a-Judge evaluator in Langfuse, you define an "outputSchema": {
"score": "Score between 0 and 1. Score 0 if false or negative and 1 if true or positive",
"reasoning": "One sentence reasoning for the score"
}The key requirement is that your chosen default model must support structured output(1). This is essential for Langfuse to correctly interpret the evaluation results from the LLM judge(1). To get strict JSON output without extra text:
If ChatGPT-4 Mini doesn't reliably produce structured output, consider switching to a model with better structured output support, or adjust your prompt to be more explicit about returning only valid JSON without any additional text or explanations. The issue you're seeing with classification (alerts as Caso_Uso: 4) is likely a prompt engineering problem rather than a Langfuse issue - you may need to refine your evaluation prompt to be more specific about your use case classifications. π Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
-
|
I have observed that when using the same model in Playground versus LLM-as-a-Judge, the output differs. In Playground, the model responds as expected. However, in LLM-as-a-Judge, the same model and prompt produce additional text beyond what is needed. This indicates that the issue originates from how LLM-as-a-Judge handles the model output rather than from the model itself. The extra text is introduced by the evaluatorβs processing, leading to inconsistencies compared to the Playground results. Addressing this behavior on the Langfuse side would be necessary to ensure that LLM-as-a-Judge returns only the intended output. |
Beta Was this translation helpful? Give feedback.
Hi @lnave300, the model calls of the LLM-as-a-Judge Evaluators are logged as traces as well. Have you already tried to open one of the LLM-as-a-Judge generations in the Playground to check if you can reproduce this?
In general, I would recommend that you set up separate LLM-as-a-Judge Evaluators to get distinct scores for each Caso_Uso.