LLM-as-a-Judge format JSON #9993

lnave300 · 2025-10-24T07:38:06Z

lnave300
Oct 24, 2025

Hello everyone,

I am using ChatGPT-4 Mini together with Langfuse to evaluate conversations and generate a score comment. I want the output to be strictly formatted in JSON, without any additional explanations or extra text.

Currently, although the model follows the instructions, the output includes too much extra text and does not always correctly reflect the use cases (for example, alerts should be classified as Caso_Uso: 4).

I would like to ask if anyone has encountered this issue:

Is this a limitation of ChatGPT-4 Mini that prevents strict JSON output?

Or is it Langfuse that does not correctly interpret the JSON when received from the model?

I am attaching a photo of the score comment for reference.

Any guidance or example of a prompt that works to generate only JSON would be very helpful.

Thank you!

Answered by jannikmaierhoefer

Oct 24, 2025

Hi @lnave300, the model calls of the LLM-as-a-Judge Evaluators are logged as traces as well. Have you already tried to open one of the LLM-as-a-Judge generations in the Playground to check if you can reproduce this?

In general, I would recommend that you set up separate LLM-as-a-Judge Evaluators to get distinct scores for each Caso_Uso.

View full answer

2025-10-24T07:38:48Z

inkeep-ai-assistant[bot]
bot Oct 24, 2025

Based on your screenshot, the issue appears to be that the LLM is returning extra text alongside the JSON in the score comment field. This is likely a model behavior rather than a Langfuse limitation.

When setting up your LLM-as-a-Judge evaluator in Langfuse, you define an outputSchema that specifies the expected structure(1). Looking at the managed evaluator examples, the output schema typically includes fields like score and reasoning(2):

"outputSchema": {
  "score": "Score between 0 and 1. Score 0 if false or negative and 1 if true or positive",
  "reasoning": "One sentence reasoning for the score"
}

(2)

The key requirement is that your chosen default model must support structured output(1). This is essential for Langfuse to correctly interpret the evaluation results from the LLM judge(1).

To get strict JSON output without extra text:

Verify structured output support: Ensure ChatGPT-4 Mini supports structured output when configured through your LLM Connection(1)
Define a clear outputSchema: In your custom evaluator, explicitly define the JSON structure you expect(1)
Use structured output parameters: When setting up your LLM Connection, you can pass additional provider options to enforce JSON mode(1)

If ChatGPT-4 Mini doesn't reliably produce structured output, consider switching to a model with better structured output support, or adjust your prompt to be more explicit about returning only valid JSON without any additional text or explanations.

The issue you're seeing with classification (alerts as Caso_Uso: 4) is likely a prompt engineering problem rather than a Langfuse issue - you may need to refine your evaluation prompt to be more specific about your use case classifications.

📚 Sources:

Have another question? Just tag @inkeep.

0 replies

lnave300 · 2025-10-24T09:30:53Z

lnave300
Oct 24, 2025
Author

I have observed that when using the same model in Playground versus LLM-as-a-Judge, the output differs. In Playground, the model responds as expected. However, in LLM-as-a-Judge, the same model and prompt produce additional text beyond what is needed.

This indicates that the issue originates from how LLM-as-a-Judge handles the model output rather than from the model itself. The extra text is introduced by the evaluator’s processing, leading to inconsistencies compared to the Playground results.

Addressing this behavior on the Langfuse side would be necessary to ensure that LLM-as-a-Judge returns only the intended output.

LLm-as-a-Judge ->

Playground ->

1 reply

jannikmaierhoefer Oct 24, 2025
Maintainer

Hi @lnave300, the model calls of the LLM-as-a-Judge Evaluators are logged as traces as well. Have you already tried to open one of the LLM-as-a-Judge generations in the Playground to check if you can reproduce this?

In general, I would recommend that you set up separate LLM-as-a-Judge Evaluators to get distinct scores for each Caso_Uso.

Answer selected by lnave300

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Langfuse

LLM-as-a-Judge format JSON #9993

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Langfuse

LLM-as-a-Judge format JSON #9993

Uh oh!

lnave300 Oct 24, 2025

Replies: 2 comments · 1 reply

Uh oh!

inkeep-ai-assistant[bot] bot Oct 24, 2025

Uh oh!

lnave300 Oct 24, 2025 Author

Uh oh!

jannikmaierhoefer Oct 24, 2025 Maintainer

lnave300
Oct 24, 2025

Replies: 2 comments 1 reply

inkeep-ai-assistant[bot]
bot Oct 24, 2025

lnave300
Oct 24, 2025
Author

jannikmaierhoefer Oct 24, 2025
Maintainer