Thanks to visit codestin.com
Credit goes to langfuse.com

Evaluations

The Python SDK provides ways to evaluate your application. You can add custom scores to your traces and observations, or use the SDK to execute Dataset Runs.

This page shows the evaluation methods that are supported by the Python SDK. Please refer to the Evaluation documentation for more information on how to evaluate your application in Langfuse.

Create Scores

  • span_or_generation_obj.score(): Scores the specific observation object.
  • span_or_generation_obj.score_trace(): Scores the entire trace to which the object belongs.
from langfuse import get_client
 
langfuse = get_client()
 
with langfuse.start_as_current_generation(name="summary_generation") as gen:
    # ... LLM call ...
    gen.update(output="summary text...")
    # Score this specific generation
    gen.score(name="conciseness", value=0.8, data_type="NUMERIC")
    # Score the overall trace
    gen.score_trace(name="user_feedback_rating", value="positive", data_type="CATEGORICAL")

Score Parameters:

ParameterTypeDescription
namestrName of the score (e.g., “relevance”, “accuracy”). Required.
valueUnion[float, str]Score value. Float for NUMERIC/BOOLEAN, string for CATEGORICAL. Required.
trace_idstrID of the trace to associate with (for create_score). Required.
observation_idOptional[str]ID of the specific observation to score (for create_score).
session_idOptional[str]ID of the specific session to score (for create_score).
score_idOptional[str]Custom ID for the score (auto-generated if None).
data_typeOptional[ScoreDataType]"NUMERIC", "BOOLEAN", or "CATEGORICAL". Inferred if not provided based on value type and score config on server.
commentOptional[str]Optional comment or explanation for the score.
config_idOptional[str]Optional ID of a pre-defined score configuration in Langfuse.

See Scoring for more details.

Dataset Runs

Langfuse Datasets are essential for evaluating and testing your LLM applications by allowing you to manage collections of inputs and their expected outputs.

Create a Dataset

  • Creating: You can programmatically create new datasets with langfuse.create_dataset(...) and add items to them using langfuse.create_dataset_item(...).
  • Fetching: Retrieve a dataset and its items using langfuse.get_dataset(name: str). This returns a DatasetClient instance, which contains a list of DatasetItemClient objects (accessible via dataset.items). Each DatasetItemClient holds the input, expected_output, and metadata for an individual data point.
from langfuse import get_client
 
langfuse = get_client()
 
# Fetch an existing dataset
dataset = langfuse.get_dataset(name="my-eval-dataset")
for item in dataset.items:
    print(f"Input: {item.input}, Expected: {item.expected_output}")
 
# Briefly: Creating a dataset and an item
new_dataset = langfuse.create_dataset(name="new-summarization-tasks")
langfuse.create_dataset_item(
    dataset_name="new-summarization-tasks",
    input={"text": "Long article..."},
    expected_output={"summary": "Short summary."}
)

Run experiment on dataset

After fetching your dataset, you can execute a run against it. This will create a new trace for each item in the dataset. Please refer to the Experiments via SDK documentation for more details.

Was this page helpful?