Evaluations
The Python SDK provides ways to evaluate your application. You can add custom scores to your traces and observations, or use the SDK to execute Dataset Runs.
This page shows the evaluation methods that are supported by the Python SDK. Please refer to the Evaluation documentation for more information on how to evaluate your application in Langfuse.
Create Scores
span_or_generation_obj.score()
: Scores the specific observation object.span_or_generation_obj.score_trace()
: Scores the entire trace to which the object belongs.
from langfuse import get_client
langfuse = get_client()
with langfuse.start_as_current_generation(name="summary_generation") as gen:
# ... LLM call ...
gen.update(output="summary text...")
# Score this specific generation
gen.score(name="conciseness", value=0.8, data_type="NUMERIC")
# Score the overall trace
gen.score_trace(name="user_feedback_rating", value="positive", data_type="CATEGORICAL")
Score Parameters:
Parameter | Type | Description |
---|---|---|
name | str | Name of the score (e.g., “relevance”, “accuracy”). Required. |
value | Union[float, str] | Score value. Float for NUMERIC /BOOLEAN , string for CATEGORICAL . Required. |
trace_id | str | ID of the trace to associate with (for create_score ). Required. |
observation_id | Optional[str] | ID of the specific observation to score (for create_score ). |
session_id | Optional[str] | ID of the specific session to score (for create_score ). |
score_id | Optional[str] | Custom ID for the score (auto-generated if None). |
data_type | Optional[ScoreDataType] | "NUMERIC" , "BOOLEAN" , or "CATEGORICAL" . Inferred if not provided based on value type and score config on server. |
comment | Optional[str] | Optional comment or explanation for the score. |
config_id | Optional[str] | Optional ID of a pre-defined score configuration in Langfuse. |
See Scoring for more details.
Dataset Runs
Langfuse Datasets are essential for evaluating and testing your LLM applications by allowing you to manage collections of inputs and their expected outputs.
Create a Dataset
- Creating: You can programmatically create new datasets with
langfuse.create_dataset(...)
and add items to them usinglangfuse.create_dataset_item(...)
. - Fetching: Retrieve a dataset and its items using
langfuse.get_dataset(name: str)
. This returns aDatasetClient
instance, which contains a list ofDatasetItemClient
objects (accessible viadataset.items
). EachDatasetItemClient
holds theinput
,expected_output
, andmetadata
for an individual data point.
from langfuse import get_client
langfuse = get_client()
# Fetch an existing dataset
dataset = langfuse.get_dataset(name="my-eval-dataset")
for item in dataset.items:
print(f"Input: {item.input}, Expected: {item.expected_output}")
# Briefly: Creating a dataset and an item
new_dataset = langfuse.create_dataset(name="new-summarization-tasks")
langfuse.create_dataset_item(
dataset_name="new-summarization-tasks",
input={"text": "Long article..."},
expected_output={"summary": "Short summary."}
)
Run experiment on dataset
After fetching your dataset, you can execute a run against it. This will create a new trace for each item in the dataset. Please refer to the Experiments via SDK documentation for more details.