BESPOKE

This is the official code for the "BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback"

Dataset

This repository does not include datasets. All data is provided in this repository.

If you plan to run the evaluators, place the provided data under the dataset/ directory so the scripts can reference it.

Data Fields (in this repository.)

Query Table (queries/query_table.parquet)

Key	Type	Description
user	str	User identifier
query-id	str	Query identifier (canonical hyphen variant)
query	str	Natural-language query
gold_information_need	str	Short description of required info

Gold Information (gold_information/gold_information.parquet)

Key	Type	Description
user	str	User identifier
query-id	str	Query identifier
id	str	UUID per gold info item
gold_information_value	string	JSON/text with gold answer metadata for the query

Personalized Rubric (personalized_rubric/personalized_rubric.parquet)

Key	Type	Description
user	str	User identifier
query-id	str	Query identifier
id	str	UUID per rubric item
personalized_rubric_value	string	Free-form or structured rubric text per query

Meta Test (meta_test/meta_test_set.parquet)

Key	Type	Description
user	str	User identifier
query-id	str	Query identifier
id	str	UUID per meta test item
meta_test_set_value	string	JSON string with {query, gold_information_need, response_list, ...}

Evaluation Shot (evaluation_shot/evaluation_shot.parquet)

Key	Type	Description
user	str	User identifier
query-id	str	Query identifier
id	str	UUID per shot
evaluation_shot_value	string	JSON/text few-shot example for evaluation or prompting

Chat/Search History (chat_history/chat_history.parquet, search_history/search_history.parquet)

Key	Type	Description
user	str	User identifier
id	str	UUID per record
history	str	Raw text transcript blob per row

Data Example

Query table row (queries/query_table.parquet):

{
  "user": "user1",
  "query-id": "1",
  "query": "Why is marketing so popular these days?",
  "gold_information_need": "..."
}

Evaluation

Set your API key (or pass it directly in code):

export OPENAI_API_KEY=YOUR_KEY

1) Personalization scoring

Required files:

dataset/evaluation_shot/{user}/query-id_{N}.json
dataset/personalized_rubric/{user}/query-id_{N}.txt

Run via the provided script (fill in placeholders inside evaluate/run_personalizaion_eval.py):

python evaluate/run_personalizaion_eval.py

Set these fields inside the script:

user (e.g., "user2")
query_id (e.g., 6)
response_text (the response text to evaluate)
eval_shot_root (e.g., dataset/evaluation_shot)
rubric_root (e.g., dataset/personalized_rubric)
model (default: gpt-5)

Output (printed as JSON):

need_alignment_score, content_depth_score, tone_score, explanation_style_score plus corresponding feedback strings

Programmatic usage (alternative):

from pathlib import Path
from evaluate.evaluator import PersonalizationEvaluator

evaluator = PersonalizationEvaluator(
    eval_shot_root=Path("dataset/evaluation_shot"),
    rubric_root=Path("dataset/personalized_rubric"),
    model="gpt-5",
)
result = evaluator.evaluate(user="user2", query_id=6, response_text="... your response ...")

2) Recall scoring

Required file:

dataset/gold_information/{user}/query-id_{N}.json (format example)

{
  "gold_information": ["Claim A", "Claim B"]
}

Run via the provided script (fill in placeholders inside evaluate/run_recall_eval.py):

python evaluate/run_recall_eval.py

Set these fields inside the script:

user (e.g., "user2")
query_id (e.g., 6)
response_text (the response text to evaluate)
gold_information_root (e.g., dataset/gold_information)
model (default: gpt-5)

Output (printed as JSON):

matched_gold_information: list of gold claims found in the response
recall: matched fraction (0.0–1.0)

Notes:

Paths in the scripts can be absolute or relative to the repo root.
The scripts are templates; fill in the placeholders before running.

Inference

CLI to generate a personalized answer for a single query. Use --enforce_search to enable web search/browsing for grounding.

Optional environment variables:

OpenAI: OPENAI_API_KEY
Perplexity: PERPLEXITY_API_KEY
Gemini: GEMINI_API_KEY

Main arguments:

--user_id (required), --query (required)
--user_context Inline user preferences/context text
--model_type openai|perplexity|gemini (default: openai)
--model_name (default: gpt-4o-mini)
--enforce_search Enable web search/browsing
--output_path Path to save JSON result
--print_json Print full JSON result

Output:

Default: prints the final answer text only
With --print_json: prints a JSON containing response, response_urls, model, etc.

BibTeX

If you use this dataset, please consider citing it:

@misc{kim2025bespokebenchmarksearchaugmentedlarge,
      title={BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback}, 
      author={Hyunseo Kim and Sangam Lee and Kwangwook Seo and Dongha Lee},
      year={2025},
      eprint={2509.21106},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.21106}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
asset		asset
evaluate		evaluate
experiment		experiment
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BESPOKE

Dataset

Data Fields (in this repository.)

Data Example

Evaluation

1) Personalization scoring

2) Recall scoring

Inference

BibTeX

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

augustinLib/BESPOKE

Folders and files

Latest commit

History

Repository files navigation

BESPOKE

Dataset

Data Fields (in this repository.)

Data Example

Evaluation

1) Personalization scoring

2) Recall scoring

Inference

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages