Steps for Generating Your Tailored Benchmark with BenchMaker

With this repository, you can achieve: input any demands for the capability you want to evaluate and receive a high-quality, customized benchmark.

For more details, see LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient.

Method 1: Run with Gradio

Step-1:

Download all the required libraries and modify the API_all.py file as required to configure your API model.

Step-2:

Run gradio_demo.py with the command gradio gradio_demo.py for an intuitive way to generate your customized benchmark.

Method 2: Run with python file

Step-1:

Download all the required libraries.

Step-2:

Modify the API_all.py file as required to configure your API model.

Step-3:

Define your assessment demands as in the JSON file of task_des.

Step-4:

Modify the task_name in final_generate_attribute_0.py and run it.

Step-5:

Modify the task_name in final_LLMasBenchmarkGenerator_1.py and run it.

Step-6:

Modify the task_name in final_decode_2.py and run it.

Step-7:

At this point, you can see the generated benchmark in generated_benchmark. If you want to further evaluate faithfulness, alignment, and semantic diversity, you can run final_get_faithfulness_3_1.py, final_get_relevance_3_2.py, and final_get_embedding_3_0.py, respectively. You need to configure your embedding model in final_get_embedding_3_0.py.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
API_Com_GPT_judge_v3/MATH/generator-4omini_method-syn_sub_name-attr_deep_diffusion_difattr_diflabel_v2-8_harder_answer-qwen_plus_w_reasoning_2/raw_data		API_Com_GPT_judge_v3/MATH/generator-4omini_method-syn_sub_name-attr_deep_diffusion_difattr_diflabel_v2-8_harder_answer-qwen_plus_w_reasoning_2/raw_data
API_Com_REL/MATH/generator-4omini_method-syn_sub_name-attr_deep_diffusion_difattr_diflabel_v2-8_harder_answer-qwen_plus/raw_data		API_Com_REL/MATH/generator-4omini_method-syn_sub_name-attr_deep_diffusion_difattr_diflabel_v2-8_harder_answer-qwen_plus/raw_data
API_Com_syn/MATH/4omini		API_Com_syn/MATH/4omini
ana_prompts		ana_prompts
generated_benchmark/API_Com_syn/MATH/4omini/attr_deep_diffusion_difattr_diflabel_v2-8_harder		generated_benchmark/API_Com_syn/MATH/4omini/attr_deep_diffusion_difattr_diflabel_v2-8_harder
prompts		prompts
rel_prompts		rel_prompts
task_des		task_des
API_all.py		API_all.py
LICENSE		LICENSE
data.json		data.json
final_LLMasBenchmarkGenerator_1.py		final_LLMasBenchmarkGenerator_1.py
final_analyze_4.py		final_analyze_4.py
final_decode_2.py		final_decode_2.py
final_generate_attribute_0.py		final_generate_attribute_0.py
final_get_embedding_3_0.py		final_get_embedding_3_0.py
final_get_faithfulness_3_1.py		final_get_faithfulness_3_1.py
final_get_relevance_3_2.py		final_get_relevance_3_2.py
final_utils.py		final_utils.py
gradio_demo.py		gradio_demo.py
icml2025.pdf		icml2025.pdf
readme.md		readme.md
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Steps for Generating Your Tailored Benchmark with BenchMaker

Method 1: Run with Gradio

Step-1:

Step-2:

Method 2: Run with python file

Step-1:

Step-2:

Step-3:

Step-4:

Step-5:

Step-6:

Step-7:

About

Uh oh!

Releases

Packages

Languages

License

ypw0102/BenchMaker

Folders and files

Latest commit

History

Repository files navigation

Steps for Generating Your Tailored Benchmark with BenchMaker

Method 1: Run with Gradio

Step-1:

Step-2:

Method 2: Run with python file

Step-1:

Step-2:

Step-3:

Step-4:

Step-5:

Step-6:

Step-7:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages