Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Latest commit

 

History

History
81 lines (62 loc) · 3.17 KB

File metadata and controls

81 lines (62 loc) · 3.17 KB

Eval

Usage

Please refer to eval_llada.sh for the required dependencies and execution commands.

For LLaDA-Base, we provide a comparison of the five conditional generation metrics evaluated using both the open-source lm-eval library and our internal evaluation toolkit.

BBH GSM8K Math HumanEval MBPP
Internal toolkit 49.8 70.7 27.3 33.5 38.4
lm-eval 49.7 70.3 31.4 35.4 40.0

In addition, we provide ablation studies on the above five metrics with respect to different generation lengths using lm-eval.

BBH GSM8K Math HumanEval MBPP
gen_length=1024,steps=1024,block_length=1024 49.7 70.3 31.4 35.4 40.0
gen_length=512,steps=512,block_length=512 50.4 70.8 30.9 32.9 39.2
gen_length=256,steps=256,block_length=256 45.0 70.0 30.3 32.9 40.2

Challenges encountered when reproducing the Instruct model with lm-eval

To ensure that we was using lm-eval correctly, we first tested it on LLaMA3-8B-Instruct. The results are as follows:

MMLU MMLU Pro ARC-C GSM8K Math GPQA HumanEval MBPP
Reported 68.4 41.0 - 79.6 30.0 34.2 62.2 67.9
Internal toolkit 68.4 41.9 82.4 78.3 29.6 33.5 59.8 57.6
lm-eval 66.5 19.6 82.1 67.3 27.3 33.5 36.6 57.0

We found that for benchmarks such as MMLU-Pro, GSM8K, and HumanEval, the results obtained using lm-eval are significantly lower than expected. Once we resolve the issues affecting the evaluation of LLaMA3-8B-Instruct, we will release the evaluation code for LLaDA-Instruct.

If you have any suggestions or feedback on this BUG, please feel free to contact us via email at [email protected] or reach out via WhatsApp/WeChat at (+86) 18809295303. We would greatly appreciate it.

Below is the command we used to test LLaMA3-8B-Instruct:

pip install transformers==4.49.0 accelerate==0.34.2
pip install antlr4-python3-runtime==4.11 math_verify sympy hf_xet

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .


export HF_ALLOW_CODE_EVAL=1
export HF_DATASETS_TRUST_REMOTE_CODE=true


accelerate launch -m lm_eval --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks mmlu_generative,gpqa_main_generative_n_shot,gsm8k \
    --num_fewshot 5 \
    --trust_remote_code \
    --apply_chat_template \
    --fewshot_as_multiturn \
    --batch_size auto:4

accelerate launch -m lm_eval --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks minerva_math \
    --num_fewshot 4 \
    --trust_remote_code \
    --apply_chat_template \
    --fewshot_as_multiturn \
    --batch_size auto:4

accelerate launch -m lm_eval --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks mmlu_pro,arc_challenge_chat \
    --trust_remote_code \
    --apply_chat_template \
    --batch_size auto:4

# For HumanEval and MBPP, using --apply_chat_template leads to significantly lower final results.
accelerate launch -m lm_eval --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks humaneval_instruct,mbpp \
    --trust_remote_code \
    --confirm_run_unsafe_code \
    --batch_size auto:4