0% found this document useful (0 votes)

36 views1 page

Llama2 Page8

The document presents a comparison of various open-source models, particularly focusing on the performance of Llama 2 models against Llama 1 and other models like MPT and Falcon across several academic benchmarks. Llama 2 models, especially the 70B variant, show significant improvements over their predecessors and perform competitively against closed-source models like GPT-3.5 and PaLM. Additionally, the document discusses the fine-tuning process for Llama 2-Chat, highlighting the use of alignment techniques and a new method called Ghost Attention.

Uploaded by

Yun Zhou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views1 page

Llama2 Page8

Uploaded by

Yun Zhou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Commonsense World Reading

Model Size Code Math MMLU BBH AGI Eval

Reasoning Knowledge Comprehension
7B 20.5 57.4 41.0 57.5 4.9 26.8 31.0 23.5
MPT
30B 28.9 64.9 50.0 64.7 9.1 46.9 38.0 33.8
7B 5.6 56.1 42.8 36.0 4.6 26.2 28.0 21.2
Falcon
40B 15.2 69.2 56.7 65.7 12.6 55.4 37.1 37.0
7B 14.1 60.8 46.2 58.5 6.95 35.1 30.3 23.9
13B 18.9 66.1 52.6 62.3 10.9 46.9 37.0 33.9
Llama 1
33B 26.0 70.0 58.4 67.6 21.4 57.8 39.8 41.7
65B 30.7 70.7 60.5 68.6 30.8 63.4 43.5 47.6
7B 16.8 63.9 48.9 61.3 14.6 45.3 32.6 29.3
13B 24.5 66.9 55.4 65.8 28.7 54.8 39.4 39.1
Llama 2
34B 27.8 69.9 58.7 68.0 24.2 62.6 44.1 43.4
70B 37.5 71.9 63.6 69.4 35.2 68.9 51.2 54.2

Table 3: Overall performance on grouped academic benchmarks compared to open-source base models.

• Popular Aggregated Benchmarks. We report the overall results for MMLU (5 shot) (Hendrycks
et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong
et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.

As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the
results on MMLU and BBH by ≈5 and ≈8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B
models outperform MPT models of the corresponding size on all categories besides code benchmarks. For the
Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks.
Additionally, Llama 2 70B model outperforms all open-source models.
In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown
in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant
gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al.,
2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4
and PaLM-2-L.
We also analysed the potential data contamination and share the details in Section A.6.

Benchmark (shots) GPT-3.5 GPT-4 PaLM PaLM-2-L Llama 2

MMLU (5-shot) 70.0 86.4 69.3 78.3 68.9
TriviaQA (1-shot) – – 81.4 86.1 85.0
Natural Questions (1-shot) – – 29.3 37.5 33.0
GSM8K (8-shot) 57.1 92.0 56.5 80.7 56.8
HumanEval (0-shot) 48.1 67.0 26.2 – 29.9
BIG-Bench Hard (3-shot) – – 52.3 65.7 51.2
Table 4: Comparison to closed-source models on academic benchmarks. Results for GPT-3.5 and GPT-4
are from OpenAI (2023). Results for the PaLM model are from Chowdhery et al. (2022). Results for the
PaLM-2-L are from Anil et al. (2023).

3 Fine-tuning
Llama 2-Chat is the result of several months of research and iterative applications of alignment techniques,
including both instruction tuning and RLHF, requiring significant computational and annotation resources.
In this section, we report on our experiments and findings using supervised fine-tuning (Section 3.1), as
well as initial and iterative reward modeling (Section 3.2.2) and RLHF (Section 3.2.3). We also share a
new technique, Ghost Attention (GAtt), which we find helps control dialogue flow over multiple turns
(Section 3.3). See Section 4.2 for safety evaluations on fine-tuned models.

Homework and Remembering 5th Grade Volume 1
50% (2)
Homework and Remembering 5th Grade Volume 1
5 pages
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
No ratings yet
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
22 pages
Code Llama
No ratings yet
Code Llama
47 pages
Code Llama: Open Foundation Models For Code
No ratings yet
Code Llama: Open Foundation Models For Code
48 pages
Recent Advances in Language Modeling (2022-2025)
No ratings yet
Recent Advances in Language Modeling (2022-2025)
5 pages
LLaMA 2
No ratings yet
LLaMA 2
77 pages
Llama 2: Advanced Open-Source Chat Models
No ratings yet
Llama 2: Advanced Open-Source Chat Models
77 pages
Llama2 Extracted
No ratings yet
Llama2 Extracted
4 pages
Researchcodebench: Benchmarking Llms On Implementing Novel Machine Learning Research Code
No ratings yet
Researchcodebench: Benchmarking Llms On Implementing Novel Machine Learning Research Code
21 pages
Analysis of Code and Test-Code Generated by Large Language Models
No ratings yet
Analysis of Code and Test-Code Generated by Large Language Models
47 pages
Evo Code Bench
No ratings yet
Evo Code Bench
15 pages
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
No ratings yet
P G - C 2: B L L M C R F: AN U Oder Oosting Arge Anguage Odels For Ode With Anking Eedback
15 pages
Llama 2 - Open Foundation and Fine-Tuned Chat Models
No ratings yet
Llama 2 - Open Foundation and Fine-Tuned Chat Models
76 pages
Live Code Bench
No ratings yet
Live Code Bench
46 pages
Tarun Red Hen Lab
No ratings yet
Tarun Red Hen Lab
6 pages
A T: E G A A LLM: Gent Uning Nabling Eneralized Gent Bilities For S
No ratings yet
A T: E G A A LLM: Gent Uning Nabling Eneralized Gent Bilities For S
31 pages
Papa Article Transformers
No ratings yet
Papa Article Transformers
10 pages
P LM: A A E B - LLM I T O: Anda N Utomatic Valuation Ench Mark For Nstruction Uning Ptimization
No ratings yet
P LM: A A E B - LLM I T O: Anda N Utomatic Valuation Ench Mark For Nstruction Uning Ptimization
21 pages
LLM Benchmarks
No ratings yet
LLM Benchmarks
5 pages
Gorilla - Large Language Model Connected With Massive APIs
No ratings yet
Gorilla - Large Language Model Connected With Massive APIs
18 pages
Ijnlc 01
No ratings yet
Ijnlc 01
18 pages
Ijnlc 01
No ratings yet
Ijnlc 01
18 pages
Evaluation of Medium-Sized Language Models in German and English Language
No ratings yet
Evaluation of Medium-Sized Language Models in German and English Language
18 pages
Code Generation With LLMs
No ratings yet
Code Generation With LLMs
59 pages
CIBench Evaluating Your LLMs With A Code Interpret
No ratings yet
CIBench Evaluating Your LLMs With A Code Interpret
22 pages
Large Language Models Meet NL2Code A Survey
No ratings yet
Large Language Models Meet NL2Code A Survey
22 pages
Herd: Using Multiple, Smaller Llms To Match The Performances of Proprietary, Large Llms Via An Intelligent Composer
No ratings yet
Herd: Using Multiple, Smaller Llms To Match The Performances of Proprietary, Large Llms Via An Intelligent Composer
8 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
Evaluation of Llms Should Not Ignore Non-Determinism: The Good, The Bad, and The Greedy
No ratings yet
Evaluation of Llms Should Not Ignore Non-Determinism: The Good, The Bad, and The Greedy
10 pages
Llama 2: Open-Source Chat Model
No ratings yet
Llama 2: Open-Source Chat Model
7 pages
Baichuan 2: Open Large-Scale Language Models: Authors Are Listed Alphabetically, Correspondent
No ratings yet
Baichuan 2: Open Large-Scale Language Models: Authors Are Listed Alphabetically, Correspondent
28 pages
Starcoder 2
No ratings yet
Starcoder 2
61 pages
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
No ratings yet
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
70 pages
2411 03350v1
No ratings yet
2411 03350v1
76 pages
Inference Efficiency by Learning Task Complexity
No ratings yet
Inference Efficiency by Learning Task Complexity
9 pages
Codejudgebench: Benchmarking Llm-As-A-Judge For Coding Tasks
No ratings yet
Codejudgebench: Benchmarking Llm-As-A-Judge For Coding Tasks
18 pages
AudioChatLlama: Towards General-Purpose Speech Abilities For LLMs
No ratings yet
AudioChatLlama: Towards General-Purpose Speech Abilities For LLMs
11 pages
Aviationgpt: A Large Language Model For The Aviation Domain
No ratings yet
Aviationgpt: A Large Language Model For The Aviation Domain
14 pages
Benchmark LLM
No ratings yet
Benchmark LLM
28 pages
Baichuan2 Technical Report
No ratings yet
Baichuan2 Technical Report
28 pages
2025 Chipsal-1 3
No ratings yet
2025 Chipsal-1 3
18 pages
Deepseek LLM
No ratings yet
Deepseek LLM
48 pages
A Survey On Open-Source Large Language Models (LLMS) - Architectures, Capabilities, and Limitations
No ratings yet
A Survey On Open-Source Large Language Models (LLMS) - Architectures, Capabilities, and Limitations
5 pages
Qwen2 Technical Report
No ratings yet
Qwen2 Technical Report
25 pages
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
No ratings yet
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
49 pages
133 Large Language Model Evaluatio
No ratings yet
133 Large Language Model Evaluatio
12 pages
Exploring The Potential of Llama Models in Automated Code Refinement: A Replication Study
No ratings yet
Exploring The Potential of Llama Models in Automated Code Refinement: A Replication Study
12 pages
Code Llama
No ratings yet
Code Llama
2 pages
E1. Code Language Models
No ratings yet
E1. Code Language Models
40 pages
Llama RP Arxiv
No ratings yet
Llama RP Arxiv
1 page
Lora Fine-Tuning Efficiently Undoes Safety
No ratings yet
Lora Fine-Tuning Efficiently Undoes Safety
11 pages
Fine-Tuning Llama 2 Domain Adaptation of A Pre-Trained Model
No ratings yet
Fine-Tuning Llama 2 Domain Adaptation of A Pre-Trained Model
36 pages
SciReplicate-Bench Benchmarking LLMs in Agent-Driv
No ratings yet
SciReplicate-Bench Benchmarking LLMs in Agent-Driv
23 pages
Openvino Toolkit Llms Solution White Paper
No ratings yet
Openvino Toolkit Llms Solution White Paper
23 pages
Large Language Model Routing With Benchmark Datasets
No ratings yet
Large Language Model Routing With Benchmark Datasets
18 pages
OpenLLAMA-The Future of Large Language Models
No ratings yet
OpenLLAMA-The Future of Large Language Models
5 pages
Benchmarking Large Language Models With A Unified Performance Ranking Metric
No ratings yet
Benchmarking Large Language Models With A Unified Performance Ranking Metric
13 pages
Local LLMs: Key Terms and Concepts
No ratings yet
Local LLMs: Key Terms and Concepts
13 pages
Openvino Toolkit Llms Solution White Paper
No ratings yet
Openvino Toolkit Llms Solution White Paper
21 pages
Index
No ratings yet
Index
40 pages
Clip
No ratings yet
Clip
8 pages
AI in Protein Structure Prediction
No ratings yet
AI in Protein Structure Prediction
47 pages
AI Language Models: Human-Centric Tuning
No ratings yet
AI Language Models: Human-Centric Tuning
26 pages
Arxiv - 20220215 - Nisan Stiennon - Learning To Summarize From Human Feedback
No ratings yet
Arxiv - 20220215 - Nisan Stiennon - Learning To Summarize From Human Feedback
45 pages
ACL - 2022 - Yanzhe Zhang - Continual Sequence Generation With Adaptive Compositional Modules
No ratings yet
ACL - 2022 - Yanzhe Zhang - Continual Sequence Generation With Adaptive Compositional Modules
15 pages
ACL - 2021 - Xiang Lisa Li - Prefix-Tuning Optimizing Continuous Prompts For Generation
No ratings yet
ACL - 2021 - Xiang Lisa Li - Prefix-Tuning Optimizing Continuous Prompts For Generation
16 pages
ACL - 2022 - Pei Zhou - Think Before You Speak Explicitly Generating Implicit Commonsense Knowledge For Response Generation
No ratings yet
ACL - 2022 - Pei Zhou - Think Before You Speak Explicitly Generating Implicit Commonsense Knowledge For Response Generation
16 pages
Madhuri Resume
No ratings yet
Madhuri Resume
4 pages
SELF-COMPASSION AND SELF-KINDNESS Activity Output
No ratings yet
SELF-COMPASSION AND SELF-KINDNESS Activity Output
1 page
Interactive Classroom Activities
No ratings yet
Interactive Classroom Activities
32 pages
Bahamas Medical Council: Application Form For Registration
No ratings yet
Bahamas Medical Council: Application Form For Registration
2 pages
Ict Css12 q4 Las4 Week-7-8
No ratings yet
Ict Css12 q4 Las4 Week-7-8
7 pages
From Silent Spring PDF
No ratings yet
From Silent Spring PDF
10 pages
PG Handbook 2019
No ratings yet
PG Handbook 2019
96 pages
Guidelines For The Post of Assistant Professor 14E2023
No ratings yet
Guidelines For The Post of Assistant Professor 14E2023
2 pages
Nanomaterials Course Overview
No ratings yet
Nanomaterials Course Overview
5 pages
Nepal Pokhara Affiliated College List.
No ratings yet
Nepal Pokhara Affiliated College List.
3 pages
Theories and Models in Social Marketing Social Marketing - Lecture 3
100% (1)
Theories and Models in Social Marketing Social Marketing - Lecture 3
53 pages
Adobe Server Tools User Guide
No ratings yet
Adobe Server Tools User Guide
44 pages
SPM Kedah 2018 Biology Marking Scheme
No ratings yet
SPM Kedah 2018 Biology Marking Scheme
9 pages
Counselor Handbook PDF
100% (1)
Counselor Handbook PDF
546 pages
Practical Research
No ratings yet
Practical Research
7 pages
BIOL 1310 Syllabus Fall 2023 Robert Morris University
No ratings yet
BIOL 1310 Syllabus Fall 2023 Robert Morris University
4 pages
4530 - CIP Interim Report - Ruchi
No ratings yet
4530 - CIP Interim Report - Ruchi
15 pages
Singapore Maths (P2) Test 1
No ratings yet
Singapore Maths (P2) Test 1
3 pages
Edwards Government in America People Politics and Policy 2016 Election Edition 17th Edition Ap Edition George C Edwards Instant Download
100% (2)
Edwards Government in America People Politics and Policy 2016 Election Edition 17th Edition Ap Edition George C Edwards Instant Download
29 pages
A Story On Aanvikshiki (How Ton Think)
No ratings yet
A Story On Aanvikshiki (How Ton Think)
16 pages
Curriculum Level 8 Lesson Plans
No ratings yet
Curriculum Level 8 Lesson Plans
30 pages
Drama Lesson Plan
No ratings yet
Drama Lesson Plan
2 pages
MCQ HRM 23302D
No ratings yet
MCQ HRM 23302D
23 pages
Multimodal AI On Wound Images and Clinical Notes For Home Patient Referral
No ratings yet
Multimodal AI On Wound Images and Clinical Notes For Home Patient Referral
11 pages
EIS ISA Complete
No ratings yet
EIS ISA Complete
49 pages
Zoology
No ratings yet
Zoology
5 pages
Guideline To Membership - 2019
No ratings yet
Guideline To Membership - 2019
50 pages
Unit of Work
No ratings yet
Unit of Work
23 pages
Topics Group 8 MTB Mle
No ratings yet
Topics Group 8 MTB Mle
4 pages

Llama2 Page8

Uploaded by

Llama2 Page8

Uploaded by

Commonsense World Reading

Model Size Code Math MMLU BBH AGI Eval

Benchmark (shots) GPT-3.5 GPT-4 PaLM PaLM-2-L Llama 2

You might also like