Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views11 pages

Evaluating LLM Performance On Real Data Set

The document discusses the evaluation of Large Language Models (LLMs) and their reliance on tokenization, specifically using Byte Pair Encoding (BPE). BPE converts text into tokens by merging frequent character pairs, allowing the model to efficiently process both common and rare words. The text is ultimately represented as a list of integer token IDs for training purposes.

Uploaded by

aftabmsdev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

Evaluating LLM Performance On Real Data Set

The document discusses the evaluation of Large Language Models (LLMs) and their reliance on tokenization, specifically using Byte Pair Encoding (BPE). BPE converts text into tokens by merging frequent character pairs, allowing the model to efficiently process both common and rare words. The text is ultimately represented as a list of integer token IDs for training purposes.

Uploaded by

aftabmsdev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Evaluating LLM performance

Dataset The verdict short story

characters
20,000
5000tokens
1) Convert text in to tokens: Byte pair encoding

Large Language Models (LLMs) don’t work directly with raw text instead, they process text as tokens. A
token is typically a word, part of a word, or even punctuation, depending on the tokenizer’s rules.

Byte Pair Encoding (BPE) is a common tokenization method where:


1. The text is first split into characters.
2. The most frequent pairs of characters are merged into bigger units.
3. This merging repeats until a fixed vocabulary size is reached.
This approach keeps common words as single tokens and breaks rare or unknown words into smaller
pieces, helping the model handle any text.

After tokenization, the text is represented as a list of integer token IDs.


For training, we usually split this list into chunks of fixed length, so the model can learn patterns in
manageable pieces.

Sliding Window Chunking is often used, where each chunk overlaps slightly with the previous one.
* max_length is the size of each chunk.
* stride controls how far the window moves each step.
This overlap ensures the model doesn’t lose context between chunks.

In the code below:


* tokenizer.encode() converts the text into token IDs (integers) while keeping special tokens like
<|endoftext|>.
* A sliding window loops through the token IDs, producing input sequences and their shifted target
sequences for next-token prediction.
2) Train/Validation Split

The dataset is split into 90% training and 10% validation using train_ratio. The training set is shuffled for
better learning, while the validation set remains in order for consistent evaluation. Separate data loaders
are created for each split with the same chunking parameters (max_length, stride) but different shuffle
and drop_last settings.

901 101

validation
Training
LLMs are
models which do not have
autoregressive
labels target
From the text we create the input target

Context size 4 Use a tokens at a time

Input Output pairs

tada1gwaysthoughach
it
ffam.cm

Is
Stride 4

mm

Dataset

v
Validation
Training TrainData Validation
Loss
Loss Data

n I had always thought y


Had always thought Jack

Gisburn rather a Gisborn rather


Lack a cheap
Summery of Calculating Loss

GPTmodel

I had always thought

at
had Jack
Target always thought

Logits

I
had
always

thought
Likelyhood
www.ij
Negative Log

logpitlogp2t
Softman
logps logp
check forhightest
Probabilities
probability
AcrossEntropy
Get output prob
fortarget tokenIDs had
I
p it Icangetrandomdata nttaine

always

m.ttiI
thought

P want
if trainthese
proba 114 48200 Thing
token
IDs to
tothe proba correspond
token IDs getproba
Batch Processing

M 4
n I had always thought y
Had always thought Jack
Gisborn rather
siE2 Jack Gisburn rather a a cheap
patch
hr 2
yr
GPT Model
v
Logits

I I
had had
always always

thought

ga

yf
man
µ
up
Gisborn
rather

8 50257
waitiIIII 2 9 50257

Softmax

v
Probabilities

I
Piz had

target token IDs


thought
Pia Jack

rather
ibm
Ppp

lrgiffjfjj.gg 8 50257

Cross Entropy

loss
im

log R logPat logR

logB logP.it log P


3) X, y split.

256

af
Yz j
9 training set batches 12 samples
256 tokens each

1 validation batch 12samples


256 token each
4) Pass through GPT-Model.

5) Calculate the Cross Entropy Loss.

Calculate loss forallthebatches

Mean loss for batch

You might also like