This material is taken from Jay Alammaifog
The illustrated Transformer
As aliens entered our planet and
T.nT.n
to colonize earth
began a certaingroupof
terminal
k
Encoderlearnsinfo
SMITH t zy It takes that
for the entire representation along
some common withthe previously
sequence
f
intermediate
representation
Abstract continous
Generated101p woong
44
Generate the HP
representation Seqi
seqe.me stepby
sea steaktoy
fltodep
Top
induced
But
dorot
short Bottom
weight There can be attentions
Segel Segel seq2 Seq2 Seq seq 2
ETYtndwdokmwwdmepoweey.TO
Sega Eczossattention
Segel
contenttransfer1Rehreance
Segel Segel seq2 Sega
learnSyntax
Gueepsemoditions
Consider other intpubwords Additional
as it encodesaspecificword layer in Decoder
Sequence Syntax semantic
Attention
helps to focus
Encoderlearns the semantic dependence on relevant
seok i segz.ci input words
aasi.fcof.ae
Word 2Vec
mm
encoding
512 d
vectors
Bottom encoder gets word embedding
Others the Otp of previous encoder
get
Encoder receives a list of embedding vectorsof
inputsentences
NO
dependencies
foggiest
offY wgn
AttentionHAI
Self wwr
”The animal didn't cross the street because it was too
tired” processing
t to
discovers
while
SA with
allows
it
SA mostly and use
ni
animal
associate
encode
Such references to
as clues terms
of
them in words
word
the related
other
Suwhowffencoders
mama attention
Co attention
learned more Animal
gives
SA upheencoding
0 to
whileeat
Basically
needles
matrices
How SA is calculated warned
yesenin of WQ x
5121
I
64 1
Role of Query Ikeylvalue need to be understood
Z s can be send to feedforward
MLP network
done in matrin
These operations are
processing
form for faster
512 de
word ol Q ok
word 02 Q ok
2 64
2 512 512 64
InfatSeq
Words 7
2 512 2 64
512 64
2 572 2 64
512 64
2 64 64 2
12 216 643
8 If
ki K2
Qiki QM2
Z 0299242
2 64
22
Multi head attention
Attention allows model to focus on
sum
different positions 2 weighted
all encodings
gByu.at
dIf
fownnah
embedding
MHA enablesattention layer
learn multiple representation
to
subspace H Hz
Widmore
Q Ktv Qakath
Tbd Each projects its
input into a differentrefresututs
Sobspace
2 512
2 64 2 64 2 64
vectors 2 64
Instead of one 2 of Size
we have Such 2 64 refresabihi
got
vector
of 2
Wo
512 69
Fat 91712 2
2
64 0
512
zI
a a W
2 64 2 64 2 512 512 64
Recap till now
into
l
1 Head net Head
Yoh.de accumabetry
encoding it attention
the
One head is Say focussing teamed bfall
the heads
On Animal or pne
togeu on it
Atf.hgarded
la yanHws is being getting
contributions from
Animal The Tired
heads be helpful
Adding
more
may
but later start to learn redundant
attention or tend to overfit.the data
PositionalEybedding
Transformers are permutation equivariant
Hence word embeddings we
alongmoth
need to pass positioned embeddings
IIfpe P.FI
W.E
Helps Nlw to
understand the
distancebetween
the words in
the Il Psequene
4 dime
embedding
PE also helps to scale to unseen lengths
the
of testing sequences
too
O
1
2
3
y
es oh
oo
f
og
is
O u
o
9
1 Embedding Dimension
0 to do 6
interleaved
Both sin Cos functions are
M encoder each sublayer has a residual connection
normalization
followed by layer
layer Nor
as
I
got
ClayerNor
a
visualized
norm
t
layer
ConnectingEndDe wgLearnt
Some CommonRepresentation CR
gtEaTEEEEaitE
seat sear Q K V
OT
Segel Segel Seg2 seq2
1 protein'Entmaffia degneieywozfff.fm
If X
finally value of again
y E Em willbeusedto
Encoder Decoder utilize the cross
attention
attention
Setof
o p of
matrin fifth
last
encoder
Is in
used
decoder's
encoder
decoder
attention
eager
Decoder uses only Previously shown input
This is done by masking future positions by
setting them to C as not o before
Softman step in self attention calculations
Encodes Decoder Attention layer works
similar to Emus A but only creates
takes values
Queriy matrix key
laminating I foisting a
matrix from of
o p encoder stacks
Final linear Soft man Layer
Decoder ofP 8 vector of floats Vector 2
Word
is required for
0 getting gp word
O Assuming that
Cteighestprobabilitycell8017
model has learned
0 1000 words as
h m duod 01p vocabulary
A fully 0 logils's size
q.zyuedeiddgkfoafog.sk is also 1k
vector of
floats
Finally Softman Converts raw
logits scores
into probabilities Highest probability cell UP
word
loss fu
6
The
untrained model
initially produce
random1incorrect
Ggp is a distribution Otp
0 10 Vocab
assuming
alarm 1 lhanks I
student Keos
Distributions can be compared
cross entropy KL Div
using
Assuming input
as je suis
Etudiant
11
v
empected Otp
I am a student
After some initial
training model
start to
may
converge to the
distribution
target
Model is selecting
the best match
highest probable word
Beam Search
Considering the
Instead Top 1
of choosing I beam size
CSay
One Can Consider
g
Top words instead
of just I d
say
µ9osFo ed a
model twice
run
4
once assuming as the next
ans
I
word and once the ment
assuming am as
whichever version produces less error
At time partialhypothesis
is theft
any translations
unfinished
are kept in memory
0 10 Vocab
I 2 3 45
0
beamsize 2
2 3 5
I 11 man length
1203 30
040345 23
St will consider
z
34 all of them and
1g 12 2 uzfgtqg.us
Choose the one giving
minimum error