0% found this document useful (0 votes)

8 views18 pages

Visual Transformer

The document discusses the architecture and functioning of Transformers, focusing on the roles of encoders and decoders, attention mechanisms, and word embeddings. It explains how multi-head attention allows the model to learn multiple representations and the importance of positional embeddings for understanding word relationships. Additionally, it covers training processes, including the use of softmax for probability distributions and beam search for generating outputs.

Uploaded by

khushalbishnoi44

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views18 pages

Visual Transformer

Uploaded by

khushalbishnoi44

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

This material is taken from Jay Alammaifog

The illustrated Transformer

As aliens entered our planet and
T.nT.n
to colonize earth
began a certaingroupof
terminal

k
Encoderlearnsinfo
SMITH t zy It takes that
for the entire representation along
some common withthe previously
sequence
f
intermediate
representation
Abstract continous
Generated101p woong
44
Generate the HP
representation Seqi
seqe.me stepby
sea steaktoy

fltodep
Top
induced

But
dorot
short Bottom
weight There can be attentions
Segel Segel seq2 Seq2 Seq seq 2
ETYtndwdokmwwdmepoweey.TO

Sega Eczossattention
Segel
contenttransfer1Rehreance
Segel Segel seq2 Sega
learnSyntax
Gueepsemoditions
Consider other intpubwords Additional
as it encodesaspecificword layer in Decoder
Sequence Syntax semantic
Attention
helps to focus
Encoderlearns the semantic dependence on relevant
seok i segz.ci input words
aasi.fcof.ae
Word 2Vec
mm
encoding

512 d
vectors

Bottom encoder gets word embedding

Others the Otp of previous encoder
get
Encoder receives a list of embedding vectorsof
inputsentences
NO
dependencies

foggiest
offY wgn
AttentionHAI
Self wwr
”The animal didn't cross the street because it was too
tired” processing
t to

discovers
while
SA with
allows

it
SA mostly and use
ni

animal
associate
encode
Such references to
as clues terms
of
them in words
word
the related
other
Suwhowffencoders
mama attention
Co attention
learned more Animal
gives
SA upheencoding
0 to
whileeat
Basically
needles

matrices
How SA is calculated warned

yesenin of WQ x
5121
I
64 1

Role of Query Ikeylvalue need to be understood

Z s can be send to feedforward
MLP network
done in matrin
These operations are
processing
form for faster
512 de
word ol Q ok
word 02 Q ok
2 64
2 512 512 64

InfatSeq
Words 7
2 512 2 64
512 64

2 572 2 64
512 64

2 64 64 2
12 216 643
8 If
ki K2

Qiki QM2
Z 0299242
2 64
22
Multi head attention

Attention allows model to focus on

sum
different positions 2 weighted
all encodings

gByu.at
dIf
fownnah

embedding

MHA enablesattention layer

learn multiple representation
to
subspace H Hz
Widmore
Q Ktv Qakath
Tbd Each projects its

input into a differentrefresututs

Sobspace
2 512

2 64 2 64 2 64

vectors 2 64
Instead of one 2 of Size
we have Such 2 64 refresabihi
got
vector
of 2
Wo
512 69
Fat 91712 2
2
64 0
512
zI

a a W
2 64 2 64 2 512 512 64

Recap till now

into

l
1 Head net Head
Yoh.de accumabetry
encoding it attention
the
One head is Say focussing teamed bfall
the heads
On Animal or pne

togeu on it

Atf.hgarded
la yanHws is being getting
contributions from
Animal The Tired

heads be helpful
Adding
more
may
but later start to learn redundant
attention or tend to overfit.the data
PositionalEybedding
Transformers are permutation equivariant
Hence word embeddings we
alongmoth
need to pass positioned embeddings

IIfpe P.FI
W.E

Helps Nlw to
understand the
distancebetween
the words in
the Il Psequene

4 dime
embedding
PE also helps to scale to unseen lengths
the
of testing sequences
too
O
1
2
3
y
es oh
oo
f
og
is
O u

o
9
1 Embedding Dimension
0 to do 6

interleaved
Both sin Cos functions are
M encoder each sublayer has a residual connection
normalization
followed by layer

layer Nor
as
I

got
ClayerNor
a

visualized
norm

t
layer
ConnectingEndDe wgLearnt
Some CommonRepresentation CR

gtEaTEEEEaitE

seat sear Q K V
OT
Segel Segel Seg2 seq2

1 protein'Entmaffia degneieywozfff.fm
If X
finally value of again
y E Em willbeusedto
Encoder Decoder utilize the cross
attention

attention
Setof
o p of
matrin fifth
last
encoder
Is in
used
decoder's
encoder
decoder
attention
eager

Decoder uses only Previously shown input

This is done by masking future positions by

setting them to C as not o before

Softman step in self attention calculations

Encodes Decoder Attention layer works

similar to Emus A but only creates
takes values
Queriy matrix key
laminating I foisting a
matrix from of
o p encoder stacks

Final linear Soft man Layer

Decoder ofP 8 vector of floats Vector 2
Word
is required for

0 getting gp word
O Assuming that
Cteighestprobabilitycell8017
model has learned
0 1000 words as
h m duod 01p vocabulary
A fully 0 logils's size
q.zyuedeiddgkfoafog.sk is also 1k
vector of
floats
Finally Softman Converts raw
logits scores
into probabilities Highest probability cell UP
word

loss fu
6
The
untrained model
initially produce
random1incorrect
Ggp is a distribution Otp
0 10 Vocab
assuming
alarm 1 lhanks I
student Keos
Distributions can be compared
cross entropy KL Div
using
Assuming input
as je suis
Etudiant
11
v
empected Otp
I am a student
After some initial
training model
start to
may
converge to the
distribution
target
Model is selecting
the best match
highest probable word
Beam Search
Considering the
Instead Top 1
of choosing I beam size
CSay
One Can Consider
g
Top words instead

of just I d
say
µ9osFo ed a
model twice
run
4
once assuming as the next
ans

I
word and once the ment
assuming am as

whichever version produces less error

At time partialhypothesis
is theft
any translations
unfinished
are kept in memory
0 10 Vocab
I 2 3 45
0
beamsize 2
2 3 5
I 11 man length
1203 30
040345 23
St will consider
z
34 all of them and
1g 12 2 uzfgtqg.us
Choose the one giving
minimum error

Transformer Neural Networks: RAHUL 121AD0036
No ratings yet
Transformer Neural Networks: RAHUL 121AD0036
43 pages
The Positional Encoding Blog
No ratings yet
The Positional Encoding Blog
17 pages
Transformer
No ratings yet
Transformer
14 pages
Tranformers Transfer Learning
No ratings yet
Tranformers Transfer Learning
58 pages
Seq2Seq, Attention and Transformers
No ratings yet
Seq2Seq, Attention and Transformers
142 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
Transformer
No ratings yet
Transformer
58 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Attention Is All We Need
No ratings yet
Attention Is All We Need
5 pages
NLP - Natural Language Processing
No ratings yet
NLP - Natural Language Processing
74 pages
AI Transformers for Researchers
No ratings yet
AI Transformers for Researchers
65 pages
Transformers Model
No ratings yet
Transformers Model
11 pages
Transformer NLP
No ratings yet
Transformer NLP
15 pages
Attention Mechanism - High Level Overview
No ratings yet
Attention Mechanism - High Level Overview
11 pages
08 Transformer
No ratings yet
08 Transformer
56 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Chapter 2
No ratings yet
Chapter 2
11 pages
Getting Started With The Model Architecture of The Transformer
No ratings yet
Getting Started With The Model Architecture of The Transformer
103 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
NLP 8
No ratings yet
NLP 8
42 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
Transformers v1.1
No ratings yet
Transformers v1.1
1 page
Positional Embeddings
No ratings yet
Positional Embeddings
7 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
32 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
Chapter 2
No ratings yet
Chapter 2
11 pages
Encoder-Decoder Models
No ratings yet
Encoder-Decoder Models
6 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
03b. Transformers
No ratings yet
03b. Transformers
75 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Transformers 1
No ratings yet
Transformers 1
6 pages
11-Transformer LLMs Updated
No ratings yet
11-Transformer LLMs Updated
96 pages
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Generative AI
No ratings yet
Generative AI
54 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
L4 Cse256 Fa24 We
No ratings yet
L4 Cse256 Fa24 We
68 pages
Transformer Language Models Without Positional Encodings Still Learn Positional Information Haviv Ram Press Izsak Levy ArXiv 2203.16634 2022
No ratings yet
Transformer Language Models Without Positional Encodings Still Learn Positional Information Haviv Ram Press Izsak Levy ArXiv 2203.16634 2022
9 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
Llms Course Andrew
No ratings yet
Llms Course Andrew
46 pages
(9,10) Transformers - 3
0% (1)
(9,10) Transformers - 3
92 pages
Computer Vision 12 Vision Language Models
No ratings yet
Computer Vision 12 Vision Language Models
56 pages
Transformers Illustraded
No ratings yet
Transformers Illustraded
31 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Transformers
No ratings yet
Transformers
23 pages
DL Co4 PPT-1
No ratings yet
DL Co4 PPT-1
29 pages
NARAYANI MAHAL Job Fare
No ratings yet
NARAYANI MAHAL Job Fare
2 pages
Aviation Safety Performance Indicators
No ratings yet
Aviation Safety Performance Indicators
49 pages
IB Chemistry Stoichiometry & Periodicity
No ratings yet
IB Chemistry Stoichiometry & Periodicity
309 pages
Meaning and Discourse: Dr. Manjet Kaur Dr. Omer Mahfoodh
No ratings yet
Meaning and Discourse: Dr. Manjet Kaur Dr. Omer Mahfoodh
59 pages
Village Map: Taluka: Kaij District: Bid
100% (1)
Village Map: Taluka: Kaij District: Bid
1 page
Cre Project Group 2 Sec01
No ratings yet
Cre Project Group 2 Sec01
28 pages
Halal Industry Master Plan (2008 - 2020) : The Evolution of The Halal Industry in Malaysia
No ratings yet
Halal Industry Master Plan (2008 - 2020) : The Evolution of The Halal Industry in Malaysia
2 pages
2006-12-31: Overall Conclusion For The Year of 'Arise and Shine'
No ratings yet
2006-12-31: Overall Conclusion For The Year of 'Arise and Shine'
6 pages
CADVR-1004FD / - 08FD: Honeywell Black
No ratings yet
CADVR-1004FD / - 08FD: Honeywell Black
4 pages
iGCSE Biology Study Guide
100% (1)
iGCSE Biology Study Guide
4 pages
BestSub Heat Press Catalog 2024
No ratings yet
BestSub Heat Press Catalog 2024
37 pages
Design and Manufacturing of Carbon Fiber Composite Drive Shaft As An Alternative To Conventional Steel Drive Shaft
No ratings yet
Design and Manufacturing of Carbon Fiber Composite Drive Shaft As An Alternative To Conventional Steel Drive Shaft
10 pages
Change of Voice
No ratings yet
Change of Voice
7 pages
Medan LPG Terminal Overview
100% (1)
Medan LPG Terminal Overview
38 pages
Algebraic Geometry - A First Course - Joe Harris - Harvard University
86% (7)
Algebraic Geometry - A First Course - Joe Harris - Harvard University
337 pages
Faircode Technologies Private Limited - Home
No ratings yet
Faircode Technologies Private Limited - Home
1 page
CSEC Biology June 2014 P032
No ratings yet
CSEC Biology June 2014 P032
12 pages
111, NDT Brochure
No ratings yet
111, NDT Brochure
4 pages
Ec PDF
No ratings yet
Ec PDF
1,602 pages
Mathematics 9 - Q3 - Mod11 - Conditions Proving For Triangles Similar - v3
100% (2)
Mathematics 9 - Q3 - Mod11 - Conditions Proving For Triangles Similar - v3
28 pages
Principles of Assessment: Prepared By: Julie G. de Guzman Eps - I Science
No ratings yet
Principles of Assessment: Prepared By: Julie G. de Guzman Eps - I Science
25 pages
ES Alcoholic Beverages
No ratings yet
ES Alcoholic Beverages
10 pages
Getting Started With Excel: Comprehensive
0% (1)
Getting Started With Excel: Comprehensive
10 pages
A Brief Biography of Hazrat Maqdum Fakhi Ali Al-Mahaimi
No ratings yet
A Brief Biography of Hazrat Maqdum Fakhi Ali Al-Mahaimi
13 pages
Sono 336 Carotid-Worksheet
No ratings yet
Sono 336 Carotid-Worksheet
1 page
Percentage Prelims - I: 1 Exclusively Prepared For IACE Students Toll Free: 1800-270-9975, PH: 9533200400
No ratings yet
Percentage Prelims - I: 1 Exclusively Prepared For IACE Students Toll Free: 1800-270-9975, PH: 9533200400
3 pages
Finite Element Method For Electromagnetics
No ratings yet
Finite Element Method For Electromagnetics
360 pages
Physics1 PDF
No ratings yet
Physics1 PDF
7 pages
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
No ratings yet
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
13 pages
Criminology MCQs
100% (1)
Criminology MCQs
4 pages