-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.xml
More file actions
622 lines (578 loc) · 114 KB
/
index.xml
File metadata and controls
622 lines (578 loc) · 114 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Home on NAN blog</title>
<link>https://theflash010.github.io/</link>
<description>Recent content in Home on NAN blog</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-gb</language>
<lastBuildDate>Sat, 16 Aug 2025 00:17:30 +0800</lastBuildDate><atom:link href="https://theflash010.github.io/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>Perfetto_tutorial</title>
<link>https://theflash010.github.io/posts/perfetto_tutorial/%E5%A4%A7%E6%A8%A1%E5%9E%8B%E8%AE%AD%E7%BB%83trace%E5%B7%A5%E5%85%B7-perfetto/</link>
<pubDate>Sat, 16 Aug 2025 00:17:30 +0800</pubDate>
<guid>https://theflash010.github.io/posts/perfetto_tutorial/%E5%A4%A7%E6%A8%A1%E5%9E%8B%E8%AE%AD%E7%BB%83trace%E5%B7%A5%E5%85%B7-perfetto/</guid>
<description><h2 id="perfetto-远程渲染-本地访问">Perfetto 远程渲染-本地访问</h2>
<p>下载下面的trace_processor脚本(是一个python脚本)</p>
<p>curl -LO <a href="https://get.perfetto.dev/trace_processor">https://get.perfetto.dev/trace_processor</a> chmod +x ./trace_processor</p>
<p>启动HTTP服务并加载Trace文件(默认端口9001),执行下面指令 ./trace_processor &ndash;httpd /path/to/trace.pftrace</p>
<p>这个/path/to/trace.pftrace就是你需要访问的trace文件(目前依然不支持目录访问)</p>
<p>在首次运行trace_processor的时候可能遇到error,这是因为执行的时候需要用外网下载一个东西trace_processor_shell,如下图所示
<img src="https://raw.githubusercontent.com/theflash010/self-gallery/main/image-20250725180331790.png" alt="">
由于需要外网,所以需要搞ssh反向代理让他下载下来,使用下面的指令(或者看后面<strong>不需要外网方案</strong>)
ssh -C -N -g -R 16666:127.0.0.1:10808 root@remote_host -p 2022</p>
<p>10808是本地可以访问外网的端口号,16666是远程主机的端口号,相当于远程主机上走16666端口的流量通过本地的10808端口转发访问</p>
<p>由于下载的时候使用https协议,所以需要更改远程主机的环境变量</p>
<p>export https_proxy=127.0.0.1:16666</p>
<p>2022是ssh连接的端口号</p>
<p>如果首次执行下载好了trace_processor_shell,在后续的执行中不需要重复配置反向代理</p>
<p>在后续执行的时候,运行./trace_processor &ndash;httpd /path/to/trace.pftrace,远程主机就会把网页通过9001端口向外暴露</p>
<p><a href="https://www.ui.perfetto.dev/">https://www.ui.perfetto.dev/</a>会自动访问本地主机的9001端口(127.0.0.1:9001)来查看是否有trace网页需要渲染,但是注意我们现在是远程主机的9001端口有网页暴露,并不是本地主机d的9001端口有网页暴露,所以还需要一个ssh代理,让本地主机9001端口流量走远程主机的9001端口,使用下面指令</p>
<p>ssh -N -L 9001:localhost:9001 root@remote_host -p 2022</p>
<p>最终就可以在本地浏览器读取trace了(一定要用Google浏览器,不然可能会🐔)
<img src="https://raw.githubusercontent.com/theflash010/self-gallery/main/20250816002230867.png" alt="20250816000703328(1).png"></p>
<h3 id="无需外网方案">无需外网方案</h3>
<p>如果远程主机外网无法访问或者搞反向代理也会被墙,可以本地把trace_processor_shell(<a href="https://commondatastorage.googleapis.com/perfetto-luci-artifacts/v51.2/linux-amd64/trace_processor_shell">https://commondatastorage.googleapis.com/perfetto-luci-artifacts/v51.2/linux-amd64/trace_processor_shell</a>)下载下来</p>
<p>这会遇到一个新问题,trace_processor脚本中在检查trace_processor_shell文件的时候,会对比sha256密钥(可能是为了保证准确性什么的),但是手动下载我们会没有密钥,但是这很容易解决,把trace_processor脚本修改,删掉sha256密钥验证就可以了</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">download_or_get_cached</span><span class="p">(</span><span class="n">file_name</span><span class="p">,</span> <span class="n">url</span><span class="p">,</span> <span class="n">sha256</span><span class="p">):</span>
</span></span><span class="line"><span class="cl"> <span class="s2">&#34;&#34;&#34; Downloads a prebuilt or returns a cached version
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2"> The first time this is invoked, it downloads the |url| and caches it into
</span></span></span><span class="line"><span class="cl"><span class="s2"> ~/.local/share/perfetto/prebuilts/$tool_name. On subsequent invocations it
</span></span></span><span class="line"><span class="cl"><span class="s2"> just runs the cached version.
</span></span></span><span class="line"><span class="cl"><span class="s2"> &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl"> <span class="nb">dir</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">expanduser</span><span class="p">(</span><span class="s1">&#39;~&#39;</span><span class="p">),</span> <span class="s1">&#39;.local&#39;</span><span class="p">,</span> <span class="s1">&#39;share&#39;</span><span class="p">,</span> <span class="s1">&#39;perfetto&#39;</span><span class="p">,</span> <span class="s1">&#39;prebuilts&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">os</span><span class="o">.</span><span class="n">makedirs</span><span class="p">(</span><span class="nb">dir</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">bin_path</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="nb">dir</span><span class="p">,</span> <span class="n">file_name</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">sha256_path</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="nb">dir</span><span class="p">,</span> <span class="n">file_name</span> <span class="o">+</span> <span class="s1">&#39;.sha256&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">needs_download</span> <span class="o">=</span> <span class="kc">True</span>
</span></span><span class="line"><span class="cl"><span class="err"></span>
</span></span><span class="line"><span class="cl"> <span class="c1"># Avoid recomputing the SHA-256 on each invocation. The SHA-256 of the last </span>
</span></span><span class="line"><span class="cl"> <span class="c1"># download is cached into file_name.sha256, just check if that matches. </span>
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">bin_path</span><span class="p">)</span> <span class="ow">and</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">sha256_path</span><span class="p">):</span>
</span></span><span class="line"><span class="cl"> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">sha256_path</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="n">digest</span> <span class="o">=</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">()</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">digest</span> <span class="o">==</span> <span class="n">sha256</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="n">needs_download</span> <span class="o">=</span> <span class="kc">False</span>
</span></span><span class="line"><span class="cl"><span class="err"></span>
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">needs_download</span><span class="p">:</span> <span class="c1"># The file doesn&#39;t exist or the SHA256 doesn&#39;t match. </span>
</span></span><span class="line"><span class="cl"> <span class="c1"># Use a unique random file to guard against concurrent executions. </span>
</span></span><span class="line"><span class="cl"> <span class="c1"># See https://github.com/google/perfetto/issues/786 . </span>
</span></span><span class="line"><span class="cl"> <span class="n">tmp_path</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="si">%s</span><span class="s1">.</span><span class="si">%d</span><span class="s1">.tmp&#39;</span> <span class="o">%</span> <span class="p">(</span><span class="n">bin_path</span><span class="p">,</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100000</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;Downloading &#39;</span> <span class="o">+</span> <span class="n">url</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">subprocess</span><span class="o">.</span><span class="n">check_call</span><span class="p">([</span><span class="s1">&#39;curl&#39;</span><span class="p">,</span> <span class="s1">&#39;-f&#39;</span><span class="p">,</span> <span class="s1">&#39;-L&#39;</span><span class="p">,</span> <span class="s1">&#39;-#&#39;</span><span class="p">,</span> <span class="s1">&#39;-o&#39;</span><span class="p">,</span> <span class="n">tmp_path</span><span class="p">,</span> <span class="n">url</span><span class="p">])</span>
</span></span><span class="line"><span class="cl"> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">tmp_path</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">fd</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="n">actual_sha256</span> <span class="o">=</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">fd</span><span class="o">.</span><span class="n">read</span><span class="p">())</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">actual_sha256</span> <span class="o">!=</span> <span class="n">sha256</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="k">raise</span> <span class="ne">Exception</span><span class="p">(</span><span class="s1">&#39;Checksum mismatch for </span><span class="si">%s</span><span class="s1"> (actual: </span><span class="si">%s</span><span class="s1">, expected: </span><span class="si">%s</span><span class="s1">)&#39;</span> <span class="o">%</span>
</span></span><span class="line"><span class="cl"> <span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">actual_sha256</span><span class="p">,</span> <span class="n">sha256</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"> <span class="n">os</span><span class="o">.</span><span class="n">chmod</span><span class="p">(</span><span class="n">tmp_path</span><span class="p">,</span> <span class="mo">0o755</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">os</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">tmp_path</span><span class="p">,</span> <span class="n">bin_path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">tmp_path</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">sha256</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">os</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">tmp_path</span><span class="p">,</span> <span class="n">sha256_path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">bin_path</span> <span class="c1">#这是原来的代码 最终其实重要的是返还一个bin_path,也就是trace_processor_shell所在位置,我们可以修改代码跳过密钥,如下面代码所示 </span>
</span></span><span class="line"><span class="cl"><span class="err"></span>
</span></span><span class="line"><span class="cl"><span class="n">修改后如下</span><span class="err">:</span>
</span></span><span class="line"><span class="cl"><span class="err"></span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">download_or_get_cached</span><span class="p">(</span><span class="n">file_name</span><span class="p">,</span> <span class="n">url</span><span class="p">,</span> <span class="n">sha256</span><span class="p">):</span>
</span></span><span class="line"><span class="cl"> <span class="s2">&#34;&#34;&#34; Downloads a prebuilt or returns a cached version
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2"> The first time this is invoked, it downloads the |url| and caches it into
</span></span></span><span class="line"><span class="cl"><span class="s2"> ~/.local/share/perfetto/prebuilts/$tool_name. On subsequent invocations it
</span></span></span><span class="line"><span class="cl"><span class="s2"> just runs the cached version.
</span></span></span><span class="line"><span class="cl"><span class="s2"> &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="s2">&#34;/root/.local/share/perfetto/prebuilts/trace_processor_shell&#34;</span><span class="c1">#直接返回trace_processor_shell所在位置</span>
</span></span></code></pre></div><p>如果使用的是vscode,在9001端口映射的时候有一个小trick,非常方便添加端口映射</p></description>
</item>
<item>
<title>VLLM_Worker</title>
<link>https://theflash010.github.io/posts/vllm_worker/</link>
<pubDate>Tue, 15 Apr 2025 21:42:12 +0800</pubDate>
<guid>https://theflash010.github.io/posts/vllm_worker/</guid>
<description><h1 id="vllm-worker">VLLM Worker</h1>
<p>VLLM中的Worker是VLLM推理系统中非常重要的一个组件,每个Worker都对应一个物理上的gpu,Worker的职责是处理来自Executor发送的任务(一个Executor包含多个Worker),这其中包括预热推理引擎,执行模型推理等,并将任务结果返回给Executor。</p>
<img src=".\image-20250415222935321.png" alt="image-20250415222935321" style="zoom: 67%;" />
<h2 id="worker的定义与实例化">Worker的定义与实例化</h2>
<p>Worker的实例化发生在Executor的init中,在Executor的init中会根据并行化策略(TP,PP,DP)循环执行WorkerProc.make_worker_process函数。</p>
<p>Executor的init代码如下:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">MultiprocExecutor</span><span class="p">(</span><span class="n">Executor</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="nf">_init_executor</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span> <span class="c1">#executor中有很多worker</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># Call self.shutdown at exit to clean up</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># and ensure workers will be terminated.</span>
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">_finalizer</span> <span class="o">=</span> <span class="n">weakref</span><span class="o">.</span><span class="n">finalize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">shutdown</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># The child processes will send SIGUSR1 when unrecoverable</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># errors happen.</span>
</span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="nf">sigusr1_handler</span><span class="p">(</span><span class="n">signum</span><span class="p">,</span> <span class="n">frame</span><span class="p">):</span>
</span></span><span class="line"><span class="cl"> <span class="n">logger</span><span class="o">.</span><span class="n">fatal</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="s2">&#34;MulitprocExecutor got fatal signal from worker processes, &#34;</span>
</span></span><span class="line"><span class="cl"> <span class="s2">&#34;shutting down. See stack trace above for root cause issue.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># Propagate error up to parent process.</span>
</span></span><span class="line"><span class="cl"> <span class="n">parent_process</span> <span class="o">=</span> <span class="n">psutil</span><span class="o">.</span><span class="n">Process</span><span class="p">()</span><span class="o">.</span><span class="n">parent</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"> <span class="n">parent_process</span><span class="o">.</span><span class="n">send_signal</span><span class="p">(</span><span class="n">signal</span><span class="o">.</span><span class="n">SIGUSR1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">shutdown</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="n">signal</span><span class="o">.</span><span class="n">signal</span><span class="p">(</span><span class="n">signal</span><span class="o">.</span><span class="n">SIGUSR1</span><span class="p">,</span> <span class="n">sigusr1_handler</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">world_size</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">parallel_config</span><span class="o">.</span><span class="n">world_size</span>
</span></span><span class="line"><span class="cl"> <span class="n">tensor_parallel_size</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">parallel_config</span><span class="o">.</span><span class="n">tensor_parallel_size</span>
</span></span><span class="line"><span class="cl"> <span class="k">assert</span> <span class="bp">self</span><span class="o">.</span><span class="n">world_size</span> <span class="o">==</span> <span class="n">tensor_parallel_size</span><span class="p">,</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="sa">f</span><span class="s2">&#34;world_size (</span><span class="si">{</span><span class="bp">self</span><span class="o">.</span><span class="n">world_size</span><span class="si">}</span><span class="s2">) must be equal to the &#34;</span>
</span></span><span class="line"><span class="cl"> <span class="sa">f</span><span class="s2">&#34;tensor_parallel_size (</span><span class="si">{</span><span class="n">tensor_parallel_size</span><span class="si">}</span><span class="s2">). &#34;</span>
</span></span><span class="line"><span class="cl"> <span class="sa">f</span><span class="s2">&#34;Pipeline parallelism is not yet implemented in v1&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># Set multiprocessing envs that are common to V0 and V1</span>
</span></span><span class="line"><span class="cl"> <span class="n">set_multiprocessing_worker_envs</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">parallel_config</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># Multiprocessing-based executor does not support multi-node setting.</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># Since it only works for single node, we can use the loopback address</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># 127.0.0.1 for communication.</span>
</span></span><span class="line"><span class="cl"> <span class="n">distributed_init_method</span> <span class="o">=</span> <span class="n">get_distributed_init_method</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="s2">&#34;127.0.0.1&#34;</span><span class="p">,</span> <span class="n">get_open_port</span><span class="p">())</span> <span class="c1">#找一个没人用的port</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># Initialize worker and set up message queues for SchedulerOutputs</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># and ModelRunnerOutputs</span>
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">rpc_broadcast_mq</span> <span class="o">=</span> <span class="n">MessageQueue</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">world_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">world_size</span><span class="p">)</span> <span class="c1">#vllm自定义的一个消息队列 使用ZMQ的PUB-SUB模式 创建SUB个数为world_size的消息队列</span>
</span></span><span class="line"><span class="cl"> <span class="n">scheduler_output_handle</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">rpc_broadcast_mq</span><span class="o">.</span><span class="n">export_handle</span><span class="p">()</span> <span class="c1">#handle句柄包含了SUB订阅所需的信息</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># Create workers</span>
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">workers</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">WorkerProcHandle</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">rank</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">world_size</span><span class="p">):</span> <span class="c1">#对所有的tp创建一个worker进程</span>
</span></span><span class="line"><span class="cl"> <span class="n">worker</span> <span class="o">=</span> <span class="n">WorkerProc</span><span class="o">.</span><span class="n">make_worker_process</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">vllm_config</span><span class="p">,</span> <span class="n">rank</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">rank</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">distributed_init_method</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">scheduler_output_handle</span><span class="p">)</span> <span class="c1">#消息队列句柄</span>
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">workers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">worker</span><span class="p">)</span><span class="c1">#将worker进程的句柄(包括Worker创建的消息队列的句柄)加入workers中</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># Ensure message queues are ready. Will deadlock if re-ordered</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># Must be kept consistent with the WorkerProc rpc_broadcast_mq是executor发给全体Worker的消息队列 worker_response_mq是Worker发送给executor的消息队列</span>
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">rpc_broadcast_mq</span><span class="o">.</span><span class="n">wait_until_ready</span><span class="p">()</span> <span class="c1">#executor作为PUB 等待所有Worker作为SUB完成订阅</span>
</span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">workers</span><span class="p">:</span><span class="c1">#每个Worker都会创建一个消息队列</span>
</span></span><span class="line"><span class="cl"> <span class="n">w</span><span class="o">.</span><span class="n">worker_response_mq</span><span class="o">.</span><span class="n">wait_until_ready</span><span class="p">()</span><span class="c1">#executor作为SUB 等待作为PUB的Worker发送信息&#34;READY&#34;</span>
</span></span></code></pre></div><p>WorkerProc.make_worker_process函数子进程,子进程执行WorkerProc.worker_main函数。WorkerProc.worker_main函数实例化WorkerProc对象</p></description>
</item>
<item>
<title>VLLM_Engine</title>
<link>https://theflash010.github.io/posts/vllm_engine/</link>
<pubDate>Tue, 08 Apr 2025 10:04:58 +0800</pubDate>
<guid>https://theflash010.github.io/posts/vllm_engine/</guid>
<description><h1 id="vllm推理框架">VLLM推理框架</h1>
<h2 id="vllm-engine四个组件">VLLM Engine四个组件</h2>
<h3 id="1-tokenizer">1. Tokenizer</h3>
<h3 id="2-processor-convert-inputs--enginecorerequests">2. Processor (convert Inputs &ndash;&gt; EngineCoreRequests)</h3>
<p>Engine中Processor的定义代码如下:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">processor</span> <span class="o">=</span> <span class="n">Processor</span><span class="p">(</span><span class="n">vllm_config</span><span class="o">=</span><span class="n">vllm_config</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">tokenizer</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">tokenizer</span><span class="p">,</span> <span class="c1">#processor需要tokenizer</span>
</span></span><span class="line"><span class="cl"> <span class="n">input_registry</span><span class="o">=</span><span class="n">input_registry</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">mm_registry</span><span class="o">=</span><span class="n">mm_registry</span><span class="p">)</span>
</span></span></code></pre></div><p>Processor的主要操作代码如下:将输入prompt转变成EngineCoreRequests</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"> <span class="k">def</span> <span class="nf">process_inputs</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="p">,</span> <span class="c1">#processor类</span>
</span></span><span class="line"><span class="cl"> <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">prompt</span><span class="p">:</span> <span class="n">PromptType</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">params</span><span class="p">:</span> <span class="n">Union</span><span class="p">[</span><span class="n">SamplingParams</span><span class="p">,</span> <span class="n">PoolingParams</span><span class="p">],</span>
</span></span><span class="line"><span class="cl"> <span class="n">arrival_time</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">float</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">lora_request</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">LoRARequest</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">trace_headers</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Mapping</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">prompt_adapter_request</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">PromptAdapterRequest</span><span class="p">]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">priority</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">EngineCoreRequest</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># TODO(woosuk): Support pooling models.</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># TODO(woosuk): Support encoder-decoder models.</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">_validate_lora</span><span class="p">(</span><span class="n">lora_request</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">_validate_params</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">priority</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="k">raise</span> <span class="ne">ValueError</span><span class="p">(</span><span class="s2">&#34;V1 does not support priority yet.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">trace_headers</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="k">raise</span> <span class="ne">ValueError</span><span class="p">(</span><span class="s2">&#34;V1 does not support tracing yet.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">prompt_adapter_request</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="k">raise</span> <span class="ne">ValueError</span><span class="p">(</span><span class="s2">&#34;V1 does not support prompt_adapter_request.&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">arrival_time</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="n">arrival_time</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># Process inputs, which includes:</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># 1. Tokenize text prompt, with LoRA request if one exists.</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># 2. For multimodal models with a merged preprocessor, preprocess</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># multimodal data and expand prompt token ids accordingly.</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># 3. Apply prompt adapter to prompt token ids if one exists.</span>
</span></span><span class="line"><span class="cl"> <span class="n">processed_inputs</span><span class="p">:</span> <span class="n">ProcessorInputs</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">input_preprocessor</span><span class="o">.</span><span class="n">preprocess</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="n">prompt</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">lora_request</span><span class="o">=</span><span class="n">lora_request</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">prompt_adapter_request</span><span class="o">=</span><span class="n">prompt_adapter_request</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">return_mm_hashes</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">use_hash</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">eos_token_id</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">input_preprocessor</span><span class="o">.</span><span class="n">get_eos_token_id</span><span class="p">(</span><span class="n">lora_request</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">_validate_model_inputs</span><span class="p">(</span><span class="n">processed_inputs</span><span class="p">,</span> <span class="n">lora_request</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">is_encoder_decoder_inputs</span><span class="p">(</span><span class="n">processed_inputs</span><span class="p">):</span>
</span></span><span class="line"><span class="cl"> <span class="n">decoder_inputs</span> <span class="o">=</span> <span class="n">SingletonInputsAdapter</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="n">processed_inputs</span><span class="p">[</span><span class="s2">&#34;decoder&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl"> <span class="n">encoder_inputs</span> <span class="o">=</span> <span class="n">SingletonInputsAdapter</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="n">processed_inputs</span><span class="p">[</span><span class="s2">&#34;encoder&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="n">decoder_inputs</span> <span class="o">=</span> <span class="n">SingletonInputsAdapter</span><span class="p">(</span><span class="n">processed_inputs</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">encoder_inputs</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># TODO: Impl encoder-decoder</span>
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">encoder_inputs</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="k">raise</span> <span class="ne">NotImplementedError</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="k">assert</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="n">SamplingParams</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># TODO: can we avoid cloning here in multiproc case?</span>
</span></span><span class="line"><span class="cl"> <span class="n">sampling_params</span> <span class="o">=</span> <span class="n">params</span><span class="o">.</span><span class="n">clone</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># If unset max tokens, then generate up to the max_model_len.</span>
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">sampling_params</span><span class="o">.</span><span class="n">max_tokens</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="n">sampling_params</span><span class="o">.</span><span class="n">max_tokens</span> <span class="o">=</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">model_config</span><span class="o">.</span><span class="n">max_model_len</span> <span class="o">-</span>
</span></span><span class="line"><span class="cl"> <span class="nb">len</span><span class="p">(</span><span class="n">decoder_inputs</span><span class="o">.</span><span class="n">prompt_token_ids</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"> <span class="n">sampling_params</span><span class="o">.</span><span class="n">update_from_generation_config</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">generation_config_fields</span><span class="p">,</span> <span class="n">eos_token_id</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">sampling_params</span><span class="o">.</span><span class="n">update_from_tokenizer</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">get_lora_tokenizer</span><span class="p">(</span><span class="n">lora_request</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># Multimodal related.</span>
</span></span><span class="line"><span class="cl"> <span class="n">sorted_mm_inputs</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">list</span><span class="p">[</span><span class="n">MultiModalKwargs</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl"> <span class="n">sorted_mm_positions</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">list</span><span class="p">[</span><span class="n">PlaceholderRange</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl"> <span class="n">sorted_mm_hashes</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="p">(</span><span class="n">decoder_mm_inputs</span> <span class="o">:=</span> <span class="n">decoder_inputs</span><span class="o">.</span><span class="n">multi_modal_data</span><span class="p">):</span>
</span></span><span class="line"><span class="cl"> <span class="k">assert</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">decoder_mm_inputs</span><span class="p">,</span> <span class="n">MultiModalKwargs</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># The output of merged multi-modal processor (`decoder_mm_inputs`)</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># contains the kwargs for all items from all modalities.</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># This code separates them so that there is one set of kwargs</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># per item per modality.</span>
</span></span><span class="line"><span class="cl"> <span class="n">individual_mm_inputs</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl"> <span class="n">MultiModalKwargs</span><span class="o">.</span><span class="n">from_items</span><span class="p">([</span><span class="n">item</span><span class="p">])</span>
</span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">modality</span> <span class="ow">in</span> <span class="n">decoder_mm_inputs</span><span class="o">.</span><span class="n">modalities</span>
</span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">decoder_mm_inputs</span><span class="o">.</span><span class="n">get_items</span><span class="p">(</span><span class="n">modality</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># Merge and flatten multimodal placeholders, hashes and inputs</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># from dictionaries to lists, and sort them by each item&#39;s position</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># in the input sequence.</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># NOTE: interleaved modalities are not supported.</span>
</span></span><span class="line"><span class="cl"> <span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="n">sorted_modalities</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">sorted_mm_positions</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">sorted_mm_hashes</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="p">)</span> <span class="o">=</span> <span class="n">merge_and_sort_multimodal_metadata</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="n">decoder_inputs</span><span class="o">.</span><span class="n">multi_modal_placeholders</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">decoder_inputs</span><span class="o">.</span><span class="n">multi_modal_hashes</span> <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">use_hash</span> <span class="k">else</span> <span class="kc">None</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># NOTE: Sort multimodal inputs/kwargs ONLY IF there are multiple</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># modalities involved.</span>
</span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sorted_modalities</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="n">modality_order_dict</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl"> <span class="n">modality</span><span class="p">:</span> <span class="n">order</span>
</span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">order</span><span class="p">,</span> <span class="n">modality</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">sorted_modalities</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># Sanity check to make sure each multimodal input has only one</span>
</span></span><span class="line"><span class="cl"> <span class="c1"># modality key.</span>
</span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">mm_input</span> <span class="ow">in</span> <span class="n">individual_mm_inputs</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">mm_input</span><span class="o">.</span><span class="n">modalities</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="c1"># Sort MultiModalKwargs to match sorted_mm_positions</span>
</span></span><span class="line"><span class="cl"> <span class="n">sorted_mm_inputs</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="n">individual_mm_inputs</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">mm_input</span><span class="p">:</span> <span class="n">modality_order_dict</span><span class="p">[</span><span class="nb">list</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="n">mm_input</span><span class="o">.</span><span class="n">modalities</span><span class="p">)[</span><span class="mi">0</span><span class="p">]])</span>
</span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"> <span class="n">sorted_mm_inputs</span> <span class="o">=</span> <span class="n">individual_mm_inputs</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">EngineCoreRequest</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="n">request_id</span><span class="o">=</span><span class="n">request_id</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">prompt</span><span class="o">=</span><span class="n">decoder_inputs</span><span class="o">.</span><span class="n">prompt</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">prompt_token_ids</span><span class="o">=</span><span class="n">decoder_inputs</span><span class="o">.</span><span class="n">prompt_token_ids</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">mm_inputs</span><span class="o">=</span><span class="n">sorted_mm_inputs</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">mm_hashes</span><span class="o">=</span><span class="n">sorted_mm_hashes</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">mm_placeholders</span><span class="o">=</span><span class="n">sorted_mm_positions</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">sampling_params</span><span class="o">=</span><span class="n">sampling_params</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">eos_token_id</span><span class="o">=</span><span class="n">eos_token_id</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">arrival_time</span><span class="o">=</span><span class="n">arrival_time</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">lora_request</span><span class="o">=</span><span class="n">lora_request</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="p">)</span>
</span></span></code></pre></div><h3 id="3-outputprocessorconvert-enginecoreoutputs--requestoutput">3. OutputProcessor(convert EngineCoreOutputs &ndash;&gt; RequestOutput)</h3>
<p>Engine中OutPutProcessor的定义代码如下:</p></description>
</item>
<item>
<title>LLM Scheduling</title>
<link>https://theflash010.github.io/posts/llm-scheduling/</link>
<pubDate>Thu, 27 Mar 2025 14:32:49 +0800</pubDate>
<guid>https://theflash010.github.io/posts/llm-scheduling/</guid>
<description><h1 id="有关大模型任务调度的讨论">有关大模型任务调度的讨论</h1>
<p>大模型具有一些内在的特性,导致在大模型任务调度方面会有一些新的挑战。</p>
<h2 id="冷启动问题">冷启动问题</h2>
<p>大模型的特点之一是参数规模非常大,常见的175B,670B等等。</p>
<p>当系统的请求量变化时,需要针对集群开发新的调度策略,其中就包括改变集群中不同模型副本的数量,也就是进行<strong>扩缩容</strong>操作。然而由于大模型的参数规模大,导致在gpu上加载新模型的开销非常大,甚至对于大多数大模型来说,单张gpu无法达到推理的显存需求,需要在好几张gpu上加载大模型,进一步导致大模型的冷启动问题。不同参数规模的大模型加载开销不同,但基本都要超过10分钟的开销。</p>
<p>针对大模型的冷启动问题常用的解决思路如下</p>
<h3 id="预测流派">预测流派</h3>
<p>调度的根本原因在于系统请求量的变化,预测流派使用EWMA,ARIMA等方法<font color='red'>粗粒度</font>的提前预测下一时段的系统请求量,并提前计算好新的调度策略。</p>
<p><font color='red'>粗粒度</font>是按照小时来作为时间段的单位,预测下一时段(几个小时)的请求量是多少,然后在下一个时段中慢慢进行模型的加载(因为相比于小时,10min还是比较短的时间)。由于是提前预测好,并不是系统突然的请求量burst,所以可以消除冷启动带来的影响(等到请求量增大时,已经提前加载好了大模型)</p>
<p>预测流派需要对系统请求量的变化有相对准确的预测,如果不准确会导致资源的浪费(多加载了大模型)/SLO 冲突(少加载了大模型)</p>
<p>相关文献<em><strong>《Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale》</strong></em></p>
<p>这个paper中还详细讨论了在下一个时段中慢慢进行模型加载的策略</p>
<p>Naive方法:在得出新的调度策略后,直接进行改变</p>
<p>Instance Utilization (LT-U)方法:考虑模型实例的使用率,当使用率增大超过某一个阈值时按照新的调度策略逐渐加载新的模型副本,但是不可以超过调度策略规定的值,相反,当使用率减少低于某一个阈值时按照新的调度策略逐渐卸载新的模型副本,但是不可以低于调度策略规定的值</p>
<p>Instance utilization and ARIMA gap (LT-UA)方法:考虑模型实例的使用率,当使用率增大超过某一个阈值时按照新的调度策略逐渐加载新的模型副本,可以超过调度策略规定的值,相反,当使用率减少低于某一个阈值时按照新的调度策略逐渐卸载新的模型副本,可以低于调度策略规定的值。这个策略可以在预测不准的情况下,更智能的进行扩缩容</p></description>
</item>
<item>
<title>Flash Attention MFU Calcuation</title>
<link>https://theflash010.github.io/posts/flash_attention_mfu_calcuation/</link>
<pubDate>Mon, 24 Mar 2025 10:39:17 +0800</pubDate>
<guid>https://theflash010.github.io/posts/flash_attention_mfu_calcuation/</guid>
<description><h1 id="计算flash-attention中self-attention部分的mfu">计算Flash Attention中Self-attention部分的MFU</h1>
<p>前置知识</p>
<p>模型算力利用率(Model FLOPs Utilization, MFU)和硬件算力利用率(Hardware FLOPs Utilization, HFU)是评估某一模型实现对芯片计算性能利用情况的常用指标</p>
<ul>
<li>模型算力利用率(MFU)是指模型一次前反向计算消耗的矩阵算力与机器算力的<strong>比值</strong></li>
<li>硬件算力利用率(HFU)是指考虑重计算后,模型一次前反向计算消耗的矩阵算力与机器算力的<strong>比值</strong></li>
</ul>
<p>用公式来表达就是 <strong>MFU = model FLOPs per iteration/(GPU单卡算力*卡数*一次迭代时间)</strong></p>
<p>机器算力只需通过记录计算花费的时间,查询加速器的峰值算力进行计算</p>
<p>FLOPs的计算相对复杂,需要手推公式</p>
<h2 id="flash-attention中self-attention的flops推导">Flash Attention中Self-attention的FLOPs推导</h2>
<p>Flash Attention的主要思想是从HBM拿到一份数据,对这份数据做尽可能多的计算操作(也就是提高<strong>arithmetic intensity</strong>算术强度),来减少显存传输的开销</p>
<p>Flash Attention对于一块数据会进行Q*K<sup>T</sup>,softmax,score*V这三个操作。传统的Transformer是需要三遍pass,所有数据做完一个操作再做下一个操作,导致更多的显存传输</p>
<p>以下对FLOPs进行推导</p>
<p>对于使用Tiling的Flash Attention,外循环是Q,内循环是K,V,所以K和V的分块是相同的</p>
<p>Flash Attention中不同块的数据可以并行处理(不同Q的数据可以并行,相同Q的需要串行),但是这并不影响总体的FLOPs,因为并行带来的好处是时间减少,实际的计算量并不会减少(除非改变算法)</p>
<p>变量符号:</p>
<table>
<thead>
<tr>
<th style="text-align: left">Name</th>
<th>Variables</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Batch_size</td>
<td>b</td>
</tr>
<tr>
<td style="text-align: left">Sequence_len</td>
<td>s</td>
</tr>
<tr>
<td style="text-align: left">Hidden_size</td>
<td>h</td>
</tr>
<tr>
<td style="text-align: left">num_heads</td>
<td>n_h</td>
</tr>
<tr>
<td style="text-align: left">number of tiling of Q</td>
<td>t_q</td>
</tr>
<tr>
<td style="text-align: left">number of tiling of K</td>
<td>t_k</td>
</tr>
</tbody>
</table>
<ol>
<li>
<p>首先,input*(W<sub>Q</sub> W<sub>K</sub> W<sub>V</sub> )-&gt; (Q,K,V)</p>
<p>FLOPs_1=3(b,s,h)*(h,h)= 3*2bsh<sup>2</sup>=6bsh<sup>2</sup></p>
</li>
<li>
<p>做完操作1后,首先多头注意力机制会对Q,K,V进行划分,(b,s,h)-&gt; (b,n_h,s,h/n_h);接着Flash Attention需要按照t_q,t_k进行分块,Q:(b,n_h,s,h/n_h)-&gt;(t_q,b,n_h,s/t_q,h/n_h),K,V:(n_h,b,s,h/n_h)-&gt;(t_k,b,n_h,s/t_k,h/n_h)。在分块结束后的计算量就是所有块的计算量的累加,所以后面对单个块进行追踪,考虑其FLOPs,按如下所示的计算流程</p>
<img src=".\image-20250324114222193.png" alt="image-20250324114222193" />
<p>2.1 Q*K<sup>T</sup>的操作,操作结束后的矩阵形状为(b,n_h,s/t_q,s/t_k)</p>
<p>FLOPs_2 =(b,n_h,s/t_q,h/n_h)*(b,n_h,s/t_k,h/n_h)<sup>T</sup></p></description>
</item>
<item>
<title>Megatron_LM</title>
<link>https://theflash010.github.io/posts/megatron_lm/</link>
<pubDate>Fri, 21 Mar 2025 15:35:05 +0800</pubDate>
<guid>https://theflash010.github.io/posts/megatron_lm/</guid>
<description><h1 id="megatron_lm中使用的的优化方案">Megatron_LM中使用的的优化方案</h1>
<p>Megatron_LM结合了DP,PP,TP,提出的叫PTD-P策略</p>
<p>最外层是DP,然后一个DP内使用PP,一个PP内使用TP</p>
<p>如下图所示</p>
<img src=".\image-20250321153637863.png" alt="image-20250321153637863" />
<h2 id="pp调度方面优化">PP调度方面优化</h2>
<p>使用了interleaved pipeline schedule版本的1F1B,通过增加流水线级数来减少气泡时间,具体可以看<a href="https://theflash010.github.io/posts/pipeline_parallelism/">《Pipeline Parallelism》</a>这篇博客</p>
<img src=".\image-20250321154650026.png" alt="image-20250321154650026" />
<h2 id="pptp通信方面的优化">PP+TP通信方面的优化</h2>
<p>前置知识:每个PP的stage使用的是DGX服务器(有8个A100),DGX内部GPU的通信是用NVLink的,DGX与外部通信是通过8个IB实现的,IB传输是点对点的,所以一次通信调用最多用一个IB设备,其余7个当然可以同时处理其他的点对点通信,但对于这次的通信调用起不到帮助。此外,Megatron_LM的TP如下</p>
<img src=".\image-20250321154546371.png" alt="image-20250321154546371" style="zoom:80%;" />
<p>然而,现有的PP+TP中,stage间的通信存在冗余。如下图所示</p>
<p>现有的通信流程如下:TP组在执行完上图中MLP部分的Dropout后,每个rank都有完整的结果激活值,然后每个rank都将完整的激活值通过IB发送给下一个stage的DGX中对应的rank(这是跨节点的通信传输),所以其实每个rank使用IB传输都是相同的一份数据,存在冗余</p>
<p>Megatron_LM优化了通信方案,让每个rank只传输1/8的数据给另一个stage对应rank结点,然后在下一个stage的内部使用all-gather重组激活值数据,由于stage内部使用的是NVLink,所以可以很快,这样相当于使用8个IB一起传输一份数据,不仅减少了通信量,而且提高了并行度</p>
<img src=".\image-20250321155831934.png" alt="image-20250321155831934" /></description>
</item>
<item>
<title>Pipeline Parallelism</title>
<link>https://theflash010.github.io/posts/pipeline_parallelism/</link>
<pubDate>Wed, 19 Mar 2025 14:58:46 +0800</pubDate>
<guid>https://theflash010.github.io/posts/pipeline_parallelism/</guid>
<description><h1 id="pp策略">PP策略</h1>
<p>PP的两个要素layer assignment(层划分) 和 scheduling strategy(调度策略)</p>
<p>这里仅讨论调度策略,每个stage什么时候做前传,什么时候做反传</p>
<p>流水线刷新(Pipeline Flush)是在 流水线并行(Pipeline Parallelism)中,为了确保所有设备(如 GPU)上的计算和优化器步骤(optimizer steps)同步执行,而采取的一种强制同步机制。(通过增加mirco-batch的数量,在stage数量不变的情况下,单个stage的处理时间变短,最后等待刷新的设备空闲时间短,可以减少流水线刷新的影响)</p>
<h2 id="naive版本">Naive版本</h2>
<p>naive的做法,一次将一个epoch的所有mini-batch注入流水线中,全部进行完前传后,进行反传。所有mini-batch反传结束后,统一进行梯度更新,这样相当于一个epoch只进行了一次梯度更新,epoch退化成了batch,所以训练效率低下(虽然流水资源利用率高,机器都在运行)</p>
<h2 id="gpipe版本">GPipe版本</h2>
<p>GPipe是在Naive的基础上换成m个mini-batch注入到流水中,也是全部前传后全部反传</p>
<p>存在Pipeline flush问题,在一轮m个mini-batch之后,需要冲刷流水线,为下一轮做准备(通过增加micro-batch数量来减少流水线刷新影响)</p>
<p>如下图所示</p>
<img src=".\image-20250319143350643.png" alt="image-20250319143350643" />
<h2 id="1f1b-版本论文版">1F1B 版本(论文版)</h2>
<p>来自论文<em><strong>《PipeDream: Generalized Pipeline Parallelism for DNN Training》</strong></em></p>
<p>会将mini-batch不断注入到流水中,边 正传 边 反传更新参数,同时严格要求stage必须交替一次正传一次反传</p>
<p>不需要进行流水线刷新(除非流水线出问题进行修复时刷新流水线)</p>
<p>如下图所示</p>
<img src=".\image-20250319143535310.png" alt="image-20250319143535310" />
<p>但是很明显有很多问题</p>
<ul>
<li>例如:当mini-batch2在反传时,mini-batch1已经修改了stage4的参数,那么mini-batch2本身应该按最初的参数进行计算梯度,现在被修改了,很明显不能直接用修改的参数进行反传。</li>
</ul>
<p>为了解决上诉问题:1F1B会stash权重,存储不同的权重版本。所以面对mini-batch2的反传,所有stage不仅有最新的参数(由mini-batch1更新而来),还有最初版本的参数,从而支持梯度的计算。<font color='red'>(对于worker的显存是否会带来非常大的额外开销???)</font></p>
<ul>
<li>例如:当mini-batch2进行反传时,已经进行参数更新,那么mini-batch2的梯度对于模型训练是否还有效果呢?</li>
</ul>
<p>作者表示参数更新变化较慢,就算参数更新了,后面mini-batch的梯度还是对训练有效,做实验效果证明了这一点</p>
<ul>
<li>例如:对于mini-batch5(后续用mb5来代替)来说,正传存在使用权重版本不一致的问题,比如在stage1,mb5用的参数是mb1更新得到的,在stage2,mb2用的参数是mb2更新得到的,这种权重版本不一致是否带来问题</li>
</ul>
<p>作者提出来其他技术来保证权重版本一致,为了解决这个问题,使用了Vertical Sync,这个方法中一个stage会保留更多的权重版本,mini-batch5在前传的时候,会在不同stage里使用由mini-batch1更新得到的权重,比如stage2会额外保留mini-batch1更新后的权重,以便mini-batch5在前传时版本统一,等到mini-batch5反传结束时,不同stage上有关mini-batch1更新的权重版本就可以删掉了</p>
<p>作者表示1F1B的默认语义中,并不包括Vertical Sync技术,因为这个会带来额外的显存开销,版本不一致可能并不会影响训练效果</p>
<h3 id="显存开销的讨论">显存开销的讨论</h3>
<p>在某一个时刻,每个stage进行Weight stashing的额外显存开销只需要考虑当前存在的反传依赖所需的版本号</p>
<p>如下图所示</p>
<p>考虑最右侧时刻</p>
<p>stage1中mini-batch5,6,7,8均存在反传依赖,而且需要不同的版本号(5-&gt;1,6-&gt;2,7-&gt;3,5-&gt;4),所以需要保存四个版本权重</p>
<p>stage4中mini-batch1的反传结束后,无需v0版本的权重,可以删除v0,保留v1,后续mini-batch2的正传也使用的是v1版本的权重,等mini-batch2反传结束,v1又可以删除,保留v2版本权重,从此反复,始终只需要保留一个版本权重</p>
<p>假设一个完整的模型大小是N,本身流水并行,每个stage的权重是N/p,而从上述分析,每个stage不会保留超过p个版本的权重,所以每个stage显存不大于N,相比于数据并行,每个worker都是N还是有好处的。</p>
<img src=".\image-20250320100623997.png" alt="image-20250320100623997" />
<h2 id="1f1b-版本工程版">1F1B 版本(工程版)</h2>
<p>由于版本权重带来的显存问题,工程上选择在反传时不进行梯度更新,等batch结束时进行同步更新,也就是工程上会使用流水刷新</p>
<p>如下图所示</p>
<img src=".\image-20250320173445048.png" alt="image-20250320173445048" />
<h3 id="gpipe和1f1b的比较">GPipe和1F1B的比较</h3>
<h4 id="气泡方面">气泡方面</h4>
<p>如果1F1B有pipeline flush(1F1B工程版),其实GPipe和1F1B的气泡时间是一样的,如下图所示</p>
<img src=".\image-20250320171443810.png" alt="image-20250320171443810" />
<p>具体计算公式如下</p>
<img src=".\image-20250320175154103.png" alt="image-20250320175154103" />
<p>但是对于1F1B(论文版)不需要流水线刷新,所以n个batch,GPipe需要刷新n-1次,1F1B不需要,气泡数量因此更少</p>
<h4 id="显存方面">显存方面</h4>
<p>考虑1F1B(工程版)</p></description>
</item>
<item>
<title>Activation Memory Analysis</title>
<link>https://theflash010.github.io/posts/activation_memory_analysis/</link>
<pubDate>Tue, 18 Mar 2025 11:53:10 +0800</pubDate>
<guid>https://theflash010.github.io/posts/activation_memory_analysis/</guid>
<description><h1 id="分析使用不同模型并行策略模型训练时激活值的显存占用情况">分析使用不同模型并行策略,模型训练时<font color=red>激活值</font>的显存占用情况</h1>
<p>来源论文<em><strong>《REDUCING ACTIVATION RECOMPUTATION IN LARGE TRANSFORMER MODELS》</strong></em>的分析</p>
<p>大模型训练时包括两部分变量:需要保存的变量和不需保存的变量</p>
<p>需要保持的变量包括两部分:权重和激活值</p>
<p>权重在这里包括模型本身权重,梯度,优化器状态</p>
<p>激活值是反向传播中需要的部分,反向传播不需要的部分不算激活值,dropout mask是被包括在里面的</p>
<p>前置知识:</p>
<ol>
<li>Note that “activations” in this paper refers to any tensor that is created in the forward pass and is necessary for gradient computation during back-propagation. As a result, this excludes the main parameters of the model and optimizer state, but, for example, includes the mask used by the dropout operation.</li>
<li>We also assume that the network and the activations are stored in a 16-bit floating point format and therefore each element requires 2 bytes for storage. The only exceptions are the dropout masks which only require a single byte per element. (除了dropout mask是1byte外,其余都是2 bytes)</li>
</ol>
<h2 id="baseline-不使用任何模型并行策略">Baseline 不使用任何模型并行策略</h2>
<img src=".\image-20250317032832928.png" alt="image-20250317032832928" style="zoom:50%;" />
<p>上图是一个Transformer架构图,基于这个图分析显存占用情况(单layer):</p></description>
</item>
<item>
<title>Encoder Decoder vs Decoder Only</title>
<link>https://theflash010.github.io/posts/encoder-decoder_vs_decoder-only/</link>
<pubDate>Wed, 08 Jan 2025 11:10:08 +0800</pubDate>
<guid>https://theflash010.github.io/posts/encoder-decoder_vs_decoder-only/</guid>
<description><h1 id="encoder-decoder--vs--decoder-only">Encoder-Decoder VS Decoder-Only</h1>
<h2 id="encoder-decoder"><mark>Encoder-Decoder</mark></h2>
<h4 id="encoder">Encoder</h4>
<p>作用:将输入文本的信息进行提取</p>
<p>过程:</p>
<ol>
<li>在实现时使用 attention + add&amp;norm + FNN为每一个单词都生成一个Encoder输出e<del>i</del>(向量数据)</li>
</ol>
<p>注意:e<sub>i</sub>始终是指Encoder最顶层块的输出结果(如果Encoder有很多块的话)</p>
<h4 id="decoder">Decoder</h4>
<p>作用:利用Encoder的输出向量数据逐个生成结果</p>
<p>过程:</p>
<ol>
<li>
<p>插入一个&lt;bos&gt;字符,表示开始生成</p>
</li>
<li>
<p>假设当前已经推理出n个单词(不包括输入文本!!!),对第n个单词使用Masked Attention,生成第n个单词与前n-1个单词的自注意力分数,并计算出自注意力值d(向量数据),还有add&amp;norm(暂不考虑)</p>
</li>
<li>
<p>将d与Encoder得到的所有输入单词的结果向量e<sub>i</sub>进行cross attention,每个e<sub>i</sub>会分别乘M<sub>k</sub>,M<sub>v</sub>矩阵,得到k<sub>i</sub>,v<sub>i</sub>向量,而d会乘以M<sub>q</sub>矩阵,得到q<sub>d</sub>向量,然后k<sub>i</sub>,v<sub>i</sub>与q<sub>d</sub>进行attention计算,得出结果o(向量数据)</p>
</li>
<li>
<p>将o进行add&amp;norm,FNN最后产生新的单词word,重复2,3,4步,直到生成&lt;eos&gt;表示结束生成</p>
<p>注意:就算decoder有很多块,每一块中的cross attention中的e<sub>i</sub>都是一样的,都是Encoder最顶层块的结果</p>
</li>
</ol>
<p>注意:就算decoder有很多块,每一块中的cross attention中的e<sub>i</sub>都是一样的,都是Encoder最顶层块的结果</p>
<h4 id="总结">总结</h4>
<p>Encoder-Decoder架构会使用Encoder进行输入文本的信息提取,然后Decoder每次生成都会使用Encoder提取出的文本信息</p>
<h2 id="decoder-only"><mark>Decoder-Only</mark></h2>
<h4 id="decoder-1">Decoder</h4>
<p>作用:不断使用attention来生成结果</p>
<p>过程:</p>
<ol>
<li>从文本输入首单词开始(注意这里不是插入&lt;bos&gt;),进行masked attention,但是在文本输入结束前的推理word是不会输出的,一直到文本输入最后一个单词开始才会把推理word输出,循环如此,直到生成&lt;eos&gt;</li>
</ol>
<p>tips:</p>
<ol>
<li>实际上文本输入不需要推理,只需要计算这些输入单词的KV Cache就可以了,等到文本输入最后一个单词才用进行推理</li>
<li>其实Decoder-Only分为两个阶段prefill和decode,其中prefill进行输入文本的KV Cache计算,decode进行推理(先prefill再decode)</li>
</ol>
<h4 id="总结-1">总结</h4>
<p>Decoder-Only架构没用Encoder进行输入文本的信息提取,而是直接进行推理,在推理的时候利用masked attention感知前面输入文本的内容,从而实现推理</p>
<img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1-727x1024.png" alt="img" style="zoom:50%;" /></description>
</item>
<item>
<title>Introduce</title>
<link>https://theflash010.github.io/pages/about/</link>
<pubDate>Fri, 03 Jan 2025 14:30:31 +0800</pubDate>
<guid>https://theflash010.github.io/pages/about/</guid>
<description><ul>
<li>graduated from Jilin University (CS)</li>
<li>pursuing a PhD at Zhejiang University(CS) at present</li>
</ul></description>
</item>
<item>
<title>CUDA Study</title>
<link>https://theflash010.github.io/posts/cuda_study/</link>
<pubDate>Fri, 03 Jan 2025 14:02:03 +0800</pubDate>
<guid>https://theflash010.github.io/posts/cuda_study/</guid>
<description><h2 id="heterogeneous-programming异构编程">Heterogeneous Programming(异构编程)</h2>
<p>用非单一架构的机器进行计算的编程模型,例如CPU与GPU融合</p>
<h4 id="asynchronous-simt-programming-model异步simt编程模型">Asynchronous SIMT Programming Model(异步SIMT编程模型)</h4>
<p>指CUDA程序在执行某些耗时操作(如GPU内核执行、内存拷贝等)时,不会等待这些操作完成,而是允许(CPU中)程序继续执行后续的命令或任务。这种机制使得CPU和GPU能够更高效地并行工作,提高了程序的整体性能。所谓异步Asyn,就是指多个设备之间步伐不一致,随便执行。同步Syn是指多个设备间需要步伐一致,当一个设备在运行时,其他设备需要等待它结束,再一起继续往下执行。</p>
<h2 id="compute-capability版本与cuda版本">Compute Capability版本与CUDA版本</h2>
<p>设备的Compute Capability由版本号表示,有时也称其“ SM 版本”。该版本号标识 GPU 硬件支持的特性,并由应用程序在运行时使用,以确定当前GPU上可用的硬件特性和指令。Compute Capability包括一个主要版本号X和一个次要版本号Y,用X.Y表示。CUDA 版本指的是 CUDA软件平台的版本。CUDA平台被应用开发人员用来创建那些可以运行在许多代GPU架构上的应用程序,包括未来尚未发明的 GPU架构。尽管CUDA平台的新版本通常会通过支持新GPU架构的Compute Capability版本来增加对于该架构的本地支持,但 CUDA平台的新版本通常也会包含软件功能,而这些是与硬件独立的。</p>
<h2 id="cuda-runtime">Cuda Runtime</h2>
<p>CUDA Runtime是NVIDIA提供的一个用于简化CUDA编程的高级API,它建立在CUDA Driver API之上,提供了一组更高层次的接口,旨在使开发者更容易使用CUDA进行并行编程。如cudaMalloc和cudaFree函数,允许开发者定义和启动核函数(kernel functions)。</p>
<h2 id="cuda-context">Cuda Context</h2>
<p>CUDA的上下文(Context)是CUDA编程中的一个重要概念,它代表了CUDA运行时(Runtime)与特定CUDA设备(通常是NVIDIA GPU)之间的关联和状态信息。当你使用CUDA Runtime API进行编程时,CUDA运行时会自动为每个CUDA设备创建一个上下文(除非你显式地请求在不同的线程中为每个设备创建独立的上下文)。一个context是为一个进程服务的,该进程内的所有线程共同使用这个context。</p>
<h4 id="primary-context">primary context</h4>
<p>在CUDA编程中,primary context(主上下文)是指与CUDA设备关联的第一个或默认的上下文(自动创建,用户不可见)。这个上下文在CUDA运行时(Runtime)初始化时自动创建,并且通常用于管理该设备上的大部分CUDA操作。它在应用程序的所有主机线程之间共享。、</p>
<h2 id="cuda-stream-cuda流">Cuda Stream (Cuda流)</h2>
<p>一般使用GPU加速,启动GPU内核或内存数据复制(CudaMemcpy)这类GPU操作是按顺序进行的,这实际是因为这些操作是由一个默认的流掌控的,如果想要让GPU的一些操作并发执行,就需要将不同操作放到不同的流上,流内部的操作顺序执行,不同流之间的操作异步执行,这就是流的概念。
CUDA流指的是由主机发出的在一个设备中执行的CUDA操作序列。一个CUDA流中的各个操作按照主机发布的次序执行;但来自两个不同CUDA流的操作不一定按照某个次序执行,有可能是并发或者交错地执行。可以将流类比成 CPU 编程中的“线程”的概念(注意不是 CUDA 编程概念中的线程):同一个“线程”中的任务串行执行,不同“线程”可以并行执行。
默认流可以分为两种:legacy default stream 和 per-thread default stream。</p>
<h4 id="legacy-default-stream">legacy default stream</h4>
<p>默认情况下(无编译选项或者使用 nvcc 编译时加上 &ndash;default-stream legacy 编译选项),每个设备(每张 GPU 卡)会创建一个默认流(NULL 流),称为 legacy default stream,该设备上所有不指定流或者指定默认流的操作全都放到这个流中。如果主机使用多线程,那么多线程的GPU操作,都会放到这个流中,共用这一个流。
注意:如果主机线程在来自不同流(自定义流)的两个命令之间向 legacy default stream 发出指令,那么这些来自不同流的命令不能同时运行。一句话总结:legacy default stream 会和所有非 non-blocking stream 产生同步。</p></description>
</item>
</channel>
</rss>