diff --git a/blog/2025/2025-05-09-blender-baking/bake-influence.webp b/blog/2025/2025-05-09-blender-baking/bake-influence.webp
new file mode 100644
index 0000000..abca7ee
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/bake-influence.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/bake-panel.webp b/blog/2025/2025-05-09-blender-baking/bake-panel.webp
new file mode 100644
index 0000000..0ea6c03
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/bake-panel.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/bake-type-diffuse.webp b/blog/2025/2025-05-09-blender-baking/bake-type-diffuse.webp
new file mode 100644
index 0000000..04829f2
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/bake-type-diffuse.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/bake-type-rough.webp b/blog/2025/2025-05-09-blender-baking/bake-type-rough.webp
new file mode 100644
index 0000000..6a93ca1
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/bake-type-rough.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/bowl-rendered.webp b/blog/2025/2025-05-09-blender-baking/bowl-rendered.webp
new file mode 100644
index 0000000..3736686
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/bowl-rendered.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/bumpy-bowl.webp b/blog/2025/2025-05-09-blender-baking/bumpy-bowl.webp
new file mode 100644
index 0000000..99a9111
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/bumpy-bowl.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/compare-materials.webp b/blog/2025/2025-05-09-blender-baking/compare-materials.webp
new file mode 100644
index 0000000..fcfb8ba
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/compare-materials.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/compare-polygons.webp b/blog/2025/2025-05-09-blender-baking/compare-polygons.webp
new file mode 100644
index 0000000..d4d0b55
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/compare-polygons.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/complete-bake.webp b/blog/2025/2025-05-09-blender-baking/complete-bake.webp
new file mode 100644
index 0000000..ac5009d
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/complete-bake.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/emission-colour.webp b/blog/2025/2025-05-09-blender-baking/emission-colour.webp
new file mode 100644
index 0000000..0e88a7f
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/emission-colour.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/extrusion-and-max-ray-distance.webp b/blog/2025/2025-05-09-blender-baking/extrusion-and-max-ray-distance.webp
new file mode 100644
index 0000000..449ccff
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/extrusion-and-max-ray-distance.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/good-values.webp b/blog/2025/2025-05-09-blender-baking/good-values.webp
new file mode 100644
index 0000000..21244da
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/good-values.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/hatchet.webp b/blog/2025/2025-05-09-blender-baking/hatchet.webp
new file mode 100644
index 0000000..3090599
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/hatchet.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/high-poly-blender.webp b/blog/2025/2025-05-09-blender-baking/high-poly-blender.webp
new file mode 100644
index 0000000..6e2c669
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/high-poly-blender.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/hit-bake.webp b/blog/2025/2025-05-09-blender-baking/hit-bake.webp
new file mode 100644
index 0000000..57e79fe
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/hit-bake.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/index.md b/blog/2025/2025-05-09-blender-baking/index.md
new file mode 100644
index 0000000..e1aa86e
--- /dev/null
+++ b/blog/2025/2025-05-09-blender-baking/index.md
@@ -0,0 +1,222 @@
+---
+title: 'The one true guide to baking materials in Blender'
+slug: 'blender-baking'
+description: 'How to get nice materials on low poly objects'
+date: '2025-05-09'
+authors: ['jaked']
+tags: ['blender', 'baking', 'normal-maps', 'article', 'tutorial']
+---  
+  
+Real time rendering performance is often limited by 3D assets as much as it's limited by code. Good low poly assets inevitably rely on baking, the process of transferring details from a high poly mesh with a complex material to a low poly one with a much simpler material. Unfortunately, however, there seems to be a lack of info regarding baking around. Especially in Blender, things can sometimes be a bit unintuitive regarding baking. In the process of working on my game, A Short Odyssey (ASO), I came up with a workflow that works quite well for me, so I will share it with you today.
+
+For this tutorial we are going to use this wooden bowl model from the fantastic website [Polyhaven](https://polyhaven.com/a/wooden_bowl_02).
+
+![Bowl Rendered](bowl-rendered.webp)
+
+<!-- truncate -->
+
+As with all of the free CC0 models on Polyhaven, this mesh has a fairly high number of small triangles. 4,666 to be exact. While that may not seem like a lot think about how big it is likely to be in a real time scene. Most of the time the entire bowl might only be a few pixels tall! Especially given that small triangles are much more expensive than large triangles (due to [quad occupancy](https://blog.selfshadow.com/2012/11/12/counting-quads/)). This is probably something we should deal with.
+
+![Small Bowl](small-bowl.webp)
+
+Now that we understand why we must bake, let's go ahead and do it.
+
+# Preparing for the Bake
+
+Open up your high poly model in Blender. I am using Blender 4.4. Other versions should work but your UI might not match up exactly with this tutorial.
+
+![High Poly Blender](high-poly-blender.webp)
+
+You then will need a low poly version of the model. How to create a low poly model is outside the scope of this tutorial, but it *must* be UV-unwrapped before proceeding and none of the polygons should be overlapping on the UV map.
+
+![Low Poly Blender](low-poly-blender.webp)
+
+My low-poly version uses only 272 triangles or roughly 5.8% of the original number. These are also bigger triangles so should have much better quad occupancy on the GPU.
+
+![overlapping models](overlapping-models.webp)
+
+The first thing you need to do is make sure the high and low-poly models are directly on top of each other, just like in the image above. You must also ensure the scale of the low poly version is exactly 1.0 on all axes.
+
+![Unit Scale](unit-scale.webp)
+
+If it is not, you can apply the scale with <kbd>Ctrl+A</kbd> -> Apply -> Scale, while the low poly object is selected in object mode. 
+
+![triangulate](triangulate.webp)
+
+Next, you need to add a triangulate modifier to the low poly object. The exact options you pick here don't really matter, but if you change them after the bake you must re-bake all maps.
+
+# Creating a Bake Target Proxy
+
+The main way my workflow differs from what I've seen elsewhere is the use of a Bake Target or Proxy, this is not strictly necessary but it makes the entire process far less frustrating if you need to run the baking process more than once, which you inevitably will. This involves creating a linked duplicate to our low poly object, this will allow you to preview the bake results without having to mess around with the shader nodes and having to reconnect things between bakes.
+
+![Linked Duplicate](linked-duplicate.webp)
+
+To create a linked duplicate, simply select your low-poly object and hit <kbd>Alt+D</kbd>, you can then move your linked duplicate off to the side somewhere.
+
+![Outliner Names](outliner-names.webp)
+
+I'm going to name the new object `Low Poly` and the first one `Bake Target` (The names don't matter but it's nice to be organized).
+
+![Object Data](object-data.webp)
+
+This next part is very important, you must set the `Bake Target` to source it's materials from "Object" instead of "Data". This way the two linked objects can have different materials. This is done as shown above in the material tab for the `Bake Target` Object
+
+You can then create a material for it which I will also call `Bake Target`, I will also create a new material for the `Low Poly` object and call it `Low Poly`.
+
+# Setting up Materials
+
+![Shading Tab](shading-tab.webp)
+
+The rest of this process will be done in the shading tab so we can switch there.
+
+![Material Nodes](material-nodes.webp)
+
+With the bake target selected we will add 3 texture nodes to it's material. Because I'm using a PBR workflow, these will be Albedo, Normal & Roughness (I will get into metalness later in this tutorial). These texture nodes should have their colour space set to "sRGB" for the Albedo and "Non-Color" for the others. You should NOT connect these nodes to anything.
+
+![Pasted Nodes](pasted-nodes.webp)
+
+You can then copy & paste these nodes into the material for the `Low Poly` object. Then connect the nodes like shown here.
+
+![Invert Green](invert-green.webp)
+
+If you use DirectX style normal maps (Like I do in ASO), you will need to add an "RGB Curves" node with the green channel flipped in order to invert the green channel of the normal map.
+
+![Weird Shiny](weird-shiny.webp)
+
+Your low poly will look weird and shiny, that is because our baked textures are all black at the moment, that is OK. It will look correct after we are done baking.
+
+Now that everything is set up, we can start looking at the actual baking UI.
+
+# The Baking UI
+
+![Render Panel](render-panel.webp)
+
+Baking is accessed through the Render tab on the properties panel.
+
+![Render Engine](render-engine.webp)
+
+In order to see the bake options you need to set the Render Engine to "Cycles". You probably also want to set Device to "GPU Compute" in order to speed things up.
+
+![Bake Panel](bake-panel.webp)
+
+Expanding the bake controls will give you access to several new options.
+
+![Normal Baking](normal-baking.webp)
+
+We will start by baking the normal map. To do so we must first select "Normal" from the Bake Type combo box. You will also want to check "Selected to Active". For users of DirectX style normal maps, like myself, you will also need to set the G channel to "-Y". If you are using OpenGL style normal maps you can leave it as is.
+
+# Performing the Bake
+
+![Selected To Active](selected-to-active.webp)
+
+Ok its finally bake time, select your High Poly asset then press <kbd>Ctrl</kbd> and select your `Bake Target` this sets the High poly as selected and your `Bake Target` as Active. If everything is selected correctly your outliner should look like the image above. With a dark orange highlight on the high poly object and bright orange for the `Bake Target`.
+
+![Select Normal Node](select-normal-node.webp)
+
+Now select the normal map texture node in the shader nodes for the current material, this tells blender to use it as the destination for baking.
+
+![Hit Bake](hit-bake.webp)
+
+We can finally hit bake!
+
+![Messed Up Bake](messed-up-bake.webp)
+
+After some amount of processing time, you should see a preview of the normal map. There is also a 99% chance it will be messed up in some way.
+
+![Messed Up Bake Normals](messed-up-bake-normals.webp)
+
+As you can see looking at our `Low Poly` object something is very off.
+
+![Extrusion And Max Ray Distance](extrusion-and-max-ray-distance.webp)
+
+The solution to this problem is adjusting two very important parameters for baking. They are "Extrusion" and "Max Ray Distance". 
+
+In Blender baking works by shooting out rays from the Bake Target. Since our Low poly mesh doesn't lie completely outside the surface of the the high poly object we blender needs to effectively extrude the surfaces of the target outward so that the high poly object is completely contained within the low poly one. The amount that it does this is the "Extrusion" and the length of the rays are "Max Ray Distance".
+
+Now of course you are probably wondering at this point, how do I know what to set these numbers to? My rule of thumb is to set extrusion to the smallest value you can that makes the green pixels in the normal map go away. Then set the Max Ray Distance to ~1.5-2 times the Extrusion. 
+
+![Good Values](good-values.webp)
+
+In this case 0.1 and 0.2 are good values.
+
+![Not Enough Distance](not-enough-distance.webp)
+
+If the Max Ray Distance was too low, eg. 0.1, we would get holes in our normal map as shown above.
+
+![Perfect Normals](perfect-normals.webp)
+
+If our values are set properly we get a nice normal map without any artifacts.
+
+![Bumpy Bowl](bumpy-bowl.webp)
+
+We can also now look at our `Low Poly` object and see that it looks nice and bumpy. But there is one tiny problem, It's far too shiny! This is because its roughness map is entirely black or 0.0, this corresponds to a mirror like shine. So of course our next step should be to bake a roughness map.
+
+# Baking a Roughness Map
+
+![Select Roughness](select-roughness.webp)
+
+With your selection still on the high poly and your active still on the bake target select the roughness map texture node in the shader nodes editor.
+
+![Bake Type Rough](bake-type-rough.webp)
+
+Select "Roughness" for Bake Type and hit Bake again.
+
+![Roughness Result](roughness-result.webp)
+
+After waiting for the bake to complete we now have a roughness map and the shininess of our bowl looks correct. Last but certainly not least we need to bake albedo. This is the actual surface colour of our object.
+
+# Baking an Albedo Map
+
+![Select Albedo](select-albedo.webp)
+
+Just as before we need to select the Albedo texture node in the shader node editor. 
+
+![Bake Type Diffuse](bake-type-diffuse.webp)
+
+We set the Bake Type to "Diffuse" this time, but there is one more thing before you bake!
+
+![Bake Influence](bake-influence.webp)
+
+In the below the bake button under "Influence" you must uncheck "Direct" and "Indirect", otherwise blender will bake the lighting into your albedo texture. Now we can hit Bake.
+
+![Complete Bake](complete-bake.webp)
+
+If everything went well, our bowl now has a complete material!
+
+![Compare Materials](compare-materials.webp)
+
+Our low poly now looks much more like the high poly one.
+
+![Compare Polygons](compare-polygons.webp)
+
+Even though their polygon counts are radically different.
+
+![Save Textures](save-textures.webp)
+
+Now before I finish I need to remind you to save your textures, and for some reason Blender doesn't do this automatically for you. you can do it from the hamburger menu in the "Image Editor" under Image -> Save. This must be done for each of your textures.
+
+There we go, that's it! That's how to bake full materials in Blender!
+
+# A Note on Metalness
+
+There is however one tiny consideration for metallic materials. For some reason if your high poly object has any metal on it whatsoever it will completely break everything when baking. Luckily however there is a workaround.
+
+![Hatchet](hatchet.webp)
+
+Lets use this hatchet as an example. You need to take the metallic parameter for the high poly mesh's material and hook it up to the *Emission Color* output.
+
+![Emission Colour](emission-colour.webp)
+
+Because ASO packs roughness and metal together I'm gonna send both through the Emission color using a "Combine Color" node (Note ASO uses R = Roughness, G = Metal, this is different from glTF). All you do now is locate the correct texture in your `Bake Target` material and instead of baking Metal and Roughness you bake using "Emission" as the bake type instead.
+
+# Considerations for Mirrored Objects
+
+![Mirror Modifier](mirror-modifier.webp)
+
+If your low poly object has a mirror modifier like the hatchet from the metal section, there is one more thing to be aware of. You should set the UV coordinate offset to 1.0 for either U or V. This will ensure the mirrored geometry generates UV coordinates that do not overlap with the ones we already have, which would have caused problems during the bake.
+
+# The End
+
+Hope you enjoyed this tutorial! If you found it useful or wanna know about my game A Short Odyssey, please wishlist it on Steam: https://store.steampowered.com/app/2818690/A_Short_Odyssey
+
+
diff --git a/blog/2025/2025-05-09-blender-baking/invert-green.webp b/blog/2025/2025-05-09-blender-baking/invert-green.webp
new file mode 100644
index 0000000..3b941b2
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/invert-green.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/linked-duplicate.webp b/blog/2025/2025-05-09-blender-baking/linked-duplicate.webp
new file mode 100644
index 0000000..81cc85d
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/linked-duplicate.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/low-poly-blender.webp b/blog/2025/2025-05-09-blender-baking/low-poly-blender.webp
new file mode 100644
index 0000000..f3570bd
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/low-poly-blender.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/material-nodes.webp b/blog/2025/2025-05-09-blender-baking/material-nodes.webp
new file mode 100644
index 0000000..132860d
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/material-nodes.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/messed-up-bake-normals.webp b/blog/2025/2025-05-09-blender-baking/messed-up-bake-normals.webp
new file mode 100644
index 0000000..a531eda
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/messed-up-bake-normals.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/messed-up-bake.webp b/blog/2025/2025-05-09-blender-baking/messed-up-bake.webp
new file mode 100644
index 0000000..c54693d
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/messed-up-bake.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/mirror-modifier.webp b/blog/2025/2025-05-09-blender-baking/mirror-modifier.webp
new file mode 100644
index 0000000..cfd1140
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/mirror-modifier.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/normal-baking.webp b/blog/2025/2025-05-09-blender-baking/normal-baking.webp
new file mode 100644
index 0000000..be97e7b
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/normal-baking.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/not-enough-distance.webp b/blog/2025/2025-05-09-blender-baking/not-enough-distance.webp
new file mode 100644
index 0000000..578a5ff
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/not-enough-distance.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/object-data.webp b/blog/2025/2025-05-09-blender-baking/object-data.webp
new file mode 100644
index 0000000..4d121cf
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/object-data.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/outliner-names.webp b/blog/2025/2025-05-09-blender-baking/outliner-names.webp
new file mode 100644
index 0000000..d48554a
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/outliner-names.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/overlapping-models.webp b/blog/2025/2025-05-09-blender-baking/overlapping-models.webp
new file mode 100644
index 0000000..021f6aa
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/overlapping-models.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/pasted-nodes.webp b/blog/2025/2025-05-09-blender-baking/pasted-nodes.webp
new file mode 100644
index 0000000..92600d3
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/pasted-nodes.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/perfect-normals.webp b/blog/2025/2025-05-09-blender-baking/perfect-normals.webp
new file mode 100644
index 0000000..6714433
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/perfect-normals.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/render-engine.webp b/blog/2025/2025-05-09-blender-baking/render-engine.webp
new file mode 100644
index 0000000..99f671f
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/render-engine.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/render-panel.webp b/blog/2025/2025-05-09-blender-baking/render-panel.webp
new file mode 100644
index 0000000..697d786
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/render-panel.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/roughness-result.webp b/blog/2025/2025-05-09-blender-baking/roughness-result.webp
new file mode 100644
index 0000000..352c78b
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/roughness-result.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/save-textures.webp b/blog/2025/2025-05-09-blender-baking/save-textures.webp
new file mode 100644
index 0000000..66c4acf
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/save-textures.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/select-albedo.webp b/blog/2025/2025-05-09-blender-baking/select-albedo.webp
new file mode 100644
index 0000000..61d6300
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/select-albedo.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/select-normal-node.webp b/blog/2025/2025-05-09-blender-baking/select-normal-node.webp
new file mode 100644
index 0000000..52ea975
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/select-normal-node.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/select-roughness.webp b/blog/2025/2025-05-09-blender-baking/select-roughness.webp
new file mode 100644
index 0000000..81c0f1d
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/select-roughness.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/selected-to-active.webp b/blog/2025/2025-05-09-blender-baking/selected-to-active.webp
new file mode 100644
index 0000000..97b7d0f
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/selected-to-active.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/shading-tab.webp b/blog/2025/2025-05-09-blender-baking/shading-tab.webp
new file mode 100644
index 0000000..ef4ddc5
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/shading-tab.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/small-bowl.webp b/blog/2025/2025-05-09-blender-baking/small-bowl.webp
new file mode 100644
index 0000000..df36585
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/small-bowl.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/triangulate.webp b/blog/2025/2025-05-09-blender-baking/triangulate.webp
new file mode 100644
index 0000000..fe0c020
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/triangulate.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/unit-scale.webp b/blog/2025/2025-05-09-blender-baking/unit-scale.webp
new file mode 100644
index 0000000..e191f47
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/unit-scale.webp differ
diff --git a/blog/2025/2025-05-09-blender-baking/weird-shiny.webp b/blog/2025/2025-05-09-blender-baking/weird-shiny.webp
new file mode 100644
index 0000000..0357c2b
Binary files /dev/null and b/blog/2025/2025-05-09-blender-baking/weird-shiny.webp differ
diff --git a/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/index.md b/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/index.md
new file mode 100644
index 0000000..9803e1b
--- /dev/null
+++ b/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/index.md
@@ -0,0 +1,366 @@
+---
+title: 'Nvidia SPIR-V Compiler Bug or Do Subgroup Shuffle Operations Not Imply Execution Dependency?'
+slug: 'subgroup-shuffle-execution-dependency-on-nvidia'
+description: "A look at the behavior behind Nabla's subgroup scan"
+date: '2025-06-19'
+authors: ['keptsecret', 'devshgraphicsprogramming']
+tags: ['nabla', 'vulkan', 'article']
+last_update:
+    date: '2025-06-19'
+    author: keptsecret
+---
+
+Reduce and scan operations are core building blocks in the world of parallel computing, and now [Nabla has a new release](https://github.com/Devsh-Graphics-Programming/Nabla/tree/v0.6.2-alpha1) with those operations made even faster for Vulkan at the subgroup and workgroup levels.
+
+This article takes a brief look at the Nabla implementation for reduce and scan on the GPU in Vulkan.
+
+Then, I discuss a missing execution dependency expected for a subgroup shuffle operation, which was only a problem on Nvidia devices in some test cases.
+
+<!-- truncate -->
+
+## Reduce and Scan
+
+Let's give a quick introduction, or recap for those already familiar, to reduce and scan operations.
+
+A reduction takes a binary associative operator $\bigoplus$ and an array of $n$ elements
+
+$\left[x_0, x_1,...,x_{n-1}\right]$,
+
+and returns
+
+$x_0 \bigoplus x_1 \bigoplus ... \bigoplus x_{n-1}$.
+
+In other words, when $\bigoplus$ is an addition, a reduction of the array $X$ is then the sum of all elements of array $X$.
+
+```
+Input:      4  6  2  3  7  1  0  5
+Reduction:  28
+```
+
+A scan is a generalization of reduction, and takes a binary associative operator $\bigoplus$ with identity $I$ and an array of $n$ elements.
+Then, for each element, performs the reduction from the first element to the current element.
+An _exclusive_ scan does so for all elements before the current element.
+
+$\left[I, x_0, (x_0 \bigoplus x_1), ..., (x_0 \bigoplus x_1 \bigoplus ... \bigoplus x_{n-2})\right]$.
+
+An _inclusive_ scan then includes the current element as well.
+
+$\left[x_0, (x_0 \bigoplus x_1), ..., (x_0 \bigoplus x_1 \bigoplus ... \bigoplus x_{n-1})\right]$.
+
+Notice the last element of the inclusive scan is the same as the reduction.
+
+```
+Input:      4  6  2  3  7  1  0  5
+Exclusive:  0  4  10 12 15 22 23 23
+Inclusive:  4  10 12 15 22 23 23 28
+```
+
+## Nabla's subgroup scans
+
+We start with the most basic of building blocks: doing a reduction or a scan in the local subgroup of a Vulkan device.
+Pretty simple actually, since Vulkan already has subgroup arithmetic operations supported.
+Nabla exposes this via the [GLSL compatibility header](https://github.com/Devsh-Graphics-Programming/Nabla/blob/v0.6.2-alpha1/include/nbl/builtin/hlsl/glsl_compat/subgroup_arithmetic.hlsl) built of [HLSL SPIR-V inline intrinsics](https://github.com/Devsh-Graphics-Programming/Nabla/blob/v0.6.2-alpha1/include/nbl/builtin/hlsl/spirv_intrinsics/subgroup_arithmetic.hlsl).
+
+```cpp
+nbl::hlsl::glsl::groupAdd(T value)
+nbl::hlsl::glsl::groupInclusiveAdd(T value)
+nbl::hlsl::glsl::groupExclusiveAdd(T value)
+etc...
+```
+
+But wait, the SPIR-V-provided operations all require your Vulkan physical device to have support the `GroupNonUniformArithmetic` capability.
+So, Nabla provides emulated versions for that too, and both versions are compiled into a single templated struct call.
+
+```cpp
+template<class Params, class BinOp, uint32_t ItemsPerInvocation, bool native>
+struct inclusive_scan;
+
+template<class Params, class BinOp, uint32_t ItemsPerInvocation, bool native>
+struct exclusive_scan;
+
+template<class Params, class BinOp, uint32_t ItemsPerInvocation, bool native>
+struct reduction;
+```
+
+The implementation of emulated subgroup scans make use of subgroup shuffle operations to access partial sums from other invocations in the subgroup.
+This is based on the [Kogge–Stone adder (KSA)](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda), using $\log_2 n$ steps where $n$ is the subgroup size with all lanes active.
+It should also be noted that in cases like this where the SIMD/SIMT processor pays for all lanes regardless of whether or not they're active, the KSA design is faster than more theoretically work-efficient parallel scans like the Blelloch (which we use at the workgroup granularity).
+
+```cpp
+T inclusive_scan(T value)
+{
+    rhs = shuffleUp(value, 1)
+    value = value + (firstInvocation ? identity : rhs)
+
+    [unroll]
+    for (i = 1; i < SubgroupSizeLog2; i++)
+    {
+        nextLevelStep = 1 << i
+        rhs = shuffleUp(value, nextLevelStep)
+        value = value + (nextLevelStep out of bounds ? identity : rhs)
+    }
+    return value
+}
+```
+
+In addition, Nabla also supports passing vectors into these subgroup operations, so you can perform reduce or scans on up to subgroup size * 4 (for `vec4`) elements per call.
+Note that it expects the elements in the vectors to be consecutive and in the same order as the input array.
+This is because we've found through benchmarking that the instructing the GPU to do a vector load/store results in faster performance than any attempt at coalesced load/store with striding.
+
+We also found shuffles and vector arithmetic to be very expensive, and so having the least amount of data exchange between invocations and pre-scanning up to 4 elements within an invocation was significantly faster.
+
+You can find all the implementations on the [Nabla repository](https://github.com/Devsh-Graphics-Programming/Nabla/blob/v0.6.2-alpha1/include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl)
+
+## An issue with subgroup sync and reconvergence
+
+Now, onto a pretty significant, but strangely obscure, problem that I ran into during unit testing this prior to release.
+[See the unit tests.](https://github.com/Devsh-Graphics-Programming/Nabla-Examples-and-Tests/blob/master/23_Arithmetic2UnitTest/app_resources/testSubgroup.comp.hlsl)
+Nabla also has implementations for workgroup reduce and scans that make use of the subgroup scans above, and one such section looks like this.
+
+```cpp
+... workgroup scan code ...
+
+debug_barrier()
+for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
+{
+    value = getValueFromDataAccessor(memoryIdx)
+
+    value = subgroup::inclusive_scan(value)
+
+    setValueToDataAccessor(memoryIdx)
+
+    if (lastSubgroupInvocation)
+    {
+        setValueToSharedMemory(smemIdx)
+    }
+}
+workgroup_execution_and_memory_barrier()
+
+... workgroup scan code ...
+```
+
+_I should note that this is the first level of scans for the workgroup scope. It is only one step of the algorithm and the data accesses are completely independent. Thus, `memoryIdx` is unique and per-invocation, and also that shared memory is only written to in this step to be accessed in later steps._
+
+At first glance, it looks fine, and it does produce the expected results for the most part... except in some very specific cases.
+After some more testing and debugging to try and identify the cause, I've found the conditions to be:
+
+* using an Nvidia GPU
+* using emulated versions of subgroup operations
+* a decent number of iterations in the loop (in this case at least 8).
+
+I tested this on an Intel GPU, to be sure, and the workgroup scan ran correctly.
+This was very baffling initially. And the results produced on an Nvidia device looked like a sync problem.
+
+It was even more convincing when I moved the control barrier inside the loop and it immediately produced correct scan results.
+
+```cpp
+... workgroup scan code ...
+
+debug_barrier()
+for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
+{
+    value = getValueFromDataAccessor(memoryIdx)
+
+    value = subgroup::inclusive_scan(value)
+
+    setValueToDataAccessor(memoryIdx)
+
+    if (lastSubgroupInvocation)
+    {
+        setValueToSharedMemory(smemIdx)
+    }
+    workgroup_execution_and_memory_barrier()
+}
+
+... workgroup scan code ...
+```
+
+Ultimately, we came to the conclusion that each subgroup invocation was probably somehow not in sync as each loop went on.
+Particularly, the effect we're seeing is a shuffle done as if `value` is not in lockstep at the call site.
+We tested using a subgroup execution barrier and maximal reconvergence.
+Strangely enough, just a memory barrier also fixed it, which it shouldn't have as subgroup shuffles are magical intrinsics that take arguments by copy and don't really deal with accessing any memory locations (SSA form).
+
+```cpp
+T inclusive_scan(T value)
+{
+    subgroup_execution_barrier()
+    rhs = shuffleUp(value, 1)
+    value = value + (firstInvocation ? identity : rhs)
+
+    [unroll]
+    for (i = 1; i < SubgroupSizeLog2; i++)
+    {
+        nextLevelStep = 1 << i
+        subgroup_execution_barrier()
+        rhs = shuffleUp(value, nextLevelStep)
+        value = value + (nextLevelStep out of bounds ? identity : rhs)
+    }
+    return value
+}
+```
+
+However, this problem was only observed on Nvidia devices.
+
+As a side note, using the `SPV_KHR_maximal_reconvergence` extension doesn't resolve this issue surprisingly.
+I feel I should point out that many presentations and code listings seem to give an impression subgroup shuffle operations execute in lockstep based on the very simple examples provided.
+
+For instance, [the example in this presentation](https://vulkan.org/user/pages/09.events/vulkanised-2025/T08-Hugo-Devillers-SaarlandUniversity.pdf) correctly demonstrates where invocations in a tangle are reading and storing to SSBO, but may mislead readers into not considering the Availability and Visibility for other scenarios that need it.
+
+Such simple examples are good enough to demonstrate the purpose of the extension, but fail to elaborate on specific details.
+If it did have a read-after-write between subgroup invocations, subgroup scope memory dependencies would have been needed.
+
+(With that said, since subgroup operations are SSA and take arguments "by copy", this discussion of Memory Dependencies and availability-visibility is not relevant to our problem, but just something to be aware of.)
+
+### A minor detour onto the performance of native vs. emulated on Nvidia devices
+
+Since all recent Nvidia GPUs support subgroup arithmetic SPIR-V capability, why were we using emulation with shuffles?
+I think this observation warrants a small discussion section of its own.
+The table below are some numbers from our benchmark measured through Nvidia's Nsight Graphics profiler of a subgroup inclusive scan using native SPIR-V instructions and our emulated version.
+
+#### Native
+
+| Workgroup size | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
+| :------------: | :---------------: | :-------------------: | :---------: | :----------------: |
+| 256           | 41.6              | 90.5                  | 16            | 27                |
+| 512           | 41.4              | 89.7                  | 16            | 27.15             |
+| 1024          | 40.5              | 59.7                  | 16            | 27.74             |
+
+#### Emulated
+
+| Workgroup size | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
+| :------------: | :---------------: | :-------------------: | :---------: | :----------------: |
+| 256           | 37.9              | 90.7                  | 16            | 12.22             |
+| 512           | 37.7              | 90.3                  | 16            | 12.3              |
+| 1024          | 37.1              | 60.5                  | 16            | 12.47             |
+
+These numbers are baffling to say the least, particularly the fact that our emulated subgroup scans are twice as fast than the native solution.
+It should be noted that this is with the subgroup barrier before every shuffle, we did not see any marked decrease in performance.
+
+An potential explanation for this may be that Nvidia has to consider any inactive invocations in a subgroup, having them behave as if they contribute the identity $I$ element to the scan.
+Our emulated scan instead requires people call the arithmetic in subgroup uniform fashion.
+If that is not the case, this seems like a cause for concern for Nvidia's SPIR-V to SASS compiler.
+
+### What could cause this behavior on Nvidia? — The Independent Program Counter
+
+We think a potential culprit for this could be Nvidia's Independent Program Counter (IPC) that was introduced with the Volta architecture.
+
+Prior to Volta, all threads in a subgroup share the same program counter, which handles scheduling of instructions across all those threads.
+This means all threads in the same subgroup execute the same instruction at any given time.
+Therefore, when you have a branch in the program flow across threads in the same subgroup, all execution paths generally have to be executed and mask off threads that should not be active for that path.
+
+<figure class="image">
+    ![Pascal and prior SIMT model](pascal_simt_model.png "Pascal and prior SIMT model")
+    <figcaption>Thread scheduling under the SIMT warp execution model of Pascal and earlier NVIDIA GPUs. Taken from [NVIDIA TESLA V100 GPU ARCHITECTURE](https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf)</figcaption>
+</figure>
+
+With Volta up to now, each thread has its own program counter that allows it to execute independently of other threads in the same subgroup.
+This also provides a new possibility on Nvidia devices, where you can now synchronize threads in the same subgroup.
+The active invocations still have to execute the same instruction, but it can be at different locations in the program (e.g. different iterations of a loop).
+
+<figure class="image">
+    ![Volta Independent Thread Scheduling model](volta_scheduling_model.png "Volta Independent Thread Scheduling model")
+    <figcaption>Independent thread scheduling in Volta architecture onwards interleaving execution from divergent branches, using an explicit sync to reconverge threads. Taken from [NVIDIA TESLA V100 GPU ARCHITECTURE](https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf)</figcaption>
+</figure>
+
+In CUDA, this is exposed through `__syncwarp()`, and we can do similar in Vulkan using subgroup control barriers.
+
+The IPC also enables starvation-free algorithms on CUDA, along with the use of mutexes where a thread that attempts to acquire a mutex is guaranteed to eventually succeed. Consider the example in the Volta whitepaper of a doubly linked list:
+
+```cpp
+__device__ void insert_after(Node* a, Node* b)
+{
+    Node* c;
+    lock(a);
+    lock(a->next);
+    c = a->next;
+
+    a->next = b;
+    b->prev = a;
+
+    b->next = c;    
+    c->prev = b;
+
+    unlock(c);
+    unlock(a);
+}
+```
+
+The diagram shows how, with IPC, even if thread K holds the lock for node A, another thread J in the same subgroup (warp in the case of CUDA) can wait for the lock to become available and not affect K's progress.
+
+<figure class="image">
+    ![Doubly Linked List lock](linked_list_lock.png "Doubly Linked List lock")
+    <figcaption>Locks are acquired for nodes A and C, shown on the left, before the threads inserts node B shown on the right. Taken from [NVIDIA TESLA V100 GPU ARCHITECTURE](https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf)</figcaption>
+</figure>
+
+In our case however, it's entirely possible that each subgroup shuffle operation does not run in lockstep with the branching introduced, which would be why subgroup execution barriers are our solution to the problem for now.
+
+Unfortunately, I couldn't find anything explicit mention in the SPIR-V specification that confirmed whether subgroup shuffle operations actually imply execution dependency, even with hours of scouring the spec.
+
+So then we either have...
+
+## This is a gray area of the Subgroup Shuffle Spec and allowed Undefined Behaviour
+
+Consider what it means if subgroup convergence doesn't guarantee that active tangle invocations execute a subgroup operation in lockstep.
+
+Subgroup ballot and ballot arithmetic are two where you don't have to consider lockstepness, because it is expected that the return value of ballot to be uniform in a tangle, and as a corollary, it is known exactly what it should be.
+
+Similarly, for subgroup broadcasts, first the value being broadcast needs to computed, say from invocation K.
+Even if other invocations don't run in lockstep, they can't read the value until invocation K broadcasts it if they want to read the same value (uniformity) and you know what value should be read (broadcasting invocation can check it got the same value back).
+
+On the flip side, reductions will always produce a uniform return value for all invocations, even if you reduce a stale or out-of-lockstep input value.
+
+Meanwhile, subgroup operations that don't return tangle-uniform values, such as shuffles and scans, would only produce the expected result only if performed on constants or variables written with an execution dependency.
+These operations can give different results per invocation so there's no implied uniformity, which means there's no reason to expect any constraints on their apparent lockstepness being implied transitively through the properties of the return value.
+
+The important consideration then is how a subgroup operation is implemented.
+When a subgroup operation doesn't explicitly state that they all have to execute at the same time by all invocations, we can imagine a scenario where a shuffle may be as simple as the receiving invocation snooping another's register without requiring any action on the latter's part.
+And that comes with obvious IPC dangers, as snooping it before it gets written or after it gets overwritten if there are no other execution dependencies will surely provide inconsistent results.
+
+This leads to code listings like the following becoming undefined behavior simply by changing the `Broadcast` into a `Shuffle`.
+
+```cpp
+// Broadcasting after computation
+// OK, only counts active invocations in tangle (doesn't change)
+int count = subgroupBallotBitCount(true);
+// OK, done on a constant
+int index = subgroupExclusiveAdd(1);
+int base, base_slot;
+if (subgroupElect())
+    base_slot = atomicAdd(dst.size,count);
+// NOT OK, `base_slot` not available, visible or other invocations may even have raced ahead of the elected one
+// Not every invocation will see the correct value of `base_slot` in the elected one memory dependency not ensured
+base = subgroupBroadcastFirst(base_slot);
+```
+
+Similarly again, with [this example from the Khronos blog on maximal reconvergence](https://www.khronos.org/blog/khronos-releases-maximal-reconvergence-and-quad-control-extensions-for-vulkan-and-spir-v)
+
+```cpp
+// OK, thanks to subgroup uniform control flow, no wiggle room here (need to know all invcocation values)
+if (subgroupAny(needs_space)) {
+   // OK, narrowly because `subgroupBallot` returns a ballot thats uniform in a tangle 
+   uvec4 mask = subgroupBallot(needs_space);
+   // OK, because `mask` is tangle-uniform
+   uint size = subgroupBallotBitCount(mask);
+   uint base = 0;
+   if (subgroupElect())
+     base = atomicAdd(b.free, size);
+
+    // NOT OK if replaced Broadcast with Shuffle, non-elected invocations could race ahead or not see (visibility) the `base` value in the elected invocation before that one would excecute a shuffle
+    base = subgroupBroadcastFirst(base);
+    // OK, but only because `mask` is tangle-uniform
+    uint offset = subgroupBallotExclusiveBitCount(mask);
+
+    if (needs_space)
+      b.data[base + offset] = ...;
+}
+```
+
+With all that said, it needs to be noted that one can't expect every instruction to run in lockstep, as that would negate the advantages of Nvidia's IPC.
+
+## Or a bug in Nvidia's SPIR-V to SASS compiler
+
+And crucially, it's impossible to know (or discuss in the case of a signed NDA) what's happening for the bug or performance regression with Nvidia.
+Unlike AMD's RDNA ISAs where we can verify that the compiler is doing what it should be doing using Radeon GPU Analyzer, the generated SASS is inaccessible and neither is the compiler public.
+
+----------------------------
+_This issue was observed happening inconsistently on Nvidia driver version 576.80, released 17th June 2025._
diff --git a/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/linked_list_lock.png b/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/linked_list_lock.png
new file mode 100644
index 0000000..6ecf59e
Binary files /dev/null and b/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/linked_list_lock.png differ
diff --git a/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/pascal_simt_model.png b/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/pascal_simt_model.png
new file mode 100644
index 0000000..d6f4700
Binary files /dev/null and b/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/pascal_simt_model.png differ
diff --git a/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/volta_scheduling_model.png b/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/volta_scheduling_model.png
new file mode 100644
index 0000000..6ee1c2b
Binary files /dev/null and b/blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/volta_scheduling_model.png differ
diff --git a/blog/authors.yml b/blog/authors.yml
index f7e1c12..dc95fa4 100644
--- a/blog/authors.yml
+++ b/blog/authors.yml
@@ -27,9 +27,27 @@ jaked:
 
 fletterio:
   name: Francisco Letterio
-  title: Junior Developer @ DevSH GP
+  title: Junior Developer @ DevSH Graphics Programming Sp. z O.O.
   url: https://github.com/Fletterio
   image_url: https://avatars.githubusercontent.com/u/40742817?v=4
   page: true
   socials:
-    github: Fletterio
\ No newline at end of file
+    github: Fletterio
+
+keptsecret:
+  name: Sorakrit Chonwattanagul
+  title: Associate Developer @ DevSH Graphics Programming Sp. z O.O.
+  url: https://github.com/keptsecret/
+  image_url: https://avatars.githubusercontent.com/u/27181108?v=4
+  page: true
+  socials:
+    github: keptsecret
+
+devshgraphicsprogramming:
+  name: Mateusz Kielan
+  title: CTO of DevSH Graphics Programming Sp. z O.O.
+  url: https://www.devsh.eu/
+  image_url: https://avatars.githubusercontent.com/u/6894321?v=4
+  page: true
+  socials:
+    github: devshgraphicsprogramming
diff --git a/docusaurus.config.ts b/docusaurus.config.ts
index 82c8b01..c5e8ed4 100644
--- a/docusaurus.config.ts
+++ b/docusaurus.config.ts
@@ -132,7 +132,7 @@ const config: Config = {
           items: [
             {
               label: "Discord",
-              href: "https://discord.com/invite/graphicsprogramming",
+              href: "https://discord.graphics-programming.org/",
             },
             {
               label: "YouTube",
diff --git a/src/pages/index.tsx b/src/pages/index.tsx
index ae085c1..91eef9a 100644
--- a/src/pages/index.tsx
+++ b/src/pages/index.tsx
@@ -38,7 +38,7 @@ function HomepageHeader() {
           </Link>
           <Link
             className="button button--secondary button--lg test"
-            to="https://discord.gg/"
+            to="https://discord.graphics-programming.org/"
           >
             Join our Discord Server
           </Link>
diff --git a/static/webring/froglist.json b/static/webring/froglist.json
index 41b2efd..de4e856 100644
--- a/static/webring/froglist.json
+++ b/static/webring/froglist.json
@@ -46,5 +46,41 @@
     "url": "https://juandiegomontoya.github.io/",
     "displayName": "Jake Ryan",
     "description": "A blog about graphics frogramming"
+  },
+  {
+    "name": "neonmoe",
+    "url": "https://blog.neon.moe/",
+    "displayName": "Jens Pitkänen",
+    "description": "A blog about programming, the small web, and arcane personal computing"
+  },
+  {
+    "name": "geometrian",
+    "url": "https://geometrian.com/",
+    "displayName": "Agatha Mallett",
+    "description": "Homepage of Agatha Mallett, including computer graphics research and many other projects!"
+  },
+  {
+    "name": "jmaier",
+    "url": "https://www.jakobmaier.at/",
+    "displayName": "Jakob Maier",
+    "description": "A website where I share my projects and blog posts."
+  },
+  {
+    "name": "edthedev",
+    "url": "https://edward.delaporte.us/",
+    "displayName": "Edward Delaporte",
+    "description": "Interactive JavaScript art, example code, and reference links."
+  },
+  {
+    "name": "rtarun9",
+    "url": "https://rtarun9.github.io/",
+    "displayName": "Tarun R",
+    "description": "Personal site with projects and blogs that are rarely updated"
+  },
+  {
+    "name": "devsh",
+    "url": "https://www.devsh.eu/",
+    "displayName": "DevSH Graphics Programming",
+    "description": "Homepage of DevSH Graphics Programming: Computer Graphics, Computer Geometry & Vision and High Performance Computing Consultancy"
   }
 ]