From b68c714bbc8570d43269fc7a165b604ac5774ebe Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Tue, 11 Apr 2023 08:23:29 +0800
Subject: [PATCH 01/11] initial document for accelerator interface

---
 .pre-commit-config.yaml                       |  2 +-
 .../how-to-accelerator-interface.md           | 51 +++++++++++++++++++
 2 files changed, 52 insertions(+), 1 deletion(-)
 create mode 100644 docs/_tutorials/how-to-accelerator-interface.md

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index c0250f243178..27b3027d1201 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -75,5 +75,5 @@ repos:
         name: check-torchcuda
         entry: ./scripts/check-torchcuda.py
         language: script
-        exclude: ^(.github/workflows/|scripts/check-torchcuda.py|accelerator/cuda_accelerator.py|deepspeed/inference/engine.py|deepspeed/model_implementations/transformers/clip_encoder.py|deepspeed/model_implementations/diffusers/vae.py|deepspeed/model_implementations/diffusers/unet.py|op_builder/spatial_inference.py|op_builder/transformer_inference.py|op_builder/builder.py|setup.py|tests/unit/ops/sparse_attention/test_sparse_attention.py)
+        exclude: ^(.github/workflows/|scripts/check-torchcuda.py|docs/_tutorials/how-to-accelerator-interface.md|accelerator/cuda_accelerator.py|deepspeed/inference/engine.py|deepspeed/model_implementations/transformers/clip_encoder.py|deepspeed/model_implementations/diffusers/vae.py|deepspeed/model_implementations/diffusers/unet.py|op_builder/spatial_inference.py|op_builder/transformer_inference.py|op_builder/builder.py|setup.py|tests/unit/ops/sparse_attention/test_sparse_attention.py)
         # Specific deepspeed/ files are excluded for now until we wrap ProcessGroup in deepspeed.comm
diff --git a/docs/_tutorials/how-to-accelerator-interface.md b/docs/_tutorials/how-to-accelerator-interface.md
new file mode 100644
index 000000000000..3eb3f807549c
--- /dev/null
+++ b/docs/_tutorials/how-to-accelerator-interface.md
@@ -0,0 +1,51 @@
+---
+title: How-to DeepSpeed Accelerator Abstraction Interface
+---
+
+DeepSpeed Accelerator Interface is introduced to allow user to run large language model seamlessly on different Deep Learning acceleration hardware seamlessly with DeepSpeed.   It provides a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  It also allows user to use the interface to write large language model code that does not has hardware specific code.  With DeepSpeed Accelerator Interface, user can run the same large language model on different hareware platform, without the need to rewrite model code for different hardware.  This makes running large language model on different hardware easier.
+
+This document cover two topics related to DeepSpeed Accelerator Abstraction Interface:
+1. How to write accelerator agnostic models with DeepSpeed Accelerator Abstraction Interface
+2. How to make a new accelerator implementation for DeepSpeed Accelerator Abstraction Interface
+
+# How to write accelerator agnostic models with DeepSpeed Accelerator Abstraction Interface
+In this part, you will learn how to write a model that does not contain HW specific code, or how to port a model that run on a specific HW only to be device agnostic.  To do this, we first import `get_accelerator` from `deepspeed.accelerator`
+
+```
+from deepspeed.accelerator import get_accelerator
+```
+
+`get_accelerator()` is the single entrance to DeepSpeed Accelerator Abstraction Interface
+
+<code that use accelerator functionality> get_accelerator().<interface name>(...)
+For existing torch.cuda.<interface name> runtime call, we convert it like the following example:
+
+if torch.cuda.is_available():
+    ...
+-->
+
+if get_accelerator().is_available():
+    ...
+For CUDA specific device name such as 'cuda' or 'cuda:0', or 'cuda:1', we convert them to get_accelerator().device_name(), get_accelerator().device_name(0), and get_accelerator().device_name(1).
+
+It is a little bit trick when we convert places where torch.cuda.current_device() are called. Current device return device index, but if we supply device index in Pytorch code where a device is needed, Pytorch will explain it as a CUDA device. To get current device that can be used as a device name, we need to call get_accelerator().current_device_name():
+
+my_tensor = torch.empty(3, 4, device=get_accelerator().current_device_name())
+Only when an integer number is expected we use get_accelerator().current_device():
+
+idx = get_accelerator().current_device()
+default_generator = get_accelerator().default_generator(idx)
+Tensor operations
+When we convert a torch tensor to accelerator device such as my_tensor.cuda(), we use my_tensor.to(get_accelerator().deivce_name())
+
+When we check whether a torch tensor is on accelerator device such as my_tensor.is_cuda, we use get_accelerator().on_accelerator(my_tensor)
+
+When pin a tensor to GPU memory such as my_tensor.pin_memory(), we use get_accelerator().pin_memory(my_tensor)
+
+Communication backend
+When a communication backend string is used, the interface get_accelerator().communication_backend_name() is used get get communication backend name. So instead of torch.distributed.init_process_group('nccl'), we use torch.distributed.init_process_group(get_accelerator().communication_backend_name())
+
+Op builder abstraction
+Op builders are abstracted through get_accelerator().create_op_builder(<op builder name>), if the op builder is implemented in the accelerator, an object of OpBuilder subclass will be returned. If the op builder is not implemented, None will be returned.
+
+A typical implementation can be referred to from the CUDA implementation, or from an XPU implementation which will be released later. Typical call such as CPUAdamBuilder().load() can be convert to get_accelerator().create_op_builder("CPUAdamBuilder").load().

From 599d8620d1c90a698ac349c32fedd2e30a6de5ce Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@gmail.com>
Date: Tue, 11 Apr 2023 10:59:30 +0800
Subject: [PATCH 02/11] Update how-to-accelerator-interface.md

---
 .../how-to-accelerator-interface.md           | 29 +++++++++++++------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/docs/_tutorials/how-to-accelerator-interface.md b/docs/_tutorials/how-to-accelerator-interface.md
index 3eb3f807549c..458c1f58e40e 100644
--- a/docs/_tutorials/how-to-accelerator-interface.md
+++ b/docs/_tutorials/how-to-accelerator-interface.md
@@ -1,31 +1,40 @@
 ---
-title: How-to DeepSpeed Accelerator Abstraction Interface
+title: DeepSpeed Accelerator Abstraction Interface
+tags: getting-started
 ---
 
+# Contents
+  * [Introduction](#introduction)
+  * [Write accelerator agnostic models](#write-accelerator-agnostic-models)
+  * [Implement new accelerator extension](#implement-new-accelerator-extension)
+
+# Introduction
 DeepSpeed Accelerator Interface is introduced to allow user to run large language model seamlessly on different Deep Learning acceleration hardware seamlessly with DeepSpeed.   It provides a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  It also allows user to use the interface to write large language model code that does not has hardware specific code.  With DeepSpeed Accelerator Interface, user can run the same large language model on different hareware platform, without the need to rewrite model code for different hardware.  This makes running large language model on different hardware easier.
 
 This document cover two topics related to DeepSpeed Accelerator Abstraction Interface:
-1. How to write accelerator agnostic models with DeepSpeed Accelerator Abstraction Interface
-2. How to make a new accelerator implementation for DeepSpeed Accelerator Abstraction Interface
+1. Write accelerator agnostic models using DeepSpeed Accelerator Abstraction Interface
+2. Implement new accelerator extension for DeepSpeed Accelerator Abstraction Interface
 
-# How to write accelerator agnostic models with DeepSpeed Accelerator Abstraction Interface
+# Write accelerator agnostic models
 In this part, you will learn how to write a model that does not contain HW specific code, or how to port a model that run on a specific HW only to be device agnostic.  To do this, we first import `get_accelerator` from `deepspeed.accelerator`
-
 ```
 from deepspeed.accelerator import get_accelerator
 ```
+Note: `get_accelerator()` is the single entrance to DeepSpeed Accelerator Abstraction Interface
+## Port accelerator runtime calls
+First we need to port accelerator runtime calls.  On CUDA device, accelerator runtime call appears in the form of `torch.cuda.<interface>(...)`.   With DeepSpeed Accelerator Abstract Interface, such accelerator runtime call can be written in the form of `get_accelerator().<interface>(...)` which will be accelerator agnostic.
 
-`get_accelerator()` is the single entrance to DeepSpeed Accelerator Abstraction Interface
-
-<code that use accelerator functionality> get_accelerator().<interface name>(...)
 For existing torch.cuda.<interface name> runtime call, we convert it like the following example:
 
+```
 if torch.cuda.is_available():
     ...
+```
 -->
-
+```
 if get_accelerator().is_available():
     ...
+```
 For CUDA specific device name such as 'cuda' or 'cuda:0', or 'cuda:1', we convert them to get_accelerator().device_name(), get_accelerator().device_name(0), and get_accelerator().device_name(1).
 
 It is a little bit trick when we convert places where torch.cuda.current_device() are called. Current device return device index, but if we supply device index in Pytorch code where a device is needed, Pytorch will explain it as a CUDA device. To get current device that can be used as a device name, we need to call get_accelerator().current_device_name():
@@ -49,3 +58,5 @@ Op builder abstraction
 Op builders are abstracted through get_accelerator().create_op_builder(<op builder name>), if the op builder is implemented in the accelerator, an object of OpBuilder subclass will be returned. If the op builder is not implemented, None will be returned.
 
 A typical implementation can be referred to from the CUDA implementation, or from an XPU implementation which will be released later. Typical call such as CPUAdamBuilder().load() can be convert to get_accelerator().create_op_builder("CPUAdamBuilder").load().
+
+# Implement new accelerator extension

From f0d862c6530b5d6fe8a418a7573511ef20de199b Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Tue, 11 Apr 2023 16:28:41 +0800
Subject: [PATCH 03/11] improve accelertator interface document

---
 .pre-commit-config.yaml                       |  2 +-
 docs/_tutorials/accelerator-interface.md      | 92 +++++++++++++++++++
 .../how-to-accelerator-interface.md           | 62 -------------
 3 files changed, 93 insertions(+), 63 deletions(-)
 create mode 100644 docs/_tutorials/accelerator-interface.md
 delete mode 100644 docs/_tutorials/how-to-accelerator-interface.md

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 27b3027d1201..bb53dd80b031 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -75,5 +75,5 @@ repos:
         name: check-torchcuda
         entry: ./scripts/check-torchcuda.py
         language: script
-        exclude: ^(.github/workflows/|scripts/check-torchcuda.py|docs/_tutorials/how-to-accelerator-interface.md|accelerator/cuda_accelerator.py|deepspeed/inference/engine.py|deepspeed/model_implementations/transformers/clip_encoder.py|deepspeed/model_implementations/diffusers/vae.py|deepspeed/model_implementations/diffusers/unet.py|op_builder/spatial_inference.py|op_builder/transformer_inference.py|op_builder/builder.py|setup.py|tests/unit/ops/sparse_attention/test_sparse_attention.py)
+        exclude: ^(.github/workflows/|scripts/check-torchcuda.py|docs/_tutorials/accelerator-interface.md|accelerator/cuda_accelerator.py|deepspeed/inference/engine.py|deepspeed/model_implementations/transformers/clip_encoder.py|deepspeed/model_implementations/diffusers/vae.py|deepspeed/model_implementations/diffusers/unet.py|op_builder/spatial_inference.py|op_builder/transformer_inference.py|op_builder/builder.py|setup.py|tests/unit/ops/sparse_attention/test_sparse_attention.py)
         # Specific deepspeed/ files are excluded for now until we wrap ProcessGroup in deepspeed.comm
diff --git a/docs/_tutorials/accelerator-interface.md b/docs/_tutorials/accelerator-interface.md
new file mode 100644
index 000000000000..dfc89e60caf3
--- /dev/null
+++ b/docs/_tutorials/accelerator-interface.md
@@ -0,0 +1,92 @@
+---
+title: DeepSpeed Accelerator Abstraction Interface
+tags: getting-started
+---
+
+# Contents
+  * [Introduction](#introduction)
+  * [Write accelerator agnostic models](#write-accelerator-agnostic-models)
+  * [Run DeepSpeed model on different accelerators](#run-deepspeed-model-on-different-accelerators)
+  * [Implement new accelerator extension](#implement-new-accelerator-extension)
+
+# Introduction
+DeepSpeed Accelerator Interface is introduced to allow user to run large language model seamlessly on different Deep Learning acceleration hardware seamlessly with DeepSpeed.   It provides a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  It also allows user to use the interface to write large language model code that does not has hardware specific code.  With DeepSpeed Accelerator Interface, user can run the same large language model on different hareware platform, without the need to rewrite model code for different hardware.  This makes running large language model on different hardware easier.
+
+This document cover two topics related to DeepSpeed Accelerator Abstraction Interface:
+1. Write accelerator agnostic models using DeepSpeed Accelerator Abstraction Interface
+2. Implement new accelerator extension for DeepSpeed Accelerator Abstraction Interface
+
+# Write accelerator agnostic models
+In this part, you will learn how to write a model that does not contain HW specific code, or how to port a model that run on a specific HW only to be device agnostic.  To do this, we first import `get_accelerator` from `deepspeed.accelerator`
+```
+from deepspeed.accelerator import get_accelerator
+```
+Note: `get_accelerator()` is the single entrance to DeepSpeed Accelerator Abstraction Interface
+## Port accelerator runtime calls
+First we need to port accelerator runtime calls.  On CUDA device, accelerator runtime call appears in the form of `torch.cuda.<interface>(...)`.   With DeepSpeed Accelerator Abstract Interface, such accelerator runtime call can be written in the form of `get_accelerator().<interface>(...)` which will be accelerator agnostic.
+
+A typical conversion looks like the following example:
+
+```
+if torch.cuda.is_available():
+    ...
+```
+-->
+```
+if get_accelerator().is_available():
+    ...
+```
+
+For most `torch.cuda.<interface>(...)` call, we can literally replace `torch.cuda` with `get_accelerator()`.   However, there are some exceptions that needs attention:
+1. For `torch.cuda.current_device()`, we need to know whether calling this interface is to get device index, or supply the return value as a device.   If we want to use the return value as a device string, we need to call `get_accelerator().current_device_name()`.  For example:
+```
+torch.empty(weight_shape, dtype=dtype, device=get_accelerator().current_device_name())
+```
+However, if we wish to get device index as a number, we should call `get_accelertor().current_device()`
+```
+local_rank = get_accelerator().current_device()
+```
+2. For `torch.cuda.default_generators[index]`, convert to `get_accelerator().default_generator(index)`
+
+## Port accelerator device name
+For CUDA specific device name such as 'cuda' or 'cuda:0', or 'cuda:1', we convert them to get_accelerator().device_name(), get_accelerator().device_name(0), and get_accelerator().device_name(1).
+
+A device name without index can be used if model need to do specific thing for certain accelerator.  We suggest to make as less as such usage only for situatio can not be resolve other way.
+
+## Tensor operations
+CUDA specific tensor operations needs to be converted according to the following rules:
+- When we convert a torch tensor to accelerator device such as my_tensor.cuda(), we use my_tensor.to(get_accelerator().deivce_name())
+
+- When we check whether a torch tensor is on accelerator device such as my_tensor.is_cuda, we use get_accelerator().on_accelerator(my_tensor)
+
+- When pin a tensor to GPU memory such as my_tensor.pin_memory(), we use get_accelerator().pin_memory(my_tensor)
+
+## Communication backend
+When a communication backend string is used, the interface get_accelerator().communication_backend_name() is used get get communication backend name. So instead of torch.distributed.init_process_group('nccl'), we use torch.distributed.
+```
+init_process_group(get_accelerator().communication_backend_name())
+```
+
+# Run DeepSpeed on different accelerators
+Once a model is ported with DeepSpeed Accelerator Interface, we can run this model on different accelerators using extension to DeepSpeed.  DeepSpeed check whether certain extension is installed in the environment to decide whether to use the Accelerator backend in that extension.  For example if we wish to run model on Intel GPU, we can install Intel Extension for DeepSpeed following the [link](https://github.com/intel/intel-extension-for-deepspeed/)
+
+After the extension is installed, install DeepSpeed and run model.   The model will be running on top of DeepSpeed.   Because DeepSpeed installation is also accelerator related, it is recommended to install DeepSpeed accelerator extension before install DeepSpeed.
+
+`CUDA_Accelerator` is the default accelerator in DeepSpeed.  If no other DeepSpeed accelerator extension is installed, `CUDA_Accelerator` will be used.
+
+When run a model on different accelerator in a cloud environment, the recommended practice is provision environment for each accelerator in different env with tool such as anaconda/miniconda/virtualenv.  When run model on different Accelerator, load the env accordingly.
+
+Note that different accelerator may have different 'flavor' of float16 or bfloat16.   So it is recommended to make the model configurable for both float16 and bfloat16, so model code does not need to be changed.
+
+# Implement new accelerator extension
+It is possible to implement a new DeepSpeed accelerator extension to support new accelerator in DeepSpeed.  An example to follow is [Intel Extension For DeepSpeed](https://github.com/intel/intel-extension-for-deepspeed/).   An accelerator extension contains the following components:
+1. XYZ_Accelerator(DeepSpeedAccelerator) class definition, where 'XYZ' is the accelerator name, such as 'XPU' or 'CPU'.
+This class implements `class DeepSpeedAccelerator` and will be returned by `get_accelerator()` in DeepSpeed.
+2. Op builders following https://github.com/intel/intel-extension-for-deepspeed/tree/main/intel_extension_for_deepspeed/op_builder.   All op builders needs to inherit `deepspeed.ops.op_builder.builder.OpBuilder` directly or indirectly.  A common practice is to implement a base op builder (SYCLOpBuilder in the case of Intel Extension for DeepSpeed) and inherit this base op builder instead.
+3. Op kernels as in the following [link](https://github.com/intel/intel-extension-for-deepspeed/tree/main/intel_extension_for_deepspeed/op_builder/csrc).
+
+Note that an extension does not have to implement all op builders under https://github.com/microsoft/DeepSpeed/tree/master/op_builder all at a time.   A missing op builder usually means certain DeepSpeed functionality cannot be used for that Accelerator, but models that does not use that functionality can still run.
+
+When implementing op builder for an accelerator extension, one thing needs to be noted is that the op builder native code is being built by DeepSpeed jit load mechanism.  This mean the native source file being built needs to be in DeepSpeed installation directory.  However these files are defined in accelerator extension installation directory, which cannot be built by DeepSpeed directly.  To solve this, follow the example in https://github.com/intel/intel-extension-for-deepspeed/blob/main/intel_extension_for_deepspeed/op_builder/cpu_adam.py to use 'sycl_kernel_path' and 'sycl_kernel_include' (User can change 'sycl' to other prefix in their own accelerator extension) to allow native code be built during DeepSpeed jit load.
+
+When accelerator extension is installed in the environment, it can be used by either explicit call deepspeed.accelerator.set_accelerator(XYZ_Accelerator()) following the example in https://github.com/microsoft/DeepSpeed/blob/master/accelerator/real_accelerator.py, or add an implicit detection code in get_accelerator in the same file above.
diff --git a/docs/_tutorials/how-to-accelerator-interface.md b/docs/_tutorials/how-to-accelerator-interface.md
deleted file mode 100644
index 458c1f58e40e..000000000000
--- a/docs/_tutorials/how-to-accelerator-interface.md
+++ /dev/null
@@ -1,62 +0,0 @@
----
-title: DeepSpeed Accelerator Abstraction Interface
-tags: getting-started
----
-
-# Contents
-  * [Introduction](#introduction)
-  * [Write accelerator agnostic models](#write-accelerator-agnostic-models)
-  * [Implement new accelerator extension](#implement-new-accelerator-extension)
-
-# Introduction
-DeepSpeed Accelerator Interface is introduced to allow user to run large language model seamlessly on different Deep Learning acceleration hardware seamlessly with DeepSpeed.   It provides a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  It also allows user to use the interface to write large language model code that does not has hardware specific code.  With DeepSpeed Accelerator Interface, user can run the same large language model on different hareware platform, without the need to rewrite model code for different hardware.  This makes running large language model on different hardware easier.
-
-This document cover two topics related to DeepSpeed Accelerator Abstraction Interface:
-1. Write accelerator agnostic models using DeepSpeed Accelerator Abstraction Interface
-2. Implement new accelerator extension for DeepSpeed Accelerator Abstraction Interface
-
-# Write accelerator agnostic models
-In this part, you will learn how to write a model that does not contain HW specific code, or how to port a model that run on a specific HW only to be device agnostic.  To do this, we first import `get_accelerator` from `deepspeed.accelerator`
-```
-from deepspeed.accelerator import get_accelerator
-```
-Note: `get_accelerator()` is the single entrance to DeepSpeed Accelerator Abstraction Interface
-## Port accelerator runtime calls
-First we need to port accelerator runtime calls.  On CUDA device, accelerator runtime call appears in the form of `torch.cuda.<interface>(...)`.   With DeepSpeed Accelerator Abstract Interface, such accelerator runtime call can be written in the form of `get_accelerator().<interface>(...)` which will be accelerator agnostic.
-
-For existing torch.cuda.<interface name> runtime call, we convert it like the following example:
-
-```
-if torch.cuda.is_available():
-    ...
-```
--->
-```
-if get_accelerator().is_available():
-    ...
-```
-For CUDA specific device name such as 'cuda' or 'cuda:0', or 'cuda:1', we convert them to get_accelerator().device_name(), get_accelerator().device_name(0), and get_accelerator().device_name(1).
-
-It is a little bit trick when we convert places where torch.cuda.current_device() are called. Current device return device index, but if we supply device index in Pytorch code where a device is needed, Pytorch will explain it as a CUDA device. To get current device that can be used as a device name, we need to call get_accelerator().current_device_name():
-
-my_tensor = torch.empty(3, 4, device=get_accelerator().current_device_name())
-Only when an integer number is expected we use get_accelerator().current_device():
-
-idx = get_accelerator().current_device()
-default_generator = get_accelerator().default_generator(idx)
-Tensor operations
-When we convert a torch tensor to accelerator device such as my_tensor.cuda(), we use my_tensor.to(get_accelerator().deivce_name())
-
-When we check whether a torch tensor is on accelerator device such as my_tensor.is_cuda, we use get_accelerator().on_accelerator(my_tensor)
-
-When pin a tensor to GPU memory such as my_tensor.pin_memory(), we use get_accelerator().pin_memory(my_tensor)
-
-Communication backend
-When a communication backend string is used, the interface get_accelerator().communication_backend_name() is used get get communication backend name. So instead of torch.distributed.init_process_group('nccl'), we use torch.distributed.init_process_group(get_accelerator().communication_backend_name())
-
-Op builder abstraction
-Op builders are abstracted through get_accelerator().create_op_builder(<op builder name>), if the op builder is implemented in the accelerator, an object of OpBuilder subclass will be returned. If the op builder is not implemented, None will be returned.
-
-A typical implementation can be referred to from the CUDA implementation, or from an XPU implementation which will be released later. Typical call such as CPUAdamBuilder().load() can be convert to get_accelerator().create_op_builder("CPUAdamBuilder").load().
-
-# Implement new accelerator extension

From 6964c2866c434de38e8ed0b784404767913d8803 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Tue, 11 Apr 2023 22:08:27 +0800
Subject: [PATCH 04/11] refine accelerator-interface tutorial

---
 docs/_tutorials/accelerator-interface.md | 37 ++++++++++++++----------
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/docs/_tutorials/accelerator-interface.md b/docs/_tutorials/accelerator-interface.md
index dfc89e60caf3..4a6ea628e359 100644
--- a/docs/_tutorials/accelerator-interface.md
+++ b/docs/_tutorials/accelerator-interface.md
@@ -10,18 +10,19 @@ tags: getting-started
   * [Implement new accelerator extension](#implement-new-accelerator-extension)
 
 # Introduction
-DeepSpeed Accelerator Interface is introduced to allow user to run large language model seamlessly on different Deep Learning acceleration hardware seamlessly with DeepSpeed.   It provides a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  It also allows user to use the interface to write large language model code that does not has hardware specific code.  With DeepSpeed Accelerator Interface, user can run the same large language model on different hareware platform, without the need to rewrite model code for different hardware.  This makes running large language model on different hardware easier.
+The DeepSpeed Accelerator Interface allows user to run large language model seamlessly on various Deep Learning acceleration hardware seamlessly with DeepSpeed.   It offers a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  This means user can write large language model code without hardware specific code.  With DeepSpeed Accelerator Interface, the same large language model can run on different hardware platform, without the need to rewrite model code.  This makes running large language model on different hardware easier.
 
-This document cover two topics related to DeepSpeed Accelerator Abstraction Interface:
-1. Write accelerator agnostic models using DeepSpeed Accelerator Abstraction Interface
-2. Implement new accelerator extension for DeepSpeed Accelerator Abstraction Interface
+This document covers three topics related to DeepSpeed Accelerator Abstraction Interface:
+1. Write accelerator agnostic models using DeepSpeed Accelerator Abstraction Interface.
+2. Run DeepSpeed model on differehnt accelerators.
+3. Implement new accelerator extension for DeepSpeed Accelerator Abstraction Interface.
 
 # Write accelerator agnostic models
-In this part, you will learn how to write a model that does not contain HW specific code, or how to port a model that run on a specific HW only to be device agnostic.  To do this, we first import `get_accelerator` from `deepspeed.accelerator`
+In this part, you will learn how to write a model that does not contain HW specific code, or how to port a model that run on a specific HW only to be accelerator agnostic.  To do this, we first import `get_accelerator` from `deepspeed.accelerator`
 ```
 from deepspeed.accelerator import get_accelerator
 ```
-Note: `get_accelerator()` is the single entrance to DeepSpeed Accelerator Abstraction Interface
+Note: `get_accelerator()` is the entrance to DeepSpeed Accelerator Abstraction Interface
 ## Port accelerator runtime calls
 First we need to port accelerator runtime calls.  On CUDA device, accelerator runtime call appears in the form of `torch.cuda.<interface>(...)`.   With DeepSpeed Accelerator Abstract Interface, such accelerator runtime call can be written in the form of `get_accelerator().<interface>(...)` which will be accelerator agnostic.
 
@@ -49,37 +50,41 @@ local_rank = get_accelerator().current_device()
 2. For `torch.cuda.default_generators[index]`, convert to `get_accelerator().default_generator(index)`
 
 ## Port accelerator device name
-For CUDA specific device name such as 'cuda' or 'cuda:0', or 'cuda:1', we convert them to get_accelerator().device_name(), get_accelerator().device_name(0), and get_accelerator().device_name(1).
+For CUDA specific device name such as `'cuda'` or `'cuda:0'`, or `'cuda:1'`, we convert them to `get_accelerator().device_name()`, `get_accelerator().device_name(0)`, and `get_accelerator().device_name(1)`.
 
 A device name without index can be used if model need to do specific thing for certain accelerator.  We suggest to make as less as such usage only for situatio can not be resolve other way.
 
 ## Tensor operations
 CUDA specific tensor operations needs to be converted according to the following rules:
-- When we convert a torch tensor to accelerator device such as my_tensor.cuda(), we use my_tensor.to(get_accelerator().deivce_name())
+- When we convert a torch tensor to accelerator device such as `my_tensor.cuda()`, we use `my_tensor.to(get_accelerator().deivce_name())`
 
-- When we check whether a torch tensor is on accelerator device such as my_tensor.is_cuda, we use get_accelerator().on_accelerator(my_tensor)
+- When we check whether a torch tensor is on accelerator device such as `my_tensor.is_cuda`, we use `get_accelerator().on_accelerator(my_tensor)`
 
-- When pin a tensor to GPU memory such as my_tensor.pin_memory(), we use get_accelerator().pin_memory(my_tensor)
+- When pin a tensor to GPU memory such as `my_tensor.pin_memory()`, we use `get_accelerator().pin_memory(my_tensor)`
 
 ## Communication backend
-When a communication backend string is used, the interface get_accelerator().communication_backend_name() is used get get communication backend name. So instead of torch.distributed.init_process_group('nccl'), we use torch.distributed.
+When a communication backend string is used, the interface `get_accelerator().communication_backend_name()` is used get get communication backend name. So instead of:
 ```
-init_process_group(get_accelerator().communication_backend_name())
+torch.distributed.init_process_group('nccl')
+```
+, we use:
+```
+torch.distributed.init_process_group(get_accelerator().communication_backend_name())
 ```
 
 # Run DeepSpeed on different accelerators
-Once a model is ported with DeepSpeed Accelerator Interface, we can run this model on different accelerators using extension to DeepSpeed.  DeepSpeed check whether certain extension is installed in the environment to decide whether to use the Accelerator backend in that extension.  For example if we wish to run model on Intel GPU, we can install Intel Extension for DeepSpeed following the [link](https://github.com/intel/intel-extension-for-deepspeed/)
+Once a model is ported with DeepSpeed Accelerator Interface, we can run this model on different accelerators using extension to DeepSpeed.  DeepSpeed check whether certain extension is installed in the environment to decide whether to use the Accelerator backend in that extension.  For example if we wish to run model on Intel GPU, we can install _Intel Extension for DeepSpeed_ following the instruction in [link](https://github.com/intel/intel-extension-for-deepspeed/)
 
 After the extension is installed, install DeepSpeed and run model.   The model will be running on top of DeepSpeed.   Because DeepSpeed installation is also accelerator related, it is recommended to install DeepSpeed accelerator extension before install DeepSpeed.
 
 `CUDA_Accelerator` is the default accelerator in DeepSpeed.  If no other DeepSpeed accelerator extension is installed, `CUDA_Accelerator` will be used.
 
-When run a model on different accelerator in a cloud environment, the recommended practice is provision environment for each accelerator in different env with tool such as anaconda/miniconda/virtualenv.  When run model on different Accelerator, load the env accordingly.
+When run a model on different accelerator in a cloud environment, the recommended practice is provision environment for each accelerator in different env with tool such as _anaconda/miniconda/virtualenv_.  When run model on different Accelerator, load the env accordingly.
 
-Note that different accelerator may have different 'flavor' of float16 or bfloat16.   So it is recommended to make the model configurable for both float16 and bfloat16, so model code does not need to be changed.
+Note that different accelerator may have different 'flavor' of float16 or bfloat16.   So it is recommended to make the model configurable for both float16 and bfloat16, in that way model code does not need to be changed when running on different accelerators.
 
 # Implement new accelerator extension
-It is possible to implement a new DeepSpeed accelerator extension to support new accelerator in DeepSpeed.  An example to follow is [Intel Extension For DeepSpeed](https://github.com/intel/intel-extension-for-deepspeed/).   An accelerator extension contains the following components:
+It is possible to implement a new DeepSpeed accelerator extension to support new accelerator in DeepSpeed.  An example to follow is _[Intel Extension For DeepSpeed](https://github.com/intel/intel-extension-for-deepspeed/)_.   An accelerator extension contains the following components:
 1. XYZ_Accelerator(DeepSpeedAccelerator) class definition, where 'XYZ' is the accelerator name, such as 'XPU' or 'CPU'.
 This class implements `class DeepSpeedAccelerator` and will be returned by `get_accelerator()` in DeepSpeed.
 2. Op builders following https://github.com/intel/intel-extension-for-deepspeed/tree/main/intel_extension_for_deepspeed/op_builder.   All op builders needs to inherit `deepspeed.ops.op_builder.builder.OpBuilder` directly or indirectly.  A common practice is to implement a base op builder (SYCLOpBuilder in the case of Intel Extension for DeepSpeed) and inherit this base op builder instead.

From 40a3429f368af51473e12240e98a08536e8acd29 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Tue, 11 Apr 2023 22:12:56 +0800
Subject: [PATCH 05/11] fix link

---
 docs/_tutorials/accelerator-interface.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-interface.md b/docs/_tutorials/accelerator-interface.md
index 4a6ea628e359..91c6a17bb205 100644
--- a/docs/_tutorials/accelerator-interface.md
+++ b/docs/_tutorials/accelerator-interface.md
@@ -72,7 +72,7 @@ torch.distributed.init_process_group('nccl')
 torch.distributed.init_process_group(get_accelerator().communication_backend_name())
 ```
 
-# Run DeepSpeed on different accelerators
+# Run DeepSpeed model on different accelerators
 Once a model is ported with DeepSpeed Accelerator Interface, we can run this model on different accelerators using extension to DeepSpeed.  DeepSpeed check whether certain extension is installed in the environment to decide whether to use the Accelerator backend in that extension.  For example if we wish to run model on Intel GPU, we can install _Intel Extension for DeepSpeed_ following the instruction in [link](https://github.com/intel/intel-extension-for-deepspeed/)
 
 After the extension is installed, install DeepSpeed and run model.   The model will be running on top of DeepSpeed.   Because DeepSpeed installation is also accelerator related, it is recommended to install DeepSpeed accelerator extension before install DeepSpeed.

From 81444770f9f8ccb9f64fccd2e83f2ca9c8a529fb Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Tue, 11 Apr 2023 22:19:04 +0800
Subject: [PATCH 06/11] add sub-bullets

---
 docs/_tutorials/accelerator-interface.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/_tutorials/accelerator-interface.md b/docs/_tutorials/accelerator-interface.md
index 91c6a17bb205..d333f2fd927b 100644
--- a/docs/_tutorials/accelerator-interface.md
+++ b/docs/_tutorials/accelerator-interface.md
@@ -6,6 +6,10 @@ tags: getting-started
 # Contents
   * [Introduction](#introduction)
   * [Write accelerator agnostic models](#write-accelerator-agnostic-models)
+    * [Port accelerator runtime calls](#port-accelerator-runtime-calls)
+    * [Port accelerator device name](#port-accelerator-device-name)
+    * [Tensor operations](#tensor-operations)
+    * [Communication backend](#communication-backend)
   * [Run DeepSpeed model on different accelerators](#run-deepspeed-model-on-different-accelerators)
   * [Implement new accelerator extension](#implement-new-accelerator-extension)
 

From 54b1454b9786d12b3f4d2676602d70cb957f25c6 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Tue, 11 Apr 2023 22:23:49 +0800
Subject: [PATCH 07/11] unify naming

---
 .pre-commit-config.yaml                                       | 2 +-
 ...ator-interface.md => accelerator-abstraction-interface.md} | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
 rename docs/_tutorials/{accelerator-interface.md => accelerator-abstraction-interface.md} (87%)

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index bb53dd80b031..473fcf9c0822 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -75,5 +75,5 @@ repos:
         name: check-torchcuda
         entry: ./scripts/check-torchcuda.py
         language: script
-        exclude: ^(.github/workflows/|scripts/check-torchcuda.py|docs/_tutorials/accelerator-interface.md|accelerator/cuda_accelerator.py|deepspeed/inference/engine.py|deepspeed/model_implementations/transformers/clip_encoder.py|deepspeed/model_implementations/diffusers/vae.py|deepspeed/model_implementations/diffusers/unet.py|op_builder/spatial_inference.py|op_builder/transformer_inference.py|op_builder/builder.py|setup.py|tests/unit/ops/sparse_attention/test_sparse_attention.py)
+        exclude: ^(.github/workflows/|scripts/check-torchcuda.py|docs/_tutorials/accelerator-abstraction-interface.md|accelerator/cuda_accelerator.py|deepspeed/inference/engine.py|deepspeed/model_implementations/transformers/clip_encoder.py|deepspeed/model_implementations/diffusers/vae.py|deepspeed/model_implementations/diffusers/unet.py|op_builder/spatial_inference.py|op_builder/transformer_inference.py|op_builder/builder.py|setup.py|tests/unit/ops/sparse_attention/test_sparse_attention.py)
         # Specific deepspeed/ files are excluded for now until we wrap ProcessGroup in deepspeed.comm
diff --git a/docs/_tutorials/accelerator-interface.md b/docs/_tutorials/accelerator-abstraction-interface.md
similarity index 87%
rename from docs/_tutorials/accelerator-interface.md
rename to docs/_tutorials/accelerator-abstraction-interface.md
index d333f2fd927b..54050dc938f8 100644
--- a/docs/_tutorials/accelerator-interface.md
+++ b/docs/_tutorials/accelerator-abstraction-interface.md
@@ -14,7 +14,7 @@ tags: getting-started
   * [Implement new accelerator extension](#implement-new-accelerator-extension)
 
 # Introduction
-The DeepSpeed Accelerator Interface allows user to run large language model seamlessly on various Deep Learning acceleration hardware seamlessly with DeepSpeed.   It offers a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  This means user can write large language model code without hardware specific code.  With DeepSpeed Accelerator Interface, the same large language model can run on different hardware platform, without the need to rewrite model code.  This makes running large language model on different hardware easier.
+The DeepSpeed Accelerator Abstraction allows user to run large language model seamlessly on various Deep Learning acceleration hardware seamlessly with DeepSpeed.   It offers a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  This means user can write large language model code without hardware specific code.  With DeepSpeed Accelerator Abstraction, the same large language model can run on different hardware platform, without the need to rewrite model code.  This makes running large language model on different hardware easier.
 
 This document covers three topics related to DeepSpeed Accelerator Abstraction Interface:
 1. Write accelerator agnostic models using DeepSpeed Accelerator Abstraction Interface.
@@ -77,7 +77,7 @@ torch.distributed.init_process_group(get_accelerator().communication_backend_nam
 ```
 
 # Run DeepSpeed model on different accelerators
-Once a model is ported with DeepSpeed Accelerator Interface, we can run this model on different accelerators using extension to DeepSpeed.  DeepSpeed check whether certain extension is installed in the environment to decide whether to use the Accelerator backend in that extension.  For example if we wish to run model on Intel GPU, we can install _Intel Extension for DeepSpeed_ following the instruction in [link](https://github.com/intel/intel-extension-for-deepspeed/)
+Once a model is ported with DeepSpeed Accelerator Abstraction Interface, we can run this model on different accelerators using extension to DeepSpeed.  DeepSpeed check whether certain extension is installed in the environment to decide whether to use the Accelerator backend in that extension.  For example if we wish to run model on Intel GPU, we can install _Intel Extension for DeepSpeed_ following the instruction in [link](https://github.com/intel/intel-extension-for-deepspeed/)
 
 After the extension is installed, install DeepSpeed and run model.   The model will be running on top of DeepSpeed.   Because DeepSpeed installation is also accelerator related, it is recommended to install DeepSpeed accelerator extension before install DeepSpeed.
 

From a392bf30324b8f5c1f61d3e89beac1ef754e926c Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Fri, 28 Apr 2023 16:28:29 +0800
Subject: [PATCH 08/11] remove duplicate words

---
 docs/_tutorials/accelerator-abstraction-interface.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-abstraction-interface.md b/docs/_tutorials/accelerator-abstraction-interface.md
index 54050dc938f8..4e3eedf07985 100644
--- a/docs/_tutorials/accelerator-abstraction-interface.md
+++ b/docs/_tutorials/accelerator-abstraction-interface.md
@@ -14,7 +14,7 @@ tags: getting-started
   * [Implement new accelerator extension](#implement-new-accelerator-extension)
 
 # Introduction
-The DeepSpeed Accelerator Abstraction allows user to run large language model seamlessly on various Deep Learning acceleration hardware seamlessly with DeepSpeed.   It offers a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  This means user can write large language model code without hardware specific code.  With DeepSpeed Accelerator Abstraction, the same large language model can run on different hardware platform, without the need to rewrite model code.  This makes running large language model on different hardware easier.
+The DeepSpeed Accelerator Abstraction allows user to run large language model seamlessly on various Deep Learning acceleration hardware with DeepSpeed.   It offers a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  This means user can write large language model code without hardware specific code.  With DeepSpeed Accelerator Abstraction, the same large language model can run on different hardware platform, without the need to rewrite model code.  This makes running large language model on different hardware easier.
 
 This document covers three topics related to DeepSpeed Accelerator Abstraction Interface:
 1. Write accelerator agnostic models using DeepSpeed Accelerator Abstraction Interface.

From f67c3d891b59d94b559591c6c9e817fae3b25c0d Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Wed, 17 May 2023 17:13:42 +0800
Subject: [PATCH 09/11] add documentation for CPU

---
 .../accelerator-abstraction-interface.md      | 54 +++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/docs/_tutorials/accelerator-abstraction-interface.md b/docs/_tutorials/accelerator-abstraction-interface.md
index 4e3eedf07985..886788bb31bb 100644
--- a/docs/_tutorials/accelerator-abstraction-interface.md
+++ b/docs/_tutorials/accelerator-abstraction-interface.md
@@ -87,6 +87,60 @@ When run a model on different accelerator in a cloud environment, the recommende
 
 Note that different accelerator may have different 'flavor' of float16 or bfloat16.   So it is recommended to make the model configurable for both float16 and bfloat16, in that way model code does not need to be changed when running on different accelerators.
 
+# Run DeepSpeed model on CPU
+DeepSpeed support use CPU as accelerator.  DeepSpeed model using DeepSpeed Accelerator Abstraction Interface could run on CPU without change to model code.   DeepSpeed decide whether _Intel Extension for PyTorch_ is installed in the environment.  If this packaged is installed, DeepSpeed will use CPU as accelerator.  Otherwise CUDA device will be used as accelerator.
+
+To run DeepSpeed model on CPU, use the following steps to prepare environment:
+
+```
+python -m pip install intel_extension_for_pytorch
+python -m pip install oneccl_bind_pt==2.0 -f https://developer.intel.com/ipex-whl-stable-cpu
+git clone https://github.com/oneapi-src/oneCCL
+cd oneCCL
+mkdir build
+cd build
+cmake ..
+make
+make install
+```
+
+Before run CPU workload, we need to source oneCCL environment variables
+```
+source <path-to-oneCCL>/build/_install/env/setvars.sh
+```
+
+After environment is prepared, we can launch DeepSpeed inference with the following command
+```
+deepspeed --bind_cores_to_rank <deepspeed-model-script>
+```
+
+This command would launch number of workers equal to number of CPU sockets on the system.  Currently DeepSpeed support running inference model with AutoTP on top of CPU.  The argument `--bind_cores_to_rank` distribute CPU cores on the system evently among workers, to allow each worker running on a dedicated set of CPU cores.
+
+On CPU system, there might by daemon process that periodically activate which would increase variance of each worker.  One practice is leave a couple of cores for daemon process using `--bind-core-list` argument:
+
+```
+deepspeed --bind_cores_to_rank --bind_core_list 0-51,56-107 <deepspeed-model-script>
+```
+
+The command above leave 4 cores on each socket to daemon process (assume two sockets, each socket has 56 cores).
+
+We can also set an arbitrary number of workers.  Unlike GPU, CPU cores on host can be further divided into subgroups.  When this number is not set, DeepSpeed would detect number of NUMA nodes on the system and launch one worker for each NUMA node.
+
+```
+deepspeed --num_accelerators 4 --bind_cores_to_rank <deepspeed-model-script>
+```
+
+Launching DeepSpeed model on multiple CPU nodes is similar to other accelerators.  We need to specify `impi` as launcher and specify `--bind_cores_to_rank` for better core binding.  Also specify `slot` number according to number of CPU sockets in host file.
+
+```
+# hostfile content should follow the format
+# worker-1-hostname slots=<#sockets>
+# worker-2-hostname slots=<#sockets>
+# ...
+
+deepspeed --hostfile=<hostfile> --bind_cores_to_rank --launcher impi --master_addr <master-ip> <deepspeed-model-script>
+```
+
 # Implement new accelerator extension
 It is possible to implement a new DeepSpeed accelerator extension to support new accelerator in DeepSpeed.  An example to follow is _[Intel Extension For DeepSpeed](https://github.com/intel/intel-extension-for-deepspeed/)_.   An accelerator extension contains the following components:
 1. XYZ_Accelerator(DeepSpeedAccelerator) class definition, where 'XYZ' is the accelerator name, such as 'XPU' or 'CPU'.

From 7bff4fbc5fa6681cfc7de2c7da7421e6e29edaae Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Wed, 17 May 2023 17:17:49 +0800
Subject: [PATCH 10/11] fix gramma

---
 docs/_tutorials/accelerator-abstraction-interface.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-abstraction-interface.md b/docs/_tutorials/accelerator-abstraction-interface.md
index 886788bb31bb..45430abf1a2d 100644
--- a/docs/_tutorials/accelerator-abstraction-interface.md
+++ b/docs/_tutorials/accelerator-abstraction-interface.md
@@ -88,7 +88,7 @@ When run a model on different accelerator in a cloud environment, the recommende
 Note that different accelerator may have different 'flavor' of float16 or bfloat16.   So it is recommended to make the model configurable for both float16 and bfloat16, in that way model code does not need to be changed when running on different accelerators.
 
 # Run DeepSpeed model on CPU
-DeepSpeed support use CPU as accelerator.  DeepSpeed model using DeepSpeed Accelerator Abstraction Interface could run on CPU without change to model code.   DeepSpeed decide whether _Intel Extension for PyTorch_ is installed in the environment.  If this packaged is installed, DeepSpeed will use CPU as accelerator.  Otherwise CUDA device will be used as accelerator.
+DeepSpeed support using CPU as accelerator.  DeepSpeed model using DeepSpeed Accelerator Abstraction Interface could run on CPU without change to model code.   DeepSpeed decide whether _Intel Extension for PyTorch_ is installed in the environment.  If this packaged is installed, DeepSpeed will use CPU as accelerator.  Otherwise CUDA device will be used as accelerator.
 
 To run DeepSpeed model on CPU, use the following steps to prepare environment:
 

From 5c00e7864af90857ecdbbcd2cf867b27bd3b82de Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Wed, 17 May 2023 17:20:25 +0800
Subject: [PATCH 11/11] fix typo

---
 docs/_tutorials/accelerator-abstraction-interface.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/_tutorials/accelerator-abstraction-interface.md b/docs/_tutorials/accelerator-abstraction-interface.md
index 45430abf1a2d..bc0db6d809f4 100644
--- a/docs/_tutorials/accelerator-abstraction-interface.md
+++ b/docs/_tutorials/accelerator-abstraction-interface.md
@@ -116,7 +116,7 @@ deepspeed --bind_cores_to_rank <deepspeed-model-script>
 
 This command would launch number of workers equal to number of CPU sockets on the system.  Currently DeepSpeed support running inference model with AutoTP on top of CPU.  The argument `--bind_cores_to_rank` distribute CPU cores on the system evently among workers, to allow each worker running on a dedicated set of CPU cores.
 
-On CPU system, there might by daemon process that periodically activate which would increase variance of each worker.  One practice is leave a couple of cores for daemon process using `--bind-core-list` argument:
+On CPU system, there might be daemon process that periodically activate which would increase variance of each worker.  One practice is leave a couple of cores for daemon process using `--bind-core-list` argument:
 
 ```
 deepspeed --bind_cores_to_rank --bind_core_list 0-51,56-107 <deepspeed-model-script>
@@ -130,7 +130,7 @@ We can also set an arbitrary number of workers.  Unlike GPU, CPU cores on host c
 deepspeed --num_accelerators 4 --bind_cores_to_rank <deepspeed-model-script>
 ```
 
-Launching DeepSpeed model on multiple CPU nodes is similar to other accelerators.  We need to specify `impi` as launcher and specify `--bind_cores_to_rank` for better core binding.  Also specify `slot` number according to number of CPU sockets in host file.
+Launching DeepSpeed model on multiple CPU nodes is similar to other accelerators.  We need to specify `impi` as launcher and specify `--bind_cores_to_rank` for better core binding.  Also specify `slots` number according to number of CPU sockets in host file.
 
 ```
 # hostfile content should follow the format