An NVM Express Tutorial
Kevin Marks
Dell, Inc.
Flash Memory Summit 2013
Santa Clara, CA
What is NVM Express and Why
NVM Express defines an optimized queuing
interface, command set, and feature set for PCIe
SSDs
Architected to scale from client to enterprise
Standardization accelerates industry adoption
Standard drivers
Consistent feature set
Industry ecosystem
Development tools
Compliance and interoperability testing
Flash Memory Summit 2013
Santa Clara, CA
Who created NVM Express (NVMe)
NVM Express was developed by industry consortium of 90+
member companies and is directed by a 13-company
Promoter Group
Flash Memory Summit 2013
Santa Clara, CA
NVM Express Release Timeline
NVMe 1.1 Released
October 11, 2012
NVMe 1.0 Released
March 1, 2011
Queueing Interface
NVM Command Set
Admin Command Set
End-to-end Protection (DIF/DIX)
Security
Physical Region Pages (PRPs)
General Scatter Gather Lists
(SGLs)
Multi-Path I/O & Namespace
Sharing
Reservations
Autonomous Power Transitions
During Idle
NVMe
Technical Work Begins
...
2009
2010
Flash Memory Summit 2013
Santa Clara, CA
2011
2012
2013
2014
Goals of NVM Express relative to
AHCI
Remove uncacheable reads from command
issue/completion
Minimize MMIO writes in command issue/completion
path
Support for deep command queues and to simplify
command decoding and processing
Support MSI-X / flexible interrupt aggregation
Support for many core systems
Support Enterprise features
Comprehensive statistics / Health status reporting /
Robust error reporting & handling
Flash Memory Summit 2013
Santa Clara, CA
NVM Express Usage Models
Server Caching
Root
Complex
Server Storage
x16
SAN
Controller A
Root
Complex
Root
Complex
SAS
PCIe
Switch
x16
PCIe/PCIe
RAID
x4
NVMe
SAN
x16
PCIe
Switch
NVMe
Root
Complex
Root
Complex
NVMe
External Storage
Client Storage
IO Hub
x4
NVMe
NVMe
NVMe
NVMe
NVMe
Controller B
x16
PCIe
Switch
SAS
NVMe
NVMe
NVMe
NVMe
SATA
HDD
SAS
HDD
Used for temporary
data
Non-redundant
Used to reduce
memory footprint
Flash Memory Summit 2013
Santa Clara, CA
Typically for
persistent data
Redundant (i.e.,
RAIDed)
Commonly used
as Tier-0 storage
Used for Boot/OS
drive and/or HDD
cache
Non-redundant
Power optimized
Used for Metadata or data
Multi-ported device
Redundancy based on usage
NVMe Queues
Tail
NVMe uses circular queues to pass messages (e.g., commands and
command completion notifications.) The queues may be located
anywhere in PCIe memory
Head
A Queue consists of set of fixed sized elements
Tail
Logical View
High Memory
Queue Size
Head
Physical View in Memory
Flash Memory Summit 2013
Santa Clara, CA
Number of entries in the queue - 1
Minimum size is 2, Maximum is ~ 64K for I/O Queues and 4K for Admin Queue
Queue Empty
Low Memory
Points to next entry to be pulled off, if queue is not empty
If an element is removed from the element pointed to by the head, the head is
incremented to point to the next element taking wrapping into consideration
Queue Size (Usable)
Points to next free element
If an element is added to the element pointed to by the tail, the tail is
incremented to point to next free element taking wrapping into consideration
Head
Tail
Typically queues are located in host memory
Queues may consist of a contiguous block of physical memory or optionally a
non-contiguous set of physical memory pages (defined by a PRP List)
Head == Tail
Queue Full
Head == Tail + 1 mod # Of Queue Entries.
7
Types of Queues
Admin Queue for Admin Command Set
One per NVMe controller with up to 4K elements per queue
Used to configure IO Queues and controller/feature management
I/O Queues for IO Command Sets (e.g., NVM command set)
Up to 64K queues per NVMe controller with up to 64K elements per queue
Used to submit/complete IO commands
Where each type has:
Submission Queues (SQ)
Queues messages from host to controller
Used to submit commands
Identified by SQ ID
Completion Queues (CQ)
Queues messages from controller to host
Used to post command completions
Identified by CQ ID
May have an independent MSI-X interrupt per completion queue
NVMe queues are messaging queues, not command queues
Flash Memory Summit 2013
Santa Clara, CA
NVMe Command Execution
7
1) Queue Command(s)
2) Ring Doorbell (New Tail)
3) Fetch Command(s)
4) Process Command (s)
5) Queue Completion(s)
6) Generate Interrupt
PCIe TLP
PCIe TLP
PCIe TLP
PCIe TLP
PCIe TLP
PCIe TLP
7) Process Completion (s)
8) Ring Doorbell (New Head)
6
3
45
Flash Memory Summit 2013
Santa Clara, CA
SQ and CQ relationships
Each SQ is associated with only one CQ (i.e.,
commands submitted on a specific SQ
complete on a specific CQ.
The SQ to CQ relationship is defined at SQ
creation time.
It is permissible within the architecture to
have multiple SQs mapped to a single CQ
(n:1)
Flash Memory Summit 2013
Santa Clara, CA
10
Scalable Queuing Interface
Host
Controller
Managment
Admin
Submission
Queue
Core 0
Admin
Completion
Queue
I/O
Submission
Queue
MSI-X
Core 1
I/O
Completion
Queue
I/O
Submission
Queue
I/O
Submission
Queue
MSI-X
Core N
I/O
Completion
Queue
...
MSI-X
I/O
Submission
Queue
I/O
Completion
Queue
MSI-X
NVMe Controller
Enables NUMA optimized drivers
Per core: One or more submission queues, one completion queue, and one MS-X
interrupt
High performance and low latency command issue
No locking between cores
Up to ~232 outstanding commands
Support for up to ~ 64K I/O submission and completion queues
Each queue supports up to ~ 64K outstanding commands
Flash Memory Summit 2013
Santa Clara, CA
11
Command Arbitration
All controllers support round robin arbitration
ASQ
SQ
SQ
SQ
RR
SQ
SQ
Flash Memory Summit 2013
Santa Clara, CA
12
Command Arbitration
An NVMe controller may support weighted round robin with urgent priority class
arbitration
Flash Memory Summit 2013
Santa Clara, CA
13
Arbitration Primitives
High
...
Priority
Arb
...
...
Low
Weight = 3
...
...
Weight = 2
Round
WRR
Arb
...
...
Med
...
Weight = 1
...
Example above shown with an arbitration burst of no limit
NVMe supports an arbitration burst of 1, 2, 4, 8, 16, 32, 64 and
no limit
NVMe supports 8-bit WRR weights
Flash Memory Summit 2013
Santa Clara, CA
14
NVMe Subsystem Model
NVM Subsystem - one or more controllers, one or more
namespaces, one or more PCI Express ports, a non-volatile
memory storage medium, and an interface between the
controller(s) and non-volatile memory storage medium
Controller A PCI Express function that implements NVM
Express
Flash Memory Summit 2013
Santa Clara, CA
15
NVMe Subsystem Example
Single controller, single namespace
PCIe Port
NVMe Controller
PCI Function 0
NSID 1
NS
A
Flash Memory Summit 2013
Santa Clara, CA
NS = Namespace, amount of NVM
storage formatted for block access
NSID = Namespace ID, controller unique
identifier for namespace (NS)
16
NVM Subsystem Example
Single Controller, multiple Namespaces
PCIe Port
NVMe Controller
PCI Function 0
NSID 1
NSID 2
NS
A
NS
B
Flash Memory Summit 2013
Santa Clara, CA
NS = Namespace, amount of NVM
storage formatted for block access
NSID = Namespace ID, controller unique
identifier for namespace (NS)
17
NVM Subsystem Example
Multiple controllers
PCIe Port
PCI Function 0
NVM Express Controller
NSID 1
NSID 2
NS
A
PCI Function 1
NVM Express Controller
NSID 1
NSID 2
NS
C
NS
B
NVM Subsystem with Two Controllers
and One Port
Flash Memory Summit 2013
Santa Clara, CA
PCIe Port x
PCIe Port y
PCI Function 0
NVM Express Controller
PCI Function 0
NVM Express Controller
NSID 1
NS
A
NSID 2
NSID 1
NSID 2
NS
C
NS
B
NVM Subsystem with Two Controllers
and Two Ports
18
PCIe Multi-Path Usage Model
Inerconnect
PCIe
SSD
Flash Memory Summit 2013
Santa Clara, CA
Host
Host
PCIe
PCIe
PCIe Switch
PCIe Switch
PCIe
SSD
PCIe
SSD
PCIe
SSD
PCIe
SSD
PCIe
SSD
PCIe
SSD
PCIe
SSD
19
Uniquely Identifying a Namespace
How do Host A and Host B know that NS B is
the same namespace?
Host
A
Host
B
NVMe Controller
PCI Function 0
NVMe Controller
PCI Function 1
NVM Express 1.1 added unique identifiers for:
The NVMe Controller; and
Each Namespace within an NVM Subsystem
These identifiers are guaranteed to be globally
unique
NSID 1
NSID 2
NSID 2
NS
A
NSID 1
NS
C
NS
B
NVM Subsystem
Unique NVMe Controller Identifier (64B) =
2B PCI Vendor ID + 20B Serial Number + 40B Model Number + 2B Controller ID
Unique Namespace Identifier (8B) = 8B IEEE Extended Unique Identifier
Flash Memory Summit 2013
Santa Clara, CA
20
NVMe controller register map
Flash Memory Summit 2013
Santa Clara, CA
21
Controller Initialization
The host performs the following actions in sequence to initialize the
controller to begin executing Admin commands:
1. Set the PCI and PCI Express registers based on the system
configuration. This includes configuration of power management
features. Pin-based or single-message MSI interrupts should be used
until the number of I/O Queues is determined.
2. Configure the Admin Queue by setting the Admin Queue Attributes
(AQA), Admin Submission Queue Base Address (ASQ), and Admin
Completion Queue Base Address (ACQ) to appropriate values.
3. Configure:
1. the arbitration mechanism in CC.AMS
2. the memory page size in CC.MPS
3. the I/O Command Set in CC.CSS
4. Enable the controller by setting CC.EN to 1.
5. Wait for the controller to indicate it is ready to process commands (i.e.,
when CSTS.RDY is set to 1)
Flash Memory Summit 2013
Santa Clara, CA
22
Submission Queue Element with
Opcode Command operation code
PRPs(64B)
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
Namespace Identifier
2
3
4
5
Metadata Pointer
PRP Entry 1
PRP Entry 2
D Word
6
7
8
9
10
11
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
Fused Operation (FUSE) specifies if
two commands should be executed as
atomic unit (optional)
PRP or SGL for Data Transfer = 0
specifies that PRPs are used; 1 specifies
SGLs are used
Command Identifier Command ID
within submission queue
Namespace Namespace on which
command operates
Metadata Pointer Pointer to
contiguous buffer containing metadata
PRP Entry 1 First PRP entry for the
command or PRP list pointer depending
on the command
PRP Entry 2 Second PRP entry for the
command or PRP list pointer depending
on the command
23
Physical Region Pages (PRPs)
PRP contains the 64-bit physical memory page address. The
lower bits (n:2) of this field indicate the offset within the memory
page. N is defined by the memory page size (CC.MPS)
PRP List contains a list of PRPs with generally no offsets.
Flash Memory Summit 2013
Santa Clara, CA
24
PRP Example
NVMe command example utilizing the two PRP Entries as
PRPs. The first PRP has an offset into the memory page.
Host Physical Pages
Flash Memory Summit 2013
Santa Clara, CA
PRP Entry 1
Offset
PRP
2
PRP
ListEntry
Pointer
0
Offset
25
PRP List Example
Host Physical Pages
Offset
PRP List Pointer
0
0
0
0
0
PRP List Pointer
PRP List
NVMe command example utilizing
the two PRP Entries, one as a PRP
and the other as a PRP List.
PRPs in the PRP List always have
offsets of zero if the first PRP entry
in the command is a PRP
0
0
0
PRP List
Flash Memory Summit 2013
Santa Clara, CA
26
Submission Queue Element with
SGLs(64B)
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
Namespace Identifier
2
3
4
5
Metadata SGL Segment Pointer
D Word
6
7
SGL Entry 1
9
10
11
12
13
Opcode Command operation code
Fused Operation (FUSE) specifies if
two commands should be executed as
atomic unit (optional)
PRP or SGL for Data Transfer = 0
specifies that PRPs are used ; 1
specifies that SGLs are used
Command Identifier Command ID
within submission queue
Namespace Namespace on which
command operates
Metadata SQL Segment Pointer first
SGL segment which describes the
metadata to transfer
SGL Entry 1 the first SGL segment for
the command
14
15
Flash Memory Summit 2013
Santa Clara, CA
27
Scatter Gather List (SGL)
SGL Descriptor
Bit
SGL List
7
0
First SGL Segment
in SQ Entry
MSB
SGL Descriptor
2
3
4
5
SGL Descriptor
SGL Descriptor
SGL Segment
SGL Descriptor
Byte
SGL Data Block Descriptors
MSB
SGL Descriptor
10
SGL Descriptor
SGL Descriptor
Descriptor
Type Specific
11
SGL Last Segment Descriptor
12
13
14
SGL Descriptor
Last
SGL Segment
SGL Descriptor
SGL Descriptor
SGL Descriptor
15
MSB
Desc. Type Specific
SGL Data Block Descriptors
Code
Descriptor Type
0h
SGL Data Block
1h
SGL Bit Bucket
2h
SGL Segment
3h
SGL Last Segment
4h - Eh
Flash Memory Summit 2013
Santa Clara, CA
LSB
SGL Desc. Type
Fh
Reserved
Vendor Specific
28
Completion Queue Element (16B)
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
DWord
0
1
2
3
SQ Identifier
Status Field
SQ Head Pointer
P
Command Identifier
SQ Head Pointer Submission queue head pointer associated with SQ
Identifier
SQ Identifier Submission queue associated with completed command
Command Identifier Command ID within submission queue
Phase Tag (P) Indicates when new command is reached
Status Field Status associated with completed command
A value of zero indicates successful command completion
Flash Memory Summit 2013
Santa Clara, CA
29
Phase Tag
High Memory
High Memory
0
0
0
0
0
0
0
0
0
Queue Size
Low Memory
Completion Queue
Initial State
Tail
Head
High Memory
0
0
0
0
0
0
1
1
1
Tail
Head
1
1
1
0
0
0
0
0
0
Low Memory
Low Memory
Invert Phase Tag for Each
Completion Entry Write
(Odd Pass 1,3,5 )
Invert Phase Tag for Each
Completion Entry Write
(Even Pass 2,4,6 )
Phase Tag Operation
Initially zero
Controller inverts phase tag of an entry each time it writes a completion
entry
Host knows phase tag of completions and can determine when last full
entry is reached
Flash Memory Summit 2013
Santa Clara, CA
30
NVMe Command Sets
Command Set
Admin
Command
Set
I/O Command Sets
NVM
Cmd
Set
Flash Memory Summit 2013
Santa Clara, CA
Rsvd
#1
Rsvd
#2
Rsvd
#3
31
Admin Commands
Command
Required or
Optional
Create I/O Submission Queue
Required
Delete I/O Submission Queue
Required
Create I/O Completion Queue
Required
Delete I/O Completion Queue
Required
Identify
Required
Get Features
Required
Set Features
Required
Get Log Page
Required
Asynchronous Event Request
Required
Abort
Required
Abort Command
Firmware Image Download
Optional
Firmware Activate
Optional
Firmware
Update / Management
I/O Command Set Specific Commands
Optional
I/O Command Set Specific
Vendor Specific Commands
Optional
Vendor Specific
Flash Memory Summit 2013
Santa Clara, CA
Category
Queue
Management
Configuration
Status Reporting
All Admin command use PRPs
32
Create I/O Submission Queue
Create specified I/O submission queue
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
Namespace Identifier
2
3
4
5
DWord
PRP Entry 1
8
9
10
Queue Size
11
Completion Queue Identifier
Queue Identifier
QPRIO PC
Queue Identifier Submission queue ID
number
Queue Size Number of entries in
submission queue (zero based value)
Completion Queue Identifier
Completion queue ID number associated
with submission queue
Queue Priority (QPRIO) Queue Priority
when WRR with urgent priority service
class priority is selected
Physically Contiguous (PC)
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
1- Submission queue is physically contiguous in
host memory
0 Submission queue is not physically
contiguous
PRP Entry 1 When not physically
contiguous, this is a pointer to a PRP list
that contains host pages
Command Specific Error Values
Completion Queue Invalid
Invalid Queue Identifier
Maximum Queue Size Exceeded
33
Create I/O Completion Queue
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
Namespace Identifier
1
2
3
4
5
DWord
Create specified I/O completion queue
Queue Identifier completion queue ID
number
Queue Size Number of entries in
completion queue (zero based value)
Interrupt Vector MSI-X or MSI vector
number
Interrupt Enable (IEN)
PRP Entry 1
7
8
9
10
Queue Size
11
Interrupt Vector
IEN PC
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
Physically Contiguous (PC)
Queue Identifier
0 Interrupts disabled
1 Interrupts enabled
1- Submission queue is physically contiguous in
host memory
0 Submission queue is not physically
contiguous
PRP Entry 1 When not physically
contiguous, this is a pointer to a PRP list
that contains host pages
34
Identify
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
Returns up to 4KB data structure that
describes controller or namespace
PRP Entry 1 Starting address of where
4KB data structure is to be written
Namespace Identifier
2
3
4
5
DWord
6
7
8
9
10
11
12
PRP Entry 1
PRP Entry 2 Starting address of where
remainder of 4KB data structure is to be
written
Controller or Namespace Structure
(CNS)
PRP Entry 2
CNS
Offset may be non-zero
00b Return corresponding namespace data
structure
01b Return corresponding controller data
structure
10b Return list of 1024 active namespace IDs
starting at the Namespace Identifer.
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
35
Active Namespace Reporting
Dword
0
1
2
3
4
5
Identify
Admin Command
Return 4KB Identify
Namespace Data Structure
for Namespace
Specified in CDW1.NSID
Return 4KB Active
Namespace Data Starting at
Namespace
Specified in CDW1.NSID
1023
Active NSID
0
...
Return 4KB Identify
Controller Data Structure
n
n+1
List of active
NSIDs greater
than or equal to
CDW1.NSID
...
Active Namespace
Data Structure
...
Identify Namespace
Data Structure
...
Identify Controller
Data Structure
Active NSID
Active NSID
Active NSID
Active NSID
Active NSID
Active NSID
0
Active Namespace
Data Structure
Flash Memory Summit 2013
Santa Clara, CA
36
Identify Controller Data Structure
Example Fields Does Not Show Complete Data Structure
Flash Memory Summit 2013
Santa Clara, CA
37
Identify Namespace Data Structure
Example Fields Does Not Show Complete Data Structure
Flash Memory Summit 2013
Santa Clara, CA
38
Set Feature
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
Namespace Identifier
2
3
4
Set value of configurable feature
PRP Entry 1 Starting address of where
Feature data is located (used by some
features)
PRP Entry 2 Starting address of where
remainder of where feature data is located
(used by some features)
Parameter Feature parameter (used by
some features)
Feature Identifier ID of feature
DWord
6
7
8
9
PRP Entry 1
PRP Entry 2
10
11
Feature Identifier
Parameter
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
39
Get Log Page
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
Namespace Identifier
1
2
3
4
5
DWord
PRP Entry 1
7
8
PRP Entry 2
9
10
Number of DWords
Retrieves up to 4KB of data from specified
log page
PRP Entry 1 Starting address of where
log page should be written
PRP Entry 2 Starting address of where
remainder of remainder of log page should
be written
Number of DWords Number of DWords
to transfer
Log Page Identifier ID of log page to
retrieve
Log Page Identifier
11
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
40
Asynchronous Event Request
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Command Identifier
Method to obtain asynchronous event status
from controller
Event signaled by a completion to a previously issued
asynchronous event request command
After asynchronous event, events of that same type
are masked until the host reads the corresponding log
page
Opcode
FUSE
Namespace Identifier
1
2
3
4
5
DWord
6
7
8
9
10
Error Status
SMART / Health status
Vendor Specific
Async Event Info Provides error type
specific details
Examples:
11
12
13
14
15
Byte 3
Byte 2
Byte 1
Log Page
Temperature above threshold
Spare space below threshold
Invalid doorbell value write
Log Page ID of log page to retrieve more
information and clear mask (using Get Log
Page)
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
DWord
Type Type of asynchronous event
Aync Event Info
Type
1
2
SQ Identifier
3
Status 2013
Field
Flash
Memory Summit
Santa Clara, CA
SQ Head Pointer
P
Command Identifier
41
Controller Initialization (Part 2)
The host performs the following actions in sequence to initialize the controller to
begin executing IO commands:
1. determine the controller configuration using the Identify command (Controller
data structure)
2. determine namespace configuration for each namespace by using the Identify
command (Namespace data structure)
3. determine the number of I/O Submission and Completion Queues supported
using the Set Features command.
4. After determining the number of I/O Queues, the MSI and/or MSI-X registers
should be configured.
5. allocate the appropriate number of I/O Completion Queues, using the Create I/O
Completion Queue command
6. allocate the appropriate number of I/O Submission Queues, using the Create I/O
Submission Queue command
7. If the host desires asynchronous notification of error or health events, submit an
appropriate number of Asynchronous Event Request commands.
Flash Memory Summit 2013
Santa Clara, CA
42
NVM Cmd Set Admin Commands
Command
Required or
Optional
Format NVM
Optional
Security Send
Optional
Security Receive
Optional
Flash Memory Summit 2013
Santa Clara, CA
Category
Admin
43
NVM Command Set
Command
Required or
Optional
Read
Required
Write
Required
Flush
Required
Write Uncorrectable
Optional
Write Zeros
Optional
Compare
Optional
Dataset Management
Optional
Reservation Acquire
Optional
Reservation Register
Optional
Reservation Release
Optional
Reservation Report
Optional
Vendor Specific Commands
Optional
Flash Memory Summit 2013
Santa Clara, CA
Category
Required
Data Commands
Optional
Data Commands
Data Hints
Reservations Commands
Vendor Specific
NVM commands support both PRPs or SGLs.
44
Read
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Command Identifier
FUSE
Opcode
Namespace Identifier
1
2
3
4
Metadata Pointer or Metadata SGL Segment Pointer
D Wor d
PRP Entry1
7
8
10
PRINFO
Number of Logical Blocks
Expected Initial Logical Block Reference Tag
Expected Logical Block Application Tag
Expected Logical Block Application Tag Mask
Guard field check or no check
Application tag field check or no check
Reference tag field check or no check
Return data from NVM
Apply limited retry or apply all available error recovery
means to return data
Data Set Management (DSM)
Described later
Protection Information Related Fields
Flash Memory Summit 2013
Santa Clara, CA
Protection Information Check
Limited Retry (LR)
DSM
13
Pass protection information or read and strip
Force Unit Access (FUA)
Starting LBA
11
15
Protection Information Action
PRP Entry2
14
12 LR FUA
Read logical blocks from NVM and perform
specified protection information processing
PRP Entry 1, PRP Entry 2, Metadata Pointer
Host buffers to write data read from NVM
Starting LBA Address of first logical block to
read
Number of Logical Blocks Number of logical
blocks to read from NVM
Protection Information Field (PRINFO)
Expected Initial Logical Block Reference Tag
Expected Logical Block Application Tag Mask
Expected Logical Block Application Tag
45
Fused Operation
A fused operation is a method to create a complex command by
fusing together two simpler commands.
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
Command Identifier
FUSE
This field specifies whether this
command is part of a fused
operation and if so, which
command it is in the sequence.
Field definition
Opcode
Namespace Identifier
2
3
4
5
DWord
6
7
8
9
10
11
12
Metadata Pointer
PRP Entry1
PRP Entry2
00b Normal operation
01b Fused operation, first command
10b Fused operation, second command
11b Reserved
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
46
Compare and Write
Compare and write is the only defined fused
operation
Compare and Write commands are submitted in
adjacent slots in the submission queue
Compare and Write are executed as atomic unit
A completion queue entry is posted for each of the two
commands
If Compare succeeds, then Write command is executed
If Compare fails, then Write command is aborted
Command Aborted due to Failed Fused Command completion
status for write command
Both Compare and Write must operate on the same LBA
range
Flash Memory Summit 2013
Santa Clara, CA
47
Data Set Management (DSM) Hints
DSM Hints
Write Cmd
Read Cmd
Starting LBA
Num Logical Blks
Starting LBA
Num Logical Blks
DSM
DSM
DSM
DSM
Dataset
Management
Cmd
LBA Range
DSM
LBA Range
DSM
LBA Range
DSM
1 to 256
Ranges
LBA Range
DSM
LBA Range
DSM
LBA Range
DSM
LBA Range
DSM
LBA Range
DSM
Flash Memory Summit 2013
Santa Clara, CA
Access size (in logical blocks)
Written in near future
Sequential read
Sequential write
Access latency (longer, typical,
small)
Access frequency
Typical read and write
Infrequent read and write
Infrequent write, frequent read
Frequent write, infrequent read
Frequent read and write
Dataset Management Command
Deallocate (TRIM)
Integral write dataset
Integral read dataset
48
Reservation Overview
Reservations provide capabilities that may be utilized by two or
more hosts to provide coordinated access to a shared
namespace
The protocol and manner in which these capabilities are used are
outside the scope of NVMe
Reservations are functionally compatible with T10 persistent
reservations
Reservations are on a namespace and restrict host access to that
namespace
If a host submits a command to a namespace in the presence of a
reservation and lacks sufficient rights, then the command is aborted
by the controller with a status of Reservation Conflict
Capabilities are provided to allow recovery from a reservation
held by a failing or uncooperative host
Flash Memory Summit 2013
Santa Clara, CA
49
Example Multi-Host System
Host
A
NVM Express
Controller 1
NVM Express
Controller 2
Host ID = A
Host ID = A
NSID 1
NSID 1
Host
B
Host
C
NVM Express
Controller 3
NVM Express
Controller 4
Host ID = B
Host ID = C
NSID 1
NSID 1
Namespace
NVM Subsystem
Host Identifier (Host ID) associated with each controller allows NVM subsystem to
identify controllers associated with the same host and preserve reservation
properties across controllers
Flash Memory Summit 2013
Santa Clara, CA
50
New NVM Reservation Commands
NVM
I/O Command
Operation
Register a reservation key
Reservation Register
Unregister a reservation key
Replace a reservation key
Acquire a reservation on a namespace
Reservation Acquire
Preempt a reservation held on a namespace
Abort a reservation held on a namespace
Reservation Release
Release a reservation held on a namespace
Clear a reservation held on a namespace
Retrieve reservation status data structure
Type of reservation held on the namespace (if any)
Reservation Report
Persist through power loss state
Reservation status, Host ID, reservation key for each
host that has access to the namespace
Flash Memory Summit 2013
Santa Clara, CA
51
Command Behavior In Presence
of a Reservation
Reservation Type
Reservation
Holder
Read
Write
Registrant
Read
Write
Non-Registrant
Read
Write
Reservation Holder Definition
Write Exclusive
One Reservation Holder
Exclusive Access
One Reservation Holder
Write Exclusive - Registrants Only
One Reservation Holder
Exclusive Access - Registrants Only
One Reservation Holder
Write Exclusive - All Registrants
All Registrants are Reservation Holders
Exclusive Access - All Registrants
All Registrants are Reservation Holders
Flash Memory Summit 2013
Santa Clara, CA
52
Reservation Acquire
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
Namespace Identifier
The Reservation Acquire command is used
to acquire a reservation on a namespace,
preempt a reservation held on a namespace,
and abort a reservation held on a
namespace
3
4
D Wor d
6
7
8
9
10
11
12
PRP Entry 1
PRP Entry 2
Reservation Type
IEKEY
RACQA
Reservation Type (RTYPE) - specifies the type of
reservation to be created
Ignore Existing Key (IEKEY): If this bit is set to a
1, then the Current Reservation Key (CRKEY)
check is disabled and the command shall succeed
regardless of the CRKEY field value
Reservation Acquire Action (RACQA): specifies
the action that is performed by the command.
000b Acquire
001b Preempt
010b Preempt and Abort
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
53
Logical Block Format
LBA Data
LBA Metadata
2 where n9
512B, 1024 B, 2048B, 4096B, ...
N Bytes
Identify Namespace data structure indicates
supported formats
A Namespace may indicate support for up to 16
different formats
Example:
512b, 520b, 528b, 4096b,
Flash Memory Summit 2013
Santa Clara, CA
54
Metadata Host Transfer Options
Flash Memory Summit 2013
Santa Clara, CA
55
Protection Information Location
LBA Data
LBA Data
LBA Metadata
PI
LBA Metadata
Protection Information in First 8B of Metadata
LBA Data
Flash Memory Summit 2013
Santa Clara, CA
LBA Metadata
PI
Protection Information in Last 8B of Metadata
56
End-to-End Data Protection
Options
LB Data
LB Data
LB Data
NVMe
Controller
Host
No Data Protection
Information
NVM
PCIe SSD
LB Data
Prot.
LB Data
Prot.
LB Data
NVMe
Controller
Host
Prot.
NVM
End-to-End
Data Protection
Information
PCIe SSD
LB Data
Host
LB Data
LB Data
NVMe
Controller
Prot.
NVM
Insert & Strip
End-to-End
Data Protection
Information
PCIe SSD
Functionally compatible with T10 DIF & DIX, including DIF Type 1, 2, and 3
End-to-end protection configured per namespace with NVM Format command
Controller may insert and strip protection information
Flash Memory Summit 2013
Santa Clara, CA
57
Format NVM
Used to low level format a namespace
Byte 3
Byte 2
Byte 1
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
Support for this command is optional
May apply to a specific namespace or to all
namespaces
Byte 0
8
Opcode
FUSE
Namespace Identifier
LBA Format (LBAF) Indicates one of the
supported LBA formats (in Identify)
Metadata Settings (MS) Extended LBA or two buffers
2
3
4
5
DWord
6
7
Protection Information (PI) Protection
information mode
8
9
10
11
SES
PIL
PI
MS
LBAF
13
15
Flash Memory Summit 2013
Santa Clara, CA
0 Last 8 bytes of metadata
1 First 8 bytes of metadata
Secure Erase Settings (SES)
0 No PI
1 Type 1
2 Type 2
3 Type 3
Protection Information Location (PIL)
12
14
0 Two buffers
1 Extended LBA
0 No secure erase
1 User data erase
2 Cryptographic erase
58
NVMe Power Management
Power
Objective
Performance
Objective
Power
Manager
Power State
(host software)
NVMe
SSD
Performance Statistics
Power State Descriptor Table
Power Maximum Operational
State
Power
State
Entry
Latency
Exit
Latency
Relative
Read
Throughput
Relative
Read
Latency
Relative
Write
Throughput
Relative
Write
Latency
25 W
Yes
ms
ms
18 W
Yes
ms
ms
18 W
Yes
ms
ms
15 W
Yes
20
ms
15
ms
7W
Yes
20
ms
30
ms
1W
No
100 mS
50 mS
.25 W
No
100 mS
500 mS
Flash Memory Summit 2013
Santa Clara, CA
59
Autonomous Power State
Transitions
Autonomous
Power State
Transition Table
Power State Descriptor Table
Idle Time
Prior to
Transition
Idle
Transition
Power State
ms
500 ms
ms
500 ms
ms
ms
500 ms
20
ms
15
ms
500 ms
20
ms
30
ms
500 ms
50 mS
10,000 ms
500 mS
Power
State
Maximum
Power
Operational
State
Entry
Latency
Exit
Latency
25 W
Yes
ms
18 W
Yes
ms
18 W
Yes
15 W
Yes
7W
Yes
1W
No
100 mS
.25 W
No
100 mS
Power State
500 ms Idle
Power State
I/O Activity
Submission
Queue Tail
Doorbell Written
10,000 ms Idle
Power State
Flash Memory Summit 2013
Santa Clara, CA
60
Backup
Flash Memory Summit 2013
Santa Clara, CA
61
SGL Data Block Descriptor
Bit
7
0
MSB
1
2
3
64-bit PCIe address of the
data
Supports any byte alignment
Address
4
5
6
Byte
7
8
MSB
LSB
MSB
Length
10
11
LSB
12
Reserved
13
Used to transfer data between
PCIe memory and Controller
Address
Length
Length of the data block in
bytes
A value of zero indicates that
no data is transferred
14
15
SGL Desc. Type
Flash Memory Summit 2013
Santa Clara, CA
MSB
Desc. Type Specific
62
SGL Bit Bucket Descriptor
Bit
7
Only makes sense for
controller to host transfers
This descriptor is ignored in
host to controller transfers
1
2
3
Reserved
4
5
Length
6
Byte
7
8
Skip source data bytes
MSB
Length
10
11
LSB
12
Length of the data block in
bytes
A value of zero indicates that
no data is transferred
Reserved
13
14
15
SGL Desc. Type
Flash Memory Summit 2013
Santa Clara, CA
MSB
Desc. Type Specific
63
SGL Segment and SGL Last
Segment Descriptors
Bit
7
0
MSB
1
2
3
Address
4
5
6
Byte
7
8
MSB
LSB
MSB
Length
10
11
LSB
12
Reserved
13
14
15
SGL Desc. Type
Flash Memory Summit 2013
Santa Clara, CA
MSB
Desc. Type Specific
SGL Segment - Pointer to next
SGL Segment
SGL Last Segment - Pointer to
last SGL Segment
Address
Address in PCIe memory of
next segment
Must be 64-bit aligned
Length
Length of the segment in
bytes
Must be multiple of 16 (a
descriptor is 16B)
64
Delete I/O Submission Queue
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Namespace Identifier
Opcode
Delete specified I/O submission queue
Queue Identifier Submission queue ID
number
Command Specific Error Values
Invalid Queue Identifier
3
4
5
DWord
6
7
8
9
10
Queue Identifier
11
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
65
Delete I/O Completion Queue
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
Delete specified I/O completion queue
Queue Identifier Completion queue ID
number
Namespace Identifier
2
3
4
5
DWord
6
7
8
9
10
Queue Identifier
11
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
66
Get Feature
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
Namespace Identifier
2
3
4
5
DWord
6
7
8
9
10
Get value of configurable feature
PRP Entry 1 Starting address of where
feature data should be written (used by
some features)
PRP Entry 2 Starting address of where
remainder of where feature data should be
written (used by some features)
Feature Identifier ID of feature
PRP Entry 1
PRP Entry 2
Feature value returned in memory (PRPs)
or in DWord 0 of completion entry
Feature Identifier
11
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
67
Abort
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Command Identifier
Opcode
FUSE
Namespace Identifier
Used to cancel/abort a specific command
previously issued on the admin or I/O
submission queue
(Submission Queue ID, Command Identifier) is globally
unique
The aborting of a command is best effort by the
controller
Implementation specific when a controller completes
the command when the command is not found
A controller specifies the maximum number of
outstanding abort command that it can support in
Identify Controller Data Structure
3
4
5
DWord
6
7
8
9
Command Identifier
10
Submission Queue ID
11
12
13
14
15
Byte 3
Byte 2
Byte 1
0 Command was aborted
1 Command was not aborted
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
DWord
Submission Queue ID ID of submission
queue on which command was issued
Command Identifier ID of the command
to abort
A Abort status
1
2
3
SQ Identifier
Status Field
Flash Memory Summit 2013
Santa Clara, CA
SQ Head Pointer
P
Command Identifier
68
Firmware Image Download
Used to download all or portion of a
firmware image
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Firmware image may consist of multiple pieces
Pieces do not need to be do downloaded in order
Pieces must not overlap
Opcode
Namespace Identifier
2
3
4
5
DWord
6
7
8
9
PRP Entry 1
PRP Entry 2
10
Number of Dwords
11
Offset
12
13
PRP Entry 1 and PRP Entry 2 PRP
entries / list pointer where firmware piece is
located
Command Identifier ID of the command
to abort
Number of Dwords Number of DWords
contains in the portion of the firmware
image being downloaded
Offset DWord offset from 0 (the start)
associated with this firmware piece
14
15
Flash Memory Summit 2013
Santa Clara, CA
69
Firmware Activate
Used to activate a firmware images
Byte 3
Byte 2
Byte 1
Command Identifier
Newly activated image is the one that runs after a
controller reset
Performs two orthogonal operations
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
FUSE
Opcode
Validates and loads downloaded firmware image into
firmware slot
Activates a firmware slot
Namespace Identifier
2
3
4
5
DWord
Active Action (AA) Action taken on the
downloaded image or image associated
with a firmware slot
7
8
9
10
AA
FS
11
12
13
14
15
Firmware Slot (FS) Field used by AA
field to indicate which slot to be updated
and/or activated
Flash Memory Summit 2013
Santa Clara, CA
00b Downloaded image becomes the new image in
the firmware slot specified by the FS field. This image
is NOT activated.
01b Downloaded image becomes the new image in
the firmware slot specified by the FS field and this
image is activated.
11b Image contained in the firmware slot specified
by the FS field is activated
Values 1 through 7 indicate a slot number
Value of 0 indicates that the controller should pick a
slot number
70
Security Received
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Namespace Identifier
1
2
3
4
5
D Wor d
PRP Entry1
7
8
PRP Entry2
9
10
Security Protocol
11
Opcode
transfers the status and data result of one or more
Security Send commands that were previously
submitted to the controller
PRP Entry 1, PRP Entry 2 - Host buffers that
contains the security protocol information
Starting LBA Address of first logical block to
read
Security Protocol specifies the security protocol
as defined in SPC-4
SP Specific - specific to the Security Protocol as
defined in SPC-4
Allocation Length - specific to the Security
Protocol as defined in SPC-4.
SP Specific
Allocation Length
12 LR
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
71
Security Send
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Namespace Identifier
1
2
3
4
5
D Wor d
PRP Entry1
7
8
Opcode
PRP Entry2
9
10
used to transfer security protocol data to the
controller.
PRP Entry 1, PRP Entry 2 - Host buffers that contains
the security protocol information
Starting LBA Address of first logical block to
read
Security Protocol specifies the security protocol
as defined in SPC-4
SP Specific - specific to the Security Protocol as
defined in SPC-4
Transfer Length - specific to the Security Protocol
as defined in SPC-4
Security Protocol
11
SP Specific
Allocation Length
12 LR
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
72
Write
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Command Identifier
FUSE
Opcode
Namespace Identifier
1
2
3
Metadata Pointer or Metadata SGL Segment Pointer
Write logical blocks to NVM and perform specified
protection information processing
PRP Entry 1, PRP Entry 2, Metadata Pointer
Host buffers for read data to be written to NVM
Starting LBA Address of first logical block to
written
Number of Logical Blocks Number of logical
blocks to write to NVM
Protection Information Field (PRINFO)
D Wor d
PRP Entry 1
PRP Entry 2
10
PRINFO
DSM
13
Initial Logical Block Reference Tag
14
15
Logical Block Application Tag
Logical Block Application Tag Mask
Described later
Protection Information Related Fields
Flash Memory Summit 2013
Santa Clara, CA
Apply limited retry or apply all available means to write data
to NVM
Data Set Management (DSM)
Write data to NVM
Limited Retry (LR)
Number of Logical Blocks
Guard field check or no check
Application tag field check or no check
Reference tag field check or no check
Force Unit Access (FUA)
Starting LBA
11
Pass protection information or write and insert
Protection Information Check
12 LR FUA
Protection Information Action
Initial Logical Block Reference Tag
Logical Block Application Tag Mask
Logical Block Application Tag
73
Flush
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
1
2
FUSE
Namespace Identifier
Opcode
Causes any data in volatile storage to be
flushed to non-volatile memory
Volatile Write Cache (VWC) field in
Indentify Controller Data Structure
1 Volatile write cache is present
4
5
Flush command may be used to write volatile data to
NVM
Set Feature command may be used to enable/disable
volatile write
0 Volatile write cache is NOT present
DWord
6
7
8
9
10
11
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
74
Write Uncorrectable
Mark logical blocks invalid
Subsequent read return Unrecovered Read Error status
Byte 3
Byte 2
Byte 1
Command Identifier
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
FUSE
Opcode
Namespace Identifier
Starting LBA Address of first logical
block to written
Number of Logical Blocks Number of
logical blocks to write to NVM
3
4
5
D W or d
6
7
8
9
10
11
12
Starting LBA
Number of Logical Blocks
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
75
Write Zeroes
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Command Identifier
FUSE
Opcode
Namespace Identifier
1
2
Write zeroes to the logical blocks on the NVM and
perform specified protection information
processing
Starting LBA Address of first logical block to
written
Number of Logical Blocks Number of logical
blocks to write to NVM
Protection Information Field (PRINFO)
Protection Information Action
Protection Information Check
4
5
D W or d
6
7
Starting LBA
11
12 LR FUA
PRINFO
Number of Logical Blocks
13
14
15
Logical Block Application Tag
Logical Block Application Tag Mask
Apply limited retry or apply all available means to write data
to NVM
Data Set Management (DSM)
Initial Logical Block Reference Tag
Write data to NVM
Limited Retry (LR)
10
Guard field check or no check
Application tag field check or no check
Reference tag field check or no check
Force Unit Access (FUA)
Pass protection information or write and insert
Described later
Protection Information Related Fields
Initial Logical Block Reference Tag
Logical Block Application Tag Mask
Logical Block Application Tag
Compare
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
P
Command Identifier
FUSE
Opcode
Namespace Identifier
3
4
Metadata Pointer or Metadata SGL Segment Pointer
D Wor d
PRP Entry1
7
8
PRP Entry2
9
10
Starting LBA
11
12 LR FUA
PRINFO
Number of Logical Blocks
13
14
15
Read logical block data from NVM and
compare the data read to data buffer(s)
fetched from the host
Same fields as a read operation
Expected Initial Logical Block Reference Tag
Expected Logical Block Application Tag
Expected Logical Block Application Tag Mask
No Dataset Management field
Protection information checking is
performed (if enabled)
Dataset Management
Allows host to indicate attributes for ranges
of logical blocks
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
Each logical block range definition is 16B
Up to 256 range definitions in a command
Range definitions are held in a contiguous buffer that
is up to 4KB in size
Buffer is defined by PRP1 and PRP2
FUSE
Opcode
Namespace Identifier
2
3
4
5
DWord
6
7
8
9
10
11
PRP Entry 1
PRP Entry 2
Number of Ranges
ID
R
AD IDW IDR
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
Number of Ranges number of range
definitions associated with the command
Integral Dataset for Read (IDR)
Indicates that dataset (all provided ranges)
should be optimized to be read as a single
unit. If a potion of the dataset is read, it is
expected that all range definitions will be
read.
Integral Dataset for Write (IDW)
Indicates that dataset (all provided ranges)
should be optimized to be write as a single
unit. If a potion of the dataset is written, it is
expected that all range definitions will be
written.
Deallocate (AD) Indicates that all
provided ranges may be de-allocated
78
Range Definition
Byte 3
Byte 2
Byte 1
Context Attributes
Length in Logical Blocks
Byte 0
8
Starting LBA
Range 0
Range 1
Context Attributes
Length in Logical Blocks
Length in Logical Blocks
Starting LBA
Context Attributes
Starting LBA
Context Attributes provides
information on how range will be
used by host software (described
later)
Length in Logical Block
number of logical blocks
associated with range defintion
Starting LBA Range definition
logical block starting address
Buffer
Range 2
Context Attributes
Length in Logical Blocks
Starting LBA
Range 3
Context Attributes
Length in Logical Blocks
Starting LBA
Context Attributes
Range 4
DWord
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Flash Memory Summit 2013
Santa Clara, CA
Length in Logical Blocks
Starting LBA
79
Context Attributes
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10
0
Command Access Size
WP SW SR
SR
4
AL
AF
Access frequency (AF)
No info provided
Typical access
Infrequent access
Infrequent writes and frequent reads
Frequent writes and infrequent reads
Frequent writes and frequent reads
Access Latency (AL)
No info provided
Longer latency acceptable
Typical latency
Smallest latency possible
Sequential Read Range (SR) Optimize for sequential reads as a single object
Sequential Write Range (SW) Optimize for sequential writes as a single object
Write Prepare (WP) Range is expected to be written in the near future
Command Access Size Number of logical block that are expected to be accessed in a read or write command in the
near future. Zero indicates no information provided
Flash Memory Summit 2013
Santa Clara, CA
80
Reservation Register
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
The Reservation Register command is used
to register, unregister, or replace a
reservation key
Namespace Identifier
2
3
4
D Wor d
6
7
8
9
10 CPTPL
PRP Entry 1
PRP Entry 2
IEKEY
RREGA
11
12
13
14
15
Change Persist Through Power Loss State
(CPTPL): This field allows the Persist Through
Power Loss state associated with the namespace
to be modified as a side effect of processing this
command
Ignore Existing Key (IEKEY): If this bit is set to a
1, then Reservation Register Action (RREGA)
field values that use the Current Reservation Key
(CRKEY) shall succeed regardless of the value of
the Current Reservation Key field in the command
(i.e., the current reservation key is not checked)
Reservation Register Action (RREGA): specifies
the action that is performed by the command.
Flash Memory Summit 2013
Santa Clara, CA
00b No change to PTPL state
10b Set PTPL state to 0. Reservations are released and
registrants are cleared on a power on
11b Set PTPL state to 1. Reservations and registrants persist
across a power loss
000b Register Reservation Key
001b Unregister Reservation Key
010b Replace Reservation Key
81
Reservation Release
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
FUSE
Opcode
Namespace Identifier
2
3
4
5
D Wor d
7
8
9
10
11
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
PRP Entry 1
PRP Entry 2
Reservation Type
IEKEY
RRELA
The Reservation Release command is used
to release or clear a reservation held on a
namespace
Reservation Type (RTYPE) If the
Reservation Release Action is 00b (i.e.,
Release), then this field specifies the type
of reservation that is being released. The
reservation type in this field shall match the
current reservation type
Ignore Existing Key (IEKEY): If this bit is
set to a 1, then the Current Reservation
Key (CRKEY) check is disabled and the
command succeeds regardless of the
CRKEY field value
Reservation Release Action (RRELA):
specifies the registration action that is
performed by the command.
00b Release
01b Clear
82
Reservation Report
Byte 3
Byte 2
Byte 1
Byte 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
0
Command Identifier
Namespace Identifier
2
3
4
FUSE
Opcode
The Reservation Report command returns a
Reservation Status data structure to host
memory that describes the registration and
reservation status of a namespace
Number of Dwords (NUMD): specifies the
number of Dwords of the Reservation
Status data structure to transfer.
D Wor d
7
8
9
10
PRP Entry 1
PRP Entry 2
Number of Dwords
11
12
13
14
15
Flash Memory Summit 2013
Santa Clara, CA
83
Host
A
Host
B
Host
C
NVM Express
Controller 3
NVM Express
Controller 4
Host ID = B
Host ID = C
NSID 1
NSID 1
Reservations in Action
Example: Host A and B have read/write
access and host C has read-only access to
the shared namespace
NVM Express
Controller 1
NVM Express
Controller 2
Host ID = A
Host ID = A
NSID 1
NSID 1
HostA-SetFeatures (HostID_A) -> OK
HostB-SetFeatures (HostID_B) -> OK
Namespace
HostC-SetFeatures (HostID_C) -> OK
NVM Subsystem
HostA-Register(NSID,Key_A) -> OK
HostB-Register(NSID,Key_B) -> OK
HostA-AcquireReservation(NSID, Reservation, WriteExclusiveRegistrantsOnly,Key_A) -> OK
HostC-AcquireReservation(NSID, Reservation, WriteExclusiveRegistrantsOnly,Key_C) ->
Error Reservation Conflict
HostA-Write(NSID) -> OK
HostB-Read(NSID) -> OK
HostB-Write(NSID) -> OK
HostC->Read(NSID) -> OK
HostC->Write(NSID) -> Error Reservation Conflict
HostA-ReleaseReservation(NSID,Key1) -> OK
HostC-Write(NSID) -> OK
Queue Management
To allocate I/O Submission Queues and I/O Completion Queues,
host software follows these steps:
1. Configure the Admin Registers and enable controller (CC.EN=1)
2. Submit a Set Features command for the Number of Queues
attribute in order to request the number of I/O Submission
Queues and I/O Completion Queues desired. The completion of
this Set Features command indicates the number of I/O
Submission and Completion Queues allocated.
3. Determine the maximum number of entries supported per queue
(CAP.MQES) and whether the queues are required to be
physically contiguous (CAP.CQR)
4. Allocate the desired I/O Completion Queues by using the Create
I/O Completion Queue command.
5. Allocates the desired I/O Submission Queues by using the
Create I/O Submission Queue command.
Flash Memory Summit 2013
Santa Clara, CA
85
PCI Express SR-IOV
PCIe Port
Physical
Function
0
NVMe Controller
Virtual Function (0,1)
NSID 1
NS
A
NSID 2
NVMe Controller
Virtual Function (0,3)
NVMe Controller
Virtual Function (0,2)
NSID 1
NSID 2
NSID 1
NS
C
NS
B
NSID 2
NVMe Controller
Virtual Function (0,4)
NSID 1
NSID 2
NS
D
NS
E
86
Multi-Path I/O and Namespace
Sharing
An NVMe namespace may be accessed via multiple paths
SSD with multiple PCI Express* ports
SSD behind a PCIe switch to many hosts
Two hosts accessing the same namespace must coordinate
NVM Express 1.1 added hooks to enable Enterprise multi-host usage models
Globally Unique ID for a namespace
Reservation capability
PCIe Port x
PCIe Port y
NVMe Controller
PCI Function 0
NVMe Controller
PCI Function 0
NSID 1
NSID 2
NSID 1
NS
A
NSID 2
NS
C
NS
B
Flash Memory Summit 2013
Santa Clara, CA
87
Controller Shutdown
The host performs the following actions in sequence for a normal shutdown:
1. Stop submitting any new I/O commands to the controller and allow any
outstanding commands to complete.
2. The host should delete all I/O Submission Queues, using the Delete I/O
Submission Queue command.
3. The host should delete all I/O Completion Queues, using the Delete I/O
Completion Queue command.
4. The host should set the Shutdown Notification (CC.SHN) field to 01b to indicate
a normal shutdown operation. The controller indicates when shutdown
processing is completed by updating the Shutdown Status (CSTS.SHST) field to
10b.
The host perform the following actions in sequence for an abrupt shutdown:
1. Stop submitting any new I/O commands to the controller.
2. The host should set the Shutdown Notification (CC.SHN) field to 10b to indicate
an abrupt shutdown operation. The controller indicates when shutdown
processing is completed by updating the Shutdown Status (CSTS.SHST) field to
10b.
Flash Memory Summit 2013
Santa Clara, CA
88
Firmware Update Process
New Firmware Image
in Host Memory
Firmware Image Download
Firmware slots allows multiple images
to be supported
Controller supports 1 to 7 slots
Slot 0 not a valid slot - reserved
Slot 1 may be a read-only firmware image
Firmware update process
Download Firmware Image: controller transfers
image from host
Activate Firmware:
Firmware Slots
0
Firmware Activate (Slot 6)
Controller Reset
Controller Running Slot 6
Firmware Image
Flash Memory Summit 2013
Santa Clara, CA
Replace Firmware: controller validates
image & applies to selected slot
Controller makes selected slot active
Firmware update occurs on next reset
Firmware boot failure
Revert to previous active slot or baseline readonly image
Host software notified via a Firmware Image
89
Load Error asynchronous event
Resets
There are five primary controller level reset mechanisms:
NVM Subsystem Reset
Conventional Reset (PCI Express Hot, Warm, or Cold reset)
PCI Express transaction layer Data Link Down status
Function Level Reset (PCI reset)
Controller Reset (CC.EN transitions from 1 to 0)
When any of the above resets occur, the following actions are performed:
All I/O Submission and Completion Queues are deleted.
All outstanding Admin and I/O commands shall be processed as aborted by
host software.
The controller is brought to an Idle state = CSTS.RDY is cleared to 0.
The Admin Queue registers (AQA, ASQ, or ACQ) are not reset as part of a
controller reset. All other controller registers defined in section 3 and
internal controller state are reset.
In all cases except a Controller Reset, the PCI register space is reset as defined
by the PCI Express base specification.
Flash Memory Summit 2013
Santa Clara, CA
90
NVM Subsystem Reset
If an NVM Subsystem Reset occurs, the entire NVM
subsystem is reset. This includes the initiation of a
Controller Level Reset on all controllers that make
up the NVM subsystem and a transition to the
Detect LTSSM state by all PCI Express ports of the
NVM subsystem.
An NVM Subsystem Reset is initiated when:
Power is applied to the NVM subsystem,
A value of 4E564D65h (NVMe) is written to the
NSSR.NSSRC field, or
A vendor specific event occurs.
To perform an NVM Subsystem Reset, write the
value NVMe to the register
Bit
Type
Reset
31:00
RW
0h
NVMe = NVM Express
Description
NVM Subsystem Reset Control (NSSRC): A write of the value 4E564D65h
("NVMe") to this field initiates an NVM Subsystem Reset. A write of any other value
has no functional effect on the operation of the NVM subsystem. This field shall return
the value 0h when read.
Data Protection
Data protection information associated
with each sector
Same format as DIF / DIX
Bit
7
0
MSB
1
2
MSB
Byte
3
4
Guard field
Guard
Application Tag
LSB
Application tag field
LSB
MSB
Reference Tag
Flash Memory Summit 2013
Santa Clara, CA
CRC-16 as defined by T10 DIF
IP Checksum not supported
5
6
Reference tag field
LSB
Same definition as T10 DIF
May be used to disable checking of
protection information (i.e., 0xFFFF)
Generally opaque data not interpreted
by controller
Same definition as T10 DIF
May be used to disable checking of
protection info (i.e., 0xFFFF_FFFF)
Incrementing value associated with
sector address or value provided as
part of command
92