Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
63 views98 pages

HCIA-Storage V5.0 Guide: Huawei Certification

This document is a learning guide for the Huawei Certified ICT Associate - Storage (HCIA-Storage) certification. It covers topics related to storage technologies, including trends in data and storage, RAID technologies, common storage protocols, and Huawei's intelligent storage products and features. The guide introduces readers to storage basics and helps prepare them for the HCIA-Storage certification exam.

Uploaded by

Andrea Murgia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views98 pages

HCIA-Storage V5.0 Guide: Huawei Certification

This document is a learning guide for the Huawei Certified ICT Associate - Storage (HCIA-Storage) certification. It covers topics related to storage technologies, including trends in data and storage, RAID technologies, common storage protocols, and Huawei's intelligent storage products and features. The guide introduces readers to storage basics and helps prepare them for the HCIA-Storage certification exam.

Uploaded by

Andrea Murgia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

lOMoARcPSD|28297286

HCIA-Storage V5.0 Learning Guide

Sieci komputerowe (Politechnika Warszawska)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Andrea Murgia ([email protected])
lOMoARcPSD|28297286

HCIA-Storage
Learning Guide

V5

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Huawei Certification System


Huawei Certification is an integral part of the company's "Platform + Ecosystem" strategy, and
it supports the ICT infrastructure featuring "Cloud-Pipe-Device". It evolves to reflect the latest
trends of ICT development. Huawei Certification consists of two categories: ICT Infrastructure
Certification, and Cloud Service & Platform Certification, making it the most extensive technical
certification program in the industry.
Huawei offers three levels of certification: Huawei Certified ICT Associate (HCIA), Huawei
Certified ICT Professional (HCIP), and Huawei Certified ICT Expert (HCIE). Huawei Certification
covers all ICT fields and adapts to the industry trend of ICT convergence. With its leading talent
development system and certification standards, it is committed to fostering new ICT talent in the
digital era, and building a sound ICT talent ecosystem.
Huawei Certified ICT Associate-Storage (HCIA-Storage) is designed for Huawei engineers,
students and ICT industry personnel. HCIA-Storage covers knowledge about storage technology
trends, storage basic technologies, storage common advanced technologies, storage business
continuity solutions and storage system O&M management.
The HCIA-Storage certificate system introduces you to the industry and market, helps you in
innovation, and enables you to stand atop the Storage frontiers.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Contents

1 Storage Technology Trends ...................................................................................................... 5


1.1 Storage Technology Trends .......................................................................................................................................... 5
1.1.1 Data and Information ................................................................................................................................................ 5
1.1.2 Data Storage .............................................................................................................................................................. 7
1.1.3 Development of Storage Technologies ...................................................................................................................... 9
1.1.4 Development Trend of Storage Products ................................................................................................................ 12
2 Basic Storage Technologies .................................................................................................... 16
2.1 Intelligent Data Storage System ................................................................................................................................. 16
2.1.1 Intelligent Data Storage System .............................................................................................................................. 16
2.1.2 Intelligent Data Storage Components...................................................................................................................... 16
2.1.3 Storage System Expansion Methods........................................................................................................................ 26
2.2 RAID Technologies ...................................................................................................................................................... 28
2.2.1 Traditional RAID ....................................................................................................................................................... 28
2.2.2 RAID 2.0+ ................................................................................................................................................................. 35
2.2.3 Other RAID Technologies ......................................................................................................................................... 38
2.3 Common Storage Protocols ........................................................................................................................................ 39
2.3.1 SAN Protocol ............................................................................................................................................................ 39
2.3.2 NAS Protocols .......................................................................................................................................................... 47
2.3.3 Object and HDFS Storage Protocols ......................................................................................................................... 51
2.3.4 Storage System Architecture Evolution ................................................................................................................... 53
2.3.5 Storage System Expansion Methods........................................................................................................................ 54
2.3.6 Huawei Storage Product Architecture ..................................................................................................................... 57
2.4 Storage Network Architecture .................................................................................................................................... 60
2.4.1 DAS........................................................................................................................................................................... 60
2.4.2 NAS .......................................................................................................................................................................... 60
2.4.3 SAN .......................................................................................................................................................................... 62
2.4.4 Distributed Architecture .......................................................................................................................................... 65
3 Huawei Intelligent Storage Products and Features ................................................................. 67
3.1 Huawei Intelligent Storage Products .......................................................................................................................... 67
3.1.1 Panorama ................................................................................................................................................................. 67
3.1.2 All-Flash Storage ...................................................................................................................................................... 67
3.1.3 Hybrid Flash Storage ................................................................................................................................................ 71
3.1.4 Scale-out Storage ..................................................................................................................................................... 72
3.1.5 Hyper-Converged Storage........................................................................................................................................ 76
3.1.6 Backup Storage ........................................................................................................................................................ 77
3.2 Storage System Operation Management ................................................................................................................... 79

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

3.2.1 Storage Management Overview .............................................................................................................................. 79


3.2.2 Introduction to Storage Management Tools ........................................................................................................... 79
3.2.3 Introduction to Basic Management Operations ...................................................................................................... 81
3.3 Storage Resource Tuning Technologies and Applications .......................................................................................... 81
3.3.1 SmartThin ................................................................................................................................................................ 81
3.3.2 SmartTier&SmartCache ........................................................................................................................................... 83
3.3.3 SmartAcceleration ................................................................................................................................................... 85
3.3.4 SmartQoS ................................................................................................................................................................. 86
3.3.5 SmartDedupe&SmartCompression.......................................................................................................................... 87
3.3.6 SmartVirtualization .................................................................................................................................................. 88
3.3.7 SmartMigration ........................................................................................................................................................ 89
3.4 Storage Data Protection Technologies and Applications ............................................................................................ 91
3.4.1 HyperSnap................................................................................................................................................................ 91
3.4.2 HyperClone .............................................................................................................................................................. 91
3.4.3 HyperCDP ................................................................................................................................................................. 92
3.4.4 HyperLock ................................................................................................................................................................ 93
4 Storage System O&M Management ....................................................................................... 94
4.1 Storage System O&M Management ........................................................................................................................... 94
4.1.1 O&M Overview ........................................................................................................................................................ 94
4.1.2 O&M Management Tools ........................................................................................................................................ 94
4.1.3 O&M Scenarios ........................................................................................................................................................ 95

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

1 Storage Technology Trends

1.1 Storage Technology Trends


1.1.1 Data and Information
1.1.1.1 What Is Data?
Data refers to recognizable symbols that record events. It is a physical symbol or a combination of
these physical symbols that record the properties, states, and relationships of events. It is
identifiable and abstract symbols. In a narrow sense, data refers to numbers. In a broad sense, it can
be a combination of characters, letters, and digits, graphs, images, videos and audios that have
specific meanings, and can also be the abstract representation of an attribute, quantity, location,
and relationship between objects. For example, 0, 1, 2, windy weather, rain, fall in temperature,
student records, and transportation of goods are all data.
In computer science, data is a generic term for all media such as numbers, letters, symbols, and
analog parameters that can be input to and processed by computer programs. Computers store and
process a wide range of objects that generate complex data.
The Global Data Management Community (DAMA) defines data as the expression of facts in the
form of texts, numbers, graphics, images, sounds, and videos.
1.1.1.2 Data Types
Based on data storage and management modes, data is classified into structured, semi-structured,
and unstructured data.
Structured data can be represented and stored in a relational database, and is often represented as
a two-dimensional table. For example, MySQL and Oracle.
Semi-structured data does not conform to the structure of relational databases or other data tables,
but uses tags to separate semantic elements or enforces hierarchies of records and fields. For
example, XML, HTML, and JSON.
Unstructured data is not organized in a regular or complete data structure, or does not have a
predefined data model. For example, texts, pictures, reports, images, audios, and videos.
1.1.1.3 Data Processing Cycle
Data processing is the reorganization or reordering of data by humans or machines to increase their
specific value. A data processing cycle includes three basic steps: input, processing, and output. The
three steps constitute the data processing cycle.
⚫ Input: inputs data in a specific format, which depends on the processing mechanism. For
example, when a computer is used, the input data can be recorded on several types of media,
such as disks and tapes.
⚫ Processing: performs actions on the input data to obtain more data value. For example, the time
card hours are calculated to payroll, or sales orders are calculated to generate sales reports.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ Output: generates and outputs the processing result. The form of the output data depends on
the data use. For example, the output data can be an employee's salary.
1.1.1.4 What Is Information
Information refers to the objects that are transmitted and processed by the voice, message, and
communication systems. It refers to all the contents that are spread in the human society. By
acquiring and identifying different information of nature and society, man can distinguish between
different things and understand and transform the world. In all communication and control systems,
information is a form of universal connection. In 1948, mathematician Claude Elwood Shannon
pointed out in paper A Mathematical Theory of Communications that the essence of information is
the resolution of random uncertainty. The most basic unit that creates all things in the universe is
information.
Information is the data with context. The context includes:
⚫ Application meanings of data elements and related terms
⚫ Format of data expression
⚫ Time range of the data
⚫ Relevance of data to particular usage
Generally speaking, the concept of "data" is more objective and is not subjective to people's will.
Information is the processed data that has value and meanings.
For example, in the perspective of a football fan, the history of football, football matches, coaches,
players, and even the rules of FIFA are all the football data. Data of his or her favorite team, star, and
followed football events is information.
People can never know "all data" but can obtain "adequate information" that allows them to make
decisions.
1.1.1.5 Data vs. Information
Data is a raw and unorganized fact that needs to be processed to make it meaningful, whereas
information is a set of data that is processed in a meaningful way according to a given requirement.
Data does not have any specific purpose whereas information carries a meaning that has been
assigned by interpreting data.
Data alone has no significance while information is significant by itself.
Data never depends on information while information is dependent on data.
Data is measured in bits and bytes while information is measured in meaningful units like time and
quantity.
Data can be structured, tabular data, graph, data tree whereas information is language, ideas, and
thoughts based on the given data.
Data is a record that reflects the attributes of an object and is the specific form that carries
information. Data becomes information after being processed, and information needs to be digitized
into data for storage and transmission.
1.1.1.6 Information Lifecycle Management
Information lifecycle management (ILM) is an information technology strategy and concept, not just
a product or solution, for enterprise users. Data is key to informatization and is the core
competitiveness of an enterprise. Information enters a cycle from the moment it is generated. A
lifecycle is completed in the process of data creation, protection, access, migration, archiving, and
destruction. This process requires good management and cooperation. If the process is not well
managed, too many resources may be wasted or the work will be inefficient due to insufficient
resources.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Data management of ILM is generally divided into the following stages:


Data creation stage: Data is generated from terminals and saved to storage devices.
⚫ Data protection stage: Different data protection technologies are used based on data and
application system levels to ensure that various types of data and information are effectively
protected in a timely manner. A storage system provides data protection functions, such as
redundant array of independent disk (RAID), HA, disaster recovery (DR), and permission
management.
⚫ Data access stage: Information must be easy to access and can be shared among organizations
and applications of enterprises to maximize business value.
⚫ Data migration stage: When using IT devices, you need to upgrade and replace devices, and
migrate the data from the old to new devices.
⚫ Data archiving stage: The data archiving system supports the business operation for enterprises
by providing the record query for transactions and decision-making. Deduplication and
compression are often used in this stage.
⚫ Data destruction stage: After a period of time, data is no longer needed. In this case, we can
destroy or reclaim data that does not need to be retained or stored, and clear the data from
storage systems and data warehouses.

1.1.2 Data Storage


1.1.2.1 What Is Data Storage
In a narrow sense, storage refers to the physical storage media with redundancy, protection,
migration, and other functions. For example, floppy disks, CDs, DVDs, disks, and even tapes.
In a broad sense, storage refers to a portfolio of solutions that provide information access,
protection, optimization, and utilization for enterprises. It is the pillar of the data-centric information
architecture.
Data storage covered in this course refers to storage in a broad sense.
1.1.2.2 Data Storage System
Storage technologies are not separate or isolated. Actually, a complete storage system consists of a
series of components.
A storage system consists of storage hardware, storage software, and storage solutions. Hardware
involves storage devices and devices for storage connections, such as disk arrays, tape libraries, and
FC switches. Storage software greatly improves the availability of a storage device. Data mirroring,
data replication, and automatic data backup can be implemented by using storage software.
1.1.2.3 Physical Structure of Storage
A typical storage system comprises the disk, control, connection, and storage management software
subsystems.
In terms of its physical structure, disks reside in the bottom layer and are connected to back-end
cards and controllers of the storage system via connectors such as optical fibers and serial cables.
The storage system is connected to hosts via front-end cards and storage network switching devices
to provide data access services.
Storage management software is used to configure, monitor, and optimize subsystems and
connectors of the storage system.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

1.1.2.4 Data Storage Types


Storage systems can be classified into internal and external storage systems based on the locations
of storage devices and hosts.
An internal storage system is directly connected to the host bus, and includes the high-speed cache
and memory required for CPU computing and the disks and CD-ROM drives that are directly
connected to the main boards of computers. Its capacity is generally small and hard to expand.
An external storage system is classified into direct-attached storage (DAS) and fabric-attached
storage (FAS) by connection mode.
FAS is classified into network-attached storage (NAS) and storage area network (SAN) by
transmission protocol.
1.1.2.5 Evolution of Data Management Technologies
Data management is a process of effectively collecting, storing, processing, and applying data by
using computer hardware and software technologies. The purpose of data management is to use the
data. Data organization is the key to effective data management.
Data management technology is used to classify, organize, encode, input, store, retrieve, maintain,
and output data. The evolution of data storage devices and computer application systems promotes
the development of databases and data management technologies. Data management in a
computer system goes through four phases: manual management, file system management,
traditional database system management, and big data management.
1.1.2.6 Data Storage Application
Data generated by individuals and organizations is processed by computing systems and stored in
data storage systems.
In the ICT era, storage is mainly used for data access, data protection for security, and data
management.
Online storage means that storage devices and stored data are always online and accessible to users
anytime, and the data access speed meets the requirements of the computing platform. The working
mode is similar to the disk storage mode on PCs. Online storage devices use disks and disk arrays,
which are expensive but provide good performance.
Offline storage is used to back up online storage data to prevent possible data disasters. Data stored
on the offline storage is not often accessed, and is generally away from system applications. That is
why people use "offline" to describe this storage mode.
Data on offline storage is read and written in sequence. If tape libraries are used as offline storage
medium and data is read, tapes will be rolled to the beginning to locate the data. When the written
data needs to be modified, all data needs to be rewritten. Therefore, the access speed of the offline
storage is slow and the efficiency is low. A typical offline storage product is a tape library, which is
relatively cheap.
Nearline storage (NearStore) is a storage type for providing more storage choices to customers. Its
costs and performance are between online storage and offline storage. If the data is not frequently
used or the amount of accessed data is small, it can be stored on nearline storage devices. These
devices still provide fast addressing capabilities and a high transmission rate. For example, archive
files that are not used for a long time into nearline storage. Therefore, nearline storage is suitable to
scenarios not requiring high performance but requiring relatively high access performance.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

1.1.3 Development of Storage Technologies


1.1.3.1 Storage Architecture
The storage architecture derives from traditional storage, external storage, and storage networks,
and has developed into scale-out and cloud storage.
Traditional storage is composed of disks. In 1956, the world's first hard disk drive was invented,
which used 50 x 24-inch platters with a capacity of only 5 MB. It was as big as two refrigerators and
weighed more than a ton. It was used in the industrial field at that time and was independent of the
mainframe.
External storage is also called direct attached storage. Its earliest form is Just a Bundle Of Disks
(JBOD), which simply combines disks and is represented to hosts as a bundle of independent disks. It
only increases the capacity and cannot ensure data security.
The disks deployed in servers have the following disadvantages: limited slots and insufficient
capacity; poor reliability as data is stored on independent disks; disks becoming the system
performance bottleneck; low storage space utilization; scattered data since it is stored on different
servers.
JBOD solves the problem of limited slots to a certain extent, and the RAID technology improves
reliability and performance. External storage gradually develops into storage arrays with controllers.
The controllers contain the cache and support the RAID function. In addition, dedicated
management software can be configured. Storage arrays are represented as a large, high-
performance, and redundant disk to hosts.
DAS has the characteristics of scattered data and low storage space utilization.
As the amount of data in our society is explosively increased, the requirements for data storage are
flexible data sharing, high resource utilization, and extended transmission distance. The emergence
of networks infuses new vitality to storage.
SAN: establishes a network between storage devices and servers to provide block storage services.
NAS: builds networks between servers and storage devices with file systems to provide file storage
services.
Since 2011, unified storage that supports both SAN and NAS protocols is a popular choice. Storage
convergence sets a new trend: converged NAS and SAN. This convergence provides both database
and file sharing services, simplifying storage management, and improving storage utilization.
SAN is a typical storage network that transmits data mainly over FC network. Then, IP SANs emerge.
Scale-out storage uses general-purpose server hardware to build storage resource pools and is
applicable to cloud computing scenarios. Physical resources are organized using software to form a
high-performance logical storage pool, ensuring reliability and providing multiple storage services.
Generally, scale-out storage scatters data to multiple independent storage servers in a scalable
system structure. It uses those storage servers to share storage loads and location servers to locate
storage information. Scale-out storage architecture has the following characteristics: universal
hardware, unified architecture, and storage-network decoupling; linear expansion of performance
and capacity, up to thousands of nodes; elastic resource scaling and high resource utilization.
Storage virtualization consolidates the storage devices into logical resources, thereby providing
comprehensive and unified storage services. Unified functions are provided regardless of different
storage forms and device types.
The cloud storage system combines multiple storage devices, applications, and services. It uses
highly virtualized multi-tenant infrastructure to provide scalable storage resources for enterprises.
Those storage resources can be dynamically configured based on organization requirements.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Cloud storage is a concept derived from cloud computing, and is a new network storage technology.
Based on functions such as cluster applications, network technologies, and distributed file systems, a
cloud storage system uses application software to enable various types of storage devices on
networks to work together, providing data storage and service access externally. When a cloud
computing system stores and manages a huge amount of data, the system requires a matched
number of storage devices. In this way, the cloud computing system turns into a cloud storage
system. Therefore, we can regard a cloud storage system as a cloud computing system with data
storage and management as its core. In a word, cloud storage is an emerging solution that
consolidates storage resources on the cloud for people to access. Users can access data on the cloud
anytime, anywhere, through any networked device.
1.1.3.2 Storage Media
History of HDDs:
⚫ From 1970 to 1991, the storage density of disk platters increased by 25% to 30% annually.
⚫ Starting from 1991, the annual increase rate of storage density surged to 60% to 80%.
⚫ Since 1997, the annual increase rate rocketed up to 100% and even 200%.
⚫ In 1992, 1.8-inch HDDs were invented.
History of SSDs:
⚫ Invented by Dawon Kahng and Simon Min Sze from the Bell Labs in 1967, the floating gate
transistor has become the solid-state drive (SSD) basis of NAND flash technology. If you are
familiar with MOS tubes, you'll find that the transistor is similar to MOSFET except a floating
gate in the middle. That is why it got the name. It is wrapped in high-impedance materials and
insulated up and down to preserve charges that enter the floating gate through the quantum
tunneling effect.
⚫ In 1976, Dataram sold SSDs called Bulk Core. The SSD had the capacity of 2 MB (which was very
large at that time), and used eight large circuit boards, each board with eighteen 256 KB RAMs.
⚫ At the end of the 1990s, some vendors began to use the flash medium to manufacture SSDs. In
1997, altec ComputerSysteme launched a parallel SCSI flash SSD. In 1999, BiTMICRO released an
18-GB flash SSD. Since then, flash SSD has gradually replaced RAM SSD and become the
mainstream product of the SSD market. The flash memory can store data even in the event of
power failure, which is similar to the hard disk drive (HDD).
⚫ In May 2005, Samsung Electronics announced its entry into the SSD market, the first IT giant
entering this market. It is also the first SSD vendor that is widely recognized today.
⚫ In 2006, Nextcom began to use SSDs on its laptops. Samsung launched the SSD with the 32 GB
capacity. According to Samsung, the market of SSDs was $1.3 billion in 2007 and reached $4.5
billion in 2010. In September, Samsung launched the PRAM SSD, another SSD technology that
used the PRAM as the carrier, and hoped to replace NOR flash memory. In November, Windows
Vista came into being as the first PC operating system to support SSD-specific features.
⚫ In 2009, the capacity of SSDs caught up with that of HDDs. pureSilicon's 2.5-inch SSD provides 1
TB capacity and consists of 128 pieces of 64 Gbit/s MLC NAND memory. Finally, SSD provides the
same capacity as HDD in the same size. This is very important. HDD vendors once believed that
the HDD capacity could be easily increased by increasing the disk density with low costs.
However, the SSD capacity could be doubled only when the internal chips were doubled, which
was difficult. However, the MLC SSD proves that it is possible to double the capacity by storing
more bits in one cell. In addition, the SSD performance is much higher than that of HDD. The
SSD has the read bandwidth of 240 MB/s, write bandwidth of 215 MB/s, read latency less than
100 microseconds, 50,000 read IOPS, and 10,000 write IOPS. HDD vendors are facing a huge
threat.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

The flash chips of SSD evolve from SLC with one cell storing one bit, MLC with two bits, TLC with
three bits, and now develop into QLC with one cell storing four bits.
History of Flash Storage:
⚫ The storage-class memory launched in 2016 combines the performance advantages of dynamic
random-access memory (DRAM) and NAND, and features large capacity, low latency, and non-
volatility in inexpensive hardware. It is considered the next-gen medium that separates itself
from conventional media.
⚫ Storage class memory (SCM) is non-volatile memory. It is not as fast as memory but runs much
faster access speed than NAND.
⚫ There are various types of SCM media under development, but the mainstream SCM media are
PCRAM, ReRAM, MRAM, and NRAM.
⚫ Phase-change RAM (PCRAM) uses the electrical conductivity difference between crystalline and
amorphous alloy materials to represent binary values (0 or 1).
⚫ Resistive RAM (ReRAM) controls the formation and fusing status of conductive wires in a cell by
applying different voltages between the upper and lower electrodes to display different
impedance values (memristors) and represent data.
⚫ Magnetic RAM (MRAM) uses the electromagnetic field to change the electron spin direction
and represent different data states.
⚫ Nantero's CNT RAM (NRAM) uses carbon nanotubes to control circuit connectivity and
represent different data states.
1.1.3.3 Interface Protocols
Interface protocols refer to the communication modes and requirements that interfaces must
comply with for information exchange.
Interfaces are used to transfer data between disk cache and host memory. Different disk interfaces
determine the connection speed between disks and controllers.
During the development of storage protocols, the data transmission rate is increasing. As storage
media evolves from HDDs to SSDs, the protocol develops from SCSI to Non-Volatile Memory Express
(NVMe), including the PCIe-based NVMe protocol and NVMe over Fabrics (NVMe-oF) protocol to
connect host networks.
NVMe-oF uses ultra-low-latency transmission protocols such as remote direct memory access
(RDMA) to remotely access SSDs, resolving the trade-off between performance, functionality, and
capacity during scale-out of next-generation data centers.
Released in 2016, the NVMe-oF specification supported both FC and RDMA. In the RDMA-based
framework, InfiniBand supported converged Ethernet and Internet Wide Area RDMA Protocol
(iWARP).
In the NVMe-oF 1.1 specification released in November 2018, TCP was added as an architecture
option, that is, RDMA over Converged Ethernet (RoCE). With RoCE, no cache was required and the
CPU could directly access disks.
NVMe is an SSD controller interface standard. It is designed for PCIe interface-based SSDs and aims
to maximize flash memory performance. It can provide intensive computing capabilities for
enterprise-class workloads in data-intensive industries, such as life sciences, financial services,
multimedia, and entertainment.
Typically, NVMe SSDs are used in database applications. If the NVMe features, such as high speed
and low latency, are used to design file systems, NVMe all-flash array (AFA) can achieve excellent
read/write performance. NVMe AFA can realize efficient storage, network switching, and metadata
communication.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

1.1.4 Development Trend of Storage Products


1.1.4.1 History of Storage Products
Overall History:
⚫ All-flash storage: In terms of media, the price of flash chips decreases year by year, and HDDs
are gradually used as tapes for storing cold data and archive data.
⚫ Cloudification: In terms of storage architecture trend, elastic scalability is provided by the scale-
out architecture, and moving workloads to the cloud helps reduce the total cost of ownership
(TCO).
⚫ Intelligence: In terms of O&M, software intelligence is provided with intelligent hardware
functions, such as smart disk enclosures.
1.1.4.2 Challenges to Data Storage in the Intelligence Era
Historically, human society has experienced three technological revolutions, that is, steam,
electricity, and information. Each period has a huge impact on our working and personal lives. Now,
the fourth revolution, the age of intelligence, is here. New technologies such as artificial intelligence
(AI), cloud computing, Internet of Things, and big data are used for large-scale digital transformation
in many sectors including industry, agriculture, and services. New services are centered on data and
intelligence, promoting changes such as service-oriented extension, network-based collaboration,
intelligent production, and customization.
The entire human society is rapidly evolving into an intelligent society. During this process, the data
volume is growing explosively. The average mobile data usage per user per day is over 1 GB. During
the training of autonomous vehicles, each vehicle generates 64 TB data every day. According to
Huawei's Global Industry Vision 2025, the amount of global data will increase from 33 ZB in 2018 to
180 ZB in 2025. Data is becoming a core business asset of enterprises and even countries. The smart
government, smart finance, and smart factory built based on effective data utilization greatly
improve the efficiency of the entire society. More and more enterprises have realized that data
infrastructure is the key to intelligent success, and storage is the core foundation of data
infrastructure. In the past, we used to classify storage systems based on new technology hotspots,
technical architecture, and storage media. As the economy and society transform from digitalization
to intelligence, we tend to call the new type of storage as the storage in the intelligence era.
It has several trends:
First, intelligence, classified by Huawei as Storage for AI and AI in Storage. Storage for AI indicates
that in the future, storage will better support enterprises in AI training and applications. AI in
Storage means that storage systems use AI technologies and integrate AI into storage lifecycle
management to provide outstanding storage management, performance, efficiency, and stability.
Second, storage systems will transform. In the future, more and more applications will require low
latency, high reliability, and low TCOs, and all-flash storage arrays will be the good choice. Although
new storage media will emerge to compete, all-flash storage will be the mainstream storage media
in the future. Today, all-flash storage is still not the mainstream in the storage market.
The third trend is scale-out storage. In the 5G intelligent era, high-performance application scenarios
such as AI, HPC, and autonomous driving and the generated massive amount of data require scale-
out storage devices. With dedicated hardware, they can provide efficient, cost-effective, and EB-
level large-capacity storage. Scale-out storage is facing the challenges of intensification and large-
scale expansion,
as well as the possible changes of chips and algorithms in the future. Scientists attempt to use chip,
algorithm, and bus technologies to break the barriers of the von Neumann architecture, provide
more computing power for the underlying data infrastructure to provide efficient and low-cost
storage media, and narrow the gap between storage and computing. These problems need to be

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

solved by dedicated hardware storage. The concept similar to Memory Fabric also brings changes to
the storage architecture.
The last trend is convergence. In the future, storage will be integrated with the data infrastructure to
support heterogeneous chip computing, streamline diversified protocols, and collaborate with data
processing and big data analytics to reduce data processing costs and improve efficiency. For
example, compared with the storage provided by general-purpose servers, the integration of data
and storage will lower the TCO because data processing is offloaded from servers to storage. Object,
big data, and other protocols are converged and interoperate to implement migration-free big data.
Such convergence greatly affects the design of storage systems and is the key to improving storage
efficiency.
1.1.4.3 Data Storage Trend
In the intelligence era, we must focus on innovation to hardware, protocols, and technologies. From
mainframe to the x86, and then to the virtualization, all-flash storage media and all-IP network
protocols become a major trend.
In the intelligence era, Huawei Cache Coherence System (HCCS) and Compute Express Link (CXL) are
designed based on ultra-fast new interconnection protocols, helping to implement high-speed
interconnection between heterogeneous processors of CPUs and neural processing units (NPUs).
RoCE and NVMe support high-speed data transmission and containerization technologies. In
addition, new hardware and technologies provide abundant choices for data storage. The Memory
Fabric architecture implements memory resource pooling with all-flash + storage class memory
(SCM) and provides microsecond-level data processing performance. SCM media include Optane,
MRAM, ReRAM, FRAM, and Fast NAND. In terms of reliability, system reconstruction and data
migration are involved. As the chip-level design of all-flash storage advances, upper-layer
applications will be unaware of the underlying storage hardware.
Currently, the access performance of SSDs has been improved by 100-fold compared with that of
HDDs. For NVMe SSDs, the access performance is 10,000 times higher than that of HDDs. While the
latency of storage media has been greatly reduced, the ratio of network latency to the total latency
has rocketed from less than 5% to about 65%. That is to say, in more than half of the time, storage
media is idle, waiting for the network communication. How to reduce network latency is the key to
improving input/output operations per second (IOPS).
Development of Storage Media
Let's move on to Blu-ray storage. The optical storage technology started to develop in the late 1960s,
and experienced three generations of product updates and iterations: CD, DVD, and BD. Blu-ray
storage (or BD) is a relatively new member of the optical storage family. It can retain data for 50 to
100 years, but still cannot meet storage requirements nowadays. We expect to store data for a
longer time. The composite glass material based on gold nanoparticles can stably store data for
more than 600 years.
In addition, technologies such as deoxyribonucleic acid (DNA) storage and quantum storage are
emerging.
As the science and technology are developing, the disk capacity is increasing and the disk size is
becoming smaller. When it comes to storing information, a hard disk is still very large compared to
genes, but the amount of stored information is far less than that of genes. Therefore, scientists start
to use DNA to store data. At first, a few teams have tried to write data into the genomes of living
cells. But the approach has a couple of disadvantages. Cells replicate, introducing new mutations
over time that can change the data. Moreover, cells die, indicating that data is lost. Later, teams
attempt to store data using artificially synthesized DNA, which is freed from cells. Although the DNA
storage density now is high enough and a small amount of artificial DNA can store a large amount of
data, the data read/write are not efficient. In addition, the synthesis of DNA molecules is expensive.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

However, it can be predicted that, with the development of gene sequencing technologies, the cost
will be reduced.
References:
Bohannon, J. (2012). DNA: The Ultimate Hard Drive. Science. Retrieved from:
https://www.sciencemag.org/news/2012/08/dna-ultimate-hard-drive
Akram F, Haq IU, Ali H, Laghari AT (October 2018). "Trends to store digital data in DNA: an overview".
Molecular Biology Reports. 45 (5): 1479–1490. doi:10.1007/s11033-018-4280-y
Although atomic storage is a technology of short history, it is not a new concept.
Early on December 1959, physicist Richard Feynman gave the lecture "There's Plenty of Room at the
Bottom: An Invitation to Enter a New Field of Physics." In this lecture, Feynman considered the
possibility of using individual atoms as basic units for information storage.
In July 2016, researchers from Delft University of Technology, Netherlands published a paper in
Nature Nanotechnology. They used chlorine atoms on copper plates to store 1 kilobyte of rewritable
data. However, the memory temporarily can only operate in a highly clean vacuum environment or
in a liquid nitrogen environment with a temperature of minus 196°C (77K).
References:
Erwin, S. A picture worth a thousand bytes. Nature Nanotech 11, 919–920 (2016).
https://doi.org/10.1038/nnano.2016.141
Kalff, F., Rebergen, M., Fahrenfort, E. et al. A kilobyte rewritable atomic memory. Nature Nanotech
11, 926–929 (2016). https://doi.org/10.1038/nnano.2016.131
Because an atom is so small, the capacity of atomic storage will be much larger than that of the
existing storage medium in the same size. With the development of science and technology in recent
years, Feynman's idea has become a reality. To pay tribute to Feynman's great idea, some research
teams wrote his lecture into atomic memory. Although the idea of atomic storage is incredible and
its implementation is becoming possible, atomic memory has strict requirements on the operating
environment. Atoms are moving and even the atoms inside solids are vibrating in the ambient
environment, so it is difficult to keep them in an ordered state in general conditions. Atom storage
can only be used in low temperatures, liquid nitrogen, or vacuum conditions.
If both DNA storage and atomic storage are intended to reduce the size of storage and increase the
capacity of storage, quantum storage is designed to improve performance and running speed.
After years of research, both the storage efficiency and the lifecycle of the quantum memory are
improved, but it is still difficult to put the quantum memory into practice. Quantum memory has the
problems of inefficiency, large noise, short lifespan, and difficulty to operate at room temperature.
Only by solving these problems, quantum memory can be put into the market.
The elements in the quantum state are easily lost due to the influence of the external environment.
In addition, it is difficult to ensure 100% accuracy of manufacturing in the quantum state and
performing quantum operations.
References:
Wang, Y., Li, J., Zhang, S. et al. Efficient quantum memory for single-photon polarization qubits. Nat.
Photonics 13, 346–351 (2019). https://doi.org/10.1038/s41566-019-0368-8
Dou Jian-Peng, Li Hang, Pang Xiao-Ling, Zhang Chao-Ni, Yang Tian-Huai, Jin Xian-Min. Research
progress of quantum memory. Acta Physica Sinica, 2019, 68(3): 030307. doi:
10.7498/aps.68.20190039
Storage Network Development
In traditional data centers, IP SAN uses the Ethernet technology to form a multi-hop symmetric
network architecture and use the TCP/IP network protocol stack for data transmission. FC SAN

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

requires an independent FC network for data transmission. Although traditional TCP/IP or FC


network become mature after years of development, their technical architecture limits the
application of AI computing and scale-out storage.
To reduce the network delay and CPU usage, the Remote Direct Memory Access (RDMA) technology
emerges, used on servers to provide the remote direct memory access function. RDMA directly
transmits data from the memory of one computer to that of another computer. Data is quickly
moved from one system to the remote system storage without intervention of both operating
systems and without time-consuming processing by processors. In this way, the system has high
bandwidth, low latency, and efficient resource usage.
1.1.4.4 History of Huawei Storage Products
Huawei has been developing storage technology since 2002 and gathers a class of global elite
engineers. Huawei is dedicated to storage innovation and R&D in the intelligent era. Huawei
products are consequently globally recognized for their superior quality by customers and standard
organizations.
Oriented to the intelligence era, Huawei OceanStor storage builds an innovative architecture based
on intelligence, hardware, and algorithms. It builds a memory/SCM-centric ultimate performance
layer based on Memory Fabric, and a cost-effective, high-performance capacity layer based on all-IP
and all-flash storage to provide intelligent tiering for data storage. Computing resource pooling and
intelligent scheduling at the 10,000-core level are implemented for CPUs, NPUs, and GPUs based on
innovative algorithms and high-speed interconnection protocols. In addition, container-based
heterogeneous microservices are tailored to business to break the boundaries of memory
performance, computing power, and protocols. Finally, an intelligent management system is
provided across the entire data lifecycle to deliver innovative storage products for a fully connected,
intelligent world.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

2 Basic Storage Technologies

2.1 Intelligent Data Storage System


2.1.1 Intelligent Data Storage System
2.1.1.1 Storage in the Intelligent Era
As we move rapidly toward an intelligent world, data is being generated at an unprecedented rate.
More companies have realized that the key to achieving smartization is data infrastructure, with
storage at its core.
2.1.1.2 Intelligent Data Storage Architecture
Huawei has built an innovative architecture for new age data storage. This architecture is built on
intelligence + hardware + algorithm. It comprises the following features:
⚫ Memory and SCM-centered high performance tier based on Memory Fabric
⚫ All IP-based all-flash capacity tier, providing intelligent layered management for storage
⚫ 10,000-core computing resource pooling and intelligent scheduling for CPUs, NPUs, and GPUs,
based on innovative algorithms and high-speed interconnection protocols
⚫ Container-based heterogeneous microservices for services
⚫ Intelligent data lifecycle management system to deliver innovative storage products for a fully
connected, intelligent world

2.1.2 Intelligent Data Storage Components


2.1.2.1 Controller Enclosure
2.1.2.1.1 Controller Enclosure Design
A controller enclosure houses controllers and is the core component of a storage system to provide
storage services.
The controller enclosure uses a modular design and consists of a system subrack, controllers (with
built-in fan modules), BBUs, power modules, management modules, and interface modules.
⚫ The system subrack integrates a backplane to provide signal and power connectivity among
modules.
⚫ The controller is a core module that processes services in a storage system.
⚫ BBUs supply power to the storage system in the event of an external power supply failure to
protect data in the storage system.
⚫ The AC power module supplies power to the controller enclosure, allowing the enclosure to
operate normally at maximum power.
⚫ The management module provides management, maintenance, and serial ports.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ Interface modules provide service or management ports and are field replaceable units. In
computer science, data is a generic term for all media such as numbers, letters, symbols, and
analog parameters that can be input to and processed by computer programs. Computers store
and process a wide range of objects that generate complex data.
2.1.2.1.2 Controller Enclosure Components
⚫ A controller is the core component of a storage system. It processes storage services, receives
configuration management commands, saves configuration data, connects to disks, and saves
critical data to coffer disks.
➢ The CPU and cache on the controller work together to process I/O requests from the host
and manage RAID of the storage system.
➢ Each controller has multiple built-in disks to store system data. If a power failure occurs,
these disks also store cache data. Disks on different controllers are redundant of each
other.
⚫ Front-end (FE) ports are used for service communication between application servers and the
storage system, that is, for processing host I/Os.
⚫ Back-end (BE) ports connect a controller enclosure to a disk enclosure and provide channels for
reading/writing data from/to disks.
⚫ A cache is a memory chip on a disk controller. It provides fast data access speed and functions
as a buffer between the internal storage and external interfaces.
⚫ An engine is a core component of a development program or system on an electronic platform.
It is usually the support part of a program or a set of systems.
⚫ Coffer disks are used to store user data, system configurations, logs, and dirty data in the cache
in the event of an unexpected power outage.
➢ Built-in coffer disk: Each controller of Huawei OceanStor Dorado V6 has one or two built-in
SSDs as coffer disks. For details, see the product documentation.
➢ External coffer disk: The storage system automatically selects four disks as coffer disks. Each
coffer disk provides 2 GB space to form a RAID 1 group. The remaining space of the coffer
disks can be used to store service data. If a coffer disk is faulty, the system automatically
replaces the faulty coffer disk with a normal disk to ensure redundancy.
⚫ Power module: The AC power module supplies power to the controller enclosure, allowing the
enclosure to operate normally at maximum power.
➢ A 4 U controller enclosure has four power modules (PSU 0, PSU 1, PSU 2, and PSU 3). PSU 0
and PSU 1 form a power plane to supply power to controllers A and C, and are redundant of
each other. PSU 2 and PSU 3 form the other power plane to supply power to controllers B
and D, and are redundant of each other. For reliability purposes, it is recommended that
you connect PSU 0 and PSU 2 to one PDU, and PSU 1 and PSU 3 to another PDU.
➢ A 2 U controller enclosure has two power modules (PSU 0 and PSU 1) to supply power to
controllers A and B. The two power modules form a power plane and are redundant of each
other. For reliability purposes, it is recommended that you connect PSU 0 and PSU 1 to
different PDUs.
2.1.2.2 Disk Enclosure
2.1.2.2.1 Disk Enclosure Design
The disk enclosure uses a modular design and consists of a system subrack, expansion modules,
power modules, and disks.
⚫ The system subrack integrates a backplane to provide signal and power connectivity among
modules.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ The expansion module provides expansion ports to connect to a controller enclosure or another
disk enclosure for data transmission.
⚫ The power module supplies power to the disk enclosure, allowing the enclosure to operate
normally at maximum power.
⚫ Disks provide storage space for the storage system to save service data, system data, and cache
data. Specific disks are used as coffer disks.
2.1.2.3 Expansion Module
2.1.2.3.1 Expansion Module
Each expansion module provides one P0 and one P1 expansion port. The expansion module provides
expansion ports to connect to a controller enclosure or another disk enclosure for data transmission.
2.1.2.3.2 CE Switch
Huawei CloudEngine series fixed switches are next-generation Ethernet switches designed for data
centers and provide high performance, high port density, and low latency. The switches use flexible
front-to-rear or rear-to-front airflow design and can be used in IP SANs and distributed storage
networks.
2.1.2.3.3 Fibre Channel Switch
Fibre Channel switches are high-speed network transmission relay devices that transmit data over
optical fibers. They accelerate transmission and protect against interference, and are used on FC
SANs.
2.1.2.3.4 Device Cable
A serial cable connects the serial port of the storage system to the maintenance terminal.
Mini SAS HD cables connect to expansion ports on controller and disk enclosures. There are mini SAS
HD electrical cables and mini SAS HD optical cables.
100G QSFP28 cables are used for direct connection between controllers or for connection to smart
disk enclosures.
25G SFP28 cables are used for front-end networking.
Optical fibers connect the storage system to Fibre Channel switches. One end of the optical fiber
connects to a Fibre Channel host bus adapter (HBA), and the other end connects to the Fibre
Channel switch or the storage system. An optical fiber uses Lucent Connectors (LCs) at both ends.
MPO-4*DLC optical fibers, which are dedicated for 8 Gbit/s Fibre Channel interface modules (8
ports) and 16 Gbit/s Fibre Channel interface modules (8 ports), can be used to connect the storage
system to Fibre Channel switches.
2.1.2.4 Disk
2.1.2.4.1 Disk Components
⚫ A platter is coated with magnetic materials on both surfaces. The magnetic grains on the platter
are polarized to represent a binary information unit (or bit).
⚫ A read/write head reads data from and writes data to a platter. It changes the N and S polarities
of magnetic grains on the surface of the platter to save data.
⚫ The actuator arm moves the read/write head to the specified position.
⚫ The spindle has a motor and bearing underneath. It rotates the platter to move the specified
position on the platter to the read/write head.
⚫ The control circuit controls the speed of the platter and movement of the actuator arm, and
delivers commands to the head.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Each platter of a disk has two read/write heads, which respectively read and write data on two
surfaces of the platter.
The head floats on the platter by air flow and does not touch the platter. Therefore, the head can
move back and forth between tracks at a high speed. If the distance between the head and platter is
too long, the signal read by the head is weak; if the distance is too short, the head may rub against
the platter surface. Therefore, the platter surface must be smooth and flat. Any foreign matter or
dust will cause the head to rub against the magnetic surface, causing permanent data corruption.
Working principles:
⚫ At the beginning, the read/write head is in the landing zone near the spindle of the platters.
⚫ The spindle connects to all platters and a motor. The spindle motor rotates at a constant speed
to drive the platters.
⚫ When the spindle rotates, there is a small gap between the head and platter, which is called the
flying height of the head.
⚫ The head is attached to the end of the actuator arm, which drives the head to the specified
position above the platter where data needs to be written or read.
⚫ The head reads and writes data in binary format on the platter surface. The read data is stored
in the flash chip of the disk and then transmitted to the program.
⚫ Platter surface: Each platter of a disk has two surfaces, both of which can store data and are
valid. All valid surfaces are numbered in sequence, starting from 0 for the top surface. In a disk
system, a surface number is also referred to as a head number, because each valid surface has a
read/write head.
⚫ Track: Tracks are concentric circles around the spindle on a platter. Data is recorded on the
tracks. Tracks are numbered from the outermost circle to the innermost one, starting from 0.
Each platter surface has 300 to 1024 tracks. New types of large-capacity disks have even more
tracks on each surface. Generally, the tracks per inch (TPI) on a platter are used to measure the
track density. Tracks are only magnetized areas on the platter surfaces and are invisible to
human eyes.
⚫ Cylinder: A cylinder is formed by tracks with the same number on all platter surfaces of a disk.
The heads of each cylinder are numbered from top to bottom, starting from 0. Data is read and
written based on cylinders. That is, head 0 in a cylinder reads and writes data first, and then the
other heads in the same cylinder read and write data in sequence. After all heads have
completed reads and writes in a cylinder, the heads move to the next cylinder. Selection of
cylinders is a mechanical switching process called seek. Generally, the position of heads in a disk
is indicated by the cylinder number instead of the track number.
⚫ Sector: Each track is divided into smaller units called sectors to arrange data orderly. A sector is
the smallest storage unit that can be independently addressed in a disk. Tracks may have
different number of sectors. Generally, a sector can store 512 bytes of user data, but some disks
can be formatted into larger sectors, such as 4 KB sectors.
Disks may have one or multiple platters. However, a disk allows only one head to read and write
data at a time. Therefore, increasing the number of platters and heads only improves the disk
capacity, but cannot improve the throughput or I/O performance of the disk.
Disk capacity = Number of cylinders x Number of heads x Number of sectors x Sector size. The unit is
MB or GB. The disk capacity is determined by the capacity of a single platter and the number of
platters.
Because the processing speed of a CPU is much faster than that of a disk, the CPU must wait until the
disk completes a read/write operation before issuing a new command. To solve this problem, a
cache is added to the disk to improve the read/write speed.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

2.1.2.4.2 Factors Relevant to Disk Performance


⚫ Rotation speed: indicates how many circles a platter can rotate per minute. The unit is
revolutions per minute (rpm). When data is being read or written, the platter rotates while the
head stays still. Therefore, a faster rotation speed of the platter means shorter data
transmission time. In the case of sequential I/Os, the actuator arm does not need to seek
frequently. Therefore, the rotation speed is the primary factor to determine the throughput and
IOPS.
⚫ Seek speed: In the case of random I/Os, the actuator arm must change tracks frequently, and
the data transmission time is much shorter than the time for track changes. Therefore, a faster
seek speed of the actuator arm can improve the IOPS of random I/Os.
⚫ Single platter capacity: A larger capacity of a single platter means more data can be stored in a
unit space, bringing a higher data density. Under the same rotation speed and seek speed, disks
with a higher data density provide better performance.
⚫ Port speed: In theory, the current port speed is enough to support the maximum external
transmission bandwidth of disks. In the case of random I/Os, the seek speed is the bottleneck
and the port speed has little impact on performance.
2.1.2.4.3 Average Access Time
⚫ The average seek time refers to the average time required for a head to move from the initial
position to a specified track on a platter. It is an important metric for the internal transfer rate of
a disk. The time should be as short as possible.
⚫ The average latency time means how long a head needs to wait for a sector to move to the
specified position after the head has reached the desired track. The average latency time is
generally half of the time required for the platter to rotate a full circle. Therefore, a faster
rotation speed leads to a shorter latency time.
2.1.2.4.4 Data Transfer Rate
The data transfer rate of a disk means how fast the disk can read and write data. It includes the
internal and external data transfer rates. The unit is MB/s.
⚫ Internal transfer rate is also called sustained transfer rate, which is the highest rate at which a
head reads and writes data. This does not include the seek time and the time waiting for the
sector to move to the head. It is achieved in an ideal situation that the head does not need to
change the track or read a specified sector, but reads and writes all sectors sequentially and
cyclically on one track. The rate in such a situation is the internal transfer rate.
⚫ External transfer rate is also called burst data transfer rate or interface transfer rate. It refers to
the data transfer rate between the system bus and the disk buffer, and depends on the disk port
type and buffer size.
2.1.2.4.5 Disk IOPS and Transmission Bandwidth
IOPS is calculated using the seek time, rotation latency, and data transmission time.
⚫ Seek time: The shorter the seek time, the faster the I/O operation. Currently, the average seek
time is 3 to 15 ms.
⚫ Rotation latency: refers to the time required for the platter to rotate the sector where the
requested data is located to the position below the head. The rotation latency depends on the
rotation speed. Generally, the latency is half of the time required for the platter to rotate a full
circle. For example, the average rotation latency of a 7200 rpm disk is about 4.17 ms (60 x
1000/7200/2), and the average rotation latency of a 15000 rpm disk is about 2 ms.
⚫ Data transmission time: indicates the time required for transmitting the requested data. It
depends on the data transfer rate. Data transmission time = Data size/Data transfer rate. For

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

example, the data transfer rate of IDE/ATA disks can reach 133 MB/s, and that of SATA II disks
can reach 300 MB/s.
⚫ In the case of random I/Os, the head must change tracks frequently. The data transmission time
is much less than the time for track changes (not in the same order of magnitude). Therefore,
the data transmission time can be ignored.
Theoretically, the maximum IOPS of a disk can be calculated using the following formula: IOPS =
1000 ms/(Seek time + Rotation latency). The data transmission time is ignored. For example, if the
average seek time is 3 ms, the theoretical maximum IOPS for 7200 rpm, 10k rpm, and 15k rpm disks
is 140, 167, and 200, respectively.
2.1.2.4.6 Data Transfer Mode
Parallel transmission:
⚫ Parallel transmission features high efficiency, short distances, and low frequency.
⚫ In long-distance transmission, using multiple lines is more expensive than using a single line.
⚫ Long-distance transmission requires thicker conducting wires to reduce signal attenuation, but it
is difficult to bundle them into a single cable.
⚫ In long-distance transmission, the time for data on each line to reach the peer end varies due to
wire resistance or other factors. The next transmission can be initiated only after data on all
lines has reached the peer end.
⚫ When the transmission frequency is high, the circuit oscillation is serious and great interference
is generated between the lines. Therefore, the frequency of parallel transmission must be set
properly.
Serial transmission:
⚫ The efficiency of serial transmission is much lower than that of parallel transmission. The
transmission speed can be improved by increasing the transmission frequency. In general, the
overall speed of serial transmission is higher than that of parallel transmission.
⚫ Serial transmission is used for long-distance transmission. Currently, PCI interfaces use serial
transmission. The PCIe interface is a typical example of serial transmission. The transmission
rate of a single line is up to 2.5 Gbit/s.
2.1.2.4.7 HDD Port Technology
Disks are classified into IDE, SCSI, SATA, SAS, and Fibre Channel disks by port. In addition to ports,
these disks also differ in the mechanical base.
IDE and SATA disks use the ATA mechanical base and are suitable for single-task processing.
SCSI, SAS, and Fibre Channel disks use the SCSI mechanical base and are suitable for multi-task
processing.
Comparison:
⚫ Under high data throughput, SCSI disks provide higher processing speed than ATA disks.
⚫ In the case of multi-task processing, the read/write head moves frequently, which causes
overheating on ATA disks.
⚫ SCSI disks provide higher reliability than ATA disks.
IDE disk port:
⚫ Multiple ATA versions have been released, including ATA-1 (IDE), ATA-2 (Enhanced IDE/Fast ATA),
ATA-3 (Fast ATA-2), ATA-4 (ATA33), ATA-5 (ATA66), ATA-6 (ATA100), and ATA-7 (ATA133).
⚫ The advantages and disadvantages of the ATA port are as follows:
➢ Advantages: low price and good compatibility

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

➢ Disadvantages: Low speed; built-in use only; strict restriction on the cable length
➢ The transmission rate of the PATA port does not meet the current user needs.
SATA port:
⚫ During data transmission, the data line and signal line are separated and use independent
transmission clock frequency. The transmission rate of SATA is 30 times that of PATA.
⚫ Advantages:
➢ Generally, a SATA port has 7+15 pins and uses a single channel. The transmission rate of
SATA is higher than that of ATA.
➢ SATA uses the cyclic redundancy check (CRC) for instructions and data packets to ensure
data transmission reliability.
➢ SATA has a better anti-interference capability than ATA.
SCSI port:
⚫ SCSI disks were developed to replace IDE disks to provide higher rotation speed and
transmission rate. SCSI is originally a bus-type interface and works independently of the system
bus.
⚫ Advantages:
➢ Applicable to a wide range of devices. One SCSI controller card can connect to 15 devices
simultaneously.
➢ High performance (multi-task processing, low CPU usage, fast rotation speed, and high
transmission rate)
➢ SCSI disks can be external or built-in ones, and are hot-swappable.
⚫ Disadvantages:
➢ High cost and complex installation and configuration.
SAS port:
⚫ SAS is similar to SATA in its use of a serial architecture for a high transmission rate and
streamlined internal space with shorter internal connections.
⚫ SAS improves the efficiency, availability, and scalability of the storage system, and is backward
compatible with SATA in terms of the physical and protocol layers.
⚫ Advantages:
➢ SAS is superior to SCSI in terms of transmission rate and anti-interference, and supports a
longer connection distance.
⚫ Disadvantages:
➢ SAS disks are more expensive.
Fibre Channel port:
⚫ Fiber Channel (FC) was not originally designed for disk ports, but was for network transmission.
It is gradually applied to disk systems in pursuit of higher speed.
⚫ Advantages:
➢ Easy to upgrade. Supports optical fiber cables with a length over 10 km.
➢ Large bandwidth.
➢ Strong universality.
⚫ Disadvantages:
➢ High cost.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

➢ Complex to build.
2.1.2.4.8 SSD Overview
Unlike traditional disks that use magnetic materials to store data, SSDs use NAND flash (using cells as
the storage units) to store data. NAND flash is a non-volatile random access storage medium that
can retain stored data after the power is turned off. It can quickly and compactly store digital
information.
SSDs do not have high-speed rotational components, and feature high performance, low power
consumption, and zero noise.
SSDs do not have mechanical parts, but this does not mean that they have infinite life cycle. Because
NAND flash is a non-volatile medium, original data must be erased before new data can be written.
However, each cell can only be erased for a limited number of times. When the upper limit is
reached, data reads and writes become invalid on this cell.
2.1.2.4.9 SSD Architecture
The host interface is the protocol and physical interface used by a host to access an SSD. Common
interfaces are SATA, SAS, and PCIe.
The SSD controller is the core SSD component responsible for read and write access from a host to
the back-end media and for protocol conversion, table entry management, data caching, and data
checking.
DRAM is the cache for flash translation layer (FTL) entries and data.
NAND flash is a non-volatile random access storage medium that stores data.
Concurrent multiple channels, allowing time-division multiplexing for flash granules in a channel.
Support for TCQ and NCQ, which respond to multiple I/O requests at a time.
2.1.2.4.10 NAND Flash
Internal storage units in NAND flash include LUNs, planes, blocks, pages, and cells.
NAND flash working principles: NAND flash stores data using floating gate transistors. The threshold
voltage changes based on the number of electric charges stored in a floating gate. Data is then
represented using the read voltage of the transistor threshold.
⚫ A LUN is the smallest physical unit that can be independently encapsulated. A LUN typically
contains multiple planes.
⚫ A plane has an independent page register. It typically contains 1,000 or 2,000 odd or even
blocks.
⚫ A block is the smallest erasure unit and generally consists of multiple pages.
⚫ A page is the smallest programming and read unit. Its size is usually 16 KB.
⚫ A cell is the smallest erasable, programmable, and readable unit found in pages. A cell
corresponds to a floating gate transistor that stores one or multiple bits.
A page is the basic unit of programming and reading, and a block is the basic unit of erasing.
Each P/E cycle causes some damage to the insulation layer of the floating gate transistor. If erasing
or programming a block fails, the block is considered as a bad block. When the number of bad blocks
reaches a threshold (4%), the NAND flash reaches the end of its service life.
2.1.2.4.11 SLC, MLC, TLC, and QLC
NAND flash chips can be classified into the following types based on the number of bits stored in a
cell:
⚫ A single level cell (SLC) can store one bit of data: 0 or 1.
⚫ A multi level cell (MLC) can store two bits of data: 00, 01, 10, and 11.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ A triple level cell (TLC) can store three bits of data: 000, 001, 010, 011, 100, 101, 110, and 111.
⚫ A quad level cell (QLC) can store four bits of data: 0000, 0001, 0010, 0011, 0100, 0101, 0110,
0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, and 1111.
The four types of cells have similar costs but store different amounts of data. Originally, the capacity
of an SSD was only 64 GB or smaller. Now, a TLC SSD can store up to 2 TB of data. However, different
types of cells have different life cycles, resulting in different SSD reliability. The life cycle of SSDs is
also an important factor in selecting SSDs.
The following shows the logic diagram of a flash chip (Toshiba 3D-TLC):
⚫ A page is logically formed by 18336*8=146688 cells. Each page can store 16 KB content and
1952 bytes of ECC data. A page is the minimum I/O unit of the flash chip.
⚫ Every 768 pages form a block. Every 1478 blocks form a plane.
⚫ A flash chip consists of two planes. One plane stores blocks with odd sequence numbers, and
the other stores blocks with even sequence numbers. The two planes can be operated
concurrently.
Considering that ECC needs to be performed on the data stored in the NAND flash, the size of the
page in the NAND flash is not an integer of 16 KB, but an extra part of bytes. For example, if the
actual size of a 16 KB page is 16384 + 1952 bytes, 16384 bytes are used to store data, and 1952 bytes
are used to store ECC data check codes.
2.1.2.4.12 Address Mapping Management
The logical block address (LBA) may refer to an address of a data block or a data block pointed to by
an address.
PBA: physical block address
The host accesses the SSD through the LBA. Each LBA represents a sector (generally 512 bytes).
Generally, the host OS accesses the SSD in the unit of 4 KB. The basic unit for the host to access the
SSD is called host page.
Inside an SSD, the flash page is the basic unit for the SSD controller to access the flash chip, which is
called physical page. Each time the host writes a host page, the SSD controller writes it to a physical
page and records their mapping relationship.
When the host reads a host page, the SSD finds the requested data according to the mapping
relationship.
2.1.2.4.13 Read and Write Process on an SSD
Data write process on an SSD:
⚫ The SSD controller connects to eight flash dies through eight channels. For better explanation,
the figure shows only one block in each die. Each square in the blocks represents a page
(assuming that the size is 4 KB).
➢ The host writes 4 KB data to the block of channel 0 (occupying one page).
➢ The host continues to write 16 KB data. In this case, 4 KB data is written to each block of
channels 1 to 4.
➢ The host continues to write data to the blocks until all blocks are full.
⚫ When the blocks on all channels are full, the SSD controller selects a new block to write data in
the same way.
⚫ Green indicates valid data and red indicates invalid data. When the user no longer needs the
data on a flash page, the data on this page becomes aged or invalid and its mapping relationship
is replaced by a new mapping.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ For example, host page A was originally stored in flash page X, and the mapping relationship
was A->X. Later, the host rewrites the host page. Because the flash memory does not overwrite
data, the SSD writes the new data to a new page Y. In this case, a new mapping relationship
A->Y is established, and the original mapping relationship is canceled. The data in page X
becomes aged and invalid, which is called garbage data.
⚫ The host continues to write data to the SSD until its space is used up. In this case, the host
cannot write more data unless the garbage data is cleared.
Data read process on an SSD:
⚫ Whether the read speed can be improved 8-fold depends on whether the data to be read is
evenly distributed in the blocks of each channel. If the 32 KB data is stored in the blocks of
channels 1-4, the read speed is improved 4-fold at most. That is why smaller files are
transmitted at a lower rate.
Short response time: Traditional HDDs spend most of the time in seeking and mechanical latency,
limiting the data transmission efficiency. SSDs use NAND flash as the storage medium, which does
not cause any seek time or mechanical latency, delivering quick responses to read and write
requests.
High read/write efficiency: When an HDD is performing random read/write operations, the head
moves back and forth, resulting in low read/write efficiency. In comparison, an SSD calculates data
storage locations using an internal controller, which saves the mechanical operation time and greatly
improves read/write efficiency.
When a large number of SSDs are used, they have a prominent advantage in saving power.
2.1.2.4.14 SCM Card
SCM is the next-generation of storage media that features both persistence and fast access. Its
read/write speed is faster than that of flash memory.
An SCM card is a cache acceleration card of the SCM media type. To use SmartCache for OceanStor
Dorado V6 all-flash storage (6.1.0 and later versions), install an SCM card on the controller
enclosure.
2.1.2.5 Interface Modules
2.1.2.5.1 Front-End: GE Interface Modules
A GE electrical interface module provides four 1 Gbit/s electrical ports and is used for HyperMetro
quorum networking.
A 10GE electrical interface module provides four 10 Gbit/s electrical ports for connecting storage
devices to application servers, which can be used only after electrical modules are installed.
A 40GE interface module provides two 40 Gbit/s optical ports for connecting storage devices to
application servers.
A 100GE interface module provides two 100 Gbit/s optical ports for connecting storage devices to
application servers.
A 25 Gbit/s RDMA interface module provides four 25 Gbit/s optical ports for direct connections
between two controller enclosures.
A 100 Gbit/s RDMA interface module provides two 100 Gbit/s optical ports for connecting controller
enclosures to scale-out switches or smart disk enclosures. In the labels, SO stands for scale-out and
BE stands for back-end.
A 12 Gbit/s SAS expansion module provides four 4 x 12 Gbit/s mini SAS HD expansion ports to
connect controller enclosures to 2 U SAS disk enclosures.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

2.1.2.5.2 Front-End: RoCE Interface Modules


A 25 Gbit/s RoCE interface module is used for connecting storage devices to application servers.
A 100 Gbit/s RoCE interface module is used for connecting storage devices to application servers.
2.1.2.5.3 Front-End: SmartIO Interface Modules
SmartIO interface modules support 8, 10, 16, 25, and 32 Gbit/s optical modules, which respectively
provide 8 Gbit/s Fibre Channel, 10GE, 16 Gbit/s Fibre Channel, 25GE, and 32 Gbit/s Fibre Channel
ports. SmartIO interface modules connect storage devices to application servers.
The optical module rate must match that on the interface module label. Otherwise, the storage
system will report an alarm and the port is unavailable.
2.1.2.5.4 Back-End: 100 Gbit/s RDMA Interface Module and 12 Gbit/s SAS Expansion Module
A 100 Gbit/s RDMA interface module provides two 100 Gbit/s optical ports for connecting controller
enclosures to scale-out switches or smart disk enclosures. In the labels, SO stands for scale-out and
BE stands for back-end.
A 12 Gbit/s SAS expansion module provides four 4 x 12 Gbit/s mini SAS HD expansion ports to
connect controller enclosures to 2 U SAS disk enclosures.
2.1.2.5.5 Scale-out: 100 Gbit/s RDMA Interface Module and 25 Gbit/s RDMA Interface Module
A 25 Gbit/s RDMA interface module provides four 25 Gbit/s optical ports for direct connections
between two controller enclosures.
A 100 Gbit/s RDMA interface module provides two 100 Gbit/s optical ports for connecting controller
enclosures to scale-out switches or smart disk enclosures. In the labels, SO stands for scale-out and
BE stands for back-end.

2.1.3 Storage System Expansion Methods


2.1.3.1 Scale-up and Scale-out
⚫ Scale-up:
This traditional vertical expansion architecture continuously adds storage disks into the existing
storage systems to supply capacity.
Advantage: simple operation at the initial stage
Disadvantage: As the storage system scale increases, resource increase reaches the bottleneck.
⚫ Scale-out:
This horizontal expansion architecture adds controllers to cope with demands.
Advantage: As the scale increases, the unit price decreases and the efficiency improves.
Disadvantage: The complexity of software and management increases.
2.1.3.2 SAS Disk Enclosure Scale-up Networking Principles
Huawei SAS disk enclosure is used as an example.
Port consistency: In a loop, the downlink (EXP) port of an upper-level disk enclosure is connected to
the uplink (PRI) port of a lower-level disk enclosure.
Dual-plane networking: Expansion module A is connected to controller A and expansion module B is
connected to controller B.
Symmetric networking: On controllers A and B, ports with the same port IDs and slot IDs are
connected to one enclosure.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Forward and backward connection networking: Expansion modules A uses forward connection and
expansion modules B uses backward connection. (OceanStor Dorado V6 and new converged storage
use forward connection.)
Cascading level: The number of cascading disk enclosures cannot exceed the preset threshold.
2.1.3.3 Smart Disk Enclosure Scale-up Networking Principles
Huawei smart disk enclosure is used as an example.
Port consistency: In a loop, the downlink port (P1) of an upper-level disk enclosure is connected to
the uplink port (P0) of a lower-level disk enclosure.
Dual-plane networking: Expansion module A is connected to controller A and expansion module B is
connected to controller B.
Symmetric networking: On controllers A and B, ports with the same port IDs and slot IDs are
connected to one enclosure.
Forward connection networking: Both expansion modules A and B use forward connection.
Cascading level: The number of cascading disk enclosures cannot exceed the preset threshold.
2.1.3.4 Local Write Process
The LUN to which a host writes data is owned by the engine to which the host delivers write I/Os.
The process is as follows:
1 A host delivers write I/Os to engine 0.
2 Engine 0 writes the data into the local cache, implements mirror protection, and returns a
message indicating that data is written successfully.
3 Engine 0 flushes dirty data onto a disk. If the target disk is on the local computer, engine 0
directly delivers the write I/Os.
4 If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (for example,
engine 1) where the disk resides.
5 Engine 1 writes dirty data onto disks.
2.1.3.5 Non-local Write Process
The LUN to which a host writes data is not owned by the engine to which the host delivers write
I/Os. The process is as follows:
1 The LUN is owned by engine 0 and the host delivers write I/Os to engine 2.
2 After detecting that the LUN is owned by engine 0, engine 2 transfers the write I/Os to engine 0.
3 Engine 0 writes the data into the local cache, implements mirror protection, and returns a
message indicating that data is written successfully.
4 Engine 2 returns the write success message to the host.
5 Engine 0 flushes dirty data onto a disk. If the target disk is on the local computer, engine 0
directly delivers the write I/Os.
6 If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (for example,
engine 1) where the disk resides.
7 Engine 1 writes dirty data onto disks.
2.1.3.6 Local Read Process
The LUN from which a host reads data is owned by the engine to which the host delivers write I/Os.
The process is as follows:
1 A host delivers read I/Os to engine 0.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

2 If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to the host.
3 If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from the disks. If the
target disk to which data is written is on the local computer, engine 0 reads data from the disks.
4 After the read I/Os are hit locally, engine 0 returns the data to the host.
5 If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (for example,
engine 1) where the disk resides.
6 Engine 1 reads data from the disk.
7 Engine 1 accomplishes the data read.
8 Engine 1 returns the data to engine 0, which then returns the data to the host.
2.1.3.7 Non-local Read Process
1 The LUN from which a host reads data is not owned by the engine to which the host delivers
write I/Os. The process is as follows:
2 The LUN is owned by engine 0 and the host delivers read I/Os to engine 2.
3 After detecting that the LUN is owned by engine 0, engine 2 transfers the read I/Os to engine 0.
4 If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to engine 2.
5 Engine 2 returns the data to the host.
6 If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from the disks. If the
target disk to which data is written is on the local computer, engine 0 reads data from the disks.
7 After the read I/Os are hit locally, engine 0 returns the data to engine 2 and then engine 2
returns the data to the host.
8 If the target disk is on a remote engine, engine 0 transfers the I/Os to the engine (for example,
engine 1) where the disk resides.
9 Engine 1 reads data from the disk.
10 Engine 1 accomplishes the data read.
Engine 1 returns the data to engine 0, and engine 0 returns the data to engine 2, which then returns
the data to the host.

2.2 RAID Technologies


2.2.1 Traditional RAID
2.2.1.1 Basic Concept of RAID
Redundant Array of Independent Disks (RAID) combines multiple physical disks into one logical disk
in different ways, for the purposes of read/write performance and data security improvement.
Functionality of RAID:
⚫ Combines multiple physical disks into one logical disk array to provide larger storage capacity.
⚫ Divides data into blocks and concurrently writes/reads data to/from multiple disks to improve
disk access efficiency.
⚫ Provides mirroring or parity for fault tolerance.
Hardware RAID and software RAID can be implemented in storage devices.
⚫ Hardware RAID uses a dedicated RAID adapter, disk controller, or storage processor. The RAID
controller has a built-in processor, I/O processor, and memory to improve resource utilization

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

and data transmission speed. The RAID controller manages routes and buffers, and controls
data flows between the host and the RAID array. Hardware RAID is usually used in servers.
⚫ Software RAID has no built-in processor or I/O processor but relies on a host processor.
Therefore, a low-speed CPU cannot meet the requirements for RAID implementation. Software
RAID is typically used in enterprise-class storage devices.
Disk striping: Space in each disk is divided into multiple strips of a specific size. Data is also divided
into blocks based on strip size when data is being written.
⚫ Strip: A strip consists of one or more consecutive sectors in a disk, and multiple strips form a
stripe.
⚫ Stripe: A stripe consists of strips of the same location or ID on multiple disks in the same array.
RAID generally provides two methods for data protection.
⚫ One is storing data copies on another redundant disk to improve data reliability and read
performance.
⚫ The other is parity. Parity data is additional information calculated using user data. For a RAID
array that uses parity, an additional parity disk is required. The XOR (symbol: ⊕) algorithm is
used for parity.
2.2.1.2 RAID 0
RAID 0, also referred to as striping, provides the best storage performance among all RAID levels.
RAID 0 uses the striping technology to distribute data to all disks in a RAID array.

Figure 2-1 Working principles of RAID 0


A RAID 0 array contains at least two member disks. A RAID 0 array divides data into blocks of
different sizes ranging from 512 bytes to megabytes (usually multiples of 512 bytes) and
concurrently writes the data blocks to different disks. The preceding figure shows a RAID 0 array
consisting of two disks (drives). The first two data blocks are written to stripe 0: the first data block is
written to strip 0 in disk 1, and the second data block is written to strip 0 in disk 2. Then, the next
data block is written to the next strip (strip 1) in disk 1, and so forth. In this mode, I/O loads are
balanced among all disks in the RAID array. As the data transfer speed on the bus is much higher
than the data read and write speed on disks, data reads and writes on disks can be considered as
being processed concurrently.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

A RAID 0 array provides a large-capacity disk with high I/O processing performance. Before the
introduction of RAID 0, there was a technology similar to RAID 0, called Just a Bundle Of Disks
(JBOD). JBOD refers to a large virtual disk consisting of multiple disks. Unlike RAID 0, JBOD does not
concurrently write data blocks to different disks. JBOD uses another disk only when the storage
capacity in the first disk is used up. Therefore, JBOD provides a total available capacity which is the
sum of capacities in all disks but provides the performance of individual disks.
In contrast, RAID 0 searches the target data block and reads data in all disks upon receiving a data
read request. The preceding figure shows a data read process. A RAID 0 array provides a read/write
performance that is directly proportional to disk quantity.
2.2.1.3 RAID 1
RAID 1, also referred to as mirroring, maximizes data security. A RAID 1 array uses two identical disks
including one mirror disk. When data is written to a disk, a copy of the same data is stored in the
mirror disk. When the source (physical) disk fails, the mirror disk takes over services from the source
disk to maintain service continuity. The mirror disk is used as a backup to provide high data
reliability.
The amount of data stored in a RAID 1 array is only equal to the capacity of a single disk, and data
copies are retained in another disk. That is, each gigabyte data needs 2 gigabyte disk space.
Therefore, a RAID 1 array consisting of two disks has a space utilization of 50%.

Figure 2-2 Working principles of RAID 1


Unlike RAID 0 which utilizes striping to concurrently write different data to different disks, RAID 1
writes same data to each disk so that data in all member disks is consistent. As shown in the
preceding figure, data blocks D 0, D 1, and D 2 are to be written to disks. D 0 and the copy of D 0 are
written to the two disks (disk 1 and disk 2) at the same time. Other data blocks are also written to
the RAID 1 array in the same way by mirroring. Generally, a RAID 1 array provides write performance
of a single disk.
A RAID 1 array reads data from the data disk and the mirror disk at the same time to improve read
performance. If one disk fails, data can be read from the other disk.
A RAID 1 array provides read performance which is the sum of the read performance of the two
disks. When a RAID array degrades, its performance decreases by half.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

2.2.1.4 RAID 3
RAID 3 is similar to RAID 0 but uses dedicated parity stripes. In a RAID 3 array, a dedicated disk
(parity disk) is used to store the parity data of strips in other disks in the same stripe. If incorrect
data is detected or a disk fails, data in the faulty disk can be recovered using the parity data. RAID 3
applies to data-intensive or single-user environments where data blocks need to be continuously
accessed for a long time. RAID 3 writes data to all member data disks. However, when new data is
written to any disk, RAID 3 recalculates and rewrites parity data. Therefore, when a large amount of
data from an application is written, the parity disk in a RAID 3 array needs to process heavy
workloads. Parity operations have certain impact on the read and write performance of a RAID 3
array. In addition, the parity disk is subject to the highest failure rate in a RAID 3 array due to heavy
workloads. A write penalty occurs when just a small amount of data is written to multiple disks,
which does not improve disk performance as compared with data writes to a single disk.

Figure 2-3 Working principles of RAID 3


RAID 3 uses a single disk for fault tolerance and performs parallel data transmission. RAID 3 uses
striping to divide data into blocks and writes XOR parity data to the last disk (parity disk).
The write performance of RAID 3 depends on the amount of changed data, the number of disks, and
the time required to calculate and store parity data. If a RAID 3 array consists of N member disks of
the same rotational speed and write penalty is not considered, its sequential I/O write performance
is theoretically slightly inferior to N – 1 times that of a single disk when full-stripe write is performed.
(Additional time is required to calculate redundancy check.)
In a RAID 3 array, data is read by stripe. Data blocks in a stripe can be concurrently read as drives in
all disks are controlled.
RAID 3 performs parallel data reads and writes. The read performance of a RAID 3 array depends on
the amount of data to be read and the number of member disks.
2.2.1.5 RAID 5
RAID 5 is improved based on RAID 3 and consists of striping and parity. In a RAID 5 array, data is
written to disks by striping. In a RAID 5 array, the parity data of different strips is distributed among
member disks instead of a parity disk.
Similar to RAID 3, a write penalty occurs when just a small amount of data is written.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Figure 2-4 Working principles of RAID 5


The write performance of a RAID 5 array depends on the amount of data to be written and the
number of member disks. If a RAID 5 array consists of N member disks of the same rotational speed
and write penalty is not considered, its sequential I/O write performance is theoretically slightly
inferior to N – 1 times that of a single disk when full-stripe write is performed. (Additional time is
required to calculate redundancy check.)
In a RAID 3 or RAID 5 array, if a disk fails, the array changes from the online (normal) state to the
degraded state until the faulty disk is reconstructed. If a second disk also fails, the data in the array
will be lost.
2.2.1.6 RAID 6
Data protection mechanisms of all RAID arrays previously discussed considered only failures of
individual disks (excluding RAID 0). The time required for reconstruction increases along with the
growth of disk capacities. It may take several days instead of hours to reconstruct a RAID 5 array
consisting of large-capacity disks. During the reconstruction, the array is in the degraded state, and
the failure of any additional disk will cause the array to be faulty and data to be lost. This is why
some organizations or units need a dual-redundancy system. In other words, a RAID array should
tolerate failures of up to two disks while maintaining normal access to data. Such dual-redundancy
data protection can be implemented in the following ways:
⚫ The first one is multi-mirroring. Multi-mirroring is a method of storing multiple copies of a data
block in redundant disks when the data block is stored in the primary disk. This means heavy
overheads.
⚫ The second one is a RAID 6 array. A RAID 6 array protects data by tolerating failures of up to two
disks even at the same time.
The formal name of RAID 6 is distributed double-parity (DP) RAID. It is essentially an improved RAID
5, and also consists of striping and distributed parity. RAID 6 supports double parity, which means
that:
⚫ When user data is written, double parity calculation needs to be performed. Therefore, RAID 6
provides the slowest data writes among all RAID levels.
⚫ Additional parity data takes storage spaces in two disks. This is why RAID 6 is considered as an N
+ 2 RAID.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Currently, RAID 6 is implemented in different ways. Different methods are used for obtaining parity
data.
RAID 6 P+Q

Figure 2-5 Working principles of RAID 6 P+Q


⚫ When a RAID 6 array uses P+Q parity, P and Q represent two independent parity data. P and Q
parity data is obtained using different algorithms. User data and parity data are distributed in all
disks in the same stripe.
⚫ As shown in the figure, P 1 is obtained by performing an XOR operation on D 0, D 1, and D 2 in
stripe 0, P 2 is obtained by performing an XOR operation on D 3, D 4, and D 5 in stripe 1, and P 3
is obtained by performing an XOR operation on D 6, D 7, and D 8 in stripe 2.
⚫ Q 1 is obtained by performing a GF transform and then an XOR operation on D 0, D 1, and D 2 in
stripe 0, Q 2 is obtained by performing a GF transform and then an XOR operation on D 3, D 4,
and D 5 in stripe 1, and Q 3 is obtained by performing a GF transform and then an XOR
operation on D 6, D 7, and D 8 in stripe 2.
⚫ If a strip on a disk fails, data on the failed disk can be recovered using the P parity value. The
XOR operation is performed between the P parity value and other data disks. If two disks in the
same stripe fail at the same time, different solutions apply to different scenarios. If the Q parity
data is not in any of the two faulty disks, the data can be recovered to data disks, and then the
parity data is recalculated. If the Q parity data is in one of the two faulty disks, data in the two
faulty disks must be recovered by using both the formulas.
RAID 6 DP

Figure 2-6 Working principles of RAID 6 DP


⚫ RAID 6 DP also has two independent parity data blocks. The first parity data is the same as the
first parity data of RAID 6 P+Q. The second parity data is the diagonal parity data obtained
through diagonal XOR operation. Horizontal parity data is obtained by performing an XOR
operation on user data in the same stripe. As shown in the preceding figure, P 0 is obtained by
performing an XOR operation on D 0, D 1, D 2, and D 3 in stripe 0, and P 1 is obtained by

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

performing an XOR operation on D4, D5, D6, and D 7 in stripe 1. Therefore, P 0 = D 0 ⊕ D 1 ⊕


D 2 ⊕ D 3, P 1 = D 4 ⊕ D 5 ⊕ D 6 ⊕ D 7, and so on.
⚫ The second parity data block is obtained by performing an XOR operation on diagonal data
blocks in the array. The process of selecting data blocks is relatively complex. DP 0 is obtained by
performing an XOR operation on D 0 in disk 1 in stripe 0, D 5 in disk 2 in stripe 1, D 10 in disk 3
in stripe 2, and D 15 in disk 4 in stripe 3. DP 1 is obtained by performing an XOR operation on D
1 in disk 2 in stripe 0, D 6 in disk 3 in stripe 1, D 11 in disk 4 in stripe 2, and P 3 in the first parity
disk in stripe 3. DP 2 is obtained by performing an XOR operation on D 2 in disk 3 in stripe 0, D 7
in disk 4 in stripe 1, P 2 in the first parity disk in stripe 2, and D 12 in disk 1 in stripe 3. Therefore,
DP 0 = D 0 ⊕ D 5 ⊕ D 10 ⊕ D 15, DP 1 = D 1 ⊕ D 6 ⊕ D 11 ⊕ P 3, and so on.
⚫ A RAID 6 array tolerates failures of up to two disks.
⚫ A RAID 6 array provides relatively poor performance no matter whether DP or P+Q is
implemented. Therefore, RAID 6 applies to the following two scenarios:
➢ Data is critical and should be consistently in online and available state.
➢ Large-capacity (generally > 2 T) disks are used. The reconstruction of a large-capacity disk
takes a long time. Data will be inaccessible for a long time if two disks fail at the same time.
A RAID 6 array tolerates failure of another disk during the reconstruction of one disk. Some
enterprises anticipate to use a dual-redundancy protection RAID array for their large-
capacity disks.
2.2.1.7 RAID 10
For most enterprises, RAID 0 is not really a practical choice, while RAID 1 is limited by disk capacity
utilization. RAID 10 provides the optimal solution by combining RAID 1 and RAID 0. In particular,
RAID 10 provides superior performance by eliminating write penalty in random writes.
A RAID 10 array consists of an even number of disks. User data is written to half of the disks and
mirror copies of user data are retained in the other half of disks. Mirroring is performed based on
stripes.

Figure 2-7 Working principles of RAID 10


As shown in the figure, physical disks 1 and 2 form a RAID 1 array, and physical disks 3 and 4 form
another RAID 1 array. The two RAID 1 sub-arrays form a RAID 0 array.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

When data is written to the RAID 10 array, data blocks are concurrently written to sub-arrays by
mirroring. As shown in the figure, D 0 is written to physical disk 1, and its copy is written to physical
disk 2.
If disks (such as disk 2 and disk 4) in both the two RAID 1 sub-arrays fail, accesses to data in the RAID
10 array will remain normal. This is because integral copies of the data in faulty disks 2 and 4 are
retained on other two disks (such as disk 3 and disk 1). However, if disks (such as disk 1 and 2) in the
same RAID 1 sub-array fail at the same time, data will be inaccessible.
Theoretically, RAID 10 tolerates failures of half of the physical disks. However, in the worst case,
failures of two disks in the same sub-array may also cause data loss. Generally, RAID 10 protects data
against the failure of a single disk.
2.2.1.8 RAID 50
RAID 50 combines RAID 0 and RAID 5. Two RAID 5 sub-arrays form a RAID 0 array. The two RAID 5
sub-arrays are independent of each other. A RAID 50 array requires at least six disks because a RAID
5 sub-array requires at least three disks.

Figure 2-8 Working principles of RAID 50


As shown in the figure, disks 1, 2, and 3 form a RAID 5 sub-array, and disks 4, 5, and 6 form another
RAID 5 sub-array. The two RAID 5 sub-arrays form a RAID 0 array.
A RAID 50 array tolerates failures of multiple disks at the same time. However, failures of two disks
in the same RAID 5 sub-array will cause data loss.

2.2.2 RAID 2.0+


2.2.2.1 RAID Evolution
As a well-developed and reliable disk data protection standard, RAID has always been used as a basic
technology for storage systems. However, traditional RAID is becoming increasingly defective, in
particular, in reconstruction of large-capacity disks, with ever-increasing data storage requirements
and capacity per disk.
Traditional RAID is defective due to high risk of data loss and impact on services.
⚫ High risk of data loss: Ever-increasing disk capacities lead to longer reconstruction time and
higher risk of data loss. Dual redundancy protection is invalid during reconstruction and data

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

will be lost if any additional disk or data block fails. Therefore, a longer reconstruction duration
results in higher risk of data loss.
⚫ Material impact on services: During reconstruction, member disks are engaged in
reconstruction and provide poor service performance, which will affect the operation of upper-
layer services.
To solve the preceding problems of traditional RAID and ride on the development of virtualization
technologies, the following alternative solutions emerged:
⚫ LUN virtualization: A traditional RAID array is further divided into small units. These units are
regrouped into storage spaces accessible to hosts.
⚫ Block virtualization: Disks in a storage pool are divided into small data blocks. A RAID array is
created using these data blocks so that data can be evenly distributed to all disks in the storage
pool. Then, resources are managed based on data blocks.
2.2.2.2 Basic Principles of RAID 2.0+
RAID 2.0+ divides a physical disk into multiple chunks (CKs). CKs in different disks form a chunk group
(CKG). CKGs have a RAID relationship with each other. Multiple CKGs form a large storage resource
pool. Resources are allocated from the resource pool to hosts.
Implementation mechanism of RAID 2.0+:
⚫ Multiple SSDs form a storage pool.
⚫ Each SSD is then divided into chunks (CKs) of a fixed size (typically 4 MB) for logical space
management.
⚫ CKs from different SSDs form chunk groups (CKGs) based on the RAID policy specified on
DeviceManager.
⚫ CKGs are further divided into grains (typically 8 KB). Grains are mapped to LUNs for refined
management of storage resources.
RAID 2.0+ outperforms traditional RAID in the following aspects:
⚫ Service load balancing to avoid hot spots: Data is evenly distributed to all disks in the resource
pool, protecting disks from early end of service life due to excessive writes.
⚫ Fast reconstruction to reduce risk window: When a disk fails, the valid data in the faulty disk is
reconstructed to all other functioning disks in the resource pool (fast many-to-many
reconstruction), efficiently resuming redundancy protection.
⚫ Reconstruction load balancing among all disks in the resource pool to minimize the impact on
upper-layer applications.
2.2.2.3 RAID 2.0+ Composition
1. Disk Domain
A disk domain is a combination of disks (which can be all disks in the array). After the disks are
combined and reserved for hot spare capacity, each disk domain provides storage resources for
the storage pool.
For traditional RAID, a RAID array must be created first for allocating disk spaces to service
hosts. However, there are some restrictions and requirements for creating a RAID array: A RAID
array must consist of disks of the same type, size, and rotational speed, and should consist of a
maximum number of 12 disks.
Huawei RAID 2.0+ is implemented in another way. A disk domain should be created first. A disk
domain is a disk array. A disk can belong to only one disk domain. One or more disk domains can
be created in an OceanStor storage system. It seems that a disk domain is similar to a RAID
array. Both consist of disks but have significant differences. A RAID array consists of disks of the

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

same type, size, and rotational speed, and such disks are associated with a RAID level. In
contrast, a disk domain consists of up to more than 100 disks of up to three types. Each type of
disk is associated with a storage tier. For example, SSDs are associated with the high
performance tier, SAS disks are associated with the performance tier, and NL-SAS disks are
associated with the capacity tier. A storage tier would not exist if there are no disks of the
corresponding type in a disk domain. A disk domain separates an array of disks from another
array of disks for fully isolating faults and maintaining independent performance and storage
resources. RAID levels are not specified when a disk domain is created. That is, data redundancy
protection methods are not specified. Actually, RAID 2.0+ provides more flexible and specific
data redundancy protection methods. The storage space formed by disks in a disk domain is
divided into storage pools of a smaller granularity and hot spare space shared among storage
tiers. The system automatically sets the hot spare space based on the hot spare policy (high,
low, or none) set by an administrator for the disk domain and the number of disks at each
storage tier in the disk domain. In a traditional RAID array, an administrator should specify a
disk as the hot space disk.
2. Storage Pool and Storage Tier
A storage pool is a storage resource container. The storage resources used by application
servers are all from storage pools.
A storage tier is a collection of storage media providing the same performance level in a storage
pool. Different storage tiers manage storage media of different performance levels and provide
storage space for applications that have different performance requirements.
A storage pool created based on a specified disk domain dynamically allocates CKs from the disk
domain to form CKGs according to the RAID policy of each storage tier for providing storage
resources with RAID protection to applications.
A storage pool can be divided into multiple tiers based on disk types.
When creating a storage pool, a user is allowed to specify a storage tier and related RAID policy
and capacity for the storage pool.
OceanStor storage systems support RAID 1, RAID 10, RAID 3, RAID 5, RAID 50, and RAID 6 and
related RAID policies.
The capacity tier consists of large-capacity SATA and NL-SAS disks. DP RAID 6 is recommended.
3. Disk Group
An OceanStor storage system automatically divides disks of each type in each disk domain into
one or more disk groups (DGs) according to disk quantity.
One DG consists of disks of only one type.
CKs in a CKG are allocated from different disks in a DG.
DGs are internal objects automatically configured by OceanStor storage systems and typically
used for fault isolation. DGs are not presented externally.
4. Logical Drive
A logical drive (LD) is a disk that is managed by a storage system and corresponds to a physical
disk.
5. CK
A chunk (CK) is a disk space of a specified size allocated from a storage pool. It is the basic unit
of a RAID array.
6. CKG

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

A chunk group (CKG) is a logical storage unit that consists of CKs from different disks in the same
DG based on the RAID algorithm. It is the minimum unit for allocating resources from a disk
domain to a storage pool.
All CKs in a CKG are allocated from the disks in the same DG. A CKG has RAID attributes, which
are actually configured for corresponding storage tiers. CKs and CKGs are internal objects
automatically configured by storage systems. They are not presented externally.
7. Extent
Each CKG is divided into logical storage spaces of a specific and adjustable size called extents.
Extent is the minimum unit (granularity) for migration and statistics of hot data. It is also the
minimum unit for space application and release in a storage pool.
An extent belongs to a volume or LUN. A user can set the extent size when creating a storage
pool. After that, the extent size cannot be changed. Different storage pools may consist of
extents of different sizes, but one storage pool must consist of extents of the same size.
8. Grain
When a thin LUN is created, extents are divided into 64 KB blocks which are called grains. A thin
LUN allocates storage space by grains. Logical block addresses (LBAs) in a grain are consecutive.
Grains are mapped to thin LUNs. A thick LUN does not involve grains.
9. Volume and LUN
A volume is an internal management object in a storage system.
A LUN is a storage unit that can be directly mapped to a host for data reads and writes. A LUN is
the external embodiment of a volume.
A volume organizes all extents and grains of a LUN and applies for and releases extents to
increase and decrease the actual space used by the volume.

2.2.3 Other RAID Technologies


2.2.3.1 Huawei Dynamic RAID Algorithm
When a flash component fails, Huawei dynamic RAID algorithm can proactively recover the data in
the faulty flash component and keep providing RAID protection for the data.
This RAID algorithm dynamically adjusts the number of data blocks in a RAID array to meet system
reliability and capacity requirements. If a chunk is faulty and no chunk is available from disks outside
the disk domain, the system dynamically reconstructs the original N + M chunks to (N - 1) + M
chunks. When a new SSD is inserted, the system migrates data from the (N - 1) + M chunks to the
newly constructed N + M chunks for efficient disk utilization.
Dynamic RAID adopts the erasure coding (EC) algorithm, which can dynamically adjust the number of
CKs in a CKG if only SSDs are used to meet the system reliability and capacity requirements.
2.2.3.2 RAID-TP
RAID protection is essential to a storage system for consistently high reliability and performance.
However, the reliability of RAID protection is challenged by uncontrollable RAID array construction
time due to drastic increase in capacity.
RAID-TP achieves optimal performance, reliability, and capacity utilization.
Customers have to purchase disks of larger capacity to replace existing disks for system upgrades. In
such a case, one system may consist of disks of different capacities. How to maintain the optimal
capacity utilization in a system that uses a mix of disks with different capacities?
RAID-TP uses Huawei's optimized FlexEC algorithm that allows the system to tolerate failures of up
to three disks, improving reliability while allowing a longer reconstruction time window.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

RAID-TP with FlexEC algorithm reduces the amount of data read from a single disk by 70%, as
compared with traditional RAID, minimizing the impact on system performance.
In a typical 4:2 RAID 6 array, the capacity utilization is about 67%. The capacity utilization of a
Huawei OceanStor all-flash storage system with 25 disks is improved by 20% on this basis.

2.3 Common Storage Protocols


2.3.1 SAN Protocol
2.3.1.1 SCSI and SAS
2.3.1.1.1 SCSI Protocol
Small Computer System Interface (SCSI) is a vast protocol system. The SCSI protocol defines a model
and a necessary instruction set for different devices to exchange information using the framework.
SCSI reference documents cover devices, models, and links.
⚫ SCSI architecture documents discuss the basic architecture models SAM and SPC and describe
the SCSI architecture in detail, covering topics like the task queue model and basic common
instruction model.
⚫ SCSI device implementation documents cover the implementation of specific devices, such as
the block device (disk) SBC and stream device (tape) SSC instruction systems.
⚫ SCSI transmission link implementation documents discuss FCP, SAS, iSCSI, and FCoE and describe
in detail the implementation of the SCSI protocol on media.
SCSI Logical Topology
The SCSI logical topology includes initiators, targets, and LUNs.
⚫ Initiator: SCSI is essentially a client/server (C/S) architecture in which a client acts as an initiator
to send request instructions to a SCSI target. Generally, a host acts as an initiator.
⚫ Target: processes SCSI instructions. It receives and parses instructions from a host. For example,
a disk array functions as a target.
⚫ LUN: a namespace resource described by a SCSI target. A target may include multiple LUNs, and
attributes of the LUNs may be different. For example, LUN#0 may be a disk, and LUN#1 may be
another device.
The initiator and target of SCSI constitute a typical C/S model. Each instruction is implemented
through the request/response mode. The initiator sends SCSI requests. The target responds to the
SCSI requests, provides services through LUNs, and provides a task management function.
SCSI Initiator Model
SCSI initiator logical layers in different operating systems:
⚫ On Windows, a SCSI initiator includes three logical layers: storage/tape driver, SCSI port, and
mini port. The SCSI port implements the basic framework processing procedures for SCSI, such
as device discovery and namespace scanning.
⚫ On Linux, a SCSI initiator includes three logical layers: SCSI device driver, scsi_mod middle layer,
and SCSI adapter driver (HBA). The scsi_mod middle layer processes SCSI device-irrelevant and
adapter-irrelevant processes, such as exceptions and namespace maintenance. The HBA driver
provides link implementation details, such as SCSI instruction packaging and unpacking. The
device driver implements specific SCSI device drivers, such as the famous SCSI disk driver, SCSI
tape driver, and SCSI CD-ROM device driver.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ The structure of Solaris comprises the SCSI device driver, SSA middle layer, and SCSI adapter
driver, which is similar to the structure of Linux/Windows.
⚫ The AIX architecture is structured in three layers: SCSI device driver, SCSI middle layer, and SCSI
adaptation driver.
SCSI Target Model
Based on the SCSI architecture, a target is divided into three layers: port layer, middle layer, and
device layer.
⚫ A PORT model in a target packages or unpackages SCSI instructions on links. For example, a
PORT can package instructions into FCP, iSCSI, or SAS, or unpackage instructions from those
formats.
⚫ A device model in a target serves as a SCSI instruction analyser. It tells the initiator what device
the current LUN is by processing INQUIRT, and processes I/Os through READ/WRITE.
⚫ The middle layer of a target maintains models such as LUN space, task set, and task (command).
There are two ways to maintain LUN space. One is to maintain a global LUN for all PORTs, and
the other is to maintain a LUN space for each PORT.
SCSI Protocol and Storage System
The SCSI protocol is the basic protocol used for communication between hosts and storage devices.
The controller sends a signal to the bus processor requesting to use the bus. After the request is
accepted, the controller's high-speed cache sends data. During this process, the bus is occupied by
the controller and other devices connected to the same bus cannot use it. However, the bus
processor can interrupt the data transfer at any time and allow other devices to use the bus for
operations of a higher priority.
A SCSI controller is like a small CPU with its own command set and cache. The special SCSI bus
architecture can dynamically allocate resources to tasks run by multiple devices in a computer. In
this way, multiple tasks can be processed at the same time.
SCSI Protocol Addressing
A traditional SCSI controller is connected to a single bus, therefore only one bus ID is allocated. An
enterprise-level server may be configured with multiple SCSI controllers, so there may be multiple
SCSI buses. In a storage network, each FC HBA or iSCSI network adapter is connected to a bus. A bus
ID must therefore be allocated to each bus to distinguish between them.
To address devices connected to a SCSI bus, SCSI device IDs and LUNs are used. Each device on the
SCSI bus must have a unique device ID. The HBA on the server also has its own device ID: 7. Each bus,
including the bus adapter, supports a maximum of 8 or 16 device IDs. The device ID is used to
address devices and identify the priority of the devices on the bus.
Each storage device may include sub-devices, such as virtual disks and tape drives. So LUN IDs are
used to address sub-devices in a storage device.
A ternary description (bus ID, target device ID, and LUN ID) is used to identify a SCSI target.
2.3.1.1.2 SAS Protocol
Serial Attached SCSI (SAS) is the serial standard of the SCSI bus protocol. A serial port has a simple
structure, supports hot swap, and boasts a high transmission speed and execution efficiency.
Generally, large parallel cables cause electronic interference. The SAS cable structure can solve this
problem. The SAS cable structure saves space, thereby improving heat dissipation and ventilation for
servers that use SAS disks.
SAS has the following advantages:
⚫ Lower cost:

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

➢ A SAS backplane supports SAS and SATA disks, which reduces the cost of using different
types of disks.
➢ There is no need to design different products based on the SCSI and SATA standards. In
addition, the cabling complexity and the number of PCB layers are reduced, further
reducing costs.
➢ System integrators do not need to purchase different backplanes and cables for different
disks.
➢ More devices can be connected.
➢ The SAS technology introduces the SAS expander, so that a SAS system supports more
devices. Each expander can be connected to multiple ports, and each port can be
connected to a SAS device, a host, or another SAS expander.
⚫ High reliability:
➢ The reliability is the same as that of SCSI and FC disks and is better than that of SATA disks.
➢ The verified SCSI command set is retained.
⚫ High performance:
➢ The unidirectional port rate is high.
⚫ Compatibility with SATA:
➢ SATA disks can be directly installed in a SAS environment.
➢ SATA and SAS disks can be used in the same system, which meets the requirements of the
popular tiered storage strategy.
The SAS architecture includes six layers from the bottom to the top: physical layer, phy layer, link
layer, port layer, transport layer, and application layer. Each layer provides certain functions.
⚫ Physical layer: defines hardware, such as cables, connectors, and transceivers.
⚫ Phy layer: includes the lowest-level protocols, like coding schemes and power supply/reset
sequences.
⚫ Link layer: describes how to control phy layer connection management, primitives, CRC,
scrambling and descrambling, and rate matching.
⚫ Port layer: describes the interfaces of the link layer and transport layer, including how to
request, interrupt, and set up connections.
⚫ Transport layer: defines how the transmitted commands, status, and data are encapsulated into
SAS frames and how SAS frames are decomposed.
⚫ Application layer: describes how to use SAS in different types of applications.
SAS has the following characteristics:
⚫ SAS uses the full-duplex (bidirectional) communication mode. The traditional parallel SCSI can
communicate only in one direction. When a device receives a data packet from the parallel SCSI
and needs to respond, a new SCSI communication link needs to be set up after the previous link
is disconnected. However, each SAS cable contains two input cables and two output cables. This
way, SAS can read and write data at the same time, improving the data throughput efficiency.
⚫ Compared with SCSI, SAS has the following advantages:
➢ As it uses the serial communication mode, SAS provides higher throughput and may deliver
higher performance in the future.
➢ Four narrow ports can be bound as a wide link port to provide higher throughput.
Scalability of SAS:

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ SAS uses expanders to expand interfaces. One SAS domain supports a maximum of 16,384 disk
devices.
⚫ A SAS expander is an interconnection device in a SAS domain. Similar to an Ethernet switch, a
SAS expander enables an increased number of devices to be connected in a SAS domain, and
reduces the cost in HBAs. Each expander can connect to a maximum of 128 terminals or
expanders. The main components in a SAS domain are SAS expanders, terminal devices, and
connection devices (or SAS connection cables).
➢ A SAS expander is equipped with a routing table that tracks the addresses of all SAS drives.
➢ A terminal device can be an initiator (usually a SAS HBA) or a target (a SAS or SATA disk, or
an HBA in target mode).
⚫ Loops cannot be formed in a SAS domain. This ensures terminal devices can be detected.
⚫ In reality, the number of terminal devices connected to an extender is far fewer than 128 due to
bandwidth reasons.
Cable Connection Principles of SAS:
⚫ Most storage device vendors use SAS cables to connect disk enclosures to controller enclosures
or connect disk enclosures. A SAS cable bundles four independent channels (narrow ports) into
a wide port to provide higher bandwidth. The four independent channels provide 12 Gbit/s
each, so a wide port can provide 48 Gbit/s of bandwidth. To ensure that the data volume on a
SAS cable does not exceed the maximum bandwidth of the SAS cable, the total number of disks
connected to a SAS loop must be limited.
⚫ For a Huawei storage device, the maximum number of disks supported is 168. That is, a loop
comprising up to a maximum seven disk enclosures each with 24 disk slots. However, all disks in
the loop must be traditional SAS disks. As SSDs are becoming more common, one must consider
that SSDs deliver much higher transmission speeds than SAS disks. Therefore, for SSDs, a
maximum of 96 disks are supported in a loop: four disk enclosures, each with 24 disk slots, form
a loop.
⚫ A SAS cable is called a mini SAS cable when the speed of a single channel is 6 Gbit/s, and a SAS
cable is called a high-density mini SAS cable when the speed is increased to 12 Gbit/s.
2.3.1.2 iSCSI and FC
2.3.1.2.1 iSCSI Protocol
The Internet SCSI (iSCSI) protocol was first launched by HP. Since 2004, the iSCSI protocol has been
used as the formal IETF standard. The existing iSCSI protocol is based on SCSI Architecture Model-2
(AM2).
iSCSI is short for Internet Small Computer System Interface. It is an IP-based storage networking
standard for linking data storage facilities. It provides block-level access to storage devices by
carrying SCSI commands over a TCP/IP network.
The iSCSI protocol encapsulates SCSI commands and block data into TCP packets for transmission
over IP networks. As the transport layer protocol of SCSI, iSCSI uses mature IP network technologies
to implement and extend SAN. The SCSI protocol layer generates CDBs and sends the CDBs to the
iSCSI protocol layer. The iSCSI protocol layer then encapsulates the CDBs into PDUs and transmits
the PDUs over an IP network.
iSCSI Initiator and Target
The iSCSI communication system inherits some of SCSI's features. The iSCSI communication involves
an initiator that sends I/O requests and a target that responds to the I/O requests and executes I/O
operations. After a connection is set up between the initiator and target, the target controls the
entire process as the primary device.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ There are three types of iSCSI initiators: software-based initiator driver, hardware-based TCP
offload engine (TOE) NIC, and iSCSI HBA. Their performance increases in that order.
⚫ An iSCSI target is usually an iSCSI disk array or iSCSI tape library.
The iSCSI protocol defines a set of naming and addressing methods for iSCSI initiators and targets. All
iSCSI nodes are identified by their iSCSI names. This method distinguishes iSCSI names from host
names.
iSCSI Architecture
In an iSCSI system, a user sends a data read or write command on a SCSI storage device. The
operating system converts this request into one or multiple SCSI instructions and sends the
instructions to the target SCSI controller card. The iSCSI node encapsulates the instructions and data
into an iSCSI packet and sends the packet to the TCP/IP layer, where the packet is encapsulated into
an IP packet to be transmitted over a network. You can also encrypt the SCSI instructions for
transmission over an insecure network.
Data packets can be transmitted over a LAN or the Internet. The receiving storage controller
restructures the data packets and sends the SCSI control commands and data in the iSCSI packets to
corresponding disks. The disks execute the operation requested by the host or application. For a
data request, data will be read from the disks and sent to the host. The process is completely
transparent to users. Though SCSI instruction execution and data preparation can be implemented
by the network controller software using TCP/IP, the host will spare a lot of CPU resources to
process the SCSI instructions and data. If these transactions are processed by dedicated devices, the
impact on system performance will be reduced to a minimum. Therefore, it is necessary to develop
dedicated iSCSI adapters that execute SCSI commands and complete data preparation under iSCSI
standards. An iSCSI adapter combines the functions of an NIC and an HBA. The iSCSI adapter obtains
data by blocks, classifies and processes data using the TCP/IP processing engine, and sends IP data
packets over an IP network. In this way, users can create IP SANs without compromising server
performance.
2.3.1.2.2 FC Protocol
Fibre Channel (FC) can be referred to as the FC protocol, FC network, or FC interconnection. As FC
delivers high performance, it is becoming more commonly used for front-end host access on point-
to-point and switch-based networks. Like TCP/IP, the FC protocol suite also includes concepts from
the TCP/IP protocol suite and the Ethernet, such as FC switching, FC switch, FC routing, FC router,
and SPF routing algorithm.
FC protocol structure:
⚫ FC-0: defines physical connections and selects different physical media and data rates for
protocol operations. This maximizes system flexibility and allows for existing cables and different
technologies to be used to meet the requirements of different systems. Copper cables and
optical cables are commonly used.
⚫ FC-1: records the 8-bit/10-bit transmission code to balance the transmission bit stream. The
code can also serve as a mechanism to transfer data and detect errors. Its excellent transfer
capability of 8-bit/10-bit encoding helps reduce component design costs and ensures optimum
transfer density for better clock recovery.
⚫ FC-2: includes the following items for sending data over the network:
➢ How data should be split into small frames
➢ How much data should be sent at a time (flow control)
➢ Where frames should be sent (including defining service levels based on applications)
⚫ FC-3: defines advanced functions such as striping (data is transferred through multiple
channels), multicast (one message is sent to multiple targets), and group query (multiple ports

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

are mapped to one node). When FC-2 defines functions for a single port, FC-3 can define
functions across ports.
⚫ FC-4: maps upper-layer protocols. FC performance is mapped to an IP address, a SCSI protocol,
or an ATM protocol. SCSI is a subset of the FC protocol.
Like the Ethernet, FC provides the following network topologies:
⚫ Point-to-point:
➢ The simplest topology that allows direct communication between two nodes (usually a
storage device and a server).
⚫ FC-AL:
➢ Similar to the Ethernet shared bus topology but is in arbitrated loop mode rather than bus
connection mode. Each device is connected to another device end to end to form a loop.
➢ Data frames are transmitted hop by hop in the arbitrated loop and the data frames can be
transmitted only in one direction at any time. As shown in the figure, node A needs to
communicate with node H. After node A wins the arbitration, it sends data frames to node
H. However, the data frames are transmitted clockwise in the sequence of B-C-D-E-F-G-H,
which is inefficient.
⚫ Fabric:
➢ Similar to an Ethernet switching topology, a fabric topology is a mesh switching matrix.
➢ The forwarding efficiency is much greater than in FC-AL.
➢ FC devices are connected to fabric switches through optical fibres or copper cables to
implement point-to-point communication between nodes.
FC frees the workstation from the management of every port. Each port manages its own
point-to-point connection to the fabric, and other fabric functions are implemented by FC
switches. There are seven types of ports in FC networks.
⚫ Device (node) port:
➢ N_Port: Node port. A fabric device can be directly attached.
➢ NL_Port: Node loop port. A device can be attached to a loop.
⚫ Switch port:
➢ E_Port: Expansion port (connecting switches).
➢ F_Port: A port of a fabric device that used to connect to the N_Port.
➢ FL_Port: Fabric loop port.
➢ G_Port: A generic port that can be converted into an E_Port or F_Port.
➢ U_Port: A universal port used to describe automatic port detection.
2.3.1.3 PCIe and NVMe
2.3.1.3.1 PCIe Protocol
In 1991, Intel first proposed the concept of PCI. PCI has the following characteristics:
⚫ Simple bus structure, low costs, easy designs.
⚫ The parallel bus supports a limited number of devices and the bus scalability is poor.
⚫ When multiple devices are connected, the effective bandwidth of the bus is greatly reduced and
the transmission rate slows down.
With the development of modern processor technologies, it is inevitable that parallel buses will be
replaced by high-speed differential buses in the interconnectivity field. Compared with single-ended

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

parallel signals, high-speed differential signals are used for higher clock frequencies. In this case, the
PCIe bus came to being.
PCIe is short for PCI Express, which is a high-performance and high-bandwidth serial communication
interconnection standard. It was first proposed by Intel and then developed by the Peripheral
Component Interconnect Special Interest Group (PCI-SIG) to replace bus-based communication
architectures.
Compared with the traditional PCI bus, PCIe has the following advantages:
⚫ Dual channels, high bandwidth, and a fast transmission rate: A transmission mode (RX and TX
are separated) similar to the full-duplex mode is implemented, providing a higher transmission
rate. PCIe 1.0, 2.0, 3.0, 4.0, and 5.0 deliver up to 2.5, 5, 8, 16, and 32 Gbit/s bandwidth,
respectively. Bandwidth can be multiplied by expanding the number of links.
⚫ Compatibility: PCIe is compatible with PCI at the software layer but has upgraded software.
⚫ Ease-of-use: Hot swap is supported. A PCIe bus interface slot contains the hot swap detection
signal, supporting hot swap and heat exchange.
⚫ Error processing and reporting: A PCIe bus uses a layered structure, in which the software layer
can process and report errors.
⚫ Virtual channels of each physical connection: Each physical channel supports multiple virtual
channels (in theory, eight virtual channels are supported for independent communication
control), thereby supporting QoS of each virtual channel and achieving high-quality traffic
control.
⚫ Reduced I/Os, board-level space, and crosstalk: A typical PCI bus data line requires at least 50
I/O resources, while PCIe X1 requires only four I/O resources. Reduced I/Os saves board-level
space, and the direct distance between I/Os can be longer, thereby reducing crosstalk.
Why PCIe? PCIe is future-oriented, and higher throughputs can be achieved in the future. PCIe is
providing increasing throughput using the latest technologies, and the transition from PCI to PCIe
can be simplified by guaranteeing compatibility with PCI software using layered protocols and drives.
The PCIe protocol features point-to-point connection, high reliability, tree networking, full duplex,
and frame-structure-based transmission.
PCIe protocol layers include the physical layer, data link layer, transaction layer, and application
layer.
⚫ The physical layer in a PCIe bus architecture determines the physical features of the bus. In
future, the performance of a PCIe bus can be further improved by increasing the speed or
changing the encoding or decoding mode. Such changes only affect the physical layer,
facilitating upgrades.
⚫ The data link layer ensures the correctness and reliability of data packets transmitted over a
PCIe bus. It checks whether the data packet encapsulation is complete and correct, adds the
sequence number and CRC code to the data, and uses the ack/nack handshake protocol for
error detection and correction.
⚫ The transaction layer receives read and write requests from the software layer or creates a
request encapsulation packet and transmits it to the data link layer. This type of packet is called
a transaction layer packet (TLP). The TLP receives data link layer packets (DLLP) from the link
layer, associates the DLLP with a related software request, and transmits it to the software layer
for processing.
⚫ The application layer is designed by users based on actual needs. Other layers must comply with
the protocol requirements.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

2.3.1.3.2 NVMe Protocol


NVMe is short for Non-Volatile Memory Express. The NVMe standard is oriented to PCIe SSDs. Direct
connection from the native PCIe channel to the CPU can avoid the latency caused by communication
between the external controller (PCH) of the SATA and SAS interface and the CPU.
In terms of the entire storage process, NVMe not only serves as a logical protocol port, but also as an
instruction standard and a specified protocol. The low latency and parallelism of PCIe channels and
the parallelism of contemporary processors, platforms, and applications can be used to greatly
improve the read and write performance of SSDs with controllable costs. They can also reduce the
latency caused by the Advanced Host Controller Interface (AHCI) and ensure enhanced performance
of SSDs in the SATA era.
NVMe protocol stack:
⚫ In terms of the transmission path, I/Os of a SAS all-flash array are transmitted from the front-
end server to the CPU through the FC/IP front-end interface protocol of a storage device. They
are then transmitted to a SAS chip, a SAS expander, and finally a SAS SSD through PCIe links and
switches.
⚫ The Huawei NVMe-based all-flash storage system supports end-to-end NVMe. Data I/Os are
transmitted from a front-end server to the CPU through a storage device's FC-NVMe/NVMe
Over RDMA front-end interface protocol. Back-end data is transmitted directly to NVMe-based
SSDs through 100 Gbit/s RDMA. The CPU of the NVMe-based all-flash storage system appears to
communicate directly with NVMe SSDs via a shorter transmission path, providing higher
transmission efficiency and a lower transmission latency.
⚫ In terms of software protocol parsing, SAS- and NVMe-based all-flash storage systems differ
greatly in protocol interaction for data writes. If the SAS back-end SCSI protocol is used, four
protocol interactions are required for a complete data write operation. Huawei NVMe-based all-
flash storage systems require only two protocol interactions, making them twice as efficient as
SAS-based all-flash storage systems in terms of processing write requests.
Advantages of NVMe:
⚫ Low latency: Data is not read from registers when commands are executed, resulting in a low
I/O latency.
⚫ High bandwidth: PCIe X4 can provide up to 4 Gbit/s throughput for a single drive.
⚫ High IOPS: NVMe increases the maximum queue depth from 32 to 64,000. The IOPS of SSDs is
also greatly improved.
⚫ Low power consumption: The automatic switchover between power consumption modes and
dynamic power management greatly reduces power consumption.
⚫ Wide driver applicability: The driver applicability problem between different PCIe SSDs is solved.
Huawei OceanStor Dorado all-flash storage systems use NVMe-oF to implement SSD resource
sharing, and provide 32 Gbit/s FC-NVMe and NVMe over 100 Gbit/s RDMA networking designs. In
this way, the same network protocol is used for front-end network connection, back-end disk
enclosure connection, and scale-out controller interconnection.
RDMA uses related hardware and network technologies to enable NICs of servers to directly read
memory, achieving high bandwidth, low latency, and low resource consumption. However, the
RDMA-dedicated IB network architecture is incompatible with a live network, resulting in high costs.
RoCE effectively solves this problem. RoCE is a network protocol that uses the Ethernet to carry
RDMA. There are two versions of RoCE. RoCEv1 is a link layer protocol and cannot be used in
different broadcast domains. RoCEv2 is a network layer protocol and can implement routing
functions.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

2.3.1.4 RDMA and RoCE


2.3.1.4.1 RDMA Protocol
RDMA is short for Remote Direct Memory Access, a method of transferring data in a buffer between
application software on two servers over a network.
Comparison between traditional mode and RDMA mode:
⚫ Compared with the internal bus I/O of traditional DMA, RDMA uses direct buffer transmission
between the application software of two endpoints over a network.
⚫ Compared with traditional network transmission, RDMA does not require operating systems or
protocol stacks.
⚫ RDMA can achieve ultra-low latency and ultra-high throughput transmission between endpoints
without using an abundance of CPU and OS resources. Few resources are consumed for data
processing and migration.
Currently, there are three types of RDMA networks: IB, RoCE, and iWARP. IB is designed for RDMA to
ensure reliable transmission at the hardware level. RoCE and iWARP are Ethernet-based EDMA
technologies and support corresponding verbs interfaces.
⚫ IB is a next-generation network protocol that supports RDMA from the beginning. NICs and
switches that support this technology are required.
⚫ RoCE is a network protocol that allows RDMA over the Ethernet. Its lower network headers are
Ethernet headers, and its upper network headers (including the data) are IB headers. RoCE
allows RDMA over a standard Ethernet infrastructure (switch). The NIC must support RoCE.
RoCE v1 is an RDMA protocol implemented based on the Ethernet link layer. Switches must
support flow control technologies like PFC to ensure reliable transmission at the physical layer.
RoCE v2 is implemented at the UDP layer in the Ethernet TCP/IP protocol.
⚫ iWARP: allows RDMA through TCP. The functions supported by IB and RoCE are not supported
by iWARP. iWARP allows RDMA to be executed over a standard Ethernet infrastructure (switch).
The NIC must support iWARP (if CPU offload is used). Otherwise, all iWARP stacks can be
implemented in the SW, and most RDMA performance advantages are lost.
2.3.1.4.2 NVMe over RoCE
NVMe over RoCE is a type of NVMe-oF protocol based on RDMA. It has been significantly optimized
in terms of performance, cost, network management, and technology development, and is gradually
becoming an optimal application of NVMe-oF.

2.3.2 NAS Protocols


2.3.2.1 File System
A file system is used by a computer to manage and organize data in the form of files and directories.
It forms a tree diagram, where the leaf node is a file, the intermediate node is a directory at each
level, and the top level is the root directory.
The file service is used to provide stable and reliable file sharing functions. It is a basic feature of the
NAS storage system and supports shared file access using CIFS and NFS clients.
2.3.2.2 CIFS, NFS, and NDMP
2.3.2.2.1 CIFS Protocol
In 1996, Microsoft renamed SMB to CIFS and added many new functions. Now, CIFS includes SMB1,
SMB2, and SMB3.0.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

CIFS has high requirements on network transmission reliability, so usually uses TCP/IP. CIFS is mainly
used for the Internet and by Windows hosts to access files or other resources over the Internet. CIFS
allows Windows clients to identify and access shared resources. With CIFS, clients can quickly read,
write, and create files in storage systems as on local PCs. CIFS helps maintain a high access speed
and a fast system response even when many users simultaneously access the same shared file.
CIFS share in non-domain environments
The storage system can employ CIFS shares to share the file systems to users as directories. Users
can only view or access their own shared directories.
On the network, the storage system serves as the CIFS server and employs the CIFS protocol to
provide shared file system access for clients. After the clients map the shared files to the local
directories, users can access the files on the server as if they are accessing local files. You can set
locally authenticated user names and passwords in the storage system to determine the local
authentication information that can be used for accessing the file system.
CIFS share in AD domain environments
With the expansion of LAN and WAN, many enterprises use the AD domain to manage networks in
Windows. The AD domain makes network management simple and flexible.
A storage system can be added to an AD domain as a client. That is, it can be seamlessly integrated
with the AD domain. The AD domain controller saves information about all the clients and groups in
the domain. Clients in the AD domain need to be authenticated by the AD domain controller before
accessing the CIFS share provided by the storage system. By setting the permissions of AD domain
users, you can allow different domain users to have different permissions for shared directories. A
client in the AD domain can only access the shared directory with the same name as the client.
2.3.2.2.2 NFS Protocol
NFS is short for Network File System. The network file sharing protocol is defined by the IETF and
widely used in the Linux/Unix environment.
NFS is a client/server application that uses remote procedure call (RPC) for communication between
computers. Users can store and update files on the remote NAS just like on local PCs. A system
requires an NFS client to connect to an NFS server. NFS is used for independent transmission so uses
TCP or UDP. Users or system administrators can use NFS to mount all file systems or a part of a file
system (a part of any directory or subdirectory hierarchy). Access to the mounted file system can be
controlled using permissions, for example, read-only or read-write permissions.
Differences between NFSv3 and NFSv4:
⚫ NFSv4 is a stateful protocol. It implements the file lock function and can obtain the root node of
a file system without the help of the NLM and MOUNT protocols. NFSv3 is a stateless protocol.
It requires the NLM protocol to implement the file lock function.
⚫ NFSv4 has enhanced security and supports and RPCSEC-GSS identity authentication.
⚫ NFSv4 provides only two requests: NULL and COMPOUND. All operations are integrated into
COMPOUND. A client can encapsulate multiple operations into one COMPOUND request based
on actual requests to improve flexibility.
⚫ The command space of the NFSv4 file system is changed. A root file system (fsid=0) must be set
on the server, and other file systems are mounted to the root file system for export.
⚫ Compared with NFSv3, the cross-platform feature of NFSv4 is enhanced.
NFS share in a non-domain environment
The NFS share in a non-domain environment is commonly used for small- and medium-sized
enterprises. On the network, the storage system serves as the NFS server and employs the NFS
protocol to provide shared file system access for clients. After the clients mount the shared files to

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

the local directories, users can access the files on the server in the same way as accessing local files.
Filter criteria can be set for client IP addresses in storage systems to restrict clients that can access
NFS shares.
NFS share in a domain environment
Domains enable accounts, applications, and networks to be centrally managed. In Linux, LDAP and
NIS domains are available.
LDAP is an open and extendable network protocol. The purpose of LDAP-based authentication
applications is to set up a directory-oriented user authentication system, specifically, an LDAP
domain. When a client user needs to access applications in the LDAP domain environment, the LDAP
server compares the user name and password sent by the client with corresponding authentication
information in the directory database for identity verification.
NIS is a directory service technology that enables users to centrally manage system databases. It
provides a yellow page function to support the centralized management of network information. It
works based on client/server architecture. When the user name and password of a user are saved in
the NIS server database, the user can log in to an NIS client and maintain the database to centrally
manage the network information on the LAN.
When a client needs to access an NFS share provided by the storage system in a domain
environment, the storage system employs the domain server network group to authenticate the
accessible IP address, ensuring the reliability of file system data.
2.3.2.2.3 CIFS-NFS Cross-Protocol Access
The storage system allows users to share a file system or dtree using NFS and CIFS at the same time.
Clients using different protocols can access the same file system or dtree at the same time. Since
Windows, Linux, and UNIX adopt different mechanisms to authenticate users and control access, the
storage system uses a mechanism to centrally map users and control access, protecting the security
of CIFS-NFS cross-protocol access.
If you use a CIFS-based client to access a storage system, the storage system authenticates local or
AD domain users in the first place. If the UNIX permission (UNIX Mode bits) has been configured for
the file or directory to be accessed, the CIFS user is mapped as an NFS user based on preset user
mapping rules during authentication. Then implements UNIX permission authentication for the user.
If an NFS user attempts to access a file or directory that has the NT ACL on the storage system, the
storage system maps the NFS user as a CIFS user based on the preset mapping rules. Then the
storage system implements NT ACL permission authentication for the user.
2.3.2.3 HTTP, FTP, and NDMP
2.3.2.3.1 HTTP Shared File System
The storage system supports the HTTP shared file system. With the HTTPS service enabled, you can
share a file system in HTTPS mode.
Shared resource management is implemented based on the WebDAV protocol. As an HTTP
extension protocol, WebDAV allows clients to copy, move, modify, lock, unlock, and search for
resources in shared directories on servers.
Hypertext Transfer Protocol (HTTP) is a data transfer protocol that regulates the way a browser and
the web server communicate. It is used to transfer World Wide Web documents over the Internet.
HTTP defines how web clients request web pages from web servers and how web servers return web
pages to web clients.
HTTP uses short connections to transmit packets. A connection is terminated each time the
transmission is complete. HTTP and HTTPS use port 80 and port 443 respectively.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

2.3.2.3.2 File Transfer Protocol (FTP)


FTP is a universal protocol for transferring files between remote servers and local clients over an IP
network.
It belongs to the application layer in the TCP/IP protocol suite and employs TCP ports 20 and 21 to
transfer files between remote servers and local clients. Port 20 is used to transfer data, and port 21
is used to transfer control messages. RFC 959 describes the basic FTP operations.
The storage system supports the FTP service. The FTPS server function on a device allows users to
securely access a remote device by using the FTP client.
FTP works in either of the following modes
Active mode (PORT): The FTP server creates a data connection request. This mode is inapplicable
when the FTP client is behind a firewall, for example, the FTP client resides on a private network.
Passive mode (PASV): The FTP client creates a data connection request. This mode is inapplicable
when the FTP server does not allow the FTP client to connect to its high-order ports (usually, the
port IDs are larger than 1024).
FTP is short for File Transfer Protocol. It is used to control bidirectional transmission of files on the
Internet. It is also an application. FTP applications vary with different operating systems but these
applications use the same protocol to transfer files.
FTP allows users to communicate with another host by performing file operations, such as adding,
deleting, modifying, querying, and transferring files.
Active Mode of the FTP Server
The FTP client uses PORT commands to inform the server of the IP address and temporary port used
for receiving the data connection request initiated by the FTP server from port 20. As the request is
actively initiated by the FTP server, this mode is called active mode. In the following figure, the
temporary port is port 30000 and the IP address is 192.168.10.50.
A data connection will be set up after a control connection is set up. If a data connection is
successfully created, you can see the file list of the FTP server on the FTP client. If listing directories
times out, a data connection cannot be created.
Passive Mode of the FTP Server
The FTP client uses PASV commands to inform the FTP server that it will create a data connection
request. Then the FTP server uses PORT commands to inform the FTP client of the IP address and
temporary port used for receiving the data connection request. In the following figure, the FTP
server uses port 30000 and IP address 192.168.10.200 to receive the data connection request from
the FTP client. Then, the FTP client sends the data connection request to port 30000 and IP address
192.168.10.200. As the server passively receives the data connection request, this mode is called
passive mode.
If a data connection is successfully created, you can see the file list of the FTP server on the FTP
client. If listing directories times out, a data connection cannot be created.
2.3.2.3.3 Network Data Management Protocol (NDMP)
The backup process of the traditional NAS storage is as follows:
⚫ A NAS device is a closed storage system. The Client Agent of the backup software can only be
installed on the production system instead of the NAS device. In the traditional network backup
process, data is read from a NAS device through the CIFS or NFS sharing protocol, and then
transferred to a backup server over a network.
⚫ Such a mode occupies network, production system and backup server resources, resulting in
poor performance and an inability to meet the requirements for backing up a large amount of
data.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

NDMP is designed for the data backup system of NAS devices. It enables NAS devices, without any
backup client agent, to send data directly to the connected disk devices or the backup servers on the
network for backup.
There are two networking modes for NDMP:
⚫ On a 2-way network, backup media is directly connected to a NAS storage system instead of
being connected to a backup server. In a backup process, the backup server sends a backup
command to the NAS storage system through the Ethernet. The system then directly backs up
data to the tape library it is connected to.
➢ In the NDMP 2-way backup mode, data flows are transmitted directly to backup media,
greatly improving the transmission performance and reducing server resource usage.
However, a tape library is connected to a NAS storage device, so the tape library can back
up data only for the NAS storage device to which it is connected.
➢ Tape libraries are expensive. To enable different NAS storage devices to share tape
devices, NDMP also supports the 3-way backup mode.
⚫ In the 3-way backup mode, a NAS storage system can transfer backup data to a NAS storage
device connected to a tape library through a dedicated backup network. Then, the storage
device backs up the data to the tape library.

2.3.3 Object and HDFS Storage Protocols


2.3.3.1 Object Storage Protocol
2.3.3.1.1 Object Service
The object service is an object-based mass data storage service offering scalable, secure, reliable,
and cost-effective data storage capabilities.
The object service provides standard S3 APIs, which are HTTP/HTTPS-based REST APIs. Users can use
object service clients, APIs, and SDKs to easily manage and use object service data and develop
various types of upper-layer applications.
It is suitable for storing files of any type. It is generally used in large-scale data storage scenarios,
such as mass Internet content (videos, images, photos, books, media, and magazines), web disks,
digital media, backup, and archiving.
2.3.3.1.2 S3 Concepts
Simple Storage Service (S3) is an open cloud storage service that web application developers can use
to store digital assets, including pictures, videos, music, and documents.
S3 provides a RESTful API to interact with the service programmatically.
Assets stored and retrieved by S3 are called objects. Objects are stored in buckets. If we compare S3
to a disk, objects are files, and buckets are folders (or directories).
2.3.3.1.3 RESTful
REST is short for REpresentational State Transfer. It indicates the state transfer of resources in a
certain form on the network.
Resource: resource, that is, data. For example, newsfeed and friends.
Representational: a representation form, for example, an image or a video.
State Transfer: status change. This is implemented through HTTP verbs.
It uses the HTTP protocol and URI to add, delete, modify, and query resources using the client/server
model.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

REST is not a specification, but an architecture for network applications. It can be regarded as a
design mode which is applied to the network application architecture.
RESTful
An architecture complying with the REST principle is called a RESTful architecture.
It provides a set of software design guidelines and constraints for designing software for interaction
between clients and servers. RESTful software is simpler and more hierarchical and facilitates the
cache mechanism.
2.3.3.2 HDFS Storage Protocol
2.3.3.2.1 HDFS Service
The HDFS service provides an HDFS decoupled storage-compute solution based on native HDFS. The
solution implements on-demand configuration of storage and compute resources, provides
consistent user experience, and helps reduce the total cost of ownership (TCO). It can coexist with
the legacy coupled storage-compute architecture.
The HDFS service provides native HDFS interfaces to interconnect with big data platforms, such as
FusionInsight, Cloudera CDH, and Hortonworks HDP, to implement big data storage and computing
and provide big data analysis services for upper-layer big data applications.
Typical application scenarios include big data for finance, Internet log retention, governments, and
Smart City.
Hadoop Distributed File System (HDFS) is one of the major components in the open-source Hadoop.
HDFS consists of NameNode and DataNode.
2.3.3.2.2 Distributed File System
The distributed file system stores files on multiple computer nodes. Thousands of computer nodes
form a computer cluster.
Currently, the computer cluster used by the distributed file system consists of common hardware,
which greatly reduces the hardware overhead.
The Hadoop Distributed File System (HDFS) is a distributed file system running on universal
hardware and is designed and developed based on the GFS paper.
2.3.3.2.3 HDFS Architecture
The HDFS architecture consists of three parts: NameNode, DataNode, and Client.
A NameNode stores and generates the metadata of file systems. It runs one instance.
A DataNode stores the actual data and reports blocks it manages to the NameNode. A DataNode
runs multiple instances.
A Client supports service access to HDFS. It obtains data from the NameNode and DataNode and
sends to services. Multiple instances can run together with services.
2.3.3.2.4 HDFS Communication Protocol
HDFS is a distributed file system deployed on a cluster. Therefore, a large amount of data needs to
be transmitted over the network.
All HDFS communication protocols are based on the TCP/IP protocol.
The client initiates a TCP connection to the NameNode through a configurable port and uses the
client protocol to interact with the NameNode.
The NameNode and the DataNode interact with each other by using the DataNode protocol.
The interaction between the client and the DataNode is implemented through the Remote
Procedure Call (RPC). In design, the name node does not initiate an RPC request, but responds to
RPC requests from the client and DataNode.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

2.3.4 Storage System Architecture Evolution


The storage system evolved from a single controller to mutual backup of dual controllers that
processed their own tasks before processing data concurrently. Later, parallel symmetric processing
was implemented for multiple controllers. Scale-out storage has become widely used thanks to the
development of cloud computing and big data.
Currently, single-controller storage is rare. Most entry-level and mid-range storage systems use dual-
controller architecture, while most mission-critical storage systems use multi-controller architecture.
Single-controller Storage:
⚫ External disk array with RAID controllers: Using a disk chassis, a disk array virtualizes internal
disks into logical disks through RAID controllers, and then connects to an SCS interface on a host
through an external SCSI interface.
⚫ If a storage system has only one controller module, it is a single point of failure (SPOF).
Dual-controller Storage:
⚫ Currently, dual-controller architecture is mainly used in mainstream entry-level and mid-range
storage systems.
⚫ There are two working modes: Active-Standby and Active-Active.
➢ Active-Standby
This is also called high availability (HA). That is, only one is working at a time, while the
other waits, synchronizes data, and monitors services. If the active controller fails, the
standby controller takes over its services. In addition, the active controller is powered off
or restarted before the takeover to prevent brain split. The bus use of the active controller
is released and then back-end and front-end buses are taken over.
➢ Active-Active
Two controllers are working at the same time. Each connects to all back-end buses, but
each bus is managed by only one controller. Each controller manages half of all back-end
buses. If one controller is faulty, the other takes over all buses. This is more efficient than
Active-Standby.
Mid-range Storage Architecture Evolution:
⚫ Mid-range storage systems always use an independent dual-controller architecture. Controllers
are usually of modular hardware.
⚫ The evolution of mid-range storage mainly focuses on the rate of host interfaces and disk
interfaces, and the number of ports.
⚫ The common form factor is the convergence of SAN and NAS storage services.
Multi-controller Storage:
⚫ Most mission-critical storage systems use multi-controller architecture.
⚫ The main architecture models are as follows:
➢ Bus architecture
➢ Hi-Star architecture
➢ Direct-connection architecture
➢ Virtual matrix architecture
Mission-critical Storage Architecture Evolution:
⚫ In 1990, EMC launched Symmetrix, a full bus architecture. A parallel bus connected front-end
interface modules, cache modules, and back-end disk interface modules for data and signal
exchange in time-division multiplexing mode.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ In 2000, HDS adopted the switching architecture for Lightning 9900 products. Front-end
interface modules, cache modules, and back-end disk interface modules were connected on two
redundant switched networks, increasing communication channels to dozens of times more
than that of the bus architecture. The internal bus was no longer a performance bottleneck.
⚫ In 2003, EMC launched the DMX series based on full mesh architecture. All modules were
connected in point-to-point mode, obtaining theoretically larger internal bandwidth but adding
system complexity and limiting scalability challenges.
⚫ In 2009, to reduce hardware development costs, EMC launched the distributed switching
architecture by connecting a separated switch module to the tightly coupled dual-controller of
mid-range storage systems. This achieved a balance between costs and scalability.
⚫ In 2012, Huawei launched the Huawei OceanStor 18000 series, a mission-critical storage
product also based on distributed switching architecture.
Storage Software Technology Evolution:
A storage system combines unreliable and low-performance disks to provide high-reliability and
high-performance storage through effective management. Storage systems provide sharing, easy-to-
manage, and convenient data protection functions. Storage system software has evolved from basic
RAID and cache to data protection features such as snapshot and replication, to dynamic resource
management with improved data management efficiency, and deduplication and tiered storage with
improved storage efficiency.
Scale-out Storage Architecture:
⚫ A scale-out storage system organizes local HDDs and SSDs of general-purpose servers into a
large-scale storage resource pool, and then distributes data to multiple data storage servers.
⚫ Currently, scale-out storage of Huawei learns from Google, building a distributed file system
among multiple servers and then implementing storage services on the file system.
⚫ Most storage nodes are general-purpose servers. Huawei OceanStor 100D is compatible with
multiple general-purpose x86 servers and Arm servers.
➢ Protocol: storage protocol layer. The block, object, HDFS, and file services support local
mounting access over iSCSI or VSC, S3/Swift access, HDFS access, and NFS access
respectively.
➢ VBS: block access layer of FusionStorage Block. User I/Os are delivered to VBS over iSCSI or
SCSI.
➢ EDS-B: provides block services with enterprise features, and receives and processes I/Os
from VBS.
➢ EDS-F: provides the HDFS service.
➢ Metadata Controller (MDC): The metadata control device controls distributed cluster node
status, data distribution rules, and data rebuilding rules.
➢ Object Storage Device (OSD): a storage device for storing user data in distributed clusters
of the object storage device
➢ Cluster Manager (CM): manages cluster information.

2.3.5 Storage System Expansion Methods


Service data continues to increase with the continued development of enterprise information
systems and the ever-expanding scale of services. The initial configuration of storage systems is
often not enough to meet these demands. Storage system capacity expansion has become a major
concern of system administrators. There are two capacity expansion methods: scale-up and scale-
out.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Scale-Up:
⚫ This traditional vertical expansion architecture continuously adds storage disks into the existing
storage systems to meet demands.
⚫ Advantage: simple operation at the initial stage
⚫ Disadvantage: As the storage system scale increases, resource increase reaches a bottleneck.
Scale-Out:
⚫ This horizontal expansion architecture adds controllers to meet demands.
⚫ Advantage: As the scale increases, the unit price decreases and the efficiency is improved.
⚫ Disadvantage: The complexity of software and management increases.
Huawei SAS disk enclosure is used as an example.
⚫ Port consistency: In a loop, the EXP port of an upper-level disk enclosure is connected to the PRI
port of a lower-level disk enclosure.
⚫ Dual-plane networking: Expansion module A connects to controller A, while expansion module
B connects to controller B.
⚫ Symmetric networking: On controllers A and B, symmetric ports and slots are connected to the
same disk enclosure.
⚫ Forward and backward connection networking: Expansion module A uses forward connection,
while expansion module B uses backward connection.
⚫ Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed the upper
limit.
Huawei smart disk enclosure is used as an example.
⚫ Port consistency: In a loop, the EXP (P1) port of an upper-level disk enclosure is connected to
the PRI (P0) port of a lower-level disk enclosure.
⚫ Dual-plane networking: Expansion board A connects to controller A, while expansion board B
connects to controller B.
⚫ Symmetric networking: On controllers A and B, symmetric ports and slots are connected to the
same disk enclosure.
⚫ Forward connection networking: Both expansion modules A and B use forward connection.
⚫ Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed the upper
limit.
IP scale-out is used for Huawei OceanStor V3 and V5 entry-level and mid-range series, Huawei
OceanStor V5 Kunpeng series, and Huawei OceanStor Dorado V6 series. IP scale-out integrates
TCP/IP, Remote Direct Memory Access (RDMA), and Internet Wide Area RDMA Protocol (iWARP) to
implement service switching between controllers, which complies with the all-IP trend of the data
center network.
PCIe scale-out is used for Huawei OceanStor 18000 V3 and V5 series, and Huawei OceanStor Dorado
V3 series. PCIe scale-out integrates PCIe channels and the RDMA technology to implement service
switching between controllers.
PCIe scale-out: features high bandwidth and low latency.
IP scale-out: employs standard data center technologies (such as ETH, TCP/IP, and iWARP) and
infrastructure, and boosts the development of Huawei's proprietary chips for entry-level and mid-
range products.
Next, let's move on to I/O read and write processes of the host. The scenarios are as follows:
⚫ Local Write Process

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

➢ A host delivers write I/Os to engine 0.


➢ Engine 0 writes the data into the local cache, implements mirror protection, and returns a
message indicating that data is written successfully.
➢ Engine 0 flushes dirty data onto a disk. If the target disk is on the local computer, engine 0
directly delivers the write I/Os.
➢ If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (engine 1
for example) where the disk resides.
➢ Engine 1 writes dirty data onto disks.
⚫ Non-local Write Process
➢ A host delivers write I/Os to engine 2.
➢ After detecting that the LUN is owned by engine 0, engine 2 transfers the write I/Os to
engine 0.
➢ Engine 0 writes the data into the local cache, implements mirror protection, and returns a
message to engine 2, indicating that data is written successfully.
➢ Engine 2 returns the write success message to the host.
➢ Engine 0 flushes dirty data onto a disk. If the target disk is on the local computer, engine 0
directly delivers the write I/Os.
➢ If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (engine 1
for example) where the disk resides.
➢ Engine 1 writes dirty data onto disks.
⚫ Local Read Process
➢ A host delivers write I/Os to engine 0.
➢ If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to the host.
➢ If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from the disk. If
the target disk is on the local computer, engine 0 reads data from the disk.
➢ After the read I/Os are hit locally, engine 0 returns the data to the host.
➢ If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (engine 1
for example) where the disk resides.
➢ Engine 1 reads data from the disk.
➢ Engine 1 accomplishes the data read.
➢ Engine 1 returns the data to engine 0 and then engine 0 returns the data to the host.
⚫ Non-local Read Process
➢ The LUN is not owned by the engine that delivers read I/Os, and the host delivers the read
I/Os to engine 2.
➢ After detecting that the LUN is owned by engine 0, engine 2 transfers the read I/Os to
engine 0.
➢ If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to engine 2.
➢ Engine 2 returns the data to the host.
➢ If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from the disk. If
the target disk is on the local computer, engine 0 reads data from the disk.
➢ After the read I/Os are hit locally, engine 0 returns the data to engine 2 and then engine 2
returns the data to the host.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

➢ If the target disk is on a remote device, engine 0 transfers the I/Os to engine 1 where the
disk resides.
➢ Engine 1 reads data from the disk.
➢ Engine 1 completes the data read.
➢ Engine 1 returns the data to engine 0, engine 0 returns the data to engine 2, and then
engine 2 returns the data to the host.

2.3.6 Huawei Storage Product Architecture


Huawei entry-level and mid-range storage products use dual-controller architecture by default.
Huawei mission-critical storage products use architecture with multiple controllers. OceanStor
Dorado V6 SmartMatrix architecture integrates the advantages of scale-up and scale-out
architectures. A single system can be expanded to a maximum of 32 controllers, greatly improving
E2E high reliability. The architecture ensures zero service interruption when seven out of eight
controllers are faulty, providing 99.9999% high availability. It is a perfect choice to carry key service
applications in finance, manufacturing, and carrier industries.
SmartMatrix makes breakthroughs in the mission-critical storage architecture that separates
computing and storage resources. Controller enclosures are completely separated from and directly
connected to disk enclosures. The biggest advantage is that controllers and storage devices can be
independently expanded and upgraded, which greatly improves storage system flexibility, protects
customers' investments in the long term, reduces storage risks, and guarantees service continuity.
⚫ Front-end full interconnection
➢ Dorado 8000 and 18000 V6 support FIMs, which can be simultaneously accessed by four
controllers in a controller enclosure.
➢ Upon reception of host I/Os, the FIM directly distributes the I/Os to appropriate
controllers.
⚫ Full interconnection among controllers
➢ Controllers in a controller enclosure are connected by 100 Gbit/s (40 Gbit/s for Dorado
3000 V6) RDMA links on the backplane.
➢ For scale-out to multiple controller enclosures, any two controllers can be directly
connected to avoid data forwarding.
⚫ Back-end full interconnection
➢ Dorado 8000 and 18000 V6 support BIMs, which allow a smart disk enclosure to be
connected to two controller enclosures and accessed by eight controllers simultaneously.
This technique, together with continuous mirroring, allows the system to tolerate failure of
7 out of 8 controllers.
➢ Dorado 3000, 5000, and 6000 V6 do not support BIMs. Disk enclosures connected to
Dorado 3000, 5000, and 6000 V6 can be accessed by only one controller enclosure.
Continuous mirroring is not supported.
The storage system supports three types of disk enclosures: SAS, smart SAS, and smart NVMe.
Currently, they cannot be used together on one storage system. Smart SAS and smart NVMe disk
enclosures use the same networking mode. In this mode, a controller enclosure uses the shared 2-
port 100 Gbit/s RDMA interface module to connect to a disk enclosure. Each interface module
connects to the four controllers in the controller enclosure through PCIe 3.0 x16. In this way, each
disk enclosure can be simultaneously accessed by all four controllers, achieving full interconnection
between the disk enclosure and the four controllers. A smart disk enclosure has two groups of uplink
ports and can connect to two controller enclosures at the same time. This allows the two controller
enclosures (eight controllers) to simultaneously access a disk enclosure, implementing full

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

interconnection between the disk enclosure and eight controllers. When full interconnection
between disk enclosures and eight controllers is implemented, the system can use continuous
mirroring to tolerate failure of 7 out of 8 controllers without service interruption.
Huawei storage provides E2E global resource sharing:
⚫ Symmetric architecture
➢ All products support host access in active-active mode. Requests can be evenly distributed
to each front-end link.
➢ They eliminate LUN ownership of controllers, making LUNs easier to use and balancing
loads. They accomplish this by dividing a LUN into multiple slices that are then evenly
distributed to all controllers using the DHT algorithm
➢ Mission-critical products reduce latency with intelligent FIMs that divide LUNs into slices
for hosts I/Os and send the requests to their target controller.
⚫ Shared port
➢ A single port is shared by four controllers in a controller enclosure.
➢ Loads are balanced without host multipathing.
⚫ Global cache
➢ The system directly writes received I/Os (in one or two slices) to the cache of the
corresponding controller and sends an acknowledgement to the host.
➢ The intelligent read cache of all controllers participates in prefetch and cache hit of all LUN
data and metadata.
FIMs of Huawei OceanStor Dorado 8000 and 18000 V6 series storage adopt Huawei-developed
Hi1822 chip to connect to all controllers in a controller enclosure via four internal links and each
front-end port provides a communication link for the host. If any controller restarts during an
upgrade, services are seamlessly switched to the other controller without impacting hosts and
interrupting links. The host is unaware of controller faults. Switchover is completed within 1 second.
The FIM has the following features:
⚫ Failure of a controller will not disconnect the front-end link, and the host is unaware of the
controller failure.
⚫ The PCIe link between the FIM and the controller is disconnected, and the FIM detects the
controller failure.
⚫ Service switchover is performed between the controllers, and the FIM redistributes host
requests to other controllers.
⚫ The switchover time is about 1 second, which is much shorter than switchover performed by
multipathing software (10-30s).
In global cache mode, host data is directly written into linear space logs, and the logs directly copy
the host data to the memory of multiple controllers using RDMA based on a preset copy policy. The
global cache consists of two parts:
⚫ Global memory: memory of all controllers (four controllers in the figure). This is managed in a
unified memory address, and provides linear address space for the upper layer based on a
redundancy configuration policy.
⚫ WAL: new write cache of the log type
The global pool uses RAID 2.0+, full-strip write of new data, and shared RAID groups between
multiple strips.
Another feature is back-end sharing, which includes sharing of back-end interface modules within an
enclosure and cross-controller enclosure sharing of back-end disk enclosures.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Active-Active Architecture with Full Load Balancing:


⚫ Even distribution of unhomed LUNs
➢ Data on LUNs is divided into 64 MB slices. The slices are distributed to different virtual
nodes based on the hash result (LUN ID + LBA).
⚫ Front-end load balancing
➢ UltraPath selects appropriate physical links to send each slice to the corresponding virtual
node.
➢ The front-end interconnect I/O modules forward the slices to the corresponding virtual
nodes.
➢ Front-end: If there is no UltraPath or FIM, the controllers forward I/Os to the
corresponding virtual nodes.
⚫ Global write cache load balancing
➢ The data volume is balanced.
➢ Data hotspots are balanced.
⚫ Global storage pool load balancing
➢ Usage of disks is balanced.
➢ The wear degree and lifecycle of disks are balanced.
➢ Data is evenly distributed.
➢ Hotspot data is balanced.
Huawei Storage Cache Mirroring Technology:
⚫ Three cache copies
➢ The system supports two or three copies of the write cache.
➢ Three-copy requires an extra license.
➢ Only mission-critical storage systems support three copies.
⚫ Three copies tolerate simultaneous failure of two controllers.
➢ Failure of two controllers does not cause data loss or service interruption.
⚫ Three copies tolerate failure of one controller enclosure.
➢ With three copies, data is mirrored in a controller enclosure and across controller
enclosures.
➢ Failure of a controller enclosure does not cause data loss or service interruption.
Key reliability technologies of Huawei storage products:
⚫ Continuous mirroring
➢ Dorado V6's mission-critical storage systems support continuous mirroring. In the event of
a controller failure, the system automatically selects new controllers for mirroring.
➢ Continuous mirroring includes all devices in back-end full interconnection.
⚫ Back-end full interconnection
➢ Controllers are directly connected to disk enclosures.
➢ Dorado V6's mission-critical storage systems support back-end full interconnection.
➢ BIMs + two groups of uplink ports on the disk enclosures achieve full interconnection of
the disk enclosures to eight controllers.
⚫ Continuous mirroring and back-end full interconnection allow the system to tolerate failure of
seven out of eight controllers.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Host service switchover when a single controller is faulty: When FIMs are used, failure of a
controller will not disconnect front-end ports from hosts, and the hosts are unaware of the
controller failure, ensuring high availability. When a controller fails, the FIM port chip detects
that the PCIe link between the FIM and the controller is disconnected. Then service switchover
is performed between the controllers, and the FIM redistributes host I/Os to other controllers.
This process is completed within seconds and does not affect host services. In comparison,
when non-shared interface modules are used, a link switchover must be performed by the
host's multipathing software in the event of a controller failure, which takes a longer time (10 to
30 seconds) and reduces reliability.

2.4 Storage Network Architecture


2.4.1 DAS
Direct-attached storage (DAS) connects one or more storage devices to servers. These storage
devices provide block-level data access for the servers. Based on the locations of storage devices and
servers, DAS is classified into internal DAS and external DAS. SCSI cables are used to connect hosts
and storage devices.
DAS is applicable to small- and medium-sized LANs that have general storage capacity requirements
and only a few number of servers. The advantages of DAS are easy deployment and small
investment.
JBOD, short for Just a Bunch Of Disks, logically connects several physical disks in series to increase
capacity but does not provide data protection. JBOD can resolve the insufficient capacity expansion
issue caused by limited disk slots of internal storage. However, it offers no redundancy, resulting in
poor reliability.
For a smart disk array, the controller provides RAID and large-capacity cache, enables the disk array
to have multiple functions, and is equipped with dedicated management software.

2.4.2 NAS
Enterprises need to store a large amount of data and share the data through a network. Therefore,
network-attached storage (NAS) is a good choice. NAS connects storage devices to the live network
and provides data and file services.
For a server or host, NAS is an external device and can be flexibly deployed through the network. In
addition, NAS provides file-level sharing rather than block-level sharing, which makes it easier for
clients to access NAS over the network. UNIX and Microsoft Windows users can seamlessly share
data through NAS or File Transfer Protocol (FTP). When NAS sharing is used, UNIX uses NFS and
Windows uses CIFS.
NAS has the following characteristics:
⚫ NAS provides storage resources through file-level data access and sharing, enabling users to
quickly share files with minimum storage management costs.
⚫ NAS is a preferred file sharing storage solution that does not require multiple file servers.
⚫ NAS also helps eliminate bottlenecks in user access to general-purpose servers.
⚫ NAS uses network and file sharing protocols for archiving and storage. These protocols include
TCP/IP for data transmission as well as CIFS and NFS for providing remote file services.
A general-purpose server can be used to carry any application and run a general-purpose operating
system. Unlike general-purpose servers, NAS is dedicated to file services and provides file sharing
services for other operating systems using open standard protocols. NAS devices are optimized

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

based on general-purpose servers in aspects such as file service functions, storage, and retrieval. To
improve the high availability of NAS devices, some NAS vendors also support the NAS clustering
function.
The components of a NAS device are as follows:
⚫ NAS engine (CPU and memory)
⚫ One or more NICs that provide network connections, for example, GE NIC and 10GE NIC.
⚫ An optimized operating system for NAS function management
⚫ NFS and CIFS protocols
⚫ Disk resources that use industry-standard storage protocols, such as ATA, SCSI, and Fibre
Channel
NAS protocols include NFS, CIFS, FTP, HTTP, and NDMP.
⚫ NFS is a traditional file sharing protocol in the UNIX environment. It is a stateless protocol. If a
fault occurs, NFS connections can be automatically recovered.
⚫ CIFS is a traditional file sharing protocol in the Microsoft environment. It is a stateful protocol
based on the Server Message Block (SMB) protocol. If a fault occurs, CIFS connections cannot be
automatically recovered. CIFS is integrated into the operating system and does not require
additional software. Moreover, CIFS sends only a small amount of redundant information, so it
has higher transmission efficiency than NFS.
⚫ FTP is one of the protocols in the TCP/IP protocol suite. It consists of two parts: FTP server and
FTP client. The FTP server is used to store files. Users can use the FTP client to access resources
on the FTP server through FTP.
⚫ Hypertext Transfer Protocol (HTTP) is an application-layer protocol used to transfer hypermedia
documents (such as HTML). It is designed for communication between a Web browser and a
Web server, but can also be used for other purposes.
⚫ Network Data Management Protocol (NDMP) provides an open standard for NAS network
backup. NDMP enables data to be directly written to tapes without being backed up by backup
servers, improving the speed and efficiency of NAS data protection.
Working principles of NFS: Like other file sharing protocols, NFS also uses the C/S architecture.
However, NFS provides only the basic file processing function and does not provide any TCP/IP data
transmission function. The TCP/IP data transmission function can be implemented only by using the
Remote Procedure Call (RPC) protocol. NFS file systems are completely transparent to clients.
Accessing files or directories in an NFS file system is the same as accessing local files or directories.
One program can use RPC to request a service from a program located in another computer over a
network without having to understand the underlying network protocols. RPC assumes the existence
of a transmission protocol such as Transmission Control Protocol (TCP) or User Datagram Protocol
(UDP) to carry the message data between communicating programs. In the OSI network
communication model, RPC traverses the transport layer and application layer. RPC simplifies
development of applications.
RPC works based on the client/server model. The requester is a client, and the service provider is a
server. The client sends a call request with parameters to the RPC server and waits for a response.
On the server side, the process remains in a sleep state until the call request arrives. Upon receipt of
the call request, the server obtains the process parameters, outputs the calculation results, and
sends the response to the client. Then, the server waits for the next call request. The client receives
the response and obtains call results.
One of the typical applications of NFS is using the NFS server as internal shared storage in cloud
computing. The NFS client is optimized based on cloud computing to provide better performance

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

and reliability. Cloud virtualization software (such as VMware) optimizes the NFS client, so that the
VM storage space can be created on the shared space of the NFS server.
Working principles of CIFS: CIFS runs on top of TCP/IP and allows Windows computers to access files
on UNIX computers over a network.
The CIFS protocol applies to file sharing. Two typical application scenarios are as follows:
⚫ File sharing service
➢ CIFS is commonly used in file sharing service scenarios such as enterprise file sharing.
⚫ Hyper-V VM application scenario
➢ SMB can be used to share mirrors of Hyper-V virtual machines promoted by Microsoft. In
this scenario, the failover feature of SMB 3.0 is required to ensure service continuity upon
a node failure and to ensure the reliability of VMs.

2.4.3 SAN
2.4.3.1 IP SAN Technologies
NIC + Initiator software: Host devices such as servers and workstations use standard NICs to connect
to Ethernet switches. iSCSI storage devices are also connected to the Ethernet switches or to the
NICs of the hosts. The initiator software installed on hosts virtualizes NICs into iSCSI cards. The iSCSI
cards are used to receive and transmit iSCSI data packets, implementing iSCSI and TCP/IP
transmission between the hosts and iSCSI devices. This mode uses standard Ethernet NICs and
switches, eliminating the need for adding other adapters. Therefore, this mode is the most cost-
effective. However, the mode occupies host resources when converting iSCSI packets into TCP/IP
packets, increasing host operation overheads and degrading system performance. The NIC + initiator
software mode is applicable to scenarios that require the relatively low I/O and bandwidth
performance for data access.
TOE NIC + initiator software: The TOE NIC processes the functions of the TCP/IP protocol layer, and
the host processes the functions of the iSCSI protocol layer. Therefore, the TOE NIC significantly
improves the data transmission rate. Compared with the pure software mode, this mode reduces
host operation overheads and requires minimal network construction expenditure. This is a trade-off
solution.
iSCSI HBA:
⚫ An iSCSI HBA is installed on the host to implement efficient data exchange between the host
and the switch and between the host and the storage device. Functions of the iSCSI protocol
layer and TCP/IP protocol stack are handled by the host HBA, occupying the least CPU resources.
This mode delivers the best data transmission performance but requires high expenditure.
⚫ The iSCSI communication system inherits part of SCSI's features. The iSCSI communication
involves an initiator that sends I/O requests and a target that responds to the I/O requests and
executes I/O operations. After a connection is set up between the initiator and target, the target
controls the entire process as the primary device. The target includes the iSCSI disk array and
iSCSI tape library.
⚫ The iSCSI protocol defines a set of naming and addressing methods for iSCSI initiators and
targets. All iSCSI nodes are identified by their iSCSI names. In this way, iSCSI names are
distinguished from host names.
⚫ iSCSI uses iSCSI Qualified Name (IQN) to identify initiators and targets. Addresses change with
the relocation of initiator or target devices, but their names remain unchanged. When setting
up a connection, an initiator sends a request. After the target receives the request, it checks
whether the iSCSI name contained in the request is consistent with that bound with the target.
If the iSCSI names are consistent, the connection is set up. Each iSCSI node has a unique iSCSI

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

name. One iSCSI name can be used in the connections from one initiator to multiple targets.
Multiple iSCSI names can be used in the connections from one target to multiple initiators.
Logical ports are created based on bond ports, VLAN ports, or Ethernet ports. Logical ports are
virtual ports that carry host services. A unique IP address is allocated to each logical port for carrying
its services.
⚫ Bond port: To improve reliability of paths for accessing file systems and increase bandwidth, you
can bond multiple Ethernet ports on the same interface module to form a bond port.
⚫ VLAN: VLANs logically divide the physical Ethernet ports or bond ports of a storage system into
multiple broadcast domains. On a VLAN, when service data is being sent or received, a VLAN ID
is configured for the data so that the networks and services of VLANs are isolated, further
ensuring service data security and reliability.
⚫ Ethernet port: Physical Ethernet ports on an interface module of a storage system. Bond ports,
VLANs, and logical ports are created based on Ethernet ports.
IP address failover: A logical IP address fails over from a faulty port to an available port. In this
way, services are switched from the faulty port to the available port without interruption. The
faulty port takes over services back after it recovers. This task can be completed automatically
or manually. IP address failover applies to IP SAN and NAS.
During the IP address failover, services are switched from the faulty port to an available port,
ensuring service continuity and improving the reliability of paths for accessing file systems. Users are
not aware of this process.
The essence of IP address failover is a service switchover between ports. The ports can be Ethernet
ports, bond ports, or VLAN ports.
⚫ Ethernet port–based IP address failover: To improve the reliability of paths for accessing file
systems, you can create logical ports based on Ethernet ports.

Figure 2-9
➢ Host services are running on logical port A of Ethernet port A. The corresponding IP
address is "a". Ethernet port A fails and thereby cannot provide services. After IP address
failover is enabled, the storage system will automatically locate available Ethernet port B,
delete the configuration of logical port A that corresponds to Ethernet port A, and create
and configure logical port A on Ethernet port B. In this way, host services are quickly
switched to logical port A on Ethernet port B. The service switchover is executed quickly.
Users are not aware of this process.
⚫ Bond port-based IP address failover: To improve the reliability of paths for accessing file
systems, you can bond multiple Ethernet ports to form a bond port. When an Ethernet port that

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

is used to create the bond port fails, services are still running on the bond port. The IP address
fails over only when all Ethernet ports that are used to create the bond port fail.

Figure 2-10
➢ Multiple Ethernet ports are bonded to form bond port A. Logical port A created based on
bond port A can provide high-speed data transmission. When both Ethernet ports A and B
fail due to various causes, the storage system will automatically locate bond port B, delete
logical port A, and create the same logical port A on bond port B. In this way, services are
switched from bond port A to bond port B. After Ethernet ports A and B recover, services
will be switched back to bond port A if failback is enabled. The service switchover is
executed quickly, and users are not aware of this process.
⚫ VLAN-based IP address failover: You can create VLANs to isolate different services.
➢ To implement VLAN-based IP address failover, you must create VLANs, allocate a unique ID
to each VLAN, and use the VLANs to isolate different services. When an Ethernet port on a
VLAN fails, the storage system will automatically locate an available Ethernet port with the
same VLAN ID and switch services to the available Ethernet port. After the faulty port
recovers, it takes over the services.
➢ VLAN names, such as VLAN A and VLAN B, are automatically generated when VLANs are
created. The actual VLAN names depend on the storage system version.
➢ Ethernet ports and their corresponding switch ports are divided into multiple VLANs, and
different IDs are allocated to the VLANs. The VLANs are used to isolated different services.
VLAN A is created on Ethernet port A, and the VLAN ID is 1. Logical port A that is created
based on VLAN A can be used to isolate services. When Ethernet port A fails due to various
causes, the storage system will automatically locate VLAN B and the port whose VLAN ID is
1, delete logical port A, and create the same logical port A based on VLAN B. In this way,
the port where services are running is switched to VLAN B. After Ethernet port A recovers,
the port where services are running will be switched back to VLAN A if failback is enabled.
➢ An Ethernet port can belong to multiple VLANs. When the Ethernet port fails, all VLANs will
fail. Services must be switched to ports of other available VLANs. The service switchover is
executed quickly, and users are not aware of this process.
2.4.3.2 FC SAN Technologies
FC HBA: The FC HBA converts SCSI packets into Fibre Channel packets, which does not occupy host
resources.
Here are some key concepts in Fibre Channel networking:
⚫ Fibre Channel Routing (FCR) provides connectivity to devices in different fabrics without
merging the fabrics. Different from E_Port cascading of common switches, after switches are

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

connected through an FCR switch, the two fabric networks are not converged and are still two
independent fabrics. The link switch between two fabrics functions as a router.
⚫ FC Router: a switch running the FC-FC routing service.
⚫ EX_Port: a type of port that functions like an E_Port, but does not propagate fabric services or
routing topology information from one fabric to another.
⚫ Backbone fabric: fabric of a switch running the Fibre Channel router service.
⚫ Edge fabric: fabric that connects a Fibre Channel router.
⚫ Inter fabric link (IFL): the link between an E_Port and an EX-Port, or a VE_Port and a VEX-Port.
Another important concept is zoning. A zone is a set of ports or devices that communicate with each
other. A zone member can only access other members of the same zone. A device can reside in
multiple zones. You can configure basic zones to control the access permission of each device or
port. Moreover, you can set traffic isolation zones. When there are multiple ISLs (E_Ports), an ISL
only transmits the traffic destined for ports that reside in the same traffic isolation zone.
2.4.3.3 Comparison Between IP SAN and FC SAN
First, let's look back on the concept of SAN.
⚫ Protocol: Fibre Channel/iSCSI. The SAN architectures that use the two protocols are FC SAN and
IP SAN.
⚫ Raw device access: suitable for traditional database access.
⚫ Dependence on the application host to provide file access. Share access requires the support of
cluster software, which causes high overheads in processing access conflicts, resulting in poor
performance. In addition, it is difficult to support sharing in heterogeneous environments.
⚫ High performance, high bandwidth, and low latency, but high cost and poor scalability
Then, let's compare FC SAN and IP SAN.
⚫ To solve the poor scalability issue of DAS, storage devices can be networked using FC SAN to
support connection to more than 100 servers.
⚫ IP SAN is designed to address the management and cost challenges of FC SAN. IP SAN requires
only a few hardware configurations and the hardware is widely used. Therefore, the cost of IP
SAN is much lower than that of FC SAN. Most hosts have been configured with appropriate NICs
and switches, which are also suitable (although not perfect) for iSCSI transmission. High-
performance IP SAN requires dedicated iSCSI HBAs and high-end switches.
2.4.3.4 Comparison Between DAS, NAS, and SAN
⚫ Protocol: NAS uses the TCP/IP protocol. SAN uses the FC protocol. DAS can use SCSI/FC/ATA.
⚫ Transmission object: DAS and SAN are transmitted based on data blocks. NAS is transmitted
based on files.
⚫ SAN is more tolerant of disasters and has dedicated solutions.

2.4.4 Distributed Architecture


A scale-out storage system organizes local HDDs and SSDs of general-purpose servers into large-scale
storage resource pools, and then distributes data to multiple data storage servers.
10GE, 25GE, and IB networks are generally used as the backend networks of scale-out storage. The
frontend network is usually a GE, 10GE, or 25GE network.
The network planes and their functions are described as follows:
⚫ Management plane: interconnects with the customer's management network for system
management and maintenance.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ BMC plane: connects to Mgmt ports of management or storage nodes to enable remote device
management.
⚫ Storage plane: an internal plane, used for service data communication among all nodes in the
storage system.
⚫ Service plane: interconnects with customer applications and accesses storage devices through
standard protocols such as iSCSI and HDFS.
⚫ Replication plane: enables data synchronization and replication among replication nodes.
⚫ Arbitration plane: communicates with the HyperMetro quorum server. This plane is planned
only when the HyperMetro function is planned for the block service.
The key software components and their functions are described as follows:
⚫ FSM: a management process of Huawei scale-out storage that provides operation and
maintenance (O&M) functions, such as alarm management, monitoring, log management, and
configuration. It is recommended that this module be deployed on two nodes in active/standby
mode.
⚫ Virtual Block Service (VBS): a process that provides the scale-out storage access point service
through SCSI or iSCSI interfaces and enables application servers to access scale-out storage
resources
⚫ Object Storage Device (OSD): a component of Huawei scale-out storage for storing user data in
distributed clusters.
⚫ REP: data replication network
⚫ Enterprise Data Service (EDS): a component that processes I/O services sent from VBS.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

3 Huawei Intelligent Storage Products and


Features

3.1 Huawei Intelligent Storage Products


3.1.1 Panorama
Huawei OceanStor scale-out storage provides diversified storage services for virtualization/cloud
resource pool, big data analysis, HPC, video, and content storage, backup, and archiving applications,
helping enterprises unleash the value of mass data.
The next-generation OceanStor Pacific intelligent scale-out mass storage provides multiple forms of
products to meet various service requirements of HPC, AI applications, virtualization/cloud resource
pools, databases, big data analysis, and mass data backup and archiving.
Huawei next-generation OceanStor 18510 and 18810 high-end hybrid flash storage systems oriented
to new data centers deliver 99.99999% reliability, 100% higher performance than peer vendors, and
industry-leading convergence features. They provide efficient and reliable data services for core
services and facilitate evolution to new data centers.
Huawei launches hyper-converged infrastructure (HCI) products for data centers. Based on the
convergence of compute, storage, and network resources, HCI products implement on-demand
loading of management, backup, and DR functions using software and provide integrated delivery
through pre-integration. With intelligent management software and intelligent algorithms that
match various industry scenarios, these products cover all scenarios from data centers to enterprise
branches and feature flexible architecture, high performance, security, reliability, simplicity, and
efficiency.

3.1.2 All-Flash Storage


3.1.2.1 All-Flash Products
In the first decade of the 21st century, the explosive growth of Internet data saw distributed
architecture become the norm for data storage, because it is an effective means to cope with the
mass data.
In the 2010's, many enterprises took their first steps toward digital transformation, which cemented
mass data storage's place in the enterprise market.
The OceanStor Pacific series includes performance- and capacity-oriented products which support
tiered storage to meet application requirements in different scenarios.
OceanStor Pacific 9950 is a typical performance-oriented model. It provides customers with ultimate
all-flash performance of mass data storage. A single device delivers 160 GB/s bandwidth
performance.
OceanStor Pacific 9550 is a typical capacity-oriented device. A single device provides 120 disk slots.
Using mainstream 14 TB disks, a single device can provide 1.68 PB storage capacity.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

The OceanStor Pacific series supports mainstream front-end ports for mass storage. It supports
10GE, 25GE, 40GE, 100GE, HDR-100, and EDR IB. It supports TCP and IP standard protocols as well as
RDMA and RoCE.
3.1.2.2 Product Features
Highlights of the next-generation Huawei all-flash storage:
Excellent performance: 21 million IOPS and 0.05 ms latency based on the SPC-1 model, and 30%
higher NAS performance than the industry average
High reliability: tolerance of failures of seven out of eight controllers, and the only active-active
architecture for SAN and NAS integration in the industry, ensuring service continuity
High efficiency: intelligent O&M and device-cloud synergy, promoting intelligent and simplified
storage resource usage
The SmartMatrix 3.0 full-mesh architecture leverages a high-speed, fully interconnected passive
backplane to connect to multiple controllers. Interface modules are shared by all controllers over the
backplane, allowing hosts to access any controller via any port. The SmartMatrix full-mesh
architecture allows close coordination between controllers, simplifies software models, and achieves
active-active fine-grained balancing, high efficiency, and low latency.
FlashLink® provides high IOPS concurrency with low latency ensured. FlashLink® employs a series of
optimizations for flash media. It associates controller CPUs with SSD CPUs to coordinate SSD
algorithms between these CPUs, thereby achieving high system performance and reliability.
An SSD is uses flash memory (NAND Flash) to store data persistently. Compared with a traditional
HDD, an SSD features high speed, low energy consumption, low latency, small size, light weight, and
shockproof capability.
High performance
⚫ All SSD design for high IOPS and low latency
⚫ FlashLink supported, providing intelligent multi-core, efficient RAID, hot and cold data
separation, and low latency guarantee
High reliability
⚫ Component failure protection implemented through the redundancy design and active-active
working mode; SmartMatrix 3.0 front- and back-end full-mesh architecture for controller
enclosures, ensuring high efficiency and low latency
⚫ Component redundancy design, power failure protection, and coffer disks
⚫ Advanced data protection technologies, including HyperSnap, HyperReplication, HyperClone,
and HyperMetro
⚫ RAID 2.0+ underlying virtualization
High availability
⚫ Online replacement of components, including controllers, power modules, interface modules,
and disks
⚫ Disk roaming, providing automatic identification of disks with slots changed and automatic
restoration of original services
⚫ Centralized management of resources in third-party storage systems
3.1.2.3 Product Form
Product form:
OceanStor Dorado V6 uses the next-generation innovative hardware platform and adopts the high-
density design for controllers and disk enclosures.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ High-end storage models use 4 U independent controller enclosures. Each enclosure supports a
maximum of 28 interface modules. Mirroring in controller enclosures and interconnection
across enclosures are implemented through 100 Gbit/s RDMA networks.
⚫ Mid-range storage models use 2 U controller enclosures. Each enclosure has integrated disks
and supports a maximum of 12 interface modules. The controllers are mirrored through 100
Gbit/s RDMA networks within each enclosure. Two controller enclosures are interconnected
through 25 Gbit/s RDMA networks.
⚫ The entry-level storage model uses 2 U controller enclosures. Each enclosure has integrated
disks and supports a maximum of 6 interface modules. The controllers are mirrored through 40
Gbit/s RDMA networks within each enclosure. Two controller enclosures are interconnected
through 25 Gbit/s RDMA networks.
⚫ A smart disk enclosure provides 25 SAS SSD slots (standard) and a high-density one supports 36
NVMe SSDs. A common disk enclosure supports 25 SAS SSDs (standard).
3.1.2.4 Innovative and Intelligent Hardware Accelerates Critical Paths (Ever Fast)
Huawei developed components for all key I/O paths to provide customers with ultimate
performance.
Intelligent multi-protocol interface module: hosts protocol parsing previously performed by
controller CPUs, which reduces controller CPU workloads and improves overall performance.
Huawei Kunpeng high-performance computing platform: delivers 25% higher computing power than
same-level Intel CPUs. In addition, the advantages in CPU and core quantities further improve
storage system performance. To be specific, 4 chips are deployed on a controller. Therefore, a 4 U 4-
controller device has 16 chips with 192 cores.
The Huawei-developed intelligent acceleration module is inserted into a controller as an interface
module to work with the internal intelligent cache algorithm. When I/O data flows from the upper
layer come, the intelligent acceleration module automatically captures the flows, learns and
analyzes their rules, predicts possible follow-up actions, and prefetches cache resources. In this way,
the read cache hit ratio is greatly improved, especially in batch read scenarios.
Intelligent SSD controller: The brain of SSDs. It accelerates read and write responses (single disk: 15%
faster than peer vendors) and reduces latency.
Intelligent management hardware: has a built-in storage fault library which contains 10 years of
accumulated fault data. The chip speeds up fault locating and offers corresponding solutions and
self-healing suggestions, improving fault locating accuracy from 60% to 93%.
With the preceding Huawei-developed hardware for transmission, computing, intelligence, storage,
and management, Huawei all-flash storage is ever fast than peer vendors.
3.1.2.5 Software Architecture
The OceanStor OS is used to manage hardware and support storage service software.
⚫ The basic function control software provides basic data storage and read/write functions.
⚫ The value-added function control software provides advanced functions such as backup, DR,
and performance tuning.
⚫ The management function control software provides management functions for the storage
system.
The maintenance terminal software is used for system configuration and maintenance. You can use
software such as SmartKit and eService on the maintenance terminal to configure and maintain the
storage system.
Application server software includes OceanStor BCManager and UltraPath.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

3.1.2.6 Intelligent Chips


The key technologies of FlashLink® include:
Intelligent multi-core technology
The storage system uses Huawei-developed CPUs. The number of CPUs and the number of CPU
cores in one controller are the largest in the industry. The intelligent multi-core technology allows
storage performance to increase linearly with the number of CPUs and cores.
Efficient RAID
The storage system uses the redirect-on-write (ROW) full-stripe write design, which writes all new
data to new blocks instead of overwriting existing blocks. This greatly reduces the overhead on
controller CPUs and read/write loads on SSDs in a write process, improving system performance in
various RAID levels.
Hot and cold data separation
The controller works with SSDs to identify hot and cold data in the system, improve garbage
collection efficiency, and reduce the program/erase (P/E) cycles on SSDs to prolong their service life.
Low latency guarantee
The storage system uses the latest generation of Huawei-developed SSDs and a faster protocol to
optimize I/O processing and maintain a low I/O latency.
Smart disk enclosure
Storage systems support the Huawei-developed smart disk enclosure. The smart disk enclosure is
equipped with CPU and memory resources, and can offload tasks, such as data reconstruction upon
a disk failure, from controllers to reduce the workload on the controllers and eliminate the impact of
such tasks on service performance.
Efficient time point technology
Storage systems implement data protection features using the distributed time point technology.
Read and write I/Os from user hosts carry the time point information to quickly locate metadata,
thereby improving access performance.
Global wear leveling and anti-wear leveling
Global wear leveling: If data is unevenly distributed to SSDs, certain SSDs may be used more
frequently and wear faster than others. As a result, they may fail much earlier than expected,
increasing the maintenance costs. Storage systems address this problem by using global wear
leveling that levels the wear degree among all SSDs, improving SSD reliability.
Global anti-wear leveling: When the wear degree of multiple SSDs is reaching the threshold, the
system preferentially writes data to specific SSDs. In this way, these SSDs wear faster than the
others. This prevents multiple SSDs from failing at a time.

3.1.2.7 Typical Application Scenario – Mission-Critical Service Acceleration


FAQ:
With the rapid development of mobile Internet, effective value mining from mass transaction data
relies on efficient data collection, analysis, consolidation, and extraction to promote the
implementation of data-centric strategies. Existing IT systems are under increasing pressure. For
example, it takes several hours to process data and integrate data warehouses in the bill and
inventory systems of banks and large enterprises. As a result, services such as operation analysis and
service query cannot be obtained in a timely manner or the query speed is slow, severely affecting
the efficiency.
Solution:

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Huawei's high-performance all-flash solution resolves these problems. High-end all-flash storage
systems are used to carry multiple core applications (such as transaction system database services).
The processing time is reduced by more than half, the response latency is shortened, and the service
efficiency is improved by times.

3.1.3 Hybrid Flash Storage


3.1.3.1 Hybrid Flash Storage
In the first decade of the 21st century, the explosive growth of Internet data saw distributed
architecture become the norm for data storage, because it is an effective means to cope with the
mass data.
In the 2010's, many enterprises took their first steps toward digital transformation, which cemented
mass data storage's place in the enterprise market.
The OceanStor Pacific series includes performance- and capacity-oriented products which support
tiered storage to meet application requirements in different scenarios.
OceanStor Pacific 9950 is a typical performance-oriented device. It provides customers with ultimate
mass data all-flash storage performance. The bandwidth performance of a single device reaches 160
GB/s.
OceanStor Pacific 9550 is a typical capacity-oriented device. A single device provides 120 disk slots.
Using mainstream 14 TB disks, a single device can provide 1.68 PB storage capacity.
The OceanStor Pacific series supports mainstream front-end ports for mass storage. It supports
10GE, 25GE, 40GE, 100GE, HDR-100 and EDR IB. It supports TCP and IP standard protocols as well as
RDMA and RoCE.
New-Gen OceanStor Hybrid Flash Storage
Converged storage:
Convergence of SAN and NAS storage technologies
Support for storage network protocols such as iSCSI, FC, NFS, CIFS, and FTP
High performance:
High-performance processor, high-speed and large-capacity cache, and various high-speed interface
modules, providing excellent storage performance
Support for SSD acceleration, greatly improving storage performance
High scalability:
Support for various disk types
Support for various interface modules
Support for technologies such as scale-out
High reliability:
SmartMatrix full-mesh architecture, redundancy design for all components, active-active working
mode, and RAID 2.0+
Multiple data protection technologies, such as power failure protection, data pre-copy, coffer disk,
and bad sector repair
High availability:
Support for NDMP and multiple advanced data protection technologies, such as snapshot, LUN copy,
remote replication, clone, volume mirroring, and active-active
Intelligence and high efficiency:

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Multiple control and management functions, such as SmartTier, SmartQoS, and SmartThin, providing
refined control and management
GUI-based operations and management through DeviceManager
Self-service intelligent O&M provided by eService
Product form:
A storage system consists of controller enclosures and disk enclosures. It provides an intelligent
storage platform that features robust reliability, high performance, and large capacity.
Different models of products are configured with different types of controller enclosures and disk
enclosures.
3.1.3.2 Converged SAN and NAS
Convergence of SAN and NAS storage technologies: One storage system supports both SAN and NAS
services at the same time and allows SAN and NAS services to share storage device resources. Hosts
can access any LUN or file system through the front-end port of any controller. During the entire
data lifecycle, hot data gradually becomes cold data. If cold data occupies cache or SSDs for a long
time, resources will be wasted and the long-term performance of the storage system will be
affected. The storage system uses the intelligent storage tiering technology to flexibly allocate data
storage media in the background.
The intelligent tiering technology requires a device with different media types. Data is monitored in
real time. Data that is not accessed for a long time is marked as cold data and is gradually
transferred from high-performance media to low-speed media, ensuring that the service response is
not becoming slow. After being activated, cold data can be quickly moved to high-performance
media, ensuring stable system performance.
Manual or automatic migration policies are supported.
3.1.3.3 Support for Multiple Service Scenarios
Huawei OceanStor hybrid flash storage systems integrate SAN and NAS and support multiple storage
protocols. This improves the service scope for general-purpose, backup, and DR scenarios in
government, finance, carrier, and manufacturing fields, among others.

3.1.3.4 Application Scenario – Active-Active Data Centers


Load balancing among controllers
Key services: RPO = 0, RTO ≈ 0
Convergence of SAN and NAS: SAN and NAS active-active services can be deployed on the same
device. If a single controller is faulty, local switchover is supported.
Solutions that ensure uninterrupted service running for customers
The active-active solution can be widely used in industries such as healthcare, finance, and social
security.

3.1.4 Scale-out Storage


3.1.4.1 Scale-out Storage Series
In the first decade of the 21st century, the explosive growth of Internet data saw distributed
architecture become the norm for data storage, because it is an effective means to cope with the
mass data.
In the 2010's, many enterprises took their first steps toward digital transformation, which cemented
mass data storage's place in the enterprise market.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

The OceanStor Pacific series includes performance- and capacity-oriented products which support
tiered storage to meet application requirements in different scenarios.
OceanStor Pacific 9950 is a typical performance-oriented model. It provides customers with ultimate
all-flash performance of mass data storage. A single device delivers 160 GB/s bandwidth
performance.
OceanStor Pacific 9550 is a typical capacity-oriented device. A single device provides 120 disk slots.
Using mainstream 14 TB disks, a single device can provide 1.68 PB storage capacity.
The OceanStor Pacific series supports mainstream front-end ports for mass storage. It supports
10GE, 25GE, 40GE, 100GE, HDR-100 and EDR IB. It supports TCP and IP standard protocols as well as
RDMA and RoCE.
3.1.4.2 Appearance of OceanStor Pacific 9950
OceanStor Pacific 9950 is a 5 U 8-node all-flash high-density device developed based on the Huawei
Kunpeng 920 processor, providing customers with ultimate performance of scale-out storage. It
features high performance, high density, ultimate TCO, and high reliability.
The device has eight independent pluggable nodes on the front panel. Each node supports 10 half-
palm NVMe SSDs, and the entire device supports 80 half-palm NVMe SSDs. Each node supports 32
GB battery-protected memory space to protect data in the event of a power failure.
The back panel of the device consists of three parts. The 2 data cluster modules in the upper part
provide each node with two 100GE RoCE dual-plane back-end storage network ports. The 16
interface modules in the middle are used for front-end network access. Each two interface modules
belong to a node. The 6 power supply slots in the lower part provide 2+2 power supply redundancy,
improving reliability.
OceanStor Pacific 9950 adopts a full FRU design. Nodes, BBUs, half-palm NVMe SSDs, fan modules,
data cluster modules, interface modules, and power modules of the device are independently
pluggable, facilitating maintenance and replacement.
Front view: contains the dust-proof cover, status display panel (operation indicator on the right:
indicates push-back or pull-forward; location indicator: steady amber for component replacement
including main storage disks, fans, BBUs, and half-palm NVMe cache SSDs), alarm indicator (front),
fans, power modules (two on the left and right, respectively), BBUs, disks, node control board (rear),
and overtemperature indicator.
Rear view: There are two storage node integration boards containing node 0 and node 1. Each node
is configured with a Kunpeng 920 processor and interface module slots of various standards. Huawei
uses a design of putting disks at the upper and nodes at the bottom instead of the traditional design
in which disks are at the front and nodes are at the rear with vertical backplanes. Each integration
control board manages 60 disks from the front to rear and connects disk enclosures to CPU
integration boards using flexible cables in tank chain mode.
Top view: The front and rear disk areas and the fan area in the middle are visible. The front and rear
disk areas contain 120 disk slots. The fan area in the middle contains fans in 5 x 2 stacked mode to
form double-layer air channels. To cope with the high air resistance of 120 disks, Huawei customizes
ten 80 mm aviation-standard counter-rotating fans to form a fan wall containing two air channels.
The upper channel is the main channel, which is used for heat dissipation of front-row disks. The
lower channel is used for heat dissipation of CPUs and rear-row disks. The fans draw air from the
front and blow air from the rear, perfectly solving the heat dissipation problem of high-density
devices. Compared with the mainstream heat dissipation technology in the industry, the reliability of
OceanStor Pacific 9550 components is improved by more than 100%.
The rightmost part shows finer division of the physical disk area. The entire disk area is divided into
eight sawtooth sub-areas. Each sub-area is in 7+8 layout mode, using expansion modules for
connection.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

OceanStor Pacific 9550 adopts a full FRU design. Nodes, power supplies, BBUs, half-palm NVMe
cache SSDs, expansion modules, fan modules, and disk modules are all hot-swappable, facilitating
maintenance and replacement.
Fans are in N+1 redundant mode. There are five groups of fans. Each fan has two motors (rotors). If
one motor is faulty, the system runs properly. If the entire fan is faulty, replace it immediately.
3.1.4.3 Software System Architecture
NFS/CIFS: Standard file sharing protocol allows the system to provide file sharing in various OSs.
POSIX/MPI: It is compatible with standard MPI and POSIX semantics in scenarios where file sharing is
implemented using DPC and provides parallel interfaces and intelligent data caching algorithms to
enable upper-layer applications to access storage space more intelligently.
S3: processes Amazon S3 and NFS protocol messages and the object service logic.
HDFS: provides standard HDFS interfaces for external systems.
SCSI/iSCSI: provides volumes for OSs and databases by locally mapping the volumes using SCSI
standard drivers or by mapping volumes to application servers through multipathing software and
iSCSI interfaces.
Data redundancy management: performs erasure coding calculation to ensure high data reliability.
Distributed data routing: evenly distributes data and metadata to storage nodes according to preset
rules.
Cluster status control: controls the distributed cluster status.
Strong-consistency replication protocol: ensures data consistency for HyperMetro pairs in the block
service.
Data reconstruction and balancing: reconstructs and balances data.
3.1.4.4 Product Features
The Huawei OceanStor Pacific series provides industry-leading hardware for different industries in
different application scenarios to meet diversified user requirements. It provides block, object, big
data, and file storage protocols and uses one architecture to meet different data storage
requirements. Oriented to different industries in different service scenarios, it provides leading
solutions and storage products with higher efficiency, reliability, and cost-effectiveness to help
customers better store and use mass data.
3.1.4.4.1 FlashLink - Multi-core Technology
Data flows from I/O cards to CPUs. The CPU algorithm used by the Huawei OceanStor Pacific series
has the following advantages:
CPUs use the latest intelligent partitioning algorithm, including the CPU grouping algorithm and CPU
core-based algorithm. The CPU grouping algorithm divides I/Os into different groups (for example,
data read/write and data exchange) so that they are independent to avoid mutual interference and
ensure read and write performance. I/Os with lower priorities are in the same group and share CPU
resources. In this way, resource utilization can be maximized.
The CPU core-based algorithm is Huawei's unique advantage. I/Os work on CPUs in polling mode,
that is, an I/O works on core 1 of a CPU for a period of time. Later, the I/O works on core 2 of the
CPU through polling. However, it takes time for the I/O to switch from core 1 to core 2. To ensure
data consistency, when the I/O switches from core 1 to core 2, core 1 is locked to ensure that the I/O
only runs on core 2, which also takes time. When the I/O works for a period of time on core 2, if the
I/O needs to switch back to core 1, core 2 needs to be locked and core 1 needs to be unlocked.
These processes are also time consuming.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

The CPU core-based algorithm solves CPU polling and locking/unlocking problems by enabling a
reading or writing request to be continuously executed on the same core of a CPU. The lock-free
design enables an I/O to run on the same core always, saving the time spent on core switching,
locking, and unlocking.
Data flows from CPUs to cache. Cache is important for storage. The cache algorithm is one of the key
factors that determine storage performance. The cache algorithm has the following highlights:
1 The binary tree algorithm of traditional storage is abandoned. The hash table algorithm
dedicated to flash memory is used. The binary tree algorithm occupies less space but the search
speed is low. The hash table algorithm provides high search speed and occupies more space.
Multi-level hash is used to save space and improve the search speed.
2 Metadata and data cache resources are partitioned to improve the metadata hit ratio. Read
data and write data are partitioned to avoid resource preemption.
3 Hash indexes support hot and cold partitioning to improve the search speed.
3.1.4.4.2 SmartInterworking
To solve the problems including long data analysis link, large data volume, and long copy period in
HPC and big data analysis scenarios such as autonomous driving, the OceanStor Pacific series
provides SmartInterworking to implement multi-protocol convergence and interworking of files,
objects, and big data without semantics or performance loss. Data can be accessed through multiple
semantics without format conversion. In this way, the data copy process is eliminated, data analysis
efficiency is greatly improved, and redundant data of data analysis and publishing stored in
traditional production systems is avoid, saving storage space. Based on protocol interworking of
files, objects, and big data, data can be accessed by multiple services at the same time without data
migration, improving efficiency and maximizing the unique value of the data lake. Among products
of the same type, Dell EMC products support unstructured protocol interworking but there is
semantics loss and some native functions are unavailable. Inspur and H3C are developed based on
open-source Ceph. These products usually support protocol interworking by adding gateways to a
certain storage protocol. However, the extra logic conversion causes great performance loss.
3.1.4.4.3 SmartIndexing
In mass data scenarios, users must use metadata to manage mass data. For example, users must
obtain the list of images or files whose name suffix is .jpeg, the list of files whose size is greater than
10 GB, and the list of files created before a specific date before managing them.
The OceanStor Pacific series provide the metadata indexing function (SmartIndexing) based on
unstructured (file, object, and big data) storage pools. SmartIndexing can be enabled or disabled by
namespace (or file system). For storage systems where metadata indexing is enabled, after front-end
service I/Os change the metadata of files, objects, or directories, the changed metadata is
proactively pushed to the index system in asynchronous mode, which is different from periodic
metadata scanning used by traditional storage products. In this way, impact on production system
performance is prevented and requirements on quick search for mass data are met. For example,
the search result of hundreds of billions of metadata records can be returned within seconds. This
feature is widely applicable to services such as electronic check image and autonomous driving
training.
Application scenarios: Electronic check image and autonomous driving data preprocessing
Benefits: Quick search is supported in scenarios where there are multiple directory levels or massive
files. Multi-dimensional metadata search is supported. The search dimensions are more than twice
that of using the find command.
3.1.4.4.4 Parallel File System
The process of HPC/HPDA and big data applications consists of multiple phases. Some phases require
large I/Os with high bandwidth, and some phases require small I/Os with high IOPS. Traditional HPC

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

storage acts well at only one performance model. As a result, different storage products are required
in different phases, and data needs to be copied between different phases, greatly reducing
efficiency.
The next-generation parallel file system of OceanStor Pacific achieves high bandwidth and IOPS
through innovations. One storage system meets the requirements of hybrid loads.
1 In terms of architecture, OceanStor Pacific distributes metadata to multiple nodes by directory.
In this way, metadata is owned, small I/Os can be forwarded to the owning node for direct
processing, which are similar to centralized storage, eliminating distributed lock overheads and
greatly reducing the read/write latency of small I/Os.
2 In terms of I/O flow, large I/Os are written to disks in pass-through mode, improving bandwidth.
Small I/Os are written to disks after aggregation through cache, ensuring low latency and
improving small file utilization.
3 In terms of data layout, OceanStor Pacific uses two-level indexes. The fixed-length large-
granularity primary indexes ensure sequential read/write performance of large I/Os. Sub-
indexes can be automatically adapted based on I/O sizes to avoid write penalty when small I/Os
are written.
3.1.4.4.5 Distributed Parallel Client (DPC)
DPC is short for the Huawei distributed parallel client.
Some HPC scenarios require single-stream bandwidth and single-client bandwidth. In other
scenarios, MPI-IO (multiple clients concurrently access the same file) may be used. When NFS is
used, a single client can connect to only one storage node, TCP is mainly used for network access,
and MPI-IO is not supported. Therefore, requirements in these scenarios cannot be met.
To address these issues, OceanStor Pacific provides the next-generation DPC, which is deployed on
compute nodes. A single compute client can connect to multiple storage nodes, eliminating
performance bottlenecks caused by storage node configurations and maximizing compute node
capabilities.
In addition, I/O-level load balancing in DPC access mode is better than that in NFS access mode (load
balancing determination is performed only during mounting). In this way, storage cluster
performance is improved.
DPC supports POSIX and MPI-IO access modes. MPI applications can obtain better access
performance without modification.
DPC supports RDMA networks in IB and RoCE modes. It can directly perform EC on data on the client
to ensure lower data read/write latency.
3.1.4.4.6 Solutions for Four Typical Scenarios
The OceanStor Pacific series provides solutions for four typical scenarios: HPC storage, virtualization
and cloud resource pool storage, object resource pool, and decoupled storage and compute in the
big data analysis scenario.

3.1.5 Hyper-Converged Storage


3.1.5.1 Hyper-Converged Series
Huawei FusionCube is IT infrastructure based on a hyper-converged architecture. It is not an
independent compute, network, or storage device and instead is a device that integrates compute,
storage, and network resources, eliminating the need to purchase additional storage or network
resources. It converges compute and storage resources, and preintegrates a distributed storage
engine, virtualization platform, and cloud management software. It supports on-demand resource
scheduling and linear expansion. It is mainly used in data center scenarios with hybrid loads, such as
database, desktop cloud, container, and virtualization.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Huawei FusionCube 1000 is an edge IT infrastructure solution that adopts the all-in-one design and is
delivered as an entire cabinet. It is mainly used in edge branches and vertical industry edge
application scenarios, such as gas station, campus, coal mine, and power grid. It can be deployed in
offices and remotely managed using FusionCube Center. With FusionCube Center Vision (FCV), this
solution offers integrated cabinets, service deployment, O&M management, and troubleshooting
services in a centralized manner. It greatly shortens the deployment duration and reduces the O&M
cost.
Hyper-converged infrastructure (HCI) is a set of devices consolidating not only compute, network,
storage, and server virtualization resources, but also elements such as backup software, snapshot
technology, data deduplication, and inline compression. Multiple sets of devices can be aggregated
by a network to achieve modular, seamless scale-out and form a unified resource pool.

3.1.6 Backup Storage


3.1.6.1 Backup Storage
Huawei backup storage provides excellent performance, efficient reduction, stability, and reliability,
to help users efficiently back up their data at a lower TCO.
Rapid Backup, Rapid Recovery
Huawei OceanProtect backup storage fully utilizes hardware and optimizes all I/O paths including
host access balancing, front- and back- end networks, and CPU, improving the overall system
performance, providing ultra-high bandwidth, and meeting high throughput and reduction ratio
demands.
As of December 2021, OceanProtect provides 155 TB/hour backup bandwidth and 172 TB/hour
recovery bandwidth, 3x and 5x higher than the next-best player.
Efficient Reduction
OceanProtect’s backup and data services run premium algorithms to slice data flows and identify
data features, while inline variable-length deduplication, feature-based compression, and byte-level
compaction technologies achieve a superior 72:1 reduction ratio, 20% higher than peer vendors, to
slash customers’ system investments.
High Reliability
Lost or damaged data must be quickly recovered using backup data from the backup system.
Employing the A-A architecture, RAID-TP, and ransomware protection technologies, OceanProtect
ensures data reliability and service availability. If an interface module, a controller, or a disk fails,
services are not interrupted and backup applications are unaware of the failure, ensuring that
backup jobs and instant recovery jobs can be completed within each backup window and providing
99.9999% reliability.
3.1.6.2 Software Architecture
Infrastructure: provides infrastructure capabilities for the storage system, such as scheduling and
memory allocation.
Persistence layer: provides persistent storage capabilities and supports EC, multi-copy, and RAID.
StorageService: provides block storage and back-end deduplication and compression, and supports
snapshot copy and virtual clone.
Protocol: protocol layer of DPA, on which local disks can be mounted and accessed through the iSCSI
protocol or VBS.
BackupDataService: provides backup processing services and supports source deduplication.
BackupService: provides application backup services, including standard backup, advanced backup,
continuous backup, and continuous replication.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Management: the DPA management plane, which provides deployment, upgrade, capacity
expansion, monitoring, and multi-domain management operations.
The fair usage policy runs a preset threshold. When the storage usage of a user in a period reaches
the threshold, the system automatically executes the preset restriction policy to reduce the
bandwidth allocated to the user.
The fair usage policy is useful in certain scenarios. For example, if a carrier launches an unlimited
traffic package, this policy prevents subscribers from over-consuming network resources.
3.1.6.3 All-Scenario Data Protection for the Intelligent World
Huawei's investment in storage products not only focuses on hardware products, but also aims to
build E2E products and solutions for customers.
To cope with the rapid growth of diversified data, OceanProtect implements full DR of hot data, hot
backup of warm data, and warm archiving of cold data throughout the data lifecycle and in all
scenarios, ensuring uninterrupted services, zero data loss, and long-term information retention.
Huawei storage provides comprehensive DR solutions, including active/standby DR, active-active
storage, and 3DC, and implements unified, intelligent, and simplified DR scheduling and O&M with
OceanStor BCManager.
In terms of back backup, Huawei provides centralized and all-in-one backup methods to meet
varying demands in different scenarios.
For archiving, Huawei provides a solution for archiving data to the local storage system.
3.1.6.4 Active-active Architecture and RAID-TP Technology for System-level
Reliability of Services and Data.
The number of write cycles majorly affects SSD lifespans. Huawei uses its patented global wear and
anti-wear leveling technologies to counteract the effects of write cycles and extend SSD lifespans.
1 Huawei RAID 2.0+ evenly distributes data to SSDs based on fingerprints, to level SSD wear and
improve SSD reliability.
2 At the end of SSD lifecycles, Huawei uses global anti-wear leveling to increase the service
volume of one SSD, preventing simultaneous multi-SSD failures.
Huawei backup storage uses two technologies to reduce the risk of multi-SSD failures, while
extending SSD lifespans and improving system reliability.
3.1.6.5 Data Protection Appliance: Comprehensive Protection for User Data and
Applications
CDM is short for Converged Data Management.
1 In D2C scenarios, production data is directly backed up to the cloud by the backup software.
In D2D2C scenarios, backup copies are tiered and archived to the cloud.
2 The difference between logical backup and physical backup is as follows. In physical backup
scenarios, data is replicated in the unit of disk blocks from the active node to the standby node.
Data is backed up in the unit of sector (512 bytes). By contrast, logical backup
replicates data from the active node to the standby node in the unit of file (the size
of the logical backup depends on the file size). In physical backup scenarios, the
entire file needs not to be backed up when a small change occurs. Instead, only the
changed part is backed up, effectively improving the backup efficiency and
shortening the backup window.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

3.2 Storage System Operation Management


3.2.1 Storage Management Overview
DeviceManager is a piece of integrated storage management software developed by Huawei. It has
been loaded to storage systems before factory delivery. You can log in to DeviceManager using a
web browser or a tablet.
After logging in to the CLI of a storage system, you can query, set, manage, and maintain the storage
system. On any maintenance terminal connected to the storage system, you can use PuTTY to access
the IP address of the management network port on the controller of the storage system through the
SSH protocol to log in to the CLI. The SSH protocol supports two authentication modes: user name +
password and public key.
You can log in to the storage system by either of the following methods:
Login Using a Serial Port
After the controller enclosure is connected to the maintenance terminal using serial cables, you can
log in to the CLI of the storage device using a terminal program (such as PuTTY).
Login Using a Management Network Port
You can log in to the CLI using an IPv4 or IPv6 address.
After connecting the controller enclosure to the maintenance terminal using a network cable, you
can log in to the storage system by using any type of remote login software that supports SSH.
For a 2 U controller enclosure, the default IP addresses of the management network ports on
controller A and controller B are 192.168.128.101 and 192.168.128.102, respectively. The default
subnet mask is 255.255.0.0. For a 3 U/6 U controller enclosure, the default IP addresses of the
management network ports on management module 0 and management module 1 are
192.168.128.101 and 192.168.128.102, respectively. The default subnet mask is 255.255.0.0.
The IP address of the controller enclosure's management network port must be in the same network
segment as that of the maintenance terminal. Otherwise, you need to modify the IP address of the
management network port through a serial port by running the change system management_ip
command.
Introduction to UltraPath
OceanStor UltraPath is the multipathing software developed by Huawei. Its functions include
masking of redundant LUNs, optimum path selection, I/O load balancing, and failover and failback.
These functions enable your storage network to be intelligent, stable, and fast.

3.2.2 Introduction to Storage Management Tools


DeviceManager is a piece of integrated storage management software designed by Huawei for a
single storage system. DeviceManager can help you easily configure, manage, and maintain storage
devices.
Users can query, set, manage, and maintain storage systems on DeviceManager and the CLI. Tools
such as SmartKit and eService can improve O&M efficiency.
Before using DeviceManager, ensure that the maintenance terminal meets the following
requirements of DeviceManager:
Operating system and browser versions of the maintenance terminal are supported.
DeviceManager supports multiple operating systems and browsers. For details about the
compatibility information, visit Huawei Storage Interoperability Navigator.
The maintenance terminal communicates with the storage system properly.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

The super administrator can log in to the storage system using this authentication mode only.
Before logging in to DeviceManager as a Lightweight Directory Access Protocol (LDAP) domain user,
first configure the LDAP domain server, and then configure parameters on the storage system to add
it into the LDAP domain, and finally create an LDAP domain user.
By default, DeviceManager allows 32 users to log in concurrently.
A storage system provides built-in roles and supports customized roles.
Built-in roles are preset in the system with specific permissions. Built-in roles include the super
administrator, administrator, and read-only user.
Permissions of user-defined roles can be configured based on actual requirements.
To support permission control in multi-tenant scenarios, the storage system divides built-in roles
into two groups: system group and tenant group. Specifically, the differences between the system
group and tenant group are as follows:
⚫ Tenant group: roles in this group are used only in the tenant view (view that can be operated
after you log in to DeviceManager using a tenant account).
⚫ System group: roles belonging to this group are used only in the system view (view that can be
operated after you log in to DeviceManager using a system group account).
Huawei UltraPath provides the following functions:
⚫ The path to the owning controller of a LUN is used to achieve the best performance.
⚫ Virtual LUNs mask physical LUNs and are visible to upper-layer users. Read/Write operations are
performed on virtual LUNs.
⚫ Mainstream clustering software: MSS MSCS, VCS, HACMP, Oracle RAC, and so on
⚫ Mainstream database software: Oracle, DB2, MySQL, Sybase, Informix, and so on
⚫ After link recovery, failback immediately occurs without manual intervention or service
interruption.
⚫ Multiple paths are automatically selected to deliver I/Os, improving I/O performance. Paths are
selected based on the path workload.
⚫ Failover occurs if a link becomes faulty, preventing service interruption.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

3.2.3 Introduction to Basic Management Operations

Figure 3-1 Configuration process

3.3 Storage Resource Tuning Technologies and Applications


3.3.1 SmartThin
Huawei-developed SmartThin for OceanStor storage series provides the automatic thin provisioning
function. It solves the problems in deployment of the traditional storage systems.
SmartThin allocates storage spaces on demand rather than pre-allocating all storage spaces at the
initial stage. It is more cost-effective because customers can start business with a few disks and add
disks based on site requirements. In this way, both the initial purchase cost and TCO are minimized.
Thin LUN
Definition: A thin LUN is a logical disk that can be accessed by hosts. It dynamically allocates storage
resources from the storage pool according to the actual capacity requirements of users.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

3.3.1.1 Working Principles of SmartThin


The working principle of SmartThin is to virtualize storage resources.
SmartThin manages storage devices on demand. SmartThin does not allocate all space in advance,
but presents users a virtual storage space larger than the physical storage space. In this way, users
see a larger storage space than the actual storage space. SmartThin allocates the space based on
users' demands. If the storage space is insufficient, users can add back-end storage units to expand
the system capacity. The whole expansion process is transparent to users without system shutdown.
SmartThin creates thin LUNs based on a RAID 2.0+ virtual storage resource pool, that is, thin LUNs
coexist with thick LUNs in the same storage resource pool. A thin LUN is a logic unit created in a
storage pool, which can be mapped to and then accessed by a host. The capacity of a thin LUN is not
an actual physical space but a virtual value. Only when the thin LUN starts to process an I/O request,
physical space can be applied from the storage resource pool based on the capacity-on-write policy.
A thin LUN is a logical disk that can be accessed by hosts. It dynamically allocates storage resources
from the storage pool according to the actual capacity requirements of users.
3.3.1.2 Read/Write Process of SmartThin
SmartThin uses the capacity-on-write and direct-on-time technologies to help hosts process read
and write requests of thin LUNs. Capacity-on-write is used to allocate space upon writes, and direct-
on-time is used to redirect data read and write requests.
Capacity-on-write: Upon receiving a write request from a host, a thin LUN uses direct-on-time to
check whether there is a physical storage space allocated to the logical storage provided for the
request. If no, a space allocation task is triggered, and the size for the space allocated is measured by
the grain as the minimum granularity. Then data is written to the newly allocated physical storage
space.
Direct-on-time: When capacity-on-write is used, the relationship between the actual storage area
and logical storage area of data is not calculated using a fixed formula but determined by random
mappings based on the capacity-on-write principle. Therefore, when data is read from or written
into a thin LUN, the read or write request must be redirected to the actual storage area based on the
mapping relationship between the actual storage area and logical storage area.
Mapping table: This table is used to record the mapping between an actual storage area and a logical
storage area. A mapping table is dynamically updated during the write process and is queried during
the read process.
3.3.1.3 Application Scenarios of SmartThin
SmartThin allocates storage space on demand. The storage system allocates space to application
servers as needed within a specific quota threshold, eliminating the storage resource waste.
SmartThin can be used in the following scenarios:
SmartThin expands the capacity of the banking transaction systems in online mode without
interrupting ongoing services.
SmartThin dynamically allocates physical storage spaces on demand to email services and online
storage services.
SmartThin allows different services provided by a carrier to compete for physical storage space to
optimize storage configurations.
3.3.1.4 Configuration Process of SmartThin
To use thin LUNs, you need to import and activate the license file of SmartThin on your storage
device.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

After a thin LUN is created, if an alarm is displayed indicating that the storage pool has no available
space, you are advised to expand the storage pool as soon as possible. Otherwise, the thin LUN may
enter the write through mode, causing performance deterioration.

3.3.2 SmartTier&SmartCache
3.3.2.1 SmartTier
SmartTier is also called intelligent storage tiering. It provides the intelligent data storage
management function that automatically matches data to the storage media best suited to that type
of data by analyzing data activities.
SmartTier migrates hot data to storage media with high performance (such as SSDs) and moves idle
data to more cost-effective storage media (such as NL-SAS disks) with more capacity. This provides
hot data with quick response and high input/output operations per second (IOPS), thereby
improving the performance of the storage system.
3.3.2.1.1 Dividing Storage Tiers
In the same storage pool, a storage tier is a collection of storage media with the same performance.
SmartTier divides storage media into high-performance, performance, and capacity tiers based on
their performance levels. Each storage tier respectively uses the same type of disks and RAID policy.
⚫ High-performance tier
Disk type: SSDs
Disk characteristics: SSDs have a high IOPS and can quickly respond to I/O request. However,
the cost of storage capacity at each unit is high.
Application characteristics: Applications with intensive random access requests are often
deployed at this tier.
Data characteristics: It carries the most active data (hot data).
⚫ Performance tier
Disk type: SAS disks
Disk characteristics: This tier delivers a high bandwidth under a heavy service workload. I/O
requests are responded in a relatively quick speed. Data write is slower than data read if no
data is cached.
Application characteristics: Applications with moderate access requests are often deployed at
this tier.
Data characteristics: It carries hot data (active data).
⚫ Capacity tier
Disk type: NL-SAS disks
Disk characteristics: NL-SAS disks have a low IOPS and slowly respond to I/O request. However,
the price per unit for storage request processing is high.
Application characteristics: Applications with fewer access requests are often deployed at this
tier.
Data characteristics: It carries cold data (idle data).
The types of disks in a storage pool determine how many storage tiers there are.
3.3.2.1.2 Managing Member Disks
A storage pool with SmartTier enabled manages SCM drives and SSDs as member disks.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Number of hot spare disks: Each tier reserves its own hot spare disks. For example, in the case of a
low hot spare policy for a storage pool, one disk is reserved for the performance tier and the
capacity tier, respectively.
Large- and small-capacity disks: The capacity of each tier is calculated independently. Each tier
supports a maximum of two disk capacity specifications.
RAID: RAID groups are configured for each tier separately. Chunk groups (CKGs) are not formed
across different media.
Capacity: The available capacity of a storage pool is the sum of the available logical capacity of each
tier.
3.3.2.1.3 Migrating Data
If a storage pool contains both SCM drives and SSDs, the data newly written by a host is
preferentially saved to the performance tier for better performance. Data that has not been
accessed for a long time is migrated to the capacity tier. When the capacity of the preferred tier is
insufficient, data is written to the other tier.
3.3.2.2 SmartCache
3.3.2.2.1 SmartCache
Based on the short read response time of SCM drives, SmartCache uses SCM drives to compose a
SmartCache pool and caches frequently-read data to the SmartCache pool. This shortens the
response time for reading hot data, improving system performance.
SmartCache pool:
A SmartCache pool consists of SCM drives and is used as a complement of DRAM cache to store hot
data.
SmartCache partition:
The SmartCache partition is a logical concept based on the SmartCache pool. It is used to store LUNs
and file system services.
3.3.2.2.2 SmartCache Write Process
⚫ After receiving a write I/O request to a LUN or file system from a server, the storage system
sends data to the DRAM cache.
⚫ After the data is written to the DRAM cache, an I/O response is returned to the server.
⚫ The DRAM cache sends the data to the storage pool management module.
⚫ Data is stored on SSDs, and an I/O response is returned.
⚫ The DRAM cache sends data copies to the SmartCache pool. After the data is filtered by the cold
and hot data identification algorithm, the identified hot data is written to the SCM media, and
the metadata of the mapping between the data and SCM media is created in the memory.
⚫ Data is cached to the SmartCache pool, and an I/O response is returned.
3.3.2.2.3 SmartCache Read Hit
⚫ A read I/O request from an application server is first delivered to the DRAM cache before
arriving at LUNs or file systems.
⚫ If the requested data is not found in the DRAM cache, the read I/O request is further delivered
to the SmartCache pool.
⚫ If the requested data is found in the SmartCache pool, the read I/O request is delivered to SCM
drives. Data is read from the SCM drives and returned to the DRAM cache.
⚫ The DRAM cache returns the data to the application server.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

3.3.2.2.4 SmartCache Read Miss


⚫ A read I/O request from an application server is first delivered to the DRAM cache before
arriving at LUNs or file systems.
⚫ If the requested data is not found in the DRAM cache, the DRAM cache forwards the read I/O
request to the SmartCache pool.
⚫ If the requested data is not found in the SmartCache pool either, the SmartCache module
forwards the read I/O request to the storage pool management module to read the data from
SSDs.
⚫ Data is read from the SSDs and returned to the SmartCache module.
⚫ The SmartCache module returns the data to the DRAM cache.
⚫ The DRAM cache returns the data to the application server.
3.3.2.2.5 Highlights
⚫ Dynamic Capacity Expansion
You can add SCM media to a SmartCache pool for dynamic capacity expansion.
You are advised to configure SCM resources of the same quantity and capacity for controllers in
the same controller enclosure. This ensures balanced acceleration performance for LUN or file
system services of multiple controllers in the same controller enclosure.
⚫ Flexible Policy Configuration
You can enable or disable SmartCache for specific LUNs or file systems without interrupting
services.
You can add one or more LUNs or file systems to a specified SmartCache partition to shorten
the data read response time.
⚫ Adaptive Switch
When detecting that the SmartCache policy is inefficient (for example, when the SmartCache
hit ratio is low or the CPU is busy), the system stops allowing I/Os to enter the SmartCache pool
to reduce the impact on services. In this case, I/Os are not sent to the SmartCache pool and the
requested data will not be found in the SmartCache pool. In addition, to ensure data
consistency, data in the SmartCache pool is cleared.
When detecting a scenario suitable for SmartCache, the system automatically restores to the
previous state.

3.3.3 SmartAcceleration
SmartAcceleration is a key feature for performance improvement on the next-generation OceanStor
hybrid flash storage. It leverages the large block sequential write mechanism of redirect-on-write
(ROW) and uses a unified performance layer for cache and tier performance acceleration. This
breaks the bottleneck of traditional HDDs in random IOPS performance, maximizing the performance
of the hybrid flash system.
3.3.3.1 Unified Performance Layer That Flexibly Integrates Caches and Tiers
Global cold and hot data sensing and data collaboration algorithms, breaking the boundaries of
caches and tiers, and providing optimal data layout and simplified configuration:
⚫ Converged caches and tiers, preventing repeated data flow and improving efficiency
⚫ Global popularity, unifying cache's admission and eviction and tier's traffic distribution and
migration

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

⚫ Detecting performance, capacity, and service life of different media to comprehensively


determine data placement
Scenario-based adaptive elastic cache adjustment, achieving optimal balance between performance
and capacity:
⚫ Flexible cache adjustment and dynamic conversion between caches and tiers, going for optimal
cost-effectiveness
⚫ No need to configure caches and tiers (physical) separately, simplifying configuration
Performance layer and capacity layer using the multiple modes of index technologies to adapt to
different data layouts, achieving optimal efficiency in mixed models with variable data:
⚫ Performance layer PelagoDB, implementing small- and medium-capacity, and fast data update
⚫ Capacity layer KVDB, implementing medium- and large-capacity, and balanced read and write
efficiency
3.3.3.2 ROW-based Large-Block Sequential Write
Huawei OceanStor storage also supports ROW-based sequential writes of large blocks. This enables
controllers to detect SSD data layouts with Huawei-developed SSDs. Consequently, multiple small
and scattered blocks are aggregated into a large continuous block and sequentially written to SSDs.
RAID 5, RAID 6, and RAID-TP perform just one I/O operation and do not require the usual multiple
read and write operations for small scattered write blocks. The write performance of RAID 5 and
RAID 6 for OceanStor storage arrays far surpasses that of traditional arrays and is almost the same as
that of RAID-TP.

3.3.4 SmartQoS
SmartQoS is an intelligent service quality control feature. It dynamically allocates storage system
resources to meet the performance requirement of certain applications.
SmartQoS is an essential value-added feature for a storage system, especially in certain applications
that demand high service-level requirements.
When multiple applications are deployed on the same storage device, users can obtain maximized
benefits through the proper configuration of SmartQoS.
Performance control reduces adverse impacts of applications on each other to ensure the
performance of critical services.
SmartQoS limits the resources allocated to non-critical applications to ensure high performance for
critical applications.
3.3.4.1 Upper Limit Traffic Control Management
SmartQoS traffic control is implemented by I/O queue management, token allocation, and dequeue
management for controlled objects.
The SmartQoS determines the amount of storage resources to be allocated to an I/O queue of
controlled objects by counting the number of tokens owned by the queue. The more tokens owned
by an I/O queue, the more resources will be allocated to the queue, and the more preferentially the
I/O requests in the queue will be processed.
SmartQoS maintains a token bucket and an I/O queue for a LUN associated with a SmartQoS policy
and converts the upper limit of the policy into the bucket generation rate.
If the maximum IOPS is limited in a SmartQoS policy, one I/O consumes one token. If the maximum
bandwidth is limited, one sector consumes one token.
A larger maximum IOPS or bandwidth indicates a larger number of tokens in the token bucket and
higher performance of the corresponding LUN.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

I/O queue processing mechanism of applications in the system:


1 After application servers send I/O requests, the storage system delivers the requests to
corresponding I/O queues of controlled objects.
2 The storage system adjusts the number of tokens owned by I/O queues based on the priorities
of controlled objects. By reducing the number of tokens owned by low-priority queues, the
storage system guarantees sufficient resources for high-priority controlled objects so that I/Os
to these controlled objects can be preferentially processed.
3 The storage system processes the I/Os in queues by priority.
3.3.4.2 Burst Traffic Control Management
Principle: The burst capability of a LUN comes from the burst credits converted from performance
accumulated when the actual performance of the LUN is lower than the upper limit.
SmartQoS uses the dual-token-bucket mode to support the production and consumption of burst
credits. Token bucket C is used to implement upper limit control, and Token bucket E is used to
implement burst control.
If the LUN's service pressure is lower than the configured performance upper limit, that is, the
consumption rate is lower than the production rate of tokens in Token bucket C, the number of
tokens in Token bucket C gradually increases and overflows when it exceeds the bucket depth.
Excessive tokens are stored in Token bucket E until the number of tokens in it also exceeds the
bucket depth. At this time, excessive tokens are discarded. Only when Token bucket E has remaining
tokens, the volume has the burst capability.
If the LUN's service pressure is greater than the configured performance upper limit, tokens in Token
bucket C will be exhausted gradually. When Token bucket C has no token, it attempts to borrow
tokens from Token bucket E until tokens in Token bucket E are exhausted.
3.3.4.3 Lower Limit Guarantee
Principle: Set lower limit objectives for all LUNs: Users set a lower limit objective for LUNs of a few
critical services. The system sets a shared lower limit objective for other LUNs at 10% (configurable)
of the nominal performance value of the system. The latency objective is converted into the IOPS
objective.
Rate the loads of all LUNs in the system. If the load of a LUN does not reach the lower limit objective,
the system rating and adjustment process are triggered.
Suppress the traffic for LUNs whose performance values are far beyond the lower limit to release
resources.

3.3.5 SmartDedupe&SmartCompression
3.3.5.1 SmartDedupe
Working Principles of Post-processing Similarity-based Deduplication
⚫ The system divides to-be-written data into blocks. The default data block size is 8 KB.
⚫ The storage system uses a similar fingerprint algorithm to calculate similar fingerprints of the
new data blocks.
⚫ The storage system writes the data blocks to disks and records data blocks' fingerprint and
location information in the opportunity table.
⚫ The storage system periodically checks whether there are similar fingerprints in the opportunity
table.
➢ If yes, the storage system performs operations in the next step.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

➢ If no, the storage system continues the periodic check.


⚫ The storage system performs a byte-by-byte comparison to check whether similar blocks are
actually the same.
➢ If yes, the storage system deletes the new data block and points the fingerprint and
storage location of the new data block to that of the existing one in the fingerprint table.
➢ If no, the storage system performs delta compression on the new data block, records the
new data block's fingerprint in the fingerprint table, updates the fingerprint to the
metadata of the new data block, and reclaims the storage space of the new data block.

3.3.6 SmartVirtualization
3.3.6.1 Related Concepts
Local storage system: refers to the OceanStor Dorado series storage system.
Heterogeneous storage system: can be either a storage system manufactured by another
mainstream vendor or a Huawei storage system of a specific model.
External LUN: a LUN in a heterogeneous storage system, which is displayed as a remote LUN in
DeviceManager.
eDevLUN: In the storage pool of a local storage system, the mapped external LUNs are reorganized
as raw storage devices based on a certain data organization form. A raw device is called an eDevLUN.
The physical space occupied by an eDevLUN in the local storage system is merely the storage space
needed by the metadata. The service data is still stored on the heterogeneous storage system.
Application servers can use eDevLUNs to access data on external LUNs in the heterogeneous storage
system, and the SmartMigration feature can be configured for the eDevLUNs.
Online takeover: During the online takeover process, services are not interrupted, ensuring service
continuity and data integrity. In this mode, the critical identity information about heterogeneous
LUNs is masqueraded so that multipathing software can automatically identify new storage systems
and switch I/Os to the new storage systems. This remarkably simplifies data migration and minimizes
time consumption.
Offline takeover: During the offline takeover process, connections between heterogeneous storage
systems and application servers are down and services are interrupted temporarily. This mode is
applicable to all compatible Huawei and third-party heterogeneous storage systems.
Hosting: LUNs in a heterogeneous storage system are mapped to a local storage system for use and
management.
3.3.6.2 Relationship Between an eDevLUN and an External LUN
The physical space needed by data is provided by the external LUN from the heterogeneous storage
system. Data does not occupy the capacity of the local storage system.
Metadata is used to manage storage locations of data on an eDevLUN. The space used to store
metadata comes from the metadata space in the storage pool created in the local storage system.
Metadata occupies merely a small amount of space. Therefore, eDevLUNs occupy a small amount of
space in the local storage system. (If no value-added feature is configured for eDevLUNs, each
eDevLUN occupies only dozens of KBs in the storage pool created in the local storage system.)
If value-added features are configured for eDevLUNs, each eDevLUN, like any other local LUNs,
occupies local storage system space to store the metadata of value-added features. Properly plan
storage space before creating eDevLUNs to ensure that value-added features can work properly.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

3.3.6.3 Data Read Process


With the use of SmartVirtualization, an application server can read data from and write data to an
external LUN in a heterogeneous storage system through the local storage system. The entire
process is similar to the process of reading and writing data to a LUN in the local storage system.
3.3.6.4 Data Write Process
With the use of SmartVirtualization, an application server can read data from and write data to an
external LUN in a heterogeneous storage system through the local storage system. The entire
process is similar to the process of reading and writing data to a LUN in the local storage system.
3.3.6.5 Centralized Management of Storage Resources
SmartVirtualization functions similarly to virtual gateways. SmartVirtualization allows you to discover
storage resources in multiple heterogeneous storage systems from the local storage system, and
deliver read and write commands to the storage resources for centralized management.
3.3.6.6 Takeover Mode Selection
The offline takeover mode is applicable to all compatible Huawei and third-party heterogeneous
storage systems. In this mode, services running on the related application servers are stopped
temporarily and the masquerading property for eDevLUNs is No masquerading.
When a Huawei heterogeneous storage system is taken over in online mode, the masquerading
property for eDevLUNs is Basic masquerading or Extended masquerading. The selection of basic
masquerading or extended masquerading depends on the vendor and version of the multipathing
software and the versions of Huawei heterogeneous storage systems. For details, see the product
documentation of the corresponding version.

3.3.7 SmartMigration
SmartMigration is a key technology for service migration. Services on a source LUN can be
completely migrated to a target LUN without interrupting host services. The target LUN can totally
replace the source LUN to carry services after the replication is complete.
3.3.7.1 Benefits of SmartMigration
Benefits of SmartMigration: Reliable service continuity: Service data is migrated non-disruptively,
preventing any loss caused by service interruption during service migration.
Stable data consistency: During service data migration, data changes made by hosts will be sent to
both the source LUN and target LUN, ensuring data consistency after migration and preventing data
loss.
Convenient performance adjustment: To flexibly adjust service performance levels, SmartMigration
migrates service data between different storage media and RAID levels based on service
requirements.
Data migration between heterogeneous storage systems: In addition to service data migration
within a storage system, SmartMigration also supports service data migration between a Huawei
storage system and a compatible heterogeneous storage system.
3.3.7.2 Working Principles of SmartMigration
SmartMigration is leveraged to adjust service performance or upgrade storage systems by migrating
services between LUNs.
3.3.7.3 Related Concepts
Storage systems use virtual storage technology. Virtual data in a storage pool consists of metadata
volumes and data volumes.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Metadata volumes: record the data storage locations, including LUN IDs and data volume IDs. LUN
IDs are used to identify LUNs, and data volume IDs are used to identify physical space of data
volumes.
Data volumes: store actual user data.
3.3.7.4 SmartMigration Service Data Synchronization
The two synchronization modes of service data are independent and can be performed at the same
time to ensure that service data changes on the host can be synchronized to the source LUN and the
target LUN.
Data change synchronization:
1 A host delivers an I/O write request to the LM module of a storage system.
2 The LM module writes the data to the source LUN and target LUN and records write operations
to the log.
3 The source LUN and target LUN return the data write result to the LM module.
4 The LM module determines to clear LOG or not based on the write I/O result.
5 A write success acknowledgment is returned to the host.
3.3.7.5 SmartMigration LUN Information Exchange
LUN information exchange is the prerequisite for a target LUN to take over services from a source
LUN after service information synchronization.
In a storage system, each LUN and its corresponding data volume have a unique identifier, namely,
the ID of a LUN and data volume ID. A source LUN corresponds to a data volume. The former is a
logical concept whereas the latter is a physical concept.
Before LUN information exchange: A host identifies a source LUN by the ID of the source LUN. The ID
of a LUN corresponds to a data volume ID.
During LUN information exchange: A source data volume and a target data volume ID are
exchanged. The physical storage space to which the source LUN points becomes the target data
volume.
After LUN information exchange: The ID of the source LUN is unchanged, and users sense no fault
because services are not affected. The ID of the source LUN and target data volume ID form a new
mapping relationship. The host actually read and writes physical space of the target LUN.
3.3.7.6 SmartMigration Pair Splitting
⚫ In splitting, host services are suspended. After information is exchanged, services are delivered
to the target LUN. In this way, service migration is transparent to users.
⚫ The consistency splitting of SmartMigration means that multiple pairs exchange LUN
information at the same time and concurrently remove pair relationships after the information
exchange is complete, ensuring that data consistency at any point in time before and after the
pairs are split.
⚫ Pair splitting: Data migration relationship between a source LUN and a target LUN is removed
after LUN information is exchanged.
➢ After the pair is split, if the host delivers an I/O request to the storage system, data is only
written to the source LUN.
➢ The target LUN stores all data of the source LUN at the pair splitting point in time.
➢ After the pair is split, no connections can be established between the source LUN and
target LUN.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

3.4 Storage Data Protection Technologies and Applications


3.4.1 HyperSnap
HyperSnap is a snapshot feature. A snapshot generated by HyperSnap is a point-in-time, consistent,
and fully usable copy of source data. It is a static image of the source data at the copy point in time.
A snapshot can be implemented using the copy-on-write (COW) or redirect-on-write (ROW)
technology.
COW enables data to be copied in the initial data write process. Data copy affects write performance
of hosts.
ROW does not copy data. However, after data is overwritten frequently, data distribution on the
source LUN will be damaged, adversely affecting sequential read performance of hosts.
Purposes of snapshots:
Backup and archiving: Snapshots can serve as data sources for backup and archiving.
Quick recovery: Snapshots flexibly and frequently generate recovery points for data on storage
devices, enabling fast data recovery when necessary.
Instant generation: Snapshots are instantaneously generated without impacting host services. It is a
data duplicate of the source data at a specific point in time.
ROW principle
When a source file system receives a write request to modify existing data, the storage system
writes the new data to a new location and directs the pointer of the modified data block to the new
location.
Working principles of HyperSnap
Common snapshot terms:
Source volume: A volume that stores the source data of a snapshot. It is presented as a LUN to users.
Snapshot volume: A logical data duplicate generated after a snapshot is created for a source volume.
It is presented as a snapshot LUN to users.
Redirect on write: When data is modified, new space is allocated to new data. After the new data
has been written successfully, the original space is released.
Snapshot rollback: Data of a snapshot LUN is copied to the source LUN. In this way, data of the
source LUN is recovered to a state at the point in time when the snapshot LUN was activated.
Inactive: A state of a snapshot in which the snapshot is unavailable. The opposite state is activated.

3.4.2 HyperClone
Definition
HyperClone creates a full copy of the source LUN's data on a target LUN at a specified point in time
(synchronization start time).
HyperClone is to create a clone for a file system or a snapshot of the file system at a specific point in
time. After a clone file system has been created, its data (including the dtree configuration and dtree
data) is consistent with that of the parent file system at the corresponding point in time.
Features
The target LUN can be read and written during synchronization.
Full synchronization and differential synchronization are supported.
Forward synchronization and reverse synchronization are supported.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Consistency groups are supported.


Read/Write principles of a clone file system
After a clone file system is created, it shares source data with the parent file system when there is no
data change. The snapshot is used to ensure data consistency at the point in time when clone was
created. If the clone file system is shared for reads and writes on application servers, the application
servers actually read the source data of the parent file system.
When an application server writes new data to an existing data block in the parent or clone file
system, the storage system allocates new storage space for the new data instead of overwriting the
original data. When the application server attempts to modify data block A in the parent file system,
the storage pool allocates a new data block A1 to store the new data and retains data block A; when
the application server attempts to modify data block D in the clone file system, the storage pool
allocates a new data block D1 to store the new data and retains data block D. Data in the file system
snapshot is not changed during this process.

3.4.3 HyperCDP
3.4.3.1 LUN HyperCDP
Based on the lossless snapshot technology, HyperCDP has little impact on the performance of source
LUNs. Compared with writable snapshots, HyperCDP does not need to build LUNs, greatly reducing
memory overhead and providing stronger and continuous protection.
HyperCDP is a value-added feature that requires a license.
In the license file, the HyperCDP feature name is displayed as HyperCDP.
The HyperCDP license also grants the permissions for HyperSnap. If you have imported a valid
HyperCDP license to your storage system, you can perform all operations of HyperSnap even though
you do not import a HyperSnap license.
HyperCDP has the following advantages:
It provides data protection at an interval of seconds, with zero impact on performance and small
space occupation.
It supports scheduled tasks. You can specify HyperCDP schedules by day, week, month, or specific
interval to meet different backup requirements.
It provides intensive and persistent data protection. HyperCDP provides more recovery points for
data and provides shorter data protection intervals, longer data protection periods, and continuous
data protection.
Purposes and benefits
⚫ Efficient use of storage space, protecting user investments
⚫ HyperCDP objects for various applications
⚫ Continuous data protection
Working principles of HyperCDP
HyperCDP creates high-density snapshots on a storage system to provide continuous data
protection. Based on the lossless snapshot technology, HyperCDP has little impact on the
performance of source LUNs. Compared with writable snapshots, HyperCDP does not need to build
LUNs, greatly reducing memory overhead and providing stronger and continuous protection.
The storage system supports HyperCDP schedules to meet customers' backup requirements.
HyperCDP objects cannot be mapped to a host directly. To read data from a HyperCDP object, you
can create a duplicate for it and map the duplicate to the host.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

3.4.3.2 FS HyperCDP
HyperCDP periodically generates snapshots for a file system based on a HyperCDP schedule. The
HyperCDP schedule can be a default (named NAS_DEFAULT_BUILDIN) or non-default one.
The default schedule is automatically created when the first file system is created in the storage
system. The storage system has only one default schedule and it cannot be deleted. OceanStor
Dorado 6.1.2 and later versions support the default schedule.
Non-default schedules are created by users as required.

3.4.4 HyperLock
Write Once Read Many (WORM), also called HyperLock, protects the integrity, confidentiality, and
accessibility of data, meeting secure storage requirements.
Working Principle
With the WORM technology, data can be written to files once only, and cannot be rewritten,
modified, deleted, or renamed. If a common file system is protected by the WORM feature, files in
the file system can be read only within the protection period. After WORM file systems are created,
you need to map them to application servers using the NFS or CIFS protocol.
WORM enables files in the WORM file system to be shifted between initial, locked, appending, and
expired states, preventing important data from being incorrectly or maliciously tampered within a
specified period.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

4 Storage System O&M Management

4.1 Storage System O&M Management


4.1.1 O&M Overview
ITIL
Information Technology Infrastructure Library (ITIL) is a widely recognized set of practice guidelines
for effective IT service management. Since 1980, Office of Government Commerce of the UK has
gradually proposed and improved a set of methods for assessing the quality of IT services, which is
called ITIL, to solve the problem of poor IT service quality. In 2001, the British Standards Institution
officially released the British national standard BS15000 with ITIL as the core at the IT Service
Management Forum (itSMF). This has become a major event of historical significance in the IT
service management field.
Traditional IT only plays a supporting role, and now IT is a type of service. To achieve the goals of
reducing costs, increasing productivity, and improving service quality, ITIL has set off a frenzy around
the world. Many famous multinational companies are active practitioners of ITIL. As the industry is
gradually changing from technology-oriented to service-oriented, enterprises' requirements for IT
service management are also increasing, which greatly helps standardize IT processes, keep IT
processes' pace with business, and improve processing efficiency.
ITIL has the strong support from the UK, other countries in Europe, North America, New Zealand,
and Australia. Whether an enterprise imports ITIL will be regarded as key indicators for determining
whether an inspection supplier or outsourcing service contractor is qualified for bidding.

4.1.2 O&M Management Tools


In storage scenarios, the following O&M tools are used:
⚫ DeviceManager: single-device O&M software.
⚫ SmartKit: a professional tool for Huawei technical support engineers, including compatibility
evaluation, planning and design, one-click fault information collection, inspection, upgrade, and
FRU replacement.
⚫ eSight: a customer-oriented multi-device maintenance suite that features fault monitoring and
visualized O&M.
⚫ DME: customer-oriented software that manages storage resources in a unified manner,
orchestrates service catalogs, and provides storage services and data application services on
demand.
⚫ eService Client: is deployed in the customer's equipment room. It detects storage device
anomalies in real time and notifies Huawei maintenance center of the anomalies.
⚫ eService cloud platform: is deployed in Huawei maintenance center to monitor devices on the
entire network in real time, changing passive maintenance to proactive maintenance and even
achieving agent maintenance.

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

4.1.3 O&M Scenarios


Maintenance Item Overview
Based on the maintenance items and periods, the system administrator can check the device
environment and device status. If an anomaly occurs, the system administrator can handle and
maintain the device in a timely manner to ensure the continuous and healthy running of the storage
system.
First Maintenance Items

Item Maintenance Operation

On the maintenance terminal, check whether SmartKit and its


sub-tools have been installed. The sub-tools provide the
following functions:
⚫ Device archive collection
Checking SmartKit installation ⚫ Information collection
⚫ Disk health analysis
⚫ Inspection
⚫ Patch tool

Checking the eService On the maintenance terminal, check whether the eService tool
installation and configuration has been installed and the alarm policy has been configured.

On DeviceManager, check whether an alarm policy has been


configured. After an alarm policy is configured, alarms will be
reported to the customer's server or mobile phone for timely
query and handling. Alarm policy includes:
⚫ Email notification
⚫ SMS message notification
Checking the alarm policy
configuration ⚫ System notification
⚫ Alarm dump
⚫ Trap IP address management
⚫ USM user management
⚫ Alarm masking
⚫ Syslog notification

Daily Maintenance Items


Check and handle the alarms. Log in to DeviceManager or use the configured alarm reporting mode
to view alarms, and handle the alarms in time based on the suggestions.
Weekly Maintenance Items

Item Maintenance Operation

Use the inspection tool of SmartKit on the maintenance


terminal to perform the inspection. The inspection items are as
Inspecting storage devices follows:
⚫ Hardware status
⚫ Software status

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Item Maintenance Operation


⚫ Value-added service
⚫ Checking alarms
Note:
If suggestions provided by SmartKit cannot resolve the
problem, use SmartKit to collect related information and
contact Huawei technical support.

Check the equipment room environment according to check


methods.
Checking the equipment room
Note:
environment
If the requirements are not met, adjust the equipment room
environment based on related specifications.

Check whether the rack internal environment meets the


requirements.
Checking the rack internal
Note:
environment
If the requirements are not met, adjust the rack internal
environment based on related requirements.

Information Collection
The information to be collected includes basic information, fault information, storage device
information, networking information, and application server information.

Information Type Name Description

Provides the serial number and version of a storage


device.
Device serial
Note:
number and
version You can log in to DeviceManager and query the
Basic information
serial number and version of a storage device in
the Basic Information area.

Customer
Provides the contact and contact details.
information

Time when a
Records the time when a fault occurs.
fault occurs

Records the symptom of a fault, such as the


Symptom displayed error dialog box and the received event
notification.
Fault information
Operations
performed Records the operations performed before a fault
before a fault occurs.
occurs

Operations Records the operations performed from the time

Downloaded by Andrea Murgia ([email protected])


lOMoARcPSD|28297286

Information Type Name Description


performed after when a fault occurs to the time when the fault is
a fault occurs reported to the maintenance personnel.

Hardware
Records the configuration information about the
module
hardware of a storage device.
configuration

Records the status of indicators on a storage


device, especially indicators in orange or red.
Storage device Indicator status For details about the indicator status of each
information component on the storage device, see the Product
Description of the corresponding product model.

Storage system Manually export the running data and system logs
data of a storage device.

Manually export alarms and logs of a storage


Alarm and log
device.

Describes how an application server and a storage


Connection
device are connected, such as the Fibre Channel
mode
network mode or iSCSI network mode.

If a switch exists on the network, record the switch


Switch model
model.

Manually export the diagnosis information about


Switch diagnosis the running switch, including the startup
information configuration, current configuration, interface
Network information
information, time, and system version.

Describes the topology diagram or provides the


Network
networking diagram between an application server
topology
and a storage device.

Describes IP address planning rules or provides the


IP address allocation list if an application server is
IP address
connected to a storage device over an iSCSI
network.

Records the type and version of the OS running on


OS version
an application server.

Application server Records the port rate of an application server that


information Port rate is connected to a storage device. For details about
how to check the port rate, see the Online Help.

OS log View and export the OS logs.

Downloaded by Andrea Murgia ([email protected])

You might also like