HCIA-Storage V5.0 Guide: Huawei Certification
HCIA-Storage V5.0 Guide: Huawei Certification
HCIA-Storage
Learning Guide
V5
Contents
⚫ Output: generates and outputs the processing result. The form of the output data depends on
the data use. For example, the output data can be an employee's salary.
1.1.1.4 What Is Information
Information refers to the objects that are transmitted and processed by the voice, message, and
communication systems. It refers to all the contents that are spread in the human society. By
acquiring and identifying different information of nature and society, man can distinguish between
different things and understand and transform the world. In all communication and control systems,
information is a form of universal connection. In 1948, mathematician Claude Elwood Shannon
pointed out in paper A Mathematical Theory of Communications that the essence of information is
the resolution of random uncertainty. The most basic unit that creates all things in the universe is
information.
Information is the data with context. The context includes:
⚫ Application meanings of data elements and related terms
⚫ Format of data expression
⚫ Time range of the data
⚫ Relevance of data to particular usage
Generally speaking, the concept of "data" is more objective and is not subjective to people's will.
Information is the processed data that has value and meanings.
For example, in the perspective of a football fan, the history of football, football matches, coaches,
players, and even the rules of FIFA are all the football data. Data of his or her favorite team, star, and
followed football events is information.
People can never know "all data" but can obtain "adequate information" that allows them to make
decisions.
1.1.1.5 Data vs. Information
Data is a raw and unorganized fact that needs to be processed to make it meaningful, whereas
information is a set of data that is processed in a meaningful way according to a given requirement.
Data does not have any specific purpose whereas information carries a meaning that has been
assigned by interpreting data.
Data alone has no significance while information is significant by itself.
Data never depends on information while information is dependent on data.
Data is measured in bits and bytes while information is measured in meaningful units like time and
quantity.
Data can be structured, tabular data, graph, data tree whereas information is language, ideas, and
thoughts based on the given data.
Data is a record that reflects the attributes of an object and is the specific form that carries
information. Data becomes information after being processed, and information needs to be digitized
into data for storage and transmission.
1.1.1.6 Information Lifecycle Management
Information lifecycle management (ILM) is an information technology strategy and concept, not just
a product or solution, for enterprise users. Data is key to informatization and is the core
competitiveness of an enterprise. Information enters a cycle from the moment it is generated. A
lifecycle is completed in the process of data creation, protection, access, migration, archiving, and
destruction. This process requires good management and cooperation. If the process is not well
managed, too many resources may be wasted or the work will be inefficient due to insufficient
resources.
Cloud storage is a concept derived from cloud computing, and is a new network storage technology.
Based on functions such as cluster applications, network technologies, and distributed file systems, a
cloud storage system uses application software to enable various types of storage devices on
networks to work together, providing data storage and service access externally. When a cloud
computing system stores and manages a huge amount of data, the system requires a matched
number of storage devices. In this way, the cloud computing system turns into a cloud storage
system. Therefore, we can regard a cloud storage system as a cloud computing system with data
storage and management as its core. In a word, cloud storage is an emerging solution that
consolidates storage resources on the cloud for people to access. Users can access data on the cloud
anytime, anywhere, through any networked device.
1.1.3.2 Storage Media
History of HDDs:
⚫ From 1970 to 1991, the storage density of disk platters increased by 25% to 30% annually.
⚫ Starting from 1991, the annual increase rate of storage density surged to 60% to 80%.
⚫ Since 1997, the annual increase rate rocketed up to 100% and even 200%.
⚫ In 1992, 1.8-inch HDDs were invented.
History of SSDs:
⚫ Invented by Dawon Kahng and Simon Min Sze from the Bell Labs in 1967, the floating gate
transistor has become the solid-state drive (SSD) basis of NAND flash technology. If you are
familiar with MOS tubes, you'll find that the transistor is similar to MOSFET except a floating
gate in the middle. That is why it got the name. It is wrapped in high-impedance materials and
insulated up and down to preserve charges that enter the floating gate through the quantum
tunneling effect.
⚫ In 1976, Dataram sold SSDs called Bulk Core. The SSD had the capacity of 2 MB (which was very
large at that time), and used eight large circuit boards, each board with eighteen 256 KB RAMs.
⚫ At the end of the 1990s, some vendors began to use the flash medium to manufacture SSDs. In
1997, altec ComputerSysteme launched a parallel SCSI flash SSD. In 1999, BiTMICRO released an
18-GB flash SSD. Since then, flash SSD has gradually replaced RAM SSD and become the
mainstream product of the SSD market. The flash memory can store data even in the event of
power failure, which is similar to the hard disk drive (HDD).
⚫ In May 2005, Samsung Electronics announced its entry into the SSD market, the first IT giant
entering this market. It is also the first SSD vendor that is widely recognized today.
⚫ In 2006, Nextcom began to use SSDs on its laptops. Samsung launched the SSD with the 32 GB
capacity. According to Samsung, the market of SSDs was $1.3 billion in 2007 and reached $4.5
billion in 2010. In September, Samsung launched the PRAM SSD, another SSD technology that
used the PRAM as the carrier, and hoped to replace NOR flash memory. In November, Windows
Vista came into being as the first PC operating system to support SSD-specific features.
⚫ In 2009, the capacity of SSDs caught up with that of HDDs. pureSilicon's 2.5-inch SSD provides 1
TB capacity and consists of 128 pieces of 64 Gbit/s MLC NAND memory. Finally, SSD provides the
same capacity as HDD in the same size. This is very important. HDD vendors once believed that
the HDD capacity could be easily increased by increasing the disk density with low costs.
However, the SSD capacity could be doubled only when the internal chips were doubled, which
was difficult. However, the MLC SSD proves that it is possible to double the capacity by storing
more bits in one cell. In addition, the SSD performance is much higher than that of HDD. The
SSD has the read bandwidth of 240 MB/s, write bandwidth of 215 MB/s, read latency less than
100 microseconds, 50,000 read IOPS, and 10,000 write IOPS. HDD vendors are facing a huge
threat.
The flash chips of SSD evolve from SLC with one cell storing one bit, MLC with two bits, TLC with
three bits, and now develop into QLC with one cell storing four bits.
History of Flash Storage:
⚫ The storage-class memory launched in 2016 combines the performance advantages of dynamic
random-access memory (DRAM) and NAND, and features large capacity, low latency, and non-
volatility in inexpensive hardware. It is considered the next-gen medium that separates itself
from conventional media.
⚫ Storage class memory (SCM) is non-volatile memory. It is not as fast as memory but runs much
faster access speed than NAND.
⚫ There are various types of SCM media under development, but the mainstream SCM media are
PCRAM, ReRAM, MRAM, and NRAM.
⚫ Phase-change RAM (PCRAM) uses the electrical conductivity difference between crystalline and
amorphous alloy materials to represent binary values (0 or 1).
⚫ Resistive RAM (ReRAM) controls the formation and fusing status of conductive wires in a cell by
applying different voltages between the upper and lower electrodes to display different
impedance values (memristors) and represent data.
⚫ Magnetic RAM (MRAM) uses the electromagnetic field to change the electron spin direction
and represent different data states.
⚫ Nantero's CNT RAM (NRAM) uses carbon nanotubes to control circuit connectivity and
represent different data states.
1.1.3.3 Interface Protocols
Interface protocols refer to the communication modes and requirements that interfaces must
comply with for information exchange.
Interfaces are used to transfer data between disk cache and host memory. Different disk interfaces
determine the connection speed between disks and controllers.
During the development of storage protocols, the data transmission rate is increasing. As storage
media evolves from HDDs to SSDs, the protocol develops from SCSI to Non-Volatile Memory Express
(NVMe), including the PCIe-based NVMe protocol and NVMe over Fabrics (NVMe-oF) protocol to
connect host networks.
NVMe-oF uses ultra-low-latency transmission protocols such as remote direct memory access
(RDMA) to remotely access SSDs, resolving the trade-off between performance, functionality, and
capacity during scale-out of next-generation data centers.
Released in 2016, the NVMe-oF specification supported both FC and RDMA. In the RDMA-based
framework, InfiniBand supported converged Ethernet and Internet Wide Area RDMA Protocol
(iWARP).
In the NVMe-oF 1.1 specification released in November 2018, TCP was added as an architecture
option, that is, RDMA over Converged Ethernet (RoCE). With RoCE, no cache was required and the
CPU could directly access disks.
NVMe is an SSD controller interface standard. It is designed for PCIe interface-based SSDs and aims
to maximize flash memory performance. It can provide intensive computing capabilities for
enterprise-class workloads in data-intensive industries, such as life sciences, financial services,
multimedia, and entertainment.
Typically, NVMe SSDs are used in database applications. If the NVMe features, such as high speed
and low latency, are used to design file systems, NVMe all-flash array (AFA) can achieve excellent
read/write performance. NVMe AFA can realize efficient storage, network switching, and metadata
communication.
solved by dedicated hardware storage. The concept similar to Memory Fabric also brings changes to
the storage architecture.
The last trend is convergence. In the future, storage will be integrated with the data infrastructure to
support heterogeneous chip computing, streamline diversified protocols, and collaborate with data
processing and big data analytics to reduce data processing costs and improve efficiency. For
example, compared with the storage provided by general-purpose servers, the integration of data
and storage will lower the TCO because data processing is offloaded from servers to storage. Object,
big data, and other protocols are converged and interoperate to implement migration-free big data.
Such convergence greatly affects the design of storage systems and is the key to improving storage
efficiency.
1.1.4.3 Data Storage Trend
In the intelligence era, we must focus on innovation to hardware, protocols, and technologies. From
mainframe to the x86, and then to the virtualization, all-flash storage media and all-IP network
protocols become a major trend.
In the intelligence era, Huawei Cache Coherence System (HCCS) and Compute Express Link (CXL) are
designed based on ultra-fast new interconnection protocols, helping to implement high-speed
interconnection between heterogeneous processors of CPUs and neural processing units (NPUs).
RoCE and NVMe support high-speed data transmission and containerization technologies. In
addition, new hardware and technologies provide abundant choices for data storage. The Memory
Fabric architecture implements memory resource pooling with all-flash + storage class memory
(SCM) and provides microsecond-level data processing performance. SCM media include Optane,
MRAM, ReRAM, FRAM, and Fast NAND. In terms of reliability, system reconstruction and data
migration are involved. As the chip-level design of all-flash storage advances, upper-layer
applications will be unaware of the underlying storage hardware.
Currently, the access performance of SSDs has been improved by 100-fold compared with that of
HDDs. For NVMe SSDs, the access performance is 10,000 times higher than that of HDDs. While the
latency of storage media has been greatly reduced, the ratio of network latency to the total latency
has rocketed from less than 5% to about 65%. That is to say, in more than half of the time, storage
media is idle, waiting for the network communication. How to reduce network latency is the key to
improving input/output operations per second (IOPS).
Development of Storage Media
Let's move on to Blu-ray storage. The optical storage technology started to develop in the late 1960s,
and experienced three generations of product updates and iterations: CD, DVD, and BD. Blu-ray
storage (or BD) is a relatively new member of the optical storage family. It can retain data for 50 to
100 years, but still cannot meet storage requirements nowadays. We expect to store data for a
longer time. The composite glass material based on gold nanoparticles can stably store data for
more than 600 years.
In addition, technologies such as deoxyribonucleic acid (DNA) storage and quantum storage are
emerging.
As the science and technology are developing, the disk capacity is increasing and the disk size is
becoming smaller. When it comes to storing information, a hard disk is still very large compared to
genes, but the amount of stored information is far less than that of genes. Therefore, scientists start
to use DNA to store data. At first, a few teams have tried to write data into the genomes of living
cells. But the approach has a couple of disadvantages. Cells replicate, introducing new mutations
over time that can change the data. Moreover, cells die, indicating that data is lost. Later, teams
attempt to store data using artificially synthesized DNA, which is freed from cells. Although the DNA
storage density now is high enough and a small amount of artificial DNA can store a large amount of
data, the data read/write are not efficient. In addition, the synthesis of DNA molecules is expensive.
However, it can be predicted that, with the development of gene sequencing technologies, the cost
will be reduced.
References:
Bohannon, J. (2012). DNA: The Ultimate Hard Drive. Science. Retrieved from:
https://www.sciencemag.org/news/2012/08/dna-ultimate-hard-drive
Akram F, Haq IU, Ali H, Laghari AT (October 2018). "Trends to store digital data in DNA: an overview".
Molecular Biology Reports. 45 (5): 1479–1490. doi:10.1007/s11033-018-4280-y
Although atomic storage is a technology of short history, it is not a new concept.
Early on December 1959, physicist Richard Feynman gave the lecture "There's Plenty of Room at the
Bottom: An Invitation to Enter a New Field of Physics." In this lecture, Feynman considered the
possibility of using individual atoms as basic units for information storage.
In July 2016, researchers from Delft University of Technology, Netherlands published a paper in
Nature Nanotechnology. They used chlorine atoms on copper plates to store 1 kilobyte of rewritable
data. However, the memory temporarily can only operate in a highly clean vacuum environment or
in a liquid nitrogen environment with a temperature of minus 196°C (77K).
References:
Erwin, S. A picture worth a thousand bytes. Nature Nanotech 11, 919–920 (2016).
https://doi.org/10.1038/nnano.2016.141
Kalff, F., Rebergen, M., Fahrenfort, E. et al. A kilobyte rewritable atomic memory. Nature Nanotech
11, 926–929 (2016). https://doi.org/10.1038/nnano.2016.131
Because an atom is so small, the capacity of atomic storage will be much larger than that of the
existing storage medium in the same size. With the development of science and technology in recent
years, Feynman's idea has become a reality. To pay tribute to Feynman's great idea, some research
teams wrote his lecture into atomic memory. Although the idea of atomic storage is incredible and
its implementation is becoming possible, atomic memory has strict requirements on the operating
environment. Atoms are moving and even the atoms inside solids are vibrating in the ambient
environment, so it is difficult to keep them in an ordered state in general conditions. Atom storage
can only be used in low temperatures, liquid nitrogen, or vacuum conditions.
If both DNA storage and atomic storage are intended to reduce the size of storage and increase the
capacity of storage, quantum storage is designed to improve performance and running speed.
After years of research, both the storage efficiency and the lifecycle of the quantum memory are
improved, but it is still difficult to put the quantum memory into practice. Quantum memory has the
problems of inefficiency, large noise, short lifespan, and difficulty to operate at room temperature.
Only by solving these problems, quantum memory can be put into the market.
The elements in the quantum state are easily lost due to the influence of the external environment.
In addition, it is difficult to ensure 100% accuracy of manufacturing in the quantum state and
performing quantum operations.
References:
Wang, Y., Li, J., Zhang, S. et al. Efficient quantum memory for single-photon polarization qubits. Nat.
Photonics 13, 346–351 (2019). https://doi.org/10.1038/s41566-019-0368-8
Dou Jian-Peng, Li Hang, Pang Xiao-Ling, Zhang Chao-Ni, Yang Tian-Huai, Jin Xian-Min. Research
progress of quantum memory. Acta Physica Sinica, 2019, 68(3): 030307. doi:
10.7498/aps.68.20190039
Storage Network Development
In traditional data centers, IP SAN uses the Ethernet technology to form a multi-hop symmetric
network architecture and use the TCP/IP network protocol stack for data transmission. FC SAN
⚫ Interface modules provide service or management ports and are field replaceable units. In
computer science, data is a generic term for all media such as numbers, letters, symbols, and
analog parameters that can be input to and processed by computer programs. Computers store
and process a wide range of objects that generate complex data.
2.1.2.1.2 Controller Enclosure Components
⚫ A controller is the core component of a storage system. It processes storage services, receives
configuration management commands, saves configuration data, connects to disks, and saves
critical data to coffer disks.
➢ The CPU and cache on the controller work together to process I/O requests from the host
and manage RAID of the storage system.
➢ Each controller has multiple built-in disks to store system data. If a power failure occurs,
these disks also store cache data. Disks on different controllers are redundant of each
other.
⚫ Front-end (FE) ports are used for service communication between application servers and the
storage system, that is, for processing host I/Os.
⚫ Back-end (BE) ports connect a controller enclosure to a disk enclosure and provide channels for
reading/writing data from/to disks.
⚫ A cache is a memory chip on a disk controller. It provides fast data access speed and functions
as a buffer between the internal storage and external interfaces.
⚫ An engine is a core component of a development program or system on an electronic platform.
It is usually the support part of a program or a set of systems.
⚫ Coffer disks are used to store user data, system configurations, logs, and dirty data in the cache
in the event of an unexpected power outage.
➢ Built-in coffer disk: Each controller of Huawei OceanStor Dorado V6 has one or two built-in
SSDs as coffer disks. For details, see the product documentation.
➢ External coffer disk: The storage system automatically selects four disks as coffer disks. Each
coffer disk provides 2 GB space to form a RAID 1 group. The remaining space of the coffer
disks can be used to store service data. If a coffer disk is faulty, the system automatically
replaces the faulty coffer disk with a normal disk to ensure redundancy.
⚫ Power module: The AC power module supplies power to the controller enclosure, allowing the
enclosure to operate normally at maximum power.
➢ A 4 U controller enclosure has four power modules (PSU 0, PSU 1, PSU 2, and PSU 3). PSU 0
and PSU 1 form a power plane to supply power to controllers A and C, and are redundant of
each other. PSU 2 and PSU 3 form the other power plane to supply power to controllers B
and D, and are redundant of each other. For reliability purposes, it is recommended that
you connect PSU 0 and PSU 2 to one PDU, and PSU 1 and PSU 3 to another PDU.
➢ A 2 U controller enclosure has two power modules (PSU 0 and PSU 1) to supply power to
controllers A and B. The two power modules form a power plane and are redundant of each
other. For reliability purposes, it is recommended that you connect PSU 0 and PSU 1 to
different PDUs.
2.1.2.2 Disk Enclosure
2.1.2.2.1 Disk Enclosure Design
The disk enclosure uses a modular design and consists of a system subrack, expansion modules,
power modules, and disks.
⚫ The system subrack integrates a backplane to provide signal and power connectivity among
modules.
⚫ The expansion module provides expansion ports to connect to a controller enclosure or another
disk enclosure for data transmission.
⚫ The power module supplies power to the disk enclosure, allowing the enclosure to operate
normally at maximum power.
⚫ Disks provide storage space for the storage system to save service data, system data, and cache
data. Specific disks are used as coffer disks.
2.1.2.3 Expansion Module
2.1.2.3.1 Expansion Module
Each expansion module provides one P0 and one P1 expansion port. The expansion module provides
expansion ports to connect to a controller enclosure or another disk enclosure for data transmission.
2.1.2.3.2 CE Switch
Huawei CloudEngine series fixed switches are next-generation Ethernet switches designed for data
centers and provide high performance, high port density, and low latency. The switches use flexible
front-to-rear or rear-to-front airflow design and can be used in IP SANs and distributed storage
networks.
2.1.2.3.3 Fibre Channel Switch
Fibre Channel switches are high-speed network transmission relay devices that transmit data over
optical fibers. They accelerate transmission and protect against interference, and are used on FC
SANs.
2.1.2.3.4 Device Cable
A serial cable connects the serial port of the storage system to the maintenance terminal.
Mini SAS HD cables connect to expansion ports on controller and disk enclosures. There are mini SAS
HD electrical cables and mini SAS HD optical cables.
100G QSFP28 cables are used for direct connection between controllers or for connection to smart
disk enclosures.
25G SFP28 cables are used for front-end networking.
Optical fibers connect the storage system to Fibre Channel switches. One end of the optical fiber
connects to a Fibre Channel host bus adapter (HBA), and the other end connects to the Fibre
Channel switch or the storage system. An optical fiber uses Lucent Connectors (LCs) at both ends.
MPO-4*DLC optical fibers, which are dedicated for 8 Gbit/s Fibre Channel interface modules (8
ports) and 16 Gbit/s Fibre Channel interface modules (8 ports), can be used to connect the storage
system to Fibre Channel switches.
2.1.2.4 Disk
2.1.2.4.1 Disk Components
⚫ A platter is coated with magnetic materials on both surfaces. The magnetic grains on the platter
are polarized to represent a binary information unit (or bit).
⚫ A read/write head reads data from and writes data to a platter. It changes the N and S polarities
of magnetic grains on the surface of the platter to save data.
⚫ The actuator arm moves the read/write head to the specified position.
⚫ The spindle has a motor and bearing underneath. It rotates the platter to move the specified
position on the platter to the read/write head.
⚫ The control circuit controls the speed of the platter and movement of the actuator arm, and
delivers commands to the head.
Each platter of a disk has two read/write heads, which respectively read and write data on two
surfaces of the platter.
The head floats on the platter by air flow and does not touch the platter. Therefore, the head can
move back and forth between tracks at a high speed. If the distance between the head and platter is
too long, the signal read by the head is weak; if the distance is too short, the head may rub against
the platter surface. Therefore, the platter surface must be smooth and flat. Any foreign matter or
dust will cause the head to rub against the magnetic surface, causing permanent data corruption.
Working principles:
⚫ At the beginning, the read/write head is in the landing zone near the spindle of the platters.
⚫ The spindle connects to all platters and a motor. The spindle motor rotates at a constant speed
to drive the platters.
⚫ When the spindle rotates, there is a small gap between the head and platter, which is called the
flying height of the head.
⚫ The head is attached to the end of the actuator arm, which drives the head to the specified
position above the platter where data needs to be written or read.
⚫ The head reads and writes data in binary format on the platter surface. The read data is stored
in the flash chip of the disk and then transmitted to the program.
⚫ Platter surface: Each platter of a disk has two surfaces, both of which can store data and are
valid. All valid surfaces are numbered in sequence, starting from 0 for the top surface. In a disk
system, a surface number is also referred to as a head number, because each valid surface has a
read/write head.
⚫ Track: Tracks are concentric circles around the spindle on a platter. Data is recorded on the
tracks. Tracks are numbered from the outermost circle to the innermost one, starting from 0.
Each platter surface has 300 to 1024 tracks. New types of large-capacity disks have even more
tracks on each surface. Generally, the tracks per inch (TPI) on a platter are used to measure the
track density. Tracks are only magnetized areas on the platter surfaces and are invisible to
human eyes.
⚫ Cylinder: A cylinder is formed by tracks with the same number on all platter surfaces of a disk.
The heads of each cylinder are numbered from top to bottom, starting from 0. Data is read and
written based on cylinders. That is, head 0 in a cylinder reads and writes data first, and then the
other heads in the same cylinder read and write data in sequence. After all heads have
completed reads and writes in a cylinder, the heads move to the next cylinder. Selection of
cylinders is a mechanical switching process called seek. Generally, the position of heads in a disk
is indicated by the cylinder number instead of the track number.
⚫ Sector: Each track is divided into smaller units called sectors to arrange data orderly. A sector is
the smallest storage unit that can be independently addressed in a disk. Tracks may have
different number of sectors. Generally, a sector can store 512 bytes of user data, but some disks
can be formatted into larger sectors, such as 4 KB sectors.
Disks may have one or multiple platters. However, a disk allows only one head to read and write
data at a time. Therefore, increasing the number of platters and heads only improves the disk
capacity, but cannot improve the throughput or I/O performance of the disk.
Disk capacity = Number of cylinders x Number of heads x Number of sectors x Sector size. The unit is
MB or GB. The disk capacity is determined by the capacity of a single platter and the number of
platters.
Because the processing speed of a CPU is much faster than that of a disk, the CPU must wait until the
disk completes a read/write operation before issuing a new command. To solve this problem, a
cache is added to the disk to improve the read/write speed.
example, the data transfer rate of IDE/ATA disks can reach 133 MB/s, and that of SATA II disks
can reach 300 MB/s.
⚫ In the case of random I/Os, the head must change tracks frequently. The data transmission time
is much less than the time for track changes (not in the same order of magnitude). Therefore,
the data transmission time can be ignored.
Theoretically, the maximum IOPS of a disk can be calculated using the following formula: IOPS =
1000 ms/(Seek time + Rotation latency). The data transmission time is ignored. For example, if the
average seek time is 3 ms, the theoretical maximum IOPS for 7200 rpm, 10k rpm, and 15k rpm disks
is 140, 167, and 200, respectively.
2.1.2.4.6 Data Transfer Mode
Parallel transmission:
⚫ Parallel transmission features high efficiency, short distances, and low frequency.
⚫ In long-distance transmission, using multiple lines is more expensive than using a single line.
⚫ Long-distance transmission requires thicker conducting wires to reduce signal attenuation, but it
is difficult to bundle them into a single cable.
⚫ In long-distance transmission, the time for data on each line to reach the peer end varies due to
wire resistance or other factors. The next transmission can be initiated only after data on all
lines has reached the peer end.
⚫ When the transmission frequency is high, the circuit oscillation is serious and great interference
is generated between the lines. Therefore, the frequency of parallel transmission must be set
properly.
Serial transmission:
⚫ The efficiency of serial transmission is much lower than that of parallel transmission. The
transmission speed can be improved by increasing the transmission frequency. In general, the
overall speed of serial transmission is higher than that of parallel transmission.
⚫ Serial transmission is used for long-distance transmission. Currently, PCI interfaces use serial
transmission. The PCIe interface is a typical example of serial transmission. The transmission
rate of a single line is up to 2.5 Gbit/s.
2.1.2.4.7 HDD Port Technology
Disks are classified into IDE, SCSI, SATA, SAS, and Fibre Channel disks by port. In addition to ports,
these disks also differ in the mechanical base.
IDE and SATA disks use the ATA mechanical base and are suitable for single-task processing.
SCSI, SAS, and Fibre Channel disks use the SCSI mechanical base and are suitable for multi-task
processing.
Comparison:
⚫ Under high data throughput, SCSI disks provide higher processing speed than ATA disks.
⚫ In the case of multi-task processing, the read/write head moves frequently, which causes
overheating on ATA disks.
⚫ SCSI disks provide higher reliability than ATA disks.
IDE disk port:
⚫ Multiple ATA versions have been released, including ATA-1 (IDE), ATA-2 (Enhanced IDE/Fast ATA),
ATA-3 (Fast ATA-2), ATA-4 (ATA33), ATA-5 (ATA66), ATA-6 (ATA100), and ATA-7 (ATA133).
⚫ The advantages and disadvantages of the ATA port are as follows:
➢ Advantages: low price and good compatibility
➢ Disadvantages: Low speed; built-in use only; strict restriction on the cable length
➢ The transmission rate of the PATA port does not meet the current user needs.
SATA port:
⚫ During data transmission, the data line and signal line are separated and use independent
transmission clock frequency. The transmission rate of SATA is 30 times that of PATA.
⚫ Advantages:
➢ Generally, a SATA port has 7+15 pins and uses a single channel. The transmission rate of
SATA is higher than that of ATA.
➢ SATA uses the cyclic redundancy check (CRC) for instructions and data packets to ensure
data transmission reliability.
➢ SATA has a better anti-interference capability than ATA.
SCSI port:
⚫ SCSI disks were developed to replace IDE disks to provide higher rotation speed and
transmission rate. SCSI is originally a bus-type interface and works independently of the system
bus.
⚫ Advantages:
➢ Applicable to a wide range of devices. One SCSI controller card can connect to 15 devices
simultaneously.
➢ High performance (multi-task processing, low CPU usage, fast rotation speed, and high
transmission rate)
➢ SCSI disks can be external or built-in ones, and are hot-swappable.
⚫ Disadvantages:
➢ High cost and complex installation and configuration.
SAS port:
⚫ SAS is similar to SATA in its use of a serial architecture for a high transmission rate and
streamlined internal space with shorter internal connections.
⚫ SAS improves the efficiency, availability, and scalability of the storage system, and is backward
compatible with SATA in terms of the physical and protocol layers.
⚫ Advantages:
➢ SAS is superior to SCSI in terms of transmission rate and anti-interference, and supports a
longer connection distance.
⚫ Disadvantages:
➢ SAS disks are more expensive.
Fibre Channel port:
⚫ Fiber Channel (FC) was not originally designed for disk ports, but was for network transmission.
It is gradually applied to disk systems in pursuit of higher speed.
⚫ Advantages:
➢ Easy to upgrade. Supports optical fiber cables with a length over 10 km.
➢ Large bandwidth.
➢ Strong universality.
⚫ Disadvantages:
➢ High cost.
➢ Complex to build.
2.1.2.4.8 SSD Overview
Unlike traditional disks that use magnetic materials to store data, SSDs use NAND flash (using cells as
the storage units) to store data. NAND flash is a non-volatile random access storage medium that
can retain stored data after the power is turned off. It can quickly and compactly store digital
information.
SSDs do not have high-speed rotational components, and feature high performance, low power
consumption, and zero noise.
SSDs do not have mechanical parts, but this does not mean that they have infinite life cycle. Because
NAND flash is a non-volatile medium, original data must be erased before new data can be written.
However, each cell can only be erased for a limited number of times. When the upper limit is
reached, data reads and writes become invalid on this cell.
2.1.2.4.9 SSD Architecture
The host interface is the protocol and physical interface used by a host to access an SSD. Common
interfaces are SATA, SAS, and PCIe.
The SSD controller is the core SSD component responsible for read and write access from a host to
the back-end media and for protocol conversion, table entry management, data caching, and data
checking.
DRAM is the cache for flash translation layer (FTL) entries and data.
NAND flash is a non-volatile random access storage medium that stores data.
Concurrent multiple channels, allowing time-division multiplexing for flash granules in a channel.
Support for TCQ and NCQ, which respond to multiple I/O requests at a time.
2.1.2.4.10 NAND Flash
Internal storage units in NAND flash include LUNs, planes, blocks, pages, and cells.
NAND flash working principles: NAND flash stores data using floating gate transistors. The threshold
voltage changes based on the number of electric charges stored in a floating gate. Data is then
represented using the read voltage of the transistor threshold.
⚫ A LUN is the smallest physical unit that can be independently encapsulated. A LUN typically
contains multiple planes.
⚫ A plane has an independent page register. It typically contains 1,000 or 2,000 odd or even
blocks.
⚫ A block is the smallest erasure unit and generally consists of multiple pages.
⚫ A page is the smallest programming and read unit. Its size is usually 16 KB.
⚫ A cell is the smallest erasable, programmable, and readable unit found in pages. A cell
corresponds to a floating gate transistor that stores one or multiple bits.
A page is the basic unit of programming and reading, and a block is the basic unit of erasing.
Each P/E cycle causes some damage to the insulation layer of the floating gate transistor. If erasing
or programming a block fails, the block is considered as a bad block. When the number of bad blocks
reaches a threshold (4%), the NAND flash reaches the end of its service life.
2.1.2.4.11 SLC, MLC, TLC, and QLC
NAND flash chips can be classified into the following types based on the number of bits stored in a
cell:
⚫ A single level cell (SLC) can store one bit of data: 0 or 1.
⚫ A multi level cell (MLC) can store two bits of data: 00, 01, 10, and 11.
⚫ A triple level cell (TLC) can store three bits of data: 000, 001, 010, 011, 100, 101, 110, and 111.
⚫ A quad level cell (QLC) can store four bits of data: 0000, 0001, 0010, 0011, 0100, 0101, 0110,
0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, and 1111.
The four types of cells have similar costs but store different amounts of data. Originally, the capacity
of an SSD was only 64 GB or smaller. Now, a TLC SSD can store up to 2 TB of data. However, different
types of cells have different life cycles, resulting in different SSD reliability. The life cycle of SSDs is
also an important factor in selecting SSDs.
The following shows the logic diagram of a flash chip (Toshiba 3D-TLC):
⚫ A page is logically formed by 18336*8=146688 cells. Each page can store 16 KB content and
1952 bytes of ECC data. A page is the minimum I/O unit of the flash chip.
⚫ Every 768 pages form a block. Every 1478 blocks form a plane.
⚫ A flash chip consists of two planes. One plane stores blocks with odd sequence numbers, and
the other stores blocks with even sequence numbers. The two planes can be operated
concurrently.
Considering that ECC needs to be performed on the data stored in the NAND flash, the size of the
page in the NAND flash is not an integer of 16 KB, but an extra part of bytes. For example, if the
actual size of a 16 KB page is 16384 + 1952 bytes, 16384 bytes are used to store data, and 1952 bytes
are used to store ECC data check codes.
2.1.2.4.12 Address Mapping Management
The logical block address (LBA) may refer to an address of a data block or a data block pointed to by
an address.
PBA: physical block address
The host accesses the SSD through the LBA. Each LBA represents a sector (generally 512 bytes).
Generally, the host OS accesses the SSD in the unit of 4 KB. The basic unit for the host to access the
SSD is called host page.
Inside an SSD, the flash page is the basic unit for the SSD controller to access the flash chip, which is
called physical page. Each time the host writes a host page, the SSD controller writes it to a physical
page and records their mapping relationship.
When the host reads a host page, the SSD finds the requested data according to the mapping
relationship.
2.1.2.4.13 Read and Write Process on an SSD
Data write process on an SSD:
⚫ The SSD controller connects to eight flash dies through eight channels. For better explanation,
the figure shows only one block in each die. Each square in the blocks represents a page
(assuming that the size is 4 KB).
➢ The host writes 4 KB data to the block of channel 0 (occupying one page).
➢ The host continues to write 16 KB data. In this case, 4 KB data is written to each block of
channels 1 to 4.
➢ The host continues to write data to the blocks until all blocks are full.
⚫ When the blocks on all channels are full, the SSD controller selects a new block to write data in
the same way.
⚫ Green indicates valid data and red indicates invalid data. When the user no longer needs the
data on a flash page, the data on this page becomes aged or invalid and its mapping relationship
is replaced by a new mapping.
⚫ For example, host page A was originally stored in flash page X, and the mapping relationship
was A->X. Later, the host rewrites the host page. Because the flash memory does not overwrite
data, the SSD writes the new data to a new page Y. In this case, a new mapping relationship
A->Y is established, and the original mapping relationship is canceled. The data in page X
becomes aged and invalid, which is called garbage data.
⚫ The host continues to write data to the SSD until its space is used up. In this case, the host
cannot write more data unless the garbage data is cleared.
Data read process on an SSD:
⚫ Whether the read speed can be improved 8-fold depends on whether the data to be read is
evenly distributed in the blocks of each channel. If the 32 KB data is stored in the blocks of
channels 1-4, the read speed is improved 4-fold at most. That is why smaller files are
transmitted at a lower rate.
Short response time: Traditional HDDs spend most of the time in seeking and mechanical latency,
limiting the data transmission efficiency. SSDs use NAND flash as the storage medium, which does
not cause any seek time or mechanical latency, delivering quick responses to read and write
requests.
High read/write efficiency: When an HDD is performing random read/write operations, the head
moves back and forth, resulting in low read/write efficiency. In comparison, an SSD calculates data
storage locations using an internal controller, which saves the mechanical operation time and greatly
improves read/write efficiency.
When a large number of SSDs are used, they have a prominent advantage in saving power.
2.1.2.4.14 SCM Card
SCM is the next-generation of storage media that features both persistence and fast access. Its
read/write speed is faster than that of flash memory.
An SCM card is a cache acceleration card of the SCM media type. To use SmartCache for OceanStor
Dorado V6 all-flash storage (6.1.0 and later versions), install an SCM card on the controller
enclosure.
2.1.2.5 Interface Modules
2.1.2.5.1 Front-End: GE Interface Modules
A GE electrical interface module provides four 1 Gbit/s electrical ports and is used for HyperMetro
quorum networking.
A 10GE electrical interface module provides four 10 Gbit/s electrical ports for connecting storage
devices to application servers, which can be used only after electrical modules are installed.
A 40GE interface module provides two 40 Gbit/s optical ports for connecting storage devices to
application servers.
A 100GE interface module provides two 100 Gbit/s optical ports for connecting storage devices to
application servers.
A 25 Gbit/s RDMA interface module provides four 25 Gbit/s optical ports for direct connections
between two controller enclosures.
A 100 Gbit/s RDMA interface module provides two 100 Gbit/s optical ports for connecting controller
enclosures to scale-out switches or smart disk enclosures. In the labels, SO stands for scale-out and
BE stands for back-end.
A 12 Gbit/s SAS expansion module provides four 4 x 12 Gbit/s mini SAS HD expansion ports to
connect controller enclosures to 2 U SAS disk enclosures.
Forward and backward connection networking: Expansion modules A uses forward connection and
expansion modules B uses backward connection. (OceanStor Dorado V6 and new converged storage
use forward connection.)
Cascading level: The number of cascading disk enclosures cannot exceed the preset threshold.
2.1.3.3 Smart Disk Enclosure Scale-up Networking Principles
Huawei smart disk enclosure is used as an example.
Port consistency: In a loop, the downlink port (P1) of an upper-level disk enclosure is connected to
the uplink port (P0) of a lower-level disk enclosure.
Dual-plane networking: Expansion module A is connected to controller A and expansion module B is
connected to controller B.
Symmetric networking: On controllers A and B, ports with the same port IDs and slot IDs are
connected to one enclosure.
Forward connection networking: Both expansion modules A and B use forward connection.
Cascading level: The number of cascading disk enclosures cannot exceed the preset threshold.
2.1.3.4 Local Write Process
The LUN to which a host writes data is owned by the engine to which the host delivers write I/Os.
The process is as follows:
1 A host delivers write I/Os to engine 0.
2 Engine 0 writes the data into the local cache, implements mirror protection, and returns a
message indicating that data is written successfully.
3 Engine 0 flushes dirty data onto a disk. If the target disk is on the local computer, engine 0
directly delivers the write I/Os.
4 If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (for example,
engine 1) where the disk resides.
5 Engine 1 writes dirty data onto disks.
2.1.3.5 Non-local Write Process
The LUN to which a host writes data is not owned by the engine to which the host delivers write
I/Os. The process is as follows:
1 The LUN is owned by engine 0 and the host delivers write I/Os to engine 2.
2 After detecting that the LUN is owned by engine 0, engine 2 transfers the write I/Os to engine 0.
3 Engine 0 writes the data into the local cache, implements mirror protection, and returns a
message indicating that data is written successfully.
4 Engine 2 returns the write success message to the host.
5 Engine 0 flushes dirty data onto a disk. If the target disk is on the local computer, engine 0
directly delivers the write I/Os.
6 If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (for example,
engine 1) where the disk resides.
7 Engine 1 writes dirty data onto disks.
2.1.3.6 Local Read Process
The LUN from which a host reads data is owned by the engine to which the host delivers write I/Os.
The process is as follows:
1 A host delivers read I/Os to engine 0.
2 If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to the host.
3 If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from the disks. If the
target disk to which data is written is on the local computer, engine 0 reads data from the disks.
4 After the read I/Os are hit locally, engine 0 returns the data to the host.
5 If the target disk is on a remote device, engine 0 transfers the I/Os to the engine (for example,
engine 1) where the disk resides.
6 Engine 1 reads data from the disk.
7 Engine 1 accomplishes the data read.
8 Engine 1 returns the data to engine 0, which then returns the data to the host.
2.1.3.7 Non-local Read Process
1 The LUN from which a host reads data is not owned by the engine to which the host delivers
write I/Os. The process is as follows:
2 The LUN is owned by engine 0 and the host delivers read I/Os to engine 2.
3 After detecting that the LUN is owned by engine 0, engine 2 transfers the read I/Os to engine 0.
4 If the read I/Os are hit in the cache of engine 0, engine 0 returns the data to engine 2.
5 Engine 2 returns the data to the host.
6 If the read I/Os are not hit in the cache of engine 0, engine 0 reads data from the disks. If the
target disk to which data is written is on the local computer, engine 0 reads data from the disks.
7 After the read I/Os are hit locally, engine 0 returns the data to engine 2 and then engine 2
returns the data to the host.
8 If the target disk is on a remote engine, engine 0 transfers the I/Os to the engine (for example,
engine 1) where the disk resides.
9 Engine 1 reads data from the disk.
10 Engine 1 accomplishes the data read.
Engine 1 returns the data to engine 0, and engine 0 returns the data to engine 2, which then returns
the data to the host.
and data transmission speed. The RAID controller manages routes and buffers, and controls
data flows between the host and the RAID array. Hardware RAID is usually used in servers.
⚫ Software RAID has no built-in processor or I/O processor but relies on a host processor.
Therefore, a low-speed CPU cannot meet the requirements for RAID implementation. Software
RAID is typically used in enterprise-class storage devices.
Disk striping: Space in each disk is divided into multiple strips of a specific size. Data is also divided
into blocks based on strip size when data is being written.
⚫ Strip: A strip consists of one or more consecutive sectors in a disk, and multiple strips form a
stripe.
⚫ Stripe: A stripe consists of strips of the same location or ID on multiple disks in the same array.
RAID generally provides two methods for data protection.
⚫ One is storing data copies on another redundant disk to improve data reliability and read
performance.
⚫ The other is parity. Parity data is additional information calculated using user data. For a RAID
array that uses parity, an additional parity disk is required. The XOR (symbol: ⊕) algorithm is
used for parity.
2.2.1.2 RAID 0
RAID 0, also referred to as striping, provides the best storage performance among all RAID levels.
RAID 0 uses the striping technology to distribute data to all disks in a RAID array.
A RAID 0 array provides a large-capacity disk with high I/O processing performance. Before the
introduction of RAID 0, there was a technology similar to RAID 0, called Just a Bundle Of Disks
(JBOD). JBOD refers to a large virtual disk consisting of multiple disks. Unlike RAID 0, JBOD does not
concurrently write data blocks to different disks. JBOD uses another disk only when the storage
capacity in the first disk is used up. Therefore, JBOD provides a total available capacity which is the
sum of capacities in all disks but provides the performance of individual disks.
In contrast, RAID 0 searches the target data block and reads data in all disks upon receiving a data
read request. The preceding figure shows a data read process. A RAID 0 array provides a read/write
performance that is directly proportional to disk quantity.
2.2.1.3 RAID 1
RAID 1, also referred to as mirroring, maximizes data security. A RAID 1 array uses two identical disks
including one mirror disk. When data is written to a disk, a copy of the same data is stored in the
mirror disk. When the source (physical) disk fails, the mirror disk takes over services from the source
disk to maintain service continuity. The mirror disk is used as a backup to provide high data
reliability.
The amount of data stored in a RAID 1 array is only equal to the capacity of a single disk, and data
copies are retained in another disk. That is, each gigabyte data needs 2 gigabyte disk space.
Therefore, a RAID 1 array consisting of two disks has a space utilization of 50%.
2.2.1.4 RAID 3
RAID 3 is similar to RAID 0 but uses dedicated parity stripes. In a RAID 3 array, a dedicated disk
(parity disk) is used to store the parity data of strips in other disks in the same stripe. If incorrect
data is detected or a disk fails, data in the faulty disk can be recovered using the parity data. RAID 3
applies to data-intensive or single-user environments where data blocks need to be continuously
accessed for a long time. RAID 3 writes data to all member data disks. However, when new data is
written to any disk, RAID 3 recalculates and rewrites parity data. Therefore, when a large amount of
data from an application is written, the parity disk in a RAID 3 array needs to process heavy
workloads. Parity operations have certain impact on the read and write performance of a RAID 3
array. In addition, the parity disk is subject to the highest failure rate in a RAID 3 array due to heavy
workloads. A write penalty occurs when just a small amount of data is written to multiple disks,
which does not improve disk performance as compared with data writes to a single disk.
Currently, RAID 6 is implemented in different ways. Different methods are used for obtaining parity
data.
RAID 6 P+Q
When data is written to the RAID 10 array, data blocks are concurrently written to sub-arrays by
mirroring. As shown in the figure, D 0 is written to physical disk 1, and its copy is written to physical
disk 2.
If disks (such as disk 2 and disk 4) in both the two RAID 1 sub-arrays fail, accesses to data in the RAID
10 array will remain normal. This is because integral copies of the data in faulty disks 2 and 4 are
retained on other two disks (such as disk 3 and disk 1). However, if disks (such as disk 1 and 2) in the
same RAID 1 sub-array fail at the same time, data will be inaccessible.
Theoretically, RAID 10 tolerates failures of half of the physical disks. However, in the worst case,
failures of two disks in the same sub-array may also cause data loss. Generally, RAID 10 protects data
against the failure of a single disk.
2.2.1.8 RAID 50
RAID 50 combines RAID 0 and RAID 5. Two RAID 5 sub-arrays form a RAID 0 array. The two RAID 5
sub-arrays are independent of each other. A RAID 50 array requires at least six disks because a RAID
5 sub-array requires at least three disks.
will be lost if any additional disk or data block fails. Therefore, a longer reconstruction duration
results in higher risk of data loss.
⚫ Material impact on services: During reconstruction, member disks are engaged in
reconstruction and provide poor service performance, which will affect the operation of upper-
layer services.
To solve the preceding problems of traditional RAID and ride on the development of virtualization
technologies, the following alternative solutions emerged:
⚫ LUN virtualization: A traditional RAID array is further divided into small units. These units are
regrouped into storage spaces accessible to hosts.
⚫ Block virtualization: Disks in a storage pool are divided into small data blocks. A RAID array is
created using these data blocks so that data can be evenly distributed to all disks in the storage
pool. Then, resources are managed based on data blocks.
2.2.2.2 Basic Principles of RAID 2.0+
RAID 2.0+ divides a physical disk into multiple chunks (CKs). CKs in different disks form a chunk group
(CKG). CKGs have a RAID relationship with each other. Multiple CKGs form a large storage resource
pool. Resources are allocated from the resource pool to hosts.
Implementation mechanism of RAID 2.0+:
⚫ Multiple SSDs form a storage pool.
⚫ Each SSD is then divided into chunks (CKs) of a fixed size (typically 4 MB) for logical space
management.
⚫ CKs from different SSDs form chunk groups (CKGs) based on the RAID policy specified on
DeviceManager.
⚫ CKGs are further divided into grains (typically 8 KB). Grains are mapped to LUNs for refined
management of storage resources.
RAID 2.0+ outperforms traditional RAID in the following aspects:
⚫ Service load balancing to avoid hot spots: Data is evenly distributed to all disks in the resource
pool, protecting disks from early end of service life due to excessive writes.
⚫ Fast reconstruction to reduce risk window: When a disk fails, the valid data in the faulty disk is
reconstructed to all other functioning disks in the resource pool (fast many-to-many
reconstruction), efficiently resuming redundancy protection.
⚫ Reconstruction load balancing among all disks in the resource pool to minimize the impact on
upper-layer applications.
2.2.2.3 RAID 2.0+ Composition
1. Disk Domain
A disk domain is a combination of disks (which can be all disks in the array). After the disks are
combined and reserved for hot spare capacity, each disk domain provides storage resources for
the storage pool.
For traditional RAID, a RAID array must be created first for allocating disk spaces to service
hosts. However, there are some restrictions and requirements for creating a RAID array: A RAID
array must consist of disks of the same type, size, and rotational speed, and should consist of a
maximum number of 12 disks.
Huawei RAID 2.0+ is implemented in another way. A disk domain should be created first. A disk
domain is a disk array. A disk can belong to only one disk domain. One or more disk domains can
be created in an OceanStor storage system. It seems that a disk domain is similar to a RAID
array. Both consist of disks but have significant differences. A RAID array consists of disks of the
same type, size, and rotational speed, and such disks are associated with a RAID level. In
contrast, a disk domain consists of up to more than 100 disks of up to three types. Each type of
disk is associated with a storage tier. For example, SSDs are associated with the high
performance tier, SAS disks are associated with the performance tier, and NL-SAS disks are
associated with the capacity tier. A storage tier would not exist if there are no disks of the
corresponding type in a disk domain. A disk domain separates an array of disks from another
array of disks for fully isolating faults and maintaining independent performance and storage
resources. RAID levels are not specified when a disk domain is created. That is, data redundancy
protection methods are not specified. Actually, RAID 2.0+ provides more flexible and specific
data redundancy protection methods. The storage space formed by disks in a disk domain is
divided into storage pools of a smaller granularity and hot spare space shared among storage
tiers. The system automatically sets the hot spare space based on the hot spare policy (high,
low, or none) set by an administrator for the disk domain and the number of disks at each
storage tier in the disk domain. In a traditional RAID array, an administrator should specify a
disk as the hot space disk.
2. Storage Pool and Storage Tier
A storage pool is a storage resource container. The storage resources used by application
servers are all from storage pools.
A storage tier is a collection of storage media providing the same performance level in a storage
pool. Different storage tiers manage storage media of different performance levels and provide
storage space for applications that have different performance requirements.
A storage pool created based on a specified disk domain dynamically allocates CKs from the disk
domain to form CKGs according to the RAID policy of each storage tier for providing storage
resources with RAID protection to applications.
A storage pool can be divided into multiple tiers based on disk types.
When creating a storage pool, a user is allowed to specify a storage tier and related RAID policy
and capacity for the storage pool.
OceanStor storage systems support RAID 1, RAID 10, RAID 3, RAID 5, RAID 50, and RAID 6 and
related RAID policies.
The capacity tier consists of large-capacity SATA and NL-SAS disks. DP RAID 6 is recommended.
3. Disk Group
An OceanStor storage system automatically divides disks of each type in each disk domain into
one or more disk groups (DGs) according to disk quantity.
One DG consists of disks of only one type.
CKs in a CKG are allocated from different disks in a DG.
DGs are internal objects automatically configured by OceanStor storage systems and typically
used for fault isolation. DGs are not presented externally.
4. Logical Drive
A logical drive (LD) is a disk that is managed by a storage system and corresponds to a physical
disk.
5. CK
A chunk (CK) is a disk space of a specified size allocated from a storage pool. It is the basic unit
of a RAID array.
6. CKG
A chunk group (CKG) is a logical storage unit that consists of CKs from different disks in the same
DG based on the RAID algorithm. It is the minimum unit for allocating resources from a disk
domain to a storage pool.
All CKs in a CKG are allocated from the disks in the same DG. A CKG has RAID attributes, which
are actually configured for corresponding storage tiers. CKs and CKGs are internal objects
automatically configured by storage systems. They are not presented externally.
7. Extent
Each CKG is divided into logical storage spaces of a specific and adjustable size called extents.
Extent is the minimum unit (granularity) for migration and statistics of hot data. It is also the
minimum unit for space application and release in a storage pool.
An extent belongs to a volume or LUN. A user can set the extent size when creating a storage
pool. After that, the extent size cannot be changed. Different storage pools may consist of
extents of different sizes, but one storage pool must consist of extents of the same size.
8. Grain
When a thin LUN is created, extents are divided into 64 KB blocks which are called grains. A thin
LUN allocates storage space by grains. Logical block addresses (LBAs) in a grain are consecutive.
Grains are mapped to thin LUNs. A thick LUN does not involve grains.
9. Volume and LUN
A volume is an internal management object in a storage system.
A LUN is a storage unit that can be directly mapped to a host for data reads and writes. A LUN is
the external embodiment of a volume.
A volume organizes all extents and grains of a LUN and applies for and releases extents to
increase and decrease the actual space used by the volume.
RAID-TP with FlexEC algorithm reduces the amount of data read from a single disk by 70%, as
compared with traditional RAID, minimizing the impact on system performance.
In a typical 4:2 RAID 6 array, the capacity utilization is about 67%. The capacity utilization of a
Huawei OceanStor all-flash storage system with 25 disks is improved by 20% on this basis.
⚫ The structure of Solaris comprises the SCSI device driver, SSA middle layer, and SCSI adapter
driver, which is similar to the structure of Linux/Windows.
⚫ The AIX architecture is structured in three layers: SCSI device driver, SCSI middle layer, and SCSI
adaptation driver.
SCSI Target Model
Based on the SCSI architecture, a target is divided into three layers: port layer, middle layer, and
device layer.
⚫ A PORT model in a target packages or unpackages SCSI instructions on links. For example, a
PORT can package instructions into FCP, iSCSI, or SAS, or unpackage instructions from those
formats.
⚫ A device model in a target serves as a SCSI instruction analyser. It tells the initiator what device
the current LUN is by processing INQUIRT, and processes I/Os through READ/WRITE.
⚫ The middle layer of a target maintains models such as LUN space, task set, and task (command).
There are two ways to maintain LUN space. One is to maintain a global LUN for all PORTs, and
the other is to maintain a LUN space for each PORT.
SCSI Protocol and Storage System
The SCSI protocol is the basic protocol used for communication between hosts and storage devices.
The controller sends a signal to the bus processor requesting to use the bus. After the request is
accepted, the controller's high-speed cache sends data. During this process, the bus is occupied by
the controller and other devices connected to the same bus cannot use it. However, the bus
processor can interrupt the data transfer at any time and allow other devices to use the bus for
operations of a higher priority.
A SCSI controller is like a small CPU with its own command set and cache. The special SCSI bus
architecture can dynamically allocate resources to tasks run by multiple devices in a computer. In
this way, multiple tasks can be processed at the same time.
SCSI Protocol Addressing
A traditional SCSI controller is connected to a single bus, therefore only one bus ID is allocated. An
enterprise-level server may be configured with multiple SCSI controllers, so there may be multiple
SCSI buses. In a storage network, each FC HBA or iSCSI network adapter is connected to a bus. A bus
ID must therefore be allocated to each bus to distinguish between them.
To address devices connected to a SCSI bus, SCSI device IDs and LUNs are used. Each device on the
SCSI bus must have a unique device ID. The HBA on the server also has its own device ID: 7. Each bus,
including the bus adapter, supports a maximum of 8 or 16 device IDs. The device ID is used to
address devices and identify the priority of the devices on the bus.
Each storage device may include sub-devices, such as virtual disks and tape drives. So LUN IDs are
used to address sub-devices in a storage device.
A ternary description (bus ID, target device ID, and LUN ID) is used to identify a SCSI target.
2.3.1.1.2 SAS Protocol
Serial Attached SCSI (SAS) is the serial standard of the SCSI bus protocol. A serial port has a simple
structure, supports hot swap, and boasts a high transmission speed and execution efficiency.
Generally, large parallel cables cause electronic interference. The SAS cable structure can solve this
problem. The SAS cable structure saves space, thereby improving heat dissipation and ventilation for
servers that use SAS disks.
SAS has the following advantages:
⚫ Lower cost:
➢ A SAS backplane supports SAS and SATA disks, which reduces the cost of using different
types of disks.
➢ There is no need to design different products based on the SCSI and SATA standards. In
addition, the cabling complexity and the number of PCB layers are reduced, further
reducing costs.
➢ System integrators do not need to purchase different backplanes and cables for different
disks.
➢ More devices can be connected.
➢ The SAS technology introduces the SAS expander, so that a SAS system supports more
devices. Each expander can be connected to multiple ports, and each port can be
connected to a SAS device, a host, or another SAS expander.
⚫ High reliability:
➢ The reliability is the same as that of SCSI and FC disks and is better than that of SATA disks.
➢ The verified SCSI command set is retained.
⚫ High performance:
➢ The unidirectional port rate is high.
⚫ Compatibility with SATA:
➢ SATA disks can be directly installed in a SAS environment.
➢ SATA and SAS disks can be used in the same system, which meets the requirements of the
popular tiered storage strategy.
The SAS architecture includes six layers from the bottom to the top: physical layer, phy layer, link
layer, port layer, transport layer, and application layer. Each layer provides certain functions.
⚫ Physical layer: defines hardware, such as cables, connectors, and transceivers.
⚫ Phy layer: includes the lowest-level protocols, like coding schemes and power supply/reset
sequences.
⚫ Link layer: describes how to control phy layer connection management, primitives, CRC,
scrambling and descrambling, and rate matching.
⚫ Port layer: describes the interfaces of the link layer and transport layer, including how to
request, interrupt, and set up connections.
⚫ Transport layer: defines how the transmitted commands, status, and data are encapsulated into
SAS frames and how SAS frames are decomposed.
⚫ Application layer: describes how to use SAS in different types of applications.
SAS has the following characteristics:
⚫ SAS uses the full-duplex (bidirectional) communication mode. The traditional parallel SCSI can
communicate only in one direction. When a device receives a data packet from the parallel SCSI
and needs to respond, a new SCSI communication link needs to be set up after the previous link
is disconnected. However, each SAS cable contains two input cables and two output cables. This
way, SAS can read and write data at the same time, improving the data throughput efficiency.
⚫ Compared with SCSI, SAS has the following advantages:
➢ As it uses the serial communication mode, SAS provides higher throughput and may deliver
higher performance in the future.
➢ Four narrow ports can be bound as a wide link port to provide higher throughput.
Scalability of SAS:
⚫ SAS uses expanders to expand interfaces. One SAS domain supports a maximum of 16,384 disk
devices.
⚫ A SAS expander is an interconnection device in a SAS domain. Similar to an Ethernet switch, a
SAS expander enables an increased number of devices to be connected in a SAS domain, and
reduces the cost in HBAs. Each expander can connect to a maximum of 128 terminals or
expanders. The main components in a SAS domain are SAS expanders, terminal devices, and
connection devices (or SAS connection cables).
➢ A SAS expander is equipped with a routing table that tracks the addresses of all SAS drives.
➢ A terminal device can be an initiator (usually a SAS HBA) or a target (a SAS or SATA disk, or
an HBA in target mode).
⚫ Loops cannot be formed in a SAS domain. This ensures terminal devices can be detected.
⚫ In reality, the number of terminal devices connected to an extender is far fewer than 128 due to
bandwidth reasons.
Cable Connection Principles of SAS:
⚫ Most storage device vendors use SAS cables to connect disk enclosures to controller enclosures
or connect disk enclosures. A SAS cable bundles four independent channels (narrow ports) into
a wide port to provide higher bandwidth. The four independent channels provide 12 Gbit/s
each, so a wide port can provide 48 Gbit/s of bandwidth. To ensure that the data volume on a
SAS cable does not exceed the maximum bandwidth of the SAS cable, the total number of disks
connected to a SAS loop must be limited.
⚫ For a Huawei storage device, the maximum number of disks supported is 168. That is, a loop
comprising up to a maximum seven disk enclosures each with 24 disk slots. However, all disks in
the loop must be traditional SAS disks. As SSDs are becoming more common, one must consider
that SSDs deliver much higher transmission speeds than SAS disks. Therefore, for SSDs, a
maximum of 96 disks are supported in a loop: four disk enclosures, each with 24 disk slots, form
a loop.
⚫ A SAS cable is called a mini SAS cable when the speed of a single channel is 6 Gbit/s, and a SAS
cable is called a high-density mini SAS cable when the speed is increased to 12 Gbit/s.
2.3.1.2 iSCSI and FC
2.3.1.2.1 iSCSI Protocol
The Internet SCSI (iSCSI) protocol was first launched by HP. Since 2004, the iSCSI protocol has been
used as the formal IETF standard. The existing iSCSI protocol is based on SCSI Architecture Model-2
(AM2).
iSCSI is short for Internet Small Computer System Interface. It is an IP-based storage networking
standard for linking data storage facilities. It provides block-level access to storage devices by
carrying SCSI commands over a TCP/IP network.
The iSCSI protocol encapsulates SCSI commands and block data into TCP packets for transmission
over IP networks. As the transport layer protocol of SCSI, iSCSI uses mature IP network technologies
to implement and extend SAN. The SCSI protocol layer generates CDBs and sends the CDBs to the
iSCSI protocol layer. The iSCSI protocol layer then encapsulates the CDBs into PDUs and transmits
the PDUs over an IP network.
iSCSI Initiator and Target
The iSCSI communication system inherits some of SCSI's features. The iSCSI communication involves
an initiator that sends I/O requests and a target that responds to the I/O requests and executes I/O
operations. After a connection is set up between the initiator and target, the target controls the
entire process as the primary device.
⚫ There are three types of iSCSI initiators: software-based initiator driver, hardware-based TCP
offload engine (TOE) NIC, and iSCSI HBA. Their performance increases in that order.
⚫ An iSCSI target is usually an iSCSI disk array or iSCSI tape library.
The iSCSI protocol defines a set of naming and addressing methods for iSCSI initiators and targets. All
iSCSI nodes are identified by their iSCSI names. This method distinguishes iSCSI names from host
names.
iSCSI Architecture
In an iSCSI system, a user sends a data read or write command on a SCSI storage device. The
operating system converts this request into one or multiple SCSI instructions and sends the
instructions to the target SCSI controller card. The iSCSI node encapsulates the instructions and data
into an iSCSI packet and sends the packet to the TCP/IP layer, where the packet is encapsulated into
an IP packet to be transmitted over a network. You can also encrypt the SCSI instructions for
transmission over an insecure network.
Data packets can be transmitted over a LAN or the Internet. The receiving storage controller
restructures the data packets and sends the SCSI control commands and data in the iSCSI packets to
corresponding disks. The disks execute the operation requested by the host or application. For a
data request, data will be read from the disks and sent to the host. The process is completely
transparent to users. Though SCSI instruction execution and data preparation can be implemented
by the network controller software using TCP/IP, the host will spare a lot of CPU resources to
process the SCSI instructions and data. If these transactions are processed by dedicated devices, the
impact on system performance will be reduced to a minimum. Therefore, it is necessary to develop
dedicated iSCSI adapters that execute SCSI commands and complete data preparation under iSCSI
standards. An iSCSI adapter combines the functions of an NIC and an HBA. The iSCSI adapter obtains
data by blocks, classifies and processes data using the TCP/IP processing engine, and sends IP data
packets over an IP network. In this way, users can create IP SANs without compromising server
performance.
2.3.1.2.2 FC Protocol
Fibre Channel (FC) can be referred to as the FC protocol, FC network, or FC interconnection. As FC
delivers high performance, it is becoming more commonly used for front-end host access on point-
to-point and switch-based networks. Like TCP/IP, the FC protocol suite also includes concepts from
the TCP/IP protocol suite and the Ethernet, such as FC switching, FC switch, FC routing, FC router,
and SPF routing algorithm.
FC protocol structure:
⚫ FC-0: defines physical connections and selects different physical media and data rates for
protocol operations. This maximizes system flexibility and allows for existing cables and different
technologies to be used to meet the requirements of different systems. Copper cables and
optical cables are commonly used.
⚫ FC-1: records the 8-bit/10-bit transmission code to balance the transmission bit stream. The
code can also serve as a mechanism to transfer data and detect errors. Its excellent transfer
capability of 8-bit/10-bit encoding helps reduce component design costs and ensures optimum
transfer density for better clock recovery.
⚫ FC-2: includes the following items for sending data over the network:
➢ How data should be split into small frames
➢ How much data should be sent at a time (flow control)
➢ Where frames should be sent (including defining service levels based on applications)
⚫ FC-3: defines advanced functions such as striping (data is transferred through multiple
channels), multicast (one message is sent to multiple targets), and group query (multiple ports
are mapped to one node). When FC-2 defines functions for a single port, FC-3 can define
functions across ports.
⚫ FC-4: maps upper-layer protocols. FC performance is mapped to an IP address, a SCSI protocol,
or an ATM protocol. SCSI is a subset of the FC protocol.
Like the Ethernet, FC provides the following network topologies:
⚫ Point-to-point:
➢ The simplest topology that allows direct communication between two nodes (usually a
storage device and a server).
⚫ FC-AL:
➢ Similar to the Ethernet shared bus topology but is in arbitrated loop mode rather than bus
connection mode. Each device is connected to another device end to end to form a loop.
➢ Data frames are transmitted hop by hop in the arbitrated loop and the data frames can be
transmitted only in one direction at any time. As shown in the figure, node A needs to
communicate with node H. After node A wins the arbitration, it sends data frames to node
H. However, the data frames are transmitted clockwise in the sequence of B-C-D-E-F-G-H,
which is inefficient.
⚫ Fabric:
➢ Similar to an Ethernet switching topology, a fabric topology is a mesh switching matrix.
➢ The forwarding efficiency is much greater than in FC-AL.
➢ FC devices are connected to fabric switches through optical fibres or copper cables to
implement point-to-point communication between nodes.
FC frees the workstation from the management of every port. Each port manages its own
point-to-point connection to the fabric, and other fabric functions are implemented by FC
switches. There are seven types of ports in FC networks.
⚫ Device (node) port:
➢ N_Port: Node port. A fabric device can be directly attached.
➢ NL_Port: Node loop port. A device can be attached to a loop.
⚫ Switch port:
➢ E_Port: Expansion port (connecting switches).
➢ F_Port: A port of a fabric device that used to connect to the N_Port.
➢ FL_Port: Fabric loop port.
➢ G_Port: A generic port that can be converted into an E_Port or F_Port.
➢ U_Port: A universal port used to describe automatic port detection.
2.3.1.3 PCIe and NVMe
2.3.1.3.1 PCIe Protocol
In 1991, Intel first proposed the concept of PCI. PCI has the following characteristics:
⚫ Simple bus structure, low costs, easy designs.
⚫ The parallel bus supports a limited number of devices and the bus scalability is poor.
⚫ When multiple devices are connected, the effective bandwidth of the bus is greatly reduced and
the transmission rate slows down.
With the development of modern processor technologies, it is inevitable that parallel buses will be
replaced by high-speed differential buses in the interconnectivity field. Compared with single-ended
parallel signals, high-speed differential signals are used for higher clock frequencies. In this case, the
PCIe bus came to being.
PCIe is short for PCI Express, which is a high-performance and high-bandwidth serial communication
interconnection standard. It was first proposed by Intel and then developed by the Peripheral
Component Interconnect Special Interest Group (PCI-SIG) to replace bus-based communication
architectures.
Compared with the traditional PCI bus, PCIe has the following advantages:
⚫ Dual channels, high bandwidth, and a fast transmission rate: A transmission mode (RX and TX
are separated) similar to the full-duplex mode is implemented, providing a higher transmission
rate. PCIe 1.0, 2.0, 3.0, 4.0, and 5.0 deliver up to 2.5, 5, 8, 16, and 32 Gbit/s bandwidth,
respectively. Bandwidth can be multiplied by expanding the number of links.
⚫ Compatibility: PCIe is compatible with PCI at the software layer but has upgraded software.
⚫ Ease-of-use: Hot swap is supported. A PCIe bus interface slot contains the hot swap detection
signal, supporting hot swap and heat exchange.
⚫ Error processing and reporting: A PCIe bus uses a layered structure, in which the software layer
can process and report errors.
⚫ Virtual channels of each physical connection: Each physical channel supports multiple virtual
channels (in theory, eight virtual channels are supported for independent communication
control), thereby supporting QoS of each virtual channel and achieving high-quality traffic
control.
⚫ Reduced I/Os, board-level space, and crosstalk: A typical PCI bus data line requires at least 50
I/O resources, while PCIe X1 requires only four I/O resources. Reduced I/Os saves board-level
space, and the direct distance between I/Os can be longer, thereby reducing crosstalk.
Why PCIe? PCIe is future-oriented, and higher throughputs can be achieved in the future. PCIe is
providing increasing throughput using the latest technologies, and the transition from PCI to PCIe
can be simplified by guaranteeing compatibility with PCI software using layered protocols and drives.
The PCIe protocol features point-to-point connection, high reliability, tree networking, full duplex,
and frame-structure-based transmission.
PCIe protocol layers include the physical layer, data link layer, transaction layer, and application
layer.
⚫ The physical layer in a PCIe bus architecture determines the physical features of the bus. In
future, the performance of a PCIe bus can be further improved by increasing the speed or
changing the encoding or decoding mode. Such changes only affect the physical layer,
facilitating upgrades.
⚫ The data link layer ensures the correctness and reliability of data packets transmitted over a
PCIe bus. It checks whether the data packet encapsulation is complete and correct, adds the
sequence number and CRC code to the data, and uses the ack/nack handshake protocol for
error detection and correction.
⚫ The transaction layer receives read and write requests from the software layer or creates a
request encapsulation packet and transmits it to the data link layer. This type of packet is called
a transaction layer packet (TLP). The TLP receives data link layer packets (DLLP) from the link
layer, associates the DLLP with a related software request, and transmits it to the software layer
for processing.
⚫ The application layer is designed by users based on actual needs. Other layers must comply with
the protocol requirements.
CIFS has high requirements on network transmission reliability, so usually uses TCP/IP. CIFS is mainly
used for the Internet and by Windows hosts to access files or other resources over the Internet. CIFS
allows Windows clients to identify and access shared resources. With CIFS, clients can quickly read,
write, and create files in storage systems as on local PCs. CIFS helps maintain a high access speed
and a fast system response even when many users simultaneously access the same shared file.
CIFS share in non-domain environments
The storage system can employ CIFS shares to share the file systems to users as directories. Users
can only view or access their own shared directories.
On the network, the storage system serves as the CIFS server and employs the CIFS protocol to
provide shared file system access for clients. After the clients map the shared files to the local
directories, users can access the files on the server as if they are accessing local files. You can set
locally authenticated user names and passwords in the storage system to determine the local
authentication information that can be used for accessing the file system.
CIFS share in AD domain environments
With the expansion of LAN and WAN, many enterprises use the AD domain to manage networks in
Windows. The AD domain makes network management simple and flexible.
A storage system can be added to an AD domain as a client. That is, it can be seamlessly integrated
with the AD domain. The AD domain controller saves information about all the clients and groups in
the domain. Clients in the AD domain need to be authenticated by the AD domain controller before
accessing the CIFS share provided by the storage system. By setting the permissions of AD domain
users, you can allow different domain users to have different permissions for shared directories. A
client in the AD domain can only access the shared directory with the same name as the client.
2.3.2.2.2 NFS Protocol
NFS is short for Network File System. The network file sharing protocol is defined by the IETF and
widely used in the Linux/Unix environment.
NFS is a client/server application that uses remote procedure call (RPC) for communication between
computers. Users can store and update files on the remote NAS just like on local PCs. A system
requires an NFS client to connect to an NFS server. NFS is used for independent transmission so uses
TCP or UDP. Users or system administrators can use NFS to mount all file systems or a part of a file
system (a part of any directory or subdirectory hierarchy). Access to the mounted file system can be
controlled using permissions, for example, read-only or read-write permissions.
Differences between NFSv3 and NFSv4:
⚫ NFSv4 is a stateful protocol. It implements the file lock function and can obtain the root node of
a file system without the help of the NLM and MOUNT protocols. NFSv3 is a stateless protocol.
It requires the NLM protocol to implement the file lock function.
⚫ NFSv4 has enhanced security and supports and RPCSEC-GSS identity authentication.
⚫ NFSv4 provides only two requests: NULL and COMPOUND. All operations are integrated into
COMPOUND. A client can encapsulate multiple operations into one COMPOUND request based
on actual requests to improve flexibility.
⚫ The command space of the NFSv4 file system is changed. A root file system (fsid=0) must be set
on the server, and other file systems are mounted to the root file system for export.
⚫ Compared with NFSv3, the cross-platform feature of NFSv4 is enhanced.
NFS share in a non-domain environment
The NFS share in a non-domain environment is commonly used for small- and medium-sized
enterprises. On the network, the storage system serves as the NFS server and employs the NFS
protocol to provide shared file system access for clients. After the clients mount the shared files to
the local directories, users can access the files on the server in the same way as accessing local files.
Filter criteria can be set for client IP addresses in storage systems to restrict clients that can access
NFS shares.
NFS share in a domain environment
Domains enable accounts, applications, and networks to be centrally managed. In Linux, LDAP and
NIS domains are available.
LDAP is an open and extendable network protocol. The purpose of LDAP-based authentication
applications is to set up a directory-oriented user authentication system, specifically, an LDAP
domain. When a client user needs to access applications in the LDAP domain environment, the LDAP
server compares the user name and password sent by the client with corresponding authentication
information in the directory database for identity verification.
NIS is a directory service technology that enables users to centrally manage system databases. It
provides a yellow page function to support the centralized management of network information. It
works based on client/server architecture. When the user name and password of a user are saved in
the NIS server database, the user can log in to an NIS client and maintain the database to centrally
manage the network information on the LAN.
When a client needs to access an NFS share provided by the storage system in a domain
environment, the storage system employs the domain server network group to authenticate the
accessible IP address, ensuring the reliability of file system data.
2.3.2.2.3 CIFS-NFS Cross-Protocol Access
The storage system allows users to share a file system or dtree using NFS and CIFS at the same time.
Clients using different protocols can access the same file system or dtree at the same time. Since
Windows, Linux, and UNIX adopt different mechanisms to authenticate users and control access, the
storage system uses a mechanism to centrally map users and control access, protecting the security
of CIFS-NFS cross-protocol access.
If you use a CIFS-based client to access a storage system, the storage system authenticates local or
AD domain users in the first place. If the UNIX permission (UNIX Mode bits) has been configured for
the file or directory to be accessed, the CIFS user is mapped as an NFS user based on preset user
mapping rules during authentication. Then implements UNIX permission authentication for the user.
If an NFS user attempts to access a file or directory that has the NT ACL on the storage system, the
storage system maps the NFS user as a CIFS user based on the preset mapping rules. Then the
storage system implements NT ACL permission authentication for the user.
2.3.2.3 HTTP, FTP, and NDMP
2.3.2.3.1 HTTP Shared File System
The storage system supports the HTTP shared file system. With the HTTPS service enabled, you can
share a file system in HTTPS mode.
Shared resource management is implemented based on the WebDAV protocol. As an HTTP
extension protocol, WebDAV allows clients to copy, move, modify, lock, unlock, and search for
resources in shared directories on servers.
Hypertext Transfer Protocol (HTTP) is a data transfer protocol that regulates the way a browser and
the web server communicate. It is used to transfer World Wide Web documents over the Internet.
HTTP defines how web clients request web pages from web servers and how web servers return web
pages to web clients.
HTTP uses short connections to transmit packets. A connection is terminated each time the
transmission is complete. HTTP and HTTPS use port 80 and port 443 respectively.
NDMP is designed for the data backup system of NAS devices. It enables NAS devices, without any
backup client agent, to send data directly to the connected disk devices or the backup servers on the
network for backup.
There are two networking modes for NDMP:
⚫ On a 2-way network, backup media is directly connected to a NAS storage system instead of
being connected to a backup server. In a backup process, the backup server sends a backup
command to the NAS storage system through the Ethernet. The system then directly backs up
data to the tape library it is connected to.
➢ In the NDMP 2-way backup mode, data flows are transmitted directly to backup media,
greatly improving the transmission performance and reducing server resource usage.
However, a tape library is connected to a NAS storage device, so the tape library can back
up data only for the NAS storage device to which it is connected.
➢ Tape libraries are expensive. To enable different NAS storage devices to share tape
devices, NDMP also supports the 3-way backup mode.
⚫ In the 3-way backup mode, a NAS storage system can transfer backup data to a NAS storage
device connected to a tape library through a dedicated backup network. Then, the storage
device backs up the data to the tape library.
REST is not a specification, but an architecture for network applications. It can be regarded as a
design mode which is applied to the network application architecture.
RESTful
An architecture complying with the REST principle is called a RESTful architecture.
It provides a set of software design guidelines and constraints for designing software for interaction
between clients and servers. RESTful software is simpler and more hierarchical and facilitates the
cache mechanism.
2.3.3.2 HDFS Storage Protocol
2.3.3.2.1 HDFS Service
The HDFS service provides an HDFS decoupled storage-compute solution based on native HDFS. The
solution implements on-demand configuration of storage and compute resources, provides
consistent user experience, and helps reduce the total cost of ownership (TCO). It can coexist with
the legacy coupled storage-compute architecture.
The HDFS service provides native HDFS interfaces to interconnect with big data platforms, such as
FusionInsight, Cloudera CDH, and Hortonworks HDP, to implement big data storage and computing
and provide big data analysis services for upper-layer big data applications.
Typical application scenarios include big data for finance, Internet log retention, governments, and
Smart City.
Hadoop Distributed File System (HDFS) is one of the major components in the open-source Hadoop.
HDFS consists of NameNode and DataNode.
2.3.3.2.2 Distributed File System
The distributed file system stores files on multiple computer nodes. Thousands of computer nodes
form a computer cluster.
Currently, the computer cluster used by the distributed file system consists of common hardware,
which greatly reduces the hardware overhead.
The Hadoop Distributed File System (HDFS) is a distributed file system running on universal
hardware and is designed and developed based on the GFS paper.
2.3.3.2.3 HDFS Architecture
The HDFS architecture consists of three parts: NameNode, DataNode, and Client.
A NameNode stores and generates the metadata of file systems. It runs one instance.
A DataNode stores the actual data and reports blocks it manages to the NameNode. A DataNode
runs multiple instances.
A Client supports service access to HDFS. It obtains data from the NameNode and DataNode and
sends to services. Multiple instances can run together with services.
2.3.3.2.4 HDFS Communication Protocol
HDFS is a distributed file system deployed on a cluster. Therefore, a large amount of data needs to
be transmitted over the network.
All HDFS communication protocols are based on the TCP/IP protocol.
The client initiates a TCP connection to the NameNode through a configurable port and uses the
client protocol to interact with the NameNode.
The NameNode and the DataNode interact with each other by using the DataNode protocol.
The interaction between the client and the DataNode is implemented through the Remote
Procedure Call (RPC). In design, the name node does not initiate an RPC request, but responds to
RPC requests from the client and DataNode.
⚫ In 2000, HDS adopted the switching architecture for Lightning 9900 products. Front-end
interface modules, cache modules, and back-end disk interface modules were connected on two
redundant switched networks, increasing communication channels to dozens of times more
than that of the bus architecture. The internal bus was no longer a performance bottleneck.
⚫ In 2003, EMC launched the DMX series based on full mesh architecture. All modules were
connected in point-to-point mode, obtaining theoretically larger internal bandwidth but adding
system complexity and limiting scalability challenges.
⚫ In 2009, to reduce hardware development costs, EMC launched the distributed switching
architecture by connecting a separated switch module to the tightly coupled dual-controller of
mid-range storage systems. This achieved a balance between costs and scalability.
⚫ In 2012, Huawei launched the Huawei OceanStor 18000 series, a mission-critical storage
product also based on distributed switching architecture.
Storage Software Technology Evolution:
A storage system combines unreliable and low-performance disks to provide high-reliability and
high-performance storage through effective management. Storage systems provide sharing, easy-to-
manage, and convenient data protection functions. Storage system software has evolved from basic
RAID and cache to data protection features such as snapshot and replication, to dynamic resource
management with improved data management efficiency, and deduplication and tiered storage with
improved storage efficiency.
Scale-out Storage Architecture:
⚫ A scale-out storage system organizes local HDDs and SSDs of general-purpose servers into a
large-scale storage resource pool, and then distributes data to multiple data storage servers.
⚫ Currently, scale-out storage of Huawei learns from Google, building a distributed file system
among multiple servers and then implementing storage services on the file system.
⚫ Most storage nodes are general-purpose servers. Huawei OceanStor 100D is compatible with
multiple general-purpose x86 servers and Arm servers.
➢ Protocol: storage protocol layer. The block, object, HDFS, and file services support local
mounting access over iSCSI or VSC, S3/Swift access, HDFS access, and NFS access
respectively.
➢ VBS: block access layer of FusionStorage Block. User I/Os are delivered to VBS over iSCSI or
SCSI.
➢ EDS-B: provides block services with enterprise features, and receives and processes I/Os
from VBS.
➢ EDS-F: provides the HDFS service.
➢ Metadata Controller (MDC): The metadata control device controls distributed cluster node
status, data distribution rules, and data rebuilding rules.
➢ Object Storage Device (OSD): a storage device for storing user data in distributed clusters
of the object storage device
➢ Cluster Manager (CM): manages cluster information.
Scale-Up:
⚫ This traditional vertical expansion architecture continuously adds storage disks into the existing
storage systems to meet demands.
⚫ Advantage: simple operation at the initial stage
⚫ Disadvantage: As the storage system scale increases, resource increase reaches a bottleneck.
Scale-Out:
⚫ This horizontal expansion architecture adds controllers to meet demands.
⚫ Advantage: As the scale increases, the unit price decreases and the efficiency is improved.
⚫ Disadvantage: The complexity of software and management increases.
Huawei SAS disk enclosure is used as an example.
⚫ Port consistency: In a loop, the EXP port of an upper-level disk enclosure is connected to the PRI
port of a lower-level disk enclosure.
⚫ Dual-plane networking: Expansion module A connects to controller A, while expansion module
B connects to controller B.
⚫ Symmetric networking: On controllers A and B, symmetric ports and slots are connected to the
same disk enclosure.
⚫ Forward and backward connection networking: Expansion module A uses forward connection,
while expansion module B uses backward connection.
⚫ Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed the upper
limit.
Huawei smart disk enclosure is used as an example.
⚫ Port consistency: In a loop, the EXP (P1) port of an upper-level disk enclosure is connected to
the PRI (P0) port of a lower-level disk enclosure.
⚫ Dual-plane networking: Expansion board A connects to controller A, while expansion board B
connects to controller B.
⚫ Symmetric networking: On controllers A and B, symmetric ports and slots are connected to the
same disk enclosure.
⚫ Forward connection networking: Both expansion modules A and B use forward connection.
⚫ Cascading depth: The number of cascaded disk enclosures in a loop cannot exceed the upper
limit.
IP scale-out is used for Huawei OceanStor V3 and V5 entry-level and mid-range series, Huawei
OceanStor V5 Kunpeng series, and Huawei OceanStor Dorado V6 series. IP scale-out integrates
TCP/IP, Remote Direct Memory Access (RDMA), and Internet Wide Area RDMA Protocol (iWARP) to
implement service switching between controllers, which complies with the all-IP trend of the data
center network.
PCIe scale-out is used for Huawei OceanStor 18000 V3 and V5 series, and Huawei OceanStor Dorado
V3 series. PCIe scale-out integrates PCIe channels and the RDMA technology to implement service
switching between controllers.
PCIe scale-out: features high bandwidth and low latency.
IP scale-out: employs standard data center technologies (such as ETH, TCP/IP, and iWARP) and
infrastructure, and boosts the development of Huawei's proprietary chips for entry-level and mid-
range products.
Next, let's move on to I/O read and write processes of the host. The scenarios are as follows:
⚫ Local Write Process
➢ If the target disk is on a remote device, engine 0 transfers the I/Os to engine 1 where the
disk resides.
➢ Engine 1 reads data from the disk.
➢ Engine 1 completes the data read.
➢ Engine 1 returns the data to engine 0, engine 0 returns the data to engine 2, and then
engine 2 returns the data to the host.
interconnection between the disk enclosure and eight controllers. When full interconnection
between disk enclosures and eight controllers is implemented, the system can use continuous
mirroring to tolerate failure of 7 out of 8 controllers without service interruption.
Huawei storage provides E2E global resource sharing:
⚫ Symmetric architecture
➢ All products support host access in active-active mode. Requests can be evenly distributed
to each front-end link.
➢ They eliminate LUN ownership of controllers, making LUNs easier to use and balancing
loads. They accomplish this by dividing a LUN into multiple slices that are then evenly
distributed to all controllers using the DHT algorithm
➢ Mission-critical products reduce latency with intelligent FIMs that divide LUNs into slices
for hosts I/Os and send the requests to their target controller.
⚫ Shared port
➢ A single port is shared by four controllers in a controller enclosure.
➢ Loads are balanced without host multipathing.
⚫ Global cache
➢ The system directly writes received I/Os (in one or two slices) to the cache of the
corresponding controller and sends an acknowledgement to the host.
➢ The intelligent read cache of all controllers participates in prefetch and cache hit of all LUN
data and metadata.
FIMs of Huawei OceanStor Dorado 8000 and 18000 V6 series storage adopt Huawei-developed
Hi1822 chip to connect to all controllers in a controller enclosure via four internal links and each
front-end port provides a communication link for the host. If any controller restarts during an
upgrade, services are seamlessly switched to the other controller without impacting hosts and
interrupting links. The host is unaware of controller faults. Switchover is completed within 1 second.
The FIM has the following features:
⚫ Failure of a controller will not disconnect the front-end link, and the host is unaware of the
controller failure.
⚫ The PCIe link between the FIM and the controller is disconnected, and the FIM detects the
controller failure.
⚫ Service switchover is performed between the controllers, and the FIM redistributes host
requests to other controllers.
⚫ The switchover time is about 1 second, which is much shorter than switchover performed by
multipathing software (10-30s).
In global cache mode, host data is directly written into linear space logs, and the logs directly copy
the host data to the memory of multiple controllers using RDMA based on a preset copy policy. The
global cache consists of two parts:
⚫ Global memory: memory of all controllers (four controllers in the figure). This is managed in a
unified memory address, and provides linear address space for the upper layer based on a
redundancy configuration policy.
⚫ WAL: new write cache of the log type
The global pool uses RAID 2.0+, full-strip write of new data, and shared RAID groups between
multiple strips.
Another feature is back-end sharing, which includes sharing of back-end interface modules within an
enclosure and cross-controller enclosure sharing of back-end disk enclosures.
Host service switchover when a single controller is faulty: When FIMs are used, failure of a
controller will not disconnect front-end ports from hosts, and the hosts are unaware of the
controller failure, ensuring high availability. When a controller fails, the FIM port chip detects
that the PCIe link between the FIM and the controller is disconnected. Then service switchover
is performed between the controllers, and the FIM redistributes host I/Os to other controllers.
This process is completed within seconds and does not affect host services. In comparison,
when non-shared interface modules are used, a link switchover must be performed by the
host's multipathing software in the event of a controller failure, which takes a longer time (10 to
30 seconds) and reduces reliability.
2.4.2 NAS
Enterprises need to store a large amount of data and share the data through a network. Therefore,
network-attached storage (NAS) is a good choice. NAS connects storage devices to the live network
and provides data and file services.
For a server or host, NAS is an external device and can be flexibly deployed through the network. In
addition, NAS provides file-level sharing rather than block-level sharing, which makes it easier for
clients to access NAS over the network. UNIX and Microsoft Windows users can seamlessly share
data through NAS or File Transfer Protocol (FTP). When NAS sharing is used, UNIX uses NFS and
Windows uses CIFS.
NAS has the following characteristics:
⚫ NAS provides storage resources through file-level data access and sharing, enabling users to
quickly share files with minimum storage management costs.
⚫ NAS is a preferred file sharing storage solution that does not require multiple file servers.
⚫ NAS also helps eliminate bottlenecks in user access to general-purpose servers.
⚫ NAS uses network and file sharing protocols for archiving and storage. These protocols include
TCP/IP for data transmission as well as CIFS and NFS for providing remote file services.
A general-purpose server can be used to carry any application and run a general-purpose operating
system. Unlike general-purpose servers, NAS is dedicated to file services and provides file sharing
services for other operating systems using open standard protocols. NAS devices are optimized
based on general-purpose servers in aspects such as file service functions, storage, and retrieval. To
improve the high availability of NAS devices, some NAS vendors also support the NAS clustering
function.
The components of a NAS device are as follows:
⚫ NAS engine (CPU and memory)
⚫ One or more NICs that provide network connections, for example, GE NIC and 10GE NIC.
⚫ An optimized operating system for NAS function management
⚫ NFS and CIFS protocols
⚫ Disk resources that use industry-standard storage protocols, such as ATA, SCSI, and Fibre
Channel
NAS protocols include NFS, CIFS, FTP, HTTP, and NDMP.
⚫ NFS is a traditional file sharing protocol in the UNIX environment. It is a stateless protocol. If a
fault occurs, NFS connections can be automatically recovered.
⚫ CIFS is a traditional file sharing protocol in the Microsoft environment. It is a stateful protocol
based on the Server Message Block (SMB) protocol. If a fault occurs, CIFS connections cannot be
automatically recovered. CIFS is integrated into the operating system and does not require
additional software. Moreover, CIFS sends only a small amount of redundant information, so it
has higher transmission efficiency than NFS.
⚫ FTP is one of the protocols in the TCP/IP protocol suite. It consists of two parts: FTP server and
FTP client. The FTP server is used to store files. Users can use the FTP client to access resources
on the FTP server through FTP.
⚫ Hypertext Transfer Protocol (HTTP) is an application-layer protocol used to transfer hypermedia
documents (such as HTML). It is designed for communication between a Web browser and a
Web server, but can also be used for other purposes.
⚫ Network Data Management Protocol (NDMP) provides an open standard for NAS network
backup. NDMP enables data to be directly written to tapes without being backed up by backup
servers, improving the speed and efficiency of NAS data protection.
Working principles of NFS: Like other file sharing protocols, NFS also uses the C/S architecture.
However, NFS provides only the basic file processing function and does not provide any TCP/IP data
transmission function. The TCP/IP data transmission function can be implemented only by using the
Remote Procedure Call (RPC) protocol. NFS file systems are completely transparent to clients.
Accessing files or directories in an NFS file system is the same as accessing local files or directories.
One program can use RPC to request a service from a program located in another computer over a
network without having to understand the underlying network protocols. RPC assumes the existence
of a transmission protocol such as Transmission Control Protocol (TCP) or User Datagram Protocol
(UDP) to carry the message data between communicating programs. In the OSI network
communication model, RPC traverses the transport layer and application layer. RPC simplifies
development of applications.
RPC works based on the client/server model. The requester is a client, and the service provider is a
server. The client sends a call request with parameters to the RPC server and waits for a response.
On the server side, the process remains in a sleep state until the call request arrives. Upon receipt of
the call request, the server obtains the process parameters, outputs the calculation results, and
sends the response to the client. Then, the server waits for the next call request. The client receives
the response and obtains call results.
One of the typical applications of NFS is using the NFS server as internal shared storage in cloud
computing. The NFS client is optimized based on cloud computing to provide better performance
and reliability. Cloud virtualization software (such as VMware) optimizes the NFS client, so that the
VM storage space can be created on the shared space of the NFS server.
Working principles of CIFS: CIFS runs on top of TCP/IP and allows Windows computers to access files
on UNIX computers over a network.
The CIFS protocol applies to file sharing. Two typical application scenarios are as follows:
⚫ File sharing service
➢ CIFS is commonly used in file sharing service scenarios such as enterprise file sharing.
⚫ Hyper-V VM application scenario
➢ SMB can be used to share mirrors of Hyper-V virtual machines promoted by Microsoft. In
this scenario, the failover feature of SMB 3.0 is required to ensure service continuity upon
a node failure and to ensure the reliability of VMs.
2.4.3 SAN
2.4.3.1 IP SAN Technologies
NIC + Initiator software: Host devices such as servers and workstations use standard NICs to connect
to Ethernet switches. iSCSI storage devices are also connected to the Ethernet switches or to the
NICs of the hosts. The initiator software installed on hosts virtualizes NICs into iSCSI cards. The iSCSI
cards are used to receive and transmit iSCSI data packets, implementing iSCSI and TCP/IP
transmission between the hosts and iSCSI devices. This mode uses standard Ethernet NICs and
switches, eliminating the need for adding other adapters. Therefore, this mode is the most cost-
effective. However, the mode occupies host resources when converting iSCSI packets into TCP/IP
packets, increasing host operation overheads and degrading system performance. The NIC + initiator
software mode is applicable to scenarios that require the relatively low I/O and bandwidth
performance for data access.
TOE NIC + initiator software: The TOE NIC processes the functions of the TCP/IP protocol layer, and
the host processes the functions of the iSCSI protocol layer. Therefore, the TOE NIC significantly
improves the data transmission rate. Compared with the pure software mode, this mode reduces
host operation overheads and requires minimal network construction expenditure. This is a trade-off
solution.
iSCSI HBA:
⚫ An iSCSI HBA is installed on the host to implement efficient data exchange between the host
and the switch and between the host and the storage device. Functions of the iSCSI protocol
layer and TCP/IP protocol stack are handled by the host HBA, occupying the least CPU resources.
This mode delivers the best data transmission performance but requires high expenditure.
⚫ The iSCSI communication system inherits part of SCSI's features. The iSCSI communication
involves an initiator that sends I/O requests and a target that responds to the I/O requests and
executes I/O operations. After a connection is set up between the initiator and target, the target
controls the entire process as the primary device. The target includes the iSCSI disk array and
iSCSI tape library.
⚫ The iSCSI protocol defines a set of naming and addressing methods for iSCSI initiators and
targets. All iSCSI nodes are identified by their iSCSI names. In this way, iSCSI names are
distinguished from host names.
⚫ iSCSI uses iSCSI Qualified Name (IQN) to identify initiators and targets. Addresses change with
the relocation of initiator or target devices, but their names remain unchanged. When setting
up a connection, an initiator sends a request. After the target receives the request, it checks
whether the iSCSI name contained in the request is consistent with that bound with the target.
If the iSCSI names are consistent, the connection is set up. Each iSCSI node has a unique iSCSI
name. One iSCSI name can be used in the connections from one initiator to multiple targets.
Multiple iSCSI names can be used in the connections from one target to multiple initiators.
Logical ports are created based on bond ports, VLAN ports, or Ethernet ports. Logical ports are
virtual ports that carry host services. A unique IP address is allocated to each logical port for carrying
its services.
⚫ Bond port: To improve reliability of paths for accessing file systems and increase bandwidth, you
can bond multiple Ethernet ports on the same interface module to form a bond port.
⚫ VLAN: VLANs logically divide the physical Ethernet ports or bond ports of a storage system into
multiple broadcast domains. On a VLAN, when service data is being sent or received, a VLAN ID
is configured for the data so that the networks and services of VLANs are isolated, further
ensuring service data security and reliability.
⚫ Ethernet port: Physical Ethernet ports on an interface module of a storage system. Bond ports,
VLANs, and logical ports are created based on Ethernet ports.
IP address failover: A logical IP address fails over from a faulty port to an available port. In this
way, services are switched from the faulty port to the available port without interruption. The
faulty port takes over services back after it recovers. This task can be completed automatically
or manually. IP address failover applies to IP SAN and NAS.
During the IP address failover, services are switched from the faulty port to an available port,
ensuring service continuity and improving the reliability of paths for accessing file systems. Users are
not aware of this process.
The essence of IP address failover is a service switchover between ports. The ports can be Ethernet
ports, bond ports, or VLAN ports.
⚫ Ethernet port–based IP address failover: To improve the reliability of paths for accessing file
systems, you can create logical ports based on Ethernet ports.
Figure 2-9
➢ Host services are running on logical port A of Ethernet port A. The corresponding IP
address is "a". Ethernet port A fails and thereby cannot provide services. After IP address
failover is enabled, the storage system will automatically locate available Ethernet port B,
delete the configuration of logical port A that corresponds to Ethernet port A, and create
and configure logical port A on Ethernet port B. In this way, host services are quickly
switched to logical port A on Ethernet port B. The service switchover is executed quickly.
Users are not aware of this process.
⚫ Bond port-based IP address failover: To improve the reliability of paths for accessing file
systems, you can bond multiple Ethernet ports to form a bond port. When an Ethernet port that
is used to create the bond port fails, services are still running on the bond port. The IP address
fails over only when all Ethernet ports that are used to create the bond port fail.
Figure 2-10
➢ Multiple Ethernet ports are bonded to form bond port A. Logical port A created based on
bond port A can provide high-speed data transmission. When both Ethernet ports A and B
fail due to various causes, the storage system will automatically locate bond port B, delete
logical port A, and create the same logical port A on bond port B. In this way, services are
switched from bond port A to bond port B. After Ethernet ports A and B recover, services
will be switched back to bond port A if failback is enabled. The service switchover is
executed quickly, and users are not aware of this process.
⚫ VLAN-based IP address failover: You can create VLANs to isolate different services.
➢ To implement VLAN-based IP address failover, you must create VLANs, allocate a unique ID
to each VLAN, and use the VLANs to isolate different services. When an Ethernet port on a
VLAN fails, the storage system will automatically locate an available Ethernet port with the
same VLAN ID and switch services to the available Ethernet port. After the faulty port
recovers, it takes over the services.
➢ VLAN names, such as VLAN A and VLAN B, are automatically generated when VLANs are
created. The actual VLAN names depend on the storage system version.
➢ Ethernet ports and their corresponding switch ports are divided into multiple VLANs, and
different IDs are allocated to the VLANs. The VLANs are used to isolated different services.
VLAN A is created on Ethernet port A, and the VLAN ID is 1. Logical port A that is created
based on VLAN A can be used to isolate services. When Ethernet port A fails due to various
causes, the storage system will automatically locate VLAN B and the port whose VLAN ID is
1, delete logical port A, and create the same logical port A based on VLAN B. In this way,
the port where services are running is switched to VLAN B. After Ethernet port A recovers,
the port where services are running will be switched back to VLAN A if failback is enabled.
➢ An Ethernet port can belong to multiple VLANs. When the Ethernet port fails, all VLANs will
fail. Services must be switched to ports of other available VLANs. The service switchover is
executed quickly, and users are not aware of this process.
2.4.3.2 FC SAN Technologies
FC HBA: The FC HBA converts SCSI packets into Fibre Channel packets, which does not occupy host
resources.
Here are some key concepts in Fibre Channel networking:
⚫ Fibre Channel Routing (FCR) provides connectivity to devices in different fabrics without
merging the fabrics. Different from E_Port cascading of common switches, after switches are
connected through an FCR switch, the two fabric networks are not converged and are still two
independent fabrics. The link switch between two fabrics functions as a router.
⚫ FC Router: a switch running the FC-FC routing service.
⚫ EX_Port: a type of port that functions like an E_Port, but does not propagate fabric services or
routing topology information from one fabric to another.
⚫ Backbone fabric: fabric of a switch running the Fibre Channel router service.
⚫ Edge fabric: fabric that connects a Fibre Channel router.
⚫ Inter fabric link (IFL): the link between an E_Port and an EX-Port, or a VE_Port and a VEX-Port.
Another important concept is zoning. A zone is a set of ports or devices that communicate with each
other. A zone member can only access other members of the same zone. A device can reside in
multiple zones. You can configure basic zones to control the access permission of each device or
port. Moreover, you can set traffic isolation zones. When there are multiple ISLs (E_Ports), an ISL
only transmits the traffic destined for ports that reside in the same traffic isolation zone.
2.4.3.3 Comparison Between IP SAN and FC SAN
First, let's look back on the concept of SAN.
⚫ Protocol: Fibre Channel/iSCSI. The SAN architectures that use the two protocols are FC SAN and
IP SAN.
⚫ Raw device access: suitable for traditional database access.
⚫ Dependence on the application host to provide file access. Share access requires the support of
cluster software, which causes high overheads in processing access conflicts, resulting in poor
performance. In addition, it is difficult to support sharing in heterogeneous environments.
⚫ High performance, high bandwidth, and low latency, but high cost and poor scalability
Then, let's compare FC SAN and IP SAN.
⚫ To solve the poor scalability issue of DAS, storage devices can be networked using FC SAN to
support connection to more than 100 servers.
⚫ IP SAN is designed to address the management and cost challenges of FC SAN. IP SAN requires
only a few hardware configurations and the hardware is widely used. Therefore, the cost of IP
SAN is much lower than that of FC SAN. Most hosts have been configured with appropriate NICs
and switches, which are also suitable (although not perfect) for iSCSI transmission. High-
performance IP SAN requires dedicated iSCSI HBAs and high-end switches.
2.4.3.4 Comparison Between DAS, NAS, and SAN
⚫ Protocol: NAS uses the TCP/IP protocol. SAN uses the FC protocol. DAS can use SCSI/FC/ATA.
⚫ Transmission object: DAS and SAN are transmitted based on data blocks. NAS is transmitted
based on files.
⚫ SAN is more tolerant of disasters and has dedicated solutions.
⚫ BMC plane: connects to Mgmt ports of management or storage nodes to enable remote device
management.
⚫ Storage plane: an internal plane, used for service data communication among all nodes in the
storage system.
⚫ Service plane: interconnects with customer applications and accesses storage devices through
standard protocols such as iSCSI and HDFS.
⚫ Replication plane: enables data synchronization and replication among replication nodes.
⚫ Arbitration plane: communicates with the HyperMetro quorum server. This plane is planned
only when the HyperMetro function is planned for the block service.
The key software components and their functions are described as follows:
⚫ FSM: a management process of Huawei scale-out storage that provides operation and
maintenance (O&M) functions, such as alarm management, monitoring, log management, and
configuration. It is recommended that this module be deployed on two nodes in active/standby
mode.
⚫ Virtual Block Service (VBS): a process that provides the scale-out storage access point service
through SCSI or iSCSI interfaces and enables application servers to access scale-out storage
resources
⚫ Object Storage Device (OSD): a component of Huawei scale-out storage for storing user data in
distributed clusters.
⚫ REP: data replication network
⚫ Enterprise Data Service (EDS): a component that processes I/O services sent from VBS.
The OceanStor Pacific series supports mainstream front-end ports for mass storage. It supports
10GE, 25GE, 40GE, 100GE, HDR-100, and EDR IB. It supports TCP and IP standard protocols as well as
RDMA and RoCE.
3.1.2.2 Product Features
Highlights of the next-generation Huawei all-flash storage:
Excellent performance: 21 million IOPS and 0.05 ms latency based on the SPC-1 model, and 30%
higher NAS performance than the industry average
High reliability: tolerance of failures of seven out of eight controllers, and the only active-active
architecture for SAN and NAS integration in the industry, ensuring service continuity
High efficiency: intelligent O&M and device-cloud synergy, promoting intelligent and simplified
storage resource usage
The SmartMatrix 3.0 full-mesh architecture leverages a high-speed, fully interconnected passive
backplane to connect to multiple controllers. Interface modules are shared by all controllers over the
backplane, allowing hosts to access any controller via any port. The SmartMatrix full-mesh
architecture allows close coordination between controllers, simplifies software models, and achieves
active-active fine-grained balancing, high efficiency, and low latency.
FlashLink® provides high IOPS concurrency with low latency ensured. FlashLink® employs a series of
optimizations for flash media. It associates controller CPUs with SSD CPUs to coordinate SSD
algorithms between these CPUs, thereby achieving high system performance and reliability.
An SSD is uses flash memory (NAND Flash) to store data persistently. Compared with a traditional
HDD, an SSD features high speed, low energy consumption, low latency, small size, light weight, and
shockproof capability.
High performance
⚫ All SSD design for high IOPS and low latency
⚫ FlashLink supported, providing intelligent multi-core, efficient RAID, hot and cold data
separation, and low latency guarantee
High reliability
⚫ Component failure protection implemented through the redundancy design and active-active
working mode; SmartMatrix 3.0 front- and back-end full-mesh architecture for controller
enclosures, ensuring high efficiency and low latency
⚫ Component redundancy design, power failure protection, and coffer disks
⚫ Advanced data protection technologies, including HyperSnap, HyperReplication, HyperClone,
and HyperMetro
⚫ RAID 2.0+ underlying virtualization
High availability
⚫ Online replacement of components, including controllers, power modules, interface modules,
and disks
⚫ Disk roaming, providing automatic identification of disks with slots changed and automatic
restoration of original services
⚫ Centralized management of resources in third-party storage systems
3.1.2.3 Product Form
Product form:
OceanStor Dorado V6 uses the next-generation innovative hardware platform and adopts the high-
density design for controllers and disk enclosures.
⚫ High-end storage models use 4 U independent controller enclosures. Each enclosure supports a
maximum of 28 interface modules. Mirroring in controller enclosures and interconnection
across enclosures are implemented through 100 Gbit/s RDMA networks.
⚫ Mid-range storage models use 2 U controller enclosures. Each enclosure has integrated disks
and supports a maximum of 12 interface modules. The controllers are mirrored through 100
Gbit/s RDMA networks within each enclosure. Two controller enclosures are interconnected
through 25 Gbit/s RDMA networks.
⚫ The entry-level storage model uses 2 U controller enclosures. Each enclosure has integrated
disks and supports a maximum of 6 interface modules. The controllers are mirrored through 40
Gbit/s RDMA networks within each enclosure. Two controller enclosures are interconnected
through 25 Gbit/s RDMA networks.
⚫ A smart disk enclosure provides 25 SAS SSD slots (standard) and a high-density one supports 36
NVMe SSDs. A common disk enclosure supports 25 SAS SSDs (standard).
3.1.2.4 Innovative and Intelligent Hardware Accelerates Critical Paths (Ever Fast)
Huawei developed components for all key I/O paths to provide customers with ultimate
performance.
Intelligent multi-protocol interface module: hosts protocol parsing previously performed by
controller CPUs, which reduces controller CPU workloads and improves overall performance.
Huawei Kunpeng high-performance computing platform: delivers 25% higher computing power than
same-level Intel CPUs. In addition, the advantages in CPU and core quantities further improve
storage system performance. To be specific, 4 chips are deployed on a controller. Therefore, a 4 U 4-
controller device has 16 chips with 192 cores.
The Huawei-developed intelligent acceleration module is inserted into a controller as an interface
module to work with the internal intelligent cache algorithm. When I/O data flows from the upper
layer come, the intelligent acceleration module automatically captures the flows, learns and
analyzes their rules, predicts possible follow-up actions, and prefetches cache resources. In this way,
the read cache hit ratio is greatly improved, especially in batch read scenarios.
Intelligent SSD controller: The brain of SSDs. It accelerates read and write responses (single disk: 15%
faster than peer vendors) and reduces latency.
Intelligent management hardware: has a built-in storage fault library which contains 10 years of
accumulated fault data. The chip speeds up fault locating and offers corresponding solutions and
self-healing suggestions, improving fault locating accuracy from 60% to 93%.
With the preceding Huawei-developed hardware for transmission, computing, intelligence, storage,
and management, Huawei all-flash storage is ever fast than peer vendors.
3.1.2.5 Software Architecture
The OceanStor OS is used to manage hardware and support storage service software.
⚫ The basic function control software provides basic data storage and read/write functions.
⚫ The value-added function control software provides advanced functions such as backup, DR,
and performance tuning.
⚫ The management function control software provides management functions for the storage
system.
The maintenance terminal software is used for system configuration and maintenance. You can use
software such as SmartKit and eService on the maintenance terminal to configure and maintain the
storage system.
Application server software includes OceanStor BCManager and UltraPath.
Huawei's high-performance all-flash solution resolves these problems. High-end all-flash storage
systems are used to carry multiple core applications (such as transaction system database services).
The processing time is reduced by more than half, the response latency is shortened, and the service
efficiency is improved by times.
Multiple control and management functions, such as SmartTier, SmartQoS, and SmartThin, providing
refined control and management
GUI-based operations and management through DeviceManager
Self-service intelligent O&M provided by eService
Product form:
A storage system consists of controller enclosures and disk enclosures. It provides an intelligent
storage platform that features robust reliability, high performance, and large capacity.
Different models of products are configured with different types of controller enclosures and disk
enclosures.
3.1.3.2 Converged SAN and NAS
Convergence of SAN and NAS storage technologies: One storage system supports both SAN and NAS
services at the same time and allows SAN and NAS services to share storage device resources. Hosts
can access any LUN or file system through the front-end port of any controller. During the entire
data lifecycle, hot data gradually becomes cold data. If cold data occupies cache or SSDs for a long
time, resources will be wasted and the long-term performance of the storage system will be
affected. The storage system uses the intelligent storage tiering technology to flexibly allocate data
storage media in the background.
The intelligent tiering technology requires a device with different media types. Data is monitored in
real time. Data that is not accessed for a long time is marked as cold data and is gradually
transferred from high-performance media to low-speed media, ensuring that the service response is
not becoming slow. After being activated, cold data can be quickly moved to high-performance
media, ensuring stable system performance.
Manual or automatic migration policies are supported.
3.1.3.3 Support for Multiple Service Scenarios
Huawei OceanStor hybrid flash storage systems integrate SAN and NAS and support multiple storage
protocols. This improves the service scope for general-purpose, backup, and DR scenarios in
government, finance, carrier, and manufacturing fields, among others.
The OceanStor Pacific series includes performance- and capacity-oriented products which support
tiered storage to meet application requirements in different scenarios.
OceanStor Pacific 9950 is a typical performance-oriented model. It provides customers with ultimate
all-flash performance of mass data storage. A single device delivers 160 GB/s bandwidth
performance.
OceanStor Pacific 9550 is a typical capacity-oriented device. A single device provides 120 disk slots.
Using mainstream 14 TB disks, a single device can provide 1.68 PB storage capacity.
The OceanStor Pacific series supports mainstream front-end ports for mass storage. It supports
10GE, 25GE, 40GE, 100GE, HDR-100 and EDR IB. It supports TCP and IP standard protocols as well as
RDMA and RoCE.
3.1.4.2 Appearance of OceanStor Pacific 9950
OceanStor Pacific 9950 is a 5 U 8-node all-flash high-density device developed based on the Huawei
Kunpeng 920 processor, providing customers with ultimate performance of scale-out storage. It
features high performance, high density, ultimate TCO, and high reliability.
The device has eight independent pluggable nodes on the front panel. Each node supports 10 half-
palm NVMe SSDs, and the entire device supports 80 half-palm NVMe SSDs. Each node supports 32
GB battery-protected memory space to protect data in the event of a power failure.
The back panel of the device consists of three parts. The 2 data cluster modules in the upper part
provide each node with two 100GE RoCE dual-plane back-end storage network ports. The 16
interface modules in the middle are used for front-end network access. Each two interface modules
belong to a node. The 6 power supply slots in the lower part provide 2+2 power supply redundancy,
improving reliability.
OceanStor Pacific 9950 adopts a full FRU design. Nodes, BBUs, half-palm NVMe SSDs, fan modules,
data cluster modules, interface modules, and power modules of the device are independently
pluggable, facilitating maintenance and replacement.
Front view: contains the dust-proof cover, status display panel (operation indicator on the right:
indicates push-back or pull-forward; location indicator: steady amber for component replacement
including main storage disks, fans, BBUs, and half-palm NVMe cache SSDs), alarm indicator (front),
fans, power modules (two on the left and right, respectively), BBUs, disks, node control board (rear),
and overtemperature indicator.
Rear view: There are two storage node integration boards containing node 0 and node 1. Each node
is configured with a Kunpeng 920 processor and interface module slots of various standards. Huawei
uses a design of putting disks at the upper and nodes at the bottom instead of the traditional design
in which disks are at the front and nodes are at the rear with vertical backplanes. Each integration
control board manages 60 disks from the front to rear and connects disk enclosures to CPU
integration boards using flexible cables in tank chain mode.
Top view: The front and rear disk areas and the fan area in the middle are visible. The front and rear
disk areas contain 120 disk slots. The fan area in the middle contains fans in 5 x 2 stacked mode to
form double-layer air channels. To cope with the high air resistance of 120 disks, Huawei customizes
ten 80 mm aviation-standard counter-rotating fans to form a fan wall containing two air channels.
The upper channel is the main channel, which is used for heat dissipation of front-row disks. The
lower channel is used for heat dissipation of CPUs and rear-row disks. The fans draw air from the
front and blow air from the rear, perfectly solving the heat dissipation problem of high-density
devices. Compared with the mainstream heat dissipation technology in the industry, the reliability of
OceanStor Pacific 9550 components is improved by more than 100%.
The rightmost part shows finer division of the physical disk area. The entire disk area is divided into
eight sawtooth sub-areas. Each sub-area is in 7+8 layout mode, using expansion modules for
connection.
OceanStor Pacific 9550 adopts a full FRU design. Nodes, power supplies, BBUs, half-palm NVMe
cache SSDs, expansion modules, fan modules, and disk modules are all hot-swappable, facilitating
maintenance and replacement.
Fans are in N+1 redundant mode. There are five groups of fans. Each fan has two motors (rotors). If
one motor is faulty, the system runs properly. If the entire fan is faulty, replace it immediately.
3.1.4.3 Software System Architecture
NFS/CIFS: Standard file sharing protocol allows the system to provide file sharing in various OSs.
POSIX/MPI: It is compatible with standard MPI and POSIX semantics in scenarios where file sharing is
implemented using DPC and provides parallel interfaces and intelligent data caching algorithms to
enable upper-layer applications to access storage space more intelligently.
S3: processes Amazon S3 and NFS protocol messages and the object service logic.
HDFS: provides standard HDFS interfaces for external systems.
SCSI/iSCSI: provides volumes for OSs and databases by locally mapping the volumes using SCSI
standard drivers or by mapping volumes to application servers through multipathing software and
iSCSI interfaces.
Data redundancy management: performs erasure coding calculation to ensure high data reliability.
Distributed data routing: evenly distributes data and metadata to storage nodes according to preset
rules.
Cluster status control: controls the distributed cluster status.
Strong-consistency replication protocol: ensures data consistency for HyperMetro pairs in the block
service.
Data reconstruction and balancing: reconstructs and balances data.
3.1.4.4 Product Features
The Huawei OceanStor Pacific series provides industry-leading hardware for different industries in
different application scenarios to meet diversified user requirements. It provides block, object, big
data, and file storage protocols and uses one architecture to meet different data storage
requirements. Oriented to different industries in different service scenarios, it provides leading
solutions and storage products with higher efficiency, reliability, and cost-effectiveness to help
customers better store and use mass data.
3.1.4.4.1 FlashLink - Multi-core Technology
Data flows from I/O cards to CPUs. The CPU algorithm used by the Huawei OceanStor Pacific series
has the following advantages:
CPUs use the latest intelligent partitioning algorithm, including the CPU grouping algorithm and CPU
core-based algorithm. The CPU grouping algorithm divides I/Os into different groups (for example,
data read/write and data exchange) so that they are independent to avoid mutual interference and
ensure read and write performance. I/Os with lower priorities are in the same group and share CPU
resources. In this way, resource utilization can be maximized.
The CPU core-based algorithm is Huawei's unique advantage. I/Os work on CPUs in polling mode,
that is, an I/O works on core 1 of a CPU for a period of time. Later, the I/O works on core 2 of the
CPU through polling. However, it takes time for the I/O to switch from core 1 to core 2. To ensure
data consistency, when the I/O switches from core 1 to core 2, core 1 is locked to ensure that the I/O
only runs on core 2, which also takes time. When the I/O works for a period of time on core 2, if the
I/O needs to switch back to core 1, core 2 needs to be locked and core 1 needs to be unlocked.
These processes are also time consuming.
The CPU core-based algorithm solves CPU polling and locking/unlocking problems by enabling a
reading or writing request to be continuously executed on the same core of a CPU. The lock-free
design enables an I/O to run on the same core always, saving the time spent on core switching,
locking, and unlocking.
Data flows from CPUs to cache. Cache is important for storage. The cache algorithm is one of the key
factors that determine storage performance. The cache algorithm has the following highlights:
1 The binary tree algorithm of traditional storage is abandoned. The hash table algorithm
dedicated to flash memory is used. The binary tree algorithm occupies less space but the search
speed is low. The hash table algorithm provides high search speed and occupies more space.
Multi-level hash is used to save space and improve the search speed.
2 Metadata and data cache resources are partitioned to improve the metadata hit ratio. Read
data and write data are partitioned to avoid resource preemption.
3 Hash indexes support hot and cold partitioning to improve the search speed.
3.1.4.4.2 SmartInterworking
To solve the problems including long data analysis link, large data volume, and long copy period in
HPC and big data analysis scenarios such as autonomous driving, the OceanStor Pacific series
provides SmartInterworking to implement multi-protocol convergence and interworking of files,
objects, and big data without semantics or performance loss. Data can be accessed through multiple
semantics without format conversion. In this way, the data copy process is eliminated, data analysis
efficiency is greatly improved, and redundant data of data analysis and publishing stored in
traditional production systems is avoid, saving storage space. Based on protocol interworking of
files, objects, and big data, data can be accessed by multiple services at the same time without data
migration, improving efficiency and maximizing the unique value of the data lake. Among products
of the same type, Dell EMC products support unstructured protocol interworking but there is
semantics loss and some native functions are unavailable. Inspur and H3C are developed based on
open-source Ceph. These products usually support protocol interworking by adding gateways to a
certain storage protocol. However, the extra logic conversion causes great performance loss.
3.1.4.4.3 SmartIndexing
In mass data scenarios, users must use metadata to manage mass data. For example, users must
obtain the list of images or files whose name suffix is .jpeg, the list of files whose size is greater than
10 GB, and the list of files created before a specific date before managing them.
The OceanStor Pacific series provide the metadata indexing function (SmartIndexing) based on
unstructured (file, object, and big data) storage pools. SmartIndexing can be enabled or disabled by
namespace (or file system). For storage systems where metadata indexing is enabled, after front-end
service I/Os change the metadata of files, objects, or directories, the changed metadata is
proactively pushed to the index system in asynchronous mode, which is different from periodic
metadata scanning used by traditional storage products. In this way, impact on production system
performance is prevented and requirements on quick search for mass data are met. For example,
the search result of hundreds of billions of metadata records can be returned within seconds. This
feature is widely applicable to services such as electronic check image and autonomous driving
training.
Application scenarios: Electronic check image and autonomous driving data preprocessing
Benefits: Quick search is supported in scenarios where there are multiple directory levels or massive
files. Multi-dimensional metadata search is supported. The search dimensions are more than twice
that of using the find command.
3.1.4.4.4 Parallel File System
The process of HPC/HPDA and big data applications consists of multiple phases. Some phases require
large I/Os with high bandwidth, and some phases require small I/Os with high IOPS. Traditional HPC
storage acts well at only one performance model. As a result, different storage products are required
in different phases, and data needs to be copied between different phases, greatly reducing
efficiency.
The next-generation parallel file system of OceanStor Pacific achieves high bandwidth and IOPS
through innovations. One storage system meets the requirements of hybrid loads.
1 In terms of architecture, OceanStor Pacific distributes metadata to multiple nodes by directory.
In this way, metadata is owned, small I/Os can be forwarded to the owning node for direct
processing, which are similar to centralized storage, eliminating distributed lock overheads and
greatly reducing the read/write latency of small I/Os.
2 In terms of I/O flow, large I/Os are written to disks in pass-through mode, improving bandwidth.
Small I/Os are written to disks after aggregation through cache, ensuring low latency and
improving small file utilization.
3 In terms of data layout, OceanStor Pacific uses two-level indexes. The fixed-length large-
granularity primary indexes ensure sequential read/write performance of large I/Os. Sub-
indexes can be automatically adapted based on I/O sizes to avoid write penalty when small I/Os
are written.
3.1.4.4.5 Distributed Parallel Client (DPC)
DPC is short for the Huawei distributed parallel client.
Some HPC scenarios require single-stream bandwidth and single-client bandwidth. In other
scenarios, MPI-IO (multiple clients concurrently access the same file) may be used. When NFS is
used, a single client can connect to only one storage node, TCP is mainly used for network access,
and MPI-IO is not supported. Therefore, requirements in these scenarios cannot be met.
To address these issues, OceanStor Pacific provides the next-generation DPC, which is deployed on
compute nodes. A single compute client can connect to multiple storage nodes, eliminating
performance bottlenecks caused by storage node configurations and maximizing compute node
capabilities.
In addition, I/O-level load balancing in DPC access mode is better than that in NFS access mode (load
balancing determination is performed only during mounting). In this way, storage cluster
performance is improved.
DPC supports POSIX and MPI-IO access modes. MPI applications can obtain better access
performance without modification.
DPC supports RDMA networks in IB and RoCE modes. It can directly perform EC on data on the client
to ensure lower data read/write latency.
3.1.4.4.6 Solutions for Four Typical Scenarios
The OceanStor Pacific series provides solutions for four typical scenarios: HPC storage, virtualization
and cloud resource pool storage, object resource pool, and decoupled storage and compute in the
big data analysis scenario.
Huawei FusionCube 1000 is an edge IT infrastructure solution that adopts the all-in-one design and is
delivered as an entire cabinet. It is mainly used in edge branches and vertical industry edge
application scenarios, such as gas station, campus, coal mine, and power grid. It can be deployed in
offices and remotely managed using FusionCube Center. With FusionCube Center Vision (FCV), this
solution offers integrated cabinets, service deployment, O&M management, and troubleshooting
services in a centralized manner. It greatly shortens the deployment duration and reduces the O&M
cost.
Hyper-converged infrastructure (HCI) is a set of devices consolidating not only compute, network,
storage, and server virtualization resources, but also elements such as backup software, snapshot
technology, data deduplication, and inline compression. Multiple sets of devices can be aggregated
by a network to achieve modular, seamless scale-out and form a unified resource pool.
Management: the DPA management plane, which provides deployment, upgrade, capacity
expansion, monitoring, and multi-domain management operations.
The fair usage policy runs a preset threshold. When the storage usage of a user in a period reaches
the threshold, the system automatically executes the preset restriction policy to reduce the
bandwidth allocated to the user.
The fair usage policy is useful in certain scenarios. For example, if a carrier launches an unlimited
traffic package, this policy prevents subscribers from over-consuming network resources.
3.1.6.3 All-Scenario Data Protection for the Intelligent World
Huawei's investment in storage products not only focuses on hardware products, but also aims to
build E2E products and solutions for customers.
To cope with the rapid growth of diversified data, OceanProtect implements full DR of hot data, hot
backup of warm data, and warm archiving of cold data throughout the data lifecycle and in all
scenarios, ensuring uninterrupted services, zero data loss, and long-term information retention.
Huawei storage provides comprehensive DR solutions, including active/standby DR, active-active
storage, and 3DC, and implements unified, intelligent, and simplified DR scheduling and O&M with
OceanStor BCManager.
In terms of back backup, Huawei provides centralized and all-in-one backup methods to meet
varying demands in different scenarios.
For archiving, Huawei provides a solution for archiving data to the local storage system.
3.1.6.4 Active-active Architecture and RAID-TP Technology for System-level
Reliability of Services and Data.
The number of write cycles majorly affects SSD lifespans. Huawei uses its patented global wear and
anti-wear leveling technologies to counteract the effects of write cycles and extend SSD lifespans.
1 Huawei RAID 2.0+ evenly distributes data to SSDs based on fingerprints, to level SSD wear and
improve SSD reliability.
2 At the end of SSD lifecycles, Huawei uses global anti-wear leveling to increase the service
volume of one SSD, preventing simultaneous multi-SSD failures.
Huawei backup storage uses two technologies to reduce the risk of multi-SSD failures, while
extending SSD lifespans and improving system reliability.
3.1.6.5 Data Protection Appliance: Comprehensive Protection for User Data and
Applications
CDM is short for Converged Data Management.
1 In D2C scenarios, production data is directly backed up to the cloud by the backup software.
In D2D2C scenarios, backup copies are tiered and archived to the cloud.
2 The difference between logical backup and physical backup is as follows. In physical backup
scenarios, data is replicated in the unit of disk blocks from the active node to the standby node.
Data is backed up in the unit of sector (512 bytes). By contrast, logical backup
replicates data from the active node to the standby node in the unit of file (the size
of the logical backup depends on the file size). In physical backup scenarios, the
entire file needs not to be backed up when a small change occurs. Instead, only the
changed part is backed up, effectively improving the backup efficiency and
shortening the backup window.
The super administrator can log in to the storage system using this authentication mode only.
Before logging in to DeviceManager as a Lightweight Directory Access Protocol (LDAP) domain user,
first configure the LDAP domain server, and then configure parameters on the storage system to add
it into the LDAP domain, and finally create an LDAP domain user.
By default, DeviceManager allows 32 users to log in concurrently.
A storage system provides built-in roles and supports customized roles.
Built-in roles are preset in the system with specific permissions. Built-in roles include the super
administrator, administrator, and read-only user.
Permissions of user-defined roles can be configured based on actual requirements.
To support permission control in multi-tenant scenarios, the storage system divides built-in roles
into two groups: system group and tenant group. Specifically, the differences between the system
group and tenant group are as follows:
⚫ Tenant group: roles in this group are used only in the tenant view (view that can be operated
after you log in to DeviceManager using a tenant account).
⚫ System group: roles belonging to this group are used only in the system view (view that can be
operated after you log in to DeviceManager using a system group account).
Huawei UltraPath provides the following functions:
⚫ The path to the owning controller of a LUN is used to achieve the best performance.
⚫ Virtual LUNs mask physical LUNs and are visible to upper-layer users. Read/Write operations are
performed on virtual LUNs.
⚫ Mainstream clustering software: MSS MSCS, VCS, HACMP, Oracle RAC, and so on
⚫ Mainstream database software: Oracle, DB2, MySQL, Sybase, Informix, and so on
⚫ After link recovery, failback immediately occurs without manual intervention or service
interruption.
⚫ Multiple paths are automatically selected to deliver I/Os, improving I/O performance. Paths are
selected based on the path workload.
⚫ Failover occurs if a link becomes faulty, preventing service interruption.
After a thin LUN is created, if an alarm is displayed indicating that the storage pool has no available
space, you are advised to expand the storage pool as soon as possible. Otherwise, the thin LUN may
enter the write through mode, causing performance deterioration.
3.3.2 SmartTier&SmartCache
3.3.2.1 SmartTier
SmartTier is also called intelligent storage tiering. It provides the intelligent data storage
management function that automatically matches data to the storage media best suited to that type
of data by analyzing data activities.
SmartTier migrates hot data to storage media with high performance (such as SSDs) and moves idle
data to more cost-effective storage media (such as NL-SAS disks) with more capacity. This provides
hot data with quick response and high input/output operations per second (IOPS), thereby
improving the performance of the storage system.
3.3.2.1.1 Dividing Storage Tiers
In the same storage pool, a storage tier is a collection of storage media with the same performance.
SmartTier divides storage media into high-performance, performance, and capacity tiers based on
their performance levels. Each storage tier respectively uses the same type of disks and RAID policy.
⚫ High-performance tier
Disk type: SSDs
Disk characteristics: SSDs have a high IOPS and can quickly respond to I/O request. However,
the cost of storage capacity at each unit is high.
Application characteristics: Applications with intensive random access requests are often
deployed at this tier.
Data characteristics: It carries the most active data (hot data).
⚫ Performance tier
Disk type: SAS disks
Disk characteristics: This tier delivers a high bandwidth under a heavy service workload. I/O
requests are responded in a relatively quick speed. Data write is slower than data read if no
data is cached.
Application characteristics: Applications with moderate access requests are often deployed at
this tier.
Data characteristics: It carries hot data (active data).
⚫ Capacity tier
Disk type: NL-SAS disks
Disk characteristics: NL-SAS disks have a low IOPS and slowly respond to I/O request. However,
the price per unit for storage request processing is high.
Application characteristics: Applications with fewer access requests are often deployed at this
tier.
Data characteristics: It carries cold data (idle data).
The types of disks in a storage pool determine how many storage tiers there are.
3.3.2.1.2 Managing Member Disks
A storage pool with SmartTier enabled manages SCM drives and SSDs as member disks.
Number of hot spare disks: Each tier reserves its own hot spare disks. For example, in the case of a
low hot spare policy for a storage pool, one disk is reserved for the performance tier and the
capacity tier, respectively.
Large- and small-capacity disks: The capacity of each tier is calculated independently. Each tier
supports a maximum of two disk capacity specifications.
RAID: RAID groups are configured for each tier separately. Chunk groups (CKGs) are not formed
across different media.
Capacity: The available capacity of a storage pool is the sum of the available logical capacity of each
tier.
3.3.2.1.3 Migrating Data
If a storage pool contains both SCM drives and SSDs, the data newly written by a host is
preferentially saved to the performance tier for better performance. Data that has not been
accessed for a long time is migrated to the capacity tier. When the capacity of the preferred tier is
insufficient, data is written to the other tier.
3.3.2.2 SmartCache
3.3.2.2.1 SmartCache
Based on the short read response time of SCM drives, SmartCache uses SCM drives to compose a
SmartCache pool and caches frequently-read data to the SmartCache pool. This shortens the
response time for reading hot data, improving system performance.
SmartCache pool:
A SmartCache pool consists of SCM drives and is used as a complement of DRAM cache to store hot
data.
SmartCache partition:
The SmartCache partition is a logical concept based on the SmartCache pool. It is used to store LUNs
and file system services.
3.3.2.2.2 SmartCache Write Process
⚫ After receiving a write I/O request to a LUN or file system from a server, the storage system
sends data to the DRAM cache.
⚫ After the data is written to the DRAM cache, an I/O response is returned to the server.
⚫ The DRAM cache sends the data to the storage pool management module.
⚫ Data is stored on SSDs, and an I/O response is returned.
⚫ The DRAM cache sends data copies to the SmartCache pool. After the data is filtered by the cold
and hot data identification algorithm, the identified hot data is written to the SCM media, and
the metadata of the mapping between the data and SCM media is created in the memory.
⚫ Data is cached to the SmartCache pool, and an I/O response is returned.
3.3.2.2.3 SmartCache Read Hit
⚫ A read I/O request from an application server is first delivered to the DRAM cache before
arriving at LUNs or file systems.
⚫ If the requested data is not found in the DRAM cache, the read I/O request is further delivered
to the SmartCache pool.
⚫ If the requested data is found in the SmartCache pool, the read I/O request is delivered to SCM
drives. Data is read from the SCM drives and returned to the DRAM cache.
⚫ The DRAM cache returns the data to the application server.
3.3.3 SmartAcceleration
SmartAcceleration is a key feature for performance improvement on the next-generation OceanStor
hybrid flash storage. It leverages the large block sequential write mechanism of redirect-on-write
(ROW) and uses a unified performance layer for cache and tier performance acceleration. This
breaks the bottleneck of traditional HDDs in random IOPS performance, maximizing the performance
of the hybrid flash system.
3.3.3.1 Unified Performance Layer That Flexibly Integrates Caches and Tiers
Global cold and hot data sensing and data collaboration algorithms, breaking the boundaries of
caches and tiers, and providing optimal data layout and simplified configuration:
⚫ Converged caches and tiers, preventing repeated data flow and improving efficiency
⚫ Global popularity, unifying cache's admission and eviction and tier's traffic distribution and
migration
3.3.4 SmartQoS
SmartQoS is an intelligent service quality control feature. It dynamically allocates storage system
resources to meet the performance requirement of certain applications.
SmartQoS is an essential value-added feature for a storage system, especially in certain applications
that demand high service-level requirements.
When multiple applications are deployed on the same storage device, users can obtain maximized
benefits through the proper configuration of SmartQoS.
Performance control reduces adverse impacts of applications on each other to ensure the
performance of critical services.
SmartQoS limits the resources allocated to non-critical applications to ensure high performance for
critical applications.
3.3.4.1 Upper Limit Traffic Control Management
SmartQoS traffic control is implemented by I/O queue management, token allocation, and dequeue
management for controlled objects.
The SmartQoS determines the amount of storage resources to be allocated to an I/O queue of
controlled objects by counting the number of tokens owned by the queue. The more tokens owned
by an I/O queue, the more resources will be allocated to the queue, and the more preferentially the
I/O requests in the queue will be processed.
SmartQoS maintains a token bucket and an I/O queue for a LUN associated with a SmartQoS policy
and converts the upper limit of the policy into the bucket generation rate.
If the maximum IOPS is limited in a SmartQoS policy, one I/O consumes one token. If the maximum
bandwidth is limited, one sector consumes one token.
A larger maximum IOPS or bandwidth indicates a larger number of tokens in the token bucket and
higher performance of the corresponding LUN.
3.3.5 SmartDedupe&SmartCompression
3.3.5.1 SmartDedupe
Working Principles of Post-processing Similarity-based Deduplication
⚫ The system divides to-be-written data into blocks. The default data block size is 8 KB.
⚫ The storage system uses a similar fingerprint algorithm to calculate similar fingerprints of the
new data blocks.
⚫ The storage system writes the data blocks to disks and records data blocks' fingerprint and
location information in the opportunity table.
⚫ The storage system periodically checks whether there are similar fingerprints in the opportunity
table.
➢ If yes, the storage system performs operations in the next step.
3.3.6 SmartVirtualization
3.3.6.1 Related Concepts
Local storage system: refers to the OceanStor Dorado series storage system.
Heterogeneous storage system: can be either a storage system manufactured by another
mainstream vendor or a Huawei storage system of a specific model.
External LUN: a LUN in a heterogeneous storage system, which is displayed as a remote LUN in
DeviceManager.
eDevLUN: In the storage pool of a local storage system, the mapped external LUNs are reorganized
as raw storage devices based on a certain data organization form. A raw device is called an eDevLUN.
The physical space occupied by an eDevLUN in the local storage system is merely the storage space
needed by the metadata. The service data is still stored on the heterogeneous storage system.
Application servers can use eDevLUNs to access data on external LUNs in the heterogeneous storage
system, and the SmartMigration feature can be configured for the eDevLUNs.
Online takeover: During the online takeover process, services are not interrupted, ensuring service
continuity and data integrity. In this mode, the critical identity information about heterogeneous
LUNs is masqueraded so that multipathing software can automatically identify new storage systems
and switch I/Os to the new storage systems. This remarkably simplifies data migration and minimizes
time consumption.
Offline takeover: During the offline takeover process, connections between heterogeneous storage
systems and application servers are down and services are interrupted temporarily. This mode is
applicable to all compatible Huawei and third-party heterogeneous storage systems.
Hosting: LUNs in a heterogeneous storage system are mapped to a local storage system for use and
management.
3.3.6.2 Relationship Between an eDevLUN and an External LUN
The physical space needed by data is provided by the external LUN from the heterogeneous storage
system. Data does not occupy the capacity of the local storage system.
Metadata is used to manage storage locations of data on an eDevLUN. The space used to store
metadata comes from the metadata space in the storage pool created in the local storage system.
Metadata occupies merely a small amount of space. Therefore, eDevLUNs occupy a small amount of
space in the local storage system. (If no value-added feature is configured for eDevLUNs, each
eDevLUN occupies only dozens of KBs in the storage pool created in the local storage system.)
If value-added features are configured for eDevLUNs, each eDevLUN, like any other local LUNs,
occupies local storage system space to store the metadata of value-added features. Properly plan
storage space before creating eDevLUNs to ensure that value-added features can work properly.
3.3.7 SmartMigration
SmartMigration is a key technology for service migration. Services on a source LUN can be
completely migrated to a target LUN without interrupting host services. The target LUN can totally
replace the source LUN to carry services after the replication is complete.
3.3.7.1 Benefits of SmartMigration
Benefits of SmartMigration: Reliable service continuity: Service data is migrated non-disruptively,
preventing any loss caused by service interruption during service migration.
Stable data consistency: During service data migration, data changes made by hosts will be sent to
both the source LUN and target LUN, ensuring data consistency after migration and preventing data
loss.
Convenient performance adjustment: To flexibly adjust service performance levels, SmartMigration
migrates service data between different storage media and RAID levels based on service
requirements.
Data migration between heterogeneous storage systems: In addition to service data migration
within a storage system, SmartMigration also supports service data migration between a Huawei
storage system and a compatible heterogeneous storage system.
3.3.7.2 Working Principles of SmartMigration
SmartMigration is leveraged to adjust service performance or upgrade storage systems by migrating
services between LUNs.
3.3.7.3 Related Concepts
Storage systems use virtual storage technology. Virtual data in a storage pool consists of metadata
volumes and data volumes.
Metadata volumes: record the data storage locations, including LUN IDs and data volume IDs. LUN
IDs are used to identify LUNs, and data volume IDs are used to identify physical space of data
volumes.
Data volumes: store actual user data.
3.3.7.4 SmartMigration Service Data Synchronization
The two synchronization modes of service data are independent and can be performed at the same
time to ensure that service data changes on the host can be synchronized to the source LUN and the
target LUN.
Data change synchronization:
1 A host delivers an I/O write request to the LM module of a storage system.
2 The LM module writes the data to the source LUN and target LUN and records write operations
to the log.
3 The source LUN and target LUN return the data write result to the LM module.
4 The LM module determines to clear LOG or not based on the write I/O result.
5 A write success acknowledgment is returned to the host.
3.3.7.5 SmartMigration LUN Information Exchange
LUN information exchange is the prerequisite for a target LUN to take over services from a source
LUN after service information synchronization.
In a storage system, each LUN and its corresponding data volume have a unique identifier, namely,
the ID of a LUN and data volume ID. A source LUN corresponds to a data volume. The former is a
logical concept whereas the latter is a physical concept.
Before LUN information exchange: A host identifies a source LUN by the ID of the source LUN. The ID
of a LUN corresponds to a data volume ID.
During LUN information exchange: A source data volume and a target data volume ID are
exchanged. The physical storage space to which the source LUN points becomes the target data
volume.
After LUN information exchange: The ID of the source LUN is unchanged, and users sense no fault
because services are not affected. The ID of the source LUN and target data volume ID form a new
mapping relationship. The host actually read and writes physical space of the target LUN.
3.3.7.6 SmartMigration Pair Splitting
⚫ In splitting, host services are suspended. After information is exchanged, services are delivered
to the target LUN. In this way, service migration is transparent to users.
⚫ The consistency splitting of SmartMigration means that multiple pairs exchange LUN
information at the same time and concurrently remove pair relationships after the information
exchange is complete, ensuring that data consistency at any point in time before and after the
pairs are split.
⚫ Pair splitting: Data migration relationship between a source LUN and a target LUN is removed
after LUN information is exchanged.
➢ After the pair is split, if the host delivers an I/O request to the storage system, data is only
written to the source LUN.
➢ The target LUN stores all data of the source LUN at the pair splitting point in time.
➢ After the pair is split, no connections can be established between the source LUN and
target LUN.
3.4.2 HyperClone
Definition
HyperClone creates a full copy of the source LUN's data on a target LUN at a specified point in time
(synchronization start time).
HyperClone is to create a clone for a file system or a snapshot of the file system at a specific point in
time. After a clone file system has been created, its data (including the dtree configuration and dtree
data) is consistent with that of the parent file system at the corresponding point in time.
Features
The target LUN can be read and written during synchronization.
Full synchronization and differential synchronization are supported.
Forward synchronization and reverse synchronization are supported.
3.4.3 HyperCDP
3.4.3.1 LUN HyperCDP
Based on the lossless snapshot technology, HyperCDP has little impact on the performance of source
LUNs. Compared with writable snapshots, HyperCDP does not need to build LUNs, greatly reducing
memory overhead and providing stronger and continuous protection.
HyperCDP is a value-added feature that requires a license.
In the license file, the HyperCDP feature name is displayed as HyperCDP.
The HyperCDP license also grants the permissions for HyperSnap. If you have imported a valid
HyperCDP license to your storage system, you can perform all operations of HyperSnap even though
you do not import a HyperSnap license.
HyperCDP has the following advantages:
It provides data protection at an interval of seconds, with zero impact on performance and small
space occupation.
It supports scheduled tasks. You can specify HyperCDP schedules by day, week, month, or specific
interval to meet different backup requirements.
It provides intensive and persistent data protection. HyperCDP provides more recovery points for
data and provides shorter data protection intervals, longer data protection periods, and continuous
data protection.
Purposes and benefits
⚫ Efficient use of storage space, protecting user investments
⚫ HyperCDP objects for various applications
⚫ Continuous data protection
Working principles of HyperCDP
HyperCDP creates high-density snapshots on a storage system to provide continuous data
protection. Based on the lossless snapshot technology, HyperCDP has little impact on the
performance of source LUNs. Compared with writable snapshots, HyperCDP does not need to build
LUNs, greatly reducing memory overhead and providing stronger and continuous protection.
The storage system supports HyperCDP schedules to meet customers' backup requirements.
HyperCDP objects cannot be mapped to a host directly. To read data from a HyperCDP object, you
can create a duplicate for it and map the duplicate to the host.
3.4.3.2 FS HyperCDP
HyperCDP periodically generates snapshots for a file system based on a HyperCDP schedule. The
HyperCDP schedule can be a default (named NAS_DEFAULT_BUILDIN) or non-default one.
The default schedule is automatically created when the first file system is created in the storage
system. The storage system has only one default schedule and it cannot be deleted. OceanStor
Dorado 6.1.2 and later versions support the default schedule.
Non-default schedules are created by users as required.
3.4.4 HyperLock
Write Once Read Many (WORM), also called HyperLock, protects the integrity, confidentiality, and
accessibility of data, meeting secure storage requirements.
Working Principle
With the WORM technology, data can be written to files once only, and cannot be rewritten,
modified, deleted, or renamed. If a common file system is protected by the WORM feature, files in
the file system can be read only within the protection period. After WORM file systems are created,
you need to map them to application servers using the NFS or CIFS protocol.
WORM enables files in the WORM file system to be shifted between initial, locked, appending, and
expired states, preventing important data from being incorrectly or maliciously tampered within a
specified period.
Checking the eService On the maintenance terminal, check whether the eService tool
installation and configuration has been installed and the alarm policy has been configured.
Information Collection
The information to be collected includes basic information, fault information, storage device
information, networking information, and application server information.
Customer
Provides the contact and contact details.
information
Time when a
Records the time when a fault occurs.
fault occurs
Hardware
Records the configuration information about the
module
hardware of a storage device.
configuration
Storage system Manually export the running data and system logs
data of a storage device.