Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views43 pages

Exadata Platform Deepdive PDF

Uploaded by

kOs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views43 pages

Exadata Platform Deepdive PDF

Uploaded by

kOs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Exadata Platform Deep Dive

Deepanshu Agarwal
Cyril Malaki

1 Copyright © 2020 Oracle 1


Safe Harbor

The following is intended to outline our general product direction. It is intended for information purposes
only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code,
or functionality, and should not be relied upon in making purchasing decisions. The development,
release, timing, and pricing of any features or functionality described for Oracle’s products may change
and remains at the sole discretion of Oracle Corporation.

Statements in this presentation relating to Oracle’s future plans, expectations, beliefs, intentions and
prospects are “forward-looking statements” and are subject to material risks and uncertainties. A detailed
discussion of these factors and other risks that affect our business is contained in Oracle’s Securities and
Exchange Commission (SEC) filings, including our most recent reports on Form 10-K and Form 10-Q
under the heading “Risk Factors.” These filings are available on the SEC’s website or on Oracle’s website
at http://www.oracle.com/investor. All information in this presentation is current as of September 2019
and Oracle undertakes no duty to update any statement in light of new information or future events.

2 Copyright © 2020 Oracle


Exadata X8M
Continues Tradition of
State-of-the-Art
Hardware

3 Copyright © 2020 Oracle


Exadata X8M-2: State-of-the-Art-Hardware
2 socket Xeon processors
• Scale-Out 2-Socket Database Servers 48 cores per server
384 GB - 1.5 TB DRAM
25/10 GigE external network

• 100Gb RDMA over Converged Ethernet 100 GbE RoCE network fabric
(RoCE) Network Fabric

Smart Storage
• Scale-Out Intelligent 2-Socket Storage 32 cores for SQL offload
Servers 192 GB DRAM
1.5 TB Persistent Memory

51.2 TB PCI NVMe Flash (EF)


25.6 TB PCI NVMe Flash (HC)
168 TB disk capacity (HC)
Fully Redundant
4 Copyright © 2020 Oracle
Exadata X8M: Scalability
X8M-2 X8M-8
Eighth Quarter Elastic Multi-Rack Elastic
Rack Rack

5 Copyright © 2020 Oracle


DB System Bottleneck
• ONE NVMe Flash Device Saturates the Network
• Throughput of all other flash devices is LOST

NVMe Flash 40Gb/s SAN Link

>
5.8 GB/sec 5 GB/sec

6 Copyright © 2020 Oracle


Exadata X8M
Unique differentiators
Exadata Storage Software 19.3

7 Copyright © 2020 Oracle


Exadata Uses RDMA for Extreme Performance
• Remote Direct Memory Access (RDMA) is the ability for one computer
to access data from a remote computer without any OS or CPU involvement
• Network card directly reads/writes memory with no extra copying or buffering
and very low latency

• RDMA is an integral part of the Exadata high-performance architecture


enabling:
• High throughput and low-CPU usage for large data transfers
• Unique Direct-to-Wire Protocol to deliver 3x faster inter-node OLTP cluster messaging
• Unique Smart Fusion Block Transfer that eliminates log write on inter-node block move
• Unique RDMA protocol to coordinate transactions between nodes

• InfiniBand was the only viable RDMA capable network at the inception of Exadata
• Ethernet has caught up

8 Copyright © 2020 Oracle


Remote Direct Memory Access (RDMA)
Database Server Storage Server

Memory Region Memory Region

RDMA Write
CPU CPU

RDMA Read

9 Copyright © 2020 Oracle


Exadata RoCE – RDMA Over Converged Ethernet
• RDMA over Converged Ethernet is a protocol that runs
RDMA software on top of Ethernet Layer RoCE InfiniBand
• Same software at upper levels of network protocol stack
• RoCE makes use of UDP encapsulation allowing it to transcend Layer 3 User Application
networks Application
• RoCE on Exadata supports all Exadata RDMA optimizations Transport (InfiniBand)
• Exadata RoCE provides RDMA speed and reliability on Ethernet fabric
• 100Gb throughout InfiniBand
Network IP Network
Network
• Zero packet loss messaging
• Prioritization of critical database messages Hardware Ethernet InfiniBand

• Defined by an Open Consortium (IBTA) , developed in open-


source, maintained in upstream Linux
• Supported by major network card and switch vendors
10 Copyright © 2020 Oracle
RoCE on Exadata RDMA over Converged Ethernet (RoCE)

Cisco Nexus 9336C-FX QSFP28 (100GbE) Ethernet Switches


RDMA Network Fabric Adapter

2 Active-Active ports in every RDMA


Network Fabric Adapter

2 RDMA Network Fabric Switches in


every Exadata single rack RDMA Network Fabric Switch

22 Ports per switch used for internal


cluster network, cabled ensuring no
single point of failure exists

11 Copyright © 2020 Oracle


Exadata Network Architecture - RoCE
Ethernet PDU A PDU B
Switch
BONDETH0
NET4 NET3 NET2 NET1 NET0 ILOM ILOM NET0
Exadata Exadata
Database Server Storage Server
RE0 RE1 RE1 RE0
Key

Management
Client Access RoCE
Private RoCE Switch
* BONDETH0 can be either
copper or optical links
RoCE
Switch

12 Copyright © 2020 Oracle


Exadata RoCE : Smart Network Prioritization

Network Switch • Network prioritization for latency-sensitive DB algorithms


Per Class of Service • Ensures that messages requiring low latency are not slowed by
Transmit Buffers high throughput messages
Highest
Priority Switch • Low latency: transaction commit, cache fusion
Buffers
Messages • High throughput: backup, reporting, batch

• RoCE Class of Service (CoS)


Lowest • Allows packets to be sent on multiple classes of service, each
Priority Switch
with separate network buffers for independence
Messages Buffers

• Exadata uniquely chooses the best Class of Service for


each database message

Completely Automatic And Transparent


13 Copyright © 2020 Oracle
14

Exadata RoCE – Avoiding Packet Loss


• Packet loss by congestion
• Packets sent faster than receiver or switches can process
Network Switch • Switch failure, link failure
Per Class of Service
Transmit Buffers • Conventional Ethernet silently drops packets and
Switch expects retransmit
Buffers
Buffer is Full • Huge hit to latency and throughput
so Pause
• Exadata RoCE avoids packet drops using:

Switch
• RoCE Priority-based Flow Control (PFC)
Buffers • Tells sender to pause the messages if switch buffer is full

• RoCE Explicit Congestion Notification (ECN)


• Marks packet flow as too fast, telling source to slow down
packet sends

Copyright © 2020 Oracle


Instant Failure Detection with RoCE
• Exadata uses frequent heartbeat messages between nodes
to detect possible server failure
• Server failure detection normally requires long timeout
NIC Port #1 NIC Port #2
• Hard to quickly distinguish between whether a slow response to
heartbeat is due to very high CPU load or to server failure
RDMA
• Exadata X8M uses RDMA to quickly confirm server failure
NIC Port #1 NIC Port #2
• RDMA uses hardware, so remote ports respond even if software is
slow
• 4 RDMA Reads are sent to the suspect server, across all
combinations of source and target ports
• If all 4 RDMAs fail, server is evicted from cluster

Completely Automatic And Transparent


15 Copyright © 2020 Oracle
Copyright © 2020 Oracle

New Persistent Memory Storage Tier


• Persistent memory is a new silicon technology
• Capacity, performance, and price are between DRAM and flash
DRAM
Persistent memory benefits include:

Higher Cost Per GB



Persistent Mem
• Reads at memory speed – much faster than flash

Faster
• Writes survive power failure unlike DRAM FLASH

• Exadata implements sophisticated algorithms to


best utilize Persistent Memory
• Remote Direct Memory Access – is key
• Exadata Smart System Software – top to bottom integration

16
Persistent Memory with Traditional Storage
• Persistent Memory usage with traditional storage:
Compute Server • Database issues read I/O call to OS
• OS sends message to storage
SAN • Storage CPU issues read to Persistent Memory
Storage Server • Storage CPU sends reply to Server OS
• Server OS wakes up Database

Persistent
• Speed of Persistent Memory read is overwhelmed
Hot Memory
by high cost of network and I/O software,
interrupts, and context switches
• Performance benefit is limited

17 Copyright © 2020 Oracle


Dissect the Exadata Flash IO Latency
Database
Server Database
Software
Context
Storage Switch:
Server 10s of µs
Kernel/OS
(Database Server)

Flash Read Raw


Latency: <100 µs Kernel/OS
FLASH (Storage Server)

Context
Switch:
Database 8K Read Storage Server 10s of µs
End-to-end Latency: Software
~200 µsec
18 Copyright © 2020 Oracle
What if we drop in PMEM as is?
Database
Server Database
Software
Context
Storage Switch:
Server 10s of µs
Kernel/OS
(Database Server)

PMEM Read
Raw Latency: ~1
µs Kernel/OS
(Storage Server)
PMEM

Context
Switch:
Storage Server 10s of µs
Database 8K Read Software
End-to-end Latency:
~100 µsec
Over 90% of the Time Wasted.
19 Copyright © 2020 Oracle
Conventional Two-Sided Read
• Conventional Storage Performs Server-side Cache Read
Storage Looks up Flash Cache
[Disk, Offset] -> Location on Flash,
DB Sends READ Request to Issues Read to Flash
Storage
[Disk, Offset ]
Database Server
… Storage Server

Flash Cache Line

Flash Cache Line


Storage Sends Data to DB …

20 Copyright © 2020 Oracle


Exadata X8M With Persistent Memory
World’s First and Only Shared Persistent Memory Optimized for Database
Compute Server
• Shared capacity across Databases
RoCE
• Smart Capacity management for maximum performance
RDMA
• Primaries on persistent memory, secondaries on flash
Storage Server • Secondaries moved into persistent memory when they become
active primaries
• Management server on storage server automatically
Persistent
manages persistent memory
Hot Memory • Creates regions, namespaces, DAX devices etc as needed
• Persistent memory replacement follows standard hardware
Warm FLASH replacement process
• End to End Security
Cold • Persistent memory is not visible to the OS on the compute server
• Protected end to end while serving RDMA

21 Copyright © 2020 Oracle


Exadata X8M With Persistent Memory Data Accelerator
World’s First and Only Shared Persistent Memory Optimized for Database
Compute Server
• Exadata Storage Servers transparently add Persistent
Memory Accelerator in front of Flash memory
RoCE RDMA
Storage Server
• Database uses RDMA instead of I/O to read remote PMEM
• Bypasses network and IO software, interrupts, context switches

• PMEM Automatically tiered and shared across DBs


Hot
Persistent
Memory • Used as a cache for hottest data increases effective capacity 10x

Warm FLASH
• Persistent Memory mirrored automatically across storage
servers for fault-tolerance
Cold • 16Million IOPS, <19us latency for 8K I/Os from the
database

22 Copyright © 2020 Oracle Enabled with Exadata System Software 19.3 and Database Software 19c
Exadata Data Access Tiers
PMEM
MAA Characteristics
Database Node Storage Cell
• Not drawn to scale J
• Primary copy of data placed in PMEM

X
cache on a read miss Database Read PMEM
• Secondary copy of data placed in Buffer Cache Hot

flash cache on buffer eviction


Sizzling

If a pmem fails in Writethrough mode, no FLASH


Warm
redundancy restoration is required
Buffer Evicted
If a pmem fails in Writeback mode, a resilver
DISK
operation is run to restore redundancy
Cold
Low latency flash reads will repopulate
super low latency pmem

23 Copyright © 2020 Oracle


A Radical Approach – RDMA to PMEM
Database
Server Database
Software
Context
Storage RDMA Switch:
Server 10s of µs
Kernel/OS
(Database Server)

PMEM

Kernel/OS
(Storage Server)
Database 8K Read
End-to-end Latency: Context
Switch:
<19 µsec Storage Server 10s of µs
Software
10x Faster than Exadata X8
24 Copyright © 2020 Oracle
Ultra Fast One-Sided Read
• New Disruptive Technology Enables PMEM Cache Read via RDMA
PMEM Cache

RDMA to Fetch Data


From PMEM
Database Server
… Storage Server

PMEM Cache Line

PMEM Cache Line


25 Copyright © 2020 Oracle


Exadata X8M Persistent Memory Commit Accelerator
Compute Server
• Log Write latency is critical for OLTP performance
RoCE
• Faster log writes means faster commit times
RDMA
• Any log write slowdown stalls the whole database
Storage Server
• Automatic Commit Accelerator
• Database issues one-way RDMA writes to PMEM on
Hot
Persistent
Memory Flush Later
multiple Storage Servers
to
Flash/Disk
• Bypasses network and I/O software, interrupts,
Warm FLASH context switches, etc.
• Up to 8x faster log writes
Cold

26 Copyright © 2020 Oracle Enabled with Exadata System Software 19.3 and Database Software 19c
What about Redo Log Writes to PMEM?
Database
Server Database
Software
Context
Storage Switch:
Server 10s of µs
Kernel/OS
(Database Server)

PMEM Write Raw


Latency:
<10 µs Kernel/OS
(Storage Server)
PMEM

Context
Switch:
Storage Server 10s of µs
Database Log Write Software
End-to-end Latency:
~100 µsec
Over 90% of the Time Wasted.
27 Copyright © 2020 Oracle
Conventional Two-Sided Log Write
• DB Sends Requests to Storage
• Storage Writes to Flash Log and Sends Ack

Storage Server Issues Write to Flash


and HDD Simultaneously Flash Log
DB Sends Log Write request (via Flash Log)
to Storage

Database Server Storage Server

Storage Server Sends Ack to DB

28 Copyright © 2020 Oracle


Ultra Fast One-Sided Log Write
• New Disruptive Technology Enables Durable PMEM Log Write via RDMA
PMEM Log
Storage Server Performs Flash
Log Write in the Background

PMEM Log Buffer
RDMA Write to
Persist Redo Log PMEM Log Buffer Storage Server
Database Server
PMEM…Log Buffer

29 Copyright © 2020 Oracle


Ultra Fast One-Sided Log Write
• Storage Server Crash Safe!
PMEM Log
Storage Server Perfoms
PMEM Log Recovery


PMEM Log Buffer
Database Server PMEM Log Buffer Storage Server

PMEM…Log Buffer

30 Copyright © 2020 Oracle


Exadata PMEM Management
• No User Interaction or Management Required
• Installed and configured by OEDA
• ILOM performs fault management for PMEM
• ILOM sends ASR traps for failed PMEM
• MS alert history shows PMEM Failure and Replacement Alerts

• Secure Erase
• Crypto Erase option
• Drop CellDisk Erase=1pass/3pass/7pass does crypto erase on underlying
PMEM DIMMs

31 Copyright © 2020 Oracle


Operating System and
Virtualization: X8M

New UEK5 kernel, Oracle Linux 7.7, KVM

32 Copyright © 2020 Oracle


Operating System Update in Exadata 19.3
• Oracle Linux kernel UEK5 (4.14.35-xx) enables RoCE and
persistent memory
• Oracle Linux distribution updated to Oracle Linux 7.7
• OS Image is Up to 35% image size reduction reduces
download and install time
• Imaging a compute node with RoCE no longer requires dual
boot
• New default file system is XFS for all components including
bare metal, virtual and the kvm host.

33 Copyright © 2020 Oracle


Exadata Virtualization
Consolidation Options

• VMs provide good isolation but poor efficiency


Dedicated and high management
DB Servers • VMs have separate OS, memory, CPUs, and patching
• Isolation without need to trust DBA, System Admin

Virtual VM VM VM • Database consolidation in a single OS is highly

More Efficient
Machines
More Isolation efficient but less isolated
• DB Resource manager isolation adds no overhead
• Resources can be shared much more dynamically
Multiple DBs • Must configure systems correctly – shared resources
on one Server • Best strategy is to combine VMs with database
native consolidation
• Multiple trusted DBs or Pluggable DBs in a VM
Multitenant • Few VMs per server to limit overhead of fragmenting
Databases CPUs/memory/patching etc.

34 Copyright © 2020 Oracle


Exadata X8M Virtualization
• Exadata X8M uses KVM based Virtualization
• Different from Xen based virtualization used in InfiniBand Exadata
• KVM on X8M Gives:
• 12 VMs per server an increase of 50% from X8
• Max Memory 1390GB per VM guest
• Max CPU/Guest = 46 cores or 92 vCPU (Total CPU on host -2)
• Faster client network latency
• Life cycle management for CPU, Disk, Memory, VMs using
vm_maker and OEDACLI
• Trusted Partitions

35 Copyright © 2020 Oracle


Exadata KVM Requirements
Hardware
• Exadata X8M
Software
• Exadata Storage Software 19.3.0 or higher
• KVM Host and guests can run different Exadata versions
• Grid Infrastructure
• Recommended – GI 19, using latest RU
• Supported - GI 12.1.0.2 or higher release, using 2019-Jul RU
• Database
• Recommended - DB 19, using latest RU
• Supported – DB 19, 18, 12.2.0.1, 12.1.0.2, or 11.2.0.4

Review MOS 888828.1

36 Copyright © 2020 Oracle


Exadata and In-Memory Columnar

Unique Technologies for Data Warehousing

37 Copyright © 2020
2019 Oracle
Oracle and/or its affiliates.
Just in Time Smart Columnar Decryption
ename job sal
• Columns are decrypted only when needed John Staff 120k
• If many columns are in a cache-line, decrypt only columns Joe Staff 230k
of interest
Jane Sr. Staff 440k
• If first predicate returns nothing, don’t access columns for
Mary Sr. Staff 380k
second predicate Encrypted
• Example Scan
• select ename from emp
• where job like ‘%VP’ and sal + bonus > 500000
• Projected columns (ename) and one or more predicated columns are job
decrypted (not entire cache-line) Staff
• For data regions that do not have a VP, salary is not accessed
Staff
• Up to 30% faster encrypted smart scans and reduced
Sr. Staff
storage server CPU usage
Sr. Staff
Decrypted
Enabled with Exadata Storage Server 19.3 for in-memory customers
38
Copyright © 2020 Oracle
Fast In-Flash Columnar Cache Creation
In-Memory
• Runtime analyzer finds best compression algorithm Columnar scans
• Reuse dictionary created in-line during analysis
• Faster Symbol lookup and insert using new dictionary
data structure
• Up to 35% faster columnar cache creation In-Flash
Columnar scans

• This feature applies to Exadata Hybrid Columnar


Compression tables only.

Completely Automatic And Transparent

Enabled with Exadata Storage Server 19.3 for in-memory customers


39 Copyright © 2020 Oracle
Smart Aggregation with Columnar Cache
dept country sal
• SUM and GROUP BY smart scan enabled with in- Sales USA 120k
memory columnar format Engr UK 230k
• Reduces data sent to the database server Sales USA 440k
• Improves CPU utilization on the database server Engr USA 380k
• Example
• select dept, sum(sal) from emp Scan
where country=‘USA’ group by dept
• Sum and group by operations performed on the
storage server
dept sum(sal)
• Up to 2x faster queries and reduced database server CPU usage
Engr 380K
Sales 560K

Completely Automatic And Transparent


Enabled with Exadata Storage Server 19.3 for in-memory customers, DB version18.1
40 Copyright © 2020 Oracle
Smart In-Flash Columnar Cache
In-Memory
• Smart Columnar Cache with chained rows Columnar scans
• Wide tables, large rows or row migration on update can
create chained rows
• Optimized SIMD representation created for chained rows
when the head/tail pieces fit in the same 1MB region
• Up to 3x faster scans for wide tables In-Flash
Columnar scans
• Faster DMLs using in-memory format
• update weather_history set temp = (temp – 32) * (5/9)
where country = ‘ENGLAND’
• Smart scan to find rows that satisfy country=‘ENGLAND’
• Row IDs returned from previous phase
• Up to 5x performance improvement due to smart scans
• Enabled with 18.3.0.0.0180717 Completely Automatic And Transparent

41 Copyright © 2020 Oracle


Copyright © 2020 Oracle

Worlds Fastest Database Machine – Exadata X8M


Exadata X8M storage performance is comparable to
in-memory
• With capacity, sharing, and cost benefits of shared storage
• 16 Million OLTP Read IOPS (8K IOs) Each rack has up to
• 2.5x faster than Exadata X8 3.0 PB Raw Disk
• <19 microsecond OLTP IO latency 920 TB NVMe Flash
• 10x faster than Exadata X8
27 TB PMEM
• Ultra fast log file writes to accelerate transactions
• 560GB/sec Analytic Scan throughput
• Over 1 TB/sec analytic scans with columnar data in flash

Performance scales as more racks are added

42
References
v Exadata Manuals (Security and Maintenance)
https://docs.oracle.com/en/engineered-systems/exadata-database-
machine/books.htm
v Oracle Exadata Best Practices (Doc ID 757552.1)
v Exadata Patching Overview Including Database and Storage Server
Upgrade Demo (Doc ID 1584200.1)
v Exadata Database Machine and Exadata Storage Server Supported
Versions (Doc ID 888828.1)
v https://blogs.oracle.com/in-memory/how-to-determine-if-columnar-
format-on-exadata-flash-cache-is-being-used

43 Copyright © 2020 Oracle

You might also like