Postgres Conference
HangZhou, China
Scaling Out PostgreSQL
Present and Future
November 21st, 2015
NTT DATA INTELLILINK Corporation
Koichi Suzuki
Copyright © 2015 NTT DATA INTELLILINK Corporation
Introduction
Copyright © 2015 NTT DATA INTELLILINK Corporation 2
About the Speaker
●
Fellow at NTT DATA Intellilink Corporation
●
Principal, Technology Professionals at NTT DATA Group
In Charge Of
●
General Database Technology
●
Database in huge data warehouse and its design
●
PostgreSQL and its cluster technology
In The Past
●
Character Set Standard (Extended Unix Code, Unicode, etc)
●
Heisei-font development (Technical Committee)
●
Oracle Porting
●
Object-Relational Database
Copyright © 2015 NTT DATA INTELLILINK Corporation 3
Agenda
●
Scaling out motivation.
●
Postgres-XC and Postgres-XL
●
Other PostgreSQL cluster efforts (example)
●
Effort in PostgreSQL core
●
Impact to scale-out feature
●
Storage and server technology innovation
●
IoT and Big Data
●
Scale-out architecture in the future
Copyright © 2015 NTT DATA INTELLILINK Corporation 4
Scale-out Motivation
●
Performance requirements
●
Larger amount of data
●
Range of petabyte
●
More transactions in transactional workload
●
More than 20,000TPS
Scale-out on top of
●
Growing demands to big data analytic workload
Database Cluster
●
Aggregates
●
Scanning tens of billions of tuples
●
Use of commodity hardware/software platform
●
No dedicated hardware
●
Shared nothing preferred
Copyright © 2015 NTT DATA INTELLILINK Corporation 5
Clustering effort in PostgreSQL core
●
Streaming Replication
●
First active cluster
●
Originally for High-availability
●
Copy all the database update to slaves
●
A slave can fail-over when the master fails
●
Read Only Slave
●
Use streaming replication slave to run read query
●
Scales out read
●
Logical replication
●
More sophisticated update transfer to other database
servers
●
Configurable
Copyright © 2015 NTT DATA INTELLILINK Corporation 6
Postgres-XC
●
Multi-node update
●
Issue update to any cluster node
●
Transparent transaction ACID property
●
Atomic visibility of updates
●
Initially for transactional workloads
Copyright © 2015 NTT DATA INTELLILINK Corporation 7
Postgres-XL
●
Postgres-XC spin-off
●
More focus on analytic workload
●
More sophisticated execution for complexed queries
●
Same architecture and code base as XC
Copyright © 2015 NTT DATA INTELLILINK Corporation 8
Read Scale-out in PostgreSQL Master/Slave
Read/Write Possible time delay
Transactions Read-only Transactions
Master
Slave
WAL (or Redo Log)
Copyright © 2015 NTT DATA INTELLILINK Corporation 9
Scaling Out in Postgres XC/XL
Read/Write Transactions
No Delay in Update Visibility
Local
Local Local Local
Disk
Disk Disk Disk
Backend Transaction Synchronization
Copyright © 2015 NTT DATA INTELLILINK Corporation 10
DBT-1 Workload Scalability
DBT-1 (Rev)
Copyright © 2015 NTT DATA INTELLILINK Corporation 11
MPP Performance – DBT-3 (TPC-H)
By courtesy of Mason Sharp, Postgres-XL leader
Copyright © 2015 NTT DATA INTELLILINK Corporation 12
Scale Out Approach (1): Table Distribution/Replication
Categorize tables into two groups:
Large and frequently-updated tables
→ Distribute rows among nodes (Distributed Tables)
→ Based on a column value (distribution key)
→ Hash, modulo or round-robin
→ Parallelism among transactions (OLTP) or in SQL processing (OLAP)
Smaller and stable tables
→ Replicate among nodes (Replicated Tables)
→ Join Pushdown
Avoid joins between Distributed Tables with join keys different from distribution
key as possible.
Copyright © 2015 NTT DATA INTELLILINK Corporation 13
Node Configuration: Two-Tier Approach
Coordinator:
●
Maintains global catalog information
●
Build global SQL plan and SQL statements for datanodes
●
Interact with datanode to execute local SQL statements and accumulate
the result
Datanode
●
Maintains actual data (local data)
●
Run local SQL statement from Coordinator
(In XL, datanode may ask other datanodes for their local data)
Copyright © 2015 NTT DATA INTELLILINK Corporation 14
Coordinator and Datanode
Read/Write Transactions
Coordinator
Datanode
Copyright © 2015 NTT DATA INTELLILINK Corporation 15
Other PostgreSQL cluster effort
●
PG Cluster
– Multi-node update
– Backend update synchronization
●
PGPool
– SQL-based database replication
– Multi-node update
– Read scalability
– Now incorporated streaming replication as pgpool-II
●
Slony
– Trigger-based database replication
– Very flexible and robust as logical replication
Copyright © 2015 NTT DATA INTELLILINK Corporation 16
Why GTM? Two-Phase Commit Protocol doesn't work?
Two-Phase Commit Protocol Does:
●
Maintain database consistency in transactions updating more than one
node.
Two-Phase Commit Protocol Doesn't:
●
Maintain Atomic Visibility of Updates to other transactions (next slide)
Copyright © 2015 NTT DATA INTELLILINK Corporation 17
Atomic Visibility and GTM
Node A Node B
TXN 1
Updates A
and B
Inconsistent
Read!
Prepares A TXN 2
and B
Reads B and
gets old value
Commits A
and B Reads A and
gets new value
GTM monitors TXN
activity and make
new value available
at this timing.
Copyright © 2015 NTT DATA INTELLILINK Corporation 18
Final Configuration: GTM, Coordinator and Datanode
Read/Write Transactions
Coordinator
GTM
Datanode
Copyright © 2015 NTT DATA INTELLILINK Corporation 19
Configuration in Practice
Just like configuring many database servers to talk each other
●
Many pitfalls
●
Pgxc_ctl provides simpler way to configure the whole cluster
●
Provide only needed parameters
●
Pgxc_ctl will do the rest to issue needed commands and SQL
statements.
– Visit
http://sourceforge.net/p/postgres-xc/xc-wiki/PGOpen2013_Postgres_Open_2013/
Copyright © 2015 NTT DATA INTELLILINK Corporation 20
OLTP Workload Characteristics
Number of Transactions: Many
Number of Involved Table Rows: Small
Locality of Row Allocation: High
Update Frequency: High
Copyright © 2015 NTT DATA INTELLILINK Corporation 21
Scaling Out OLTP Workload
Read/Write Transactions
Run Transactions in Parallel
Coordinator
GTM
High workload
Datanode
Copyright © 2015 NTT DATA INTELLILINK Corporation 22
OLAP Workload Characteristics
Number of Transactions: Small
Number of Involved Table Rows: Huge
Locality of Row Allocation: Low
Update Frequency: Low
Copyright © 2015 NTT DATA INTELLILINK Corporation 23
Scaling Out OLAP Workload
SQL
May need less
Top level coordinators
Coordinator aggregation
GTM
Low workload
Datanode
Run Small Local SQLs for each
Datanode in Parallel
Copyright © 2015 NTT DATA INTELLILINK Corporation 24
Join Offloading: When row allocation is available
● Replicated Table and Partitioned Table
– Can determine which datanode to go from WHERE clause
Copyright © 2015 NTT DATA INTELLILINK Corporation 25
Join Offloading: When row allocation is available
● Replicated Table and Partitioned Table
– When the coordinator cannot determine which datanode to go from WHERE clause
Copyright © 2015 NTT DATA INTELLILINK Corporation 26
Aggregate Functions in PostgreSQL
Finalize Function State Transition
Function
Copyright © 2015 NTT DATA INTELLILINK Corporation 27
Aggregate Functions in Postgres-XC/XL
(Sum, Count)
AVG ← (Sum, Count)
State Transition
State Transition
Finalize Function Collector Function State Transition
Function
Function
Function
Coordinator Datanode
Similar to Map Reduce!
Copyright © 2015 NTT DATA INTELLILINK Corporation 28
Scale-out effort in PostgreSQL core
●
Efforts like Postgres-XC/XL
– Inter-node communication based upon FDW (Foreign Data Wrapper)
●
General PostgreSQL means to handle external data (not only
PostgreSQL)
– Introducing parallelism
●
Parallel sequential scan has just been committed
●
Some now discussed yet
– Update atomicity
– Node management/configuration
– Aggregate
– Other cluster-wide architecture
Visit https://wiki.postgresql.org/wiki/PG-EU_2015_Cluster_Summit for
details
Copyright © 2015 NTT DATA INTELLILINK Corporation 29
PostgreSQL Scale-Out Cluster
In the Future
Copyright © 2015 NTT DATA INTELLILINK Corporation 30
Impact to the approach
●
Increase demand for analytic workload
– IoT
●
Bigger amount of data
●
Not well-formatted
●
Just adding/archiving new/old data
– Big data analysis
●
SQL-based analysis (flexible and reasonable performance)
●
Semi-structured data (JSONB)
●
Storage innovation
– SSD via PCIe/NVMe
●
Solution to HDD performance bottleneck?
– Even faster storage like 3D-XPOINT
●
1000times faster than HDD
●
Storage via memory bus
●
New kernel support expected
Copyright © 2015 NTT DATA INTELLILINK Corporation 31
Impact to the approach (cont.)
●
New server architecture
– Soft-Defined
●
RSA (Rack-Scale architecture)
●
More suited for scale-out approach
●
GPU as CPU accelerator
– Parallel filter
– On-memory sort
●
Applicable to external sort?
– Data compression
●
Too much for GPU? Need FPGA?
●
Server backbone N/W
– 1Gig → 10Gig → 100Gig
– Suitable for scale-out approach
Copyright © 2015 NTT DATA INTELLILINK Corporation 32
Future scale-out technology forecast
●
Use datanode as data storage and intelligent scan
– Simpler statement
– Intelligent scan
●
Parallel scan: both intra/inter node
●
Allow coordinator to do more with less workload
– Use physical data distribution, index, etc. at coordinator
– More parallelism
●
Map-reduce aggregate
●
Allow local operation of cluster nodes
– Improve Global Transaction ID approach in XC
●
Make cluster more symmetric
– Get rid of central global transaction management
●
Simpler update synchronization
– Improve 2PC overhead
Copyright © 2015 NTT DATA INTELLILINK Corporation 33
Copyright © 2015 NTT DATA INTELLILINK Corporation
更多精彩,尽在PG社区
•PostgreSQL中国社区 : postgres.cn
•PostgreSQL专业1群 : 3336901(已满)
•PostgreSQL专业2群 : 100910388
•PostgreSQL专业3群 : 150657323
•文档翻译群 : 309292849
PostgresChina微信公众号 PostgreSQL用户会微博
Postgres Conference China 2015 中国用户大会