Cloud Computing Unit 1 To Unit 5 Material
Cloud Computing Unit 1 To Unit 5 Material
UNIT – 1
SYLLABUS: Systems modelling, Clustering and virtualization: Scalable
Computing over the Internet, Technologies for Network based systems, System
models for Distributed and Cloud Computing, Software environments for
distributed systems and clouds, Performance, Security And Energy Efficiency.
1.1 Scalable Computing Over the Internet
1
The general trend is to control shared web resources and massive data over the
Internet. In the above figure 1.1, we can observe the evolution of HPC and HTC
systems.
HPC contains super computers which are gradually replaced by clusters of inter-
cooperating systems that share the data among them. A cluster is a collection of
homogeneous computers, which are physically connected.
HTC shows the formation of peer-to-peer (P2P) networks for distributed file sharing
and apps. A P2P system is built over many client machines and is globally
distributed. This leads to formation of computational grids or data grids.
1.1.1.2. High Performance Computing (HPC): HPC stressed upon the speed
performance. The speed of HPC systems has increased from Gflops to Pflops
(FLOP=> Floating Point Operations Per Second) these days, driven by the
requirements from different fields like science, engineering, medicine and others.
The systems that generally have high speed are super computers, main frames and
other servers.
1.1.1.3. High Throughput Computing: The market-oriented computing is now
going through a strategic change from HPC to HTC paradigm. HTC concentrates
more on high-flux computing. The performance goal has shifted from speed of the
device to the number of tasks completed per unit of time (throughput).
1.1.1.4. Three New computing Paradigms: It can be seen from Figure 1.1 that
SOA (Software Oriented Architecture) has made the web services available for all
tasks. The Internet Clouds have become a major factor to consider for all types of
tasks. Three new paradigms have come into existence:
(a) Radio-Frequency Identification (RFID): This uses electro-magnetic fields to
automatically identify and track tags attached to objects. These tags contain
electronically stored information.
(b) Global Positioning System (GPS): It is a global navigation satellite system
that provides the geographical location and time information to a GPS receiver
[5].
(c) Internet of Things (IoT): It is the internetworking of different physical devices
(vehicles, buildings etc.) embedded with electronic devices (sensors), software,
and network connectivity. Data can be collected and exchanged through this
network (IoT).
1.1.1.5. Computing Paradigm Distinctions:
(a) Centralized Computing: All computer resources like processors, memory and
storage are centralized in one physical system. All of these are shared and
inter-connected and monitored by the OS.
(b) Parallel Computing: All processors are tightly coupled with centralized shared
memory or loosely coupled with distributed memory (parallel processing). Inter
processor communication is achieved by message passing. This methodology is
known as parallel computing.
(c) Distributed Computing: A distributed system consists of multiple
autonomous computers with each device having its own private memory. They
interconnect among themselves by the usage of a computer network. Here
also, information exchange is accomplished by message passing.
(d) Cloud Computing: An Internet Cloud of resources can either be a centralized
or a distributed computing system. The cloud applies parallel or distributed
computing or both. Cloud can be built by using physical or virtual resources
over data centers. CC is also called as utility/ service/concurrent computing.
1.1.1.6. Distributed System Families
In the future, both HPC and HTC will demand multicore processors that can handle
large number of computing threads per core. Both concentrate upon parallel and
distributed computing. The main work lies in the fields of throughput, efficiency,
scalability and reliability.
Main Objectives:
(a) Efficiency: Efficiency is decided by speed, programming and throughput
demands’ achievement.
2
(b) Dependability: This measures the reliability from the chip to the system at
different levels. Main purpose here is to provide good QoS (Quality of Service).
(c) Adaption in the Programming Model: This measures the ability to support
unending number of job requests over massive data sets and virtualized cloud
resources under different models.
(d) Flexibility: It is the ability of distributed systems to run in good health in both
HPC (science/engineering) and HTC (business).
1.1.2 SCALABLE COMPUTING TRENDS AND NEW PARADIGMS
1.1.2.1Degrees of ‘Parallelism’:
(a) Bit-level parallelism (BLP) 8 bit, 16, 32, and 64.
(b) Instruction-level parallelism (ILP): The processor executes multiple
instructions simultaneously. Ex: Pipelining, supercomputing, VLIW (very long
instruction word), and multithreading.
Pipelining: Data processing elements are connected in series where output of
one element is input to the next.
Multithreading: Multithreading is the ability of a CPU or a single core in
a multi-core processor to execute multiple processes or threads concurrently,
supported by the OS.
(c) Data-level Parallelism (DLP): Here, instructions are given like arrays (single
instruction, multiple data SIMD). More hardware support is needed.
(d) Task-level Parallelism (TLP): It is a process of execution where different
threads (functions) are distributed across multiple processors in parallel
computing environments.
(e) Job-level Parallelism (JLP): Job level parallelism is the highest level of
parallelism where we concentrate on a lab or computer center to execute as
many jobs as possible in any given time period. To achieve this, we purchase
more systems so that more jobs are running at any one time, even though any
one user's job will not run faster.
1.1.2.2. Innovative Applications: It is used in different fields for different
purposes. All applications demand computing economics, web-scale data collection,
system reliability, and scalable performance. Ex: Distributed transaction processing
is practiced in the banking industry. Transactions represent 90 percent of the
existing market for reliable banking systems.
1.1.2.3 The Trend toward Utility Computing: Major computing paradigms and
available services/capabilities are coming together to produce a technology
convergence of cloud/utility computing where both HPC and HTC are utilised to
achieve objectives like reliability and scalability. They also aim to reach autonomic
operations that can be self-organized and support dynamic recovery. Ex:
Interpretation of sensor data, effectors like Google Home and Amazon Echo, smart
home devices etc.
3
Cloud Computing focuses on a business model where a customer receives different
computing resources (storage, service, security etc.) from service providers like
AWS, EMC, Salesforce.com.
1.1.2.4 The Hype Cycle of New Technologies: A new hype (exciting) cycle is
coming into picture where different important and significant works needed by the
customer are offered as services by CC. Ex: SaaS, IaaS, Security as a Service, DM
as a Service etc. Many others are also along the pipeline.
4
These objects can be interconnected, can exchange data and interact with each other by
the usage of suitable applications (web/mobile). In the IoT era, CC can be used
efficiently and in a secure way to provide different services to the humans, computers
and other objects. Ex: Smart cities, inter-connected networks, self-controlling street
lights/traffic lights etc.
1.1.3.2. Cyber-Physical Systems (CPS): CPS means cyber–physical system where
physical objects and computational processes interact with each other. Ex: Wrest bands
to monitor BP. CPS merges the 3Cs which are computation, communication and control
to provide intelligent feedbacks between the cyber and physical worlds.
5
In future, thousand-core GPUs may feature in the field of Eflops/1018 flops
systems.
1.2.2.3. Power Efficiency of the GPU: The major benefits of GPU over CPU are power
and massive parallelism. Estimation says that 60 Gflops/watt per core is needed to run
an exaflops system. [One exaflops is a thousand petaflops or a quintillion, 1018, floating
point operations per second]. A GPU chip requires one-tenth less of the power that a
CPU requires. (Ex: CPU: 100, GPU: 90).
CPU is optimized (use most effectively) for latency (time between request and response)
in caches and memory; GPU is optimized for throughput with explicit (open)
management of on-chip memory.Both power consumption and software are the future
challenges in parallel and distributed systems.
6
1.2.3. Memory, Storage and WAN:
(a) Memory Technology: The upper curve in Figure 1.10 shows the growth of DRAM
chip capacity from 16 KB to 64 GB. [SRAM is Static RAM and is 'static' because the
memory does not have to be continuously refreshed like Dynamic RAM. SRAM is
faster but also more expensive and is used inside the CPU. The traditional RAMs in
computers are all DRAMs]. For hard drives, capacity increased from 260 MB to 3 TB
and lately 5 TB (by Seagate). Faster processor speed and higher memory capacity
will result in a wider gap between processors and memory, which is an ever-
existing problem.
(b) Disks and Storage Technology: The rapid growth of flash memory and solid-
state drives (SSD) also has an impact on the future of HPC and HTC systems. An
SSD can handle 300,000 to 1 million write cycles per block, increasing the speed
and performance. Power consumption should also be taken care-of before planning
any increase of capacity.
(c) System-Area Interconnects: The nodes in small clusters are interconnected by
an Ethernet switch or a LAN. As shown in Figure 1.11, a LAN is used to connect
clients to servers. A Storage Area Network (SAN) connects servers to network
storage like disk arrays. Network Attached Storage (NAS) connects clients directly
to disk arrays. All these types of network appear in a large cluster built with
commercial network components (Cisco, Juniper). If not, much data is shared
(overlapped), we can build a small cluster with an Ethernet Switch + copper cables
to link to the end machines (clients/servers).
(d) WAN: We can also notice the rapid growth of Ethernet bandwidth from 10 Mbps
to 1 Gbps and still increasing. Different bandwidths are needed for local, national, and
international levels of networks. It is also estimated that computers will be used
concurrently in the coming future and higher bandwidth will certainly add more speed
and capacity to aid the cloud/distributed computing.
1.2.4. Virtual Machines and Middleware: A typical computer has a single OS
image at a time. This leads to a rigid architecture that tightly couples apps to a
specific hardware platform i.e., an app working on a system might not work on
another system with another OS (non-portable).
(e) To build large clusters, grids and clouds, we need to increase the capacity of
computing, storage and networking resources in a virtualized manner. A cloud of
limited resources should aggregate all these dynamically to bring out the expected
results.
7
(a) Virtual Machines: As seen in Figure 1.12, the host machine is equipped with a
physical hardware. The VM is built with virtual resources managed by a guest OS to
run a specific application (Ex: VMware to run Ubuntu for Hadoop). Between the
VMs and the host platform we need a middleware called VM Monitor (VMM). A
hypervisor (VMM) is a program that allows different operating systems to share a
single hardware host. This approach is called bare-metal VM because a hypervisor
handles CPU, memory and I/O directly. VM can also be implemented with a dual
mode as shown in Figure 1.12 (d). Here, part of VMM runs under user level and
another part runs under supervisor level.
(b) VM Primitive operations: A VMM operation provides VM abstraction to the guest
OS. The VMM can also export an abstraction at full virtualization so that a standard
OS can run it as it would on physical hardware. Low level VMM operations are
indicated in Figure 1.13.
The VMs can be multiplexed between hardware machines as shown in 1.13 (a)
A VM can be suspended and stored in a stable storage as shown in 1.13(b)
A suspended VM can be resumed on a new hardware platform as shown in 1.13 (c)
A VM can be migrated from one hardware platform to another as shown in 1.13 (d)
8
Advantages:
These VM operations can enable a VM to work on any hardware platform.
They enable flexibility (the quality of bending easily without breaking) in porting
distributed app executions.
VM approach enhances the utilization of server resources – multiple server
functions can be integrated on the same hardware platform to achieve higher
system efficiency. [VMware claims that server resource utilization has increased
from 5-15% to 60-80%].
Eliminates server crashes due to VM usage or shows more transparency in the
operations that are being carried out.
1.2.5. Data Center Virtualization for Cloud Computing: Cloud architecture is built
with products like hardware and network devices. Almost all cloud platforms use x86
(Family of Intel 8086 processors). Low-cost terabyte disks and gigabit Ethernet are
used to build data centers. A data center takes into consideration the
performance/price ratio instead of only speed.
(a) Data Center Growth and Cost Breakdown: Large data centers are built with
thousands of servers and smaller ones have hundreds of the same. The cost of
maintaining a data center has increased and much of this money is spent on
management and maintenance which did not increase with time. Electricity and
cooling also consume much of the allocated finance.
(b) Low Cost Design Philosophy: High-end switches or routers that provide high
bandwidth networks cost more and do not match the financial design of cloud
computing. For a fixed budget, typical switches and networks are more desirable.
Similarly, usage of x86 servers is more preferred over expensive mainframes.
Appropriate software ‘layer’ should be able to balance between the available
resources and the general requirements like network traffic, fault tolerance, and
expandability. [Fault tolerance is the property that enables a system to continue
operating properly even when one or more of its components have failed].
(c) Convergence of Technologies: CC is enabled by the convergence of technologies
in four areas:
Hardware virtualization and multi-core chips
Utility and grid computing
SOA, Web 2.0 and Web Service integration
Autonomic Computing and Data Center Automation
Web 2.0 is the second stage of the development of the Internet, where static pages
transformed into dynamic and the growth of social media.
Data is increasing by leaps and bounds every day, coming from sensors,
simulations, web services, mobile services and so on. Storage, acquisition and
access of this huge amount of data sets requires standard tools that support high
performance, scalable file systems, DBs, algorithms and visualization. With science
becoming data-centric, storage and analysis of the data plays a huge role in the
appropriate usage of the data-intensive technologies.
Cloud Computing is basically focused on the massive data that is flooding the
industry. CC also impacts the e-science where multi-core and parallel computing is
required. To achieve the goals in these fields, one needs to work on workflows,
databases, algorithms and virtualization issues.
Cloud Computing is a transformative approach since it promises more results than
a normal data center. The basic interaction with the information is taken up in a
different approach to obtain a variety of results, by using different types of data to
end up with useful analytical results.
A typical cloud runs on an extremely large cluster of standard PCs. In each cluster
node, multithreading is practised with a large number of cores in many-core GPU
9
clusters. Hence, data science, cloud computing and multi-core computing are
coming together to revolutionize the next generation of computing and take up the
new programming challenges.
1.3. SYSTEM MODELS FOR DISTRIBUTED AND CLOUD COMPUTING: Distributed
and Cloud Computing systems are built over a large number of independent
computer nodes, which are interconnected by SAN, LAN or WAN. Few LAN switches
can easily connect hundreds of machines as a working cluster. A WAN can connect
many local clusters to form large cluster of clusters.
Large systems are highly scalable, and can reach web-scale connectivity either
physically or logically. Table 1.2 below shows massive systems classification as four
groups: clusters, P2P networks, computing grids and Internet clouds over large
data centers. These machines work collectively, cooperatively, or collaboratively at
various levels.
Clusters are more popular in supercomputing apps. They have laid the foundation
for cloud computing. P2P are mostly used in business apps. Many grids formed in
the previous decade have not been utilized per their potential due to lack of proper
middleware or well-coded apps.
10
Through a hierarchical construction using SAN, LAN or WAN, scalable clusters can
be built with increasing number of nodes. The concerned cluster is connected to
the Internet through a VPN (Virtual Private Network) gateway, which has an IP
address to locate the cluster. Generally, most clusters have loosely connected
nodes, which are autonomous with their own OS.
To create SSIs, we need special cluster middleware support. Note that both
sequential and parallel apps can run on the cluster but parallel environments give
effective exploitation of the resources. Distributed Shared memory (DSM) makes
all the data to be shared by all the clusters, thus bringing all the resources into
availability of every user. But SSI features are expensive and difficult to achieve;
so users generally prefer loosely coupled machines.
(d)Major Cluster Design Issues: A cluster-wide OSs or a single OS controlling the
cluster virtually is not yet available. This makes the designing and achievement of
SSI difficult and expensive. All the apps should rely upon the middleware to bring
out the coupling between the machines in cluster or between the clusters. But it
should also be noted that the major advantages of clustering are scalable
performance, efficient message passing, high system availability, good fault
tolerance and a cluster-wide job management which react positively to the user
demands.
11
scale. They are also termed as virtual platforms. Computers, workstations, servers
and clusters are used in a grid. Note that PCs, laptops and others can be viewed as
access devices to a grid system. Figure 1.6 below shows an example grid built by
different organisations over multiple systems of different types, with different
operating systems.
12
If a new peer joins the system, its peer ID is added as a node in the overlay network.
The P2P overlay network distinguishes the logical connectivity among the peers. The
types here are unstructured and structured. Unstructured P2P ON is a random one
and has no fixed route of contact – flooding is used to send queries to all nodes. This
resulted in sudden increase of network traffic and unsure results. On the other hand,
structured ONs follow a pre-determined methodology of connectivity for inserting and
removing nodes from the overlay graph.
(c) P2P Application Families: There exist 4 types of P2P networks: distributed file
sharing, collaborative platform, distributed P2P computing and others. Ex:
BitTorrent, Napster, Skype, Geonome, JXTA, .NET etc.
(d)P2P Computing Challenges: The main problems in P2P computing are those in
hardware, software and network. Many hardware models exist to select from;
incompatibility exists between the software and the operating systems; different
network connections and protocols make it too complex to apply in real-time
applications. Further, data location, scalability, performance, bandwidth etc. are
the other challenges.
Disadvantages here are that since the total system is not centralized, management
of the total network is difficult – anyone can logon and put in any type of data.
Security is less.
1.3.4. Cloud Computing over Internet: Cloud Computing is defined by IBM as
follows: A cloud is a pool of virtualized computer resources. A cloud can host a
variety of different workloads that include batch-style backend jobs and interactive
and user-facing applications.
Since the explosion of data, the trend of computing has changed – the software
apps have to be sent to the concerned data. Previously, the data was transferred to
the software for computation. This is the main reason for promoting cloud
computing.
A cloud allows workloads to be deployed and scaled out through rapid provisioning
of physical or virtual systems. The cloud supports redundant, self-recovering, and
highly scalable programming models that allow workloads to recover from software
or hardware failures. The cloud system also monitors the resource use in such a
way that allocations can be rebalanced when required.
13
(a) Internet Clouds: The idea in Cloud Computing is to move desktop computing to a
service-oriented platform using server clusters and huge DBs at data centers. CC
benefits both users and providers by using its low cost and simple resources
through machine virtualization. Many user applications are satisfied simultaneously
by CC and finally, its design should satisfy the security norms, be trustworthy and
dependable. CC is viewed in two ways: a centralized resource pool or a server
cluster practising distributed computing.
(b) The Cloud Landscape: A distributed computing system is controlled by
companies or organisations. But these traditional systems encounter several
bottlenecks like constant maintenance, poor utilization, and increasing costs and
updates of software or hardware. To get rid of these, CC should be utilized as on-
demand computing.
14
(a) Layered Architecture for Web Services and Grids: The entity interfaces
correspond to the WSDL (web services description language) like XML, Java and
CORBA interface definition language (IDL) in the distributed systems. These
interfaces are linked with high level communication systems like SOAP, RMI and
IIOP. These are based on message-oriented middleware infrastructures like JMS
and Web Sphere MQ.
At entity levels, for fault tolerance, the features in (Web Services Reliable
Messaging) WSRM and its framework are same as the levels of OSI model. Entity
communication is supported by higher level services for services, metadata, and
the management of entities, which can be discussed later on. Ex: JNDI, CORBA
trading service, UDDI, LDAP and ebXML. This enables effective exchange of
information. This also results in higher performance and more throughputs.
In web services, the aim is to specify all aspects of the offered service and its
environment. This idea is carried out by using SOAP. Consequently, the
environment becomes a universal distributed OS with fully distributed capability
carried out by SOAP messages. But it should be noted that this approach has had
mixed results since the protocol can’t be agreed upon easily and even if so, it is
hard to implement.
In the REST approach, simplicity is stressed upon, and difficult problems are
delegated to the apps. In a web services language, REST has minimal information
in the header and the message body carries the needed information. REST
architectures are more useful in rapid technology environments. Above the
communication and management layers, we can compose new entities or
distributed programs by grouping several entities together.
Java and CORBA use RPC methodology through RMI. In grids, sensors represent
entities that output data as messages; grids and clouds represent collection of
services that have multiple message-based inputs and outputs.
(c) The Evolution of SOA: Software Oriented Architecture applies to building grids,
clouds, their combinations and even inter-clouds and systems of systems. The data
collections is done through the sensors like ZigBee device, Bluetooth device, Wi-Fi
15
access point, a PC, a mobile phone and others. All these devices interact among
each other or with grids, clouds and databases at distant places.
(d)Grids Vs Clouds: Grid systems apply static resources, while a cloud stresses upon
elastic resources. Differences between grid and cloud exist only in dynamic
resource allocation based on virtualization and autonomic computing. A ‘grid of
clouds’ can also be built and can do a better job than a pure cloud because it can
support resource allocation. Grid of clouds, cloud of grids, cloud of clouds and inter-
clouds are also possible.
Amdahl’s Law states that the speedup factor of using n-processor system over the
use of a single processor is given by:
18
The maximum speedup of n can be obtained only if α is reduced to zero or the
code can be parallelized with α = 0.
As the cluster becomes large (that is n ∞), S approaches 1/α, which is the
threshold on the speedup of S. Note that the threshold is independent of n. The
sequential bottleneck is the portion of the code that cannot be parallelized. Ex: The
maximum speed achieved is 4, if α=0.25 or 1-α=0.75, even if a user uses
hundreds of processors. This law deduces that we should make the sequential
bottleneck as small as possible.
e)Problem with fixed workload: In Amdahl’s law, same amount of workload
was assumed for both sequential and parallel execution of the program with a fixed
problem size or dataset. This was called fixed-workload speedup by other
scientists. To execute this fixed-workload on n processors, parallel processing leads
to a system efficiency E which is given by:
E = S/n = 1/[α n + 1-α] ---- (1.2)
Generally, the system efficiency is low, especially when the cluster size is large. To
execute a program on cluster with n=256 nodes, and α=0.25, efficiency E =
1/[0.25x256 + 0.75] = 1.5%, which is very low. This is because only a few
processors, say 4, are kept busy whereas the others are kept idle.
f) Gustafson’s Law: To obtain higher efficiency when using a large cluster,
scaling the problem size to match the cluster’s capability should be considered. The
speedup law proposed by Gustafson is also referred to as scaled-workload
speedup.
Let W be the workload in a given program. When using an n-processor system, the
user scales the workload to W’= αW + (1-α)nW. Note that only the portion of the
workload that can be parallelized is scaled n times in the second term. This scaled
workload W’ is the sequential execution time on a single processor. The parallel
execution time W’ on n processors is defined by a scaled-workload speedup as:
S’ = W’/W = [αW + (1-α) nW]/W = α+ (1-α) n ---- (1.3)
This speedup is known as Gustafson’s law. By fixing the parallel execution time at
level W, we can obtain the following efficiency:
E’ = S’/n = α/n+ (1-α) ---- (1.4)
Taking previous workload values into consideration, efficiency can be improved for
a 256-node cluster to E’ = 0.25/256 + (1-0.25) = 0.751. For a fixed workload
Amdahl’s law must be used and for scaled problems users should apply Gustafson’s
law.
1.5.2. Fault Tolerance and System Availability:
a)System Availability: High availability (HA) is needed in all clusters, grids, P2P
networks and cloud systems. A system is highly available if it has a long mean time
to failure (MTTF) and a short mean time to repair (MTTR).
19
As a distributed system increases in size, availability decreases due to a higher
chance of failure and difficulty in isolating the features. Both SMP and MPP are
likely to fail under centralized resources with one OS. NUMA machines are a bit
better here since they use multiple OS.
1.5.3 Network Threats and Data Integrity:
a) Threats to Systems and Networks:
The Figure 1.25 presents a summary of various attack types and the damaged
caused by them to the users. Information leaks lead to a loss of confidentiality.
Loss of data integrity can be caused by user alteration, Trojan horses, service
spoofing attacks, and Denial of Service (DoS) – this lead of loss of Internet
connections and system operations. Users need to protect clusters, grids, clouds
and P2P systems from malicious intrusions that may destroy hosts, network and
storage resources. Internet anomalies found generally in routers, gateways and
distributed hosts may hinder (hold back) the usage and acceptance of these public
resources.
21
c) Application Layer: Most apps in different areas like science, engineering,
business, financial etc. try to increase the system’s speed or quality. By introducing
energy-conscious applications, one should try to design the usage and consumption
in a planned manner such that the apps manage to use the new multi-level and
multi-domain energy management methodologies without reducing the
performance. For this goal, we need to identify a relationship between the
performance and energy consumption areas (correlation). Note that these two
factors (compute and storage) are surely correlated and affect completion time.
d) Middleware layer: The middleware layer is a connection between application
layer and resource layer. This layer provides resource broker, communication
service, task analyzer & scheduler, security access, reliability control, and
information service capabilities. It is also responsible for energy-efficient
techniques in task scheduling. In distributed computing system, a balance has to
be brought out between efficient resource usage and the available energy.
e) Resource Layer: This layer consists of different resources including the
computing nodes and storage units. Since this layer interacts with hardware
devices and the operating systems, it is responsible for controlling all distributed
resources. Several methods exist for efficient power management of hardware and
OS and majority of them are concerned with the processors.
Dynamic power management (DPM) and dynamic voltage frequency scaling
(DVFS) are the two popular methods being used recently. In DPM, hardware
devices can switch from idle modes to lower power modes. In DVFS, energy
savings are obtained based on the fact that power consumption in CMOS
(Complementary Metal-Oxide Semiconductor) circuits have a direct relationship
with frequency and the square of the voltage supply. [P = 0.5 CV2f] Execution
time and power consumption can be controlled by switching among different
voltages and frequencies.
f) Network Layer: The main responsibilities of the network layer in distributed
computing are routing and transferring packets, and enabling network services to
the resource layer. Energy consumption and performance are to measured,
predicted and balanced in a systematic manner so as to bring out energy-efficient
networks. Two challenges exist here:
The models should represent the networks systematically and should possess a
full understanding of interactions among time, space and energy.
New and energy-efficient algorithms have to be developed to rope in the
advantages to the maximum scale and defend against the attacks.
Data centers are becoming more important in distributed computing since the data
is ever-increasing with the advent of social media. They are now another core
infrastructure like power grid and transportation systems.
g) DVFS Method for Energy Efficiency: This method enables the exploitation of
idle time (slack time) encountered by an inter-task relationship. The slack time
associated with a task is utilized to the task in a lower voltage frequency. The
relationship between energy and voltage frequency in CMOS circuits is calculated
by:
2
E C fv teff
(v vt ) 2 ---- (1.6)
f K
v
where v, Ceff, K and vt are the voltage, circuit switching capacity, a technology
dependent factor and threshold voltage; t is the execution time of the task under
clock frequency f. By reducing v and f, the energy consumption of the device can
also be reduced.
22
CLOUD COMPUTING
UNIT II
SYLLABUS: Virtual Machines and Virtualization of Clusters and Data Centers:
Implementation Levels of Virtualization, Virtualization Structures/ Tools and mechanisms,
Virtualization of CPU, Memory and I/O Devices, Virtual Clusters and Resource Management,
Virtualization for Data Center Automation.
INTRODUCTION:
The massive usage of virtual machines (VMs) opens up new opportunities for parallel, cluster grid,
cloud and distributed computing. Virtualization enables the users to share expensive hardware
resources by multiplexing (i.e., multiple analog/digital are combined into one signal over a shared
medium) VMs on the same set of hardware hosts like servers or data centers.
The main idea is to separate hardware from software to obtain greater efficiency from the system.
Ex: Users can gain access to more memory by this concept of VMs. With sufficient storage, any
computer platform can be installed in another host computer, even if processors’ usage and
operating systems are different.
a) Levels of Virtualization Implementation: A traditional computer system runs with a host
OS specially adjusted for its hardware architecture. This is depicted in Figure 3.1a.
After virtualization, different user apps managed by their own OS (i.e., guest OS) can run on
the same hardware, independent of the host OS. This is often done by adding a virtualization
layer as shown in Figure 3.1b.
This virtualization layer is called VM Monitor or hypervisor. The VMs can be seen in the upper
boxes where apps run on their own guest OS over a virtualized CPU, memory and I/O devices.
The main function of the software layer for virtualization is to virtualize the physical hardware
of a host machine into virtual resources to be saved by the VMs. The virtualization software
creates the abstract of VMs by introducing a virtualization layer at various levels of a
computer. General virtualization layers include the instruction set architecture (ISA) level,
hardware level, OS level, library support level, and app level. This can be seen in Figure 3.2.
The levels are discussed below.
1
I) Instruction Set Architecture Level: At the ISA level, virtualization is performed by
emulation (imitate) of the given ISA by the ISA of the host machine. Ex: MIPS binary code
can run on an x86-based host machine with the help of ISA simulation. Instruction
emulation leads to virtual ISAs created on any hardware machine.
2
(b) vCUDA by NVIDIA. (CUDA – No acronym)
V) User-App Level: An app level virtualization brings out a real VM; this process is also
known as process level virtualization. Generally HLL VMs are used where virtualization
layer is an app above the OS; it can run programs written and compiled to an abstract
machine definition. Ex: JVM and .NET CLR (Common Language Runtime).
Other forms of app level virtualization are app isolation, app sandboxing or app streaming.
Here, the app is wrapped in a layer and is isolated from the host OS and other apps. This
makes the app more much easier to distribute and remove from user workstations. Ex:
LANDesk (an app virtualization platform) – this installs apps as self-contained, executable
files in an isolated environment. No actual installation is required and no system
modifications are needed.
Table 3.1 that hardware and OS support will yield the highest performance. At the same
time, the hardware and app levels are most expensive to implement. User isolation is
difficult to archive and ISA offers best flexibility.
• For programs, a VMM should provide an identical environment, same as the original
machine.
• Programs running in this environment should show only minor decreases in speed.
• A VMM should be in complete control of the system resources.
Some differences might still be caused due to availability of system resources (more than one
VM is running on the same system) and differences caused by timing dependencies.
The hardware resource requirements (like memory) of each VM is reduced, but the total sum
of them is greater that of the real machine. This is needed because of any other VMs that are
concurrently running on the same hardware.
A VMM should demonstrate efficiency in using the VMs. To guarantee the efficiency of a VMM,
a statistically dominant subset of the virtual processor’s instructions needs to be executed
directly by the real processor with no intervention by the VMM. A comparison can be seen in
Table 3.2:
3
The aspects to be considered here include (1) The VMM is responsible for allocating hardware
resources for programs; (2) a program can’t access any resource that has not been allocated
to it; (3) at a certain juncture, it is not possible for the VMM to regain control of the resources
already allocated. Note that all processors might not satisfy these requirements of a VMM.
4
b) Advantages of OS Extensions:
▪ VMs at the OS level have minimal start-up shutdown costs, low resource requirements
and high scalability.
▪ For an OS level VM, the VM and its host environment can synchronise state changes
➢ All OS level VMs on the same physical machine share a single OS kernel
➢ The virtualization layer can be designed in way that allows processes in VMs can access as
many resources as possible from the host machine, but can never modify them.
As we can observe in Figure 3.3, the virtualization layer is inserted inside the OS to partition
the hardware resources for multiple VMs to run their applications in multiple virtual
environments. To implement this OS level virtualization, isolated execution environments
(VMs) should be created based on a single OS kernel. In addition, the access requests from a
VM must be redirected to the VM’s local resource partition on the physical machine. For
example, ‘chroot’ command in a UNIX system can create several virtual root directories within
an OS that can be used for multiple VMs.
To implement the virtual root directories’ concept, there exist two ways: (a) duplicating
common resources to each VM partition or (b) sharing most resources with the host
environment but create private copies for the VMs on demand. It is to be noted that the first
method incurs (brings up) resource costs and burden on a physical machine. Therefore, the
second method is the apparent choice.
Linux platforms are not tied to a special kernel. In such a case, a host can run several VMs
simultaneously on the same hardware. Examples can be seen in Table 3.3.
e) Middleware Support for Virtualization: This is the other name for Library-level
Virtualization and is also known as user-level Application Binary Interface or API emulation.
5
This type of virtualization can create execution environments for running alien (new/unknown)
programs on a platform rather than creating a VM to run the entire OS. The key functions
performed here are API call interception and remapping (assign a function to a key).
a) Hypervisor and Xen Architecture: The hypervisor (VMM) supports hardware level
virtualization on bare metal devices like CPU, memory, disk and network interfaces. The hypervisor
software exists between the hardware and its OS (platform). The hypervisor provides hypercalls
for the guest operating systems and applications. Depending on the functionality, a hypervisor can
assume micro-kernel architecture like MS Hyper-V or monolithic hypervisor architecture like the
VMware ESX for server virtualization.
➢ Hypercall: A hypercall is a software trap from a domain to the hypervisor, just as a
syscall is a software trap from an application to the kernel. Domains will use hypercalls to
request privileged operations like updating page tables.
➢ Software Trap: A trap, also known as an exception or a fault, is typically a type of
synchronous interrupt caused by an exceptional condition (e.g., breakpoint, division by
zero, invalid memory access). A trap usually results in a switch to kernel mode, wherein
the OS performs some action before returning control to the originating process. A trap in
a system process is more serious than a trap in a user process and might be fatal. The
term trap might also refer to an interrupt intended to initiate a context switch to a monitor
program or debugger.
➢ Domain: It is a group of computers/devices on a network that are administered as a unit
with common rules and procedures. Ex: Within the Internet, all devices sharing a common
part of the IP address are said to be in the same domain.
➢ Page Table: A page table is the data structure used by a virtual memory system in an
OS to store the mapping between virtual addresses and physical addresses.
➢ Kernel: A kernel is the central part of an OS and manages the tasks of the computer and
hardware like memory and CPU time.
➢ Monolithic Kernel: These are commonly used by the OS. When a device is needed, it is
added as a part of the kernel and the kernel increases in size. This has disadvantages like
faulty programs damaging the kernel and so on. Ex: Memory, processor, device drivers
etc.
➢ Micro-kernel: In micro-kernels, only the basic functions are dealt with – nothing else. Ex:
Memory management and processor scheduling. It should also be noted that OS can’t run
only on a micro-kernel, which slows down the OS.
The size of the hypervisor code of a micro-kernel hypervisor is smaller than that of monolithic
hypervisor. Essentially, a hypervisor must be able to convert physical devices into virtual resources
dedicated for the VM usage.
6
i) Xen Architecture: It is an open source hypervisor program developed by Cambridge
University. Xen is a micro-kernel hypervisor, whose policy is implemented by Domain 0.
As can be seen in Figure 3.5, Xen doesn’t include any device drivers; it provides a mechanism by
which a guest-OS can have direct access to the physical devices. The size of Xen is kept small, and
provides a virtual environment between the hardware and the OS. Commercial Xen hypervisors
are provided by Citrix, Huawei and Oracle.
The core components of Xen are the hypervisor, kernel and applications. Many guest operating
systems can run on the top of the hypervisor; but it should be noted that one of these guest OS
controls the others. This guest OS with the control ability is called Domain 0 – the others are called
Domain U. Domain 0 is first loaded when the system boots and can access the hardware directly
and manage devices by allocating the hardware resources for the guest domains (Domain U).
Say Xen is based on Linux and its security level is some C2. Its management VM is named as
Domain 0, which can access and manage all other VMs on the same host. If a user has access to
Domain 0 (VMM), he can create, copy, save, modify or share files and resources of all the VMs.
This is a huge advantage for the user but concentrating all the resources in Domain 0 can also
become a privilege for a hacker. If Domain 0 is hacked, through it, a hacker can control all the
VMs and through them, the total host system or systems. Security problems are to be dealt with in
a careful manner before handing over Xen to the user.
A machine’s lifetime can be thought of as a straight line that progresses monotonically (never
decreases or increases) as the s/w executes. During this time, executions are made, configurations
are changed, and s/w patches can be applied. VM is similar to tree in this environment; execution
can go into N different branches where multiple instances of VM can be done in this tree at any
time. VMs can also be allowed to roll back to a particular state and rerun from the same point.
b) Binary Translation with Full Virtualization: Hardware virtualization can be categorised into
two categories: full virtualization and host-based virtualization.
➢ Full Virtualization doesn’t need to modify the host OS; it relies upon binary translation to
trap and to virtualize certain sensitive instructions. Normal instructions can run directly on
the host OS. This is done to increase the performance overhead – normal instructions are
carried out in the normal manner, but the difficult and precise executions are first
discovered using a trap and executed in a virtual manner. This is done to improve the
security of the system and also to increase the performance.
This approach is mainly used by VMware and others. As it can be seen in Figure 3.6, the
VMware puts the VMM at Ring 0 and the guest OS at Ring 1. The VMM scans the
instructions to identify complex and privileged instructions and trap them into the VMM,
which emulates the behaviour of these instructions. Binary translation is the method used
for emulation (A => 97 => 01100001). Full virtualization combines both binary translation
and direct execution. The guest OS is totally decoupled from the hardware and run
virtually (like an emulator).
7
Full virtualization is ideal since it involves binary translation and is time consuming. Binary
translation also is cost consuming but it increases the system performance. (Same as 90%
of the host).
➢ Host based Virtualization: In a host-based virtualization system both host and guest
OS are used and a virtualization layer is built between them. The host OS is still
responsible for managing the hardware resources. Dedicated apps might run on the VMs
and some others can run on the host OS directly. By using this methodology, the user can
install the VM architecture without modifying the host OS. The virtualization software can
rely upon the host OS to provide device drivers and other low-level services. Hence the
installation and maintenance of the VM becomes easier.
Another advantage is that many host machine configurations can be perfectly utilized; still
four layers of mapping exist in between the guest and host operating systems. This may
hinder the speed and performance, in particular when the ISA (Instruction Set
Architecture) of a guest OS is different from that of the hardware – binary translation
MUST be deployed. This increases in time and cost and slows the system.
Kernel based VM (KVM): This is a Linux para-virtualization system – it is a part of the Linux
kernel. Memory management and scheduling activities are carried out by the existing Linux kernel.
Other activities are taken care of by the KVM and this methodology makes it easier to handle than
the hypervisor. Also note that KVM is hardware assisted para-virtualization tool, which improves
performance and supports unmodified guest operating systems like Windows, Linux, Solaris and
others.
a) H/W Support for Virtualization: Modern operating systems and processors permit
multiple processes to run simultaneously. A protection mechanism should exist in the processor
so that all instructions from different processes will not access the hardware directly – this will
lead to a system crash.
All processors should have at least two modes – user and supervisor modes to control the
access to the hardware directly. Instructions running in the supervisor mode are called
privileged instructions and the others are unprivileged.
A CPU is virtualization only if it supports the VM in the CPU’s user mode while the VMM runs in
a supervisor’s mode. When the privileged instructions are executed, they are trapped in the
VMM. In this case, the VMM acts as a mediator between the hardware resources and different
VMs so that correctness and stability of the system are not disturbed. It should be noted that
not all CPU architectures support virtualization.
Process:
• System call triggers the 80h interrupt and passes control to the OS kernel.
• Kernel invokes the interrupt handler to process the system call
• In Xen, the 80h interrupt in the guest OS concurrently causes the 82h interrupt in the
hypervisor; control is passed on to the hypervisor as well.
9
• After the task is completed, the control is transferred back to the guest OS kernel.
10
Each page table of a guest OS has a page table allocated for it in the VMM. The page table in the
VMM which handles all these is called a shadow page table. As it can be seen all this process is
nested and inter-connected at different levels through the concerned address. If any change
occurs in the virtual memory page table or TLB, the shadow page table in the VMM is updated
accordingly.
d)I/O Virtualization: This involves managing of the routing of I/O requests between virtual
devices and shared physical hardware. The there are three ways to implement this are full device
emulation, para- virtualization and direct I/O.
• Full Device Emulation: This process emulates well-known and real-world devices. All the
functions of a device or bus infrastructure such as device enumeration, identification,
interrupts etc. are replicated in the software, which itself is located in the VMM and acts as a
virtual device. The I/O requests are trapped in the VMM accordingly. The emulation
approach can be seen in Figure 3.14.
• Para- virtualization: This method of I/O virtualization is taken up since software emulation
runs slower than the hardware it emulates. In para- virtualization, the frontend driver runs
in Domain-U; it manages the requests of the guest OS. The backend driver runs in Domain-
0 and is responsible for managing the real I/O devices. This methodology (para) gives more
performance but has a higher CPU overhead.
• Direct I/O virtualization: This lets the VM access devices directly; achieves high
performance with lower costs. Currently, it is used only for the mainframes.
Ex: VMware Workstation for I/O virtualization: NIC=> Network Interface Controller
11
e) Virtualization in Multi-Core Processors: Virtualizing a multi-core processor is more
complicated than that of a unicore processor. Multi-core processors have high performance by
integrating multiple cores in a chip, but their virtualization poses a new challenge. The main
difficulties are that apps must be utilized in a parallelized way to use all the cores and this task
must be accomplished by software, which is a much higher problem.
To reach these goals, new programming models, algorithms, languages and libraries are needed to
increase the parallelism.
i)Physical versus Virtual Processor Cores: A multi-core virtualization method was proposed
to allow hardware designers to obtain an abstraction of the lowest level details of all the cores.
This technique alleviates (lessens) the burden of managing the hardware resources by
software. It is located under the ISA (Instruction Set Architecture) and is unmodified by the OS
or hypervisor. This can be seen in Figure 3.16.
ii) Virtual Hierarchy: The emerging concept of many-core chip multiprocessors (CMPs) is a
new computing landscape (background). Instead of supporting time-sharing jobs on one or few
cores, abundant cores can be used in a space-sharing – here single or multi-threaded jobs are
simultaneously assigned to the cores. Thus, the cores are separated from each other and no
12
interferences take place. Jobs go on in parallel, for long time intervals. To optimize (use
effectively) the workloads, a virtual hierarchy has been proposed to overlay (place on top) a
coherence (consistency) and caching hierarchy onto a physical processor. A virtual hierarchy
can adapt by itself to fit how to carry out the works and share the workspace depending upon
the workload and the availability of the cores.
The CMPs use a physical hierarchy of two or more cache levels that statically determine the
cache (memory) allocation and mapping. A virtual hierarchy is a cache hierarchy that can
adapt to fit the workloads. First level in the hierarchy locates data blocks close to the cores to
increase the access speed; it then establishes a shared-cache domain, establishes a point of
coherence, thus increasing communication speed between the levels. This idea can be seen in
Figure 3.17(a) [1].
Space sharing is applied to assign three workloads to three clusters of virtual cores: VM0 and
VM3 for DB workload, VM1 and VM2 for web server workload, and VM4-VM7 for middleware
workload. Basic assumption here is that a workload runs in its own VM. But in a single OS,
13
space sharing applies equally. To encounter this problem, Marty and Hill suggested a two-level
virtual coherence and caching hierarchy. This can be seen in Figure 3.17(b) [1]. Each VM
operates in its own virtual cluster in the first level which minimises both access time and
performance interference. The second level maintains a globally shared memory.
The provisioning of VMs to a virtual cluster is done dynamically to have the following properties:
• The virtual cluster nodes can be either physical or virtual (VMs) with different operating
systems.
• A VM runs with a guest OS that manages the resources in the physical machine.
• The purpose of using VMs is to consolidate multiple functionalities on the same server.
• VMs can be replicated in multiple servers to promote parallelism, fault tolerance and disaster
discovery.
• The no. of nodes in a virtual cluster can grow or shrink dynamically.
• The failure of some physical nodes will slow the work but the failure of VMs will cause no harm
(fault tolerance is high).
Since system virtualization is widely used, the VMs on virtual clusters have to be effectively
managed. The virtual computing environment should provide high performance in virtual cluster
deployment, monitoring large clusters, scheduling of the resources, fault tolerance and so on.
14
Figure 3.19 shows the concept of a virtual cluster based on app partitioning. The different colours
represent nodes in different virtual clusters. The storage images (SSI) from different VMs from
different clusters is the most important concept here. Software packages can be pre-installed as
templates and the users can build their own software stacks. Note that the boundary of the
virtual cluster might change since VM nodes are added, removed, or migrated dynamically.
➢ Fast Deployment and Effective Scheduling: The concerned system should be able to
• Construct and distribute software stacks (OS, libraries, apps) to a physical node inside
the cluster as fast as possible
• Quickly switch runtime environments from one virtual cluster to another.
Green Computing: It is a methodology that is environmentally responsible and an eco-friendly
usage of computers and their resources. It is also defined as the study of designing,
manufacturing, using and disposing of computing devices in a way that reduces their
environmental impact.
Engineers must concentrate upon the point the available resources are utilized in a cost and
energy-reducing manner to optimize the performance and throughput. Parallelism must be put in
place wherever needed and virtual machines/clusters should be used for attaining this goal.
Through this, we can reduce the overhead, attain load balancing and achieve scale-up and scale-
down mechanisms on the virtual clusters. Finally, the virtual clusters must be clustered among
themselves again by mapping methods in a dynamical manner.
15
It should be noted that every VM is configured with a name, disk image, network settings, and is
allocated a CPU and memory. But this might be cumbersome if the VMs are many in number. The
process can be simplified by configuring similar VMs with pre-edited profiles. Finally, the
deployment principle should be able to fulfil the VM requirement to balance the workloads.
b) Live VM Migration Steps: Normally in a cluster built with mixed modes of host and guest
systems, the procedure is to run everything on the physical machine. When a VM fails, it can be
replaced by another VM on a different node, as long as they both run the same guest OS. This is
called a failover (a procedure by which a system automatically transfers control to a duplicate
system when it detects a fault or failure) of a physical system to a VM. Compared to a physical-
physical failover, this methodology has more flexibility. It also has a drawback – a VM must stop
working if its host node fails. This can be lessened by migrating from one node to another for a
similar VM. The live migration process is depicted in Figure 3.20.
Managing a Virtual Cluster: There exist four ways.
1. We can use a guest-based manager, by which the cluster manager resides inside a guest
OS. Ex: A Linux cluster can run different guest operating systems on top of the Xen
hypervisor.
2. We can bring out a host-based manager which itself is a cluster manager on the host
systems. Ex: VMware HA (High Availability) system that can restart a guest system after
failure.
3. An independent cluster manager, which can be used on both the host and the guest –
making the infrastructure complex.
4. Finally, we might also use an integrated cluster (manager), on the guest and host
operating systems; here the manager must clearly distinguish between physical and virtual
resources.
16
Virtual clusters are generally used where fault tolerance of VMs on the host plays an important
role in the total cluster strategy. These clusters can be applied in grids, clouds and HPC
platforms. The HPC is obtained by dynamical finding and usage of resources as per requirement,
and less migration time & bandwidth that is used.
i)Memory Migration: This is done between the physical host and any other physical/virtual
machine. The techniques used here depend upon the guest OS. MM can be in a range of
megabytes to gigabytes. The Internet Suspend-Resume (ISR) technique exploits temporal
locality since the memory states are may have overlaps in the suspended/resumed instances of a
VM. Temporal locality (TL) refers to the fact that the memory states differ only by the amount
of work done since a VM was last suspended.
To utilize the TL, each file is represented as a tree of small sub-files. A copy of this tree exists in
both the running and suspended instances of the VM. The advantage here is usage of tree
representation of a file and caching ensures that the changed files are only utilized for
transmission.
b) File System Migration: To support VM migration from one cluster to another, a consistent
and location-dependent view of the file system is available on all hosts. Each VM is provided with
its own virtual disk to which the file system is mapped to. The contents of the VM can be
transmitted across the cluster by inter-connections (mapping) between the hosts. But migration of
an entire host (if required) is not advisable due to cost and security problems. We can also provide
a global file system across all host machines where a VM can be located. This methodology
removes the need of copying files from one machine to another – all files on all machines can be
accessed through network. It should be noted here that the actual files are not mapped or copied.
The VMM accesses only the local file system of a machine and the original/modified files are stored
at their respective systems only. This decoupling improves security and performance but increases
the overhead of the VMM – every file has to be stored in virtual disks in its local files. Smart
Copying ensures that after being resumed from suspension state, a VM doesn’t get a whole file as
a backup. It receives only the changes that were made. This technique reduces the amount of data
that has to be moved between two locations.
17
iii) Network Migration: A migrating should maintain open network connections. It should not
depend upon forwarding mechanisms (mediators) or mobile mechanisms. Each VM should be
assigned a unique IP or MAC (Media Access Control) [7] addresses which is different from that of
the host machine. The mapping of the IP and MAC addresses to their respective VMs is done by
the VMM.
If the destination of the VM is also on the same LAN, special messages are sent using MAC
address that the IP address of the VM has moved to a new location. If the destination is on
another network, the migrating OS can keep its original Ethernet MAC address and depend on
the network switch [9] to detect its move to a new port [8].
iv) Live Migration of VM Using Xen: live migration means moving a VM from one physical
node to another while keeping its OS environment and apps intact. All this process is carried out
by a program called migration daemon. This capability provides efficient online system
maintenance, reconfiguration, load balancing, and improved fault tolerance. The recently
improved mechanisms are able to migrate without suspending the concerned VM.
There are two approaches in live migration: pre copy and post copy.
(a) In pre copy, which is manly used in live migration, all memory pages are first transferred; it
then copies the modified pages in the last round iteratively. Here, performance ‘degradation’
will occur because migration will be encountering dirty pages (pages that change during
networking) all around in the network before getting to the right destination. The iterations
could also increase, causing another problem. To encounter these problems, check-
pointing/recovery process is used at different positions to take care of the above problems
and increase the performance.
(b) In post-copy, all memory pages are transferred only once during the migration process. The
threshold time allocated for migration is reduced. But the downtime is higher than that in
pre-copy.
Downtime means the time in which a system is out of action or can’t handle other works.
18
a) Server Consolidation in Data Centres: In data centers, heterogeneous workloads may run
at different times. The two types here are
1. Chatty (Interactive) Workloads: These types may reach the peak at a particular time
and may be silent at some other time. Ex: WhatsApp in the evening and the same at
midday.
2. Non-Interactive Workloads: These don’t require any users’ efforts to make progress
after they have been submitted. Ex: HPC
The data center should be able to handle the workload with satisfactory performance both at the
peak and normal levels.
It is common that much of the resources of data centers like hardware, space, power and cost are
under-utilized at various levels and times. To come out of this disadvantage, one approach is to
use the methodology of server consolidation. This improves the server utility ratio of hardware
devices by reducing the number of physical servers. There exist two types of server consolidation:
(a) Centralised and Physical Consolidation (b) virtualization-based server consolidation.
The second method is widely used these days, and it has some advantages.
• Consolidation increases hardware utilization
• It enables more agile provisioning of the available resources
• The total cost of owning and using data center is reduced (low maintenance, low cooling,
low cabling etc.)
• It enables availability and business continuity – the crash of a guest OS has no effect upon
a host OS.
To automate (virtualization) data centers one must consider several factors like resource
scheduling, power management, performance of analytical models and so on. This improves the
utilization in data centers and gives high performance. Scheduling and reallocation can be done at
different levels at VM level, server level and data center level, but generally any one (or two)
level is used at a time.
The schemes that can be considered are:
(a) Dynamic CPU allocation: This is based on VM utilization and app level QoS (Quality of
Service) metrics. The CPU should adjust automatically according to the demands and
workloads to deliver the best performance possible.
(b) Another scheme uses two-level resource management system to handle the complexity of
the requests and allocations. The resources are allocated automatically and autonomously to
bring down the workload on each server of a data center.
Finally, we should efficiently balance the power saving and data center performance to achieve the
HP and HT also at different situations as they demand.
Parallax itself runs as a user-level application in the VM storage, providing Virtual Disk Images
(VDIs). A VDI can accessed in a transparent manner from any host machine in the Parallax
cluster. It is a core abstraction of the storage methodology used by Parallax.
19
c) Cloud OS for virtualization Data Centers: VI => Virtual Infrastructure managers Types can
be seen in Table 3.6.
EC2 => Amazon Elastic Compute Cloud WS => Web Service CLI => Command Line Interface
WSRF => Web Services Resource Framework KVM => Kernel-based VM VMFS => VM File System
HA => High Availability
Example of Eucalyptus for Virtual Networking of Private Cloud: It is an open-source software
system intended for IaaS clouds. This is seen in Figure 3.27.
Instance Manager (IM): It controls execution, inspection and terminating of VM instances on the
host machines where it runs.
20
Group Manager (GM): It gathers information about VM execution and schedules them on specific
IMs; it also manages virtual instance network.
Cloud Manager (CM): It is an entry-point into the cloud for both users and administrators. It
gathers information about the resources, allocates them by proper scheduling, and implements
them through the GMs.
The above figure proposes the concept of granting IDS runs only on a highly-privileged VM. Notice
that policies play an important role here. A policy framework can monitor the events in different
guest operating systems of different VMs by using an OS interface library to determine which grant
is secure and which is not.
It is difficult to determine which access is intrusion and which is not without some time delay.
Systems also may use access ‘logs’ to analyse which is an intrusion and which is secure. The IDS
log service is based on the OS kernel and the UNIX kernel is hard to break; so even if a host
machine is taken over by the hackers, the IDS log book remains unaffected.
The security problems of the cloud mainly arise in the transport of the images through the network
from one location to another. The VMM must be used more effectively and efficiently to deny any
chances for the hackers.
21
CLOUD COMPUTING
UNIT – 3
SYLLABUS: Cloud Platform Architecture: Cloud Computing and service Models, Public
Cloud Platforms, Service Oriented Architecture, Programming on Amazon AWS and
Microsoft Azure.
In recent days, the IT industry has moved from manufacturing to offering more services (service-
oriented). As of now, 80% of the industry is ‘service-industry’. It should be realized that services
are not manufactured/invented from time-to-time; they are only rented and improved as per the
requirements. Clouds aim to utilize the resources of data centers virtually over automated
hardware, databases, user interfaces and apps.
I)Public, Private and Hybrid Clouds: Cloud computing has evolved from the concepts of
clusters, grids and distributed computing. Different resources (hardware, finance, time) are
leveraged (use to maximum advantage) to bring out the maximum HTC. A Cloud Computing
model enables the users to share resources from anywhere at any time through their connected
devices.
Advantages of Cloud Computing: Recall that in Cloud Computing, the programming is
sent to data rather than the reverse, to avoid large data movement, and maximize the bandwidth
utilization. Cloud Computing also reduces the costs incurred by the data centers, and increases
the app flexibility. Cloud Computing consists of a virtual platform with elastic resources and puts
together the hardware, data and software as per demand. Furthermore, the apps utilized and
offered are heterogeneous.
The Basic Architecture of the types of clouds can be seen in Figure 4.1 below.
• Public Clouds: A public cloud is owned by a service provider, built over the Internet and
offered to a user on payment. Ex: Google App Engine (GAE), AWS, MS-Azure, IBM
Blie Cloud and Salesforce-Force.com. All these offer their services for creating and
1
managing VM instances to the users within their own infrastructure.
• Private Clouds: A private cloud is built within the domain of an intranet owned by a
single organization. It is client-owned and managed; its access is granted to a limited
number of clients only. Private clouds offer a flexible and agile private infrastructure
to run workloads within their own domains. Though private cloud offers more control, it
has limited resources only.
• Hybrid Clouds: A hybrid cloud is built with both public and private clouds. Private
clouds can also support a hybrid cloud model by enhancing the local infrastructure with
computing capacity of a public external cloud.
• Data Center Networking Architecture: The core of a cloud is the server cluster and the
cluster nodes are used as compute nodes. The scheduling of user jobs requires that virtual
clusters are to be created for the users and should be granted control over the required
resources. Gateway nodes are used to provide the access points of the concerned service
from the outside world. They can also be used for security control of the entire cloud
platform. It is to be noted that in physical clusters/grids, the workload is static; in clouds,
the workload is dynamic and the cloud should be able to handle any level of workload on
demand.
ii) Cloud Ecosystem and Enabling Technologies: The differences between classical
computing and cloud computing can be seen in the table below. In traditional computing, a
user has to buy the hardware, acquire the software, install the system, test the configuration and
execute the app code. The management of the available resources is also a part of this. Finally,
all this process has to be revised for every 1.5 or 2 years since the used methodologies will
2
become obsolete.
On the other hand, Cloud Computing follows a pay-as-you-go model [1]. Hence the cost is
reduced significantly – a user doesn’t buy any resources but rents them as per his requirements.
All S/W and H/W resources are leased by the user from the cloud resource providers. This is
advantageous for small and middle business firms which require limited amount of resources
only. Finally, Cloud Computing also saves power.
3
b) Cost Model:
The above Figure 4.3a shows the additional costs on top of fixed capital investments in
traditional computing. In Cloud Computing, only pay-as-per-use is applied, and user-jobs are
outsourced to data centers. To use a cloud, one has no need to buy hardware resources; he can
utilize them as per the demands of the work and release the same after the job is completed.
An ecosystem for private clouds was suggested by scientists as depicted in Figure 4.4.
In the above suggested 4 levels, at the user end, a flexible platform is required by the
4
customers. At the cloud management level, the virtualization resources are provided by the
concerned cloud manager to offer the IaaS. At the VI management level, the manager
allocates the VMs to the available multiple clusters. Finally, at the VM management level, the
VM managers handle VMs installed on the individual host machines.
d) Increase of Private Clouds: Private clouds influence the infrastructure and services that are
utilized by an organization. Private and public clouds handle the workloads dynamically but
public clouds handle them without communication dependency. On the other hand, private
clouds can balance workloads to exploit the infrastructure effectively to obtain HP. The
major advantage of private clouds is less security problems and public clouds need less
investment.
II) PUBLIC CLOUD PLATFORMS: Cloud services are provided as per demand by different
companies. It can be seen in Figure 4.19 that there are 5 levels of cloud players.
The app providers at the SaaS level are used mainly by the individual users. Most business
organizations are serviced by IaaS and PaaS providers. IaaS provides compute, storage, and
communication resources to both app providers and organizational users. The cloud
environment is defined by PaaS providers. Note that PaaS provides support both IaaS services
and organizational users directly.
Cloud services depend upon machine virtualization, SOA, grid infrastructure management and
power efficiency. The provider service charges are much lower than the cost incurred by the
users when replacing damaged servers. The Table 4.5 shows a summary of the profiles of the
major service providers.
6
PKI=> Public Key Infrastructure; VPN=> Virtual Private Network
a. Google App Engine (GAE): The Google platform is based on its search engine
expertise and is applicable to many other areas (Ex: MapReduce). The Google Cloud
Infrastructure consists of several apps like Gmail, Google Docs, and Google Earth and
can support multiple no. of users simultaneously to raise the bar for HA (high
availability). Other technology achievements of Google include Google File System
(GFS) [like HDFS], MapReduce, BigTable, and Chubby (A Distributed Lock Service).
GAE enables users to run their apps on a large number of data centers associated with
Google’s search engine operations. The GAE architecture can be seen in Figure 4.20 [1]
below:
The building blocks of Google’s Cloud Computing app include GFS for storing large amounts
of data, the MapReduce programming framework for developers, Chubby for distributed lock
services and BigTable as a storage service for accessing structural data.
GAE runs the user program on Google’s infrastructure where the user need not worry about
7
storage or maintenance of data in the servers. It is a combination of several software
components but the frontend is same as ASP (Active Server Pages), J2EE and JSP.
The well-known GAE apps are the search engine, docs, earth and Gmail. Users linked with one
app can interact and interface with other apps through the resources of GAE (synchronise and
one login for all services).
b. Amazon Web Services (AWS): Amazon applies the IaaS model in providing its
services. The Figure 4.21 [1] below shows the architecture of AWS:
EC2 provides the virtualized platforms to host the VMs where the cloud app can run.
S3 (Simple Storage Service) provides the OO storage service for the users.
EBS (Elastic Block Service) provides the block storage interface which can be used to support
traditional apps.
SQS (Simple Queue Service) ensures a reliable message service between two processes.
Amazon offers a RDS (relational database service) with a messaging interface. The AWS
offerings are given below in Table 4.6
8
c. MS-Azure: The overall architecture of MS cloud platform, built on its own data
centers, is shown in Figure 4.22. It is divided into 3 major component platforms as it
can be seen. Apps are installed on VMs and Azure platform itself is built on Windows
OS.
• Live Service: Through this, the users can apply MS live apps and data across multiple
machines concurrently.
• .NET Service: This package supports app development on local hosts and execution on cloud
machines.
• SQL Azure: Users can visit and utilized the relational database associated with a SQL server
in the cloud.
• SharePoint Service: A scalable platform to develop special business apps.
• Dynamic CRM Service: This provides a business platform for the developers to manage the
CRM apps in financing, marketing, sales and promotions.
SOAP: This provides a standard packaging structure for transmission of XML documents over
various IPs. (HTTP, SMTP, FTP). A SOAP message consists of an envelope (root element),
which itself contains a header. It also had a body that carries the payload of the message.
WSDL: It describes the interface and a set of operations supported by a web service in a
standard format.
UDDI: This provides a global registry for advertising and discovery of web services by
searching for names, identifiers, categories.
Since SOAP can combine the strengths of XML and HTTP, it is useful for heterogeneous
distributed computing environments like grids and clouds
ii. Enterprise Multitier Architecture: This is a kind of client/server architecture
application processing and data management are logically separate processes. As seen
below in Figure 5.4, it is a three-tier information system where each layer has its own
important responsibilities.
10
Presentation Layer: Presents information to external entities and allows them to interact with
the system by submitting operations and getting responses.
Application Logic (Middleware): These consist of programs that implement actual operations
requested by the client. The middle tier can also be used for user authentication and granting of
resources, thus removing some load from the servers.
Resource Management Layer (Data Layer): It deals with the data sources of an information
system.
11
12
CLOUD COMPUTING
UNIT IV
Syllabus: Cloud Resource Management and Scheduling: Policies and Mechanisms for
Resource Management, Applications of Control Theory to Task Scheduling on a Cloud, Stability of a
Two-Level Resource Allocation Architecture, Feedback Control Based on Dynamic Thresholds.
Coordination of Specialized Autonomic Performance Managers, Resource Bundling, Scheduling
Algorithms for Computing Clouds, Fair Queuing, Start Time Fair Queuing, Borrowed Virtual Time,
Cloud Scheduling Subject to Deadlines, Scheduling MapReduce Applications Subject to Deadlines.
5.1. INTRODUCTION:
Resource management is a core function of any man-made system. It affects the three basic
criteria for the evaluation of a system: performance, functionality, and cost. An inefficient resource
management has a direct negative effect on performance and cost and an indirect effect on the
functionality of a system.
Cloud resource management requires complex policies and decisions for multi-objective
optimization. Cloud resource management is extremely challenging because of the complexity of
the system, which makes it impossible to have accurate global state information, and because of
the unpredictable interactions with the environment.
The strategies for resource management associated with the three cloud delivery models, IaaS,
PaaS, and SaaS, differ from one another.
5.2. POLICIES AND MECHANISMS FOR RESOURCE MANAGEMENT
A policy typically refers to the principal guiding decisions, whereas mechanisms represent the
means to implement policies. Separation of policies from mechanisms is a guiding principle in
computer science.
Cloud resource management policies can be loosely grouped into five classes:
1. Admission control.
2. Capacity allocation.
3. Load balancing.
4. Energy optimization.
5. Quality-of-service (QoS) guarantees
Admission control It is a validation process in communication systems where a check is
performed before a connection is established to see if current resources are sufficient for the
proposed connection. It is a policy to prevent the system from accepting workloads in violation of
high-level system policies.
Capacity allocation means to allocate resources for individual instances; an instance is an
activation of a service.
Load balancing distribute the workload evenly among the servers.
energy optimization minimization of energy consumption.
Load balancing and energy optimization are correlated and affect the cost of providing the
services.
Quality of service is that aspect of resource management that is probably the most difficult to
address and, at the same time, possibly the most critical to the future of cloud computing ability to
satisfy timing or other conditions specified by a Service Level Agreement.
The four basic mechanisms for the implementation of resource management policies are:
• Control theory: uses the feedback to guarantee system stability and predict transient
behavior.
• Machine Learning: does not need a performance model of the system.
• Utility based: require a performance model and a mechanism to correlate user-level
performance with cost.
• Market oriented/economic mechanism: Such mechanisms don’t require a system
model, such as combining auctions for bundles of resources.do not require a model of the
system, e.g., combinatorial auctions for bundles of resources.
5.3. APPLICATIONS OF CONTROL THEORY TO TASK SCHEDULING ON A CLOUD
Control theory has been used to design adaptive resource management for many classes of
applications, including power management, task scheduling, QoS adaptation in Web servers, and
load balancing.
The classical feedback control methods are used in all these cases to regulate the key operating
parameters of the system based on measurement of the system output.
A technique to design self-managing systems, which allows multiple QoS objectives and operating
constraints to be expressed as a cost function and can be applied to stand-alone or distributed
Web servers, database servers, high-performance application servers, and even mobile/embedded
systems
Our goal is to illustrate the methodology for optimal resource management based on control
theory concepts. The analysis is intricate and cannot be easily extended to a collection of servers.
Control Theory Principles. Optimal control generates a sequence of control inputs over a look-
ahead horizon while estimating changes in operating conditions. A convex cost function has
arguments x (k), the state at step k, and u(k), the control vector; this cost function is minimized,
subject to the constraints imposed by the system dynamics. The discrete-time optimal control
problem is to determine the sequence of control variables u(i ), u(i + 1), . . . , u(n − 1) to
minimize the expression
where Ø (n, x (n)) is the cost function of the final step, n, and Lk (x (k), u(k)) is a time-varying
cost function at the intermediate step k over the horizon [i , n]. The minimization is subject to the
constraints
where x (k + 1), the system state at time k + 1, is a function of x (k), the state at time k, and of
u(k), the input at time k; in general, the function f k is time-varying; thus, its superscript.
The controller uses the feedback regarding the current state as well as the estimation of the future
disturbance due to environment to compute the optimal inputs over a finite horizon. The two
parameters r and s are the weighting factors of the performance index.
A two-level resource allocation architecture based on control theory concepts for the entire cloud.
The automatic resource management is based on two levels of controllers, one for the service
provider and one for the application, is shown below.
The main components of a control system are the inputs, the control system components, and
the outputs.
The system components are sensors used to estimate relevant measures of performance and
controllers that implement various policies; the output is the resource allocations to the individual
applications.
2. The granularity of the control, the fact that a small change enacted by the controllers leads to
very large changes of the output.
3. Oscillations, which occur when the changes of the input are too large and the control is too
weak, such that the changes of the input propagate directly to the output.
The elements involved in a control system are sensors, monitors, and actuators.
The sensors measure the parameter(s) of interest, then transmit the measured values to a
monitor, which determines whether the system behavior must be changed, and, if so, it requests
that the actuators carry out the necessary actions. Often the parameter used for admission
control policy is the current system load; when a threshold, e.g., 80%, is reached, the cloud stops
accepting additional load.
Thresholds:
A threshold is the value of a parameter related to the state of a system that triggers a change in
the system behavior. Thresholds are used in control theory to keep critical parameters of a system
in a predefined range. The threshold could be static, defined once and for all, or it could be
dynamic. A dynamic threshold could be based on an average of measurements carried out over a
time interval, a so-called integral control.
To maintain the system parameters in a given range, a high and a low threshold are often defined.
2. Request additional VMs when the average value of the CPU utilization over the current
time slice exceeds the high threshold.
3. Release a VM when the average value of the CPU utilization over the current time slice
falls below the low threshold.
Virtually all modern processors support dynamic voltage scaling (DVS) as a mechanism for energy
saving. Indeed, the energy dissipation scales quadratically with the supply voltage. The power
management controls the CPU frequency and, thus, the rate of instruction execution. For some
compute-intensive workloads the performance decreases linearly with the CPU clock frequency,
whereas for others the effect of lower clock frequency is less noticeable or nonexistent.
The approach to coordinating power and performance management in is based on several ideas:
• Use a joint utility function for power and performance. The joint performance-power utility
function, U pp (R, P ), is a function of the response time, R, and the power, P , and it can
be of the form with U (R) the utility function based on response time only and a parameter
to weight the influence of the two factors, response time and power.
• Set up a power cap for individual systems based on the utility-optimized power management
policy.
• Use a standard performance manager modified only to accept input from the power manager
regarding the frequency determined according to the power management policy. The
power manager consists of Tcl (Tool Command Language) and C programs to compute the
per-server (per-blade) power caps and send them via IPMI5 to the firmware controlling the
blade power. The power manager and the performance manager interact, but no
negotiation between the two agents is involved.
5.7. RESOURCE BUNDLING: Combinatorial Auctions For Cloud Resources
Resources in a cloud are allocated in bundles, allowing users get maximum benefit from a specific
combination of resources. Indeed, along with CPU cycles, an application needs specific amounts of
main memory, disk space, network bandwidth, and so on. Resource bundling complicates
traditional resource allocation models and has generated interest in economic models and, in
particular, auction algorithms. In the context of cloud computing, an auction is the allocation of
resources to the highest bidder.
Combinatorial Auctions:
Auctions in which participants can bid on combinations of items, or packages, are called
combinatorial auctions. Such auctions provide a relatively simple, scalable, and tractable solution
to cloud resource allocation.
The final auction prices for individual resources are given by the vector p = ( p1, p2, . . . , p R )
and the amounts of resources allocated to user u are xu = (xu1, xu2, . . . , x uR ). Thus, the
expression [(xu )T p] represents the total price paid by user u for the bundle of resources if the bid
is successful at time T . The scalar [minq∈Qu (q T p)] is the final price established through the
bidding process.
A pricing and allocation algorithm partition the set of users into two disjoint sets, winners
and losers, denoted as W and L, respectively. The algorithm should:
FIGURE: Best-effort policies do not impose requirements regarding either the amount of
resources allocated to an application or the timing when an application is scheduled.
Soft-requirements allocation policies require statistically guaranteed amounts and timing
constraints; hard-requirements allocation policies demand strict timing and precise
amounts of resources.
5.9. FAIR QUEUING
When the load exceeds its capacity, a switch starts dropping packets because it has limited input
buffers for the switching fabric and for the outgoing links, as well as limited CPU cycles. A switch
must handle multiple flows and pairs of source-destination endpoints of the traffic.
To address this problem, a fair queuing algorithm proposed in requires that separate queues, one
per flow, be maintained by a switch and that the queues be serviced in a round-robin manner. This
algorithm guarantees the fairness of buffer space management, but does not guarantee fairness of
bandwidth allocation. Indeed, a flow transporting large packets will benefit from a larger
bandwidth.
The fair queuing (FQ) algorithm in proposes a solution to this problem. First, it introduces a bit-
by-bit round-robin (BR) strategy; as the name implies, in this rather impractical scheme a single
bit from each queue is transmitted and the queues are visited in a round-robin fashion. Let R(t )
be the number of rounds of the BR algorithm up to time t and Nactive (t) be the number of active
flows through the switch. Call tia the time when the packet i of flow a, of size Pia bits arrives, and
call Sia and Fia the values of R(t) when the first and the last bit, respectively, of the packet i of flow
a are transmitted. Then,
Fia = Sia + P ia and S ia = max [Fa i−1, R (tia)] .
The quantities R(t ), Nactive (t ), Sia , and Fia depend only on the arrival time of the packets, tia , and
not on their transmission time, provided that a flow a is active as long as
A hierarchical CPU scheduler for multimedia operating systems was proposed in. The basic
idea of the start-time fair queuing (SFQ) algorithm is to organize the consumers of the CPU
bandwidth in a tree structure; the root node is the processor and the leaves of this tree are the
threads of each application. A scheduler acts at each level of the hierarchy. The fraction of the
processor bandwidth, B, allocated to the intermediate node i is
5.11. BORROWED VIRTUAL TIME
5.12. CLOUD SCHEDULING SUBJECT TO DEADLINES
• Hard deadlines → if the task is not completed by the deadline, other tasks which depend
on it may be affected and there are penalties; a hard deadline is strict and expressed
precisely as milliseconds, or possibly seconds.
• Soft deadlines→ more of a guideline and, in general, there are no penalties; soft
deadlines can be missed by fractions of the units used to express them. (cloud schedules
are usually in this category)
System Model:
• First in, first out (FIFO) → The tasks are scheduled for execution in the order of their
arrival.
• Earliest deadline first (EDF) → The task with the earliest deadline is scheduled first.
• Maximum workload derivative first (MWF) → The tasks are scheduled in the order of their
derivatives, the one with the highest derivative first. The number n of nodes assigned to
the application is kept to a minimum.
Workload Partitioning Rules:
• Optimal Partitioning Rule (OPR)→ the workload is partitioned to ensure the earliest
possible completion time and all tasks are required to complete at the same time.
• The head node distributes sequentially the data to individual worker nodes.
• Worker nodes start processing the data as soon as the transfer is complete.
Figure: The timing diagram for the Optimal Partitioning Rule; the algorithm requires
worker nodes to complete execution at the same time. The head node, S0, distributes sequentially
the data to individual worker nodes.
Where
Δ→ time the worker S needs to process a unit of data.
S0→ head node
Si→ worker nodes
Γ→ time to transfer the data
• Equal Partitioning Rule (EPR) → assigns an equal workload to individual worker nodes.
• The head node distributes sequentially the data to individual worker nodes.
• Worker nodes start processing the data as soon as the transfer is complete.
• The workload is partitioned in equal segments.
The timing diagram for the Equal Partitioning Rule; the algorithm assigns an equal
workload to individual worker nodes.
5.13. SCHEDULING MAPREDUCE APPLICATIONS SUBJECT TO DEADLINES
MapReduce applications on the cloud subject to deadlines. Several options for scheduling Apache
Hadoop, an open-source implementation of the MapReduce algorithm, are:
• The default FIFO schedule.
• The Fair Scheduler.
• The Capacity Scheduler.
• The Dynamic Proportional Scheduler.
Following Table summarizes the notations used for the analysis of Hadoop; the term slots is
equivalent with nodes and means the number of instances.
We make two assumptions for our initial derivation:
• The system is homogeneous; this means that ρm and ρr , the cost of processing a unit
data by the map and the reduce task, respectively, are the same for all servers.
• Load Equipartition
1
CLOUD COMPUTING
UNIT-5
SYLLABUS: Storage Systems: Evolution of storage technology, storage models, file systems
and database, distributed file systems, general parallel file systems. Google file system.,
Apache Hadoop, BigTable, Megastore, Amazon Simple Storage Service(S3)
6.1. STORAGE SYSTEMS
Storage and processing on the cloud are intimately tied to one another.
• Most cloud applications process very large amounts of data. Effective data
replication and storage management strategies are critical to the computations
performed on the cloud.
• Strategies to reduce the access time and to support real-time multimedia access
are necessary to satisfy the requirements of content delivery.
• An ever-increasing number of cloud-based services collect detailed data about
their services and information about the users of these services. The service
providers use the clouds to analyze the data.
• In 2013 Humongous amounts of data
▪ The Internet video will generate over 18 EB/month.
▪ Global mobile data traffic will reach 2 EB/month.
▪ EB→ Exabyte
A new concept, “big data,” reflects the fact that many applications use data sets so large
that they cannot be stored and processed using local resources.
The consensus is that “big data” growth can be viewed as a three-dimensional
phenomenon; it implies an increased volume of data, requires an increased processing speed
to process more data and produce more results, and at the same time it involves a diversity
of data sources and data types.
A storage model describes the layout of a data structure in physical storage; a data
model captures the most important logical aspects of a data structure in a database. The
physical storage can be a local disk, a removable media, or storage accessible via a network.
Two abstract models of storage are commonly used: cell storage and journal storage.
Cell storage assumes that the storage consists of cells of the same size and that each object
fits exactly in one cell. This model reflects the physical organization of several storage media;
the primary memory of a computer is organized as an array of memory cells, and a secondary
storage device (e.g., a disk) is organized in sectors or blocks read and written as a unit.
read/write coherence and before-or-after atomicity are two highly desirable properties of any
storage model and in particular of cell storage (see Figure).
Journal storage is a fairly elaborate organization for storing composite objects such as
records consisting of multiple fields. Journal storage consists of a manager and cell storage,
where the entire history of a variable is maintained, rather than just the current value.
The user does not have direct access to the cell storage; instead the user can request the
journal manager to (i) start a new action; (ii) read the value of a cell; (iii) write the value of
a cell; (iv) commit an action; or (v) abort an action. The journal manager translates user
requests to commands sent to the cell storage: (i) read a cell; (ii) write a cell; (iii) allocate
a cell; or (iv) deallocate a cell.
2
FIGURE: A log contains the entire history of all variables. The log is stored on nonvolatile
media of journal storage. If the system fails after the new value of a variable is stored in
the log but before the value is stored in cell memory, then the value can be recovered from
the log. If the system fails while writing the log, the cell memory is not updated. This
guarantees that all actions are all-or-nothing. Two variables, A and B, in the log and cell
storage are shown. A new value of A is written first to the log and then installed on cell
memory at the unique address assigned to A.
In the context of storage systems, a log contains a history of all variables in cell storage.
The information about the updates of each data item forms a record appended at the end of
the log. A log provides authoritative information about the outcome of an action involving
cell storage; the cell storage can be reconstructed using the log, which can be easily accessed
– we only need a pointer to the last record.
Parallel file systems are scalable, are capable of distributing files across a large number of
nodes, and provide a global naming space. In a parallel data system, several I/O nodes serve
data to all computational nodes and include a metadata server that contains information
about the data stored in the I/O nodes. The interconnection network of a parallel file system
could be a SAN.
Most cloud applications do not interact directly with file systems but rather through an
application layer that manages a database. A database is a collection of logically related
records. The software that controls the access to the database is called a database
management system (DBMS). The main functions of a DBMS are to enforce data integrity,
manage data access and concurrency control, and support recovery after a failure.
A DBMS supports a query language, a dedicated programming language used to develop
database applications. Several database models, including the navigational model of the
1960s, the relational model of the 1970s, the object-oriented model of the 1980s, and the
NoSQL model of the first decade of the 2000s, reflect the limitations of the hardware available
at the time and the requirements of the most popular applications of each period.
Most cloud applications are data intensive and test the limitations of the existing
infrastructure. For example, they demand DBMSs capable of supporting rapid application
development and short time to market. At the same time, cloud applications require low
latency, scalability, and high availability and demand a consistent view of the data.
limited to a single data item. The NoSQL model is useful when the structure of the data does
not require a relational model and the amount of data is very large. Several types of NoSQL
database have emerged in the last few years. Based on the way the NoSQL databases store
data, we recognize several types, such as key-value stores, BigTable implementations,
document store databases, and graph databases.
Replication, used to ensure fault tolerance of large-scale systems built with commodity
components, requires mechanisms to guarantee that all replicas are consistent with one
another. This is another example of increased complexity of modern computing and
communication systems due to physical characteristics of components, a topic discussed in
Chapter 10. Section 8.7 contains an in-depth analysis of a service implementing a consensus
algorithm to guarantee that replicated objects are consistent.
Network File System (NFS). NFS was the first widely used distributed file system; the
development of this application based on the client-server model was motivated by the need
to share a file system among a number of clients interconnected by a local area network.
A majority of workstations were running under Unix; thus, many design decisions for the
NFS were influenced by the design philosophy of the Unix File System (UFS). It is not
surprising that the NFS designers aimed to:
• Provide the same semantics as a local UFS to ensure compatibility with existing
applications.
• Facilitate easy integration into existing UFS.
• Ensure that the system would be widely used and thus support clients running on
different operating systems.
• Accept a modest performance degradation due to remote access over a network with a
bandwidth of several Mbps.
Before we examine NFS in more detail, we have to analyze three important characteristics
of the Unix File System that enabled the extension from local to remote file management:
• The layered design provides the necessary flexibility for the file system; layering allows
separation of concerns and minimization of the interaction among the modules
necessary to implement the system. The addition of the vnode layer allowed the Unix
File System to treat local and remote file access uniformly.
• The hierarchical design supports scalability of the file system; indeed, it allows grouping
of files into special files called directories and supports multiple levels of directories and
collections of directories and files, the so-called file systems. The hierarchical file
structure is reflected by the file-naming convention.
• The metadata supports a systematic rather than an ad hoc design philosophy of the file
system. The so called inodes contain information about individual files and directories.
The inodes are kept on persistent media, together with the data. Metadata includes the
file owner, the access rights, the creation time or the time of the last modification of
the file, the file size, and information about the structure of the file and the persistent
storage device cells where data is stored. Metadata also supports device independence,
a very important objective due to the very rapid pace of storage technology
development.
The logical organization of a file reflects the data model – the view of the data from the
perspective of the application. The physical organization reflects the storage model and
describes the manner in which the file is stored on a given storage medium. The layered
design allows UFS to separate concerns for the physical file structure from the logical one.
Recall that a file is a linear array of cells stored on a persistent storage device. The file
pointer identifies a cell used as a starting point for a read or write operation. This linear array
is viewed by an application as a collection of logical records; the file is stored on a physical
device as a set of physical records, or blocks, of a size dictated by the physical media.
4
The lower three layers of the UFS hierarchy – the block, the file, and the inode layer –
reflect the physical organization. The block layer allows the system to locate individual blocks
on the physical device; the file layer reflects the organization of blocks into files; and the
inode layer provides the metadata for the objects (files and directories). The upper three
layers – the path name, the absolute path name, and the symbolic path name layer – reflect
the logical organization. The file-name layer mediates between the machine-oriented and the
user-oriented views of the file system (see Figure).
FIGURE: The layered design of the Unix File System separates the physical file structure
from the logical one.
Several control structures maintained by the kernel of the operating system support
file handling by a running process. These structures are maintained in the user area of the
process address space and can only be accessed in kernel mode. To access a file, a process
must first establish a connection with the file system by opening the file. At that time a new
entry is added to the file description table, and the meta-information is brought into another
control structure, the open file table.
A path specifies the location of a file or directory in a file system; a relative path
specifies this location relative to the current/working directory of the process, whereas a full
path, also called an absolute path, specifies the location of the file independently of the
current directory, typically relative to the root directory. A local file is uniquely identified by
a file descriptor (fd), generally an index in the open file table.
The Network File System is based on the client-server paradigm. The client runs on
the local host while the server is at the site of the remote file system, and they interact by
means of remote procedure calls (RPCs) (see Figure 8.4). The API interface of the local file
system distinguishes file operations on a local file from the ones on a remote file and, in the
latter case, invokes the RPC client. Figure 8.5 shows the API for a Unix File System, with the
calls made by the RPC client in response to API calls issued by a user program for a remote
file system as well as some of the actions carried out by the NFS server in response to an
RPC call. NFS uses a vnode layer to distinguish between operations on local and remote files,
as shown in Figure 8.4.
FIGURE 6.4: The NFS client-server interaction. The vnode layer implements file operation
in a uniform manner, regardless of whether the file is local or remote. An operation targeting
a local file is directed to the local file system, whereas one for a remote file involves NFS. An
NSF client packages the relevant information about the target and the NFS server passes it
to the vnode layer on the remote host, which, in turn, directs it to the remote file system.
A remote file is uniquely identified by a file handle (fh) rather than a file descriptor.
The file handle is a 32-byte internal name, a combination of the file system identification, an
inode number, and a generation number. The file handle allows the system to locate the
remote file system and the file on that system; the generation number allows the system to
5
reuse the inode numbers and ensures correct semantics when multiple clients operate on the
same remote file.
6
Andrew File System (AFS). AFS is a distributed file system developed in the late 1980s at
Carnegie Mellon University (CMU) in collaboration with IBM. The designers of the system
envisioned a very large number of workstations interconnected with a relatively small
number of servers; it was anticipated that each individual at CMU would have an Andrew
workstation, so the system would connect up to 10,000 workstations. The set of trusted
servers in AFS forms a structure called Vice. The OS on a workstation, 4.2 BSD Unix,
intercepts file system calls and forwards them to a user-level process called Venus, which
caches files from Vice and stores modified copies of files back on the servers they came from.
Reading and writing from/to a file are performed directly on the cached copy and bypass
Venus; only when a file is opened or closed does Venus communicate with Vice.
The emphasis of the AFS design was on performance, security, and simple
management of the file system. To ensure scalability and to reduce response time, the local
disks of the workstations are used as persistent cache. The master copy of a file residing on
one of the servers is updated only when the file is modified. This strategy reduces the load
placed on the servers and contributes to better system performance.
Another major objective of the AFS design was improved security. The
communications between clients and servers are encrypted, and all file operations require
secure network connections. When a user signs into a workstation, the password is used to
obtain security tokens from an authentication server. These tokens are then used every time
a file operation requires a secure network connection.
The AFS uses access control lists (ACLs) to allow control sharing of the data. An ACL
specifies the access rights of an individual user or group of users. A set of tools supports ACL
management. Another facet of the effort to reduce user involvement in file management is
location transparency. The files can be accessed from any location and can be moved
automatically or at the request of system administrators without user involvement and/or
inconvenience. The relatively small number of servers drastically reduces the efforts related
to system administration because operations, such as backups, affect only the servers,
whereas workstations can be added, removed, or moved from one location to another
without administrative intervention.
FIGURE 6.5: The API of the Unix File System and the corresponding RPC issued by an NFS
client to the NFS server. fd stands for file descriptor, fh for file handle, fname for filename,
dname for directory name, dfh for the directory where the file handle can be found, count
for the number of bytes to be transferred, buf for the buffer to transfer the data to/from,
and device for the device on which the file system is located fsname (stands for files system
name).
Sprite Network File System (SFS). SFS is a component of the Sprite network operating
system. SFS supports non-write-through caching of files on the client as well as the server
systems. Processes running on all workstations enjoy the same semantics for file access as
they would if they were run on a single system. This is possible due to a cache consistency
mechanism that flushes portions of the cache and disables caching for shared files opened
for read/write operations.
Caching not only hides the network latency, it also reduces server utilization and
obviously improves performance by reducing response time. A file access request made by
a client process could be satisfied at different levels. First, the request is directed to the local
cache; if it’s not satisfied there, it is passed to the local file system of the client. If it cannot
7
be satisfied locally then the request is sent to the remote server. If the request cannot be
satisfied by the remote server’s cache, it is sent to the file system running on the server.
The design decisions for the Sprite system were influenced by the resources available
at a time when a typical workstation had a 1–2 MIPS processor and 4–14 Mbytes of physical
memory. The main-memory caches allowed diskless workstations to be integrated into the
system and enabled the development of unique caching mechanisms and policies for both
clients and servers. The results of a file-intensive benchmark report show that SFS was 30–
35% faster than either NFS or AFS.
The file cache is organized as a collection of 4 KB blocks; a cache block has a virtual
address consisting of a unique file identifier supplied by the server and a block number in
the file. Virtual addressing allows the clients to create new blocks without the need to
communicate with the server. File servers map virtual to physical disk addresses. Note that
the page size of the virtual memory in Sprite is also 4K.
The size of the cache available to an SFS client or a server system changes
dynamically as a function of the needs. This is possible because the Sprite operating system
ensures optimal sharing of the physical memory between file caching by SFS and virtual
memory management.
FIGURE 6.6: A GPFS configuration. The disks are interconnected by a SAN and compute
servers are distributed in four LANs, LAN1–LAN4. The I/O nodes/servers are connected to
LAN1.
GPFS reliability:
covering the entire file; this node is allowed to carry out all reads and
writes to the file without any need for permission until a second node
attempts to write to the same file; then, the range of the token given to
the first node is restricted.
o Data-shipping → an alternative to byte-range locking, allows fine-grain
data sharing. In this mode the file blocks are controlled by the I/O nodes
in a round-robin manner. A node forwards a read or write operation to the
node controlling the target block, the only one allowed to access the file.
The architecture of a GFS cluster is illustrated in Figure 8.7. A master controls a large
number of chunk servers; it maintains metadata such as filenames, access control
information, the location of all the replicas for every chunk of each file, and the state of
individual chunk servers. Some of the metadata is stored in persistent storage (e.g., the
operation log records the file namespace as well as the file-to-chunk mapping).
The locations of the chunks are stored only in the control structure of the master’s
memory and are updated at system startup or when a new chunk server joins the cluster.
This strategy allows the master to have up-to-date information about the location of the
chunks.
System reliability is a major concern, and the operation log maintains a historical
record of metadata changes, enabling the master to recover in case of a failure. As a result,
such changes are atomic and are not made visible to the clients until they have been recorded
on multiple replicas on persistent storage. To recover from a failure, the master replays the
operation log. To minimize the recovery time, the master periodically checkpoints its state
and at recovery time replays only the log records after the last checkpoint.
Each chunk server is a commodity Linux system; it receives instructions from the
master and responds with status information. To access a file, an application sends to the
master the filename and the chunk index, the offset in the file for the read or write operation;
9
the master responds with the chunk handle and the location of the chunk. Then the
application communicates directly with the chunk server to carry out the desired file
operation.
The consistency model is very effective and scalable. Operations, such as file creation, are
atomic and are handled by the master. To ensure scalability, the master has minimal
involvement in file mutations and operations such as write or append that occur frequently.
In such cases the master grants a lease for a particular chunk to one of the chunk servers,
called the primary; then, the primary creates a serial order for the updates of that chunk.
When data for a write straddles the chunk boundary, two operations are carried out, one
for each chunk. The steps for a write request illustrate a process that buffers data and
decouples the control flow from the data flow for efficiency:
1. The client contacts the master, which assigns a lease to one of the chunk servers for
a particular chunk if no lease for that chunk exists; then the master replies with the
ID of the primary as well as secondary chunk servers holding replicas of the chunk.
The client caches this information.
2. The client sends the data to all chunk servers holding replicas of the chunk; each
one of the chunk servers stores the data in an internal LRU buffer and then sends an
acknowledgment to the client.
3. The client sends a write request to the primary once it has received the
acknowledgments from all chunk servers holding replicas of the chunk. The primary
identifies mutations by consecutive sequence numbers.
4. The primary sends the write requests to all secondaries.
5. Each secondary applies the mutations in the order of the sequence numbers and
then sends an acknowledgment to the primary.
6. Finally, after receiving the acknowledgments from all secondaries, the primary
informs the client.
FIGURE 8.7: The architecture of a GFS cluster. The master maintains state information about all system components;
it controls a number of chunk servers. A chunk server runs under Linux; it uses metadata provided by the master to
communicate directly with the application. The data flow is decoupled from the control flow. The data and the control
paths are shown separately, data paths with thick lines and control paths with thin lines. Arrows show the flow of
control among the application, the master, and the chunk servers.
FIGURE 6.8: A Hadoop cluster using HDFS. The cluster includes a master and four slave
nodes. Each node runs a MapReduce engine and a database engine, often HDFS. The job
tracker of the master’s engine communicates with the task trackers on all the nodes and
with the name node of HDFS. The name node of HDFS shares information about data
placement with the job tracker to minimize communication between the nodes on which
data is located and the ones where it is needed.
• Chubby Lock Service is designed for use within a loosely-coupled Distributed system
consisting of moderately large numbers of small machines connected by a high-
speed network.
• Chubby Lock Service allows the election of a master and to allow the master to
discover the servers it controls, and to permit clients to find the master.
• File locking is a mechanism which allows only one process to access a file at any
specific time. By using file locking mechanism, many processes can read/write a
single file in a safer way.
We will take the following example to understand why file locking is required.
1. Process “A” opens and reads a file which contains account related information.
2. Process “B” also opens the file and reads the information in it.
3. Now Process “A” changes the account balance of a record in its copy, and writes it
back to the file.
4. Process “B” which has no way of knowing that the file is changed since its last read,
has the stale original value. It then changes the account balance of the same record,
and writes back into the file.
5. Now the file will have only the changes done by process “B”.
Locks
Advisory locks→ Advisory locking requires cooperation from the participating processes.
Mandatory locks→ Mandatory locking doesn’t require cooperation from the participating
processes. Mandatory locking causes the kernel to check every open, read, and write to
verify that the calling process isn’t violating a lock on the given file.
Fine-grained locks → locks that can be held for only a very short time.
Coarse-grained locks → locks held for a longer time.
11
The question of how to most effectively support a locking and consensus component
of a large-scale distributed system demands several design decisions. A first design
decision is whether the locks should be mandatory or advisory. Mandatory locks have the
obvious advantage of enforcing access control; a traffic analogy is that a mandatory lock is
like a drawbridge. Once it is up, all traffic is forced to stop.
An advisory lock is like a stop sign; those who obey the traffic laws will stop, but
some might not. The disadvantages of mandatory locks are added overhead and less
flexibility. Once a data item is locked, even a high-priority task related to maintenance or
recovery cannot access the data unless it forces the application holding the lock to terminate.
This is a very significant problem in large-scale systems where partial system failures are
likely.
In the early 2000s, when Google started to develop a lock service called Chubby, it was
decided to use advisory and coarse-grained locks. The service is used by several Google
systems, including the GFS and BigTable.
FIGURE 6.9: A Chubby cell consisting of five replicas, one of which is elected as a
master; n clients use RPCs to communicate with the master.
The basic organization of the system is shown in Figure 6.9. A Chubby cell typically serves
one data center. The cell server includes several replicas, the standard number of which is
five. To reduce the probability of correlated failures, the servers hosting replicas are
distributed across the campus of a data center.
The replicas use a distributed consensus protocol to elect a new master when the current
one fails. The master is elected by a majority, as required by the asynchronous Paxos
algorithm, accompanied by the commitment that a new master will not be elected for a
period called a master lease. A session is a connection between a client and the cell server
maintained over a period of time; the data cached by the client, the locks acquired, and the
handles of all files locked by the client are valid for only the duration of the session.
Clients use RPCs to request services from the master. When it receives a write request,
the master propagates the request to all replicas and waits for a reply from a majority of
replicas before responding. When it receives a read request the master responds without
consulting the replicas. The client interface of the system is similar to, yet simpler than, the
one supported by the Unix File System. In addition, it includes notification of events related
to file or system status. A client can subscribe to events such as file content modification,
change or addition of a child node, master failure, lock acquired, conflicting lock requests,
and invalid file handle.
The files and directories of the Chubby service are organized in a tree structure and use a
naming scheme similar to that of Unix. Each file has a file handle similar to the file descriptor.
The master of a cell periodically writes a snapshot of its database to a GFS file server.
12
We now take a closer look at the actual implementation of the service. As pointed out
earlier, Chubby locks and Chubby files are stored in a database, and this database is
replicated. The architecture of these replicas shows that the stack consists of the Chubby
component, which implements the Chubby protocol for communication with the clients, and
the active components, which write log entries and files to the local storage of the replica
see (Figure 6.10).
Recall that an atomicity log for a transaction-processing system allows a crash recovery
procedure to undo all-or-nothing actions that did not complete or to finish all-or-nothing
actions that committed but did not record all of their effects. Each replica maintains its own
copy of the log; a new log entry is appended to the existing log and the Paxos algorithm is
executed repeatedly to ensure that all replicas have the same sequence of log entries.
FIGURE 6.10: Chubby replica architecture. The Chubby component implements the
communication protocol with the clients. The system includes a component to transfer files
to a fault-tolerant database and a fault-tolerant log component to write log entries. The
fault-tolerant log uses the Paxos protocol to achieve consensus. Each replica has its own
local file system; replicas communicate with one another using a dedicated interconnect
and communicate with clients through a client network.
The next element of the stack is responsible for the maintenance of a fault-tolerant
database – in other words, making sure that all local copies are consistent. The database
consists of the actual data, or the local snapshot in Chubby speak, and a replay log to allow
recovery in case of failure. The state of the system is also recorded in the database.
The Paxos algorithm is used to reach consensus on sets of values (e.g., the sequence of
entries in a replicated log). To ensure that the Paxos algorithm succeeds in spite of the
occasional failure of a replica, the following three phases of the algorithm are executed
repeatedly:
Implementation of the Paxos algorithm is far from trivial. Although the algorithm can be
expressed in as few as ten lines of pseudocode, its actual implementation could be several
thousand lines of C++ code. Moreover, practical use of the algorithm cannot ignore the wide
variety of failure modes, including algorithm errors and bugs in its implementation, and
testing a software system of a few thousands lines of codes is challenging.
• OLTP typically involves inserting, updating, and/or deleting small amounts of data in
a database.
• OLTP mainly deals with large numbers of transactions by a large number of users.
• Examples of OLTP transactions include:
• Online banking
• Purchasing a book online
• Booking an airline ticket
• Sending a text message
• Order entry
A major concern for the designers of OLTP systems is to reduce the response time.
The term memcaching refers to a general-purpose distributed memory system that caches
objects in main memory (RAM); the system is based on a very large hash table distributed
across many servers. The memcached system is based on a client-server architecture and
runs under several operating systems, including Linux, Unix, Mac OS X, and Windows. The
servers maintain a key-value associative array. The API allows the clients to add entries to
the array and to query it. A key can be up to 250 bytes long, and a value can be no larger
than 1 MB. The memcached system uses an LRU cache-replacement strategy.
Scalability is the other major concern for cloud OLTP applications and implicitly for
datastores. There is a distinction between vertical scaling, where the data and the workload
are distributed to systems that share resources such as cores and processors, disks, and
possibly RAM, and horizontal scaling, where the systems do not share either primary or
secondary storage.
NOSQL DATABASES:
• A NoSQL originally referring to non-SQL or non-relational is a database that provides
a mechanism for storage and retrieval of data.
• it an alternative to traditional relational databases in which data is placed in tables
and data schema is carefully designed before the database is built. NoSQL databases
are especially useful for working with large sets of distributed data.
• The NoSQL model is useful when the structure of the data does not require a
relational model and the amount of data is very large.
• Does not support SQL as a query language.
• May not guarantee the ACID (Atomicity, Consistency, Isolation, Durability)
properties of traditional databases; it usually guarantees the eventual
consistency for transactions limited to a single data item.
6.10. BIGTABLE:
• Distributed storage system developed by Google to
o store massive amounts of data.
o scale up to thousands of storage servers.
• The system uses
o Google File System à to store user data and system information.
o Chubby distributed lock service → to guarantee atomic read and write
operations; the directories and the files in the namespace of Chubby are used
as locks.
• Data is assembled in order by row key, and indexing of the map is arranged according
to row, column keys and timestamps. Compression algorithms help achieve high
capacity.
• Google Bigtable serves as the database for applications such as the Google App
Engine Datastore, Google Personalized Search, Google Earth and Google Analytics.
tablets. An application client searches through this hierarchy to identify the location of its
tablets and then caches the addresses for further use.
6.11 MEGASTORE:
• Scalable storage for online services. Widely used internally at Google, it handles
some 23 billion transactions daily, 3 billion write and 20 billion read transactions.
• The system, distributed over several data centers, has a very large capacity, 1 PB in
2011, and it is highly available.
• Each partition is replicated in data centers in different geographic areas.
• Megastore is a storage system developed to meet the storage requirements of
today's interactive online services. It is novel in that it blends the scalability of a
NoSQL datastore with the convenience of a traditional RDBMS.
The basic design philosophy of the system is to partition the data into entity groups and
replicate each partition independently in data centers located in different geographic areas.
The system supports full ACID semantics within each partition and provides limited
consistency guarantees across partitions (see Figure 6.12). Megastore supports only those
traditional database features that allow the system to scale well and that do not drastically
affect the response time.
Another distinctive feature of the system is the use of the Paxos consensus algorithm, to
replicate primary user data, metadata, and system configuration information across data
centers and for locking. The version of the Paxos algorithm used by Megastore does not
require a single master. Instead, any node can initiate read and write operations to a write-
ahead log replicated to a group of symmetric peers.
The entity groups are application-specific and store together logically related data. For
example, an email account could be an entity group for an email application. Data should be
carefully partitioned to avoid excessive communication between entity groups. Sometimes it
is desirable to form multiple entity groups, as in the case of blogs.
The middle ground between traditional and NoSQL databases taken by the Megastore
designers is also reflected in the data model. The data model is declared in a schema
consisting of a set of tables composed of entries, each entry being a collection of named and
typed properties. The unique primary key of an entity in a table is created as a composition
of entry properties. A Megastore table can be a root or a child table. Each child entity must
reference a special entity, called a root entity in its root table. An entity group consists of
the primary entity and all entities that reference it.
The system makes extensive use of BigTable. Entities from different Megastore tables can
be mapped to the same BigTable row without collisions. This is possible because the BigTable
column name is a concatenation of the Megastore table name and the name of a property.
A BigTable row for the root entity stores the transaction and all metadata for the entity
group. Megastore takes advantage of this feature to implement multi-version concurrency
control (MVCC); when a mutation of a transaction occurs, this mutation is recorded along
with its time stamp, rather than marking the old data as obsolete and adding the new version.
This strategy has several advantages: read and write operations can proceed concurrently,
and a read always returns the last fully updated version.
A write transaction involves the following steps: (1) Get the timestamp and the log position
of the last committed transaction. (2) Gather the write operations in a log entry. (3) Use the
consensus algorithm to append the log entry and then commit. (4) Update the BigTable
entries. (5) Clean up.
15
FIGURE 6.12: Megastore organization. The data is partitioned into entity groups; full
ACID semantics within each partition and limited consistency guarantees across
partitions are supported. A partition is replicated across data centers in different
geographic areas.
Amazon S3 provides a simple web services interface that can be used to store and
retrieve any amount of data, at any time, from anywhere on the web. S3 provides the object-
oriented storage service for users. Users can access their objects through Simple Object
Access Protocol (SOAP) with either browsers or other client programs which support SOAP.
SQS is responsible for ensuring a reliable message service between two processes, even if
the receiver processes are not running. Following Figure shows the S3 execution
environment.
o Low cost and Easy to Use − Using Amazon S3, the user can store a large amount
of data at very low charges.
o Secure − Amazon S3 supports data transfer over SSL and the data gets encrypted
automatically once it is uploaded. The user has complete control over their data by
configuring bucket policies using AWS IAM.
o Scalable − Using Amazon S3, there need not be any worry about storage concerns.
We can store as much data as we have and access it anytime.
o Higher performance − Amazon S3 is integrated with Amazon CloudFront, that
distributes content to the end users with low latency and provides high data transfer
speeds without any minimum usage commitments.
o Integrated with AWS services − Amazon S3 integrated with AWS services include
Amazon CloudFront, Amazon CLoudWatch, Amazon Kinesis, Amazon RDS, Amazon
Route 53, Amazon VPC, AWS Lambda, Amazon EBS, Amazon Dynamo DB, etc.