Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views82 pages

Cloud Computing Unit 1 To Unit 5 Material

Uploaded by

VISHWAS M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views82 pages

Cloud Computing Unit 1 To Unit 5 Material

Uploaded by

VISHWAS M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

CLOUD COMPUTING

UNIT – 1
SYLLABUS: Systems modelling, Clustering and virtualization: Scalable
Computing over the Internet, Technologies for Network based systems, System
models for Distributed and Cloud Computing, Software environments for
distributed systems and clouds, Performance, Security And Energy Efficiency.
1.1 Scalable Computing Over the Internet

Scalability: Scalability is the capability of a system or network or process to handle


a growing amount of works like database storage, software usage and so on. A
scalable system should be able to handle the ever-increasing data, levels of
computations and should be efficient.
Parallel Computing: Execution of many processes is carried out simultaneously in
this case. Large problems can be divided into smaller ones, solved at the same time
and integrated later.
Distributed Computing: A distributed system is a model in which components
located on connected computers (through a network) interchange/monitor their
actions by passing messages. Distributed computing may refer to systems situated
at different physical locations or different actions being performed on the same
system.
Distributed Computing is centred on data and based on networks.
Data Center is a centralised repository and distribution of data and information
organised around a particular concept (ex: Telecommunications, Health data,
business data etc.). A typical data center may have a capacity in Petabytes.
Internet Computing: Data centers and super computer sites must be upgraded to
meet the demands of millions of users who utilize the Internet. High Performance
Computing (HPC), which was a standard for measuring the system performance, is
no longer used. High Throughput Computing (HTC) came into existence with
emergence of computing clouds. Here, the systems are parallel and distributed.
1.1.1 The Age of Internet Computing

1.1.1.1 Platform Evolution:

Figure 1.1 Evolutionary Trend towards parallel, distributed and cloud


computing

Computer technology has gone through five generations of development, each


spanning at 10 to 20 years. By the start of 1990s, the use of HPC and HTC systems
has sky-rocketed. These use clusters, grids, Internet and clouds.

1
The general trend is to control shared web resources and massive data over the
Internet. In the above figure 1.1, we can observe the evolution of HPC and HTC
systems.
HPC contains super computers which are gradually replaced by clusters of inter-
cooperating systems that share the data among them. A cluster is a collection of
homogeneous computers, which are physically connected.
HTC shows the formation of peer-to-peer (P2P) networks for distributed file sharing
and apps. A P2P system is built over many client machines and is globally
distributed. This leads to formation of computational grids or data grids.
1.1.1.2. High Performance Computing (HPC): HPC stressed upon the speed
performance. The speed of HPC systems has increased from Gflops to Pflops
(FLOP=> Floating Point Operations Per Second) these days, driven by the
requirements from different fields like science, engineering, medicine and others.
The systems that generally have high speed are super computers, main frames and
other servers.
1.1.1.3. High Throughput Computing: The market-oriented computing is now
going through a strategic change from HPC to HTC paradigm. HTC concentrates
more on high-flux computing. The performance goal has shifted from speed of the
device to the number of tasks completed per unit of time (throughput).
1.1.1.4. Three New computing Paradigms: It can be seen from Figure 1.1 that
SOA (Software Oriented Architecture) has made the web services available for all
tasks. The Internet Clouds have become a major factor to consider for all types of
tasks. Three new paradigms have come into existence:
(a) Radio-Frequency Identification (RFID): This uses electro-magnetic fields to
automatically identify and track tags attached to objects. These tags contain
electronically stored information.
(b) Global Positioning System (GPS): It is a global navigation satellite system
that provides the geographical location and time information to a GPS receiver
[5].
(c) Internet of Things (IoT): It is the internetworking of different physical devices
(vehicles, buildings etc.) embedded with electronic devices (sensors), software,
and network connectivity. Data can be collected and exchanged through this
network (IoT).
1.1.1.5. Computing Paradigm Distinctions:
(a) Centralized Computing: All computer resources like processors, memory and
storage are centralized in one physical system. All of these are shared and
inter-connected and monitored by the OS.
(b) Parallel Computing: All processors are tightly coupled with centralized shared
memory or loosely coupled with distributed memory (parallel processing). Inter
processor communication is achieved by message passing. This methodology is
known as parallel computing.
(c) Distributed Computing: A distributed system consists of multiple
autonomous computers with each device having its own private memory. They
interconnect among themselves by the usage of a computer network. Here
also, information exchange is accomplished by message passing.
(d) Cloud Computing: An Internet Cloud of resources can either be a centralized
or a distributed computing system. The cloud applies parallel or distributed
computing or both. Cloud can be built by using physical or virtual resources
over data centers. CC is also called as utility/ service/concurrent computing.
1.1.1.6. Distributed System Families
In the future, both HPC and HTC will demand multicore processors that can handle
large number of computing threads per core. Both concentrate upon parallel and
distributed computing. The main work lies in the fields of throughput, efficiency,
scalability and reliability.
Main Objectives:
(a) Efficiency: Efficiency is decided by speed, programming and throughput
demands’ achievement.

2
(b) Dependability: This measures the reliability from the chip to the system at
different levels. Main purpose here is to provide good QoS (Quality of Service).
(c) Adaption in the Programming Model: This measures the ability to support
unending number of job requests over massive data sets and virtualized cloud
resources under different models.
(d) Flexibility: It is the ability of distributed systems to run in good health in both
HPC (science/engineering) and HTC (business).
1.1.2 SCALABLE COMPUTING TRENDS AND NEW PARADIGMS
1.1.2.1Degrees of ‘Parallelism’:
(a) Bit-level parallelism (BLP) 8 bit, 16, 32, and 64.
(b) Instruction-level parallelism (ILP): The processor executes multiple
instructions simultaneously. Ex: Pipelining, supercomputing, VLIW (very long
instruction word), and multithreading.
Pipelining: Data processing elements are connected in series where output of
one element is input to the next.
Multithreading: Multithreading is the ability of a CPU or a single core in
a multi-core processor to execute multiple processes or threads concurrently,
supported by the OS.
(c) Data-level Parallelism (DLP): Here, instructions are given like arrays (single
instruction, multiple data SIMD). More hardware support is needed.
(d) Task-level Parallelism (TLP): It is a process of execution where different
threads (functions) are distributed across multiple processors in parallel
computing environments.
(e) Job-level Parallelism (JLP): Job level parallelism is the highest level of
parallelism where we concentrate on a lab or computer center to execute as
many jobs as possible in any given time period. To achieve this, we purchase
more systems so that more jobs are running at any one time, even though any
one user's job will not run faster.
1.1.2.2. Innovative Applications: It is used in different fields for different
purposes. All applications demand computing economics, web-scale data collection,
system reliability, and scalable performance. Ex: Distributed transaction processing
is practiced in the banking industry. Transactions represent 90 percent of the
existing market for reliable banking systems.

1.1.2.3 The Trend toward Utility Computing: Major computing paradigms and
available services/capabilities are coming together to produce a technology
convergence of cloud/utility computing where both HPC and HTC are utilised to
achieve objectives like reliability and scalability. They also aim to reach autonomic
operations that can be self-organized and support dynamic recovery. Ex:
Interpretation of sensor data, effectors like Google Home and Amazon Echo, smart
home devices etc.

3
Cloud Computing focuses on a business model where a customer receives different
computing resources (storage, service, security etc.) from service providers like
AWS, EMC, Salesforce.com.

1.1.2.4 The Hype Cycle of New Technologies: A new hype (exciting) cycle is
coming into picture where different important and significant works needed by the
customer are offered as services by CC. Ex: SaaS, IaaS, Security as a Service, DM
as a Service etc. Many others are also along the pipeline.

1.1.3 The Internet of Things and Cyber-Physical Systems:


1.1.3.1Internet of Things: The IoT [8] refers the networked interconnection of
everyday objects, tools, devices or computers. It can be seen as a wireless network of
sensors that interconnect all things we use in our daily life. RFID and GPS are also used
here. The IoT demands universal addressability of all the objects or things that may be
steady or moving.

4
These objects can be interconnected, can exchange data and interact with each other by
the usage of suitable applications (web/mobile). In the IoT era, CC can be used
efficiently and in a secure way to provide different services to the humans, computers
and other objects. Ex: Smart cities, inter-connected networks, self-controlling street
lights/traffic lights etc.
1.1.3.2. Cyber-Physical Systems (CPS): CPS means cyber–physical system where
physical objects and computational processes interact with each other. Ex: Wrest bands
to monitor BP. CPS merges the 3Cs which are computation, communication and control
to provide intelligent feedbacks between the cyber and physical worlds.

1.2. Technologies for Network based Systems


1.2.1. Multi-core CPUs and Multithreading Technologies: Over the last 30 years
the speed of the chips and their capacity to handle variety of jobs has increased at an
exceptional rate. This is crucial to both HPC and HTC system development. The
processor speed is measured in MIPS (millions of instructions per second) and the
utilized network bandwidth is measured in Mbps or Gbps.
1.2.1.1. Advances in CPU Processors: The advanced microprocessor chips (by Intel,
NVIDIA, AMD, Qualcomm etc.) assume a multi-core architecture with dual core, quad
core or more processing cores. They exploit parallelism at different levels. Moore’s law
has proven accurate at these levels. Moore's law is the observation that the number of
transistors in a dense integrated circuit doubles approximately every two years.
1.2.1.2. Multi-core CPU: A multi-core processor is a single computing component with
two or more independent actual processing units (called "cores"), which are units that
read and execute program instructions. (Ex: add, move data, and branch). The multiple
cores can run multiple instructions at the same time, increasing overall speed for
programs open to parallel computing.
1.2.1.3. Many-core GPU: (Graphics Processing Unit) Many-core processors are
specialist multi-core processors designed for a high degree of parallel processing,
containing a large number of simpler, independent processor cores. Many-core
processors are used extensively in embedded computers and high-performance
computing. (Main frames, super computers).
1.2.2.GPU Computing: A GPU is a graphics co-processor mounted on a computer’s
graphics card to perform high level graphics tasks in video editing apps. (Ex: Intel Xeon,
NVIDIA). A modern GPU chip can be built with hundreds of processing cores. These
days, parallel GPUs or GPU clusters are gaining more attention. Starting as co-processors
attached to the CPU, the GPUs these days possess 128 cores on a single chip (NVIDIA).
Hence they have 1024 threads (128*8) executing tasks concurrently, on a single GPU.
This can be termed as massive parallelism at multicore and multi-threading levels. GPUs
are not restricted to videos only – they can be used in HPC systems to super computers
for handling high level calculations in parallel.
1.2.2.1. GPU Programming Model: Figure 1.7 and 1.8 [2] show the interaction
between a CPU and GPU in performing parallel execution of floating-point operations
concurrently.
Floating-point operations involve floating-point numbers and typically take longer to
execute than simple binary integer operations. A GPU has hundreds of simple cores
organised as multiprocessors. Each core can have one or more threads. The CPU
instructs the GPU to perform massive data processing where the bandwidth must be
matched between main memory and GPU memory.

5
In future, thousand-core GPUs may feature in the field of Eflops/1018 flops
systems.

1.2.2.3. Power Efficiency of the GPU: The major benefits of GPU over CPU are power
and massive parallelism. Estimation says that 60 Gflops/watt per core is needed to run
an exaflops system. [One exaflops is a thousand petaflops or a quintillion, 1018, floating
point operations per second]. A GPU chip requires one-tenth less of the power that a
CPU requires. (Ex: CPU: 100, GPU: 90).

CPU is optimized (use most effectively) for latency (time between request and response)
in caches and memory; GPU is optimized for throughput with explicit (open)
management of on-chip memory.Both power consumption and software are the future
challenges in parallel and distributed systems.

6
1.2.3. Memory, Storage and WAN:
(a) Memory Technology: The upper curve in Figure 1.10 shows the growth of DRAM
chip capacity from 16 KB to 64 GB. [SRAM is Static RAM and is 'static' because the
memory does not have to be continuously refreshed like Dynamic RAM. SRAM is
faster but also more expensive and is used inside the CPU. The traditional RAMs in
computers are all DRAMs]. For hard drives, capacity increased from 260 MB to 3 TB
and lately 5 TB (by Seagate). Faster processor speed and higher memory capacity
will result in a wider gap between processors and memory, which is an ever-
existing problem.
(b) Disks and Storage Technology: The rapid growth of flash memory and solid-
state drives (SSD) also has an impact on the future of HPC and HTC systems. An
SSD can handle 300,000 to 1 million write cycles per block, increasing the speed
and performance. Power consumption should also be taken care-of before planning
any increase of capacity.
(c) System-Area Interconnects: The nodes in small clusters are interconnected by
an Ethernet switch or a LAN. As shown in Figure 1.11, a LAN is used to connect
clients to servers. A Storage Area Network (SAN) connects servers to network
storage like disk arrays. Network Attached Storage (NAS) connects clients directly
to disk arrays. All these types of network appear in a large cluster built with
commercial network components (Cisco, Juniper). If not, much data is shared
(overlapped), we can build a small cluster with an Ethernet Switch + copper cables
to link to the end machines (clients/servers).

(d) WAN: We can also notice the rapid growth of Ethernet bandwidth from 10 Mbps
to 1 Gbps and still increasing. Different bandwidths are needed for local, national, and
international levels of networks. It is also estimated that computers will be used
concurrently in the coming future and higher bandwidth will certainly add more speed
and capacity to aid the cloud/distributed computing.
1.2.4. Virtual Machines and Middleware: A typical computer has a single OS
image at a time. This leads to a rigid architecture that tightly couples apps to a
specific hardware platform i.e., an app working on a system might not work on
another system with another OS (non-portable).
(e) To build large clusters, grids and clouds, we need to increase the capacity of
computing, storage and networking resources in a virtualized manner. A cloud of
limited resources should aggregate all these dynamically to bring out the expected
results.

7
(a) Virtual Machines: As seen in Figure 1.12, the host machine is equipped with a
physical hardware. The VM is built with virtual resources managed by a guest OS to
run a specific application (Ex: VMware to run Ubuntu for Hadoop). Between the
VMs and the host platform we need a middleware called VM Monitor (VMM). A
hypervisor (VMM) is a program that allows different operating systems to share a
single hardware host. This approach is called bare-metal VM because a hypervisor
handles CPU, memory and I/O directly. VM can also be implemented with a dual
mode as shown in Figure 1.12 (d). Here, part of VMM runs under user level and
another part runs under supervisor level.
(b) VM Primitive operations: A VMM operation provides VM abstraction to the guest
OS. The VMM can also export an abstraction at full virtualization so that a standard
OS can run it as it would on physical hardware. Low level VMM operations are
indicated in Figure 1.13.

 The VMs can be multiplexed between hardware machines as shown in 1.13 (a)
 A VM can be suspended and stored in a stable storage as shown in 1.13(b)
 A suspended VM can be resumed on a new hardware platform as shown in 1.13 (c)
 A VM can be migrated from one hardware platform to another as shown in 1.13 (d)

8
Advantages:
 These VM operations can enable a VM to work on any hardware platform.
 They enable flexibility (the quality of bending easily without breaking) in porting
distributed app executions.
 VM approach enhances the utilization of server resources – multiple server
functions can be integrated on the same hardware platform to achieve higher
system efficiency. [VMware claims that server resource utilization has increased
from 5-15% to 60-80%].
 Eliminates server crashes due to VM usage or shows more transparency in the
operations that are being carried out.

(c) Virtual Infrastructures: Virtual Infrastructure connects resources to distributed


applications in such a way that a resource needed by an app is exactly mapped to
it. This decreases the costs and increases efficiency and server response.

1.2.5. Data Center Virtualization for Cloud Computing: Cloud architecture is built
with products like hardware and network devices. Almost all cloud platforms use x86
(Family of Intel 8086 processors). Low-cost terabyte disks and gigabit Ethernet are
used to build data centers. A data center takes into consideration the
performance/price ratio instead of only speed.
(a) Data Center Growth and Cost Breakdown: Large data centers are built with
thousands of servers and smaller ones have hundreds of the same. The cost of
maintaining a data center has increased and much of this money is spent on
management and maintenance which did not increase with time. Electricity and
cooling also consume much of the allocated finance.
(b) Low Cost Design Philosophy: High-end switches or routers that provide high
bandwidth networks cost more and do not match the financial design of cloud
computing. For a fixed budget, typical switches and networks are more desirable.
Similarly, usage of x86 servers is more preferred over expensive mainframes.
Appropriate software ‘layer’ should be able to balance between the available
resources and the general requirements like network traffic, fault tolerance, and
expandability. [Fault tolerance is the property that enables a system to continue
operating properly even when one or more of its components have failed].
(c) Convergence of Technologies: CC is enabled by the convergence of technologies
in four areas:
 Hardware virtualization and multi-core chips
 Utility and grid computing
 SOA, Web 2.0 and Web Service integration
 Autonomic Computing and Data Center Automation

Web 2.0 is the second stage of the development of the Internet, where static pages
transformed into dynamic and the growth of social media.
Data is increasing by leaps and bounds every day, coming from sensors,
simulations, web services, mobile services and so on. Storage, acquisition and
access of this huge amount of data sets requires standard tools that support high
performance, scalable file systems, DBs, algorithms and visualization. With science
becoming data-centric, storage and analysis of the data plays a huge role in the
appropriate usage of the data-intensive technologies.
Cloud Computing is basically focused on the massive data that is flooding the
industry. CC also impacts the e-science where multi-core and parallel computing is
required. To achieve the goals in these fields, one needs to work on workflows,
databases, algorithms and virtualization issues.
Cloud Computing is a transformative approach since it promises more results than
a normal data center. The basic interaction with the information is taken up in a
different approach to obtain a variety of results, by using different types of data to
end up with useful analytical results.
A typical cloud runs on an extremely large cluster of standard PCs. In each cluster
node, multithreading is practised with a large number of cores in many-core GPU
9
clusters. Hence, data science, cloud computing and multi-core computing are
coming together to revolutionize the next generation of computing and take up the
new programming challenges.
1.3. SYSTEM MODELS FOR DISTRIBUTED AND CLOUD COMPUTING: Distributed
and Cloud Computing systems are built over a large number of independent
computer nodes, which are interconnected by SAN, LAN or WAN. Few LAN switches
can easily connect hundreds of machines as a working cluster. A WAN can connect
many local clusters to form large cluster of clusters.
Large systems are highly scalable, and can reach web-scale connectivity either
physically or logically. Table 1.2 below shows massive systems classification as four
groups: clusters, P2P networks, computing grids and Internet clouds over large
data centers. These machines work collectively, cooperatively, or collaboratively at
various levels.

Clusters are more popular in supercomputing apps. They have laid the foundation
for cloud computing. P2P are mostly used in business apps. Many grids formed in
the previous decade have not been utilized per their potential due to lack of proper
middleware or well-coded apps.

1.3.1. Clusters of Cooperative Computers: A computing cluster consists of inter-


connected standalone computers which work jointly as a single integrated
computing resource. Particularly, this approach yields good results in handling
heavy workloads with large datasets.
(a) The Figure 1.1.5 below shows the architecture of a typical server cluster that has
low latency and high bandwidth network. [Latency is the delay from input into a
system to desired outcome]. For building a large cluster, an interconnection
network can be utilized using Gigabit Ethernet, Myrinet or InfiniBrand switches.

10
Through a hierarchical construction using SAN, LAN or WAN, scalable clusters can
be built with increasing number of nodes. The concerned cluster is connected to
the Internet through a VPN (Virtual Private Network) gateway, which has an IP
address to locate the cluster. Generally, most clusters have loosely connected
nodes, which are autonomous with their own OS.

(b)Single-System Image (SSI): It was indicated that multiple system images


should be integrated into a single-system image for a cluster. A cluster-OS is more
desired these days, or a middleware to support SSI that includes sharing of CPUs,
memory, I/O across all the nodes in the cluster. An SSI is an illusion (something
that doesn’t exist actually) that shows the integrated resources as a single and
powerful resource. SSI can be created by software or hardware. Finally, a cluster is
with multiple system images is only a collection of the resources of independent
computers that are loosely inter-connected.
(c) HW, SW and MW Support: It should be noted that MPPs (Massively Parallel
Processing) are clusters exploring high-level parallel processing. The building
blocks here are the computer nodes (PCs, Symmetric Multi-Processors (SMPs),
work stations or servers), communication software like Parallel Virtual Machine
(PVM), Message Passing Interface (MPI), and a network interface card in each
node. All the nodes are interconnected by high bandwidth network (Ex: Gigabit
Ethernet).

To create SSIs, we need special cluster middleware support. Note that both
sequential and parallel apps can run on the cluster but parallel environments give
effective exploitation of the resources. Distributed Shared memory (DSM) makes
all the data to be shared by all the clusters, thus bringing all the resources into
availability of every user. But SSI features are expensive and difficult to achieve;
so users generally prefer loosely coupled machines.
(d)Major Cluster Design Issues: A cluster-wide OSs or a single OS controlling the
cluster virtually is not yet available. This makes the designing and achievement of
SSI difficult and expensive. All the apps should rely upon the middleware to bring
out the coupling between the machines in cluster or between the clusters. But it
should also be noted that the major advantages of clustering are scalable
performance, efficient message passing, high system availability, good fault
tolerance and a cluster-wide job management which react positively to the user
demands.

1.3.2. Grid Computing Infrastructures: Grid computing is designed to allow close


interaction among applications running on distant computers simultaneously.
(a) Computational Grids: A computing grid provides an infrastructure that couples
computers, software/hardware, sensors and others together. The grid can be
constructed across LAN, WAN and other networks on a regional, national or global

11
scale. They are also termed as virtual platforms. Computers, workstations, servers
and clusters are used in a grid. Note that PCs, laptops and others can be viewed as
access devices to a grid system. Figure 1.6 below shows an example grid built by
different organisations over multiple systems of different types, with different
operating systems.

(b)Grid Families: Grid technology demands new distributed computing models,


software/middleware support, network protocols, and hardware infrastructures.
National grid projects are followed by industrial grid platforms by IBM, Microsoft,
HP, Dell-EMC, Cisco, and Oracle. New grid service providers (GSPs) and new grid
applications have emerged rapidly, similar to the growth of Internet and web
services in the past two decades. Grid systems are classified in essentially two
categories: computational or data grids and P2P grids. Computing or data grids are
built primarily at the national level.
1.3.3. Peer-to-Peer Network Families: In the basic client-server architecture, the
client machines are connected to a central server for different purposes and these
are essentially P2P networks. The P2P architecture offers a distributed model of
networked systems. P2P network is client-oriented instead of server-oriented.
(a) P2P Systems: Here, every node act as both a client and a server. Peer machines
are those connected to the Internet; all client machines act autonomously to join
or leave the P2P system at their choice. No central coordination DB is needed. The
system is self-organising with distributed control.
Basically, the peers are unrelated. Each peer machine joins or leaves the P2P
network at any time. The participating peers form the physical network at any
time. This physical network is not a dedicated interconnection but a simple ad-hoc
network at various Internet domains formed randomly.
(b)Overlay Networks: As shown in Figure 1.17, an overlay network is a virtual
network formed by mapping each physical machine with its ID, through a virtual
mapping.

12
If a new peer joins the system, its peer ID is added as a node in the overlay network.
The P2P overlay network distinguishes the logical connectivity among the peers. The
types here are unstructured and structured. Unstructured P2P ON is a random one
and has no fixed route of contact – flooding is used to send queries to all nodes. This
resulted in sudden increase of network traffic and unsure results. On the other hand,
structured ONs follow a pre-determined methodology of connectivity for inserting and
removing nodes from the overlay graph.
(c) P2P Application Families: There exist 4 types of P2P networks: distributed file
sharing, collaborative platform, distributed P2P computing and others. Ex:
BitTorrent, Napster, Skype, Geonome, JXTA, .NET etc.
(d)P2P Computing Challenges: The main problems in P2P computing are those in
hardware, software and network. Many hardware models exist to select from;
incompatibility exists between the software and the operating systems; different
network connections and protocols make it too complex to apply in real-time
applications. Further, data location, scalability, performance, bandwidth etc. are
the other challenges.

P2P performance is further affected by routing efficiency and self-organization


among the peers. Fault tolerance, failure management, load balancing, lack of trust
among the peers (for security, privacy and copyright violations), storage space
availability are the other issues that have to be taken care of. But it should also be
noted that the distributed nature of P2P network increases robustness since the
failure of some peers doesn’t affect the full network – fault tolerance is good.

Disadvantages here are that since the total system is not centralized, management
of the total network is difficult – anyone can logon and put in any type of data.
Security is less.
1.3.4. Cloud Computing over Internet: Cloud Computing is defined by IBM as
follows: A cloud is a pool of virtualized computer resources. A cloud can host a
variety of different workloads that include batch-style backend jobs and interactive
and user-facing applications.

Since the explosion of data, the trend of computing has changed – the software
apps have to be sent to the concerned data. Previously, the data was transferred to
the software for computation. This is the main reason for promoting cloud
computing.

A cloud allows workloads to be deployed and scaled out through rapid provisioning
of physical or virtual systems. The cloud supports redundant, self-recovering, and
highly scalable programming models that allow workloads to recover from software
or hardware failures. The cloud system also monitors the resource use in such a
way that allocations can be rebalanced when required.

13
(a) Internet Clouds: The idea in Cloud Computing is to move desktop computing to a
service-oriented platform using server clusters and huge DBs at data centers. CC
benefits both users and providers by using its low cost and simple resources
through machine virtualization. Many user applications are satisfied simultaneously
by CC and finally, its design should satisfy the security norms, be trustworthy and
dependable. CC is viewed in two ways: a centralized resource pool or a server
cluster practising distributed computing.
(b) The Cloud Landscape: A distributed computing system is controlled by
companies or organisations. But these traditional systems encounter several
bottlenecks like constant maintenance, poor utilization, and increasing costs and
updates of software or hardware. To get rid of these, CC should be utilized as on-
demand computing.

Cloud Computing offers different types of computing as services:


 Infrastructure as a Service (IaaS): This model provides different
infrastructures like servers, storage, networks and the data center fabric (here,
databases) to the user on demand. A typical user can deploy and run multiple
VMs where guest operating systems can be used for specific applications.
Platform as a Service (PaaS): In this model, the user can install his own apps
onto a virtualized cloud platform. PaaS includes middleware, DBs, development
tools, and some computing languages. It includes both hardware and software.
The provider supplies the API and the software tools (ex: Java, Python, .NET).
The user need not manage the cloud infrastructure which is taken care of by the
provider.
 Software as a Service (SaaS): It is browser-initiated application software
paid cloud customers. This model is used in business processes, industry
applications, CRM, ERP, HR and collaborative (joint) applications. Ex: Google
Apps, Twitter, Facebook, Cloudera, Salesforce etc.
(c) Inter clouds offer four deployment models: private, public, managed and
hybrid.
 Private Cloud: Private cloud is a type of cloud computing that delivers similar
advantages to public cloud, including scalability and self-service, but through a
proprietary architecture.
 Public Cloud: A public cloud is one based on the standard cloud computing
model, in which a service provider makes resources, such as applications and
storage, available to the general public over the Internet.
 Managed Cloud: Managed cloud hosting is a process in which organizations
share and access resources, including databases, hardware and software tools,
across a remote network via multiple servers in another location.
 Hybrid Cloud: A hybrid cloud is an integrated cloud service utilising both
private and public clouds to perform distinct functions within the same
organisation.
1.4. SOFTWARE ENVIRONMENTS FOR DISTRIBUTED SYSTEMS AND
CLOUDS
1.4.1. SERVICE ORIENTED ARCHITECTURE(SOA): In grids that use
Java/CORBA, an entity is a service or an object. Such architectures build on the
seven OSI layers (APSTNDP) that provide networking abstractions. Above this we
have a base service environment like .NET, Java etc. and a broker network for
CORBA, which enables collaboration between systems on different operating
systems, programming languages and hardware. By using this base, one can build
a higher-level environment reflecting the special features of distributed computing.
The same is reflected in the figure 1.20 below.

14
(a) Layered Architecture for Web Services and Grids: The entity interfaces
correspond to the WSDL (web services description language) like XML, Java and
CORBA interface definition language (IDL) in the distributed systems. These
interfaces are linked with high level communication systems like SOAP, RMI and
IIOP. These are based on message-oriented middleware infrastructures like JMS
and Web Sphere MQ.

At entity levels, for fault tolerance, the features in (Web Services Reliable
Messaging) WSRM and its framework are same as the levels of OSI model. Entity
communication is supported by higher level services for services, metadata, and
the management of entities, which can be discussed later on. Ex: JNDI, CORBA
trading service, UDDI, LDAP and ebXML. This enables effective exchange of
information. This also results in higher performance and more throughputs.

(b)Web Services and Tools: Loose Coupling and support of heterogeneous


implementations make services (SaaS, IaaS etc.) more attractive than distributed
objects. It should be realised that the above figure corresponds to two choices of
service architecture: web services or (Representational State Transfer) REST
systems.

In web services, the aim is to specify all aspects of the offered service and its
environment. This idea is carried out by using SOAP. Consequently, the
environment becomes a universal distributed OS with fully distributed capability
carried out by SOAP messages. But it should be noted that this approach has had
mixed results since the protocol can’t be agreed upon easily and even if so, it is
hard to implement.

In the REST approach, simplicity is stressed upon, and difficult problems are
delegated to the apps. In a web services language, REST has minimal information
in the header and the message body carries the needed information. REST
architectures are more useful in rapid technology environments. Above the
communication and management layers, we can compose new entities or
distributed programs by grouping several entities together.

Java and CORBA use RPC methodology through RMI. In grids, sensors represent
entities that output data as messages; grids and clouds represent collection of
services that have multiple message-based inputs and outputs.

(c) The Evolution of SOA: Software Oriented Architecture applies to building grids,
clouds, their combinations and even inter-clouds and systems of systems. The data
collections is done through the sensors like ZigBee device, Bluetooth device, Wi-Fi
15
access point, a PC, a mobile phone and others. All these devices interact among
each other or with grids, clouds and databases at distant places.

Raw Data Data Information Knowledge Wisdom Decisions

(d)Grids Vs Clouds: Grid systems apply static resources, while a cloud stresses upon
elastic resources. Differences between grid and cloud exist only in dynamic
resource allocation based on virtualization and autonomic computing. A ‘grid of
clouds’ can also be built and can do a better job than a pure cloud because it can
support resource allocation. Grid of clouds, cloud of grids, cloud of clouds and inter-
clouds are also possible.

1.4.2. TRENDS TOWARD DISTRIBUTED OPERATING SYSTEMS:


a) Distributed Operating Systems: To promote resource sharing and fast
communication, it is best to have a distributed operating system that can manage the
resources efficiently. In distributed systems or more generally, a network needs an
operating system itself since it deals with many heterogeneous platforms. But such an
OS offers low transparency to the users. It should be noted that middleware can also
be used to generate resource sharing but only till we attain a certain level. The third
approach is to develop a truly distributed OS to achieve highest efficiency and
maximum transparency. Comparison can be seen in Table 1.6.

b) Amoeba vs DCE: Distributed Computing Environment is a middleware-based


system for DCEs. Amoeba was developed by academicians in Holland. But it should
be noticed that DCE, Amoeba and MOSIX2 are all research prototypes used only in
academia.

c) MOSIX2 vs Linux Clusters: MOSIX is a distributed OS, which runs with a


virtualization layer in the Linux environment. This layer provides a single-system
image to user apps. MOSIX supports both sequential and parallel apps and the
resources are discovered and migrated among the Linux nodes. (MOSIX uses Linux
Kernel). A MOSIX enabled grid can extend indefinitely as long as interoperation the
clusters exists.

d) Transparency in programming environments that handle user data, OS,


and hardware plays a key role in the success of clouds. This concept is divided into
16
4 levels as seen below: Data, app, OS, and hardware. Users will be able to choose
the OS they like as well as the app they like – this is the main concept behind
Software as a Service (SaaS).

1.4.3. Parallel and Distributed Programming Models:


a) Message-Passing Interface (MPI): MPI is a library of sub-programs that can
be called from C or FORTRAN to write parallel programs running on a distributed
system. The goal here is to represent clusters, grid systems, and P2P systems with
upgraded web services and other utility apps. Distributed programming can also be
supported by Parallel Virtual Machine (PVM).
b) MapReduce: it is a web programming model for scalable data processing on
large data clusters. It is applied mainly in web-scale search and cloud computing
apps. The user specifies a Map function to generate a set of intermediate key/value
pairs. Then the user applies a Reduce function to merge all intermediate values
with the same (intermediate) key. MapReduce is highly scalable to explore high
degrees of parallelism at different job levels and can handle terabytes of data on
thousands of client machines. Many MapReduce programs can be executed
simultaneously. Ex: Google’s clusters.
c)Hadoop Library: Hadoop enables users to write and run apps over vast
amounts of distributed data. Users can easily scale Hadoop to store and process
Petabytes of data in the web space. The package is economical (open source),
efficient (high level of parallelism) and is reliable (keeps multiple data copies).

c)Open Grid Services Architecture: OGSA is driven by large-scale distributed


computing apps. These apps must provide take into account high degree of
resource and data sharing. The key features here are: distributed executed
environment, public key infrastructure (PKI) services, trust management and
security problems in grid computing.

Globus is a middleware library that implements OGSA standards for resource


discovery, allocation and security enforcement.
1.5 PERFORMANCE, SECURITY, AND ENERGY EFFICEINCY
1.5.1 PERFOEMANCE METRICS AND SCALABILITY ANALYSIS:
a)Performance Metrics: In a distributed system, system throughput is measured
in MIPS, Tflops (Tera Floating point Operations per Second) or Transactions per
Second (TPS). Other measures also exist: job response and network latency. An
interconnection network with low latency and high bandwidth is preferred. The key
factors to be considered for performance are OS boot time, compile time, I/O data
rate, and the runtime support system used.
17
b) Dimensions of Scalability: System scaling can increase or decrease resources
depending on different practical factors.
 Size Scalability: This target higher performance or more functionality by
increasing the machine size (cache, processors, memory etc.). We can
determine the size scalability by counting the number of processors installed.
That is more processors => more ‘size’.
 Software Scalability: Upgrades in OS/compilers, adding mathematical
libraries, installing new apps, and using more user-friendly environments are
the factors considered in determining software scalability.
 Application Scalability: This refers to matching problem size scalability
(increasing data) with machine size scalability (effectively use the resources to
obtain the best result possible).
 Technology Scalability: Here, systems that can adapt to changes in different
aspects of technology like component or network are considered. Three
aspects play an important role here: time, space and heterogeneity. Time is
concerned with processors, motherboard, power supply packaging and cooling.
All these have to be upgraded between 3 to 5 years. Space is related to
packaging and energy concerns. Heterogeneity refers to the use of hardware
components or software packages from different vendors; this affects
scalability the most.
c)Scalability versus OS Image Count: In Figure 1.23, scalable performance is
estimated against the multiplicity of OS images in distributed systems. scalable
performance means we can ever increase the speed of the system by adding more
servers of processors, or by enlarging memory size and so on. The OS image is
counted by the no. of independent OS images observed in a cluster, grid, P2P
network or the cloud.

An SMP (Symmetric multiprocessor) server has a single system image or a single


node in a large cluster. NUMA (non-uniform memory access) machines are SMP
machines with distributed and shared memory. NUMA machine can run with
multiple OS and can scale a hundred of processors.
d) Amdahl’s Law: Consider the execution of a given program on a uniprocessor
workstation with a total execution time of T minutes. Say the program is running in
parallel with other servers on a cluster of many processing nodes. Assume that a
fraction α of the code must be executed sequentially (sequential bottleneck).
Hence, (1-α) of the code can be compiled for parallel execution by n processors.
The total execution time of the program is calculated by αT + (1-α) T/n where the
first term is for sequential execution time on a single processor and the second
term is for parallel execution time on n parallel nodes.

Amdahl’s Law states that the speedup factor of using n-processor system over the
use of a single processor is given by:

Speedup S= T/[αT + (1-α) T/n] = 1/[ α + (1-α)/n] ---- (1.1)

18
The maximum speedup of n can be obtained only if α is reduced to zero or the
code can be parallelized with α = 0.

As the cluster becomes large (that is n ∞), S approaches 1/α, which is the
threshold on the speedup of S. Note that the threshold is independent of n. The
sequential bottleneck is the portion of the code that cannot be parallelized. Ex: The
maximum speed achieved is 4, if α=0.25 or 1-α=0.75, even if a user uses
hundreds of processors. This law deduces that we should make the sequential
bottleneck as small as possible.
e)Problem with fixed workload: In Amdahl’s law, same amount of workload
was assumed for both sequential and parallel execution of the program with a fixed
problem size or dataset. This was called fixed-workload speedup by other
scientists. To execute this fixed-workload on n processors, parallel processing leads
to a system efficiency E which is given by:
E = S/n = 1/[α n + 1-α] ---- (1.2)
Generally, the system efficiency is low, especially when the cluster size is large. To
execute a program on cluster with n=256 nodes, and α=0.25, efficiency E =
1/[0.25x256 + 0.75] = 1.5%, which is very low. This is because only a few
processors, say 4, are kept busy whereas the others are kept idle.
f) Gustafson’s Law: To obtain higher efficiency when using a large cluster,
scaling the problem size to match the cluster’s capability should be considered. The
speedup law proposed by Gustafson is also referred to as scaled-workload
speedup.
Let W be the workload in a given program. When using an n-processor system, the
user scales the workload to W’= αW + (1-α)nW. Note that only the portion of the
workload that can be parallelized is scaled n times in the second term. This scaled
workload W’ is the sequential execution time on a single processor. The parallel
execution time W’ on n processors is defined by a scaled-workload speedup as:
S’ = W’/W = [αW + (1-α) nW]/W = α+ (1-α) n ---- (1.3)
This speedup is known as Gustafson’s law. By fixing the parallel execution time at
level W, we can obtain the following efficiency:
E’ = S’/n = α/n+ (1-α) ---- (1.4)
Taking previous workload values into consideration, efficiency can be improved for
a 256-node cluster to E’ = 0.25/256 + (1-0.25) = 0.751. For a fixed workload
Amdahl’s law must be used and for scaled problems users should apply Gustafson’s
law.
1.5.2. Fault Tolerance and System Availability:
a)System Availability: High availability (HA) is needed in all clusters, grids, P2P
networks and cloud systems. A system is highly available if it has a long mean time
to failure (MTTF) and a short mean time to repair (MTTR).

b)System Availability = MTTF/(MTTF + MTTR) ---- (1.5)


System availability depends on many factors like hardware, software and network
components. Any failure that will lead to the failure of the total system is known as
a single point of failure. It is the general goal of any manufacturer or user to bring
out a system with no single point of failure. For achieving this goal, the factors that
need to be considered are: adding hardware redundancy, increasing component
reliability and designing testability. In the Figure 1.24 below, the effects of system
availability are estimated by scaling the system size in terms of no. of process
cores in the system.

19
As a distributed system increases in size, availability decreases due to a higher
chance of failure and difficulty in isolating the features. Both SMP and MPP are
likely to fail under centralized resources with one OS. NUMA machines are a bit
better here since they use multiple OS.
1.5.3 Network Threats and Data Integrity:
a) Threats to Systems and Networks:
The Figure 1.25 presents a summary of various attack types and the damaged
caused by them to the users. Information leaks lead to a loss of confidentiality.
Loss of data integrity can be caused by user alteration, Trojan horses, service
spoofing attacks, and Denial of Service (DoS) – this lead of loss of Internet
connections and system operations. Users need to protect clusters, grids, clouds
and P2P systems from malicious intrusions that may destroy hosts, network and
storage resources. Internet anomalies found generally in routers, gateways and
distributed hosts may hinder (hold back) the usage and acceptance of these public
resources.

b) Security Responsibilities: The main responsibilities include confidentiality,


integrity and availability for most Internet service providers and cloud users. In the
order of SaaS, PaaS and IaaS, the providers increase/transfer security control to
the users. IN brief, the SaaS model relies on the cloud provider for all the security
features. On the other hand, IaaS wants the users to take control of all security
functions, but their availability is still decided by the providers. Finally, the PaaS
model divides the security aspects in this way: data integrity and availability is with
the provider while confidentiality and privacy control is the burden of the users.
c) Copyright Protection: Collusive (secret agreement) piracy is the main source
of copyright violation within the boundary of a P2P network. Clients may illegally
share their software, allotted only to them, with others thus triggering piracy. One
can develop a proactive (control the situation before damage happens) content
poisoning scheme to stop colluders (conspirers) and pirates, detect them and stop
them to proceed in their illegal work.
d)System Defence Technologies: There exist three generations of network
defence. In the first generation, tools were designed to prevent intrusions. These
tools established themselves as access control policies, cryptographic systems etc.
20
but an intruder can always slip into the system since there existed a weak link
every time. The second generation detected intrusions in a timely manner to
enforce remedies. Ex: Firewalls, intrusion detection systems (IDS), public key
infrastructure (PKI) services (banking, e-commerce), reputation systems etc. The
third generation provides more intelligent responses to intrusions.
e) Data Protection Infrastructure: Security infrastructure is required to protect
web and cloud services. At the user level, one needs to perform trust negotiation
and reputation aggregation over all users. At the app end, we need to establish
security precautions and intrusion detection systems to restrain virus, worm,
malware, and DDoS attacks. Piracy and copyright violations should also be
detected and contained.
1.5.4. Energy Efficiency in Distributed Computing: The primary goals in
parallel and distributed computing systems are HP and HT and also performance
reliability (fault tolerance and security). New challenges encountered in this area
(distributed power management-DPM) these days include energy efficiency,
workload and resource outsourcing. In the forth-coming topics, the energy
consumption issues in servers and HPC systems are discussed.
Energy consumption in parallel and distributed computing raises different issues
like monetary (financial), environmental and system performance issues. The
megawatts of power needed for PFlops has to be within the budget control and the
distributed usage of resources has to be planned accordingly. The rising of
temperature due to more usage of the resources (cooling) is also to be addressed.
a) Energy Consumption of Unused Servers: To run a data center, a company
has to spend huge amount of money for hardware, software, operational support
and energy every year. Hence, the firm should plan accordingly to make maximum
utilization of the available resources and yet the financial and cooling issues should
not cross their limits. For all the finance spent on a data center, it should also not
lie down idle and should be utilized or leased for useful work.
Idle servers can save a lot of money and energy; so the first step in IT
departments is to identify the unused or underused servers and plan to utilize their
resources in a suitable manner.
b) Reducing Energy in Active Servers: In addition to identifying
unused/underused servers for energy savings, we should also apply necessary
techniques to decrease energy consumption in active distributed systems. These
techniques should not hinder the performance of the concerned system. Power
management issues in distributed computing can be classified into four layers, as
seen in Figure 1.26.

21
c) Application Layer: Most apps in different areas like science, engineering,
business, financial etc. try to increase the system’s speed or quality. By introducing
energy-conscious applications, one should try to design the usage and consumption
in a planned manner such that the apps manage to use the new multi-level and
multi-domain energy management methodologies without reducing the
performance. For this goal, we need to identify a relationship between the
performance and energy consumption areas (correlation). Note that these two
factors (compute and storage) are surely correlated and affect completion time.
d) Middleware layer: The middleware layer is a connection between application
layer and resource layer. This layer provides resource broker, communication
service, task analyzer & scheduler, security access, reliability control, and
information service capabilities. It is also responsible for energy-efficient
techniques in task scheduling. In distributed computing system, a balance has to
be brought out between efficient resource usage and the available energy.
e) Resource Layer: This layer consists of different resources including the
computing nodes and storage units. Since this layer interacts with hardware
devices and the operating systems, it is responsible for controlling all distributed
resources. Several methods exist for efficient power management of hardware and
OS and majority of them are concerned with the processors.
Dynamic power management (DPM) and dynamic voltage frequency scaling
(DVFS) are the two popular methods being used recently. In DPM, hardware
devices can switch from idle modes to lower power modes. In DVFS, energy
savings are obtained based on the fact that power consumption in CMOS
(Complementary Metal-Oxide Semiconductor) circuits have a direct relationship
with frequency and the square of the voltage supply. [P = 0.5 CV2f] Execution
time and power consumption can be controlled by switching among different
voltages and frequencies.
f) Network Layer: The main responsibilities of the network layer in distributed
computing are routing and transferring packets, and enabling network services to
the resource layer. Energy consumption and performance are to measured,
predicted and balanced in a systematic manner so as to bring out energy-efficient
networks. Two challenges exist here:
 The models should represent the networks systematically and should possess a
full understanding of interactions among time, space and energy.
 New and energy-efficient algorithms have to be developed to rope in the
advantages to the maximum scale and defend against the attacks.

Data centers are becoming more important in distributed computing since the data
is ever-increasing with the advent of social media. They are now another core
infrastructure like power grid and transportation systems.
g) DVFS Method for Energy Efficiency: This method enables the exploitation of
idle time (slack time) encountered by an inter-task relationship. The slack time
associated with a task is utilized to the task in a lower voltage frequency. The
relationship between energy and voltage frequency in CMOS circuits is calculated
by:
2
E  C fv teff

(v  vt ) 2 ---- (1.6)
f K
v
where v, Ceff, K and vt are the voltage, circuit switching capacity, a technology
dependent factor and threshold voltage; t is the execution time of the task under
clock frequency f. By reducing v and f, the energy consumption of the device can
also be reduced.

22
CLOUD COMPUTING
UNIT II
SYLLABUS: Virtual Machines and Virtualization of Clusters and Data Centers:
Implementation Levels of Virtualization, Virtualization Structures/ Tools and mechanisms,
Virtualization of CPU, Memory and I/O Devices, Virtual Clusters and Resource Management,
Virtualization for Data Center Automation.

INTRODUCTION:
The massive usage of virtual machines (VMs) opens up new opportunities for parallel, cluster grid,
cloud and distributed computing. Virtualization enables the users to share expensive hardware
resources by multiplexing (i.e., multiple analog/digital are combined into one signal over a shared
medium) VMs on the same set of hardware hosts like servers or data centers.

2.1 Implementation Levels of Virtualization:


Virtualization is a concept by which several VMs are multiplexed into the same hardware machine.
The purpose of a VM is to enhance resource sharing by many users and improve computer
performance in terms of resource utilization and application flexibility. Hardware resources (CPU,
memory, I/O devices etc.) or software resources (OS and apps) can be virtualized at various
layers of functionality.

The main idea is to separate hardware from software to obtain greater efficiency from the system.
Ex: Users can gain access to more memory by this concept of VMs. With sufficient storage, any
computer platform can be installed in another host computer, even if processors’ usage and
operating systems are different.
a) Levels of Virtualization Implementation: A traditional computer system runs with a host
OS specially adjusted for its hardware architecture. This is depicted in Figure 3.1a.

After virtualization, different user apps managed by their own OS (i.e., guest OS) can run on
the same hardware, independent of the host OS. This is often done by adding a virtualization
layer as shown in Figure 3.1b.

This virtualization layer is called VM Monitor or hypervisor. The VMs can be seen in the upper
boxes where apps run on their own guest OS over a virtualized CPU, memory and I/O devices.
The main function of the software layer for virtualization is to virtualize the physical hardware
of a host machine into virtual resources to be saved by the VMs. The virtualization software
creates the abstract of VMs by introducing a virtualization layer at various levels of a
computer. General virtualization layers include the instruction set architecture (ISA) level,
hardware level, OS level, library support level, and app level. This can be seen in Figure 3.2.
The levels are discussed below.

1
I) Instruction Set Architecture Level: At the ISA level, virtualization is performed by
emulation (imitate) of the given ISA by the ISA of the host machine. Ex: MIPS binary code
can run on an x86-based host machine with the help of ISA simulation. Instruction
emulation leads to virtual ISAs created on any hardware machine.

Basic level of emulation can be traced at code interpretation. An interpreter (line-by-line


compiler) program works on the instructions one-by-one and this process is slow. To
speedup, dynamic binary translation can be used where it translates blocks of dynamic
source instructions to target instructions. The basic blocks can also be extended to
program traces or super blocks to increase translation efficiency. This emulation requires
binary translation and optimization. Hence, a Virtual-ISA requires a processor specific
translation layer to the compiler.
II) Hardware Abstraction Level: Hardware level virtualization is performed on the bare
hardware. This approach generates a virtual hardware environment and processes the
hardware in a virtual manner. The idea is to virtualize the resources of a computer by
utilizing them concurrently. Ex: IBM Xen hypervisor (VMM) runs Linux or other guest OS
applications. [Discussed later]
III) OS Level: This refers to an abstraction layer between the OS and the user apps. The
OS level virtualization creates isolated containers on a single physical server and OS
instances to utilize software and hardware in data centers. The containers behave like real
servers. OS level virtualization is used in creating virtual hosting environments to allocate
hardware resources among a large number of ‘distrusting’ users. It can also be used to
indirectly merge server hardware by moving resources on different hosts into different
containers or VMs on one server.
IV) Library Support Level: Most applications use APIs exported by user-level libraries
rather than lengthy system calls by the OS. Virtualization with library interfaces is possible
by controlling the communication link between apps and the rest of the system through
API hooks.
Ex: (a) Wine (recursive acronym for Wine Is Not an Emulator) is a free and open
source compatibility layer software application that aims to allow applications designed
for MS-Windows to run on Linux OS.

2
(b) vCUDA by NVIDIA. (CUDA – No acronym)

V) User-App Level: An app level virtualization brings out a real VM; this process is also
known as process level virtualization. Generally HLL VMs are used where virtualization
layer is an app above the OS; it can run programs written and compiled to an abstract
machine definition. Ex: JVM and .NET CLR (Common Language Runtime).

Other forms of app level virtualization are app isolation, app sandboxing or app streaming.
Here, the app is wrapped in a layer and is isolated from the host OS and other apps. This
makes the app more much easier to distribute and remove from user workstations. Ex:
LANDesk (an app virtualization platform) – this installs apps as self-contained, executable
files in an isolated environment. No actual installation is required and no system
modifications are needed.

Table 3.1 that hardware and OS support will yield the highest performance. At the same
time, the hardware and app levels are most expensive to implement. User isolation is
difficult to archive and ISA offers best flexibility.

b) VMM Design Requirement and Providers: As seen before, hardware-level virtualization


inserts a layer between real hardware and traditional OS. This layer (VMM/hypervisor)
manages the hardware resources of the computer effectively. By the usage of VMM, different
traditional operating systems can be used with the same set of hardware simultaneously.

Requirements for a VMM:

• For programs, a VMM should provide an identical environment, same as the original
machine.
• Programs running in this environment should show only minor decreases in speed.
• A VMM should be in complete control of the system resources.

Some differences might still be caused due to availability of system resources (more than one
VM is running on the same system) and differences caused by timing dependencies.

The hardware resource requirements (like memory) of each VM is reduced, but the total sum
of them is greater that of the real machine. This is needed because of any other VMs that are
concurrently running on the same hardware.

A VMM should demonstrate efficiency in using the VMs. To guarantee the efficiency of a VMM,
a statistically dominant subset of the virtual processor’s instructions needs to be executed
directly by the real processor with no intervention by the VMM. A comparison can be seen in
Table 3.2:

3
The aspects to be considered here include (1) The VMM is responsible for allocating hardware
resources for programs; (2) a program can’t access any resource that has not been allocated
to it; (3) at a certain juncture, it is not possible for the VMM to regain control of the resources
already allocated. Note that all processors might not satisfy these requirements of a VMM.

A VMM is tightly related to the architectures of the processors. It is difficult to implement a


VMM on some types of processors like x86. If a processor is not designed to satisfy the
requirements of a VMM, the hardware should be modified – this is known as hardware
assisted virtualization.

c) Virtualization Support at the OS Level: CC is transforming the computing landscape by


shifting the hardware and management costs of a data center to third parties, like banks. The
challenges of CC are: (a) the ability to use a variable number of physical machines and VM
instances depending on the needs of the problem. Ex: A work may need a single CPU at an
instance but multi-CPUs at another instance (b) the slow operation of instantiating new VMs.

As of now, new VMs originate either as fresh boots or as replicates of a VM template –


unaware of the current status of the application.

i)Why OS Level Virtualization (Disadvantages of hardware level virtualization):


• It is slow to initiate a hardware level VM since each VM creates its own image from the
beginning.
• Redundancy content is high in these VMs.
• Slow performance and low density
• Hardware modifications maybe needed

To provide a solution to all these problems, OS level virtualization is needed. It inserts a


virtualization layer inside the OS to partition the physical resources of a system. It enables
multiple isolated VMs within a single OS kernel. This kind of VM is called a Virtual Execution
Environment (VE) or Virtual Private System or simply a container. From the user’s point of
view, a VE/container has its own set of processes, file system, user accounts, network
interfaces (with IP addresses), routing tables, firewalls and other personal settings. The
containers can be customized for different people, they share the same OS kernel. Therefore,
this methodology is also called single-OS image virtualization. All this can be observed in
Figure 3.3.

4
b) Advantages of OS Extensions:
▪ VMs at the OS level have minimal start-up shutdown costs, low resource requirements
and high scalability.
▪ For an OS level VM, the VM and its host environment can synchronise state changes

These can be achieved through two mechanisms of OS level virtualization:

➢ All OS level VMs on the same physical machine share a single OS kernel
➢ The virtualization layer can be designed in way that allows processes in VMs can access as
many resources as possible from the host machine, but can never modify them.

c) Disadvantages of OS Extension: The main disadvantage of OS extensions is that all VMs


at OS level on a single container must have the same kind of guest OS. Though different OS
level VMs may have different OS distributions (Win XP, 7, 10), they must be related to the
same OS family (Win). A Windows distribution can’t run on a Linux based container.

As we can observe in Figure 3.3, the virtualization layer is inserted inside the OS to partition
the hardware resources for multiple VMs to run their applications in multiple virtual
environments. To implement this OS level virtualization, isolated execution environments
(VMs) should be created based on a single OS kernel. In addition, the access requests from a
VM must be redirected to the VM’s local resource partition on the physical machine. For
example, ‘chroot’ command in a UNIX system can create several virtual root directories within
an OS that can be used for multiple VMs.

To implement the virtual root directories’ concept, there exist two ways: (a) duplicating
common resources to each VM partition or (b) sharing most resources with the host
environment but create private copies for the VMs on demand. It is to be noted that the first
method incurs (brings up) resource costs and burden on a physical machine. Therefore, the
second method is the apparent choice.

d) Virtualization on Linux or Windows Platforms: Generally, the OS-level virtualization


systems are Linux-based. Windows based virtualization platforms are not much in use. The
Linux kernel offers an abstraction layer to allow software processes to with and operate on
resources without knowing the hardware details. Different Linux platforms use patched
kernels to provide special support for extended functionality.

Linux platforms are not tied to a special kernel. In such a case, a host can run several VMs
simultaneously on the same hardware. Examples can be seen in Table 3.3.

e) Middleware Support for Virtualization: This is the other name for Library-level
Virtualization and is also known as user-level Application Binary Interface or API emulation.

5
This type of virtualization can create execution environments for running alien (new/unknown)
programs on a platform rather than creating a VM to run the entire OS. The key functions
performed here are API call interception and remapping (assign a function to a key).

2.2. VIRTUALIZATION STRUCTURES/TOOLS AND MECHANISMS:


It should be noted that there are three classes of VM architecture. Before virtualization, the OS
manages the hardware. After virtualization, a virtualization layer is inserted between the hardware
and the OS. Here, the virtualization layer is responsible for converting parts of real hardware into
virtual hardware. Different operating systems like Windows and Linux can run simultaneously on
the same machine in this manner. Depending on the position of the virtualization layer,
several classes of VM architectures can be framed out: Hypervisor Architecture, para-virtualization
and host-based virtualization.

a) Hypervisor and Xen Architecture: The hypervisor (VMM) supports hardware level
virtualization on bare metal devices like CPU, memory, disk and network interfaces. The hypervisor
software exists between the hardware and its OS (platform). The hypervisor provides hypercalls
for the guest operating systems and applications. Depending on the functionality, a hypervisor can
assume micro-kernel architecture like MS Hyper-V or monolithic hypervisor architecture like the
VMware ESX for server virtualization.
➢ Hypercall: A hypercall is a software trap from a domain to the hypervisor, just as a
syscall is a software trap from an application to the kernel. Domains will use hypercalls to
request privileged operations like updating page tables.
➢ Software Trap: A trap, also known as an exception or a fault, is typically a type of
synchronous interrupt caused by an exceptional condition (e.g., breakpoint, division by
zero, invalid memory access). A trap usually results in a switch to kernel mode, wherein
the OS performs some action before returning control to the originating process. A trap in
a system process is more serious than a trap in a user process and might be fatal. The
term trap might also refer to an interrupt intended to initiate a context switch to a monitor
program or debugger.
➢ Domain: It is a group of computers/devices on a network that are administered as a unit
with common rules and procedures. Ex: Within the Internet, all devices sharing a common
part of the IP address are said to be in the same domain.
➢ Page Table: A page table is the data structure used by a virtual memory system in an
OS to store the mapping between virtual addresses and physical addresses.
➢ Kernel: A kernel is the central part of an OS and manages the tasks of the computer and
hardware like memory and CPU time.
➢ Monolithic Kernel: These are commonly used by the OS. When a device is needed, it is
added as a part of the kernel and the kernel increases in size. This has disadvantages like
faulty programs damaging the kernel and so on. Ex: Memory, processor, device drivers
etc.
➢ Micro-kernel: In micro-kernels, only the basic functions are dealt with – nothing else. Ex:
Memory management and processor scheduling. It should also be noted that OS can’t run
only on a micro-kernel, which slows down the OS.

The size of the hypervisor code of a micro-kernel hypervisor is smaller than that of monolithic
hypervisor. Essentially, a hypervisor must be able to convert physical devices into virtual resources
dedicated for the VM usage.
6
i) Xen Architecture: It is an open source hypervisor program developed by Cambridge
University. Xen is a micro-kernel hypervisor, whose policy is implemented by Domain 0.

As can be seen in Figure 3.5, Xen doesn’t include any device drivers; it provides a mechanism by
which a guest-OS can have direct access to the physical devices. The size of Xen is kept small, and
provides a virtual environment between the hardware and the OS. Commercial Xen hypervisors
are provided by Citrix, Huawei and Oracle.
The core components of Xen are the hypervisor, kernel and applications. Many guest operating
systems can run on the top of the hypervisor; but it should be noted that one of these guest OS
controls the others. This guest OS with the control ability is called Domain 0 – the others are called
Domain U. Domain 0 is first loaded when the system boots and can access the hardware directly
and manage devices by allocating the hardware resources for the guest domains (Domain U).
Say Xen is based on Linux and its security level is some C2. Its management VM is named as
Domain 0, which can access and manage all other VMs on the same host. If a user has access to
Domain 0 (VMM), he can create, copy, save, modify or share files and resources of all the VMs.
This is a huge advantage for the user but concentrating all the resources in Domain 0 can also
become a privilege for a hacker. If Domain 0 is hacked, through it, a hacker can control all the
VMs and through them, the total host system or systems. Security problems are to be dealt with in
a careful manner before handing over Xen to the user.
A machine’s lifetime can be thought of as a straight line that progresses monotonically (never
decreases or increases) as the s/w executes. During this time, executions are made, configurations
are changed, and s/w patches can be applied. VM is similar to tree in this environment; execution
can go into N different branches where multiple instances of VM can be done in this tree at any
time. VMs can also be allowed to roll back to a particular state and rerun from the same point.

b) Binary Translation with Full Virtualization: Hardware virtualization can be categorised into
two categories: full virtualization and host-based virtualization.

➢ Full Virtualization doesn’t need to modify the host OS; it relies upon binary translation to
trap and to virtualize certain sensitive instructions. Normal instructions can run directly on
the host OS. This is done to increase the performance overhead – normal instructions are
carried out in the normal manner, but the difficult and precise executions are first
discovered using a trap and executed in a virtual manner. This is done to improve the
security of the system and also to increase the performance.

➢ Binary Translation of Guest OS Requests Using a VMM:

This approach is mainly used by VMware and others. As it can be seen in Figure 3.6, the
VMware puts the VMM at Ring 0 and the guest OS at Ring 1. The VMM scans the
instructions to identify complex and privileged instructions and trap them into the VMM,
which emulates the behaviour of these instructions. Binary translation is the method used
for emulation (A => 97 => 01100001). Full virtualization combines both binary translation
and direct execution. The guest OS is totally decoupled from the hardware and run
virtually (like an emulator).

7
Full virtualization is ideal since it involves binary translation and is time consuming. Binary
translation also is cost consuming but it increases the system performance. (Same as 90%
of the host).

➢ Host based Virtualization: In a host-based virtualization system both host and guest
OS are used and a virtualization layer is built between them. The host OS is still
responsible for managing the hardware resources. Dedicated apps might run on the VMs
and some others can run on the host OS directly. By using this methodology, the user can
install the VM architecture without modifying the host OS. The virtualization software can
rely upon the host OS to provide device drivers and other low-level services. Hence the
installation and maintenance of the VM becomes easier.
Another advantage is that many host machine configurations can be perfectly utilized; still
four layers of mapping exist in between the guest and host operating systems. This may
hinder the speed and performance, in particular when the ISA (Instruction Set
Architecture) of a guest OS is different from that of the hardware – binary translation
MUST be deployed. This increases in time and cost and slows the system.

c) Para-Virtualization with Compiler Support: Para-Virtualization modifies the guest operating


systems; a para-virtualized VM provides special APIs which take up user apps needing those
changes. Para-virtualization tries to reduce the virtualization burden/extra-work to improve the
performance – this is done by modifying only the guest OS kernel. This can be seen in Figure 3.7.

Ex: In a typical para-virtualization architecture, which considers an x86 processor, a virtualization


layer is inserted between h/w and OS. According to the x86 ‘ring definition’ the virtualization layer
should also be installed at Ring 0. In Figure 3.8, we can notice that para-virtualization replaces
8
instructions that cannot be virtualized with hypercalls (placing a trap) that communicate directly
with the VMM. Notice that if a guest OS kernel is modified for virtualization, it can’t run the
hardware directly – that should be done through the virtualization layer.

Disadvantages of Para-Virtualization: Although para-virtualization reduces the overhead, it


has other problems. Its compatibility (suitability) and portability can be in doubt because it has to
support both the modified guest OS and the host OS as per requirements. Also, the maintenance
cost of para-virtualization is high since it may require deep kernel modifications. Finally, the
performance advantage of para-virtualization is not stable – it varies as per the workload. But
compared with full virtualization, para-virtualization is easier and more practical since binary
translation is not much considered. Many products utilize para-virtualization to overcome the less
speed of binary translation. Ex: Xen, KVM, VMware ESX.

Kernel based VM (KVM): This is a Linux para-virtualization system – it is a part of the Linux
kernel. Memory management and scheduling activities are carried out by the existing Linux kernel.
Other activities are taken care of by the KVM and this methodology makes it easier to handle than
the hypervisor. Also note that KVM is hardware assisted para-virtualization tool, which improves
performance and supports unmodified guest operating systems like Windows, Linux, Solaris and
others.

2.3 VIRTUALIZATION OF CPU, MEMORY AND I/O DEVICES:

Processors employ a special running mode and instructions, known as hardware-assisted


virtualization. Through this, the VMM and guest OS run in different modes; all sensitive
instructions of the guest OS and its apps are caught by the ‘trap’ in the VMM.

a) H/W Support for Virtualization: Modern operating systems and processors permit
multiple processes to run simultaneously. A protection mechanism should exist in the processor
so that all instructions from different processes will not access the hardware directly – this will
lead to a system crash.

All processors should have at least two modes – user and supervisor modes to control the
access to the hardware directly. Instructions running in the supervisor mode are called
privileged instructions and the others are unprivileged.

Ex: VMware Workstation

b) CPU Virtualization: A VM is a duplicate of an existing system; majority of instructions are


executed by the host processor. Unprivileged instructions run on the host machine directly;
other instructions are to be handled carefully. These critical instructions are of three types:
privileged, control-sensitive and behaviour-sensitive.
Privileged=> Executed in a special mode and are trapped if not done so.
Control-Sensitive=> Attempt to change the configuration of the used resources
Behaviour-Sensitive=> They have different behaviours in different situations (high load or
storage or capacity)

A CPU is virtualization only if it supports the VM in the CPU’s user mode while the VMM runs in
a supervisor’s mode. When the privileged instructions are executed, they are trapped in the
VMM. In this case, the VMM acts as a mediator between the hardware resources and different
VMs so that correctness and stability of the system are not disturbed. It should be noted that
not all CPU architectures support virtualization.
Process:
• System call triggers the 80h interrupt and passes control to the OS kernel.
• Kernel invokes the interrupt handler to process the system call
• In Xen, the 80h interrupt in the guest OS concurrently causes the 82h interrupt in the
hypervisor; control is passed on to the hypervisor as well.
9
• After the task is completed, the control is transferred back to the guest OS kernel.

➢ Hardware Assisted CPU Virtualization: Since full Virtualization or para- Virtualization is


complicated, this new methodology tries to simplify the situation. Intel and AMD add an
additional mode called privilege mode level to the x86 processors. The OS can still run at
Ring 0 and hypervisor at Ring 1. Note that all privileged instructions are trapped at the
hypervisor. Hence, no modifications are required in the VMs at OS level.

VMCS=> VM Control System


VMX=> A virtual router
c) Memory Virtualization: In the traditional methodology, the OS maintains mappings
between virtual memory to machine memory (MM) using page tables, which is a one-stage
mapping from virtual memory to MM.
• Virtual memory is a feature of an operating system (OS) that allows a computer to
compensate for shortages of physical memory by temporarily transferring pages of data
from random access memory (RAM) to disk storage.
• Machine Memory is the upper bound (threshold) of the physical memory that a host can
allocate to the VM. All modern x86 processors contain memory management unit (MMU)
and a translation look-aside buffer (TLB) to optimize the virtual memory performance.
In a virtual execution environment, virtual memory virtualization involves sharing the
physical system memory in RAM and dynamically allocating it to the physical memory of the
VMs.
Stages:
• Virtual memory to physical memory
• Physical memory to machine memory.

MMU should be supported, guest OS controls to monitor mapping of virtual addresses to


physical memory address of the VMs. All this is depicted in Figure 3.12.

10
Each page table of a guest OS has a page table allocated for it in the VMM. The page table in the
VMM which handles all these is called a shadow page table. As it can be seen all this process is
nested and inter-connected at different levels through the concerned address. If any change
occurs in the virtual memory page table or TLB, the shadow page table in the VMM is updated
accordingly.

d)I/O Virtualization: This involves managing of the routing of I/O requests between virtual
devices and shared physical hardware. The there are three ways to implement this are full device
emulation, para- virtualization and direct I/O.
• Full Device Emulation: This process emulates well-known and real-world devices. All the
functions of a device or bus infrastructure such as device enumeration, identification,
interrupts etc. are replicated in the software, which itself is located in the VMM and acts as a
virtual device. The I/O requests are trapped in the VMM accordingly. The emulation
approach can be seen in Figure 3.14.

• Para- virtualization: This method of I/O virtualization is taken up since software emulation
runs slower than the hardware it emulates. In para- virtualization, the frontend driver runs
in Domain-U; it manages the requests of the guest OS. The backend driver runs in Domain-
0 and is responsible for managing the real I/O devices. This methodology (para) gives more
performance but has a higher CPU overhead.
• Direct I/O virtualization: This lets the VM access devices directly; achieves high
performance with lower costs. Currently, it is used only for the mainframes.

Ex: VMware Workstation for I/O virtualization: NIC=> Network Interface Controller

11
e) Virtualization in Multi-Core Processors: Virtualizing a multi-core processor is more
complicated than that of a unicore processor. Multi-core processors have high performance by
integrating multiple cores in a chip, but their virtualization poses a new challenge. The main
difficulties are that apps must be utilized in a parallelized way to use all the cores and this task
must be accomplished by software, which is a much higher problem.
To reach these goals, new programming models, algorithms, languages and libraries are needed to
increase the parallelism.
i)Physical versus Virtual Processor Cores: A multi-core virtualization method was proposed
to allow hardware designers to obtain an abstraction of the lowest level details of all the cores.
This technique alleviates (lessens) the burden of managing the hardware resources by
software. It is located under the ISA (Instruction Set Architecture) and is unmodified by the OS
or hypervisor. This can be seen in Figure 3.16.

ii) Virtual Hierarchy: The emerging concept of many-core chip multiprocessors (CMPs) is a
new computing landscape (background). Instead of supporting time-sharing jobs on one or few
cores, abundant cores can be used in a space-sharing – here single or multi-threaded jobs are
simultaneously assigned to the cores. Thus, the cores are separated from each other and no

12
interferences take place. Jobs go on in parallel, for long time intervals. To optimize (use
effectively) the workloads, a virtual hierarchy has been proposed to overlay (place on top) a
coherence (consistency) and caching hierarchy onto a physical processor. A virtual hierarchy
can adapt by itself to fit how to carry out the works and share the workspace depending upon
the workload and the availability of the cores.

The CMPs use a physical hierarchy of two or more cache levels that statically determine the
cache (memory) allocation and mapping. A virtual hierarchy is a cache hierarchy that can
adapt to fit the workloads. First level in the hierarchy locates data blocks close to the cores to
increase the access speed; it then establishes a shared-cache domain, establishes a point of
coherence, thus increasing communication speed between the levels. This idea can be seen in
Figure 3.17(a) [1].

Space sharing is applied to assign three workloads to three clusters of virtual cores: VM0 and
VM3 for DB workload, VM1 and VM2 for web server workload, and VM4-VM7 for middleware
workload. Basic assumption here is that a workload runs in its own VM. But in a single OS,

13
space sharing applies equally. To encounter this problem, Marty and Hill suggested a two-level
virtual coherence and caching hierarchy. This can be seen in Figure 3.17(b) [1]. Each VM
operates in its own virtual cluster in the first level which minimises both access time and
performance interference. The second level maintains a globally shared memory.

A virtual hierarchy adapts to space-shared workloads like multiprogramming and server


consolidation.

2.4. VIRTUAL CLUSTERS AND RESOURCE MANAGEMENT: A physical cluster is a collection of


physical servers that are interconnected. The issues that are to be dealt with here are: live
migration of VMs, memory and file migrations and dynamic deployment of virtual clusters.
When a general VM is initialized, the administrator has to manually write configuration
information; this increases his workload, particularly when more and more VMs join the clusters.
As a solution to this, a service is needed that takes care of the configuration information (capacity,
speed etc.) of the VMs. The best example is Amazon’s Elastic Compute Cloud (EC2), which
provides elastic computing power in a cloud.
Most virtualization platforms like VMware ESX Server, and XenServer support a bridging mode
which allows all domains to appear on the network as individual hosts. Through this mode, VMs
can communicate with each other freely through the virtual network and configure automatically.
a) Physical versus Virtual Clusters: Virtual Clusters are built with VMs installed at one or
more physical clusters. The VMs in a virtual cluster are interconnected by a virtual network
across several physical networks. The concept can be observed in Figure 3.18.

The provisioning of VMs to a virtual cluster is done dynamically to have the following properties:
• The virtual cluster nodes can be either physical or virtual (VMs) with different operating
systems.
• A VM runs with a guest OS that manages the resources in the physical machine.
• The purpose of using VMs is to consolidate multiple functionalities on the same server.
• VMs can be replicated in multiple servers to promote parallelism, fault tolerance and disaster
discovery.
• The no. of nodes in a virtual cluster can grow or shrink dynamically.
• The failure of some physical nodes will slow the work but the failure of VMs will cause no harm
(fault tolerance is high).
Since system virtualization is widely used, the VMs on virtual clusters have to be effectively
managed. The virtual computing environment should provide high performance in virtual cluster
deployment, monitoring large clusters, scheduling of the resources, fault tolerance and so on.

14
Figure 3.19 shows the concept of a virtual cluster based on app partitioning. The different colours
represent nodes in different virtual clusters. The storage images (SSI) from different VMs from
different clusters is the most important concept here. Software packages can be pre-installed as
templates and the users can build their own software stacks. Note that the boundary of the
virtual cluster might change since VM nodes are added, removed, or migrated dynamically.
➢ Fast Deployment and Effective Scheduling: The concerned system should be able to
• Construct and distribute software stacks (OS, libraries, apps) to a physical node inside
the cluster as fast as possible
• Quickly switch runtime environments from one virtual cluster to another.
Green Computing: It is a methodology that is environmentally responsible and an eco-friendly
usage of computers and their resources. It is also defined as the study of designing,
manufacturing, using and disposing of computing devices in a way that reduces their
environmental impact.
Engineers must concentrate upon the point the available resources are utilized in a cost and
energy-reducing manner to optimize the performance and throughput. Parallelism must be put in
place wherever needed and virtual machines/clusters should be used for attaining this goal.
Through this, we can reduce the overhead, attain load balancing and achieve scale-up and scale-
down mechanisms on the virtual clusters. Finally, the virtual clusters must be clustered among
themselves again by mapping methods in a dynamical manner.

➢ High Performance Virtual Storage: A template must be prepared for the VM


construction and usage and distributed to the physical hosts. Software packages that
reduce the time for customization (getting used to) and switching of environment. Users
should be identified by their profiles that are stored in data blocks. All these methods
increase the performance in virtual storage. Ex: Dropbox

Steps to deploy (arrange/install) a group of VMs onto a target cluster:


• Preparing the disk image (SSI)
• Configuring the virtual machines
• Choosing the destination nodes
• Executing the VM deployment commands at every host
A template is a disk image/SSI that hides the distributed environment from the user. It may
consist of an OS and some apps. Templates are chosen by the users as per their requirements
and can implement COW (Copy on Write) format. A new COW backup file is small and easy to
create and transfer, thus reducing space consumption.

15
It should be noted that every VM is configured with a name, disk image, network settings, and is
allocated a CPU and memory. But this might be cumbersome if the VMs are many in number. The
process can be simplified by configuring similar VMs with pre-edited profiles. Finally, the
deployment principle should be able to fulfil the VM requirement to balance the workloads.

b) Live VM Migration Steps: Normally in a cluster built with mixed modes of host and guest
systems, the procedure is to run everything on the physical machine. When a VM fails, it can be
replaced by another VM on a different node, as long as they both run the same guest OS. This is
called a failover (a procedure by which a system automatically transfers control to a duplicate
system when it detects a fault or failure) of a physical system to a VM. Compared to a physical-
physical failover, this methodology has more flexibility. It also has a drawback – a VM must stop
working if its host node fails. This can be lessened by migrating from one node to another for a
similar VM. The live migration process is depicted in Figure 3.20.
Managing a Virtual Cluster: There exist four ways.
1. We can use a guest-based manager, by which the cluster manager resides inside a guest
OS. Ex: A Linux cluster can run different guest operating systems on top of the Xen
hypervisor.
2. We can bring out a host-based manager which itself is a cluster manager on the host
systems. Ex: VMware HA (High Availability) system that can restart a guest system after
failure.
3. An independent cluster manager, which can be used on both the host and the guest –
making the infrastructure complex.
4. Finally, we might also use an integrated cluster (manager), on the guest and host
operating systems; here the manager must clearly distinguish between physical and virtual
resources.

16
Virtual clusters are generally used where fault tolerance of VMs on the host plays an important
role in the total cluster strategy. These clusters can be applied in grids, clouds and HPC
platforms. The HPC is obtained by dynamical finding and usage of resources as per requirement,
and less migration time & bandwidth that is used.

A VM can be in one of the following states:


(a) Inactive State: This is defined by the virtualization platform, under which the VM is not
enabled.
(b) Active State: This refers to a VM that has been instantiated at the virtualization platform to
perform a task.
(c) Paused State: A VM has been instantiated but disabled temporarily to process a task or is in
a waiting state itself.
(d) Suspended State: A VM enters this state if its machine file and virtual resources are stored
back to the disk.

Live Migration Steps: This consists of 6 steps.


• Steps 0 and 1: Start migration automatically and checkout load balances and server
consolidation.
• Step 2: Transfer memory (transfer the memory data + recopy any data that is changed
during the process). This goes on iteratively till changed memory is small enough to be
handled directly.
• Step 3: Suspend the VM and copy the last portion of the data.
• Steps 4 and 5: Commit and activate the new host. Here, all the data is recovered, and
the VM is started from exactly the place where it was suspended, but on the new host.

c) Migration of Memory, Files, and Network resources


Virtual Clusters are being widely used to use the computing resources effectively, generate HP,
overcome the burden of interaction between different OSs and make different configurations to
coexist.

i)Memory Migration: This is done between the physical host and any other physical/virtual
machine. The techniques used here depend upon the guest OS. MM can be in a range of
megabytes to gigabytes. The Internet Suspend-Resume (ISR) technique exploits temporal
locality since the memory states are may have overlaps in the suspended/resumed instances of a
VM. Temporal locality (TL) refers to the fact that the memory states differ only by the amount
of work done since a VM was last suspended.
To utilize the TL, each file is represented as a tree of small sub-files. A copy of this tree exists in
both the running and suspended instances of the VM. The advantage here is usage of tree
representation of a file and caching ensures that the changed files are only utilized for
transmission.

b) File System Migration: To support VM migration from one cluster to another, a consistent
and location-dependent view of the file system is available on all hosts. Each VM is provided with
its own virtual disk to which the file system is mapped to. The contents of the VM can be
transmitted across the cluster by inter-connections (mapping) between the hosts. But migration of
an entire host (if required) is not advisable due to cost and security problems. We can also provide
a global file system across all host machines where a VM can be located. This methodology
removes the need of copying files from one machine to another – all files on all machines can be
accessed through network. It should be noted here that the actual files are not mapped or copied.
The VMM accesses only the local file system of a machine and the original/modified files are stored
at their respective systems only. This decoupling improves security and performance but increases
the overhead of the VMM – every file has to be stored in virtual disks in its local files. Smart
Copying ensures that after being resumed from suspension state, a VM doesn’t get a whole file as
a backup. It receives only the changes that were made. This technique reduces the amount of data
that has to be moved between two locations.

17
iii) Network Migration: A migrating should maintain open network connections. It should not
depend upon forwarding mechanisms (mediators) or mobile mechanisms. Each VM should be
assigned a unique IP or MAC (Media Access Control) [7] addresses which is different from that of
the host machine. The mapping of the IP and MAC addresses to their respective VMs is done by
the VMM.

If the destination of the VM is also on the same LAN, special messages are sent using MAC
address that the IP address of the VM has moved to a new location. If the destination is on
another network, the migrating OS can keep its original Ethernet MAC address and depend on
the network switch [9] to detect its move to a new port [8].

iv) Live Migration of VM Using Xen: live migration means moving a VM from one physical
node to another while keeping its OS environment and apps intact. All this process is carried out
by a program called migration daemon. This capability provides efficient online system
maintenance, reconfiguration, load balancing, and improved fault tolerance. The recently
improved mechanisms are able to migrate without suspending the concerned VM.
There are two approaches in live migration: pre copy and post copy.
(a) In pre copy, which is manly used in live migration, all memory pages are first transferred; it
then copies the modified pages in the last round iteratively. Here, performance ‘degradation’
will occur because migration will be encountering dirty pages (pages that change during
networking) all around in the network before getting to the right destination. The iterations
could also increase, causing another problem. To encounter these problems, check-
pointing/recovery process is used at different positions to take care of the above problems
and increase the performance.
(b) In post-copy, all memory pages are transferred only once during the migration process. The
threshold time allocated for migration is reduced. But the downtime is higher than that in
pre-copy.
Downtime means the time in which a system is out of action or can’t handle other works.

Ex: Live migration between two Xen-enabled hosts: Figure 3.22


CBC Compression=> Context Based Compression
RDMA=> Remote Direct memory Access

2.5. VIRTUALIZATION FOR DATA CENTRE AUTOMATION:


Data Centres have been built and automated recently by different companies like Google, MS, IBM,
Apple etc. By utilizing the data centres and the data in the same, virtualization is moving towards
mobility, reduced maintenance time, and increasing the number of virtual clients. Other factors
that influence the deployment and usage of data centres are high availability (HA), backup
services, and workload balancing.

18
a) Server Consolidation in Data Centres: In data centers, heterogeneous workloads may run
at different times. The two types here are
1. Chatty (Interactive) Workloads: These types may reach the peak at a particular time
and may be silent at some other time. Ex: WhatsApp in the evening and the same at
midday.
2. Non-Interactive Workloads: These don’t require any users’ efforts to make progress
after they have been submitted. Ex: HPC

The data center should be able to handle the workload with satisfactory performance both at the
peak and normal levels.
It is common that much of the resources of data centers like hardware, space, power and cost are
under-utilized at various levels and times. To come out of this disadvantage, one approach is to
use the methodology of server consolidation. This improves the server utility ratio of hardware
devices by reducing the number of physical servers. There exist two types of server consolidation:
(a) Centralised and Physical Consolidation (b) virtualization-based server consolidation.
The second method is widely used these days, and it has some advantages.
• Consolidation increases hardware utilization
• It enables more agile provisioning of the available resources
• The total cost of owning and using data center is reduced (low maintenance, low cooling,
low cabling etc.)
• It enables availability and business continuity – the crash of a guest OS has no effect upon
a host OS.
To automate (virtualization) data centers one must consider several factors like resource
scheduling, power management, performance of analytical models and so on. This improves the
utilization in data centers and gives high performance. Scheduling and reallocation can be done at
different levels at VM level, server level and data center level, but generally any one (or two)
level is used at a time.
The schemes that can be considered are:
(a) Dynamic CPU allocation: This is based on VM utilization and app level QoS (Quality of
Service) metrics. The CPU should adjust automatically according to the demands and
workloads to deliver the best performance possible.
(b) Another scheme uses two-level resource management system to handle the complexity of
the requests and allocations. The resources are allocated automatically and autonomously to
bring down the workload on each server of a data center.
Finally, we should efficiently balance the power saving and data center performance to achieve the
HP and HT also at different situations as they demand.

b) Virtual Storage Management: virtualization is mainly lagging behind the modernisation of


data centers and is the bottleneck of VM deployment. The CPUs are rarely updates, the chips are
not replaced and the host/guest operating systems are not adjusted as per the demands of
situation.
Also, the storage methodologies used by the VMs are not as fast as they are expected to be
(nimble). Thousands of such VMs may flood the data center and their lakhs of images (SSI) may
lead to data center collapse. Research has been conducted for this purpose to bring out an efficient
storage and reduce the size of images by storing parts of them at different locations. The solution
here is Content Addressable Storage (CAS). Ex: Parallax system architecture (A distributed storage
system). This can be viewed at Figure 3.26.

Parallax itself runs as a user-level application in the VM storage, providing Virtual Disk Images
(VDIs). A VDI can accessed in a transparent manner from any host machine in the Parallax
cluster. It is a core abstraction of the storage methodology used by Parallax.

19
c) Cloud OS for virtualization Data Centers: VI => Virtual Infrastructure managers Types can
be seen in Table 3.6.

EC2 => Amazon Elastic Compute Cloud WS => Web Service CLI => Command Line Interface
WSRF => Web Services Resource Framework KVM => Kernel-based VM VMFS => VM File System
HA => High Availability
Example of Eucalyptus for Virtual Networking of Private Cloud: It is an open-source software
system intended for IaaS clouds. This is seen in Figure 3.27.

Instance Manager (IM): It controls execution, inspection and terminating of VM instances on the
host machines where it runs.

20
Group Manager (GM): It gathers information about VM execution and schedules them on specific
IMs; it also manages virtual instance network.
Cloud Manager (CM): It is an entry-point into the cloud for both users and administrators. It
gathers information about the resources, allocates them by proper scheduling, and implements
them through the GMs.

d) Trust Management in virtualization Data Centers: As a recollect, a VMM (hypervisor) is a


layer between the host OS and the hardware to create 1 or more VMs on a single platform. A VM
encapsulates the guest OS and its current state and can transport it through the network as a SSI.
At this juncture, in the network transportation, any intruders may get into the image or the
concerned hypervisor itself and pose danger to both the image and the host system. Ex: A subtle
problem lies in reusing a random number for cryptography.

i) VM-based Intrusion Detection: Intrusions are unauthorized access to a computer from


other network users. An intrusion detection system (IDS), which is built on the host OS can be
divided into two types: host-based IDS (HIDS) and a network-based IDS (NIDS).
virtualization based IDS can isolate each VM on the VMM and work upon the concerned systems
without having contacts with the other. Any problem with a VM will not pose problems for other
VMs. Also, a VMM audits the hardware allocation and usage for the VMs regularly so as to notice
any abnormal changes. Still yet, the host and guest OS are fully isolated from each other. A
methodology on these bases can be noticed in Figure 3.29.

The above figure proposes the concept of granting IDS runs only on a highly-privileged VM. Notice
that policies play an important role here. A policy framework can monitor the events in different
guest operating systems of different VMs by using an OS interface library to determine which grant
is secure and which is not.
It is difficult to determine which access is intrusion and which is not without some time delay.
Systems also may use access ‘logs’ to analyse which is an intrusion and which is secure. The IDS
log service is based on the OS kernel and the UNIX kernel is hard to break; so even if a host
machine is taken over by the hackers, the IDS log book remains unaffected.
The security problems of the cloud mainly arise in the transport of the images through the network
from one location to another. The VMM must be used more effectively and efficiently to deny any
chances for the hackers.

21
CLOUD COMPUTING
UNIT – 3

SYLLABUS: Cloud Platform Architecture: Cloud Computing and service Models, Public
Cloud Platforms, Service Oriented Architecture, Programming on Amazon AWS and
Microsoft Azure.

3.1. CLOUD COMPUTING AND SERVICE MODELS:

In recent days, the IT industry has moved from manufacturing to offering more services (service-
oriented). As of now, 80% of the industry is ‘service-industry’. It should be realized that services
are not manufactured/invented from time-to-time; they are only rented and improved as per the
requirements. Clouds aim to utilize the resources of data centers virtually over automated
hardware, databases, user interfaces and apps.
I)Public, Private and Hybrid Clouds: Cloud computing has evolved from the concepts of
clusters, grids and distributed computing. Different resources (hardware, finance, time) are
leveraged (use to maximum advantage) to bring out the maximum HTC. A Cloud Computing
model enables the users to share resources from anywhere at any time through their connected
devices.
Advantages of Cloud Computing: Recall that in Cloud Computing, the programming is
sent to data rather than the reverse, to avoid large data movement, and maximize the bandwidth
utilization. Cloud Computing also reduces the costs incurred by the data centers, and increases
the app flexibility. Cloud Computing consists of a virtual platform with elastic resources and puts
together the hardware, data and software as per demand. Furthermore, the apps utilized and
offered are heterogeneous.
The Basic Architecture of the types of clouds can be seen in Figure 4.1 below.

• Public Clouds: A public cloud is owned by a service provider, built over the Internet and
offered to a user on payment. Ex: Google App Engine (GAE), AWS, MS-Azure, IBM
Blie Cloud and Salesforce-Force.com. All these offer their services for creating and
1
managing VM instances to the users within their own infrastructure.
• Private Clouds: A private cloud is built within the domain of an intranet owned by a
single organization. It is client-owned and managed; its access is granted to a limited
number of clients only. Private clouds offer a flexible and agile private infrastructure
to run workloads within their own domains. Though private cloud offers more control, it
has limited resources only.
• Hybrid Clouds: A hybrid cloud is built with both public and private clouds. Private
clouds can also support a hybrid cloud model by enhancing the local infrastructure with
computing capacity of a public external cloud.

• Data Center Networking Architecture: The core of a cloud is the server cluster and the
cluster nodes are used as compute nodes. The scheduling of user jobs requires that virtual
clusters are to be created for the users and should be granted control over the required
resources. Gateway nodes are used to provide the access points of the concerned service
from the outside world. They can also be used for security control of the entire cloud
platform. It is to be noted that in physical clusters/grids, the workload is static; in clouds,
the workload is dynamic and the cloud should be able to handle any level of workload on
demand.

Data centers and supercomputers also differ in networking requirements, as illustrated in


Figure 4.2. Supercomputers use custom-designed high-bandwidth networks such as fat trees
or 3D torus networks. Data-center networks are mostly IP-based commodity networks, such
as the 10 Gbps Ethernet network, which is optimized for Internet access. Figure 4.2 shows a
multilayer structure for accessing the Internet. The server racks are at the bottom Layer 2,
and they are connected through fast switches (S) as the hardware core. The data center is
connected to the Internet at Layer 3 with many access routers (ARs) and border routers
(BRs).
• Cloud Development Trends: There is a good chance that private clouds will grow in the
future since private clouds are more secure, and adjustable within an organization. Once
they are matured and more scalable, they might be converted into public clouds. In another
angle, hybrid clouds might also grow in the future.

ii) Cloud Ecosystem and Enabling Technologies: The differences between classical
computing and cloud computing can be seen in the table below. In traditional computing, a
user has to buy the hardware, acquire the software, install the system, test the configuration and
execute the app code. The management of the available resources is also a part of this. Finally,
all this process has to be revised for every 1.5 or 2 years since the used methodologies will
2
become obsolete.

On the other hand, Cloud Computing follows a pay-as-you-go model [1]. Hence the cost is
reduced significantly – a user doesn’t buy any resources but rents them as per his requirements.
All S/W and H/W resources are leased by the user from the cloud resource providers. This is
advantageous for small and middle business firms which require limited amount of resources
only. Finally, Cloud Computing also saves power.

a) Cloud Design Objectives:


• Shifting computing from desktops to data centers : Computer processing, storage, and
software delivery is shifted away from desktops and local servers and toward data centers
over the Internet.
• Service provisioning and cloud economics: Providers supply cloud services by signing
SLAs with consumers and end users. The services must be efficient in terms of
computing, storage, and power consumption. Pricing is based on a pay-as-you-go policy.
• Scalability in performance (as the no. of users increases) : performance The cloud
platforms and software and infrastructure services must be able to scale in performance
as the number of users increases
• Data privacy protection Can you trust data centers to handle your private data and
records? This concern must be addressed to make clouds successful as trusted services.
• High quality of cloud services The QoS of cloud computing must be standardized to
make clouds interoperable among multiple providers.
• New standards and interfaces This refers to solving the data lock-in problem associated
with data centers or cloud providers. Universally accepted APIs and access protocols are
needed to provide high portability and flexibility of virtualized applications.

3
b) Cost Model:

The above Figure 4.3a shows the additional costs on top of fixed capital investments in
traditional computing. In Cloud Computing, only pay-as-per-use is applied, and user-jobs are
outsourced to data centers. To use a cloud, one has no need to buy hardware resources; he can
utilize them as per the demands of the work and release the same after the job is completed.

c) Cloud Ecosystems: With the emergence of Internet clouds, an ‘ecosystem’ (a complex


inter- connected systems network) has evolved. This consists of users, providers and
technologies. All this is based mainly on the open source Cloud Computing tools that let
organizations build their own IaaS. Private and hybrid clouds are also used. Ex: Amazon
EC2.

An ecosystem for private clouds was suggested by scientists as depicted in Figure 4.4.

In the above suggested 4 levels, at the user end, a flexible platform is required by the

4
customers. At the cloud management level, the virtualization resources are provided by the
concerned cloud manager to offer the IaaS. At the VI management level, the manager
allocates the VMs to the available multiple clusters. Finally, at the VM management level, the
VM managers handle VMs installed on the individual host machines.

d) Increase of Private Clouds: Private clouds influence the infrastructure and services that are
utilized by an organization. Private and public clouds handle the workloads dynamically but
public clouds handle them without communication dependency. On the other hand, private
clouds can balance workloads to exploit the infrastructure effectively to obtain HP. The
major advantage of private clouds is less security problems and public clouds need less
investment.

iii)Infrastructure-as-a-Service (IaaS): A model for different services is shown in Figure 4.5,


as shown below. The required service is performed by the rented cloud infrastructure. On this
environment, the user can deploy and run his apps. Note that user doesn’t have any control
over the cloud infrastructure but can choose his OS, storage, apps and network components.
Ex: Amazon EC2.

iv) platform-as-a-service (PaaS) and Software-as-a-Service (SaaS)


• Platform-as-a-Service (PaaS): To develop, deploy and manage apps with provisioned
resources, an able platform is needed by the users. Such a platform includes OS and
runtime library support. Different PaaS offered in the current market and other details
are highlighted in the Table 4.2 below:
It should be noted that platform cloud is an integrated system consisting of both S/W and
5
H/W. The user doesn’t manage the cloud infrastructure but chooses the platform that is
best suited to his choice of apps. The model also encourages third parties to provide
software management, integration and service monitoring solutions.
• Software as a Service (SaaS): This is about a browser-initiated app s/w over thousands
of cloud customers. Services & tools offered by PaaS are utilized in construction and
deployment of apps and management of their resources. The customer needs no
investment and the provider can keep the costs low. Customer data is also stored in a
cloud and is accessible through different other services. Ex: Gmail, Google docs,
Salesforce.com etc.
• Mashup of Cloud Services: Public clouds are more used these days but private clouds
are not far behind. To utilize the resources up to the maximum level and deploy/remove
the apps as per requirement, we may need to mix-up the different parts of each service
to bring out a chain of connected activities. Ex: Google Maps, Twitter, Amazon
ecommerce, YouTube etc.

II) PUBLIC CLOUD PLATFORMS: Cloud services are provided as per demand by different
companies. It can be seen in Figure 4.19 that there are 5 levels of cloud players.

The app providers at the SaaS level are used mainly by the individual users. Most business
organizations are serviced by IaaS and PaaS providers. IaaS provides compute, storage, and
communication resources to both app providers and organizational users. The cloud
environment is defined by PaaS providers. Note that PaaS provides support both IaaS services
and organizational users directly.
Cloud services depend upon machine virtualization, SOA, grid infrastructure management and
power efficiency. The provider service charges are much lower than the cost incurred by the
users when replacing damaged servers. The Table 4.5 shows a summary of the profiles of the
major service providers.

6
PKI=> Public Key Infrastructure; VPN=> Virtual Private Network

a. Google App Engine (GAE): The Google platform is based on its search engine
expertise and is applicable to many other areas (Ex: MapReduce). The Google Cloud
Infrastructure consists of several apps like Gmail, Google Docs, and Google Earth and
can support multiple no. of users simultaneously to raise the bar for HA (high
availability). Other technology achievements of Google include Google File System
(GFS) [like HDFS], MapReduce, BigTable, and Chubby (A Distributed Lock Service).
GAE enables users to run their apps on a large number of data centers associated with
Google’s search engine operations. The GAE architecture can be seen in Figure 4.20 [1]
below:

The building blocks of Google’s Cloud Computing app include GFS for storing large amounts
of data, the MapReduce programming framework for developers, Chubby for distributed lock
services and BigTable as a storage service for accessing structural data.
GAE runs the user program on Google’s infrastructure where the user need not worry about

7
storage or maintenance of data in the servers. It is a combination of several software
components but the frontend is same as ASP (Active Server Pages), J2EE and JSP.

Functional Modules of GAE:


• Datastore offers OO, distributed and structured data storage services based on
BigTable techniques. This secures data management operations.
• Application Runtime Environment: It is a platform for scalable web programming
and execution. (Supports the languages of Java and Python)
• Software Development Kit: It is used for local app development and test runs of the
new apps.
• Administration Console: Used for easy management of user app development cycles
instead of physical resource management.
• Web Service Infrastructure provides special interfaces to guarantee flexible use and
management of storage and network resources.

The well-known GAE apps are the search engine, docs, earth and Gmail. Users linked with one
app can interact and interface with other apps through the resources of GAE (synchronise and
one login for all services).

b. Amazon Web Services (AWS): Amazon applies the IaaS model in providing its
services. The Figure 4.21 [1] below shows the architecture of AWS:

EC2 provides the virtualized platforms to host the VMs where the cloud app can run.
S3 (Simple Storage Service) provides the OO storage service for the users.
EBS (Elastic Block Service) provides the block storage interface which can be used to support
traditional apps.
SQS (Simple Queue Service) ensures a reliable message service between two processes.
Amazon offers a RDS (relational database service) with a messaging interface. The AWS
offerings are given below in Table 4.6

8
c. MS-Azure: The overall architecture of MS cloud platform, built on its own data
centers, is shown in Figure 4.22. It is divided into 3 major component platforms as it
can be seen. Apps are installed on VMs and Azure platform itself is built on Windows
OS.

• Live Service: Through this, the users can apply MS live apps and data across multiple
machines concurrently.
• .NET Service: This package supports app development on local hosts and execution on cloud
machines.
• SQL Azure: Users can visit and utilized the relational database associated with a SQL server
in the cloud.
• SharePoint Service: A scalable platform to develop special business apps.
• Dynamic CRM Service: This provides a business platform for the developers to manage the
CRM apps in financing, marketing, sales and promotions.

III) SERVICE-ORIENTED ARCHITECTURE: SOA is concerned about how to design a


software system that makes use of services or apps through their interfaces. These apps are
distributed over the networks. The World Wide Web Consortium (W3C) defines SOA as a form
of distributed architecture characterized by:
• Logical View: The SOA is an abstracted, logical view of actual programs, DBs etc.
defined in terms of the operations it carries out. The service is formally defined in terms of
messages exchanged between providers and requests.
• Message Orientation
9
• Description Orientation
i. Services and Web Services: In an SOA concept, the s/w capabilities are delivered &
consumed through loosely coupled and reusable services using messages. ‘Web
Service’ is a self-contained modular application designed to be used by other apps
across the web. This can be seen in Figure 5.2.

WSDL => Web Services Description Language


UDDI => Universal Description, Discovery and
Integration SOAP => Simple Object Access Protocol

SOAP: This provides a standard packaging structure for transmission of XML documents over
various IPs. (HTTP, SMTP, FTP). A SOAP message consists of an envelope (root element),
which itself contains a header. It also had a body that carries the payload of the message.
WSDL: It describes the interface and a set of operations supported by a web service in a
standard format.
UDDI: This provides a global registry for advertising and discovery of web services by
searching for names, identifiers, categories.
Since SOAP can combine the strengths of XML and HTTP, it is useful for heterogeneous
distributed computing environments like grids and clouds
ii. Enterprise Multitier Architecture: This is a kind of client/server architecture
application processing and data management are logically separate processes. As seen
below in Figure 5.4, it is a three-tier information system where each layer has its own
important responsibilities.

10
Presentation Layer: Presents information to external entities and allows them to interact with
the system by submitting operations and getting responses.
Application Logic (Middleware): These consist of programs that implement actual operations
requested by the client. The middle tier can also be used for user authentication and granting of
resources, thus removing some load from the servers.
Resource Management Layer (Data Layer): It deals with the data sources of an information
system.

iii. OGSA Grid: Open Grid Services Architecture is intended to


• Facilitate the usage of resources across heterogeneous environments
• Deliver best QoS
• Define open interfaces between diverse resources
• Develop inter-operable standards
OGSA architecture falls into seven broad areas, as shown in Figure 5.5.
Infrastructure Services, Execution Management Services, Data Management Services,
Resource Management Services, Security Services, Security Services, Information Services and
Self- management Services (automation).

These services are summarized as follows:


• Infrastructure Services Refer to a set of common functionalities, such as naming,
typically required by higher level services.
• Execution Management Services Concerned with issues such as starting and
managing tasks, including placement, provisioning, and life-cycle management. Tasks
may range from simple jobs to complex workflows or composite services.
• Data Management Services Provide functionality to move data to where it is needed,
maintain replicated copies, run queries and updates, and transform data into new
formats. These services must handle issues such as data consistency, persistency, and
integrity. An OGSA data service is a web service that implements one or more of the
base data interfaces to enable access to, and management of, data resources in a
distributed environment. The three base interfaces, Data Access, Data Factory, and Data
Management, define basic operations for representing, accessing, creating, and
managing data.
• Resource Management Services Provide management capabilities for grid resources:
management of the resources themselves, management of the resources as grid
components, and management of the OGSA infrastructure.
• Security Services Facilitate the enforcement of security-related policies within a
(virtual) organization, and supports safe resource sharing. Authentication, authorization,
and integrity assurance are essential functionalities provided by these services.
• Information Services Provide efficient production of, and access to, information about
the grid and its constituent resources. The term “information” refers to dynamic data or
events used for status monitoring; relatively static data used for discovery; and any data
that is logged.
• Self-Management Services Support service-level attainment for a set of services (or
resources),with as much automation as possible, to reduce the costs and complexity of
managing the system. These services are essential in addressing the increasing
complexity of owning and operating an IT infrastructure

11
12
CLOUD COMPUTING
UNIT IV
Syllabus: Cloud Resource Management and Scheduling: Policies and Mechanisms for
Resource Management, Applications of Control Theory to Task Scheduling on a Cloud, Stability of a
Two-Level Resource Allocation Architecture, Feedback Control Based on Dynamic Thresholds.
Coordination of Specialized Autonomic Performance Managers, Resource Bundling, Scheduling
Algorithms for Computing Clouds, Fair Queuing, Start Time Fair Queuing, Borrowed Virtual Time,
Cloud Scheduling Subject to Deadlines, Scheduling MapReduce Applications Subject to Deadlines.

CLOUD RESOURCE MANAGEMENT AND SCHEDULING

5.1. INTRODUCTION:
Resource management is a core function of any man-made system. It affects the three basic
criteria for the evaluation of a system: performance, functionality, and cost. An inefficient resource
management has a direct negative effect on performance and cost and an indirect effect on the
functionality of a system.
Cloud resource management requires complex policies and decisions for multi-objective
optimization. Cloud resource management is extremely challenging because of the complexity of
the system, which makes it impossible to have accurate global state information, and because of
the unpredictable interactions with the environment.
The strategies for resource management associated with the three cloud delivery models, IaaS,
PaaS, and SaaS, differ from one another.
5.2. POLICIES AND MECHANISMS FOR RESOURCE MANAGEMENT

A policy typically refers to the principal guiding decisions, whereas mechanisms represent the
means to implement policies. Separation of policies from mechanisms is a guiding principle in
computer science.
Cloud resource management policies can be loosely grouped into five classes:
1. Admission control.
2. Capacity allocation.
3. Load balancing.
4. Energy optimization.
5. Quality-of-service (QoS) guarantees
Admission control It is a validation process in communication systems where a check is
performed before a connection is established to see if current resources are sufficient for the
proposed connection. It is a policy to prevent the system from accepting workloads in violation of
high-level system policies.
Capacity allocation means to allocate resources for individual instances; an instance is an
activation of a service.
Load balancing distribute the workload evenly among the servers.
energy optimization minimization of energy consumption.
Load balancing and energy optimization are correlated and affect the cost of providing the
services.
Quality of service is that aspect of resource management that is probably the most difficult to
address and, at the same time, possibly the most critical to the future of cloud computing ability to
satisfy timing or other conditions specified by a Service Level Agreement.
The four basic mechanisms for the implementation of resource management policies are:
• Control theory: uses the feedback to guarantee system stability and predict transient
behavior.
• Machine Learning: does not need a performance model of the system.
• Utility based: require a performance model and a mechanism to correlate user-level
performance with cost.
• Market oriented/economic mechanism: Such mechanisms don’t require a system
model, such as combining auctions for bundles of resources.do not require a model of the
system, e.g., combinatorial auctions for bundles of resources.
5.3. APPLICATIONS OF CONTROL THEORY TO TASK SCHEDULING ON A CLOUD

Control theory has been used to design adaptive resource management for many classes of
applications, including power management, task scheduling, QoS adaptation in Web servers, and
load balancing.
The classical feedback control methods are used in all these cases to regulate the key operating
parameters of the system based on measurement of the system output.
A technique to design self-managing systems, which allows multiple QoS objectives and operating
constraints to be expressed as a cost function and can be applied to stand-alone or distributed
Web servers, database servers, high-performance application servers, and even mobile/embedded
systems
Our goal is to illustrate the methodology for optimal resource management based on control
theory concepts. The analysis is intricate and cannot be easily extended to a collection of servers.

Control Theory Principles. Optimal control generates a sequence of control inputs over a look-
ahead horizon while estimating changes in operating conditions. A convex cost function has
arguments x (k), the state at step k, and u(k), the control vector; this cost function is minimized,
subject to the constraints imposed by the system dynamics. The discrete-time optimal control
problem is to determine the sequence of control variables u(i ), u(i + 1), . . . , u(n − 1) to
minimize the expression

where Ø (n, x (n)) is the cost function of the final step, n, and Lk (x (k), u(k)) is a time-varying
cost function at the intermediate step k over the horizon [i , n]. The minimization is subject to the
constraints

where x (k + 1), the system state at time k + 1, is a function of x (k), the state at time k, and of
u(k), the input at time k; in general, the function f k is time-varying; thus, its superscript.

The controller uses the feedback regarding the current state as well as the estimation of the future
disturbance due to environment to compute the optimal inputs over a finite horizon. The two
parameters r and s are the weighting factors of the performance index.

5.4. STABILITY OF A TWO-LEVEL RESOURCE ALLOCATION ARCHITECTURE

A two-level resource allocation architecture based on control theory concepts for the entire cloud.
The automatic resource management is based on two levels of controllers, one for the service
provider and one for the application, is shown below.
The main components of a control system are the inputs, the control system components, and
the outputs.

The system components are sensors used to estimate relevant measures of performance and
controllers that implement various policies; the output is the resource allocations to the individual
applications.

There are three main sources of instability in any control system:

1. The delay in getting the system reaction after a control action.

2. The granularity of the control, the fact that a small change enacted by the controllers leads to
very large changes of the output.

3. Oscillations, which occur when the changes of the input are too large and the control is too
weak, such that the changes of the input propagate directly to the output.

Two types of policies are used in autonomic systems:

i) threshold-based policies and


ii) sequential decision policies based on Markovian decision models.

5.5. FEEDBACK CONTROL BASED ON DYNAMIC THRESHOLDS

The elements involved in a control system are sensors, monitors, and actuators.

The sensors measure the parameter(s) of interest, then transmit the measured values to a
monitor, which determines whether the system behavior must be changed, and, if so, it requests
that the actuators carry out the necessary actions. Often the parameter used for admission
control policy is the current system load; when a threshold, e.g., 80%, is reached, the cloud stops
accepting additional load.

Thresholds:

A threshold is the value of a parameter related to the state of a system that triggers a change in
the system behavior. Thresholds are used in control theory to keep critical parameters of a system
in a predefined range. The threshold could be static, defined once and for all, or it could be
dynamic. A dynamic threshold could be based on an average of measurements carried out over a
time interval, a so-called integral control.

To maintain the system parameters in a given range, a high and a low threshold are often defined.

The essence of the proportional thresholding is captured by the following algorithm:


1. Compute the integral value of the high and the low thresholds as averages of the
maximum and, respectively, the minimum of the processor utilization over the process history.

2. Request additional VMs when the average value of the CPU utilization over the current
time slice exceeds the high threshold.

3. Release a VM when the average value of the CPU utilization over the current time slice
falls below the low threshold.

5.6. COORDINATION OF SPECIALIZED AUTONOMIC PERFORMANCE MANAGERS

Virtually all modern processors support dynamic voltage scaling (DVS) as a mechanism for energy
saving. Indeed, the energy dissipation scales quadratically with the supply voltage. The power
management controls the CPU frequency and, thus, the rate of instruction execution. For some
compute-intensive workloads the performance decreases linearly with the CPU clock frequency,
whereas for others the effect of lower clock frequency is less noticeable or nonexistent.

The approach to coordinating power and performance management in is based on several ideas:

• Use a joint utility function for power and performance. The joint performance-power utility
function, U pp (R, P ), is a function of the response time, R, and the power, P , and it can
be of the form with U (R) the utility function based on response time only and a parameter
to weight the influence of the two factors, response time and power.

• Identify a minimal set of parameters to be exchanged between the two managers.

• Set up a power cap for individual systems based on the utility-optimized power management
policy.

• Use a standard performance manager modified only to accept input from the power manager
regarding the frequency determined according to the power management policy. The
power manager consists of Tcl (Tool Command Language) and C programs to compute the
per-server (per-blade) power caps and send them via IPMI5 to the firmware controlling the
blade power. The power manager and the performance manager interact, but no
negotiation between the two agents is involved.
5.7. RESOURCE BUNDLING: Combinatorial Auctions For Cloud Resources

Resources in a cloud are allocated in bundles, allowing users get maximum benefit from a specific
combination of resources. Indeed, along with CPU cycles, an application needs specific amounts of
main memory, disk space, network bandwidth, and so on. Resource bundling complicates
traditional resource allocation models and has generated interest in economic models and, in
particular, auction algorithms. In the context of cloud computing, an auction is the allocation of
resources to the highest bidder.

Combinatorial Auctions:

Auctions in which participants can bid on combinations of items, or packages, are called
combinatorial auctions. Such auctions provide a relatively simple, scalable, and tractable solution
to cloud resource allocation.
The final auction prices for individual resources are given by the vector p = ( p1, p2, . . . , p R )
and the amounts of resources allocated to user u are xu = (xu1, xu2, . . . , x uR ). Thus, the
expression [(xu )T p] represents the total price paid by user u for the bundle of resources if the bid
is successful at time T . The scalar [minq∈Qu (q T p)] is the final price established through the
bidding process.

Pricing and Allocation Algorithms:

A pricing and allocation algorithm partition the set of users into two disjoint sets, winners
and losers, denoted as W and L, respectively. The algorithm should:

1. Be computationally tractable. Traditional combinatorial auction algorithms such as Vickey-


Clarke-Groves (VLG) fail this criterion, because they are not computationally tractable.
2. Scale well. Given the scale of the system and the number of requests for service,
scalability is a necessary condition.
3. Be objective. Partitioning in winners and losers should only be based on the price πu of a
user’s bid. If the price exceeds the threshold, the user is a winner; otherwise the user is a
loser.
4. Be fair. Make sure that the prices are uniform. All winners within a given resource pool pay
the same price.
5. Indicate clearly at the end of the auction the unit prices for each resource pool.
6. Indicate clearly to all participants the relationship between the supply and the demand in
the system.

The function to be maximized is

5.8. SCHEDULING ALGORITHMS FOR COMPUTING CLOUDS

Scheduling is a critical component of cloud resource management. Scheduling is responsible for


resource sharing/multiplexing at several levels. A server can be shared among several virtual
machines, each virtual machine could support several applications, and each application may
consist of multiple threads. CPU scheduling supports the virtualization of a processor, the
individual threads acting as virtual processors; a communication link can be multiplexed among a
number of virtual channels, one for each flow.

Two distinct dimensions of resource management must be addressed by a scheduling policy:


a) The amount or quantity of resources allocated and
b) The timing when access to resources is granted.
There are multiple definitions of a fair scheduling algorithm. First, we discuss the max-
min fairness criterion. Consider a resource with bandwidth B shared among n users who
have equal rights. Each user requests an amount bi and receives Bi. Then, according to
the max-min criterion, the following conditions must be satisfied by a fair allocation:
C1. The amount received by any user is not larger than the amount requested, Bi≤bi .
C2. If the minimum allocation of any user is Bmin no allocation satisfying condition C1 has a higher
Bmin than the current allocation.
C3. When we remove the user receiving the minimum allocation Bmin and then reduce the total
amount of the resource available from B to (B − Bmin ), the condition C2 remains recursively true.
A fairness criterion for CPU scheduling requires that the amount of work in the time interval
from t1 to t2 of two runnable threads a and b, Ωa (t1, t2) and Ωb (t1, t2), respectively, minimize
the expression

FIGURE: Best-effort policies do not impose requirements regarding either the amount of
resources allocated to an application or the timing when an application is scheduled.
Soft-requirements allocation policies require statistically guaranteed amounts and timing
constraints; hard-requirements allocation policies demand strict timing and precise
amounts of resources.
5.9. FAIR QUEUING

When the load exceeds its capacity, a switch starts dropping packets because it has limited input
buffers for the switching fabric and for the outgoing links, as well as limited CPU cycles. A switch
must handle multiple flows and pairs of source-destination endpoints of the traffic.

To address this problem, a fair queuing algorithm proposed in requires that separate queues, one
per flow, be maintained by a switch and that the queues be serviced in a round-robin manner. This
algorithm guarantees the fairness of buffer space management, but does not guarantee fairness of
bandwidth allocation. Indeed, a flow transporting large packets will benefit from a larger
bandwidth.
The fair queuing (FQ) algorithm in proposes a solution to this problem. First, it introduces a bit-
by-bit round-robin (BR) strategy; as the name implies, in this rather impractical scheme a single
bit from each queue is transmitted and the queues are visited in a round-robin fashion. Let R(t )
be the number of rounds of the BR algorithm up to time t and Nactive (t) be the number of active
flows through the switch. Call tia the time when the packet i of flow a, of size Pia bits arrives, and
call Sia and Fia the values of R(t) when the first and the last bit, respectively, of the packet i of flow
a are transmitted. Then,
Fia = Sia + P ia and S ia = max [Fa i−1, R (tia)] .

The quantities R(t ), Nactive (t ), Sia , and Fia depend only on the arrival time of the packets, tia , and
not on their transmission time, provided that a flow a is active as long as

R(t )≤F ia when i = max( j |t ai ≤t)


Another approach for packet-by-packet transmission time the following nonpreemptive scheduling
rule, which emulates the BR strategy: The next packet to be transmitted is the one with the
smallest Fia . A preemptive version of the algorithm requires that the transmission of the current
packet be interrupted as soon as one with a shorter finishing time, Fia , arrives.
A fair allocation of the bandwidth does not have an effect on the timing of the transmission. A
possible strategy is to allow less delay for the flows using less than their fair share of the
bandwidth. The same paper [102] proposes the introduction of a quantity called the bid, Bia , and
scheduling the packet transmission based on its value. The bid is defined as

Bia = Pia + max F ia−1, R t ia − δ ,


with δ a nonnegative parameter. The properties of the FQ algorithm, as well as the implementation
of a nonpreemptive version of the algorithms, are analyzed in.

5.10. START TIME FAIR QUEUING

A hierarchical CPU scheduler for multimedia operating systems was proposed in. The basic
idea of the start-time fair queuing (SFQ) algorithm is to organize the consumers of the CPU
bandwidth in a tree structure; the root node is the processor and the leaves of this tree are the
threads of each application. A scheduler acts at each level of the hierarchy. The fraction of the
processor bandwidth, B, allocated to the intermediate node i is
5.11. BORROWED VIRTUAL TIME
5.12. CLOUD SCHEDULING SUBJECT TO DEADLINES

Task Characterization and Deadlines:

• Hard deadlines → if the task is not completed by the deadline, other tasks which depend
on it may be affected and there are penalties; a hard deadline is strict and expressed
precisely as milliseconds, or possibly seconds.

• Soft deadlines→ more of a guideline and, in general, there are no penalties; soft
deadlines can be missed by fractions of the units used to express them. (cloud schedules
are usually in this category)

System Model:

• Aperiodic(irregular) tasks with arbitrarily(random) divisible workloads only.


• The application runs on a partition of a cloud, a virtual cloud with a head node (A head
node is a node on a Hadoop cluster that runs the services that manages the worker
nodes). and n worker nodes (The worker node is a node in an Hadoop cluster that host
the process).
• The system is homogeneous, all workers are identical, and the communication time from
the head node to any worker node is the same.
• The problems to be resolved are:
1. The order of execution of the tasks.
2. The workload partitioning and task mapping to the worker nodes.
Scheduling Policies:

• First in, first out (FIFO) → The tasks are scheduled for execution in the order of their
arrival.
• Earliest deadline first (EDF) → The task with the earliest deadline is scheduled first.
• Maximum workload derivative first (MWF) → The tasks are scheduled in the order of their
derivatives, the one with the highest derivative first. The number n of nodes assigned to
the application is kept to a minimum.
Workload Partitioning Rules:

• Optimal Partitioning Rule (OPR)→ the workload is partitioned to ensure the earliest
possible completion time and all tasks are required to complete at the same time.
• The head node distributes sequentially the data to individual worker nodes.
• Worker nodes start processing the data as soon as the transfer is complete.

Figure: The timing diagram for the Optimal Partitioning Rule; the algorithm requires
worker nodes to complete execution at the same time. The head node, S0, distributes sequentially
the data to individual worker nodes.
Where
Δ→ time the worker S needs to process a unit of data.
S0→ head node
Si→ worker nodes
Γ→ time to transfer the data
• Equal Partitioning Rule (EPR) → assigns an equal workload to individual worker nodes.
• The head node distributes sequentially the data to individual worker nodes.
• Worker nodes start processing the data as soon as the transfer is complete.
• The workload is partitioned in equal segments.

The timing diagram for the Equal Partitioning Rule; the algorithm assigns an equal
workload to individual worker nodes.
5.13. SCHEDULING MAPREDUCE APPLICATIONS SUBJECT TO DEADLINES
MapReduce applications on the cloud subject to deadlines. Several options for scheduling Apache
Hadoop, an open-source implementation of the MapReduce algorithm, are:
• The default FIFO schedule.
• The Fair Scheduler.
• The Capacity Scheduler.
• The Dynamic Proportional Scheduler.
Following Table summarizes the notations used for the analysis of Hadoop; the term slots is
equivalent with nodes and means the number of instances.
We make two assumptions for our initial derivation:
• The system is homogeneous; this means that ρm and ρr , the cost of processing a unit
data by the map and the reduce task, respectively, are the same for all servers.
• Load Equipartition
1

CLOUD COMPUTING
UNIT-5
SYLLABUS: Storage Systems: Evolution of storage technology, storage models, file systems
and database, distributed file systems, general parallel file systems. Google file system.,
Apache Hadoop, BigTable, Megastore, Amazon Simple Storage Service(S3)
6.1. STORAGE SYSTEMS
Storage and processing on the cloud are intimately tied to one another.
• Most cloud applications process very large amounts of data. Effective data
replication and storage management strategies are critical to the computations
performed on the cloud.
• Strategies to reduce the access time and to support real-time multimedia access
are necessary to satisfy the requirements of content delivery.
• An ever-increasing number of cloud-based services collect detailed data about
their services and information about the users of these services. The service
providers use the clouds to analyze the data.
• In 2013 Humongous amounts of data
▪ The Internet video will generate over 18 EB/month.
▪ Global mobile data traffic will reach 2 EB/month.
▪ EB→ Exabyte

A new concept, “big data,” reflects the fact that many applications use data sets so large
that they cannot be stored and processed using local resources.
The consensus is that “big data” growth can be viewed as a three-dimensional
phenomenon; it implies an increased volume of data, requires an increased processing speed
to process more data and produce more results, and at the same time it involves a diversity
of data sources and data types.

Applications in genomics, structural biology, high energy physics, astronomy, meteorology,


and the study of the environment carry out complex analysis of data sets often of the order
of TBs (terabytes). Examples:
▪ In 2010, the four main detectors at the Large Hadron Collider (LHC) (particle
accelerator) produced 13 PB of data.
▪ The Sloan Digital Sky Survey (SDSS) collects about 200 GB of data per night.
6.2. THE EVOLUTION OF STORAGE TECHNOLOGY
The technological capacity to store information has grown over time at an accelerated pace:
• 1986: 2.6 EB; equivalent to less than one 730 MB CD-ROM of data per computer user.
• 1993: 15.8 EB; equivalent to four CD-ROMs per user.
• 2000: 54.5 EB; equivalent to 12 CD-ROMs per user.
• 2007: 295 EB; equivalent to almost 61 CD-ROMs per user.
Hard disk drives (HDD) - during the 1980-2003 period:
▪ Storage density of has increased by four orders of magnitude from about 0.01 Gb/in2
to about 100 Gb/in2
▪ Prices have fallen by five orders of magnitude to about 1 cent/MB.
▪ HDD densities are projected to climb to 1,800 Gb/in2 by 2016, up from 744 Gb/in2 in
2011.
Dynamic Random-Access Memory (DRAM) - during the period 1990-2003:
▪ The density increased from about 1 Gb/in2 in 1990 to 100 Gb/in2
▪ The cost has tumbled from about $80/MB to less than $1/MB.

6.3. STORAGE MODELS, FILE SYSTEMS, AND DATABASES

A storage model describes the layout of a data structure in physical storage; a data
model captures the most important logical aspects of a data structure in a database. The
physical storage can be a local disk, a removable media, or storage accessible via a network.

Two abstract models of storage are commonly used: cell storage and journal storage.
Cell storage assumes that the storage consists of cells of the same size and that each object
fits exactly in one cell. This model reflects the physical organization of several storage media;
the primary memory of a computer is organized as an array of memory cells, and a secondary
storage device (e.g., a disk) is organized in sectors or blocks read and written as a unit.
read/write coherence and before-or-after atomicity are two highly desirable properties of any
storage model and in particular of cell storage (see Figure).

Journal storage is a fairly elaborate organization for storing composite objects such as
records consisting of multiple fields. Journal storage consists of a manager and cell storage,
where the entire history of a variable is maintained, rather than just the current value.
The user does not have direct access to the cell storage; instead the user can request the
journal manager to (i) start a new action; (ii) read the value of a cell; (iii) write the value of
a cell; (iv) commit an action; or (v) abort an action. The journal manager translates user
requests to commands sent to the cell storage: (i) read a cell; (ii) write a cell; (iii) allocate
a cell; or (iv) deallocate a cell.
2

FIGURE: Illustration capturing the semantics of read/write coherence and before-or-


after atomicity.

FIGURE: A log contains the entire history of all variables. The log is stored on nonvolatile
media of journal storage. If the system fails after the new value of a variable is stored in
the log but before the value is stored in cell memory, then the value can be recovered from
the log. If the system fails while writing the log, the cell memory is not updated. This
guarantees that all actions are all-or-nothing. Two variables, A and B, in the log and cell
storage are shown. A new value of A is written first to the log and then installed on cell
memory at the unique address assigned to A.

In the context of storage systems, a log contains a history of all variables in cell storage.
The information about the updates of each data item forms a record appended at the end of
the log. A log provides authoritative information about the outcome of an action involving
cell storage; the cell storage can be reconstructed using the log, which can be easily accessed
– we only need a pointer to the last record.

A file system consists of a collection of directories. Each directory provides information


about a set of files. Today high-performance systems can choose among three classes of file
system: networks file systems (NFSs), storage area networks (SANs), and parallel file
systems (PFSs). The NFS is very popular and has been used for some time, but it does not
scale well and has reliability problems; an NFS server could be a single point of failure.

Advances in networking technology allow the separation of storage systems from


computational servers; the two can be connected by a SAN. SANs offer additional flexibility
and allow cloud servers to deal with nondisruptive changes in the storage configuration.
Moreover, the storage in a SAN can be pooled and then allocated based on the needs of the
servers; pooling requires additional software and hardware support and represents another
advantage of a centralized storage system. A SAN-based implementation of a file system can
be expensive, since each node must have a Fibre Channel adapter to connect to the network.

Parallel file systems are scalable, are capable of distributing files across a large number of
nodes, and provide a global naming space. In a parallel data system, several I/O nodes serve
data to all computational nodes and include a metadata server that contains information
about the data stored in the I/O nodes. The interconnection network of a parallel file system
could be a SAN.

Most cloud applications do not interact directly with file systems but rather through an
application layer that manages a database. A database is a collection of logically related
records. The software that controls the access to the database is called a database
management system (DBMS). The main functions of a DBMS are to enforce data integrity,
manage data access and concurrency control, and support recovery after a failure.
A DBMS supports a query language, a dedicated programming language used to develop
database applications. Several database models, including the navigational model of the
1960s, the relational model of the 1970s, the object-oriented model of the 1980s, and the
NoSQL model of the first decade of the 2000s, reflect the limitations of the hardware available
at the time and the requirements of the most popular applications of each period.

Most cloud applications are data intensive and test the limitations of the existing
infrastructure. For example, they demand DBMSs capable of supporting rapid application
development and short time to market. At the same time, cloud applications require low
latency, scalability, and high availability and demand a consistent view of the data.

These requirements cannot be satisfied simultaneously by existing database models; for


example, relational databases are easy to use for application development but do not scale
well. As its name implies, the NoSQL model does not support SQL as a query language and
may not guarantee the atomicity, consistency, isolation, durability (ACID) properties of
traditional databases. NoSQL usually guarantees the eventual consistency for transactions
3

limited to a single data item. The NoSQL model is useful when the structure of the data does
not require a relational model and the amount of data is very large. Several types of NoSQL
database have emerged in the last few years. Based on the way the NoSQL databases store
data, we recognize several types, such as key-value stores, BigTable implementations,
document store databases, and graph databases.

Replication, used to ensure fault tolerance of large-scale systems built with commodity
components, requires mechanisms to guarantee that all replicas are consistent with one
another. This is another example of increased complexity of modern computing and
communication systems due to physical characteristics of components, a topic discussed in
Chapter 10. Section 8.7 contains an in-depth analysis of a service implementing a consensus
algorithm to guarantee that replicated objects are consistent.

6.4. DISTRIBUTED FILE SYSTEMS: THE PRECURSORS


In the 1980s many organizations, including research centers, universities, financial
institutions, and design centers, considered networks of workstations to be an ideal
environment for their operations. Diskless workstations were appealing due to reduced
hardware costs and because of lower maintenance and system administration costs. Soon it
became obvious that a distributed file system could be very useful for the management of a
large number of workstations. Sun Microsystems, one of the main promoters of distributed
systems based on workstations, proceeded to develop the NFS in the early 1980s.

Network File System (NFS). NFS was the first widely used distributed file system; the
development of this application based on the client-server model was motivated by the need
to share a file system among a number of clients interconnected by a local area network.

A majority of workstations were running under Unix; thus, many design decisions for the
NFS were influenced by the design philosophy of the Unix File System (UFS). It is not
surprising that the NFS designers aimed to:

• Provide the same semantics as a local UFS to ensure compatibility with existing
applications.
• Facilitate easy integration into existing UFS.
• Ensure that the system would be widely used and thus support clients running on
different operating systems.
• Accept a modest performance degradation due to remote access over a network with a
bandwidth of several Mbps.

Before we examine NFS in more detail, we have to analyze three important characteristics
of the Unix File System that enabled the extension from local to remote file management:
• The layered design provides the necessary flexibility for the file system; layering allows
separation of concerns and minimization of the interaction among the modules
necessary to implement the system. The addition of the vnode layer allowed the Unix
File System to treat local and remote file access uniformly.
• The hierarchical design supports scalability of the file system; indeed, it allows grouping
of files into special files called directories and supports multiple levels of directories and
collections of directories and files, the so-called file systems. The hierarchical file
structure is reflected by the file-naming convention.
• The metadata supports a systematic rather than an ad hoc design philosophy of the file
system. The so called inodes contain information about individual files and directories.
The inodes are kept on persistent media, together with the data. Metadata includes the
file owner, the access rights, the creation time or the time of the last modification of
the file, the file size, and information about the structure of the file and the persistent
storage device cells where data is stored. Metadata also supports device independence,
a very important objective due to the very rapid pace of storage technology
development.
The logical organization of a file reflects the data model – the view of the data from the
perspective of the application. The physical organization reflects the storage model and
describes the manner in which the file is stored on a given storage medium. The layered
design allows UFS to separate concerns for the physical file structure from the logical one.

Recall that a file is a linear array of cells stored on a persistent storage device. The file
pointer identifies a cell used as a starting point for a read or write operation. This linear array
is viewed by an application as a collection of logical records; the file is stored on a physical
device as a set of physical records, or blocks, of a size dictated by the physical media.
4

The lower three layers of the UFS hierarchy – the block, the file, and the inode layer –
reflect the physical organization. The block layer allows the system to locate individual blocks
on the physical device; the file layer reflects the organization of blocks into files; and the
inode layer provides the metadata for the objects (files and directories). The upper three
layers – the path name, the absolute path name, and the symbolic path name layer – reflect
the logical organization. The file-name layer mediates between the machine-oriented and the
user-oriented views of the file system (see Figure).

FIGURE: The layered design of the Unix File System separates the physical file structure
from the logical one.

Several control structures maintained by the kernel of the operating system support
file handling by a running process. These structures are maintained in the user area of the
process address space and can only be accessed in kernel mode. To access a file, a process
must first establish a connection with the file system by opening the file. At that time a new
entry is added to the file description table, and the meta-information is brought into another
control structure, the open file table.

A path specifies the location of a file or directory in a file system; a relative path
specifies this location relative to the current/working directory of the process, whereas a full
path, also called an absolute path, specifies the location of the file independently of the
current directory, typically relative to the root directory. A local file is uniquely identified by
a file descriptor (fd), generally an index in the open file table.

The Network File System is based on the client-server paradigm. The client runs on
the local host while the server is at the site of the remote file system, and they interact by
means of remote procedure calls (RPCs) (see Figure 8.4). The API interface of the local file
system distinguishes file operations on a local file from the ones on a remote file and, in the
latter case, invokes the RPC client. Figure 8.5 shows the API for a Unix File System, with the
calls made by the RPC client in response to API calls issued by a user program for a remote
file system as well as some of the actions carried out by the NFS server in response to an
RPC call. NFS uses a vnode layer to distinguish between operations on local and remote files,
as shown in Figure 8.4.

FIGURE 6.4: The NFS client-server interaction. The vnode layer implements file operation
in a uniform manner, regardless of whether the file is local or remote. An operation targeting
a local file is directed to the local file system, whereas one for a remote file involves NFS. An
NSF client packages the relevant information about the target and the NFS server passes it
to the vnode layer on the remote host, which, in turn, directs it to the remote file system.

A remote file is uniquely identified by a file handle (fh) rather than a file descriptor.
The file handle is a 32-byte internal name, a combination of the file system identification, an
inode number, and a generation number. The file handle allows the system to locate the
remote file system and the file on that system; the generation number allows the system to
5

reuse the inode numbers and ensures correct semantics when multiple clients operate on the
same remote file.
6

Andrew File System (AFS). AFS is a distributed file system developed in the late 1980s at
Carnegie Mellon University (CMU) in collaboration with IBM. The designers of the system
envisioned a very large number of workstations interconnected with a relatively small
number of servers; it was anticipated that each individual at CMU would have an Andrew
workstation, so the system would connect up to 10,000 workstations. The set of trusted
servers in AFS forms a structure called Vice. The OS on a workstation, 4.2 BSD Unix,
intercepts file system calls and forwards them to a user-level process called Venus, which
caches files from Vice and stores modified copies of files back on the servers they came from.
Reading and writing from/to a file are performed directly on the cached copy and bypass
Venus; only when a file is opened or closed does Venus communicate with Vice.

The emphasis of the AFS design was on performance, security, and simple
management of the file system. To ensure scalability and to reduce response time, the local
disks of the workstations are used as persistent cache. The master copy of a file residing on
one of the servers is updated only when the file is modified. This strategy reduces the load
placed on the servers and contributes to better system performance.

Another major objective of the AFS design was improved security. The
communications between clients and servers are encrypted, and all file operations require
secure network connections. When a user signs into a workstation, the password is used to
obtain security tokens from an authentication server. These tokens are then used every time
a file operation requires a secure network connection.

The AFS uses access control lists (ACLs) to allow control sharing of the data. An ACL
specifies the access rights of an individual user or group of users. A set of tools supports ACL
management. Another facet of the effort to reduce user involvement in file management is
location transparency. The files can be accessed from any location and can be moved
automatically or at the request of system administrators without user involvement and/or
inconvenience. The relatively small number of servers drastically reduces the efforts related
to system administration because operations, such as backups, affect only the servers,
whereas workstations can be added, removed, or moved from one location to another
without administrative intervention.

FIGURE 6.5: The API of the Unix File System and the corresponding RPC issued by an NFS
client to the NFS server. fd stands for file descriptor, fh for file handle, fname for filename,
dname for directory name, dfh for the directory where the file handle can be found, count
for the number of bytes to be transferred, buf for the buffer to transfer the data to/from,
and device for the device on which the file system is located fsname (stands for files system
name).

Sprite Network File System (SFS). SFS is a component of the Sprite network operating
system. SFS supports non-write-through caching of files on the client as well as the server
systems. Processes running on all workstations enjoy the same semantics for file access as
they would if they were run on a single system. This is possible due to a cache consistency
mechanism that flushes portions of the cache and disables caching for shared files opened
for read/write operations.

Caching not only hides the network latency, it also reduces server utilization and
obviously improves performance by reducing response time. A file access request made by
a client process could be satisfied at different levels. First, the request is directed to the local
cache; if it’s not satisfied there, it is passed to the local file system of the client. If it cannot
7

be satisfied locally then the request is sent to the remote server. If the request cannot be
satisfied by the remote server’s cache, it is sent to the file system running on the server.

The design decisions for the Sprite system were influenced by the resources available
at a time when a typical workstation had a 1–2 MIPS processor and 4–14 Mbytes of physical
memory. The main-memory caches allowed diskless workstations to be integrated into the
system and enabled the development of unique caching mechanisms and policies for both
clients and servers. The results of a file-intensive benchmark report show that SFS was 30–
35% faster than either NFS or AFS.
The file cache is organized as a collection of 4 KB blocks; a cache block has a virtual
address consisting of a unique file identifier supplied by the server and a block number in
the file. Virtual addressing allows the clients to create new blocks without the need to
communicate with the server. File servers map virtual to physical disk addresses. Note that
the page size of the virtual memory in Sprite is also 4K.
The size of the cache available to an SFS client or a server system changes
dynamically as a function of the needs. This is possible because the Sprite operating system
ensures optimal sharing of the physical memory between file caching by SFS and virtual
memory management.

6.5. GENERAL PARALLEL FILE SYSTEM


• Developed at IBM in the early 2000s.
• General Parallel File System (IBM GPFS) is a file system used to distribute and manage
data across multiple servers, and is implemented in many high-performance
computing and large-scale storage environments.
• GPFS is among the leading file systems for high performance computing (HPC)
applications.
• Storage used for large supercomputers is often GPFS-based.
• GPFS is also popular for commercial applications requiring high-speed access to large
volumes of data
• Designed for optimal performance of large clusters; it can support a file system of up
to 4 PB consisting of up to 4,096 disks of 1 TB each.
• Maximum file size is (263-1) bytes.
• A file consists of blocks of equal size, ranging from 16 KB to 1 MB, stripped across
several disks.

FIGURE 6.6: A GPFS configuration. The disks are interconnected by a SAN and compute
servers are distributed in four LANs, LAN1–LAN4. The I/O nodes/servers are connected to
LAN1.

GPFS reliability:

• Reliability is a major concern in a system with many physical components. To


recover from system failures, GPFS records all metadata updates in a write-ahead
log file.
• Write-ahead → updates are written to persistent storage only after the
• log records have been written.
• The log files are maintained by each I/O node for each file system it mounts; any
I/O node can initiate recovery on behalf of a failed node.
• Data striping allows concurrent access and improves performance, but can have
unpleasant side-effects. When a single disk fails, a large number of files are
affected.
• The system uses RAID devices with the stripes equal to the block size and dual-
attached RAID controllers.
• To further improve the fault tolerance of the system, GPFS data files as well as
metadata are replicated on two different physical disks.
GPFS distributed locking:
• In GPFS, consistency and synchronization are ensured by a distributed locking
mechanism. A central lock manager grants lock tokens to local lock managers
running in each I/O node. Lock tokens are also used by the cache management
system.
• Lock granularity has important implications on the performance. GPFS uses a
variety of techniques for different types of data.
o Byte-range tokens → used for read and write operations to data files as
follows: the first node attempting to write to a file acquires a token
8

covering the entire file; this node is allowed to carry out all reads and
writes to the file without any need for permission until a second node
attempts to write to the same file; then, the range of the token given to
the first node is restricted.
o Data-shipping → an alternative to byte-range locking, allows fine-grain
data sharing. In this mode the file blocks are controlled by the I/O nodes
in a round-robin manner. A node forwards a read or write operation to the
node controlling the target block, the only one allowed to access the file.

6.6 GOOGLE FILE SYSTEM


• GFS developed in the late 1990s; uses thousands of storage systems built from
inexpensive commodity components to provide petabytes of storage to a large
user community with diverse needs.
• Design considerations.

o Scalability and reliability are critical features of the system; they


must be considered from the beginning rather than at some stage of
the design.
o The vast majority of files range in size from a few GB to hundreds of
TB.
o The most common operation is to append to an existing file; random
write operations to a file are extremely infrequent.
o Sequential read operations are the norm.
o The users process the data in bulk and are less concerned with the
response time.
o The consistency model should be relaxed to simplify the system
implementation, but without placing an additional burden on the
application developers.

GFS – design decisions:


Several design decisions were made as a result of this analysis:
1. Segment a file in large chunks.
2. Implement an atomic file append operation allowing multiple applications operating
concurrently to append to the same file.
3. Build the cluster around a high-bandwidth rather than low-latency interconnection
network. Separate the flow of control from the data flow; schedule the high-bandwidth
data flow by pipelining the data transfer over TCP connections to reduce the response
time. Exploit network topology by sending data to the closest node in the network.
4. Eliminate caching at the client site. Caching increases the overhead for maintaining
consistency among cached copies at multiple client sites and it is not likely to improve
performance.
5. Ensure consistency by channeling critical file operations through a master, a component
of the cluster that controls the entire system.
6. Minimize the involvement of the master in file access operations to avoid hot-spot
contention and to ensure scalability.
7. Support efficient check pointing and fast recovery mechanisms.
8. Support an efficient garbage-collection mechanism.
GFS chunks:

• GFS files are collections of fixed-size segments called chunks.


• The chunk size is 64 MB;
• A chunk consists of 64 KB blocks and each block has a 32-bit checksum.
• Chunks are stored on Linux files systems and are replicated on multiple sites
• At the time of file creation each chunk is assigned a unique chunk handle.
• A master controls a large number of chunk servers; it maintains metadata such as
filenames, access control information, the location of all the replicas for every chunk
of each file, and the state of individual chunk servers.

The architecture of a GFS cluster is illustrated in Figure 8.7. A master controls a large
number of chunk servers; it maintains metadata such as filenames, access control
information, the location of all the replicas for every chunk of each file, and the state of
individual chunk servers. Some of the metadata is stored in persistent storage (e.g., the
operation log records the file namespace as well as the file-to-chunk mapping).

The locations of the chunks are stored only in the control structure of the master’s
memory and are updated at system startup or when a new chunk server joins the cluster.
This strategy allows the master to have up-to-date information about the location of the
chunks.

System reliability is a major concern, and the operation log maintains a historical
record of metadata changes, enabling the master to recover in case of a failure. As a result,
such changes are atomic and are not made visible to the clients until they have been recorded
on multiple replicas on persistent storage. To recover from a failure, the master replays the
operation log. To minimize the recovery time, the master periodically checkpoints its state
and at recovery time replays only the log records after the last checkpoint.

Each chunk server is a commodity Linux system; it receives instructions from the
master and responds with status information. To access a file, an application sends to the
master the filename and the chunk index, the offset in the file for the read or write operation;
9

the master responds with the chunk handle and the location of the chunk. Then the
application communicates directly with the chunk server to carry out the desired file
operation.

The consistency model is very effective and scalable. Operations, such as file creation, are
atomic and are handled by the master. To ensure scalability, the master has minimal
involvement in file mutations and operations such as write or append that occur frequently.
In such cases the master grants a lease for a particular chunk to one of the chunk servers,
called the primary; then, the primary creates a serial order for the updates of that chunk.
When data for a write straddles the chunk boundary, two operations are carried out, one
for each chunk. The steps for a write request illustrate a process that buffers data and
decouples the control flow from the data flow for efficiency:

1. The client contacts the master, which assigns a lease to one of the chunk servers for
a particular chunk if no lease for that chunk exists; then the master replies with the
ID of the primary as well as secondary chunk servers holding replicas of the chunk.
The client caches this information.
2. The client sends the data to all chunk servers holding replicas of the chunk; each
one of the chunk servers stores the data in an internal LRU buffer and then sends an
acknowledgment to the client.
3. The client sends a write request to the primary once it has received the
acknowledgments from all chunk servers holding replicas of the chunk. The primary
identifies mutations by consecutive sequence numbers.
4. The primary sends the write requests to all secondaries.
5. Each secondary applies the mutations in the order of the sequence numbers and
then sends an acknowledgment to the primary.
6. Finally, after receiving the acknowledgments from all secondaries, the primary
informs the client.

FIGURE 8.7: The architecture of a GFS cluster. The master maintains state information about all system components;
it controls a number of chunk servers. A chunk server runs under Linux; it uses metadata provided by the master to
communicate directly with the application. The data flow is decoupled from the control flow. The data and the control
paths are shown separately, data paths with thick lines and control paths with thin lines. Arrows show the flow of
control among the application, the master, and the chunk servers.

6.7. APACHE HADOOP:


• It is an open source, Java-based software, supports distributed applications handling
extremely large volumes of data.
• Hadoop is used by many organizations from industry, government, and research;
major IT companies e.g., Apple, IBM, HP, Microsoft, Yahoo, and Amazon, media
companies e.g., New York Times and Fox, social networks including, Twitter,
Facebook, and LinkedIn, and government agencies such as Federal Reserve.
• A Hadoop system has two components, a MapReduce engine and a database.
The database could be the Hadoop File System (HDFS), Amazon’s S3, or
CloudStore, an implementation of GFS.
• HDFS is a distributed file system written in Java; it is portable. HDFS holds very
large amount of data and provides easier access. To store such huge data, the files
are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes
applications available to parallel processing.
• The Hadoop engine on the master of a multinode cluster consists of a job tracker
and a task tracker, whereas the engine on a slave has only a task tracker.
• The job tracker receives a MapReduce job from a client and dispatches the work to
the task trackers running on the nodes of a cluster.
• To increase efficiency, the job tracker attempts to dispatch the tasks to available
slaves closest to the place where it stored the task data.
• The task tracker supervises the execution of the work allocated to the node.
• The name node running on the master manages the data distribution and data
replication and communicates with data nodes running on all cluster nodes; it shares
with the job tracker information about data placement to minimize communication
between the nodes on which data is located and the ones where it is needed.
10

FIGURE 6.8: A Hadoop cluster using HDFS. The cluster includes a master and four slave
nodes. Each node runs a MapReduce engine and a database engine, often HDFS. The job
tracker of the master’s engine communicates with the task trackers on all the nodes and
with the name node of HDFS. The name node of HDFS shares information about data
placement with the job tracker to minimize communication between the nodes on which
data is located and the ones where it is needed.

6.8. LOCKS AND CHUBBY: A LOCKING SERVICE:

• Chubby Lock Service is designed for use within a loosely-coupled Distributed system
consisting of moderately large numbers of small machines connected by a high-
speed network.
• Chubby Lock Service allows the election of a master and to allow the master to
discover the servers it controls, and to permit clients to find the master.
• File locking is a mechanism which allows only one process to access a file at any
specific time. By using file locking mechanism, many processes can read/write a
single file in a safer way.
We will take the following example to understand why file locking is required.
1. Process “A” opens and reads a file which contains account related information.
2. Process “B” also opens the file and reads the information in it.
3. Now Process “A” changes the account balance of a record in its copy, and writes it
back to the file.
4. Process “B” which has no way of knowing that the file is changed since its last read,
has the stale original value. It then changes the account balance of the same record,
and writes back into the file.
5. Now the file will have only the changes done by process “B”.

THE PAXOS ALGORITHM:


• Used to reach consensus on sets of values, e.g., the sequence of entries in
a replicated log.
• The phases of the algorithm.
o Elect a replica to be the master/coordinator. When a master fails,
several replicas may decide to assume the role of a master; to ensure
that the result of the election is unique each replica generates a
sequence number larger than any sequence number it has seen, in
the range (1,r) where r is the number of replicas, and broadcasts it
in a propose message. The replicas which have not seen a higher
sequence number broadcast a promise reply and declare that they
will reject proposals from other candidate masters; if the number of
respondents represents a majority of replicas, the one who sent the
propose message is elected as the master.
o The master broadcasts to all replicas an accept message including
the value it has selected and waits for replies, either acknowledge or
reject.
o Consensus is reached when the majority of the replicas send the
acknowledge message; then the master broadcasts the commit
message.

Locks
Advisory locks→ Advisory locking requires cooperation from the participating processes.
Mandatory locks→ Mandatory locking doesn’t require cooperation from the participating
processes. Mandatory locking causes the kernel to check every open, read, and write to
verify that the calling process isn’t violating a lock on the given file.
Fine-grained locks → locks that can be held for only a very short time.
Coarse-grained locks → locks held for a longer time.
11

The question of how to most effectively support a locking and consensus component
of a large-scale distributed system demands several design decisions. A first design
decision is whether the locks should be mandatory or advisory. Mandatory locks have the
obvious advantage of enforcing access control; a traffic analogy is that a mandatory lock is
like a drawbridge. Once it is up, all traffic is forced to stop.
An advisory lock is like a stop sign; those who obey the traffic laws will stop, but
some might not. The disadvantages of mandatory locks are added overhead and less
flexibility. Once a data item is locked, even a high-priority task related to maintenance or
recovery cannot access the data unless it forces the application holding the lock to terminate.
This is a very significant problem in large-scale systems where partial system failures are
likely.

A second design decision is whether the system should be based on fine-grained


or coarse-grained locking. Fine-grained locks allow more application threads to access shared
data in any time interval, but they generate a larger workload for the lock server. Moreover,
when the lock server fails for a period of time, a larger number of applications are affected.
Advisory locks and coarse-grained locks seem to be a better choice for a system expected to
scale to a very large number of nodes distributed in data centers that are interconnected via
wide area networks.

A third design decision is how to support a systematic approach to locking. Two


alternatives come to mind: (i) delegate to the clients the implementation of the consensus
algorithm and provide a library of functions needed for this task, or (ii) create a locking
service that implements a version of the asynchronous Paxos algorithm and provide a library
to be linked with an application client to support service calls. Forcing application developers
to invoke calls to a Paxos library is more cumbersome and more prone to errors than the
service alternative. Of course, the lock service itself has to be scalable to support a potentially
heavy load.

In the early 2000s, when Google started to develop a lock service called Chubby, it was
decided to use advisory and coarse-grained locks. The service is used by several Google
systems, including the GFS and BigTable.

FIGURE 6.9: A Chubby cell consisting of five replicas, one of which is elected as a
master; n clients use RPCs to communicate with the master.

The basic organization of the system is shown in Figure 6.9. A Chubby cell typically serves
one data center. The cell server includes several replicas, the standard number of which is
five. To reduce the probability of correlated failures, the servers hosting replicas are
distributed across the campus of a data center.

The replicas use a distributed consensus protocol to elect a new master when the current
one fails. The master is elected by a majority, as required by the asynchronous Paxos
algorithm, accompanied by the commitment that a new master will not be elected for a
period called a master lease. A session is a connection between a client and the cell server
maintained over a period of time; the data cached by the client, the locks acquired, and the
handles of all files locked by the client are valid for only the duration of the session.

Clients use RPCs to request services from the master. When it receives a write request,
the master propagates the request to all replicas and waits for a reply from a majority of
replicas before responding. When it receives a read request the master responds without
consulting the replicas. The client interface of the system is similar to, yet simpler than, the
one supported by the Unix File System. In addition, it includes notification of events related
to file or system status. A client can subscribe to events such as file content modification,
change or addition of a child node, master failure, lock acquired, conflicting lock requests,
and invalid file handle.

The files and directories of the Chubby service are organized in a tree structure and use a
naming scheme similar to that of Unix. Each file has a file handle similar to the file descriptor.
The master of a cell periodically writes a snapshot of its database to a GFS file server.
12

We now take a closer look at the actual implementation of the service. As pointed out
earlier, Chubby locks and Chubby files are stored in a database, and this database is
replicated. The architecture of these replicas shows that the stack consists of the Chubby
component, which implements the Chubby protocol for communication with the clients, and
the active components, which write log entries and files to the local storage of the replica
see (Figure 6.10).

Recall that an atomicity log for a transaction-processing system allows a crash recovery
procedure to undo all-or-nothing actions that did not complete or to finish all-or-nothing
actions that committed but did not record all of their effects. Each replica maintains its own
copy of the log; a new log entry is appended to the existing log and the Paxos algorithm is
executed repeatedly to ensure that all replicas have the same sequence of log entries.

FIGURE 6.10: Chubby replica architecture. The Chubby component implements the
communication protocol with the clients. The system includes a component to transfer files
to a fault-tolerant database and a fault-tolerant log component to write log entries. The
fault-tolerant log uses the Paxos protocol to achieve consensus. Each replica has its own
local file system; replicas communicate with one another using a dedicated interconnect
and communicate with clients through a client network.

The next element of the stack is responsible for the maintenance of a fault-tolerant
database – in other words, making sure that all local copies are consistent. The database
consists of the actual data, or the local snapshot in Chubby speak, and a replay log to allow
recovery in case of failure. The state of the system is also recorded in the database.

The Paxos algorithm is used to reach consensus on sets of values (e.g., the sequence of
entries in a replicated log). To ensure that the Paxos algorithm succeeds in spite of the
occasional failure of a replica, the following three phases of the algorithm are executed
repeatedly:

1. Elect a replica to be the master/coordinator. When a master fails, several replicas


may decide to assume the role of a master. To ensure that the result of the election
is unique, each replica generates a sequence number larger than any sequence
number it has seen, in the range (1, r), where r is the number of replicas, and
broadcasts it in a propose message. The replicas that have not seen a higher
sequence number broadcast a promise reply and declare that they will reject
proposals from other candidate masters. If the number of respondents represents a
majority of replicas, the one that sent the propose message is elected master.
2. The master broadcasts to all replicas an accept message, including the value it has
selected, and waits for replies, either acknowledge or reject.
3. Consensus is reached when the majority of the replicas send an acknowledge
message; then the master broadcasts the commit message.

Implementation of the Paxos algorithm is far from trivial. Although the algorithm can be
expressed in as few as ten lines of pseudocode, its actual implementation could be several
thousand lines of C++ code. Moreover, practical use of the algorithm cannot ignore the wide
variety of failure modes, including algorithm errors and bugs in its implementation, and
testing a software system of a few thousands lines of codes is challenging.

6.9. TRANSACTION PROCESSING AND NOSQL DATABASES:


Many cloud services are based on online transaction processing (OLTP) and operate
under tight latency constraints. Moreover, these applications have to deal with extremely
high data volumes and are expected to provide reliable services for very large communities
of users. It did not take very long for companies heavily involved in cloud computing, such
as Google and Amazon, e-commerce companies such as eBay, and social media networks
such as Facebook, Twitter, or LinkedIn, to discover that traditional relational databases are
not able to handle the massive amount of data and the real-time demands of online
applications that are critical for their business models.
• OLTP (Online Transactional Processing) is a category of data processing that is
focused on transaction-oriented tasks.
13

• OLTP typically involves inserting, updating, and/or deleting small amounts of data in
a database.
• OLTP mainly deals with large numbers of transactions by a large number of users.
• Examples of OLTP transactions include:
• Online banking
• Purchasing a book online
• Booking an airline ticket
• Sending a text message
• Order entry

A major concern for the designers of OLTP systems is to reduce the response time.
The term memcaching refers to a general-purpose distributed memory system that caches
objects in main memory (RAM); the system is based on a very large hash table distributed
across many servers. The memcached system is based on a client-server architecture and
runs under several operating systems, including Linux, Unix, Mac OS X, and Windows. The
servers maintain a key-value associative array. The API allows the clients to add entries to
the array and to query it. A key can be up to 250 bytes long, and a value can be no larger
than 1 MB. The memcached system uses an LRU cache-replacement strategy.

Scalability is the other major concern for cloud OLTP applications and implicitly for
datastores. There is a distinction between vertical scaling, where the data and the workload
are distributed to systems that share resources such as cores and processors, disks, and
possibly RAM, and horizontal scaling, where the systems do not share either primary or
secondary storage.

Sources of OLTP overhead


Four sources with equal contribution:
1. Logging - expensive because traditional databases require transaction durability
thus, every write to the database can only be completed after the log has been
updated.
2. Locking - to guarantee atomicity, transactions lock every record and this requires
access to a lock table.
3. Latching – many operations require multi-threading and the access to shared data
structures, such as lock tables, demands short-term latches for coordination. A latch
is a counter that triggers an event when it reaches zero; for example, a master
thread initiates a counter with the number of worker threads and waits to be notified
when all of them have finished.
4. Buffer management.

NOSQL DATABASES:
• A NoSQL originally referring to non-SQL or non-relational is a database that provides
a mechanism for storage and retrieval of data.
• it an alternative to traditional relational databases in which data is placed in tables
and data schema is carefully designed before the database is built. NoSQL databases
are especially useful for working with large sets of distributed data.
• The NoSQL model is useful when the structure of the data does not require a
relational model and the amount of data is very large.
• Does not support SQL as a query language.
• May not guarantee the ACID (Atomicity, Consistency, Isolation, Durability)
properties of traditional databases; it usually guarantees the eventual
consistency for transactions limited to a single data item.

6.10. BIGTABLE:
• Distributed storage system developed by Google to
o store massive amounts of data.
o scale up to thousands of storage servers.
• The system uses
o Google File System à to store user data and system information.
o Chubby distributed lock service → to guarantee atomic read and write
operations; the directories and the files in the namespace of Chubby are used
as locks.
• Data is assembled in order by row key, and indexing of the map is arranged according
to row, column keys and timestamps. Compression algorithms help achieve high
capacity.
• Google Bigtable serves as the database for applications such as the Google App
Engine Datastore, Google Personalized Search, Google Earth and Google Analytics.

The organization of a BigTable (see Figure 6.11) shows a sparse, distributed,


multidimensional map for an email application. The system consists of three major
components: a library linked to application clients to access the system, a master server,
and a large number of tablet servers. The master server controls the entire system, assigns
tablets to tablet servers and balances the load among them, manages garbage collection,
and handles table and column family creation and deletion.

Internally, the space management is ensured by a three-level hierarchy: the root


tablet, the location of which is stored in a Chubby file, points to entries in the second element,
the metadata tablet, which, in turn, points to user tablets, collections of locations of users’
14

tablets. An application client searches through this hierarchy to identify the location of its
tablets and then caches the addresses for further use.

FIGURE 6.11: A BigTable example. The organization of an email application as a sparse,


distributed, multidimensional map. The slice of the BigTable shown consists of a row with
the key “UserId” and three family columns. The “Contents” key identifies the cell holding
the contents of emails received, the cell with key “Subject” identifies the subject of emails,
and the cell with the key “Reply” identifies the cell holding the replies. The versions of
records in each cell are ordered according to their time stamps. The row keys of this
BigTable are ordered lexicographically. A column key is obtained by concatenating the
family and the qualifier fields. Each value is an uninterpreted array of bytes.

6.11 MEGASTORE:
• Scalable storage for online services. Widely used internally at Google, it handles
some 23 billion transactions daily, 3 billion write and 20 billion read transactions.
• The system, distributed over several data centers, has a very large capacity, 1 PB in
2011, and it is highly available.
• Each partition is replicated in data centers in different geographic areas.
• Megastore is a storage system developed to meet the storage requirements of
today's interactive online services. It is novel in that it blends the scalability of a
NoSQL datastore with the convenience of a traditional RDBMS.

The basic design philosophy of the system is to partition the data into entity groups and
replicate each partition independently in data centers located in different geographic areas.
The system supports full ACID semantics within each partition and provides limited
consistency guarantees across partitions (see Figure 6.12). Megastore supports only those
traditional database features that allow the system to scale well and that do not drastically
affect the response time.

Another distinctive feature of the system is the use of the Paxos consensus algorithm, to
replicate primary user data, metadata, and system configuration information across data
centers and for locking. The version of the Paxos algorithm used by Megastore does not
require a single master. Instead, any node can initiate read and write operations to a write-
ahead log replicated to a group of symmetric peers.

The entity groups are application-specific and store together logically related data. For
example, an email account could be an entity group for an email application. Data should be
carefully partitioned to avoid excessive communication between entity groups. Sometimes it
is desirable to form multiple entity groups, as in the case of blogs.

The middle ground between traditional and NoSQL databases taken by the Megastore
designers is also reflected in the data model. The data model is declared in a schema
consisting of a set of tables composed of entries, each entry being a collection of named and
typed properties. The unique primary key of an entity in a table is created as a composition
of entry properties. A Megastore table can be a root or a child table. Each child entity must
reference a special entity, called a root entity in its root table. An entity group consists of
the primary entity and all entities that reference it.

The system makes extensive use of BigTable. Entities from different Megastore tables can
be mapped to the same BigTable row without collisions. This is possible because the BigTable
column name is a concatenation of the Megastore table name and the name of a property.
A BigTable row for the root entity stores the transaction and all metadata for the entity
group. Megastore takes advantage of this feature to implement multi-version concurrency
control (MVCC); when a mutation of a transaction occurs, this mutation is recorded along
with its time stamp, rather than marking the old data as obsolete and adding the new version.
This strategy has several advantages: read and write operations can proceed concurrently,
and a read always returns the last fully updated version.

A write transaction involves the following steps: (1) Get the timestamp and the log position
of the last committed transaction. (2) Gather the write operations in a log entry. (3) Use the
consensus algorithm to append the log entry and then commit. (4) Update the BigTable
entries. (5) Clean up.
15

FIGURE 6.12: Megastore organization. The data is partitioned into entity groups; full
ACID semantics within each partition and limited consistency guarantees across
partitions are supported. A partition is replicated across data centers in different
geographic areas.

6.12 AMAZON SIMPLE STORAGE SERVICE(S3):

Amazon S3 Simple Storage Service is a scalable, high-speed, low-cost web-based


service designed for online backup and archiving of data and application programs. It allows
uploading, store, and downloading any type of files up to 5 GB in size. This service allows
the subscribers to access the same systems that Amazon uses to run its own web sites. The
subscriber has control over the accessibility of data, i.e. privately/publicly accessible.

Amazon S3 provides a simple web services interface that can be used to store and
retrieve any amount of data, at any time, from anywhere on the web. S3 provides the object-
oriented storage service for users. Users can access their objects through Simple Object
Access Protocol (SOAP) with either browsers or other client programs which support SOAP.
SQS is responsible for ensuring a reliable message service between two processes, even if
the receiver processes are not running. Following Figure shows the S3 execution
environment.

Fig: Amazon S3 Execution Environment


The fundamental operation unit of S3 is called an object. Each object is stored in a
bucket and retrieved via a unique, developer-assigned key. In other words, the bucket is the
container of the object. Besides unique key attributes, the object has other attributes such
as values, metadata, and access control information. From the programmer’s perspective,
the storage provided by S3 can be viewed as a very coarse-grained key-value pair. Through
the key-value programming interface, users can write, read, and delete objects containing
from 1 byte to 5 gigabytes of data each. There are two types of web service interface for the
user to access the data stored in Amazon clouds. One is a REST (web 2.0) interface, and the
other is a SOAP interface.
Key features of S3:
o Redundant through geographic dispersion.
o Designed to provide 99.999999999 percent durability and 99.99 percent availability
of objects over a given year with cheaper reduced redundancy storage (RRS).
o Authentication mechanisms to ensure that data is kept secure from unauthorized
access. Objects can be made private or public, and rights can be granted to specific
users.
o Per-object URLs and ACLs (access control lists).
o Default download protocol of HTTP. A BitTorrent protocol interface is provided to
lower costs for high-scale distribution.
o $0.055 (more than 5,000 TB) to 0.15 per GB per month storage (depending on total
amount).
o First 1 GB per month input or output free and then $.08 to $0.15 per GB for transfers
outside an S3 region.
o There is no data transfer charge for data transferred between Amazon EC2 and
Amazon S3 within the same region
16

o Low cost and Easy to Use − Using Amazon S3, the user can store a large amount
of data at very low charges.
o Secure − Amazon S3 supports data transfer over SSL and the data gets encrypted
automatically once it is uploaded. The user has complete control over their data by
configuring bucket policies using AWS IAM.
o Scalable − Using Amazon S3, there need not be any worry about storage concerns.
We can store as much data as we have and access it anytime.
o Higher performance − Amazon S3 is integrated with Amazon CloudFront, that
distributes content to the end users with low latency and provides high data transfer
speeds without any minimum usage commitments.
o Integrated with AWS services − Amazon S3 integrated with AWS services include
Amazon CloudFront, Amazon CLoudWatch, Amazon Kinesis, Amazon RDS, Amazon
Route 53, Amazon VPC, AWS Lambda, Amazon EBS, Amazon Dynamo DB, etc.

You might also like