Final Distributed Systems
Final Distributed Systems
1
2: What are the main design goals for building a distributed system? Explain
transparency, scalability, and reliability in detail.
The main design goals for building a distributed system are to make it easy to use,
powerful, and robust. These goals can be summarized as achieving Transparency,
Scalability, and Reliability.
1. Transparency
Transparency (also called "invisibility") is the concealment of the separation of
components in a distributed system from the user and the application programmer.
The aim is to make the system appear as a single computer.
● Access Transparency: Hides differences in data representation and how a
resource is accessed. Example: A user accesses a file without knowing if it's
stored on a Linux (ext4) or Windows (NTFS) file system.
● Location Transparency: Hides where a resource is physically located.
Example: A URL like http://www.google.com/index.html does not reveal the
server's actual IP address or location.
● Replication Transparency: Hides the fact that a resource is replicated
(copied) across multiple computers. The user interacts with what they believe
is a single copy.
● Failure Transparency: Hides the failure and recovery of a resource. If a
server fails, the user's request might be automatically redirected to a backup
server without the user ever knowing.
2. Scalability
Scalability is the ability of the system to handle a growing amount of work by adding
resources. A scalable system maintains its performance and efficiency even as the
number of users, objects, or computers increases.
● Size Scalability: The ability to add more users and resources without
affecting performance. Example: Gmail can support millions of new users
without slowing down for existing ones.
● Geographical Scalability: The system remains effective even when users
and resources are far apart. This is challenging due to communication delays
(latency). Example: Content Delivery Networks (CDNs) place data closer to
users worldwide to reduce latency.
● Administrative Scalability: The system remains easy to manage even if it
spans multiple independent administrative organizations with different policies.
3. Reliability (and Availability)
Reliability ensures that the system can be trusted to work correctly and continue
functioning even when faults occur.
● Availability: The property of a system being ready for immediate use. It is the
fraction of time the system is operational. A highly available system is
accessible most of the time (e.g., 99.999% uptime).
● Fault Tolerance: The ability of a system to provide its services even in the
presence of faults (e.g., server crashes, network disconnections). This is often
achieved through redundancy—having backup components or data.
2
3: Explain the different aspects of distribution transparency in brief.
Distribution transparency is the property of a distributed system that hides its
distributed nature from users and programmers. The goal is to make the system
look and feel like a single, centralized system.
The key aspects (or types) of transparency are:
1. Access Transparency: Hides differences in data representation and the way
resources are accessed. For example, a program should be able to access a
local file and a remote file using the exact same read() or write() operations.
2. Location Transparency: Hides the physical location of a resource. Users and
applications do not need to know the IP address or host name where a
resource (like a file or database) resides. They use a logical name, and the
system resolves it.
3. Migration (or Mobility) Transparency: Hides the fact that a resource or a
process may move from one physical location to another while it is in use. For
example, a mobile user's session should continue uninterrupted even when
their device switches from a Wi-Fi network to a cellular network.
4. Replication Transparency: Hides the existence of multiple copies (replicas)
of a resource. The user interacts with the resource as if it were a single entity,
while the system manages the replicas in the background for performance and
reliability.
5. Concurrency Transparency: Hides the fact that a resource may be shared
by several competing users simultaneously. The system ensures that
concurrent access is managed correctly (e.g., through locking mechanisms) so
that the resource remains in a consistent state.
6. Failure Transparency: Hides the failure and recovery of a component. If a
server fails, the system might automatically redirect requests to a backup
server without the user noticing anything more than a slight delay.
3
5. Fault The failure of one node does not Higher fault tolerance is a key design
Tolerance affect others, but services on the goal. The system can often
failed node become unavailable. automatically migrate tasks from a
failed node to a working one.
5: How does the World Wide Web serve as a practical example of a distributed
system?
The World Wide Web (WWW) is arguably the largest and most successful
distributed system in existence. It exhibits all the key characteristics of a distributed
system:
1. Vast Resource Sharing: The WWW is built on sharing resources (web pages,
images, videos, services) on a global scale. These resources are stored on
millions of different servers worldwide.
2. Client-Server Architecture: It operates on a classic client-server model. Web
browsers (clients) request resources from web servers (servers) using the
HTTP protocol.
3. Heterogeneity: The WWW is extremely heterogeneous. Servers run on
different hardware (Intel, ARM) and operating systems (Linux, Windows). Web
pages are created using different technologies. This heterogeneity is managed
through open, standard protocols like HTTP, HTML, and TCP/IP, which act as
a form of middleware.
4. Massive Scalability: The Web has scaled to billions of users and servers.
This is managed through the Domain Name System (DNS), a distributed
database that maps human-readable names (like www.google.com) to
machine-readable IP addresses.
5. Openness: Its success is due to its open standards. Anyone can create a web
page, set up a web server, or build a web browser as long as they adhere to
the public standards (HTTP, URL, HTML).
6. Fault Tolerance (Partial): The Web is resilient to failures. If a single web
server or network path goes down, it does not bring down the entire Web.
Other parts remain accessible. However, it does not provide strong failure
transparency (if a site is down, you get a "404 Not Found" error).
4
3. Offer Common Services: Middleware often provides essential services that
many distributed applications need, such as:
○ Naming services: To locate other processes or resources in the
network.
○ Security services: For authentication, authorization, and encryption.
○ Transaction services: To ensure operations are completed reliably.
○ Persistence services: For storing application data.
4. Enhance Transparency: Middleware is the key to achieving many forms of
distribution transparency (location, access, failure, etc.), making the distributed
system look like a single machine to the programmer.
General Organization
Middleware is often organized as a layer that extends the services of the local
operating system on each machine, providing a more powerful and uniform interface
for applications.
Example: CORBA (Common Object Request Broker Architecture) and Java RMI
(Remote Method Invocation) are classic examples of middleware.
● Example: Web browsing (browser is the client, web server is the server),
databases.
2. Peer-to-Peer (P2P) Architecture
In P2P architecture, all processes (or nodes) are considered equal and play the role
of both client and server simultaneously. Each node, called a peer, contributes
resources (processing power, storage, bandwidth) to the network.
● How it works: Peers communicate directly with each other without the need
for a central server. They form an overlay network on top of the physical
internet to discover each other and exchange data.
5
Diagram:
Example: BitTorrent (file sharing), Skype (early versions for voice calls),
Cryptocurrencies (Bitcoin, Ethereum).
3. Layered Architecture
In this style, components are organized into a series of layers. Each layer provides a
service to the layer above it and uses services from the layer below it. Requests and
responses flow down and up through the layers.
● How it works: A request from a top-level layer is passed down through
successive layers, with each layer adding its own logic or data, until it reaches
the bottom. The response travels back up the stack.
Diagram:
6
8: Differentiate between stateful and stateless servers with examples.
A key design choice for servers in a distributed system is whether to maintain
information about past interactions with clients. This leads to two types of servers:
stateful and stateless.
Feature Stateless Server Stateful Server
1. Client Does not keep any information Maintains a record of client
Information (state) about past client interactions and their current state.
requests.
2. Request Each request is treated as Uses stored state information to
Handling independent and must contain process new requests from a client.
all information needed to
process it.
3. Failure Recovery from a server crash Recovery is complex. The server's
Recovery is simple. A new server can state must be restored before it can
immediately take over since no resume serving clients.
state is lost.
4. Generally more scalable, as Less scalable, as a client may need
Scalability any server can handle any to be "stuck" to the specific server
client's request, making load that holds its state.
balancing easy.
5. Example An HTTP Web Server. Each A File Server that allows clients to
HTTP request is independent. open a file, read parts of it, and then
The server forgets about the close it. The server must remember
client after sending the which file is open for which client.
response.
7
4. Resource Consolidation: Multiple servers can be consolidated onto a single
physical machine, saving power, cooling, and space. This is a key principle of
modern cloud computing data centers.
5. Simplified Deployment: New servers (as VMs) can be provisioned and
deployed in minutes by cloning a template, which drastically speeds up the
process of scaling a distributed application.
Types of Virtualization
1. Hardware Virtualization (or Platform Virtualization): This is the most
common type. The hypervisor creates a complete virtual hardware platform for
each VM. Each VM can then run its own full-fledged guest operating system.
○ Type 1 (Bare-Metal) Hypervisor: Runs directly on the physical
hardware. Examples: VMware ESXi, Microsoft Hyper-V.
○ Type 2 (Hosted) Hypervisor: Runs as an application on top of a
conventional host operating system. Examples: Oracle VirtualBox,
VMware Workstation.
2. Operating System-Level Virtualization (Containerization): The host OS
kernel is shared by multiple isolated user-space instances called containers.
Containers are more lightweight and have less overhead than VMs because
they don't need a full guest OS. Example: Docker, Kubernetes, LXC.
10: Explain the difference between a thread and a process in the context of a
distributed system.
Feature Process Thread
1. Memory & Has its own private memory Shares the memory address space
Resources address space and and resources of its parent process.
dedicated system resources.
2. Isolation & Highly isolated. The crash of Not isolated. If one thread crashes, it
Faults one process does not affect brings down the entire parent
other processes. process, including all its other
threads.
3. Inter-Process Inter-Thread Communication is
Communicat Communication (IPC) is simple and fast, as they can directly
ion complex and slower (e.g., via access shared memory.
sockets).
4. Creation "Heavyweight." Slower and "Lightweight." Faster and cheaper to
Cost more resource-intensive for create.
the OS to create.
5. Typical Used for running separate, Used within a single server to handle
Use Case isolated services or multiple client requests concurrently
applications on a server. and efficiently.
8
11: How is process resilience achieved through groups in a distributed
system?
Process resilience is the ability of a system to continue functioning correctly even
when some of its processes fail. This is a form of fault tolerance. A key technique to
achieve this is by organizing processes into groups.
How Resilience is Achieved:
1. Replication: The service or data is replicated across multiple processes in the
group. All processes in the group are functionally identical (they can perform
the same tasks).
○ Flat Groups: All processes are peers. Decisions are made collectively.
This is democratic but complex.
○ Hierarchical Groups: One process acts as a coordinator (or primary),
and others are backups (or secondaries). This is simpler to manage.
2. Failure Detection: The group members need a mechanism to detect when
another member has failed. This is often done using "heartbeat" messages,
where each process periodically sends an "I'm alive" message to others. If a
heartbeat is missed for a certain period, the process is assumed to have
crashed.
3. Takeover Mechanism: When a failure is detected, the group must react.
○ In a hierarchical group, if the primary fails, an election algorithm (like
the Bully or Ring algorithm) is run among the backups to choose a new
primary. The new primary then takes over all responsibilities.
○ In a flat group, the remaining processes may need to re-distribute the
workload of the failed process among themselves.
12: What is code migration? Explain the different models for code migration
(e.g., weak vs. strong mobility).
Code migration is the mechanism of moving a process, or a part of it (like a specific
function or object), from one machine to another in a distributed system. The key
idea is to move the computation closer to the data or resources it needs, rather than
moving large amounts of data to the computation.
13: Explain Remote Procedure Call (RPC) and its working process with a
suitable diagram. Also, discuss its advantages and disadvantages.
Remote Procedure Call is a communication mechanism that allows a program on
one computer to execute a procedure (or function) on another computer as if it were
a local call. The goal of RPC is to hide the complexities of network communication
(like socket programming, data marshalling) from the programmer, providing a high
degree of access transparency.
10
3. Network Communication: The client's operating system sends this message
to the remote server machine.
4. Server Stub (Unmarshalling): The message is received by the server's OS
and passed to the server stub. The server stub "unmarshals" the message,
unpacking the parameters.
5. Server Procedure Execution: The server stub calls the actual procedure on
the server, passing it the unpacked parameters.
6. Return and Marshalling: When the procedure finishes, it returns the result to
the server stub. The server stub marshals the return value into another
message.
7. Network Return: The server's OS sends the result message back to the client
machine.
8. Client Unmarshalling & Return: The client's OS gives the message to the
client stub, which unmarshals the return value and passes it back to the
original client program.
Advantages of RPC:
● Simplicity and Transparency: Programmers use a familiar procedure call
syntax, hiding the underlying network communication.
● Abstraction: It abstracts away the complexities of data representation and
network protocols.
● Efficiency: Can be highly optimized for request-reply interactions.
Disadvantages of RPC:
● Limited to Request-Reply: Not suitable for all communication patterns (e.g.,
asynchronous messaging, streaming).
● Coupling: The client and server are tightly coupled. A change in the server's
procedure signature often requires the client to be recompiled.
● Failure Handling: Handling failures (e.g., a crashed server or lost message)
is more complex than with local calls and must be explicitly programmed.
11
14: Differentiate Remote Procedure Call (RPC) from Message-Oriented
Communication, providing an illustrative example for each.
Feature Remote Procedure Call Message-Oriented Communication
(RPC) (MOM)
1. Synchronous: The client Asynchronous: The client sends a
Communicati calls a function and blocks message and continues its work without
on Style (waits) until it gets a reply. waiting.
2. Coupling Tightly Coupled: The client Loosely Coupled: A message queue
and server must both be acts as a middleman; client and server
active at the same time. can be offline at different times.
3. Addressing Direct: The client Indirect: The client sends a message
communicates directly with a to a named queue, not to a specific
known server address. receiver.
4. Paradigm Like making a direct phone Like dropping a letter in a mailbox;
Analogy call; you wait for the other you post it and walk away, knowing it
person to answer. will be delivered later.
5. Failure If the server is down, the call If the receiver is down, the message is
Handling fails immediately. stored reliably in the queue until the
receiver is available.
Illustrative Examples:
● RPC Example: The app directly calls the server to get the account balance
and waits for the reply before showing it.
● MOM Example: The site sends the order to a queue and instantly shows a
confirmation; backend services handle the rest later.
17: What is the purpose of a naming system? Explain the different entity
naming schemes (e.g., name, address, identifier).
In a distributed system, a naming system is used to uniquely identify and locate
entities such as files, users, services, or servers. Purpose of a Naming System:
1. Identification – To uniquely refer to an entity
2. Location – To find and communicate with that entity.
3. Sharing – To let multiple users/processes access the same entity via a
common name.
4. Abstraction – To hide the physical location; the name remains constant even
if the address changes.
Entity Naming Schemes:
1. Name (Human-Readable):
○ A string that users can easily read and remember.
○ It doesn't reveal location details.
○ Example: www.example.com, /home/user/file.txt
2. Address (Location-Based):
○ Contains actual location information to access the entity.
○ Changes if the entity moves.
○ Example: 192.168.1.100:8080, MAC address
3. Identifier (Unique System-Generated):
○ A unique, location-independent bit string.
○ Remains constant throughout the entity’s life.
○ Example: UUID like 123e4567-e89b-12d3-a456-426614174000,
database user ID
13
18: Explain structured naming and attribute-based naming with examples.
Feature Structured Naming (Hierarchical) Attribute-Based Naming (Descriptive)
21: What is an election algorithm? Explain the Bully algorithm and the Ring
algorithm with a suitable diagram showing the election process.
An election algorithm is a procedure used in a distributed system for processes to
agree on a single process that will act as a coordinator or leader. This process is
necessary when the previous coordinator fails or when the system initializes. The
goal is for all working processes to reach a consensus on the new leader. Any active
process can initiate an election.
1. The Bully Algorithm: The Bully algorithm assumes every process knows the ID
of all other processes. The process with the highest ID is always chosen as the
leader.
15
Explanation:
1. When a process P detects a coordinator failure, it sends an ELECTION
message to all processes with a higher ID.
2. If no higher-ID process responds after a timeout, P declares itself the winner
and sends a COORDINATOR message to all other processes.
3. If P receives a RESPONSE from a higher-ID process Q, it stops its own
election effort and waits for Q (or another, even higher-ID process) to
eventually announce victory.
4. A process that receives an ELECTION message from a lower-ID process
sends a RESPONSE back and then starts its own election, effectively
"bullying" the lower-ID process out of the race.
2. The Ring Algorithm: The Ring algorithm assumes processes are organized in a
logical ring where each process only knows its immediate successor.
Explanation:
1. When a process P detects a coordinator failure, it creates an ELECTION
message containing its own ID and sends it to its successor.
2. Each subsequent process that receives the message adds its own ID to the
list in the message and forwards it to its successor.
3. The message circulates the entire ring until it returns to the process that
started it (P).
4. At this point, P has a complete list of all active processes. It elects the process
with the highest ID from this list as the new coordinator.
5. P then circulates a final COORDINATOR message around the ring to inform all
other processes who the winner is.
16
22: Why is Clock Synchronization Needed in a Distributed System? Different
Between the Berkeley Algorithm and Network Time Protocol (NTP).
Clock Synchronization is crucial in distributed systems to ensure consistency,
coordination, and correctness across multiple nodes. Without synchronized clocks,
systems may face issues like:
1. Event Ordering: Logical ordering of events becomes difficult (e.g., in
distributed transactions or logs).
2. Consistency: Databases or caches may return inconsistent results due to
timestamp mismatches.
3. Timeout Handling: Distributed algorithms (e.g., leader election, failure
detection) rely on accurate timeouts.
4. Debugging: Logs from different machines are hard to correlate without a
common time reference.
Feature Berkeley Algorithm Network Time Protocol (NTP)
1. Primary Internal Consistency: Makes all External Accuracy: Synchronizes
Goal clocks in a group agree with each clocks to the official world time
other. (UTC).
2. Centralized: A master polls Hierarchical: Servers are organized
Architecture slaves and computes an average. in layers (strata) based on accuracy.
3. Time None required. The average time Relies on reference clocks (e.g.,
Source of the group becomes the atomic, GPS) at the top of the
standard. hierarchy.
4. Method Master sends an adjustment Client calculates its precise offset
value (e.g., "+5ms") based on the and network delay using a
group average. four-timestamp exchange.
5. Low.Best for small, local networks High. Designed to scale globally for
Scalability (e.g., a single server cluster). the entire internet.
18
○ Eventually Consistent: All nodes will get the data over time, but exact
timing is not guaranteed.
3. Use Cases:
○ Database Replication (e.g., Amazon DynamoDB
○ Failure Detection (nodes gossip about alive/dead nodes
○ Cluster Membership (sharing list of active nodes
○ Data Aggregation (e.g., average load, max value
26: What is replication and why is it used? Explain the different data-centric
consistency models.
Replication is the process of creating and maintaining multiple copies of data or
resources on different computers in a distributed system. These copies are called
replicas.
Why is Replication Used?
1. Increased Reliability and Availability: If one server holding a replica fails,
the system can continue to operate by using other available replicas. This
makes the system more fault-tolerant and highly available.
2. Improved Performance: By placing replicas closer to the users who access
them, replication can significantly reduce data access latency. For example, a
user in Asia can access a replica on a server in Asia instead of one in North
America. It also allows for load balancing, as client requests can be distributed
among multiple replica servers.
Data-Centric Consistency Models
The main challenge with replication is keeping the replicas consistent. If data is
written to one replica, how and when do the other replicas get updated?
Consistency models are contracts between the data store and the application that
define the rules for the consistency of data.
Here are the most common data-centric models, ordered from strongest to weakest:
1. Strict Consistency: All reads return the result of the most recent write
instantly, requiring a perfect global clock—making it theoretically strong but
impractical in real-world systems.Used where absolute correctness is critical,
though rarely feasible in practice.
2. Sequential Consistency: All processes see operations in the same
sequential order, preserving each process’s program order, though the order
19
may not reflect actual real-time execution. Provides a balance between
usability and enforceability without needing global time.
3. Causal Consistency: Causally related writes (e.g., a reply following a
message) must be seen in the correct order by all processes, while
concurrent, unrelated operations can be observed in different orders. Common
in collaborative applications where operation dependencies matter.
4. Eventual Consistency: The weakest model, where all replicas eventually
converge to the same value if no new updates occur; allows temporary
inconsistencies and favors availability over immediate accuracy. Ideal for
large-scale, fault-tolerant systems like DNS or NoSQL databases.
20
28: Difference between continuous consistency and sequential consistency.
Feature Sequential Consistency Continuous Consistency
1. Core A strict ordering model that A flexible framework for measuring
Concept defines a correct sequence of and bounding the level of data
operations. inconsistency.
2. Type of Guarantees that all processes Guarantees that data will not deviate
Guarantee see the same single, global from a perfectly consistent state
order of all operations. beyond a specified tolerance.
3. Binary: A system is either Quantitative: Measures
Measurement sequentially consistent or it is inconsistency by numerical value,
not. staleness (time), and order.
4. Flexibility Rigid. It is an all-or-nothing Tunable. Developers can specify the
property with no room for exact level of inconsistency their
negotiation. application can tolerate.
5. Typical Systems where a predictable, Large, highly available systems where
Use Case global order is critical (e.g., some data staleness is acceptable
simple distributed locks). (e.g., online gaming, caching).
Replica management is the process of deciding where, when, and how to create
and maintain replicas (copies) of data or services in a distributed system. It involves
a set of policies and mechanisms that govern the lifecycle of replicas.
Why is it Important?
Effective replica management is crucial for achieving the primary goals of
replication: performance and reliability. Poor management can negate the benefits
and even introduce new problems. Key decisions in replica management include:
1. Replica Placement: Deciding where to place the replicas.
○ Goal: To improve performance by reducing latency for users and to
increase fault tolerance by placing replicas in different physical locations
(different racks, data centers, or continents) to avoid a single point of
failure.
○ Types of Placement:
■ Permanent Replicas: The initial, fixed set of replicas of a data
store (e.g., a database cluster with 3 servers).
■ Server-Initiated Replicas: A server dynamically creates a replica
of a heavily accessed file on another server to balance the load.
■ Client-Initiated Replicas (Caching): A client creates a temporary,
local copy (a cache) of data for fast, repeated access.
2. Content Replication and Update Propagation: Deciding what to replicate
and how to spread updates.
○ Goal: To keep replicas consistent without overwhelming the network.
○ Key Decisions:
21
■ Push vs. Pull: Should the server push updates to the replicas as
they happen (active), or should replicas periodically pull updates
from the server (passive)?
■ State vs. Operations: Should the entire modified data item (state)
be transferred, or only the operation that was performed
(operation)? Transferring operations is often more efficient. For
example, instead of sending the entire new bank balance, just
send the operation DEPOSIT($100).
Proper replica management ensures that the system is both fast and resilient. For
instance, placing replicas too close together compromises fault tolerance, while
placing them too far apart can increase the latency of keeping them synchronized.
30: What are the different types of failures that can occur in a distributed
system? Explain the methods used to recover from a crash.
22
31: Define distributed commit. Explain the two-phase commit (2PC) protocol in
detail with a diagram.
23
32: What is a fault and an error? Differentiate between them.
Faults and errors are related but distinct concepts in the study of system reliability.
They represent different stages in the chain of events leading to a system failure.
● Fault: A fault is the underlying cause of a problem. It is a defect or flaw in a
hardware or software component. It can be a bug in the code, a faulty
hardware component, or a design mistake. A fault is static and may exist in the
system for a long time without causing any issue.
● Error: An error is a part of the system's state that is incorrect and may lead to
a failure. An error is the manifestation of a fault. When a fault is activated (e.g.,
a buggy piece of code is executed), it produces an error in the system's state.
● Failure: A failure is the observable deviation of a system from its specified
behavior. It is what the user sees when the system does not perform its
function correctly. A failure occurs when an error in the system's state
propagates to the output.
Example Chain:
1. Fault: A programmer writes x = y / 0; in the code. This is the defect.
2. Error: When this line of code is executed, the system's state becomes
erroneous (e.g., a "division by zero" exception is raised).
3. Failure: The program crashes or displays an error message to the user, which
is a deviation from its specified behavior.
Feature Fault Error
1. The root cause of a potential An incorrect state within the system
Definition problem (a defect). caused by a fault.
2. Nature It is a static condition or flaw in It is a dynamic, incorrect state that the
a component. system enters.
3. Analogy A disease or virus present in The symptoms of the disease (e.g.,
the body. fever).
4. Example A bug in the source code; a A variable having a wrong value; a
loose wire. corrupted packet.
5. Causal A fault causes an error. An error is caused by a fault and
Relation causes a failure.
24
33: Explain reliable client-server communication.
Reliable client-server communication aims to guarantee that a client's request is
processed by a server exactly once, and a reply is successfully returned, even in the
presence of network failures or server crashes. This is primarily achieved by
systematically handling three main problems.
1. Handling Lost Messages
● Problem: A request or reply message can be lost in the network. The client
will wait for a response that never comes.
● Solution: The client sets a timeout after sending a request. If no reply arrives
within this period, it assumes the message was lost and retransmits the
original request.
2. Handling Duplicate Requests
● Problem: Retransmission can lead to duplicate processing. If the server
successfully processed the first request but its reply was lost, the server will
receive the same request again. For non-idempotent operations (like
add_to_balance(100)), this can lead to incorrect results.
● Solution: The server must be able to detect duplicates. This is done by having
the client include a unique request ID with every message. The server
maintains a history of recently processed IDs. If a request arrives with an ID
that has already been processed, the server does not re-execute the
operation; instead, it simply re-sends its stored reply.
3. Handling Server Crashes
● Problem: A server can crash at any point, leaving the client uncertain about
the state of its request (was it started, partially completed, or finished?).
● Solution: The system provides different levels of guarantees, known as
execution semantics, to manage this uncertainty:
○ At-Least-Once: Retries until success; may run the operation multiple
times. Safe only for idempotent actions.
○ At-Most-Once: Ensures no duplicate execution; operation might be lost
if failure occurs.
○ Exactly-Once: Guarantees the operation runs once and only once using
persistent logs or transactions—ideal but complex.
The Relationship:
Authentication always comes before Authorization. You cannot grant access rights
to someone until you have reliably confirmed their identity.
Feature Authentication Authorization
1. To verify identity. To grant or deny permissions.
Purpose
2. "Who are you?" "What are you allowed to do?"
Question
3. Validates user credentials Checks user privileges against an
Process (password, token, biometric). access control policy (ACL, RBAC).
4. Timing The first step in the security The second step, performed after
process. successful authentication.
5. Output A decision of "Valid Identity" or A decision of "Access Granted" or
"Invalid Identity". "Access Denied".
27
4. Firewall
● A network security device or software that monitors and controls incoming and
outgoing traffic based on security rules.
● Acts as a barrier between trusted internal networks and untrusted external
networks (like the internet).
● Can block unauthorized access while allowing legitimate communication.
● Types: Packet-filtering firewall, stateful firewall, proxy firewall.
5. Secure Naming
● Refers to the process of securely mapping names (like URLs or hostnames) to
network addresses.
● Prevents attacks like DNS spoofing or man-in-the-middle attacks.
● Secure naming systems use cryptographic methods to ensure authenticity and
integrity.
● Example: DNSSEC (Domain Name System Security Extensions) adds
security to DNS.
7. Idempotent Operations
1. Definition: An operation is idempotent if repeating it multiple times produces
the same result as performing it just once.
2. Core Principle: f(f(x)) = f(x). Applying the function again does not change the
outcome.
3. Importance in DS: Crucial for fault tolerance. It allows clients to safely
retransmit requests after a timeout without fear of corrupting data (e.g., if a
server's reply was lost).
4. Examples: Setting an absolute value (SET x = 10), reading a piece of data, or
deleting a specific record by its unique ID.
5. Non-Examples: Appending data to a log file or incrementing a counter (x = x
+ 1), as these operations change the state with each execution.
10. Kerberos
1. Definition: A network authentication protocol that uses a trusted third party,
the Key Distribution Center (KDC), to provide secure identity verification for
users and services.
2. Core Concept: Works on the basis of "tickets" to avoid sending passwords
over the network. It provides a single sign-on (SSO) experience.
3. Key Components:
○ Authentication Server (AS): Verifies the user's identity initially.
○ Ticket-Granting Server (TGS): Issues temporary tickets for specific
services.
4. Workflow: A user gets a main Ticket-Granting Ticket (TGT) once, then uses it
to request temporary service tickets for any resource they need to access,
without re-entering their password.
5. Use Case: The standard for authentication in trusted, managed enterprise
networks, such as Windows Active Directory domains.
29
12. Distributed Event Matching
1. Definition: A system where clients (subscribers) express interest in complex
patterns of events, and a network of brokers matches sequences of events
published by other clients (publishers) against these patterns.
2. Core Functionality: It goes beyond simple "publish/subscribe." Instead of
subscribing to a single topic (e.g., "stock_prices"), a subscriber might define a
pattern like: "Notify me if Stock A drops 5% AND Stock B rises 3% within 10
seconds."
3. Mechanism: It involves a broker network that filters, aggregates, and
correlates events from multiple sources to detect when a composite event
pattern has occurred.
4. Challenges: Efficiently matching millions of complex patterns against a
high-volume stream of events in a distributed, low-latency manner.
5. Use Cases: Algorithmic trading in finance, real-time network intrusion
detection, supply chain monitoring (e.g., tracking a package through multiple
checkpoints), and IoT sensor data analysis.
30