Unit 4
Failures and Their Classification:
Definition: A failure in a distributed database system refers to any event that disrupts the normal
operation of the system, resulting in the loss of data consistency, availability, or reliability.
Types of Failures:
- Hardware Failures: Failures in physical components such as servers, disks, or network devices.
- Software Failures: Errors or bugs in the software components of the database system, such as the
database management system (DBMS) or applications.
- Network Failures: Communication failures or network outages that prevent data transmission
between distributed nodes.
- Site Failures: Failures that affect an entire site or data center, resulting in the loss of access to all
resources hosted at that location.
- Media Failures: Physical damage or corruption to storage media, such as disks or tapes, leading to
data loss or corruption.
Classification:
- Transient Failures: Temporary failures that can be recovered from quickly, such as a network
glitch or a brief power outage.
- Permanent Failures: Irreversible failures that require more extensive recovery procedures, such
as hardware failures or data corruption.
__________________________________________________________________________________
Checkpoints and Recovery:
1. Checkpoints:
- Definition: Checkpoints are predefined moments in time when the state of a distributed database
system is saved to stable storage, allowing recovery to a consistent state after a failure.
- Purpose: Checkpoints help reduce the amount of work needed during recovery by providing a
consistent starting point.
- Types:
- Periodic Checkpoints: Scheduled at regular intervals to save the current state of the system.
- Forced Checkpoints: Triggered manually or automatically in response to specific events, such as
transaction commits or system checkpoints.
2. Recovery:
- Definition: Recovery in a distributed database system involves restoring the system to a
consistent state after a failure occurs.
- Phases:
- Analysis: Identifying the transactions that were in progress at the time of failure and
determining the necessary actions for recovery.
- Undo: Reverting the effects of incomplete transactions by rolling them back to their pre-failure
state.
- Redo: Reapplying the effects of committed transactions that were lost due to the failure.
- Techniques:
- Backward Recovery: Reverting to a previous consistent state and replaying transactions from
that point forward.
- Forward Recovery: Applying recovery actions directly to the current state of the system without
reverting to a previous state.
3. Recovery Protocols:
- Two-Phase Commit (2PC): Ensures atomicity and durability of distributed transactions by
coordinating commit or rollback decisions among participating nodes.
- Three-Phase Commit (3PC): Enhances the reliability of 2PC by introducing a prepare phase to
handle failure scenarios more robustly.
Process Resilience
Definition: Process resilience refers to the ability of a system or application to continue functioning
despite failures or disruptions.
Fault Tolerance:
- Redundancy: Introducing duplicate processes or components to ensure continued operation if
one fails.
- Failure Detection: Detecting failures quickly to initiate recovery processes.
- Recovery Mechanisms: Implementing strategies such as checkpointing and rollback to recover
from failures.
Techniques -
Replication: Running multiple instances of a process on different nodes to tolerate failures.
- Isolation: Isolating individual processes to prevent failures from propagating to other
components.
- Graceful Degradation: Prioritizing essential functions to maintain basic functionality during failure
conditions.
Challenges:
- Overhead: Replication and recovery mechanisms can introduce overhead in terms of resources
and performance.
- Consistency: Ensuring consistency across replicated processes while maintaining performance.
- Complexity: Designing and managing resilient systems can be complex and require careful
planning.
__________________________________________________________________________________
Reliable Client-Server Communication:
Definition: Reliable client-server communication ensures that data is transmitted accurately and in
the correct order between clients and servers, even in the presence of failures or network issues.
Techniques
- Acknowledgments: Using acknowledgments to confirm successful receipt of data and
retransmitting if necessary.
- Sequence Numbers: Assigning sequence numbers to data packets to ensure correct ordering.
- Timeouts and Retransmissions: Setting timeouts to detect lost packets and retransmitting them if
no acknowledgment is received.
Protocols:
- TCP (Transmission Control Protocol): Provides reliable, connection-oriented communication with
mechanisms such as acknowledgment, retransmission, and flow control.
- HTTP (Hypertext Transfer Protocol): Built on top of TCP, it ensures reliable transfer of web data
between clients and servers.
- RPC (Remote Procedure Call): Provides reliable communication between distributed systems by
abstracting procedure calls over the network.
4. Challenges:
- Performance: Ensuring reliability without sacrificing performance can be challenging.
- Overhead: Adding reliability mechanisms can increase network overhead and latency.
- Scalability: Maintaining reliability in large-scale distributed systems with many clients and servers
can be complex.
_____________________________________________________________________
Reliable Group Communication:
Definition: Reliable group communication ensures that messages are delivered to all members of a
group in a consistent and ordered manner, even in the presence of failures or network partitions.
Techniques
- Total Order: Ensuring that messages are delivered to all group members in the same order.
- View Synchronization: Keeping group members synchronized to detect failures and maintain
consistency.
- Membership Management: Handling dynamic changes in group membership due to joins, leaves,
or failures.
3. Protocols:
- IP Multicast: Allows for one-to-many communication by sending packets to a group of destination
hosts.
- Paxos: A consensus protocol used to ensure agreement among a group of nodes in a distributed
system.
- Virtual Synchrony: Maintains a consistent view of the group by synchronizing membership
changes and message delivery.
4. Challenges:
- Scalability: Ensuring reliable group communication in large-scale distributed systems with many
members.
- Fault Tolerance: Handling failures and network partitions while maintaining consistency.
- Complexity: Designing and implementing reliable group communication protocols can be complex
and require careful consideration of various factors.
Mechanism for commit and recovery in distributed Database system
Ans: In distributed database systems, the Two-Phase Commit (2PC) protocol is commonly used for
commit, and recovery is often facilitated by techniques such as logging and checkpoints.
Two-Phase Commit Protocol:
1. Prepare Phase:
- The coordinator (typically the transaction manager) sends a prepare request to all participants
(resource managers) involved in the transaction.
- Each participant responds with either a "yes" (vote to commit) or "no" (vote to abort).
- If any participant votes "no" (indicating it cannot commit the transaction), the coordinator
proceeds to the abort phase.
2. Commit Phase:
- If all participants vote "yes" in the prepare phase, the coordinator sends a commit request to all
participants.
- Upon receiving the commit request, each participant performs the commit operation, making the
transaction's changes permanent.
- After successfully committing, the participant acknowledges the coordinator.
3. Abort Phase:
- If any participant votes "no" in the prepare phase or if the coordinator times out waiting for
responses, the coordinator sends an abort request to all participants.
- Upon receiving the abort request, each participant rolls back the transaction, undoing any
changes made by the transaction.
- After successfully aborting, the participant acknowledges the coordinator.
Recovery Mechanisms: Logging and Checkpoints
1. Logging:
- Logging involves recording all changes made by transactions to a log file before they are applied
to the database.
- During recovery, the log is replayed to redo committed transactions or undo aborted
transactions, bringing the system to a consistent state.
- Write-Ahead Logging (WAL) is a common logging protocol where changes are written to the log
before being applied to the database to ensure durability.
2. Checkpoints:
- Checkpoints involve periodically saving the system state to stable storage.
- During recovery, the system can roll back to the last checkpoint and replay the log from that point
to recover transactions committed after the checkpoint.
- Checkpoints help reduce the time and resources required for recovery by providing a consistent
starting point.