Enhanced Checkpointing Algorithm for Mobile Applications

 

Kailash Prasad Dewangan1, Somesh Kumar Dewangan2

1Department of Computer Science & Engineering, Kruti Institute of Technology and Engineering,    Raipur (C.G.), India,

2Department of Computer Science & Engineering, Disha Institute of Management and Technology, Raipur (C.G.), India,

*Corresponding Author Email-kailashprasadd80@gmail.com, somesh_4@yahoo.com

 

ABSTRACT:

Rollback-recovery in mobile systems is important for fault-tolerant computing. Without fault tolerance mechanisms, an application running on a system has to be restarted from scratch if a fault happens in the middle of its execution, resulting in loss of useful computation. To provide efficient rollback-recovery for fault-tolerance in distributed systems, it is significant to reduce the number of checkpoints under the existence of consistent global checkpoints in distributed checkpointing algorithms. Because of the dependencies among the processes states that induced by inter-process communication in distributed systems, asynchronous checkpointing may suffer from the domino effect. Therefore, a consistent global checkpoint should always be ensured to restrict the rollback distance. The quasi-synchronous checkpointing protocols achieve synchronization in a loose fashion. The algorithm proposed in this paper follows a new strategy to update the checkpoint interval dynamically.

 

KEY WORDS: Checkpoint, Quasi-synchronous checkpointing, Overhead, Recovery, Distributed Systems 

 


 I.      INTRODUCTION

Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Checkpointing is the process of saving the status information. A checkpoint of a process is the information about the state of a process at some instant of time. Fault tolerance through checkpoint and recovery techniques includes taking check- point of an application process periodically, and logging the checkpoint in a stable storage which is immune to failures. Checkpoint of an application process is the information about the state of the process and can be used to restart it from a state corresponding to the checkpoint. On the other hand, roll back recovery for an application is defined as the procedure for restarting the application process from a checkpoint stored in stable storage.

 

A checkpoint can be saved on either stable storage or the volatile storage of another process, depending on the failure scenarios to be tolerated.

 

For long-running scientific applications, checkpointing and rollback-recovery can be used to minimize the total execution times in the presence of failures.

 

For an application involving a single process checkpoint and recovery is simple. In the event of a node failure, the application process is restarted from latest checkpoint, i.e. rolled back its latest checkpoint. However, several issues with checkpointing and recovery arise in distributed system which runs applications having multiple concurrent processes, communicating with each other via messages.

 

 II.      ASPECTS OF CHECKPOINTING:

Some of the aspects to be considered with checkpointing are frequency of checkpointing, contents of a checkpoint and methods of checkpointing.

 

A.      Frequency of Checkpointing:

A checkpointing algorithm executes in parallel with the underlying computation. Therefore, the overheads introduced due to checkpointing should be minimized. Checkpointing should enable a user to recover quickly and not lose substantial computation in case of an error, which necessitates frequent checkpointing and consequently significant overhead. The number of check-points initiated should be such that the cost of information loss due to failure is small and the overhead due to checkpointing is not significant. These depend on the failure probability and the importance of the computation. For example, in a transaction processing system where every transaction is important and information loss is not permitted, a checkpoint may be taken after every transaction, increasing the checkpointing overhead significantly.

 

B.      Contents of a Checkpoint:

The state of a process has to be saved in stable storage so that the process can be restarted in case of an error. The state context includes code, data and stack segments along with the environment and the register contents. Environment has the information about the various files currently in use and the file pointers. In case of message-passing systems, environment variables include those messages which are sent and not yet received.

 

C.      Methods of Checkpointing:

The methodology used for checkpointing depends on the architecture of the system. Methods used in multiprocessor systems should incorporate explicit coordination unlike uniprocessor systems. In a message-passing system, the messages should be monitored and if necessary saved as part of the global context. The reason is that the messages introduce dependencies among the processors. On the other hand, a shared memory system communicates through shared variables which introduce dependency among the nodes and thus, at the time of checkpointing, the memory has to be in a consistent state to obtain a set of concurrent checkpoints.

 

III. OVERHEAD OF CHECKPOINTING:

During a failure-free run, every global checkpoint incurs coordination overhead and context saving overhead in a Distributed system. We define them as follows.

 

A.      Coordination Overhead:

In a distributed system, coordination among processes is needed to obtain a consistent global state. Special messages and piggy-backed information with regular messages are used to obtain coordination among processes. Coordination overhead is due to these special messages and piggy-backed information.

 

B.      Context Saving Overhead:

The time taken to save the global context of a computation is defined as the context-saving overhead. Overhead for checkpoint storage in stable storage contributes a major part of the overhead of checkpoint and recovery protocols. This overhead is proportional to the size of the context.

 

 IV.      CONSISTENT SYSTEM STATE:

In distributed system, a computation node can take checkpoints of its local processes only, and such checkpoints are called local checkpoints. A global checkpoint of a distributed system is defined as set of local checkpoints, one from each of its processes in the system. After recovery the system must be in a consistent state. A global state of a message-passing system is a collection of the individual states of all participating processes and of the states of the communication channels. Intuitively, a consistent global state is one that may occur during a failure-free, correct execution of a distributed computation. More precisely, a consistent system state is one in which if a process state reflects a message receipt, then the state of the corresponding sender reflects sending that message.

 

For example, Fig 1 shows two examples of global states, a consistent state in Fig 1(a), and an inconsistent state in Fig 1(b). Note that the consistent state in Fig 1(a) shows message m1 to have been sent but not yet received. This state is consistent, because it represents a situation in which the message has left the sender and is still traveling across the network. On the other hand, the state in Fig 1(b) is inconsistent because process P2 is shown to have received m2 but the state of process P1 does not reflect sending it. Such a state is impossible in any failure-free, correct computation.

 

Inconsistent states occur because of failures. For example, the situation shown in part (b) of Figure 1 may occur if process P1 fails after sending message m2 to P2 and then restarts at the state shown in the figure. A fundamental goal of any rollback-recovery protocol is to bring the system into a consistent state when inconsistencies occur because of a failure. The reconstructed consistent state is not necessarily one that has occurred before the failure. It is sufficient that the reconstructed state be one that could have occurred before the failure in a failure-free, correct execution.

 

Fig 1: an example of a consistent and inconsistent state.

 

V.      ISSUES IN CHECKPOINTING:

In concurrent systems, several processes cooperate by exchanging information to accomplish task. The information exchanges through the message passing. In such system, if one of the cooperating processes fails and resumes execution from a recovery point, then the effects it has caused at other processes due to the information it has exchanged with them after establishing the recovery point will have to be undone. To undo the effects caused by a failed process at an active process, the active process must also rollback to an earlier state. Rolling back processes in concurrent system is more difficult than in the case of a single process. The following discussion illustrates how the rolling back of processes can cause further problems.

 

A.      Orphan Messages and Domino Effect:

Consider the three processes X, Y and Z are running concurrently in Fig 2. The parallel lines are showing the executions of the processes. These processes are communicated through exchange of messages. Each symbol '[' marks a recovery point to which process can be rolled back in the event of a failure.

 

 

Fig 2: Effects of Orphan Messages and Domino Effect

 

If the process X is to be rolled back, it can be rolled back to the recovery point x3 without affecting any other process. Suppose that Y fails after sending message m and is rolled back to y2. In this case, the receipt of m is recorded in x3, but the sending of m is not recorded in y2. Now we have a situation where X has received message m from Y, but Y has no record of sending it, which corresponds to an inconsistent state. Under such circumstances, m is referred to as an orphan message and process X must also rollback. X must roll back because Y interacted with X after establishing its recovery point y2. When Y is rolled back to y2, the event that is responsible for the interaction is undone.

 

Therefore, all the effects at X caused by the interaction must also be undone. This can be achieved by rolling back X to recovery point x2. Likewise, it can be seen that, if Z is rolled back, all three processes must rollback to their very first recovery points, namely, x1, y1, z1. This effect, where rolling back one process causes one or more processes to rollback, is known as the domino effect and orphan messages are the cause.

B.      Lost Messages:

Suppose that checkpoints x1 and y1 in Fig 3 are chosen as the recovery points for processes X and Y, respectively. In this case, the event that sent message m is recorded in x1, while the event of its receipt at Y is not recorded in y1. If Y fails after receiving message m the system is restored to state x1, y1, in which message m is lost as process X is past the point where it sends message m. This condition can also arise if m is lost in the communication channel and processes X and Y are in state x1 and y1, respectively.

 

Fig 3: Lost Messages

 

C.      Livelocks:

In rollback recovery, livelock is a situation in which a single failure can cause an infinite number of rollbacks, preventing the system from making process. A livelock situation in a distributed system is shown in Fig 4. Fig 4 illustrates the activity of two processes X and Y until the failure of Y. Process Y fails before receiving message n1, sent by X. When Y rollbacks to y1, there is no record of sending message m1, hence X must rollback to x1.When process Y recovers, it sends out m2 and receives n1 (Fig 5). Process X, after resuming from x1, sends n2 and receives m2. However, because X rolled back, there is no record of sending n1 and hence Y has to rollback for the second time. This forces X to rollback too, as it has received m2, and there is no record of sending m2 at Y. This situation can repeat indefinitely, preventing the system from making any progress.

Fig 4: State before Livelock

 

Fig 5: Livelock Situation

 

 VI.   TYPES OF CHECKPOINTING PROTOCOLS:

The protocols are classified into following categories. (i) Asynchronous checkpointing protocols (ii) Synchronous checkpointing protocols (iii) Quasi-Synchronous checkpointing protocols

 

A.      Asynchronous Checkpointing Protocols:

The protocols in this class allow taking local checkpoints independent of other processes in the distributed system. Such protocols are also referred as uncoordinated or asynchronous checkpointing protocols. The fault-free runtime overhead is least for these kinds of protocols because no coordination is needed between the processes to take checkpoints. During recovery, processes coordinate among themselves to determine a consistent global state. Therefore, recovery overhead is high. Due to bad placement of checkpoints over the communication pattern, the recovery protocol may require several rounds of coordination and rollbacks until a consistent global checkpoint is found. As a result a lot of useful computation may be lost and recovery overhead increases.

 

Many of the local checkpoints taken may not be part of any recovery line, and they are called useless checkpoints. However, processes may need to store all the local checkpoints since identifying which checkpoints are useless at runtime can be costly. As a result storage requirement is high for this kind of protocols.

 

Fig 6: Domino Effect

 

There is also a possibility of domino effect which may cause a loss of large amount of useful computation. Fig 6 shows a checkpoint and communication pattern involving two processes, which can be affected by domino effect. It can be seen from the figure that all the possible global checkpoints are inconsistent and therefore all local checkpoints are useless checkpoints. If a fault occurs, the recovering process will force the other process to roll back to its previous checkpoint, until both the processes are rolled back to their initial state. In order to determine a recovery line during recovery, processes record their dependencies with checkpoints of other processes during failure-free execution.

 

The straightforward way to keep track of such information is by using a set of message counters, one for each of the processes with which it communicates. When a process sends out an application message, it increments the counter value corresponding to the receiver process, and tags the value as an identification number to the message. The processes also maintain records of the highest numbered message received from each of its senders.

 

Processes store both these send and receive information along with its checkpoints and this is used to determine inconsistent checkpoints during recovery. A checkpoint of a process P is inconsistent with that of a process Q if the highest message id received from the process Q recorded in P's checkpoint is higher than the send counter value corresponding to P as recorded in Q's checkpoint. Then process P has to roll back again.

 

This may lead to several rounds of coordination and cascaded rollbacks until a consistent global checkpoint is found. These dependency tracking protocols add some overhead during fault-free execution. In case of a fault, all the processes have to coordinate among themselves to decide upon a recovery line to which they will roll back.

 

Therefore, recovery overhead is high in most cases. This approach is very reasonable if faults are rare in the system under consideration. There is also a possibility of domino effect during recovery. This class of protocols does not inherently support output commit.

 

B.      Synchronous Checkpointing Protocols:

In this class of protocols a process does not take local checkpoints independently, but synchronizes every checkpointing event with that of other processes, such that every checkpointing effort results in a consistent global checkpoint. These protocols are also known as synchronous checkpointing protocols. This class of protocols ensures that whenever a process takes a local checkpoint, all other processes in the system also take their respective local checkpoints. As a result every local checkpointing effort translates into a global checkpointing activity. They are free from domino effect. Storage space requirement is minimum for these protocols, since they require that only the latest checkpoint be stored for recovery.

 

All the checkpoints taken with successful coordination are useful, i.e., the recovery line steadily progresses with every checkpoint. All the latest local checkpoints are part of a consistent global checkpoint, and therefore recovery time is lower compared to asynchronous protocols. However, due to the effort of synchronization involved in every checkpointing activity, checkpointing overhead is high. Synchronous checkpointing protocols can be either blocking or non-blocking.

 

a)       Blocking Protocols:

A straightforward approach to synchronous checkpointing is to block inter-process communication until the checkpointing protocol completes. The protocol is initiated by a coordinator. The coordinator sends a request to all processes asking them to checkpoint. On the receipt of the request the process blocks normal execution, takes a tentative checkpoint, and sends an acknowledgment message to the coordinator. On receipt of the acknowledgment messages from all the processes, the coordinator sends a message indicating the end of the protocol. On receipt of this message, all processes make their tentative checkpoints permanent, remove old permanent checkpoints, and resume normal execution.

 

b)       Non Blocking Protocols:

The problem with blocking synchronous checkpointing protocols is that the processes are not allowed to execute until the coordination is complete. Hence checkpointing overhead is high.

 

The idea is to use a special message, called a marker message, which is a checkpoint request message carrying the checkpointing interval number. The protocol is similar to the flooding protocols used in distributed systems.. The process which first initiates coordination blocks the local process, takes a local checkpoint, sends a marker message to all its neighboring processes, and then unblocks the local process.

 

The receiver of a marker message, if has not already received a similar message, follows the same procedure as that of the coordinator and forwards the marker message to all its neighbors. Here we relaxed the FIFO constraint by piggybacking the marker on every post-checkpoint message. The same affect can be obtained by marking local checkpoints by a sequence number, called checkpoint index, and piggybacking the current checkpoint index value on every application message.

 

Similar to blocking synchronous checkpointing protocols, attempts have been made to construct non-blocking synchronous checkpointing protocols which require that a minimal number of processes take checkpoints. The coordination protocols discussed above involve all the processes in the system and therefore raise concern for scalability of the protocols.

 

A minimal number of processes, only those whose present states are causally dependent on the current state of the coordinator, need to participate.

 

Mutable checkpoints can be stored in the volatile memory and are converted into permanent checkpoints and stored in stable store only when a new recovery line is developed. Another approach to non-blocking synchronous checkpointing protocol is to avoid explicit coordination by message passing and use synchronous clocks to achieve implicit coordination.

 

In modern distributed systems many applications require clocks of the processors to be approximately synchronized. Many distributed systems run clock synchronization protocol at the background to keep their clock differences within some guaranteed bound. Such loosely coupled synchronized clocks can facilitate checkpointing effort without explicit coordination .A process takes a local checkpoint and blocks all receives for a period which is equal to the maximum deviation between clocks plus the maximum time to detect a failure in the system. It can be shown that all its local checkpoints of processes form a global recovery line.

 

C.      Quasi-Synchronous Checkpointing Protocols:

This class comprises of protocols which try to combine the advantages of both asynchronous and synchronous checkpointing protocols. This kind of protocols takes two types of checkpoints, namely basic and forced checkpoints.

 

These protocols ensure that every checkpoint taken by processes is in some recovery line. Since local checkpoints are taken independently these protocols suffer less overhead for checkpointing, and yet avoid domino effect. But these protocols may end up taking more number of checkpoints compared to that in asynchronous checkpointing protocols. The failed process can determine the recovery line without any coordination with other processes. Other processes need to roll back to the recovery line sent by the failed process. Therefore recovery is simple.

 

The protocols in this class piggyback protocol-specific information on every application messages. The receiver process then analyzes this information to decide whether any forced checkpoint is required or not. If so, the process first takes a checkpoint and then delivers the message to the application. Informally, the decision is based on whether the checkpoint and communication pattern will create any useless checkpoints in the system. If there is such a possibility, a forced checkpoint is taken to break the pattern.

 

The decision is based on the notion of a Z – cycle and a Z – path, based on the zigzag path formulation. A Z – path is the same as a zigzag-path. A Z –cycle is a Z – path that begins and ends in the same checkpoint interval (Fig 7). CIC protocols can be broadly sub-divided into two classes, index–based coordination protocols and model–based checkpointing protocols.

 

Fig 7: Z-path determination

 

In index-based coordination checkpointing protocol, a process takes both basic and forced checkpoints, and all local checkpoints of a process are indexed by a monotonically increasing value. The index value is piggybacked on all application messages. When a process receives an application message, it checks whether the piggybacked index value is higher than its own. If so, then it updates its own index value to the piggybacked index value and takes a forced checkpoint. It then delivers the message to the application process. The protocol ensures that local checkpoints in different processes having the same index value form a recovery line. A more sophisticated protocol where processes transmit more information on application messages to reduce the number of forced checkpoints.

 

A model-based checkpointing protocol defines a model of checkpoint and communication pattern which contains no useless checkpoints.

 

VII. CONCLUSIONS:

This paper presents an enhanced checkpointing algorithm for mobile applications to improve the performance of checkpointing algorithms. The proposed algorithm follows a new strategy to update the checkpoint interval dynamically as opposed to the static interval used by the existing algorithms explained in the above sections. Whenever a process takes a forced checkpoint due to the reception of a message with sequence number higher than the sequence number of the process, the checkpoint interval is reset that is a new interval starts from the point where the forced checkpoint is taken. The timer reset strategy of resetting the timer after a forced checkpoint restarts a new basic checkpoint interval.

 

VIII. References:

[1]   Acharya A. and Badrinath B. R., "Recording distributed snapshots based on causal order of message delivery." Information Processing Letters 44, pp. 317-321, 1992.

[2]   Baldoni R., Qualigia F., and Fornara P. “An index-based checkpointing algorithm for autonomous distributed systems.” IEEE Trans. Parallel and Distributed Syst. Volume 10 No 2, (February 1999): p. 181-192.

[3]   Briatico D., Ciufoletti A., and Simoncini L. “A distributed domino-effect free recovery algorithm.”In Proc. 4th IEEE Symp. On Reliability in Distributed Software and Database Syst., (1984): p. 207-215.

[4]   Cao G. and Singhal M., "On coordinated checkpointing in distributed system." IEEE Trans, on Parallel and Distributed Systems Volume 9, No 12, (December 1998): p. 1213-1225.

[5]   Cao G and Singhal M., "Mutable checkpoints: A new checkpointing approach for mobile computing systems." IEEE Trans, on Parallel and Distributed Systems Volume 12, No 2, (February 2001): p. 157-172,

[6]   Chandy K. M. and Lamport, "Distributed snapshots: Determining global states in distributed systems." ACM Trans on Computer Systems, Volume 3, No. 1, (February 1985): p. 63-75.

[7]   Critchlow C and Taylor K., "The inhibition spectrum and the achievement of causal consistency", Cornell University." (February 1990) Tech. Rep. TR 90-1101.

[8]   Elnozahy and E. N. MaNetHo, "Fault tolerance in distributed systems using rollback-recovery and process replication." Ph.D. dissertation, Rice University, Texas, Austin, October 1993

[9]   Elnozahy E. N., Alvisi L., Wang Y. M., and Johnson D. B, "A survey of rollback-recovery protocols in message-passing systems." ACM Computing Surveys Volume 34, No 3, (September 2002): p. 375-408.

[10] Elnozahy E. N., Johnson D. B., and Zwaenepoel W., "The performance of consistent checkpointing.” In Proc. of IEEE Symp.on Reliable Distributed Systems, (October 1992): p. 39-47.

[11] Helary J. M., Mostefaoui A., Netzer R. H., and Raynal M., Preventing useless checkpoints in distributed computations." In Proc. of the 16th Symposium on Reliable Distributed Systems, (1997): p. 183-190.

[12] S. Kalaiselvi and V. Rajaraman, "A survey of checkpointing algorithms for parallel and distributed computers."Sadhana, Volume 25, No. 5, (October 2000): p. 489-510.

[13] Koo R. and Toueg S., "Checkpointing and rollback-recovery for distributed systems." IEEE Trans, on Software Engg. Volume 13, No 1, (1987): p. 23-31.

[14] Manivannan D. and Singhal M., "A low-overhead recovery technique using quasi-synchronous checkpointing." In Proc. of Distributed Computing Systems. ACM, (May 1996): p. 100-107.

[15] Prakash R. and Singhal M., "Low-cost checkpointing and failure recovery in mobile computing systems." IEEE Trans on Parallel and Distributed Systems Volume 7, No 10, (October 1996): p.1035-1048.

[16] Silva L. M. and Silva J. G., "The performance of coordinated and independent checkpointing." In Proc. of Intl. Parallel and Distributed Processing Symposium, (1999): p. 88-94.

[17] Strom R. E. and Yemini S., "Optimistic recovery in distributed systems." ACM Trans. on Computer Systems Volume 3, No 3, (August 1985): p. 204-226.

 

 

 

Received on 07.03.2012      Accepted on 02.04.2013        

Modified on 06.04.2013©A&V Publications all right reserved

Research J. Science and Tech 5(3): July- Sept., 2013 page 352-357