Enhanced Checkpointing Algorithm for Mobile Applications
Kailash Prasad Dewangan1,
Somesh Kumar Dewangan2
1Department of Computer Science & Engineering, Kruti
Institute of Technology and Engineering, Raipur
(C.G.), India,
2Department of Computer Science &
Engineering, Disha Institute of Management and
Technology, Raipur (C.G.), India,
*Corresponding
Author Email-kailashprasadd80@gmail.com,
somesh_4@yahoo.com
ABSTRACT:
Rollback-recovery in mobile systems is
important for fault-tolerant computing. Without fault tolerance mechanisms, an
application running on a system has to be restarted from scratch if a fault
happens in the middle of its execution, resulting in loss of useful
computation. To provide efficient rollback-recovery for fault-tolerance in
distributed systems, it is significant to reduce the number of checkpoints
under the existence of consistent global checkpoints in distributed checkpointing algorithms. Because of the dependencies among
the processes states that induced by inter-process communication in distributed
systems, asynchronous checkpointing may suffer from
the domino effect. Therefore, a consistent global checkpoint should always be
ensured to restrict the rollback distance. The quasi-synchronous checkpointing protocols achieve synchronization in a loose
fashion. The algorithm proposed in this paper follows a new strategy to update
the checkpoint interval dynamically.
KEY WORDS: Checkpoint, Quasi-synchronous checkpointing, Overhead, Recovery, Distributed Systems
I.
INTRODUCTION
Checkpoint is defined as a designated place in a
program at which normal processing is interrupted specifically to preserve the
status information necessary to allow resumption of processing at a later time.
Checkpointing is the process of saving the status
information. A checkpoint of a process is the information about the state of a
process at some instant of time. Fault tolerance through checkpoint and
recovery techniques includes taking check- point of an application process
periodically, and logging the checkpoint in a stable storage which is immune to
failures. Checkpoint of an application process is the information about the
state of the process and can be used to restart it from a state corresponding
to the checkpoint. On the other hand, roll back recovery for an application is
defined as the procedure for restarting the application process from a
checkpoint stored in stable storage.
A checkpoint can be saved on either stable
storage or the volatile storage of another process, depending on the failure
scenarios to be tolerated.
For long-running scientific applications, checkpointing and rollback-recovery can be used to minimize
the total execution times in the presence of failures.
For an application involving a single process
checkpoint and recovery is simple. In the event of a node failure, the
application process is restarted from latest checkpoint, i.e. rolled back its
latest checkpoint. However, several issues with checkpointing
and recovery arise in distributed system which runs applications having
multiple concurrent processes, communicating with each other via messages.
II. ASPECTS OF
CHECKPOINTING:
Some of the aspects to be considered with checkpointing are frequency of checkpointing,
contents of a checkpoint and methods of checkpointing.
A. Frequency of Checkpointing:
A checkpointing
algorithm executes in parallel with the underlying computation. Therefore, the
overheads introduced due to checkpointing should be
minimized. Checkpointing should enable a user to
recover quickly and not lose substantial computation in case of an error, which
necessitates frequent checkpointing and consequently
significant overhead. The number of check-points initiated should be such that
the cost of information loss due to failure is small and the overhead due to checkpointing is not significant. These depend on the
failure probability and the importance of the computation. For example, in a
transaction processing system where every transaction is important and
information loss is not permitted, a checkpoint may be taken after every transaction,
increasing the checkpointing overhead significantly.
B. Contents of a
Checkpoint:
The state of a process has to be saved in
stable storage so that the process can be restarted in case of an error. The
state context includes code, data and stack segments along with the environment
and the register contents. Environment has the information about the various
files currently in use and the file pointers. In case of message-passing
systems, environment variables include those messages which are sent and not
yet received.
C.
Methods of Checkpointing:
The methodology used for checkpointing
depends on the architecture of the system. Methods used in multiprocessor
systems should incorporate explicit coordination unlike uniprocessor
systems. In a message-passing system, the messages should be monitored and if
necessary saved as part of the global context. The reason is that the messages
introduce dependencies among the processors. On the other hand, a shared memory
system communicates through shared variables which introduce dependency among
the nodes and thus, at the time of checkpointing, the
memory has to be in a consistent state to obtain a set of concurrent
checkpoints.
III. OVERHEAD OF CHECKPOINTING:
During a failure-free run, every global
checkpoint incurs coordination overhead and context saving overhead in a
Distributed system. We define them as follows.
A. Coordination Overhead:
In a distributed system, coordination among
processes is needed to obtain a consistent global state. Special messages and
piggy-backed information with regular messages are used to obtain coordination
among processes. Coordination overhead is due to these special messages and
piggy-backed information.
B.
Context Saving Overhead:
The time taken to save the global context of
a computation is defined as the context-saving overhead. Overhead for
checkpoint storage in stable storage contributes a major part of the overhead
of checkpoint and recovery protocols. This overhead is proportional to the size
of the context.
IV.
CONSISTENT SYSTEM STATE:
In distributed system, a computation node can
take checkpoints of its local processes only, and such checkpoints are called local
checkpoints. A global checkpoint of a distributed system is defined
as set of local checkpoints, one from each of its processes in the system.
After recovery the system must be in a consistent state. A global state
of a message-passing system is a collection of the individual states of all
participating processes and of the states of the communication channels.
Intuitively, a consistent global state is one that may occur during a
failure-free, correct execution of a distributed computation. More precisely, a
consistent system state is one in which if a process state reflects a message
receipt, then the state of the corresponding sender reflects sending that
message.
For example, Fig 1 shows two examples of
global states, a consistent state in Fig 1(a), and an inconsistent state in Fig
1(b). Note that the consistent state in Fig 1(a) shows message m1 to
have been sent but not yet received. This state is consistent, because it
represents a situation in which the message has left the sender and is still
traveling across the network. On the other hand, the state in Fig 1(b) is
inconsistent because process P2 is shown to have received m2 but
the state of process P1 does not reflect sending it. Such a state
is impossible in any failure-free, correct computation.
Inconsistent states occur because of
failures. For example, the situation shown in part (b) of Figure 1 may occur if
process P1 fails after sending message m2 to P2 and
then restarts at the state shown in the figure. A fundamental goal of any
rollback-recovery protocol is to bring the system into a consistent state when
inconsistencies occur because of a failure. The reconstructed consistent state
is not necessarily one that has occurred before the failure. It is sufficient
that the reconstructed state be one that could have occurred before the failure
in a failure-free, correct execution.
Fig 1: an example of a consistent and
inconsistent state.
V. ISSUES IN
CHECKPOINTING:
In concurrent systems, several processes
cooperate by exchanging information to accomplish task. The information
exchanges through the message passing. In such system, if one of the
cooperating processes fails and resumes execution from a recovery point, then
the effects it has caused at other processes due to the information it has
exchanged with them after establishing the recovery point will have to be
undone. To undo the effects caused by a failed process at an active process,
the active process must also rollback to an earlier state. Rolling back
processes in concurrent system is more difficult than in the case of a single
process. The following discussion illustrates how the rolling back of processes
can cause further problems.
A. Orphan Messages and
Domino Effect:
Consider the three processes X, Y and Z are
running concurrently in Fig 2. The parallel lines are showing the executions of
the processes. These processes are communicated through exchange of messages.
Each symbol '[' marks a recovery point to which process can be rolled
back in the event of a failure.
Fig 2:
Effects of Orphan Messages and Domino Effect
If the process X is to be rolled back, it can
be rolled back to the recovery point x3 without affecting any other
process. Suppose that Y fails after sending message m and is rolled back
to y2.
In this case, the receipt of m is recorded in x3, but
the sending of m is not recorded in y2. Now we have a
situation where X has received message m from Y, but Y has no record of
sending it, which corresponds to an inconsistent state. Under such
circumstances, m is referred to as an orphan message and process
X must also rollback. X must roll back because Y interacted with X after
establishing its recovery point y2. When Y is rolled back to y2, the
event that is responsible for the interaction is undone.
Therefore, all the effects at X caused by the
interaction must also be undone. This can be achieved by rolling back X to
recovery point x2. Likewise, it can be seen that, if Z is
rolled back, all three processes must rollback to their very first recovery
points, namely, x1, y1, z1. This
effect, where rolling back one process causes one or more processes to
rollback, is known as the domino effect and orphan messages are the
cause.
B.
Lost Messages:
Suppose that checkpoints x1 and y1 in Fig
3 are chosen as the recovery points for processes X and Y, respectively. In
this case, the event that sent message m is recorded in x1, while
the event of its receipt at Y is not recorded in y1. If Y
fails after receiving message m the system is restored to state x1, y1, in
which message m is lost as process X is past the point where it
sends message m. This condition can also arise if m is lost in
the communication channel and processes X and Y are in state x1 and y1,
respectively.
Fig 3: Lost Messages
C. Livelocks:
In rollback recovery, livelock
is a situation in which a single failure can cause an infinite number of
rollbacks, preventing the system from making process. A livelock
situation in a distributed system is shown in Fig 4. Fig 4 illustrates the
activity of two processes X and Y until the failure of Y. Process Y fails
before receiving message n1, sent by X. When Y rollbacks to
y1,
there is no record of sending message m1, hence X must rollback to x1.When
process Y recovers, it sends out m2 and receives n1 (Fig
5). Process X, after resuming from x1, sends n2 and
receives m2. However, because X rolled back, there is no record of
sending n1 and hence Y has to rollback for the second time. This
forces X to rollback too, as it has received m2, and
there is no record of sending m2 at Y. This situation can repeat
indefinitely, preventing the system from making any progress.
Fig 4: State before Livelock
Fig 5: Livelock Situation
VI. TYPES OF CHECKPOINTING
PROTOCOLS:
The protocols are classified into following
categories. (i) Asynchronous checkpointing
protocols (ii) Synchronous checkpointing protocols
(iii) Quasi-Synchronous checkpointing protocols
A.
Asynchronous Checkpointing
Protocols:
The protocols in this class allow taking
local checkpoints independent of other processes in the distributed system.
Such protocols are also referred as uncoordinated or asynchronous checkpointing protocols. The fault-free runtime overhead is
least for these kinds of protocols because no coordination is needed between
the processes to take checkpoints. During recovery, processes coordinate among
themselves to determine a consistent global state. Therefore, recovery overhead
is high. Due to bad placement of checkpoints over the communication pattern,
the recovery protocol may require several rounds of coordination and rollbacks
until a consistent global checkpoint is found. As a result a lot of useful
computation may be lost and recovery overhead increases.
Many of the local checkpoints taken may not
be part of any recovery line, and they are called useless checkpoints.
However, processes may need to store all the local checkpoints since
identifying which checkpoints are useless at runtime can be costly. As a result
storage requirement is high for this kind of protocols.
Fig 6: Domino Effect
There is also a possibility of domino effect
which may cause a loss of large amount of useful computation. Fig 6 shows a
checkpoint and communication pattern involving two processes, which can be
affected by domino effect. It can be seen from the figure that all the possible
global checkpoints are inconsistent and therefore all local checkpoints are
useless checkpoints. If a fault occurs, the recovering process will force the
other process to roll back to its previous checkpoint, until both the processes
are rolled back to their initial state. In order to determine a recovery line
during recovery, processes record their dependencies with checkpoints of other
processes during failure-free execution.
The straightforward way to keep track of such
information is by using a set of message counters, one for each of the
processes with which it communicates. When a process sends out an application
message, it increments the counter value corresponding to the receiver process,
and tags the value as an identification number to the message. The processes
also maintain records of the highest numbered message received from each of its
senders.
Processes store both these send and receive
information along with its checkpoints and this is used to determine
inconsistent checkpoints during recovery. A checkpoint of a process P is
inconsistent with that of a process Q if the highest message id received from
the process Q recorded in P's checkpoint is higher than the send counter value
corresponding to P as recorded in Q's checkpoint. Then process P has to roll
back again.
This may lead to several rounds of
coordination and cascaded rollbacks until a consistent global checkpoint is
found. These dependency tracking protocols add some overhead during fault-free
execution. In case of a fault, all the processes have to coordinate among
themselves to decide upon a recovery line to which they will roll back.
Therefore, recovery overhead is high in most
cases. This approach is very reasonable if faults are rare in the system under
consideration. There is also a possibility of domino effect during recovery.
This class of protocols does not inherently support output commit.
B.
Synchronous Checkpointing
Protocols:
In this class of protocols a process does not
take local checkpoints independently, but synchronizes every checkpointing event with that of other processes, such that
every checkpointing effort results in a consistent
global checkpoint. These protocols are also known as synchronous checkpointing protocols. This class of protocols ensures
that whenever a process takes a local checkpoint, all other processes in the
system also take their respective local checkpoints. As a result every local checkpointing effort translates into a global checkpointing activity. They are free from domino effect.
Storage space requirement is minimum for these
protocols, since they require that only the latest checkpoint be stored for
recovery.
All the checkpoints taken with successful
coordination are useful, i.e., the recovery line steadily progresses with every
checkpoint. All the latest local checkpoints are part of a consistent global
checkpoint, and therefore recovery time is lower compared to asynchronous
protocols. However, due to the effort of synchronization involved in every checkpointing activity, checkpointing
overhead is high. Synchronous checkpointing protocols
can be either blocking or non-blocking.
a) Blocking Protocols:
A straightforward approach to synchronous checkpointing is to block inter-process communication until
the checkpointing protocol completes. The protocol is
initiated by a coordinator. The coordinator sends a request to all processes
asking them to checkpoint. On the receipt of the request the process blocks
normal execution, takes a tentative checkpoint, and sends an acknowledgment
message to the coordinator. On receipt of the acknowledgment messages from all
the processes, the coordinator sends a message indicating the end of the
protocol. On receipt of this message, all processes make their tentative
checkpoints permanent, remove old permanent checkpoints, and resume normal
execution.
b) Non Blocking Protocols:
The problem with blocking synchronous checkpointing protocols is that the processes are not
allowed to execute until the coordination is complete. Hence checkpointing overhead is high.
The idea is to use a special message, called
a marker message, which is a checkpoint request message carrying the checkpointing interval number. The protocol is similar to
the flooding protocols used in distributed systems..
The process which first initiates coordination blocks the local process, takes
a local checkpoint, sends a marker message to all its neighboring processes,
and then unblocks the local process.
The receiver of a
marker message, if has not already received a similar message, follows the same
procedure as that of the coordinator and forwards the marker message to all its
neighbors. Here we relaxed the FIFO constraint by piggybacking the marker on every
post-checkpoint message. The same affect can be obtained by marking local
checkpoints by a sequence number, called checkpoint index, and piggybacking the
current checkpoint index value on every application message.
Similar to blocking synchronous checkpointing protocols, attempts have been made to
construct non-blocking synchronous checkpointing
protocols which require that a minimal number of processes take checkpoints.
The coordination protocols discussed above involve all the processes in the
system and therefore raise concern for scalability of the protocols.
A minimal number of processes, only those
whose present states are causally dependent on the current state of the
coordinator, need to participate.
Mutable checkpoints can be stored in the
volatile memory and are converted into permanent checkpoints and stored in
stable store only when a new recovery line is developed. Another approach to
non-blocking synchronous checkpointing protocol is to
avoid explicit coordination by message passing and use synchronous clocks to
achieve implicit coordination.
In modern distributed systems many
applications require clocks of the processors to be approximately synchronized.
Many distributed systems run clock synchronization protocol at the background
to keep their clock differences within some guaranteed bound. Such loosely
coupled synchronized clocks can facilitate checkpointing
effort without explicit coordination .A process takes a local checkpoint and
blocks all receives for a period which is equal to the maximum deviation
between clocks plus the maximum time to detect a failure in the system. It can
be shown that all its local checkpoints of processes form a global recovery
line.
C. Quasi-Synchronous Checkpointing Protocols:
This class comprises of protocols which try
to combine the advantages of both asynchronous and synchronous checkpointing protocols. This kind of protocols takes two
types of checkpoints, namely basic and forced checkpoints.
These protocols ensure that every checkpoint
taken by processes is in some recovery line. Since local checkpoints are taken
independently these protocols suffer less overhead for checkpointing,
and yet avoid domino effect. But these protocols may end up taking more number
of checkpoints compared to that in asynchronous checkpointing
protocols. The failed process can determine the recovery line without any
coordination with other processes. Other processes need to roll back to the
recovery line sent by the failed process. Therefore recovery is simple.
The protocols in this class piggyback
protocol-specific information on every application messages. The receiver
process then analyzes this information to decide whether any forced checkpoint
is required or not. If so, the process first takes a checkpoint and then
delivers the message to the application. Informally, the decision is based on
whether the checkpoint and communication pattern will create any useless
checkpoints in the system. If there is such a possibility, a forced checkpoint
is taken to break the pattern.
The decision is based on the notion of a Z
cycle and a Z path, based on the zigzag path formulation.
A Z path is the same as a zigzag-path. A Z cycle is a Z path
that begins and ends in the same checkpoint interval (Fig 7). CIC protocols can
be broadly sub-divided into two classes, indexbased coordination
protocols and modelbased checkpointing
protocols.
Fig 7: Z-path determination
In index-based coordination checkpointing protocol, a process takes both basic and
forced checkpoints, and all local checkpoints of a process are indexed by a
monotonically increasing value. The index value is piggybacked on all
application messages. When a process receives an application message, it checks
whether the piggybacked index value is higher than its own. If so, then it
updates its own index value to the piggybacked index value and takes a forced
checkpoint. It then delivers the message to the application process. The
protocol ensures that local checkpoints in different processes having the same
index value form a recovery line. A more sophisticated protocol where processes
transmit more information on application messages to reduce the number of
forced checkpoints.
A model-based checkpointing
protocol defines a model of checkpoint and communication pattern which contains
no useless checkpoints.
VII. CONCLUSIONS:
This paper presents an enhanced checkpointing
algorithm for mobile applications to improve the performance of checkpointing algorithms. The proposed algorithm follows a
new strategy to update the checkpoint interval dynamically as opposed to the
static interval used by the existing algorithms explained in the above
sections. Whenever a process takes a forced checkpoint due to the reception of
a message with sequence number higher than the sequence number of the process,
the checkpoint interval is reset that is a new interval starts from the point
where the forced checkpoint is taken. The timer reset strategy of resetting the
timer after a forced checkpoint restarts a new basic checkpoint interval.
VIII. References:
[1] Acharya A. and Badrinath B. R.,
"Recording distributed snapshots based on causal order of message
delivery." Information Processing Letters 44, pp.
317-321, 1992.
[2] Baldoni R., Qualigia F., and Fornara P. An index-based checkpointing
algorithm for autonomous distributed systems. IEEE Trans. Parallel and
Distributed Syst. Volume 10 No 2, (February 1999): p. 181-192.
[3] Briatico D., Ciufoletti
A., and Simoncini L. A distributed domino-effect
free recovery algorithm.In Proc. 4th IEEE Symp. On Reliability in Distributed
Software and Database Syst., (1984): p. 207-215.
[4] Cao
G. and Singhal M., "On coordinated checkpointing in distributed system." IEEE Trans, on
Parallel and Distributed Systems Volume 9, No 12, (December 1998): p.
1213-1225.
[5] Cao
G and Singhal M., "Mutable checkpoints: A new checkpointing approach for mobile computing systems."
IEEE Trans, on Parallel and Distributed Systems Volume 12, No 2, (February
2001): p. 157-172,
[6] Chandy K. M. and Lamport,
"Distributed snapshots: Determining global states in distributed
systems." ACM Trans on Computer Systems, Volume 3, No. 1, (February 1985):
p. 63-75.
[7] Critchlow C and Taylor K., "The inhibition
spectrum and the achievement of causal consistency", Cornell
University."
(February 1990) Tech. Rep. TR 90-1101.
[8] Elnozahy and E. N. MaNetHo,
"Fault tolerance in distributed systems using rollback-recovery and
process replication." Ph.D. dissertation, Rice University, Texas, Austin,
October 1993
[9] Elnozahy E. N., Alvisi L., Wang Y.
M., and Johnson D. B, "A survey of rollback-recovery protocols in
message-passing systems." ACM Computing Surveys Volume 34, No 3,
(September 2002): p. 375-408.
[10] Elnozahy E. N., Johnson D. B., and Zwaenepoel
W., "The performance of consistent checkpointing.
In Proc. of IEEE Symp.on Reliable Distributed
Systems, (October 1992): p. 39-47.
[11] Helary J. M., Mostefaoui
A., Netzer R. H., and Raynal
M., Preventing useless checkpoints in distributed computations." In Proc. of the 16th Symposium
on Reliable Distributed Systems, (1997): p. 183-190.
[12] S. Kalaiselvi and V. Rajaraman,
"A survey of checkpointing algorithms for
parallel and distributed computers."Sadhana, Volume 25, No. 5, (October
2000): p. 489-510.
[13] Koo
R. and Toueg S., "Checkpointing
and rollback-recovery for distributed systems." IEEE
Trans, on Software Engg. Volume 13, No 1,
(1987): p. 23-31.
[14] Manivannan D. and Singhal
M., "A low-overhead recovery technique using quasi-synchronous checkpointing." In Proc. of Distributed
Computing Systems. ACM, (May 1996): p. 100-107.
[15] Prakash R. and Singhal M.,
"Low-cost checkpointing and failure recovery in
mobile computing systems." IEEE Trans on Parallel and Distributed Systems
Volume 7, No 10, (October 1996): p.1035-1048.
[16] Silva
L. M. and Silva J. G., "The performance of coordinated and independent checkpointing." In Proc. of Intl. Parallel and
Distributed Processing Symposium, (1999): p. 88-94.
[17] Strom
R. E. and Yemini S., "Optimistic recovery in distributed systems."
ACM Trans. on Computer Systems Volume 3, No 3, (August 1985): p. 204-226.
Received on 07.03.2012 Accepted on 02.04.2013
Modified on 06.04.2013©A&V Publications all right reserved
Research J. Science and Tech 5(3): July- Sept., 2013 page 352-357