Basics
- E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of
rollback-recovery protocols in message-passing systems. Technical Report
CMU-CS-96-181, Carnegie Mellon University, October 1996. [pdf]
- L. Lamport. "Time, clocks, and the ordering of events in a
distributed system." Communications of the ACM, 21(7):588565,
Jul. 1978. [pdf]
- J. Mellor-Crummey and T. LeBlanc. "A software instruction
counter." In Proceedings of the 3 rd Symposium on Architectural
Support for Programming Languages and Operating Systems, pp. 7886,
Apr. 1989. [pdf]
- R.D. Schlichting and F.B. Schneider. "Fail-Stop processors: An
approach to designing fault-tolerant computing systems." ACM
Transactions on Computer Systems, vol. 1(3): 222238, Aug. 1983. [pdf]
Coordinated checkpointing
blocking
- J.S. Plank, M. Beck, G. Kingsley and K. Li. "Lipckpt: Transparent
checkpointing under UNIX." In Proceedings of the USENIX
Winter 1995 Technical Conference, pp. 213223, Jan. 1995. [pdf]
- J. S. Plank, Jian Xu, R.B. Netzer, "Compressed differences: An
algorithm for fast incremental checkpointing." Technical Report
CS-95-302, University of Tennessee at Knoxville, Aug. 1995. [ps]
- A. Beguelin, E. Seligman and P. Stephan. "Application level fault
tolerance in heterogeneous networks of workstations." In Journal
Parallel & Distributed Computing, 43(2):147155, Jun. 1997. [ps]
- E. Seligman and A. Beguelin. "High-level fault tolerance in
distributed programs." Technical Report CMU-CS-94-223, Department of
Computer Science, Carnegie Mellon University, Dec. 1994. [ps]
non-blocking
- O. Babaoglu and K. Marzullo. "Consistent global states of
distributed systems: Fundamental concepts and mechanisms." Distributed
Systems, Ed. S. Mullender, Addison-Wesley, pp. 5596,
1993. [ps]
- M. Chandy and L. Lamport. "Distributed snapshots: Determining
global states of distributed systems." In ACM Transactions
on Computing Systems, 3(1):6375, Aug. 1985. [pdf]
- K. Li, J.F. Naughton and J.S. Plank. "Real-time, concurrent
checkpoint for parallel programs." In Proceedings of the 1990 Conference
on the Principles and Practice of Parallel Programming, pp. 7988,
Mar. 1990. [pdf]
Uncoordinated checkpointing
- Y. M. Wang. "Space reclamation for uncoordinated checkpointing in
message-passing systems." Ph.D. Thesis, University of Illinois
Urbana-Champaign, Aug. 1993. [???]
- Y. M. Wang, P. Y. Chung, I. J. Lin and W. K. Fuchs. "Checkpoint space
reclamation for uncoordinated checkpointing in message-passing
systems." In IEEE Transactions on Parallel and Distributed Systems,
6(5):546554, May 1995. [???]
- Y. M. Wang, P. Y. Chung, and W. K. Fuchs, " Tight upper bound on
useful distributed system checkpoints," Tech. Rep. CRHC-95-16,
Coordinated Science Laboratory, University of Illinois at Urbana-Champaign,
1995. [ps]
Message logging
survey: Alvisi/Marzullo
- L. Alvisi and K. Marzullo. "Message logging: Pessimistic,
optimistic, causal and optimal." In IEEE Transactions on Software
Engineering, 24(2):149159, Feb. 1998. [ps]
optimistic:
- R. Strom and S. Yemini. "Optimistic recovery in distributed
systems." ACM Transactions on Computer Systems, 3(3): 204226,
Aug. 1985. [pdf]
- D.B. Johnson and W. Zwaenepoel. "Recovery in distributed systems
using optimistic message logging and checkpointing." In Proceedings
of the Sixth Annual ACM Symposium on Principles of Distributed Computing
Systems, PODC-88, pp. 171181, Aug. 1988. [pdf]
- D. B. Johnson and W. Zwaenepoel. "Transparent optimistic rollback
recovery." In Operating Systems Review, pp. 99102, Apr.
1991. [ps]
sender-based logging:
- D.B. Johnson and W. Zwaenepoel. "Sender-based message
logging." In Proceedings of the Seventeenth International Symposium
on Fault-Tolerant Computing (FTCS-17), pp. 1419, Jun. 1987. [ps]
- J. Xu, R.B. Netzer, and M. Mackey. "Sender-based message logging
for reducing rollback propagation." In Proceedings of the
Seventh IEEE Symposium on Parallel and Distributed Processing, pp. 602609,
1995. [???]
causal logging:
Manetho:
- E.N. Elnozahy. "Manetho: Fault tolerance in distributed systems
using rollback-recovery and process replication." Ph.D. Thesis,
Rice University, Oct. 1993. Also available as Technical Report 93-212,
Department of Computer Science, Rice University. [ps]
- E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel. "The performance
of consistent checkpointing." In Proceedings of the Eleventh
Symposium on Reliable Distributed Systems, pp. 3947, Oct.
1992. [ps]
- E.N. Elnozahy and W. Zwaenepoel. "On the use and implementation
of message logging." In Proceedings of the Twenty Fourth
International Symposium on Fault-Tolerant Computing (FTCS-24), pp.
298307, Jun. 1994. [ps]
- E.N. Elnozahy and W. Zwaenepoel. "Manetho, transparent
rollback-recovery with low overhead, limited rollback and fast output
commit." In IEEE Transactions on Computers, Special Issue on
Fault-Tolerant Computing, 41(5):526531, May 1992. [ps]
FBL: Alvisi and Maruzllo
- L. Alvisi and K. Marzullo. "Trade-offs in implementing causal
message logging protocols." In Proceedings of the 1996 ACM
SIGACT-SIGOPS Symposium on Principles of Distributed Computing Systems (PODC),
pp. 5867, 1996. [pdf]
Byzantine failures
Tennessee:
- Y. Kim, J.S. Plank and J.J. Dongarra. "Fault-tolerant matrix
operations using checksum and reverse computation." In Proceedings
of 6 th Symposium on the Frontiers of Massively Parallel Computation, Oct.
1996. [ps]
- Y. Kim, J.S. Plank and J.J. Dongarra. "Fault-tolerant matrix
operations for network of workstations using multiple checkpointing."
In Proceedings of HPC Asia97, High Performance Computing in
the Information Superhighway, pp. 460465, Apr. 1997. [ps]
- J. S. Plank, K. Li, and M.A. Puening. "Diskless
checkpointing." IEEE Transactions on Parallel & Distributed
Systems, 9(10):972986, Oct. 1998. [ps]
- J. S. Plank, Y. Kim and J.J. Dongarra. "Algorithm-based diskless
checkpointing for fault-tolerant matrix computations." In Proceedings
of the Twenty Fifth International Symposium on Fault-Tolerant Computing
Systems, pp. 351360, Jun. 1995. [ps]
- J. S. Plank, K. Youngbae and J. J. Dongara. "Fault-tolerant matrix
operations for networks of workstations using diskless
checkpointing." In Journal of Parallel & Distributed Computing,
43(2):125138, Jun. 1997. [ps]
Prith Banerjee
- Prithviraj Banerjee, Vijay Balasubramanian, and Amber Roy-Chowdhury.
"Compiler Assisted Synthesis of Algorithm-Based Checking in
Multiprocessors". to appear in Foundations of Dependable
Computing: Vol III. System Implementation. Gary Koob, editor, Kluwer
Academic Publishers. [ps]
self-checking programs
Replay and debugging
Netzer et al:
- R.B. Netzer and B.P. Miller. "Optimal tracing and replay for
debugging message-passing parallel programs." In Proceedings of
Supercomputing92, pp. 502511, Nov. 1992. [ps]
- R.B. Netzer and J. Xu. "Adaptive message logging for incremental
program replay." In IEEE Parallel and Distributed Technology,
1(4):3239, Nov. 1993. [???]
- R.B. Netzer and J. Xu. "Replaying distributed programs without
message logging." In Proceedings of the Sixth IEEE International
Symposium on High Performance Distributed Computing (HPDC), pp. 137147,
Aug. 1997. [???]
Objective CAML
Shared memory
- N. Neves, M. Castro and P. Guedes. "A checkpoint protocol for an
entry consistent shared memory system." In Proceedings of the 13 th
ACM Symposium on Principles of Distributed Computing, Aug. 1994. [ps]
- L.M. Silva, J.G. Silva and S. Chapple. "Portable transparent
checkpointing for distributed shared memory." In Proceedings of the
Fifth IEEE International Symposium on High Performance Distributed Computing,
HPDC-5, pp. 422-431, Aug. 1996. [href]
MPI
- G. Stellner. "CoCheck: Checkpointing and process migration for MPI."
In Proceedings of the 10 th International Parallel Processing
Symposium, Apr. 1996. [ps]
- W-J Li. And J-J Tsay. "Checkpointing message-passing interface
(MPI) parallel programs." In Proceedings of the Pacific Rim
International Symposium on Fault-Tolerant Systems, pp. 147152, 1997.
[href]
Compiling
- G. Barigazzi and L. Strigini. "Application-transparent setting of
recovery points." In Proceedings of the Thirteenth International
Symposium on Fault-Tolerant Computing Systems, FTCS-13, pp. 4855,
1983. [???]
- M. Beck, J. S. Plank and G. Kingsley. "Compiler-assisted
checkpointing." Technical Report CS-94-269, Department of Computer
Science, University of Tennessee at Knoxville, Dec. 1994. [ps]
- P.L. Ecuyer and J. Malefant. "Computing optimal checkpointing
strategies for rollback and recovery systems." In IEEE Transactions
on Computers, vol. 37, pp. 491496, Apr. 1988. [???]
- C.M. Krishna, G. Kang and Y. Lee. "Optimization criteria for
checkpoint placement." In Communications of the ACM, 27(10):10081012,
Oct. 1984. [pdf]
- J. Long, W.K. Fuchs and J.A. Abraham. "Compiler-assisted static
checkpoint insertion." In Proceedings of the Twenty Second
Annual International Symposium on Fault-Tolerant Computing, FTCS-22,
pp. 5865, Jul. 1992. [???]
- D. Manivannan, R. H. Netzer, and M. Singhal. "Finding consistent
global checkpoints in a distributed computation." In IEEE
Transactions on Parallel & Distributed Systems, 8(6):623627, Jun.
1997. [ps]
- J. S. Plank, M. Beck and G. Kingsley. "Compiler-assisted memory
exclusion for fast checkpointing." In IEEE Technical Committee
on Operating Systems Newsletter, Special Issue on Fault Tolerance, pp.
6267, Dec. 1995. [ps]
- J. S. Plank, Y. Chen, K. Li, M. Beck and G. Kingsley. "Memory
exclusion: Optimizing the performance of checkpointing systems."
Technical Report UT-CS-96-335, University of Tennessee, Aug. 1996. [ps]
- A. Ziv and J. Bruck. "An on-line algorithm for checkpoint
placement." In IEEE Transactions on Computers, 46(9):976985,
Sep. 1997. [ps]
Other stuff
- A.C. Klaiber and H.M. Levy. "Crash recovery for scientific
applications." In Proceedings of the International Conference on
Parallel and Distributed Systems, 1993. [???]
- S. Rangarajan, S. Garg and Y. Huang. "Checkpoints-on-demand with
active replication." In Proceedings of the Seventeenth
Symposium on Reliable Distributed Systems, pp. 7583, Oct. 1998. [href]
- L.M. Silva and J.G. Silva. "An experimental study about diskless
checkpointing." In Proceedings of the 24 th EUROMICRO
Conference, pp. 395402, Aug. 1998 [pdf]
- L.M. Silva and J.G. Silva. "System-level versus user-defined
checkpointing." In Proceedings of the Seventeenth Symposium
on Reliable Distributed Systems, pp. 6874, Oct. 1998. [pdf]
- L.M. Silva, J.G. Silva and S. Chapple. "Portable transparent
checkpointing for distributed shared memory." In Proceedings of the
Fifth IEEE International Symposium on High Performance Distributed Computing,
HPDC-5, pp. 422-431, Aug. 1996. [href]
- L.M. Silva, J.G. Silva, S. Chapple and L. Clarke. "Portable
checkpointing and recovery." In Proceedings of the 4 th International
Symposium on High-Performance Distributed Computing, HPDC-4, pp. 188195,
Aug. 1995. [???]
- R. E. Strom, D. F. Bacon and S. A. Yemini. "Volatile logging in
n-fault-tolerant distributed systems." In Proceedings of the
Eighteenth International Symposium on Fault-Tolerant Computing Systems,
pp. 4449, 1988. [???]
- K. Tanaka. H. Higaki and M. Takizawa. "Object-based checkpoints in
distributed systems." In Computer Systems Science &
Engineering, 13(3):179185, May 1998. [ps]
- P. Tullmann, J. Lepreau, B. Ford and M. Hibler. "User-level
checkpointing through exportable kernel state." In Proceedings of
the Fifth International Workshop on Object-Orientation in Operating Systems,
pp. 8588, Oct. 1996. [href]
- Y. M. Wang, E. Chung, Y. Huang, and E.N. Elnozahy. "Integrating
checkpointing with transaction processing." In Proceedings of the
Twenty Seventh International Symposium on Fault-Tolerant Computing (FTCS-27),
pp. 304308, Jun.1997. [ps]
Reversible computations:
- "Source-code Transformations for Efficient Reversibility",
Kalyan Perumalla, Richard Fujimoto, Technical report GIT-CC-99-21, College
of Computing, Georgia Tech, September 1999. [ps]
- "Efficient Optimistic Parallel Simulations Using Reverse
Computation", Christopher Carothers, Kalyan Perumalla and Richard
Fujimoto, Best Paper, ACM/IEEE Workshop on Parallel and Distributed
Simulation, 1999. [ps]
Systems:
NetSolve and Globus
Muller:
- G. Muller, M. Banβtre , N. Peyrouz and B. Rochat. "Lessons from
FTM: an experiment in design and implementation of a low-cost
fault-tolerant system." In IEEE Transactions on Reliability,
45(2):332340, Jun. 1996. [ps]
Seti@Home