Article

Fault Tolerance for Off-the-Shelf Applications and Hardware

Authors:

M. Russinovich,

Z. SegallAuthors Info & Claims

FTCS '95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing

Page 67

Published: 27 June 1995 Publication History

Abstract

Abstract: The concept of middleware provides a transparent way to augment and change the characteristics of a service provider as seen from a client. Fault tolerant policies are ideal candidates for middleware implementation. We have defined and implemented operating system based middleware support that provides the power and flexibility needed by diverse fault tolerant policies. This mechanism, called the sentry, has been built into the UNIX 4.3 BSD operating system server running on a Mach 3.0 kernel. To demonstrate the effectiveness of the mechanism several policies have been implemented using sentries including checkpointing and journaling. The implementation shows that complex fault tolerant policies can be efficiently and transparently implemented as middleware. Performance overhead of input journaling is less than 5% and application suspension during the checkpoint is typically under 10 seconds in length. A standard hard disk is used to store journal and checkpoint information with dedicated storage requirements of less than 20 MB.

References

[1]

M. J. Bach, "The design of the unix operating system," Prentice-Hall, Englewood Cliffs, NJ, 1986.

Digital Library

[2]

A. Borg, W. Blau, W. Graetcsh, F. Hermann, and W. Oberle, "Fault-tolerance under UNIX." ACM Trans. Comput. Syst., vol. 7, no. 1, Feb. 1989, pp. 1-24.

Digital Library

[3]

K. M. Chandy and L. Lamport, "Distributed snapshots: determining global states of distributed systems," ACM Trans. Comput. Syst., vol. 3, no. 1, Feb. 1985, pp. 63-75.

Digital Library

[4]

E. N. Elnozahy and W. Zwaenepoel, "Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit", IEEE Trans. on Comput., vol. 41, no. 5, May 1992, pp. 526-531.

Digital Library

[5]

T. M. Frazier and Y. Tamir, "Application-transparent error-recovery techniques for multicomputers", in 4th Conf. on Hypercubes, Conc. Comput. and App., March 1989, pp. 103-108.

[6]

D. Golub, R. Dean, A. Forin and R. Rashid, "Unix as an Application Program". in USENIX Summer Conference, Anaheim, CA, June 11-15, 1990.

[7]

Y. Huang and C. Kintala, "Software implemented fault tolerance: technologies and experience," in Proc. 23rd Int. Symp. on Fault-Tolerant Comput., June 22-24, 1993, pp. 2-9.

[8]

T. T. Juang and S. Venkatesan, "Efficient algorithms for crash recovery in distributed systems", 10th Conf. on Found. of Software Tech. and Theoretical Comput. Sci., 1990, pp. 17-19.

Digital Library

[9]

R. Koo and S. Toueg, "Checkpointing and rollback-recovery for distributed systems," IEEE Trans. Software Eng., vol. SE-13, Jan. 1987, pp. 23-31.

Digital Library

[10]

L. Lamport, "Time, clocks, and the ordering of events in a distributed system," Commun. ACM, vol. 21, no. 7, July 1978, pp. 558-565.

Digital Library

[11]

R. Rashid, R. Baron, A. Forin, D. Golub, M. Jones, D. Julin, D. Orr, and R. Sanzi, "Mach: a foundation for open systems", in Proc. 2nd Workshop Workstation Operating Syst., Sept. 27-29, 1989.

[12]

M. Russinovich, "Application-transparent fault management," Ph.D. dissertation, Carnegie Mellon University, Aug. 1994.

Digital Library

[13]

M. Russinovich and Z. Segall, "Application- Transparent Checkpointing in Mach 3.0/UX," to appear in 27th Hawaii Int. Conf. Syst. Sciences, Jan. 1995.

Digital Library

[14]

M. Russinovich, Z. Segall, and D. Sieworiek, "Application-Transparent Fault Management in Fault Tolerant Mach," in Proc. 23rd Int. Symp. Fault-Tolerant Comput., June 1993, pp. 10-19.

[15]

R. W. Scheifler and J. Gettys, "The X-window system," ACM Trans. on Graphics, vol. 5, no. 2, Apr. 1986, pp. 79-109.

Digital Library

[16]

R. E. Strom, D. F. Bacon and S. A. Yemini, "Towards self recovering operating systems", International Canf. Reliable Syst., Los Angeles, CA, April 21-23, 1988, pp. 59-71.

[17]

UNIX Programmer's Manual, USENIX Association, March 1986.

Cited By

Lee G(2005)Duplex method for mobile communication systemsProceedings of the First international conference on Mobile Ad-hoc and Sensor Networks10.1007/11599463_112(1125-1132)Online publication date: 13-Dec-2005
https://dl.acm.org/doi/10.1007/11599463_112
Bowen TSegal M(2000)Remediation of Application-Specific Security Vulnerabilities at RuntimeIEEE Software10.1109/52.87786717:5(59-67)Online publication date: 1-Sep-2000
https://dl.acm.org/doi/10.1109/52.877867
Plank JLi KPuening M(1998)Diskless CheckpointingIEEE Transactions on Parallel and Distributed Systems10.1109/71.7305279:10(972-986)Online publication date: 1-Oct-1998
https://dl.acm.org/doi/10.1109/71.730527
Show More Cited By

Fault Tolerance for Off-the-Shelf Applications and Hardware

Recommendations

Hardware/software fault tolerance with multiple task modular redundancy
ISCC '95: Proceedings of the IEEE Symposium on Computers and Communications (ISCC'95)

N-modular redundancy (NMR) and N-version programming (NVP) are two popular fault tolerance techniques in which hardware and software redundancy is exploited to mask faults. Redundant hardware is used to improve fault tolerance rather than throughput. We ...
Application-transparent checkpointing in Mach 3.O/UX
HICSS '95: Proceedings of the 28th Hawaii International Conference on System Sciences

Checkpointing is perhaps the most explored of software based recovery techniques, yet it has typically been developed only for special purpose or research oriented operating systems. The paper presents virtual memory checkpointing algorithms that have ...
Algorithm-Based Fault Tolerance for FFT Networks

Algorithm-based fault tolerance (ABFT) is a low-overhead system-level fault tolerance technique. Many ABFT schemes have been proposed in the past for fast Fourier transform (FFT) networks. In this paper, a new ABFT scheme for FFT networks is proposed. ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

FTCS '95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing

June 1995

Copyright © Copyright (c) 1995 Institute of Electrical and Electronics Engineers, Inc. All rights reserved.

Publisher

IEEE Computer Society

United States

Publication History

Published: 27 June 1995

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee G(2005)Duplex method for mobile communication systemsProceedings of the First international conference on Mobile Ad-hoc and Sensor Networks10.1007/11599463_112(1125-1132)Online publication date: 13-Dec-2005
https://dl.acm.org/doi/10.1007/11599463_112
Bowen TSegal M(2000)Remediation of Application-Specific Security Vulnerabilities at RuntimeIEEE Software10.1109/52.87786717:5(59-67)Online publication date: 1-Sep-2000
https://dl.acm.org/doi/10.1109/52.877867
Plank JLi KPuening M(1998)Diskless CheckpointingIEEE Transactions on Parallel and Distributed Systems10.1109/71.7305279:10(972-986)Online publication date: 1-Oct-1998
https://dl.acm.org/doi/10.1109/71.730527
Chen YPlank JLi KCrawford D(1997)CLIPProceedings of the 1997 ACM/IEEE conference on Supercomputing10.1145/509593.509626(1-11)Online publication date: 15-Nov-1997
https://dl.acm.org/doi/10.1145/509593.509626

View Options

View options

Figures

Tables

Media

View Table of Conten