skip to main content
10.5555/874064.875650guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Fault Tolerance for Off-the-Shelf Applications and Hardware

Published: 27 June 1995 Publication History

Abstract

Abstract: The concept of middleware provides a transparent way to augment and change the characteristics of a service provider as seen from a client. Fault tolerant policies are ideal candidates for middleware implementation. We have defined and implemented operating system based middleware support that provides the power and flexibility needed by diverse fault tolerant policies. This mechanism, called the sentry, has been built into the UNIX 4.3 BSD operating system server running on a Mach 3.0 kernel. To demonstrate the effectiveness of the mechanism several policies have been implemented using sentries including checkpointing and journaling. The implementation shows that complex fault tolerant policies can be efficiently and transparently implemented as middleware. Performance overhead of input journaling is less than 5% and application suspension during the checkpoint is typically under 10 seconds in length. A standard hard disk is used to store journal and checkpoint information with dedicated storage requirements of less than 20 MB.

References

[1]
M. J. Bach, "The design of the unix operating system," Prentice-Hall, Englewood Cliffs, NJ, 1986.
[2]
A. Borg, W. Blau, W. Graetcsh, F. Hermann, and W. Oberle, "Fault-tolerance under UNIX." ACM Trans. Comput. Syst., vol. 7, no. 1, Feb. 1989, pp. 1-24.
[3]
K. M. Chandy and L. Lamport, "Distributed snapshots: determining global states of distributed systems," ACM Trans. Comput. Syst., vol. 3, no. 1, Feb. 1985, pp. 63-75.
[4]
E. N. Elnozahy and W. Zwaenepoel, "Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit", IEEE Trans. on Comput., vol. 41, no. 5, May 1992, pp. 526-531.
[5]
T. M. Frazier and Y. Tamir, "Application-transparent error-recovery techniques for multicomputers", in 4th Conf. on Hypercubes, Conc. Comput. and App., March 1989, pp. 103-108.
[6]
D. Golub, R. Dean, A. Forin and R. Rashid, "Unix as an Application Program". in USENIX Summer Conference, Anaheim, CA, June 11-15, 1990.
[7]
Y. Huang and C. Kintala, "Software implemented fault tolerance: technologies and experience," in Proc. 23rd Int. Symp. on Fault-Tolerant Comput., June 22-24, 1993, pp. 2-9.
[8]
T. T. Juang and S. Venkatesan, "Efficient algorithms for crash recovery in distributed systems", 10th Conf. on Found. of Software Tech. and Theoretical Comput. Sci., 1990, pp. 17-19.
[9]
R. Koo and S. Toueg, "Checkpointing and rollback-recovery for distributed systems," IEEE Trans. Software Eng., vol. SE-13, Jan. 1987, pp. 23-31.
[10]
L. Lamport, "Time, clocks, and the ordering of events in a distributed system," Commun. ACM, vol. 21, no. 7, July 1978, pp. 558-565.
[11]
R. Rashid, R. Baron, A. Forin, D. Golub, M. Jones, D. Julin, D. Orr, and R. Sanzi, "Mach: a foundation for open systems", in Proc. 2nd Workshop Workstation Operating Syst., Sept. 27-29, 1989.
[12]
M. Russinovich, "Application-transparent fault management," Ph.D. dissertation, Carnegie Mellon University, Aug. 1994.
[13]
M. Russinovich and Z. Segall, "Application- Transparent Checkpointing in Mach 3.0/UX," to appear in 27th Hawaii Int. Conf. Syst. Sciences, Jan. 1995.
[14]
M. Russinovich, Z. Segall, and D. Sieworiek, "Application-Transparent Fault Management in Fault Tolerant Mach," in Proc. 23rd Int. Symp. Fault-Tolerant Comput., June 1993, pp. 10-19.
[15]
R. W. Scheifler and J. Gettys, "The X-window system," ACM Trans. on Graphics, vol. 5, no. 2, Apr. 1986, pp. 79-109.
[16]
R. E. Strom, D. F. Bacon and S. A. Yemini, "Towards self recovering operating systems", International Canf. Reliable Syst., Los Angeles, CA, April 21-23, 1988, pp. 59-71.
[17]
UNIX Programmer's Manual, USENIX Association, March 1986.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
FTCS '95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
June 1995

Publisher

IEEE Computer Society

United States

Publication History

Published: 27 June 1995

Author Tags

  1. Mach 3.0 kernel
  2. UNIX 4.3 BSD operating system server
  3. Unix
  4. application suspension
  5. checkpointing
  6. client
  7. dedicated storage requirements
  8. fault tolerant computing
  9. fault tolerant policies
  10. fault-tolerance
  11. hard disk
  12. input journaling
  13. journaling
  14. middleware
  15. off-the-shelf applications
  16. off-the-shelf hardware
  17. operating system based middleware support
  18. operating systems (computers)
  19. performance overhead
  20. reliability
  21. sentry
  22. service provider
  23. software fault tolerance

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media