skip to main content
10.1145/1508244.1508251acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Recovery domains: an organizing principle for recoverable operating systems

Published: 07 March 2009 Publication History

Abstract

We describe a strategy for enabling existing commodity operating systems to recover from unexpected run-time errors in nearly any part of the kernel, including core kernel components. Our approach is dynamic and request-oriented; it isolates the effects of a fault to the requests that caused the fault rather than to static kernel components. This approach is based on a notion of "recovery domains," an organizing principle to enable rollback of state affected by a request in a multithreaded system with minimal impact on other requests or threads. We have applied this approach on v2.4.22 and v2.6.27 of the Linux kernel and it required 132 lines of changed or new code: the other changes are all performed by a simple instrumentation pass of a compiler. Our experiments show that the approach is able to recover from otherwise fatal faults with minimal collateral impact during a recovery event.

References

[1]
P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987.
[2]
T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems, 14(1):80--107, February 1996.
[3]
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot-a technique for cheap recovery. In 6th Symposium on Operating Systems Design and Implementation (OSDI), pages 31--44, December 2004.
[4]
J. Criswell, A. Lenharth, D. Dhurjati, and V. Adve. Secure virtual architecture: A safe execution environment for commodity operating systems. In SOSP '07: Proceedings of the Twenty First ACM Symposium on Operating Systems Principles, October 2007.
[5]
D. Dhurjati, S. Kowshik, and V. Adve. SAFECode: Enforcing alias analysis for weakly typed languages. In Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), June 2006.
[6]
E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys, 34(3), September 2002.
[7]
W. Feng. Making a case for efficient supercomputing. Queue, 1(7):54--64, 2003.
[8]
J. Gray. The transaction concept: Virtues and limitations. In Proc. Int'l Conf. on Very Large Data Bases, pages 144--154, 1981.
[9]
H. S. Gunawi, V. Prabhakaran, S. Krishnan, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Improving file system reliability with i/o shepherding. In SOSP '07: Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, pages 293--306, New York, NY, USA, 2007. ACM.
[10]
M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. In Proc. Int'l Conf. on Comp. Arch. (ISCA), pages 289--300, New York, NY, USA, 1993. ACM Press.
[11]
G. C. Hunt, J. R. Larus, M. Abadi, M. Aiken, P. Barham, M. Fýhndrich, C. H. O. Hodson, S. Levi, N. Murphy, B. Steensgaard, D. Tarditi, T. Wobber, and B. Zill. An overview of the Singularity project. Technical Report MSR-TR-2005-135, Microsoft Research, October 2005.
[12]
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In Proc. Conf. on Code Generation and Optimization, Mar 2004.
[13]
D. Lowell, S. Chandra, and P. Chen. Exploring failure transparency and the limits of generic recovery. In Proceedings of the 4th USENIX Symposium on Operating Systems Design and Implementation, pages 289--304.
[14]
D. E. Lowell and P. M. Chen. Free transactions with rio vista. In SOSP '97: Proceedings of the sixteenth ACM symposium on Operating systems principles, pages 92--101, New York, NY, USA, 1997. ACM Press.
[15]
C. C. Minh, M. Trautmann, J. Chung, A. McDonald, N. Bronson, J. Casper, C. Kozyrakis, and K. Olukotun. An effective hybrid transactional memory system with strong isolation guarantees. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 69--80, New York, NY, USA, 2007. ACM.
[16]
G. C. Necula, J. Condit, M. Harren, S. McPeak, and W. Weimer. Ccured: type-safe retrofitting of legacy software. ACM Transactions on Programming Languages and Systems, 2005.
[17]
C. J. Rossbach, O. S. Hofmann, D. E. P. ter, H. E. Ramadan, A. Bhandari, and E. Witchel. Txlinux: Using and managing hardware transactional memory in an operating system. In SOSP '07: Proceedings of the Twenty First ACM Symposium on Operating Systems Principles, October 2007.
[18]
M. I. Seltzer, Y. Endo, C. Small, and K. A. Smith. Dealing with disaster: Surviving misbehaved kernel extensions. In Proceedings of the 2nd Symposium on Operating Systems Design and Implementation, pages 213--227, Seattle, Washington, 1996.
[19]
N. Shavit and D. Touitou. Software transactional memory. In Symp. on Principles of Distrib. Comp., pages 204--213, New York, NY, 1995. ACM Press.
[20]
A. Shinnar, D. Tarditi, M. Plesko, and B. Steensgaard. Integrating support for undo with exception handling. Technical Report MSR-TR-2004-140, Microsoft Research, Dec. 2004.
[21]
P. Starzetz and W. Purczynski. Linux kernel setsockopt MCAST_MSFILTER integer overflow vulnerability, 2004. http://www.securityfocus.com/bid/10179.
[22]
M. Swift, M. Annamalai, B. Bershad, and H. Levy. Recovering device drivers. In Proceedings of the 2004 Symposium on Operating Systems Design and Implementation (OSDI), Nov 2004.
[23]
M. Swift, B. Bershad, and H. Levy. Improving the reliability of commodity operating systems. In Proceedings of the 19th Symposium on Operating Systems Principles, New York, 2003.
[24]
I. L. Traiger. Trends in systems aspects of database management. In In Int'l Conf. on Databases, pages 1--21, 1983.
[25]
W. Weimer and G. Necula. Finding and preventing run-time error handling mistakes, 2004.
[26]
J. Xu, B. Randell, A. Romanovsky, C. M. F. Rubira, and Z. Wu. Fault tolerance in concurrent object-oriented software through coordinated error recovery. In FTCS '95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing, page 499, Washington, DC, USA, 1995. IEEE Computer Society.
[27]
F. Zhou, J. Condit, Z. Anderson, I. Bagrak, R. Ennals, M. Harren, G. Necula, and E. Brewer. Safedrive: Safe and recoverable extensions using language-based techniques. In Proceedings of the 2006 Symposium on Operating Systems Design and Implementation (OSDI), pages 45--60, Nov. 2006.

Cited By

View all

Index Terms

  1. Recovery domains: an organizing principle for recoverable operating systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS XIV: Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
    March 2009
    358 pages
    ISBN:9781605584065
    DOI:10.1145/1508244
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 44, Issue 3
      ASPLOS 2009
      March 2009
      346 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/1508284
      Issue’s Table of Contents
    • cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 37, Issue 1
      ASPLOS 2009
      March 2009
      346 pages
      ISSN:0163-5964
      DOI:10.1145/2528521
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 March 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. akeso
    2. automatic fault recovery
    3. recovery domains

    Qualifiers

    • Research-article

    Conference

    ASPLOS09

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)25
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media