Software fault tolerance

Applied Filters

People

Publications

Publication Date

Searched The ACM Guide to Computing Literature (3,823,348 records)|Limit your search to The ACM Full-Text Collection (772,531 records)

Showing 1 - 20of44 Results

Filters

Select All

Export Citations Save to Binder

per page:

Recency

editorial
Free
November 2024
Editorial
- Robbert van Renesse,
- Sam H. Noh
ACM Transactions on Computer Systems (TOCS), Volume 42, Issue 3-4Article No.: 5, Pages 1–2https://doi.org/10.1145/3696656
0
207
Metrics
Total Citations0
Total Downloads207
Last 12 Months207
Last 6 weeks207
View online with eReader
PDF
research-article
December 2011
Depot: Cloud Storage with Minimal Trust
ACM Transactions on Computer Systems (TOCS), Volume 29, Issue 4Article No.: 12, Pages 1–38https://doi.org/10.1145/2063509.2063512

This article describes the design, implementation, and evaluation of Depot, a cloud storage system that minimizes trust assumptions. Depot tolerates buggy or malicious behavior by any number of clients or servers, yet it provides safety and liveness ...
119
2,225
Metrics
Total Citations119
Total Downloads2,225
Last 12 Months59
Last 6 weeks9
Get Access
research-article
December 2011
Efficient Testing of Recovery Code Using Fault Injection
- Paul D. Marinescu,
- George Candea
ACM Transactions on Computer Systems (TOCS), Volume 29, Issue 4Article No.: 11, Pages 1–38https://doi.org/10.1145/2063509.2063511

A critical part of developing a reliable software system is testing its recovery code. This code is traditionally difficult to test in the lab, and, in the field, it rarely gets to run; yet, when it does run, it must execute flawlessly in order to ...
42
1,068
Metrics
Total Citations42
Total Downloads1,068
Last 12 Months17
Last 6 weeks3
Get Access
research-article
July 2010
Throughput optimal total order broadcast for cluster environments
ACM Transactions on Computer Systems (TOCS), Volume 28, Issue 2Article No.: 5, Pages 1–32https://doi.org/10.1145/1813654.1813656

Total order broadcast is a fundamental communication primitive that plays a central role in bringing cheap software-based high availability to a wide range of services. This article studies the practical performance of such a primitive on a cluster of ...
22
766
Metrics
Total Citations22
Total Downloads766
Last 12 Months19
Last 6 weeks2
Get Access
research-article
July 2010
Proactive obfuscation
- Tom Roeder,
- Fred B. Schneider
ACM Transactions on Computer Systems (TOCS), Volume 28, Issue 2Article No.: 4, Pages 1–54https://doi.org/10.1145/1813654.1813655

Proactive obfuscation is a new method for creating server replicas that are likely to have fewer shared vulnerabilities. It uses semantics-preserving code transformations to generate diverse executables, periodically restarting servers with these fresh ...
51
1,015
Metrics
Total Citations51
Total Downloads1,015
Last 12 Months16
Last 6 weeks6
Get Access
research-article
January 2010
Zyzzyva: Speculative Byzantine fault tolerance
ACM Transactions on Computer Systems (TOCS), Volume 27, Issue 4Article No.: 7, Pages 1–39https://doi.org/10.1145/1658357.1658358

A longstanding vision in distributed systems is to build reliable systems from unreliable components. An enticing formulation of this vision is Byzantine Fault-Tolerant (BFT) state machine replication, in which a group of servers collectively act as a ...
179
1,540
Metrics
Total Citations179
Total Downloads1,540
Last 12 Months85
Last 6 weeks10
Get Access
research-article
May 2009
Practical and low-overhead masking of failures of TCP-based servers
ACM Transactions on Computer Systems (TOCS), Volume 27, Issue 2Article No.: 4, Pages 1–39https://doi.org/10.1145/1534909.1534911

This article describes an architecture that allows a replicated service to survive crashes without breaking its TCP connections. Our approach does not require modifications to the TCP protocol, to the operating system on the server, or to any of the ...
15
830
Metrics
Total Citations15
Total Downloads830
Last 12 Months14
Last 6 weeks5
Get Access
article
November 2006
Recovering device drivers
ACM Transactions on Computer Systems (TOCS), Volume 24, Issue 4Pages 333–360https://doi.org/10.1145/1189256.1189257

This article presents a new mechanism that enables applications to run correctly when device drivers fail. Because device drivers are the principal failing component in most systems, reducing driver-induced failures greatly improves overall reliability. ...
107
1,694
Metrics
Total Citations107
Total Downloads1,694
Last 12 Months44
Last 6 weeks9
Get Access
article
February 2005
Improving the reliability of commodity operating systems
ACM Transactions on Computer Systems (TOCS), Volume 23, Issue 1Pages 77–110https://doi.org/10.1145/1047915.1047919

Despite decades of research in extensible operating system technology, extensions such as device drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85% of recently reported failures.This article ...
184
3,846
Metrics
Total Citations184
Total Downloads3,846
Last 12 Months65
Last 6 weeks6
Get Access
article
August 2003
BASE: Using abstraction to improve fault tolerance
ACM Transactions on Computer Systems (TOCS), Volume 21, Issue 3Pages 236–269https://doi.org/10.1145/859716.859718

Software errors are a major cause of outages and they are increasingly exploited in malicious attacks. Byzantine fault tolerance allows replicated systems to mask some software errors but it is expensive to deploy. This paper describes a replication ...
111
2,193
Metrics
Total Citations111
Total Downloads2,193
Last 12 Months15
Last 6 weeks3
Get Access
article
May 2003
Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining
ACM Transactions on Computer Systems (TOCS), Volume 21, Issue 2Pages 164–206https://doi.org/10.1145/762483.762485

Scalable management and self-organizational capabilities are emerging as central requirements for a generation of large-scale, highly dynamic, distributed applications. We have developed an entirely new distributed information management system called ...
469
5,466
Metrics
Total Citations469
Total Downloads5,466
Last 12 Months71
Last 6 weeks16
Get Access
article
November 2002
Practical byzantine fault tolerance and proactive recovery
- Miguel Castro,
- Barbara Liskov
ACM Transactions on Computer Systems (TOCS), Volume 20, Issue 4Pages 398–461https://doi.org/10.1145/571637.571640

Our growing reliance on online services accessible on the Internet demands highly available systems that provide correct service without interruptions. Software bugs, operator mistakes, and malicious attacks are a major cause of service interruptions ...
1,788
11,303
Metrics
Total Citations1,788
Total Downloads11,303
Last 12 Months1,876
Last 6 weeks68
Get Access
article
November 2002
COCA: A secure distributed online certification authority
ACM Transactions on Computer Systems (TOCS), Volume 20, Issue 4Pages 329–368https://doi.org/10.1145/571637.571638

COCA is a fault-tolerant and secure online certification authority that has been built and deployed both in a local area network and in the Internet. Extremely weak assumptions characterize environments in which COCA's protocols execute correctly: no ...
216
2,083
Metrics
Total Citations216
Total Downloads2,083
Last 12 Months27
Last 6 weeks6
Get Access
article
May 2002
The evolution of Coda
- M. Satyanarayanan
ACM Transactions on Computer Systems (TOCS), Volume 20, Issue 2Pages 85–124https://doi.org/10.1145/507052.507053

Failure-resilient, scalable, and secure read-write access to shared information by mobile and static users over wireless and wired networks is a fundamental computing challenge. In this article, we describe how the Coda file system has evolved to meet ...
88
2,879
Metrics
Total Citations88
Total Downloads2,879
Last 12 Months18
Last 6 weeks2
Get Access
article
May 2001
Specifying and using a partitionable group communication service
ACM Transactions on Computer Systems (TOCS), Volume 19, Issue 2Pages 171–216https://doi.org/10.1145/377769.377776

Group communication services are becoming accepted as effective building blocks for the construction of fault-tolerant distributed applications. Many specifications for group communication services have been proposed. However, there is still no agreement ...
49
1,242
Metrics
Total Citations49
Total Downloads1,242
Last 12 Months15
Last 6 weeks4
Get Access
article
Free
August 2000
Manageability, availability, and performance in porcupine: a highly scalable, cluster-based mail service
ACM Transactions on Computer Systems (TOCS), Volume 18, Issue 3Page 298https://doi.org/10.1145/354871.354875

This paper describes the motivation, design and performance of Porcupine, a scalable mail server. The goal of Porcupine is to provide a highly available and scalable electronic mail service using a large cluster of commodity PCs. We designed Porcupine to ...
39
1,304
Metrics
Total Citations39
Total Downloads1,304
Last 12 Months55
Last 6 weeks13
View online with eReader
PDF
article
Free
November 1998
Coyote: a system for constructing fine-grain configurable communication services
ACM Transactions on Computer Systems (TOCS), Volume 16, Issue 4Pages 321–366https://doi.org/10.1145/292523.292524

Communication-oriented abstractions such as atomic multicast, group RPC, and protocols for location-independent mobile computing can simplify the development of complex applications built on distributed systems. This article describes Coyote, a system ...
98
1,029
Metrics
Total Citations98
Total Downloads1,029
Last 12 Months117
Last 6 weeks19
View online with eReader
PDF
article
Free
May 1998
The part-time parliament
- Leslie Lamport
ACM Transactions on Computer Systems (TOCS), Volume 16, Issue 2Pages 133–169https://doi.org/10.1145/279227.279229

Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their ...
2,045
11,744
Metrics
Total Citations2,045
Total Downloads11,744
Last 12 Months1,652
Last 6 weeks291
View online with eReader
PDF
article
Free
May 1997
Strong loss tolerance of electronic coin systems
- Birgit Pfitzmann,
- Michael Waidner
ACM Transactions on Computer Systems (TOCS), Volume 15, Issue 2Pages 194–213https://doi.org/10.1145/253145.253282

Untraceable electronic cash means prepaid digital payment systems, usually with offline payments, that protect user privacy. Such systems have recently been given considerable attention by both theory and development projects. However, in most current ...
15
1,119
Metrics
Total Citations15
Total Downloads1,119
Last 12 Months76
Last 6 weeks19
View online with eReader
PDF
article
Free
August 1996
Recovery in the Calypso file system
ACM Transactions on Computer Systems (TOCS), Volume 14, Issue 3Pages 287–310https://doi.org/10.1145/233557.233560

This article presents the deign and implementation of the recovery scheme in Calypso. Calypso is a cluster-optimized, distributed file system for UNIX clusters. As in Sprite and AFS, Calypso servers are stateful and scale well to a large number of ...
24
921
Metrics
Total Citations24
Total Downloads921
Last 12 Months107
Last 6 weeks28
View online with eReader
PDF