Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Fault-Tolerant Computing: Understanding Failures and Design Flaws in Computer Systems - Pr, Papers of Electrical and Electronics Engineering

This presentation, prepared for the graduate course ece 257a at university of california, santa barbara, explores the concept of fault-tolerant computing, discussing motivation, background, tools, and various types of failures and design flaws in computer systems. It covers hardware and software examples, such as intel pentium processor and disney concert hall, and discusses the learning curve and causes of human errors.

Typology: Papers

Pre 2010

Uploaded on 08/31/2009

koofers-user-rjz
koofers-user-rjz 🇺🇸

10 documents

1 / 18

Toggle sidebar

Related documents


Partial preview of the text

Download Fault-Tolerant Computing: Understanding Failures and Design Flaws in Computer Systems - Pr and more Papers Electrical and Electronics Engineering in PDF only on Docsity! Sep. 2006 Introduction and Motivation Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools Sep. 2006 Introduction and Motivation Slide 2 About This Presentation Edition Released Revised Revised First Sep. 2006 This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami Sep. 2006 Introduction and Motivation Slide 5 The Curse of Complexity Computer engineering is the art and science of translating user requirements we do not fully understand; into hardware and software we cannot precisely analyze; to operate in environments we cannot accurately predict; all in such a way that the society at large is given no reason to suspect the extent of our ignorance.1 1Adapted from definition of structural engineering: Ralph Kaplan, By Design: Why There Are No Locks on the Bathroom Doors in the Hotel Louis XIV and Other Object Lessons, Fairchild Books, 2004, p. 229 Microsoft Windows NT (1992): ≈4M lines of code Microsoft Windows XP (2002): ≈40M lines of code Intel Pentium processor (1993): ≈4M transistors Intel Pentium 4 processor (2001): ≈40M transistors Intel Itanium 2 processor (2002): ≈500M transistors Sep. 2006 Introduction and Motivation Slide 6 Defining Failure Failure is an unacceptable difference between expected and observed performance.1 1 Definition used by the Tech. Council on Forensic Engineering of the Amer. Society of Civil Engineers A structure (building or bridge) need not collapse catastrophically to be deemed a failure Reasons of typical Web site failures Hardware problems: 15% Software problems: 34% Operator error: 51% ImplementationSpecification ≈? Sep. 2006 Introduction and Motivation Slide 7 Design Flaws: “To Engineer is Human”1 Complex systems almost certainly contain multiple design flaws Redundancy in the form of safety factor is routinely used in buildings and bridges 1 Title of book by Henry Petroski One catastrophic bridge collapse every 30 years or so See the following amazing video clip (Tacoma Narrows Bridge): http://www.enm.bris.ac.uk/research/nonlinear/tacoma/tacnarr.mpg Example of a more subtle flaw: Disney Concert Hall in Los Angeles reflected light into nearby building, causing discomfort for tenants due to blinding light and high temperature Sep. 2006 Introduction and Motivation Slide 10 Mishaps, Accidents, and Catastrophes Mishap: misfortune; unfortunate accident Forum on Risks to the Public in Computers and Related Systems http://catless.ncl.ac.uk/risks (Peter G. Neumann, moderator) At one time (following the initial years of highly unreliable hardware), computer mishaps were predominantly the results of human error Accident: unexpected (no-fault) happening causing loss or injury Now, most mishaps are due to complexity (unanticipated interactions) Catastrophe: final, momentous event of drastic action; utter failure Sep. 2006 Introduction and Motivation Slide 11 Example from On August 17, 2006, a class-two incident occurred at the Swedish atomic reactor Forsmark. A short-circuit in the electricity network caused a problem inside the reactor and it needed to be shut down immediately, using emergency backup electricity. However, in two of the four generators, which run on AC, the AC/DC converters died. The generators disconnected, leaving the reactor in an unsafe state and the operators unaware of the current state of the system for approximately 20 minutes. A meltdown, such as the one in Tschernobyl, could have occurred. Coincidence of problems in multiple protection levels seems to be a recurring theme in many modern-day mishaps -- emergency systems had not been tested with the grid electricity being off Sep. 2006 Introduction and Motivation Slide 12 Layers of Safeguards With multiple layers of safeguards, a system failure occurs only if warning symptoms and compensating actions are missed at each layer, which is quite unlikely Is it really? The computer engineering literature is full of examples of mishaps when two or more layers of protection failed at the same time Multiple layers increase the reliability significantly only if the “holes” in the representation above are fairly randomly distributed, so that the probability of their being aligned is negligible Dec. 1986: ARPANET had 7 dedicated lines between NY and Boston; A backhoe accidentally cut all 7 (they went through the same conduit) Sep. 2006 Introduction and Motivation Slide 15 Properties of a Good User Interface 1. Simplicity: Easy to use, clean and unencumbered look 2. Design for error: Makes errors easy to prevent, detect, and reverse; asks for confirmation of critical actions 3. Visibility of system state: Lets user know what is happening inside the system from looking at the interface 4. Use of familiar language: Uses terms that are known to the user (there may be different classes of users, each with its own vocabulary) 5. Minimal reliance on human memory: Shows critical info on screen; uses selection from a set of options whenever possible 6. Frequent feedback: Messages indicate consequences of actions 7. Good error messages: Descriptive, rather than cryptic 8. Consistency: Similar/different actions produce similar/different results and are encoded with similar/different colors and shapes Sep. 2006 Introduction and Motivation Slide 16 Operational Errors in Computer Systems Hardware examples Permanent incapacitation due to shock, overheating, voltage spike Intermittent failure due to overload, timing irregularities, crosstalk Transient signal deviation due to alpha particles, external interference Software examples Counter or buffer overflow Out-of-range, unreasonable, or unanticipated input Unsatisfied loop termination condition Dec. 2004: “Comair runs a 15-year old scheduling software package from SBS International (www.sbsint.com). The software has a hard limit of 32,000 schedule changes per month. With all of the bad weather last week, Comair apparently hit this limit and then was unable to assign pilots to planes.” It appears that they were using a 16-bit integer format to hold the count. June 1996: Explosion of the Ariane 5 rocket 37 s into its maiden flight was due to a silly software error. For an excellent exposition of the cause, see: http://www.comp.lancs.ac.uk/computing/users/dixa/teaching/CSC221/ariane.pdf) These can also be classified as design errors Sep. 2006 Introduction and Motivation Slide 17 About the Name of This Course Fault-tolerant computing: a discipline that began in the late 1960s – 1st Fault-Tolerant Computing Symposium (FTCS) was held in 1971 In the early 1980s, the name “dependable computing” was proposed for the field, to account for the fact that tolerating faults is but one approach to ensuring reliable computation. The terms “fault tolerance” and “fault- tolerant” were so firmly established, however, that people started to use “dependable and fault-tolerant computing.” In 2000, the premier conference of the field was merged with another and renamed “Int’l Conf. on Dependable Systems and Networks” (DSN) In 2004, IEEE began the publication of IEEE Trans. On Dependable and Secure Systems (inclusion of the term “secure” is for emphasis, because security was already accepted as an aspect of dependability)
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved