Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

The Science of Deriving Stability Analyses, Lecture notes of Calculus

The University of Texas at Austin Calculus

This paper introduces a methodology for obtaining inventories of error results for families of numerical dense linear algebra algorithms. The approach for deriving the analyses is goal-oriented, systematic, and layered. The presentation places the analysis side-by-side with the algorithm so that it is obvious where error is introduced, making it attractive for use in the classroom. The paper focuses on the methodology for deriving results as much as, if not more, than the results themselves. The goal for the authors is to identify notation and a procedure that yields the solution and that can, in principle, be made mechanical (automated).

Typology: Lecture notes

2022/2023

Uploaded on 05/11/2023

selvam_0p3 🇺🇸

4.3

(14)

5 documents

1 / 31

Partial preview of the text

Download The Science of Deriving Stability Analyses and more Lecture notes Calculus in PDF only on Docsity! THE SCIENCE OF DERIVING STABILITY ANALYSES PAOLO BIENTINESI∗ AND ROBERT A. VAN DE GEIJN† FLAME Working Note #33 Abstract. We introduce a methodology for obtaining inventories of error results for families of numerical dense linear algebra algorithms. The approach for deriving the analyses is goal-oriented, systematic, and layered. The presentation places the analysis side-by-side with the algorithm so that it is obvious where error is introduced, making it attractive for use in the classroom. Yet the approach is sufficiently powerful to support the analysis of more complex algorithms, such as blocked variants that attain high performance. We provide what we believe to be the first analysis of a practical blocked LU factorization. Contrary to popular belief, the bound on the error is tighter than that of the corresponding unblocked algorithm. 1. Introduction. Numerical stability analysis related to dense linear algebra operations is one of the most important topics in numerical analysis. It is part of almost any introductory and advanced text on numerical linear algebra and earned one of its pioneers, J.H. Wilkinson, a Turing Award in 1970. Parlett [2] observes that “One of the major difficulties in a practical analysis is that of description. An ounce of analysis follows a pound of preparation.” This suggests the desirability of a standardization of the description of error analyses. Higham [11] echoes this sentiment: “Two reasons why rounding error analysis can be hard to understand are that, first, there is no standard notation and, second, error analyses are often cluttered with re-derivations of standard results.” His statement also suggests that an inventory of results, to be used in subsequent analyses, is desirable. We believe that texts and papers alike focus on the error results and their proofs rather than on exposing a methodology that can be used by practitioners and students to obtain error results in a systematic way. Especially in a classroom setting, assignments related to stability analysis reward the recognition of tricks because no systematic methodology is explicitly passed from a typical instructor to the student. As part of the FLAME project, we approach such problems differently. As com- puter scientists, we are interested in the methodology for deriving results as much as, if not more, than the results themselves. The goal for us is to identify notation and a procedure that yields the solution and that can, in principle, be made mechanical (automated). For the problem of deriving algorithms for linear algebra operations we were able to achieve these objectives: we identified notation and exposed a systematic procedure that was reproducible by a computer algebra system [3]. In this paper, we show that the same notation and procedure can be extended to equally systematically (although not yet mechanically) derive stability analyses. This paper makes the following contributions: • The notation we use deviates from tradition. As much as possible we abstract away from details like indices in an effort to avoid clutter, both in the pre- sentation of the algorithms and in the error analyses of those algorithms. We are not alone in this: many texts, (e.g., [8]), partition matrices in analyses. We believe we do so more systematically and more consistent with how we present algorithms. 1RWTH Aachen University, AICES, Pauwelsstrasse 12, 52074 Aachen, GERMANY; pauldj@aices.rwth-aachen.de. 2Department of Computer Sciences, 1 University Station C0500, The University of Texas, Austin, TX 78712; rvdg@cs.utexas.edu. 1 • In most texts and courses the error results are presented as theorems with proofs that incorporate one or more obscure tricks, often with different tricks for different operations. In contrast, we show that the derivation of error results is a goal-oriented exercise: given an operation and an algorithm that computes it, a possible error result is motivated and is subsequently estab- lished hand-in-hand with its proof. The methodology naturally leads to a sequence of error results that can be catalogued for future reuse. If in an analysis a subresult has not yet been catalogued, the methodology can be applied to derive that subresult. • We motivate and demonstrate the methodology by applying it to a sequence of progressively more difficult algorithms for which results were already known. The presentation is such that it can be used as a supplement to a text being used in a class; this why we took the liberty of including exercises in this paper. • We apply the methodology to analyze a blocked algorithm for LU factor- ization that is closely related to the most commonly used high-performance algorithm for LU factorization with partial pivoting. We show that the com- puted Ľ and Ǔ factors of matrix A are such that ĽǓ = A + ∆A, with |∆A| ≤ γn b +b(|A| + |Ľ||Ǔ |). The factor γn b +b improves on the known factor γn resulting from unblocked algorithms, yielding γ2 √ n for a suitable choice of the block size b. We do not claim that the approach is different from what an expert does as he/she analyzes an algorithm. We do claim that we provide structure so that such analyses can be performed by people with considerably less expertise. Experience tells us that there are many, including experts, in this particular field who think at the index level and for whom the abstractions themselves are clutter. However, we have noticed that our approach often provokes a reaction like “As much as I have been trying to avoid [stability] analysis for the last ten years, I can only say that I like [this].” Indeed, this pretty much summarizes our own reaction as we wrote this paper. One should not dismiss the approach without at least trying some of the more advanced exercises. While the paper is meant to be self-contained, full understanding will come from first reading our paper on the systematic derivation of algorithms in this problem domain [4]. The reason is that the derivation of the analysis of an algorithm mirrors the derivation of the algorithm itself, as discussed in the conclusion. It is impossible for us to give a complete treatment of related work. We refer the reader to Higham’s book [11], which lists no less than 1134 citations. Other classic references are the book by Golub and Van Loan [9], Stewart [13], and by Stewart and Sun [14]. The structure of this paper is unusual. Since we claim that the presented ap- proach makes the subject more accessible to the novice, it is written to a target audience that includes mathematically sophisticated graduate students who have not yet been exposed to numerical error analysis and includes exercises. As a result, the first sections may appear to the expert to be a review of basic notation followed by an unorthodox presentation of the subject. In Section 2, we review notation for captur- ing and analyzing error. In Section 3, we analyze error accumulated when computing the inner (dot) product. We also use this section to introduce what we call the “error worksheet”, first as a framework for capturing the inductive proof, of how error accu- mulates and next as a tool for constructively deriving the error analysis. This helps relate the new approach to the more traditional analysis that is usually presented in a 2 Definition 2.1. Given x ∈ Rn and A ∈ Rm×n, |x| =  |χ0| ... |χn−1|  and |A| =  |α0,0| . . . |α0,n−1| ... . . . ... |αm−1,0| . . . |αm−1,n−1| . Definition 2.2. Let M∈ {<,≤,=,≥, >} and x, y ∈ Rn. Then |x| M |y| iff |χi|M |ψi|, with i = 0, . . . , n − 1. Similarly, given A and B ∈ Rm×n, |A|M |B| iff |αij |M |βij |, with i = 0, . . . ,m− 1 and j = 0, . . . , n− 1. The next Lemma is exploited in later sections: Lemma 2.3. Let A ∈ Rm×k and B ∈ Rk×n. Then |AB| ≤ |A||B|. Exercise 2.4. Prove Lemma 2.3. Answer: Let C = AB. Then the (i, j) entry in |C| is given by |γi,j | = ∣∣∣∣∣ k∑ p=0 αi,pβp,j ∣∣∣∣∣ ≤ k∑ p=0 |αi,pβp,j | = k∑ p=0 |αi,p||βp,j | which equals the (i, j) entry of |A||B|. Thus |AB| ≤ |A||B|. End of Answer: The fact that the bounds that we establish can be easily converted into bounds involving norms is a consequence of the following theorem, where ‖ ‖F indicates the Frobenius matrix norm. Theorem 2.5. Let A,B ∈ Rm×n. If |A| ≤ |B| then ‖A‖1 ≤ ‖B‖1, ‖A‖∞ ≤ ‖B‖∞, and ‖A‖F ≤ ‖B‖F . Exercise 2.6. Prove Theorem 2.5. Answer: Show that if |A| ≤ |B| then ‖A‖1 ≤ ‖B‖1: Let A = ( a0 · · · an−1 ) and B = ( b0 · · · bn−1 ) . Then ‖A‖1 = max 0≤j<n ‖aj‖1 = max 0≤j<n ( m−1∑ i=0 |αi,j | ) = ( m−1∑ i=0 |αi,k| ) where k is the index that maximizes ≤ ( m−1∑ i=0 |βi,k| ) since |A| ≤ |B| ≤ max 0≤j<n ( m−1∑ i=0 |βi,j | ) = max 0≤j<n ‖bj‖1 = ‖B‖1. Show that if |A| ≤ |B| then ‖A‖∞ ≤ ‖B‖∞: Note: ‖A‖∞ = ‖AT ‖1 and ‖B‖∞ = ‖BT ‖1. Also, if |A| ≤ |B| then, clearly, |AT | ≤ |BT |. Hence ‖A‖∞ = ‖AT ‖1 ≤ ‖BT ‖1 = ‖B‖∞. Show that if |A| ≤ |B| then ‖A‖F ≤ ‖B‖F : 5 Algorithm Dot: κ := xT y κ := 0 Partition x→ ( xT xB ) , y → ( yT yB ) where xT and yT are empty While m(xT ) < m(x) do Repartition( xT xB ) →  x0 χ1 x2  , ( yT yB ) →  y0 ψ1 y2  where χ1 and ψ1 are scalars κ := κ+ χ1ψ1 Continue with( xT xB ) ←  x0 χ1 x2  , ( yT yB ) ←  y0 ψ1 y2  endwhile Fig. 3.1. Algorithm for computing κ := xT y. ‖A‖2F = m−1∑ i=0 n−1∑ j=0 α2 i,j ≤ m−1∑ i=0 n−1∑ j=0 β2 i,j = ‖B‖2F . Hence ‖A‖F ≤ ‖B‖F . End of Answer: 2.5. Deriving dense linear algebra algorithms. In various papers, we have shown that for a broad class of linear algebra operations, given an operation one can systematically derive algorithms for computing it [10]. The primary vehicle in the derivation of algorithms is a worksheet to be filled in a prescribed order [4]. We do not discuss the derivation worksheet in this paper in order to keep the focus of on the derivation of error analyses. However, we encourage the reader to compare the error worksheet, introduced next, to the worksheet for deriving algorithms, as the order in which the error worksheet is filled mirrors that of the derivation worksheet. 3. Stability of the Dot Product Operation and Introduction to the Error Worksheet. The triangular solve algorithm discussed in the next section requires the computation of the dot (inner) product (Dot) of vectors x, y ∈ Rn: κ := xT y. In this section, we give an algorithm for this operation and the related error results. We also introduce the error-worksheet as a framework for presenting the error analysis side-by-side with the algorithm. 3.1. An algorithm for computing Dot. We will consider the algorithm given in Fig. 3.1. It uses the FLAME notation [10, 4] to express the computation κ := (( (χ0ψ0 + χ1ψ1) + · · · ) + χn−2ψn−2 ) + χn−1ψn−1 (3.1) in the indicated order. 6 3.2. Preparation. Under the computational model given in Section 2.2, the computed result of (3.1), κ̌, satisfies κ̌ = ((( χ0ψ0(1 + ε (0) ∗ ) + χ1ψ1(1 + ε (1) ∗ ) ) (1 + ε (1) + ) + · · · ) (1 + ε (n−2) + ) + χn−1ψn−1(1 + ε (n−1) ∗ ) ) (1 + ε (n−1) + ) = n−1∑ i=0 χiψi(1 + ε (i) ∗ ) n−1∏ j=i (1 + ε (j) + ) , (3.2) where ε (0) + = 0 and |ε(0) ∗ |, |ε(j)∗ |, |ε(j)+ | ≤ u for j = 1, . . . , n − 1. Clearly, a notation to keep expressions from becoming unreadable is desirable. Lemma 3.1. Let εi ∈ R, 0 ≤ i ≤ n − 1, nu < 1, and |εi| ≤ u. Then ∃ θn ∈ R such that ∏n−1 i=0 (1 + εi) ±1 = 1 + θn, with |θn| ≤ nu/(1− nu). Proof: By Mathematical Induction. Base case. n = 1. Trivial. Inductive Step. The Inductive Hypothesis (I.H.) tells us that for all εi ∈ R, 0 ≤ i ≤ n− 1, nu < 1, and |εi| ≤ u, there exists a θn ∈ R such that ∏n−1 i=0 (1 + εi) ±1 = 1 + θn, with |θn| ≤ nu/(1− nu). We will show that if εi ∈ R, 0 ≤ i ≤ n, (n + 1)u < 1, and |εi| ≤ u, then there exists a θn+1 ∈ R such that ∏n i=0(1 + εi) ±1 = 1 + θn+1, with |θn+1| ≤ (n+ 1)u/(1− (n+ 1)u). Case 1: ∏n i=0(1 + εi) ±1 = ∏n−1 i=0 (1 + εi) ±1(1 + εn). See Exercise 3.2. Case 2: ∏n i=0(1 + εi) ±1 = ( ∏n−1 i=0 (1 + εi) ±1)/(1 + εn). By the I.H. there exists a θn such that (1 + θn) = ∏n−1 i=0 (1 + εi) ±1 and |θn| ≤ nu/(1− nu). Then∏n−1 i=0 (1 + εi) ±1 1 + εn = 1 + θn 1 + εn = 1 + θn − εn 1 + εn︸︷︷︸ θn+1 , which tells us how to pick θn+1. Now |θn+1| = ∣∣∣∣θn − εn1 + εn ∣∣∣∣ ≤ |θn|+ u 1− u ≤ nu 1−nu + u 1− u = nu + (1− nu)u (1− nu)(1− u) = (n+ 1)u− nu2 1− (n+ 1)u + nu2 ≤ (n+ 1)u 1− (n+ 1)u . By the Principle of Mathematical Induction, the result holds. Exercise 3.2. Complete the proof of Lemma 3.1. Answer: We merely need to fill in the details for Case 1 in the proof: Case 1: ∏n i=0(1 + εi) ±1 = ( ∏n−1 i=0 (1 + εi) ±1)(1 + εn). By the I.H. there exists a θn such that (1 + θn) = ∏n−1 i=0 (1 + εi) ±1 and |θn| ≤ nu/(1− nu). Then( n−1∏ i=0 (1 + εi) ±1 ) (1 + εn) = (1 + θn)(1 + εn) = 1 + θn + εn + θnεn︸︷︷︸ θn+1 , 7 3.4. A proof in traditional format. In the below proof, we will pick symbols to denote vectors so that the proof can be easily related to the alternative framework to be presented in Section 3.5. Proof: By Mathematical Induction on n, the length of vectors x and y. Base case. m(x) = m(y) = 0. Trivial. Inductive Step. I.H.: Assume that if xT , yT ∈ Rk, k > 0, then[ xTT yT ] = xTT (I + ΣT )yT , where ΣT = Σ(k). We will show that when xT , yT ∈ Rk+1, the equality [ xTT yT ] = xTT (I + ΣT )yT holds true again. Assume that xT , yT ∈ Rk+1, and partition xT → ( x0 χ1 ) and yT → ( y0 ψ1 ) . Then[( x0 χ1 )T ( y0 ψ1 )] = [[ xT0 y0 ] + [χ1ψ1] ] (definition) = [ xT0 (I + Σ0)y0 + [χ1ψ1] ] (I.H. with xT = x0, yT = y0, and Σ0 = Σ(k)) = ( xT0 (I + Σ0)y0 + χ1ψ1(1 + ε∗) ) (1 + ε+) (SCM, twice) = ( x0 χ1 )T ( (I + Σ0) 0 0 (1 + ε∗) ) (1 + ε+) ( y0 ψ1 ) (rearrangement) = xTT (I + ΣT )yT (renaming), where |ε∗|, |ε+| ≤ u, ε+ = 0 if k = 0, and (I + ΣT ) = ( (I + Σ0) 0 0 (1 + ε∗) ) (1 + ε+) so that ΣT = Σ(k+1). By the Principle of Mathematical Induction, the result holds. 3.5. The error worksheet. We focus the reader’s attention on Fig. 3.2 in which we present a framework, which we will call the error worksheet, for presenting the inductive proof of Theorem 3.7 side-by-side with the algorithm for Dot. This frame- work, in a slightly different form, was first introduced in [3]. The expressions enclosed by { } (in the grey boxes) are predicates describing the state of the variables used in the algorithms and in their analysis. In the worksheet, we use superscripts to indi- cate the iteration number, thus, the symbols vi and vi+1 do not denote two different variables, but two different states of variable v. The proof presented in Fig. 3.2 goes hand in hand with the algorithm, as it shows that before and after each iteration of the loop that computes κ := xT y, the variables κ̌, xT , yT ,ΣT are such that the predicate {κ̌ = xTT (I + ΣT )yT ∧ k = m(xT ) ∧ ΣT = Σ(k)} (3.5) holds true. This relation is satisfied at each iteration of the loop, so it is also satisfied when the loop completes. Upon completion, the loop guard is m(xT ) = m(x) = n, which implies that κ̌ = xT (I + Σ(n))y, i.e., the thesis of the theorem, is satisfied too. In details, the inductive proof of Theorem 3.7 is captured by the error worksheet as follows: Base case. In Step 2a, i.e. before the execution of the loop, predicate (3.5) is satisfied, as k = m(xT ) = 0. 10 Error side Step κ := 0 { Σ = 0 } 1a Partition x→ ( xT xB ) , y → ( yT yB ) , Σ→ ( ΣT 0 0 ΣB ) where xT and yT are empty, and ΣT is 0× 0 4 { κ̌ = xTT (I + ΣT )yT ∧ ΣT = Σ(k) ∧m(xT ) = k } 2a While m(xT ) < m(x) do 3{ κ̌ = xTT (I + ΣT )yT ∧ ΣT = Σ(k) ∧m(xT ) = k } 2b Repartition( xT xB ) →  x0 χ1 x2 ,( yT yB ) →  y0 ψ1 y2 , ( ΣT 0 0 ΣB ) →  Σi 0 0 0 0 σi 1 0 0 0 Σ2  where χ1, ψ1, and σi 1 are scalars 5a { κ̌i = xT0 (I + Σi 0)y0 ∧ Σi 0 = Σ(k) ∧m(x0) = k } 6 κ := κ+ χ1ψ1 κ̌i+1 = ( κ̌i + χ1ψ1(1 + ε∗) ) (1 + ε+) SCM, twice (ε+ = 0 if k = 0) = ( xT0 (I + Σ (k) 0 )y0 + χ1ψ1(1 + ε∗) ) (1 + ε+) Step 6: I.H. = ( x0 χ1 )T ( I + Σ (k) 0 0 0 1 + ε∗ ) (1 + ε+) ( y0 ψ1 ) Rearrange = ( x0 χ1 )T ( I + Σ(k+1) )( y0 ψ1 ) Exercise 3.6 8  κ̌i+1 = ( x0 χ1 )T ( I + ( Σi+1 0 0 0 σi+1 1 ))( y0 ψ1 ) ∧ ( Σi+1 0 0 0 σi+1 1 ) = Σ(k+1) ∧m ( x0 χ1 ) = (k + 1)  7 Continue with( xT xB ) ← ( x0 χ1 x2 ) , ( yT yB ) ← ( y0 ψ1 y2 ) , ( ΣT 0 0 ΣB ) ←  Σi+1 0 0 0 0 σi+1 1 0 0 0 Σ2  5b { κ̌ = xTT (I + ΣT )yT ∧ ΣT = Σ(k) ∧m(xT ) = k } 2c endwhile { κ̌ = xTT (I + ΣT )yT ∧ ΣT = Σ(k) ∧m(xT ) = k ∧m(xT ) = m(x) } 2d{ κ̌ = xT(I + Σ(n))y ∧m(x) = n } 1b Fig. 3.2. Error worksheet completed to establish the backward error result for the given algo- rithm that computes the Dot operation. Inductive step. Assume that the predicate (3.5) holds true at Step 2b, i.e., at the top of the loop. Then Steps 6, 7, and 8 in Fig. 3.2 prove that the predicate is satisfied again at Step 2c, i.e., the bottom of the loop. Specifically, • Step 6 holds by virtue of the equalities x0 = xT , y0 = yT , and Σi0 = ΣT . • The update in Step 8-left introduces the error indicated in Step 8-right (SCM, twice), yielding the results for Σi+1 0 and σi+1 1 , leaving the variables in the state indicated in Step 7. • Finally, the redefinition of ΣT in Step 5b transforms the predicate in Step 7 into that of Step 2c, completing the inductive step. By the Principle of Mathematical Induction, the predicate (3.5) holds for all iterations. In particular, when the loop terminates, the predicate becomes κ̌ = xT (I + Σ(n))y ∧ n = m(xT ). This completes the discussion of the proof as captured by Fig. 3.2. 11 In the derivation of algorithms, the concept of loop-invariant plays a central role. Let L be a loop and P a predicate. If P is true before the execution of L, at the beginning and at the end of each iteration of L, and after the completion of L, then predicate P is a loop-invariant with respect to L. Similarly, we give the definition of error-invariant. Definition 3.8. We call the predicate involving the operands and error operands in Steps 2a–d the error-invariant for the analysis. This predicate is true before and after each iteration of the loop. For any algorithm, the loop-invariant and the error-invariant are related in that the former describes the status of the computation at the beginning and the end of each iteration, while the latter captures an error result for the computation indicated by the loop-invariant. The reader will likely think that the error worksheet is an overkill when proving the error result for the dot product. We agree. We use the dot product merely as a vehicle to introduce the reader to the methodology. As the operations being analyzed become more complex, the benefits of the structure that the error worksheet provides will become more obvious. 3.6. Results. A number of useful consequences of Theorem 3.7 follow. These will be used later as an inventory (library) of error results from which to draw when analyzing operations and algorithms that utilize Dot. Corollary 3.9. Under the assumptions of Theorem 3.7 the following relations hold: R1-B: (Backward analysis) κ̌ = (x+ δx)T y, where |δx| ≤ γn|x|, and κ̌ = xT (y + δy), where |δy| ≤ γn|y|; R1-F: (Forward analysis) κ̌ = xT y + δκ, where |δκ| ≤ γn|x|T |y|. Proof: We leave the proof of R1-B as an exercise. For R1-F, let δκ = xTΣ(n)y, where Σ(n) is as in Theorem 3.7. Then |δκ| = |xTΣ(n)y| ≤ |χ0||θn||ψ0|+ |χ1||θn||ψ1|+ · · ·+ |χn−1||θ2||ψn−1| ≤ γn|χ0||ψ0|+ γn|χ1||ψ1|+ · · ·+ γ2|χn−1||ψn−1| ≤ γn|x|T |y|. Exercise 3.10. Prove R1-B. Answer: From Theorem 3.7. we know that κ̌ = xT (I + Σ(n))y = (x+ Σ(n)x δx )T y. Then |δx| = |Σ(n)x| = ∣∣∣∣∣∣∣∣∣∣∣  θnχ0 θnχ1 θn−1χ2 ... θ2χn−1  ∣∣∣∣∣∣∣∣∣∣∣ =  |θnχ0| |θnχ1| |θn−1χ2| ... |θ2χn−1|  =  |θn||χ0| |θn||χ1| |θn−1||χ2| ... |θ2||χn−1|  ≤  |θn||χ0| |θn||χ1| |θn||χ2| ... |θn||χn−1|  = |θn|  |χ0| |χ1| |χ2| ... |χn−1|  ≤ γn|x|. 12 Algorithm Ltrsv: Compute x such that Lx = b Partition L→ ( LTL 0 LBL LBR ) , x→ ( xT xB ) , b→ ( bT bB ) where LTL, xT , and bT are empty While m(xT ) < m(x) do Repartition( LTL 0 LBL LBR ) →  L00 0 0 lT10 λ11 0 L20 l21 L22 , ( xT xB ) →  x0 χ1 x2 ,( bT bB ) →  b0 β1 b2  where λ11, χ1, and β1 are scalars χ1 := (β1 − lT10x0)/λ11 Continue with( LTL 0 LBL LBR ) ←  L00 0 0 lT10 λ11 0 L20 l21 L22 , ( xT xB ) ←  x0 χ1 x2 ,( bT bB ) ←  b0 β1 b2  endwhile Fig. 4.1. An algorithm for solving Lx = b. Step 1: The error-precondition and error-postcondition. The first ques- tion is what error result is desired. We will show that the computed solution, x̌, satisfies the backward error result (L + ∆L)x̌ = b with |∆L| ≤ max(γ2, γn−1)|L|, i.e., |∆L| ≤ γn−1|L| when n > 2. The reasoning for the factor γn−1 is as follows. We would expect the maximal error to be incurred during the final iteration when computing χ1 = (β1 − lT10x0)/λ. This assignment involves a dot product with vectors of length n − 1. Thus, results R4 from Theorem 3.12 suggest that it is reasonable to expect the indicated factor. In practice, the analysis proceeds in two stages. In the first stage one proves, con- structively, the existence of a matrix ∆L, the error-operand, such that (L+∆L)x̌ = b. In this stage error bounds are not considered, as the only concern is to assign the error generated by the computational updates to the error-operands. This process involves error results regarding the suboperations that appear in the updates. These error results then allow one to make an educated guess of the bounds that can be established for the error operands. In the second stage the bounds on ∆L are de- termined. For space considerations, in this paper we incorporate the proof of the bounds on the error operands into the proof of existence. We call the predicate that describes the state of the error operands, in this case the matrix ∆L, before the algorithm is executed, the error-precondition, and the predicate that describes the target error result, the error-postcondition. In Fig. 4.2 these pred- icates are placed in Step 1a and 1b, respectively. The error-precondition is {∆L = 0} (no error has yet accumulated), and the error-postcondition is {(L+∆L)x̌ = b∧|∆L| ≤ max(γ2, γn−1)|L| ∧m(x) = n}. Step 2: The error-invariant. Now we pick an error-invariant for the algo- rithm, i.e., a predicate that describes the state of the error operands in Steps 2a–d. Since the loop-invariant is the predicate that describes the computation performed 15 Error side Step { ∆L = 0 } 1a Partition L→ ( LTL 0 LBL LBR ) , x→ ( xT xB ) , b→ ( bT bB ) , ∆L→ ( ∆LTL 0 ∆LBL ∆LBR ) where LTL, xT , bT , and ∆LTL are empty 4 {(LTL + ∆LTL)x̌T = bT ∧ |∆LTL| ≤ max(γ2, γk−1)|LTL| ∧m(xT ) = k } 2a While m(xT ) < m(x) do 3 {(LTL + ∆LTL)x̌T = bT ∧ |∆LTL| ≤ max(γ2, γk−1)|LTL| ∧m(xT ) = k } 2b Repartition( LTL 0 LBL LBR ) → L00 0 0 lT10 λ11 0 L20 l21 L22 , ( x̌T x̌B ) →  x̌0 χ̌1 x̌2 ,( bT bB ) →  b0 β1 b2  ( ∆LTL 0 ∆LBL ∆LBR ) → ∆L00 0 0 δlT10 δλ11 0 ∆L20 δl21 ∆L22  where λ11, δλ11, χ1, and β1 are scalars 5a {(L00 + ∆L00)x̌0 = b0 ∧ |∆L00| ≤ max(γ2, γk−1)|L00| ∧m(x0) = k} 6 χ1 := β1 − lT10x0 λ11 b0 = (L00 + ∆L00)x̌0 ∧ |∆L00| ≤ max(γ2, γk−1)|L00| } Step 6: I.H. β1 = ( lT10 + δlT10 ) x̌0 + (λ11 + δλ11) χ̌1 ∧ ∣∣( δlT10 δλ11 )∣∣ ≤ max(γ2, γk) ∣∣( lT10 λ11 )∣∣ } Th. 3.12 R4-B 8  (( L00 0 lT10 λ ) + ( ∆L00 0 δlT10 δλ ))( x̌0 χ̌1 ) = ( b0 β1 ) ∧ ∣∣∣∣(∆L00 0 δlT10 δλ )∣∣∣∣ ≤ max(γ2, γk) ∣∣∣∣( L00 0 lT10 λ )∣∣∣∣ ∧m( x0χ1 ) = k + 1  7 Continue with L, x and b as in Fig. 4.1, and ( ∆LTL 0 ∆LBL ∆LBR ) ← ∆L00 0 0 δlT10 δλ11 0 ∆L20 δl21 ∆L22  5b {(LTL + ∆LTL)x̌T = bT ∧ |∆LTL| ≤ max(γ2, γk−1)|LTL| ∧m(xT ) = k } 2c endwhile { (LTL + ∆LTL)x̌T = bT ∧ |∆LTL| ≤ max(γ2, γk−1)|LTL| ∧ m(xT ) = m(x) ∧m(xT ) = k } 2d {(L+ ∆L)x̌ = b ∧ |∆L| ≤ max(γ2, γn−1)|L| ∧m(x) = n} 1b Fig. 4.2. Error worksheet for deriving deriving the backward stability error result for the algo- rithm in Fig. 4.1. when the algorithm reaches these stages, it makes sense to impose the error-invariant to equal the portion of the error-postcondition that corresponds to the state de- scribed by the loop-invariant. The algorithm in Fig. 4.1 has the property that before and after each iteration the contents of xT are such that {LTLxT = bT }. This predicate is the loop-invariant. Thus, we expect the accumulated error to satisfy (LTL + ∆LTL)x̌T = bT . Considering also bounds on ∆LTL, the error-invariant be- comes {(LTL + ∆LTL)x̌T = bT ∧ |∆LTL| ≤ max(γ2, γk−1)|LTL| ∧m(xT ) = k}, and it appears in Steps 2a–d. Step 3: The loop-guard. There is nothing to be done for this step, as the loop guard comes as part of the algorithm to analyze. We retain this step number so that 16 the error worksheet closely resembles the worksheet we use for deriving algorithms [4, 3, 15]. Step 4: Initialization. In this step the error operand, ∆L, is partitioned con- formally to the operands in the algorithm with respect to the error-postcondition. In other words, ∆L is partitioned so that the dimensions of the variables in the expression (L+ ∆L)x̌ = b, with the operands partitioned as in Fig. 4.1, are conformal. Step 5: Moving through the error operand. In Steps 5a and 5b, the error operand is repartitioned conformally to the repartitioning of the operands in the algorithm. Step 6: State of the variables before the update in Step 8. Step 5a consists of the following relabelling statements: LTL → L00, xT → x0, bT → b0, and ∆LTL → ∆L00. Thus, given the state of the variables in Step 2b, the contents of the submatrices and subvectors exposed in Step 5a are described by (L00 + ∆L00)x̌0 = b0 ∧ |∆L00| ≤ max(γ2, γk−1)|L00| ∧m(x0) = k (4.1) at Step 6. This predicate expresses the state of the error operands before the execution of the computational statements listed in Step 8, on the left. Such a replacement of symbols is called textual substitution. Step 7: State of the variables after the update in Step 8. At the bottom of the loop, the variables must again be in a state where the loop-invariant holds, as indicated by Step 2c. In Step 5b, the different quadrants of L and ∆L, and the subvectors of x and b, are redefined so that LTL ← ( L00 0 lT10 λ11 ) , xT ← ( x0 χ1 ) , bT ← ( b0 β1 ) , and ∆LTL ← ( ∆L00 0 δlT10 δλ11 ) . Thus, given the state in which the variables must be in Step 2c, the contents of the submatrices and subvectors before Step 5b must be(( L00 0 lT10 λ11 ) + ( ∆L00 0 δlT10 δλ11 ))( x̌0 χ̌1 ) = ( b0 β1 ) ∧m ( x0 χ1 ) = k + 1 ∧ ∣∣∣∣(∆L00 0 δlT10 δλ11 )∣∣∣∣ ≤ max(γ2, γk) ∣∣∣∣(L00 0 lT10 λ11 )∣∣∣∣ (4.2) at Step 6. This predicate is attained via textual substitution and algebraic manipu- lation. Step 8: The inductive step. When the computation reaches this step the predicate (4.1), described in Step 6, is satisfied. The question now is whether the computational update χ1 := (β1 − lT10x0)/λ11, with m(x0) = k, generates errors for which one can find assignments to the error variables ∆L00, δlT10, and δλ11 so that the predicate (4.2), described in Step 7, is also satisfied. 17 The a priori justification for the bound is not as straightforward as it was for the solution of the triangular system. The factor γn comes from the fact that each element of L and U is exposed to the effects of at most n individual operations. In addition, experience tells us that bounds for the backward error of various algorithms for LU factorization have the form c1γn|A| + c2γn|L||U |, where c1 ≥ 0 and c2 ≥ 1. We stress that in practice one would start with a formulation of the analysis without taking bounds into account; our methodology makes it easy to experiment with loose analyses to gain insight that leads to a tighter formulation. For space considerations, we start the analysis already with tight bounds. In Fig. 5.2, we show both the algorithm, on the left, and the error side of the error worksheet, on the right. The latter is filled out in the order indicated by the Step column, as we demonstrated in Sec. 4. We have already established the expressions for the error-precondition and error-postcondition (Step 1), in which the matrix ∆A acts as the error operand. In Step 2 one has to select a feasible error-invariant, i.e., an error result for the computations performed up to this stage. To this end, we restrict the error-postcondition to the computation described by the loop-invariant (5.1), yielding the error-invariant m(ATL) = k ∧ ( ĽTLǓTL = ATL + ∆ATL ĽTLǓTR = ATR + ∆ATR ĽBLǓTR = ABL + ∆ABL ) ∧ ( |∆ATL| ≤ γk|ĽTL||ǓTL| |∆ATR| ≤ γk|ĽTL||ǓTR| |ABL| ≤ γk|ĽBL||ǓTL| ) , which appears in Step 2.4 Given the error-invariant, Steps 4–7 only require symbolic manipulation. In particular, the predicates indicated in Steps 6 and 7 follow immedi- ately from the error-invariant by substituting the submatrices from Steps 5a and 5b, respectively. Step 6 becomes m(A00) = k ∧ Ľ00Ǔ00 = A00+∆A00 Ľ00ǔ01 = a01+δa01 Ľ00Ǔ02 = A02+∆A02 ľT10Ǔ00 = aT10+δaT10 Ľ20Ǔ00 = A20+∆A02 ∧ ∣∣∣∣∣∣ ∆A00 δa01 ∆A02 δaT10 ∆A20 ∣∣∣∣∣∣≤γk  |Ľ00||Ǔ00| |Ľ00||ǔ01| |Ľ00||Ǔ02| |ľT10||Ǔ00| |Ľ20||Ǔ00| . (5.2) Step 2 tells us that this predicate is true before the execution of the computational statements in the left box of Step 8. In other words, Step 6 represents our Inductive 4In the error worksheet of Fig. 4.2, the error-invariant appears in four different places. In order to save space, here we list it only before the loop. 20 Hypothesis (I.H.). Algebraic manipulation of Step 7 yields m ( A00 a01 aT10 α11 ) = k + 1 ∧ Ľ00Ǔ00 = A00+∆A00 Ľ00ǔ01 = a01+δa01 Ľ00Ǔ02 = A02+∆A02 ľT10Ǔ00 = aT10+δaT10 ľT10ǔ01 + υ̌11 = α11+δα11 ľT10Ǔ02 + ǔT12 = aT12+δaT12 Ľ20Ǔ00 = A20+∆A20 Ľ20ǔ01 + ľ21υ̌11 = a21+δa21 ∧ ∣∣∣∣∣∣ ∆A00 δa01 ∆A02 δaT10 δα11 δaT12 ∆A20 δa21 ∣∣∣∣∣∣≤γk+1  |Ľ00||Ǔ00| |Ľ00||ǔ01| |Ľ00||Ǔ02| |ľT10||Ǔ00| |ľT10||ǔ01|+|υ̌11| |ľT10||Ǔ02|+|ǔT12| |Ľ20||Ǔ00| |Ľ20||ǔ01|+|ľ21||υ̌11| , which is the predicate that must be true after the execution of the computational statements. The goal is to show that the computation in Step 8, when executed in a state that satisfies predicate (5.2), generates errors that satisfy the relations in (5.3). For simplicity, in the latter predicate we have highlighted the submatrices that are affected by the computation. The relations in the other submatrices are satisfied “by the I.H.”, in the terminology of an inductive proof, since γk ≤ γk+1. Thus it remains to show that there exist δα11, δaT12, and δa21 that satisfy the constraints in the grey-shaded boxes. To this end, we examine the error introduced by the computational updates to determine how error is contributed to each of these variables: Determining δα11: The assignment υ11 := α11 − lT10u01 is executed in a state where m(A00) = k. Theorem 3.12 R2-F states that there exists δυ11 such that lT10u01 + υ̌11 = α11 + δυ11, where |δυ11| ≤ γk(|lT10||u01|+ |υ̌11|) ≤ γk+1(|lT10||u01|+ |υ̌11|), therefore we choose δα11 := δυ11. Determining δaT12: The assignment uT12 := aT12 − lT10U02 executed in a state where m(A00) = k and Theorem 5.1 R2-F imply that there exists δυ12 such that lT10U02 + ǔT12 = aT12 + δυT12, where |δυT12| ≤ γk (∣∣lT10 ∣∣ |U02|+ ∣∣ǔT12 ∣∣) ≤ γk+1 (∣∣lT10 ∣∣ |U02|+ ∣∣ǔT12 ∣∣) ; thus, δaT12 := δυT12 is the desired update. Determining δa21: The assignment l21 := (a21 − L20u01)/υ11 executed in a state where m(A00) = k, and Theorem 5.1 R4-F imply that there exists δw21 such that L20u01 + υ11 ľ21 = a21 + δw21, where |δw21| ≤ γk+1 ( |L20| |u01|+ ∣∣ľ21 ∣∣ |υ11| ) ; therefore δa21 := δw21 is the desired update. This completes the proof of the Inductive Step. The proof is also summarized in Step 8 of Fig. 5.2. 5.4. Results. The above discussion, summarized in Fig. 5.2, proves the following theorem: Theorem 5.4. Let A ∈ Rn×n, n > 2, and let Ľ and Ǔ be computed via the Crout variant of the LU factorization, as given in Fig. 5.1. Then the computed factors Ľ and Ǔ satisfy ĽǓ = A+ ∆A, where |∆A| ≤ γn|Ľ||Ǔ |. The backward stability result in this theorem agrees with the one in [11]. The attentive reader will have noticed that the none of the bounds used to determine 21 Error side Step { ∆A = 0 } 1a Partition A→ ( ATL ATR ABL ABR ) , L→ ( LTL 0 LBL LBR ) , U→ ( UTL UTR 0 UBR ) , ∆A→ ( ∆ATL ∆ATR ∆ABL ∆ABR ) where ATL, LTL, UTL and ∆ATL are empty 4  m(ATL) = k ∧ ( ĽTLǓTL = ATL + ∆ATL ĽTLǓTR = ATR + ∆ATR ĽBLǓTR = ABL + ∆ABL ) ∧ ∣∣∣∣( ∆ATL ∆ATR ∆ABL )∣∣∣∣≤γk( |ĽTL||ǓTL| |ĽTL||ǓTR| |ĽBL||ǓTL| )  2a While m(ATL) ≤ m(A) do 3 Repartition A,L,U partitioned as in Fig. 5.1, and ( ∆ATL ∆ATR ∆ABL ∆ABR ) → ∆A00 δa01 ∆A02 δaT10 δα11 δaT12 ∆A20 δa21 ∆A22  where δα11 is a scalars 5a  m(A00) = k ∧  Ľ00Ǔ00 = A00+∆A00 Ľ00ǔ01 = a01+δa01 Ľ00Ǔ02 = A02+∆A02 ľT10Ǔ00 = aT10+δaT10 Ľ20Ǔ00 = A20+∆A02  ∧ ∣∣∣∣∣∣ ∆A00 δa01 ∆A02 δaT10 ∆A20 ∣∣∣∣∣∣≤γk  |Ľ00||Ǔ00| |Ľ00||ǔ01| |Ľ00||Ǔ02| |ľT10||Ǔ00| |Ľ20||Ǔ00|   6 υ11 := α11 − lT10u01 uT12 := aT12 − lT10U02 l21 := (a21 − L20u01)/υ11 υ̌11 + δα11 = α11 − lT10u01 ∧ |δα11|≤γk+1(|lT10||u01|+|υ̌11|) Th. 3.12 R2-F ǔT12 + δαT 12 = aT12 − lT10U02 ∧ |δαT 12|≤γk+1(|lT10||U02|+|ǔT12|) Th. 5.1 R2-F ľ21υ11 + δα21 = a21 − L20u01 ∧ |δα21|≤γk+1(|Ľ20||ǔ01|+|ľ21||υ11|) Th. 5.1 R4-F 8  m ( A00 a01 aT10 α11 ) = k + 1 ∧  Ľ00Ǔ00 = A00+∆A00 Ľ00ǔ01 = a01+δa01 Ľ00Ǔ02 = A02+∆A02 ľT10Ǔ00 = aT10+δaT10 ľT10ǔ01 + υ̌11 = α11+δα11 ľT10Ǔ02 + ǔT12 = aT12+δaT12 Ľ20Ǔ00 = A20+∆A20 Ľ20ǔ01 + ľ21υ̌11 = a21+δa21  ∧ ∣∣∣∣∣∣ ∆A00 δa01 ∆A02 δaT10 δα11 δaT12 ∆A20 δa21 ∣∣∣∣∣∣≤γk+1  |Ľ00||Ǔ00| |Ľ00||ǔ01| |Ľ00||Ǔ02| |ľT10||Ǔ00| |ľT10||ǔ01|+|υ̌11| |ľT10||Ǔ02|+|ǔT12| |Ľ20||Ǔ00| |Ľ20||ǔ01|+|ľ21||υ̌11|   7 Continue with A,L and U as in Fig. 5.1, and ( ∆ATL ∆ATR ∆ABL ∆ABR ) ← ∆A00 δa01 ∆A02 δaT10 δα11 δaT12 ∆A20 δa21 ∆A22  5b endwhile { m(A) = n ∧ ĽǓ = (A+ ∆A) ∧ |∆A| ≤ γn|L||U | } 1b Fig. 5.2. LU = A. Error worksheet for proving the backward stability of the Crout variant for the LU factorization. 22 . . .         m (A T L ) = k ∧ ( Ľ T L Ǔ T L = A T L + ∆ A T L Ľ T L Ǔ T R = A T R + ∆ A T R Ľ B L Ǔ T L = A B L + ∆ A B L Ǎ B R = A B R − Ľ B L Ǔ T R + ∆ A B R ) ∧ ∣ ∣ ∣ ∣( ∆ A T L ∆ A T R ∆ A B L ∆ A B R )∣ ∣ ∣ ∣≤ γ n b + b ( |A T L |+ |Ľ T L ||Ǔ T L | |A T R |+ |Ľ T L ||Ǔ T R | |A T L |+ |Ľ B L ||Ǔ T L | |A B R |+ |Ľ B L ||Ǔ T R |)         2 a . . .                   m (A 0 0 ) = k ∧  ? ? ? ? Ǎ i 1 1 = A 1 1 − Ľ 1 0 Ǔ 0 1 + ∆ A i 1 1 Ǎ i 1 2 = A 1 2 − Ľ 1 0 Ǔ 0 2 + ∆ A i 1 2 ? Ǎ i 2 1 = A 2 1 − Ľ 2 0 Ǔ 0 1 + ∆ A i 2 1 Ǎ i 2 2 = A 2 2 − Ľ 2 0 Ǔ 0 2 + ∆ A i 2 2   ∧ ∣ ∣ ∣ ∣ ∣ ∣ ? ? ? ? ∆ A i 1 1 ∆ A i 1 2 ? ∆ A i 2 1 ∆ A i 2 2  ∣ ∣ ∣ ∣ ∣ ∣≤γ k b + b  ? ? ? ? |A 1 1 |+ ∣ ∣ Ľ 10 ∣ ∣∣ ∣ Ǔ 0 1 ∣ ∣ |A 1 2 |+ ∣ ∣ Ľ 10 ∣ ∣∣ ∣ Ǔ 0 2 ∣ ∣ ? |A 2 1 |+ ∣ ∣ Ľ 20 ∣ ∣∣ ∣ Ǔ 0 1 ∣ ∣ |A 2 2 |+ ∣ ∣ Ľ 20 ∣ ∣∣ ∣ Ǔ 0 2 ∣ ∣                    6 [L 1 1 ,U 1 1 ] := L U (A 1 1 ) U 1 2 := L − 1 1 1 A 1 2 L 2 1 := A 2 1 U − 1 1 1 A 2 2 := A 2 2 − L 2 1 U 1 2 Ľ 1 1 Ǔ 1 1 = A 1 1 + ∆ A 1 1 ∧ |∆ Ǎ 1 1 |≤ γ b |Ľ 1 1 ||Ǔ 1 1 | T h eo re m 5 .4 Ľ 1 1 Ǔ 1 2 = A 1 2 + ∆ A 1 2 ∧ |∆ Ǎ 1 2 |≤ γ b |Ľ 1 1 ||Ǔ 1 2 | C o ro ll a ry 6 .3 Ľ 2 1 Ǔ 1 1 = A 2 1 + ∆ A 2 1 ∧ |∆ Ǎ 2 1 |≤ γ b |Ľ 2 1 ||Ǔ 1 1 | C o ro ll a ry 6 .3 Ǎ i+ 1 2 2 + Ľ 2 1 Ǔ 1 2 = A i 2 2 + ∆ Ǎ 2 2 ∧ |∆ Ǎ 2 2 |≤ γ b + 1 (| B̌ i 2 2 |+ |Ľ 2 1 ||Ǔ 1 2 |) C o ro ll a ry 6 .1 8                     m ( A 0 0 A 0 1 A 1 0 A 1 1 ) = k + b ∧   ? ? ? ? Ľ 1 1 Ǔ 1 1 = A 1 1 − Ľ 1 0 Ǔ 0 1 + ∆ A i+ 1 1 1 Ľ 1 0 Ǔ 0 2 + Ľ 1 1 Ǔ 1 2 = A 1 2 + ∆ A i+ 1 1 2 ? Ľ 2 0 Ǔ 0 1 + Ľ 2 1 Ǔ 1 1 = A 2 1 + ∆ A i+ 1 2 1 Ǎ i+ 1 2 2 = A 2 2 − Ľ 2 0 Ǔ 0 2 − Ľ 2 1 Ǔ 1 2 + ∆ A i+ 1 2 2    ∧ ∣ ∣ ∣ ∣ ∣ ∣ ∣  ? ? ? ? ∆ A i+ 1 1 1 ∆ A i+ 1 1 2 ? ∆ A i+ 1 2 1 ∆ A i+ 1 2 2   ∣ ∣ ∣ ∣ ∣ ∣ ∣≤γ k b + b + 1  ? ? ? ? |A 1 1 |+ ∣ ∣ Ľ 10 ∣ ∣∣ ∣ Ǔ 0 1 ∣ ∣ +∣ ∣ Ľ 1 1 ∣ ∣∣ ∣ Ǔ 1 1 ∣ ∣ |A 1 2 |+ ∣ ∣ Ľ 10 ∣ ∣∣ ∣ Ǔ 0 2 ∣ ∣ +∣ ∣ Ľ 1 1 ∣ ∣∣ ∣ Ǔ 1 2 ∣ ∣ ? |A 2 1 |+ ∣ ∣ Ľ 20 ∣ ∣∣ ∣ Ǔ 0 1 ∣ ∣ +∣ ∣ Ľ 2 1 ∣ ∣∣ ∣ Ǔ 1 1 ∣ ∣ |A 2 2 |+ ∣ ∣ Ľ 20 ∣ ∣∣ ∣ Ǔ 0 2 ∣ ∣ +∣ ∣ Ľ 2 1 ∣ ∣∣ ∣ Ǔ 1 2 ∣ ∣                      7 . . . F ig . 6 .2 . L U = A . D et a il s fo r S te p s 6 – 8 in th e p ro o f o f T h eo re m 6 .5 . 25 is satisfied at the beginning and at the end of each iteration. Superscripts denote the iteration number. 6.2. Preparation. The computation in Fig. 6.1 (left) is cast in terms of an unblocked LU factorization, LU(A11), two triangular solves with multiple right-hand sides, L11U12 = A12 and L21U11 = A21, and one matrix-matrix multiplication, A22 − L21U12. An error result for the unblocked LU factorization, if the Crout variant is used, is given in Theorem 5.4. Here we present theorems related to triangular solve with multiple right hand sides and matrix-matrix multiplication. Error results for matrix-matrix multiplication. Corollary 6.1. Let C ∈ Rm×n, A ∈ Rm×k, and B ∈ Rk×n. Assume that Z := C−AB is computed one column at a time by the matrix-vector multiplication dis- cussed in Theorem 5.1. Partition Z,C and B by columns as Z→ ( z0 · · · zn−1 ) , B→( b0 · · · bn−1 ) , and C → ( c0 · · · cn−1 ) , so that Z = ( c0 −Ab0 · · · cn−1Abn−1 ) = C −AB. Then R1-F: AB + Ž = C + ∆Z, where |∆Z| ≤ γk|A||B|+ γ1|Ž|. R2-F: AB + Ž = C + ∆C, where |∆C| ≤ γk+1|A||B|+ γ1|Č|. Exercise 6.2. Prove Corollary 6.1. Hint: use Theorem 5.1. Error results for triangular solve with multiple right-hand sides. Let L ∈ Rm×m, be a lower triangular matrix (no assumptions on the diagonal entries), and X,B ∈ Rm×n, where L and B are input operands, and matrix X is to be computed. Partition X → ( x0 · · · xn−1 ) and B → ( b0 · · · bn−1 ) . Then LX = L ( x0 · · · xn−1 ) = ( Lx0 · · · Lxn−1 ) = ( b0 · · · bn−1 ) so that Lxj = bj . This explains the name of the operation: each column of X is computed by solving a lower triangular system with the corresponding column of B as right-hand side. The following result follows immediately from Theorem 4.1: Corollary 6.3. Let L ∈ Rm×m be nonsingular lower triangular matrix and X,B ∈ Rm×n. Assume that the solution X to LX = B is computed one column at a time via the algorithm in Fig. 4.2(left side). Then X̌ satisfies LX̌ = B + ∆B where |∆B| ≤ γn|L||X̌|. Exercise 6.4. Discuss why in Cor. 6.3 there is no result that corresponds to Theorem 4.1 R1. There are a number of different algorithms for computing this operation, each of which has a similar error result. 6.3. Analysis. As for the unblocked algorithm discussed in the previous sec- tion, the right-looking LU factorization algorithm, in the presence of round-off error, computes factors Ľ and Ǔ . This section provides the proof to the following theorem. Theorem 6.5. Given A ∈ Rn×n, assume that the blocked right-looking algorithm in Fig. 6.1 completes. Then the computed factors Ľ and Ǔ are such that ĽǓ = A+ ∆A, where |∆A| ≤ γn b +b(|A|+ |Ľ||Ǔ |). The worksheet containing the proof of Theorem 6.5 is shown in in Figg. 6.1 (right) and 6.2. In the remainder of the section we will prove that the error bounds indicated 26 in the worksheet hold. Consider the error-invariant m(ATL) = k ∧( ĽTLǓTL = ATL + ∆ATL ĽTLǓTR = ATR + ∆ATR ĽBLǓTL = ABL + ∆ABL ǍBR = ABR − ĽBLǓTR + ∆ABR ) ∧∣∣∣∣(∆ATL ∆ATR ∆ABL ∆ABR )∣∣∣∣≤γ k b +b ( |ATL|+ |ĽTL||ǓTL| |ATR|+ |ĽTL||ǓTR| |ATL|+ |ĽBL||ǓTL| |ABR|+ |ĽBL||ǓTR| ) (6.1) which results from restricting the target error result to the state of the variables at the top of each iteration. The base case of the proof requires this predicate to be satisfied right after the initialization of Step 4. The initialization sets ATL to be an empty matrix; therefore predicate (6.1) reduces to ǍBR = ABR + ∆ABR ∧ |∆ABR| ≤ γb|ABR|, which is true as no computation has been performed yet, thus ǍBR equals ABR, and ∆ABR = 0. We know that the predicate in Step 6 of the worksheet represents the inductive hypothesis. This predicate follows immediately by substituting the submatrices from Steps 5a into the error-invariant. Algebraic manipulation yields m(A00) = k ∧ (6.2)Ľ00Ǔ00 = A00+∆A00 Ľ00Ǔ01 = A01+∆A01 Ľ00Ǔ02 = A02+∆A02 Ľ10Ǔ00 = A10+∆A10 Ǎ11 = A11−Ľ10Ǔ01+∆Ai11 Ǎ12 = A12−Ľ10Ǔ02+∆Ai12 Ľ20Ǔ00 = A20+∆A02 Ǎ21 = A21−Ľ20Ǔ01+∆Ai21 Ǎ i 22 = A22−Ľ20Ǔ02+∆Ai22  ∧ ∣∣∣∣∣∣ ∆A00 ∆A01 ∆A02 ∆A10 ∆Ai 11 ∆Ai 12 ∆A20 ∆Ai 21 ∆Ai 22 ∣∣∣∣∣∣≤γ k b +b  |A00|+|Ľ00||Ǔ00| |A01|+|Ľ00||Ǔ01| |A02|+|Ľ00||Ǔ02| |A10|+|Ľ10||Ǔ00| |A11|+|Ľ10||Ǔ01| |A12|+|Ľ10||Ǔ02| |A20|+|Ľ20||Ǔ00| |A21|+|Ľ20||Ǔ01| |A22|+|Ľ20||Ǔ02| . Similarly, the predicate in Step 7 represents the relations that have to be satisfied at the end of each iteration. The predicate is obtained by substituting the submatrices from Steps 5b into the error-invariant. Algebraic manipulation yields m ( A00 A01 A10 A11 ) = k + b ∧ (6.3)? ? ? ? Ľ10Ǔ01+Ľ11Ǔ11 = A11+∆Ai+1 11 Ľ10Ǔ02+Ľ11Ǔ12 = A12+∆Ai+1 12 ? Ľ20Ǔ01+Ľ21Ǔ11 = A21+∆Ai+1 21 Ai+1 22 = A22−Ľ20Ǔ02−Ľ21Ǔ12 + ∆Ai+1 22  ∧ ∣∣∣∣∣∣ ? ? ? ? ∆Ai+1 11 ∆Ai+1 12 ? ∆Ai+1 21 ∆Ai+1 22 ∣∣∣∣∣∣≤γ k b +b+1 ? ? ? ? |A11|+|Ľ10||Ǔ01|+|Ľ11||Ǔ11| |A12|+|Ľ10||Ǔ02|+|Ľ11||Ǔ12| ? |A21|+|Ľ20||Ǔ01|+|Ľ21||Ǔ11| |A22|+|Ľ20||Ǔ02|+|Ľ21||Ǔ12| . where the ? indicates that the expression is identical to that in the corresponding result in (6.2) above. The goal is to prove that the updates in Step 8 (left), when executed in a state that satisfies predicate (6.2), generate errors that satisfy the predicate (6.3). The constraints in (6.3) that are not highlighted are already satisfied in Step 6, and since the computation only affects submatrices L11, L21, U11, U12 and A22, they are also satisfied in Step 7, as γ k b +b ≤ γ k b +b+1. 27 6.4. A comment about practical implementations. In practice, the fac- torization of A11 and subsequent updating of A21 is accomplished by computing an LU factorization with partial pivoting of the panel of columns ( A11 A21 ) , after which the row exchanges are applied to the remainder of the matrix. As we argued for the unblocked algorithm, pivoting does not introduce error and therefore does not change the analysis. Once pivoting is taken out of the equation, the factorization of the panel of columns can be thought of as a simultaneous factorization of A11 and subsequent update of A21. Thus, the analyses of these separate operations are equivalent to the analysis of the factorization of the panel and the error result established for the blocked algorithm holds. 7. Conclusion. In this paper, we described a systematic approach to deriving numerical stability results for linear algebra algorithms. The approach leverages no- tation for representing algorithms that was developed as part of the FLAME project. It extends the FLAME methodology for deriving algorithms so that numerical sta- bility results are established in a goal-oriented and modular fashion. The presented methodology alleviates the difficulties of description and preparation mentioned in the quote by Parlett. It also overcomes the problem of re-derivation of standard results mentioned in the quote by Higham, facilitating the establishment of an inventory of results. In addition, it has yielded a previously unpublished result for the backward stability of blocked LU factorization. While the paper was written so that later results build on earlier results, the best way to unleash the methodology is to start with the final algorithm to be analyzed, and work backwards. For example, we could have started with the blocked LU fac- torization. As part of the analysis, it would have become clear that algorithms and stability results were needed for unblocked LU factorization, triangular solve with multiple right-hand sides, and matrix-matrix multiplication. In turn, each of these operations would have exposed other suboperations and so forth. Eventually, the analysis would have reached the fundamental operations to which the SCM and ACM can be directly applied. From that stage, the results would have then slowly built back up to the analysis of the blocked algorithm. However, such a structure would have made the paper difficult to read and of little pedagogical value. Just like it has been shown that systematic derivation of algorithms can be made mechanical [3], we believe the proposed system could also be made mechanical by a system that understands the rules of linear algebra. This will be the topic of future research, as will be the application of the proposed techniques to more complex matrix operations. 8. Acknowledgments. The idea of systematically deriving stability analyses occurred to us shortly after the idea of deriving algorithms was first proposed [10, 4]. At that time, we considered it high hanging fruit and ignored the possibility. Then, in response to a paper with Enrique Quintana-Ort́ı on the derivation of algorithms for the triangular Sylvester equation [12] that yielded a number of new algorithms for that operation, a referee pushed us to add a stability analysis for the algorithms. This referee, who we correctly guessed was G.W. (Pete) Stewart, gave us three choices: (1) show empirical results; (2) present an analysis that shows the algorithms to have equivalent error accumulation; or (3) extend the derivation process so that it also yields the stability analysis. We invited Pete to work on (3) with us, to which he responded that some Holy Grails are best left unpursued. And thus we opted for (1) in the revision of that paper. It took a few more years before one of the authors 30 decided to pursue the suggestion as part of his dissertation [3]. We are grateful to Pete for his insight and encouragement. In addition, we thank Enrique Quintana-Ort́ı and Maggie Myers for feedback on an early draft. This work was supported in part by NSF grants ACI-0305163, CCF-0342369, CCF-0540926, and CCF-0702714. Any opinions, findings and conclusions or recom- mendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] E. Anderson, Z. Bai, J. Demmel, J. E. Dongarra, J. DuCroz, A. Greenbaum, S. Ham- marling, A. E. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users’ Guide, SIAM, Philadelphia, 1992. [2] P. Beresford, Matrix eigenvalue problems, The American Mathematical Monthly, 72 (1965). [3] P. Bientinesi, Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Al- gorithms, PhD thesis, Department of Computer Sciences, The University of Texas, 2006. Technical Report TR-06-46. September 2006. [4] P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintana-Ort́ı, and R. A. van de Geijn, The science of deriving dense linear algebra algorithms, ACM Transactions on Mathematical Software, 31 (2005), pp. 1–26. [5] J. W. Demmel, N. J. Higham, and R. S. Schreiber, Stability of block lu factorization, Numer. Lin. Algebra Applic, 2 (1995), pp. 173–190. [6] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff, A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Soft., 16 (1990), pp. 1–17. [7] J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, SIAM, Philadelphia, PA, 1991. [8] G. H. Golub and C. F. V. Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, 2nd ed., 1989. [9] G. H. Golub and C. F. Van Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, 3nd ed., 1996. [10] J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn, FLAME: Formal Linear Algebra Methods Environment, ACM Trans. Math. Soft., 27 (2001), pp. 422–455. [11] N. J. Higham, Accuracy and Stability of Numerical Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second ed., 2002. [12] E. S. Quintana-Ort́ı and R. A. van de Geijn, Formal derivation of algorithms: The trian- gular Sylvester equation, ACM Trans. Math. Soft., 29 (2003), pp. 218–243. [13] G. W. Stewart, Matrix Algorithms. Volume I: Basic Decompositions, SIAM, Philadelphia, 1998. [14] G. W. Stewart and J.-G. Sun, Matrix Perturbation Theory (Computer Science and Scientific Computing) (Computer Science and Scientific Computing), Academic Press, June 1990. [15] R. A. van de Geijn and E. S. Quintana-Ort́ı, The Science of Programming Matrix Compu- tations, www.lulu.com, 2008. 31

Documents

questions

The Science of Deriving Stability Analyses, Lecture notes of Calculus

Related documents

Partial preview of the text