Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Math Tools and Problem-Solving Techniques Cheatsheet, Cheat Sheet of Probability and Statistics

A quick refresher on some math tools and problem-solving techniques from a prerequisite course. It covers probability, continuous optimization, creative problem-solving, and writing solid proofs. The notes are intended for an audience who has seen these topics before, but may enjoy a refresher before using these tools to solve abstract problems.

Typology: Cheat Sheet

2021/2022

Uploaded on 05/11/2023

parolie
parolie 🇺🇸

4.9

(14)

10 documents

1 / 28

Toggle sidebar

Related documents


Partial preview of the text

Download Math Tools and Problem-Solving Techniques Cheatsheet and more Cheat Sheet Probability and Statistics in PDF only on Docsity! Observation 1 For all x < 5, Pr[X · I(X ∈ [5, 10]) > x] = e−5 − e−10. Proof. For x < 5, X · I(X ∈ [5, 10]) > x whenever X ∈ [5, 10] (because this guarantees that X > x, and also that the multiplier is one and not zero). Therefore, Pr[X · I(X ∈ [5, 10]) > x] = Pr[X ∈ [5, 10]]. Observe also that Pr[X ∈ [5, 10]] = Pr[X ≥ 5] − Pr[X > 10] = 1− F (5)− (1− F (10)) = F (10)− F (5) = e−5 − e−10. Observation 2 For all x > 10, Pr[X · I(X ∈ [5, 10]) > x] = 0. Proof. Whenever X > 10, the multiplier I(X ∈ [5, 10]) = 0, so the whole term can never be > 10 (because whenever X > 10, it is zeroed out anyway). Observation 3 For x ∈ [5, 10], Pr[X · I(X ∈ [5, 10]) > x] = e−x − e−10. Proof. To see this, observe that when x ∈ [5, 10], X · I(X ∈ [5, 10]) > x if and only if X ∈ [x, 10]. This is because we first need X > x, but we also need I(X ∈ [5, 10]) = 1. Therefore, we get that Pr[X · I(X ∈ [5, 10]) > x] = Pr[X ∈ [x, 10]]. We again compute this as Pr[X ≥ x] − Pr[X ≥ 10] = e−x − e−10. Now that we know Pr[X · I(X ∈ [5, 10]) > x] for all x, we can compute its expectation: E[X · I(X ∈ [5, 10])] = ∫ ∞ 0 Pr[X · I(X ∈ [5, 10]) > x]dx = ∫ 5 0 e−5 − e−10dx+ ∫ 10 5 e−x − e−10dx+ ∫ ∞ 10 0dx = 5 · (e−5 − e−10) +−e−x − xe−10|150 = 5 · (e−5 − e−10) + e−5 − e−10 − 5e−10|150 = 6e−5 − 11e−10. Finally, to compute E[X|X ∈ [5, 10]] = E[X · I(X ∈ [5, 10])]/Pr[X ∈ [5, 10]] we get: 6e−5 − 11e−10 e−5 − e−10 = 6− 5 · 1 e5 − 1 . 1.7 Coupling Arguments Coupling arguments are typically useful to relate two probabilities. We’ll first give a definition of a coupling argument, and then give several examples for why these arguments can be useful. In gen- eral, coupling arguments can be useful for saving significant energy compared to raw calculations (but you can typically replace any coupling argument with a brute-force calculation, if you find that approach preferable and are able to execute the calculations cleanly). Definition 4 (Coupling) Let D1, D2 be distributions. A coupling of D1 and D2 is a process to jointly draw two random variables X, Y such that: • X is distributed according to D1. • Y is distributed according to D2. 7 For this problem, even the “raw calculations” approach is quite tricky (because it’s cumbersome to write a closed form for Hk). However, there is an elegant coupling argument that makes the proof clean. Consider the following process for producing two random walks. To be explicit, we will let D1 and D2 be the same distribution over steps R0, . . . , Rn, where R0 = 0, and each Ri is equal to Ri−1 + 1 or Ri−1 − 1 with probability 1/2, independently. Now consider the following process to jointly draw two random walks: • Let R0 = S0 = 0. For all i from 1 to n, let Ri be equal to Ri−1 + 1 with probability 1/2, and equal to Ri−1 − 1 with probability 1/2 (independently). • If there exists an i such that Ri = k, let i∗ be the largest such i. If there is no such i, let Si := Ri for all i. Observe that because n is even and k is odd, that we must have k < n. • Otherwise, let Si := Ri for all i ≤ i∗, and Si := k− (Ri− k) for all i > i∗. That is, “reflect” Ri over the horizontal line at k to get Si when i > i∗. Again, we need to show two things. First, we need to show that both R and S are correctly sampled random walks (that is, R is distributed according to D1, and S is distributed according to D2). Again, this will be trivial for R, and not-bad-but-not-trivial for S. Then, we need to use these two walks to reason about Hk and Lk. First, it’s clear that R is a correctly-drawn random walk, because we explicitly define it as such. It’s also true that S is a correctly-drawn random walk. To see this, observe importantly that reflecting from R to S after i∗ does not change i∗. Indeed, after i∗, R is either completely above k, or completely below k (by definition). Reflecting R to S maintains Si∗ = k, and also maintains that S is completely below k or completely above k (the opposite of R). This means that the mapping from R to S is its own inverse (applying it twice returns the original R). In particular, this means that every possible random walk is equally likely to be drawn from R (if it is drawn directly) as it is drawn from S (if it’s reflection is drawn from R). Therefore, S is also a correctly-drawn random walk. Finally, observe that either: neither R nor S ever reach k, or both R and S reach k. Additionally, when neither R nor S reach k, neither R nor S finish above k. On the other hand, when both R and S reach k exactly one of them finishes above k. Because both R and S are correctly-drawn random walks, this means that Hk is exactly half the probability that R reaches k, plus half the probability that S reachest k. Similarly, Lk is exactly half the probability that R finishes above k, plus half the probability that S finishes above k. By the reasoning at the beginning of this paragraph, Hk is now exactly twice Lk. Concluding Thoughts: Coupling arguments are a nice tool to make elegant arguments that better align with intuition than “raw calculations.” You won’t have to use them extensively in 445, but they will be relevant when we discuss information cascades, and it’s a good tool to have in your toolkit to avoid messy calculations. 1.8 Stochastic Dominance Stochastic Dominance is a useful relationship between two single-variable distributions, and also a good demonstration of coupling arguments. We won’t use it much throughout the course, but it’s generally a good concept to understand. Definition 5 (Stochastic Dominance) We say that a single-variable distribution D+ stochastically dominates D if for all x, Prv+←D+ [v+ ≥ x] ≥ Prv←D[v ≥ x]. Equivalently, if F+ is the CDF of D+, and F is the CDF of D, then F+(x) ≤ F (x) for all x. 10 We’ll go through some examples shortly applying this definition, and a key fact (Fact 6). We’ll first make a quick observation, state and prove the fact, and then go through the examples. Observation 6 Let D+ stochastically dominate D. Let v+ be a random variable drawn from D+, and v be a random variable drawn from D. Then E[v+] ≥ E[v]. Proof. We have: E[v+] = ∫ ∞ 0 1− F+(x)dx ≥ ∫ ∞ 0 1− F (x)dx = E[v]. The first and third equalities are just the definition of expected value. The middle line follows by stochastic dominance, as 1− F+(x) ≥ 1− F (x) for all x. Fact 6 Let D+ stochastically dominate D. Then there is a way to couple draws v+ from D+, and v from D so that v+ ≥ v with probability one. That is, it is possible to jointly draw samples from D and D+ so that the sample from D+ is always bigger than the sample from D. Proof. We’ll prove the claim for continuous distributions with strictly monotone increasing CDF on its support (that is, the CDF is 0 on [0, x], then strictly increasing on (x, y), then equal to 1 on [y,∞), for some x, y).4 Recall that in order to properly establish a coupling argument, we must first define our random variables. To this end: • Draw w+ from D+. • Let w := F−1(F+(w+)). • Set v+ := w+ and w := v. We need to prove three claims: • v+ is distributed according to a draw from D+. This is clearly true, as w+ is drawn from D+, and v+ := w+. • v is distributed according to a draw from D. To see this, let’s just explicitly compute the probability that v ≤ x, for any x. We can write: Pr[v ≤ x] = Pr[F−1(F+(w+)) ≤ x] = Pr[w+ ≤ (F+)−1(F (x))] = F+((F+)−1(F (x))) = F (x). Above, the first line follows by definition of v. The second follows by just applying F and then (F+)−1 to both sides of the inequality (and using the fact that both the CDFs and their inverses are continuous). The third line follows by definition of CDF (F+(x) := Pr[w+ ≤ x], by definition of the CDF). The final line just cancels F+ with (F+)−1 (which applies when the CDFs and their inverses are continuous). 4The proof for discrete distributions, or distributions with weakly monotone increasing CDF, follows the same outline, but requires some extra steps due to the fact that the CDF or its inverse is discontinuous. Due to the excessive calculations, the proof for the general case is omitted here. But, you may cite this fact for any distribution, discrete or continous, without proof. 11 • Finally, we have to ensure that v+ ≥ v for all draws. To see this, observe that: v = F−1(F+(v+)) ⇒ F (v) = F+(v+) ⇒ F+(v) ≤ F (v) = F+(v+) ⇒ v ≤ v+ Here, the first line is just the definition of v. The second line applies F to both sides. The third line applies the definition of stochastic dominance. The fourth line follows because F+ is an increasing function. Now, we’ve defined a coupling, proved that both v+ and v are drawn correctly, and also established that v+ ≥ v with probability one. This completes the proof. Here are now a few examples of distributions that stochastically dominate each other, and the corresponding coupling in the proof of Fact 6. Example One: Consider the uniform distribution on [0, 1], which has CDF F1(x) := x on x ∈ [0, 1], and F1(x) = 1 for x > 1. Consider also the uniform distribution on [0, 2], which has CDF F2(x) := x/2 for x ∈ [0, 2], and F2(x) = 1 for x > 2. We can clearly see that F2(x) ≤ F1(x) for all x, and therefore the uniform distribution on [0, 2] stochastically dominates the uniform distribution on [0, 1]. Moreover, if we draw y uniformly at random from [0, 1] and output the pair (z1, z2) = (y, 2y), then z1 is distributed according to uniform on [0, 1]. Also, z2 is distributed according to uniform on [0, 2]. Also, y ≤ 2y with probability 1. Example Two: Consider the exponential distribution with rate one, which has CDF F1(x) := 1− e−x on [0,∞), and the equal revenue curve, which has CDF F2(x) := 1− 1/x on [1,∞), and F2(x) := 0 on [0, 1]. Then F2(x) ≤ F1(x) for all x, as 0 ≤ 1−e−x on [0, 1], and 1−1/x ≤ 1−e−x on [1,∞).5 So the equal revenue curve stochastically dominates the exponential distribution with rate one. Moreover, if we draw z2 from the equal revenue curve and output z1 := ln(z2), then z2 is dis- tributed according to an equal revenue curve, and z1 is distributed according to an exponential with rate one.6 Also, z1 ≥ z2 always, because ln(x) ≤ x whenever x ≥ 1. And finally, here is an example of a simple proof that uses Fact 6. We’ll reprove Observation 6 using Fact 6. Alternate proof of Observation 6. Recall that D+ stochastically dominates D. Therefore, by Fact 6, there is a way to couple draws (v+, v) so that v+ is a draw from D+, v is a draw from D, and v+ ≥ v with probability 1. It is now immediate that because v+ ≥ v with probability 1, that E[v+] ≥ E[v] as well. 1.9 “Principle of Deferred Decisions” The “principle of deferred decisions” isn’t a formal theorem or definition, but a concept that will be useful throughout this class to simplify analyses. Consider the following example: say that 5To quickly see this, observe that 1/x ≥ e−x, as x ≤ ex for all x ∈ [1,∞). 6To see this calculation, we have just applied the coupling used in the proof of Fact 6. We’ve set z1 := F−11 (F2(z2)) = − ln(1− F2(z2)) = − ln(1− (1− 1/z2)) = ln(z2). 12 2.2 Single-variable, constrained optimization Say now we want to find the constrained maximum of a differentiable function f(·) over the interval [a, b]. Now, any value that is the constrained maximum must either be a critical point, or an endpoint of the interval. Here are a few approaches to find the constrained maximum: • Find all critical points, compute f(a), f(b), f(x) for all critical points x and output the largest. • Confirm that f ′(a) > 0 (that is, f is increasing at a) and f ′(b) < 0. This proves that neither a nor b can be the global maximum. Then compute f(x) for all critical points x and output the largest. • In either of the above, rather than directly comparing f(x) to f(y), one can instead prove that f ′(z) ≥ 0 on the entire interval [x, y] to conclude that f(y) ≥ f(x). • Prove that x is a global unconstrained maximum of f(·), and observe that x ∈ [a, b]. There are many other approaches. The point is that at the end of the day, you must directly or indirectly compare all critical points and all endpoints. You don’t have to directly compute f(·) at all of these values (the bullets above provide some shortcuts), but you must at least indirectly compare them. For this class, it is OK to just describe your approach without writing down the entire calculations (as in the following examples). Example 3: Say we want to find the constrained maximum of f(x) = x2 on the interval [3, 8]. f has no critical points on this range, so the maximum must be either 3 or 8. f ′(x) = 2x > 0 on this entire interval, so therefore the maximum must be 8. Example 4: Say we want to find the constrained maximum of f(x) = 3x2 − x3 on the interval [−2, 3]. f ′(x) = 6x − 3x2, and therefore f has critical points at 0 and 2. So we need to (at least indirectly) consider −2, 0, 2, 3. We see that f ′(x) ≤ 0 on [−2, 0], so we can immediately con- clude that f(−2) ≥ f(0). We also see that f ′(x) ≤ 0 on [2, 3], so we can immediately conclude that f(2) ≥ f(3). Now, we only need to compare −2 and 2. We can also immediately see that f(−x) > f(x) for all x > 0, and therefore f(−2) > f(2), and x = −2 is the global constrained maximum. Example 5: Say we want to find the constrained maximum of f(x) = 4x − x2 on the interval [−8, 5]. We already proved above that x = 2 is the global unconstrained maximum (Example 2). Therefore, f(2) ≥ f(x) for all x ∈ R, and certainly f(2) ≥ f(x) for all x ∈ [−8, 5]. Therefore x = 2 is also the global constrained maximum on [−8, 5]. Warning! An incorrect approach. It might be tempting to try the following approach: First, find all local maxima of f(·). Call this set X . Then, check to see which elements of X lie in [a, b]. Call them Y . Then, output the argmax of f(x) over all x ∈ Y . This approach does not work, and in fact we already saw a counterexample. Say we want to find the constrained maximum of f(x) = 3x2 − x3 on the interval [−2, 3]. Then f ′(x) = 6x− 3x2, and f has critical points at 0 and 2. We can verify that x = 0 is a local minimum and x = 2 is a local maximum. So x = 2 is the unique local maximum, and it also lies in [−2, 3]. But, we saw that it’s incorrect to conclude that therefore x = 2 is the constrained global maximum. In general, remember that your goal is to find a y such that f(y) ≥ f(x) for all x ∈ [a, b]. You can use shortcuts (you know that the maximum must be either a critical point, or an endpoint). You 15 can also use derivatives to compare f(x) vs. f(y) without explicitly computing f(x) or f(y). You can also use all sorts of other tricks to save on calculations (and you will never need to use any fancy tricks in this course). But at the end of the day, you must provide a logically sound argument that f(y) ≥ f(x) for all x ∈ [a, b]. The above examples are arguments that are complete, and plenty rigorous for this course. 2.3 Multi-variable, unconstrained optimization Say now we want to find the unconstrained global maximum of a differentiable multi-variate func- tion f(·, ·, . . . , ·). Again, any value that is the unconstrained maximum must be a critical point, where a critical point has ∂f(~x) ∂xi = 0 for all i. Again, not all critical points are local optima/maxima, but all local maxima are definitely critical points. Recall also that some f(·) don’t achieve their global maximum, depending what happens when approaching∞. Having a general approach that works in all cases is quite tedious, but in this class we’ll only see cases where a simple approach works.8 Again, remember that your goal is just to provide a logically sound proof that f(~y) ≥ f(~x) for all ~x ∈ Rn. Here are some examples that you might reasonably need to solve, and an approach to solve them: Example 6: Say you want to maximize f(x1, x2) = x1 − x2 1 − x2 2. We will solve this in two steps. First, we will think of x1 as fixed, and try to first maximize f(x1, y) as a function of y. To this end, observe that f(x1, y) = x1 − x2 1 − y2. As a function of y, this is just a constant minus y2, which is clearly maximized at y = 0. Therefore, we can immediately conclude that f(x1, 0) ≥ f(x1, x2) for all x2. This is the key first step, and we now know that the optimum must be of the form (x1, 0), for some x1. Now, we just need to optimize f(x1, 0) = x1−x2 1 over x1. This is a single-variable optimization problem, which we can solve using the tools in Section 2.1. We notice that the derivative is 1−2x1, which is negative on (−∞, 1/2), and positive on (1/2,∞), meaning that x1 = 1/2 is the global maximum. This means that the unconstrained global maximum is (1/2, 0). To recap the complete logic, what we’ve shown is that for any (x1, x2), we have: f(1/2, 0) ≥ f(x1, 0) ≥ f(x1, x2). The previous paragraph proves the first inequality, and the first paragraph proves the second. Example 7: Say you want to maximize f(x1, x2) = x1x2 − x2 1 − x2 2. We can again think of x1 as fixed, and try to optimize f(x1, y) over y. Because we treat x1 as fixed, this is again a single- variable optimization problem. The derivative with respect to y is x1 − 2y. We conclude that the derivative is positive on (−∞, x1/2), and negative on (x1/2,∞), and therefore y = x1/2 is the unconstrained optimum. This directly proves that for all x1, f(x1, x1/2) ≥ f(x1, x2), for all x2. Now, we want to find the maximizer among points of the form (x1, x1/2). Observe that now this is again a single-variable function, and its equal to x1 · (x1/2) − x2 1 − (x1/2) 2 = (−3/4)x2 1. This is clearly maximized at x1 = 0, so the global maximizer is (0, 0). To again recap the complete logic, we’ve shown, for any (x1, x2) that f(0, 0) ≥ f(x1, x1/2) ≥ f(x1, x2). The previous paragraph proves the first inequality, and the first paragraph proves the second. Example 8: Say you want to maximize f(~x) = ∑ i fi(xi). That is, the function you’re trying to maximize is just the sum of single-variable functions (one for each coordinate of ~x). Then we can 8Sometimes you’ll need to be clever, but ideally very few (if any) proofs will require very tedious calculations. 16 simply maximize each fi(·) separately, and let x∗i = argmaxxi {fi(xi)}. Observe that ~x∗ must be the maximizer of f(~x). To recap the complete logic, we’ve shown, for any ~x, that f(~x∗) = ∑ i fi(x ∗ i ) ≥ ∑ i fi(xi) = f(~x). The middle inequality follows simply because every term in the sum on the left exceeds the corresponding term in the sum on the right. 2.4 Multi-variable, constrained optimization Finally, say we want to find the constrained global maximum of a differentiable multi-variate func- tion f(·, . . . , ·). Then the same rules as before apply: we must (at least indirectly) consider all critical points and all extreme points. Multi-variable constrained optimization in general is tricky, and would require an entire class to learn enough tricks to solve every instance. All of the in- stances that you’ll need to solve in this class can be done by using a simple approach (building off approaches from the previous section). Just remember that at the end of the day, you just need a logically sound approach to compare your claimed optimum to all critical points and extreme points. Example 9: For example, say you want to maximize f(~x) = ∑ i xie −xi , subject to the constraints −5 ≤ xi ≤ 5 for all i. We can first try to find the unconstrained maximizer using the approach in Example 8, because this is a sum of single-variable functions. Indeed, the derivative of each single- variable function is of the form e−xi − xie −xi , which is positive when xi < 1, and negative when xi > 1. This means that the maximizer of this single-variable function is at xi = 1. Therefore, (1, . . . , 1) is the global maximizer. We observe that −5 ≤ 1 ≤ 5, so (1, . . . , 1) also satisfies the constraints. So (1, . . . , 1) is also the constrained maximizer over [−5, 5]n. To recap the complete logic, we first showed that f(1, . . . , 1) ≥ f(~x) for all ~x ∈ Rn. So there- fore, clearly f(1, . . . , 1) ≥ f(~x) for all ~x ∈ [−5, 5]n. Also, we confirmed that (1, . . . , 1) ∈ [−5, 5]n, so it is the global maximum. Repeat Warning! Again, recall that it is not a valid approach to first find all critical points of f(·), and then see which critical points satisfy the constraints and only consider those (recall example at the end of Section 2.2). 3 Basic Problem Solving The PSets for this class are “short”, in that they’re only three problems (four if you count the Strategy Designs). But some of the problems will be a full paragraph description, introduce new definitions, etc. Part of the challenge is figuring out on your own how to break these problems down into tractable subparts. Problem solving is more of an art, so I can’t recommend a concrete step-by-step procedure. However, I can try to give general guidelines/tips. I will use the following problem as a running example for this section: Recall that a bipartite graph has two sets of nodes, L and R, with all edges having one endpoint in L and the other in R. Recall also that a perfect matching is a set of edges such that every node is in exactly one edge. Let G be a bipartite graph with n nodes on each side. Prove that if every node has degree ≥ n/2, then G has a perfect matching. Hint: You may use Hall’s Marriage Theorem. Recall that Hall’s Marriage Theorem asserts that a bipartite graph has a perfect matching if and only if for every set S ⊆ L 17 – Fully Solve a Special Case. For example, if the problem asks you to prove a claim for all n, prove it first for n = 2. Add n = 3, or larger n if you can. – Prove a Clearly Useful and Clearly Stated Lemma. For example, if the problem asks you to prove that X holds if and only if Y holds, prove that X implies Y.9 • Be up front about what your solution accomplishes. The better the grader understands what your solution accomplishes, the more partial credit they can award. If you’re up front about what your solution accomplishes, this will make it easier for the grader to give you partial credit. Additionally, the graders are instructed to give a couple points back for partial- credit solutions that were exceptionally easy to grade. – On a related note, do not try to sneak something false, vague, or otherwise wrong past the graders. The graders will always10 catch it. Mistakes happen, and the graders aren’t out for blood to punish mistakes. But if it’s challenging for the graders to figure out what your solution accomplishes, the graders are instructed to deduct a couple points to give you proper feedback on the quality of presentation. Here are two examples for the problem in Section 3. I was unable to solve the general case, but here is a proof for n = 2. Let u, v denote the two nodes on the left. Observe that Hall’s Theorem says that a perfect matching exists if and only if: (a) u has at least one neighbor, (b) v has at least one neighbor, and (c) the set {u, v} has two neighbors. Observe that, immediately because every node has degree at least n/2, that both u and v must have at least one neighbor, so this covers (a) and (b). To consider {u, v}, first observe that both right nodes have at least one neighbor (again, immediately from the problem statement). Because {u, v} are the only possible neighbors this means that each right node has a neighbor in {u, v}. In particular, this means that the set {u, v} has both right nodes as neighbors, which covers (c). In the general case, we can again seek to apply Hall’s theorem. For any set S with |S| ≤ n/2, observe that even a single node in S has n/2 neighbors, which implies that |N(S)| ≥ n/2 ≥ |S|, as desired. But I can’t figure out how to handle sets with |S| > n/2. This solution makes a lot of partial progress. First, the logical clarity is excellent — it is easy for a grader to figure out exactly what is proved. Additionally, it provides both examples of concrete partial progress: First, it provides a complete proof in the case of n = 2. Second, it provides the easy half of a proof for the general case. If I were to grade this solution, I might give it a 14/20. Below is a picture of a graph on 4 nodes without a perfect matching (imagine that a picture is given). We note that it has no perfect matching because it violates Hall’s Theorem. We also note node u has < 2 neighbors, violating the problem hypothesis. It seems like it would be really challenging to add a neighbor to u without creating a perfect matching. For example, if we add any single edge to u, this yields a perfect 9There are also more creative ways to prove a clearly useful and clearly stated lemma. But to contribute to partial credit, it must be both clearly useful and clearly stated. You won’t get credit for proving random claims that the grader finds unrelated to the problem. 10Obviously not always, but it is much easier for graders to detect logical flaws and imprecise claims than it is to detect mistakes in other courses. 20 matching. Of course, maybe we can add an edge to u and delete another edge, but this also seems unlikely to work. Also recall that Hall’s Theorem can be proved using max-flow-min-cut. As such, we could also consider a proof approach by writing out a network with a source on the left with capacity 1 edge to all left nodes, infinite weight edges from left nodes to right nodes, and capacity 1 edges to a sink on the right. We could then try to prove that whenever each edge has degree at least n/2 that there is a flow of weight n, and therefore a perfect matching. This solution conveys that the author clearly has a lot of ideas. I think it is a good idea to write down a concrete example and play around with it. I also think it’s a good idea to see whether max- flow-min-cut helps at all with a proof. But unfortunately there’s not much else here. The logical clarity of this solution is pretty good (there are no flaws, and I find it easy to read). And there’s no false or frustrating claims. But I just can’t find any concrete partial progress to give credit for. While I’m convinced that the author tried, and did come up with some ideas, none of the ideas make concrete progress towards a solution. Put another way, the solution does convince me that the author is further along towards a solution than when they first started, but the solution wouldn’t help another problem-solver. If I were to grade this solution, I might give it a 6/20. I hope that the examples above help display the difference between productive and unproductive partial progress.11 Here are two last suggestions: First, whenever you’re stuck on a problem, I strongly recommend doing something small, but well (e.g. a proof for a special case, or a clear proof of a concrete stepping stone). It may be helpful to think of an analogy to programming: it’s always best if you can provide well-documented code that correctly solves the task you were assigned. Short of that, it is significantly better to submit well-documented code that correctly solves a less ambitious task, than to submit buggy or unparseable code that may or may not solve the entire task. Second, once you’ve solved as much as you can and are now writing it up, logical clarity plays a huge role in partial progress. If you submit a well-executed partial solution, that will surely help another student quite well! If your solution is poorly presented, it is harder for the grader to find concrete progress for partial credit. Note that I’m trying really hard to emphasize logical clarity, and not fancy formatting or fancy pictures. The grader needs to understand exactly what you’re claiming (due to strong logical clarity), rather than feel that you worked really hard (due to good formatting and pictures). 3.1.1 A Generic Rubric To help you understand in more detail how a typical problem is graded, here is a generic rubric template (for a 20-point problem). A majority (maybe super-majority) of problems you solve in this course will start with this as a rubric template, and fill in details with common examples of various levels of ’concrete partial progress’. 20 points The solution is correct, clear, concise, and easy to evaluate. 16 points The solution has a clear, concrete outline of all the key steps, but some steps are not done rigorously. 11All examples in this document were made up by me for the purpose of explanation — these are not actual problems I’ve asked in 445, nor actual student solutions. 21 12 points The solution is clearly on the right track, and proves a concrete lemma which constitutes significant partial progress, or handles a non-trivial special case. 8 points The solution proves a concrete lemma related to the problem, or solves a special case. This includes any concrete lemma, no matter how small, or any special case, no matter how spe- cial. 4-6 points The solution demonstrates good-strong intuition, but does not provide any concrete partial progress. 0 points The solution is absent or flawed. In addition to this core rubric, the graders are also instructed to add/subtract up to two points based on the quality of presentation: 2 points The solution was exceptionally easy to evaluate. The logical flow was clear and precise. It was easy to see that all logical claims/implications were true. 1 point (In between 0 and 2). Neutral The solution was reasonable to evaluate. The logical flow was reasonably clear, and I did not struggle to evaluate any logical claims. -1 point (In between 0 and -2). -2 points The solution was tough to evaluate. The solution may have made imprecise claims, and it was not easy to figure out what was intended. The solution may have made false claims, and it was not easy to see why they were false. It may have been a struggle to figure out the intended logical flow, or to figure out exactly what concrete claims the solution accomplished. Using this rubric, I would have given the first sample solution a 12/20 + 2: the solution made significant partial progress, but does not have a clear outline. However, it is very easy for me to read the solution and understand exactly what the solution accomplishes, so I’d give an extra +2 for the presentation points. I would have given the second solution a 6/20 + 0: I unfortunately can’t find any concrete (even small!) claims to give partial credit for, although there is certainly good intuition demonstrated. The writing is fine, but it takes some time after reading it to realize that there is no concrete content. See the following section for writing tips to help make it clear where your solution lies on the rubric, and also to help avoid presentation deductions. 4 Basic Proof Writing This is a bit of an oversimplification, but I think there are two ‘kinds’ of proof-writing that this class will develop. First, you must be able to write rigorous, complete proofs of short claims. Second, you must be able to write a clear, rigorous outline for a complex proof, by breaking it down into concrete, rigorous claims. Section 4.1 deals with the first kind, and Section 4.2 deals with the second. I strongly recommend reading both sections. 22 4.1.1 A Final Example Consider the following problem, which I first heard about here: https://gilkalai.wordpress. com/2017/09/08/elchanan-mossels-amazing-dice-paradox-answers-to-tyi-30/. You roll a fair six-sided die until it lands six. What is the expected number of rolls you make (in- cluded the one which lands six), conditioned on all rolls being even? Let’s view two conflicting “proofs.” In both, we’ll use without proof the following fact: Fact 7 Let D be a distribution such that when random variable X is drawn from D, the probability that X = x is p. Then if we repeatedly sample draws from D independently until we see one which is equal to x, the expected number of draws we make is 1/p. Proof. This is also a good example where the math is simpler if we use “form 2” of the definition of expectation. The probability that we make strictly more than i draws is the probability that all of the first i draws are not equal to x. Because they are drawn independently, and equal to x with probability p, this is just (1 − p)i. So we get that the expected number of draws we make is∑∞ i=0(1− p)i = 1/p. Proof 1: We know that if we were to roll the die until we hit a six, it would take six rolls in expectation, by Fact 7 (because we have a 1/6 chance of rolling a six each time). If instead we condition on all rolls being even, now there are only three possibilities instead of six, so the probability of rolling a six each time is 1/3 instead of 1/6. So by the same Fact 7, the expected number of rolls until we hit a six, conditioned on all rolls being even, is now three. Proof 2: Consider instead repeatedly rolling a die in the following manner, using the principle of deferred decisions. First, decide if the die will land on two/four, or not on two/four (then decide exactly the roll, uniformly at random among the remaining possibilities). Stop as soon as the die lands not on two/four. Then the probability of terminating any given round is 2/3, and so by Fact 7, the expected number of rolls is 3/2. Moreover, observe that we can decide whether or not to stop rolling independently of whether the last roll is a six or odd. Therefore, the expected number of rolls until we hit a six, conditioned on all rolls being even, is 3/2. Both proofs seem tempting: the logic in proof 1 is pretty straight-forward to follow. Proof 2 may be extra tempting because it uses a fancy term that was introduced earlier. Proof 2, it turns out, is correct (but you should not typically associate correctness with fancy terms), and Proof 1 is not. The first sentence of Proof 1 is correct. The second sentence of Proof 1 is vague or incorrect. In particular, when the proof says “now there are only three possibilities instead of six,” it seems to suggest that conditioning on all rolls being even is the same as independently rolling each die, and enforcing that each draw is even. These are not the same (likely the original motivation for this problem was to point out this misconception). Indeed, let’s consider instead a million-sided die. The point is that we are extremely unlikely to have a long run where all rolls are even, so conditioning on all rolls being even makes the length of the runs quite short. In particular, we shouldn’t expect to have all even throws followed by a six for a long run at all, and most of the time when this happens, it’s because we got a six very quickly. If we repeat the argument in Proof 1, it would imply that the expected number of throws, conditioned 25 on all throws being even, is 500000. Proof 2 instead suggests that the expected number of throws until we hit a roll which is either odd or six is 1000000/500001 ≈ 2. Hopefully that gives some intuition. We can also do the full calculation to confirm: The probability that we roll exactly i times until hitting a six, and that the first i− 1 rolls were all even (i.e. two or four), is (1/3)i−1/6. So the probability that we roll all evens until hitting a six is: ∑ i≥1 (1/3)i−1/6 = 1/4. Also, the total number of rolls, only counting those from sequences from which we rolled evens until hitting a six is: ∑ i≥1 i(1/3)i−1/6 = 9/24. The conditional expectation then just divides these two to get 9/24 1/4 = 9/6 = 3/2. 4.2 Effectively breaking down long proofs Let’s revisit the problem from Section 3, and see how to write a clear proof of a complex claim. Recall first the problem: Recall that a bipartite graph has two sets of nodes, L and R, with all edges having one endpoint in L and the other in R. Recall also that a perfect matching is a set of edges such that every node is in exactly one edge. Let G be a bipartite graph with n nodes on each side. Prove that if every node has degree ≥ n/2, then G has a perfect matching. Hint: You may use Hall’s Marriage Theorem. Recall that Hall’s Marriage Theorem asserts that a bipartite graph has a perfect matching if and only if for every set S ⊆ L of nodes on the left, we have |N(S)| ≥ |S|, where N(S) denotes the set of nodes with an edge to some node in S. Here is a solution I would write. Afterwards, I’ll explain what I think of as the key points. Solution. We will prove that G has a perfect matching using Hall’s Marriage Theorem, and show that for all sets S ⊆ L, |N(S)| ≥ |S|. Let us first consider sets S where |S| ≤ n/2. Lemma 7 Consider any S ⊆ L, with |S| ≤ n/2. Then, |N(S)| ≥ |S|. Proof. Let v be any node in S. Observe that, immediately by the definition of G, the degree of v is at least n/2. This immediately implies that v has at least n/2 neighbors, and therefore S has at least n/2 neighbors. Because |S| ≤ n/2, we have that |N(S)| ≥ n/2 ≥ |S|, as desired. Next, we consider the case where |S| > n/2. Lemma 8 Consider any S ⊆ L, with |S| > n/2. Then, |N(S)| ≥ |S|. 26 Proof. In fact, we will prove an even stronger claim, that |N(S)| = n. To do this, assume for contradiction that |N(S)| < n. This means that there must exist some node u ∈ R such that u /∈ N(S). In particular, this means that u has no neighbors in S. However, there are > n/2 nodes in S, and therefore < n/2 nodes in L \ S. This means that the degree of u is strictly less than n/2 (because all of u’s neighbors must lie in L \ S). This contradicts the definition of G, as every node has degree at least n/2. Now, we can wrap up the proof. Lemmas 7 and 8 together prove that |N(S)| ≥ |S| for all S ⊆ L. Hall’s marriage theorem now implies that G has a perfect matching, as desired. Thoughts on this writeup. Here is a great test to see if you’ve successfully broken down your complex proof into clear subparts: Ignore the proofs of Lemma 7 and Lemma 8. Now, the entire proof is just a few sentences. Assuming that Lemma 7, Lemma 8, and Hall’s Marriage Theorem are all correct, is it easy to follow the logical flow? This is always a bit subjective, but I’d argue that the logic is quite clear, and is entirely captured in the final two sentences. This is the key difference between writing a ‘long’ proof and a ‘short’ proof. For a long proof, it’s impossible for a reader to follow multiple logical trails at once, so your job is to break it down into manageable short proofs, and also provide a single short proof to bring it all together. Try to think of it like a tree: the root is the outline which connects Lemma 7, Lemma 8 and Hall’s Marriage Theorem to prove the claim. It has three children: Lemma 7, Lemma 8, and Hall’s Marriage Theorem. Lemma 7 has a standalone proof, because the logic is short and coherent. Lemma 8 has a standalone proof, because the logic is short and coherent. Hall’s Marriage Theorem is given to you, and does not require a proof. Each node in the tree should provide a clear, logically coherent proof (i.e. the proof uses just a few ideas, and can fit in the grader’s head all at once) of the desired claim, assuming that the claims made in its children are correct. Here are some other bulleted thoughts: • Is it crucial that the proof separates out the key claims using the Lemma environment in La- TeX? No. But, if you’re new to proof-writing, this is a good structure to enforce on yourself. I personally try to use this structure whenever I write my own proofs. • Is it crucial that Lemmas 7 and 8 are broken down into two lemmas, instead of just one? No. But, the two cases clearly use different logic. So absolutely, they should at least be broken up into separate paragraphs. • It is crucial that each subpart has a clear, concrete, and formal statement. For example, it’s crucial that it’s easy to see that Lemmas 7 and 8 together cover all S, and connect to Hall’s Theorem. Informal statements like “All sets have sufficient neighbors” (what is “sufficient”?) or “small sets satisfy |N(S)| ≥ |S|” (what is “small”?) fail this, because the grader can’t figure out exactly what’s proved in these steps. • Staff solutions and lecture notes will give you further examples of how to break down large proofs into smaller chunks. When you read them, try to go through the exercise of reading just the definitions/lemmas/conclusion, and confirming that the logic follows. Then, when reading each individual lemma, you just need to confirm that this individual lemma is correct. 4.3 And Two Quick Tips Tip One: I still suggest this even to PhD students, and follow this advice myself: after you write something up, read it yourself (perhaps even out-loud-in-your-head). If you can’t follow your own logic, don’t be surprised when the grader can’t either! You will likely be surprised at how 27
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved