Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Linear Programming Boosting: Efficiently Solving LP Approaches to Boosting using LPBoost, Papers of Computer Graphics

The use of linear programming (lp) approaches to boosting and demonstrates their efficient solution using lpboost, a column generation based simplex method. The lpboost algorithm can be used to solve any lp boosting formulation by iteratively optimizing the dual misclassification costs in a restricted lp and dynamically generating weak hypotheses to make new lp columns. Insights into the mathematical workings of lp boosting and its differences from the prior hard margin approach.

Typology: Papers

Pre 2010

Uploaded on 08/19/2009

koofers-user-vy4-1
koofers-user-vy4-1 🇺🇸

10 documents

1 / 30

Toggle sidebar

Related documents


Partial preview of the text

Download Linear Programming Boosting: Efficiently Solving LP Approaches to Boosting using LPBoost and more Papers Computer Graphics in PDF only on Docsity! Machine Learning, 46, 225–254, 2002 c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Linear Programming Boosting via Column Generation AYHAN DEMIRIZ demira@rpi.edu Department of Decision Sciences and Eng. Systems, Rensselaer Polytechnic Institute, Troy, NY 12180, USA KRISTIN P. BENNETT bennek@rpi.edu Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA while visiting Microsoft Research, Redmond, WA, USA JOHN SHAWE-TAYLOR jst@cs.rhul.ac.uk Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK Editor: Nello Cristianini Abstract. We examine linear program (LP) approaches to boosting and demonstrate their efficient solution using LPBoost, a column generation based simplex method. We formulate the problem as if all possible weak hypotheses had already been generated. The labels produced by the weak hypotheses become the new feature space of the problem. The boosting task becomes to construct a learning function in the label space that minimizes misclassification error and maximizes the soft margin. We prove that for classification, minimizing the 1-norm soft margin error function directly optimizes a generalization error bound. The equivalent linear program can be efficiently solved using column generation techniques developed for large-scale optimization problems. The resulting LPBoost algorithm can be used to solve any LP boosting formulation by iteratively optimizing the dual misclassification costs in a restricted LP and dynamically generating weak hypotheses to make new LP columns. We provide algorithms for soft margin classification, confidence-rated, and regression boosting problems. Unlike gradient boosting algorithms, which may converge in the limit only, LPBoost converges in a finite number of iterations to a global solution satisfying mathematically well-defined optimality conditions. The optimal solutions of LPBoost are very sparse in contrast with gradient based methods. Computationally, LPBoost is competitive in quality and computational cost to AdaBoost. Keywords: ensemble learning, boosting, linear programming, sparseness, soft margin 1. Introduction Recent papers (Schapire et al., 1998) have shown that boosting, arcing, and related ensemble methods (hereafter summarized as boosting) can be viewed as margin maximization in function space. By changing the cost function, different boosting methods such as AdaBoost can be viewed as gradient descent to minimize this cost function. Some authors have noted the possibility of choosing cost functions that can be formulated as linear programs (LP) but then dismiss the approach as intractable using standard LP algorithms (Rätsch et al., 2000a; Breiman, 1999). 226 A. DEMIRIZ, K.P. BENNETT AND J. SHAWE-TAYLOR The main contribution in the application of LP methods to boosting has been made by Grove and Schuurmans (1998) who derived an LP method DualLPBoost based on maximizing the margin of the combined classifier. They experienced difficulties, however, in getting the method to work in practice. We adopt a similar approach but optimize a new rigorous generalization bound obtained in terms of a soft margin measure. Using a soft margin ensures that the approach is able to handle noisy data more robustly, but also overcomes the convergence problems experienced by Grove and Schuurmans. We discuss in more detail the reasons for this improvement in Section 7. Furthermore in this paper we show that LP boosting is generally computationally feasible using a classic column generation simplex algorithm (Nash & Sofer, 1996). This method performs tractable boosting using any cost function expressible as an LP. We specifically examine the variations of the 1-norm soft margin cost function used for support vector machines (Rätsch et al., 2000b; Bennett, 1999; Mangasarian, 2000). One advantage of these approaches is that immediately the method of analysis for support vector machine problems becomes applicable to the boosting problem. In Section 2, we prove that the LPBoost approach to classification directly minimizes a bound on the generalization error. We adopt the LP formulations developed for support vector machines. In Section 3, we discuss the soft margin LP formulation. By adopting linear programming, we immediately have the tools of mathematical programming at our disposal. In Section 4 we examine how column generation approaches for solving large scale LPs can be adapted to boosting. For classification, we examine both standard and confidence-rated boosting. Standard boosting algorithms use weak hypotheses that are classifiers, that is, whose outputs are in the set {−1, +1}. Schapire and Singer (1998) have considered boosting weak hypotheses whose outputs reflected not only a classification but also an associated confidence encoded by a value in the range [−1, +1]. They demonstrate that so-called confidence-rated boosting can speed convergence of the composite classifier, though the accuracy in the long term was not found to be significantly affected. In Section 5, we discuss the minor modifications needed for LPBoost to perform confidence-rated boosting. The methods we develop can be readily extended to any ensemble problem formulated as an LP. We demonstrate this by adapting the approach to regression in Section 6. In Section 7, we examine the hard margin LP formulation of Grove and Schuurmans (1998) which is also a special case of the column generation approach. By use of duality theory and optimality conditions, we can gain insight into how LP boosting works mathematically, specifically demonstrating the critical differences between the prior hard margin approach and the proposed soft margin approach. Computational results and practical issues for implementation of the method are given in Section 8. 2. Motivation for soft margin boosting We begin with an analysis of the boosting problem using the methodology developed for support vector machines. The function classes that we will be considering are of the form co(H) = {∑h∈H ahh : ah ≥ 0}, where H is a set of weak hypotheses which we assume is closed under complementation. Initially, these will be classification functions with outputs in the set {−1, 1}, though this can be taken as [−1, 1] in confidence-rated boosting. We begin, LINEAR PROGRAMMING BOOSTING 229 probability 1 − δ over  random examples S, any hypothesis f ∈F for which ( f, g f ) ∈ G has generalization error no more than errD( f ) ≤ ε(,F, δ, γ ) = 2  ( logN (G, 2, γ 2 ) + log 2 δ ) , provided  > 2/ε, and there is no discrete probability on misclassified training points. We are now in a position to apply these results to our function class which will be in the form described above,F = co(H) = {∑h∈H ahh : ah ≥ 0}, where we have left open for the time being what the class H of weak hypotheses might contain. The sets G of Theorem 2.2 will be chosen as follows: GB = {( ∑ h∈H ahh, g ) : ∑ h∈H ah + ‖g‖1 ≤ B, ah ≥ 0 } . Hence, the condition that a function f = ∑h∈H ahh satisfies the conditions of Theorem 2.2 for G = GB is simply ∑ h∈H ah + 1 ∑ i=1 ξ((xi , yi ), f, γ ) = ∑ h∈H ah + 1 ∑ i=1 ξi ≤ B. (1) Note that this will be the quantity that we will minimize through the boosting iterations described in later sections, where we will use the parameter C in place of 1/ and the margin γ will be set to 1. The final piece of the puzzle that we require to apply Theorem 2.2 is a bound on the covering numbers of GB in terms of the class of weak hypotheses H , the bound B, and the margin γ . Before launching into this analysis, observe that for any input x, max h∈H {|h(x)|} = 1, while max xi δxi (x) ≤ ≤ 1. 2.1. Covering numbers of convex hulls In this subsection we analyze the covering numbers N (GB, , γ ) of the set GB = {( ∑ h∈H ahh, g ) : ∑ h∈H ah + ‖g‖1 ≤ B, ah ≥ 0 } in terms of B, the class H , and the scale γ . Assume first that we have an η/B-cover G of the function class H with respect to the set S = (x1, x2, . . . , x) for some η < γ . If H is a class of binary-valued functions then we will take η to be zero and G will be the set of dichotomies that can be realized by the class. Now consider the set V of 230 A. DEMIRIZ, K.P. BENNETT AND J. SHAWE-TAYLOR vectors of positive real numbers indexed by G ∪ {1, . . . , }. Let VB be the function class VB = {g → 〈g · v〉 : v ∈ V, ‖v‖1 ≤ B, ‖g‖∞ ≤ 1}, and suppose that U is an (γ − η)-cover of VB . We claim that the set A = {( ∑ h∈G vhh, ∑ i=1 viδxi ) : v ∈ U } is a γ -cover of GB with respect to the set τ (S). We prove this assertion by taking a general function f = (∑h∈H ahh, g) ∈GB , and finding a function in A within γ of it on all of the points τ (xi ). First, for each h with non-zero coefficient ah , select ĥ ∈ G, such that |h(xi )− ĥ(xi )| ≤ η/B, and for h′ ∈ G set vh′ = ∑ h:ĥ = h′ ah and vi = g(xi )/ , i = 1, . . . , . Now we form the function f̄ = (∑h∈G vhh, ∑i=1 viδxi ), which lies in the set VB , since∑ h∈G ah + ∑ i=1 vi ≤ B. Furthermore we have that | f (τ (x j )) − f̄ (τ (x j ))| = ∣∣∣∣∣ ∑ h∈H ahh(x j ) + g(x j ) − ∑ h∈G vhh(x j ) − v j ∣∣∣∣∣ ≤ ∣∣∣∣∣ ∑ h∈H ah(h(x j ) − ĥ(x j )) ∣∣∣∣∣ ≤ η B ∑ h∈H ah ≤ η Since U is a γ −η cover of VB there exists v̂ ∈ U such that f̂ = ( ∑ h∈G v̂hh, ∑ i=1 v̂iδxi ) is within γ −η of f̄ on τ (x j ), j = 1, . . . , . It follows that f̂ is within γ of f on this same set. Hence, A forms a γ cover of the class GB . We bound |A| = |U | using the following theorem due to Zhang (1999), though a slightly weaker version can also be found in Anthony and Bartlett (1999). Theorem 2.3 (Zhang, 1999). For the class VB defined above we have that logN (VB, , γ ) ≤ 1 + 144B 2 γ 2 (2 + ln(|G| + )) log ( 2 ⌈ 4B γ + 2 ⌉  + 1 ) . Hence we see that optimizing B directly optimizes the relevant covering number bound and hence the generalization bound given in Theorem 2.2 with G = GB . Note that in the cases considered |G| is just the growth function BH () of the class H of weak hypotheses. LINEAR PROGRAMMING BOOSTING 231 3. Boosting LP for classification From the above discussion we can see that a soft margin cost function should be valuable for boosting classification functions. Once again using the techniques used in support vector machines, we can formulate this problem as a linear program. The quantity B defined in Eq. (1) can be optimized directly using an LP. The LP is formulated as if all possible labelings of the training data by the weak hypotheses were known. The LP minimizes the 1-norm soft margin cost function used in support vector machines with the added restrictions that all the weights are positive and the threshold is assumed to be zero. This LP and variants can be practically solved using a column generation approach. Weak hypotheses are generated as needed to produce the optimal support vector machine based on the output of the all weak hypotheses. In essence the base learning algorithm becomes an ‘oracle’ that generates the necessary columns. The dual variables of the linear program provide the misclassification costs needed by the learning machine. The column generation procedure searches for the best possible misclassification costs in dual space. Only at optimality is the actual ensemble of weak hypotheses constructed. 3.1. LP formulation Let the matrix H be a  by m matrix of all the possible labelings of the training data using functions from H. Specifically Hi j = h j (xi ) is the label (1 or −1) given by weak hypothesis h j ∈ H on the training point xi . Each column H. j of the matrix H constitutes the output of weak hypothesis h j on the training data, while each row Hi gives the out- puts of all the weak hypotheses on the example xi . There may be up to 2 distinct weak hypotheses. The following linear program can be used to minimize the quantity in Eq. (1): min a,ξ m∑ i=1 ai + C ∑ i=1 ξi s.t. yi Hi a + ξi ≥ 1, ξi ≥ 0, i = 1, . . . ,  (2) ai ≥ 0, i = 1, . . . , m where C > 0 is the tradeoff parameter between misclassification error and margin maxi- mization. The dual of LP (2) is max u ∑ i=1 ui s.t. ∑ i=1 ui yi Hij ≤ 1, j = 1, . . . , m (3) 0 ≤ ui ≤ C, i = 1, . . . ,  234 A. DEMIRIZ, K.P. BENNETT AND J. SHAWE-TAYLOR formulation solvable by LPBoost.) A notable difference is that LP (5) has an additional upper bound on the misclassification costs u, 0 ≤ ui ≤ D, i = 1, . . . , , that is produced by the introduction of the soft margin in the primal. From the LP optimality conditions and the fact that linear programs have extreme point solutions, we know that there exist very sparse solutions of both the primal and dual problems and that the degree of sparsity will be greatly influenced by the choice of parameter D = 1 ν . The size of the dual feasible region depends on our choice of ν. If ν is too large, forcing D small, then the dual problem is infeasible. For large but still feasible ν (D very small but still feasible), the problem degrades to something very close to the equal-cost case, ui = 1/. All the ui are forced to be nonzero. Practically, this means that as ν increases (D becomes larger), the optimal solution may be one or two weak hypotheses that are best assuming approximately equal costs. As ν decreases (D grows), the misclassification costs, ui , will increase for hard-to-classify points or points on the margin in the label space and will go to 0 for points that are easy to classify. Thus the misclassification costs u become sparser. If ν is too small (and D too large) then the meaningless null solution, a = 0, with all points classified as one class, becomes optimal. For a good choice of ν, a sparse solution for the primal ensemble weights a will be optimal. This implies that few weak hypotheses will be used. Also a sparse dual u will be optimal. This means that the solution will be dependent only on a smaller subset of data (the support vectors.) Data with ui = 0 are well-classified with sufficient margin, so the performance on these data is not critical. From LP sensitivity analysis, we know that the ui are exactly the sensitivity of the optimal solution to small perturbations in the margin. In some sense the sparseness of u is good because the weak hypotheses can be constructed using only smaller subsets of the data. But as we will see in Sections 7 and 8, this sparseness of the misclassification costs can lead to problems when implementing algorithms. 4. LPBoost algorithms We now examine practical algorithms for solving the LP (4). Since the matrix H has a very large number of columns, prior authors have dismissed the idea of solving LP formulations for boosting as being intractable using standard LP techniques. But column generation techniques for solving such LPs have existed since the 1950s and can be found in LP textbooks; see for example (Nash & Sofer, 1996, Section 7.4). Column generation is frequently used in large-scale integer and linear programming algorithms so commercial codes such as CPLEX have been optimized to perform column generation very efficiently (CPLEX, 1994). The simplex method does not require that the matrix H be explicitly available. At each iteration, only a subset of the columns is used to determine the current solution (called a basic feasible solution). The simplex method needs some means for determining if the current solution is optimal, and if it is not, some means for generating some column that violates the optimality conditions. The tasks of verification of optimality and generating a column can be performed by the learning algorithm. A simplex-based boosting method will alternate between solving an LP for a reduced matrix Ĥ corresponding to the weak hypotheses generated so far and using the base learning algorithm to generate the best-scoring weak hypothesis based on the dual misclassification cost provided by the LINEAR PROGRAMMING BOOSTING 235 LP. This will continue until the algorithm terminates at an exact or approximate optimal solution based on well-defined stopping criteria or some other stopping criteria such as the maximum number of iterations is reached. The idea of column generation (CG) is to restrict the primal problem (2) by considering only a subset of all the possible labelings based on the weak hypotheses generated so far; i.e., only a subset Ĥ of the columns of H is used. The LP solved using Ĥ is typically referred to as the restricted master problem. Solving the restricted primal LP corresponds to solving a relaxation of the dual LP. The constraints for weak hypotheses that have not been generated yet are missing. One extreme case is when no weak hypotheses are considered. In this case the optimal dual solution is ûi = 1 (with appropriate choice of D). This will provide the initialization of the algorithm. If we consider the unused columns to have âi = 0, then â is feasible for the original primal LP. If (û, β̂) is feasible for the original dual problem then we are done since we have primal and dual feasibility with equal objectives. If â is not optimal then (û, β̂) is infeasible for the dual LP with full matrix H . Specifically, the constraint ∑ i=1 ûi yi Hi j ≤ β̂ is violated for at least one weak hypothesis. Or equivalently, ∑ i=1 ûi yi Hi j > β̂ for some j . Of course we do not want to a priori generate all columns of H (H. j ), so we use our base learning algorithm as an oracle that either produces H. j , ∑ i=1 ûi yi Hi j > β̂ for some j or a guarantee that no such H. j exists. To speed convergence we would like to find the one with maximum deviation, that is, the base learning algorithm H(S, u) must deliver a function ĥ satisfying ∑ i=1 yi ĥ(xi )ûi = max h∈H ∑ i=1 ûi yi h(xi ) (10) Thus ûi becomes the new misclassification cost, for example i , that is given to the base learning machine to guide the choice of the next weak hypothesis. One of the big payoffs of the approach is that we have a stopping criterion that guarantees that the optimal ensemble has been found. If there is no weak hypothesis h for which ∑ i=1 ûi yi h(xi ) > β̂, then the current combined hypothesis is the optimal solution over all linear combinations of weak hypotheses. We can also gauge the cost of early stopping since if maxh∈H ∑ i=1 ûi yi h(xi ) ≤ β̂ + , for some  > 0, we can obtain a feasible solution of the full dual problem by taking (û, β̂ + ). Hence, the value V of the optimal solution can be bounded between β̂ ≤ V < β̂ + . This implies that, even if we were to potentially include a non-zero coefficient for all the weak hypotheses, the value of the objective ρ − D ∑i=1 ξi can only be increased by at most . We assume the existence of the weak learning algorithm H(S, u) which selects the best weak hypothesis from a set H closed under complementation using the criterion of Eq. (10). The following algorithm results. Algorithm 4.1 (LPBoost). Given as input training set: S m ← 0 No weak hypotheses a ← 0 All coefficients are 0 236 A. DEMIRIZ, K.P. BENNETT AND J. SHAWE-TAYLOR β ← 0 u ← ( 1  , . . . , 1  ) Corresponding optimal dual REPEAT m ← m + 1 Find weak hypothesis using equation (10): hm ← H(S, u) Check for optimal solution: If ∑ i=1 ui yi hm(xi ) ≤ β, m ← m − 1, break Him ← hm(xi ) Solve restricted master for new costs: argmin β s.t. ∑ i=1 ui yi h j (xi ) ≤ β (u, β) ← j = 1, . . . , m∑ i=1 ui = 1 0 ≤ ui ≤ D, i = 1, . . . ,  END a ← Lagrangian multipliers from last LP return m, f = ∑mj=1 a j h j Note that the assumption of finding the best weak hypothesis is not essential for good performance of the algorithm. Recall that the role of the learning algorithm is to generate columns (weak hypotheses) corresponding to a dual infeasible row or to indicate optimality by showing no infeasible weak hypotheses exist. All that we require is that the base learner return a column corresponding to a dual infeasible row. It need not be the one with maximum infeasibility. This is done primarily to improve convergence speed. In fact, choosing columns using “steepest edge” criteria that look for the column that leads to the biggest actual change in the objective may lead to even faster convergence. If the learning algorithm fails to find a dual infeasible weak hypothesis when one exists then the algorithm may prematurely stop at a nonoptimal solution. With small changes this algorithm can be adapted to perform any of the LP boosting formulations by simply changing the restricted master LP solved, the costs given to the learning algorithm, and the optimality conditions checked. Assuming the base learner solves (10) exactly, LPBoost is a variant of the dual simplex algorithm (Nash & Sofer, 1996). Thus it inherits all the benefits of the simplex algorithm. Benefits include: 1) Well-defined exact and approximate stopping criteria for global optimality. Typically, ad hoc termination schemes, e.g. a fixed number of iterations, are the only effective termination criteria for the gradient-based boosting algorithms. 2) Finite termination at a globally optimal solution. In practice the algorithm generates few weak hypotheses to arrive at an optimal solution. 3) The optimal solution is sparse and thus uses few weak hypotheses. 4) The algorithm is performed in the dual space of the classification costs. The weights of the optimal ensemble are only generated and fixed at optimality. 5) High-performance commercial LP algorithms optimized for column generation exist making the algorithm efficient in practice. LINEAR PROGRAMMING BOOSTING 239 ∑ i=1(−ui +u∗i )Hi j > β, then the ensemble is not optimal and the weak hypothesis should be added to the ensemble. To speed convergence we would like the weak hypothesis with maximum deviation, i.e., max j ∑ i=1 (−ui + u∗i )Hi j . (15) This is perhaps odd at first glance because the criteria do not actually explicitly involve the dependent variables yi . But within the LPBoost algorithm, the ui are closely related to the error residuals of the current ensemble. If the data point xi is overestimated by the current ensemble function by more than , then by complementarity ui will be positive and u∗i = 0. So at the next iteration the base learner will attempt to construct a function that has a negative sign at point xi . If the point xi falls within the  margin then the ui = u∗i = 0, and the next base learner will try to construct a function with value 0 at that point. If the data point xi is underestimated by the current ensemble function by more than , then by complementarity u∗i will be positive and ui = 0. So at the next iteration the base learner will attempt to construct a function that has a positive sign at point xi . By sensitivity analysis, the magnitudes of u and u∗ are proportional to the changes of the objective with respect to changes in y. This becomes even clearer using the approach taken in the Barrier Boosting algorithm for this problem (Rätsch et al., 2000c). Equation (15) can be converted to a least squares problem. For vi = −ui + u∗i and Hi j = f j (xi ), ( f (xi ) − vi )2 = f (xi )2 − 2v′i f (xi ) + v2i . (16) So the objective to be optimized by the base learner can be transformed as follows: max j ∑ i=1 (−ui + u∗i ) f j (xi ) = minj ∑ i=1 vi f j (xi ) = min j 1 2 ∑ i=1 [ ( f j (xi ) − vi )2 − f j (xi )2 − v2i ] . (17) The constant term v2i can be ignored. So effectively the base learner must construct a regularized least squares approximation of the residual function. The final regression algorithm looks very much like the classification case. The vari- ables ui and u∗i can be initialized to any initial feasible point. We present one such strat- egy here assuming that D is sufficiently large. Here (a)+ := max(a, 0) denotes the plus function. 240 A. DEMIRIZ, K.P. BENNETT AND J. SHAWE-TAYLOR Algorithm 6.1 (LPBoost-Regression). Given as input training set: S m ← 0 No weak hypotheses a ← 0 All coefficients are 0 β ← 0 ui ← (−yi )+‖y‖1 Corresponding feasible dual u∗i ← (yi )+‖y‖1 REPEAT m ← m + 1 Find weak hypothesis using equation (17): hm ← H(S, (−u + u∗)) Check for optimal solution: If ∑ i=1(−ui + u∗i )hm(xi ) ≤ β, m ← m − 1, break Him ← hm(xi ) Solve restricted master for new costs: argmin β + ∑i=1 (ui − u∗i )yi s.t. ∑ i=1(−ui + u∗i )h j (xi ) ≤ β (u, u∗, β) ← j = 1, . . . , m∑ i=1(ui + u∗i ) = 1 0 ≤ ui , u∗i ≤ C, i = 1, . . . , END a ← Lagrangian multipliers from last LP return m, f = ∑mj=1 a j h j 7. Hard margins, soft margins, and sparsity The column generation algorithm can also be applied to the hard margin LP error func- tion for boosting. In fact the DualLPBoost proposed by Grove and Schuurmans (1998) does exactly this. Breiman (1999) also investigated an equivalent formulation using an asymptotic algorithm. Both papers found that optimizing the hard margin LP to con- struct ensembles did not work well in practice. In contrast the soft margin LP ensem- ble methods optimized using column generation investigated in this paper and using an arcing approach in Rätsch et al. (2000b) worked well (see Section 8). Poor performance of hard margin versus soft margin classification methods have been noted in other con- texts as well. In a computational study of the hard margin Multisurface-Method (MSM) for classification (Mangasarian, 1965) and the soft margin Robust Linear Programming (RLP) method (Bennett & Mangasarian, 1992) (both closely related LP precursors to Boser et al.’s Support Vector Machine (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995)), the soft margin RLP performed uniformly better than the hard margin MSM. In this section we will examine the critical difference between hard and soft margin classifiers geometrically through a simple example. This discussion will also illustrate some of the practical issues of using a column generation approach to solving the soft margin problems. LINEAR PROGRAMMING BOOSTING 241 The hard margin ensemble LP found in Grove and Schuurmans (1998) expressed in the notation of this paper is: max a,ρ ρ s.t. yi Hi a ≥ ρ, i = 1, . . . ,  m∑ i=1 a j = 1, ai ≥ 0, i = 1, . . . , m (18) This is the primal formulation. The dual of the hard margin problem is min u,β β s.t. ∑ i=1 ui yi Hi j ≤ β, j = 1, . . . , m ∑ i=1 ui = 1, 0 ≤ ui , i = 1, . . . ,  (19) Let us examine geometrically what the hard and soft margin formulations do using con- cepts used to described the geometry of SVM in Bennett and Bredensteiner (2000). Consider the LP subproblem in the column generation algorithm after sufficient weak hypotheses have been generated such that two classes are linearly separable. Specifically, there exist ρ > 0 and a such that yi Hi a ≥ ρ > 0 for i = 1, . . . , . Figure 1 gives an example of two confidence rated hypotheses (labels between 0 and 1). The left figure shows the separating hyperplane in the label space where each data point xi is plotted as (h1(xi ), h2(xi )). The separating hy- perplane is shown as a dotted line through the origin as there is no threshold. The minimum margin ρ is positive and produces a very reasonable separating plane. The solution depends only on the two support vectors indicated by boxes. The right side shows the problem in dual or margin space where each point is plotted as (yi h1(xi ), yi h2(xi )). Recall, a weak hypothesis is correct on a point if yi h(xi ) is positive. The convex hull of the points in the dual space is shown with dotted lines. The dual LP computes a point in the convex hull2 that is optimal by some criteria. When the data are linearly separable, the dual problem finds the Figure 1. No noise hard margin LP solution for two confidence-rated hypotheses. Left is the separation in label space. Right is the separation in dual or margin space. 244 A. DEMIRIZ, K.P. BENNETT AND J. SHAWE-TAYLOR But in the early iterations of a column generation algorithm, the hard margin LP will be optimized over a small set of hypotheses such that the classes are not linearly separable in the label space. In this case we observe several problematic characteristics of the hard margin formulation: extreme sensitivity to noise (producing undesirable hypotheses weightings), extreme sparsity of the dual vector especially in the early iterations of a column generation algorithm, failure to assign positive Lagrangian multipliers to misclassified examples, and no guarantee that points will be drawn from both distributions. Although we examined these potential problems using the confidence-rated case in 2 dimensions it is easy to see that they hold true and are somewhat worse for the more typical case where the labels are restricted to 1 and −1. The soft margin LP adopted in this paper addresses some but not all of these problems. Adding soft margins makes the LP much less sensitive to noise. Adding soft margins to the primal corresponds to adding bounds on the dual multipliers. The constraint that the dual multipliers sum to one forces more of the multipliers to be positive both in the separable and inseparable cases. Furthermore the complementarity conditions of the soft margin LP guarantee that any point that violates the soft margin will have a positive multiplier. Assuming D is sufficiently small, this means that every misclassified point will have a positive multiplier. But this geometric analysis illustrates that there are some potential problems with the soft margin LP. The column generation algorithm uses the dual costs as misclassifica- tion costs for the base learner to generate new hypotheses. So the characteristics of the dual solution are critical. For a small set of hypotheses, the LP will be degenerate, and the dual solution may still be quite sparse. Any method that finds extreme point solu- tions will be biased to the sparsest dual optimal solution, when in practice less sparse solutions would be better suited as misclassification costs for the base learner. If the pa- rameter D is chosen too large the margin may still be negative so the LP will still suffer from the many problems found in the hard margin case. If the parameter D is chosen too small then the problem reduces to the equal cost case so little advantage will be gained through using an ensemble method. Potentially, the distribution of the support vectors may still be highly skewed towards one class. All of these are potential problems in an LP-Based ensemble method. As we will see in the following sections, they can arise in practice. 8. Computational experiments We performed three sets of experiments to compare the performance of LPBoost, CRB, and AdaBoost on three classification tasks: one boosting decision tree stumps on smaller datasets and two boosting C4.5 (Quinlan, 1996). For decision tree stumps six datasets were used, LP-Boost was run until the optimal ensemble was found, and AdaBoost was stopped at 100 and 1000 iterations . For the C4.5 experiments, we report results for four large datasets with and without noise. Finally, to further validate C4.5, we experimented with ten more additional datasets. The rationale was to first evaluate LPBoost where the base learner solves (10) exactly and the optimal ensemble can be found by LP-Boost. Then our goal was to examine LPBoost in a more realistic environment by using C4.5 as a base learner LINEAR PROGRAMMING BOOSTING 245 using a relatively small number of maximum iterations for both LPBoost and AdaBoost. All of the datasets were obtained from the UC-Irvine data repository (Murphy & Aha, 1992). For the C4.5 experiments we performed both traditional and confidence-rated boosting. Different strategies for picking the LP model parameter were used in each of the three type to make sure the results were not a quirk of any particular model selection strategy. The implementations of LPBoost were identical except in how the misclassification costs were generated and in the stopping criteria. Both methods were allowed the same maximum number of iterations. 8.1. Boosting decision tree stumps We used decision tree stumps as base hypotheses on the following six datasets: Cancer (9,699), Diagnostic (30,569), Heart (13,297), Ionosphere (34,351), Musk (166,476), and Sonar (60,208). The number of features and number of points in each dataset are shown, respectively, in parentheses. We report testing set accuracy for each dataset based on 10-fold Cross Validation (CV). We generate the decision tree stumps based on the mid-point between two consecutive values for a given variable. Since there is limited confidence information in stumps, we did not perform confidence-rated boosting. All boosting methods search for the best weak hypothesis which returns the least weighted misclassification error at each itera- tion. LPBoost can take advantage of the fact that each weak hypothesis need only be added into the ensemble once. Thus once a stump is added to the ensemble it is never evaluated by the learning algorithm again. The weights of the weak hypotheses are adjusted dynamically by the LP. This is an advantage over AdaBoost, since AdaBoost adjust weights by repeat- edly adding the same weak hypothesis into the ensemble. As is discussed in Section 8.3, the computational effort to reoptimize the LP is a fraction of the time to find a weak hypothesis. The parameter ν for LPBoost was set using a simple heuristic: 0.1 added to previously- reported linear discriminant error rates on each dataset in Bennett and Demiriz (1999) except for the Cancer dataset. Specifically the values of ν in the same order of the datasets given above were (0.2, 0.1, 0.25, 0.2, 0.25, 0.3 ). The parameter ν corresponds to the fraction of the data that are support vectors which we overestimate as the error estimate plus 10 percent. The base linear error estimate can easily be derived using cross-validation instead of using results from a previous paper. This is rough heuristic that gives a reasonable guess at ν but further tuning may improve it. For cancer the initial guess of ν = 0.15 overfit the training data so it was increased to ν = 0.2. Results for AdaBoost were reported for a maximum number of iterations of 100 and 1000. Many authors have reported results for AdaBoost at these iterations using decision stumps. The 10-fold average classification accuracies and standard deviations are reported in Table 1. We also report the average number of unique weak hypotheses over 10 folds. LPBoost performed well in terms of classification accuracy, number of weak hypothe- ses, and training time. There is little difference between the accuracy of LPBoost and the best accuracy reported for AdaBoost using either 100 or 1000 iterations. The variation in AdaBoost for 100 and 1000 iterations illustrates the importance of well-defined stopping criteria. Typically, AdaBoost only obtains its solution in the limit and thus stops when the 246 A. DEMIRIZ, K.P. BENNETT AND J. SHAWE-TAYLOR Table 1. Average accuracy and standard deviations of boosting using decision tree stumps. Dataset LPBoost (m) AB-100 (m) AB-1000 (m) Cancer 0.966 ± 0.025 (14.7) 0.954 ± 0.029 (36.8) 0.947 ± 0.026 (59.3) Diagnostic 0.961 ± 0.027 (54.2) 0.968 ± 0.027 (67.7) 0.970 ± 0.031 (196.1) Heart 0.795 ± 0.079 (70.8) 0.818 ± 0.075 (51.1) 0.801 ± 0.061 (103.1) Ionosphere 0.906 ± 0.052 (87.6) 0.906 ± 0.054 (69.1) 0.903 ± 0.043 (184.2) Musk 0.882 ± 0.035 (205.3) 0.840 ± 0.042 (89.8) 0.891 ± 0.033 (370.9) Sonar 0.870 ± 0.082 (85.7) 0.808 ± 0.084 (76.4) 0.856 ± 0.078 (235.5) (m) = average number of unique decision tree stumps in final ensemble. maximum number of iterations (or some other heuristic stopping criteria) is reached. There is no magic number of iterations good for all datasets. LPBoost has a well-defined criterion for stopping when an optimal ensemble is found that is reached in relatively few iterations. It uses few weak hypotheses. There are only 81 possible stumps on the Breast Cancer dataset (nine attributes having nine possible values), so clearly AdaBoost may require the same tree to be generated multiple times. LPBoost generates a weak hypothesis only once and can alter the weight on that weak hypothesis at any iteration. The run time of LPBoost is proportional to the number of weak hypotheses generated. Since the LP package that we used, CPLEX 4.0 (CPL, 1994), is optimized for column generation, the cost of adding a column and reoptimizing the LP at each iteration is small. An iteration of LPBoost is only slightly more expensive that an iteration of AdaBoost. The time is proportional to the number of weak hypotheses generated. For problems in which LPBoost generates far fewer weak hypotheses it is much less computationally costly. Results also clearly indicate that if AdaBoost uses fewer unique weak hypotheses, it underfits. In the opposite case, it overfits. LPBoost depends on the choice of the model parameter for preventing overfitting. AdaBoost depends on choice of the maximum number of iterations to prevent overfitting. In the next subsection, we test the practicality of our methodology on different datasets using C4.5 in a more realistic environment where both AdaBoost and LPBoost are halted after a relatively small number of iterations. 8.2. Boosting C4.5 LPBoost with C4.5 as the base algorithm performed well after some operational chal- lenges were solved. In concept, boosting using C4.5 is straightforward since the C4.5 algorithm accepts misclassification costs. One problem is that C4.5 only finds a good solution not guaranteed to maximize (10). This can effect the convergence speed of the algorithm and may cause the algorithm to terminate at a suboptimal solution. As dis- cussed in Section 7 another challenge is that the misclassification costs determined by LPBoost are very sparse, i.e. ui = 0 for many of the points. The dual LP has a basic feasible solution corresponding to a vertex of the dual feasible region. Only the variables corresponding to the basic solution can be nonnegative. So while a face of the region corresponding to many nonnegative weights may be optimal, only a vertex solution will LINEAR PROGRAMMING BOOSTING 249 Figure 4. Validation set accuracy by ν value. Triangles are no noise and circles are with noise. (a) Forest dataset; (b) Adult dataset: (c) USPS dataset; (d) Optdigits dataset. noisy data. All boosting methods outperform C4.5. Results also indicate that none of the boosting methods overfits badly. This can be explained by early stopping based on large validation sets. We also conducted experiments by boosting C4.5 on small datasets. Once again there was no strong evidence of superiority of any of the boosting approaches. In addition to six UCI datasets used in decision tree stumps experiments, we use four additional UCI datasets here. These are the House (16,435), Housing (13,506),3 Pima (8,768), and Spam (57,4601) datasets. As in the decision tree stumps experiments, we report results from 10-fold CV. Since the best ν value for LPBoost varies between 0.03 and 0.11 for the large datasets, we pick parameter ν = 0.07 for the small datasets. No effort was made to tune the ν parameter. Thus there is no advantage given to LPBoost. All boosting methods were allowed to run up to 25 iterations. Results are reported in Table 4. C4.5 performed the best on the House dataset. AdaBoost performed the best in four datasets out of ten. LPBoost and CRB had the best classification performance for three and two datasets respectively. When we drop CRB in Table 4, LPBoost would in this case perform the best in five datasets. 250 A. DEMIRIZ, K.P. BENNETT AND J. SHAWE-TAYLOR Table 2. Stopping iterations determined by validation for AdaBoost. 25 iterations 50 iterations 100 iterations Dataset Original Noisy Original Noisy Original Noisy Forest 22 19 36 39 51 39 Adult 25 4 48 4 74 91 USPS 22 25 47 40 86 99 Optdigits 25 25 49 50 91 94 Table 3. Large dataset results from boosting C4.5 by method and maximum number of iterations. Dataset Method Iteration Forest +15% Noise Adult +15% Noise USPS +15% Noise OptDigits +15% Noise LP Boost 25 0.7226 0.6602 0.8476 0.8032 0.9123 0.8744 0.9249 0.8948 50 0.7300 0.6645 0.8495 0.8176 0.9188 0.8849 0.9416 0.9060 100 0.7322 0.6822 0.8501 0.8461 0.9153 0.9103 0.9449 0.9160 CRB 25 0.7259 0.6569 0.8461 0.8219 0.9103 0.8739 0.9355 0.8948 50 0.7303 0.6928 0.8496 0.8240 0.9063 0.8789 0.9349 0.9093 100 0.7326 0.7045 0.8508 0.8250 0.9133 0.8874 0.9343 0.9238 AdaBoost 25 0.7370 0.6763 0.8358 0.7630 0.9130 0.8789 0.9416 0.8770 50 0.7432 0.6844 0.8402 0.7630 0.9188 0.8934 0.9494 0.9104 100 0.7475 0.6844 0.8412 0.7752 0.9218 0.8889 0.9510 0.9243 C4.5 1 0.6638 0.5927 0.8289 0.7630 0.7833 0.6846 0.7958 0.6884 Table 4. Small dataset results from boosting C4.5. Dataset LPBoost CRB AdaBoost C4.5 Cancer 0.959 ± 0.017 0.963 ± 0.025 0.966 ± 0.025 0.945 ± 0.025 Diagnostic 0.965 ± 0.026 0.963 ± 0.028 0.971 ± 0.019 0.937 ± 0.037 Heart 0.791 ± 0.062 0.795 ± 0.010 0.787 ± 0.061 0.788 ± 0.077 House 0.959 ± 0.034 0.945 ± 0.053 0.951 ± 0.042 0.962 ± 0.029 Housing 0.854 ± 0.048 0.866 ± 0.038 0.879 ± 0.039 0.817 ± 0.049 Ionosphere 0.937 ± 0.038 0.926 ± 0.060 0.936 ± 0.041 0.916 ± 0.052 Musk 0.882 ± 0.054 0.906 ± 0.049 0.929 ± 0.028 0.834 ± 0.034 Pima 0.750 ± 0.050 0.728 ± 0.048 0.748 ± 0.071 0.729 ± 0.046 Sonar 0.817 ± 0.083 0.832 ± 0.083 0.814 ± 0.093 0.701 ± 0.073 Spam 0.956 ± 0.009 0.955 ± 0.010 0.952 ± 0.009 0.930 ± 0.009 LINEAR PROGRAMMING BOOSTING 251 Figure 5. CPU times in seconds for each iteration. (a) Adult dataset; (b) Spam dataset average 10 fold CV. 8.3. Computational cost analysis In this section, we analyze computational costs of different boosting methods. One important issue is to justify the additional cost of LP time in LPBoost and CRB. Is it worth reoptimizing an LP at each iteration? Since we use a column generation approach, in theory, it should not affect performance very much. In order to report timings, we reran some of the experiments on a fully dedicated IBM RS-6000 with 512 MB RAM and a single 330 MhZ processor. Results were consistent across different datasets so we focus on two sample cases. We plot CPU time in seconds per iteration for a large dataset (single run for Adult) and one from a relatively smaller dataset (Spam averaged over 10 folds) in figure 5. The total CPU times
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved