Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

SVMs in Computational Genomics: Theory & Applications, Study notes of Biology

An overview of support vector machines (svms), a popular machine learning algorithm used in various fields including computational functional genomics. Svms are a type of large margin classifier that can be successful in applications such as bioinformatics, cheminformatics, text, handwriting recognition, and more. The basic notation, linear learning machines, implicit mapping to features, kernels, and the generalization problem. It also discusses the advantages and disadvantages of using svms.

Typology: Study notes

Pre 2010

Uploaded on 07/23/2009

koofers-user-jmo
koofers-user-jmo 🇺🇸

10 documents

1 / 40

Toggle sidebar

Related documents


Partial preview of the text

Download SVMs in Computational Genomics: Theory & Applications and more Study notes Biology in PDF only on Docsity! 1 1 Computational Functional Genomics Lecture 18 Genomic data-mining method 5 - Support vector machines Yang Dai BioE 594 Computational Functional Genomics 2 Brief history of Support Vector Machines SVMs is inspired from statistical learning theory Introduced in COLT-92 by Boser, Guyon, Vapnik. Initially popularized in the NIPS community, now a large and diverse community, from machine learning, optimization, statistics, neural networks, functional analysis, etc. SVMs is a particular instance of a large class of learning algorithms called “Kernel Machines” Successful applications in many fields (bioinformatics, cheminformatics, text, handwriting recognition, etc) Implementations: LIBSVM, SVMlight, etc. Website: www.kernel-machines.org Computational Functional Genomics 5 Large-margin decision boundary The decision boundary should be as far away from the data of both classes as possible Maximize the margin, m Class 1 Class 2 m Computational Functional Genomics 6 Perceptron Linear separation of the input space f(x) = hw,xi + b Update rule (ignoring threshold) If yi(hwk,xii) 6 0 then wk+1 ← wk + ηyixi k← k+1 Solution is a linear combination of training points w = ∑ αiyixi, αi>0 Only used informative points (mistake driven) The coefficient of a point in combination reflects its ‘difficulty’ Class 1 Class 2 w b ( ( )) { 1, 1}sign f x ∈ + − Computational Functional Genomics 7 Dual Representation The decision function can be rewritten as follows f(x) = hw,xi+b=∑αiyihxi,xi+b w=∑ αiyixi Also the update rule can be rewritten as follows: If yi(∑αiyihxi,xi+b)6 0 then αi← αi+η Duality: first property of SVMs DUALITY is the first feature of support vector machines (and kernel method in general) SVMs are Linear Learning Machines represented in a dual fashion f(x) = hw,xi+b=∑αiyihxi,xi+b Data appear only within dot products (in decision function and in training algorithm) Computational Functional Genomics 10 Overview of kernel methods Kernel methods work by embedding the data into a vector space, and by detecting linear relations in that space “Linear relations” can be regressions, classifications, correlations, principal components, etc. If the feature space chosen suitably, pattern recognition can be easy General structure of kernel-based algorithms composed of two modules Learning module Kernel function A learning algorithm: performs the learning in the embedding space A kernel function: takes care of the embedding Computational Functional Genomics 11 Kernel methods Kernel methods exploit information about the inner products between data items Many standard algorithms can be rewritten so that they only require inner products between data (inputs) Kernel functions = inner products in some feature space (potentially very complex) If kernel given, no need to specify what features of the data are being used Computational Functional Genomics 12 Kernels A function that returns the value of the dot product between the images of the two arguments K(x,x) = hφ(x),φ(x)i Given a function K, it is possible to verify that it is a kernel One can use LLMs in a feature space by simply rewriting it in dual representation and replacing dot products with kernels: hxi,xi←K(x,x) = hφ(x),φ(x)i Example: polynomial kernels Computational Functional Genomics 15 No free kernel A bad kernel would be a kernel whose kernel matrix is mostly diagonal: all points orthogonal to each other, no clusters, no structure… If mapping in a space with too many irrelevant features, kernel matrix becomes diagonal Need some prior knowledge of target to choose a good kernel 1…000 …………… 0…100 0…010 0…001 Computational Functional Genomics 16 The Generalization Problem The curse of dimensionality: Easy to overfit in high dimensional spaces Regularities could be found in the training set that are accidental, that is that would not be found again in a test set The SVM problem is ill posed Finding one hyperplane that separates the data But many such hyperplanes exist Need principled way to choose the best possible hyperplane Computational Functional Genomics 17 Second property of SVMs SVMs are Linear Learning Machines, that Use a dual representation Operate in a kernel induced feature space f(x) = ∑αiyihφ(xi),φ(x)i+b (that is a linear function in the feature space implicitely defined by K) Computational Functional Genomics 20 The primal problem Minimize hw,wi notice that (hw,wi = ||w||2) subject to yi(hw,xii+b) > 1 This is an optimization problem Find the maximal margin hyperplane: constrained optimization (quadratic programming) Use Lagrange theory (or Kuhn-Tucker Theory) The Lagrangian of this optimization problem is ( )1( ) , , 1 02 i i i iL w w w y w x bα α = − + − ≥ ∑ Computational Functional Genomics 21 From Primal to Dual By setting the derivative of the Lagrangian to be zero, the optimization problem can be written in terms of αi (the dual problem) Maximize Subject to This is a quadratic programming (QP) problem Convexity - global maximum of αi can always be found w can be recovered by 1 1, 1 1( ) , 2 n n i j i i j i j i i j W y y x xα α αα = = = = −∑ ∑ 1 n i i i i w y xα = = ∑ Computational Functional Genomics 22 Properties of the solution Kuhn-Tucker Theorem Duality: can use kernels KKT conditions: αi[yi(hw,xii+b)-1]=0 KKT conditions imply sparseness – another SMVs property Many of the αi are zero Only a small number of data points nearest to the hyperplane (margin =1) have positive weight w=∑αiyixi xi with non-zero αi are called (SV), decision boundary is determined only by the support vectors For testing with a new data z Let tj (j=1, ..., s) be the indices of the s support vectors. We can write Compute and classify z as class 1 if the sum is positive, and class 2 otherwise 1 j j j s t t t j w y xα = = ∑ ( ) 1 , ,j j j s t t t j w z b y x z bα = + = +∑ Computational Functional Genomics 25 Non-linearly Separable Problems Non-separable data in original space Separate it with a “finer” kernel, but that will cause overfitting Better to have a model that tolerates mislabeled points - allow “error” ξi in classification Computational Functional Genomics 26 Soft Margin Hyperplane By minimizing ∑iξi, ξi can be obtained by ξi are “slack variables” in optimization ξi=0 if there is no error for xi, and ξi is an upper bound of the number of errors We want to minimize C : tradeoff parameter between error and margin The optimization problem becomes 1 , 1 1 , 1 0 T i i i T i i i i w x b y w x b y ξ ξ ξ  + ≥ − =  + ≥ − + = −  ≥ subject to ( ) 1 , 0T ii i iy w x b ξ ξ+ ≥ − ≥ Computational Functional Genomics 27 The Optimization Problem 1 1, 1 1max. ( ) x ,x 2 n n i j i i j i j i i j W y yα α αα = = = = −∑ ∑ The dual of the problem is w can be recovered as This is very similar to the optimization problem in the linear separable case, except that there is an upper bound C on αi now Once again, a QP solver can be used to find αi 1 w = x t j j j s t t j yα = ∑ Computational Functional Genomics 30 An Example for φ(.) and K(.,.) Suppose Φ(.) is given as follows An inner product in the feature space is So, if we define the kernel function as follows, there is no need to carry out Φ(.) explicitly This use of kernel function to avoid carrying out Φ(.) explicitly is known as the kernel trick 1 1 2 1 1 2 2 2 2 , =(1 ) x z x z x z x z φ φ        + +              2 1 1 2 2(x, z)=(1 ) K x z x z+ + Computational Functional Genomics 31 Kernel Functions In practical use of SVM, only the kernel function (and not Φ(.)) is specified Kernel function can be thought of as a similarity measure between the input objects Not all similarity measure can be used as kernel function, however The kernel function needs to satisfy the Mercer’s theorem, i.e., the function is “positive-definite” This has the consequence that the kernel matrix, where the (i,j)-th entry is the K(xi, xj), is always positive definite Note that xi needs not be vectorial for the kernel function to exist. This opens up enormous opportunities for classification with sequences, graphs, etc., by SVM Computational Functional Genomics 32 Kernel function - examples Polynomial kernel with degree d Radial basis function kernel with width s Closely related to radial basis function neural networks Sigmoid with parameter k and q It does not satisfy the Mercer condition on all k and q ( )(x, z)= x,z +1 dK ( )2 2(x, z)=exp x-z /(2 )K σ− (x, z)=tanh( x z+ ) TK κ θ Computational Functional Genomics 35 Example Suppose we have 5 1D data points x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as class 2 ⇒ y1=1, y2=1, y3=-1, y4=-1, y5=1 We use the polynomial kernel of degree 2 K(x,z) = (xz+1)2 C is set to 100 We first find αi (i=1, …, 5) by Computational Functional Genomics 36 Example By using a QP solver, we get α1=0, α2=2.5, α3=0, α4=7.333, α5=4.833 Note that the constraints are indeed satisfied The support vectors are {x2=2, x4=5, x5=6} The discriminant function is b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x2, x4, x5 lie on and all give b=9 2 2 2 2 ( ) 2.5(1)(2 1) 7.333( 1)(5 1) 4.833(1)(6 1) 0.6667 5.333 f x x x x b x x b = + + − + + + + = − + 2( ) 0.6667 5.333 9f x x x= − + (w ( ) ) 1Tiy z bφ + = Computational Functional Genomics 37 Choosing the kernel function for SVMs The kernel function is important because it creates the kernel matrix, which summarize all the data Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, …) In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try A practical guide to SVM classification by Chih-Chung Chang and Chih-Jen Lin
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved