Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Radial Basis Function Networks - Lecture Slides | CPSC 636, Study notes of Computer Science

Material Type: Notes; Professor: Choe; Subject: COMPUTER SCIENCE; University: Texas A&M University; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-4q8
koofers-user-4q8 🇺🇸

10 documents

1 / 12

Toggle sidebar

Related documents


Partial preview of the text

Download Radial Basis Function Networks - Lecture Slides | CPSC 636 and more Study notes Computer Science in PDF only on Docsity! Slide05 Haykin Chapter 5: Radial-Basis Function Networks CPSC 636-600 Instructor: Yoonsuck Choe Spring 2008 1 Learning in MLP • Supervised learning in multilayer perceptrons: – Recursive technique of stochastic approximation, e.g., backprop. – Design of nnet as a curve-fitting (approximation) problem, e.g., RBF. • Curve-fitting: – Finding a surface in a multidimensional space that provides a best fit to the training data. – “Best fit” measured in a certain statistical sense. – RBF is an example: hidden neurons forming an arbitrary basis for the input patterns when they are expanded into the hidden space. These basis are called radial basis functions. 2 Radial-Basis Function Networks φ 2φ 1 φ 3 φ 4 Output Hidden Input W Three layers: • Input • Hidden: nonlinear transformation from input to hidden space. • Output: linear activation. Principal motivation: Cover’s theorem—pattern classifiction casted in high-dimensional space is more likely to be linearly separable than in low-dimensional space. 3 Cover’s Theorem Cover’s theorem on the separability of patterns: A complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. Basic idea: nonlinearly map points in the input space to a hidden space that has a higher dimension than the input space. Once the proper mapping is done, simple, quick algorithms can be used to find the separating hyperplane. 4 φ-Separability of Patterns • N input patternsX = {x1, x2, ..., xN} in m0-dimensional space. • The inputs belong to either of two setsX1 andX2 : they form a dichotomy. • The dichotomy is separable wrt a family of surfaces if a surface exists in the family that separates the points in classX1 fromX2 . • For each x ∈ X , define an m1-vector {φi(x)|i − 1, 2, ..., m1}: φ(x) = [φ1(x), φ2(x), ..., φm1 (x)] T that maps inputs in m0-D space to the hidden space of m1-D. φi(x) are called the hidden functions, and the space spanned by these functions is called the hidden space or feature space. • A dichotomy is φ-separable if there exists an m1-D vector w such that: w T φ(x) > 0, x ∈ X1 w T φ(x) < 0, x ∈ X2 with separating hyperplane wT φ(x) = 0. 5 Cover’s Theorem Revisited • Given a set X of N inputs picked from the input space independently, and suppose all the possible dichotomies of X are equiprobable. • Let P (N, m1) denote the probability that a particular dichotomy picked at random is φ-separable, where the family of surfaces has m1 degrees of freedom. • In this case, P (N, m1) = „ 1 2 «N−1 m1−1X m=0 N − 1 m ! where l m ! = l(l − 1)(l − 2)...(l − m + 1) m! 6 Cover’s Theorem: Interpretation • Separability depends on: (1) particular dichotomy, and (2) the distribution of patterns in the input space. • The derived P (N, m1) states that the probability of being φ-separable is equivalent to the cumulative binomial distribution corresponding to the probability that (N − 1) flips of a fair coin will result in (m1 − 1) or fewer heads. • In sum, Cover’s theorem has two basic ingredients: – Nonlinear mapping to hidden space with φi(x) (i = 1, 2, .., m1). – High dimensionality of hidden space compared to the input space (m1 > m0). • Corollary: A maximum of 2m1 patterns can be linearly separated by a hidden space of m1-D. 7 Example: XOR (again!) • With Gaussian hidden functions, the inputs become linearly separable in the hidden space: φ1(x) = exp(−‖x− t1‖2), t1 = [1, 1]T φ2(x) = exp(−‖x− t2‖2), t2 = [0, 0]T 8 Other Perspectives on RBF Learning • RBF learning is formularized as F (x) = m0X k=1 wkφk(x). • This kind of expression was given without much rationale, other than intuitive appeal. • However, there’s one way to derive the above formalism based on an interesting theoretical point-of-view, which we will see next. 17 Supervised Learning as Ill-Posed Hypersurface Reconstruction Problem • The exact interpolation approach has limitations: – Poor generalization: data points being more numerous than the degree of freedom of the underlying process can lead to overfitting. • How to overcome this issue? – Approach the problem from a perspective that learning is a hypersurface reconstruction problem given a sparse set of data points. – Contrast between direct problem (in many cases well-posed) vs. inverse problem (in many cases ill-posed). 18 Well-Posed Problems in Reconstructing Functional Mapping Given an unknown mapping from domainX to rangeY , we want to reconstruct the mapping f . This mapping is well-posed if all the following conditions are satisfied: • Existence: For all x ∈ X , there exist an output y ∈ Y such that y = f(x). • Uniqueness: For all x, t ∈ X , f(x) = f(t) iff x = t. • Continuity: The mapping is continuous. For any  > 0 there exists δ = δ() such that ρx(x, t) < δ → ρy(f(x), f(t)) < , where ρ(·, ·) is the distance measure. If any of these conditions are violated, the problem is called an ill-posed problem. 19 Ill-Posed Problems and Solutions • Direct (causal) mapping are generally well-posed (e.g., 3D object to 2D projection). • On the other hand, inverse problems are ill-posed (e.g., reconstructing 3D structure from 2D projections). • For ill-posed problems, solutions are not unique (can in many cases they can be infinite): We need prior knowledge (or some kind of preference) to narrow down the range of solutions (this is called regularization). Treating supervised learning as an ill-posed problem, and using certain prior knowledge, we can derive the RBF formalism. 20 Regularization Theory: Overview • The main idea behind regularization is to stabilize the solution by means of prior information. • This is done by including a functional (a function that maps from a function to a scalar) in the cost function, so that the functional can also be minimized. Only a small number of candidate solutions will minimize this functional. • These functional terms are called the regularization term. • Typically, the functionals measure the smoothness of the function. 21 Tikonov’s Regularization Theory • Task: Given input xi ∈ Rm0 and target di ∈ R1 , find F (x). • Minimize the sum of two terms: 1. Standard error term: Es(F ) = 1 2 NX i=1 (di − yi)2 = 1 2 NX i=1 (di − F (xi))2. 2. Regularization term: Ec(F ) = 1 2 ‖DF‖2, where D is a linear differential operator, and ‖ · ‖ the norm of the function space. • Putting these together, we want to minimize (w/ regularization param. λ) E(F ) = Es(F )+λEc(F ) = 1 2 NX i=1 (di−F (xi))2+ 1 2 λ‖DF‖2. 22 Error Term vs. Regularization Term -0.5 0 0.5 1 1.5 2 0 1 2 3 4 5 6 7 data fit -0.5 0 0.5 1 1.5 2 0 1 2 3 4 5 6 7 data fit 0 0.5 1 1.5 2 2.5 0 1 2 3 4 5 6 7 data fit 0.9955 0.996 0.9965 0.997 0.9975 0.998 0.9985 0.999 0.9995 1 0 1 2 3 4 5 6 7 RBFs 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 RBFs 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 RBFs Error Regularization Left Bad fit Extremely smooth Middle Good fit Smooth Right Over fit Jagged Try this demo: http://lcn.epfl.ch/tutorial/english/rbf/html/. 23 Solution that Minimizes E(F ) • Problem: minimize E(F ) = Es(F )+λEc(F ) = 1 2 NX i=1 (di−F (xi))2+ 1 2 λ‖DF‖2. • Solution: Fλ(x) that satisfies the Euler-Lagrange equation (below) minimizes E(F ). eDDFλ(x)− 1 λ [di − F (xi)] δ(x− xi) = 0, where eD is the adjoint operator of D and δ(·) is the Dirac delta function. 24 Solution that Minimizes E(F ) (cont’d) • The solution to the Euler-Lagrange equation can be formulated in terms of the Green’s function that satisfies: eDDG(x,x′) = δ(x,x′). Note: the form of G(·, ·) depends on the particular choice of D. • Finally, the desired function Fλ(x) that minimizes E(F ) is: Fλ(x) = 1 λ NX i=1 [di − F (xi)] G(x,xi). 25 Solution that Minimizes E(F ) (cont’d) • Letting wi = 1 λ [di − F (xi)], i = 1, 2, ...N we can recast Fλ(x) as Fλ(x) = NX i=1 wiG(x, xi). • Plugging in input xj , we get Fλ(xj) = NX i=1 wiG(xj , xi). • Note the similarity to the RBF: F (x) = NX i=1 wiφ(‖x− xi‖) 26 Solution that Minimizes E(F ) (cont’d) We can use a matrix notation: Fλ = [Fλ(x1), Fλ(x1), ...Fλ(xN )] T d = [d1, d2, ..., dN ] T G = 26666664 G(x1, x1) G(x1, x2) · · · G(x1, xN ) G(x2, x1) G(x2, x2) · · · G(x2, xN ) . . . . . . . . . . . . G(xN , x1) G(xN , x2) · · · G(xN , xN ) 37777775 w = [w1, w2, ..., wN ] T . Then we can rewrite the formula in the previous page as w = 1λ (d− Fλ), and Fλ = Gw. 27 Solution that Minimizes E(F ) (cont’d) • Combining w = 1 λ (d− Fλ) Fλ = Gw we can eliminate Fλ to get (G + λI)w = d. • From this, we can get the weights: w = (G + λI)−1d if G + λI is invertible (it needs to be positive definite, which can be ensured by a large λ). Note: G(xi,xj) = G(xj ,xi), thus GT = G. 28 Regularization Networks vs. Generalized RBF x1 w1 w2 wN xm Output Hidden Input G G G G F(x) .... .... x1 w1 w2 w0=b wm1 xm Output Hidden Input φ φ φ φ φ=1 .... .... • Hidden layer in GRBF is much smaller: m1 < N . • In GRBF, (1) the weights, (2) the RBF centers ti , and (3) the norm weighting matrix are all unknown parameters to be determined. • In regularization networks, RBF centers are known (same as all the inputs), and only the weights need to be determined. 37 Estimating the Parameters • Weights wi: already discussed (more next). • Regularization parameter λ: – Minimize averaged squared error. – Use generalized cross-validation. • RBF centers: – Randomly select fixed centers. – Self-organized selection. – Supervised selection. 38 Estimating the Regularization Parameter λ • Minimize average squared error: For a fixed λ, for all N inputs, calculate the squared error betwen the true function value and the estimated RBF network output using the λ. Find the optimal λ that minimizes this error. Problem: This requires knowledge of the true function values. • Generalized cross-validation: Use leave-one-out cross validation. With a fixed λ, for all N inputs, find the difference between the target value (from the training set) and the predicted value from the leave-one-out-trained network. This approach depend only on the training set. 39 RBF Learning x1 w1 w2 w0=b wm1 xm Output Hidden Input φ φ φ φ φ=1 .... .... Basic idea is to learn in two different time scales: • Nonlinear, slow learning of the RBF parameters (center, variance). • Linear, fast learning of the hidden-to-output weights. 40 RBF Learning (1/3): Random Centers • Use m1 hidden units: G(‖x− ti‖) = exp „ − m1 d2max ‖x− ti‖2 « , where ti(i = 1, 2, ..., m1) are picked by random from the available inputs xj(j = 1, 2, ..., N). • Note that the standard deviation (width) of the RBF is fixed to: σ = dmax√ 2m1 , where dmax is the max distance between the chosen centers ti . This gives a width that is not too peaked nor too flat. • The linear weights are learned using the pseudoinverse: w = G + d = (G T G) −1 G T d, where the matrix G = {gji}, gji = G(‖xj − ti‖2). 41 Finding G+ with Singular Value Decomposition If for a real N × M matrix G, there exists orthogonal matrices U = [u1,u2, ...,uN ] V = [v1,v2, ...,vM ], such that UT GV = diag(σ1, σ2, ..., σK) = Σ, K = min(M, N), then U is called the left singular matrix, V the right singular matrix, and σ1, σ2, ..., σK the singular values of the matrix G. Once these are known, we can obtain G+ as G+ = VΣ+UT where Σ+ = diag “ 1 σ1 , 1 σ2 , ... 1 σK ” . There are efficient algorithms for singular value decomposition that can be used for this. 42 Finding G+ with Singular Value Decomposition (cont’d) Using these properties U −1 = U T V −1 = V T ΣΣ + = I we can verify that GG+ = I: UT GV = Σ UUT GVVT = UΣVT G = UΣVT GG+ = UΣVT VΣ+UT = UΣΣ+UT = UUT = I. 43 RBF Learning (2/3): Self-Organized Centers The random-center approach is only effective with large input sets. To overcome this, we can take a hybrid approach: (1) self-organized learning of centers, and (2) supervised learning of linear weights. Clustering for RBF center learning (similar to Self-Organizing Maps): 1. Initialization: Randomly choose distinct tk(0)s. 2. Sampling: Draw a random input vector x ∈ X . 3. Similarity matching: Find best-matching center vector tk(x) : k(x) = arg min k ‖x(n)− tk(n)‖ 4. Updating: Update center vectors tk(n+1) = ( tk(n) + η[x(n)− tk(n)], if k = k(x) tk(n), otherwise 5. Continuation: increment n and repeat from step 2. 44 RBF Learning (3/3): Supervised Selection of Centers Use error correction learning to adjust all RBF parameters to minimize the error cost function: E = 1 2 NX j=1 e 2 j , ej = dj − F∗(xj) = dj − MX i=1 wiG(‖xj − ti‖Ci ). • Linear weights (output layer): wi(n + 1) = wi(n)− η1 ∂E(n)∂wi(n) . • Position of centers (hidden layer): ti(n + 1) = ti(n)− η2 ∂E(n)∂ti(n) • Spread of centers (hidden layer): Σ −1 i (n + 1) = Σ −1 i (n)− η3 ∂E(n) ∂Σ−1i (n) 45 RBF Learning (3/3): Supervised Selection of Centers (cont’d) • Linear weights ∂E(n) ∂wi(n) = NX j=1 ej(n)G(‖xj − ti(n)‖Ci ). • Position of centers ∂E(n) ∂ti(n) = 2wi(n) NX j=1 ej(n)G ′ “‖xj − ti(n)‖Ci”Σ−1i (n) hxj − ti(n)i • Spread of centers ∂E(n) ∂Σ−1i (n) = −wi(n) NX j=1 ej(n)G ′ `‖xj − ti(n)‖Ci´Qij(n), Qji(n) = [xj − ti(n)][xj − ti(n)]T 46 Comparison of RBF and MLP • RBF has a single hidden layer, while MLP can have many. • In MLP, hidden and output neurons have the same underlying function. In RBF, they are specialized into distinct functions. • In RBF, the output layer is linear, but in MLP, all neurons are nonlinear. • The hidden neurons in RBF calculate the Euclidean norm of the input vector and the center, while in MLP the inner product of the input vector and the weight vector is calculated. • MLPs construct global approximations to nonlinear input–output mapping. RBF uses exponentially decaying localized nonlinearities (e.g. Gaussians) to construct local approximations to nonlinear input–output mappings. 47 Summary • RBF network is unusual due to its two different unit types: RBF hidden layer, and linear output layer. • RBF is derived in a principled manner, starting from Tikohonov’s regularization theory, unlike MLP. • In RBF, the smoothing term becomes important, with different D giving rise to different Green’s function G(·, ·). • Generalized RBF lifts the requirement of N hidden units for N input patterns, greatly reducing the computational complexity. • Proper estimation of the regularization parameter λ is needed. 48
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved