Bayesian interpretation of regularization

In machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning. f ^ ( x ′ ) = k ⊤ ( K + λ n I ) − 1 Y , {displaystyle {hat {f}}(mathbf {x} ')=mathbf {k} ^{ op }(mathbf {K} +lambda nmathbf {I} )^{-1}mathbf {Y} ,}     (1) 1 n ∑ i = 1 n ( f ( x i ) − y i ) 2 + λ ‖ f ‖ k 2 , {displaystyle {frac {1}{n}}sum _{i=1}^{n}(f(mathbf {x} _{i})-y_{i})^{2}+lambda |f|_{k}^{2},}     (2) f ^ ( x ′ ) = ∑ i = 1 n c i k ( x i , x ′ ) = k ⊤ c , {displaystyle {hat {f}}(mathbf {x} ')=sum _{i=1}^{n}c_{i}k(mathbf {x} _{i},mathbf {x} ')=mathbf {k} ^{ op }mathbf {c} ,}     (3) In machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning. A mathematical equivalence between the regularization and the Bayesian point of view is easily proved in cases where the reproducing kernel Hilbert space is finite-dimensional. The infinite-dimensional case raises subtle mathematical issues; we will consider here the finite-dimensional case. We start with a brief review of the main ideas underlying kernel methods for scalar learning, and briefly introduce the concepts of regularization and Gaussian processes. We then show how both points of view arrive at essentially equivalent estimators, and show the connection that ties them together. The classical supervised learning problem requires estimating the output for some new input point x ′ {displaystyle mathbf {x} '} by learning a scalar-valued estimator f ^ ( x ′ ) {displaystyle {hat {f}}(mathbf {x} ')} on the basis of a training set S {displaystyle S} consisting of n {displaystyle n} input-output pairs, S = ( X , Y ) = ( x 1 , y 1 ) , … , ( x n , y n ) {displaystyle S=(mathbf {X} ,mathbf {Y} )=(mathbf {x} _{1},y_{1}),ldots ,(mathbf {x} _{n},y_{n})} . Given a symmetric and positive bivariate function k ( ⋅ , ⋅ ) {displaystyle k(cdot ,cdot )} called a kernel, one of the most popular estimators in machine learning is given by where K ≡ k ( X , X ) {displaystyle mathbf {K} equiv k(mathbf {X} ,mathbf {X} )} is the kernel matrix with entries K i j = k ( x i , x j ) {displaystyle mathbf {K} _{ij}=k(mathbf {x} _{i},mathbf {x} _{j})} , k = [ k ( x 1 , x ′ ) , … , k ( x n , x ′ ) ] ⊤ {displaystyle mathbf {k} =^{ op }} , and Y = [ y 1 , … , y n ] ⊤ {displaystyle mathbf {Y} =^{ op }} . We will see how this estimator can be derived both from a regularization and a Bayesian perspective. The main assumption in the regularization perspective is that the set of functions F {displaystyle {mathcal {F}}} is assumed to belong to a reproducing kernel Hilbert space H k {displaystyle {mathcal {H}}_{k}} . A reproducing kernel Hilbert space (RKHS) H k {displaystyle {mathcal {H}}_{k}} is a Hilbert space of functions defined by a symmetric, positive-definite function k : X × X → R {displaystyle k:{mathcal {X}} imes {mathcal {X}} ightarrow mathbb {R} } called the reproducing kernel such that the function k ( x , ⋅ ) {displaystyle k(mathbf {x} ,cdot )} belongs to H k {displaystyle {mathcal {H}}_{k}} for all x ∈ X {displaystyle mathbf {x} in {mathcal {X}}} . There are three main properties make an RKHS appealing: 1. The reproducing property, which gives name to the space, where ⟨ ⋅ , ⋅ ⟩ k {displaystyle langle cdot ,cdot angle _{k}} is the inner product in H k {displaystyle {mathcal {H}}_{k}} .

[ "Artificial neural network" ]
Parent Topic
Child Topic
    No Parent Topic