Empirical risk minimization

Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an algorithm will work in practice (the true 'risk') because we don't know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data (the 'empirical' risk). Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an algorithm will work in practice (the true 'risk') because we don't know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data (the 'empirical' risk). Consider the following situation, which is a general setting of many supervised learning problems. We have two spaces of objects X {displaystyle X} and Y {displaystyle Y} and would like to learn a function h : X → Y {displaystyle h:X o Y} (often called hypothesis) which outputs an object y ∈ Y {displaystyle yin Y} , given x ∈ X {displaystyle xin X} . To do so, we have at our disposal a training set of n {displaystyle n} examples ( x 1 , y 1 ) , … , ( x n , y n ) {displaystyle (x_{1},y_{1}),ldots ,(x_{n},y_{n})} where x i ∈ X {displaystyle x_{i}in X} is an input and y i ∈ Y {displaystyle y_{i}in Y} is the corresponding response that we wish to get from h ( x i ) {displaystyle h(x_{i})} . To put it more formally, we assume that there is a joint probability distribution P ( x , y ) {displaystyle P(x,y)} over X {displaystyle X} and Y {displaystyle Y} , and that the training set consists of n {displaystyle n} instances ( x 1 , y 1 ) , … , ( x n , y n ) {displaystyle (x_{1},y_{1}),ldots ,(x_{n},y_{n})} drawn i.i.d. from P ( x , y ) {displaystyle P(x,y)} . Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because y {displaystyle y} is not a deterministic function of x {displaystyle x} , but rather a random variable with conditional distribution P ( y | x ) {displaystyle P(y|x)} for a fixed x {displaystyle x} . We also assume that we are given a non-negative real-valued loss function L ( y ^ , y ) {displaystyle L({hat {y}},y)} which measures how different the prediction y ^ {displaystyle {hat {y}}} of a hypothesis is from the true outcome y . {displaystyle y.} The risk associated with hypothesis h ( x ) {displaystyle h(x)} is then defined as the expectation of the loss function: A loss function commonly used in theory is the 0-1 loss function: L ( y ^ , y ) = I ( y ^ ≠ y ) {displaystyle L({hat {y}},y)=I({hat {y}} eq y)} , where I ( … ) {displaystyle I(dots )} is the indicator function. The ultimate goal of a learning algorithm is to find a hypothesis h ∗ {displaystyle h^{*}} among a fixed class of functions H {displaystyle {mathcal {H}}} for which the risk R ( h ) {displaystyle R(h)} is minimal: In general, the risk R ( h ) {displaystyle R(h)} cannot be computed because the distribution P ( x , y ) {displaystyle P(x,y)} is unknown to the learning algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation, called empirical risk, by averaging the loss function on the training set: The empirical risk minimization principle states that the learning algorithm should choose a hypothesis h ^ {displaystyle {hat {h}}} which minimizes the empirical risk: Thus the learning algorithm defined by the ERM principle consists in solving the above optimization problem.

Parent Topic

Child Topic

No Parent Topic