Robust Sequential Prediction in Linear Regression with Student's t-distribution.

2016 
The Predictive Least Squares (PLS) model selection criterion is known to be consistent in the context of linear regression. For small sample sizes, however, it can exhibit erratic behavior. We show that this shortcoming can be amended by incorporating a Student’s t-distribution into PLS. The resulting criterion is shown to be asymptotically equivalent to PLS but significantly more robust for small sample sizes. A scale parameter involved with the t-distribution can be used to incorporate an estimate of the scale of the noise; it is shown that the new criterion is robust with regard to the choice of this parameter and that its effect disappears asymptotically. The recently proposed Sequentially Normalized Least Squares (SNLS) criterion can be written in a form that exposes a similar interpretation with the exception that the scale parameter of the t-distribution is estimated sequentially from the data. Numerical experiments are presented; they indicate that using a Student’s t-distribution enhances model selection performance and that the benefit of the scale estimator of SNLS is negligible. Introduction Linear regression has recently received attention in the sequential or online setting, where work has been done in selecting a subset of the covariates (Maatta, Schmidt, and Roos 2015) and finding a predictor that minimizes the worstcase regret (Bartlett et al. 2015). The probabilistic case that we consider also fits to the prequential framework of Dawid (1984). In this article, we concentrate on the subset selection problem, also called the model selection problem. We assume a fixed design matrix Zn ∈ Rn×q that consists of the row vectors z1, z2, . . . ,zn. Associated with each sample zt, we have a response yt ∈ R. The goal is to select a non-empty subset of the covariates, γ ⊆ {1, 2, . . . , q}, that strikes a good balance between underfitting (poor prediction of the training data) and overfitting (poor generalization for future data). In order to assess the performance of subset selection methods, one often introduces the assumption that the data (y1:n,Zn) comes from the linear model yt = ztβ + et, (1) where β ∈ R is a fixed coefficient vector and the et’s are i.i.d. noise with E[et] = 0 and E[et ] = σ 2 < ∞. In this setting, a subset selection method is said to be consistent if its probability of selecting the γ that corresponds exactly to the non-zero elements of β will approach one when the sample size n tends to infinity. This γ is referred to as the true model or subset. For the batch case, where the score of a subset γ cannot be represented in a sequential manner, there are numerous methods (McQuarrie and Tsai 1998). Perhaps the most wellknown of these is the Bayesian Information Criterion (Akaike 1978; Schwarz 1978), or BIC, which is also known as the Schwarz Information Criterion (SIC). For the model (1), the BIC criterion is BIC(y1:n,Zn, γ) := n log ( σ n,γ ) + |γ| log n, (2) where |γ| is the cardinality of γ and σ n,γ := 1 n n ∑ t=1 ( yi − ztβ n )2 . (3) Here and later β n ∈ R denotes the maximum likelihood estimate of β computed using the first n samples and with the restriction that the entries of β n that are not present in γ are forced to be zeros. As for information criteria based on sequential prediction, we are aware of only two (besides Bayesian methods that admit a sequential interpretation). The first is the Predictive Least Squares (PLS) criterion, introduced by Rissanen (1986), which is defined as PLS(y1:n,Zn, γ) := n ∑
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    13
    References
    3
    Citations
    NaN
    KQI
    []