On Flat Minima, Large Margins and Generalizability
0
Citation
0
Reference
20
Related Paper
Abstract:
The intuitive connection to robustness and convincing empirical evidence have made the flatness of the loss surface an attractive measure of generalizability for neural networks. Yet it suffers from various problems such as computational difficulties, reparametrization issues, and a growing concern that it may only be an epiphenomenon of optimization methods. We provide empirical evidence that under the cross-entropy loss once a neural network reaches a non-trivial training error, the flatness correlates (via Pearson Correlation Coefficient) well to the classification margins, which allows us to better reason about the concerns surrounding flatness. Our results lead to the practical recommendation that when assessing generalizability one should consider a margin-based measure instead, as it is computationally more efficient, provides further insight, and is highly correlated to flatness. We also use our insight to replace the misleading folklore that small-batch methods generalize better because they are able to escape sharp minima. Instead, we argue that large-batch methods did not have enough time to maximize margins and hence generalize worse.Keywords:
Flatness (cosmology)
Maxima and minima
Robustness
Margin (machine learning)
Cite
Highly overparametrized neural networks can display curiously strong generalization performance - a phenomenon that has recently garnered a wealth of theoretical and empirical research in order to better understand it. In contrast to most previous work, which typically considers the performance as a function of the model size, in this paper we empirically study the generalization performance as the size of the training set varies over multiple orders of magnitude. These systematic experiments lead to some interesting and potentially very useful observations; perhaps most notably that training on smaller subsets of the data can lead to more reliable model selection decisions whilst simultaneously enjoying smaller computational costs. Our experiments furthermore allow us to estimate Minimum Description Lengths for common datasets given modern neural network architectures, thereby paving the way for principled model selection taking into account Occams-razor.
Training set
Data set
Small data
Cite
Citations (5)
Group testing is a fundamental problem in statistical inference with many real-world applications, including the need for massive group testing during the ongoing COVID-19 pandemic. In this paper we study the task of approximate recovery, in which we tolerate having a small number of incorrectly classified items.
One of the most well-known, optimal, and easy to implement testing procedures is the non-adaptive Bernoulli group testing, where all tests are conducted in parallel, and each item is chosen to be part of any certain test independently with some fixed probability. In this setting, there is an observed gap between the number of tests above which recovery is information theoretically possible, and the number of tests required by the currently best known efficient algorithms to succeed. In this paper we seek to understand whether this computational-statistical gap can be closed. Our main contributions are the following:
1.Often times such gaps are explained by a phase transition in the landscape of the solution space of the problem (an Overlap Gap Property (OGP) phase transition). We provide first moment evidence that, perhaps surprisingly, such a phase transition does not take place throughout the regime for which recovery is information theoretically possible. This fact suggests that the model is in fact amenable to local search algorithms.
2. We prove the complete absence of “bad” local minima for a part of the “hard” regime, a fact which implies an improvement over known theoretical results on the performance of efficient algorithms for approximate recovery without false-negatives.
3. Finally, motivated by the evidence for the abscence for the OGP, we present extensive simulations that strongly suggest that a very simple local algorithm known as Glauber Dynamics does indeed succeed, and can be used to efficiently implement the well-known (theoretically optimal) Smallest Satisfying Set (SSS) estimator. Given that practical algorithms for this task utilize Branch and Bound and Linear Programming relaxation techniques, our finding could potentially be of practical interest.
Maxima and minima
Statistical Inference
Cite
Citations (0)
Current methods for training robust networks lead to a drop in test accuracy, which has led prior works to posit that a robustness-accuracy tradeoff may be inevitable in deep learning. We take a closer look at this phenomenon and first show that real image datasets are actually separated. With this property in mind, we then prove that robustness and accuracy should both be achievable for benchmark datasets through locally Lipschitz functions, and hence, there should be no inherent tradeoff between robustness and accuracy. Through extensive experiments with robustness methods, we argue that the gap between theory and practice arises from two limitations of current methods: either they fail to impose local Lipschitzness or they are insufficiently generalized. We explore combining dropout with robust training methods and obtain better generalization. We conclude that achieving robustness and accuracy in practice may require using methods that impose local Lipschitzness and augmenting them with deep learning generalization techniques. Code available at this https URL
Robustness
Cite
Citations (25)
There is a myriad of phenomena that are better modelled with semi-infinite distribution families, many of which are studied in survival analysis. When performing inference, lack of knowledge of the populational minimum becomes a problem, which can be dealt with by making a good guess thereof, or by handcrafting a grid of initial parameters that will be useful for that particular problem. These solutions are fine when analyzing a single set of samples, but it becomes unfeasible when there are multiple datasets and a case-by-case analysis would be too time consuming. In this paper we propose methods to deal with the populational minimum in algorithmic, efficient and/or simple ways. Six methods are presented and analyzed, two of which have full theoretical support, but lack simplicity. The other four are simple and have some theoretical grounds in non-parametric results such as the law of iterated logarithm, and they exhibited very good results when it comes to maximizing likelihood and being able to recycle the grid of initial parameters among the datasets. With our results, we hope to ease the inference process for practitioners, and expect that these methods will eventually be included in software packages themselves.
Iterated function
Maxima and minima
Simplicity
Cite
Citations (1)
Deep models, while being extremely versatile and accurate, are vulnerable to adversarial attacks: slight perturbations that are imperceptible to humans can completely flip the prediction of deep models. Many attack and defense mechanisms have been proposed, although a satisfying solution still largely remains elusive. In this work, we give strong evidence that during training, deep models maximize the minimum margin in order to achieve high accuracy, but at the same time decrease the \emph{average} margin hence hurting robustness. Our empirical results highlight an intrinsic trade-off between accuracy and robustness for current deep model training. To further address this issue, we propose a new regularizer to explicitly promote average margin, and we verify through extensive experiments that it does lead to better robustness. Our regularized objective remains Fisher-consistent, hence asymptotically can still recover the Bayes optimal classifier.
Robustness
Margin (machine learning)
Cite
Citations (8)
As deep learning techniques have become more prevalent in computer vision, the need to explain these so called ‘black boxes’ has increased. Indeed, these techniques are now being developed and deployed in such sensitive areas as medical imaging, autonomous vehicles, and security applications. Being able to create reliable explanations of
their operations is therefore essential.
For images, a common method for explaining the predictions of a convolutional neural network is to highlight the regions of an input image that are deemed important. Many techniques have been proposed, however these are often constrained to produce an explanation with a certain level of coarseness. Explanations can be created that either score individual pixels, or score large regions of an image as a whole. It is difficult to create an explanation with a level of coarseness that falls in between these two. A potentially even greater problem is that none of these explanation techniques have been designed to explain what happens when a network fails to obtain the correct prediction. In these instances, current explanation techniques are not useful.
In this thesis, we propose two novel techniques that are able to efficiently create explanations that are neither too fine or too coarse.
The first of these techniques uses superpixels weighted with gradients to create explanations of any desirable coarseness (within computational constraints). We show that we are able to produce explanations in an
efficient way that have a higher accuracy than comparable existing methods. In addition, we find that our technique can be used in conjunction with existing techniques such as LIME to improve their accuracy. This is subsequently shown to generalise well for use in networks that use video as an input.
The second of these techniques is to create multiple explanations using a rescaled input image to allow for finer features to be found.
We show this performs much better than comparable techniques in both accuracy and weak-localisation metrics. With this technique, we also show that a common metric, faithfulness, is a flawed metric, and
recommend its use be discontinued.
Finally, we propose a third novel technique to address the issue of explaining failure using the concepts of surprise and expectation. By building an understanding of how a model has learnt to represent the training data, we can begin to explore the reasons for failure. Using this technique, we show that we can highlight regions in the image that have
caused failure, explore features that may be missing from a misclassified image, and provide an insightful method to explore an unseen portion of a dataset.
Cite
Citations (0)
Adversarial training is a powerful type of defense against adversarial examples. Previous empirical results suggest that adversarial training requires wider networks for better performances. However, it remains elusive how neural network width affects model robustness. In this paper, we carefully examine the relationship between network width and model robustness. Specifically, we show that the model robustness is closely related to the tradeoff between natural accuracy and perturbation stability, which is controlled by the robust regularization parameter $\lambda$. With the same $\lambda$, wider networks can achieve better natural accuracy but worse perturbation stability, leading to a potentially worse overall model robustness. To understand the origin of this phenomenon, we further relate the perturbation stability with the network's local Lipschitzness. By leveraging recent results on neural tangent kernels, we theoretically show that wider networks tend to have worse perturbation stability. Our analyses suggest that: 1) the common strategy of first fine-tuning $\lambda$ on small networks and then directly use it for wide model training could lead to deteriorated model robustness; 2) one needs to properly enlarge $\lambda$ to unleash the robustness potential of wider models fully. Finally, we propose a new Width Adjusted Regularization (WAR) method that adaptively enlarges $\lambda$ on wide models and significantly saves the tuning time.
Robustness
Regularization
Cite
Citations (2)
Work on metalearning for algorithm selection has often been criticized because it mostly considers only the default parameter settings of the candidate base learning algorithms. Many have indeed argued that the choice of parameter values can have a significant impact on accuracy. Yet little empirical evidence exists to provide definitive support for that argument. Recent experiments do suggest that parameter optimization may indeed have an impact. However, the distribution of performance differences has a long tail, suggesting that in most cases parameter optimization has little effect on accuracy. In this paper, we revisit some of these results and use metalearning to characterize the situations when parameter optimization is likely to cause a significant increase in accuracy. In so doing, we show that 1) a relatively simple and efficient landmarker carries significant predictive power, and 2) metalearning for algorithm selection should be effected in two phases, the first in which one determines whether parameter optimization is likely to increase accuracy, and the second in which algorithm selection actually takes place.
Cite
Citations (10)
The problem of adversarial examples has shown that modern Neural Network (NN) models could be rather fragile. Among the more established techniques to solve the problem, one is to require the model to be {\it $\epsilon$-adversarially robust} (AR); that is, to require the model not to change predicted labels when any given input examples are perturbed within a certain range. However, it is observed that such methods would lead to standard performance degradation, i.e., the degradation on natural examples. In this work, we study the degradation through the regularization perspective. We identify quantities from generalization analysis of NNs; with the identified quantities we empirically find that AR is achieved by regularizing/biasing NNs towards less confident solutions by making the changes in the feature space (induced by changes in the instance space) of most layers smoother uniformly in all directions; so to a certain extent, it prevents sudden change in prediction w.r.t. perturbations. However, the end result of such smoothing concentrates samples around decision boundaries, resulting in less confident solutions, and leads to worse standard performance. Our studies suggest that one might consider ways that build AR into NNs in a gentler way to avoid the problematic regularization.
Regularization
Smoothing
Robustness
Deep Neural Networks
Cite
Citations (2)
How predictable is success in complex social systems? In spite of a recent profusion of prediction studies that exploit online social and information network data, this question remains unanswered, in part because it has not been adequately specified. In this paper we attempt to clarify the question by presenting a simple stylized model of success that attributes prediction error to one of two generic sources: insufficiency of available data and/or models on the one hand; and inherent unpredictability of complex social systems on the other. We then use this model to motivate an illustrative empirical study of information cascade size prediction on Twitter. Despite an unprecedented volume of information about users, content, and past performance, our best performing models can explain less than half of the variance in cascade sizes. In turn, this result suggests that even with unlimited data predictive performance would be bounded well below deterministic accuracy. Finally, we explore this potential bound theoretically using simulations of a diffusion process on a random scale free network similar to Twitter. We show that although higher predictive power is possible in theory, such performance requires a homogeneous system and perfect ex-ante knowledge of it: even a small degree of uncertainty in estimating product quality or slight variation in quality across products leads to substantially more restrictive bounds on predictability. We conclude that realistic bounds on predictive accuracy are not dissimilar from those we have obtained empirically, and that such bounds for other complex social systems for which data is more difficult to obtain are likely even lower.
Predictability
Predictive power
Stylized fact
Information cascade
Social network (sociolinguistics)
Cite
Citations (1)