logo
    Supposed Maximum Mutual Information for Improving Generalization and Interpretation of Multi-Layered Neural Networks
    13
    Citation
    33
    Reference
    10
    Related Paper
    Citation Trend
    Abstract:
    Abstract The present paper 1 aims to propose a new type of information-theoretic method to maximize mutual information between inputs and outputs. The importance of mutual information in neural networks is well known, but the actual implementation of mutual information maximization has been quite difficult to undertake. In addition, mutual information has not extensively been used in neural networks, meaning that its applicability is very limited. To overcome the shortcoming of mutual information maximization, we present it here in a very simplified manner by supposing that mutual information is already maximized before learning, or at least at the beginning of learning. The method was applied to three data sets (crab data set, wholesale data set, and human resources data set) and examined in terms of generalization performance and connection weights. The results showed that by disentangling connection weights, maximizing mutual information made it possible to explicitly interpret the relations between inputs and outputs.
    Keywords:
    Maximization
    Pointwise mutual information
    Pointwise mutual information
    Conditional mutual information
    Interaction information
    Relevance
    Information diagram
    Information Theory
    Citations (163)
    Feature selection is used to eliminate redundant features and keep relevant features, it can enhance machine learning algorithm's performance and accelerate computing speed. In various methods, mutual information has attracted increasingly more attention as it's an effective criterion to measure variable correlation. However, current works mainly focus on maximizing the feature relevancy with class label and minimizing the feature redundancy within selected features, we reckon that pursuing feature redundancy minimization is reasonable but not necessary because part of so-called redundant features also carries some useful information to promote performance. In terms of mutual information calculation, it may distort the true relationship between two variables without proper neighborhood partition. Traditional methods usually split the continuous variables into several intervals even ignore such influence. We theoretically prove how variable fluctuation negatively influences mutual information calculation. To remove the referred obstacles, for feature selection method, we propose a full conditional mutual information maximization method (FCMIM) which only considers the feature relevancy in two aspects. For obtaining a better partition effect and eliminating the negative influence of attribute fluctuation, we put up an adaptive neighborhood partition algorithm (ANP) with the feedback of mutual information maximization algorithm, the backpropagation process helps search for a proper neighborhood partition parameter. We compare our method with several mutual information methods on 17 benchmark datasets. Results of FCMIM are better than other methods based on different classifiers. Results show that ANP indeed promotes nearly all the mutual information methods' performance.
    Maximization
    Conditional mutual information
    Feature (linguistics)
    Interaction information
    Minification
    Pointwise mutual information
    Benchmark (surveying)
    Citations (1)
    The paper proposes a new information-theoretic method to improve the generalization performance of multi-layered neural networks, called "self-organized mutual information maximization learning". In the method, the self-organizing map (SOM) is successively applied to give the knowledge to the subsequent multi-layered neural networks. In this process, the mutual information between input patterns and competitive neurons is forced to increase by changing the spread parameter. Though several methods to increase information have been proposed in multi-layered neural networks, the present paper is the first to confirm that mutual information play important roles in learning in multi-layered neural networks and how to compute the mutual information. The method was applied to the extended Senate data. In the experiments, it is examined whether mutual information is actually increased by the present method, because mutual information can be seemingly increased by changing the spread parameter. Experimental results shows that even if the parameter responsible for changing mutual information was fixed, mutual information could be increased. This means that neural networks can be organized so as to store information content on input patterns by the present method. In addition, it could be observed that generalization performance was much improved by this increase in mutual information.
    Maximization
    Pointwise mutual information
    Citations (3)
    Mutual information quantifies the dependence between two random variables and remains invariant under diffeomorphisms. In this paper, we explore the pointwise mutual information profile, an extension of mutual information that maintains this invariance. We analytically describe the profiles of multivariate normal distributions and introduce the family of fine distributions, for which the profile can be accurately approximated using Monte Carlo methods. We then show how fine distributions can be used to study the limitations of existing mutual information estimators, investigate the behavior of neural critics used in variational estimators, and understand the effect of experimental outliers on mutual information estimation. Finally, we show how fine distributions can be used to obtain model-based Bayesian estimates of mutual information, suitable for problems with available domain expertise in which uncertainty quantification is necessary.
    Pointwise mutual information
    Pointwise
    Citations (0)
    The present paper aims to propose a new type of information-theoretic method to maximize mutul information between neurons. The importance of mutual information has been well known in neural networks, but the actual implementation of mutual information maximization is a hard problem and mutual information has not necessarily been used in neural networks. We can say that the application of mutual information is very limited. To overcome this shortcoming of mutual information maximization, we present here the very simplified version of mutual information maximization by supposing that mutual information is already maximized before learning. The method was applied to the wholesale data set and the inference of default credit card holders. The experimental results show that mutual information between neurons could be increased and generalization performance could be improved. Then, the important features can be obtained by the present method, even if the training data set was small. On the other hand, by the logistic regression analysis, the important features could be extracted only with the large training data set.
    Maximization
    Pointwise mutual information
    Conditional mutual information
    Citations (9)
    This paper introduces the hypervolume maximization with a single solution as an alternative to the mean loss minimization. The relationship between the two problems is proved through bounds on the cost function when an optimal solution to one of the problems is evaluated on the other, with a hyperparameter to control the similarity between the two problems. This same hyperparameter allows higher weight to be placed on samples with higher loss when computing the hypervolume's gradient, whose normalized version can range from the mean loss to the max loss. An experiment on MNIST with a neural network is used to validate the theory developed, showing that the hypervolume maximization can behave similarly to the mean loss minimization and can also provide better performance, resulting on a 20% reduction of the classification error on the test set.
    Maximization
    Citations (2)
    Abstract The present paper 1 aims to propose a new type of information-theoretic method to maximize mutual information between inputs and outputs. The importance of mutual information in neural networks is well known, but the actual implementation of mutual information maximization has been quite difficult to undertake. In addition, mutual information has not extensively been used in neural networks, meaning that its applicability is very limited. To overcome the shortcoming of mutual information maximization, we present it here in a very simplified manner by supposing that mutual information is already maximized before learning, or at least at the beginning of learning. The method was applied to three data sets (crab data set, wholesale data set, and human resources data set) and examined in terms of generalization performance and connection weights. The results showed that by disentangling connection weights, maximizing mutual information made it possible to explicitly interpret the relations between inputs and outputs.
    Maximization
    Pointwise mutual information
    Citations (13)