Problems of Algorithms Development to Determine Quality of Topic Models Ensembles for Make Rubricators

Alexey P. Shiryaev,Alexey R. Fedorov,P. A. Fedorov,Larisa G. Gagarina,E.M. Portnov

Problems of Algorithms Development to Determine Quality of Topic Models Ensembles for Make Rubricators

2018

Intelligent data mining is one of the most relevant areas of research in the modern world. The spectrum of its application is extremely wide and covers practically all scientiﬁc disciplines. The task of analyzing text collections with the purpose of establishing thematic headings, which should be classiﬁed as separate articles with observance of the principle of systematization “from the general to the particular” and the formation of the list of “nuclear” categories, is very actual. Clustering and, in particular, topic modeling is one of the methods of intelligent text analysis. The solution of the problem of clustering text collections is fundamentally ambiguously, and there are several reasons. Firstly, there isn’t known clearly the best criterion of quality of clustering. There are a lot of reasonable criteria, but they all can give diﬀerent results. Secondly, the number of clusters is usually unknown in advance and determined according by some subjective criterion. Thirdly, clustering result depends signiﬁcantly on the distance metric, the choice of which is usually subjective and set by the expert. Nowadays ensembles of models are becoming more widespread among the data mining techniques. They can signiﬁcantly improve the accuracy of modeling results. The main purpose of this research is to increase the clustering eﬀectiveness of textual information by using the ensemble thematic models. This article describes the usage of a voting algorithm, which is based on a group of diﬀerent evaluation algorithms. Voting algorithm allows you to select the most appropriate solution, to accurately assess the quality of the topic model and to generate a set of relevant topics. Computational experiment demonstrates coincidence with the results of expert assessments and the evaluations of formal criteria in general. The concept for quality evaluation of thematic models ensemble, which uses the simple voting algorithm, was explored and proposed for further researches.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations