Hierarchical Three-module Method of Text Classification in Web Big Data

2020 
Text analysis is a method for extracting knowledge from text. Memory and time limitations in processing big data is crucial due to data sources distributed in web, search engines and socials network sites. In addition, due to automatizing search process, summarizing and finding the interests of users, immediate classification of various texts in a streaming manner has gained attention in industrial and scientific fields. Hierarchical classification of text is among common issues which is simply possible in traditional methods using bag of words; however, while talking about big data and when there are a lot of labels of classes, employing traditional methods will not meet the needs of societies. With the improvement of data in internet and social networks, more powerful methods are needed which can classify the data closely and immediately. Through abstraction in textual data, deep learning can deal with these challenges. In this paper a deep learning method will be introduced which is based on hierarchical classification (HAN) named HAN-MODI and which can classify texts from social networks and web sites with an accuracy of 98.81% at the real time bilingually in English and Farsi. This paper also shows that this complex network with three modules word, sentence and document can work better at word level and there is no need to know syntactic or semantics structure of language. The novelty of the proposed method is adding a third level to the hierarchical structure for general detection and for more exact detection of the class. In addition, classification using this method will be multi-level classification and finally with a change in HAN, this method can be used with Farsi texts. Model improvement is done by adding a new layer above the architecture HAN. We called it as segmentation of sentences into expressions Bag of Sentences and added a dynamicity window in any stage that applied attention mechanism simultaneously.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    4
    References
    0
    Citations
    NaN
    KQI
    []