Term Weight Algorithm Oriented Terms: Low Frequency Rather Than Little Occurrences

Yiyi He,Tiejun Li,Yuhong Huang,Shijie Li,Yanhuang Jiang

Term Weight Algorithm Oriented Terms: Low Frequency Rather Than Little Occurrences

2020

Abstract Term weight algorithms based on inverse document analysis are widely used in the expression of characteristic information for text. According to the finding that frequently occurring terms always cover less feature information for the text, the terms with lower frequency will be endowed higher weight. However, the terms with little occurrences always display unimportant information or even error information, such as rare terms and misspelled terms. To tackle such a problem, this paper proposed a novel term weight algorithm that focuses on the terms with low frequency rather than little occurrences. With the statistics based on non-homogeneous compression of term frequency, the action of terms with concerned frequency will be highlighted. And logarithmic function combined with the number of terms with the same frequency is utilized to weight the terms with different frequency based on different compression intervals. Comparing with TF-IDF and SIF, the proposed approach has a similar performance with SIF and a little better than TF-IDF. According to the difference among such methods, a finding shows that the term with a low frequency rather than little occurrences may dominate the feature information of the text.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations