Term Weight Algorithm Oriented Terms: Low Frequency Rather Than Little Occurrences

2020 
Abstract Term weight algorithms based on inverse document analysis are widely used in the expression of characteristic information for text. According to the finding that frequently occurring terms always cover less feature information for the text, the terms with lower frequency will be endowed higher weight. However, the terms with little occurrences always display unimportant information or even error information, such as rare terms and misspelled terms. To tackle such a problem, this paper proposed a novel term weight algorithm that focuses on the terms with low frequency rather than little occurrences. With the statistics based on non-homogeneous compression of term frequency, the action of terms with concerned frequency will be highlighted. And logarithmic function combined with the number of terms with the same frequency is utilized to weight the terms with different frequency based on different compression intervals. Comparing with TF-IDF and SIF, the proposed approach has a similar performance with SIF and a little better than TF-IDF. According to the difference among such methods, a finding shows that the term with a low frequency rather than little occurrences may dominate the feature information of the text.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    0
    Citations
    NaN
    KQI
    []