Fisher's exact test explains a popular metric in information retrieval.
2020
Term frequency-inverse document frequency, or tf-idf for short, is a numerical measure that is widely used in information retrieval to quantify the importance of a term of interest in one out of many documents. While tf-idf was originally proposed as a heuristic, much work has been devoted over the years to placing it on a solid theoretical foundation. Following in this tradition, we here advance the first justification for tf-idf that is grounded in statistical hypothesis testing. More precisely, we first show that the one-tailed version of Fisher's exact test, also known as the hypergeometric test, corresponds well with a common tf-idf variant on selected real-data information retrieval tasks. We then set forth a mathematical argument that suggests the tf-idf variant approximates the negative logarithm of the one-tailed Fisher's exact test P-value (i.e., a hypergeometric distribution tail probability). The Fisher's exact test interpretation of this common tf-idf variant furnishes the working statistician with a ready explanation of tf-idf's long-established effectiveness.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
38
References
0
Citations
NaN
KQI