Quality-efficiency trade-offs in machine learning for text processing

2017 
As the amount of available digital documents keeps growing rapidly, extracting useful information from them has become a major challenge. Data mining, natural language processing, and machine learning are powerful techniques that can be used together to deal with this problem. Depending on the task at hand, there are many different approaches that can be used. The methods available are continuously improved, but not all of them have been tested and compared in a set of coherent problems using supervised machine learning algorithms. For example, what happens to the quality of the methods if we increase the training data size from, say, 100 MB to over 1 GB? Moreover, are quality gains worth it when the rate of data processing diminishes? Can we trade quality for time efficiency and recover the quality loss by just being able to process more data? We attempt to answer these questions in a general way for text processing tasks, considering the trade-offs involving training data size, learning time, and quality obtained. For this, we propose a performance trade-off framework and apply it to three important tasks: Named Entity Recognition, Sentiment Analysis and Document Classification. These problems were also chosen because they have different levels of object granularity: words, paragraphs, and documents. For each problem, we selected several supervised machine learning algorithms and we evaluated the trade-offs of them on large publicly available data sets (news, reviews, patents). To explore these trade-offs, we use different data subsets of increasing size ranging from 50 MB to several GB. For the last two tasks, we also consider similar algorithms with two different data sets and two evaluation techniques, to study their impact on the resulting trade-offs. We find that the results do not change significantly and that most of the time the best algorithms are the ones with fastest processing time. However, we also show that the results for small data (say less than 100 MB) are different from the results for big data and in those cases the best algorithm is much harder to determine.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    3
    Citations
    NaN
    KQI
    []