Machine learning becomes increasingly prevalent and integrated into a wide range of software systems. These systems, named ML systems, must be adequately tested to gain confidence that they behave correctly. Although many research efforts have been devoted to testing technologies for ML systems, the industrial teams are faced with new challenges on testing the ML systems in real-world settings. To absorb inspirations from the industry on the problems in ML testing, we conducted an empirical study including a survey with 87 responses and interviews with 7 senior ML practitioners from well-known IT companies. Our study uncovers significant industrial concerns on major testing activities, i.e., test data collection, test execution, and test result analysis, and also the good practices and open challenges from the perspective of the industry. (1) Test data collection is conducted in different ways on ML model, data, and code and faced with different challenges. (2) Test execution in ML systems suffers from two major problems: entanglement among the components and the regression on model performance. (3) Test result analysis centers on quantitative methods, e.g., metric-based evaluation, and is combined with some qualitative methods based on practitioners' experience. Based on our findings, we highlight the research opportunities and also provide some implications for practitioners.
Bug-related user reviews of mobile applications have negative influence on their reputation and competence, and thus these reviews are highly regarded by developers. Before bug fixing, developers need to manually reproduce the bugs reported in user reviews, which is an extremely time-consuming and tedious task. Hence, it is highly expected to automate this process. However, it is challenging to do so since user reviews are hard to understand and poorly informative for bug reproduction (especially lack of reproduction steps). In this paper, we propose RepRev to automatically Reproduce Android application bugs from user Reviews. Specifically, RepRev leverages natural language processing techniques to extract valuable information for bug reproduction. Then, it ranks GUI components by semantic similarity with the user review and dynamically searches on apps with a novel one-step exploration technique. To evaluate RepRev, we construct a benchmark including 63 crash-related user reviews from Google Play, which have been reproduced successfully by three graduate students. On this benchmark, RepRev presents comparable performance with humans, which successfully reproduces 44 user reviews in our benchmark (about 70%) with 432.2 seconds average time. We make the implementation of our approach publicly available, along with the artifacts and experimental data we used [4].
Record-and-replay tools are indispensable for quality assurance of mobile applications. Due to its importance, an increasing number of tools are being developed to record and replay user interactions for Android. However, by conducting an empirical study of various existing tools in industrial settings, researchers have revealed a gap between the characteristics requested from industry and the performance of publicly available record-and-replay tools. The study concludes that no existing tools under evaluation are sufficient for industrial applications. In this paper, we present a record-and-replay tool called SARA towards bridging the gap and targeting a wide adoption. Specifically, a dynamic instrumentation technique is used to accommodate rich sources of inputs in the application layer satisfying various constraints requested from industry. A self-replay mechanism is proposed to record more information of user inputs for accurate replaying without degrading user experience. In addition, an adaptive replay method is designed to enable replaying events on different devices with diverse screen sizes and OS versions. Through an evaluation on 53 highly popular industrial Android applications and 265 common usage scenarios, we demonstrate the effectiveness of SARA in recording and replaying rich sources of inputs on the same or different devices.
Machine learning becomes increasingly prevalent and integrated into a wide range of software systems. These systems, named ML systems, must be adequately tested to gain confidence that they behave correctly. Although many research efforts have been devoted to testing technologies for ML systems, the industrial teams are faced with new challenges on testing the ML systems in real-world settings. To absorb inspirations from the industry on the problems in ML testing, we conducted an empirical study including a survey with 87 responses and interviews with 7 senior ML practitioners from well-known IT companies. Our study uncovers significant industrial concerns on major testing activities, i.e., test data collection, test execution, and test result analysis, and also the good practices and open challenges from the perspective of the industry. (1) Test data collection is conducted in different ways on ML model, data, and code and faced with different challenges. (2) Test execution in ML systems suffers from two major problems: entanglement among the components and the regression on model performance. (3) Test result analysis centers on quantitative methods, e.g., metric-based evaluation, and is combined with some qualitative methods based on practitioners' experience. Based on our findings, we highlight the research opportunities and also provide some implications for practitioners.
The purpose of the General Data Protection Regulation (GDPR) is to provide improved privacy protection. If an app controls personal data from users, it needs to be compliant with GDPR. However, GDPR lists general rules rather than exact step-by-step guidelines about how to develop an app that fulfills the requirements. Therefore, there may exist GDPR compliance violations in existing apps, which would pose severe privacy threats to app users. In this paper, we take mobile health applications (mHealth apps) as a peephole to examine the status quo of GDPR compliance in Android apps. We first propose an automated system, named HPDROID, to bridge the semantic gap between the general rules of GDPR and the app implementations by identifying the data practices declared in the app privacy policy and the data relevant behaviors in the app code. Then, based on HPDROID, we detect three kinds of GDPR compliance violations, including the incompleteness of privacy policy, the inconsistency of data collections, and the insecurity of data transmission. We perform an empirical evaluation of 796 mHealth apps. The results reveal that 189 (23.7%) of them do not provide complete privacy policies. Moreover, 59 apps collect sensitive data through different measures, but 46 (77.9%) of them contain at least one inconsistent collection behavior. Even worse, among the 59 apps, only 8 apps try to ensure the transmission security of collected data. However, all of them contain at least one encryption or SSL misuse. Our work exposes severe privacy issues to raise awareness of privacy protection for app users and developers.