Mobile applications (apps) provide users valuable benefits at the risk of exposing users to privacy harms. Improving privacy in mobile apps faces several challenges, in particular, that many apps are developed by low resourced software development teams, such as end-user programmers or in startups. In addition, privacy risks are primarily known to users, which can make it difficult for developers to prioritize privacy for sensitive data. In this paper, we introduce a novel, lightweight method that allows app developers to elicit scenarios and privacy risk scores from users directly using only an app screenshot. The technique relies on named entity recognition (NER) to identify information types in user-authored scenarios, which are then fed in real-time to a privacy risk survey that users complete. The best-performing NER model predicts information types with a weighted average precision of 0.70 and recall of 0.72, after post-processing to remove false positives. The model was trained on a labeled 300-scenario corpus, and evaluated in an end-to-end evaluation using an additional 203 scenarios yielding 2,338 user-provided privacy risk scores. Finally, we discuss how developers can use the risk scores to prioritize, select and apply privacy design strategies in the context of four user-authored scenarios.
Differential privacy (DP) ensures that training a machine learning model does not leak private data. In practice, we may have access to auxiliary public data that is free of privacy concerns. In this work, we assume access to a given amount of public data and settle the following fundamental open questions: 1. What is the optimal (worst-case) error of a DP model trained over a private data set while having access to side public data? 2. How can we harness public data to improve DP model training in practice? We consider these questions in both the local and central models of pure and approximate DP. To answer the first question, we prove tight (up to log factors) lower and upper bounds that characterize the optimal error rates of three fundamental problems: mean estimation, empirical risk minimization, and stochastic convex optimization. We show that the optimal error rates can be attained (up to log factors) by either discarding private data and training a public model, or treating public data like it is private and using an optimal DP algorithm. To address the second question, we develop novel algorithms that are "even more optimal" (i.e. better constants) than the asymptotically optimal approaches described above. For local DP mean estimation, our algorithm is \ul{optimal including constants}. Empirically, our algorithms show benefits over the state-of-the-art.
Quantization of the parameters of machine learning models, such as deep neural networks, requires solving constrained optimization problems, where the constraint set is formed by the Cartesian product of many simple discrete sets. For such optimization problems, we study the performance of the Alternating Direction Method of Multipliers for Quantization ($\texttt{ADMM-Q}$) algorithm, which is a variant of the widely-used ADMM method applied to our discrete optimization problem. We establish the convergence of the iterates of $\texttt{ADMM-Q}$ to certain $\textit{stationary points}$. To the best of our knowledge, this is the first analysis of an ADMM-type method for problems with discrete variables/constraints. Based on our theoretical insights, we develop a few variants of $\texttt{ADMM-Q}$ that can handle inexact update rules, and have improved performance via the use of soft projection and injecting randomness to the algorithm. We empirically evaluate the efficacy of our proposed approaches.
Mobile applications (apps) provide users valuable benefits at the risk of exposing users to privacy harms. Improving privacy in mobile apps faces several challenges, in particular, that many apps are developed by low resourced software development teams, such as end-user programmers or in startups. In addition, privacy risks are primarily known to users, which can make it difficult for developers to prioritize privacy for sensitive data. In this paper, we introduce a novel, lightweight method that allows app developers to elicit scenarios and privacy risk scores from users directly using only an app screenshot. The technique relies on named entity recognition (NER) to identify information types in user-authored scenarios, which are then fed in real-time to a privacy risk survey that users complete. The best-performing NER model predicts information types with a weighted average precision of 0.70 and recall of 0.72, after post-processing to remove false positives. The model was trained on a labeled 300-scenario corpus, and evaluated in an end-to-end evaluation using an additional 203 scenarios yielding 2,338 user-provided privacy risk scores. Finally, we discuss how developers can use the risk scores to prioritize, select and apply privacy design strategies in the context of four user-authored scenarios.
Recent applications that arise in machine learning have surged significant interest in solving min-max saddle point games. This problem has been extensively studied in the convex-concave regime for which a global equilibrium solution can be computed efficiently. In this paper, we study the problem in the non-convex regime and show that an \varepsilon--first order stationary point of the game can be computed when one of the player's objective can be optimized to global optimality efficiently. In particular, we first consider the case where the objective of one of the players satisfies the Polyak-Łojasiewicz (PL) condition. For such a game, we show that a simple multi-step gradient descent-ascent algorithm finds an \varepsilon--first order stationary point of the problem in \widetilde{\mathcal{O}}(\varepsilon^{-2}) iterations. Then we show that our framework can also be applied to the case where the objective of the "max-player" is concave. In this case, we propose a multi-step gradient descent-ascent algorithm that finds an \varepsilon--first order stationary point of the game in \widetilde{\cal O}(\varepsilon^{-3.5}) iterations, which is the best known rate in the literature. We applied our algorithm to a fair classification problem of Fashion-MNIST dataset and observed that the proposed algorithm results in smoother training and better generalization.
Quantization of the parameters of machine learning models, such as deep neural networks, requires solving constrained optimization problems, where the constraint set is formed by the Cartesian product of many simple discrete sets. For such optimization problems, we study the performance of the Alternating Direction Method of Multipliers for Quantization ($\texttt{ADMM-Q}$) algorithm, which is a variant of the widely-used ADMM method applied to our discrete optimization problem. We establish the convergence of the iterates of $\texttt{ADMM-Q}$ to certain $\textit{stationary points}$. To the best of our knowledge, this is the first analysis of an ADMM-type method for problems with discrete variables/constraints. Based on our theoretical insights, we develop a few variants of $\texttt{ADMM-Q}$ that can handle inexact update rules, and have improved performance via the use of "soft projection" and "injecting randomness to the algorithm". We empirically evaluate the efficacy of our proposed approaches.
While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM is not robust to distribution shifts or adversarial attacks. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple and widely used solution to improve robustness in ERM. In addition, consistency regularization can be applied to further improve the robustness of the model by forcing the representation of the original sample and the augmented one to be similar. However, existing consistency regularization methods are not applicable to covariant data augmentation, where the label in the augmented sample is dependent on the augmentation function. For example, dialog state covaries with named entity when we augment data with a new named entity. In this paper, we propose data augmented loss invariant regularization (DAIR), a simple form of consistency regularization that is applied directly at the loss level rather than intermediate features, making it widely applicable to both invariant and covariant data augmentation regardless of network architecture, problem setup, and task. We apply DAIR to real-world learning problems involving covariant data augmentation: robust neural task-oriented dialog state tracking and robust visual question answering. We also apply DAIR to tasks involving invariant data augmentation: robust regression, robust classification against adversarial attacks, and robust ImageNet classification under distribution shift. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal computational cost and sets new state-of-the-art results in several benchmarks involving covariant data augmentation. Our code of all experiments is available at: https://github.com/optimization-for-data-driven-science/DAIR.git
The min-max optimization problem, also known as the <;i>saddle point problem<;/i>, is a classical optimization problem that is also studied in the context of zero-sum games. Given a class of objective functions, the goal is to find a value for the argument that leads to a small objective value even for the worst-case function in the given class. Min-max optimization problems have recently become very popular in a wide range of signal and data processing applications, such as fair beamforming, training generative adversarial networks (GANs), and robust machine learning (ML), to just name a few.
Recent applications that arise in machine learning have surged significant interest in solving min-max saddle point games. This problem has been extensively studied in the convex-concave regime for which a global equilibrium solution can be computed efficiently. In this paper, we study the problem in the non-convex regime and show that an $\varepsilon$--first order stationary point of the game can be computed when one of the player’s objective can be optimized to global optimality efficiently. In particular, we first consider the case where the objective of one of the players satisfies the Polyak-{\L}ojasiewicz (PL) condition. For such a game, we show that a simple multi-step gradient descent-ascent algorithm finds an $\varepsilon$--first order stationary point of the problem in $\widetilde{\mathcal{O}}(\varepsilon^{-2})$ iterations. Then we show that our framework can also be applied to the case where the objective of the ``max-player is concave. In this case, we propose a multi-step gradient descent-ascent algorithm that finds an $\varepsilon$--first order stationary point of the game in $\widetilde{\cal O}(\varepsilon^{-3.5})$ iterations, which is the best known rate in the literature. We applied our algorithm to a fair classification problem of Fashion-MNIST dataset and observed that the proposed algorithm results in smoother training and better generalization.