Large Language Models (LLMs) are attracting significant research attention due to their instruction-following abilities, allowing users and developers to leverage LLMs for a variety of tasks. However, LLMs are vulnerable to prompt-injection attacks: a class of attacks that hijack the model's instruction-following abilities, changing responses to prompts to undesired, possibly malicious ones. In this work, we introduce Jatmo, a method for generating task-specific models resilient to prompt-injection attacks. Jatmo leverages the fact that LLMs can only follow instructions once they have undergone instruction tuning. It harnesses a teacher instruction-tuned model to generate a task-specific dataset, which is then used to fine-tune a base model (i.e., a non-instruction-tuned model). Jatmo only needs a task prompt and a dataset of inputs for the task: it uses the teacher model to generate outputs. For situations with no pre-existing datasets, Jatmo can use a single example, or in some cases none at all, to produce a fully synthetic dataset. Our experiments on seven tasks show that Jatmo models provide similar quality of outputs on their specific task as standard LLMs, while being resilient to prompt injections. The best attacks succeeded in less than 0.5% of cases against our models, versus 87% success rate against GPT-3.5-Turbo. We release Jatmo at https://github.com/wagner-group/prompt-injection-defense.
Deep neural networks (DNNs) are acknowledged as vulnerable to adversarial attacks while the existing black-box attacks require extensive queries on the victim DNN to achieve high success rates. For query-efficiency, surrogate models of the victim are used to generate transferable adversarial examples (AEs) because of their gradient similarity (GS), i.e., surrogates' attack gradients are similar to the victim's ones. However, it is generally neglected to exploit their similarity on outputs, namely the prediction similarity (PS), to filter out inefficient queries by surrogates without querying the victim. To jointly utilize and also optimize surrogates' GS and PS, we develop QueryNet, a unified attack framework that can significantly reduce queries. QueryNet creatively attacks by multi-identity surrogates, i.e., crafts several AEs for one sample by different surrogates and also uses surrogates to decide on the most promising AE for the query. After that, the victim's query feedback is accumulated to optimize not only surrogates' parameters but also their architectures, enhancing both the GS and the PS. Although QueryNet has no access to pretrained surrogates' prior, it reduces queries by averagely about an order of magnitude compared to alternatives within an acceptable time, according to our comprehensive experiments: 11 victims (including two commercial models) on MNIST/CIFAR10/ImageNet, allowing only 8-b image queries, and no access to the victim's training data. The code is available at https://github.com/Sizhe-Chen/QueryNet .
Deep neural networks could be fooled by adversarial examples with trivial differences to original samples. To keep the difference imperceptible in human eyes, researchers bound the adversarial perturbations by the ℓ ∞ norm, which is now commonly served as the standard to align the strength of different attacks for a fair comparison. However, we propose that using the ℓ ∞ norm alone is not sufficient in measuring the attack strength, because even with a fixed ℓ ∞ distance, the ℓ 2 distance also greatly affects the attack transferability between models. Through the discovery, we reach more in-depth understandings towards the attack mechanism, i.e., several existing methods attack black-box models better partly because they craft perturbations with 70% to 130% larger ℓ 2 distances. Since larger perturbations naturally lead to better transferability, we thereby advocate that the strength of attacks should be simultaneously measured by both the ℓ ∞ and ℓ 2 norm. Our proposal is firmly supported by extensive experiments on ImageNet dataset from 7 attacks, 4 white-box models, and 9 black-box models.
As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Existing evaluations of adversarial attacks and defenses on LLMs generally require either expensive manual review or unreliable heuristic checks. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 14 simple text scenarios in which the model is instructed to obey various rules while interacting with the user. Each scenario has a programmatic evaluation function to determine whether the model has broken any rules in a conversation. Our evaluations of proprietary and open models show that almost all current models struggle to follow scenario rules, even on straightforward test cases. We also demonstrate that simple optimization attacks suffice to significantly increase failure rates on test cases. We conclude by exploring two potential avenues for improvement: test-time steering and supervised fine-tuning.
The score-based query attacks (SQAs) pose practical threats to deep neural networks by crafting adversarial perturbations within dozens of queries, only using the model's output scores. Nonetheless, we note that if the loss trend of the outputs is slightly perturbed, SQAs could be easily misled and thereby become much less effective. Following this idea, we propose a novel defense, namely Adversarial Attack on Attackers (AAA), to confound SQAs towards incorrect attack directions by slightly modifying the output logits. In this way, (1) SQAs are prevented regardless of the model's worst-case robustness; (2) the original model predictions are hardly changed, i.e., no degradation on clean accuracy; (3) the calibration of confidence scores can be improved simultaneously. Extensive experiments are provided to verify the above advantages. For example, by setting $\ell_\infty=8/255$ on CIFAR-10, our proposed AAA helps WideResNet-28 secure 80.59% accuracy under Square attack (2500 queries), while the best prior defense (i.e., adversarial training) only attains 67.44%. Since AAA attacks SQA's general greedy strategy, such advantages of AAA over 8 defenses can be consistently observed on 8 CIFAR-10/ImageNet models under 6 SQAs, using different attack targets, bounds, norms, losses, and strategies. Moreover, AAA calibrates better without hurting the accuracy. Our code is available at https://github.com/Sizhe-Chen/AAA.
With the popularity of online purchases and user feedback system, huge number of reviews about products, services and social issues enable us to judge items through others' experience. However, generating comprehensive and diversified review summaries usually consumes extensive manual efforts. To solve this problem, we propose a generalized automatic pipeline to generate diversified review summary on different aspects. With the certain review corpus, word embedding matrix can be trained and entities in all aspects will be enlarged accordingly. Combining entities and syntax parsing method, adjectives in every aspect are extracted and divided given their sentiments. The final summary is generated by diverse sampling given aspect and sentiment. This pipeline could be transferred to other review corpora in other languages only if new aspects are specified.
The clumping index (CI) is a key structural parameter that quantifies the nonrandomness of the spatial distribution of vegetation canopy leaves. Investigating seasonal variations in the CI is crucial, especially for estimating the leaf area index (LAI) and studying global carbon and water cycles. However, accurate estimations of the seasonal CI have substantial challenges, e.g., from the need for accurate hot spot measurements, i.e., the typical feature of the bidirectional reflectance distribution fumnction (BRDF) shape in the current CI algorithm framework. Therefore, deriving a phenologically simplified stable CI product from a high-frequency CI product (e.g., 8 days) to reduce the uncertainty of CI seasonality and simplify CI applications remains important. In this study, we applied the discrete Fourier transform and an improved dynamic threshold method to estimate the start of season (SOS) and end of season (EOS) from the CI time series and indicated that the CI exhibits significant seasonal variation characteristics that are generally consistent with the MODIS land surface phenology (LSP) product (MCD12Q2), although seasonal differences between them probably exist. Second, we divided the vegetation cycle into two phenological stages based on the MODIS LSP product, ignoring the differences mentioned above, i.e., the leaf-on season (LOS, from greenup to dormancy) and the leaf-off season (LFS, after dormancy and before greenup of the next vegetation cycle), and developed the phenologically simplified two-stage CI product using the MODIS 8-day CI product suite. Finally, we assessed the accuracy of this CI product (RMSE = 0.06, bias = 0.01) via 95 datasets from 14 field-measured sites globally. This study revealed that the CI exhibited an approximately inverse trend in terms of phenological variation compared with the NDVI. Globally, based on the phenologically simplified two-stage CI product, the CILOS is smaller than the CILFS across all land cover types. Compared with the LFS stage, the quality for this CI product is better in the LOS stage, where the QA is basically identified as 0 and 1, accounting for more than ~90% of the total quality flag, which is significantly higher than that in the LFS stage (~60%). This study provides relatively reliable CI datasets that capture the general trend of seasonal CI variations and simplify potential applications in modeling ecological, meteorological, and other surface processes at both global and regional scales. Therefore, this study provides both new perspectives and datasets for future research in relation to CI and other biophysical parameters, e.g., the LAI.