Large language models are finetuned to refuse questions about hazardous
knowledge, but these protections can often be bypassed. Unlearning methods aim
at completely removing hazardous capabilities from models and make them
inaccessible to adversaries. This work challenges the fundamental differences
between unlearning and traditional safety post-training from an adversarial
perspective. We demonstrate that existing jailbreak methods, previously
reported as ineffective against unlearning, can be successful when applied
carefully. Furthermore, we develop a variety of adaptive methods that recover
most supposedly unlearned capabilities. For instance, we show that finetuning
on 10 unrelated examples or removing specific directions in the activation
space can recover most hazardous capabilities for models edited with RMU, a
state-of-the-art unlearning method. Our findings challenge the robustness of
current unlearning approaches and question their advantages over safety
training.Abstract
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank
Modifications
22 pages, 9 figures. Project page is available at
https://boyiwei.com/alignment-attribution/
Large language models (LLMs) show inherent brittleness in their safety
mechanisms, as evidenced by their susceptibility to jailbreaking and even
non-malicious fine-tuning. This study explores this brittleness of safety
alignment by leveraging pruning and low-rank modifications. We develop methods
to identify critical regions that are vital for safety guardrails, and that are
disentangled from utility-relevant regions at both the neuron and rank levels.
Surprisingly, the isolated regions we find are sparse, comprising about 3%
at the parameter level and 2.5% at the rank level. Removing these regions
compromises safety without significantly impacting utility, corroborating the
inherent brittleness of the model's safety mechanisms. Moreover, we show that
LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications
to the safety-critical regions are restricted. These findings underscore the
urgent need for more robust safety strategies in LLMs.Abstract
The Responsible Foundation Model Development Cheatsheet: A Review of
Tools & Resources
arXiv:2406.16746v3 »Full PDF »Foundation model development attracts a rapidly expanding body of
contributors, scientists, and applications. To help shape responsible
development practices, we introduce the Foundation Model Development
Cheatsheet: a growing collection of 250+ tools and resources spanning text,
vision, and speech modalities. We draw on a large body of prior work to survey
resources (e.g. software, documentation, frameworks, guides, and practical
tools) that support informed data selection, processing, and understanding,
precise and limitation-aware artifact documentation, efficient model training,
advance awareness of the environmental impact from training, careful model
evaluation of capabilities, risks, and claims, as well as responsible model
release, licensing and deployment practices. We hope this curated collection of
resources helps guide more responsible development. The process of curating
this list, enabled us to review the AI development ecosystem, revealing what
tools are critically missing, misused, or over-used in existing practices. We
find that (i) tools for data sourcing, model evaluation, and monitoring are
critically under-serving ethical and real-world needs, (ii) evaluations for
model safety, capabilities, and environmental impact all lack reproducibility
and transparency, (iii) text and particularly English-centric analyses continue
to dominate over multilingual and multi-modal analyses, and (iv) evaluation of
systems, rather than just models, is needed so that capabilities and impact are
assessed in context.Abstract
What is in Your Safe Data? Identifying Benign Data that Breaks Safety
arXiv:2404.01099v2 »Full PDF »Current Large Language Models (LLMs), even those tuned for safety and
alignment, are susceptible to jailbreaking. Some have found that just further
fine-tuning an aligned model with benign data (i.e., data without harmful
content) surprisingly leads to substantial degradation in safety. We delve into
the data-centric aspects of why benign fine-tuning inadvertently contributes to
jailbreaking. First, we represent fine-tuning data through two lenses:
representation and gradient spaces. Additionally, we propose a bi-directional
anchoring method that, during the selection process, prioritizes data points
that are close to harmful examples and far from benign ones. Our approach
effectively identifies subsets of benign data that are more likely to degrade
the model's safety after fine-tuning. Training on just 100 of these seemingly
benign datapoints surprisingly leads to the fine-tuned model affirmatively
responding to >70% of tested harmful requests, compared to <20% after
fine-tuning on randomly selected data. We also observe that the selected data
frequently appear as lists, bullet points, or math questions, indicating a
systematic pattern in fine-tuning data that contributes to jailbreaking.Abstract
SORRY-Bench: Systematically Evaluating Large Language Model Safety
Refusal Behaviors
arXiv:2406.14598v1 »Full PDF »Evaluating aligned large language models' (LLMs) ability to recognize and
reject unsafe user requests is crucial for safe, policy-compliant deployments.
Existing evaluation efforts, however, face three limitations that we address
with SORRY-Bench, our proposed benchmark. First, existing methods often use
coarse-grained taxonomies of unsafe topics, and are over-representing some
fine-grained topics. For example, among the ten existing datasets that we
evaluated, tests for refusals of self-harm instructions are over 3x less
represented than tests for fraudulent activities. SORRY-Bench improves on this
by using a fine-grained taxonomy of 45 potentially unsafe topics, and 450
class-balanced unsafe instructions, compiled through human-in-the-loop methods.
Second, linguistic characteristics and formatting of prompts are often
overlooked, like different languages, dialects, and more -- which are only
implicitly considered in many evaluations. We supplement SORRY-Bench with 20
diverse linguistic augmentations to systematically examine these effects.
Third, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation,
which can be computationally expensive. We investigate design choices for
creating a fast, accurate automated safety evaluator. By collecting 7K+ human
annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs,
we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale
LLMs, with lower computational cost. Putting these together, we evaluate over
40 proprietary and open-source LLMs on SORRY-Bench, analyzing their distinctive
refusal behaviors. We hope our effort provides a building block for systematic
evaluations of LLMs' safety refusal capabilities, in a balanced, granular, and
efficient manner.Abstract
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
arXiv:2406.05946v1 »Full PDF »The safety alignment of current Large Language Models (LLMs) is vulnerable.
Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned
models. We argue that many of these vulnerabilities are related to a shared
underlying issue: safety alignment can take shortcuts, wherein the alignment
adapts a model's generative distribution primarily over only its very first few
output tokens. We refer to this issue as shallow safety alignment. In this
paper, we present case studies to explain why shallow safety alignment can
exist and provide evidence that current aligned LLMs are subject to this issue.
We also show how these findings help explain multiple recently discovered
vulnerabilities in LLMs, including the susceptibility to adversarial suffix
attacks, prefilling attacks, decoding parameter attacks, and fine-tuning
attacks. Importantly, we discuss how this consolidated notion of shallow safety
alignment sheds light on promising research directions for mitigating these
vulnerabilities. For instance, we show that deepening the safety alignment
beyond just the first few tokens can often meaningfully improve robustness
against some common exploits. Finally, we design a regularized finetuning
objective that makes the safety alignment more persistent against fine-tuning
attacks by constraining updates on initial tokens. Overall, we advocate that
future safety alignment should be made more than just a few tokens deep.Abstract
AI Risk Management Should Incorporate Both Safety and Security
arXiv:2405.19524v1 »Full PDF »The exposure of security vulnerabilities in safety-aligned language models,
e.g., susceptibility to adversarial attacks, has shed light on the intricate
interplay between AI safety and AI security. Although the two disciplines now
come together under the overarching goal of AI risk management, they have
historically evolved separately, giving rise to differing perspectives.
Therefore, in this paper, we advocate that stakeholders in AI risk management
should be aware of the nuances, synergies, and interplay between safety and
security, and unambiguously take into account the perspectives of both
disciplines in order to devise mostly effective and holistic risk mitigation
approaches. Unfortunately, this vision is often obfuscated, as the definitions
of the basic concepts of "safety" and "security" themselves are often
inconsistent and lack consensus across communities. With AI risk management
being increasingly cross-disciplinary, this issue is particularly salient. In
light of this conceptual challenge, we introduce a unified reference framework
to clarify the differences and interplay between AI safety and AI security,
aiming to facilitate a shared understanding and effective collaboration across
communities.Abstract
arXiv:2403.04893v1 »Full PDF »Independent evaluation and red teaming are critical for identifying the risks
posed by generative AI systems. However, the terms of service and enforcement
strategies used by prominent AI companies to deter model misuse have
disincentives on good faith safety evaluations. This causes some researchers to
fear that conducting such research or releasing their findings will result in
account suspensions or legal reprisal. Although some companies offer researcher
access programs, they are an inadequate substitute for independent research
access, as they have limited community representation, receive inadequate
funding, and lack independence from corporate incentives. We propose that major
AI developers commit to providing a legal and technical safe harbor,
indemnifying public interest safety research and protecting it from the threat
of account suspensions or legal reprisal. These proposals emerged from our
collective experience conducting safety, privacy, and trustworthiness research
on generative AI systems, where norms and incentives could be better aligned
with public interests, without exacerbating model misuse. We believe these
commitments are a necessary step towards more inclusive and unimpeded community
efforts to tackle the risks of generative AI.Abstract
Fine-tuning Aligned Language Models Compromises Safety, Even When Users
Do Not Intend To!
arXiv:2310.03693v1 »Full PDF »Optimizing large language models (LLMs) for downstream use cases often
involves the customization of pre-trained LLMs through further fine-tuning.
Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5
Turbo on custom datasets also encourage this practice. But, what are the safety
costs associated with such custom fine-tuning? We note that while existing
safety alignment infrastructures can restrict harmful behaviors of LLMs at
inference time, they do not cover safety risks when fine-tuning privileges are
extended to end-users. Our red teaming studies find that the safety alignment
of LLMs can be compromised by fine-tuning with only a few adversarially
designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety
guardrails by fine-tuning it on only 10 such examples at a cost of less than
$0.20 via OpenAI's APIs, making the model responsive to nearly any harmful
instructions. Disconcertingly, our research also reveals that, even without
malicious intent, simply fine-tuning with benign and commonly used datasets can
also inadvertently degrade the safety alignment of LLMs, though to a lesser
extent. These findings suggest that fine-tuning aligned LLMs introduces new
safety risks that current safety infrastructures fall short of addressing --
even if a model's initial safety alignment is impeccable, it is not necessarily
to be maintained after custom fine-tuning. We outline and critically analyze
potential mitigations and advocate for further research efforts toward
reinforcing safety protocols for the custom fine-tuning of aligned LLMs.Abstract
Authored by the Center for Research on Foundation Models (CRFM) at
the Stanford Institute for Huma...
Language models (LMs) are becoming the foundation for almost all major
language technologies, but their capabilities, limitations, and risks are not
well understood. We present Holistic Evaluation of Language Models (HELM) to
improve the transparency of language models. First, we taxonomize the vast
space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata)
that are of interest for LMs. Then we select a broad subset based on coverage
and feasibility, noting what's missing or underrepresented (e.g. question
answering for neglected English dialects, metrics for trustworthiness). Second,
we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration,
robustness, fairness, bias, toxicity, and efficiency) for each of 16 core
scenarios when possible (87.5% of the time). This ensures metrics beyond
accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We
also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze
specific aspects (e.g. reasoning, disinformation). Third, we conduct a
large-scale evaluation of 30 prominent language models (spanning open,
limited-access, and closed models) on all 42 scenarios, 21 of which were not
previously used in mainstream LM evaluation. Prior to HELM, models on average
were evaluated on just 17.9% of the core HELM scenarios, with some prominent
models not sharing a single scenario in common. We improve this to 96.0%: now
all 30 models have been densely benchmarked on the same core scenarios and
metrics under standardized conditions. Our evaluation surfaces 25 top-level
findings. For full transparency, we release all raw model prompts and
completions publicly for further analysis, as well as a general modular
toolkit. We intend for HELM to be a living benchmark for the community,
continuously updated with new scenarios, metrics, and models.Abstract