The camera-ready version of JailbreakBench v1.0 (accepted at NeurIPS
2024 Datasets and Benchmarks ...
Jailbreak attacks cause large language models (LLMs) to generate harmful,
unethical, or otherwise objectionable content. Evaluating these attacks
presents a number of challenges, which the current collection of benchmarks and
evaluation techniques do not adequately address. First, there is no clear
standard of practice regarding jailbreaking evaluation. Second, existing works
compute costs and success rates in incomparable ways. And third, numerous works
are not reproducible, as they withhold adversarial prompts, involve
closed-source code, or rely on evolving proprietary APIs. To address these
challenges, we introduce JailbreakBench, an open-sourced benchmark with the
following components: (1) an evolving repository of state-of-the-art
adversarial prompts, which we refer to as jailbreak artifacts; (2) a
jailbreaking dataset comprising 100 behaviors -- both original and sourced from
prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with
OpenAI's usage policies; (3) a standardized evaluation framework at
https://github.com/JailbreakBench/jailbreakbench that includes a clearly
defined threat model, system prompts, chat templates, and scoring functions;
and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the
performance of attacks and defenses for various LLMs. We have carefully
considered the potential ethical implications of releasing this benchmark, and
believe that it will be a net positive for the community.Abstract
arXiv:2410.21276v1 »Full PDF »GPT-4o is an autoregressive omni model that accepts as input any combination
of text, audio, image, and video, and generates any combination of text, audio,
and image outputs. It's trained end-to-end across text, vision, and audio,
meaning all inputs and outputs are processed by the same neural network. GPT-4o
can respond to audio inputs in as little as 232 milliseconds, with an average
of 320 milliseconds, which is similar to human response time in conversation.
It matches GPT-4 Turbo performance on text in English and code, with
significant improvement on text in non-English languages, while also being much
faster and 50\% cheaper in the API. GPT-4o is especially better at vision and
audio understanding compared to existing models. In line with our commitment to
building AI safely and consistent with our voluntary commitments to the White
House, we are sharing the GPT-4o System Card, which includes our Preparedness
Framework evaluations. In this System Card, we provide a detailed look at
GPT-4o's capabilities, limitations, and safety evaluations across multiple
categories, focusing on speech-to-speech while also evaluating text and image
capabilities, and measures we've implemented to ensure the model is safe and
aligned. We also include third-party assessments on dangerous capabilities, as
well as discussion of potential societal impacts of GPT-4o's text and vision
capabilities.Abstract
arXiv:2407.21783v2 »Full PDF »Modern artificial intelligence (AI) systems are powered by foundation models.
This paper presents a new set of foundation models, called Llama 3. It is a
herd of language models that natively support multilinguality, coding,
reasoning, and tool usage. Our largest model is a dense Transformer with 405B
parameters and a context window of up to 128K tokens. This paper presents an
extensive empirical evaluation of Llama 3. We find that Llama 3 delivers
comparable quality to leading language models such as GPT-4 on a plethora of
tasks. We publicly release Llama 3, including pre-trained and post-trained
versions of the 405B parameter language model and our Llama Guard 3 model for
input and output safety. The paper also presents the results of experiments in
which we integrate image, video, and speech capabilities into Llama 3 via a
compositional approach. We observe this approach performs competitively with
the state-of-the-art on image, video, and speech recognition tasks. The
resulting models are not yet being broadly released as they are still under
development.Abstract
Jailbreaking Black Box Large Language Models in Twenty Queries
arXiv:2310.08419v4 »Full PDF »There is growing interest in ensuring that large language models (LLMs) align
with human values. However, the alignment of such models is vulnerable to
adversarial jailbreaks, which coax LLMs into overriding their safety
guardrails. The identification of these vulnerabilities is therefore
instrumental in understanding inherent weaknesses and preventing future misuse.
To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an
algorithm that generates semantic jailbreaks with only black-box access to an
LLM. PAIR -- which is inspired by social engineering attacks -- uses an
attacker LLM to automatically generate jailbreaks for a separate targeted LLM
without human intervention. In this way, the attacker LLM iteratively queries
the target LLM to update and refine a candidate jailbreak. Empirically, PAIR
often requires fewer than twenty queries to produce a jailbreak, which is
orders of magnitude more efficient than existing algorithms. PAIR also achieves
competitive jailbreaking success rates and transferability on open and
closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini.Abstract
arXiv:2403.04893v1 »Full PDF »Independent evaluation and red teaming are critical for identifying the risks
posed by generative AI systems. However, the terms of service and enforcement
strategies used by prominent AI companies to deter model misuse have
disincentives on good faith safety evaluations. This causes some researchers to
fear that conducting such research or releasing their findings will result in
account suspensions or legal reprisal. Although some companies offer researcher
access programs, they are an inadequate substitute for independent research
access, as they have limited community representation, receive inadequate
funding, and lack independence from corporate incentives. We propose that major
AI developers commit to providing a legal and technical safe harbor,
indemnifying public interest safety research and protecting it from the threat
of account suspensions or legal reprisal. These proposals emerged from our
collective experience conducting safety, privacy, and trustworthiness research
on generative AI systems, where norms and incentives could be better aligned
with public interests, without exacerbating model misuse. We believe these
commitments are a necessary step towards more inclusive and unimpeded community
efforts to tackle the risks of generative AI.Abstract
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering
in LLMs
arXiv:2411.07122v1 »Full PDF »Large Language Models (LLMs) have demonstrated remarkable capabilities in
generating human-like text, but their output may not be aligned with the user
or even produce harmful content. This paper presents a novel approach to detect
and steer concepts such as toxicity before generation. We introduce the Sparse
Conditioned Autoencoder (SCAR), a single trained module that extends the
otherwise untouched LLM. SCAR ensures full steerability, towards and away from
concepts (e.g., toxic content), without compromising the quality of the model's
text generation on standard evaluation benchmarks. We demonstrate the effective
application of our approach through a variety of concepts, including toxicity,
safety, and writing style alignment. As such, this work establishes a robust
framework for controlling LLM generations, ensuring their ethical and safe
deployment in real-world applications.Abstract
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models
Text-to-image models encounter safety issues, including concerns related to
copyright and Not-Safe-For-Work (NSFW) content. Despite several methods have
been proposed for erasing inappropriate concepts from diffusion models, they
often exhibit incomplete erasure, consume a lot of computing resources, and
inadvertently damage generation ability. In this work, we introduce Reliable
and Efficient Concept Erasure (RECE), a novel approach that modifies the model
in 3 seconds without necessitating additional fine-tuning. Specifically, RECE
efficiently leverages a closed-form solution to derive new target embeddings,
which are capable of regenerating erased concepts within the unlearned model.
To mitigate inappropriate content potentially represented by derived
embeddings, RECE further aligns them with harmless concepts in cross-attention
layers. The derivation and erasure of new representation embeddings are
conducted iteratively to achieve a thorough erasure of inappropriate concepts.
Besides, to preserve the model's generation ability, RECE introduces an
additional regularization term during the derivation process, resulting in
minimizing the impact on unrelated concepts during the erasure process. All the
processes above are in closed-form, guaranteeing extremely efficient erasure in
only 3 seconds. Benchmarking against previous approaches, our method achieves
more efficient and thorough erasure with minor damage to original generation
ability and demonstrates enhanced robustness against red-teaming tools. Code is
available at \url{https://github.com/CharlesGong12/RECE}.Abstract
Unraveling and Mitigating Safety Alignment Degradation of
Vision-Language Models
The safety alignment ability of Vision-Language Models (VLMs) is prone to be
degraded by the integration of the vision module compared to its LLM backbone.
We investigate this phenomenon, dubbed as ''safety alignment degradation'' in
this paper, and show that the challenge arises from the representation gap that
emerges when introducing vision modality to VLMs. In particular, we show that
the representations of multi-modal inputs shift away from that of text-only
inputs which represent the distribution that the LLM backbone is optimized for.
At the same time, the safety alignment capabilities, initially developed within
the textual embedding space, do not successfully transfer to this new
multi-modal representation space. To reduce safety alignment degradation, we
introduce Cross-Modality Representation Manipulation (CMRM), an inference time
representation intervention method for recovering the safety alignment ability
that is inherent in the LLM backbone of VLMs, while simultaneously preserving
the functional capabilities of VLMs. The empirical results show that our
framework significantly recovers the alignment ability that is inherited from
the LLM backbone with minimal impact on the fluency and linguistic capabilities
of pre-trained VLMs even without additional training. Specifically, the unsafe
rate of LLaVA-7B on multi-modal input can be reduced from 61.53% to as low as
3.15% with only inference-time intervention.
WARNING: This paper contains examples of toxic or harmful language.Abstract
FairQuant: Certifying and Quantifying Fairness of Deep Neural Networks
Accepted at ICSE 2025; To Appear In Proceedings of the 47th IEEE/ACM
International Conference on S...
We propose a method for formally certifying and quantifying individual
fairness of deep neural networks (DNN). Individual fairness guarantees that any
two individuals who are identical except for a legally protected attribute
(e.g., gender or race) receive the same treatment. While there are existing
techniques that provide such a guarantee, they tend to suffer from lack of
scalability or accuracy as the size and input dimension of the DNN increase.
Our method overcomes this limitation by applying abstraction to a symbolic
interval based analysis of the DNN followed by iterative refinement guided by
the fairness property. Furthermore, our method lifts the symbolic interval
based analysis from conventional qualitative certification to quantitative
certification, by computing the percentage of individuals whose classification
outputs are provably fair, instead of merely deciding if the DNN is fair. We
have implemented our method and evaluated it on deep neural networks trained on
four popular fairness research datasets. The experimental results show that our
method is not only more accurate than state-of-the-art techniques but also
several orders-of-magnitude faster.Abstract
Reactive Multi-Robot Navigation in Outdoor Environments Through
Uncertainty-Aware Active Learning of Human Preference Landscape
arXiv:2409.16577v1 »Full PDF »Compared with single robots, Multi-Robot Systems (MRS) can perform missions
more efficiently due to the presence of multiple members with diverse
capabilities. However, deploying an MRS in wide real-world environments is
still challenging due to uncertain and various obstacles (e.g., building
clusters and trees). With a limited understanding of environmental uncertainty
on performance, an MRS cannot flexibly adjust its behaviors (e.g., teaming,
load sharing, trajectory planning) to ensure both environment adaptation and
task accomplishments. In this work, a novel joint preference landscape learning
and behavior adjusting framework (PLBA) is designed. PLBA efficiently
integrates real-time human guidance to MRS coordination and utilizes Sparse
Variational Gaussian Processes with Varying Output Noise to quickly assess
human preferences by leveraging spatial correlations between environment
characteristics. An optimization-based behavior-adjusting method then safely
adapts MRS behaviors to environments. To validate PLBA's effectiveness in MRS
behavior adaption, a flood disaster search and rescue task was designed. 20
human users provided 1764 feedback based on human preferences obtained from MRS
behaviors related to "task quality", "task progress", "robot safety". The
prediction accuracy and adaptation speed results show the effectiveness of PLBA
in preference learning and MRS behavior adaption.Abstract