In recent years, Large Language Models (LLMs) have gained widespread use,
raising concerns about their security. Traditional jailbreak attacks, which
often rely on the model internal information or have limitations when exploring
the unsafe behavior of the victim model, limiting their reducing their general
applicability. In this paper, we introduce PathSeeker, a novel black-box
jailbreak method, which is inspired by the game of rats escaping a maze. We
think that each LLM has its unique "security maze", and attackers attempt to
find the exit learning from the received feedback and their accumulated
experience to compromise the target LLM's security defences. Our approach
leverages multi-agent reinforcement learning, where smaller models collaborate
to guide the main LLM in performing mutation operations to achieve the attack
objectives. By progressively modifying inputs based on the model's feedback,
our system induces richer, harmful responses. During our manual attempts to
perform jailbreak attacks, we found that the vocabulary of the response of the
target model gradually became richer and eventually produced harmful responses.
Based on the observation, we also introduce a reward mechanism that exploits
the expansion of vocabulary richness in LLM responses to weaken security
constraints. Our method outperforms five state-of-the-art attack techniques
when tested across 13 commercial and open-source LLMs, achieving high attack
success rates, especially in strongly aligned commercial models like
GPT-4o-mini, Claude-3.5, and GLM-4-air with strong safety alignment. This study
aims to improve the understanding of LLM security vulnerabilities and we hope
that this sturdy can contribute to the development of more robust defenses.Abstract
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled
Reasoning
arXiv:2406.09187v1 »Full PDF »The rapid advancement of large language models (LLMs) has catalyzed the
deployment of LLM-powered agents across numerous applications, raising new
concerns regarding their safety and trustworthiness. Existing methods for
enhancing the safety of LLMs are not directly transferable to LLM-powered
agents due to their diverse objectives and output modalities. In this paper, we
propose GuardAgent, the first LLM agent as a guardrail to other LLM agents.
Specifically, GuardAgent oversees a target LLM agent by checking whether its
inputs/outputs satisfy a set of given guard requests defined by the users.
GuardAgent comprises two steps: 1) creating a task plan by analyzing the
provided guard requests, and 2) generating guardrail code based on the task
plan and executing the code by calling APIs or using external engines. In both
steps, an LLM is utilized as the core reasoning component, supplemented by
in-context demonstrations retrieved from a memory module. Such
knowledge-enabled reasoning allows GuardAgent to understand various textual
guard requests and accurately "translate" them into executable code that
provides reliable guardrails. Furthermore, GuardAgent is equipped with an
extendable toolbox containing functions and APIs and requires no additional LLM
training, which underscores its generalization capabilities and low operational
overhead. Additionally, we propose two novel benchmarks: an EICU-AC benchmark
for assessing privacy-related access control for healthcare agents and a
Mind2Web-SC benchmark for safety evaluation for web agents. We show the
effectiveness of GuardAgent on these two benchmarks with 98.7% and 90.0%
accuracy in moderating invalid inputs and outputs for the two types of agents,
respectively. We also show that GuardAgent is able to define novel functions in
adaption to emergent LLM agents and guard requests, which underscores its
strong generalization capabilities.Abstract
A Survey of Robustness and Safety of 2D and 3D Deep Learning Models
Against Adversarial Attacks
Benefiting from the rapid development of deep learning, 2D and 3D computer
vision applications are deployed in many safe-critical systems, such as
autopilot and identity authentication. However, deep learning models are not
trustworthy enough because of their limited robustness against adversarial
attacks. The physically realizable adversarial attacks further pose fatal
threats to the application and human safety. Lots of papers have emerged to
investigate the robustness and safety of deep learning models against
adversarial attacks. To lead to trustworthy AI, we first construct a general
threat model from different perspectives and then comprehensively review the
latest progress of both 2D and 3D adversarial attacks. We extend the concept of
adversarial examples beyond imperceptive perturbations and collate over 170
papers to give an overview of deep learning model robustness against various
adversarial attacks. To the best of our knowledge, we are the first to
systematically investigate adversarial attacks for 3D models, a flourishing
field applied to many real-world applications. In addition, we examine physical
adversarial attacks that lead to safety violations. Last but not least, we
summarize present popular topics, give insights on challenges, and shed light
on future research on trustworthy AI.Abstract
Who Should Review Your Proposal? Interdisciplinary Topic Path Detection
for Research Proposals
The peer merit review of research proposals has been the major mechanism to
decide grant awards. Nowadays, research proposals have become increasingly
interdisciplinary. It has been a longstanding challenge to assign proposals to
appropriate reviewers. One of the critical steps in reviewer assignment is to
generate accurate interdisciplinary topic labels for proposals. Existing
systems mainly collect topic labels manually reported by discipline
investigators. However, such human-reported labels can be non-accurate and
incomplete. What role can AI play in developing a fair and precise proposal
review system? In this evidential study, we collaborate with the National
Science Foundation of China to address the task of automated interdisciplinary
topic path detection. For this purpose, we develop a deep Hierarchical
Interdisciplinary Research Proposal Classification Network (HIRPCN). We first
propose a hierarchical transformer to extract the textual semantic information
of proposals. We then design an interdisciplinary graph and leverage GNNs to
learn representations of each discipline in order to extract interdisciplinary
knowledge. After extracting the semantic and interdisciplinary knowledge, we
design a level-wise prediction component to fuse the two types of knowledge
representations and detect interdisciplinary topic paths for each proposal. We
conduct extensive experiments and expert evaluations on three real-world
datasets to demonstrate the effectiveness of our proposed model.Abstract
Hidden Backdoor Attack against Semantic Segmentation Models
This is a 6-pages short version of our ongoing work. It is accepted
by the non-archival ICLR works...
Deep neural networks (DNNs) are vulnerable to the \emph{backdoor attack},
which intends to embed hidden backdoors in DNNs by poisoning training data. The
attacked model behaves normally on benign samples, whereas its prediction will
be changed to a particular target label if hidden backdoors are activated. So
far, backdoor research has mostly been conducted towards classification tasks.
In this paper, we reveal that this threat could also happen in semantic
segmentation, which may further endanger many mission-critical applications
(e.g., autonomous driving). Except for extending the existing attack paradigm
to maliciously manipulate the segmentation models from the image-level, we
propose a novel attack paradigm, the \emph{fine-grained attack}, where we treat
the target label (i.e., annotation) from the object-level instead of the
image-level to achieve more sophisticated manipulation. In the annotation of
poisoned samples generated by the fine-grained attack, only pixels of specific
objects will be labeled with the attacker-specified target class while others
are still with their ground-truth ones. Experiments show that the proposed
methods can successfully attack semantic segmentation models by poisoning only
a small proportion of training data. Our method not only provides a new
perspective for designing novel attacks but also serves as a strong baseline
for improving the robustness of semantic segmentation methods.Abstract
BehaviorGPT: Smart Agent Simulation for Autonomous Driving with
Next-Patch Prediction
Simulating realistic behaviors of traffic agents is pivotal for efficiently
validating the safety of autonomous driving systems. Existing data-driven
simulators primarily use an encoder-decoder architecture to encode the
historical trajectories before decoding the future. However, the heterogeneity
between encoders and decoders complicates the models, and the manual separation
of historical and future trajectories leads to low data utilization. Given
these limitations, we propose BehaviorGPT, a homogeneous and fully
autoregressive Transformer designed to simulate the sequential behavior of
multiple agents. Crucially, our approach discards the traditional separation
between "history" and "future" by modeling each time step as the "current" one
for motion generation, leading to a simpler, more parameter- and data-efficient
agent simulator. We further introduce the Next-Patch Prediction Paradigm (NP3)
to mitigate the negative effects of autoregressive modeling, in which models
are trained to reason at the patch level of trajectories and capture long-range
spatial-temporal interactions. Despite having merely 3M model parameters,
BehaviorGPT won first place in the 2024 Waymo Open Sim Agents Challenge with a
realism score of 0.7473 and a minADE score of 1.4147, demonstrating its
exceptional performance in traffic agent simulation.Abstract
LongSafetyBench: Long-Context LLMs Struggle with Safety Issues
arXiv:2411.06899v1 »Full PDF »With the development of large language models (LLMs), the sequence length of
these models continues to increase, drawing significant attention to
long-context language models. However, the evaluation of these models has been
primarily limited to their capabilities, with a lack of research focusing on
their safety. Existing work, such as ManyShotJailbreak, has to some extent
demonstrated that long-context language models can exhibit safety concerns.
However, the methods used are limited and lack comprehensiveness. In response,
we introduce \textbf{LongSafetyBench}, the first benchmark designed to
objectively and comprehensively evaluate the safety of long-context models.
LongSafetyBench consists of 10 task categories, with an average length of
41,889 words. After testing eight long-context language models on
LongSafetyBench, we found that existing models generally exhibit insufficient
safety capabilities. The proportion of safe responses from most mainstream
long-context LLMs is below 50\%. Moreover, models' safety performance in
long-context scenarios does not always align with that in short-context
scenarios. Further investigation revealed that long-context models tend to
overlook harmful content within lengthy texts. We also proposed a simple yet
effective solution, allowing open-source models to achieve performance
comparable to that of top-tier closed-source models. We believe that
LongSafetyBench can serve as a valuable benchmark for evaluating the safety
capabilities of long-context language models. We hope that our work will
encourage the broader community to pay attention to the safety of long-context
models and contribute to the development of solutions to improve the safety of
long-context LLMs.Abstract
Federated Learning (FL) employs a training approach to address scenarios
where users' data cannot be shared across clients. Achieving fairness in FL is
imperative since training data in FL is inherently geographically distributed
among diverse user groups. Existing research on fairness predominantly assumes
access to the entire training data, making direct transfer to FL challenging.
However, the limited existing research on fairness in FL does not effectively
address two key challenges, i.e., (CH1) Current methods fail to deal with the
inconsistency between fair optimization results obtained with surrogate
functions and fair classification results. (CH2) Directly aggregating local
fair models does not always yield a globally fair model due to non Identical
and Independent data Distributions (non-IID) among clients. To address these
challenges, we propose a Wasserstein Fair Federated Learning framework, namely
WassFFed. To tackle CH1, we ensure that the outputs of local models, rather
than the loss calculated with surrogate functions or classification results
with a threshold, remain independent of various user groups. To resolve CH2, we
employ a Wasserstein barycenter calculation of all local models' outputs for
each user group, bringing local model outputs closer to the global output
distribution to ensure consistency between the global model and local models.
We conduct extensive experiments on three real-world datasets, demonstrating
that WassFFed outperforms existing approaches in striking a balance between
accuracy and fairness.Abstract
The Dark Side of AI Companionship: A Taxonomy of Harmful Algorithmic
Behaviors in Human-AI Relationships
arXiv:2410.20130v2 »Full PDF »As conversational AI systems increasingly permeate the socio-emotional realms
of human life, they bring both benefits and risks to individuals and society.
Despite extensive research on detecting and categorizing harms in AI systems,
less is known about the harms that arise from social interactions with AI
chatbots. Through a mixed-methods analysis of 35,390 conversation excerpts
shared on r/replika, an online community for users of the AI companion Replika,
we identified six categories of harmful behaviors exhibited by the chatbot:
relational transgression, verbal abuse and hate, self-inflicted harm,
harassment and violence, mis/disinformation, and privacy violations. The AI
contributes to these harms through four distinct roles: perpetrator,
instigator, facilitator, and enabler. Our findings highlight the relational
harms of AI chatbots and the danger of algorithmic compliance, enhancing the
understanding of AI harms in socio-emotional interactions. We also provide
suggestions for designing ethical and responsible AI systems that prioritize
user safety and well-being.Abstract
Reinforcement Learning-based Receding Horizon Control using Adaptive
Control Barrier Functions for Safety-Critical Systems
arXiv:2403.17338v2 »Full PDF »Optimal control methods provide solutions to safety-critical problems but
easily become intractable. Control Barrier Functions (CBFs) have emerged as a
popular technique that facilitates their solution by provably guaranteeing
safety, through their forward invariance property, at the expense of some
performance loss. This approach involves defining a performance objective
alongside CBF-based safety constraints that must always be enforced.
Unfortunately, both performance and solution feasibility can be significantly
impacted by two key factors: (i) the selection of the cost function and
associated parameters, and (ii) the calibration of parameters within the
CBF-based constraints, which capture the trade-off between performance and
conservativeness. %as well as infeasibility. To address these challenges, we
propose a Reinforcement Learning (RL)-based Receding Horizon Control (RHC)
approach leveraging Model Predictive Control (MPC) with CBFs (MPC-CBF). In
particular, we parameterize our controller and use bilevel optimization, where
RL is used to learn the optimal parameters while MPC computes the optimal
control input. We validate our method by applying it to the challenging
automated merging control problem for Connected and Automated Vehicles (CAVs)
at conflicting roadways. Results demonstrate improved performance and a
significant reduction in the number of infeasible cases compared to traditional
heuristic approaches used for tuning CBF-based controllers, showcasing the
effectiveness of the proposed method.Abstract