This paper has already been accepted by ICLR 2024. This version is
the camera-ready version
Diffusion models for text-to-image (T2I) synthesis, such as Stable Diffusion
(SD), have recently demonstrated exceptional capabilities for generating
high-quality content. However, this progress has raised several concerns of
potential misuse, particularly in creating copyrighted, prohibited, and
restricted content, or NSFW (not safe for work) images. While efforts have been
made to mitigate such problems, either by implementing a safety filter at the
evaluation stage or by fine-tuning models to eliminate undesirable concepts or
styles, the effectiveness of these safety measures in dealing with a wide range
of prompts remains largely unexplored. In this work, we aim to investigate
these safety mechanisms by proposing one novel concept retrieval algorithm for
evaluation. We introduce Ring-A-Bell, a model-agnostic red-teaming tool for T2I
diffusion models, where the whole evaluation can be prepared in advance without
prior knowledge of the target model. Specifically, Ring-A-Bell first performs
concept extraction to obtain holistic representations for sensitive and
inappropriate concepts. Subsequently, by leveraging the extracted concept,
Ring-A-Bell automatically identifies problematic prompts for diffusion models
with the corresponding generation of inappropriate content, allowing the user
to assess the reliability of deployed safety mechanisms. Finally, we
empirically validate our method by testing online services such as Midjourney
and various methods of concept removal. Our results show that Ring-A-Bell, by
manipulating safe prompting benchmarks, can transform prompts that were
originally regarded as safe to evade existing safety mechanisms, thus revealing
the defects of the so-called safety mechanisms which could practically lead to
the generation of harmful contents. Our codes are available at
https://github.com/chiayi-hsu/Ring-A-Bell.Abstract
BehaviorGPT: Smart Agent Simulation for Autonomous Driving with
Next-Patch Prediction
Simulating realistic behaviors of traffic agents is pivotal for efficiently
validating the safety of autonomous driving systems. Existing data-driven
simulators primarily use an encoder-decoder architecture to encode the
historical trajectories before decoding the future. However, the heterogeneity
between encoders and decoders complicates the models, and the manual separation
of historical and future trajectories leads to low data utilization. Given
these limitations, we propose BehaviorGPT, a homogeneous and fully
autoregressive Transformer designed to simulate the sequential behavior of
multiple agents. Crucially, our approach discards the traditional separation
between "history" and "future" by modeling each time step as the "current" one
for motion generation, leading to a simpler, more parameter- and data-efficient
agent simulator. We further introduce the Next-Patch Prediction Paradigm (NP3)
to mitigate the negative effects of autoregressive modeling, in which models
are trained to reason at the patch level of trajectories and capture long-range
spatial-temporal interactions. Despite having merely 3M model parameters,
BehaviorGPT won first place in the 2024 Waymo Open Sim Agents Challenge with a
realism score of 0.7473 and a minADE score of 1.4147, demonstrating its
exceptional performance in traffic agent simulation.Abstract
SIESEF-FusionNet: Spatial Inter-correlation Enhancement and
Spatially-Embedded Feature Fusion Network for LiDAR Point Cloud Semantic
Segmentation
The ambiguity at the boundaries of different semantic classes in point cloud
semantic segmentation often leads to incorrect decisions in intelligent
perception systems, such as autonomous driving. Hence, accurate delineation of
the boundaries is crucial for improving safety in autonomous driving. A novel
spatial inter-correlation enhancement and spatially-embedded feature fusion
network (SIESEF-FusionNet) is proposed in this paper, enhancing spatial
inter-correlation by combining inverse distance weighting and angular
compensation to extract more beneficial spatial information without causing
redundancy. Meanwhile, a new spatial adaptive pooling module is also designed,
embedding enhanced spatial information into semantic features for strengthening
the context-awareness of semantic features. Experimental results demonstrate
that 83.7% mIoU and 97.8% OA are achieved by SIESEF-FusionNet on the Toronto3D
dataset, with performance superior to other baseline methods. A value of 61.1%
mIoU is reached on the semanticKITTI dataset, where a marked improvement in
segmentation performance is observed. In addition, the effectiveness and
plug-and-play capability of the proposed modules are further verified through
ablation studies.Abstract
Federated Learning (FL) employs a training approach to address scenarios
where users' data cannot be shared across clients. Achieving fairness in FL is
imperative since training data in FL is inherently geographically distributed
among diverse user groups. Existing research on fairness predominantly assumes
access to the entire training data, making direct transfer to FL challenging.
However, the limited existing research on fairness in FL does not effectively
address two key challenges, i.e., (CH1) Current methods fail to deal with the
inconsistency between fair optimization results obtained with surrogate
functions and fair classification results. (CH2) Directly aggregating local
fair models does not always yield a globally fair model due to non Identical
and Independent data Distributions (non-IID) among clients. To address these
challenges, we propose a Wasserstein Fair Federated Learning framework, namely
WassFFed. To tackle CH1, we ensure that the outputs of local models, rather
than the loss calculated with surrogate functions or classification results
with a threshold, remain independent of various user groups. To resolve CH2, we
employ a Wasserstein barycenter calculation of all local models' outputs for
each user group, bringing local model outputs closer to the global output
distribution to ensure consistency between the global model and local models.
We conduct extensive experiments on three real-world datasets, demonstrating
that WassFFed outperforms existing approaches in striking a balance between
accuracy and fairness.Abstract
Combining Domain and Alignment Vectors to Achieve Better
Knowledge-Safety Trade-offs in LLMs
arXiv:2411.06824v1 »Full PDF »There is a growing interest in training domain-expert LLMs that excel in
specific technical fields compared to their general-purpose instruction-tuned
counterparts. However, these expert models often experience a loss in their
safety abilities in the process, making them capable of generating harmful
content. As a solution, we introduce an efficient and effective merging-based
alignment method called \textsc{MergeAlign} that interpolates the domain and
alignment vectors, creating safer domain-specific models while preserving their
utility. We apply \textsc{MergeAlign} on Llama3 variants that are experts in
medicine and finance, obtaining substantial alignment improvements with minimal
to no degradation on domain-specific benchmarks. We study the impact of model
merging through model similarity metrics and contributions of individual models
being merged. We hope our findings open new research avenues and inspire more
efficient development of safe expert LLMs.Abstract
Incorporating Human Explanations for Robust Hate Speech Detection
Given the black-box nature and complexity of large transformer language
models (LM), concerns about generalizability and robustness present ethical
implications for domains such as hate speech (HS) detection. Using the content
rich Social Bias Frames dataset, containing human-annotated stereotypes,
intent, and targeted groups, we develop a three stage analysis to evaluate if
LMs faithfully assess hate speech. First, we observe the need for modeling
contextually grounded stereotype intents to capture implicit semantic meaning.
Next, we design a new task, Stereotype Intent Entailment (SIE), which
encourages a model to contextually understand stereotype presence. Finally,
through ablation tests and user studies, we find a SIE objective improves
content understanding, but challenges remain in modeling implicit intent.Abstract
A Retrospective on the Robot Air Hockey Challenge: Benchmarking Robust,
Reliable, and Safe Learning Techniques for Real-world Robotics
Accept at NeurIPS 2024 Dataset and Benchmark Track
Machine learning methods have a groundbreaking impact in many application
domains, but their application on real robotic platforms is still limited.
Despite the many challenges associated with combining machine learning
technology with robotics, robot learning remains one of the most promising
directions for enhancing the capabilities of robots. When deploying
learning-based approaches on real robots, extra effort is required to address
the challenges posed by various real-world factors. To investigate the key
factors influencing real-world deployment and to encourage original solutions
from different researchers, we organized the Robot Air Hockey Challenge at the
NeurIPS 2023 conference. We selected the air hockey task as a benchmark,
encompassing low-level robotics problems and high-level tactics. Different from
other machine learning-centric benchmarks, participants need to tackle
practical challenges in robotics, such as the sim-to-real gap, low-level
control issues, safety problems, real-time requirements, and the limited
availability of real-world data. Furthermore, we focus on a dynamic
environment, removing the typical assumption of quasi-static motions of other
real-world benchmarks. The competition's results show that solutions combining
learning-based approaches with prior knowledge outperform those relying solely
on data when real-world deployment is challenging. Our ablation study reveals
which real-world factors may be overlooked when building a learning-based
solution. The successful real-world air hockey deployment of best-performing
agents sets the foundation for future competitions and follow-up research
directions.Abstract
Fairness Without Harm: An Influence-Guided Active Sampling Approach
arXiv:2402.12789v3 »Full PDF »The pursuit of fairness in machine learning (ML), ensuring that the models do
not exhibit biases toward protected demographic groups, typically results in a
compromise scenario. This compromise can be explained by a Pareto frontier
where given certain resources (e.g., data), reducing the fairness violations
often comes at the cost of lowering the model accuracy. In this work, we aim to
train models that mitigate group fairness disparity without causing harm to
model accuracy. Intuitively, acquiring more data is a natural and promising
approach to achieve this goal by reaching a better Pareto frontier of the
fairness-accuracy tradeoff. The current data acquisition methods, such as fair
active learning approaches, typically require annotating sensitive attributes.
However, these sensitive attribute annotations should be protected due to
privacy and safety concerns. In this paper, we propose a tractable active data
sampling algorithm that does not rely on training group annotations, instead
only requiring group annotations on a small validation set. Specifically, the
algorithm first scores each new example by its influence on fairness and
accuracy evaluated on the validation dataset, and then selects a certain number
of examples for training. We theoretically analyze how acquiring more data can
improve fairness without causing harm, and validate the possibility of our
sampling approach in the context of risk disparity. We also provide the upper
bound of generalization error and risk disparity as well as the corresponding
connections. Extensive experiments on real-world data demonstrate the
effectiveness of our proposed algorithm. Our code is available at
https://github.com/UCSC-REAL/FairnessWithoutHarm.Abstract
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
Exploring Refusal Loss Landscapes
Accepted by NeurIPS 2024. Project page:
https://huggingface.co/spaces/TrustSafeAI/GradientCuff-Jai...
Large Language Models (LLMs) are becoming a prominent generative AI tool,
where the user enters a query and the LLM generates an answer. To reduce harm
and misuse, efforts have been made to align these LLMs to human values using
advanced training techniques such as Reinforcement Learning from Human Feedback
(RLHF). However, recent studies have highlighted the vulnerability of LLMs to
adversarial jailbreak attempts aiming at subverting the embedded safety
guardrails. To address this challenge, this paper defines and investigates the
Refusal Loss of LLMs and then proposes a method called Gradient Cuff to detect
jailbreak attempts. Gradient Cuff exploits the unique properties observed in
the refusal loss landscape, including functional values and its smoothness, to
design an effective two-step detection strategy. Experimental results on two
aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six types of jailbreak
attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) show that Gradient Cuff can
significantly improve the LLM's rejection capability for malicious jailbreak
queries, while maintaining the model's performance for benign user queries by
adjusting the detection threshold.Abstract
A Comparative Study of Deep Reinforcement Learning for Crop Production
Management
Crop production management is essential for optimizing yield and minimizing a
field's environmental impact to crop fields, yet it remains challenging due to
the complex and stochastic processes involved. Recently, researchers have
turned to machine learning to address these complexities. Specifically,
reinforcement learning (RL), a cutting-edge approach designed to learn optimal
decision-making strategies through trial and error in dynamic environments, has
emerged as a promising tool for developing adaptive crop management policies.
RL models aim to optimize long-term rewards by continuously interacting with
the environment, making them well-suited for tackling the uncertainties and
variability inherent in crop management. Studies have shown that RL can
generate crop management policies that compete with, and even outperform,
expert-designed policies within simulation-based crop models. In the gym-DSSAT
crop model environment, one of the most widely used simulators for crop
management, proximal policy optimization (PPO) and deep Q-networks (DQN) have
shown promising results. However, these methods have not yet been
systematically evaluated under identical conditions. In this study, we
evaluated PPO and DQN against static baseline policies across three different
RL tasks, fertilization, irrigation, and mixed management, provided by the
gym-DSSAT environment. To ensure a fair comparison, we used consistent default
parameters, identical reward functions, and the same environment settings. Our
results indicate that PPO outperforms DQN in fertilization and irrigation
tasks, while DQN excels in the mixed management task. This comparative analysis
provides critical insights into the strengths and limitations of each approach,
advancing the development of more effective RL-based crop management
strategies.Abstract