The goal of multi-objective optimization (MOO) is to learn under multiple,
potentially conflicting, objectives. One widely used technique to tackle MOO is
through linear scalarization, where one fixed preference vector is used to
combine the objectives into a single scalar value for optimization. However,
recent work (Hu et al., 2024) has shown linear scalarization often fails to
capture the non-convex regions of the Pareto Front, failing to recover the
complete set of Pareto optimal solutions. In light of the above limitations,
this paper focuses on Tchebycheff scalarization that optimizes for the
worst-case objective. In particular, we propose an online mirror descent
algorithm for Tchebycheff scalarization, which we call OMD-TCH. We show that
OMD-TCH enjoys a convergence rate of O(√logm/T) where m is the
number of objectives and T is the number of iteration rounds. We also propose
a novel adaptive online-to-batch conversion scheme that significantly improves
the practical performance of OMD-TCH while maintaining the same convergence
guarantees. We demonstrate the effectiveness of OMD-TCH and the adaptive
conversion scheme on both synthetic problems and federated learning tasks under
fairness constraints, showing state-of-the-art performance.Abstract
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled
Reasoning
arXiv:2406.09187v1 »Full PDF »The rapid advancement of large language models (LLMs) has catalyzed the
deployment of LLM-powered agents across numerous applications, raising new
concerns regarding their safety and trustworthiness. Existing methods for
enhancing the safety of LLMs are not directly transferable to LLM-powered
agents due to their diverse objectives and output modalities. In this paper, we
propose GuardAgent, the first LLM agent as a guardrail to other LLM agents.
Specifically, GuardAgent oversees a target LLM agent by checking whether its
inputs/outputs satisfy a set of given guard requests defined by the users.
GuardAgent comprises two steps: 1) creating a task plan by analyzing the
provided guard requests, and 2) generating guardrail code based on the task
plan and executing the code by calling APIs or using external engines. In both
steps, an LLM is utilized as the core reasoning component, supplemented by
in-context demonstrations retrieved from a memory module. Such
knowledge-enabled reasoning allows GuardAgent to understand various textual
guard requests and accurately "translate" them into executable code that
provides reliable guardrails. Furthermore, GuardAgent is equipped with an
extendable toolbox containing functions and APIs and requires no additional LLM
training, which underscores its generalization capabilities and low operational
overhead. Additionally, we propose two novel benchmarks: an EICU-AC benchmark
for assessing privacy-related access control for healthcare agents and a
Mind2Web-SC benchmark for safety evaluation for web agents. We show the
effectiveness of GuardAgent on these two benchmarks with 98.7% and 90.0%
accuracy in moderating invalid inputs and outputs for the two types of agents,
respectively. We also show that GuardAgent is able to define novel functions in
adaption to emergent LLM agents and guard requests, which underscores its
strong generalization capabilities.Abstract
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion
Models?
This paper has already been accepted by ICLR 2024. This version is
the camera-ready version
Diffusion models for text-to-image (T2I) synthesis, such as Stable Diffusion
(SD), have recently demonstrated exceptional capabilities for generating
high-quality content. However, this progress has raised several concerns of
potential misuse, particularly in creating copyrighted, prohibited, and
restricted content, or NSFW (not safe for work) images. While efforts have been
made to mitigate such problems, either by implementing a safety filter at the
evaluation stage or by fine-tuning models to eliminate undesirable concepts or
styles, the effectiveness of these safety measures in dealing with a wide range
of prompts remains largely unexplored. In this work, we aim to investigate
these safety mechanisms by proposing one novel concept retrieval algorithm for
evaluation. We introduce Ring-A-Bell, a model-agnostic red-teaming tool for T2I
diffusion models, where the whole evaluation can be prepared in advance without
prior knowledge of the target model. Specifically, Ring-A-Bell first performs
concept extraction to obtain holistic representations for sensitive and
inappropriate concepts. Subsequently, by leveraging the extracted concept,
Ring-A-Bell automatically identifies problematic prompts for diffusion models
with the corresponding generation of inappropriate content, allowing the user
to assess the reliability of deployed safety mechanisms. Finally, we
empirically validate our method by testing online services such as Midjourney
and various methods of concept removal. Our results show that Ring-A-Bell, by
manipulating safe prompting benchmarks, can transform prompts that were
originally regarded as safe to evade existing safety mechanisms, thus revealing
the defects of the so-called safety mechanisms which could practically lead to
the generation of harmful contents. Our codes are available at
https://github.com/chiayi-hsu/Ring-A-Bell.Abstract
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient
LLMs Under Compression
Compressing high-capability Large Language Models (LLMs) has emerged as a
favored strategy for resource-efficient inferences. While state-of-the-art
(SoTA) compression methods boast impressive advancements in preserving benign
task performance, the potential risks of compression in terms of safety and
trustworthiness have been largely neglected. This study conducts the first,
thorough evaluation of three (3) leading LLMs using five (5) SoTA compression
techniques across eight (8) trustworthiness dimensions. Our experiments
highlight the intricate interplay between compression and trustworthiness,
revealing some interesting patterns. We find that quantization is currently a
more effective approach than pruning in achieving efficiency and
trustworthiness simultaneously. For instance, a 4-bit quantized model retains
the trustworthiness of its original counterpart, but model pruning
significantly degrades trustworthiness, even at 50% sparsity. Moreover,
employing quantization within a moderate bit range could unexpectedly improve
certain trustworthiness dimensions such as ethics and fairness. Conversely,
extreme quantization to very low bit levels (3 bits) tends to reduce
trustworthiness significantly. This increased risk cannot be uncovered by
looking at benign performance alone, in turn, mandating comprehensive
trustworthiness evaluation in practice. These findings culminate in practical
recommendations for simultaneously achieving high utility, efficiency, and
trustworthiness in LLMs. Code and models are available at
https://decoding-comp-trust.github.io.Abstract
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models
NeurIPS 2023 Outstanding Paper (Datasets and Benchmarks Track)
Generative Pre-trained Transformer (GPT) models have exhibited exciting
progress in their capabilities, capturing the interest of practitioners and the
public alike. Yet, while the literature on the trustworthiness of GPT models
remains limited, practitioners have proposed employing capable GPT models for
sensitive applications such as healthcare and finance -- where mistakes can be
costly. To this end, this work proposes a comprehensive trustworthiness
evaluation for large language models with a focus on GPT-4 and GPT-3.5,
considering diverse perspectives -- including toxicity, stereotype bias,
adversarial robustness, out-of-distribution robustness, robustness on
adversarial demonstrations, privacy, machine ethics, and fairness. Based on our
evaluations, we discover previously unpublished vulnerabilities to
trustworthiness threats. For instance, we find that GPT models can be easily
misled to generate toxic and biased outputs and leak private information in
both training data and conversation history. We also find that although GPT-4
is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more
vulnerable given jailbreaking system or user prompts, potentially because GPT-4
follows (misleading) instructions more precisely. Our work illustrates a
comprehensive trustworthiness evaluation of GPT models and sheds light on the
trustworthiness gaps. Our benchmark is publicly available at
https://decodingtrust.github.io/ ; our dataset can be previewed at
https://huggingface.co/datasets/AI-Secure/DecodingTrust ; a concise version of
this work is at https://openreview.net/pdf?id=kaHpo8OZw2 .Abstract
FOCUS: Fairness via Agent-Awareness for Federated Learning on
Heterogeneous Data
arXiv:2207.10265v4 »Full PDF »Federated learning (FL) allows agents to jointly train a global model without
sharing their local data. However, due to the heterogeneous nature of local
data, it is challenging to optimize or even define fairness of the trained
global model for the agents. For instance, existing work usually considers
accuracy equity as fairness for different agents in FL, which is limited,
especially under the heterogeneous setting, since it is intuitively "unfair" to
enforce agents with high-quality data to achieve similar accuracy to those who
contribute low-quality data, which may discourage the agents from participating
in FL. In this work, we propose a formal FL fairness definition, fairness via
agent-awareness (FAA), which takes different contributions of heterogeneous
agents into account. Under FAA, the performance of agents with high-quality
data will not be sacrificed just due to the existence of large amounts of
agents with low-quality data. In addition, we propose a fair FL training
algorithm based on agent clustering (FOCUS) to achieve fairness in FL measured
by FAA. Theoretically, we prove the convergence and optimality of FOCUS under
mild conditions for linear and general convex loss functions with bounded
smoothness. We also prove that FOCUS always achieves higher fairness in terms
of FAA compared with standard FedAvg under both linear and general convex loss
functions. Empirically, we show that on four FL datasets, including synthetic
data, images, and texts, FOCUS achieves significantly higher fairness in terms
of FAA while maintaining competitive prediction accuracy compared with FedAvg
and state-of-the-art fair FL algorithms.Abstract
Subnet Replacement: Deployment-stage backdoor attack against deep neural
networks in gray-box setting
6 pages, 3 figures, ICLR 2021 Workshop on Security and Safety in
Machine Learning System
We study the realistic potential of conducting backdoor attack against deep
neural networks (DNNs) during deployment stage. Specifically, our goal is to
design a deployment-stage backdoor attack algorithm that is both threatening
and realistically implementable. To this end, we propose Subnet Replacement
Attack (SRA), which is capable of embedding backdoor into DNNs by directly
modifying a limited number of model parameters. Considering the realistic
practicability, we abandon the strong white-box assumption widely adopted in
existing studies, instead, our algorithm works in a gray-box setting, where
architecture information of the victim model is available but the adversaries
do not have any knowledge of parameter values. The key philosophy underlying
our approach is -- given any neural network instance (regardless of its
specific parameter values) of a certain architecture, we can always embed a
backdoor into that model instance, by replacing a very narrow subnet of a
benign model (without backdoor) with a malicious backdoor subnet, which is
designed to be sensitive (fire large activation value) to a particular backdoor
trigger pattern.Abstract
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning
for Web Agents
Language agents have demonstrated promising capabilities in automating
web-based tasks, though their current reactive approaches still underperform
largely compared to humans. While incorporating advanced planning algorithms,
particularly tree search methods, could enhance these agents' performance,
implementing tree search directly on live websites poses significant safety
risks and practical constraints due to irreversible actions such as confirming
a purchase. In this paper, we introduce a novel paradigm that augments language
agents with model-based planning, pioneering the innovative use of large
language models (LLMs) as world models in complex web environments. Our method,
WebDreamer, builds on the key insight that LLMs inherently encode comprehensive
knowledge about website structures and functionalities. Specifically,
WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g.,
"what would happen if I click this button?") using natural language
descriptions, and then evaluates these imagined outcomes to determine the
optimal action at each step. Empirical results on two representative web agent
benchmarks with online interaction -- VisualWebArena and Mind2Web-live --
demonstrate that WebDreamer achieves substantial improvements over reactive
baselines. By establishing the viability of LLMs as world models in web
environments, this work lays the groundwork for a paradigm shift in automated
web interaction. More broadly, our findings open exciting new avenues for
future research into 1) optimizing LLMs specifically for world modeling in
complex, dynamic environments, and 2) model-based speculative planning for
language agents.Abstract
arXiv:2411.06353v1 »Full PDF »Machine learning models deployed in open-world scenarios often encounter
unfamiliar conditions and perform poorly in unanticipated situations. As AI
systems advance and find application in safety-critical domains, effectively
handling out-of-distribution (OOD) data is crucial to building open-world
learning systems. In this work, we introduce ALOE, a novel active learning
algorithm for open-world environments designed to enhance model adaptation by
incorporating new OOD classes via a two-stage approach. First, diversity
sampling selects a representative set of examples, followed by energy-based OOD
detection to prioritize likely unknown classes for annotation. This strategy
accelerates class discovery and learning, even under constrained annotation
budgets. Evaluations on three long-tailed image classification benchmarks
demonstrate that ALOE outperforms traditional active learning baselines,
effectively expanding known categories while balancing annotation cost. Our
findings reveal a crucial tradeoff between enhancing known-class performance
and discovering new classes, setting the stage for future advancements in
open-world machine learning.Abstract
Evaluation and Improvement of Fault Detection for Large Language Models
arXiv:2404.14419v2 »Full PDF »Large language models (LLMs) have recently achieved significant success
across various application domains, garnering substantial attention from
different communities. Unfortunately, even for the best LLM, many
\textit{faults} still exist that LLM cannot properly predict. Such faults will
harm the usability of LLMs in general and could introduce safety issues in
reliability-critical systems such as autonomous driving systems. How to quickly
reveal these faults in real-world datasets that LLM could face is important,
but challenging. The major reason is that the ground truth is necessary but the
data labeling process is heavy considering the time and human effort. To handle
this problem, in the conventional deep learning testing field, test selection
methods have been proposed for efficiently evaluating deep learning models by
prioritizing faults. However, despite their importance, the usefulness of these
methods on LLMs is unclear, and lack of exploration. In this paper, we conduct
the first empirical study to investigate the effectiveness of existing fault
detection methods for LLMs. Experimental results on four different
tasks~(including both code tasks and natural language processing tasks) and
four LLMs~(e.g., LLaMA3 and GPT4) demonstrated that simple methods such as
Margin perform well on LLMs but there is still a big room for improvement.
Based on the study, we further propose \textbf{MuCS}, a prompt
\textbf{Mu}tation-based prediction \textbf{C}onfidence \textbf{S}moothing
framework to boost the fault detection capability of existing methods.
Concretely, multiple prompt mutation techniques have been proposed to help
collect more diverse outputs for confidence smoothing. The results show that
our proposed framework significantly enhances existing methods with the
improvement of test relative coverage by up to 70.53\%.Abstract