arXiv:2404.01099v2 »Full PDF »Current Large Language Models (LLMs), even those tuned for safety and
alignment, are susceptible to jailbreaking. Some have found that just further
fine-tuning an aligned model with benign data (i.e., data without harmful
content) surprisingly leads to substantial degradation in safety. We delve into
the data-centric aspects of why benign fine-tuning inadvertently contributes to
jailbreaking. First, we represent fine-tuning data through two lenses:
representation and gradient spaces. Additionally, we propose a bi-directional
anchoring method that, during the selection process, prioritizes data points
that are close to harmful examples and far from benign ones. Our approach
effectively identifies subsets of benign data that are more likely to degrade
the model's safety after fine-tuning. Training on just 100 of these seemingly
benign datapoints surprisingly leads to the fine-tuned model affirmatively
responding to >70% of tested harmful requests, compared to <20% after
fine-tuning on randomly selected data. We also observe that the selected data
frequently appear as lists, bullet points, or math questions, indicating a
systematic pattern in fine-tuning data that contributes to jailbreaking.Abstract
SORRY-Bench: Systematically Evaluating Large Language Model Safety
Refusal Behaviors
arXiv:2406.14598v1 »Full PDF »Evaluating aligned large language models' (LLMs) ability to recognize and
reject unsafe user requests is crucial for safe, policy-compliant deployments.
Existing evaluation efforts, however, face three limitations that we address
with SORRY-Bench, our proposed benchmark. First, existing methods often use
coarse-grained taxonomies of unsafe topics, and are over-representing some
fine-grained topics. For example, among the ten existing datasets that we
evaluated, tests for refusals of self-harm instructions are over 3x less
represented than tests for fraudulent activities. SORRY-Bench improves on this
by using a fine-grained taxonomy of 45 potentially unsafe topics, and 450
class-balanced unsafe instructions, compiled through human-in-the-loop methods.
Second, linguistic characteristics and formatting of prompts are often
overlooked, like different languages, dialects, and more -- which are only
implicitly considered in many evaluations. We supplement SORRY-Bench with 20
diverse linguistic augmentations to systematically examine these effects.
Third, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation,
which can be computationally expensive. We investigate design choices for
creating a fast, accurate automated safety evaluator. By collecting 7K+ human
annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs,
we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale
LLMs, with lower computational cost. Putting these together, we evaluate over
40 proprietary and open-source LLMs on SORRY-Bench, analyzing their distinctive
refusal behaviors. We hope our effort provides a building block for systematic
evaluations of LLMs' safety refusal capabilities, in a balanced, granular, and
efficient manner.Abstract
AI Risk Management Should Incorporate Both Safety and Security
arXiv:2405.19524v1 »Full PDF »The exposure of security vulnerabilities in safety-aligned language models,
e.g., susceptibility to adversarial attacks, has shed light on the intricate
interplay between AI safety and AI security. Although the two disciplines now
come together under the overarching goal of AI risk management, they have
historically evolved separately, giving rise to differing perspectives.
Therefore, in this paper, we advocate that stakeholders in AI risk management
should be aware of the nuances, synergies, and interplay between safety and
security, and unambiguously take into account the perspectives of both
disciplines in order to devise mostly effective and holistic risk mitigation
approaches. Unfortunately, this vision is often obfuscated, as the definitions
of the basic concepts of "safety" and "security" themselves are often
inconsistent and lack consensus across communities. With AI risk management
being increasingly cross-disciplinary, this issue is particularly salient. In
light of this conceptual challenge, we introduce a unified reference framework
to clarify the differences and interplay between AI safety and AI security,
aiming to facilitate a shared understanding and effective collaboration across
communities.Abstract
Aleatoric and Epistemic Discrimination: Fundamental Limits of Fairness
Interventions
arXiv:2301.11781v3 »Full PDF »Machine learning (ML) models can underperform on certain population groups
due to choices made during model development and bias inherent in the data. We
categorize sources of discrimination in the ML pipeline into two classes:
aleatoric discrimination, which is inherent in the data distribution, and
epistemic discrimination, which is due to decisions made during model
development. We quantify aleatoric discrimination by determining the
performance limits of a model under fairness constraints, assuming perfect
knowledge of the data distribution. We demonstrate how to characterize
aleatoric discrimination by applying Blackwell's results on comparing
statistical experiments. We then quantify epistemic discrimination as the gap
between a model's accuracy when fairness constraints are applied and the limit
posed by aleatoric discrimination. We apply this approach to benchmark existing
fairness interventions and investigate fairness risks in data with missing
values. Our results indicate that state-of-the-art fairness interventions are
effective at removing epistemic discrimination on standard (overused) tabular
datasets. However, when data has missing values, there is still significant
room for improvement in handling aleatoric discrimination.Abstract
Revisiting, Benchmarking and Understanding Unsupervised Graph Domain
Adaptation
Unsupervised Graph Domain Adaptation (UGDA) involves the transfer of
knowledge from a label-rich source graph to an unlabeled target graph under
domain discrepancies. Despite the proliferation of methods designed for this
emerging task, the lack of standard experimental settings and fair performance
comparisons makes it challenging to understand which and when models perform
well across different scenarios. To fill this gap, we present the first
comprehensive benchmark for unsupervised graph domain adaptation named
GDABench, which encompasses 16 algorithms across 5 datasets with 74 adaptation
tasks. Through extensive experiments, we observe that the performance of
current UGDA models varies significantly across different datasets and
adaptation scenarios. Specifically, we recognize that when the source and
target graphs face significant distribution shifts, it is imperative to
formulate strategies to effectively address and mitigate graph structural
shifts. We also find that with appropriate neighbourhood aggregation
mechanisms, simple GNN variants can even surpass state-of-the-art UGDA
baselines. To facilitate reproducibility, we have developed an easy-to-use
library PyGDA for training and evaluating existing UGDA methods, providing a
standardized platform in this community. Our source codes and datasets can be
found at: https://github.com/pygda-team/pygda.Abstract
Integrating Object Detection Modality into Visual Language Model for
Enhanced Autonomous Driving Agent
In this paper, we propose a novel framework for enhancing visual
comprehension in autonomous driving systems by integrating visual language
models (VLMs) with additional visual perception module specialised in object
detection. We extend the Llama-Adapter architecture by incorporating a
YOLOS-based detection network alongside the CLIP perception network, addressing
limitations in object detection and localisation. Our approach introduces
camera ID-separators to improve multi-view processing, crucial for
comprehensive environmental awareness. Experiments on the DriveLM visual
question answering challenge demonstrate significant improvements over baseline
models, with enhanced performance in ChatGPT scores, BLEU scores, and CIDEr
metrics, indicating closeness of model answer to ground truth. Our method
represents a promising step towards more capable and interpretable autonomous
driving systems. Possible safety enhancement enabled by detection modality is
also discussed.Abstract
Generalize or Detect? Towards Robust Semantic Segmentation Under
Multiple Distribution Shifts
In open-world scenarios, where both novel classes and domains may exist, an
ideal segmentation model should detect anomaly classes for safety and
generalize to new domains. However, existing methods often struggle to
distinguish between domain-level and semantic-level distribution shifts,
leading to poor out-of-distribution (OOD) detection or domain generalization
performance. In this work, we aim to equip the model to generalize effectively
to covariate-shift regions while precisely identifying semantic-shift regions.
To achieve this, we design a novel generative augmentation method to produce
coherent images that incorporate both anomaly (or novel) objects and various
covariate shifts at both image and object levels. Furthermore, we introduce a
training strategy that recalibrates uncertainty specifically for semantic
shifts and enhances the feature extractor to align features associated with
domain shifts. We validate the effectiveness of our method across benchmarks
featuring both semantic and domain shifts. Our method achieves state-of-the-art
performance across all benchmarks for both OOD detection and domain
generalization. Code is available at
https://github.com/gaozhitong/MultiShiftSeg.Abstract
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse
Activation Control
arXiv:2411.02461v1 »Full PDF »As the development and application of Large Language Models (LLMs) continue
to advance rapidly, enhancing their trustworthiness and aligning them with
human preferences has become a critical area of research. Traditional methods
rely heavily on extensive data for Reinforcement Learning from Human Feedback
(RLHF), but representation engineering offers a new, training-free approach.
This technique leverages semantic features to control the representation of
LLM's intermediate hidden states, enabling the model to meet specific
requirements such as increased honesty or heightened safety awareness. However,
a significant challenge arises when attempting to fulfill multiple requirements
simultaneously. It proves difficult to encode various semantic contents, like
honesty and safety, into a singular semantic feature, restricting its
practicality. In this work, we address this issue through ``Sparse Activation
Control''. By delving into the intrinsic mechanisms of LLMs, we manage to
identify and pinpoint components that are closely related to specific tasks
within the model, i.e., attention heads. These heads display sparse
characteristics that allow for near-independent control over different tasks.
Our experiments, conducted on the open-source Llama series models, have yielded
encouraging results. The models were able to align with human preferences on
issues of safety, factuality, and bias concurrently.Abstract
PhysMLE: Generalizable and Priors-Inclusive Multi-task Remote
Physiological Measurement
arXiv:2405.06201v2 »Full PDF »Remote photoplethysmography (rPPG) has been widely applied to measure heart
rate from face videos. To increase the generalizability of the algorithms,
domain generalization (DG) attracted increasing attention in rPPG. However,
when rPPG is extended to simultaneously measure more vital signs (e.g.,
respiration and blood oxygen saturation), achieving generalizability brings new
challenges. Although partial features shared among different physiological
signals can benefit multi-task learning, the sparse and imbalanced target label
space brings the seesaw effect over task-specific feature learning. To resolve
this problem, we designed an end-to-end Mixture of Low-rank Experts for
multi-task remote Physiological measurement (PhysMLE), which is based on
multiple low-rank experts with a novel router mechanism, thereby enabling the
model to adeptly handle both specifications and correlations within tasks.
Additionally, we introduced prior knowledge from physiology among tasks to
overcome the imbalance of label space under real-world multi-task physiological
measurement. For fair and comprehensive evaluations, this paper proposed a
large-scale multi-task generalization benchmark, named Multi-Source Synsemantic
Domain Generalization (MSSDG) protocol. Extensive experiments with MSSDG and
intra-dataset have shown the effectiveness and efficiency of PhysMLE. In
addition, a new dataset was collected and made publicly available to meet the
needs of the MSSDG.Abstract
Efficient Mixture-of-Expert for Video-based Driver State and
Physiological Multi-task Estimation in Conditional Autonomous Driving
arXiv:2410.21086v1 »Full PDF »Road safety remains a critical challenge worldwide, with approximately 1.35
million fatalities annually attributed to traffic accidents, often due to human
errors. As we advance towards higher levels of vehicle automation, challenges
still exist, as driving with automation can cognitively over-demand drivers if
they engage in non-driving-related tasks (NDRTs), or lead to drowsiness if
driving was the sole task. This calls for the urgent need for an effective
Driver Monitoring System (DMS) that can evaluate cognitive load and drowsiness
in SAE Level-2/3 autonomous driving contexts. In this study, we propose a novel
multi-task DMS, termed VDMoE, which leverages RGB video input to monitor driver
states non-invasively. By utilizing key facial features to minimize
computational load and integrating remote Photoplethysmography (rPPG) for
physiological insights, our approach enhances detection accuracy while
maintaining efficiency. Additionally, we optimize the Mixture-of-Experts (MoE)
framework to accommodate multi-modal inputs and improve performance across
different tasks. A novel prior-inclusive regularization method is introduced to
align model outputs with statistical priors, thus accelerating convergence and
mitigating overfitting risks. We validate our method with the creation of a new
dataset (MCDD), which comprises RGB video and physiological indicators from 42
participants, and two public datasets. Our findings demonstrate the
effectiveness of VDMoE in monitoring driver states, contributing to safer
autonomous driving systems. The code and data will be released.Abstract