In high-stake domains such as healthcare and hiring, the role of machine
learning (ML) in decision-making raises significant fairness concerns. This
work focuses on Counterfactual Fairness (CF), which posits that an ML model's
outcome on any individual should remain unchanged if they had belonged to a
different demographic group. Previous works have proposed methods that
guarantee CF. Notwithstanding, their effects on the model's predictive
performance remains largely unclear. To fill in this gap, we provide a
theoretical study on the inherent trade-off between CF and predictive
performance in a model-agnostic manner. We first propose a simple but effective
method to cast an optimal but potentially unfair predictor into a fair one
without losing the optimality. By analyzing its excess risk in order to achieve
CF, we quantify this inherent trade-off. Further analysis on our method's
performance with access to only incomplete causal knowledge is also conducted.
Built upon it, we propose a performant algorithm that can be applied in such
scenarios. Experiments on both synthetic and semi-synthetic datasets
demonstrate the validity of our analysis and methods.Abstract
ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language
Models
arXiv:2410.18491v1 »Full PDF »With the rapid development of Large language models (LLMs), understanding the
capabilities of LLMs in identifying unsafe content has become increasingly
important. While previous works have introduced several benchmarks to evaluate
the safety risk of LLMs, the community still has a limited understanding of
current LLMs' capability to recognize illegal and unsafe content in Chinese
contexts. In this work, we present a Chinese safety benchmark (ChineseSafe) to
facilitate research on the content safety of large language models. To align
with the regulations for Chinese Internet content moderation, our ChineseSafe
contains 205,034 examples across 4 classes and 10 sub-classes of safety issues.
For Chinese contexts, we add several special types of illegal content:
political sensitivity, pornography, and variant/homophonic words. Moreover, we
employ two methods to evaluate the legal risks of popular LLMs, including
open-sourced models and APIs. The results reveal that many LLMs exhibit
vulnerability to certain types of safety issues, leading to legal risks in
China. Our work provides a guideline for developers and researchers to
facilitate the safety of LLMs. Our results are also available at
https://huggingface.co/spaces/SUSTech/ChineseSafe-Benchmark.Abstract
CodeAttack: Revealing Safety Generalization Challenges of Large Language
Models via Code Completion
ACL Findings 2024, Code is available at
https://github.com/renqibing/CodeAttack
The rapid advancement of Large Language Models (LLMs) has brought about
remarkable generative capabilities but also raised concerns about their
potential misuse. While strategies like supervised fine-tuning and
reinforcement learning from human feedback have enhanced their safety, these
methods primarily focus on natural languages, which may not generalize to other
domains. This paper introduces CodeAttack, a framework that transforms natural
language inputs into code inputs, presenting a novel environment for testing
the safety generalization of LLMs. Our comprehensive studies on
state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series reveal a
new and universal safety vulnerability of these models against code input:
CodeAttack bypasses the safety guardrails of all models more than 80\% of the
time. We find that a larger distribution gap between CodeAttack and natural
language leads to weaker safety generalization, such as encoding natural
language input with data structures. Furthermore, we give our hypotheses about
the success of CodeAttack: the misaligned bias acquired by LLMs during code
training, prioritizing code completion over avoiding the potential safety risk.
Finally, we analyze potential mitigation measures. These findings highlight new
safety risks in the code domain and the need for more robust safety alignment
algorithms to match the code capabilities of LLMs.Abstract
PsySafe: A Comprehensive Framework for Psychological-based Attack,
Defense, and Evaluation of Multi-agent System Safety
Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit
profound capabilities in collective intelligence. However, the potential misuse
of this intelligence for malicious purposes presents significant risks. To
date, comprehensive research on the safety issues associated with multi-agent
systems remains limited. In this paper, we explore these concerns through the
innovative lens of agent psychology, revealing that the dark psychological
states of agents constitute a significant threat to safety. To tackle these
concerns, we propose a comprehensive framework (PsySafe) grounded in agent
psychology, focusing on three key areas: firstly, identifying how dark
personality traits in agents can lead to risky behaviors; secondly, evaluating
the safety of multi-agent systems from the psychological and behavioral
perspectives, and thirdly, devising effective strategies to mitigate these
risks. Our experiments reveal several intriguing phenomena, such as the
collective dangerous behaviors among agents, agents' self-reflection when
engaging in dangerous behavior, and the correlation between agents'
psychological assessments and dangerous behaviors. We anticipate that our
framework and observations will provide valuable insights for further research
into the safety of multi-agent systems. We will make our data and code publicly
accessible at https://github.com/AI4Good24/PsySafe.Abstract
Leave No Patient Behind: Enhancing Medication Recommendation for Rare
Disease Patients
arXiv:2403.17745v2 »Full PDF »Medication recommendation systems have gained significant attention in
healthcare as a means of providing tailored and effective drug combinations
based on patients' clinical information. However, existing approaches often
suffer from fairness issues, as recommendations tend to be more accurate for
patients with common diseases compared to those with rare conditions. In this
paper, we propose a novel model called Robust and Accurate REcommendations for
Medication (RAREMed), which leverages the pretrain-finetune learning paradigm
to enhance accuracy for rare diseases. RAREMed employs a transformer encoder
with a unified input sequence approach to capture complex relationships among
disease and procedure codes. Additionally, it introduces two self-supervised
pre-training tasks, namely Sequence Matching Prediction (SMP) and Self
Reconstruction (SR), to learn specialized medication needs and interrelations
among clinical codes. Experimental results on two real-world datasets
demonstrate that RAREMed provides accurate drug sets for both rare and common
disease patients, thereby mitigating unfairness in medication recommendation
systems.Abstract
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue
Coreference
As large language models (LLMs) constantly evolve, ensuring their safety
remains a critical research problem. Previous red-teaming approaches for LLM
safety have primarily focused on single prompt attacks or goal hijacking. To
the best of our knowledge, we are the first to study LLM safety in multi-turn
dialogue coreference. We created a dataset of 1,400 questions across 14
categories, each featuring multi-turn coreference safety attacks. We then
conducted detailed evaluations on five widely used open-source LLMs. The
results indicated that under multi-turn coreference safety attacks, the highest
attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was
13.9% with the Mistral-7B-Instruct model. These findings highlight the safety
vulnerabilities in LLMs during dialogue coreference interactions.Abstract
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision
Language Model
arXiv:2406.12030v1 »Full PDF »The emergence of Vision Language Models (VLMs) has brought unprecedented
advances in understanding multimodal information. The combination of textual
and visual semantics in VLMs is highly complex and diverse, making the safety
alignment of these models challenging. Furthermore, due to the limited study on
the safety alignment of VLMs, there is a lack of large-scale, high-quality
datasets. To address these limitations, we propose a Safety Preference
Alignment dataset for Vision Language Models named SPA-VL. In terms of breadth,
SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and
contains 100,788 samples of the quadruple (question, image, chosen response,
rejected response). In terms of depth, the responses are collected from 12
open- (e.g., QwenVL) and closed-source (e.g., Gemini) VLMs to ensure diversity.
The experimental results indicate that models trained with alignment techniques
on the SPA-VL dataset exhibit substantial improvements in harmlessness and
helpfulness while maintaining core capabilities. SPA-VL, as a large-scale,
high-quality, and diverse dataset, represents a significant milestone in
ensuring that VLMs achieve both harmlessness and helpfulness. We have made our
code https://github.com/EchoseChen/SPA-VL-RLHF and SPA-VL dataset url
https://huggingface.co/datasets/sqrti/SPA-VL publicly available.Abstract
LIDAO: Towards Limited Interventions for Debiasing (Large) Language
Models
arXiv:2406.00548v1 »Full PDF »Large language models (LLMs) have achieved impressive performance on various
natural language generation tasks. Nonetheless, they suffer from generating
negative and harmful contents that are biased against certain demographic
groups (e.g., female), raising severe fairness concerns. As remedies, prior
works intervened the generation by removing attitude or demographic
information, inevitably degrading the generation quality and resulting in
notable \textit{fairness-fluency} trade-offs. However, it is still
under-explored to what extent the fluency \textit{has to} be affected in order
to achieve a desired level of fairness. In this work, we conduct the first
formal study from an information-theoretic perspective. We show that previous
approaches are excessive for debiasing and propose LIDAO, a general framework
to debias a (L)LM at a better fluency provably. We further robustify LIDAO in
adversarial scenarios, where a carefully-crafted prompt may stimulate LLMs
exhibiting instruction-following abilities to generate texts with fairness
issue appears only when the prompt is also taken into account. Experiments on
three LMs ranging from 0.7B to 7B parameters demonstrate the superiority of our
method.Abstract
arXiv:2309.16487v2 »Full PDF »Fair machine learning seeks to mitigate model prediction bias against certain
demographic subgroups such as elder and female. Recently, fair representation
learning (FRL) trained by deep neural networks has demonstrated superior
performance, whereby representations containing no demographic information are
inferred from the data and then used as the input to classification or other
downstream tasks. Despite the development of FRL methods, their vulnerability
under data poisoning attack, a popular protocol to benchmark model robustness
under adversarial scenarios, is under-explored. Data poisoning attacks have
been developed for classical fair machine learning methods which incorporate
fairness constraints into shallow-model classifiers. Nonetheless, these attacks
fall short in FRL due to notably different fairness goals and model
architectures. This work proposes the first data poisoning framework attacking
FRL. We induce the model to output unfair representations that contain as much
demographic information as possible by injecting carefully crafted poisoning
samples into the training data. This attack entails a prohibitive bilevel
optimization, wherefore an effective approximated solution is proposed. A
theoretical analysis on the needed number of poisoning samples is derived and
sheds light on defending against the attack. Experiments on benchmark fairness
datasets and state-of-the-art fair representation learning models demonstrate
the superiority of our attack.Abstract
SimFair: A Unified Framework for Fairness-Aware Multi-Label
Classification
Recent years have witnessed increasing concerns towards unfair decisions made
by machine learning algorithms. To improve fairness in model decisions, various
fairness notions have been proposed and many fairness-aware methods are
developed. However, most of existing definitions and methods focus only on
single-label classification. Fairness for multi-label classification, where
each instance is associated with more than one labels, is still yet to
establish. To fill this gap, we study fairness-aware multi-label classification
in this paper. We start by extending Demographic Parity (DP) and Equalized
Opportunity (EOp), two popular fairness notions, to multi-label classification
scenarios. Through a systematic study, we show that on multi-label data,
because of unevenly distributed labels, EOp usually fails to construct a
reliable estimate on labels with few instances. We then propose a new framework
named Similarity s-induced Fairness (sγ-SimFair). This new framework
utilizes data that have similar labels when estimating fairness on a particular
label group for better stability, and can unify DP and EOp. Theoretical
analysis and experimental results on real-world datasets together demonstrate
the advantage of over existing methods sγ-SimFair on multi-label
classification tasks.Abstract