The rapid development of large language models (LLMs) has not only provided
numerous opportunities but also presented significant challenges. This becomes
particularly evident when LLMs inadvertently generate harmful or toxic content,
either unintentionally or because of intentional inducement. Existing alignment
methods usually direct LLMs toward the favorable outcomes by utilizing
human-annotated, flawless instruction-response pairs. Conversely, this study
proposes a novel alignment technique based on mistake analysis, which
deliberately exposes LLMs to erroneous content to learn the reasons for
mistakes and how to avoid them. In this case, mistakes are repurposed into
valuable data for alignment, effectively helping to avoid the production of
erroneous responses. Without external models or human annotations, our method
leverages a model's intrinsic ability to discern undesirable mistakes and
improves the safety of its generated responses. Experimental results reveal
that our method outperforms existing alignment approaches in enhancing model
safety while maintaining the overall utility.Abstract
Constructing Highly Inductive Contexts for Dialogue Safety through
Controllable Reverse Generation
Large pretrained language models can easily produce toxic or biased content,
which is prohibitive for practical use. In order to detect such toxic
generations, existing methods rely on templates, real-world data extraction,
crowdsourcing workers, or automatic generation to construct adversarial
contexts that are likely to induce toxic generations. However, what type of
context is more likely to induce unsafe responses is still under-explored. In
this paper, we identify that context toxicity and context category (e.g.,
\textit{profanity}, \textit{insult}, \textit{drugs}, etc.) are two important
factors to cause safety issues in response generation. Hence, we propose a
method called \emph{reverse generation} to construct adversarial contexts
conditioned on a given response, with the flexibility to control category,
toxicity level, and inductivity of the generated contexts. Via reverse
generation, we augment the existing BAD dataset and construct a new dataset
BAD+ which contains more than 120K diverse and highly inductive contexts in 12
categories. We test three popular pretrained dialogue models (Blender,
DialoGPT, and Plato2) and find that BAD+ can largely expose their safety
problems. Furthermore, we show that BAD+ can greatly enhance the safety of
generation and reveal the key factors of safety improvement. Our code and
dataset is available at \url{https://github.com/thu-coai/Reverse_Generation}.Abstract
PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained
Language Model
Update model and results; add comparison with EVA2.0
In this paper, we introduce PanGu-Bot, a Chinese pre-trained open-domain
dialogue generation model based on a large pre-trained language model (PLM)
PANGU-alpha (Zeng et al.,2021). Different from other pre-trained dialogue
models trained over a massive amount of dialogue data from scratch, we aim to
build a powerful dialogue model with relatively fewer data and computation
costs by inheriting valuable language capabilities and knowledge from PLMs. To
this end, we train PanGu-Bot from the large PLM PANGU-alpha, which has been
proven well-performed on a variety of Chinese natural language tasks. We
investigate different aspects of responses generated by PanGu-Bot, including
response quality, knowledge, and safety. We show that PanGu-Bot outperforms
state-of-the-art Chinese dialogue systems (CDIALGPT (Wang et al., 2020), EVA
(Zhou et al., 2021), EVA2.0 (Gu et al., 2022)) w.r.t. the above three aspects.
We also demonstrate that PanGu-Bot can be easily deployed to generate emotional
responses without further training. Throughout our empirical analysis, we also
point out that the PanGu-Bot response quality, knowledge correctness, and
safety are still far from perfect, and further explorations are indispensable
to building reliable and smart dialogue systems. Our model and code will be
available at
https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/PanGu-Bot
soon.Abstract
Trusting Your AI Agent Emotionally and Cognitively: Development and
Validation of a Semantic Differential Scale for AI Trust
arXiv:2408.05354v2 »Full PDF »Trust is not just a cognitive issue but also an emotional one, yet the
research in human-AI interactions has primarily focused on the cognitive route
of trust development. Recent work has highlighted the importance of studying
affective trust towards AI, especially in the context of emerging human-like
LLMs-powered conversational agents. However, there is a lack of validated and
generalizable measures for the two-dimensional construct of trust in AI agents.
To address this gap, we developed and validated a set of 27-item semantic
differential scales for affective and cognitive trust through a scenario-based
survey study. We then further validated and applied the scale through an
experiment study. Our empirical findings showed how the emotional and cognitive
aspects of trust interact with each other and collectively shape a person's
overall trust in AI agents. Our study methodology and findings also provide
insights into the capability of the state-of-art LLMs to foster trust through
different routes.Abstract
FaceChain-FACT: Face Adapter with Decoupled Training for
Identity-preserved Personalization
In the field of human-centric personalized image generation, the
adapter-based method obtains the ability to customize and generate portraits by
text-to-image training on facial data. This allows for identity-preserved
personalization without additional fine-tuning in inference. Although there are
improvements in efficiency and fidelity, there is often a significant
performance decrease in test following ability, controllability, and diversity
of generated faces compared to the base model. In this paper, we analyze that
the performance degradation is attributed to the failure to decouple identity
features from other attributes during extraction, as well as the failure to
decouple the portrait generation training from the overall generation task. To
address these issues, we propose the Face Adapter with deCoupled Training
(FACT) framework, focusing on both model architecture and training strategy. To
decouple identity features from others, we leverage a transformer-based
face-export encoder and harness fine-grained identity features. To decouple the
portrait generation training, we propose Face Adapting Increment
Regularization~(FAIR), which effectively constrains the effect of face adapters
on the facial region, preserving the generative ability of the base model.
Additionally, we incorporate a face condition drop and shuffle mechanism,
combined with curriculum learning, to enhance facial controllability and
diversity. As a result, FACT solely learns identity preservation from training
data, thereby minimizing the impact on the original text-to-image capabilities
of the base model. Extensive experiments show that FACT has both
controllability and fidelity in both text-to-image generation and inpainting
solutions for portrait generation.Abstract
When Machine Unlearning Meets Retrieval-Augmented Generation (RAG): Keep
Secret or Forget Knowledge?
The deployment of large language models (LLMs) like ChatGPT and Gemini has
shown their powerful natural language generation capabilities. However, these
models can inadvertently learn and retain sensitive information and harmful
content during training, raising significant ethical and legal concerns. To
address these issues, machine unlearning has been introduced as a potential
solution. While existing unlearning methods take into account the specific
characteristics of LLMs, they often suffer from high computational demands,
limited applicability, or the risk of catastrophic forgetting. To address these
limitations, we propose a lightweight unlearning framework based on
Retrieval-Augmented Generation (RAG) technology. By modifying the external
knowledge base of RAG, we simulate the effects of forgetting without directly
interacting with the unlearned LLM. We approach the construction of unlearned
knowledge as a constrained optimization problem, deriving two key components
that underpin the effectiveness of RAG-based unlearning. This RAG-based
approach is particularly effective for closed-source LLMs, where existing
unlearning methods often fail. We evaluate our framework through extensive
experiments on both open-source and closed-source models, including ChatGPT,
Gemini, Llama-2-7b-chat-hf, and PaLM 2. The results demonstrate that our
approach meets five key unlearning criteria: effectiveness, universality,
harmlessness, simplicity, and robustness. Meanwhile, this approach can extend
to multimodal large language models and LLM-based agents.Abstract
arXiv:2410.13901v1 »Full PDF »The safety and robustness of large language models (LLMs) based applications
remain critical challenges in artificial intelligence. Among the key threats to
these applications are prompt hacking attacks, which can significantly
undermine the security and reliability of LLM-based systems. In this work, we
offer a comprehensive and systematic overview of three distinct types of prompt
hacking: jailbreaking, leaking, and injection, addressing the nuances that
differentiate them despite their overlapping characteristics. To enhance the
evaluation of LLM-based applications, we propose a novel framework that
categorizes LLM responses into five distinct classes, moving beyond the
traditional binary classification. This approach provides more granular
insights into the AI's behavior, improving diagnostic precision and enabling
more targeted enhancements to the system's safety and robustness.Abstract
Unraveling and Mitigating Safety Alignment Degradation of
Vision-Language Models
The safety alignment ability of Vision-Language Models (VLMs) is prone to be
degraded by the integration of the vision module compared to its LLM backbone.
We investigate this phenomenon, dubbed as ''safety alignment degradation'' in
this paper, and show that the challenge arises from the representation gap that
emerges when introducing vision modality to VLMs. In particular, we show that
the representations of multi-modal inputs shift away from that of text-only
inputs which represent the distribution that the LLM backbone is optimized for.
At the same time, the safety alignment capabilities, initially developed within
the textual embedding space, do not successfully transfer to this new
multi-modal representation space. To reduce safety alignment degradation, we
introduce Cross-Modality Representation Manipulation (CMRM), an inference time
representation intervention method for recovering the safety alignment ability
that is inherent in the LLM backbone of VLMs, while simultaneously preserving
the functional capabilities of VLMs. The empirical results show that our
framework significantly recovers the alignment ability that is inherited from
the LLM backbone with minimal impact on the fluency and linguistic capabilities
of pre-trained VLMs even without additional training. Specifically, the unsafe
rate of LLaVA-7B on multi-modal input can be reduced from 61.53% to as low as
3.15% with only inference-time intervention.
WARNING: This paper contains examples of toxic or harmful language.Abstract
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Large language models (LLMs) have demonstrated immense utility across various
industries. However, as LLMs advance, the risk of harmful outputs increases due
to incorrect or malicious instruction prompts. While current methods
effectively address jailbreak risks, they share common limitations: 1) Judging
harmful responses from the prefill-level lacks utilization of the model's
decoding outputs, leading to relatively lower effectiveness and robustness. 2)
Rejecting potentially harmful responses based on a single evaluation can
significantly impair the model's helpfulness.This paper examines the LLMs'
capability to recognize harmful outputs, revealing and quantifying their
proficiency in assessing the danger of previous tokens. Motivated by pilot
experiment results, we design a robust defense mechanism at the decoding level.
Our novel decoder-oriented, step-by-step defense architecture corrects harmful
queries directly rather than rejecting them outright. We introduce speculative
decoding to enhance usability and facilitate deployment to boost secure
decoding speed. Extensive experiments demonstrate that our approach improves
model security without compromising reasoning speed. Notably, our method
leverages the model's ability to discern hazardous information, maintaining its
helpfulness compared to existing methods.Abstract
FairFML: Fair Federated Machine Learning with a Case Study on Reducing
Gender Disparities in Cardiac Arrest Outcome Prediction
arXiv:2410.17269v1 »Full PDF »Objective: Mitigating algorithmic disparities is a critical challenge in
healthcare research, where ensuring equity and fairness is paramount. While
large-scale healthcare data exist across multiple institutions,
cross-institutional collaborations often face privacy constraints, highlighting
the need for privacy-preserving solutions that also promote fairness.
Materials and Methods: In this study, we present Fair Federated Machine
Learning (FairFML), a model-agnostic solution designed to reduce algorithmic
bias in cross-institutional healthcare collaborations while preserving patient
privacy. As a proof of concept, we validated FairFML using a real-world
clinical case study focused on reducing gender disparities in cardiac arrest
outcome prediction.
Results: We demonstrate that the proposed FairFML framework enhances fairness
in federated learning (FL) models without compromising predictive performance.
Our findings show that FairFML improves model fairness by up to 65% compared to
the centralized model, while maintaining performance comparable to both local
and centralized models, as measured by receiver operating characteristic
analysis.
Discussion and Conclusion: FairFML offers a promising and flexible solution
for FL collaborations, with its adaptability allowing seamless integration with
various FL frameworks and models, from traditional statistical methods to deep
learning techniques. This makes FairFML a robust approach for developing fairer
FL models across diverse clinical and biomedical applications.Abstract