Artificial intelligence has significantly impacted medical applications,
particularly with the advent of Medical Large Vision Language Models
(Med-LVLMs), sparking optimism for the future of automated and personalized
healthcare. However, the trustworthiness of Med-LVLMs remains unverified,
posing significant risks for future model deployment. In this paper, we
introduce CARES and aim to comprehensively evaluate the Trustworthiness of
Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs
across five dimensions, including trustfulness, fairness, safety, privacy, and
robustness. CARES comprises about 41K question-answer pairs in both closed and
open-ended formats, covering 16 medical image modalities and 27 anatomical
regions. Our analysis reveals that the models consistently exhibit concerns
regarding trustworthiness, often displaying factual inaccuracies and failing to
maintain fairness across different demographic groups. Furthermore, they are
vulnerable to attacks and demonstrate a lack of privacy awareness. We publicly
release our benchmark and code in https://cares-ai.github.io/.Abstract
TrustLLM: Trustworthiness in Large Language Models
This work is still under work and we welcome your contribution
Large language models (LLMs), exemplified by ChatGPT, have gained
considerable attention for their excellent natural language processing
capabilities. Nonetheless, these LLMs present many challenges, particularly in
the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs
emerges as an important topic. This paper introduces TrustLLM, a comprehensive
study of trustworthiness in LLMs, including principles for different dimensions
of trustworthiness, established benchmark, evaluation, and analysis of
trustworthiness for mainstream LLMs, and discussion of open challenges and
future directions. Specifically, we first propose a set of principles for
trustworthy LLMs that span eight different dimensions. Based on these
principles, we further establish a benchmark across six dimensions including
truthfulness, safety, fairness, robustness, privacy, and machine ethics. We
then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of
over 30 datasets. Our findings firstly show that in general trustworthiness and
utility (i.e., functional effectiveness) are positively related. Secondly, our
observations reveal that proprietary LLMs generally outperform most open-source
counterparts in terms of trustworthiness, raising concerns about the potential
risks of widely accessible open-source LLMs. However, a few open-source LLMs
come very close to proprietary ones. Thirdly, it is important to note that some
LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent
that they compromise their utility by mistakenly treating benign prompts as
harmful and consequently not responding. Finally, we emphasize the importance
of ensuring transparency not only in the models themselves but also in the
technologies that underpin trustworthiness. Knowing the specific trustworthy
technologies that have been employed is crucial for analyzing their
effectiveness.Abstract
arXiv:2409.18968v1 »Full PDF »Recent advancements in artificial intelligence (AI), particularly in deep
learning and large language models (LLMs), have accelerated their integration
into medicine. However, these developments have also raised public concerns
about the safe application of AI. In healthcare, these concerns are especially
pertinent, as the ethical and secure deployment of AI is crucial for protecting
patient health and privacy. This review examines potential risks in AI
practices that may compromise safety in medicine, including reduced performance
across diverse populations, inconsistent operational stability, the need for
high-quality data for effective model tuning, and the risk of data breaches
during model development and deployment. For medical practitioners, patients,
and researchers, LLMs provide a convenient way to interact with AI and data
through language. However, their emergence has also amplified safety concerns,
particularly due to issues like hallucination. Second part of this article
explores safety issues specific to LLMs in medical contexts, including
limitations in processing complex logic, challenges in aligning AI objectives
with human values, the illusion of understanding, and concerns about diversity.
Thoughtful development of safe AI could accelerate its adoption in real-world
medical settings.Abstract
arXiv:2407.21783v2 »Full PDF »Modern artificial intelligence (AI) systems are powered by foundation models.
This paper presents a new set of foundation models, called Llama 3. It is a
herd of language models that natively support multilinguality, coding,
reasoning, and tool usage. Our largest model is a dense Transformer with 405B
parameters and a context window of up to 128K tokens. This paper presents an
extensive empirical evaluation of Llama 3. We find that Llama 3 delivers
comparable quality to leading language models such as GPT-4 on a plethora of
tasks. We publicly release Llama 3, including pre-trained and post-trained
versions of the 405B parameter language model and our Llama Guard 3 model for
input and output safety. The paper also presents the results of experiments in
which we integrate image, video, and speech capabilities into Llama 3 via a
compositional approach. We observe this approach performs competitively with
the state-of-the-art on image, video, and speech recognition tasks. The
resulting models are not yet being broadly released as they are still under
development.Abstract
Learning and Forgetting Unsafe Examples in Large Language Models
As the number of large language models (LLMs) released to the public grows,
there is a pressing need to understand the safety implications associated with
these models learning from third-party custom finetuning data. We explore the
behavior of LLMs finetuned on noisy custom data containing unsafe content,
represented by datasets that contain biases, toxicity, and harmfulness, finding
that while aligned LLMs can readily learn this unsafe content, they also tend
to forget it more significantly than other examples when subsequently finetuned
on safer content. Drawing inspiration from the discrepancies in forgetting, we
introduce the "ForgetFilter" algorithm, which filters unsafe data based on how
strong the model's forgetting signal is for that data. We demonstrate that the
ForgetFilter algorithm ensures safety in customized finetuning without
compromising downstream task performance, unlike sequential safety finetuning.
ForgetFilter outperforms alternative strategies like replay and moral
self-correction in curbing LLMs' ability to assimilate unsafe content during
custom finetuning, e.g. 75% lower than not applying any safety measures and 62%
lower than using self-correction in toxicity score.Abstract
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language
Models that Follow Instructions
arXiv:2309.07875v3 »Full PDF »Training large language models to follow instructions makes them perform
better on a wide range of tasks and generally become more helpful. However, a
perfectly helpful model will follow even the most malicious instructions and
readily generate harmful content. In this paper, we raise concerns over the
safety of models that only emphasize helpfulness, not harmlessness, in their
instruction-tuning. We show that several popular instruction-tuned models are
highly unsafe. Moreover, we show that adding just 3% safety examples (a few
hundred demonstrations) when fine-tuning a model like LLaMA can substantially
improve its safety. Our safety-tuning does not make models significantly less
capable or helpful as measured by standard benchmarks. However, we do find
exaggerated safety behaviours, where too much safety-tuning makes models refuse
perfectly safe prompts if they superficially resemble unsafe ones. As a whole,
our results illustrate trade-offs in training LLMs to be helpful and training
them to be safe.Abstract
Large Language Models are Vulnerable to Bait-and-Switch Attacks for
Generating Harmful Content
arXiv:2402.13926v1 »Full PDF »The risks derived from large language models (LLMs) generating deceptive and
damaging content have been the subject of considerable research, but even safe
generations can lead to problematic downstream impacts. In our study, we shift
the focus to how even safe text coming from LLMs can be easily turned into
potentially dangerous content through Bait-and-Switch attacks. In such attacks,
the user first prompts LLMs with safe questions and then employs a simple
find-and-replace post-hoc technique to manipulate the outputs into harmful
narratives. The alarming efficacy of this approach in generating toxic content
highlights a significant challenge in developing reliable safety guardrails for
LLMs. In particular, we stress that focusing on the safety of the verbatim LLM
outputs is insufficient and that we also need to consider post-hoc
transformations.Abstract
In-context Vectors: Making In Context Learning More Effective and
Controllable Through Latent Space Steering
arXiv:2311.06668v3 »Full PDF »Large language models (LLMs) demonstrate emergent in-context learning
capabilities, where they adapt to new tasks based on example demonstrations.
However, in-context learning has seen limited effectiveness in many settings,
is difficult to quantitatively control and takes up context window space. To
overcome these limitations, we propose an alternative approach that recasts
in-context learning as in-context vectors (ICV). Using ICV has two steps. We
first use a forward pass on demonstration examples to create the in-context
vector from the latent embedding of the LLM. This vector captures essential
information about the intended task. On a new query, instead of adding
demonstrations to the prompt, we shift the latent states of the LLM using the
ICV. The ICV approach has several benefits: 1) it enables the LLM to more
effectively follow the demonstration examples; 2) it's easy to control by
adjusting the magnitude of the ICV; 3) it reduces the length of the prompt by
removing the in-context demonstrations; 4) ICV is computationally much more
efficient than fine-tuning. We demonstrate that ICV achieves better performance
compared to standard in-context learning and fine-tuning on diverse tasks
including safety, style transfer, role-playing and formatting. Moreover, we
show that we can flexibly teach LLM to simultaneously follow different types of
instructions by simple vector arithmetics on the corresponding ICVs.Abstract
Representation Engineering: A Top-Down Approach to AI Transparency
Code is available at
https://github.com/andyzoujm/representation-engineering
In this paper, we identify and characterize the emerging area of
representation engineering (RepE), an approach to enhancing the transparency of
AI systems that draws on insights from cognitive neuroscience. RepE places
population-level representations, rather than neurons or circuits, at the
center of analysis, equipping us with novel methods for monitoring and
manipulating high-level cognitive phenomena in deep neural networks (DNNs). We
provide baselines and an initial analysis of RepE techniques, showing that they
offer simple yet effective solutions for improving our understanding and
control of large language models. We showcase how these methods can provide
traction on a wide range of safety-relevant problems, including honesty,
harmlessness, power-seeking, and more, demonstrating the promise of top-down
transparency research. We hope that this work catalyzes further exploration of
RepE and fosters advancements in the transparency and safety of AI systems.Abstract
FaiREE: Fair Classification with Finite-Sample and Distribution-Free
Guarantee
Algorithmic fairness plays an increasingly critical role in machine learning
research. Several group fairness notions and algorithms have been proposed.
However, the fairness guarantee of existing fair classification methods mainly
depends on specific data distributional assumptions, often requiring large
sample sizes, and fairness could be violated when there is a modest number of
samples, which is often the case in practice. In this paper, we propose FaiREE,
a fair classification algorithm that can satisfy group fairness constraints
with finite-sample and distribution-free theoretical guarantees. FaiREE can be
adapted to satisfy various group fairness notions (e.g., Equality of
Opportunity, Equalized Odds, Demographic Parity, etc.) and achieve the optimal
accuracy. These theoretical guarantees are further supported by experiments on
both synthetic and real data. FaiREE is shown to have favorable performance
over state-of-the-art algorithms.Abstract