The unprecedented performance of large language models (LLMs) necessitates
improvements in evaluations. Rather than merely exploring the breadth of LLM
abilities, we believe meticulous and thoughtful designs are essential to
thorough, unbiased, and applicable evaluations. Given the importance of world
knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark
(KoLA), in which we carefully design three crucial factors: (1) For
\textbf{ability modeling}, we mimic human cognition to form a four-level
taxonomy of knowledge-related abilities, covering 19 tasks. (2) For
\textbf{data}, to ensure fair comparisons, we use both Wikipedia, a corpus
prevalently pre-trained by LLMs, along with continuously collected emerging
corpora, aiming to evaluate the capacity to handle unseen data and evolving
knowledge. (3) For \textbf{evaluation criteria}, we adopt a contrastive system,
including overall standard scores for better numerical comparability across
tasks and models and a unique self-contrast metric for automatically evaluating
knowledge-creating ability. We evaluate 28 open-source and commercial LLMs
and obtain some intriguing findings. The KoLA dataset and open-participation
leaderboard are publicly released at https://kola.xlore.cn and will be
continuously updated to provide references for developing LLMs and
knowledge-related systems.Abstract
arXiv:2406.14144v1 »Full PDF »Large language models (LLMs) excel in various capabilities but also pose
safety risks such as generating harmful content and misinformation, even after
safety alignment. In this paper, we explore the inner mechanisms of safety
alignment from the perspective of mechanistic interpretability, focusing on
identifying and analyzing safety neurons within LLMs that are responsible for
safety behaviors. We propose generation-time activation contrasting to locate
these neurons and dynamic activation patching to evaluate their causal effects.
Experiments on multiple recent LLMs show that: (1) Safety neurons are sparse
and effective. We can restore 90% safety performance with intervention only
on about 5% of all the neurons. (2) Safety neurons encode transferrable
mechanisms. They exhibit consistent effectiveness on different red-teaming
datasets. The finding of safety neurons also interprets "alignment tax". We
observe that the identified key neurons for safety and helpfulness
significantly overlap, but they require different activation patterns of the
shared neurons. Furthermore, we demonstrate an application of safety neurons in
detecting unsafe outputs before generation. Our findings may promote further
research on understanding LLM alignment. The source codes will be publicly
released to facilitate future research.Abstract
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for
Vision LLMs
H.T., C.C., and Z.W. contribute equally. Work done during H.T. and
Z.W.'s internship at UCSC, and ...
This work focuses on the potential of Vision LLMs (VLLMs) in visual
reasoning. Different from prior studies, we shift our focus from evaluating
standard performance to introducing a comprehensive safety evaluation suite,
covering both out-of-distribution (OOD) generalization and adversarial
robustness. For the OOD evaluation, we present two novel VQA datasets, each
with one variant, designed to test model performance under challenging
conditions. In exploring adversarial robustness, we propose a straightforward
attack strategy for misleading VLLMs to produce visual-unrelated responses.
Moreover, we assess the efficacy of two jailbreaking strategies, targeting
either the vision or language component of VLLMs. Our evaluation of 21 diverse
models, ranging from open-source VLLMs to GPT-4V, yields interesting
observations: 1) Current VLLMs struggle with OOD texts but not images, unless
the visual information is limited; and 2) These VLLMs can be easily misled by
deceiving vision encoders only, and their vision-language training often
compromise safety protocols. We release this safety evaluation suite at
https://github.com/UCSC-VLAA/vllm-safety-benchmark.Abstract
LittleMu: Deploying an Online Virtual Teaching Assistant via
Heterogeneous Sources Integration and Chain of Teach Prompts
Teaching assistants have played essential roles in the long history of
education. However, few MOOC platforms are providing human or virtual teaching
assistants to support learning for massive online students due to the
complexity of real-world online education scenarios and the lack of training
data. In this paper, we present a virtual MOOC teaching assistant, LittleMu
with minimum labeled training data, to provide question answering and chit-chat
services. Consisting of two interactive modules of heterogeneous retrieval and
language model prompting, LittleMu first integrates structural, semi- and
unstructured knowledge sources to support accurate answers for a wide range of
questions. Then, we design delicate demonstrations named "Chain of Teach"
prompts to exploit the large-scale pre-trained model to handle complex
uncollected questions. Except for question answering, we develop other
educational services such as knowledge-grounded chit-chat. We test the system's
performance via both offline evaluation and online deployment. Since May 2020,
our LittleMu system has served over 80,000 users with over 300,000 queries from
over 500 courses on XuetangX MOOC platform, which continuously contributes to a
more convenient and fair education. Our code, services, and dataset will be
available at https://github.com/THU-KEG/VTA.Abstract
A Comprehensive Survey and Guide to Multimodal Large Language Models in
Vision-Language Tasks
arXiv:2411.06284v1 »Full PDF »This survey and application guide to multimodal large language models(MLLMs)
explores the rapidly developing field of MLLMs, examining their architectures,
applications, and impact on AI and Generative Models. Starting with
foundational concepts, we delve into how MLLMs integrate various data types,
including text, images, video and audio, to enable complex AI systems for
cross-modal understanding and generation. It covers essential topics such as
training methods, architectural components, and practical applications in
various fields, from visual storytelling to enhanced accessibility. Through
detailed case studies and technical analysis, the text examines prominent MLLM
implementations while addressing key challenges in scalability, robustness, and
cross-modal learning. Concluding with a discussion of ethical considerations,
responsible AI development, and future directions, this authoritative resource
provides both theoretical frameworks and practical insights. It offers a
balanced perspective on the opportunities and challenges in the development and
deployment of MLLMs, and is highly valuable for researchers, practitioners, and
students interested in the intersection of natural language processing and
computer vision.Abstract
Fairness Without Harm: An Influence-Guided Active Sampling Approach
arXiv:2402.12789v3 »Full PDF »The pursuit of fairness in machine learning (ML), ensuring that the models do
not exhibit biases toward protected demographic groups, typically results in a
compromise scenario. This compromise can be explained by a Pareto frontier
where given certain resources (e.g., data), reducing the fairness violations
often comes at the cost of lowering the model accuracy. In this work, we aim to
train models that mitigate group fairness disparity without causing harm to
model accuracy. Intuitively, acquiring more data is a natural and promising
approach to achieve this goal by reaching a better Pareto frontier of the
fairness-accuracy tradeoff. The current data acquisition methods, such as fair
active learning approaches, typically require annotating sensitive attributes.
However, these sensitive attribute annotations should be protected due to
privacy and safety concerns. In this paper, we propose a tractable active data
sampling algorithm that does not rely on training group annotations, instead
only requiring group annotations on a small validation set. Specifically, the
algorithm first scores each new example by its influence on fairness and
accuracy evaluated on the validation dataset, and then selects a certain number
of examples for training. We theoretically analyze how acquiring more data can
improve fairness without causing harm, and validate the possibility of our
sampling approach in the context of risk disparity. We also provide the upper
bound of generalization error and risk disparity as well as the corresponding
connections. Extensive experiments on real-world data demonstrate the
effectiveness of our proposed algorithm. Our code is available at
https://github.com/UCSC-REAL/FairnessWithoutHarm.Abstract
Alpha and Prejudice: Improving α-sized Worst-case Fairness via
Intrinsic Reweighting
arXiv:2411.03068v1 »Full PDF »Worst-case fairness with off-the-shelf demographics achieves group parity by
maximizing the model utility of the worst-off group. Nevertheless, demographic
information is often unavailable in practical scenarios, which impedes the use
of such a direct max-min formulation. Recent advances have reframed this
learning problem by introducing the lower bound of minimal partition ratio,
denoted as α, as side information, referred to as ``α-sized
worst-case fairness'' in this paper. We first justify the practical
significance of this setting by presenting noteworthy evidence from the data
privacy perspective, which has been overlooked by existing research. Without
imposing specific requirements on loss functions, we propose reweighting the
training samples based on their intrinsic importance to fairness. Given the
global nature of the worst-case formulation, we further develop a stochastic
learning scheme to simplify the training process without compromising model
performance. Additionally, we address the issue of outliers and provide a
robust variant to handle potential outliers during model training. Our
theoretical analysis and experimental observations reveal the connections
between the proposed approaches and existing ``fairness-through-reweighting''
studies, with extensive experimental results on fairness benchmarks
demonstrating the superiority of our methods.Abstract
Manipulation Facing Threats: Evaluating Physical Vulnerabilities in
End-to-End Vision Language Action Models
arXiv:2409.13174v2 »Full PDF »Recently, driven by advancements in Multimodal Large Language Models (MLLMs),
Vision Language Action Models (VLAMs) are being proposed to achieve better
performance in open-vocabulary scenarios for robotic manipulation tasks. Since
manipulation tasks involve direct interaction with the physical world, ensuring
robustness and safety during the execution of this task is always a very
critical issue. In this paper, by synthesizing current safety research on MLLMs
and the specific application scenarios of the manipulation task in the physical
world, we comprehensively evaluate VLAMs in the face of potential physical
threats. Specifically, we propose the Physical Vulnerability Evaluating
Pipeline (PVEP) that can incorporate as many visual modal physical threats as
possible for evaluating the physical robustness of VLAMs. The physical threats
in PVEP specifically include Out-of-Distribution, Typography-based Visual
Prompts, and Adversarial Patch Attacks. By comparing the performance
fluctuations of VLAMs before and after being attacked, we provide generalizable
Analyses of how VLAMs respond to different physical security threats. Our
project page is in this link:
https://chaducheng.github.io/Manipulat-Facing-Threats/.Abstract
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision
Language Models
Artificial intelligence has significantly impacted medical applications,
particularly with the advent of Medical Large Vision Language Models
(Med-LVLMs), sparking optimism for the future of automated and personalized
healthcare. However, the trustworthiness of Med-LVLMs remains unverified,
posing significant risks for future model deployment. In this paper, we
introduce CARES and aim to comprehensively evaluate the Trustworthiness of
Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs
across five dimensions, including trustfulness, fairness, safety, privacy, and
robustness. CARES comprises about 41K question-answer pairs in both closed and
open-ended formats, covering 16 medical image modalities and 27 anatomical
regions. Our analysis reveals that the models consistently exhibit concerns
regarding trustworthiness, often displaying factual inaccuracies and failing to
maintain fairness across different demographic groups. Furthermore, they are
vulnerable to attacks and demonstrate a lack of privacy awareness. We publicly
release our benchmark and code in https://cares-ai.github.io/.Abstract
FairSkin: Fair Diffusion for Skin Disease Image Generation
arXiv:2410.22551v2 »Full PDF »Image generation is a prevailing technique for clinical data augmentation for
advancing diagnostic accuracy and reducing healthcare disparities. Diffusion
Model (DM) has become a leading method in generating synthetic medical images,
but it suffers from a critical twofold bias: (1) The quality of images
generated for Caucasian individuals is significantly higher, as measured by the
Frechet Inception Distance (FID). (2) The ability of the downstream-task
learner to learn critical features from disease images varies across different
skin tones. These biases pose significant risks, particularly in skin disease
detection, where underrepresentation of certain skin tones can lead to
misdiagnosis or neglect of specific conditions. To address these challenges, we
propose FairSkin, a novel DM framework that mitigates these biases through a
three-level resampling mechanism, ensuring fairer representation across racial
and disease categories. Our approach significantly improves the diversity and
quality of generated images, contributing to more equitable skin disease
detection in clinical settings.Abstract