22 pages, 9 figures. Project page is available at
https://boyiwei.com/alignment-attribution/
Large language models (LLMs) show inherent brittleness in their safety
mechanisms, as evidenced by their susceptibility to jailbreaking and even
non-malicious fine-tuning. This study explores this brittleness of safety
alignment by leveraging pruning and low-rank modifications. We develop methods
to identify critical regions that are vital for safety guardrails, and that are
disentangled from utility-relevant regions at both the neuron and rank levels.
Surprisingly, the isolated regions we find are sparse, comprising about 3%
at the parameter level and 2.5% at the rank level. Removing these regions
compromises safety without significantly impacting utility, corroborating the
inherent brittleness of the model's safety mechanisms. Moreover, we show that
LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications
to the safety-critical regions are restricted. These findings underscore the
urgent need for more robust safety strategies in LLMs.Abstract
What is in Your Safe Data? Identifying Benign Data that Breaks Safety
arXiv:2404.01099v2 »Full PDF »Current Large Language Models (LLMs), even those tuned for safety and
alignment, are susceptible to jailbreaking. Some have found that just further
fine-tuning an aligned model with benign data (i.e., data without harmful
content) surprisingly leads to substantial degradation in safety. We delve into
the data-centric aspects of why benign fine-tuning inadvertently contributes to
jailbreaking. First, we represent fine-tuning data through two lenses:
representation and gradient spaces. Additionally, we propose a bi-directional
anchoring method that, during the selection process, prioritizes data points
that are close to harmful examples and far from benign ones. Our approach
effectively identifies subsets of benign data that are more likely to degrade
the model's safety after fine-tuning. Training on just 100 of these seemingly
benign datapoints surprisingly leads to the fine-tuned model affirmatively
responding to >70% of tested harmful requests, compared to <20% after
fine-tuning on randomly selected data. We also observe that the selected data
frequently appear as lists, bullet points, or math questions, indicating a
systematic pattern in fine-tuning data that contributes to jailbreaking.Abstract
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
arXiv:2310.06987v1 »Full PDF »The rapid progress in open-source large language models (LLMs) is
significantly advancing AI development. Extensive efforts have been made before
model release to align their behavior with human values, with the primary goal
of ensuring their helpfulness and harmlessness. However, even carefully aligned
models can be manipulated maliciously, leading to unintended behaviors, known
as "jailbreaks". These jailbreaks are typically triggered by specific text
inputs, often referred to as adversarial prompts. In this work, we propose the
generation exploitation attack, an extremely simple approach that disrupts
model alignment by only manipulating variations of decoding methods. By
exploiting different generation strategies, including varying decoding
hyper-parameters and sampling methods, we increase the misalignment rate from
0% to more than 95% across 11 language models including LLaMA2, Vicuna, Falcon,
and MPT families, outperforming state-of-the-art attacks with 30× lower
computational cost. Finally, we propose an effective alignment method that
explores diverse generation strategies, which can reasonably reduce the
misalignment rate under our attack. Altogether, our study underscores a major
failure in current safety evaluation and alignment procedures for open-source
LLMs, strongly advocating for more comprehensive red teaming and better
alignment before releasing such models. Our code is available at
https://github.com/Princeton-SysML/Jailbreak_LLM.Abstract
MABEL: Attenuating Gender Bias using Textual Entailment Data
Accepted to EMNLP 2022. Code and models are publicly available at
https://github.com/princeton-nlp...
Pre-trained language models encode undesirable social biases, which are
further exacerbated in downstream use. To this end, we propose MABEL (a Method
for Attenuating Gender Bias using Entailment Labels), an intermediate
pre-training approach for mitigating gender bias in contextualized
representations. Key to our approach is the use of a contrastive learning
objective on counterfactually augmented, gender-balanced entailment pairs from
natural language inference (NLI) datasets. We also introduce an alignment
regularizer that pulls identical entailment pairs along opposite gender
directions closer. We extensively evaluate our approach on intrinsic and
extrinsic metrics, and show that MABEL outperforms previous task-agnostic
debiasing approaches in terms of fairness. It also preserves task performance
after fine-tuning on downstream tasks. Together, these findings demonstrate the
suitability of NLI data as an effective means of bias mitigation, as opposed to
only using unlabeled sentences in the literature. Finally, we identify that
existing approaches often use evaluation settings that are insufficient or
inconsistent. We make an effort to reproduce and compare previous methods, and
call for unifying the evaluation settings across gender debiasing methods for
better future comparison.Abstract
SIESEF-FusionNet: Spatial Inter-correlation Enhancement and
Spatially-Embedded Feature Fusion Network for LiDAR Point Cloud Semantic
Segmentation
The ambiguity at the boundaries of different semantic classes in point cloud
semantic segmentation often leads to incorrect decisions in intelligent
perception systems, such as autonomous driving. Hence, accurate delineation of
the boundaries is crucial for improving safety in autonomous driving. A novel
spatial inter-correlation enhancement and spatially-embedded feature fusion
network (SIESEF-FusionNet) is proposed in this paper, enhancing spatial
inter-correlation by combining inverse distance weighting and angular
compensation to extract more beneficial spatial information without causing
redundancy. Meanwhile, a new spatial adaptive pooling module is also designed,
embedding enhanced spatial information into semantic features for strengthening
the context-awareness of semantic features. Experimental results demonstrate
that 83.7% mIoU and 97.8% OA are achieved by SIESEF-FusionNet on the Toronto3D
dataset, with performance superior to other baseline methods. A value of 61.1%
mIoU is reached on the semanticKITTI dataset, where a marked improvement in
segmentation performance is observed. In addition, the effectiveness and
plug-and-play capability of the proposed modules are further verified through
ablation studies.Abstract
Towards Open Respiratory Acoustic Foundation Models: Pretraining and
Benchmarking
accepted by NeurIPS 2024 Track Datasets and Benchmarks
Respiratory audio, such as coughing and breathing sounds, has predictive
power for a wide range of healthcare applications, yet is currently
under-explored. The main problem for those applications arises from the
difficulty in collecting large labeled task-specific data for model
development. Generalizable respiratory acoustic foundation models pretrained
with unlabeled data would offer appealing advantages and possibly unlock this
impasse. However, given the safety-critical nature of healthcare applications,
it is pivotal to also ensure openness and replicability for any proposed
foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory
Acoustic foundation model pretraining and benchmarking system, as the first
approach answering this need. We curate large-scale respiratory audio datasets
(~136K samples, over 400 hours), pretrain three pioneering foundation models,
and build a benchmark consisting of 19 downstream respiratory health tasks for
evaluation. Our pretrained models demonstrate superior performance (against
existing acoustic models pretrained with general audio on 16 out of 19 tasks)
and generalizability (to unseen datasets and new respiratory audio modalities).
This highlights the great promise of respiratory acoustic foundation models and
encourages more studies using OPERA as an open resource to accelerate research
on respiratory audio for health. The system is accessible from
https://github.com/evelyn0414/OPERA.Abstract
Post-translational modifications (PTMs) profoundly expand the complexity and
functionality of the proteome, regulating protein attributes and interactions
that are crucial for biological processes. Accurately predicting PTM sites and
their specific types is therefore essential for elucidating protein function
and understanding disease mechanisms. Existing computational approaches
predominantly focus on protein sequences to predict PTM sites, driven by the
recognition of sequence-dependent motifs. However, these approaches often
overlook protein structural contexts. In this work, we first compile a
large-scale sequence-structure PTM dataset, which serves as the foundation for
fair comparison. We introduce the MeToken model, which tokenizes the
micro-environment of each amino acid, integrating both sequence and structural
information into unified discrete tokens. This model not only captures the
typical sequence motifs associated with PTMs but also leverages the spatial
arrangements dictated by protein tertiary structures, thus providing a holistic
view of the factors influencing PTM sites. Designed to address the long-tail
distribution of PTM types, MeToken employs uniform sub-codebooks that ensure
even the rarest PTMs are adequately represented and distinguished. We validate
the effectiveness and generalizability of MeToken across multiple datasets,
demonstrating its superior performance in accurately identifying PTM types. The
results underscore the importance of incorporating structural data and
highlight MeToken's potential in facilitating accurate and comprehensive PTM
predictions, which could significantly impact proteomics research. The code and
datasets are available at https://github.com/A4Bio/MeToken.Abstract
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision
Language Models
Artificial intelligence has significantly impacted medical applications,
particularly with the advent of Medical Large Vision Language Models
(Med-LVLMs), sparking optimism for the future of automated and personalized
healthcare. However, the trustworthiness of Med-LVLMs remains unverified,
posing significant risks for future model deployment. In this paper, we
introduce CARES and aim to comprehensively evaluate the Trustworthiness of
Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs
across five dimensions, including trustfulness, fairness, safety, privacy, and
robustness. CARES comprises about 41K question-answer pairs in both closed and
open-ended formats, covering 16 medical image modalities and 27 anatomical
regions. Our analysis reveals that the models consistently exhibit concerns
regarding trustworthiness, often displaying factual inaccuracies and failing to
maintain fairness across different demographic groups. Furthermore, they are
vulnerable to attacks and demonstrate a lack of privacy awareness. We publicly
release our benchmark and code in https://cares-ai.github.io/.Abstract
arXiv:2410.21276v1 »Full PDF »GPT-4o is an autoregressive omni model that accepts as input any combination
of text, audio, image, and video, and generates any combination of text, audio,
and image outputs. It's trained end-to-end across text, vision, and audio,
meaning all inputs and outputs are processed by the same neural network. GPT-4o
can respond to audio inputs in as little as 232 milliseconds, with an average
of 320 milliseconds, which is similar to human response time in conversation.
It matches GPT-4 Turbo performance on text in English and code, with
significant improvement on text in non-English languages, while also being much
faster and 50\% cheaper in the API. GPT-4o is especially better at vision and
audio understanding compared to existing models. In line with our commitment to
building AI safely and consistent with our voluntary commitments to the White
House, we are sharing the GPT-4o System Card, which includes our Preparedness
Framework evaluations. In this System Card, we provide a detailed look at
GPT-4o's capabilities, limitations, and safety evaluations across multiple
categories, focusing on speech-to-speech while also evaluating text and image
capabilities, and measures we've implemented to ensure the model is safe and
aligned. We also include third-party assessments on dangerous capabilities, as
well as discussion of potential societal impacts of GPT-4o's text and vision
capabilities.Abstract
FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational
Learning
arXiv:2410.15010v1 »Full PDF »Molecular relational learning (MRL) is crucial for understanding the
interaction behaviors between molecular pairs, a critical aspect of drug
discovery and development. However, the large feasible model space of MRL poses
significant challenges to benchmarking, and existing MRL frameworks face
limitations in flexibility and scope. To address these challenges, avoid
repetitive coding efforts, and ensure fair comparison of models, we introduce
FlexMol, a comprehensive toolkit designed to facilitate the construction and
evaluation of diverse model architectures across various datasets and
performance metrics. FlexMol offers a robust suite of preset model components,
including 16 drug encoders, 13 protein sequence encoders, 9 protein structure
encoders, and 7 interaction layers. With its easy-to-use API and flexibility,
FlexMol supports the dynamic construction of over 70, 000 distinct combinations
of model architectures. Additionally, we provide detailed benchmark results and
code examples to demonstrate FlexMol's effectiveness in simplifying and
standardizing MRL model development and comparison.Abstract