The camera-ready version of JailbreakBench v1.0 (accepted at NeurIPS
2024 Datasets and Benchmarks ...
Jailbreak attacks cause large language models (LLMs) to generate harmful,
unethical, or otherwise objectionable content. Evaluating these attacks
presents a number of challenges, which the current collection of benchmarks and
evaluation techniques do not adequately address. First, there is no clear
standard of practice regarding jailbreaking evaluation. Second, existing works
compute costs and success rates in incomparable ways. And third, numerous works
are not reproducible, as they withhold adversarial prompts, involve
closed-source code, or rely on evolving proprietary APIs. To address these
challenges, we introduce JailbreakBench, an open-sourced benchmark with the
following components: (1) an evolving repository of state-of-the-art
adversarial prompts, which we refer to as jailbreak artifacts; (2) a
jailbreaking dataset comprising 100 behaviors -- both original and sourced from
prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with
OpenAI's usage policies; (3) a standardized evaluation framework at
https://github.com/JailbreakBench/jailbreakbench that includes a clearly
defined threat model, system prompts, chat templates, and scoring functions;
and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the
performance of attacks and defenses for various LLMs. We have carefully
considered the potential ethical implications of releasing this benchmark, and
believe that it will be a net positive for the community.Abstract
AI Risk Management Should Incorporate Both Safety and Security
arXiv:2405.19524v1 »Full PDF »The exposure of security vulnerabilities in safety-aligned language models,
e.g., susceptibility to adversarial attacks, has shed light on the intricate
interplay between AI safety and AI security. Although the two disciplines now
come together under the overarching goal of AI risk management, they have
historically evolved separately, giving rise to differing perspectives.
Therefore, in this paper, we advocate that stakeholders in AI risk management
should be aware of the nuances, synergies, and interplay between safety and
security, and unambiguously take into account the perspectives of both
disciplines in order to devise mostly effective and holistic risk mitigation
approaches. Unfortunately, this vision is often obfuscated, as the definitions
of the basic concepts of "safety" and "security" themselves are often
inconsistent and lack consensus across communities. With AI risk management
being increasingly cross-disciplinary, this issue is particularly salient. In
light of this conceptual challenge, we introduce a unified reference framework
to clarify the differences and interplay between AI safety and AI security,
aiming to facilitate a shared understanding and effective collaboration across
communities.Abstract
RobustBench: a standardized adversarial robustness benchmark
The camera-ready version accepted at the NeurIPS'21 Datasets and
Benchmarks Track: 120+ evaluation...
As a research community, we are still lacking a systematic understanding of
the progress on adversarial robustness which often makes it hard to identify
the most promising ideas in training robust models. A key challenge in
benchmarking robustness is that its evaluation is often error-prone leading to
robustness overestimation. Our goal is to establish a standardized benchmark of
adversarial robustness, which as accurately as possible reflects the robustness
of the considered models within a reasonable computational budget. To this end,
we start by considering the image classification task and introduce
restrictions (possibly loosened in the future) on the allowed models. We
evaluate adversarial robustness with AutoAttack, an ensemble of white- and
black-box attacks, which was recently shown in a large-scale study to improve
almost all robustness evaluations compared to the original publications. To
prevent overadaptation of new defenses to AutoAttack, we welcome external
evaluations based on adaptive attacks, especially where AutoAttack flags a
potential overestimation of robustness. Our leaderboard, hosted at
https://robustbench.github.io/, contains evaluations of 120+ models and aims at
reflecting the current state of the art in image classification on a set of
well-defined tasks in ℓ∞- and ℓ2-threat models and on common
corruptions, with possible extensions in the future. Additionally, we
open-source the library https://github.com/RobustBench/robustbench that
provides unified access to 80+ robust models to facilitate their downstream
applications. Finally, based on the collected models, we analyze the impact of
robustness on the performance on distribution shifts, calibration,
out-of-distribution detection, fairness, privacy leakage, smoothness, and
transferability.Abstract
In safety-critical but computationally resource-constrained applications,
deep learning faces two key challenges: lack of robustness against adversarial
attacks and large neural network size (often millions of parameters). While the
research community has extensively explored the use of robust training and
network pruning independently to address one of these challenges, only a few
recent works have studied them jointly. However, these works inherit a
heuristic pruning strategy that was developed for benign training, which
performs poorly when integrated with robust training techniques, including
adversarial training and verifiable robust training. To overcome this
challenge, we propose to make pruning techniques aware of the robust training
objective and let the training objective guide the search for which connections
to prune. We realize this insight by formulating the pruning objective as an
empirical risk minimization problem which is solved efficiently using SGD. We
demonstrate that our approach, titled HYDRA, achieves compressed networks with
state-of-the-art benign and robust accuracy, simultaneously. We demonstrate the
success of our approach across CIFAR-10, SVHN, and ImageNet dataset with four
robust training techniques: iterative adversarial training, randomized
smoothing, MixTrain, and CROWN-IBP. We also demonstrate the existence of highly
robust sub-networks within non-robust networks. Our code and compressed
networks are publicly available at
\url{https://github.com/inspire-group/compactness-robustness}.Abstract
SORRY-Bench: Systematically Evaluating Large Language Model Safety
Refusal Behaviors
arXiv:2406.14598v1 »Full PDF »Evaluating aligned large language models' (LLMs) ability to recognize and
reject unsafe user requests is crucial for safe, policy-compliant deployments.
Existing evaluation efforts, however, face three limitations that we address
with SORRY-Bench, our proposed benchmark. First, existing methods often use
coarse-grained taxonomies of unsafe topics, and are over-representing some
fine-grained topics. For example, among the ten existing datasets that we
evaluated, tests for refusals of self-harm instructions are over 3x less
represented than tests for fraudulent activities. SORRY-Bench improves on this
by using a fine-grained taxonomy of 45 potentially unsafe topics, and 450
class-balanced unsafe instructions, compiled through human-in-the-loop methods.
Second, linguistic characteristics and formatting of prompts are often
overlooked, like different languages, dialects, and more -- which are only
implicitly considered in many evaluations. We supplement SORRY-Bench with 20
diverse linguistic augmentations to systematically examine these effects.
Third, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation,
which can be computationally expensive. We investigate design choices for
creating a fast, accurate automated safety evaluator. By collecting 7K+ human
annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs,
we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale
LLMs, with lower computational cost. Putting these together, we evaluate over
40 proprietary and open-source LLMs on SORRY-Bench, analyzing their distinctive
refusal behaviors. We hope our effort provides a building block for systematic
evaluations of LLMs' safety refusal capabilities, in a balanced, granular, and
efficient manner.Abstract
Fair Evaluation of Federated Learning Algorithms for Automated Breast
Density Classification: The Results of the 2022 ACR-NCI-NVIDIA Federated
Learning Challenge
The correct interpretation of breast density is important in the assessment
of breast cancer risk. AI has been shown capable of accurately predicting
breast density, however, due to the differences in imaging characteristics
across mammography systems, models built using data from one system do not
generalize well to other systems. Though federated learning (FL) has emerged as
a way to improve the generalizability of AI without the need to share data, the
best way to preserve features from all training data during FL is an active
area of research. To explore FL methodology, the breast density classification
FL challenge was hosted in partnership with the American College of Radiology,
Harvard Medical School's Mass General Brigham, University of Colorado, NVIDIA,
and the National Institutes of Health National Cancer Institute. Challenge
participants were able to submit docker containers capable of implementing FL
on three simulated medical facilities, each containing a unique large
mammography dataset. The breast density FL challenge ran from June 15 to
September 5, 2022, attracting seven finalists from around the world. The
winning FL submission reached a linear kappa score of 0.653 on the challenge
test data and 0.413 on an external testing dataset, scoring comparably to a
model trained on the same data in a central location.Abstract
MoDem-V2: Visuo-Motor World Models for Real-World Robot Manipulation
Robotic systems that aspire to operate in uninstrumented real-world
environments must perceive the world directly via onboard sensing. Vision-based
learning systems aim to eliminate the need for environment instrumentation by
building an implicit understanding of the world based on raw pixels, but
navigating the contact-rich high-dimensional search space from solely sparse
visual reward signals significantly exacerbates the challenge of exploration.
The applicability of such systems is thus typically restricted to simulated or
heavily engineered environments since agent exploration in the real-world
without the guidance of explicit state estimation and dense rewards can lead to
unsafe behavior and safety faults that are catastrophic. In this study, we
isolate the root causes behind these limitations to develop a system, called
MoDem-V2, capable of learning contact-rich manipulation directly in the
uninstrumented real world. Building on the latest algorithmic advancements in
model-based reinforcement learning (MBRL), demo-bootstrapping, and effective
exploration, MoDem-V2 can acquire contact-rich dexterous manipulation skills
directly in the real world. We identify key ingredients for leveraging
demonstrations in model learning while respecting real-world safety
considerations -- exploration centering, agency handover, and actor-critic
ensembles. We empirically demonstrate the contribution of these ingredients in
four complex visuo-motor manipulation problems in both simulation and the real
world. To the best of our knowledge, our work presents the first successful
system for demonstration-augmented visual MBRL trained directly in the real
world. Visit https://sites.google.com/view/modem-v2 for videos and more
details.Abstract
RAISE -- Radiology AI Safety, an End-to-end lifecycle approach
The integration of AI into radiology introduces opportunities for improved
clinical care provision and efficiency but it demands a meticulous approach to
mitigate potential risks as with any other new technology. Beginning with
rigorous pre-deployment evaluation and validation, the focus should be on
ensuring models meet the highest standards of safety, effectiveness and
efficacy for their intended applications. Input and output guardrails
implemented during production usage act as an additional layer of protection,
identifying and addressing individual failures as they occur. Continuous
post-deployment monitoring allows for tracking population-level performance
(data drift), fairness, and value delivery over time. Scheduling reviews of
post-deployment model performance and educating radiologists about new
algorithmic-driven findings is critical for AI to be effective in clinical
practice. Recognizing that no single AI solution can provide absolute assurance
even when limited to its intended use, the synergistic application of quality
assurance at multiple levels - regulatory, clinical, technical, and ethical -
is emphasized. Collaborative efforts between stakeholders spanning healthcare
systems, industry, academia, and government are imperative to address the
multifaceted challenges involved. Trust in AI is an earned privilege,
contingent on a broad set of goals, among them transparently demonstrating that
the AI adheres to the same rigorous safety, effectiveness and efficacy
standards as other established medical technologies. By doing so, developers
can instil confidence among providers and patients alike, enabling the
responsible scaling of AI and the realization of its potential benefits. The
roadmap presented herein aims to expedite the achievement of deployable,
reliable, and safe AI in radiology.Abstract
Integration and Implementation Strategies for AI Algorithm Deployment
with Smart Routing Rules and Workflow Management
This paper reviews the challenges hindering the widespread adoption of
artificial intelligence (AI) solutions in the healthcare industry, focusing on
computer vision applications for medical imaging, and how interoperability and
enterprise-grade scalability can be used to address these challenges. The
complex nature of healthcare workflows, intricacies in managing large and
secure medical imaging data, and the absence of standardized frameworks for AI
development pose significant barriers and require a new paradigm to address
them.
The role of interoperability is examined in this paper as a crucial factor in
connecting disparate applications within healthcare workflows. Standards such
as DICOM, Health Level 7 (HL7), and Integrating the Healthcare Enterprise (IHE)
are highlighted as foundational for common imaging workflows. A specific focus
is placed on the role of DICOM gateways, with Smart Routing Rules and Workflow
Management leading transformational efforts in this area.
To drive enterprise scalability, new tools are needed. Project MONAI,
established in 2019, is introduced as an initiative aiming to redefine the
development of medical AI applications. The MONAI Deploy App SDK, a component
of Project MONAI, is identified as a key tool in simplifying the packaging and
deployment process, enabling repeatable, scalable, and standardized deployment
patterns for AI applications.
The abstract underscores the potential impact of successful AI adoption in
healthcare, offering physicians both life-saving and time-saving insights and
driving efficiencies in radiology department workflows. The collaborative
efforts between academia and industry, are emphasized as essential for
advancing the adoption of healthcare AI solutions.Abstract
RoboAgent: Generalization and Efficiency in Robot Manipulation via
Semantic Augmentations and Action Chunking
arXiv:2309.01918v1 »Full PDF »The grand aim of having a single robot that can manipulate arbitrary objects
in diverse settings is at odds with the paucity of robotics datasets. Acquiring
and growing such datasets is strenuous due to manual efforts, operational
costs, and safety challenges. A path toward such an universal agent would
require a structured framework capable of wide generalization but trained
within a reasonable data budget. In this paper, we develop an efficient system
(RoboAgent) for training universal agents capable of multi-task manipulation
skills using (a) semantic augmentations that can rapidly multiply existing
datasets and (b) action representations that can extract performant policies
with small yet diverse multi-modal datasets without overfitting. In addition,
reliable task conditioning and an expressive policy architecture enable our
agent to exhibit a diverse repertoire of skills in novel situations specified
using language commands. Using merely 7500 demonstrations, we are able to train
a single agent capable of 12 unique skills, and demonstrate its generalization
over 38 tasks spread across common daily activities in diverse kitchen scenes.
On average, RoboAgent outperforms prior methods by over 40% in unseen
situations while being more sample efficient and being amenable to capability
improvements and extensions through fine-tuning. Videos at
https://robopen.github.io/Abstract