Authored by the Center for Research on Foundation Models (CRFM) at
the Stanford Institute for Huma...
Language models (LMs) are becoming the foundation for almost all major
language technologies, but their capabilities, limitations, and risks are not
well understood. We present Holistic Evaluation of Language Models (HELM) to
improve the transparency of language models. First, we taxonomize the vast
space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata)
that are of interest for LMs. Then we select a broad subset based on coverage
and feasibility, noting what's missing or underrepresented (e.g. question
answering for neglected English dialects, metrics for trustworthiness). Second,
we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration,
robustness, fairness, bias, toxicity, and efficiency) for each of 16 core
scenarios when possible (87.5% of the time). This ensures metrics beyond
accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We
also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze
specific aspects (e.g. reasoning, disinformation). Third, we conduct a
large-scale evaluation of 30 prominent language models (spanning open,
limited-access, and closed models) on all 42 scenarios, 21 of which were not
previously used in mainstream LM evaluation. Prior to HELM, models on average
were evaluated on just 17.9% of the core HELM scenarios, with some prominent
models not sharing a single scenario in common. We improve this to 96.0%: now
all 30 models have been densely benchmarked on the same core scenarios and
metrics under standardized conditions. Our evaluation surfaces 25 top-level
findings. For full transparency, we release all raw model prompts and
completions publicly for further analysis, as well as a general modular
toolkit. We intend for HELM to be a living benchmark for the community,
continuously updated with new scenarios, metrics, and models.Abstract
Blue Sky Ideas in Artificial Intelligence Education from the EAAI 2017
New and Future AI Educator Program
Working paper in the 7th Symposium on Educational Advances in
Artificial Intelligence (EAAI-17)
The 7th Symposium on Educational Advances in Artificial Intelligence
(EAAI'17, co-chaired by Sven Koenig and Eric Eaton) launched the EAAI New and
Future AI Educator Program to support the training of early-career university
faculty, secondary school faculty, and future educators (PhD candidates or
postdocs who intend a career in academia). As part of the program, awardees
were asked to address one of the following "blue sky" questions:
* How could/should Artificial Intelligence (AI) courses incorporate ethics
into the curriculum?
* How could we teach AI topics at an early undergraduate or a secondary
school level?
* AI has the potential for broad impact to numerous disciplines. How could we
make AI education more interdisciplinary, specifically to benefit
non-engineering fields?
This paper is a collection of their responses, intended to help motivate
discussion around these issues in AI education.Abstract
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on
Multimodal Large Language Models
Multimodal large language models (MLLMs) have revolutionized vision-language
understanding but are vulnerable to multimodal jailbreak attacks, where
adversaries meticulously craft inputs to elicit harmful or inappropriate
responses. We propose UniGuard, a novel multimodal safety guardrail that
jointly considers the unimodal and cross-modal harmful signals. UniGuard is
trained such that the likelihood of generating harmful responses in a toxic
corpus is minimized, and can be seamlessly applied to any input prompt during
inference with minimal computational costs. Extensive experiments demonstrate
the generalizability of UniGuard across multiple modalities and attack
strategies. It demonstrates impressive generalizability across multiple
state-of-the-art MLLMs, including LLaVA, Gemini Pro, GPT-4, MiniGPT-4, and
InstructBLIP, thereby broadening the scope of our solution.Abstract
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large
Language Models
The camera-ready version of JailbreakBench v1.0 (accepted at NeurIPS
2024 Datasets and Benchmarks ...
Jailbreak attacks cause large language models (LLMs) to generate harmful,
unethical, or otherwise objectionable content. Evaluating these attacks
presents a number of challenges, which the current collection of benchmarks and
evaluation techniques do not adequately address. First, there is no clear
standard of practice regarding jailbreaking evaluation. Second, existing works
compute costs and success rates in incomparable ways. And third, numerous works
are not reproducible, as they withhold adversarial prompts, involve
closed-source code, or rely on evolving proprietary APIs. To address these
challenges, we introduce JailbreakBench, an open-sourced benchmark with the
following components: (1) an evolving repository of state-of-the-art
adversarial prompts, which we refer to as jailbreak artifacts; (2) a
jailbreaking dataset comprising 100 behaviors -- both original and sourced from
prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with
OpenAI's usage policies; (3) a standardized evaluation framework at
https://github.com/JailbreakBench/jailbreakbench that includes a clearly
defined threat model, system prompts, chat templates, and scoring functions;
and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the
performance of attacks and defenses for various LLMs. We have carefully
considered the potential ethical implications of releasing this benchmark, and
believe that it will be a net positive for the community.Abstract
You Never Know: Quantization Induces Inconsistent Biases in
Vision-Language Foundation Models
Workshop paper at NeurIPS 2024 RBFM. 6 pages, 3 figures
We study the impact of a standard practice in compressing foundation
vision-language models - quantization - on the models' ability to produce
socially-fair outputs. In contrast to prior findings with unimodal models that
compression consistently amplifies social biases, our extensive evaluation of
four quantization settings across three datasets and three CLIP variants yields
a surprising result: while individual models demonstrate bias, we find no
consistent change in bias magnitude or direction across a population of
compressed models due to quantization.Abstract
arXiv:2410.21276v1 »Full PDF »GPT-4o is an autoregressive omni model that accepts as input any combination
of text, audio, image, and video, and generates any combination of text, audio,
and image outputs. It's trained end-to-end across text, vision, and audio,
meaning all inputs and outputs are processed by the same neural network. GPT-4o
can respond to audio inputs in as little as 232 milliseconds, with an average
of 320 milliseconds, which is similar to human response time in conversation.
It matches GPT-4 Turbo performance on text in English and code, with
significant improvement on text in non-English languages, while also being much
faster and 50\% cheaper in the API. GPT-4o is especially better at vision and
audio understanding compared to existing models. In line with our commitment to
building AI safely and consistent with our voluntary commitments to the White
House, we are sharing the GPT-4o System Card, which includes our Preparedness
Framework evaluations. In this System Card, we provide a detailed look at
GPT-4o's capabilities, limitations, and safety evaluations across multiple
categories, focusing on speech-to-speech while also evaluating text and image
capabilities, and measures we've implemented to ensure the model is safe and
aligned. We also include third-party assessments on dangerous capabilities, as
well as discussion of potential societal impacts of GPT-4o's text and vision
capabilities.Abstract
Avoiding Copyright Infringement via Large Language Model Unlearning
arXiv:2406.10952v2 »Full PDF »Pre-trained Large Language Models (LLMs) have demonstrated remarkable
capabilities but also pose risks by learning and generating copyrighted
material, leading to significant legal and ethical concerns. In real-world
scenarios, model owners need to continuously address copyright infringement as
new requests for content removal emerge at different time points. This leads to
the need for sequential unlearning, where copyrighted content is removed
sequentially as new requests arise. Despite its practical relevance, sequential
unlearning in the context of copyright infringement has not been rigorously
explored in existing literature. To address this gap, we propose Stable
Sequential Unlearning (SSU), a novel framework designed to unlearn copyrighted
content from LLMs over multiple time steps. Our approach works by identifying
and removing specific weight updates in the model's parameters that correspond
to copyrighted content. We improve unlearning efficacy by introducing random
labeling loss and ensuring the model retains its general-purpose knowledge by
adjusting targeted parameters. Experimental results show that SSU achieves an
effective trade-off between unlearning efficacy and general-purpose language
abilities, outperforming existing baselines.Abstract
Vision-Based Adaptive Robotics for Autonomous Surface Crack Repair
22 pages, 14 figures, submitted to Advanced Engineering Informatics
Surface cracks in infrastructure can lead to significant deterioration and
costly maintenance if not efficiently repaired. Manual repair methods are
labor-intensive, time-consuming, and imprecise and thus difficult to scale to
large areas. While advancements in robotic perception and manipulation have
progressed autonomous crack repair, existing methods still face three key
challenges: accurate localization of cracks within the robot's coordinate
frame, (ii) adaptability to varying crack depths and widths, and (iii)
validation of the repair process under realistic conditions. This paper
presents an adaptive, autonomous system for surface crack detection and repair
using robotics with advanced sensing technologies to enhance precision and
safety for humans. The system uses an RGB-D camera for crack detection, a laser
scanner for precise measurement, and an extruder and pump for material
deposition. To address one of the key challenges, the laser scanner is used to
enhance the crack coordinates for accurate localization. Furthermore, our
approach demonstrates that an adaptive crack-filling method is more efficient
and effective than a fixed-speed approach, with experimental results confirming
both precision and consistency. In addition, to ensure real-world applicability
and testing repeatability, we introduce a novel validation procedure using
3D-printed crack specimens that accurately simulate real-world conditions. This
research contributes to the evolving field of human-robot interaction in
construction by demonstrating how adaptive robotic systems can reduce the need
for manual labor, improve safety, and enhance the efficiency of maintenance
operations, ultimately paving the way for more sophisticated and integrated
construction robotics.Abstract
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
arXiv:2410.09024v2 »Full PDF »The robustness of LLMs to jailbreak attacks, where users design prompts to
circumvent safety measures and misuse model capabilities, has been studied
primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which
use external tools and can execute multi-stage tasks -- may pose a greater risk
if misused, but their robustness remains underexplored. To facilitate research
on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark
includes a diverse set of 110 explicitly malicious agent tasks (440 with
augmentations), covering 11 harm categories including fraud, cybercrime, and
harassment. In addition to measuring whether models refuse harmful agentic
requests, scoring well on AgentHarm requires jailbroken agents to maintain
their capabilities following an attack to complete a multi-step task. We
evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly
compliant with malicious agent requests without jailbreaking, (2) simple
universal jailbreak templates can be adapted to effectively jailbreak agents,
and (3) these jailbreaks enable coherent and malicious multi-step agent
behavior and retain model capabilities. To enable simple and reliable
evaluation of attacks and defenses for LLM-based agents, we publicly release
AgentHarm at https://huggingface.co/datasets/ai-safety-institute/AgentHarm.Abstract
arXiv:2410.06172v1 »Full PDF »Multimodal Large Language Models (MLLMs) are rapidly evolving, demonstrating
impressive capabilities as multimodal assistants that interact with both humans
and their environments. However, this increased sophistication introduces
significant safety concerns. In this paper, we present the first evaluation and
analysis of a novel safety challenge termed Multimodal Situational Safety,
which explores how safety considerations vary based on the specific situation
in which the user or agent is engaged. We argue that for an MLLM to respond
safely, whether through language or action, it often needs to assess the safety
implications of a language query within its corresponding visual context. To
evaluate this capability, we develop the Multimodal Situational Safety
benchmark (MSSBench) to assess the situational safety performance of current
MLLMs. The dataset comprises 1,820 language query-image pairs, half of which
the image context is safe, and the other half is unsafe. We also develop an
evaluation framework that analyzes key safety aspects, including explicit
safety reasoning, visual understanding, and, crucially, situational safety
reasoning. Our findings reveal that current MLLMs struggle with this nuanced
safety problem in the instruction-following setting and struggle to tackle
these situational safety challenges all at once, highlighting a key area for
future research. Furthermore, we develop multi-agent pipelines to coordinately
solve safety challenges, which shows consistent improvement in safety over the
original MLLM response. Code and data: mssbench.github.io.Abstract