Authored by the Center for Research on Foundation Models (CRFM) at
the Stanford Institute for Huma...
Language models (LMs) are becoming the foundation for almost all major
language technologies, but their capabilities, limitations, and risks are not
well understood. We present Holistic Evaluation of Language Models (HELM) to
improve the transparency of language models. First, we taxonomize the vast
space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata)
that are of interest for LMs. Then we select a broad subset based on coverage
and feasibility, noting what's missing or underrepresented (e.g. question
answering for neglected English dialects, metrics for trustworthiness). Second,
we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration,
robustness, fairness, bias, toxicity, and efficiency) for each of 16 core
scenarios when possible (87.5% of the time). This ensures metrics beyond
accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We
also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze
specific aspects (e.g. reasoning, disinformation). Third, we conduct a
large-scale evaluation of 30 prominent language models (spanning open,
limited-access, and closed models) on all 42 scenarios, 21 of which were not
previously used in mainstream LM evaluation. Prior to HELM, models on average
were evaluated on just 17.9% of the core HELM scenarios, with some prominent
models not sharing a single scenario in common. We improve this to 96.0%: now
all 30 models have been densely benchmarked on the same core scenarios and
metrics under standardized conditions. Our evaluation surfaces 25 top-level
findings. For full transparency, we release all raw model prompts and
completions publicly for further analysis, as well as a general modular
toolkit. We intend for HELM to be a living benchmark for the community,
continuously updated with new scenarios, metrics, and models.Abstract
On the Opportunities and Risks of Foundation Models
Authored by the Center for Research on Foundation Models (CRFM) at
the Stanford Institute for Huma...
AI is undergoing a paradigm shift with the rise of models (e.g., BERT,
DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a
wide range of downstream tasks. We call these models foundation models to
underscore their critically central yet incomplete character. This report
provides a thorough account of the opportunities and risks of foundation
models, ranging from their capabilities (e.g., language, vision, robotics,
reasoning, human interaction) and technical principles(e.g., model
architectures, training procedures, data, systems, security, evaluation,
theory) to their applications (e.g., law, healthcare, education) and societal
impact (e.g., inequity, misuse, economic and environmental impact, legal and
ethical considerations). Though foundation models are based on standard deep
learning and transfer learning, their scale results in new emergent
capabilities,and their effectiveness across so many tasks incentivizes
homogenization. Homogenization provides powerful leverage but demands caution,
as the defects of the foundation model are inherited by all the adapted models
downstream. Despite the impending widespread deployment of foundation models,
we currently lack a clear understanding of how they work, when they fail, and
what they are even capable of due to their emergent properties. To tackle these
questions, we believe much of the critical research on foundation models will
require deep interdisciplinary collaboration commensurate with their
fundamentally sociotechnical nature.Abstract
No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained
Classification Problems
40 pages. Published as a conference paper at NeurIPS 2020
In real-world classification tasks, each class often comprises multiple
finer-grained "subclasses." As the subclass labels are frequently unavailable,
models trained using only the coarser-grained class labels often exhibit highly
variable performance across different subclasses. This phenomenon, known as
hidden stratification, has important consequences for models deployed in
safety-critical applications such as medicine. We propose GEORGE, a method to
both measure and mitigate hidden stratification even when subclass labels are
unknown. We first observe that unlabeled subclasses are often separable in the
feature space of deep neural networks, and exploit this fact to estimate
subclass labels for the training data via clustering techniques. We then use
these approximate subclass labels as a form of noisy supervision in a
distributionally robust optimization objective. We theoretically characterize
the performance of GEORGE in terms of the worst-case generalization error
across any subclass. We empirically validate GEORGE on a mix of real-world and
benchmark image classification datasets, and show that our approach boosts
worst-case subclass accuracy by up to 22 percentage points compared to standard
training techniques, without requiring any prior information about the
subclasses.Abstract
BARACK: Partially Supervised Group Robustness With Guarantees
While neural networks have shown remarkable success on classification tasks
in terms of average-case performance, they often fail to perform well on
certain groups of the data. Such group information may be expensive to obtain;
thus, recent works in robustness and fairness have proposed ways to improve
worst-group performance even when group labels are unavailable for the training
data. However, these methods generally underperform methods that utilize group
information at training time. In this work, we assume access to a small number
of group labels alongside a larger dataset without group labels. We propose
BARACK, a simple two-step framework to utilize this partial group information
to improve worst-group performance: train a model to predict the missing group
labels for the training data, and then use these predicted group labels in a
robust optimization objective. Theoretically, we provide generalization bounds
for our approach in terms of the worst-group performance, which scale with
respect to both the total number of training points and the number of training
points with group labels. Empirically, our method outperforms the baselines
that do not use group information, even when only 1-33% of points have group
labels. We provide ablation studies to support the robustness and extensibility
of our framework.Abstract
Personalized Benchmarking with the Ludwig Benchmarking Toolkit
14 pages, 14 figures, 35th Conference on Neural Information
Processing Systems (NeurIPS 2021) Trac...
The rapid proliferation of machine learning models across domains and
deployment settings has given rise to various communities (e.g. industry
practitioners) which seek to benchmark models across tasks and objectives of
personal value. Unfortunately, these users cannot use standard benchmark
results to perform such value-driven comparisons as traditional benchmarks
evaluate models on a single objective (e.g. average accuracy) and fail to
facilitate a standardized training framework that controls for confounding
variables (e.g. computational budget), making fair comparisons difficult. To
address these challenges, we introduce the open-source Ludwig Benchmarking
Toolkit (LBT), a personalized benchmarking toolkit for running end-to-end
benchmark studies (from hyperparameter optimization to evaluation) across an
easily extensible set of tasks, deep learning models, datasets and evaluation
metrics. LBT provides a configurable interface for controlling training and
customizing evaluation, a standardized training framework for eliminating
confounding variables, and support for multi-objective evaluation. We
demonstrate how LBT can be used to create personalized benchmark studies with a
large-scale comparative analysis for text classification across 7 models and 9
datasets. We explore the trade-offs between inference latency and performance,
relationships between dataset attributes and performance, and the effects of
pretraining on convergence and robustness, showing how LBT can be used to
satisfy various benchmarking objectives.Abstract
Slice-based Learning: A Programming Model for Residual Learning in
Critical Data Slices
In real-world machine learning applications, data subsets correspond to
especially critical outcomes: vulnerable cyclist detections are safety-critical
in an autonomous driving task, and "question" sentences might be important to a
dialogue agent's language understanding for product purposes. While machine
learning models can achieve high quality performance on coarse-grained metrics
like F1-score and overall accuracy, they may underperform on critical
subsets---we define these as slices, the key abstraction in our approach. To
address slice-level performance, practitioners often train separate "expert"
models on slice subsets or use multi-task hard parameter sharing. We propose
Slice-based Learning, a new programming model in which the slicing function
(SF), a programming interface, specifies critical data subsets for which the
model should commit additional capacity. Any model can leverage SFs to learn
slice expert representations, which are combined with an attention mechanism to
make slice-aware predictions. We show that our approach maintains a
parameter-efficient representation while improving over baselines by up to 19.0
F1 on slices and 4.6 F1 overall on datasets spanning language understanding
(e.g. SuperGLUE), computer vision, and production-scale industrial systems.Abstract
Medical device surveillance with electronic health records
arXiv:1904.07640v1 »Full PDF »Post-market medical device surveillance is a challenge facing manufacturers,
regulatory agencies, and health care providers. Electronic health records are
valuable sources of real world evidence to assess device safety and track
device-related patient outcomes over time. However, distilling this evidence
remains challenging, as information is fractured across clinical notes and
structured records. Modern machine learning methods for machine reading promise
to unlock increasingly complex information from text, but face barriers due to
their reliance on large and expensive hand-labeled training sets. To address
these challenges, we developed and validated state-of-the-art deep learning
methods that identify patient outcomes from clinical notes without requiring
hand-labeled training data. Using hip replacements as a test case, our methods
accurately extracted implant details and reports of complications and pain from
electronic health records with up to 96.3% precision, 98.5% recall, and 97.4%
F1, improved classification performance by 12.7- 53.0% over rule-based methods,
and detected over 6 times as many complication events compared to using
structured data alone. Using these events to assess complication-free
survivorship of different implant systems, we found significant variation
between implants, including for risk of revision surgery, which could not be
detected using coded data alone. Patients with revision surgeries had more hip
pain mentions in the post-hip replacement, pre-revision period compared to
patients with no evidence of revision surgery (mean hip pain mentions 4.97 vs.
3.23; t = 5.14; p < 0.001). Some implant models were associated with higher or
lower rates of hip pain mentions. Our methods complement existing surveillance
mechanisms by requiring orders of magnitude less hand-labeled training data,
offering a scalable solution for national medical device surveillance.Abstract
Bottom-Up and Top-Down Analysis of Values, Agendas, and Observations in
Corpora and LLMs
arXiv:2411.05040v1 »Full PDF »Large language models (LLMs) generate diverse, situated, persuasive texts
from a plurality of potential perspectives, influenced heavily by their prompts
and training data. As part of LLM adoption, we seek to characterize - and
ideally, manage - the socio-cultural values that they express, for reasons of
safety, accuracy, inclusion, and cultural fidelity. We present a validated
approach to automatically (1) extracting heterogeneous latent value
propositions from texts, (2) assessing resonance and conflict of values with
texts, and (3) combining these operations to characterize the pluralistic value
alignment of human-sourced and LLM-sourced textual data.Abstract
In-context learning (ICL) is a powerful technique for getting language models
to perform complex tasks with no training updates. Prior work has established
strong correlations between the number of in-context examples provided and the
accuracy of the model's predictions. In this paper, we seek to explain this
correlation by showing that ICL approximates a Bayesian learner. This
perspective gives rise to a family of novel Bayesian scaling laws for ICL. In
experiments with \mbox{GPT-2} models of different sizes, our scaling laws
exceed or match existing scaling laws in accuracy while also offering
interpretable terms for task priors, learning efficiency, and per-example
probabilities. To illustrate the analytic power that such interpretable scaling
laws provide, we report on controlled synthetic dataset experiments designed to
inform real-world studies of safety alignment. In our experimental protocol, we
use SFT to suppress an unwanted existing model capability and then use ICL to
try to bring that capability back (many-shot jailbreaking). We then experiment
on real-world instruction-tuned LLMs using capabilities benchmarks as well as a
new many-shot jailbreaking dataset. In all cases, Bayesian scaling laws
accurately predict the conditions under which ICL will cause the suppressed
behavior to reemerge, which sheds light on the ineffectiveness of post-training
at increasing LLM safety.Abstract
Whither Bias Goes, I Will Go: An Integrative, Systematic Review of
Algorithmic Bias Mitigation
Machine learning (ML) models are increasingly used for personnel assessment
and selection (e.g., resume screeners, automatically scored interviews).
However, concerns have been raised throughout society that ML assessments may
be biased and perpetuate or exacerbate inequality. Although organizational
researchers have begun investigating ML assessments from traditional
psychometric and legal perspectives, there is a need to understand, clarify,
and integrate fairness operationalizations and algorithmic bias mitigation
methods from the computer science, data science, and organizational research
literatures. We present a four-stage model of developing ML assessments and
applying bias mitigation methods, including 1) generating the training data, 2)
training the model, 3) testing the model, and 4) deploying the model. When
introducing the four-stage model, we describe potential sources of bias and
unfairness at each stage. Then, we systematically review definitions and
operationalizations of algorithmic bias, legal requirements governing personnel
selection from the United States and Europe, and research on algorithmic bias
mitigation across multiple domains and integrate these findings into our
framework. Our review provides insights for both research and practice by
elucidating possible mechanisms of algorithmic bias while identifying which
bias mitigation methods are legal and effective. This integrative framework
also reveals gaps in the knowledge of algorithmic bias mitigation that should
be addressed by future collaborative research between organizational
researchers, computer scientists, and data scientists. We provide
recommendations for developing and deploying ML assessments, as well as
recommendations for future research into algorithmic bias and fairness.Abstract