arXiv:2406.16746v3 »Full PDF »Foundation model development attracts a rapidly expanding body of
contributors, scientists, and applications. To help shape responsible
development practices, we introduce the Foundation Model Development
Cheatsheet: a growing collection of 250+ tools and resources spanning text,
vision, and speech modalities. We draw on a large body of prior work to survey
resources (e.g. software, documentation, frameworks, guides, and practical
tools) that support informed data selection, processing, and understanding,
precise and limitation-aware artifact documentation, efficient model training,
advance awareness of the environmental impact from training, careful model
evaluation of capabilities, risks, and claims, as well as responsible model
release, licensing and deployment practices. We hope this curated collection of
resources helps guide more responsible development. The process of curating
this list, enabled us to review the AI development ecosystem, revealing what
tools are critically missing, misused, or over-used in existing practices. We
find that (i) tools for data sourcing, model evaluation, and monitoring are
critically under-serving ethical and real-world needs, (ii) evaluations for
model safety, capabilities, and environmental impact all lack reproducibility
and transparency, (iii) text and particularly English-centric analyses continue
to dominate over multilingual and multi-modal analyses, and (iv) evaluation of
systems, rather than just models, is needed so that capabilities and impact are
assessed in context.Abstract
The opportunities and risks of large language models in mental health
Global rates of mental health concerns are rising, and there is increasing
realization that existing models of mental health care will not adequately
expand to meet the demand. With the emergence of large language models (LLMs)
has come great optimism regarding their promise to create novel, large-scale
solutions to support mental health. Despite their nascence, LLMs have already
been applied to mental health related tasks. In this paper, we summarize the
extant literature on efforts to use LLMs to provide mental health education,
assessment, and intervention and highlight key opportunities for positive
impact in each area. We then highlight risks associated with LLMs' application
to mental health and encourage the adoption of strategies to mitigate these
risks. The urgent need for mental health support must be balanced with
responsible development, testing, and deployment of mental health LLMs. It is
especially critical to ensure that mental health LLMs are fine-tuned for mental
health, enhance mental health equity, and adhere to ethical standards and that
people, including those with lived experience with mental health concerns, are
involved in all stages from development through deployment. Prioritizing these
efforts will minimize potential harms to mental health and maximize the
likelihood that LLMs will positively impact mental health globally.Abstract
In health, most large language model (LLM) research has focused on clinical
tasks. However, mobile and wearable devices, which are rarely integrated into
such tasks, provide rich, longitudinal data for personal health monitoring.
Here we present Personal Health Large Language Model (PH-LLM), fine-tuned from
Gemini for understanding and reasoning over numerical time-series personal
health data. We created and curated three datasets that test 1) production of
personalized insights and recommendations from sleep patterns, physical
activity, and physiological responses, 2) expert domain knowledge, and 3)
prediction of self-reported sleep outcomes. For the first task we designed 857
case studies in collaboration with domain experts to assess real-world
scenarios in sleep and fitness. Through comprehensive evaluation of
domain-specific rubrics, we observed that Gemini Ultra 1.0 and PH-LLM are not
statistically different from expert performance in fitness and, while experts
remain superior for sleep, fine-tuning PH-LLM provided significant improvements
in using relevant domain knowledge and personalizing information for sleep
insights. We evaluated PH-LLM domain knowledge using multiple choice sleep
medicine and fitness examinations. PH-LLM achieved 79% on sleep and 88% on
fitness, exceeding average scores from a sample of human experts. Finally, we
trained PH-LLM to predict self-reported sleep quality outcomes from textual and
multimodal encoding representations of wearable data, and demonstrate that
multimodal encoding is required to match performance of specialized
discriminative models. Although further development and evaluation are
necessary in the safety-critical personal health domain, these results
demonstrate both the broad knowledge and capabilities of Gemini models and the
benefit of contextualizing physiological data for personal health applications
as done with PH-LLM.Abstract
arXiv:2404.18416v2 »Full PDF »Excellence in a wide variety of medical applications poses considerable
challenges for AI, requiring advanced reasoning, access to up-to-date medical
knowledge and understanding of complex multimodal data. Gemini models, with
strong general capabilities in multimodal and long-context reasoning, offer
exciting possibilities in medicine. Building on these core strengths of Gemini,
we introduce Med-Gemini, a family of highly capable multimodal models that are
specialized in medicine with the ability to seamlessly use web search, and that
can be efficiently tailored to novel modalities using custom encoders. We
evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art
(SoTA) performance on 10 of them, and surpass the GPT-4 model family on every
benchmark where a direct comparison is viable, often by a wide margin. On the
popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves
SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search
strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU
(health & medicine), Med-Gemini improves over GPT-4V by an average relative
margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context
capabilities through SoTA performance on a needle-in-a-haystack retrieval task
from long de-identified health records and medical video question answering,
surpassing prior bespoke methods using only in-context learning. Finally,
Med-Gemini's performance suggests real-world utility by surpassing human
experts on tasks such as medical text summarization, alongside demonstrations
of promising potential for multimodal medical dialogue, medical research and
education. Taken together, our results offer compelling evidence for
Med-Gemini's potential, although further rigorous evaluation will be crucial
before real-world deployment in this safety-critical domain.Abstract
arXiv:2304.03243v1 »Full PDF »Synthetic data are becoming a critical tool for building artificially
intelligent systems. Simulators provide a way of generating data systematically
and at scale. These data can then be used either exclusively, or in conjunction
with real data, for training and testing systems. Synthetic data are
particularly attractive in cases where the availability of ``real'' training
examples might be a bottleneck. While the volume of data in healthcare is
growing exponentially, creating datasets for novel tasks and/or that reflect a
diverse set of conditions and causal relationships is not trivial. Furthermore,
these data are highly sensitive and often patient specific. Recent research has
begun to illustrate the potential for synthetic data in many areas of medicine,
but no systematic review of the literature exists. In this paper, we present
the cases for physical and statistical simulations for creating data and the
proposed applications in healthcare and medicine. We discuss that while
synthetics can promote privacy, equity, safety and continual and causal
learning, they also run the risk of introducing flaws, blind spots and
propagating or exaggerating biases.Abstract
With the growing reliance on artificial intelligence (AI) for many different
applications, the sharing of code, data, and models is important to ensure the
replicability and democratization of scientific knowledge. Many high-profile
academic publishing venues expect code and models to be submitted and released
with papers. Furthermore, developers often want to release these assets to
encourage development of technology that leverages their frameworks and
services. A number of organizations have expressed concerns about the
inappropriate or irresponsible use of AI and have proposed ethical guidelines
around the application of such systems. While such guidelines can help set
norms and shape policy, they are not easily enforceable. In this paper, we
advocate the use of licensing to enable legally enforceable behavioral use
conditions on software and code and provide several case studies that
demonstrate the feasibility of behavioral use licensing. We envision how
licensing may be implemented in accordance with existing responsible AI
guidelines.Abstract
CausalCity: Complex Simulations with Agency for Causal Discovery and
Reasoning
arXiv:2106.13364v1 »Full PDF »The ability to perform causal and counterfactual reasoning are central
properties of human intelligence. Decision-making systems that can perform
these types of reasoning have the potential to be more generalizable and
interpretable. Simulations have helped advance the state-of-the-art in this
domain, by providing the ability to systematically vary parameters (e.g.,
confounders) and generate examples of the outcomes in the case of
counterfactual scenarios. However, simulating complex temporal causal events in
multi-agent scenarios, such as those that exist in driving and vehicle
navigation, is challenging. To help address this, we present a high-fidelity
simulation environment that is designed for developing algorithms for causal
discovery and counterfactual reasoning in the safety-critical context. A core
component of our work is to introduce \textit{agency}, such that it is simple
to define and create complex scenarios using high-level definitions. The
vehicles then operate with agency to complete these objectives, meaning
low-level behaviors need only be controlled if necessary. We perform
experiments with three state-of-the-art methods to create baselines and
highlight the affordances of this environment. Finally, we highlight challenges
and opportunities for future work.Abstract
Incorporating Human Explanations for Robust Hate Speech Detection
Given the black-box nature and complexity of large transformer language
models (LM), concerns about generalizability and robustness present ethical
implications for domains such as hate speech (HS) detection. Using the content
rich Social Bias Frames dataset, containing human-annotated stereotypes,
intent, and targeted groups, we develop a three stage analysis to evaluate if
LMs faithfully assess hate speech. First, we observe the need for modeling
contextually grounded stereotype intents to capture implicit semantic meaning.
Next, we design a new task, Stereotype Intent Entailment (SIE), which
encourages a model to contextually understand stereotype presence. Finally,
through ablation tests and user studies, we find a SIE objective improves
content understanding, but challenges remain in modeling implicit intent.Abstract
Towards evaluations-based safety cases for AI scheming
arXiv:2411.03336v2 »Full PDF »We sketch how developers of frontier AI systems could construct a structured
rationale -- a 'safety case' -- that an AI system is unlikely to cause
catastrophic outcomes through scheming. Scheming is a potential threat model
where AI systems could pursue misaligned goals covertly, hiding their true
capabilities and objectives. In this report, we propose three arguments that
safety cases could use in relation to scheming. For each argument we sketch how
evidence could be gathered from empirical evaluations, and what assumptions
would need to be met to provide strong assurance. First, developers of frontier
AI systems could argue that AI systems are not capable of scheming (Scheming
Inability). Second, one could argue that AI systems are not capable of posing
harm through scheming (Harm Inability). Third, one could argue that control
measures around the AI systems would prevent unacceptable outcomes even if the
AI systems intentionally attempted to subvert them (Harm Control).
Additionally, we discuss how safety cases might be supported by evidence that
an AI system is reasonably aligned with its developers (Alignment). Finally, we
point out that many of the assumptions required to make these safety arguments
have not been confidently satisfied to date and require making progress on
multiple open research problems.Abstract
2 pages, Accepted at the Network Mobility (NetMob) 2024 conference
This work examines the fairness of generative mobility models, addressing the
often overlooked dimension of equity in model performance across geographic
regions. Predictive models built on crowd flow data are instrumental in
understanding urban structures and movement patterns; however, they risk
embedding biases, particularly in spatiotemporal contexts where model
performance may reflect and reinforce existing inequities tied to geographic
distribution. We propose a novel framework for assessing fairness by measuring
the utility and equity of generated traces. Utility is assessed via the Common
Part of Commuters (CPC), a similarity metric comparing generated and real
mobility flows, while fairness is evaluated using demographic parity. By
reformulating demographic parity to reflect the difference in CPC distribution
between two groups, our analysis reveals disparities in how various models
encode biases present in the underlying data. We utilized four models (Gravity,
Radiation, Deep Gravity, and Non-linear Gravity) and our results indicate that
traditional gravity and radiation models produce fairer outcomes, although Deep
Gravity achieves higher CPC. This disparity underscores a trade-off between
model accuracy and equity, with the feature-rich Deep Gravity model amplifying
pre-existing biases in community representations. Our findings emphasize the
importance of integrating fairness metrics in mobility modeling to avoid
perpetuating inequities.Abstract