arXiv:2411.03395v1 »Full PDF »Large language models (LLMs) have shown remarkable progress in encoding
clinical knowledge and responding to complex medical queries with appropriate
clinical reasoning. However, their applicability in subspecialist or complex
medical settings remains underexplored. In this work, we probe the performance
of AMIE, a research conversational diagnostic AI system, in the subspecialist
domain of breast oncology care without specific fine-tuning to this challenging
domain. To perform this evaluation, we curated a set of 50 synthetic breast
cancer vignettes representing a range of treatment-naive and
treatment-refractory cases and mirroring the key information available to a
multidisciplinary tumor board for decision-making (openly released with this
work). We developed a detailed clinical rubric for evaluating management plans,
including axes such as the quality of case summarization, safety of the
proposed care plan, and recommendations for chemotherapy, radiotherapy, surgery
and hormonal therapy. To improve performance, we enhanced AMIE with the
inference-time ability to perform web search retrieval to gather relevant and
up-to-date clinical knowledge and refine its responses with a multi-stage
self-critique pipeline. We compare response quality of AMIE with internal
medicine trainees, oncology fellows, and general oncology attendings under both
automated and specialist clinician evaluations. In our evaluations, AMIE
outperformed trainees and fellows demonstrating the potential of the system in
this challenging and important domain. We further demonstrate through
qualitative examples, how systems such as AMIE might facilitate conversational
interactions to assist clinicians in their decision making. However, AMIE's
performance was overall inferior to attending oncologists suggesting that
further research is needed prior to consideration of prospective uses.Abstract
Advancing Multimodal Medical Capabilities of Gemini
arXiv:2405.03162v1 »Full PDF »Many clinical tasks require an understanding of specialized data, such as
medical images and genomics, which is not typically found in general-purpose
large multimodal models. Building upon Gemini's multimodal models, we develop
several models within the new Med-Gemini family that inherit core capabilities
of Gemini and are optimized for medical use via fine-tuning with 2D and 3D
radiology, histopathology, ophthalmology, dermatology and genomic data.
Med-Gemini-2D sets a new standard for AI-based chest X-ray (CXR) report
generation based on expert evaluation, exceeding previous best results across
two separate datasets by an absolute margin of 1% and 12%, where 57% and 96% of
AI reports on normal cases, and 43% and 65% on abnormal cases, are evaluated as
"equivalent or better" than the original radiologists' reports. We demonstrate
the first ever large multimodal model-based report generation for 3D computed
tomography (CT) volumes using Med-Gemini-3D, with 53% of AI reports considered
clinically acceptable, although additional research is needed to meet expert
radiologist reporting quality. Beyond report generation, Med-Gemini-2D
surpasses the previous best performance in CXR visual question answering (VQA)
and performs well in CXR classification and radiology VQA, exceeding SoTA or
baselines on 17 of 20 tasks. In histopathology, ophthalmology, and dermatology
image classification, Med-Gemini-2D surpasses baselines across 18 out of 20
tasks and approaches task-specific model performance. Beyond imaging,
Med-Gemini-Polygenic outperforms the standard linear polygenic risk score-based
approach for disease risk prediction and generalizes to genetically correlated
diseases for which it has never been trained. Although further development and
evaluation are necessary in the safety-critical medical domain, our results
highlight the potential of Med-Gemini across a wide range of medical tasks.Abstract
arXiv:2404.18416v2 »Full PDF »Excellence in a wide variety of medical applications poses considerable
challenges for AI, requiring advanced reasoning, access to up-to-date medical
knowledge and understanding of complex multimodal data. Gemini models, with
strong general capabilities in multimodal and long-context reasoning, offer
exciting possibilities in medicine. Building on these core strengths of Gemini,
we introduce Med-Gemini, a family of highly capable multimodal models that are
specialized in medicine with the ability to seamlessly use web search, and that
can be efficiently tailored to novel modalities using custom encoders. We
evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art
(SoTA) performance on 10 of them, and surpass the GPT-4 model family on every
benchmark where a direct comparison is viable, often by a wide margin. On the
popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves
SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search
strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU
(health & medicine), Med-Gemini improves over GPT-4V by an average relative
margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context
capabilities through SoTA performance on a needle-in-a-haystack retrieval task
from long de-identified health records and medical video question answering,
surpassing prior bespoke methods using only in-context learning. Finally,
Med-Gemini's performance suggests real-world utility by surpassing human
experts on tasks such as medical text summarization, alongside demonstrations
of promising potential for multimodal medical dialogue, medical research and
education. Taken together, our results offer compelling evidence for
Med-Gemini's potential, although further rigorous evaluation will be crucial
before real-world deployment in this safety-critical domain.Abstract
M-RewardBench: Evaluating Reward Models in Multilingual Settings
Reward models (RMs) have driven the state-of-the-art performance of LLMs
today by enabling the integration of human feedback into the language modeling
process. However, RMs are primarily trained and evaluated in English, and their
capabilities in multilingual settings remain largely understudied. In this
work, we conduct a systematic evaluation of several reward models in
multilingual settings. We first construct the first-of-its-kind multilingual RM
evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances
for 23 typologically diverse languages, that tests the chat, safety, reasoning,
and translation capabilities of RMs. We then rigorously evaluate a wide range
of reward models on M-RewardBench, offering fresh insights into their
performance across diverse languages. We identify a significant gap in RMs'
performances between English and non-English languages and show that RM
preferences can change substantially from one language to another. We also
present several findings on how different multilingual aspects impact RM
performance. Specifically, we show that the performance of RMs is improved with
improved translation quality. Similarly, we demonstrate that the models exhibit
better performance for high-resource languages. We release M-RewardBench
dataset and the codebase in this study to facilitate a better understanding of
RM evaluation in multilingual settings.Abstract
arXiv:2410.21276v1 »Full PDF »GPT-4o is an autoregressive omni model that accepts as input any combination
of text, audio, image, and video, and generates any combination of text, audio,
and image outputs. It's trained end-to-end across text, vision, and audio,
meaning all inputs and outputs are processed by the same neural network. GPT-4o
can respond to audio inputs in as little as 232 milliseconds, with an average
of 320 milliseconds, which is similar to human response time in conversation.
It matches GPT-4 Turbo performance on text in English and code, with
significant improvement on text in non-English languages, while also being much
faster and 50\% cheaper in the API. GPT-4o is especially better at vision and
audio understanding compared to existing models. In line with our commitment to
building AI safely and consistent with our voluntary commitments to the White
House, we are sharing the GPT-4o System Card, which includes our Preparedness
Framework evaluations. In this System Card, we provide a detailed look at
GPT-4o's capabilities, limitations, and safety evaluations across multiple
categories, focusing on speech-to-speech while also evaluating text and image
capabilities, and measures we've implemented to ensure the model is safe and
aligned. We also include third-party assessments on dangerous capabilities, as
well as discussion of potential societal impacts of GPT-4o's text and vision
capabilities.Abstract
Competency-Aware Planning for Probabilistically Safe Navigation Under
Perception Uncertainty
arXiv:2409.06111v2 »Full PDF »Perception-based navigation systems are useful for unmanned ground vehicle
(UGV) navigation in complex terrains, where traditional depth-based navigation
schemes are insufficient. However, these data-driven methods are highly
dependent on their training data and can fail in surprising and dramatic ways
with little warning. To ensure the safety of the vehicle and the surrounding
environment, it is imperative that the navigation system is able to recognize
the predictive uncertainty of the perception model and respond safely and
effectively in the face of uncertainty. In an effort to enable safe navigation
under perception uncertainty, we develop a probabilistic and
reconstruction-based competency estimation (PaRCE) method to estimate the
model's level of familiarity with an input image as a whole and with specific
regions in the image. We find that the overall competency score can correctly
predict correctly classified, misclassified, and out-of-distribution (OOD)
samples. We also confirm that the regional competency maps can accurately
distinguish between familiar and unfamiliar regions across images. We then use
this competency information to develop a planning and control scheme that
enables effective navigation while maintaining a low probability of error. We
find that the competency-aware scheme greatly reduces the number of collisions
with unfamiliar obstacles, compared to a baseline controller with no competency
awareness. Furthermore, the regional competency information is very valuable in
enabling efficient navigation.Abstract
Constrained Recurrent Bayesian Forecasting for Crack Propagation
arXiv:2410.14761v1 »Full PDF »Predictive maintenance of railway infrastructure, especially railroads, is
essential to ensure safety. However, accurate prediction of crack evolution
represents a major challenge due to the complex interactions between intrinsic
and external factors, as well as measurement uncertainties. Effective modeling
requires a multidimensional approach and a comprehensive understanding of these
dynamics and uncertainties. Motivated by an industrial use case based on
collected real data containing measured crack lengths, this paper introduces a
robust Bayesian multi-horizon approach for predicting the temporal evolution of
crack lengths on rails. This model captures the intricate interplay between
various factors influencing crack growth. Additionally, the Bayesian approach
quantifies both epistemic and aleatoric uncertainties, providing a confidence
interval around predictions. To enhance the model's reliability for railroad
maintenance, specific constraints are incorporated. These constraints limit
non-physical crack propagation behavior and prioritize safety. The findings
reveal a trade-off between prediction accuracy and constraint compliance,
highlighting the nuanced decision-making process in model training. This study
offers insights into advanced predictive modeling for dynamic temporal
forecasting, particularly in railway maintenance, with potential applications
in other domains.Abstract
Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning
arXiv:2410.10801v1 »Full PDF »Large Language Models (LLMs) have been adopted and deployed worldwide for a
broad variety of applications. However, ensuring their safe use remains a
significant challenge. Preference training and safety measures often overfit to
harms prevalent in Western-centric datasets, and safety protocols frequently
fail to extend to multilingual settings. In this work, we explore model merging
in a diverse multi-task setting, combining safety and general-purpose tasks
within a multilingual context. Each language introduces unique and varied
learning challenges across tasks. We find that objective-based merging is more
effective than mixing data, with improvements of up to 8% and 10% in general
performance and safety respectively. We also find that language-based merging
is highly effective -- by merging monolingually fine-tuned models, we achieve a
4% increase in general performance and 7% reduction in harm across all
languages on top of the data mixtures method using the same available data.
Overall, our comprehensive study of merging approaches provides a useful
framework for building strong and safe multilingual models.Abstract
arXiv:2410.01276v1 »Full PDF »Machine unlearning (MU) aims to remove the influence of particular data
points from the learnable parameters of a trained machine learning model. This
is a crucial capability in light of data privacy requirements, trustworthiness,
and safety in deployed models. MU is particularly challenging for deep neural
networks (DNNs), such as convolutional nets or vision transformers, as such
DNNs tend to memorize a notable portion of their training dataset.
Nevertheless, the community lacks a rigorous and multifaceted study that looks
into the success of MU methods for DNNs. In this paper, we investigate 18
state-of-the-art MU methods across various benchmark datasets and models, with
each evaluation conducted over 10 different initializations, a comprehensive
evaluation involving MU over 100K models. We show that, with the proper
hyperparameters, Masked Small Gradients (MSG) and Convolution Transpose (CT),
consistently perform better in terms of model accuracy and run-time efficiency
across different models, datasets, and initializations, assessed by
population-based membership inference attacks (MIA) and per-sample unlearning
likelihood ratio attacks (U-LiRA). Furthermore, our benchmark highlights the
fact that comparing a MU method only with commonly used baselines, such as
Gradient Ascent (GA) or Successive Random Relabeling (SRL), is inadequate, and
we need better baselines like Negative Gradient Plus (NG+) with proper
hyperparameter selection.Abstract
SynBench: A Synthetic Benchmark for Non-rigid 3D Point Cloud
Registration
arXiv:2409.14474v1 »Full PDF »Non-rigid point cloud registration is a crucial task in computer vision.
Evaluating a non-rigid point cloud registration method requires a dataset with
challenges such as large deformation levels, noise, outliers, and
incompleteness. Despite the existence of several datasets for deformable point
cloud registration, the absence of a comprehensive benchmark with all
challenges makes it difficult to achieve fair evaluations among different
methods. This paper introduces SynBench, a new non-rigid point cloud
registration dataset created using SimTool, a toolset for soft body simulation
in Flex and Unreal Engine. SynBench provides the ground truth of corresponding
points between two point sets and encompasses key registration challenges,
including varying levels of deformation, noise, outliers, and incompleteness.
To the best of the authors' knowledge, compared to existing datasets, SynBench
possesses three particular characteristics: (1) it is the first benchmark that
provides various challenges for non-rigid point cloud registration, (2)
SynBench encompasses challenges of varying difficulty levels, and (3) it
includes ground truth corresponding points both before and after deformation.
The authors believe that SynBench enables future non-rigid point cloud
registration methods to present a fair comparison of their achievements.
SynBench is publicly available at: https://doi.org/10.11588/data/R9IKCF.Abstract