In health, most large language model (LLM) research has focused on clinical
tasks. However, mobile and wearable devices, which are rarely integrated into
such tasks, provide rich, longitudinal data for personal health monitoring.
Here we present Personal Health Large Language Model (PH-LLM), fine-tuned from
Gemini for understanding and reasoning over numerical time-series personal
health data. We created and curated three datasets that test 1) production of
personalized insights and recommendations from sleep patterns, physical
activity, and physiological responses, 2) expert domain knowledge, and 3)
prediction of self-reported sleep outcomes. For the first task we designed 857
case studies in collaboration with domain experts to assess real-world
scenarios in sleep and fitness. Through comprehensive evaluation of
domain-specific rubrics, we observed that Gemini Ultra 1.0 and PH-LLM are not
statistically different from expert performance in fitness and, while experts
remain superior for sleep, fine-tuning PH-LLM provided significant improvements
in using relevant domain knowledge and personalizing information for sleep
insights. We evaluated PH-LLM domain knowledge using multiple choice sleep
medicine and fitness examinations. PH-LLM achieved 79% on sleep and 88% on
fitness, exceeding average scores from a sample of human experts. Finally, we
trained PH-LLM to predict self-reported sleep quality outcomes from textual and
multimodal encoding representations of wearable data, and demonstrate that
multimodal encoding is required to match performance of specialized
discriminative models. Although further development and evaluation are
necessary in the safety-critical personal health domain, these results
demonstrate both the broad knowledge and capabilities of Gemini models and the
benefit of contextualizing physiological data for personal health applications
as done with PH-LLM.Abstract
Advancing Multimodal Medical Capabilities of Gemini
arXiv:2405.03162v1 »Full PDF »Many clinical tasks require an understanding of specialized data, such as
medical images and genomics, which is not typically found in general-purpose
large multimodal models. Building upon Gemini's multimodal models, we develop
several models within the new Med-Gemini family that inherit core capabilities
of Gemini and are optimized for medical use via fine-tuning with 2D and 3D
radiology, histopathology, ophthalmology, dermatology and genomic data.
Med-Gemini-2D sets a new standard for AI-based chest X-ray (CXR) report
generation based on expert evaluation, exceeding previous best results across
two separate datasets by an absolute margin of 1% and 12%, where 57% and 96% of
AI reports on normal cases, and 43% and 65% on abnormal cases, are evaluated as
"equivalent or better" than the original radiologists' reports. We demonstrate
the first ever large multimodal model-based report generation for 3D computed
tomography (CT) volumes using Med-Gemini-3D, with 53% of AI reports considered
clinically acceptable, although additional research is needed to meet expert
radiologist reporting quality. Beyond report generation, Med-Gemini-2D
surpasses the previous best performance in CXR visual question answering (VQA)
and performs well in CXR classification and radiology VQA, exceeding SoTA or
baselines on 17 of 20 tasks. In histopathology, ophthalmology, and dermatology
image classification, Med-Gemini-2D surpasses baselines across 18 out of 20
tasks and approaches task-specific model performance. Beyond imaging,
Med-Gemini-Polygenic outperforms the standard linear polygenic risk score-based
approach for disease risk prediction and generalizes to genetically correlated
diseases for which it has never been trained. Although further development and
evaluation are necessary in the safety-critical medical domain, our results
highlight the potential of Med-Gemini across a wide range of medical tasks.Abstract
arXiv:2410.21276v1 »Full PDF »GPT-4o is an autoregressive omni model that accepts as input any combination
of text, audio, image, and video, and generates any combination of text, audio,
and image outputs. It's trained end-to-end across text, vision, and audio,
meaning all inputs and outputs are processed by the same neural network. GPT-4o
can respond to audio inputs in as little as 232 milliseconds, with an average
of 320 milliseconds, which is similar to human response time in conversation.
It matches GPT-4 Turbo performance on text in English and code, with
significant improvement on text in non-English languages, while also being much
faster and 50\% cheaper in the API. GPT-4o is especially better at vision and
audio understanding compared to existing models. In line with our commitment to
building AI safely and consistent with our voluntary commitments to the White
House, we are sharing the GPT-4o System Card, which includes our Preparedness
Framework evaluations. In this System Card, we provide a detailed look at
GPT-4o's capabilities, limitations, and safety evaluations across multiple
categories, focusing on speech-to-speech while also evaluating text and image
capabilities, and measures we've implemented to ensure the model is safe and
aligned. We also include third-party assessments on dangerous capabilities, as
well as discussion of potential societal impacts of GPT-4o's text and vision
capabilities.Abstract
Advancing Healthcare: Innovative ML Approaches for Improved Medical
Imaging in Data-Constrained Environments
Healthcare industries face challenges when experiencing rare diseases due to
limited samples. Artificial Intelligence (AI) communities overcome this
situation to create synthetic data which is an ethical and privacy issue in the
medical domain. This research introduces the CAT-U-Net framework as a new
approach to overcome these limitations, which enhances feature extraction from
medical images without the need for large datasets. The proposed framework adds
an extra concatenation layer with downsampling parts, thereby improving its
ability to learn from limited data while maintaining patient privacy. To
validate, the proposed framework's robustness, different medical conditioning
datasets were utilized including COVID-19, brain tumors, and wrist fractures.
The framework achieved nearly 98% reconstruction accuracy, with a Dice
coefficient close to 0.946. The proposed CAT-U-Net has the potential to make a
big difference in medical image diagnostics in settings with limited data.Abstract
Trustworthy Artificial Intelligence Framework for Proactive Detection
and Risk Explanation of Cyber Attacks in Smart Grid
The rapid growth of distributed energy resources (DERs), such as renewable
energy sources, generators, consumers, and prosumers in the smart grid
infrastructure, poses significant cybersecurity and trust challenges to the
grid controller. Consequently, it is crucial to identify adversarial tactics
and measure the strength of the attacker's DER. To enable a trustworthy smart
grid controller, this work investigates a trustworthy artificial intelligence
(AI) mechanism for proactive identification and explanation of the cyber risk
caused by the control/status message of DERs. Thus, proposing and developing a
trustworthy AI framework to facilitate the deployment of any AI algorithms for
detecting potential cyber threats and analyzing root causes based on Shapley
value interpretation while dynamically quantifying the risk of an attack based
on Ward's minimum variance formula. The experiment with a state-of-the-art
dataset establishes the proposed framework as a trustworthy AI by fulfilling
the capabilities of reliability, fairness, explainability, transparency,
reproducibility, and accountability.Abstract
Toward Fairness in Speech Recognition: Discovery and mitigation of
performance disparities
As for other forms of AI, speech recognition has recently been examined with
respect to performance disparities across different user cohorts. One approach
to achieve fairness in speech recognition is to (1) identify speaker cohorts
that suffer from subpar performance and (2) apply fairness mitigation measures
targeting the cohorts discovered. In this paper, we report on initial findings
with both discovery and mitigation of performance disparities using data from a
product-scale AI assistant speech recognition system. We compare cohort
discovery based on geographic and demographic information to a more scalable
method that groups speakers without human labels, using speaker embedding
technology. For fairness mitigation, we find that oversampling of
underrepresented cohorts, as well as modeling speaker cohort membership by
additional input variables, reduces the gap between top- and bottom-performing
cohorts, without deteriorating overall recognition accuracy.Abstract
Federated Learning using Smart Contracts on Blockchains, based on Reward
Driven Approach
Over the recent years, Federated machine learning continues to gain interest
and momentum where there is a need to draw insights from data while preserving
the data provider's privacy. However, one among other existing challenges in
the adoption of federated learning has been the lack of fair, transparent and
universally agreed incentivization schemes for rewarding the federated learning
contributors. Smart contracts on a blockchain network provide transparent,
immutable and independently verifiable proofs by all participants of the
network. We leverage this open and transparent nature of smart contracts on a
blockchain to define incentivization rules for the contributors, which is based
on a novel scalar quantity - federated contribution. Such a smart contract
based reward-driven model has the potential to revolutionize the federated
learning adoption in enterprises. Our contribution is two-fold: first is to
show how smart contract based blockchain can be a very natural communication
channel for federated learning. Second, leveraging this infrastructure, we can
show how an intuitive measure of each agents' contribution can be built and
integrated with the life cycle of the training and reward process.Abstract