arXiv admin note: substantial text overlap with arXiv:2402.01733
Large Language Models (LLMs) show potential for medical applications but
often lack specialized clinical knowledge. Retrieval Augmented Generation (RAG)
allows customization with domain-specific information, making it suitable for
healthcare. This study evaluates the accuracy, consistency, and safety of RAG
models in determining fitness for surgery and providing preoperative
instructions. We developed LLM-RAG models using 35 local and 23 international
preoperative guidelines and tested them against human-generated responses. A
total of 3,682 responses were evaluated. Clinical documents were processed
using Llamaindex, and 10 LLMs, including GPT3.5, GPT4, and Claude-3, were
assessed. Fourteen clinical scenarios were analyzed, focusing on seven aspects
of preoperative instructions. Established guidelines and expert judgment were
used to determine correct responses, with human-generated answers serving as
comparisons. The LLM-RAG models generated responses within 20 seconds,
significantly faster than clinicians (10 minutes). The GPT4 LLM-RAG model
achieved the highest accuracy (96.4% vs. 86.6%, p=0.016), with no
hallucinations and producing correct instructions comparable to clinicians.
Results were consistent across both local and international guidelines. This
study demonstrates the potential of LLM-RAG models for preoperative healthcare
tasks, highlighting their efficiency, scalability, and reliability.Abstract
arXiv:2410.00003v2 »Full PDF »Human Activity Recognition (HAR) using Inertial Measurement Unit (IMU)
sensors is critical for applications in healthcare, safety, and industrial
production. However, variations in activity patterns, device types, and sensor
placements create distribution gaps across datasets, reducing the performance
of HAR models. To address this, we propose LanHAR, a novel system that
leverages Large Language Models (LLMs) to generate semantic interpretations of
sensor readings and activity labels for cross-dataset HAR. This approach not
only mitigates cross-dataset heterogeneity but also enhances the recognition of
new activities. LanHAR employs an iterative re-generation method to produce
high-quality semantic interpretations with LLMs and a two-stage training
framework that bridges the semantic interpretations of sensor readings and
activity labels. This ultimately leads to a lightweight sensor encoder suitable
for mobile deployment, enabling any sensor reading to be mapped into the
semantic interpretation space. Experiments on four public datasets demonstrate
that our approach significantly outperforms state-of-the-art methods in both
cross-dataset HAR and new activity recognition. The source code will be made
publicly available.Abstract
Efficient Detection of Toxic Prompts in Large Language Models
Accepted by the 39th IEEE/ACM International Conference on Automated
Software Engineering (ASE 2024...
Large language models (LLMs) like ChatGPT and Gemini have significantly
advanced natural language processing, enabling various applications such as
chatbots and automated content generation. However, these models can be
exploited by malicious individuals who craft toxic prompts to elicit harmful or
unethical responses. These individuals often employ jailbreaking techniques to
bypass safety mechanisms, highlighting the need for robust toxic prompt
detection methods. Existing detection techniques, both blackbox and whitebox,
face challenges related to the diversity of toxic prompts, scalability, and
computational efficiency. In response, we propose ToxicDetector, a lightweight
greybox method designed to efficiently detect toxic prompts in LLMs.
ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding
vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP)
classifier for prompt classification. Our evaluation on various versions of the
LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector
achieves a high accuracy of 96.39\% and a low false positive rate of 2.00\%,
outperforming state-of-the-art methods. Additionally, ToxicDetector's
processing time of 0.0780 seconds per prompt makes it highly suitable for
real-time applications. ToxicDetector achieves high accuracy, efficiency, and
scalability, making it a practical method for toxic prompt detection in LLMs.Abstract
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
Phone
We introduce phi-3-mini, a 3.8 billion parameter language model trained on
3.3 trillion tokens, whose overall performance, as measured by both academic
benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and
GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite
being small enough to be deployed on a phone. Our training dataset is a
scaled-up version of the one used for phi-2, composed of heavily filtered
publicly available web data and synthetic data. The model is also further
aligned for robustness, safety, and chat format. We also provide
parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called
phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini
(e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance
multilingual, multimodal, and long-context capabilities, we introduce three
models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision.
The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters,
achieves superior performance in language reasoning, math, and code tasks
compared to other open-source models of similar scale, such as Llama 3.1 and
the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini.
Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from
phi-3.5-mini, excels in reasoning tasks and is adept at handling both
single-image and text prompts, as well as multi-image and text prompts.Abstract
arXiv:2409.00133v1 »Full PDF »Recent breakthroughs in large language models (LLMs) offer unprecedented
natural language understanding and generation capabilities. However, existing
surveys on LLMs in biomedicine often focus on specific applications or model
architectures, lacking a comprehensive analysis that integrates the latest
advancements across various biomedical domains. This review, based on an
analysis of 484 publications sourced from databases including PubMed, Web of
Science, and arXiv, provides an in-depth examination of the current landscape,
applications, challenges, and prospects of LLMs in biomedicine, distinguishing
itself by focusing on the practical implications of these models in real-world
biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot
learning across a broad spectrum of biomedical tasks, including diagnostic
assistance, drug discovery, and personalized medicine, among others, with
insights drawn from 137 key studies. Then, we discuss adaptation strategies of
LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to
enhance their performance in specialized biomedical contexts where zero-shot
fails to achieve, such as medical question answering and efficient processing
of biomedical literature. Finally, we discuss the challenges that LLMs face in
the biomedicine domain including data privacy concerns, limited model
interpretability, issues with dataset quality, and ethics due to the sensitive
nature of biomedical data, the need for highly reliable model outputs, and the
ethical implications of deploying AI in healthcare. To address these
challenges, we also identify future research directions of LLM in biomedicine
including federated learning methods to preserve data privacy and integrating
explainable AI methodologies to enhance the transparency of LLMs.Abstract
arXiv:2407.21783v2 »Full PDF »Modern artificial intelligence (AI) systems are powered by foundation models.
This paper presents a new set of foundation models, called Llama 3. It is a
herd of language models that natively support multilinguality, coding,
reasoning, and tool usage. Our largest model is a dense Transformer with 405B
parameters and a context window of up to 128K tokens. This paper presents an
extensive empirical evaluation of Llama 3. We find that Llama 3 delivers
comparable quality to leading language models such as GPT-4 on a plethora of
tasks. We publicly release Llama 3, including pre-trained and post-trained
versions of the 405B parameter language model and our Llama Guard 3 model for
input and output safety. The paper also presents the results of experiments in
which we integrate image, video, and speech capabilities into Llama 3 via a
compositional approach. We observe this approach performs competitively with
the state-of-the-art on image, video, and speech recognition tasks. The
resulting models are not yet being broadly released as they are still under
development.Abstract
AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from
Regulations and Policies
arXiv:2407.17436v2 »Full PDF »Foundation models (FMs) provide societal benefits but also amplify risks.
Governments, companies, and researchers have proposed regulatory frameworks,
acceptable use policies, and safety benchmarks in response. However, existing
public benchmarks often define safety categories based on previous literature,
intuitions, or common sense, leading to disjointed sets of categories for risks
specified in recent regulations and policies, which makes it challenging to
evaluate and compare FMs across these benchmarks. To bridge this gap, we
introduce AIR-Bench 2024, the first AI safety benchmark aligned with emerging
government regulations and company policies, following the regulation-based
safety categories grounded in our AI risks study, AIR 2024. AIR 2024 decomposes
8 government regulations and 16 company policies into a four-tiered safety
taxonomy with 314 granular risk categories in the lowest tier. AIR-Bench 2024
contains 5,694 diverse prompts spanning these categories, with manual curation
and human auditing to ensure quality. We evaluate leading language models on
AIR-Bench 2024, uncovering insights into their alignment with specified safety
concerns. By bridging the gap between public benchmarks and practical AI risks,
AIR-Bench 2024 provides a foundation for assessing model safety across
jurisdictions, fostering the development of safer and more responsible AI
systems.Abstract
FairDomain: Achieving Fairness in Cross-Domain Medical Image
Segmentation and Classification
ECCV 2024; Codes and datasets are available at
https://github.com/Harvard-Ophthalmology-AI-Lab/Fai...
Addressing fairness in artificial intelligence (AI), particularly in medical
AI, is crucial for ensuring equitable healthcare outcomes. Recent efforts to
enhance fairness have introduced new methodologies and datasets in medical AI.
However, the fairness issue under the setting of domain transfer is almost
unexplored, while it is common that clinics rely on different imaging
technologies (e.g., different retinal imaging modalities) for patient
diagnosis. This paper presents FairDomain, a pioneering systemic study into
algorithmic fairness under domain shifts, employing state-of-the-art domain
adaptation (DA) and generalization (DG) algorithms for both medical
segmentation and classification tasks to understand how biases are transferred
between different domains. We also introduce a novel plug-and-play fair
identity attention (FIA) module that adapts to various DA and DG algorithms to
improve fairness by using self-attention to adjust feature importance based on
demographic attributes. Additionally, we curate the first fairness-focused
dataset with two paired imaging modalities for the same patient cohort on
medical segmentation and classification tasks, to rigorously assess fairness in
domain-shift scenarios. Excluding the confounding impact of demographic
distribution variation between source and target domains will allow clearer
quantification of the performance of domain transfer models. Our extensive
evaluations reveal that the proposed FIA significantly enhances both model
performance accounted for fairness across all domain shift settings (i.e., DA
and DG) with respect to different demographics, which outperforms existing
methods on both segmentation and classification. The code and data can be
accessed at https://ophai.hms.harvard.edu/datasets/harvard-fairdomain20k.Abstract
AI Risk Categorization Decoded (AIR 2024): From Government Regulations
to Corporate Policies
arXiv:2406.17864v1 »Full PDF »We present a comprehensive AI risk taxonomy derived from eight government
policies from the European Union, United States, and China and 16 company
policies worldwide, making a significant step towards establishing a unified
language for generative AI safety evaluation. We identify 314 unique risk
categories organized into a four-tiered taxonomy. At the highest level, this
taxonomy encompasses System & Operational Risks, Content Safety Risks, Societal
Risks, and Legal & Rights Risks. The taxonomy establishes connections between
various descriptions and approaches to risk, highlighting the overlaps and
discrepancies between public and private sector conceptions of risk. By
providing this unified framework, we aim to advance AI safety through
information sharing across sectors and the promotion of best practices in risk
mitigation for generative AI models and systems.Abstract
SS-Bench: A Benchmark for Social Story Generation and Evaluation
arXiv:2406.15695v1 »Full PDF »Children with Autism Spectrum Disorder (ASD) often misunderstand social
situations and struggle to participate in daily routines. Psychology experts
write Social Stories under strict constraints of structural clarity,
descriptive orientation, and situational safety to enhance their abilities in
these regimes. However, Social Stories are costly in creation and often limited
in diversity and timeliness. As Large Language Models (LLMs) become
increasingly powerful, there is a growing need for more automated, affordable,
and accessible methods to generate Social Stories in real-time with broad
coverage. Adapting LLMs to meet the unique and strict constraints of Social
Stories is a challenging issue. To this end, we propose \textbf{SS-Bench}, a
\textbf{S}ocial \textbf{S}tory \textbf{Bench}mark for generating and evaluating
Social Stories. Specifically, we develop a constraint-driven strategy named
\textbf{\textsc{StarSow}} to hierarchically prompt LLMs to generate Social
Stories and build a benchmark, which has been validated through experiments to
fine-tune smaller models for generating qualified Social Stories. Additionally,
we introduce \textbf{Quality Assessment Criteria}, employed in human and GPT
evaluations, to verify the effectiveness of the generated stories. We hope this
work benefits the autism community and catalyzes future research focusing on
particular groups.Abstract