The burgeoning field of Large Language Models (LLMs), exemplified by
sophisticated models like OpenAI's ChatGPT, represents a significant
advancement in artificial intelligence. These models, however, bring forth
substantial challenges in the high consumption of computational, memory,
energy, and financial resources, especially in environments with limited
resource capabilities. This survey aims to systematically address these
challenges by reviewing a broad spectrum of techniques designed to enhance the
resource efficiency of LLMs. We categorize methods based on their optimization
focus: computational, memory, energy, financial, and network resources and
their applicability across various stages of an LLM's lifecycle, including
architecture design, pretraining, finetuning, and system design. Additionally,
the survey introduces a nuanced categorization of resource efficiency
techniques by their specific resource types, which uncovers the intricate
relationships and mappings between various resources and corresponding
optimization techniques. A standardized set of evaluation metrics and datasets
is also presented to facilitate consistent and fair comparisons across
different models and techniques. By offering a comprehensive overview of the
current sota and identifying open research avenues, this survey serves as a
foundational reference for researchers and practitioners, aiding them in
developing more sustainable and efficient LLMs in a rapidly evolving landscape.Abstract
arXiv:2410.21276v1 »Full PDF »GPT-4o is an autoregressive omni model that accepts as input any combination
of text, audio, image, and video, and generates any combination of text, audio,
and image outputs. It's trained end-to-end across text, vision, and audio,
meaning all inputs and outputs are processed by the same neural network. GPT-4o
can respond to audio inputs in as little as 232 milliseconds, with an average
of 320 milliseconds, which is similar to human response time in conversation.
It matches GPT-4 Turbo performance on text in English and code, with
significant improvement on text in non-English languages, while also being much
faster and 50\% cheaper in the API. GPT-4o is especially better at vision and
audio understanding compared to existing models. In line with our commitment to
building AI safely and consistent with our voluntary commitments to the White
House, we are sharing the GPT-4o System Card, which includes our Preparedness
Framework evaluations. In this System Card, we provide a detailed look at
GPT-4o's capabilities, limitations, and safety evaluations across multiple
categories, focusing on speech-to-speech while also evaluating text and image
capabilities, and measures we've implemented to ensure the model is safe and
aligned. We also include third-party assessments on dangerous capabilities, as
well as discussion of potential societal impacts of GPT-4o's text and vision
capabilities.Abstract
Deconstructing The Ethics of Large Language Models from Long-standing
Issues to New-emerging Dilemmas: A Survey
arXiv:2406.05392v2 »Full PDF »Large Language Models (LLMs) have achieved unparalleled success across
diverse language modeling tasks in recent years. However, this progress has
also intensified ethical concerns, impacting the deployment of LLMs in everyday
contexts. This paper provides a comprehensive survey of ethical challenges
associated with LLMs, from longstanding issues such as copyright infringement,
systematic bias, and data privacy, to emerging problems like truthfulness and
social norms. We critically analyze existing research aimed at understanding,
examining, and mitigating these ethical risks. Our survey underscores
integrating ethical standards and societal values into the development of LLMs,
thereby guiding the development of responsible and ethically aligned language
models.Abstract
DemoShapley: Valuation of Demonstrations for In-Context Learning
arXiv:2410.07523v1 »Full PDF »Large language models (LLMs) leveraging in-context learning (ICL) have set
new benchmarks in few-shot learning across various tasks without needing
task-specific fine-tuning. However, extensive research has demonstrated that
the effectiveness of ICL is significantly influenced by the selection and
ordering of demonstrations. Considering the critical role of demonstration
selection in ICL, we introduce DemoShapley which is inspired by the Data
Shapley valuation theorem. This approach assesses the influence of individual
demonstration instances, distinguishing between those that contribute
positively and those that may hinder performance. Our findings reveal that
DemoShapley not only enhances model performance in terms of accuracy and
fairness but also generalizes queries from domains distinct from those of the
in-context demonstrations, highlighting its versatility and effectiveness in
optimizing ICL demonstration selection. Last but not least, DemoShapley
demonstrates its ability to aid in identifying noisy data within the
demonstration set.Abstract
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
Phone
We introduce phi-3-mini, a 3.8 billion parameter language model trained on
3.3 trillion tokens, whose overall performance, as measured by both academic
benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and
GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite
being small enough to be deployed on a phone. Our training dataset is a
scaled-up version of the one used for phi-2, composed of heavily filtered
publicly available web data and synthetic data. The model is also further
aligned for robustness, safety, and chat format. We also provide
parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called
phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini
(e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance
multilingual, multimodal, and long-context capabilities, we introduce three
models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision.
The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters,
achieves superior performance in language reasoning, math, and code tasks
compared to other open-source models of similar scale, such as Llama 3.1 and
the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini.
Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from
phi-3.5-mini, excels in reasoning tasks and is adept at handling both
single-image and text prompts, as well as multi-image and text prompts.Abstract
ChipExpert: The Open-Source Integrated-Circuit-Design-Specific Large
Language Model
arXiv:2408.00804v1 »Full PDF »The field of integrated circuit (IC) design is highly specialized, presenting
significant barriers to entry and research and development challenges. Although
large language models (LLMs) have achieved remarkable success in various
domains, existing LLMs often fail to meet the specific needs of students,
engineers, and researchers. Consequently, the potential of LLMs in the IC
design domain remains largely unexplored. To address these issues, we introduce
ChipExpert, the first open-source, instructional LLM specifically tailored for
the IC design field. ChipExpert is trained on one of the current best
open-source base model (Llama-3 8B). The entire training process encompasses
several key stages, including data preparation, continue pre-training,
instruction-guided supervised fine-tuning, preference alignment, and
evaluation. In the data preparation stage, we construct multiple high-quality
custom datasets through manual selection and data synthesis techniques. In the
subsequent two stages, ChipExpert acquires a vast amount of IC design knowledge
and learns how to respond to user queries professionally. ChipExpert also
undergoes an alignment phase, using Direct Preference Optimization, to achieve
a high standard of ethical performance. Finally, to mitigate the hallucinations
of ChipExpert, we have developed a Retrieval-Augmented Generation (RAG) system,
based on the IC design knowledge base. We also released the first IC design
benchmark ChipICD-Bench, to evaluate the capabilities of LLMs across multiple
IC design sub-domains. Through comprehensive experiments conducted on this
benchmark, ChipExpert demonstrated a high level of expertise in IC design
knowledge Question-and-Answer tasks.Abstract
arXiv:2402.05355v6 »Full PDF »In the rapidly evolving landscape of artificial intelligence, multimodal
learning systems (MMLS) have gained traction for their ability to process and
integrate information from diverse modality inputs. Their expanding use in
vital sectors such as healthcare has made safety assurance a critical concern.
However, the absence of systematic research into their safety is a significant
barrier to progress in this field. To bridge the gap, we present the first
taxonomy that systematically categorizes and assesses MMLS safety. This
taxonomy is structured around four fundamental pillars that are critical to
ensuring the safety of MMLS: robustness, alignment, monitoring, and
controllability. Leveraging this taxonomy, we review existing methodologies,
benchmarks, and the current state of research, while also pinpointing the
principal limitations and gaps in knowledge. Finally, we discuss unique
challenges in MMLS safety. In illuminating these challenges, we aim to pave the
way for future research, proposing potential directions that could lead to
significant advancements in the safety protocols of MMLS.Abstract
Robust Stance Detection: Understanding Public Perceptions in Social
Media
arXiv:2309.15176v2 »Full PDF »The abundance of social media data has presented opportunities for accurately
determining public and group-specific stances around policy proposals or
controversial topics. In contrast with sentiment analysis which focuses on
identifying prevailing emotions, stance detection identifies precise positions
(i.e., supportive, opposing, neutral) relative to a well-defined topic, such as
perceptions toward specific global health interventions during the COVID-19
pandemic. Traditional stance detection models, while effective within their
specific domain (e.g., attitudes towards masking protocols during COVID-19),
often lag in performance when applied to new domains and topics due to changes
in data distribution. This limitation is compounded by the scarcity of
domain-specific, labeled datasets, which are expensive and labor-intensive to
create. A solution we present in this paper combines counterfactual data
augmentation with contrastive learning to enhance the robustness of stance
detection across domains and topics of interest. We evaluate the performance of
current state-of-the-art stance detection models, including a prompt-optimized
large language model, relative to our proposed framework succinctly called
STANCE-C3 (domain-adaptive Cross-target STANCE detection via Contrastive
learning and Counterfactual generation). Empirical evaluations demonstrate
STANCE-C3's consistent improvements over the baseline models with respect to
accuracy across domains and varying focal topics. Despite the increasing
prevalence of general-purpose models such as generative AI, specialized models
such as STANCE-C3 provide utility in safety-critical domains wherein precision
is highly valued, especially when a nuanced understanding of the concerns of
different population segments could result in crafting more impactful public
policies.Abstract
Long-Term Fairness Inquiries and Pursuits in Machine Learning: A Survey
of Notions, Methods, and Challenges
arXiv:2406.06736v1 »Full PDF »The widespread integration of Machine Learning systems in daily life,
particularly in high-stakes domains, has raised concerns about the fairness
implications. While prior works have investigated static fairness measures,
recent studies reveal that automated decision-making has long-term implications
and that off-the-shelf fairness approaches may not serve the purpose of
achieving long-term fairness. Additionally, the existence of feedback loops and
the interaction between models and the environment introduces additional
complexities that may deviate from the initial fairness goals. In this survey,
we review existing literature on long-term fairness from different perspectives
and present a taxonomy for long-term fairness studies. We highlight key
challenges and consider future research directions, analyzing both current
issues and potential further explorations.Abstract
arXiv:2404.18416v2 »Full PDF »Excellence in a wide variety of medical applications poses considerable
challenges for AI, requiring advanced reasoning, access to up-to-date medical
knowledge and understanding of complex multimodal data. Gemini models, with
strong general capabilities in multimodal and long-context reasoning, offer
exciting possibilities in medicine. Building on these core strengths of Gemini,
we introduce Med-Gemini, a family of highly capable multimodal models that are
specialized in medicine with the ability to seamlessly use web search, and that
can be efficiently tailored to novel modalities using custom encoders. We
evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art
(SoTA) performance on 10 of them, and surpass the GPT-4 model family on every
benchmark where a direct comparison is viable, often by a wide margin. On the
popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves
SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search
strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU
(health & medicine), Med-Gemini improves over GPT-4V by an average relative
margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context
capabilities through SoTA performance on a needle-in-a-haystack retrieval task
from long de-identified health records and medical video question answering,
surpassing prior bespoke methods using only in-context learning. Finally,
Med-Gemini's performance suggests real-world utility by surpassing human
experts on tasks such as medical text summarization, alongside demonstrations
of promising potential for multimodal medical dialogue, medical research and
education. Taken together, our results offer compelling evidence for
Med-Gemini's potential, although further rigorous evaluation will be crucial
before real-world deployment in this safety-critical domain.Abstract