arXiv:2410.21276v1 »Full PDF »GPT-4o is an autoregressive omni model that accepts as input any combination
of text, audio, image, and video, and generates any combination of text, audio,
and image outputs. It's trained end-to-end across text, vision, and audio,
meaning all inputs and outputs are processed by the same neural network. GPT-4o
can respond to audio inputs in as little as 232 milliseconds, with an average
of 320 milliseconds, which is similar to human response time in conversation.
It matches GPT-4 Turbo performance on text in English and code, with
significant improvement on text in non-English languages, while also being much
faster and 50\% cheaper in the API. GPT-4o is especially better at vision and
audio understanding compared to existing models. In line with our commitment to
building AI safely and consistent with our voluntary commitments to the White
House, we are sharing the GPT-4o System Card, which includes our Preparedness
Framework evaluations. In this System Card, we provide a detailed look at
GPT-4o's capabilities, limitations, and safety evaluations across multiple
categories, focusing on speech-to-speech while also evaluating text and image
capabilities, and measures we've implemented to ensure the model is safe and
aligned. We also include third-party assessments on dangerous capabilities, as
well as discussion of potential societal impacts of GPT-4o's text and vision
capabilities.Abstract
arXiv:2204.02311v5 »Full PDF »Large language models have been shown to achieve remarkable performance
across a variety of natural language tasks using few-shot learning, which
drastically reduces the number of task-specific training examples needed to
adapt the model to a particular application. To further our understanding of
the impact of scale on few-shot learning, we trained a 540-billion parameter,
densely activated, Transformer language model, which we call Pathways Language
Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML
system which enables highly efficient training across multiple TPU Pods. We
demonstrate continued benefits of scaling by achieving state-of-the-art
few-shot learning results on hundreds of language understanding and generation
benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough
performance, outperforming the finetuned state-of-the-art on a suite of
multi-step reasoning tasks, and outperforming average human performance on the
recently released BIG-bench benchmark. A significant number of BIG-bench tasks
showed discontinuous improvements from model scale, meaning that performance
steeply increased as we scaled to our largest model. PaLM also has strong
capabilities in multilingual tasks and source code generation, which we
demonstrate on a wide array of benchmarks. We additionally provide a
comprehensive analysis on bias and toxicity, and study the extent of training
data memorization with respect to model scale. Finally, we discuss the ethical
considerations related to large language models and discuss potential
mitigation strategies.Abstract
We introduce Codex, a GPT language model fine-tuned on publicly available
code from GitHub, and study its Python code-writing capabilities. A distinct
production version of Codex powers GitHub Copilot. On HumanEval, a new
evaluation set we release to measure functional correctness for synthesizing
programs from docstrings, our model solves 28.8% of the problems, while GPT-3
solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling
from the model is a surprisingly effective strategy for producing working
solutions to difficult prompts. Using this method, we solve 70.2% of our
problems with 100 samples per problem. Careful investigation of our model
reveals its limitations, including difficulty with docstrings describing long
chains of operations and with binding operations to variables. Finally, we
discuss the potential broader impacts of deploying powerful code generation
technologies, covering safety, security, and economics.Abstract
Gathering Strength, Gathering Storms: The One Hundred Year Study on
Artificial Intelligence (AI100) 2021 Study Panel Report
In September 2021, the "One Hundred Year Study on Artificial Intelligence"
project (AI100) issued the second report of its planned long-term periodic
assessment of artificial intelligence (AI) and its impact on society. It was
written by a panel of 17 study authors, each of whom is deeply rooted in AI
research, chaired by Michael Littman of Brown University. The report, entitled
"Gathering Strength, Gathering Storms," answers a set of 14 questions probing
critical areas of AI development addressing the major risks and dangers of AI,
its effects on society, its public perception and the future of the field. The
report concludes that AI has made a major leap from the lab to people's lives
in recent years, which increases the urgency to understand its potential
negative effects. The questions were developed by the AI100 Standing Committee,
chaired by Peter Stone of the University of Texas at Austin, consisting of a
group of AI leaders with expertise in computer science, sociology, ethics,
economics, and other disciplines.Abstract
Accelerating Greedy Coordinate Gradient and General Prompt Optimization
via Probe Sampling
arXiv:2403.01251v3 »Full PDF »Safety of Large Language Models (LLMs) has become a critical issue given
their rapid progresses. Greedy Coordinate Gradient (GCG) is shown to be
effective in constructing adversarial prompts to break the aligned LLMs, but
optimization of GCG is time-consuming. To reduce the time cost of GCG and
enable more comprehensive studies of LLM safety, in this work, we study a new
algorithm called Probe sampling. At the core of the algorithm is a
mechanism that dynamically determines how similar a smaller draft model's
predictions are to the target model's predictions for prompt candidates. When
the target model is similar to the draft model, we rely heavily on the draft
model to filter out a large number of potential prompt candidates. Probe
sampling achieves up to 5.6 times speedup using Llama2-7b-chat and leads to
equal or improved attack success rate (ASR) on the AdvBench. Furthermore, probe
sampling is also able to accelerate other prompt optimization techniques and
adversarial methods, leading to acceleration of 1.8× for AutoPrompt,
2.4× for APE and 2.4× for AutoDAN.Abstract
Excluding the Irrelevant: Focusing Reinforcement Learning through
Continuous Action Masking
arXiv:2406.03704v2 »Full PDF »Continuous action spaces in reinforcement learning (RL) are commonly defined
as multidimensional intervals. While intervals usually reflect the action
boundaries for tasks well, they can be challenging for learning because the
typically large global action space leads to frequent exploration of irrelevant
actions. Yet, little task knowledge can be sufficient to identify significantly
smaller state-specific sets of relevant actions. Focusing learning on these
relevant actions can significantly improve training efficiency and
effectiveness. In this paper, we propose to focus learning on the set of
relevant actions and introduce three continuous action masking methods for
exactly mapping the action space to the state-dependent set of relevant
actions. Thus, our methods ensure that only relevant actions are executed,
enhancing the predictability of the RL agent and enabling its use in
safety-critical applications. We further derive the implications of the
proposed methods on the policy gradient. Using proximal policy optimization
(PPO), we evaluate our methods on four control tasks, where the relevant action
set is computed based on the system dynamics and a relevant state set. Our
experiments show that the three action masking methods achieve higher final
rewards and converge faster than the baseline without action masking.Abstract
Traffic and Safety Rule Compliance of Humans in Diverse Driving
Situations
The increasing interest in autonomous driving systems has highlighted the
need for an in-depth analysis of human driving behavior in diverse scenarios.
Analyzing human data is crucial for developing autonomous systems that
replicate safe driving practices and ensure seamless integration into
human-dominated environments. This paper presents a comparative evaluation of
human compliance with traffic and safety rules across multiple trajectory
prediction datasets, including Argoverse 2, nuPlan, Lyft, and DeepUrban. By
defining and leveraging existing safety and behavior-related metrics, such as
time to collision, adherence to speed limits, and interactions with other
traffic participants, we aim to provide a comprehensive understanding of each
datasets strengths and limitations. Our analysis focuses on the distribution of
data samples, identifying noise, outliers, and undesirable behaviors exhibited
by human drivers in both the training and validation sets. The results
underscore the need for applying robust filtering techniques to certain
datasets due to high levels of noise and the presence of such undesirable
behaviors.Abstract
arXiv:2404.01475v2 »Full PDF »Large language models (LLMs) have gained widespread interest due to their
ability to process human language and perform tasks on which they have not been
explicitly trained.
However, we possess only a limited systematic understanding of the chemical
capabilities of LLMs, which would be required to improve models and mitigate
potential harm. Here, we introduce "ChemBench," an automated framework for
evaluating the chemical knowledge and reasoning abilities of state-of-the-art
LLMs against the expertise of chemists.
We curated more than 2,700 question-answer pairs, evaluated leading open- and
closed-source LLMs, and found that the best models outperformed the best human
chemists in our study on average. However, the models struggle with some basic
tasks and provide overconfident predictions.
These findings reveal LLMs' impressive chemical capabilities while
emphasizing the need for further research to improve their safety and
usefulness. They also suggest adapting chemistry education and show the value
of benchmarking frameworks for evaluating LLMs in specific domains.Abstract
Web Scraping for Research: Legal, Ethical, Institutional, and Scientific
Considerations
arXiv:2410.23432v1 »Full PDF »Scientists across disciplines often use data from the internet to conduct
research, generating valuable insights about human behavior. However, as
generative AI relying on massive text corpora becomes increasingly valuable,
platforms have greatly restricted access to data through official channels. As
a result, researchers will likely engage in more web scraping to collect data,
introducing new challenges and concerns for researchers. This paper proposes a
comprehensive framework for web scraping in social science research for
U.S.-based researchers, examining the legal, ethical, institutional, and
scientific factors that researchers should consider when scraping the web. We
present an overview of the current regulatory environment impacting when and
how researchers can access, collect, store, and share data via scraping. We
then provide researchers with recommendations to conduct scraping in a
scientifically legitimate and ethical manner. We aim to equip researchers with
the relevant information to mitigate risks and maximize the impact of their
research amidst this evolving data access landscape.Abstract
CycleCrash: A Dataset of Bicycle Collision Videos for Collision
Prediction and Analysis
Self-driving research often underrepresents cyclist collisions and safety. To
address this, we present CycleCrash, a novel dataset consisting of 3,000
dashcam videos with 436,347 frames that capture cyclists in a range of critical
situations, from collisions to safe interactions. This dataset enables 9
different cyclist collision prediction and classification tasks focusing on
potentially hazardous conditions for cyclists and is annotated with
collision-related, cyclist-related, and scene-related labels. Next, we propose
VidNeXt, a novel method that leverages a ConvNeXt spatial encoder and a
non-stationary transformer to capture the temporal dynamics of videos for the
tasks defined in our dataset. To demonstrate the effectiveness of our method
and create additional baselines on CycleCrash, we apply and compare 7 models
along with a detailed ablation. We release the dataset and code at
https://github.com/DeSinister/CycleCrash/ .Abstract