arXiv:2410.21276v1 »Full PDF »GPT-4o is an autoregressive omni model that accepts as input any combination
of text, audio, image, and video, and generates any combination of text, audio,
and image outputs. It's trained end-to-end across text, vision, and audio,
meaning all inputs and outputs are processed by the same neural network. GPT-4o
can respond to audio inputs in as little as 232 milliseconds, with an average
of 320 milliseconds, which is similar to human response time in conversation.
It matches GPT-4 Turbo performance on text in English and code, with
significant improvement on text in non-English languages, while also being much
faster and 50\% cheaper in the API. GPT-4o is especially better at vision and
audio understanding compared to existing models. In line with our commitment to
building AI safely and consistent with our voluntary commitments to the White
House, we are sharing the GPT-4o System Card, which includes our Preparedness
Framework evaluations. In this System Card, we provide a detailed look at
GPT-4o's capabilities, limitations, and safety evaluations across multiple
categories, focusing on speech-to-speech while also evaluating text and image
capabilities, and measures we've implemented to ensure the model is safe and
aligned. We also include third-party assessments on dangerous capabilities, as
well as discussion of potential societal impacts of GPT-4o's text and vision
capabilities.Abstract
Fairness On The Ground: Applying Algorithmic Fairness Approaches to
Production Systems
Many technical approaches have been proposed for ensuring that decisions made
by machine learning systems are fair, but few of these proposals have been
stress-tested in real-world systems. This paper presents an example of one
team's approach to the challenge of applying algorithmic fairness approaches to
complex production systems within the context of a large technology company. We
discuss how we disentangle normative questions of product and policy design
(like, "how should the system trade off between different stakeholders'
interests and needs?") from empirical questions of system implementation (like,
"is the system achieving the desired tradeoff in practice?"). We also present
an approach for answering questions of the latter sort, which allows us to
measure how machine learning systems and human labelers are making these
tradeoffs across different relevant groups. We hope our experience integrating
fairness tools and approaches into large-scale and complex production systems
will be useful to other practitioners facing similar challenges, and
illuminating to academics and researchers looking to better address the needs
of practitioners.Abstract
Risk Sources and Risk Management Measures in Support of Standards for
General-Purpose AI Systems
There is an urgent need to identify both short and long-term risks from newly
emerging types of Artificial Intelligence (AI), as well as available risk
management measures. In response, and to support global efforts in regulating
AI and writing safety standards, we compile an extensive catalog of risk
sources and risk management measures for general-purpose AI (GPAI) systems,
complete with descriptions and supporting examples where relevant. This work
involves identifying technical, operational, and societal risks across model
development, training, and deployment stages, as well as surveying established
and experimental methods for managing these risks. To the best of our
knowledge, this paper is the first of its kind to provide extensive
documentation of both GPAI risk sources and risk management measures that are
descriptive, self-contained and neutral with respect to any existing regulatory
framework. This work intends to help AI providers, standards experts,
researchers, policymakers, and regulators in identifying and mitigating
systemic risks from GPAI systems. For this reason, the catalog is released
under a public domain license for ease of direct use by stakeholders in AI
governance and standards.Abstract
TrustLLM: Trustworthiness in Large Language Models
This work is still under work and we welcome your contribution
Large language models (LLMs), exemplified by ChatGPT, have gained
considerable attention for their excellent natural language processing
capabilities. Nonetheless, these LLMs present many challenges, particularly in
the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs
emerges as an important topic. This paper introduces TrustLLM, a comprehensive
study of trustworthiness in LLMs, including principles for different dimensions
of trustworthiness, established benchmark, evaluation, and analysis of
trustworthiness for mainstream LLMs, and discussion of open challenges and
future directions. Specifically, we first propose a set of principles for
trustworthy LLMs that span eight different dimensions. Based on these
principles, we further establish a benchmark across six dimensions including
truthfulness, safety, fairness, robustness, privacy, and machine ethics. We
then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of
over 30 datasets. Our findings firstly show that in general trustworthiness and
utility (i.e., functional effectiveness) are positively related. Secondly, our
observations reveal that proprietary LLMs generally outperform most open-source
counterparts in terms of trustworthiness, raising concerns about the potential
risks of widely accessible open-source LLMs. However, a few open-source LLMs
come very close to proprietary ones. Thirdly, it is important to note that some
LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent
that they compromise their utility by mistakenly treating benign prompts as
harmful and consequently not responding. Finally, we emphasize the importance
of ensuring transparency not only in the models themselves but also in the
technologies that underpin trustworthiness. Knowing the specific trustworthy
technologies that have been employed is crucial for analyzing their
effectiveness.Abstract
The Visual Experience Dataset: Over 200 Recorded Hours of Integrated Eye
Movement, Odometry, and Egocentric Video
We introduce the Visual Experience Dataset (VEDB), a compilation of over 240
hours of egocentric video combined with gaze- and head-tracking data that
offers an unprecedented view of the visual world as experienced by human
observers. The dataset consists of 717 sessions, recorded by 58 observers
ranging from 6-49 years old. This paper outlines the data collection,
processing, and labeling protocols undertaken to ensure a representative sample
and discusses the potential sources of error or bias within the dataset. The
VEDB's potential applications are vast, including improving gaze tracking
methodologies, assessing spatiotemporal image statistics, and refining deep
neural networks for scene and activity recognition. The VEDB is accessible
through established open science platforms and is intended to be a living
dataset with plans for expansion and community contributions. It is released
with an emphasis on ethical considerations, such as participant privacy and the
mitigation of potential biases. By providing a dataset grounded in real-world
experiences and accompanied by extensive metadata and supporting code, the
authors invite the research community to utilize and contribute to the VEDB,
facilitating a richer understanding of visual perception and behavior in
naturalistic settings.Abstract
Introducing v0.5 of the AI Safety Benchmark from MLCommons
arXiv:2404.12241v2 »Full PDF »This paper introduces v0.5 of the AI Safety Benchmark, which has been created
by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been
designed to assess the safety risks of AI systems that use chat-tuned language
models. We introduce a principled approach to specifying and constructing the
benchmark, which for v0.5 covers only a single use case (an adult chatting to a
general-purpose assistant in English), and a limited set of personas (i.e.,
typical users, malicious users, and vulnerable users). We created a new
taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark.
We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024.
The v1.0 benchmark will provide meaningful insights into the safety of AI
systems. However, the v0.5 benchmark should not be used to assess the safety of
AI systems. We have sought to fully document the limitations, flaws, and
challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes
(1) a principled approach to specifying and constructing the benchmark, which
comprises use cases, types of systems under test (SUTs), language and context,
personas, tests, and test items; (2) a taxonomy of 13 hazard categories with
definitions and subcategories; (3) tests for seven of the hazard categories,
each comprising a unique set of test items, i.e., prompts. There are 43,090
test items in total, which we created with templates; (4) a grading system for
AI systems against the benchmark; (5) an openly available platform, and
downloadable tool, called ModelBench that can be used to evaluate the safety of
AI systems on the benchmark; (6) an example evaluation report which benchmarks
the performance of over a dozen openly available chat-tuned language models;
(7) a test specification for the benchmark.Abstract
Can Fairness be Automated? Guidelines and Opportunities for
Fairness-aware AutoML
arXiv:2303.08485v2 »Full PDF »The field of automated machine learning (AutoML) introduces techniques that
automate parts of the development of machine learning (ML) systems,
accelerating the process and reducing barriers for novices. However, decisions
derived from ML models can reproduce, amplify, or even introduce unfairness in
our societies, causing harm to (groups of) individuals. In response,
researchers have started to propose AutoML systems that jointly optimize
fairness and predictive performance to mitigate fairness-related harm. However,
fairness is a complex and inherently interdisciplinary subject, and solely
posing it as an optimization problem can have adverse side effects. With this
work, we aim to raise awareness among developers of AutoML systems about such
limitations of fairness-aware AutoML, while also calling attention to the
potential of AutoML as a tool for fairness research. We present a comprehensive
overview of different ways in which fairness-related harm can arise and the
ensuing implications for the design of fairness-aware AutoML. We conclude that
while fairness cannot be automated, fairness-aware AutoML can play an important
role in the toolbox of ML practitioners. We highlight several open technical
challenges for future work in this direction. Additionally, we advocate for the
creation of more user-centered assistive systems designed to tackle challenges
encountered in fairness workAbstract
Disentangling and Operationalizing AI Fairness at LinkedIn
arXiv:2306.00025v1 »Full PDF »Operationalizing AI fairness at LinkedIn's scale is challenging not only
because there are multiple mutually incompatible definitions of fairness but
also because determining what is fair depends on the specifics and context of
the product where AI is deployed. Moreover, AI practitioners need clarity on
what fairness expectations need to be addressed at the AI level. In this paper,
we present the evolving AI fairness framework used at LinkedIn to address these
three challenges. The framework disentangles AI fairness by separating out
equal treatment and equitable product expectations. Rather than imposing a
trade-off between these two commonly opposing interpretations of fairness, the
framework provides clear guidelines for operationalizing equal AI treatment
complemented with a product equity strategy. This paper focuses on the equal AI
treatment component of LinkedIn's AI fairness framework, shares the principles
that support it, and illustrates their application through a case study. We
hope this paper will encourage other big tech companies to join us in sharing
their approach to operationalizing AI fairness at scale, so that together we
can keep advancing this constantly evolving field.Abstract