proposing "bidirectional human-AI alignment" framework after a
systematic review of over 400 align...
Recent advancements in general-purpose AI have highlighted the importance of
guiding AI systems towards the intended goals, ethical principles, and values
of individuals and groups, a concept broadly recognized as alignment. However,
the lack of clarified definitions and scopes of human-AI alignment poses a
significant obstacle, hampering collaborative efforts across research domains
to achieve this alignment. In particular, ML- and philosophy-oriented alignment
research often views AI alignment as a static, unidirectional process (i.e.,
aiming to ensure that AI systems' objectives match humans) rather than an
ongoing, mutual alignment problem. This perspective largely neglects the
long-term interaction and dynamic changes of alignment. To understand these
gaps, we introduce a systematic review of over 400 papers published between
2019 and January 2024, spanning multiple domains such as Human-Computer
Interaction (HCI), Natural Language Processing (NLP), Machine Learning (ML). We
characterize, define and scope human-AI alignment. From this, we present a
conceptual framework of "Bidirectional Human-AI Alignment" to organize the
literature from a human-centered perspective. This framework encompasses both
1) conventional studies of aligning AI to humans that ensures AI produces the
intended outcomes determined by humans, and 2) a proposed concept of aligning
humans to AI, which aims to help individuals and society adjust to AI
advancements both cognitively and behaviorally. Additionally, we articulate the
key findings derived from literature analysis, including literature gaps and
trends, human values, and interaction techniques. To pave the way for future
studies, we envision three key challenges and give recommendations for future
research.Abstract
Measuring and Addressing Indexical Bias in Information Retrieval
Information Retrieval (IR) systems are designed to deliver relevant content,
but traditional systems may not optimize rankings for fairness, neutrality, or
the balance of ideas. Consequently, IR can often introduce indexical biases, or
biases in the positional order of documents. Although indexical bias can
demonstrably affect people's opinion, voting patterns, and other behaviors,
these issues remain understudied as the field lacks reliable metrics and
procedures for automatically measuring indexical bias. Towards this end, we
introduce the PAIR framework, which supports automatic bias audits for ranked
documents or entire IR systems. After introducing DUO, the first
general-purpose automatic bias metric, we run an extensive evaluation of 8 IR
systems on a new corpus of 32k synthetic and 4.7k natural documents, with 4k
queries spanning 1.4k controversial issue topics. A human behavioral study
validates our approach, showing that our bias metric can help predict when and
how indexical bias will shift a reader's opinion.Abstract
arXiv:2403.04893v1 »Full PDF »Independent evaluation and red teaming are critical for identifying the risks
posed by generative AI systems. However, the terms of service and enforcement
strategies used by prominent AI companies to deter model misuse have
disincentives on good faith safety evaluations. This causes some researchers to
fear that conducting such research or releasing their findings will result in
account suspensions or legal reprisal. Although some companies offer researcher
access programs, they are an inadequate substitute for independent research
access, as they have limited community representation, receive inadequate
funding, and lack independence from corporate incentives. We propose that major
AI developers commit to providing a legal and technical safe harbor,
indemnifying public interest safety research and protecting it from the threat
of account suspensions or legal reprisal. These proposals emerged from our
collective experience conducting safety, privacy, and trustworthiness research
on generative AI systems, where norms and incentives could be better aligned
with public interests, without exacerbating model misuse. We believe these
commitments are a necessary step towards more inclusive and unimpeded community
efforts to tackle the risks of generative AI.Abstract
Can Large Language Models Transform Computational Social Science?
Large Language Models (LLMs) are capable of successfully performing many
language processing tasks zero-shot (without training data). If zero-shot LLMs
can also reliably classify and explain social phenomena like persuasiveness and
political ideology, then LLMs could augment the Computational Social Science
(CSS) pipeline in important ways. This work provides a road map for using LLMs
as CSS tools. Towards this end, we contribute a set of prompting best practices
and an extensive evaluation pipeline to measure the zero-shot performance of 13
language models on 25 representative English CSS benchmarks. On taxonomic
labeling tasks (classification), LLMs fail to outperform the best fine-tuned
models but still achieve fair levels of agreement with humans. On free-form
coding tasks (generation), LLMs produce explanations that often exceed the
quality of crowdworkers' gold references. We conclude that the performance of
today's LLMs can augment the CSS research pipeline in two ways: (1) serving as
zero-shot data annotators on human annotation teams, and (2) bootstrapping
challenging creative generation tasks (e.g., explaining the underlying
attributes of a text). In summary, LLMs are posed to meaningfully participate
in social science analysis in partnership with humans.Abstract
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
Challenge AI Safety by Humanizing LLMs
14 pages of the main text, qualitative examples of jailbreaks may be
harmful in nature
Most traditional AI safety research has approached AI models as machines and
centered on algorithm-focused attacks developed by security experts. As large
language models (LLMs) become increasingly common and competent, non-expert
users can also impose risks during daily interactions. This paper introduces a
new perspective to jailbreak LLMs as human-like communicators, to explore this
overlooked intersection between everyday language interaction and AI safety.
Specifically, we study how to persuade LLMs to jailbreak them. First, we
propose a persuasion taxonomy derived from decades of social science research.
Then, we apply the taxonomy to automatically generate interpretable persuasive
adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion
significantly increases the jailbreak performance across all risk categories:
PAP consistently achieves an attack success rate of over 92% on Llama 2-7b
Chat, GPT-3.5, and GPT-4 in 10 trials, surpassing recent algorithm-focused
attacks. On the defense side, we explore various mechanisms against PAP and,
found a significant gap in existing defenses, and advocate for more fundamental
mitigation for highly interactive LLMsAbstract
Multi-VALUE: A Framework for Cross-Dialectal English NLP
Dialect differences caused by regional, social, and economic barriers cause
performance discrepancies for many groups of users of language technology.
Fair, inclusive, and equitable language technology must critically be dialect
invariant, meaning that performance remains constant over dialectal shifts.
Current English systems often fall significantly short of this ideal since they
are designed and tested on a single dialect: Standard American English. We
introduce Multi-VALUE -- a suite of resources for evaluating and achieving
English dialect invariance. We build a controllable rule-based translation
system spanning 50 English dialects and a total of 189 unique linguistic
features. Our translation maps Standard American English text to synthetic form
of each dialect, which uses an upper-bound on the natural density of features
in that dialect. First, we use this system to build stress tests for question
answering, machine translation, and semantic parsing tasks. Stress tests reveal
significant performance disparities for leading models on non-standard
dialects. Second, we use this system as a data augmentation technique to
improve the dialect robustness of existing systems. Finally, we partner with
native speakers of Chicano and Indian English to release new gold-standard
variants of the popular CoQA task.Abstract
In a fair world, people have equitable opportunities to education, to conduct
scientific research, to publish, and to get credit for their work, regardless
of where they live. However, it is common knowledge among researchers that a
vast number of papers accepted at top NLP venues come from a handful of western
countries and (lately) China; whereas, very few papers from Africa and South
America get published. Similar disparities are also believed to exist for paper
citation counts. In the spirit of "what we do not measure, we cannot improve",
this work asks a series of questions on the relationship between geographical
location and publication success (acceptance in top NLP venues and citation
impact). We first created a dataset of 70,000 papers from the ACL Anthology,
extracted their meta-information, and generated their citation network. We then
show that not only are there substantial geographical disparities in paper
acceptance and citation but also that these disparities persist even when
controlling for a number of variables such as venue of publication and
sub-field of NLP. Further, despite some steps taken by the NLP community to
improve geographical diversity, we show that the disparity in publication
metrics across locations is still on an increasing trend since the early 2000s.
We release our code and dataset here:
https://github.com/iamjanvijay/acl-cite-netAbstract
Causal Inference in Natural Language Processing: Estimation, Prediction,
Interpretation and Beyond
Accepted to Transactions of the Association for Computational
Linguistics (TACL)
A fundamental goal of scientific research is to learn about causal
relationships. However, despite its critical role in the life and social
sciences, causality has not had the same importance in Natural Language
Processing (NLP), which has traditionally placed more emphasis on predictive
tasks. This distinction is beginning to fade, with an emerging area of
interdisciplinary research at the convergence of causal inference and language
processing. Still, research on causality in NLP remains scattered across
domains without unified definitions, benchmark datasets and clear articulations
of the challenges and opportunities in the application of causal inference to
the textual domain, with its unique properties. In this survey, we consolidate
research across academic areas and situate it in the broader NLP landscape. We
introduce the statistical challenge of estimating causal effects with text,
encompassing settings where text is used as an outcome, treatment, or to
address confounding. In addition, we explore potential uses of causal inference
to improve the robustness, fairness, and interpretability of NLP models. We
thus provide a unified overview of causal inference for the NLP community.Abstract
The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems
Conversational agents have come increasingly closer to human competence in
open-domain dialogue settings; however, such models can reflect insensitive,
hurtful, or entirely incoherent viewpoints that erode a user's trust in the
moral integrity of the system. Moral deviations are difficult to mitigate
because moral judgments are not universal, and there may be multiple competing
judgments that apply to a situation simultaneously. In this work, we introduce
a new resource, not to authoritatively resolve moral ambiguities, but instead
to facilitate systematic understanding of the intuitions, values and moral
judgments reflected in the utterances of dialogue systems. The Moral Integrity
Corpus, MIC, is such a resource, which captures the moral assumptions of 38k
prompt-reply pairs, using 99k distinct Rules of Thumb (RoTs). Each RoT reflects
a particular moral conviction that can explain why a chatbot's reply may appear
acceptable or problematic. We further organize RoTs with a set of 9 moral and
social attributes and benchmark performance for attribute classification. Most
importantly, we show that current neural language models can automatically
generate new RoTs that reasonably describe previously unseen interactions, but
they still struggle with certain scenarios. Our findings suggest that MIC will
be a useful resource for understanding and language models' implicit moral
assumptions and flexibly benchmarking the integrity of conversational agents.
To download the data, see https://github.com/GT-SALT/micAbstract
Mitigating Racial Biases in Toxic Language Detection with an
Equity-Based Ensemble Framework
Accepted to ACM EAAMO '21: https://eaamo.org/accepted/ Code
available: https://github.com/matanhal...
Recent research has demonstrated how racial biases against users who write
African American English exists in popular toxic language datasets. While
previous work has focused on a single fairness criteria, we propose to use
additional descriptive fairness metrics to better understand the source of
these biases. We demonstrate that different benchmark classifiers, as well as
two in-process bias-remediation techniques, propagate racial biases even in a
larger corpus. We then propose a novel ensemble-framework that uses a
specialized classifier that is fine-tuned to the African American English
dialect. We show that our proposed framework substantially reduces the racial
biases that the model learns from these datasets. We demonstrate how the
ensemble framework improves fairness metrics across all sample datasets with
minimal impact on the classification performance, and provide empirical
evidence in its ability to unlearn the annotation biases towards authors who
use African American English.
** Please note that this work may contain examples of offensive words and
phrases.Abstract