The Thirty-eight Conference on Neural Information Processing Systems
Datasets and Benchmarks Track...
Deep graph learning has gained grand popularity over the past years due to
its versatility and success in representing graph data across a wide range of
domains. However, the pervasive issue of imbalanced graph data distributions,
where certain parts exhibit disproportionally abundant data while others remain
sparse, undermines the efficacy of conventional graph learning algorithms,
leading to biased outcomes. To address this challenge, Imbalanced Graph
Learning (IGL) has garnered substantial attention, enabling more balanced data
distributions and better task performance. Despite the proliferation of IGL
algorithms, the absence of consistent experimental protocols and fair
performance comparisons pose a significant barrier to comprehending
advancements in this field. To bridge this gap, we introduce IGL-Bench, a
foundational comprehensive benchmark for imbalanced graph learning, embarking
on 16 diverse graph datasets and 24 distinct IGL algorithms with uniform data
processing and splitting strategies. Specifically, IGL-Bench systematically
investigates state-of-the-art IGL algorithms in terms of effectiveness,
robustness, and efficiency on node-level and graph-level tasks, with the scope
of class-imbalance and topology-imbalance. Extensive experiments demonstrate
the potential benefits of IGL algorithms on various imbalanced conditions,
offering insights and opportunities in the IGL field. Further, we have
developed an open-sourced and unified package to facilitate reproducible
evaluation and inspire further innovative research, which is available at
https://github.com/RingBDStack/IGL-Bench.Abstract
BehaviorGPT: Smart Agent Simulation for Autonomous Driving with
Next-Patch Prediction
Simulating realistic behaviors of traffic agents is pivotal for efficiently
validating the safety of autonomous driving systems. Existing data-driven
simulators primarily use an encoder-decoder architecture to encode the
historical trajectories before decoding the future. However, the heterogeneity
between encoders and decoders complicates the models, and the manual separation
of historical and future trajectories leads to low data utilization. Given
these limitations, we propose BehaviorGPT, a homogeneous and fully
autoregressive Transformer designed to simulate the sequential behavior of
multiple agents. Crucially, our approach discards the traditional separation
between "history" and "future" by modeling each time step as the "current" one
for motion generation, leading to a simpler, more parameter- and data-efficient
agent simulator. We further introduce the Next-Patch Prediction Paradigm (NP3)
to mitigate the negative effects of autoregressive modeling, in which models
are trained to reason at the patch level of trajectories and capture long-range
spatial-temporal interactions. Despite having merely 3M model parameters,
BehaviorGPT won first place in the 2024 Waymo Open Sim Agents Challenge with a
realism score of 0.7473 and a minADE score of 1.4147, demonstrating its
exceptional performance in traffic agent simulation.Abstract
LongSafetyBench: Long-Context LLMs Struggle with Safety Issues
arXiv:2411.06899v1 »Full PDF »With the development of large language models (LLMs), the sequence length of
these models continues to increase, drawing significant attention to
long-context language models. However, the evaluation of these models has been
primarily limited to their capabilities, with a lack of research focusing on
their safety. Existing work, such as ManyShotJailbreak, has to some extent
demonstrated that long-context language models can exhibit safety concerns.
However, the methods used are limited and lack comprehensiveness. In response,
we introduce \textbf{LongSafetyBench}, the first benchmark designed to
objectively and comprehensively evaluate the safety of long-context models.
LongSafetyBench consists of 10 task categories, with an average length of
41,889 words. After testing eight long-context language models on
LongSafetyBench, we found that existing models generally exhibit insufficient
safety capabilities. The proportion of safe responses from most mainstream
long-context LLMs is below 50\%. Moreover, models' safety performance in
long-context scenarios does not always align with that in short-context
scenarios. Further investigation revealed that long-context models tend to
overlook harmful content within lengthy texts. We also proposed a simple yet
effective solution, allowing open-source models to achieve performance
comparable to that of top-tier closed-source models. We believe that
LongSafetyBench can serve as a valuable benchmark for evaluating the safety
capabilities of long-context language models. We hope that our work will
encourage the broader community to pay attention to the safety of long-context
models and contribute to the development of solutions to improve the safety of
long-context LLMs.Abstract
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
arXiv:2411.04905v2 »Full PDF »Large language models (LLMs) for code have become indispensable in various
domains, including code generation, reasoning tasks and agent systems. While
open-access code LLMs are increasingly approaching the performance levels of
proprietary models, high-quality code LLMs suitable for rigorous scientific
investigation, particularly those with reproducible data processing pipelines
and transparent training protocols, remain limited. The scarcity is due to
various challenges, including resource constraints, ethical considerations, and
the competitive advantages of keeping models advanced. To address the gap, we
introduce OpenCoder, a top-tier code LLM that not only achieves performance
comparable to leading models but also serves as an "open cookbook" for the
research community. Unlike most prior efforts, we release not only model
weights and inference code, but also the reproducible training data, complete
data processing pipeline, rigorous experimental ablation results, and detailed
training protocols for open scientific research. Through this comprehensive
release, we identify the key ingredients for building a top-tier code LLM: (1)
code optimized heuristic rules for data cleaning and methods for data
deduplication, (2) recall of text corpus related to code and (3) high-quality
synthetic data in both annealing and supervised fine-tuning stages. By offering
this level of openness, we aim to broaden access to all aspects of a top-tier
code LLM, with OpenCoder serving as both a powerful model and an open
foundation to accelerate research, and enable reproducible advancements in code
AI.Abstract
An Adversarial Perspective on Machine Unlearning for AI Safety
Large language models are finetuned to refuse questions about hazardous
knowledge, but these protections can often be bypassed. Unlearning methods aim
at completely removing hazardous capabilities from models and make them
inaccessible to adversaries. This work challenges the fundamental differences
between unlearning and traditional safety post-training from an adversarial
perspective. We demonstrate that existing jailbreak methods, previously
reported as ineffective against unlearning, can be successful when applied
carefully. Furthermore, we develop a variety of adaptive methods that recover
most supposedly unlearned capabilities. For instance, we show that finetuning
on 10 unrelated examples or removing specific directions in the activation
space can recover most hazardous capabilities for models edited with RMU, a
state-of-the-art unlearning method. Our findings challenge the robustness of
current unlearning approaches and question their advantages over safety
training.Abstract
Integrating Object Detection Modality into Visual Language Model for
Enhanced Autonomous Driving Agent
In this paper, we propose a novel framework for enhancing visual
comprehension in autonomous driving systems by integrating visual language
models (VLMs) with additional visual perception module specialised in object
detection. We extend the Llama-Adapter architecture by incorporating a
YOLOS-based detection network alongside the CLIP perception network, addressing
limitations in object detection and localisation. Our approach introduces
camera ID-separators to improve multi-view processing, crucial for
comprehensive environmental awareness. Experiments on the DriveLM visual
question answering challenge demonstrate significant improvements over baseline
models, with enhanced performance in ChatGPT scores, BLEU scores, and CIDEr
metrics, indicating closeness of model answer to ground truth. Our method
represents a promising step towards more capable and interpretable autonomous
driving systems. Possible safety enhancement enabled by detection modality is
also discussed.Abstract
When AI Eats Itself: On the Caveats of AI Autophagy
arXiv:2405.09597v3 »Full PDF »Generative Artificial Intelligence (AI) technologies and large models are
producing realistic outputs across various domains, such as images, text,
speech, and music. Creating these advanced generative models requires
significant resources, particularly large and high-quality datasets. To
minimise training expenses, many algorithm developers use data created by the
models themselves as a cost-effective training solution. However, not all
synthetic data effectively improve model performance, necessitating a strategic
balance in the use of real versus synthetic data to optimise outcomes.
Currently, the previously well-controlled integration of real and synthetic
data is becoming uncontrollable. The widespread and unregulated dissemination
of synthetic data online leads to the contamination of datasets traditionally
compiled through web scraping, now mixed with unlabeled synthetic data. This
trend, known as the AI autophagy phenomenon, suggests a future where generative
AI systems may increasingly consume their own outputs without discernment,
raising concerns about model performance, reliability, and ethical
implications. What will happen if generative AI continuously consumes itself
without discernment? What measures can we take to mitigate the potential
adverse effects? To address these research questions, this study examines the
existing literature, delving into the consequences of AI autophagy, analyzing
the associated risks, and exploring strategies to mitigate its impact. Our aim
is to provide a comprehensive perspective on this phenomenon advocating for a
balanced approach that promotes the sustainable development of generative AI
technologies in the era of large models.Abstract
STAND-Guard: A Small Task-Adaptive Content Moderation Model
Content moderation, the process of reviewing and monitoring the safety of
generated content, is important for development of welcoming online platforms
and responsible large language models. Content moderation contains various
tasks, each with its unique requirements tailored to specific scenarios.
Therefore, it is crucial to develop a model that can be easily adapted to novel
or customized content moderation tasks accurately without extensive model
tuning. This paper presents STAND-GUARD, a Small Task-Adaptive coNtent
moDeration model. The basic motivation is: by performing instruct tuning on
various content moderation tasks, we can unleash the power of small language
models (SLMs) on unseen (out-of-distribution) content moderation tasks. We also
carefully study the effects of training tasks and model size on the efficacy of
cross-task fine-tuning mechanism. Experiments demonstrate STAND-Guard is
comparable to GPT-3.5-Turbo across over 40 public datasets, as well as
proprietary datasets derived from real-world business scenarios. Remarkably,
STAND-Guard achieved nearly equivalent results to GPT-4-Turbo on unseen English
binary classification tasksAbstract
A multi-purpose automatic editing system based on lecture semantics for
remote education
arXiv:2411.04859v1 »Full PDF »Remote teaching has become popular recently due to its convenience and
safety, especially under extreme circumstances like a pandemic. However, online
students usually have a poor experience since the information acquired from the
views provided by the broadcast platforms is limited. One potential solution is
to show more camera views simultaneously, but it is technically challenging and
distracting for the viewers. Therefore, an automatic multi-camera
directing/editing system, which aims at selecting the most concerned view at
each time instance to guide the attention of online students, is in urgent
demand. However, existing systems mostly make simple assumptions and focus on
tracking the position of the speaker instead of the real lecture semantics, and
therefore have limited capacities to deliver optimal information flow. To this
end, this paper proposes an automatic multi-purpose editing system based on the
lecture semantics, which can both direct the multiple video streams for
real-time broadcasting and edit the optimal video offline for review purposes.
Our system directs the views by semantically analyzing the class events while
following the professional directing rules, mimicking a human director to
capture the regions of interest from the viewpoint of the onsite students. We
conduct both qualitative and quantitative analyses to verify the effectiveness
of the proposed system and its components.Abstract
A Comparative Study of Deep Reinforcement Learning for Crop Production
Management
Crop production management is essential for optimizing yield and minimizing a
field's environmental impact to crop fields, yet it remains challenging due to
the complex and stochastic processes involved. Recently, researchers have
turned to machine learning to address these complexities. Specifically,
reinforcement learning (RL), a cutting-edge approach designed to learn optimal
decision-making strategies through trial and error in dynamic environments, has
emerged as a promising tool for developing adaptive crop management policies.
RL models aim to optimize long-term rewards by continuously interacting with
the environment, making them well-suited for tackling the uncertainties and
variability inherent in crop management. Studies have shown that RL can
generate crop management policies that compete with, and even outperform,
expert-designed policies within simulation-based crop models. In the gym-DSSAT
crop model environment, one of the most widely used simulators for crop
management, proximal policy optimization (PPO) and deep Q-networks (DQN) have
shown promising results. However, these methods have not yet been
systematically evaluated under identical conditions. In this study, we
evaluated PPO and DQN against static baseline policies across three different
RL tasks, fertilization, irrigation, and mixed management, provided by the
gym-DSSAT environment. To ensure a fair comparison, we used consistent default
parameters, identical reward functions, and the same environment settings. Our
results indicate that PPO outperforms DQN in fertilization and irrigation
tasks, while DQN excels in the mixed management task. This comparative analysis
provides critical insights into the strengths and limitations of each approach,
advancing the development of more effective RL-based crop management
strategies.Abstract