Long-tailed data distributions pose challenges for a variety of domains like
e-commerce, finance, biomedical science, and cyber security, where the
performance of machine learning models is often dominated by head categories
while tail categories are inadequately learned. This work aims to provide a
systematic view of long-tailed learning with regard to three pivotal angles:
(A1) the characterization of data long-tailedness, (A2) the data complexity of
various domains, and (A3) the heterogeneity of emerging tasks. We develop
HeroLT, a comprehensive long-tailed learning benchmark integrating 18
state-of-the-art algorithms, 10 evaluation metrics, and 17 real-world datasets
across 6 tasks and 4 data modalities. HeroLT with novel angles and extensive
experiments (315 in total) enables effective and fair evaluation of newly
proposed methods compared with existing baselines on varying dataset types.
Finally, we conclude by highlighting the significant applications of
long-tailed learning and identifying several promising future directions. For
accessibility and reproducibility, we open-source our benchmark HeroLT and
corresponding results at https://github.com/SSSKJ/HeroLT.Abstract
Rethinking the Uncertainty: A Critical Review and Analysis in the Era of
Large Language Models
arXiv:2410.20199v1 »Full PDF »In recent years, Large Language Models (LLMs) have become fundamental to a
broad spectrum of artificial intelligence applications. As the use of LLMs
expands, precisely estimating the uncertainty in their predictions has become
crucial. Current methods often struggle to accurately identify, measure, and
address the true uncertainty, with many focusing primarily on estimating model
confidence. This discrepancy is largely due to an incomplete understanding of
where, when, and how uncertainties are injected into models. This paper
introduces a comprehensive framework specifically designed to identify and
understand the types and sources of uncertainty, aligned with the unique
characteristics of LLMs. Our framework enhances the understanding of the
diverse landscape of uncertainties by systematically categorizing and defining
each type, establishing a solid foundation for developing targeted methods that
can precisely quantify these uncertainties. We also provide a detailed
introduction to key related concepts and examine the limitations of current
methods in mission-critical and safety-sensitive applications. The paper
concludes with a perspective on future directions aimed at enhancing the
reliability and practical adoption of these methods in real-world scenarios.Abstract
There have been tremendous efforts over the past decades dedicated to the
generation of realistic graphs in a variety of domains, ranging from social
networks to computer networks, from gene regulatory networks to online
transaction networks. Despite the remarkable success, the vast majority of
these works are unsupervised in nature and are typically trained to minimize
the expected graph reconstruction loss, which would result in the
representation disparity issue in the generated graphs, i.e., the protected
groups (often minorities) contribute less to the objective and thus suffer from
systematically higher errors. In this paper, we aim to tailor graph generation
to downstream mining tasks by leveraging label information and user-preferred
parity constraints. In particular, we start from the investigation of
representation disparity in the context of graph generative models. To mitigate
the disparity, we propose a fairness-aware graph generative model named
FairGen. Our model jointly trains a label-informed graph generation module and
a fair representation learning module by progressively learning the behaviors
of the protected and unprotected groups, from the `easy' concepts to the `hard'
ones. In addition, we propose a generic context sampling strategy for graph
generative models, which is proven to be capable of fairly capturing the
contextual information of each group with a high probability. Experimental
results on seven real-world data sets, including web-based graphs, demonstrate
that FairGen (1) obtains performance on par with state-of-the-art graph
generative models across nine network properties, (2) mitigates the
representation disparity issues in the generated graphs, and (3) substantially
boosts the model performance by up to 17% in downstream tasks via data
augmentation.Abstract
PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained
Language Model
Update model and results; add comparison with EVA2.0
In this paper, we introduce PanGu-Bot, a Chinese pre-trained open-domain
dialogue generation model based on a large pre-trained language model (PLM)
PANGU-alpha (Zeng et al.,2021). Different from other pre-trained dialogue
models trained over a massive amount of dialogue data from scratch, we aim to
build a powerful dialogue model with relatively fewer data and computation
costs by inheriting valuable language capabilities and knowledge from PLMs. To
this end, we train PanGu-Bot from the large PLM PANGU-alpha, which has been
proven well-performed on a variety of Chinese natural language tasks. We
investigate different aspects of responses generated by PanGu-Bot, including
response quality, knowledge, and safety. We show that PanGu-Bot outperforms
state-of-the-art Chinese dialogue systems (CDIALGPT (Wang et al., 2020), EVA
(Zhou et al., 2021), EVA2.0 (Gu et al., 2022)) w.r.t. the above three aspects.
We also demonstrate that PanGu-Bot can be easily deployed to generate emotional
responses without further training. Throughout our empirical analysis, we also
point out that the PanGu-Bot response quality, knowledge correctness, and
safety are still far from perfect, and further explorations are indispensable
to building reliable and smart dialogue systems. Our model and code will be
available at
https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/PanGu-Bot
soon.Abstract
BehaviorGPT: Smart Agent Simulation for Autonomous Driving with
Next-Patch Prediction
Simulating realistic behaviors of traffic agents is pivotal for efficiently
validating the safety of autonomous driving systems. Existing data-driven
simulators primarily use an encoder-decoder architecture to encode the
historical trajectories before decoding the future. However, the heterogeneity
between encoders and decoders complicates the models, and the manual separation
of historical and future trajectories leads to low data utilization. Given
these limitations, we propose BehaviorGPT, a homogeneous and fully
autoregressive Transformer designed to simulate the sequential behavior of
multiple agents. Crucially, our approach discards the traditional separation
between "history" and "future" by modeling each time step as the "current" one
for motion generation, leading to a simpler, more parameter- and data-efficient
agent simulator. We further introduce the Next-Patch Prediction Paradigm (NP3)
to mitigate the negative effects of autoregressive modeling, in which models
are trained to reason at the patch level of trajectories and capture long-range
spatial-temporal interactions. Despite having merely 3M model parameters,
BehaviorGPT won first place in the 2024 Waymo Open Sim Agents Challenge with a
realism score of 0.7473 and a minADE score of 1.4147, demonstrating its
exceptional performance in traffic agent simulation.Abstract
Revisiting, Benchmarking and Understanding Unsupervised Graph Domain
Adaptation
Unsupervised Graph Domain Adaptation (UGDA) involves the transfer of
knowledge from a label-rich source graph to an unlabeled target graph under
domain discrepancies. Despite the proliferation of methods designed for this
emerging task, the lack of standard experimental settings and fair performance
comparisons makes it challenging to understand which and when models perform
well across different scenarios. To fill this gap, we present the first
comprehensive benchmark for unsupervised graph domain adaptation named
GDABench, which encompasses 16 algorithms across 5 datasets with 74 adaptation
tasks. Through extensive experiments, we observe that the performance of
current UGDA models varies significantly across different datasets and
adaptation scenarios. Specifically, we recognize that when the source and
target graphs face significant distribution shifts, it is imperative to
formulate strategies to effectively address and mitigate graph structural
shifts. We also find that with appropriate neighbourhood aggregation
mechanisms, simple GNN variants can even surpass state-of-the-art UGDA
baselines. To facilitate reproducibility, we have developed an easy-to-use
library PyGDA for training and evaluating existing UGDA methods, providing a
standardized platform in this community. Our source codes and datasets can be
found at: https://github.com/pygda-team/pygda.Abstract
LongSafetyBench: Long-Context LLMs Struggle with Safety Issues
arXiv:2411.06899v1 »Full PDF »With the development of large language models (LLMs), the sequence length of
these models continues to increase, drawing significant attention to
long-context language models. However, the evaluation of these models has been
primarily limited to their capabilities, with a lack of research focusing on
their safety. Existing work, such as ManyShotJailbreak, has to some extent
demonstrated that long-context language models can exhibit safety concerns.
However, the methods used are limited and lack comprehensiveness. In response,
we introduce \textbf{LongSafetyBench}, the first benchmark designed to
objectively and comprehensively evaluate the safety of long-context models.
LongSafetyBench consists of 10 task categories, with an average length of
41,889 words. After testing eight long-context language models on
LongSafetyBench, we found that existing models generally exhibit insufficient
safety capabilities. The proportion of safe responses from most mainstream
long-context LLMs is below 50\%. Moreover, models' safety performance in
long-context scenarios does not always align with that in short-context
scenarios. Further investigation revealed that long-context models tend to
overlook harmful content within lengthy texts. We also proposed a simple yet
effective solution, allowing open-source models to achieve performance
comparable to that of top-tier closed-source models. We believe that
LongSafetyBench can serve as a valuable benchmark for evaluating the safety
capabilities of long-context language models. We hope that our work will
encourage the broader community to pay attention to the safety of long-context
models and contribute to the development of solutions to improve the safety of
long-context LLMs.Abstract
Accepted for publication at NeurIPS 2024, 34 Pages, 9 Figures
This paper examines the issue of fairness in the estimation of graphical
models (GMs), particularly Gaussian, Covariance, and Ising models. These models
play a vital role in understanding complex relationships in high-dimensional
data. However, standard GMs can result in biased outcomes, especially when the
underlying data involves sensitive characteristics or protected groups. To
address this, we introduce a comprehensive framework designed to reduce bias in
the estimation of GMs related to protected attributes. Our approach involves
the integration of the pairwise graph disparity error and a tailored loss
function into a nonsmooth multi-objective optimization problem, striving to
achieve fairness across different sensitive groups while maintaining the
effectiveness of the GMs. Experimental evaluations on synthetic and real-world
datasets demonstrate that our framework effectively mitigates bias without
undermining GMs' performance.Abstract
Counterfactual Fairness by Combining Factual and Counterfactual
Predictions
In high-stake domains such as healthcare and hiring, the role of machine
learning (ML) in decision-making raises significant fairness concerns. This
work focuses on Counterfactual Fairness (CF), which posits that an ML model's
outcome on any individual should remain unchanged if they had belonged to a
different demographic group. Previous works have proposed methods that
guarantee CF. Notwithstanding, their effects on the model's predictive
performance remains largely unclear. To fill in this gap, we provide a
theoretical study on the inherent trade-off between CF and predictive
performance in a model-agnostic manner. We first propose a simple but effective
method to cast an optimal but potentially unfair predictor into a fair one
without losing the optimality. By analyzing its excess risk in order to achieve
CF, we quantify this inherent trade-off. Further analysis on our method's
performance with access to only incomplete causal knowledge is also conducted.
Built upon it, we propose a performant algorithm that can be applied in such
scenarios. Experiments on both synthetic and semi-synthetic datasets
demonstrate the validity of our analysis and methods.Abstract
A Retrospective on the Robot Air Hockey Challenge: Benchmarking Robust,
Reliable, and Safe Learning Techniques for Real-world Robotics
Accept at NeurIPS 2024 Dataset and Benchmark Track
Machine learning methods have a groundbreaking impact in many application
domains, but their application on real robotic platforms is still limited.
Despite the many challenges associated with combining machine learning
technology with robotics, robot learning remains one of the most promising
directions for enhancing the capabilities of robots. When deploying
learning-based approaches on real robots, extra effort is required to address
the challenges posed by various real-world factors. To investigate the key
factors influencing real-world deployment and to encourage original solutions
from different researchers, we organized the Robot Air Hockey Challenge at the
NeurIPS 2023 conference. We selected the air hockey task as a benchmark,
encompassing low-level robotics problems and high-level tactics. Different from
other machine learning-centric benchmarks, participants need to tackle
practical challenges in robotics, such as the sim-to-real gap, low-level
control issues, safety problems, real-time requirements, and the limited
availability of real-world data. Furthermore, we focus on a dynamic
environment, removing the typical assumption of quasi-static motions of other
real-world benchmarks. The competition's results show that solutions combining
learning-based approaches with prior knowledge outperform those relying solely
on data when real-world deployment is challenging. Our ablation study reveals
which real-world factors may be overlooked when building a learning-based
solution. The successful real-world air hockey deployment of best-performing
agents sets the foundation for future competitions and follow-up research
directions.Abstract