arXiv:2409.19545v1 »Full PDF »Labor market forecasting on talent demand and supply is essential for
business management and economic development. With accurate and timely
forecasts, employers can adapt their recruitment strategies to align with the
evolving labor market, and employees can have proactive career path planning
according to future demand and supply. However, previous studies ignore the
interconnection between demand-supply sequences among different companies and
positions for predicting variations. Moreover, companies are reluctant to share
their private human resource data for global labor market analysis due to
concerns over jeopardizing competitive advantage, security threats, and
potential ethical or legal violations. To this end, in this paper, we formulate
the Federated Labor Market Forecasting (FedLMF) problem and propose a
Meta-personalized Convergence-aware Clustered Federated Learning (MPCAC-FL)
framework to provide accurate and timely collaborative talent demand and supply
prediction in a privacy-preserving way. First, we design a graph-based
sequential model to capture the inherent correlation between demand and supply
sequences and company-position pairs. Second, we adopt meta-learning techniques
to learn effective initial model parameters that can be shared across
companies, allowing personalized models to be optimized for forecasting
company-specific demand and supply, even when companies have heterogeneous
data. Third, we devise a Convergence-aware Clustering algorithm to dynamically
divide companies into groups according to model similarity and apply federated
aggregation in each group. The heterogeneity can be alleviated for more stable
convergence and better performance. Extensive experiments demonstrate that
MPCAC-FL outperforms compared baselines on three real-world datasets and
achieves over 97% of the state-of-the-art model, i.e., DH-GEM, without exposing
private company data.Abstract
LongSafetyBench: Long-Context LLMs Struggle with Safety Issues
arXiv:2411.06899v1 »Full PDF »With the development of large language models (LLMs), the sequence length of
these models continues to increase, drawing significant attention to
long-context language models. However, the evaluation of these models has been
primarily limited to their capabilities, with a lack of research focusing on
their safety. Existing work, such as ManyShotJailbreak, has to some extent
demonstrated that long-context language models can exhibit safety concerns.
However, the methods used are limited and lack comprehensiveness. In response,
we introduce \textbf{LongSafetyBench}, the first benchmark designed to
objectively and comprehensively evaluate the safety of long-context models.
LongSafetyBench consists of 10 task categories, with an average length of
41,889 words. After testing eight long-context language models on
LongSafetyBench, we found that existing models generally exhibit insufficient
safety capabilities. The proportion of safe responses from most mainstream
long-context LLMs is below 50\%. Moreover, models' safety performance in
long-context scenarios does not always align with that in short-context
scenarios. Further investigation revealed that long-context models tend to
overlook harmful content within lengthy texts. We also proposed a simple yet
effective solution, allowing open-source models to achieve performance
comparable to that of top-tier closed-source models. We believe that
LongSafetyBench can serve as a valuable benchmark for evaluating the safety
capabilities of long-context language models. We hope that our work will
encourage the broader community to pay attention to the safety of long-context
models and contribute to the development of solutions to improve the safety of
long-context LLMs.Abstract
Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models
arXiv:2404.08254v2 »Full PDF »Diffusion models have emerged as a robust framework for various generative
tasks, including tabular data synthesis. However, current tabular diffusion
models tend to inherit bias in the training dataset and generate biased
synthetic data, which may influence discriminatory actions. In this research,
we introduce a novel tabular diffusion model that incorporates sensitive
guidance to generate fair synthetic data with balanced joint distributions of
the target label and sensitive attributes, such as sex and race. The empirical
results demonstrate that our method effectively mitigates bias in training data
while maintaining the quality of the generated samples. Furthermore, we provide
evidence that our approach outperforms existing methods for synthesizing
tabular data on fairness metrics such as demographic parity ratio and equalized
odds ratio, achieving improvements of over 10%. Our implementation is
available at https://github.com/comp-well-org/fair-tab-diffusion.Abstract
Narrative Feature or Structured Feature? A Study of Large Language
Models to Identify Cancer Patients at Risk of Heart Failure
Cancer treatments are known to introduce cardiotoxicity, negatively impacting
outcomes and survivorship. Identifying cancer patients at risk of heart failure
(HF) is critical to improving cancer treatment outcomes and safety. This study
examined machine learning (ML) models to identify cancer patients at risk of HF
using electronic health records (EHRs), including traditional ML, Time-Aware
long short-term memory (T-LSTM), and large language models (LLMs) using novel
narrative features derived from the structured medical codes. We identified a
cancer cohort of 12,806 patients from the University of Florida Health,
diagnosed with lung, breast, and colorectal cancers, among which 1,602
individuals developed HF after cancer. The LLM, GatorTron-3.9B, achieved the
best F1 scores, outperforming the traditional support vector machines by 39%,
the T-LSTM deep learning model by 7%, and a widely used transformer model,
BERT, by 5.6%. The analysis shows that the proposed narrative features
remarkably increased feature density and improved performance.Abstract
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
arXiv:2405.20778v2 »Full PDF »Adversarial prompts generated using gradient-based methods exhibit
outstanding performance in performing automatic jailbreak attacks against
safety-aligned LLMs. Nevertheless, due to the discrete nature of texts, the
input gradient of LLMs struggles to precisely reflect the magnitude of loss
change that results from token replacements in the prompt, leading to limited
attack success rates against safety-aligned LLMs, even in the white-box
setting. In this paper, we explore a new perspective on this problem,
suggesting that it can be alleviated by leveraging innovations inspired in
transfer-based attacks that were originally proposed for attacking black-box
image classification models. For the first time, we appropriate the ideologies
of effective methods among these transfer-based attacks, i.e., Skip Gradient
Method and Intermediate Level Attack, into gradient-based adversarial prompt
generation and achieve significant performance gains without introducing
obvious computational cost. Meanwhile, by discussing mechanisms behind the
gains, new insights are drawn, and proper combinations of these methods are
also developed. Our empirical results show that 87% of the query-specific
adversarial suffixes generated by the developed combination can induce
Llama-2-7B-Chat to produce the output that exactly matches the target string on
AdvBench. This match rate is 33% higher than that of a very strong baseline
known as GCG, demonstrating advanced discrete optimization for adversarial
prompt generation against LLMs. In addition, without introducing obvious cost,
the combination achieves >30% absolute increase in attack success rates
compared with GCG when generating both query-specific (38% -> 68%) and
universal adversarial prompts (26.68% -> 60.32%) for attacking the
Llama-2-7B-Chat model on AdvBench. Code at:
https://github.com/qizhangli/Gradient-based-Jailbreak-Attacks.Abstract
FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for
Time Series Forecasting
arXiv:2410.11802v3 »Full PDF »Time Series Forecasting (TSF) is key functionality in numerous fields,
including in finance, weather services, and energy management. While TSF
methods are emerging these days, many of them require domain-specific data
collection and model training and struggle with poor generalization performance
on new domains. Foundation models aim to overcome this limitation. Pre-trained
on large-scale language or time series data, they exhibit promising inferencing
capabilities in new or unseen data. This has spurred a surge in new TSF
foundation models. We propose a new benchmark, FoundTS, to enable thorough and
fair evaluation and comparison of such models. FoundTS covers a variety of TSF
foundation models, including those based on large language models and those
pretrained on time series. Next, FoundTS supports different forecasting
strategies, including zero-shot, few-shot, and full-shot, thereby facilitating
more thorough evaluations. Finally, FoundTS offers a pipeline that standardizes
evaluation processes such as dataset splitting, loading, normalization, and
few-shot sampling, thereby facilitating fair evaluations. Building on this, we
report on an extensive evaluation of TSF foundation models on a broad range of
datasets from diverse domains and with different statistical characteristics.
Specifically, we identify pros and cons and inherent limitations of existing
foundation models, and we identify directions for future model design. We make
our code and datasets available at
https://anonymous.4open.science/r/FoundTS-C2B0.Abstract
SimGen: Simulator-conditioned Driving Scene Generation
arXiv:2406.09386v2 »Full PDF »Controllable synthetic data generation can substantially lower the annotation
cost of training data. Prior works use diffusion models to generate driving
images conditioned on the 3D object layout. However, those models are trained
on small-scale datasets like nuScenes, which lack appearance and layout
diversity. Moreover, overfitting often happens, where the trained models can
only generate images based on the layout data from the validation set of the
same dataset. In this work, we introduce a simulator-conditioned scene
generation framework called SimGen that can learn to generate diverse driving
scenes by mixing data from the simulator and the real world. It uses a novel
cascade diffusion pipeline to address challenging sim-to-real gaps and
multi-condition conflicts. A driving video dataset DIVA is collected to enhance
the generative diversity of SimGen, which contains over 147.5 hours of
real-world driving videos from 73 locations worldwide and simulated driving
data from the MetaDrive simulator. SimGen achieves superior generation quality
and diversity while preserving controllability based on the text prompt and the
layout pulled from a simulator. We further demonstrate the improvements brought
by SimGen for synthetic data augmentation on the BEV detection and segmentation
task and showcase its capability in safety-critical data generation.Abstract
Pretraining Data Detection for Large Language Models: A Divergence-based
Calibration Method
As the scale of training corpora for large language models (LLMs) grows,
model developers become increasingly reluctant to disclose details on their
data. This lack of transparency poses challenges to scientific evaluation and
ethical deployment. Recently, pretraining data detection approaches, which
infer whether a given text was part of an LLM's training data through black-box
access, have been explored. The Min-K\% Prob method, which has achieved
state-of-the-art results, assumes that a non-training example tends to contain
a few outlier words with low token probabilities. However, the effectiveness
may be limited as it tends to misclassify non-training texts that contain many
common words with high probabilities predicted by LLMs. To address this issue,
we introduce a divergence-based calibration method, inspired by the
divergence-from-randomness concept, to calibrate token probabilities for
pretraining data detection. We compute the cross-entropy (i.e., the divergence)
between the token probability distribution and the token frequency distribution
to derive a detection score. We have developed a Chinese-language benchmark,
PatentMIA, to assess the performance of detection approaches for LLMs on
Chinese text. Experimental results on English-language benchmarks and PatentMIA
demonstrate that our proposed method significantly outperforms existing
methods. Our code and PatentMIA benchmark are available at
\url{https://github.com/zhang-wei-chao/DC-PDD}.Abstract
FedMABA: Towards Fair Federated Learning through Multi-Armed Bandits
Allocation
arXiv:2410.20141v1 »Full PDF »The increasing concern for data privacy has driven the rapid development of
federated learning (FL), a privacy-preserving collaborative paradigm. However,
the statistical heterogeneity among clients in FL results in inconsistent
performance of the server model across various clients. Server model may show
favoritism towards certain clients while performing poorly for others,
heightening the challenge of fairness. In this paper, we reconsider the
inconsistency in client performance distribution and introduce the concept of
adversarial multi-armed bandit to optimize the proposed objective with explicit
constraints on performance disparities. Practically, we propose a novel
multi-armed bandit-based allocation FL algorithm (FedMABA) to mitigate
performance unfairness among diverse clients with different data distributions.
Extensive experiments, in different Non-I.I.D. scenarios, demonstrate the
exceptional performance of FedMABA in enhancing fairness.Abstract
Enhancing Safety in Reinforcement Learning with Human Feedback via
Rectified Policy Optimization
arXiv:2410.19933v1 »Full PDF »Balancing helpfulness and safety (harmlessness) is a critical challenge in
aligning large language models (LLMs). Current approaches often decouple these
two objectives, training separate preference models for helpfulness and safety,
while framing safety as a constraint within a constrained Markov Decision
Process (CMDP) framework. However, these methods can lead to ``safety
interference'', where average-based safety constraints compromise the safety of
some prompts in favor of others. To address this issue, we propose
\textbf{Rectified Policy Optimization (RePO)}, which replaces the average
safety constraint with stricter (per prompt) safety constraints. At the core of
RePO is a policy update mechanism driven by rectified policy gradients, which
penalizes the strict safety violation of every prompt, thereby enhancing safety
across nearly all prompts. Our experiments on Alpaca-7B demonstrate that RePO
improves the safety alignment and reduces the safety interference compared to
baseline methods. Code is available at https://github.com/pxyWaterMoon/RePO.Abstract