The burgeoning field of Large Language Models (LLMs), exemplified by
sophisticated models like OpenAI's ChatGPT, represents a significant
advancement in artificial intelligence. These models, however, bring forth
substantial challenges in the high consumption of computational, memory,
energy, and financial resources, especially in environments with limited
resource capabilities. This survey aims to systematically address these
challenges by reviewing a broad spectrum of techniques designed to enhance the
resource efficiency of LLMs. We categorize methods based on their optimization
focus: computational, memory, energy, financial, and network resources and
their applicability across various stages of an LLM's lifecycle, including
architecture design, pretraining, finetuning, and system design. Additionally,
the survey introduces a nuanced categorization of resource efficiency
techniques by their specific resource types, which uncovers the intricate
relationships and mappings between various resources and corresponding
optimization techniques. A standardized set of evaluation metrics and datasets
is also presented to facilitate consistent and fair comparisons across
different models and techniques. By offering a comprehensive overview of the
current sota and identifying open research avenues, this survey serves as a
foundational reference for researchers and practitioners, aiding them in
developing more sustainable and efficient LLMs in a rapidly evolving landscape.Abstract
The RoboDepth Challenge: Methods and Advancements Towards Robust Depth
Estimation
Accurate depth estimation under out-of-distribution (OoD) scenarios, such as
adverse weather conditions, sensor failure, and noise contamination, is
desirable for safety-critical applications. Existing depth estimation systems,
however, suffer inevitably from real-world corruptions and perturbations and
are struggled to provide reliable depth predictions under such cases. In this
paper, we summarize the winning solutions from the RoboDepth Challenge -- an
academic competition designed to facilitate and advance robust OoD depth
estimation. This challenge was developed based on the newly established KITTI-C
and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis
on robust self-supervised and robust fully-supervised depth estimation,
respectively. Out of more than two hundred participants, nine unique and
top-performing solutions have appeared, with novel designs ranging from the
following aspects: spatial- and frequency-domain augmentations, masked image
modeling, image restoration and super-resolution, adversarial training,
diffusion-based noise suppression, vision-language pre-training, learned model
ensembling, and hierarchical feature enhancement. Extensive experimental
analyses along with insightful observations are drawn to better understand the
rationale behind each design. We hope this challenge could lay a solid
foundation for future research on robust and reliable depth estimation and
beyond. The datasets, competition toolkit, workshop recordings, and source code
from the winning teams are publicly available on the challenge website.Abstract
Closed-Loop Data Transcription to an LDR via Minimaxing Rate Reduction
This work proposes a new computational framework for learning a structured
generative model for real-world datasets. In particular, we propose to learn a
closed-loop transcription between a multi-class multi-dimensional data
distribution and a linear discriminative representation (LDR) in the feature
space that consists of multiple independent multi-dimensional linear subspaces.
In particular, we argue that the optimal encoding and decoding mappings sought
can be formulated as the equilibrium point of a two-player minimax game between
the encoder and decoder. A natural utility function for this game is the
so-called rate reduction, a simple information-theoretic measure for distances
between mixtures of subspace-like Gaussians in the feature space. Our
formulation draws inspiration from closed-loop error feedback from control
systems and avoids expensive evaluating and minimizing approximated distances
between arbitrary distributions in either the data space or the feature space.
To a large extent, this new formulation unifies the concepts and benefits of
Auto-Encoding and GAN and naturally extends them to the settings of learning a
both discriminative and generative representation for multi-class and
multi-dimensional real-world data. Our extensive experiments on many benchmark
imagery datasets demonstrate tremendous potential of this new closed-loop
formulation: under fair comparison, visual quality of the learned decoder and
classification performance of the encoder is competitive and often better than
existing methods based on GAN, VAE, or a combination of both. Unlike existing
generative models, the so learned features of the multiple classes are
structured: different classes are explicitly mapped onto corresponding
independent principal subspaces in the feature space. Source code can be found
at https://github.com/Delay-Xili/LDR.Abstract
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning
for Web Agents
Language agents have demonstrated promising capabilities in automating
web-based tasks, though their current reactive approaches still underperform
largely compared to humans. While incorporating advanced planning algorithms,
particularly tree search methods, could enhance these agents' performance,
implementing tree search directly on live websites poses significant safety
risks and practical constraints due to irreversible actions such as confirming
a purchase. In this paper, we introduce a novel paradigm that augments language
agents with model-based planning, pioneering the innovative use of large
language models (LLMs) as world models in complex web environments. Our method,
WebDreamer, builds on the key insight that LLMs inherently encode comprehensive
knowledge about website structures and functionalities. Specifically,
WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g.,
"what would happen if I click this button?") using natural language
descriptions, and then evaluates these imagined outcomes to determine the
optimal action at each step. Empirical results on two representative web agent
benchmarks with online interaction -- VisualWebArena and Mind2Web-live --
demonstrate that WebDreamer achieves substantial improvements over reactive
baselines. By establishing the viability of LLMs as world models in web
environments, this work lays the groundwork for a paradigm shift in automated
web interaction. More broadly, our findings open exciting new avenues for
future research into 1) optimizing LLMs specifically for world modeling in
complex, dynamic environments, and 2) model-based speculative planning for
language agents.Abstract
Towards Open Respiratory Acoustic Foundation Models: Pretraining and
Benchmarking
accepted by NeurIPS 2024 Track Datasets and Benchmarks
Respiratory audio, such as coughing and breathing sounds, has predictive
power for a wide range of healthcare applications, yet is currently
under-explored. The main problem for those applications arises from the
difficulty in collecting large labeled task-specific data for model
development. Generalizable respiratory acoustic foundation models pretrained
with unlabeled data would offer appealing advantages and possibly unlock this
impasse. However, given the safety-critical nature of healthcare applications,
it is pivotal to also ensure openness and replicability for any proposed
foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory
Acoustic foundation model pretraining and benchmarking system, as the first
approach answering this need. We curate large-scale respiratory audio datasets
(~136K samples, over 400 hours), pretrain three pioneering foundation models,
and build a benchmark consisting of 19 downstream respiratory health tasks for
evaluation. Our pretrained models demonstrate superior performance (against
existing acoustic models pretrained with general audio on 16 out of 19 tasks)
and generalizability (to unseen datasets and new respiratory audio modalities).
This highlights the great promise of respiratory acoustic foundation models and
encourages more studies using OPERA as an open resource to accelerate research
on respiratory audio for health. The system is accessible from
https://github.com/evelyn0414/OPERA.Abstract
Unsupervised Abnormal Stop Detection for Long Distance Coaches with
Low-Frequency GPS
arXiv:2411.04422v1 »Full PDF »In our urban life, long distance coaches supply a convenient yet economic
approach to the transportation of the public. One notable problem is to
discover the abnormal stop of the coaches due to the important reason, i.e.,
illegal pick up on the way which possibly endangers the safety of passengers.
It has become a pressing issue to detect the coach abnormal stop with
low-quality GPS. In this paper, we propose an unsupervised method that helps
transportation managers to efficiently discover the Abnormal Stop Detection
(ASD) for long distance coaches. Concretely, our method converts the ASD
problem into an unsupervised clustering framework in which both the normal stop
and the abnormal one are decomposed. Firstly, we propose a stop duration model
for the low frequency GPS based on the assumption that a coach changes speed
approximately in a linear approach. Secondly, we strip the abnormal stops from
the normal stop points by the low rank assumption. The proposed method is
conceptually simple yet efficient, by leveraging low rank assumption to handle
normal stop points, our approach enables domain experts to discover the ASD for
coaches, from a case study motivated by traffic managers. Datset and code are
publicly available at: https://github.com/pangjunbiao/IPPs.Abstract
Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models
arXiv:2404.08254v2 »Full PDF »Diffusion models have emerged as a robust framework for various generative
tasks, including tabular data synthesis. However, current tabular diffusion
models tend to inherit bias in the training dataset and generate biased
synthetic data, which may influence discriminatory actions. In this research,
we introduce a novel tabular diffusion model that incorporates sensitive
guidance to generate fair synthetic data with balanced joint distributions of
the target label and sensitive attributes, such as sex and race. The empirical
results demonstrate that our method effectively mitigates bias in training data
while maintaining the quality of the generated samples. Furthermore, we provide
evidence that our approach outperforms existing methods for synthesizing
tabular data on fairness metrics such as demographic parity ratio and equalized
odds ratio, achieving improvements of over 10%. Our implementation is
available at https://github.com/comp-well-org/fair-tab-diffusion.Abstract
Manipulation Facing Threats: Evaluating Physical Vulnerabilities in
End-to-End Vision Language Action Models
arXiv:2409.13174v2 »Full PDF »Recently, driven by advancements in Multimodal Large Language Models (MLLMs),
Vision Language Action Models (VLAMs) are being proposed to achieve better
performance in open-vocabulary scenarios for robotic manipulation tasks. Since
manipulation tasks involve direct interaction with the physical world, ensuring
robustness and safety during the execution of this task is always a very
critical issue. In this paper, by synthesizing current safety research on MLLMs
and the specific application scenarios of the manipulation task in the physical
world, we comprehensively evaluate VLAMs in the face of potential physical
threats. Specifically, we propose the Physical Vulnerability Evaluating
Pipeline (PVEP) that can incorporate as many visual modal physical threats as
possible for evaluating the physical robustness of VLAMs. The physical threats
in PVEP specifically include Out-of-Distribution, Typography-based Visual
Prompts, and Adversarial Patch Attacks. By comparing the performance
fluctuations of VLAMs before and after being attacked, we provide generalizable
Analyses of how VLAMs respond to different physical security threats. Our
project page is in this link:
https://chaducheng.github.io/Manipulat-Facing-Threats/.Abstract
Real-time and Downtime-tolerant Fault Diagnosis for Railway Turnout
Machines (RTMs) Empowered with Cloud-Edge Pipeline Parallelism
arXiv:2411.02086v1 »Full PDF »Railway Turnout Machines (RTMs) are mission-critical components of the
railway transportation infrastructure, responsible for directing trains onto
desired tracks. For safety assurance applications, especially in early-warning
scenarios, RTM faults are expected to be detected as early as possible on a
continuous 7x24 basis. However, limited emphasis has been placed on distributed
model inference frameworks that can meet the inference latency and reliability
requirements of such mission critical fault diagnosis systems. In this paper,
an edge-cloud collaborative early-warning system is proposed to enable
real-time and downtime-tolerant fault diagnosis of RTMs, providing a new
paradigm for the deployment of models in safety-critical scenarios. Firstly, a
modular fault diagnosis model is designed specifically for distributed
deployment, which utilizes a hierarchical architecture consisting of the prior
knowledge module, subordinate classifiers, and a fusion layer for enhanced
accuracy and parallelism. Then, a cloud-edge collaborative framework leveraging
pipeline parallelism, namely CEC-PA, is developed to minimize the overhead
resulting from distributed task execution and context exchange by strategically
partitioning and offloading model components across cloud and edge.
Additionally, an election consensus mechanism is implemented within CEC-PA to
ensure system robustness during coordinator node downtime. Comparative
experiments and ablation studies are conducted to validate the effectiveness of
the proposed distributed fault diagnosis approach. Our ensemble-based fault
diagnosis model achieves a remarkable 97.4% accuracy on a real-world dataset
collected by Nanjing Metro in Jiangsu Province, China. Meanwhile, CEC-PA
demonstrates superior recovery proficiency during node disruptions and speed-up
ranging from 1.98x to 7.93x in total inference time compared to its
counterparts.Abstract
Traffic and Safety Rule Compliance of Humans in Diverse Driving
Situations
The increasing interest in autonomous driving systems has highlighted the
need for an in-depth analysis of human driving behavior in diverse scenarios.
Analyzing human data is crucial for developing autonomous systems that
replicate safe driving practices and ensure seamless integration into
human-dominated environments. This paper presents a comparative evaluation of
human compliance with traffic and safety rules across multiple trajectory
prediction datasets, including Argoverse 2, nuPlan, Lyft, and DeepUrban. By
defining and leveraging existing safety and behavior-related metrics, such as
time to collision, adherence to speed limits, and interactions with other
traffic participants, we aim to provide a comprehensive understanding of each
datasets strengths and limitations. Our analysis focuses on the distribution of
data samples, identifying noise, outliers, and undesirable behaviors exhibited
by human drivers in both the training and validation sets. The results
underscore the need for applying robust filtering techniques to certain
datasets due to high levels of noise and the presence of such undesirable
behaviors.Abstract