The ambiguity at the boundaries of different semantic classes in point cloud
semantic segmentation often leads to incorrect decisions in intelligent
perception systems, such as autonomous driving. Hence, accurate delineation of
the boundaries is crucial for improving safety in autonomous driving. A novel
spatial inter-correlation enhancement and spatially-embedded feature fusion
network (SIESEF-FusionNet) is proposed in this paper, enhancing spatial
inter-correlation by combining inverse distance weighting and angular
compensation to extract more beneficial spatial information without causing
redundancy. Meanwhile, a new spatial adaptive pooling module is also designed,
embedding enhanced spatial information into semantic features for strengthening
the context-awareness of semantic features. Experimental results demonstrate
that 83.7% mIoU and 97.8% OA are achieved by SIESEF-FusionNet on the Toronto3D
dataset, with performance superior to other baseline methods. A value of 61.1%
mIoU is reached on the semanticKITTI dataset, where a marked improvement in
segmentation performance is observed. In addition, the effectiveness and
plug-and-play capability of the proposed modules are further verified through
ablation studies.Abstract
arXiv:2407.21783v2 »Full PDF »Modern artificial intelligence (AI) systems are powered by foundation models.
This paper presents a new set of foundation models, called Llama 3. It is a
herd of language models that natively support multilinguality, coding,
reasoning, and tool usage. Our largest model is a dense Transformer with 405B
parameters and a context window of up to 128K tokens. This paper presents an
extensive empirical evaluation of Llama 3. We find that Llama 3 delivers
comparable quality to leading language models such as GPT-4 on a plethora of
tasks. We publicly release Llama 3, including pre-trained and post-trained
versions of the 405B parameter language model and our Llama Guard 3 model for
input and output safety. The paper also presents the results of experiments in
which we integrate image, video, and speech capabilities into Llama 3 via a
compositional approach. We observe this approach performs competitively with
the state-of-the-art on image, video, and speech recognition tasks. The
resulting models are not yet being broadly released as they are still under
development.Abstract
AutoRT: Embodied Foundation Models for Large Scale Orchestration of
Robotic Agents
Foundation models that incorporate language, vision, and more recently
actions have revolutionized the ability to harness internet scale data to
reason about useful tasks. However, one of the key challenges of training
embodied foundation models is the lack of data grounded in the physical world.
In this paper, we propose AutoRT, a system that leverages existing foundation
models to scale up the deployment of operational robots in completely unseen
scenarios with minimal human supervision. AutoRT leverages vision-language
models (VLMs) for scene understanding and grounding, and further uses large
language models (LLMs) for proposing diverse and novel instructions to be
performed by a fleet of robots. Guiding data collection by tapping into the
knowledge of foundation models enables AutoRT to effectively reason about
autonomy tradeoffs and safety while significantly scaling up data collection
for robot learning. We demonstrate AutoRT proposing instructions to over 20
robots across multiple buildings and collecting 77k real robot episodes via
both teleoperation and autonomous robot policies. We experimentally show that
such "in-the-wild" data collected by AutoRT is significantly more diverse, and
that AutoRT's use of LLMs allows for instruction following data collection
robots that can align to human preferences.Abstract
Grounded Decoding: Guiding Text Generation with Grounded Models for
Embodied Agents
arXiv:2303.00855v2 »Full PDF »Recent progress in large language models (LLMs) has demonstrated the ability
to learn and leverage Internet-scale knowledge through pre-training with
autoregressive models. Unfortunately, applying such models to settings with
embodied agents, such as robots, is challenging due to their lack of experience
with the physical world, inability to parse non-language observations, and
ignorance of rewards or safety constraints that robots may require. On the
other hand, language-conditioned robotic policies that learn from interaction
data can provide the necessary grounding that allows the agent to be correctly
situated in the real world, but such policies are limited by the lack of
high-level semantic understanding due to the limited breadth of the interaction
data available for training them. Thus, if we want to make use of the semantic
knowledge in a language model while still situating it in an embodied setting,
we must construct an action sequence that is both likely according to the
language model and also realizable according to grounded models of the
environment. We frame this as a problem similar to probabilistic filtering:
decode a sequence that both has high probability under the language model and
high probability under a set of grounded model objectives. We demonstrate how
such grounded models can be obtained across three simulation and real-world
domains, and that the proposed decoding strategy is able to solve complex,
long-horizon embodiment tasks in a robotic setting by leveraging the knowledge
of both models. The project's website can be found at
grounded-decoding.github.io.Abstract
Principles and Guidelines for Evaluating Social Robot Navigation
Algorithms
A major challenge to deploying robots widely is navigation in human-populated
environments, commonly referred to as social robot navigation. While the field
of social navigation has advanced tremendously in recent years, the fair
evaluation of algorithms that tackle social navigation remains hard because it
involves not just robotic agents moving in static environments but also dynamic
human agents and their perceptions of the appropriateness of robot behavior. In
contrast, clear, repeatable, and accessible benchmarks have accelerated
progress in fields like computer vision, natural language processing and
traditional robot navigation by enabling researchers to fairly compare
algorithms, revealing limitations of existing solutions and illuminating
promising new directions. We believe the same approach can benefit social
navigation. In this paper, we pave the road towards common, widely accessible,
and repeatable benchmarking criteria to evaluate social robot navigation. Our
contributions include (a) a definition of a socially navigating robot as one
that respects the principles of safety, comfort, legibility, politeness, social
competency, agent understanding, proactivity, and responsiveness to context,
(b) guidelines for the use of metrics, development of scenarios, benchmarks,
datasets, and simulators to evaluate social navigation, and (c) a design of a
social navigation metrics framework to make it easier to compare results from
different simulators, robots and datasets.Abstract
Federated Learning (FL) employs a training approach to address scenarios
where users' data cannot be shared across clients. Achieving fairness in FL is
imperative since training data in FL is inherently geographically distributed
among diverse user groups. Existing research on fairness predominantly assumes
access to the entire training data, making direct transfer to FL challenging.
However, the limited existing research on fairness in FL does not effectively
address two key challenges, i.e., (CH1) Current methods fail to deal with the
inconsistency between fair optimization results obtained with surrogate
functions and fair classification results. (CH2) Directly aggregating local
fair models does not always yield a globally fair model due to non Identical
and Independent data Distributions (non-IID) among clients. To address these
challenges, we propose a Wasserstein Fair Federated Learning framework, namely
WassFFed. To tackle CH1, we ensure that the outputs of local models, rather
than the loss calculated with surrogate functions or classification results
with a threshold, remain independent of various user groups. To resolve CH2, we
employ a Wasserstein barycenter calculation of all local models' outputs for
each user group, bringing local model outputs closer to the global output
distribution to ensure consistency between the global model and local models.
We conduct extensive experiments on three real-world datasets, demonstrating
that WassFFed outperforms existing approaches in striking a balance between
accuracy and fairness.Abstract
Towards Open Respiratory Acoustic Foundation Models: Pretraining and
Benchmarking
accepted by NeurIPS 2024 Track Datasets and Benchmarks
Respiratory audio, such as coughing and breathing sounds, has predictive
power for a wide range of healthcare applications, yet is currently
under-explored. The main problem for those applications arises from the
difficulty in collecting large labeled task-specific data for model
development. Generalizable respiratory acoustic foundation models pretrained
with unlabeled data would offer appealing advantages and possibly unlock this
impasse. However, given the safety-critical nature of healthcare applications,
it is pivotal to also ensure openness and replicability for any proposed
foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory
Acoustic foundation model pretraining and benchmarking system, as the first
approach answering this need. We curate large-scale respiratory audio datasets
(~136K samples, over 400 hours), pretrain three pioneering foundation models,
and build a benchmark consisting of 19 downstream respiratory health tasks for
evaluation. Our pretrained models demonstrate superior performance (against
existing acoustic models pretrained with general audio on 16 out of 19 tasks)
and generalizability (to unseen datasets and new respiratory audio modalities).
This highlights the great promise of respiratory acoustic foundation models and
encourages more studies using OPERA as an open resource to accelerate research
on respiratory audio for health. The system is accessible from
https://github.com/evelyn0414/OPERA.Abstract
From Word Vectors to Multimodal Embeddings: Techniques, Applications,
and Future Directions For Large Language Models
Word embeddings and language models have transformed natural language
processing (NLP) by facilitating the representation of linguistic elements in
continuous vector spaces. This review visits foundational concepts such as the
distributional hypothesis and contextual similarity, tracing the evolution from
sparse representations like one-hot encoding to dense embeddings including
Word2Vec, GloVe, and fastText. We examine both static and contextualized
embeddings, underscoring advancements in models such as ELMo, BERT, and GPT and
their adaptations for cross-lingual and personalized applications. The
discussion extends to sentence and document embeddings, covering aggregation
methods and generative topic models, along with the application of embeddings
in multimodal domains, including vision, robotics, and cognitive science.
Advanced topics such as model compression, interpretability, numerical
encoding, and bias mitigation are analyzed, addressing both technical
challenges and ethical implications. Additionally, we identify future research
directions, emphasizing the need for scalable training techniques, enhanced
interpretability, and robust grounding in non-textual modalities. By
synthesizing current methodologies and emerging trends, this survey offers
researchers and practitioners an in-depth resource to push the boundaries of
embedding-based language models.Abstract
Post-translational modifications (PTMs) profoundly expand the complexity and
functionality of the proteome, regulating protein attributes and interactions
that are crucial for biological processes. Accurately predicting PTM sites and
their specific types is therefore essential for elucidating protein function
and understanding disease mechanisms. Existing computational approaches
predominantly focus on protein sequences to predict PTM sites, driven by the
recognition of sequence-dependent motifs. However, these approaches often
overlook protein structural contexts. In this work, we first compile a
large-scale sequence-structure PTM dataset, which serves as the foundation for
fair comparison. We introduce the MeToken model, which tokenizes the
micro-environment of each amino acid, integrating both sequence and structural
information into unified discrete tokens. This model not only captures the
typical sequence motifs associated with PTMs but also leverages the spatial
arrangements dictated by protein tertiary structures, thus providing a holistic
view of the factors influencing PTM sites. Designed to address the long-tail
distribution of PTM types, MeToken employs uniform sub-codebooks that ensure
even the rarest PTMs are adequately represented and distinguished. We validate
the effectiveness and generalizability of MeToken across multiple datasets,
demonstrating its superior performance in accurately identifying PTM types. The
results underscore the importance of incorporating structural data and
highlight MeToken's potential in facilitating accurate and comprehensive PTM
predictions, which could significantly impact proteomics research. The code and
datasets are available at https://github.com/A4Bio/MeToken.Abstract
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision
Language Models
Artificial intelligence has significantly impacted medical applications,
particularly with the advent of Medical Large Vision Language Models
(Med-LVLMs), sparking optimism for the future of automated and personalized
healthcare. However, the trustworthiness of Med-LVLMs remains unverified,
posing significant risks for future model deployment. In this paper, we
introduce CARES and aim to comprehensively evaluate the Trustworthiness of
Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs
across five dimensions, including trustfulness, fairness, safety, privacy, and
robustness. CARES comprises about 41K question-answer pairs in both closed and
open-ended formats, covering 16 medical image modalities and 27 anatomical
regions. Our analysis reveals that the models consistently exhibit concerns
regarding trustworthiness, often displaying factual inaccuracies and failing to
maintain fairness across different demographic groups. Furthermore, they are
vulnerable to attacks and demonstrate a lack of privacy awareness. We publicly
release our benchmark and code in https://cares-ai.github.io/.Abstract