survey paper. We welcome questions, issues, and paper requests via
https://github.com/AtsuMiyai/Aw...
Detecting out-of-distribution (OOD) samples is crucial for ensuring the
safety of machine learning systems and has shaped the field of OOD detection.
Meanwhile, several other problems are closely related to OOD detection,
including anomaly detection (AD), novelty detection (ND), open set recognition
(OSR), and outlier detection (OD). To unify these problems, a generalized OOD
detection framework was proposed, taxonomically categorizing these five
problems. However, Vision Language Models (VLMs) such as CLIP have
significantly changed the paradigm and blurred the boundaries between these
fields, again confusing researchers. In this survey, we first present a
generalized OOD detection v2, encapsulating the evolution of AD, ND, OSR, OOD
detection, and OD in the VLM era. Our framework reveals that, with some field
inactivity and integration, the demanding challenges have become OOD detection
and AD. In addition, we also highlight the significant shift in the definition,
problem settings, and benchmarks; we thus feature a comprehensive review of the
methodology for OOD detection, including the discussion over other related
tasks to clarify their relationship to OOD detection. Finally, we explore the
advancements in the emerging Large Vision Language Model (LVLM) era, such as
GPT-4V. We conclude this survey with open challenges and future directions.Abstract
Accepted to TPAMI. Code is available at
https://github.com/zhoudw-zdw/CIL_Survey/
Deep models, e.g., CNNs and Vision Transformers, have achieved impressive
achievements in many vision tasks in the closed world. However, novel classes
emerge from time to time in our ever-changing world, requiring a learning
system to acquire new knowledge continually. Class-Incremental Learning (CIL)
enables the learner to incorporate the knowledge of new classes incrementally
and build a universal classifier among all seen classes. Correspondingly, when
directly training the model with new class instances, a fatal problem occurs --
the model tends to catastrophically forget the characteristics of former ones,
and its performance drastically degrades. There have been numerous efforts to
tackle catastrophic forgetting in the machine learning community. In this
paper, we survey comprehensively recent advances in class-incremental learning
and summarize these methods from several aspects. We also provide a rigorous
and unified evaluation of 17 methods in benchmark image classification tasks to
find out the characteristics of different algorithms empirically. Furthermore,
we notice that the current comparison protocol ignores the influence of memory
budget in model storage, which may result in unfair comparison and biased
results. Hence, we advocate fair comparison by aligning the memory budget in
evaluation, as well as several memory-agnostic performance measures. The source
code is available at https://github.com/zhoudw-zdw/CIL_Survey/Abstract
Calib3D: Calibrating Model Preferences for Reliable 3D Scene
Understanding
Preprint; 37 pages, 8 figures, 11 tables; Code at
https://github.com/ldkong1205/Calib3D
Safety-critical 3D scene understanding tasks necessitate not only accurate
but also confident predictions from 3D perception models. This study introduces
Calib3D, a pioneering effort to benchmark and scrutinize the reliability of 3D
scene understanding models from an uncertainty estimation viewpoint. We
comprehensively evaluate 28 state-of-the-art models across 10 diverse 3D
datasets, uncovering insightful phenomena that cope with both the aleatoric and
epistemic uncertainties in 3D scene understanding. We discover that despite
achieving impressive levels of accuracy, existing models frequently fail to
provide reliable uncertainty estimates -- a pitfall that critically undermines
their applicability in safety-sensitive contexts. Through extensive analysis of
key factors such as network capacity, LiDAR representations, rasterization
resolutions, and 3D data augmentation techniques, we correlate these aspects
directly with the model calibration efficacy. Furthermore, we introduce DeptS,
a novel depth-aware scaling approach aimed at enhancing 3D model calibration.
Extensive experiments across a wide range of configurations validate the
superiority of our method. We hope this work could serve as a cornerstone for
fostering reliable 3D scene understanding. Code and benchmark toolkits are
publicly available.Abstract
Generalized Out-of-Distribution Detection: A Survey
Feel free to comment on our Overleaf manuscript:
https://www.overleaf.com/9899719915wmccvdtwpkct#c...
Out-of-distribution (OOD) detection is critical to ensuring the reliability
and safety of machine learning systems. For instance, in autonomous driving, we
would like the driving system to issue an alert and hand over the control to
humans when it detects unusual scenes or objects that it has never seen during
training time and cannot make a safe decision. The term, OOD detection, first
emerged in 2017 and since then has received increasing attention from the
research community, leading to a plethora of methods developed, ranging from
classification-based to density-based to distance-based ones. Meanwhile,
several other problems, including anomaly detection (AD), novelty detection
(ND), open set recognition (OSR), and outlier detection (OD), are closely
related to OOD detection in terms of motivation and methodology. Despite common
goals, these topics develop in isolation, and their subtle differences in
definition and problem setting often confuse readers and practitioners. In this
survey, we first present a unified framework called generalized OOD detection,
which encompasses the five aforementioned problems, i.e., AD, ND, OSR, OOD
detection, and OD. Under our framework, these five problems can be seen as
special cases or sub-tasks, and are easier to distinguish. We then review each
of these five areas by summarizing their recent technical developments, with a
special focus on OOD detection methodologies. We conclude this survey with open
challenges and potential research directions.Abstract
Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases
Project Page: https://github.com/Qi-Zhangyang/Gemini-vs-GPT4V. arXiv
admin note: substantial text ...
The rapidly evolving sector of Multi-modal Large Language Models (MLLMs) is
at the forefront of integrating linguistic and visual processing in artificial
intelligence. This paper presents an in-depth comparative study of two
pioneering models: Google's Gemini and OpenAI's GPT-4V(ision). Our study
involves a multi-faceted evaluation of both models across key dimensions such
as Vision-Language Capability, Interaction with Humans, Temporal Understanding,
and assessments in both Intelligence and Emotional Quotients. The core of our
analysis delves into the distinct visual comprehension abilities of each model.
We conducted a series of structured experiments to evaluate their performance
in various industrial application scenarios, offering a comprehensive
perspective on their practical utility. We not only involve direct performance
comparisons but also include adjustments in prompts and scenarios to ensure a
balanced and fair analysis. Our findings illuminate the unique strengths and
niches of both models. GPT-4V distinguishes itself with its precision and
succinctness in responses, while Gemini excels in providing detailed, expansive
answers accompanied by relevant imagery and links. These understandings not
only shed light on the comparative merits of Gemini and GPT-4V but also
underscore the evolving landscape of multimodal foundation models, paving the
way for future advancements in this area. After the comparison, we attempted to
achieve better results by combining the two models. Finally, We would like to
express our profound gratitude to the teams behind GPT-4V and Gemini for their
pioneering contributions to the field. Our acknowledgments are also extended to
the comprehensive qualitative analysis presented in 'Dawn' by Yang et al. This
work, with its extensive collection of image samples, prompts, and
GPT-4V-related results, provided a foundational basis for our analysis.Abstract
Robo3D: Towards Robust and Reliable 3D Perception against Corruptions
The robustness of 3D perception systems under natural corruptions from
environments and sensors is pivotal for safety-critical applications. Existing
large-scale 3D perception datasets often contain data that are meticulously
cleaned. Such configurations, however, cannot reflect the reliability of
perception models during the deployment stage. In this work, we present Robo3D,
the first comprehensive benchmark heading toward probing the robustness of 3D
detectors and segmentors under out-of-distribution scenarios against natural
corruptions that occur in real-world environments. Specifically, we consider
eight corruption types stemming from severe weather conditions, external
disturbances, and internal sensor failure. We uncover that, although promising
results have been progressively achieved on standard benchmarks,
state-of-the-art 3D perception models are at risk of being vulnerable to
corruptions. We draw key observations on the use of data representations,
augmentation schemes, and training strategies, that could severely affect the
model's performance. To pursue better robustness, we propose a
density-insensitive training framework along with a simple flexible
voxelization strategy to enhance the model resiliency. We hope our benchmark
and approach could inspire future research in designing more robust and
reliable 3D perception models. Our robustness benchmark suite is publicly
available.Abstract
BiBench: Benchmarking and Analyzing Network Binarization
arXiv:2301.11233v2 »Full PDF »Network binarization emerges as one of the most promising compression
approaches offering extraordinary computation and memory savings by minimizing
the bit-width. However, recent research has shown that applying existing
binarization algorithms to diverse tasks, architectures, and hardware in
realistic scenarios is still not straightforward. Common challenges of
binarization, such as accuracy degradation and efficiency limitation, suggest
that its attributes are not fully understood. To close this gap, we present
BiBench, a rigorously designed benchmark with in-depth analysis for network
binarization. We first carefully scrutinize the requirements of binarization in
the actual production and define evaluation tracks and metrics for a
comprehensive and fair investigation. Then, we evaluate and analyze a series of
milestone binarization algorithms that function at the operator level and with
extensive influence. Our benchmark reveals that 1) the binarized operator has a
crucial impact on the performance and deployability of binarized networks; 2)
the accuracy of binarization varies significantly across different learning
tasks and neural architectures; 3) binarization has demonstrated promising
efficiency potential on edge devices despite the limited hardware support. The
results and analysis also lead to a promising paradigm for accurate and
efficient binarization. We believe that BiBench will contribute to the broader
adoption of binarization and serve as a foundation for future research. The
code for our BiBench is released https://github.com/htqin/BiBench .Abstract
Accepted by NeurIPS 2022 Datasets and Benchmarks Track. Codebase:
https://github.com/Jingkang50/Op...
Out-of-distribution (OOD) detection is vital to safety-critical machine
learning applications and has thus been extensively studied, with a plethora of
methods developed in the literature. However, the field currently lacks a
unified, strictly formulated, and comprehensive benchmark, which often results
in unfair comparisons and inconclusive results. From the problem setting
perspective, OOD detection is closely related to neighboring fields including
anomaly detection (AD), open set recognition (OSR), and model uncertainty,
since methods developed for one domain are often applicable to each other. To
help the community to improve the evaluation and advance, we build a unified,
well-structured codebase called OpenOOD, which implements over 30 methods
developed in relevant fields and provides a comprehensive benchmark under the
recently proposed generalized OOD detection framework. With a comprehensive
comparison of these methods, we are gratified that the field has progressed
significantly over the past few years, where both preprocessing methods and the
orthogonal post-hoc methods show strong potential.Abstract
Benchmarking and Analyzing 3D Human Pose and Shape Estimation Beyond
Algorithms
Submission to 36th Conference on Neural Information Processing
Systems (NeurIPS 2022) Track on Dat...
3D human pose and shape estimation (a.k.a. "human mesh recovery") has
achieved substantial progress. Researchers mainly focus on the development of
novel algorithms, while less attention has been paid to other critical factors
involved. This could lead to less optimal baselines, hindering the fair and
faithful evaluations of newly designed methodologies. To address this problem,
this work presents the first comprehensive benchmarking study from three
under-explored perspectives beyond algorithms. 1) Datasets. An analysis on 31
datasets reveals the distinct impacts of data samples: datasets featuring
critical attributes (i.e. diverse poses, shapes, camera characteristics,
backbone features) are more effective. Strategical selection and combination of
high-quality datasets can yield a significant boost to the model performance.
2) Backbones. Experiments with 10 backbones, ranging from CNNs to transformers,
show the knowledge learnt from a proximity task is readily transferable to
human mesh recovery. 3) Training strategies. Proper augmentation techniques and
loss designs are crucial. With the above findings, we achieve a PA-MPJPE of
47.3 mm on the 3DPW test set with a relatively simple model. More importantly,
we provide strong baselines for fair comparisons of algorithms, and
recommendations for building effective training configurations in the future.
Codebase is available at http://github.com/smplbody/hmr-benchmarksAbstract
To appear in IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI), 2022. Extende...
Real world data often exhibits a long-tailed and open-ended (with unseen
classes) distribution. A practical recognition system must balance between
majority (head) and minority (tail) classes, generalize across the
distribution, and acknowledge novelty upon the instances of unseen classes
(open classes). We define Open Long-Tailed Recognition++ (OLTR++) as learning
from such naturally distributed data and optimizing for the classification
accuracy over a balanced test set which includes both known and open classes.
OLTR++ handles imbalanced classification, few-shot learning, open-set
recognition, and active learning in one integrated algorithm, whereas existing
classification approaches often focus only on one or two aspects and deliver
poorly over the entire spectrum. The key challenges are: 1) how to share visual
knowledge between head and tail classes, 2) how to reduce confusion between
tail and open classes, and 3) how to actively explore open classes with learned
knowledge. Our algorithm, OLTR++, maps images to a feature space such that
visual concepts can relate to each other through a memory association mechanism
and a learned metric (dynamic meta-embedding) that both respects the closed
world classification of seen classes and acknowledges the novelty of open
classes. Additionally, we propose an active learning scheme based on visual
memory, which learns to recognize open classes in a data-efficient manner for
future expansions. On three large-scale open long-tailed datasets we curated
from ImageNet (object-centric), Places (scene-centric), and MS1M (face-centric)
data, as well as three standard benchmarks (CIFAR-10-LT, CIFAR-100-LT, and
iNaturalist-18), our approach, as a unified framework, consistently
demonstrates competitive performance. Notably, our approach also shows strong
potential for the active exploration of open classes and the fairness analysis
of minority groups.Abstract