Exploring Out-of-Distribution Detection Using Vision Transformers
Date
2022
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Current state-of-the-art artificial neural network (ANN) image classifiers perform
well on input data from the same distribution that it was trained with, also known as
in-distribution (InD), yet have worse results on out-of-distribution (OOD) samples. An
input can be considered OOD for many reasons - such as an input with a new concept
(e.g. new class), or the input has random noise generated by a sensor. Knowing if a
new data point is OOD is necessary for deploying models in real-world safety-critical
applications (e.g. self-driving cars, healthcare) to make safer decisions. For example,
a self-driving car slows down when it detects an OOD object or gives the control back
to the human. The primary method used for OOD detection is to utilise ANN as a
feature extractor of embeddings to approximate where the new data point will be in
the embedding space and compare it to trained embeddings using distance metrics.
We use a Vision Transformer (ViT) as the ANN because there has been a push to use
large-scale pre-trained Transformers to improve a range of OOD tasks. Improvements
stem from ViT’s state-of-the-art performance as a feature extractor, which can be used
out-of-the-box for OOD detection compared to convolutional neural networks (CNNs),
which require custom training methods and exposure to OOD to reach similar results.
In this thesis, a ViT was used as a feature extractor, and the performance of OOD
detection was compared using various distance metrics to determine the robustness and
choose the best distance metric in ViT’s embedding space. Three separate experiments
were conducted with multiple datasets, methods, models and approaches. The experiments
showed that ViT is capable of OOD detection out-of-the-box without any custom
training methods or exposure to OOD. However, none of the distance metrics could
noticeably improve the results of OOD detection obtained with the baseline Mahalanobis
distance. Nonetheless, ViT has considerably better OOD detection performance in most
datasets and is more generalisable than a similarly trained CNN. Furthermore, ViT is
more robust to various distance metrics, proving that the features extracted from the
model are good enough to discriminate between InD and OOD. Finally, it was shown
that ViT with Mahalanobis distance has the best OOD detection performance when
blending InD and OOD at various ratios. Future work can consider ensembling multiple
distance metrics to utilise the properties of each distance metric and to apply the same
methodology on other ANN architectures.
Description
Keywords
deep learning, neural networks, vision transformer, out-of-distribution detection