0% found this document useful (0 votes)

95 views19 pages

Anomaly GPT

AnomalyGPT is a novel approach for Industrial Anomaly Detection (IAD) utilizing Large Vision-Language Models (LVLMs) to detect and localize anomalies in industrial images without the need for manual threshold settings. The method generates training data through simulated anomalous images and employs a lightweight decoder for fine-grained semantic analysis, achieving state-of-the-art performance on the MVTec-AD dataset with an accuracy of 86.1%. AnomalyGPT also supports multi-turn dialogues and few-shot learning, enhancing its practical application in dynamic production environments.

Uploaded by

15715189857

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views19 pages

Anomaly GPT

Uploaded by

15715189857

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

AnomalyGPT: Detecting Industrial Anomalies Using

Large Vision-Language Models

Zhaopeng Gu1,2 Bingke Zhu1,3,4 Guibo Zhu1,2,4

Yingying Chen1,3,4 Ming Tang1,2 Jinqiao Wang1,2,3,4
1
Foundation Model Research Center, Institute of Automation,
Chinese Academy of Sciences, Beijing, China
2
University of Chinese Academy of Sciences, Beijing, China
arXiv:2308.15366v4 [cs.CV] 28 Dec 2023

3
Objecteye Inc., Beijing, China
4
Wuhan AI Research, Wuhan, China
[email protected]
{bingke.zhu,gbzhu,yingying.chen,tangm,jqwang}@nlpr.ia.ac.cn

Abstract
Large Vision-Language Models (LVLMs) such as
MiniGPT-4 and LLaVA have demonstrated the capability
of understanding images and achieved remarkable perfor-
mance in various visual tasks. Despite their strong abili-
ties in recognizing common objects due to extensive train-
ing datasets, they lack specific domain knowledge and have
a weaker understanding of localized details within objects,
which hinders their effectiveness in the Industrial Anomaly
Detection (IAD) task. On the other hand, most existing IAD
methods only provide anomaly scores and necessitate the
manual setting of thresholds to distinguish between normal
and abnormal samples, which restricts their practical im-
plementation. In this paper, we explore the utilization of
LVLM to address the IAD problem and propose Anoma-
lyGPT, a novel IAD approach based on LVLM. We generate
training data by simulating anomalous images and produc- Figure 1. Comparison between our AnomalyGPT, existing IAD
ing corresponding textual descriptions for each image. We methods and existing LVLMs. Existing IAD methods can only
also employ an image decoder to provide fine-grained se- provide anomaly scores and need manually threshold setting,
mantic and design a prompt learner to fine-tune the LVLM while existing LVLMs cannot detect anomalies in the image.
AnomalyGPT can not only provide information about the image
using prompt embeddings. Our AnomalyGPT eliminates
but also indicate the presence and location of anomaly.
the need for manual threshold adjustments, thus directly
assesses the presence and locations of anomalies. Addi-
tionally, AnomalyGPT supports multi-turn dialogues and
exhibits impressive few-shot in-context learning capabili- on a range of Natural Language Processing (NLP) tasks.
ties. With only one normal shot, AnomalyGPT achieves the More recently, novel methods including MiniGPT-4 [36],
state-of-the-art performance with an accuracy of 86.1%, an BLIP-2 [15], and PandaGPT [25] have further extended the
image-level AUC of 94.1%, and a pixel-level AUC of 95.3% ability of LLMs into visual processing by aligning visual
on the MVTec-AD dataset. Code is available at https: features with text features, bringing a significant revolu-
//github.com/CASIA-IVA-Lab/AnomalyGPT. tion in the domain of Artificial General Intelligence (AGI).
While LVLMs are pre-trained on amounts of data sourced
from the Internet, their domain-specific knowledge is rela-
1. Introduction tively limited and they lack sensitivity to local details within
Large Language Models (LLMs) like GPT-3.5 [19] and objects, which restricts their potentiality in IAD task.
LLaMA [26] have demonstrated remarkable performance IAD task aims to detect and localize anomalies in in-

1
Methods Few-shot learning Anomaly score Anomaly localization Anomaly judgement Multi-turn dialogue
Traditional IAD methods ✓ ✓
Few-shot IAD methods ✓ ✓ ✓
LVLMs ✓ ✓
AnomalyGPT (ours) ✓ ✓ ✓ ✓ ✓

Table 1. Comparison between our AnomalyGPT and existing methods across various functionalities. The “Traditional IAD methods” in the
table refers to “one-class-one-model” methods such as PatchCore [23], InTra [21], and PyramidFlow [13]. “Few-shot IAD methods” refers
to methods that can perform few-shot learning like RegAD [10], Graphcore [29], and WinCLIP [27]. “LVLMs” represents general large
vision-language models like MiniGPT-4 [36], LLaVA [17], and PandaGPT [25]. “Anomaly score” in the table represents just providing
scores for anomaly detection, while “Anomaly judgement” indicates directly assessing the presence of anomaly.

dustrial product images. Due to the rarity and unpre- Experimentally, we conduct extensive experiments on
dictability of real-world samples, models are required to the MVTec-AD [1] and VisA [37] datasets. With unsu-
be trained only on normal samples and distinguish anoma- pervised training on the MVTec-AD dataset, we achieve
lous samples that deviate from normal samples. Current an accuracy of 93.3%, an image-level AUC of 97.4%, and
IAD methods [10, 11, 32] typically only provide anomaly a pixel-level AUC of 93.1%. When one-shot transferred
scores for test samples and require manually specification to the VisA dataset, we achieve an accuracy of 77.4%,
of thresholds to distinguish between normal and anomalous an image-level AUC of 87.4%, and a pixel-level AUC of
instances for each class of items, which is not suitable for 96.2%. Conversely, after unsupervised training on the VisA
real production environments. dataset, one-shot transferred to the MVTec-AD dataset re-
As illustrated in Figure 1 and Table 1, neither existing sult in an accuracy of 86.1%, an image-level AUC of 94.1%,
IAD methods nor LVLMs can address IAD problem well, and a pixel-level AUC of 95.3%.
so we introduce AnomalyGPT, a novel IAD approach based Our contributions are summarized as follows:
on LVLM. AnomalyGPT can detect the presence and loca- • We present the pioneering utilization of LVLM for ad-
tion of anomalies without the need for manual threshold set- dressing IAD task. Our method not only detects and lo-
tings. Moreover, our method can provide information about cates anomaly without manually threshold adjustments
the image and allows for interactive engagement, enabling but also supports multi-round dialogues. To the best of
users to ask follow-up questions based on their needs and our knowledge, we are the first to successfully apply
the provided answers. AnomalyGPT can also perform in- LVLM to the domain of industrial anomaly detection.
context learning with a small number of normal samples, • The lightweight, visual-textual feature-matching-based
enabling swift adaptation to previously unseen objects. decoder in our work addresses the limitation of the LLM’s
Specifically, we focus on fine-tuning the LVLM using weaker discernment of fine-grained semantic and allevi-
synthesized anomalous visual-textual data, integrating IAD ates the constraint of LLM’s restricted ability to solely
knowledge into the model. However, direct training with generate text outputs.
IAD data presents numerous challenges. The first is data • We employ prompt embeddings for fine-tuning and train
scarcity. Methods like LLaVA [17] and PandaGPT [25] are our model concurrently with the data utilized during
pre-trained on 160k images with corresponding multi-turn LVLM pre-training, thus preserving the LVLM’s inherent
dialogues. However, existing IAD datasets [1, 37] contain capabilities and enabling multi-turn dialogues.
only a few thousand samples, rendering direct fine-tuning • Our method retains robust transferability and is capable of
easy to overfitting and catastrophic forgetting. To address engaging in in-context few-shot learning on new datasets,
this, we use prompt embeddings to fine-tune the LVLM in- yielding outstanding performance.
stead of parameter fine-tuning. Additional prompt embed-
dings are added after image inputs, introducing supplemen- 2. Related Work
tary IAD knowledge into the LVLM. The second challenge
relates to fine-grained semantic. We propose a lightweight, Industrial Anomaly Detection: Existing IAD methods
visual-textual feature-matching-based decoder to generate can be categorized into reconstruction-based and feature
pixel-level anomaly localization results. The decoder’s out- embedding-based approaches. Reconstruction-based meth-
puts are introduced to the LVLM along with the original ods primarily aim to reconstruct anomalous samples to
test images through prompt embeddings, which allows the their corresponding normal counterparts and detect anoma-
LVLM to utilize both the raw image and the decoder’s out- lies by calculating the reconstruction error. RIAD [33],
puts to make anomaly determinations, improving the accu- SCADN [30], InTra [21] and AnoDDPM [28] employ dif-
racy of its judgments. ferent reconstruction network architectures, ranging from

2
autoencoder and Generative Adversarial Network (GAN) to the potential of LLM-based polymathic models.
Transformer and diffusion model. However, as mentioned earlier, these models are trained
Feature embedding-based methods focus on modeling on general data and lack domain-specific expertise. In this
the feature embeddings of normal samples. Approaches paper, through the utilization of simulated anomaly data,
such as PatchSVDD [31] aim to find a hypersphere that image decoder and prompt embeddings, AnomalyGPT is
tightly encapsulates normal samples. Cflow-AD [9] and introduced as an novel approach that achieves IAD task
PyramidFlow [13] use normalizing flows to project normal without the need for manually specified thresholds, while
samples onto a Gaussian distribution. PatchCore [23] and also enabling few-shot in-context learning. Table 1 illus-
CFA [12] establish a memory bank of patch embeddings trates a comparison between AnomalyGPT and existing
from normal samples and detect anomalies by measuring methods across various functionalities.
the distance between a test sample embedding and its near-
est normal embedding in the memory bank. 3. Method
These methods typically follow the “one-class-one- AnomalyGPT is a novel conversational IAD vision-
model” learning paradigm, requiring plentiful normal sam- language model, primarily designed for detecting anoma-
ples for each object class to learn its distribution, making lies in images of industrial artifacts and pinpointing their
them impractical for novel object categories and less suit- positions. We leverage a pre-trained image encoder and
able for dynamic production environments. In contrast, our a LLM to align IAD images and their corresponding tex-
method facilitates in-context learning for novel object cate- tual descriptions via simulated anomaly data. We intro-
gories, enabling inference with only few normal samples. duce a decoder module and a prompt learner module to
Zero-/Few-shot Industrial Anomaly Detection: Recent enhance IAD performance and achieve pixel-level localiza-
efforts have focused on methods utilizing minimal normal tion output. Employing prompt tuning and alternate training
samples to accomplish IAD task. PatchCore [23] constructs with pre-training data preserves the LLM’s transferability
a memory bank using only a few normal samples, resulting and prevents catastrophic forgetting. Our method exhibits
in a noticeable performance decline. RegAD [10] trained robust few-shot transfer capability, enabling anomaly de-
an image registration network to align test images with nor- tection and localization for previously unseen items with
mal samples, followed by similarity computation for cor- merely one normal sample provided.
responding patches. WinCLIP [11] leveraged CLIP [22] to
compute similarity between images and textual descriptions 3.1. Model Architecture
representing normal and anomalous semantics, distinguish- Figure 2 illustrates the comprehensive architecture of
ing anomalies based on their relative scores. However, AnomalyGPT. Given a query image x ∈ RH×W ×C , the fi-
these methods can only provide anomaly scores for test nal features Fimg ∈ RC1 extracted by the image encoder are
samples during inference. To distinguish normal samples passed through the linear layer to obtain the image embed-
from anomalous ones, it’s necessary to experimentally de- ding Eimg ∈ RCemb , which is then fed into the LLM. In un-
termine the optimal threshold on a test set, which contra- supervised setting, the patch-level features extracted by in-
dicts the original intent of IAD task that only utilize nor- termediate layers of image encoder are fed into the decoder
mal data. For instance, while PatchCore [23] achieves an together with text features to generate pixel-level anomaly
image-level AUC of 99.3% on MVTec-AD in unsupervised localization results. In few-shot setting, the patch-level fea-
setting, its accuracy drops to 79.76% when using a unified tures from normal samples are stored in memory banks and
threshold for inference. The detailed experimental results the localization result can be obtained by calculating the dis-
and analyses can be found in Appendix A. Our method, in tance between query patches and their most similar coun-
contrast, enables the LVLM to directly assess test samples terparts in the memory bank. The localization results is
for the presence of anomalies and pinpoint their locations, subsequently transformed into prompt embeddings through
demonstrating enhanced practicality. the prompt learner, serving as a part of LLM input. The
Large Vision-Language Models: LLMs, traditionally suc- LLM leverages image input, prompt embeddings, and user-
cessful in NLP, are now explored for visual tasks. BLIP- provided textual input to detect anomalies and identify their
2 [15] leverages Q-Former to input visual features from Vi- locations, thus generating responses for the user.
sion Transformer [7] into the Flan-T5 [4] model. MiniGPT-
3.2. Decoder and Prompt Learner
4 [36] connects the image segment of BLIP-2 and the
Vicuna [3] model with a linear layer, performing a two- Decoder To achieve pixel-level anomaly localization, we
stage fine-tuning process using extensive image-text data. employ a lightweight feature-matching-based image de-
PandaGPT [25] establishes a connection between Image- coder that supports both unsupervised IAD and few-shot
Bind [8] and the Vicuna [3] model via a linear layer, al- IAD. The design of the decoder is primarily inspired by
lowing for multi-modal input. These approaches showcase PatchCore [23], WinCLIP [11], and APRIL-GAN [2].

3
Figure 2. The architecture of AnomalyGPT. The query image is passed to the frozen image encoder and the patch-level features extracted
from intermediate layers are fed into image decoder to compute their similarity with normal and abnormal texts to obtain localization
result. The final features extracted by the image encoder are fed to a linear layer and then passed to the prompt learner along with the
localization result. The prompt learner converts them into prompt embeddings suitable for input into the LLM together with user text
inputs. In few-shot setting, the patch-level features from normal samples are stored in memory banks and the localization result can be
obtained by calculating the distance between query patches and their most similar counterparts in the memory bank.

As illustrated in the upper part of Figure 2, we parti- we calculate the distance between each patch and its most
tion the image encoder into 4 stages and obtain the in- similar counterpart in the memory bank, and the localiza-
termediate patch-level features extracted by every stage tion result M ∈ RH×W can be obtained by Eq. (2):
i
Fpatch ∈ RHi ×Wi ×Ci , where i indicates the i-th stage.
Following the idea from WinCLIP [11], a natural approach
i
is to compute the similarity between Fpatch and the text 4
!
iT
X
2×Ctext i
features Ftext ∈ R respectively representing nor- M = U psample 1− max(Fpatch ·B ) . (2)
mality and abnormality. Detailed texts representing nor- i=1
mal and abnormal cases are presented in Appendix B. How-
ever, since these intermediate features have not undergone Prompt Learner To leverage fine-grained semantic from
the final image-text alignment, they cannot be directly com- images and maintain semantic consistency between LLM
pared with text features. To address this, we introduce addi- and decoder outputs, we introduce a prompt learner that
tional linear layers to project these intermediate features to transforms the localization result into prompt embeddings.
i
F̃patch ∈ RHi ×Wi ×Ctext , and align them with text features Additionally, learnable base prompt embeddings, unrelated
representing normal and abnormal semantics. The localiza- to decoder outputs, are incorporated into the prompt learner
tion result M ∈ RH×W can be obtained by Eq. (1): to provide extra information for the IAD task. Finally, these
embeddings, along with the original image information, are
fed into the LLM.
4
!
X
i T
M = U psample sof tmax(F̃patch Ftext ) . (1) As illustrated in Figure 2, the prompt learner consists of
i=1
the learnable base prompt embeddings Ebase ∈ Rn1 ×Cemb
For few-shot IAD, as illustrated in the lower part of Fig- and a convolutional neural network. The network converts
ure 2, we utilize the same image encoder to extract inter- the localization result M ∈ RH×W into n2 prompt embed-
mediate patch-level features from normal samples and store dings Edec ∈ Rn2 ×Cemb . Ebase and Edec form a set of
them in memory banks B i ∈ RN ×Ci , where i indicates the n1 + n2 prompt embeddings Eprompt ∈ R(n1 +n2 )×Cemb
i
i-th stage. For patch-level features Fpatch ∈ RHi ×Wi ×Ci , that are combined with the image embedding into the LLM.

4
3.3. Data for Image-Text Alignment capable of performing IAD task based solely on the pro-
vided image input. Detailed description for each category
Anomaly Simulation We primarily adopt the approach
are provided in Appendix C.
proposed by NSA [24] to simulate anomalous data. The
NSA [24] method builds upon the Cut-paste [14] technique
by incorporating the Poisson image editing [20] method
to alleviate the discontinuity introduced by pasting image
segments. Cut-paste [14] is a common technique in IAD
domain for generating simulated anomaly images. This
method involves randomly cropping a block region from an
image and then pasting it onto a random location in another
image, thus creating a simulated anomalous portion. Sim-
ulated anomaly samples can significantly enhance the per-
formance of IAD models, but this procedure often results
in noticeable discontinuities, as illustrated in Figure 3. The
Poisson editing method [20] has been developed to seam- Figure 4. Illustration of the 3 × 3 grid of image, which is used to
lessly clone an object from one image into another image let LLM verbally indicate the abnormal position.
by solving the Poisson partial differential equations.
Prompts fed to the LLM typically follow the format:
### Human: <Img>Eimg </Img>Eprompt [Image De-
scription]Is there any anomaly in the image?###Assistant:
Eimg ∈ RCemb represents the image embedding be-
ing processed through the image encoder and linear layer,
Eprompt ∈ R(n1 +n2 )×Cemb refers to the prompt embed-
dings generated by the prompt learner, and [Image Descrip-
tion] corresponds to the textual description of the image.
Figure 3. Illustration of the comparison between cut-paste and 3.4. Loss Functions
poisson image editing. The results of cut-paste exhibit evident
discontinuities and the results of poisson image editing are more To train the decoder and prompt learner, we primarily
natural. employed three loss functions: cross-entropy loss, focal
loss [16], and dice loss [18]. The latter two are primarily
Question and Answer Content To conduct prompt tuning utilized to enhance the pixel-level localization accuracy of
on the LVLM, we generate corresponding textual queries the decoder.
based on the simulated anomalous images. Specifically, Cross-Entropy Loss Cross-entropy loss is commonly em-
each query consists of two components. The first part in- ployed for training language models, which quantifies the
volves a description of the input image, providing infor- disparity between the text sequence generated by the model
mation about the objects present in the image and their ex- and the target text sequence. The formula is as follows:
pected attributes, such as This is a photo of leather, which n
should be brown and without any damage, flaw, defect, Lce = −
X
yi log(pi ), (3)
scratch, hole or broken part. The second part queries the i=1
presence of anomalies within the object, namely Is there
any anomaly in the image? The LVLM firstly responds to where n is the number of tokens, yi is the true label for
whether anomalies are present. If anomalies are detected, token i and pi is the predicted probability for token i.
the model continues to specify the number and location of Focal Loss Focal loss [16] is commonly used in object de-
the anomalous areas, such as Yes, there is an anomaly in tection and semantic segmentation to address the issue of
the image, at the bottom left of the image. or No, there class imbalance, which introduces an adjustable parameter
are no anomalies in the image. We divide the image into γ to modify the weight distribution of cross-entropy loss,
a grid of 3 × 3 distinct regions to facilitate the LVLM in emphasizing samples that are difficult to classify. In IAD
verbally indicating the positions of anomalies, as shown in task, where most regions in anomaly images are still nor-
Figure 4. The descriptive content about the image furnishes mal, employing focal loss can mitigate the problem of class
the LVLM with foundational knowledge of the input im- imbalance. Focal loss can be calculated by Eq. (4):
age, aiding in the model’s better comprehension of the im- n
age contents. However, during practical applications, users 1X
Lf ocal = − (1 − pi )γ log(pi ), (4)
may opt to omit this descriptive input, and the model is still n i=1

5
MVTec-AD VisA
Setup Method
Image-AUC Pixel-AUC Accuracy Image-AUC Pixel-AUC Accuracy
SPADE 81.0 ± 2.0 91.2 ± 0.4 - 79.5 ± 4.0 95.6 ± 0.4 -
PaDiM 76.6 ± 3.1 89.3 ± 0.9 - 62.8 ± 5.4 89.9 ± 0.8 -
1-shot PatchCore 83.4 ± 3.0 92.0 ± 1.0 - 79.9 ± 2.9 95.4 ± 0.6 -
WinCLIP 93.1 ± 2.0 95.2 ± 0.5 - 83.8 ± 4.0 96.4 ± 0.4 -
AnomalyGPT (ours) 94.1 ± 1.1 95.3 ± 0.1 86.1 ± 1.1 87.4 ± 0.8 96.2 ± 0.1 77.4 ± 1.0
SPADE 82.9 ± 2.6 92.0 ± 0.3 - 80.7 ± 5.0 96.2 ± 0.4 -
PaDiM 78.9 ± 3.1 91.3 ± 0.7 - 67.4 ± 5.1 92.0 ± 0.7 -
2-shot PatchCore 86.3 ± 3.3 93.3 ± 0.6 - 81.6 ± 4.0 96.1 ± 0.5 -
WinCLIP 94.4 ± 1.3 96.0 ± 0.3 - 84.6 ± 2.4 96.8 ± 0.3 -
AnomalyGPT (ours) 95.5 ± 0.8 95.6 ± 0.2 84.8 ± 0.8 88.6 ± 0.7 96.4 ± 0.1 77.5 ± 0.3
SPADE 84.8 ± 2.5 92.7 ± 0.3 - 81.7 ± 3.4 96.6 ± 0.3 -
PaDiM 80.4 ± 2.5 92.6 ± 0.7 - 72.8 ± 2.9 93.2 ± 0.5 -
4-shot PatchCore 88.8 ± 2.6 94.3 ± 0.5 - 85.3 ± 2.1 96.8 ± 0.3 -
WinCLIP 95.2 ± 1.3 96.2 ± 0.3 - 87.3 ± 1.8 97.2 ± 0.2 -
AnomalyGPT (ours) 96.3 ± 0.3 96.2 ± 0.1 85.0 ± 0.3 90.6 ± 0.7 96.7 ± 0.1 77.7 ± 0.4

Table 2. Few-shot IAD results on MVTec-AD and VisA datasets. Results are listed as the average of 5 runs and the best-performing method
is in bold. The results for SPADE, PaDiM, PatchCore and WinCLIP are reported from [11].

Method Image-AUC Pixel-AUC Accuracy 4. Experiments

PaDiM (Unified) 84.2 89.5 - Datasets We conduct experiments primarily on the MVTec-
JNLD (Unified) 91.3 88.6 - AD [1] and VisA [37] datasets. The MVTec-AD dataset
UniAD 96.5 96.8 -
comprises 3629 training images and 1725 testing images
AnomalyGPT (ours) 97.4 93.1 93.3
across 15 different categories, making it one of the most
popular datasets for IAD. The training images only con-
Table 3. Unsupervised anomaly detection results on MVTec-AD
dataset. The best-performing method is in bold and the results for
sist of normal images, while the testing images contain
PaDiM and JNLD are reported from [35]. both normal and anomalous images. The image resolu-
tions vary from 700×700 to 1024×1024. VisA, a newly
introduced IAD dataset, contains 9621 normal images and
where n = H × W represents the total number of pixels, 1200 anomalous images across 12 categories, with resolu-
pi is the predicted probability of the positive classes and γ tions approximately around 1500×1000. Consistent with
is a tunable parameter for adjusting the weight of hard-to- previous IAD methods, we only use the normal data from
classify samples. In our implementation, we set γ to 2. these datasets for training.
Dice Loss Dice loss [18] is a commonly employed loss Evaluation Metrics Following existing IAD methods, we
function in semantic segmentation tasks. It is based on the employ the Area Under the Receiver Operating Charac-
dice coefficient and can be calculated by Eq. (5): teristic (AUC) as our evaluation metric, with image-level
Pn and pixel-level AUC used to assess anomaly detection and
i=1 yP
i ŷi anomaly localization performance, respectively. However,
Ldice = − Pn 2+ n 2, (5)
i=1 iy i=1 ŷi our proposed approach uniquely allows for determining the
presence of anomalies without the need for manually-set
where n = H × W , yi is the output of decoder and ŷi is the
thresholds. Therefore, we also utilize the image-level ac-
ground truth value.
curacy to evaluate the performance of our method.
Finally, the overall loss function is defined as:
Implementation Details We utilize ImageBind-Huge [8]
L = αLce + βLf ocal + δLdice , (6) as the image encoder and Vicuna-7B [3] as the inferential
LLM, connected through a linear layer. We initialize our
where α, β, δ are coefficients to balance the three loss func- model using pre-trained parameters from PandaGPT [25].
tions, which are set to 1 by default in our experiments. We set the image resolution at 224×224 and feed the

6
MVTec-AD (unsupervised) VisA (1-shot)
Decoder Prompt learner LLM LoRA
Image-AUC Pixel-AUC Accuracy Image-AUC Pixel-AUC Accuracy
✓ - - 72.2 - - 56.5
✓ ✓ - - 73.4 - - 56.6
✓ ✓ - - 79.8 - - 63.4
✓ ✓ 97.1 90.9 72.2 85.8 96.2 56.5
✓ ✓ ✓ 97.1 90.9 84.2 85.8 96.2 64.7
✓ ✓ ✓ ✓ 96.0 88.1 83.9 85.8 96.5 72.7
✓ 97.1 90.9 90.3 85.8 96.2 75.4
✓ ✓ ✓ 97.4 93.1 93.3 87.4 96.2 77.4

Table 4. Results of ablation studies. The ✓ in “Decoder” and “Prompt learner” columns indicate module inclusion. The ✓ in “LLM”
column denotes whether use LLM for inference and the ✓ in “LoRA” column denotes whether use LoRA to fine-tune LLM. In settings
without LLM, the maximum anomaly score from normal samples is used as the classification threshold. In settings without decoder, due
to the sole textual output from the LLM, we cannot compute image-level and pixel-level AUC.

Figure 5. Qualitative example of AnomalyGPT in the unsuper-

vised setting. AnomalyGPT is capable of detecting anomaly, pin-
pointing its location, providing pixel-level localization results and
answering questions about the image.
Figure 6. Qualitative example of AnomalyGPT in the one-
normal-shot setting. The localization performance is slightly
outputs from the 8th, 16th, 24th, and 32nd layers of lower compared to the unsupervised setting due to the absence of
parameter training.
ImageBind-Huge’s image encoder to the image decoder.
Training is conducted on two RTX-3090 GPUs over 50
epochs, with a learning rate of 1e-3 and a batch size of
16. Linear warm-up and a one-cycle cosine learning rate ting of unsupervised training with a large number of nor-
decay strategy are applied. We perform alternating train- mal samples, given that our method trains a single model
ing using both the pre-training data of PandaGPT [25] and on samples from all classes within a dataset, we selected
our anomaly image-text data. Only the decoder and prompt UniAD [32], which is trained under the same setup, as a
learner undergo parameter updates, while the remaining pa- baseline for comparison. Additionally, we compare our
rameters are all kept frozen. model with PaDiM [6] and JNLD [34] using the same uni-
fied setting. The results on MVTec-AD dataset are pre-
4.1. Quantitative Results sented in Table 3.

Few-Shot Industrial Anomaly Detection We compare 4.2. Qualitative Examples

our work with prior few-shot IAD methods, selecting
SPADE [5], PaDiM [6], PatchCore [23], and WinCLIP [11] Figure 5 illustrates the performance of our AnomalyGPT
as the baselines. The results are presented in Table 2. in unsupervised anomaly detection, and Figure 6 showcases
Across both datasets, our method notably outperforms pre- the results in the 1-shot in-context learning. Our model is
vious approaches in terms of image-level AUC and achieves capable of indicating the presence of anomalies, pinpoint-
competitive pixel-level AUC and good accuracy. ing their locations, and providing pixel-level localization
Unsupervised Industrial Anomaly Detection In the set- results. Users can engage in multi-turn dialogues related to

7
image content. In the 1-shot in-context learning setting, due Mostafa Dehghani, Siddhartha Brahma, et al. Scaling
to the absence of training, the model’s localization perfor- instruction-finetuned language models. arXiv preprint
mance is slightly lower than the unsupervised setting. More arXiv:2210.11416, 2022.
qualitative examples can be found in Appendix D. [5] Niv Cohen and Yedid Hoshen. Sub-image anomaly detec-
tion with deep pyramid correspondences. arXiv preprint
4.3. Ablation Studies arXiv:2005.02357, 2020.
[6] Thomas Defard, Aleksandr Setkov, Angelique Loesch, and
To prove the efficacy of each proposed module, extensive
Romaric Audigier. Padim: a patch distribution modeling
ablation experiments are conducted on both the MVTec-AD framework for anomaly detection and localization. In Inter-
and VisA datasets. We primarily focus on four aspects: the national Conference on Pattern Recognition, pages 475–489.
decoder, prompt learner, the usage of LLM for inference, Springer, 2021.
and the utilization of LoRA to fine-tune the LLM. The prin- [7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
cipal results are presented in Table 4. Unsupervised training Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
and testing are carried out on the MVTec-AD dataset, while Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
the one-shot performance is evaluated on the visa dataset. It vain Gelly, et al. An image is worth 16x16 words: Trans-
can be observed that the decoder demonstrates impressive formers for image recognition at scale. arXiv preprint
pixel-level anomaly localization performance. Compared arXiv:2010.11929, 2020.
to manually-set thresholds, the LLM exhibits superior in- [8] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat
Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan
ference accuracy and provides additional functionality. Fur-
Misra. Imagebind: One embedding space to bind them all.
thermore, prompt tuning outperforms LoRA in terms of ac- In Proceedings of the IEEE/CVF Conference on Computer
curacy and transferability. Vision and Pattern Recognition, pages 15180–15190, 2023.
[9] Denis Gudovskiy, Shun Ishizaka, and Kazuki Kozuka.
5. Conclusion Cflow-ad: Real-time unsupervised anomaly detection with
We introduce AnomalyGPT, a novel conversational IAD localization via conditional normalizing flows. In Proceed-
ings of the IEEE/CVF Winter Conference on Applications of
vision-language model, leveraging the powerful capabili-
Computer Vision, pages 98–107, 2022.
ties of LVLM. AnomalyGPT can determine whether an im-
[10] Chaoqin Huang, Haoyan Guan, Aofan Jiang, Ya Zhang,
age contains anomalies and pinpoint their locations with- Michael Spratling, and Yan-Feng Wang. Registration based
out the need for manually specified thresholds. Further- few-shot anomaly detection. In European Conference on
more, AnomalyGPT enables multi-turn dialogues focused Computer Vision, pages 303–319. Springer, 2022.
on anomaly detection and demonstrates remarkable perfor- [11] Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang,
mance in few-shot in-context learning. The effectiveness of Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero-
AnomalyGPT is validated on two common datasets. Our /few-shot anomaly classification and segmentation. In Pro-
work delves into the potential application of large visual ceedings of the IEEE/CVF Conference on Computer Vision
language models in anomaly detection, offering fresh ideas and Pattern Recognition, pages 19606–19616, 2023.
and possibilities for the field of industrial anomaly detec- [12] Sungwook Lee, Seunghyun Lee, and Byung Cheol Song.
tion. Cfa: Coupled-hypersphere-based feature adaptation for
target-oriented anomaly localization. IEEE Access, 10:
78446–78454, 2022.
References
[13] Jiarui Lei, Xiaobo Hu, Yue Wang, and Dong Liu. Pyramid-
[1] Paul Bergmann, Michael Fauser, David Sattlegger, and flow: High-resolution defect contrastive localization using
Carsten Steger. Mvtec ad–a comprehensive real-world pyramid normalizing flow. In Proceedings of the IEEE/CVF
dataset for unsupervised anomaly detection. In Proceedings Conference on Computer Vision and Pattern Recognition,
of the IEEE/CVF conference on computer vision and pattern pages 14143–14152, 2023.
recognition, pages 9592–9600, 2019. [14] Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas
[2] Xuhai Chen, Yue Han, and Jiangning Zhang. A zero-/few- Pfister. Cutpaste: Self-supervised learning for anomaly de-
shot anomaly classification and segmentation method for tection and localization. In Proceedings of the IEEE/CVF
cvpr 2023 vand workshop challenge tracks 1&2: 1st place conference on computer vision and pattern recognition,
on zero-shot ad and 4th place on few-shot ad. arXiv preprint pages 9664–9674, 2021.
arXiv:2305.17382, 2023. [15] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
[3] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Blip-2: Bootstrapping language-image pre-training with
Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao frozen image encoders and large language models. arXiv
Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source preprint arXiv:2301.12597, 2023.
chatbot impressing gpt-4 with 90%* chatgpt quality. See [16] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
https://vicuna. lmsys. org (accessed 14 April 2023), 2023. Piotr Dollár. Focal loss for dense object detection. In Pro-
[4] Hyung Won Chung, Le Hou, Shayne Longpre, Barret ceedings of the IEEE international conference on computer
Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, vision, pages 2980–2988, 2017.

8
[17] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. [30] Xudong Yan, Huaidong Zhang, Xuemiao Xu, Xiaowei Hu,
Visual instruction tuning. arXiv preprint arXiv:2304.08485, and Pheng-Ann Heng. Learning semantic context from nor-
2023. mal samples for unsupervised anomaly detection. In Pro-
[18] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. ceedings of the AAAI conference on artificial intelligence,
V-net: Fully convolutional neural networks for volumetric pages 3110–3118, 2021.
medical image segmentation. In 2016 fourth international [31] Jihun Yi and Sungroh Yoon. Patch svdd: Patch-level svdd
conference on 3D vision (3DV), pages 565–571. Ieee, 2016. for anomaly detection and segmentation. In Proceedings of
[19] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- the Asian conference on computer vision, 2020.
roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini [32] Zhiyuan You, Lei Cui, Yujun Shen, Kai Yang, Xin Lu,
Agarwal, Katarina Slama, Alex Ray, et al. Training lan- Yu Zheng, and Xinyi Le. A unified model for multi-class
guage models to follow instructions with human feedback. anomaly detection. Advances in Neural Information Pro-
Advances in Neural Information Processing Systems, 35: cessing Systems, 35:4571–4584, 2022.
27730–27744, 2022. [33] Vitjan Zavrtanik, Matej Kristan, and Danijel Skočaj. Recon-
[20] Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson struction by inpainting for visual anomaly detection. Pattern
image editing. In ACM SIGGRAPH 2003 Papers, pages 313– Recognition, 112:107706, 2021.
318. 2003. [34] Ying Zhao. Just noticeable learning for unsupervised
anomaly localization and detection. In 2022 IEEE Interna-
[21] Jonathan Pirnay and Keng Chai. Inpainting transformer for
tional Conference on Multimedia and Expo (ICME), pages
anomaly detection. In International Conference on Image
01–06. IEEE, 2022.
Analysis and Processing, pages 394–406. Springer, 2022.
[35] Ying Zhao. Omnial: A unified cnn framework for unsuper-
[22] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
vised anomaly localization. In Proceedings of the IEEE/CVF
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Conference on Computer Vision and Pattern Recognition,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
pages 3924–3933, 2023.
transferable visual models from natural language supervi-
[36] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
sion. In International conference on machine learning, pages
hamed Elhoseiny. Minigpt-4: Enhancing vision-language
8748–8763. PMLR, 2021.
understanding with advanced large language models. arXiv
[23] Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard preprint arXiv:2304.10592, 2023.
Schölkopf, Thomas Brox, and Peter Gehler. Towards to-
[37] Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang,
tal recall in industrial anomaly detection. In Proceedings of
and Onkar Dabeer. Spot-the-difference self-supervised pre-
the IEEE/CVF Conference on Computer Vision and Pattern
training for anomaly detection and segmentation. In Eu-
Recognition, pages 14318–14328, 2022.
ropean Conference on Computer Vision, pages 392–408.
[24] Hannah M Schlüter, Jeremy Tan, Benjamin Hou, and Bern- Springer, 2022.
hard Kainz. Natural synthetic anomalies for self-supervised
anomaly detection and localization. In European Conference
on Computer Vision, pages 474–489. Springer, 2022.
[25] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and
Deng Cai. Pandagpt: One model to instruction-follow them
all. arXiv preprint arXiv:2305.16355, 2023.
[26] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.
Llama: Open and efficient foundation language models.
arXiv preprint arXiv:2302.13971, 2023.
[27] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu,
Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu
Qiao, et al. Visionllm: Large language model is also an
open-ended decoder for vision-centric tasks. arXiv preprint
arXiv:2305.11175, 2023.
[28] Julian Wyatt, Adam Leach, Sebastian M Schmon, and
Chris G Willcocks. Anoddpm: Anomaly detection with de-
noising diffusion probabilistic models using simplex noise.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 650–656, 2022.
[29] Guoyang Xie, Jingbao Wang, Jiaqi Liu, Feng Zheng, and
Yaochu Jin. Pushing the limits of fewshot anomaly de-
tection in industry vision: Graphcore. arXiv preprint
arXiv:2301.12082, 2023.

9
Supplementary Material
AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models

A. More Experimental Results of Existing IAD Methods

As described in the paper, existing IAD methods solely provide anomaly scores for test samples. However, anomaly scores
alone do not allow users to determine the presence of anomalies because they don’t know the threshold to distinguish be-
tween normal and abnormal samples. The threshold varies considerably for each category of objects. Only by concurrently
obtaining both normal and anomalous samples for each object category, along with their respective anomaly scores, can users
identify the optimal classification threshold, which contradicts the original intent of IAD task that only utilize normal data.
One potential solution to this problem involves utilizing the maximum anomaly score from the normal samples of each
category in the training set as the threshold. However, this approach is solely suited for “one-class-one-model” methods,
which are designed specifically for detecting objects of a certain category within a specific environment. When presented
with a previously unseen test sample, the model remains uncertain about which threshold to apply for decision-making. Thus,
a unified threshold applicable across all categories is needed, which is challenging for existing IAD techniques.
We conduct experiments on two representative IAD methods, PatchCore [23] and WinCLIP [11]. PatchCore [23] achieves
an Image-level AUC of 99.3% on the MVTec-AD dataset, while WinCLIP [11] is the state-of-the-art method for few-shot
IAD. We assess the accuracy of both methods across individual categories at varying thresholds. It can be observed that the
threshold exerts a significant influence on the performance of these two methods. Furthermore, a singular threshold displays
markedly different efficacies across disparate categories. Hence, it becomes challenging to ascertain an optimal threshold
unless experimental trials are conducted on test sets containing anomalous samples for each category. Figures 7 and Figure 8
delineate the outcomes of PatchCore [23] and WinCLIP [11] on the MVTec-AD [1] dataset across each category under
different threshold settings.
Bottle Cable Capsule Carpet Grid
100 95 90 90
95 90 80 80 80
85 70
Accuracy(%)

Accuracy(%)

Accuracy(%)
90 70
80 60 60 60
85 75 50
70 40 50
80 40 40
65
20 30 30
75 0.2 0.4 0.6 0.8 60 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Threshold Threshold Threshold Threshold Threshold

100
Hazelnut Leather Metal_nut Pill 90
Screw
100
95
95 90 80 80
90 90
80 70
Accuracy(%)

Accuracy(%)

85 85 60 60
80 70 80
50
75 60 75 40
70 40
50 70
65 20 30
40 65
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Threshold Threshold Threshold Threshold Threshold
Tile Toothbrush Transistor Wood 100 Zipper
100
85 90 90 90
90
80 80 80
80 80
Accuracy(%)

Accuracy(%)

70
75 70 70
70 60
70 60 60 50
60
50 50 40
50 65
30
40 40
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Threshold Threshold Threshold Threshold Threshold

Figure 7. Experimental results of PatchCore [23] on the MVTec-AD [1] dataset across each category under different thresholds. The
optimal threshold varies considerably for each category of objects.

10
Bottle Cable 90
Capsule 100 Carpet 100
Grid
90 80
80 90
80 70 80 80
70
70
Accuracy(%)

Accuracy(%)

Accuracy(%)
60 70
60 60 50 60 60
50 40 50
40 50 40
30 40
30 20 30
40
0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510
Threshold Threshold Threshold Threshold Threshold
Hazelnut Leather Metal_nut Pill Screw
100 90
80 80
80 70
70
70 80 70
60 60
Accuracy(%)

Accuracy(%)

Accuracy(%)
60
60 60 50 50
50
40
50 40 40
40 30
30
40 20 30
20
0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510
Threshold Threshold Threshold Threshold Threshold
Tile Toothbrush Transistor Wood 90 Zipper
100 80 90
90 80 80
80
80 70 70 70
70
Accuracy(%)

Accuracy(%)

Accuracy(%)
70 60 60
60 60
60 50
50 50
50 40
40 50 40
40 30
30
30 30 40 20
0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510 0.485 0.490 0.495 0.500 0.505 0.510
Threshold Threshold Threshold Threshold Threshold

Figure 8. Experimental results of WinCLIP [11] on the MVTec-AD [1] dataset across each category under different thresholds. The optimal
threshold varies considerably for each category of objects.

B. Normal and Abnormal Texts

Following WinCLIP [11], we utilize the compositional prompt ensemble to obtain texts presenting normality and abnormality.
Specifically, we consider two levels of texts: (a) state-level, and (b) template level. The complete text can be composed by
replacing the token [c] in a template-level text with one of state-level text and replacing the token [o] with the object’s
name. When the item’s name is unavailable, the term “object” is adopted as the name for the item. Table 5 provides a detailed
list of the multi-level texts.
(a) State-level (normal) (b) Template-level • "a blurry photo of the [c]."
• c := "[o]" • "a cropped photo of the • "a blurry photo of a [c]."
• c := "flawless [o]" [c]." • "a photo of a [c]."
• c := "perfect [o]" • "a cropped photo of a [c]." • "a photo of the [c]."
• c := "unblemished [o]" • "a close-up photo of a [c]." • "a photo of a small [c]."
• c := "[o] without flaw" • "a close-up photo of the • "a photo of the small [c]."
• c := "[o] without defect" [c]." • "a photo of a large [c]."
• c := "[o] without damage" • "a bright photo of a [c]." • "a photo of the large [c]."
State-level (anomaly) • "a bright photo of the [c]." • "a photo of the [c] for
• "a dark photo of the [c]." visual inspection."
• c := "damaged [o]" • "a photo of a [c] for visual
• "a dark photo of a [c]."
• c := "broken [o]" inspection."
• "a jpeg corrupted photo of a
• c := "[o] with flaw" • "a photo of the [c] for
[c]."
• c := "[o] with defect" anomaly detection."
• "a jpeg corrupted photo of
• c := "[o] with damage" • "a photo of a [c] for
the [c]."
anomaly detection."

Table 5. Lists of multi-level texts considered in this paper to present normal and abnormal semantics.

11
C. Detailed Image Description
As mentioned in the paper, prompts fed to the LLM typically follow the format:
### Human: <Img> Eimg </Img> Eprompt [Image Description] Is there any anomaly in the image? ### Assistant:
The [Image Description] part involves a description of the input image, providing information about the objects present
in the image and their expected attributes. Such description furnishes the LVLM with foundational knowledge of the input
image, aiding in the model’s better comprehension of the image contents. The detailed description of every category in
MVTec-AD [1] and VisA [37] datasets can be found in Table 6 and Table 7. Note that users can omit this descriptive input,
and the model is still capable of performing IAD task based solely on the provided image input.

Class Image description

This is a photo of a bottle for anomaly detection, which should be round and without any
Bottle
damage, flaw, defect, scratch, hole or broken part.
This is a photo of three cables for anomaly detection, they are green, blue and grey, which cannot
Cable
be missed or swapped and should be without any damage, flaw, defect, scratch, hole or broken part.
This is a photo of a capsule for anomaly detection, which should be black and orange,
Capsule
with print ‘500’ and without any damage, flaw, defect, scratch, hole or broken part.
This is a photo of carpet for anomaly detection, which should be without any
Carpet
damage, flaw, defect, scratch, hole or broken part.
This is a photo of grid for anomaly detection, which should be without any
Grid
damage, flaw, defect, scratch, hole or broken part.
This is a photo of a hazelnut for anomaly detection, which should be without any
Hazelnut
damage, flaw, defect, scratch, hole or broken part.
This is a photo of leather for anomaly detection, which should be brown with patterns and
Leather
without any damage, flaw, defect, scratch, hole or broken part.
This is a photo of a metal nut for anomaly detection, which should be without any
Metal nut
damage, flaw, defect, scratch, hole or broken part, and shouldn’t be fliped.
This is a photo of a pill for anomaly detection, which should be white, with print ‘FF’
Pill
and red patterns and without any damage, flaw, defect, scratch, hole or broken part.
This is a photo of a screw for anomaly detection, whose tail should be sharp,
Screw
and without any damage, flaw, defect, scratch, hole or broken part.
This is a photo of tile for anomaly detection, which should be without any
Tile
damage, flaw, defect, scratch, hole or broken part.
This is a photo of a toothbrush for anomaly detection, which should be without any
Toothbrush
damage, flaw, defect, scratch, hole or broken part.
This is a photo of a transistor for anomaly detection, which should be without any
Transistor
damage, flaw, defect, scratch, hole or broken part.
This is a photo of wood for anomaly detection, which should be brown with patterns
Wood
and without any damage, flaw, defect, scratch, hole or broken part.
This is a photo of a zipper for anomaly detection, which should be without
Zipper
any damage, flaw, defect, scratch, hole or broken part.

Table 6. Detailed image description for every category in MVTec-AD dataset. The description will be added to the prompts of the
corresponding category during training to provide foundational knowledge of the input image.

12
Class Image description
This is a photo of 4 candles for anomaly detection, every candle should be round,
Candle
without any damage, flaw, defect, scratch, hole or broken part.
This is a photo of many small capsules for anomaly detection, every capsule is green and should be
Capsules
without any damage, flaw, defect, scratch, hole or broken part.
This is a photo of a cashew for anomaly detection, which should be without any
Cashew
damage, flaw, defect, scratch, hole or broken part.
This is a photo of a chewinggum for anomaly detection, which should be white,
Chewinggum
without any damage, flaw, defect, scratch, hole or broken part.
This is a photo of a fryum for anomaly detection on green background, which should be
Fryum
without any damage, flaw, defect, scratch, hole or broken part.
This is a photo of 4 macaronis for anomaly detection, which should be without any
Macaroni1
damage, flaw, defect, scratch, hole or broken part.
This is a photo of 4 macaronis for anomaly detection, which should be without any
Macaroni2
damage, flaw, defect, scratch, hole or broken part.
This is a photo of PCB for anomaly detection, which should be without any
PCB1
damage, flaw, defect, scratch, hole or broken part.
This is a photo of PCB for anomaly detection, which should be without any
PCB2
damage, flaw, defect, scratch, hole or broken part.
This is a photo of PCB for anomaly detection, which should be without any
PCB3
damage, flaw, defect, scratch, hole or broken part.
This is a photo of PCB for anomaly detection, which should be without any
PCB4
damage, flaw, defect, scratch, hole or broken part.
This is a photo of a pipe fryum for anomaly detection, which should be without any
Pipe fryum
damage, flaw, defect, scratch, hole or broken part.

Table 7. Detailed image description for every category in VisA dataset. The description will be added to the prompts of the corresponding
category during training to provide foundational knowledge of the input image.

13
D. More Qualitative Examples
We compare our approach with several existing LVLMs, specifically selecting PandaGPT [25], MiniGPT-4 [36], and
LLaVA [17] for comparative analysis. We conduct experiments across various categories of both normal and anomalous
samples. The results are presented in Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14. It can be observed
that only our method exhibits proficiency in both accurately answering questions related to anomaly detection and those about
image content. In contrast, the other models demonstrate suboptimal performance in discerning the presence of anomalies
and pinpointing their precise locations. Notably, PandaGPT and LLaVA show a marked tendency to misclassify all samples
as anomalous. Conversely, MiniGPT-4 tends to err on the side of caution, predominantly labeling samples as normal.

Figure 9. Comparison between AnomalyGPT, PandaGPT, LLaVA and MiniGPT-4. The input image is a top-view photo of a normal bottle.
AnomalyGPT is capable of detecting anomaly, pinpointing its location, providing pixel-level localization results and answering questions
about the image.

14
Figure 10. Comparison between AnomalyGPT, PandaGPT, LLaVA and MiniGPT-4. The input image is a photo of a cutting wood.
AnomalyGPT is capable of detecting anomaly, pinpointing its location, providing pixel-level localization results and answering questions
about the image.

15
Figure 11. Comparison between AnomalyGPT, PandaGPT, LLaVA and MiniGPT-4. The input image is a photo of a normal pill. Anoma-
lyGPT is capable of detecting anomaly, pinpointing its location, providing pixel-level localization results and answering questions about
the image.

16
Figure 12. Comparison between AnomalyGPT, PandaGPT, LLaVA and MiniGPT-4. The input image is a photo of a piece of farbic with
hole. AnomalyGPT is capable of detecting anomaly, pinpointing its location, providing pixel-level localization results and answering
questions about the image.

17
Figure 13. Comparison between AnomalyGPT, PandaGPT, LLaVA and MiniGPT-4. The input image is a photo of normal metal grid.
AnomalyGPT is capable of detecting anomaly, pinpointing its location, providing pixel-level localization results and answering questions
about the image.

18
Figure 14. Comparison between AnomalyGPT, PandaGPT, LLaVA and MiniGPT-4. The input image is a photo of a cable with defect.
AnomalyGPT is capable of detecting anomaly, pinpointing its location, providing pixel-level localization results and answering questions
about the image.

Research Papers On Distinct Topics
No ratings yet
Research Papers On Distinct Topics
1 page
Anomaly Detection For Real-World Industrial Applications Benchmarking Recent Self-Supervised and Pretrained Methods
No ratings yet
Anomaly Detection For Real-World Industrial Applications Benchmarking Recent Self-Supervised and Pretrained Methods
6 pages
IET Information Security - 2024 - Kumari - A Comprehensive Investigation of Anomaly Detection Methods in Deep Learning and
No ratings yet
IET Information Security - 2024 - Kumari - A Comprehensive Investigation of Anomaly Detection Methods in Deep Learning and
26 pages
Counselor Handbook PDF
100% (1)
Counselor Handbook PDF
546 pages
ALS A E Reviewer Mock Test LS4 Life Skills
No ratings yet
ALS A E Reviewer Mock Test LS4 Life Skills
31 pages
Management Information Systems 1st Edition Heather Gray Download PDF
100% (1)
Management Information Systems 1st Edition Heather Gray Download PDF
82 pages
Aaa 4332 o 323
No ratings yet
Aaa 4332 o 323
31 pages
Topics Group 8 MTB Mle
No ratings yet
Topics Group 8 MTB Mle
4 pages
The Rizal Memorial Colleges, Inc
No ratings yet
The Rizal Memorial Colleges, Inc
4 pages
Anomaly Detection Via Personalization
No ratings yet
Anomaly Detection Via Personalization
17 pages
Benchmarking Unsupervised Anomaly Detection and Localization
No ratings yet
Benchmarking Unsupervised Anomaly Detection and Localization
8 pages
Instant Download Research Methods in Second Language Psycholinguistics 1st Edition Jill Jegerski PDF All Chapter
100% (13)
Instant Download Research Methods in Second Language Psycholinguistics 1st Edition Jill Jegerski PDF All Chapter
66 pages
Zhu Toward Generalist Anomaly Detection Via In-Context Residual Learning With Few-Shot CVPR 2024 Paper
No ratings yet
Zhu Toward Generalist Anomaly Detection Via In-Context Residual Learning With Few-Shot CVPR 2024 Paper
11 pages
NEET Chapter Wise Weightage 2025 With Important Topics
No ratings yet
NEET Chapter Wise Weightage 2025 With Important Topics
13 pages
Driving Like Humans Leveraging Vision Large Language Models For Road Anomaly Detection
No ratings yet
Driving Like Humans Leveraging Vision Large Language Models For Road Anomaly Detection
6 pages
Stead
No ratings yet
Stead
8 pages
Theories and Models in Social Marketing Social Marketing - Lecture 3
100% (1)
Theories and Models in Social Marketing Social Marketing - Lecture 3
53 pages
Sta. Isabel Es Sip 2019-2022
100% (1)
Sta. Isabel Es Sip 2019-2022
28 pages
Real
No ratings yet
Real
8 pages
Electronics 11 02306 v2 PDF
No ratings yet
Electronics 11 02306 v2 PDF
15 pages
How To Work With Spirits - Taylor Ellwood
No ratings yet
How To Work With Spirits - Taylor Ellwood
8 pages
The Effects of Poor Reading Comprehension On The Academic Performance of Grade 11 Students at Electron Collage Technical Education
No ratings yet
The Effects of Poor Reading Comprehension On The Academic Performance of Grade 11 Students at Electron Collage Technical Education
11 pages
Anomaly Detection in Industrial Machinery Using IoT Devices and Machine Learning A Systematic Mapping
No ratings yet
Anomaly Detection in Industrial Machinery Using IoT Devices and Machine Learning A Systematic Mapping
46 pages
VAE Architectures for Anomaly Detection
No ratings yet
VAE Architectures for Anomaly Detection
6 pages
TA Tao-Hands-Practitioner-Syllabus 20231019 V09 DR
No ratings yet
TA Tao-Hands-Practitioner-Syllabus 20231019 V09 DR
8 pages
ITIL Practitioner 160317
No ratings yet
ITIL Practitioner 160317
26 pages
Pre Work PPT VLM Anomaly Detection
No ratings yet
Pre Work PPT VLM Anomaly Detection
5 pages
J 03818 Paper II Marathi
No ratings yet
J 03818 Paper II Marathi
16 pages
Video Anamoly Detection
No ratings yet
Video Anamoly Detection
22 pages
Lesson 02 - Number Conversions and Arithmetic Operations
No ratings yet
Lesson 02 - Number Conversions and Arithmetic Operations
7 pages
FINAL Review
No ratings yet
FINAL Review
17 pages
Pure Anomaly Detection Via Self-Supervised Deep Metric Learning With Adaptive Margin
No ratings yet
Pure Anomaly Detection Via Self-Supervised Deep Metric Learning With Adaptive Margin
13 pages
VideoAnamolyDetection Survey
No ratings yet
VideoAnamolyDetection Survey
36 pages
School Leadership Interview Guide
No ratings yet
School Leadership Interview Guide
2 pages
OB-GYN Outpatient Census 6/7/19
No ratings yet
OB-GYN Outpatient Census 6/7/19
2 pages
Learning Memory-Guided Normality For Anomaly Detection
No ratings yet
Learning Memory-Guided Normality For Anomaly Detection
10 pages
Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder
100% (1)
Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder
20 pages
PBL Rubric Ed PDF
No ratings yet
PBL Rubric Ed PDF
1 page
Bullying's Impact on Grade 3 Academics
No ratings yet
Bullying's Impact on Grade 3 Academics
12 pages
OR Cours Outline MGMT
No ratings yet
OR Cours Outline MGMT
3 pages
Anomaly Detection in Industrial Machinery Using IoT Devices and Machine Learning A Systematic Mapping
No ratings yet
Anomaly Detection in Industrial Machinery Using IoT Devices and Machine Learning A Systematic Mapping
20 pages
Matrix On Strategic Plan For Customs Development (SPCD)
No ratings yet
Matrix On Strategic Plan For Customs Development (SPCD)
4 pages
Biradar Robust Anomaly Detection Through Transformer-Encoded Feature Diversity Learning ACCVW 2024 Paper
No ratings yet
Biradar Robust Anomaly Detection Through Transformer-Encoded Feature Diversity Learning ACCVW 2024 Paper
14 pages
Line Rockets Lesson Plan k-2nd
No ratings yet
Line Rockets Lesson Plan k-2nd
11 pages
1st Review
No ratings yet
1st Review
31 pages
Anomaly Detection Using Prediction Error Cc5b2ed6
No ratings yet
Anomaly Detection Using Prediction Error Cc5b2ed6
6 pages
Curvitaeko Updated
No ratings yet
Curvitaeko Updated
4 pages
Electronics 12 00029
No ratings yet
Electronics 12 00029
22 pages
3 HR Frame Worksheet
No ratings yet
3 HR Frame Worksheet
2 pages
PaDiM: Anomaly Detection Framework
No ratings yet
PaDiM: Anomaly Detection Framework
7 pages
Interactive Learning Boosts WWI Analysis
No ratings yet
Interactive Learning Boosts WWI Analysis
7 pages
Harnessing Large Language Models For Training-Free Video Anomaly Detection
No ratings yet
Harnessing Large Language Models For Training-Free Video Anomaly Detection
13 pages
Icses 24 T3 1047
No ratings yet
Icses 24 T3 1047
8 pages
IEEE Xplore Citation Plain Text Download 2025.1.25.10.43.24
No ratings yet
IEEE Xplore Citation Plain Text Download 2025.1.25.10.43.24
3 pages
Siena News Fall 2010
No ratings yet
Siena News Fall 2010
36 pages
Jayson Bejec: Industrial Engineering Resume
No ratings yet
Jayson Bejec: Industrial Engineering Resume
3 pages
Towards Total Recall in Industrial Anomaly Detection
No ratings yet
Towards Total Recall in Industrial Anomaly Detection
18 pages
Video Anomaly Detection For Smart Surveillance: Related Concepts
No ratings yet
Video Anomaly Detection For Smart Surveillance: Related Concepts
12 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
8 pages
Annomally Detection Reserach Paper
No ratings yet
Annomally Detection Reserach Paper
21 pages
Exploring The Use of Different Feature Levels of CNN For Anomaly Detection
No ratings yet
Exploring The Use of Different Feature Levels of CNN For Anomaly Detection
5 pages
5-Uninformed Students Student-Teacher Anomaly Detection
No ratings yet
5-Uninformed Students Student-Teacher Anomaly Detection
11 pages
Abstract F2
No ratings yet
Abstract F2
1 page
Resume - Muneeb Manzoor
No ratings yet
Resume - Muneeb Manzoor
3 pages
Realnet: A Feature Selection Network With Realistic Synthetic Anomaly For Anomaly Detection
No ratings yet
Realnet: A Feature Selection Network With Realistic Synthetic Anomaly For Anomaly Detection
21 pages
Artola GLAD A Global-to-Local Anomaly Detector WACV 2023 Paper
No ratings yet
Artola GLAD A Global-to-Local Anomaly Detector WACV 2023 Paper
10 pages
Any-Shot Sequential Anomaly Detection in Surveillance Videos CVPRW 2020 Paper
No ratings yet
Any-Shot Sequential Anomaly Detection in Surveillance Videos CVPRW 2020 Paper
6 pages
Vision Transformer Attention With Multi-Reservoir Echo State
No ratings yet
Vision Transformer Attention With Multi-Reservoir Echo State
17 pages
Imp 2
No ratings yet
Imp 2
27 pages
28703-Article Text-32757-1-2-20240324
No ratings yet
28703-Article Text-32757-1-2-20240324
9 pages
Deep Industrial Image Anomaly Detection: A Survey
No ratings yet
Deep Industrial Image Anomaly Detection: A Survey
32 pages
Real-Time Anomaly Detection and Classification From Surveillance Cameras Using Deep Neural Network
No ratings yet
Real-Time Anomaly Detection and Classification From Surveillance Cameras Using Deep Neural Network
6 pages
Computer Vision3
No ratings yet
Computer Vision3
5 pages
Video Anomaly Detection Via Motion Completion Diffusion For Intelligent Surveillance System
No ratings yet
Video Anomaly Detection Via Motion Completion Diffusion For Intelligent Surveillance System
11 pages
FEGAN - A Feature Extraction Based Approach For GAN Anomaly Detection and Localization
No ratings yet
FEGAN - A Feature Extraction Based Approach For GAN Anomaly Detection and Localization
15 pages
Roth Towards Total Recall in Industrial Anomaly Detection CVPR 2022 Paper
No ratings yet
Roth Towards Total Recall in Industrial Anomaly Detection CVPR 2022 Paper
11 pages
Yu Deep Anomaly Discovery From Unlabeled Videos Via Normality Advantage and CVPR 2022 Paper
No ratings yet
Yu Deep Anomaly Discovery From Unlabeled Videos Via Normality Advantage and CVPR 2022 Paper
12 pages
Video Anomaly Detection in 10 Years: A Survey and Outlook
No ratings yet
Video Anomaly Detection in 10 Years: A Survey and Outlook
20 pages
Enhancing Video Anomaly Detection For Human Suspicious Behavior Through Deep Hybrid Temporal Spatial Network
No ratings yet
Enhancing Video Anomaly Detection For Human Suspicious Behavior Through Deep Hybrid Temporal Spatial Network
8 pages
Liu SimpleNet A Simple Network For Image Anomaly Detection and Localization CVPR 2023 Paper
No ratings yet
Liu SimpleNet A Simple Network For Image Anomaly Detection and Localization CVPR 2023 Paper
10 pages
Lich Su Dang
No ratings yet
Lich Su Dang
6 pages
Sensors 23 06256 v2
No ratings yet
Sensors 23 06256 v2
27 pages
JKD Conversations With John Little
No ratings yet
JKD Conversations With John Little
37 pages
Anomaly Detection Using Deep Learning Based Model With Feature Attention
No ratings yet
Anomaly Detection Using Deep Learning Based Model With Feature Attention
8 pages
Personalized Learning Path Generator (PLPG)
No ratings yet
Personalized Learning Path Generator (PLPG)
3 pages
IJECE
No ratings yet
IJECE
12 pages
Edge-Based Anomaly Detection Review
No ratings yet
Edge-Based Anomaly Detection Review
26 pages
Spatiotemporal Anomaly Detection
No ratings yet
Spatiotemporal Anomaly Detection
10 pages