We consider the problem of zero-shot anomaly detection in which a model is pre-trained to detect anomalies
in images belonging to seen classes, and expected to detect anomalies from unseen classes at test time. State-ofthe-art anomaly detection (AD) methods can often achieve
exceptional results when training images are abundant, but
they catastrophically fail in zero-shot scenarios with a lack
of real examples. However, with the emergence of multimodal models such as CLIP, it is possible to use knowledge from other modalities (e.g. text) to compensate for the
lack of visual information and improve AD performance.
In this work, we propose PromptAD, a dual-branch framework which uses prior knowledge about both normal and
abnormal behaviours in the form of text prompts to detect anomalies even in unseen classes. More specifically,
it uses CLIP as a backbone encoder network and an additional dual-branch vision-language decoding network for
both normality and abnormality information. The normality branch establishes a profile of normality, while the abnormality branch models anomalous behaviors, guided by
natural language text prompts. As the two branches capture complementary information or ‘views’, we propose a
‘cross-view contrastive learning’ (CCL) component which
regularizes each view with additional reference information
from the other view. We further propose a cross-view mutual interaction (CMI) strategy to promote the mutual exploration of useful knowledge from each branch. We show
that PromptAD outperforms existing baselines in zero-shot
anomaly detection on key benchmark datasets and analyse
the role of each component in ablation studies.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the A*STAR - AME Programmatic Funds
Grant Reference no. : A20H6b0151