Efficient Odd-One-Out Anomaly Detection

Abstract

The recently introduced odd-one-out anomaly detection task involves identifying the odd-looking instances within a multi-object scene. This problem presents several challenges for modern deep learning models, demanding spatial reasoning across multiple views and relational reasoning to understand context and generalize across varying object categories and layouts. We argue that these challenges must be addressed with efficiency in mind. To this end, we propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of six compared to the current state-of-the-art, while maintaining competitive performance. Our experimental evaluation also introduces a Multimodal Large Language Model baseline, providing insights into its current limitations in structured visual reasoning tasks.

Method

The above schematic representation illustrates our approach. It takes as input M views of the scene, which are processed using the DINOv2 encoder. The resulting features are projected onto a voxel grid and associated with each of the N=3 objects. A subsequent pooling step yields per-object representations. These are further refined by the context and residual anomaly heads, which encode both object-to-object similarity and the relative deviations of each object from the scene-specific average normalcy.

Main Results

Model	Toys Seen		Toys Unseen		Parts
Model	AUC	Accuracy	AUC	Accuracy	AUC	Accuracy
ImVoxelNet	78.13	65.55	73.19	60.12	72.80	64.34
DETR3D	79.16	67.37	74.60	62.98	74.49	65.11
OOO	91.78	83.21	89.15	81.57	86.12	79.68
MLLM baseline	-	52.23	-	53.35	-	60.73
Ours	89.82	85.39	84.43	80.64	89.72	88.81

The results of our experimental analysis are presented in the above table. We evaluate our method on two datasets: ToysAD and PartsAD, both introduced by Odd-One-Out: Anomaly Detection by Comparing with Neighbors, CVPR 2025. ToysAD contains synthetic toy scenes with diverse categories, while PartsAD is composed of mechanical components with rigid shapes and subtle anomalies, reflecting industrial inspection scenarios. ToysAD test set is further split into SEEN and UNSEEN test sets, in order to test whether the model is able to generalize to objects categories never encountered before.

Our method performs similarly to OOO on Toys Seen and slightly worse on Toys Unseen, while showing a significant advantage over the leading competitor on Parts Unseen. We remark that the difference between these two dataset is in the geometric nature of the comprised objects: Toys includes free-form shapes with high inter class variability, while Parts consists of mechanical components characterized by low semantic diversity and predominantly rigid angular geometries. Besides being more challenging due to the higher object count in the scenes and fine-grained shape differences, the limited categorical diversity of the latter dataset better reflects industrial quality inspections scenarios and aligns well with our approach which is specifically designed to prioritize efficiency.

Ablations

Ours Head	Toys Seen		Toys Unseen		Parts		Memory (GB)	Inf. time (ms)
Ours Head	AUC	Acc.	AUC	Acc.	AUC	Acc.	Memory (GB)	Inf. time (ms)
Sparse Voxel Attn.	86.11	78.79	85.48	77.32	86.32	84.06	1.23	337
Context	89.18	85.42	85.09	81.27	89.14	88.65	0.36	271
Context + Residual	89.82	85.39	84.43	80.64	89.72	88.81	0.54	286

Our model excels at detecting fine-grained anomalies like material issues and fractures, though it shows limitations with complex 3D deformations such as missing parts or misalignments (Figure a). As shown in the metrics in (Figure b), increasing the number of views improves performance, while robustness remains consistent across varying object counts. Ablation studies (Figure c) confirm that our dense attention mechanism is both more accurate and efficient than sparse voxel-based alternatives, with the Residual Head providing a slight yet consistent performance boost.

Qualitative Results

These are some qualitative results of our approach on Toys Unseen and Parts Unseen. We show three views for each scene. The green boxes indicate normal objects while the red ones indicate anomalous objects. The ✓ and ✗ indicate respectively correct and wrong predictions.

BibTeX


        @misc{chito2025efficientoddoneoutanomalydetection,
              title={Efficient Odd-One-Out Anomaly Detection}, 
              author={Silvio Chito and Paolo Rabino and Tatiana Tommasi},
              year={2025},
              eprint={2509.04326},
              archivePrefix={arXiv},
              primaryClass={cs.CV},
              url={https://arxiv.org/abs/2509.04326}, 
        }