The recently introduced odd-one-out anomaly detection task involves identifying the odd-looking instances within a multi-object scene. This problem presents several challenges for modern deep learning models, demanding spatial reasoning across multiple views and relational reasoning to understand context and generalize across varying object categories and layouts. We argue that these challenges must be addressed with efficiency in mind. To this end, we propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of six compared to the current state-of-the-art, while maintaining competitive performance. Our experimental evaluation also introduces a Multimodal Large Language Model baseline, providing insights into its current limitations in structured visual reasoning tasks.
The above schematic representation illustrates our approach. It takes as input M views of the scene, which are processed using the DINOv2 encoder. The resulting features are projected onto a voxel grid and associated with each of the N=3 objects. A subsequent pooling step yields per-object representations. These are further refined by the context and residual anomaly heads, which encode both object-to-object similarity and the relative deviations of each object from the scene-specific average normalcy.
Model | Toys Seen | Toys Unseen | Parts | |||
---|---|---|---|---|---|---|
AUC | Accuracy | AUC | Accuracy | AUC | Accuracy | |
ImVoxelNet | 78.13 | 65.55 | 73.19 | 60.12 | 72.80 | 64.34 |
DETR3D | 79.16 | 67.37 | 74.60 | 62.98 | 74.49 | 65.11 |
OOO | 91.78 | 83.21 | 89.15 | 81.57 | 86.12 | 79.68 |
MLLM baseline | - | 52.23 | - | 53.35 | - | 60.73 |
Ours | 89.82 | 85.39 | 84.43 | 80.64 | 89.72 | 88.81 |
The results of our experimental analysis are presented in the above table. Our method performs similarly to OOO on Toys Seen and slightly worse on Toys Unseen, while showing a significant advantage over the leading competitor on Parts Unseen. We remark that the difference between these two dataset is in the geometric nature of the comprised objects: Toys includes free-form shapes with high inter class variability, while Parts consists of mechanical components characterized by low semantic diversity and predominantly rigid angular geometries. Besides being more challenging due to the higher object count in the scenes and fine-grained shape differences, the limited categorical diversity of the latter dataset better reflects industrial quality inspections scenarios and aligns well with our approach which is specifically designed to prioritize efficiency.
Ours Head | Toys Seen | Toys Unseen | Parts | Memory (GB) |
Inf. time (ms) |
|||
---|---|---|---|---|---|---|---|---|
AUC | Acc. | AUC | Acc. | AUC | Acc. | |||
Sparse Voxel Attn. | 86.11 | 78.79 | 85.48 | 77.32 | 86.32 | 84.06 | 1.23 | 337 |
Context | 89.18 | 85.42 | 85.09 | 81.27 | 89.14 | 88.65 | 0.36 | 271 |
Context + Residual | 89.82 | 85.39 | 84.43 | 80.64 | 89.72 | 88.81 | 0.54 | 286 |
Our model excels at detecting fine-grained anomalies like material issues and fractures, though it shows limitations with complex 3D deformations such as missing parts or misalignments (Figure a). As shown in the metrics in (Figure b), increasing the number of views improves performance, while robustness remains consistent across varying object counts. Ablation studies (Figure c) confirm that our dense attention mechanism is both more accurate and efficient than sparse voxel-based alternatives, with the Residual Head providing a slight yet consistent performance boost.
These are some qualitative results of our approach on Toys Unseen and Parts Unseen. We show three views for each scene. The green boxes indicate normal objects while the red ones indicate anomalous objects. The ✓ and ✗ indicate respectively correct and wrong predictions.
BibTex Code Here