SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation

Page view(s)
4
Checked on Sep 05, 2025
SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation
Title:
SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation
Journal Title:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Publication Date:
04 August 2025
Citation:
Zhang, W., Ng, W. E., Ma, L., Wang, Y., Zhao, J., Koenecke, A., Li, B., & Wang, L. (2025). SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 11591–11609. https://doi.org/10.18653/v1/2025.acl-long.568
Abstract:
Current vision-language models may grasp basic spatial cues and simple directions (e.g. left, right, front, back), but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications. To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework supported by a new human-annotated dataset. SPHERE systematically probes models across increasing levels of complexity, from fundamental skills to multi-skill integration and high-level reasoning that combines spatial, visual, and logical understanding. Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity, understanding both egocentric and allocentric perspectives, and applying spatial logic in physical contexts. These findings expose critical blind spots in existing models and underscore the need for more advanced spatial reasoning techniques, driving the development of vision-language models that align more closely with human spatial cognition.
License type:
Attribution 4.0 International (CC BY 4.0)
Funding Info:
This research / project is supported by the National Research Foundation - National Research Foundation Fellowship
Grant Reference no. : NRF-NRFF13-2021-00
Description:
ACL materials are Copyright © 1963–2025 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.
ISBN:
NA
Files uploaded: