MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

🎉 Accepted at CVPR 2026 🎉

Oskar Kristoffersen^1,2, Alba Reinders Sánchez¹, Morten Rieger Hannemose^1,2, Anders Bjorholm Dahl^1,2, Dim P. Papadopoulos^1,2

¹Technical University of Denmark, ²Pioneer Center for AI

Paper arXiv Code Dataset Explore Dataset

TL;DR - What is MMLandmarks ?

✓Four modalities: Ground, Satellite, GPS and Text
✓18,557 unique landmark instances
✓Continental Scale: United States of America
✓Open Access - Publicly Available
✓CC / Public domain licenses

MMLandmarks is a large-scale multimodal benchmark for geo-spatial understanding, uniquely combining four complementary modalities—ground-level imagery, aerial/satellite imagery, GPS coordinates, and Wikipedia text—for the same set of landmarks. Unlike prior datasets that focus on a single view or modality, MMLandmarks enables cross-view and cross-modal retrieval at the instance level. All 18,557 landmarks are geographically distributed across the United States, sourced from OpenStreetMap and Wikimedia Commons under open licenses, making the dataset fully reproducible and suitable for academic research.

MMLandmarks: We present four distinct data modalities: ground-view images, aerial-view images, GPS coordinates, and textual descriptions, collected from 18,557 unique landmarks in the United States. Data sources are included alongside each modality.

Collection Pipeline: Tags from OpenStreetMaps are used to collect Wiki-identifiers, ensuring that landmarks have a Wikipedia and Wikimedia Commons page. If both are available, we check that the longest edge of the landmark’s bounding box is smaller than 400 meters to keep an even size distribution across the dataset. Every resulting landmark has a Wikimedia Commons page (ground), a Wikipedia page (text), a box size and center (coordinates), and associated aerial imagery (satellite).

**Dataset comparison.** Modality abbreviations: S - Satellite, G - Ground, T - Text, C - Coordinates, D - Drone. In the Scale column, the number in parentheses indicates the number of cities.
Task	Dataset	Year	Train(G/S)	Index(G/S)	Instances	Scale (Cities)	Modalities	Open-access	License
Geo localization	IM2GPS	2008	6.4M/-	-	-	Global	G,C	✗	N/A
	YFCC100M	2016	100M/-	-	-	Global	G,C	✓	Flickr TC
	PlaNet	2016	126M/-	-	-	Global	G,C	✗	N/A
	MP16	2017	4.7M/-	-	-	Global	G,C	✓	Flickr TC
	OSV-5M	2017	5.1M/-	-	-	Global	G,C	✓	CC-BY-SA
Cross-View Retrieval	CVUSA	2015	35k/35k	8.8k/8.8k	-	USA(1)	G,S	✓	Flickr TC
	Vo.	2016	450k/450k	70k/70k	-	USA(11)	G,S	✓	N/A
	CVACT	2019	44k/44k	92k/92k	-	Australia(1)	G,S	✓	N/A
	Uni-1652	2020	11.6k/701	5.5k/1652	1652	72 Universities	G,S,D	✓	N/A
	VIGOR	2021	51k/44k	53k/46k	-	USA(4)	G,S	✓	N/A
	CV-Cities	2024	162k/162k	61k/61k	-	Global(16)	G,S	✓	N/A
	CVGlobal	2024	134k/134k	-	-	Global(7)	G,S	✓	N/A
Landmark Retrieval	R-Oxford	2018	-	5k + 1M/-	11	Oxford	G	✓	Flickr TC/CC
	R-Paris	2018	-	6k + 1M/-	11	Paris	G	✓	Flickr TC/CC
	GLDv1	2018	1.2M/-	1.1M/-	30k	Global	G	✗	Multiple
	GLDv2	2020	4.1M/-	764k/-	200k	Global	G	✓	CC/Public-domain
	MMLandmarks	2026	329k/197k	714k/100k	18,557	USA	G,S,T,C	✓	CC/Public-domain

MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

Dataset

Examples

BibTeX