Map Data V: False Assumptions Programmers Make

Famously, engineers tend to make wrong assumptions about a lot of things: Names and time are two well-known examples. Maps are a rich source of edge cases. Especially when you start dealing with global maps, be prepared: That weird situation you assume does not exist? Somewhere in the world, it does.

This article is part of a series about the specific challenges of working with map data. Part one of the series can be found here.

In the following, let us consider some flawed assumptions.

Assuming That Countries Have Simple Shapes

The first wrong assumption is to think that a country shape can always be represented by a simple polygon. Most people will directly come up with counterexamples to this assumption: Countries can comprise islands, enclaves, or exclaves.

It gets more complex than that, however: The image above shows a snapshot of the town Baarle-Hertog (BE) and Baarle-Nassau (NL). Two things stand out: Not only is Baarle-Hertog a Belgian exclave with a very complicated shape, in some parts only covering a couple of houses, it also contains enclaves inside of the exclave.

Until 2015, there even existed a third-order enclave: Dahala Khagrabari was a piece of India within a piece of Bangladesh within a piece of India within Bangladesh. In 2015, the situation was simplified in an exchange of land between the two countries. I like to think that this decision was a consequence of the violent resistance of rioting cartographers.

Assuming That Everything Is A Country

Here is an interesting question: If I point you to any arbitrary point of land in the world, will you always be able to tell me what country it is in?

I already went through one of the counterexamples to this in a previous article on geopolitics: The answer may depend on the political stance of the spectator.

There are other counterexamples, however: Antarctica, is in some but not all areas claimed by countries. However, these claims are not typically part of a political map.

There are areas in the world that are simply no man’s land. (Terra Nullius if you want to sound smart at the next cocktail party). Bir Tawil is an interesting case to study: This area between Egypt and Sudan is claimed by neither of the two bordering countries. It is the only geopolitically disputed area on earth in which both parties state “Nah, this is not mine, it’s yours!”.

The handling of dependent territories can sometimes be a source of misplaced assumptions. While it is politically speaking correct to state that Curaçao is part of the Kingdom of the Netherlands, if you run a query for the average temperature in the Netherlands, in many cases including data from Willemstad does not correspond to the intention of the query. Often, it is helpful to be very specific in which sense you use the word “country”.

There is also the question of what counts as a piece of land. Do man-made structures count? An offshore oil rig might not quite qualify for most people. But then, China has famously created islands in the South China Sea to reinforce its claim on these contested waters which is a more complicated debate.

Assuming That Administrative Divisions Are Homogeneous

Most countries have some form of administrative partitioning: Germany is subdivided into federated states (Bundesländer) which in turn may be divided into districts and municipalities. Belgium is divided into regions, provinces, and municipalities and also has an orthogonal concept of language community which is part of its administrative subdivision.

The first mistake that we could make is to assume that such divisions are somewhat similar across countries. The reality is that each country applies a slightly different subdivision and the administrative layers have different meanings depending on the national context. This makes drawing sub-national divisions on a global map a tricky challenge. Take the following rendering of the OpenStreetMap data of a part of Europe:

Source: OpenStreetMap

This depiction could be taken to imply that French provinces, German bundesländer, Belgian regions, Swiss cantons, and the countries making up the UK are somewhat comparable entities. However, their political meaning is very different. While we might not be too bothered by this particular image where the shape of these administrative geometries is more orientation to the viewer than a significant component of the map, this starts to play an important role when you represent statistical data on this same map as the dataset from Moldova may be hard to compare to the data from the UK.

Not every country has an administrative subdivision: The island nation of Kiribati and the Vatican are counterexamples that only have the national level.

You may be tempted to think that at least within a country, the administrative division should follow a homogeneous structure. However, you would be wrong again.

Canada is partitioned into ten provinces and three territories. Each province defines its own administrative subdivision.

Bosnia Herzegovina is composed of two entities: The Federation of Bosnia and Herzegovina (not to be confused with the state) and the Republika Srpska. The former breaks down into cantons and then municipalities while the latter is directly subdivided into municipalities.

Can we at least be sure that administrative areas of a given level form a complete partitioning of a given country? Again, the answer is no. While the county-level subdivision completely covers the 50 US states (but not DC), the same is not true for the municipality level. Municipal borders could even overlap with county borders.

The image below illustrates this with OpenStreetMap administrative areas of levels 6-10 around the Bay Area:

Source: OpenStreetMap

As this is not very practical for the processing of statistical data, the United States Census Bureau defines subdivisions for statistical purposes that are decoupled from the administrative structure of a state.

Assuming That Addressing Schemes Follow A Simple Rule

Addresses are a well-known source of wrong assumptions. I want to highlight a number of striking examples you will come across when working with global map data.

Every country has its own convention with regard to addressing formatting. Here is a typically-formatted address from the UK:

75 Belford Road
Edinburgh
EH4 3DR

Here is a typical address from Germany:

Durlacher Allee 97
76137 Karlsruhe

However, the differences go beyond different post-code conventions and number placement. Take Colombia as an example where addresses are identified through a grid system based on calles and carreras which can result in something like the following:

Carrera 94 C 
No. 129 A - 04
Bogotá

Japan also uses a grid and districts, but it is more complicated than that. Rural addresses in Bangladesh don’t have a postcode. In many countries, there are addresses that only consist of a place name and don’t have a street number or even a street name. Qatar and the UAE do not use postal codes and not all streets and buildings are numbered – mail is typically delivered to PO Boxes.

The more you start looking at address data, the more you arrive at the conclusion that the only thing they have in common is that they are a sequence of characters and while it is common to find elements like street name and house number, it is impossible to give any guarantees.

Assuming That Time Zones Follow A Logical System

Time zones are a typical element that a digital map needs to provide. They are required for many use cases such as “what will be the local time when I arrive at the destination?”. OpenStreetMap supports time zones with the timezone tag. Importantly, time zones are also a well-known trap for unsuspecting programmers.

Time zones of the world (Source)

Faulty assumptions include the idea that time-zone offsets will remain constant (they don’t), or that daylight saving time is synchronized across time zones (it isn’t).

A well-known gotcha of time zones is the assumption that offsets are always multiples of a full hour. There are many counterexamples to this assumption: Iran Standard Time is defined as UTC +03:30 and the Chatham Islands in New Zealand use Chatham Standard Time, which is defined as UTC +12:45.

It would be tempting to assume that no city can be part of two time zones. However, from 2016 to 2017 Nicosia, Cyprus was de-facto split into one part following UTC+2 in winter and UTC+3 in summer and another part observing UTC+3 all year round.

Assuming That Countries Do Not Change

It is easy to assume that country shapes are relatively stable. This assumption, if true, would simplify a lot in map data processing: You could cut your map into country-sized pieces and not worry about the shape of these cuts going forward. A newer version of one country could be stitched to an older version of a neighboring country.

Furthermore, you could easily cache operations that require information derived from a country. Here is an example: The country might determine the address scheme to be used and if the country shape was ever updated, you would need to invalidate any formatted addresses in the modified area. If country shapes remain stable, you could cache any formatted address until the address information itself changes.

Unfortunately, this assumption is not exactly accurate. Above we already came across one example of a country border change: The land swap between India and Bangladesh. Country borders can also change because of armed conflict or political shifts. Many of the examples we talked about in the article on geopolitics can be sources of change.

Borders can also change for technical reasons. Often, for example, a border is defined by a natural landmark such as a river. Someone might improve the shape of that river, thereby slightly modifying the border. A different situation is a new border crossing road being added. For the reasons we talked about when we investigated the role of relations in a map, this modification would likely add a shape point on the border geometry, again slightly altering the shape.

How much does this happen? We can find out using the Ohsome API. The following query will count the number of geometry changes during the year 2021 to an administrative area with admin_level=2 (i.e. a country) in a somewhat arbitrary bounding box around Luxemburg:

curl -X GET 'https://api.ohsome.org/v1/contributions/count?bboxes=5.393,50.5,6.77,49.2&filter=type=boundary and boundary=administrative and admin_level=2&time=2021-01-01,2022-01-01&contributionType=geometryChange'

Returning

{
    "attribution": {
        "url": "https://ohsome.org/copyrights",
        "text": "© OpenStreetMap contributors"
    },
    "apiVersion": "1.6.3",
    "result": [
        {
            "fromTimestamp": "2021-01-01T00:00:00Z",
            "toTimestamp": "2022-01-01T00:00:00Z",
            "value": 138.0
        }
    ]
}

As are able to see for yourself there were 138 country geometry changes around Luxemburg in this time period.

Another faulty assumption you could make is that countries do not change names – the current events around the country formerly known as Turkey shows that they do.

Assuming That Buildings Do Not Overlap

It is easy to make assumptions about how buildings and other geometric entities relate to each other. Can two buildings ever overlap each other or overlap with a street? Naively you might conclude that this is not possible.

However, a map is a two-dimensional representation of a three-dimensional world so it is absolutely possible that a building arches over a road or over another building.

Rue Wiertz in Brussels overlaps with the European Parlament building. (Source)

Assuming That The Road Network Is A Connected Graph

Is routing between two points on the road network always possible? Islands are an obvious counterexample. While many larger islands with a road network do have ferry connections, not all do.

However, a decoupled routing graph can occur even without the presence of natural barriers: Not every piece of road is accessible to every vehicle type. Road networks that are part of factory zones are often private and closed to public traffic, some tunnels and bridges restrict the maximum height or weight of vehicles and some roads are simply closed to freight traffic.

These restrictions can create de-facto islands in the routing graph for certain vehicle classes.

Assuming That There Are No Relevant Constructions Outside A Country’s Landmass

It is common practice to cut a map into country-sized pieces and stitch them back together as needed. We have already discussed that this can be problematic in the face of changing country definitions. It is also important to be careful with how you deal with the water belts around a country as it could be tempting to focus only on a country’s landmass.

Doing this, however, might end up cutting out important constructions in the water like the Øresund Bridge between Copenhagen and Malmö.

Øresund Bridge (Source)

Conclusion

In this series of articles, I aimed to show some of the particular challenges you will come across when working with map data. While many of the typical data processing challenges also apply to map data, there are some particularities that set it apart. In this article, we saw that one particular challenge of map data is that everyone has an intuitive understanding of what a map ought to look like and will likely make assumptions based on this understanding. Assumptions that more often than not turn out to be false. The world is a big place and that edge case that you assumed will never occur – somewhere it will.

Update: There is a discussion thread about this article on Hacker News containing a ton of fascinating examples.

14 thoughts on “Map Data V: False Assumptions Programmers Make

Add yours

  1. My information suggests that Kiribati is split into 3 units, then into 6 districts and then into 21 island councils (http://www.grcdi.nl/gsb/kiribati.html). Also, all Bangladeshi addresses have postal codes, as postal codes point to delivery post offices. Whether they are used or not is an entirely different matter …. There are no countries where postal codes codes are not universally applied to my knowledge. Postal pointers which apply to only large cities, for example, such as used in Ireland before Eircodes (e.g. “Dublin 16”) are sorting codes rather than postal codes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: