How to Count Large Scale Geohashes
As part of my team’s effort to crunch geospatial information, we needed to find a way to accurately count the number of geohashes in any given area. Knowing how many geohashes there are in any given area lets us extract information per region and compare areas with each other, for advanced analyses. This blog post will present a new way to count geohashes, even in areas as big as continents.
Let’s get started.
Let’s Talk Geohashes
For those of you who managed to stumble onto this blog-post because it’s rainy outside, here’s some background on geohashes to get you caught up. Wikipedia describes “Geohash” as a geocoding system, which means it represents global locations through letter and digit strings. Geohashes allow us to do two things:
- Represent locations around the globe efficiently and accurately. For example, I can send you a 9-digit geohash string and you would be able to locate me with the precision of a 2m by 4m rectangle on earth (Well, roughly – the distances vary according to where you are on the globe).
- Decide on the required accuracy of geo-identification according to your needs. By removing geohash digits from the right side of the representation you will still get pointed to the same location, just with less accuracy. For example, if we remove one digit from the 9-digit geohash string, we get an 8-digit representation that points us to a location with the precision of (roughly) a 20m by 19m rectangle on earth.
Take a look at these two pictures for further clarification:
This is a map of my hometown, Petah-Tikva, in Israel. The large square represents a level 7 geohash that can be referred to as “sv8y9ed”. For those of you who counted the digits, there are 7 of them. Hence, a level 7 geohash.
This map refers to the same geohash as before, however, it represents a level 8 geohash inside our original object. You’ll notice that as you dive into a deeper precision level, your original rectangle will be divided into 32 internal rectangles.
Each geohash level can be divided into 32 squares, with each smaller square representing a geohash of the next level (in the example above: level 8). There is no limit to the precision level you can reach. If you dive another two precision levels in, for example from level 7 to level 9, the number of rectangles inside the level 7 geohash will equal 32^2, and so on…
Going the other way around: removing a digit from the right side of the geohash representation will always point to the same location but with a smaller precision and a larger square around the desired area (i.e the parent geohash object).
Counting Geohashes in Countries and Cities: System Requirements
As mentioned before, we needed to count the number of geohashes of a certain level in a defined area. By doing so, we have the ability to conduct a more granular analysis of which geohashes comply with certain rules we want to examine, out of all the geohashes available in a “defined area”. The “defined area” can be a neighborhood, city, country, or really any area on the globe.
To find that elusive number, we needed a system that we could obtain the coordinates of an area of interest (a Polygon) and a geohash precision level. The output of the system should state the number of geohash objects within a specific area at a specific precision level.
Not really conductive to our original task of counting, but due to the nature of the algorithm, a welcomed side-effect would be a compressed or uncompressed list of all the geohash IDs in our defined area. This information might be pertinent to some businesses.
Someone Must Have Done it Already
Google, or Duckduckgo for the purists among us, should have some information available, as someone must have already had the need for counting geohashes and already implemented a solution. We found a few implementations from people who wrote code and shared it with the world, including a Python implementation and a Scala implementation that originally looked fit for the job.
However, when we tried to get a count of all level 7 geohashes in the US, for example, the Python implementation took way too long and died after a few hours! The Scala implementation ran for 24 hours but crashed for unknown reasons when we tried to fetch the results. Clearly, the people who wrote the code didn’t have continent sized polygons in mind.
The second worst thing a developer wants to do
The second worst thing a developer wants to do is to read someone else’s code (well known fact). Reading through the source code for the Python implementation, it became obvious that whoever wrote it (thanks, by the way) meant for it to run on fairly small polygons.
The way the algorithm worked also rang a few warning bells. What it did was as follows:
- Found the center of the required polygon.
- Placed a geohash object (of the requested precision) in the middle of the polygon,
- Started going through the 8 neighboring geohashes, tested if they were inside the polygon, and repeated the process for the neighboring areas.
- Finally, it returned all the geohashes that were found inside the polygon.
Why was this problematic?
- The center of the polygon could be outside the polygon! Imagine a crescent shaped island, for example.
- What if the polygon is a MultiPolygon? For example, the USA and Hawaii. We’d get a count only for the center mass.
- It’s terribly inefficient to run from the highest precision level right at the get-go. The US has approximately 18642894503 level 8’s inside it. That’s 18 Billion!
I can go on, but hey, the guy made an effort, solved it for small polygons, and shared it with the world!
The #1 worst thing a developer wants to do
What terrifies developers, and is considered the worst thing they need to do, is to write a solution from scratch. But hey, someone has to do it! Back to the drawing board.
Creating a Geohash Counting Solution
Ok, so I rewrote it. Here are the highlights:
- Define your polygon
- Choose a lower precision geohash (e.g level 2 or 3)
- Find all geohashes of said level in the polygon
- Decide what to do with intersecting geohashes
- Calculate the number of geohashes of a higher precision level
- Reaching a complete list
Let’s look at each step in more detail:
Define Your Polygon
Choose the area you want to check. In our example, we chose the United States. We do so by asking Shapely, a python library that performs all the geometry you will ever need, to supply us with the boundaries of the polygon. In the case of the US, it would include the mainland, Guam, Hawaii, and Alaska.
Start with lower precision GEOHASHES & find all Geohashes inside the polygon
Once we have the boundaries, we can create a grid of geohashes starting at the north-west extremity all the way to the south-east extremity. These would be the initial geohashes.
But which geohash level should we choose? One of the main drawbacks we had with the original implementation was that all the processing was done at a high level for the desired location. (Reminder: the US had 18 billion for precision level 8).
Checking if a geohash precision 2 is in the US and checking if geohash precision 8 is in the US, computationally costs the same, but if we get a result saying that geohash ‘9x‘ (not level 9, see map below) is completely inside the polygon, we can skip the testing of 32^6 precision 8 geohashes that are inside ‘9x‘ and mark them all as in.
This also gives us “compression”. Once ‘9x’ is stored as inside the US, we do not need to waste memory and storage by storing the 32^6 precision 8 objects. ‘9x’ inside the inner_objects list is sufficient. Hence, the “compression”. This also means that any precision geohash that starts with ‘9x’ is inside the US.
Therefore, in order to start the program, we’ll need to choose a group of low precision geohashes that completely cover the polygon. If we look at a map of the United States, we can see that a few geohash precision 2 objects fit within the borders since the country is so big.
Later on we will check which geohashes of the level we require are inside these level 2 geohashes, But first, what should we do with geohashes that intersect the polygon borders?
In, Out, Or In Between
Now that we have our initial set, we use Shapely, again, to test each of our initial geohash objects against the polygon.
There can be three possible responses:
- Test subject is inside the polygon – in that case, we add it to the result list.
- Test subject is outside the polygon – in that case, we discard it as it’s no longer interesting.
- Test subject is intersected by the polygon – This is where it gets interesting.
The meaning of result #3 is that the geohash has parts in the polygon but also parts that are outside. In this case, we need to test the 32 internal higher precision geohash objects against the polygon.
In order to do so, we break the large geohash object into 32 smaller object and push those into the queue of objects that are waiting to be analyzed. (The same queue we first pushed the initial geohashes into).
When recursing into data, it’s always important to remember when to stop. In our case, we will stop the recursion when we reach a precision level of n (n meaning the precision level for which we want the count). We’ll also stop if there are no more objects to test. Maybe the polygon is a rectangle that snuggly fits all tested geohashes.
This image is taken from the original implementation Github page and illustrates the difference between inner only and inner+outer result.
When doing inner only, you may miss some space in the polygon that is not covered. When using the outer option, though, you also get some real-estate that may belong to a neighboring polygon.
The implementation of this in our case is easy. When we reach the target precision, any intersecting object would be either added to the results (if “outer” is requested) or discarded if we requested “inner”.
How to count (efficiently)
We’ve officially recursed through our polygon and stored all geohashes (from various levels) that fit into the polygon. Now comes counting the results.
Considering that some of the final objects are not in the highest requested geohash level, it’s still easy to get the number of internal geohashes by counting it as 32^n where n = –
More for Developers
Otonomo is more than a car data exchange. Read these blogs written by developers, for developers, about coding, technology and culture.