How to Count Large Scale Geohashes
As part of my team’s effort to crunch geospatial information, we needed to find a way to accurately count the number of geohashes in any given area. Knowing how many geohashes there are in any given area lets us extract information per region and compare areas with each other, for advanced analyses. This blog post will present a new way to count geohashes, even in areas as big as continents.
Let’s get started.
Let’s Talk Geohashes
For those of you who managed to stumble onto this blog-post because it’s rainy outside, here’s some background on geohashes to get you caught up. Wikipedia describes “Geohash” as a geocoding system, which means it represents global locations through letter and digit strings. Geohashes allow us to do two things:
- Represent locations around the globe efficiently and accurately. For example, I can send you a 9-digit geohash string and you would be able to locate me with the precision of a 2m by 4m rectangle on earth (Well, roughly – the distances vary according to where you are on the globe).
- Decide on the required accuracy of geo-identification according to your needs. By removing geohash digits from the right side of the representation you will still get pointed to the same location, just with less accuracy. For example, if we remove one digit from the 9-digit geohash string, we get an 8-digit representation that points us to a location with the precision of (roughly) a 20m by 19m rectangle on earth.
Take a look at these two pictures for further clarification:
This is a map of my hometown, Petah-Tikva, in Israel. The large square represents a level 7 geohash that can be referred to as “sv8y9ed”. For those of you who counted the digits, there are 7 of them. Hence, a level 7 geohash.
This map refers to the same geohash as before, however, it represents a level 8 geohash inside our original object. You’ll notice that as you dive into a deeper precision level, your original rectangle will be divided into 32 internal rectangles.
Each geohash level can be divided into 32 squares, with each smaller square representing a geohash of the next level (in the example above: level 8). There is no limit to the precision level you can reach. If you dive another two precision levels in, for example from level 7 to level 9, the number of rectangles inside the level 7 geohash will equal 32^2, and so on…
Going the other way around: removing a digit from the right side of the geohash representation will always point to the same location but with a smaller precision and a larger square around the desired area (i.e the parent geohash object).
Counting Geohashes in Countries and Cities: System Requirements
As mentioned before, we needed to count the number of geohashes of a certain level in a defined area. By doing so, we have the ability to conduct a more granular analysis of which geohashes comply with certain rules we want to examine, out of all the geohashes available in a “defined area”. The “defined area” can be a neighborhood, city, country, or really any area on the globe.
To find that elusive number, we needed a system that we could obtain the coordinates of an area of interest (a Polygon) and a geohash precision level. The output of the system should state the number of geohash objects within a specific area at a specific precision level.
Not really conductive to our original task of counting, but due to the nature of the algorithm, a welcomed side-effect would be a compressed or uncompressed list of all the geohash IDs in our defined area. This information might be pertinent to some businesses.
Someone Must Have Done it Already
Google, or Duckduckgo for the purists among us, should have some information available, as someone must have already had the need for counting geohashes and already implemented a solution. We found a few implementations from people who wrote code and shared it with the world, including a Python implementation and a Scala implementation that originally looked fit for the job.
However, when we tried to get a count of all level 7 geohashes in the US, for example, the Python implementation took way too long and died after a few hours! The Scala implementation ran for 24 hours but crashed for unknown reasons when we tried to fetch the results. Clearly, the people who wrote the code didn’t have continent sized polygons in mind.
The second worst thing a developer wants to do
The second worst thing a developer wants to do is to read someone else’s code (well known fact). Reading through the source code for the Python implementation, it became obvious that whoever wrote it (thanks, by the way) meant for it to run on fairly small polygons.
The way the algorithm worked also rang a few warning bells. What it did was as follows:
- Found the center of the required polygon.
- Placed a geohash object (of the requested precision) in the middle of the polygon,
- Started going through the 8 neighboring geohashes, tested if they were inside the polygon, and repeated the process for the neighboring areas.
- Finally, it returned all the geohashes that were found inside the polygon.
Why was this problematic?
- The center of the polygon could be outside the polygon! Imagine a crescent shaped island, for example.
- What if the polygon is a MultiPolygon? For example, the USA and Hawaii. We’d get a count only for the center mass.
- It’s terribly inefficient to run from the highest precision level right at the get-go. The US has approximately 18642894503 level 8’s inside it. That’s 18 Billion!
I can go on, but hey, the guy made an effort, solved it for small polygons, and shared it with the world!
The #1 worst thing a developer wants to do
What terrifies developers, and is considered the worst thing they need to do, is to write a solution from scratch. But hey, someone has to do it! Back to the drawing board.
Creating a Geohash Counting Solution
Ok, so I rewrote it. Here are the highlights:
- Define your polygon
- Choose a lower precision geohash (e.g level 2 or 3)
- Find all geohashes of said level in the polygon
- Decide what to do with intersecting geohashes
- Calculate the number of geohashes of a higher precision level
- Reaching a complete list
Let’s look at each step in more detail:
Define Your Polygon
Choose the area you want to check. In our example, we chose the United States. We do so by asking Shapely, a python library that performs all the geometry you will ever need, to supply us with the boundaries of the polygon. In the case of the US, it would include the mainland, Guam, Hawaii, and Alaska.
Start with lower precision GEOHASHES & find all Geohashes inside the polygon
Once we have the boundaries, we can create a grid of geohashes starting at the north-west extremity all the way to the south-east extremity. These would be the initial geohashes.
But which geohash level should we choose? One of the main drawbacks we had with the original implementation was that all the processing was done at a high level for the desired location. (Reminder: the US had 18 billion for precision level 8).
Checking if a geohash precision 2 is in the US and checking if geohash precision 8 is in the US, computationally costs the same, but if we get a result saying that geohash ‘9x‘ (not level 9, see map below) is completely inside the polygon, we can skip the testing of 32^6 precision 8 geohashes that are inside ‘9x‘ and mark them all as in.
This also gives us “compression”. Once ‘9x’ is stored as inside the US, we do not need to waste memory and storage by storing the 32^6 precision 8 objects. ‘9x’ inside the inner_objects list is sufficient. Hence, the “compression”. This also means that any precision geohash that starts with ‘9x’ is inside the US.
Therefore, in order to start the program, we’ll need to choose a group of low precision geohashes that completely cover the polygon. If we look at a map of the United States, we can see that a few geohash precision 2 objects fit within the borders since the country is so big.
Later on we will check which geohashes of the level we require are inside these level 2 geohashes, But first, what should we do with geohashes that intersect the polygon borders?
In, Out, Or In Between
Now that we have our initial set, we use Shapely, again, to test each of our initial geohash objects against the polygon.
There can be three possible responses:
- Test subject is inside the polygon – in that case, we add it to the result list.
- Test subject is outside the polygon – in that case, we discard it as it’s no longer interesting.
- Test subject is intersected by the polygon – This is where it gets interesting.
The meaning of result #3 is that the geohash has parts in the polygon but also parts that are outside. In this case, we need to test the 32 internal higher precision geohash objects against the polygon.
In order to do so, we break the large geohash object into 32 smaller object and push those into the queue of objects that are waiting to be analyzed. (The same queue we first pushed the initial geohashes into).
When recursing into data, it’s always important to remember when to stop. In our case, we will stop the recursion when we reach a precision level of n (n meaning the precision level for which we want the count). We’ll also stop if there are no more objects to test. Maybe the polygon is a rectangle that snuggly fits all tested geohashes.
This image is taken from the original implementation Github page and illustrates the difference between inner only and inner+outer result.
When doing inner only, you may miss some space in the polygon that is not covered. When using the outer option, though, you also get some real-estate that may belong to a neighboring polygon.
The implementation of this in our case is easy. When we reach the target precision, any intersecting object would be either added to the results (if “outer” is requested) or discarded if we requested “inner”.
How to count (efficiently)
We’ve officially recursed through our polygon and stored all geohashes (from various levels) that fit into the polygon. Now comes counting the results.
Considering that some of the final objects are not in the highest requested geohash level, it’s still easy to get the number of internal geohashes by counting it as 32^n where n = < target geohash level > – < object geohash level >. This formula makes it much faster to count through the results, as many geohash objects would be in a much lower level than the target level. It’s likely that since our polygon is big, there would be many lower level geohash object that fit in.Here’s the code that counts target level objects regardless of their level, from the geohash list ‘inner_geohashes’:
Returning a full object list
To return a full list of geohashes from the requested level, we now need a code that goes through the result list. For every object that is not in the target level, it should return all the internal geohashes. This process is pretty easy considering that geohashes are equivalent to a base-32 counting mechanism (with a small hack because geohash uses a weird, non standard group of chars for its representation) . Here’s how I implemented it:
Running Geohash Counting Faster
What we’ve done so far is great and all, but with a high precision and a large polygon, even our new algorithm will take a very long time to process. One lucky find and a small tweak made it run like a rabbit on steroids.
Transforming Polygons to Prepped
As I was digging through a polygon library that does similar spacial tricks, I noticed that the author used a strange command to transform Shapely Polygon objects. Turns out there’s a way to transform a MultiPolygon/ Polygon object to a prepped object. This disables much of the methods associated with Polygons but makes intersection tests with other objects much, much faster. This provided the boost we needed to make the code run extremely fast.
It can be incredibly frustrating to look at the output of your top command and see one processor is at 100% utilisation while the remaining CPUs just slack around. I had to do something. Here is what I did:
- When choosing the initial geohashes, I recursed down the precision levels until I had more initial objects than available processors.
- Divided the initial objects to groups, numbering the count of available processors.
- Used multiprocessing to start a process for each of the initial object groups.
- Once all the processes were done, I combined the results and finished!
At the end, running over the US with geohash precision level 8 took 240 seconds to complete on my laptop! I hope this method will help you with your Geohash counting needs as well.
More for Developers