Itamar Landsman

Itamar Landsman is an Algorithm Engineer at Otonomo. He has over 20 years of experience in networks, environments and data science. From RF systems, networking solutions, designing self-deploying OpenStack based cloud environments to dabbling with data-science problems, Itamar holds extensive knowledge in the fields of *nix OSs, networking and anything complicated in general. Itamar worked for 9 years for a company that manufactured point to multi-point wireless systems, and today that company is manufacturing 5th gen cellular equipment.

How to Count Large Scale Geohashes

June 3, 2019

As part of my team’s effort to crunch geospatial information, we needed to find a way to accurately count the number of geohashes in any given area. Knowing how many geohashes there are in any given area lets us extract information per region and compare areas with each other, for advanced analyses. This blog post will present a new way to count geohashes, even in areas as big as continents.

Let’s get started.

Let’s Talk Geohashes

For those of you who managed to stumble onto this blog-post because it’s rainy outside, here’s some background on geohashes to get you caught up. Wikipedia describes “Geohash” as a geocoding system, which means it represents global locations through letter and digit strings. Geohashes allow us to do two things:

Represent locations around the globe efficiently and accurately. For example, I can send you a 9-digit geohash string and you would be able to locate me with the precision of a 2m by 4m rectangle on earth (Well, roughly – the distances vary according to where you are on the globe).
Decide on the required accuracy of geo-identification according to your needs. By removing geohash digits from the right side of the representation you will still get pointed to the same location, just with less accuracy. For example, if we remove one digit from the 9-digit geohash string, we get an 8-digit representation that points us to a location with the precision of (roughly) a 20m by 19m rectangle on earth.

Take a look at these two pictures for further clarification:

This is a map of my hometown, Petah-Tikva, in Israel. The large square represents a level 7 geohash that can be referred to as “sv8y9ed”. For those of you who counted the digits, there are 7 of them. Hence, a level 7 geohash.

This map refers to the same geohash as before, however, it represents a level 8 geohash inside our original object. You’ll notice that as you dive into a deeper precision level, your original rectangle will be divided into 32 internal rectangles.

Each geohash level can be divided into 32 squares, with each smaller square representing a geohash of the next level (in the example above: level 8). There is no limit to the precision level you can reach. If you dive another two precision levels in, for example from level 7 to level 9, the number of rectangles inside the level 7 geohash will equal 32^2, and so on…

Going the other way around: removing a digit from the right side of the geohash representation will always point to the same location but with a smaller precision and a larger square around the desired area (i.e the parent geohash object).

Counting Geohashes in Countries and Cities: System Requirements

As mentioned before, we needed to count the number of geohashes of a certain level in a defined area. By doing so, we have the ability to conduct a more granular analysis of which geohashes comply with certain rules we want to examine, out of all the geohashes available in a “defined area”. The “defined area” can be a neighborhood, city, country, or really any area on the globe.

To find that elusive number, we needed a system that we could obtain the coordinates of an area of interest (a Polygon) and a geohash precision level. The output of the system should state the number of geohash objects within a specific area at a specific precision level.

Not really conductive to our original task of counting, but due to the nature of the algorithm, a welcomed side-effect would be a compressed or uncompressed list of all the geohash IDs in our defined area. This information might be pertinent to some businesses.

Someone Must Have Done it Already

Google, or Duckduckgo for the purists among us, should have some information available, as someone must have already had the need for counting geohashes and already implemented a solution. We found a few implementations from people who wrote code and shared it with the world, including a Python implementation and a Scala implementation that originally looked fit for the job.

However, when we tried to get a count of all level 7 geohashes in the US, for example, the Python implementation took way too long and died after a few hours! The Scala implementation ran for 24 hours but crashed for unknown reasons when we tried to fetch the results. Clearly, the people who wrote the code didn’t have continent sized polygons in mind.

The second worst thing a developer wants to do

The second worst thing a developer wants to do is to read someone else’s code (well known fact). Reading through the source code for the Python implementation, it became obvious that whoever wrote it (thanks, by the way) meant for it to run on fairly small polygons.

The way the algorithm worked also rang a few warning bells. What it did was as follows:

Found the center of the required polygon.
Placed a geohash object (of the requested precision) in the middle of the polygon,
Started going through the 8 neighboring geohashes, tested if they were inside the polygon, and repeated the process for the neighboring areas.
Finally, it returned all the geohashes that were found inside the polygon.

Why was this problematic?

The center of the polygon could be outside the polygon! Imagine a crescent shaped island, for example.
What if the polygon is a MultiPolygon? For example, the USA and Hawaii. We’d get a count only for the center mass.
It’s terribly inefficient to run from the highest precision level right at the get-go. The US has approximately 18642894503 level 8’s inside it. That’s 18 Billion!

I can go on, but hey, the guy made an effort, solved it for small polygons, and shared it with the world!

The #1 worst thing a developer wants to do

What terrifies developers, and is considered the worst thing they need to do, is to write a solution from scratch. But hey, someone has to do it! Back to the drawing board.

Creating a Geohash Counting Solution

Ok, so I rewrote it. Here are the highlights:

Define your polygon
Choose a lower precision geohash (e.g level 2 or 3)
Find all geohashes of said level in the polygon
Decide what to do with intersecting geohashes
Calculate the number of geohashes of a higher precision level
Reaching a complete list

Let’s look at each step in more detail:

Define Your Polygon

Choose the area you want to check. In our example, we chose the United States. We do so by asking Shapely, a python library that performs all the geometry you will ever need, to supply us with the boundaries of the polygon. In the case of the US, it would include the mainland, Guam, Hawaii, and Alaska.

Start with lower precision GEOHASHES & find all Geohashes inside the polygon

Once we have the boundaries, we can create a grid of geohashes starting at the north-west extremity all the way to the south-east extremity. These would be the initial geohashes.

But which geohash level should we choose? One of the main drawbacks we had with the original implementation was that all the processing was done at a high level for the desired location. (Reminder: the US had 18 billion for precision level 8).

Checking if a geohash precision 2 is in the US and checking if geohash precision 8 is in the US, computationally costs the same, but if we get a result saying that geohash ‘9x‘ (not level 9, see map below) is completely inside the polygon, we can skip the testing of 32^6 precision 8 geohashes that are inside ‘9x‘ and mark them all as in.

This also gives us “compression”. Once ‘9x’ is stored as inside the US, we do not need to waste memory and storage by storing the 32^6 precision 8 objects. ‘9x’ inside the inner_objects list is sufficient. Hence, the “compression”. This also means that any precision geohash that starts with ‘9x’ is inside the US.

Therefore, in order to start the program, we’ll need to choose a group of low precision geohashes that completely cover the polygon. If we look at a map of the United States, we can see that a few geohash precision 2 objects fit within the borders since the country is so big.

Later on we will check which geohashes of the level we require are inside these level 2 geohashes, But first, what should we do with geohashes that intersect the polygon borders?

In, Out, Or In Between

Now that we have our initial set, we use Shapely, again, to test each of our initial geohash objects against the polygon.

There can be three possible responses:

Test subject is inside the polygon – in that case, we add it to the result list.
Test subject is outside the polygon – in that case, we discard it as it’s no longer interesting.
Test subject is intersected by the polygon – This is where it gets interesting.

The meaning of result #3 is that the geohash has parts in the polygon but also parts that are outside. In this case, we need to test the 32 internal higher precision geohash objects against the polygon.

In order to do so, we break the large geohash object into 32 smaller object and push those into the queue of objects that are waiting to be analyzed. (The same queue we first pushed the initial geohashes into).

When recursing into data, it’s always important to remember when to stop. In our case, we will stop the recursion when we reach a precision level of n (n meaning the precision level for which we want the count). We’ll also stop if there are no more objects to test. Maybe the polygon is a rectangle that snuggly fits all tested geohashes.

This image is taken from the original implementation Github page and illustrates the difference between inner only and inner+outer result.

When doing inner only, you may miss some space in the polygon that is not covered. When using the outer option, though, you also get some real-estate that may belong to a neighboring polygon.

The implementation of this in our case is easy. When we reach the target precision, any intersecting object would be either added to the results (if “outer” is requested) or discarded if we requested “inner”.

How to count (efficiently)

We’ve officially recursed through our polygon and stored all geohashes (from various levels) that fit into the polygon. Now comes counting the results.

Considering that some of the final objects are not in the highest requested geohash level, it’s still easy to get the number of internal geohashes by counting it as 32^n where n = –

More for Developers

Otonomo is more than a car data exchange. Read these blogs written by developers, for developers, about coding, technology and culture.

Spark Cache Applied at Large Scale – Challenges, Pitfalls and Solutions

The ultimate guide for Spark cache and Spark memory. Learn to apply Spark caching on production with confidence, for large-scales of data. Everything Spark cache.

Ofek Hod

November 18, 2021

@Otonomo: An Innovative Approach to Software Delivery

In our Behind the Scenes Otonomo series, we talk to people from across the Otonomo family to hear what makes their job unique, and the innovative ways they take on their role within the company.

Nir Nahum - Software Engineering Team Leader

June 15, 2021

How We Run CI/CD in Our Development Process new

We developed a CI/CD pipeline to assist our R&D save time when merging to the master branch. Learn about our environment challenges, cloud pricing, and more

Danny Gitelman

August 12, 2019

Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems

A comparison of Luigi, Airflow, Pinball and Chronos. Choose the best workflow management system for your automated jobs based on features and abilities.

Hilla Shapira

June 5, 2019

How to Count Large Scale Geohashes

A brand new effective way to count geohashes in any given region at any level, even in continents. Learn how you can now analyze geohashes properly.

Itamar Landsman

June 3, 2019

Deleting Code Matters

Deleting parts of your code is hard but necessary. Read how keeping your code short is better for code maintenance, reducing bugs etc., and best practices.

Tzahi Furmanski

May 28, 2019

Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?

Choose the best microservices message broker for your communication needs. Read this comparison of Redis, Kafka and RabbitMQ and become an expert.

Sefi Itzkovich - CTO

May 20, 2019

How We Run CI/CD in Our Development Process

We developed a CI/CD pipeline to assist our R&D save time when merging to the master branch. Learn about our environment challenges, cloud pricing, and more

Danny Gitelman

August 12, 2019

Reshaped and Insightful

We developed a CI/CD pipeline to assist our R&D save time when merging to the master branch. Learn about our environment challenges, cloud pricing, and more

Danny Gitelman

August 12, 2019

Detailed Reports and Analytics

We developed a CI/CD pipeline to assist our R&D save time when merging to the master branch. Learn about our environment challenges, cloud pricing, and more

Danny Gitelman

August 12, 2019

Put data to work with innovative mobility service solutions

Ready
for Connected
Car Data?

Enrich traffic info
with real time &
historical vehicle data

Data-Driven Driving:
Consumer attitudes
about shared mobility

Otonomo is hiring!

How to Count Large Scale Geohashes

Let’s Talk Geohashes

Counting Geohashes in Countries and Cities: System Requirements

Someone Must Have Done it Already