Sunday, 1 February 2015

GeoZip System for Encoding, Decoding, and Searching Geographic Coordinates

Inventor and Author: Rahim Sikander Allana


The GeoZip System for Encoding, Decoding, and Searching Geographic Coordinates


GeoZip is a system that includes methods for encoding and decoding the latitude and longitude values of any given geographic coordinate into a single numeric value representing a hierarchical spatial data structure.  Starting from the high order digits and moving towards the low order ones, each pair of digits in a GeoZip code represents a smaller subdivision of the previous subdivision, until it eventually narrows in on the exact coordinate that was encoded in the first place.

Description of Methods

This system requires that the latitude, and longitude values of geographic coordinates being encoded have the same number of decimal digits that cannot be increased throughout the life of a particular implementation of this system.  Let us consider an example in which our geographical coordinates will not exceed 6 decimal digits.  By way of example, let us also assume that the geographic coordinate that we are considering have a latitude value of -34.783467, and a longitude value of 128.294109.

Space Shift
The geographic coordinate space is divided into four parts. In order for us to setup an implementation of this system, we must first shift the entire space into one of the quadrants which will of course extend to the size of the entire space.  Although, for any given implementation, we can make this shift to any of the quadrants, and adjust the formulae accordingly, the most natural one, for our example, seems to be the quadrant that has positive ranges for both the latitude, and longitude values.  Hence, To shift the entire space into the quadrant with positive values, the -90, -180 coordinate will be shifted to where the 0, 0 coordinate is, and it would become the new 0, 0 coordinate, the old 0, 0 will shift to 90, 180, and the old 90, 180 will shift to 180, 360, thus extending the quadrant in question to accommodate all the four quadrants of the old space.  This space shift has been illustrated in Figure 1.

All coordinates being encoded into this system will exist in this shifted space.  Hence, in order to encode any given coordinate, one must shift its latitude, and longitude values, just as we have shifted our space.  Given our current example, shown below are the formulae to shift any given geographic coordinate:
Shifted_Latitude = Latitude + 90
Shifted_Longitude = Longitude + 180
Hence, the latitude, and longitude values in our example will be shifted to 55.216533, 308.294109.
Zipping Latitude and Longitude into a Single Integer Value
We zip the latitude and longitude values together by interleaving the digits in each using the following method.  Our shifted space has a latitude range of 180 degrees, and a longitude range of 360 degrees.  Regardless of the number of decimal digits being used we will require at least 6 high order digits in order to interleave the integer portion of the latitude, and longitude.  For latitudes, or longitudes that have integer portions with digits less than 3, zeros will be substituted for the missing digits.  Hence, a value of 45 will be represented as 045, a value of 7, will be represented as 007, and 0 will be represented as 000, etc.  Given the fact that the number of digits being used for the integer and the decimal portions to represent any latitude, and longitude will be fixed in an implemented GeoZip system, the decimal point itself can be eliminated.  In our example, we are using 6 decimal digits, which means that we will need 18 digits to represent the interleaved values of both our shifted latitude, and longitude.
To zip the latitude, and longitude values, we take the first high order digit from the latitude value, and make it the first high order digit in the resulting GeoZip code.  The first high order digit from the longitude value will become the second high order digit in the GeoZip code.  The second high order digit in the latitude value will become the third consecutive high order digit in our GeoZip code, and the second high order digit in our longitude value will become the fourth consecutive high order digit in our GeoZip code.  We will keep adding pairs to our GeoZip code till all the digits in our latitude, and longitude values have been exhausted.  Hence, as far as our example is concerned the resulting GeoZip code will be 035058221964513039.  In this resulting integer value every pair of digits starting from the high order numbers, and moving towards the low order numbers will narrow our space down to the coordinate under consideration.  Figure 2 illustrates this method of zipping the two values together.

Unzipping the Latitude and Longitude Out of a GeoZip code
This method is simply a reversal of the encoding method.  The first step is to check if the given GeoZip code has the maximum number of encoded digits that have been designated for a given implementation.  The missing digits would be assumed to be zeros in high order position.  For our example, if the number of digits in the  GeoZip code is less than 18 than we must assume that the missing digits are zeros in the high order position that were left out perhaps because of the GeoZip code being saved in a numeric data field or variable.  These missing zeros can be assumed to exist, or they can be added to the beginning of the GeoZip code, depending on the way in which the system is implemented.
The next step is to unzip the GeoZip code digits, starting from the first high order digit which will be the first high order digit in the resulting latitude value.  The next digit will be the first high order digit in the resulting longitude value, and so on till we reach the last low order digit.  As far as our example is concerned , unzipping the GeoZip code will result in a 9 digit latitude, and a 9 digit longitude value.  During this step we must also add the missing decimal to both the latitude, and longitude values after the third high order digit.
Shift Back to Original Space
As our final step we must shift the resulting latitude, and longitude back to their original space.  The formulae used to shift the coordintate during the encoding step is simply reversed as shown below:
Latitude = Shifted_Latitude  - 90
Longitude = Shifted_Longitude + 180

Search Methods
Proximity Search
Searches are usually faster if you can divide out a chunk of data that is likely to have the item you are searching for.  In a GeoZip code, starting from the high order digit, every pair of digits represents a rectangle that contains the point we are searching for.  As we move towards the low order digits, every consecutive rectangle is smaller than the previous rectangle, and is encapsulated within it.  These rectangles become smaller and smaller as we reach the last pair of low order digits which leaves us with the exact latitude and longitude values of the geographic coordinate we are searching for.
In general, the GeoZip codes that share the most number of high order digits are closer together.  Of course this is not the case at the poles, and at the seam in our space at -180 degree longitude.  However, if our application affords us the possibility of not compensating for this deficiency, we might realize that the percentage of the world population that lives at the poles, or at the -180 degree longitude meridian is relatively low.  Furthermore, one has to account for the edge cases that fall into adjacent rectangles or buckets.  Let us consider the example of a user of a mobile phone, or cellphone app who wants to know the names of all the restaurants in a mile around him/her.  One way of performing a proximity search would be to figure out the number of contiguous high order digit pairs that represents the smallest rectangular space containing the points we are looking for.  
Although, the GeoZip code can be used as text string in a computer system, It should be noted that, because the GeoZip system provides a single numeric value, it might result in greater flexibility, and / or performance if we can fit the GeoZip codes in our implementation into a single numeric field, or variable.  Incidentally, the 6 decimal digit precision we have used in our example does allow us to save our GeoZip into a 64 bit integer value that is available in various databases.  Hence, in our example, if the following set of high order digits represent the area that is of interest to us: 03506872893441, then we could simply search for all the numbers in the range starting from 035068728934410000 and ending at 035068728934419999.  The zero in the high order position can be ignored for a numeric field.
The area defined by the GeoZip digit pairs provide a rough bound for the values we are searching for, however, we could perform a Bounding Rectangle Search on the result to filter down the data, before we check the distance between our users’ coordinates, and each data point in our narrowed down area.
Bounding Rectangle Search
Given a geographic coordinate, and the distance we want to cover from it, there exist established methods for finding a minimum bounding rectangle which encapsulates a circle that is centered at the coordinate we are searching against and has a radius equal to the distance we are searching against.  Hence, the minimum bounding rectangle is a good way to extract a chunk of data that, for the most part, satisfies our distance requirements.  Given our example, a one mile minimum bounding rectangle around our user would give us all the restaurants that are within a mile around our user, however, it would also include some that are farther away due to the four corners of the minimum bounding rectangle that are outside the one mile circle around our user.
In order to perform a bounding rectangle search on a list of GeoZip codes we need the geographic coordinates of the upper right corner, and the lower left corner of the bounding rectangle.  All the GeoZip codes of the points in our dataset that are greater than the GeoZip code of the lower left corner, and are less than the GeoZip code of the upper right corner are within the bounding rectangle we are searching against.  To include the boundaries of our rectangle we can include equality to our comparison operators.


Creative Commons License



    There was a typing error in the formula for shifting the longitude back to the original space. The "+", should be a "-". Hence, the correct formula, as expected, is: Longitude = Shifted_Longitude - 180

  2. Thanks! I like the simplicity of it and wrote an implementation in Javascript

  3. I tried implementing this method and found that by using the lower left and the upper right I am actually getting points outside of the bounding rectangle.

    For example:
    Lower Left:(37.9,-91.1) --> 10287899
    Upper Right:(39.4,-89.3) --> 10299047

    Point outside the bounding rectangle: (37.5,-80.7) --> 10297953

    This point which is clearly outside of the bounding rectangle has a GeoZip value between the LL and UR GeoZip values. The range between the two values graphically looks roughly like this:

  4. Bounding Rectangle Search was a last minute addition to this article, and due to personal hardships I was unable to validate the hypothesis extensively. My motivation for this hasty addition to this article came from my desire to preserve such a possibility for the community.

    The following link to a PDF document: shows the encoded values for a 20 x 20 grid. It demonstrates that the Bounding Rectangle Search only works if the encoded values of lower left, and upper right corners of the Bounding Rectangle are within the same Proximity Bucket pointed out by the same digit pairs. In this case the digit pairs are 00, 10, 01, and 11. This is the very reason why perhaps the Proximity Search, based on the digit pairs, is expected to work.

    I believe the jury is still out on the validity of the Encoding, and Decoding System itself, as well as the Proximity Search. I would really appreciate any contribution you can make in that regard.